Command Palette
Search for a command to run...
Gatk_benchmark Genome Analysis Example Dataset
Date
Publish URL
Paper URL
License
Other
*This dataset supports online use.Click here to jump.
GATK (Genome Analysis Toolkit) is an open source bioinformatics toolkit developed by the Broad Institute, a joint venture between MIT and Harvard University.
The goal of this project is to provide a standardized analysis process for high-throughput sequencing (NGS) data, mainly used for:
- Quality control (QC) of DNA/RNA sequencing data;
 - Sequence alignment and recalibration;
 - Variant calling, identifying SNPs, InDels and other variants;
 - Joint genotyping at the population level.
 
GATK is one of the most commonly used analysis frameworks in the field of genomics and is widely used in human whole genome sequencing, cancer genome research and precision medicine.
The relevant paper results areThe Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data", published in 2010 by the Broad Institute of Harvard University and MIT, in collaboration with the Center for Human Genetics at Massachusetts General Hospital.
Introduction to the sample dataset
GATK's analysis pipeline uses unaligned BAM files (uBAM) as a unified starting point, while sequencers typically output FASTQ files or pre-aligned BAM files. To ensure consistent and reproducible analysis, data from different sources must be converted to the uBAM format.
This project provides two typical examples, showing:
- Conversion pipeline from FASTQ to unaligned BAM (FastqToSam);
 - Restoration process from aligned BAM to unaligned BAM (RevertSam).
 
tutorial6484FastqToSam.tar.gz
This dataset contains FASTQ format files, which are the raw data format output by the sequencer. They record the base sequence and sequencing quality value of each DNA read. They are used to demonstrate how to use Picard's FastqToSam tool to convert the FASTQ files of paired-end sequencing into unaligned BAM and generate .bam A file containing the raw sequence, quality values, and read group information, but without any alignment position information.
- Example of conversion command:
 
  bash
    java -Xmx8G -jar picard.jar FastqToSam \
    FASTQ=6484_snippet_1.fastq \    # 双端测序中的第一个读段文件  
    FASTQ2=6484_snippet_2.fastq \   # 双端测序中的第二个读段文件  
    OUTPUT=6484_snippet_fastqtosam.bam \
    READ_GROUP_NAME=H0164.2 \       # 必需;读组名称(默认值为 A,此处已修改)
    SAMPLE_NAME=NA12878 \           # 必需;样本名称  
    LIBRARY_NAME=Solexa-272222 \    # 必需;文库名称  
    PLATFORM_UNIT=H0164ALXX140820.2 \ 
    PLATFORM=illumina \             # 推荐;测序平台类型(如 Illumina)
    SEQUENCING_CENTER=BI \ 
    RUN_DATE=2014-08-20T00:00:00-0400   # 测序运行日期与时间  
tutorial6484RevertSam.tar.gz
This dataset is in BAM format, a binary file generated by normalizing or aligning sequences in FASTQ files. It stores sequences and their locations on the genome in a more efficient manner. This example dataset demonstrates how to use Picard's RevertSam tool to restore aligned BAM files to an unaligned state for re-alignment or re-analysis.
- Example of conversion command:
 
  bash
    java -Xmx8G -jar /path/picard.jar RevertSam \
    I=6484_snippet.bam \
    O=6484_snippet_revertsam.bam \
    SANITIZE=true \ 
    MAX_DISCARD_FRACTION=0.005 \      # 仅用于信息提示,不影响处理过程  
    ATTRIBUTE_TO_CLEAR=XT \
    ATTRIBUTE_TO_CLEAR=XN \
    ATTRIBUTE_TO_CLEAR=AS \           # 自 2015 年 9 月的 Picard 版本起,AS 属性默认会被清除  
    ATTRIBUTE_TO_CLEAR=OC \
    ATTRIBUTE_TO_CLEAR=OP \
    SORT_ORDER=queryname \            # 默认设置;按查询名排序  
    RESTORE_ORIGINAL_QUALITIES=true \ # 默认设置;恢复原始质量值  
    REMOVE_DUPLICATE_INFORMATION=true \ # 默认设置;移除重复信息  
    REMOVE_ALIGNMENT_INFORMATION=true   # 默认设置;移除比对信息  Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.