Softwares used in the alignment module

bwa_mem

Align .fastq files to a reference genome and generate a .bam file.

Rule

rule bwa_mem:
    input:
        reads=lambda wildcards: alignment_input(wildcards),
        idx=[
            config.get("bwa_mem", {}).get("amb", ""),
            config.get("bwa_mem", {}).get("ann", ""),
            config.get("bwa_mem", {}).get("bwt", ""),
            config.get("bwa_mem", {}).get("pac", ""),
            config.get("bwa_mem", {}).get("sa", ""),
        ],
    output:
        bam=temp("alignment/bwa_mem/{sample}_{type}_{flowcell}_{lane}_{barcode}.bam"),
    params:
        extra=lambda wildcards: "%s %s %s"
        % (
            config.get("bwa_mem", {}).get("extra", ""),
            config.get("bwa_mem", {}).get("read_group", generate_read_group(wildcards)),
            get_deduplication_option(wildcards),
        ),
        sorting=config.get("bwa_mem", {}).get("sort", "samtools"),
        sort_order=config.get("bwa_mem", {}).get("sort_order", "coordinate"),
        sort_extra="-@ %s"
        % str(config.get("bwa_mem", config["default_resources"]).get("threads", config["default_resources"]["threads"])),
    log:
        "alignment/bwa_mem/{sample}_{type}_{flowcell}_{lane}_{barcode}.bam.log",
    benchmark:
        repeat(
            "alignment/bwa_mem/{sample}_{type}_{flowcell}_{lane}_{barcode}.bam.benchmark.tsv",
            config.get("bwa_mem", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("bwa_mem", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("bwa_mem", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("bwa_mem", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("bwa_mem", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("bwa_mem", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("bwa_mem", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("bwa_mem", {}).get("container", config["default_container"])
    message:
        "{rule}: align fastq files {input.reads} using bwa mem against {input.idx[2]}"
    wrapper:
        "v1.3.1/bio/bwa/mem"

input / output files

Rule parameters	Key	Value	Description
input	reads	`lambda wildcards: alignment_input(wildcards)`	fastq files from the same sample. fastq files obtained by get_fastq_file defined in `common.smk`
input	idx	`[ config.get("bwa_mem", {}).get("amb", ""), config.get("bwa_mem", {}).get("ann", ""), config.get("bwa_mem", {}).get("bwt", ""), config.get("bwa_mem", {}).get("pac", ""), config.get("bwa_mem", {}).get("sa", ""), ]`	reference files for bwa-mem, location are defined in the `config.yaml`
output	bam	`"alignment/bwa_mem/{sample}_{type}_{flowcell}_{lane}_{barcode}.bam"`	aligned bam file. NOTE: if the fastq files are divided into different lanes for the same sample, they will be aligned separately

Configuration

Software settings (`config.yaml`)

Key	Type	Description
benchmark_repeats	integer	set number of times benchmark should be repeated
container	string	name or path to docker/singularity container
extra	string	Parameters that should be forwarded. NOTE: If the sample is marked with umi in the deduplication column in the samples.tsv file the -Y flag is added by the get_deduplication_option function (in common.smk)
read_group	string	RG string will be added to bam file generated. The RG string is generated by the function generate_read_group defined in `common.smk`
sorting	string	program handling the bam sorting (default samtools)
sort_order	string	how the bam file should be sorted (default coordinate)
sort_extra	string	parameters that should be forwarded to sorting (ie, number of threads)
amb	string	fasta reference amb file
ann	string	fasta reference ann file
bwt	string	fasta reference bwt file
pac	string	fasta reference pac file
sa	string	fasta reference sa file

Resources settings (`resources.yaml`)

Key	Type	Description
mem_mb	integer	memory in MB used per cpu NOTE: bwa mem uses a large amount of memory.
mem_per_cpu	integer	memory used per cpu NOTE: bwa mem uses a large amount of memory.
partition	string	partition to use on cluster
time	string	max execution time
threads	integer	number of threads to be available RECOMMENDATION: use multiple threads for decreased run time. NOTE: if multiple threads is used the memory must also be increased (mem_mb)

bwa_mem_merge

Merge .bam files from the same sample using samtools merge.

Rule

rule bwa_mem_merge:
    input:
        bams=lambda wildcards: [
            "alignment/bwa_mem/{sample}_{type}_%s_%s_%s.bam" % (u.flowcell, u.lane, u.barcode)
            for u in get_units(units, wildcards)
        ],
    output:
        bam=temp("alignment/bwa_mem/{sample}_{type}_unsorted.bam"),
    params:
        config.get("bwa_mem_merge", {}).get("extra", ""),
    log:
        "alignment/bwa_mem/{sample}_{type}_unsorted.bam.log",
    benchmark:
        repeat(
            "alignment/bwa_mem/{sample}_{type}_unsorted.bam.benchmark.tsv",
            config.get("bwa_mem_merge", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("bwa_mem_merge", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("bwa_mem_merge", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("bwa_mem_merge", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("bwa_mem_merge", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("bwa_mem_merge", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("bwa_mem_merge", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("bwa_mem_merge", {}).get("container", config["default_container"])
    message:
        "{rule}: merge bam file {input} using samtools"
    wrapper:
        "v1.1.0/bio/samtools/merge"

input / output files

Rule parameters	Key	Value	Description
input	bams	`lambda wildcards: [ "alignment/bwa_mem/{sample}_{type}_%s_%s_%s.bam" % (u.flowcell, u.lane, u.barcode) for u in get_units(units, wildcards) ]`	bam files from the same sample (and the same sample type) a list of bam files is obtained using the information in the `units.tsv` file
output	bam	`"alignment/bwa_mem/{sample}_{type}_unsorted.bam"`	an unsorted merged bam file

Configuration

Software settings (`config.yaml`)

Key	Type	Description
benchmark_repeats	integer	set number of times benchmark should be repeated
container	string	name or path to docker/singularity container
extra	string	parameters that should be forwarded RECOMMENDATION: use -c -p to only keep one of the read groups IDs when merging files from the same sample

Resources settings (`resources.yaml`)

Key	Type	Description
mem_mb	integer	memory in MB used per cpu
mem_per_cpu	integer	memory used per cpu
partition	string	partition to use on cluster
threads	integer	number of threads to be available
time	string	max execution time

bwa_mem_realign_consensus_reads

Realign after consensus read creation by fgbio_call_and_filter_consensus_reads and generate a .bam file.

Rule

rule bwa_mem_realign_consensus_reads:
    input:
        bam="alignment/fgbio_call_and_filter_consensus_reads/{sample}_{type}.umi.unmapped_bam",
    output:
        bam=temp("alignment/bwa_mem_realign_consensus_reads/{sample}_{type}.umi_unsorted.bam"),
    params:
        extra_bwa_mem=config.get("bwa_mem_realign_consensus_reads", {}).get("extra_bwa_mem", ""),
        reference=config.get("reference", {}).get("fasta", ""),
        tmp_dir="alignment/tmp_realign_{sample}_{type}",
        fgbio_sorted_unmapped="alignment/tmp_realign_{sample}_{type}/fgbio_query_sorted.bam",
    log:
        "alignment/bwa_mem_realign_consensus_reads/{sample}_{type}.umi.bam.log",
    benchmark:
        repeat(
            "alignment/bwa_mem_realign_consensus_reads/{sample}_{type}.umi.bam.benchmark.tsv",
            config.get("bwa_mem_realign_consensus_reads", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("bwa_mem_realign_consensus_reads", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("bwa_mem_realign_consensus_reads", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("bwa_mem_realign_consensus_reads", {}).get(
            "mem_per_cpu", config["default_resources"]["mem_per_cpu"]
        ),
        partition=config.get("bwa_mem_realign_consensus_reads", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("bwa_mem_realign_consensus_reads", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("bwa_mem_realign_consensus_reads", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("bwa_mem_realign_consensus_reads", {}).get("container", config["default_container"])
    message:
        "{rule}: realign unmappend consensus reads found in {input.bam}"
    shell:
        'sh -c "'
        "set -e; "
        "mkdir -p {params.tmp_dir}; "
        "trap 'rm -rf {params.tmp_dir}' EXIT; "

        "fgbio -Xmx16g SortBam -i {input.bam} -s Queryname -o {params.fgbio_sorted_unmapped}; "

        "samtools fastq -n {params.fgbio_sorted_unmapped} | "
        "bwa mem -t {threads} -p -K 150000000 -Y {params.reference} {params.extra_bwa_mem} - | "
        "fgbio -Xmx16g SortBam -i /dev/stdin -s Queryname -o /dev/stdout | "
        "fgbio -Xmx16g ZipperBams "
        "--unmapped {params.fgbio_sorted_unmapped} "
        "--ref {params.reference} "
        "--tags-to-reverse cd ce ad ae bd be aq bq "
        "--tags-to-revcomp ac bc "
        "-o {output.bam}"
        '" >& {log}'

input / output files

Rule parameters	Key	Value	Description
input	bam	`"alignment/fgbio_call_and_filter_consensus_reads/{sample}_{type}.umi.unmapped_bam"`	unmapped `bam` file with consensus reads based on umi barcodes
output	bam	`"alignment/bwa_mem_realign_consensus_reads/{sample}_{type}.umi_unsorted.bam"`	realigned and sorted `bam` file based on umi consensus reads

Configuration

Software settings (`config.yaml`)

Key	Type	Description
benchmark_repeats	integer	set number of times benchmark should be repeated
container	string	name or path to docker/singularity container
extra_bwa_mem	string	parameters that should be forwarded to bwa_mem
extra_sort	string	parameters that should be forwarded to samtools sort
extra_zipper_bam	string	parameters that should be forwarded to fgbios ZipperBam

Resources settings (`resources.yaml`)

Key	Type	Description
mem_mb	integer	max memory in MB to be available
mem_per_cpu	integer	memory in MB used per cpu
partition	string	partition to use on cluster
threads	integer	number of threads to be available RECOMMENDATION: Use multiple threads for decreased run time. NOTE: If multiple threads is used the memory must also be increased (mem_mb)
time	string	max execution time

fgbio_call_and_filter_consensus_reads

Call and filter consensus reads based on umis using fgbio (CallDuplexConsensusReads followed by FilterConsensusReads)

Rule

rule fgbio_call_and_filter_consensus_reads:
    input:
        bam="alignment/fgbio_group_reads_by_umi/{sample}_{type}.umi.bam",
    output:
        bam=temp("alignment/fgbio_call_and_filter_consensus_reads/{sample}_{type}.umi.unmapped_bam"),
    params:
        extra_call=config.get("fgbio_call_and_filter_consensus_reads", {}).get("extra_call", ""),
        extra_filter=config.get("fgbio_call_and_filter_consensus_reads", {}).get("extra_filter", ""),
        max_base_error_rate=config.get("fgbio_call_and_filter_consensus_reads", {}).get("max_base_error_rate", "0.2"),
        min_reads_call=config.get("fgbio_call_and_filter_consensus_reads", {}).get("min_reads_call", "1 1 1"),
        min_reads_filter=config.get("fgbio_call_and_filter_consensus_reads", {}).get("min_reads_filter", "1 1 1"),
        min_input_base_quality_call=config.get("fgbio_call_and_filter_consensus_reads", {}).get(
            "min_input_base_quality_call", "20"
        ),
        min_input_base_quality_filter=config.get("fgbio_call_and_filter_consensus_reads", {}).get(
            "min_input_base_quality_filter", "45"
        ),
        reference=config.get("reference", {}).get("fasta", ""),
    log:
        "alignment/fgbio_call_and_filter_consensus_reads/{sample}_{type}.umi.unmapped.bam.log",
    benchmark:
        repeat(
            "alignment/fgbio_call_and_filter_consensus_reads/{sample}_{type}.umi.unmapped.bam.benchmark.tsv",
            config.get("fgbio_call_and_filter_consensus_reads", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("fgbio_call_and_filter_consensus_reads", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("fgbio_call_and_filter_consensus_reads", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("fgbio_call_and_filter_consensus_reads", {}).get(
            "mem_per_cpu", config["default_resources"]["mem_per_cpu"]
        ),
        partition=config.get("fgbio_call_and_filter_consensus_reads", {}).get(
            "partition", config["default_resources"]["partition"]
        ),
        threads=config.get("fgbio_call_and_filter_consensus_reads", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("fgbio_call_and_filter_consensus_reads", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("fgbio_call_and_filter_consensus_reads", {}).get("container", config["default_container"])
    message:
        "{rule}: call and filter consensus reads in {input.bam} into an unmapped bam file"
    shell:
        'sh -c "'
        "fgbio -Xmx4g --compression 0 CallDuplexConsensusReads "
        "--input {input.bam} "
        "--output /dev/stdout "
        "--min-reads {params.min_reads_call} "
        "--min-input-base-quality {params.min_input_base_quality_call} "
        "--threads {threads} "
        "{params.extra_call} "
        "| fgbio -Xmx8g --compression 1 FilterConsensusReads "
        "--input /dev/stdin "
        "--output {output.bam} "
        "--ref {params.reference} "
        "--min-reads {params.min_reads_filter} "
        "--min-base-quality {params.min_input_base_quality_filter} "
        "--max-base-error-rate {params.max_base_error_rate} "
        '{params.extra_filter}" >& {log}'

input / output files

Rule parameters	Key	Value	Description
input	bam	`"alignment/fgbio_group_reads_by_umi/{sample}_{type}.umi.bam"`	input 'bam' file with umi tags
output	bam	`"alignment/fgbio_call_and_filter_consensus_reads/{sample}_{type}.umi.unmapped_bam"`	unmapped `bam` file with consensus reads that are hard filtered

Configuration

Software settings (`config.yaml`)

Key	Type	Description
benchmark_repeats	integer	set number of times benchmark should be repeated
container	string	name or path to docker/singularity container
extra_call	string	parameters that should be forwarded to CallDuplexConsensusReads
extra_filter	string	parameters that should be forwarded to FilterConsensusReads
max_base_error_rate	string	mask bases with N if the % of reads that differ from consensus in higher than max error rate in FilterConsensusReads
min_reads_call	string	String of three numbers for which the reads are filtered if the number of reads are under these numbers in CallDuplexConsensusReads. The first number is reads from both strands while the second and third number is for the individual strands. The first number must be greater or equal to the other numbers.
min_reads_filter	string	String of three numbers for which the reads are filtered if the number of reads are under these numbers in FilterConsensusReads. The first number is reads from both strands while the second and third number is for the individual strands. The first number must be greater or equal to the other numbers.
min_input_base_quality_call	integer	only consider bases over min base quality in consensus creation in CallDuplexConsensusReads
min_input_base_quality_filter	integer	mask bases with N if under min base quality in consensus creation in FilterConsensusReads

Resources settings (`resources.yaml`)

Key	Type	Description
mem_mb	integer	max memory in MB to be available NOTE: must be at least 8 GB
mem_per_cpu	integer	memory in MB used per cpu
partition	string	partition to use on cluster
threads	integer	number of threads to be available RECOMMENDATION: Use multiple threads for decreased run time.
time	string	max execution time

fgbio_group_reads_by_umi

Group and sort reads based on umi using fgbio in preparation for fgbio_call_and_filter_consensus_reads. Also add mate pair MQ sam tags using samblaster.

Rule

rule fgbio_group_reads_by_umi:
    input:
        bam="alignment/fgbio_copy_umi_from_read_name/{sample}_{type}.umi.bam",
    output:
        bam="alignment/fgbio_group_reads_by_umi/{sample}_{type}.umi.bam",
        histo="alignment/fgbio_group_reads_by_umi/{sample}_{type}.umi.histo.tsv",
    params:
        extra=config.get("fgbio_group_reads_by_umi", {}).get("extra", ""),
        umi_strategy=config.get("fgbio_group_reads_by_umi", {}).get("umi_strategy", "paired"),
    log:
        "alignment/fgbio_group_reads_by_umi/{sample}_{type}.umi.bam.log",
    benchmark:
        repeat(
            "alignment/fgbio_group_reads_by_umi/{sample}_{type}.umi.bam.benchmark.tsv",
            config.get("fgbio_group_reads_by_umi", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("fgbio_group_reads_by_umi", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("fgbio_group_reads_by_umi", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("fgbio_group_reads_by_umi", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("fgbio_group_reads_by_umi", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("fgbio_group_reads_by_umi", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("fgbio_group_reads_by_umi", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("fgbio_group_reads_by_umi", {}).get("container", config["default_container"])
    message:
        "{rule}: group reads by umi in {input.bam} and output umi sorted bam"
    shell:
        "(fgbio GroupReadsByUmi "
        "-i {input.bam} "
        "-o {output.bam} "
        "-f {output.histo} "
        "-s {params.umi_strategy} "
        "{params.extra}) &> {log}"

input / output files

Rule parameters	Key	Value	Description
input	bam	`"alignment/fgbio_copy_umi_from_read_name/{sample}_{type}.umi.bam"`	input bam file
output	bam	`"alignment/fgbio_group_reads_by_umi/{sample}_{type}.umi.bam"`	output bam that is umi sorted

Configuration

Software settings (`config.yaml`)

Key	Type	Description
benchmark_repeats	integer	set number of times benchmark should be repeated
container	string	name or path to docker/singularity container
extra	string	parameters that should be forwarded
umi_strategy	string	umi strategy for how the umis should be grouped (paired for duplex umis)

Resources settings (`resources.yaml`)

Key	Type	Description
mem_mb	integer	max memory in MB to be available
mem_per_cpu	integer	memory in MB used per cpu
partition	string	partition to use on cluster
threads	integer	number of threads to be available
time	string	max execution time

fgbio_call_overlapping_consensus_bases

Call consensus bases from overlapping reads.

Rule

rule fgbio_call_overlapping_consensus_bases:
    input:
        bam="alignment/bwa_mem_realign_consensus_reads/{sample}_{type}.umi.bam",
        ref=config.get("reference", {}).get("fasta", ""),
    output:
        bam=temp("alignment/fgbio_call_overlapping_consensus_bases/{sample}_{type}.umi.bam"),
        metrics=temp("alignment/fgbio_call_overlapping_consensus_bases/{sample}_{type}.umi.metrics.txt"),
    params:
        agreement_strategy=config.get("fgbio_call_overlapping_consensus_bases", {}).get("agreement_strategy", "Consensus"),
        disagreement_strategy=config.get("fgbio_call_overlapping_consensus_bases", {}).get("disagreement_strategy", "Consensus"),
        extra=config.get("fgbio_call_overlapping_consensus_bases", {}).get("extra", ""),
        jvm_args=config.get("fgbio_call_overlapping_consensus_bases", {}).get("jvm_args", "-Xmx6g"),
    log:
        "alignment/fgbio_call_overlapping_consensus_bases/{sample}_{type}.umi.bam.log",
    benchmark:
        repeat(
            "alignment/fgbio_call_overlapping_consensus_bases/{sample}_{type}.umi.bam.benchmark.tsv",
            config.get("fgbio_call_overlapping_consensus_bases", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("fgbio_call_overlapping_consensus_bases", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("fgbio_call_overlapping_consensus_bases", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("fgbio_call_overlapping_consensus_bases", {}).get(
            "mem_per_cpu", config["default_resources"]["mem_per_cpu"]
        ),
        partition=config.get("fgbio_call_overlapping_consensus_bases", {}).get(
            "partition", config["default_resources"]["partition"]
        ),
        threads=config.get("fgbio_call_overlapping_consensus_bases", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("fgbio_call_overlapping_consensus_bases", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("fgbio_call_overlapping_consensus_bases", {}).get("container", config["default_container"])
    message:
        "{rule}: call overlapping consensus bases on {input.bam}"
    shell:
        'sh -c "'
        "fgbio {params.jvm_args} CallOverlappingConsensusBases "
        "--input {input.bam} "
        "--output {output.bam} "
        "--metrics {output.metrics} "
        "--ref {input.ref} "
        "--agreement-strategy {params.agreement_strategy} "
        "--disagreement-strategy {params.disagreement_strategy} "
        '{params.extra}" >& {log}'

input / output files

Rule parameters	Key	Value	Description
input	bam	`"alignment/bwa_mem_realign_consensus_reads/{sample}_{type}.umi.bam"`	input query sorted `bam` file
input	ref	`config.get("reference", {}).get("fasta", "")`	genome reference file
output	bam	`"alignment/fgbio_call_overlapping_consensus_bases/{sample}_{type}.umi.bam"`	output `bam` file with corrected overlapping consensus bases
output	metrics	`"alignment/fgbio_call_overlapping_consensus_bases/{sample}_{type}.umi.metrics.txt"`	output `metrics` file with statistics of the overlapping consensus bases correction

Configuration

Software settings (`config.yaml`)

Key	Type	Description
benchmark_repeats	integer	set number of times benchmark should be repeated
container	string	name or path to docker/singularity container
agreement_strategy	string	set agreement strategy for fgbio_call_overlapping_consensus_bases
disagreement_strategy	string	set disagreement strategy for fgbio_call_overlapping_consensus_bases
extra	string	parameters that should be forwarded
jvm_args	string	set jvm args for fgbio_call_overlapping_consensus_bases

Resources settings (`resources.yaml`)

Key	Type	Description
mem_mb	integer	max memory in MB to be available
mem_per_cpu	integer	memory in MB used per cpu
partition	string	partition to use on cluster
threads	integer	number of threads to be available
time	string	max execution time

fgbio_copy_umi_from_read_name

Copies the UMI at the end of the BAM’s read name to the RX tag using fgbio in preparation for fgbio_group_reads_by_umi

Rule

rule fgbio_copy_umi_from_read_name:
    input:
        bam="alignment/bwa_mem/{sample}_{type}.umi.bam",
    output:
        bam=temp("alignment/fgbio_copy_umi_from_read_name/{sample}_{type}.umi.bam"),
    params:
        extra=config.get("fgbio_copy_umi_from_read_name", {}).get("extra", ""),
    log:
        "alignment/fgbio_copy_umi_from_read_name/{sample}_{type}.umi.bam.log",
    benchmark:
        repeat(
            "alignment/fgbio_copy_umi_from_read_name/{sample}_{type}.umi.bam.benchmark.tsv",
            config.get("fgbio_copy_umi_from_read_name", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("fgbio_copy_umi_from_read_name", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("fgbio_copy_umi_from_read_name", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("fgbio_copy_umi_from_read_name", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("fgbio_copy_umi_from_read_name", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("fgbio_copy_umi_from_read_name", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("fgbio_copy_umi_from_read_name", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("fgbio_copy_umi_from_read_name", {}).get("container", config["default_container"])
    message:
        "{rule}: Copy UMI from read name to sam tag on {input.bam}"
    shell:
        'sh -c "'
        "(samtools view "
        "-h "
        "-F 0x900 "
        "{input.bam} "
        "| samblaster "
        "--addMateTags "
        "--ignoreUnmated "
        "| fgbio CopyUmiFromReadName "
        "-i /dev/stdin "
        "-o {output.bam} "
        '{params.extra})" &> {log}'

input / output files

Rule parameters	Key	Value	Description
input	bam	`"alignment/bwa_mem/{sample}_{type}.umi.bam"`	input `bam` file
output	bam	`"alignment/fgbio_copy_umi_from_read_name/{sample}_{type}.umi.bam"`	Output `bam` file with umi tag added (default tag name RX) extracted from the read name

Configuration

Software settings (`config.yaml`)

Key	Type	Description
benchmark_repeats	integer	set number of times benchmark should be repeated
container	string	name or path to docker/singularity container
extra	string	parameters that should be forwarded

Resources settings (`resources.yaml`)

Key	Type	Description
mem_mb	integer	max memory in MB to be available
mem_per_cpu	integer	memory in MB used per cpu
partition	string	partition to use on cluster
threads	integer	number of threads to be available RECOMMENDATION: Use multiple threads for decreased run time.
time	string	max execution time

minimap2_align

Align long read sequencing data stored in a bamfile to a reference genome to produce a bam with aligned reads.

Rule

rule minimap2_align:
    input:
        query=lambda wildcards: get_minimap2_query(wildcards),
        target=expand(
            "alignment/minimap2_index/{ref}.{preset}.mmi",
            ref=os.path.basename(config.get("reference", {}).get("fasta", "")),
            preset=config.get("minimap2_align", {}).get("preset", ""),
        ),
    output:
        bam=temp("alignment/minimap2_align/{sample}_{type}_{processing_unit}_{barcode}.bam"),
    params:
        extra=lambda wildcards, input: "%s %s -x %s"
        % (
            config.get("minimap2_align", {}).get("extra", ""),
            config.get("minimap2_align", {}).get("read_group", generate_minimap2_read_group(wildcards, input)),
            config.get("minimap2_align", {}).get("preset", ""),
        ),
        sorting=config.get("minimap2_align", {}).get("sort_order", "coordinate"),
        sort_extra=config.get("minimap2_align", {}).get("sort_extra", ""),
    log:
        "alignment/minimap2_align/{sample}_{type}_{processing_unit}_{barcode}.bam.log",
    benchmark:
        repeat(
            "alignment/minimap2_align/{sample}_{type}_{processing_unit}_{barcode}.bam.benchmark.tsv",
            config.get("minimap2_align", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("minimap2_align", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("minimap2_align", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("minimap2_align", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("minimap2_align", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("minimap2_align", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("minimap2_align", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("minimap2_align", {}).get("container", config["default_container"])
    message:
        "{rule}: run minimap2 to align reads from {input.query} to {input.target}"
    wrapper:
        "v4.3.0/bio/minimap2/aligner"

input / output files

Rule parameters	Key	Value	Description
input	target	`expand( "alignment/minimap2_index/{ref}.{preset}.mmi", ref=os.path.basename(config.get("reference", {}).get("fasta", "")), preset=config.get("minimap2_align", {}).get("preset", ""), )`	a mininmap2 index file for the reference genome
input	query	`lambda wildcards: get_minimap2_query(wildcards)`	bam file with unaligned reads
output	bam	`"alignment/minimap2_align/{sample}_{type}_{processing_unit}_{barcode}.bam"`	bam file with aligned reads (Note that the methylation tags will also be present in the aligned bam file)

Configuration

Software settings (`config.yaml`)

Key	Type	Description
benchmark_repeats	integer	set number of times benchmark should be repeated
container	string	name or path to docker/singularity container
extra	string	parameters that should be forwarded
mmi	string	fasta reference mmi file (generated with the same preset as specified in the config)
preset	string	minimap2 preset options for various types long read sequencing data (e.g., map-hifi or map-ont)
read_group	string	RG string will be added to bam file generated. The RG string is generated by the function generate_minimap2_read_group defined in `common.smk`
sort_order	string	how the bam file should be sorted (default coordinate)
sort_extra	string	parameters that should be forwarded to sorting (NB. do not set -@ or --threads here, this is set from {threads})

Resources settings (`resources.yaml`)

Key	Type	Description
mem_mb	integer	max memory in MB to be available
mem_per_cpu	integer	memory in MB used per cpu
partition	string	partition to use on cluster
threads	integer	number of threads to be available
time	string	max execution time

minimap2_index

Prepare reference index file for minimap2.

Rule

rule minimap2_index:
    input:
        target=config.get("reference", {}).get("fasta", ""),
    output:
        mmi=expand(
            "alignment/minimap2_index/{ref}.{preset}.mmi",
            ref=os.path.basename(config.get("reference", {}).get("fasta", "")),
            preset=config.get("minimap2_align", {}).get("preset", ""),
        ),
    params:
        extra=set_minimap2_preset,
    log:
        "alignment/minimap2_index/minimap2_index.log",
    benchmark:
        repeat(
            "alignment/minimap2_index/minimap2_index.benchmark.tsv", config.get("minimap2_index", {}).get("benchmark_repeats", 1)
        )
    threads: config.get("minimap2_index", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("minimap2_index", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("minimap2_index", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("minimap2_index", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("minimap2_index", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("minimap2_index", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("minimap2_index", {}).get("container", config["default_container"])
    message:
        "{rule}: index {input.target} with minimap2"
    wrapper:
        "v4.3.0/bio/minimap2/index"

input / output files

Rule parameters	Key	Value	Description
input	target	`config.get("reference", {}).get("fasta", "")`	reference/target genome
output	mmi	`expand( "alignment/minimap2_index/{ref}.{preset}.mmi", ref=os.path.basename(config.get("reference", {}).get("fasta", "")), preset=config.get("minimap2_align", {}).get("preset", ""), )`	minimap2 index file of the target/reference genome

Configuration

Software settings (`config.yaml`)

Key	Type	Description
benchmark_repeats	integer	set number of times benchmark should be repeated
container	string	name or path to docker/singularity container
extra	string	parameters that should be forwarded

Resources settings (`resources.yaml`)

Key	Type	Description
mem_mb	integer	max memory in MB to be available
mem_per_cpu	integer	memory in MB used per cpu
partition	string	partition to use on cluster
threads	integer	number of threads to be available
time	string	max execution time

minimap2_merge

Merge minimap2 .bam files from the same sample using samtools merge.

Rule

rule minimap2_merge:
    input:
        bams=lambda wildcards: [
            "alignment/minimap2_align/{sample}_{type}_%s_%s.bam" % (u.processing_unit, u.barcode)
            for u in get_units(units, wildcards)
        ],
    output:
        bam=temp("alignment/minimap2_align/{sample}_{type}.bam"),
    params:
        extra=config.get("minimap2_merge", {}).get("extra", ""),
    log:
        "alignment/minimap2_align/{sample}_{type}.bam.log",
    benchmark:
        repeat(
            "alignment/minimap2_align/{sample}_{type}.bam.benchmark.tsv",
            config.get("minimap2", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("minimap2_merge", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("minimap2_merge", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("minimap2_merge", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("minimap2_merge", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("minimap2_merge", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("minimap2_merge", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("minimap2_merge", {}).get("container", config["default_container"])
    message:
        "{rule}: merge {input.bams} using samtools merge"
    wrapper:
        "v3.9.0/bio/samtools/merge"

input / output files

Rule parameters	Key	Value	Description
input	bams	`lambda wildcards: [ "alignment/minimap2_align/{sample}_{type}_%s_%s.bam" % (u.processing_unit, u.barcode) for u in get_units(units, wildcards) ]`	bam files from the same sample (and the same sample type) a list of bam files is obtained using the information in the `units.tsv` file
output	bam	`"alignment/minimap2_align/{sample}_{type}.bam"`	a sorted merged bam file

Configuration

Software settings (`config.yaml`)

Key	Type	Description
benchmark_repeats	integer	set number of times benchmark should be repeated
container	string	name or path to docker/singularity container
extra	string	parameters that should be forwarded RECOMMENDATION: use -c -p to only keep one of the read groups IDs when merging files from the same sample and flowcell

Resources settings (`resources.yaml`)

Key	Type	Description
mem_mb	integer	max memory in MB to be available
mem_per_cpu	integer	memory in MB used per cpu
partition	string	partition to use on cluster
threads	integer	number of threads to be available
time	string	max execution time

pbmm2_align

Align long read sequencing data stored in a bamfile to a reference genome to produce a bam with aligned reads.

Rule

rule pbmm2_align:
    input:
        query=lambda wildcards: get_minimap2_query(wildcards),
        reference=expand(
            "alignment/pbmm2_index/{ref}.{preset}.mmi",
            ref=os.path.basename(config.get("reference", {}).get("fasta", "")),
            preset=config.get("pbmm2_align", {}).get("preset", ""),
        ),
    output:
        bam=temp("alignment/pbmm2_align/{sample}_{type}_{processing_unit}_{barcode}.bam"),
    params:
        preset=config.get("pbmm2_align", {}).get("preset", ""),
        sample=lambda wildcards: f"{wildcards.sample}_{wildcards.type}",
        loglevel="INFO",
        extra=" --sort %s " % (config.get("pbmm2_align", {}).get("extra", "")),
    log:
        bam="alignment/pbmm2_align/{sample}_{type}_{processing_unit}_{barcode}.bam.log",
    benchmark:
        repeat(
            "alignment/pbmm2_align/{sample}_{type}_{processing_unit}_{barcode}.bam.benchmark.tsv",
            config.get("pbmm2_align", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("pbmm2_align", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("pbmm2_align", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("pbmm2_align", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("pbmm2_align", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("pbmm2_align", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("pbmm2_align", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("pbmm2_align", {}).get("container", config["default_container"])
    message:
        "{rule}: Align reads in {input.query} against {input.reference}"
    wrapper:
        "v4.3.0/bio/pbmm2/align"

input / output files

Rule parameters	Key	Value	Description
input	query	`lambda wildcards: get_minimap2_query(wildcards)`	bam file of unaligned pacbio reads
input	reference	`expand( "alignment/pbmm2_index/{ref}.{preset}.mmi", ref=os.path.basename(config.get("reference", {}).get("fasta", "")), preset=config.get("pbmm2_align", {}).get("preset", ""), )`	pbmm2 reference index file
output	bam	`"alignment/pbmm2_align/{sample}_{type}_{processing_unit}_{barcode}.bam"`	bam file with aligned reads (Note that the methylation tags will also be present in the aligned bam file)

Configuration

Software settings (`config.yaml`)

Key	Type	Description
benchmark_repeats	integer	set number of times benchmark should be repeated
container	string	name or path to docker/singularity container
extra	string	parameters that should be forwarded
preset	string	pbmm2 preset options for various types pacbio sequencing data (e.g., HIFI)

Resources settings (`resources.yaml`)

Key	Type	Description
mem_mb	integer	max memory in MB to be available
mem_per_cpu	integer	memory in MB used per cpu
partition	string	partition to use on cluster
threads	integer	number of threads to be available
time	string	max execution time

pbmm2_index

Prepare reference index file for pbmm2.

Rule

rule pbmm2_index:
    input:
        reference=config.get("reference", {}).get("fasta", ""),
    output:
        mmi=expand(
            "alignment/pbmm2_index/{ref}.{preset}.mmi",
            ref=os.path.basename(config.get("reference", {}).get("fasta", "")),
            preset=config.get("pbmm2_align", {}).get("preset", ""),
        ),
    params:
        preset=config.get("pbmm2_align", {}).get("preset", ""),
        extra=config.get("pbmm2_index", {}).get("extra", ""),
    log:
        "alignment/pbmm2_index/pbmm2_index.log",
    benchmark:
        repeat("alignment/pbmm2_index/pbmm2_index.benchmark.tsv", config.get("pbmm2_index", {}).get("benchmark_repeats", 1))
    threads: config.get("pbmm2_index", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("pbmm2_index", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("pbmm2_index", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("pbmm2_index", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("pbmm2_index", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("pbmm2_index", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("pbmm2_index", {}).get("container", config["default_container"])
    message:
        "{rule}: index {input.reference} with pbmm2"
    wrapper:
        "v3.9.0/bio/pbmm2/index"

input / output files

Rule parameters	Key	Value	Description
input	reference	`config.get("reference", {}).get("fasta", "")`	target/reference genome fasta file
output	mmi	`expand( "alignment/pbmm2_index/{ref}.{preset}.mmi", ref=os.path.basename(config.get("reference", {}).get("fasta", "")), preset=config.get("pbmm2_align", {}).get("preset", ""), )`	pbmm2 index file of the target/reference genome

Configuration

Software settings (`config.yaml`)

Key	Type	Description
benchmark_repeats	integer	set number of times benchmark should be repeated
container	string	name or path to docker/singularity container
extra	string	parameters that should be forwarded
preset	string	preset for indexing the target genome

Resources settings (`resources.yaml`)

Key	Type	Description
mem_mb	integer	max memory in MB to be available
mem_per_cpu	integer	memory in MB used per cpu
partition	string	partition to use on cluster
threads	integer	number of threads to be available
time	string	max execution time

pbmm2_merge

Merge pbmm2 .bam files from the same sample using samtools merge.

Rule

rule pbmm2_merge:
    input:
        bams=lambda wildcards: [
            "alignment/pbmm2_align/{sample}_{type}_%s_%s.bam" % (u.processing_unit, u.barcode)
            for u in get_units(units, wildcards)
        ],
    output:
        bam=temp("alignment/pbmm2_align/{sample}_{type}.bam"),
    params:
        extra=config.get("pbmm2_merge", {}).get("extra", ""),
    log:
        "alignment/pbmm2_align/{sample}_{type}.bam.log",
    benchmark:
        repeat(
            "alignment/pbmm2_align/{sample}_{type}.bam.benchmark.tsv",
            config.get("pbmm2_align", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("pbmm2_merge", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("pbmm2_merge", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("pbmm2_merge", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("pbmm2_merge", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("pbmm2_merge", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("pbmm2_merge", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("pbmm2_merge", {}).get("container", config["default_container"])
    message:
        "{rule}: merge bam file {input} using samtools"
    wrapper:
        "v3.9.0/bio/samtools/merge"

input / output files

Rule parameters	Key	Value	Description
input	bams	`lambda wildcards: [ "alignment/pbmm2_align/{sample}_{type}_%s_%s.bam" % (u.processing_unit, u.barcode) for u in get_units(units, wildcards) ]`	bam files from the same sample (and the same sample type) a list of bam files is obtained using the information in the `units.tsv` file
output	bam	`"alignment/pbmm2_align/{sample}_{type}.bam"`	a sorted merged bam file

Configuration

Software settings (`config.yaml`)

Key	Type	Description
benchmark_repeats	integer	set number of times benchmark should be repeated
container	string	name or path to docker/singularity container
extra	string	parameters that should be forwarded

Resources settings (`resources.yaml`)

Key	Type	Description
mem_mb	integer	max memory in MB to be available
mem_per_cpu	integer	memory in MB used per cpu
partition	string	partition to use on cluster
threads	integer	number of threads to be available
time	string	max execution time

picard mark duplicates

Generate a bam file for a single chromosome with duplicates marked

Rule

rule picard_mark_duplicates:
    input:
        bams="alignment/samtools_extract_reads/{sample}_{type}_{chr}.bam",
    output:
        bam=temp("alignment/picard_mark_duplicates/{sample}_{type}_{chr}.bam"),
        metrics=temp("alignment/picard_mark_duplicates/{sample}_{type}_{chr}.metrics.txt"),
    params:
        extra=config.get("picard_mark_duplicates", {}).get("extra", ""),
    log:
        "alignment/picard_mark_duplicates/{sample}_{type}_{chr}.bam.log",
    benchmark:
        repeat(
            "alignment/picard_mark_duplicates/{sample}_{type}_{chr}.bam.benchmark.tsv",
            config.get("picard_mark_duplicates", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("picard_mark_duplicates", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("picard_mark_duplicates", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("picard_mark_duplicates", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("picard_mark_duplicates", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("picard_mark_duplicates", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("picard_mark_duplicates", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("picard_mark_duplicates", {}).get("container", config["default_container"])
    message:
        "{rule}: mark duplicates in {input} using picard"
    wrapper:
        "v1.25.0/bio/picard/markduplicates"

input / output files

Rule parameters	Key	Value	Description
input	bams	`"alignment/samtools_extract_reads/{sample}_{type}_{chr}.bam"`	bam file containing one chromosome
output	bam	`"alignment/picard_mark_duplicates/{sample}_{type}_{chr}.bam"`	duplicate marked bam file containing one chromosome
output	metrics	`"alignment/picard_mark_duplicates/{sample}_{type}_{chr}.metrics.txt"`	duplicate statistics for qc

Configuration

Software settings (`config.yaml`)

Key	Type	Description
benchmark_repeats	integer	set number of times benchmark should be repeated
container	string	name or path to docker/singularity container
extra	string	parameters that should be forwarded

Resources settings (`resources.yaml`)

Key	Type	Description
mem_mb	integer	memory in MB used per cpu
mem_per_cpu	integer	memory used per cpu
partition	string	partition to use on cluster
time	string	max execution time
threads	integer	number of threads to be available

picard_mark_duplicates_non_chr

Generate a bam file for a non-chromosomal contigs and unmapped reads with duplicates marked

Rule

rule picard_mark_duplicates_non_chr:
    input:
        bams="alignment/samtools_extract_reads/{sample}_{type}_non_chr.bam",
    output:
        bam=temp("alignment/picard_mark_duplicates/{sample}_{type}_non_chr.bam"),
        metrics=temp("alignment/picard_mark_duplicates/{sample}_{type}_non_chr.metrics.txt"),
    params:
        extra=config.get("picard_mark_duplicates_non_chr", {}).get("extra", ""),
    log:
        "alignment/picard_mark_duplicates_non_chr/{sample}_{type}.output.log",
    benchmark:
        repeat(
            "alignment/picard_mark_duplicates_non_chr/{sample}_{type}.output.benchmark.tsv",
            config.get("picard_mark_duplicates_non_chr", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("picard_mark_duplicates_non_chr", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("picard_mark_duplicates_non_chr", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("picard_mark_duplicates_non_chr", {}).get(
            "mem_per_cpu", config["default_resources"]["mem_per_cpu"]
        ),
        partition=config.get("picard_mark_duplicates_non_chr", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("picard_mark_duplicates_non_chr", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("picard_mark_duplicates_non_chr", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("picard_mark_duplicates", {}).get("container", config["default_container"])
    message:
        "{rule}: mark duplicates in {input.bams} using picard"
    wrapper:
        "v1.25.0/bio/picard/markduplicates"

input / output files

Rule parameters	Key	Value	Description
input	bams	`"alignment/samtools_extract_reads/{sample}_{type}_non_chr.bam"`	bam file containing non-chromosomal contigs requested in the config and unmapped reads
output	bam	`"alignment/picard_mark_duplicates/{sample}_{type}_non_chr.bam"`	duplicate marked bam file containing non-chromosomal contigs requested in the config and unmapped reads
output	metrics	`"alignment/picard_mark_duplicates/{sample}_{type}_non_chr.metrics.txt"`	duplicate statistics for qc

Configuration

Software settings (`config.yaml`)

Key	Type	Description
benchmark_repeats	integer	set number of times benchmark should be repeated
container	string	name or path to docker/singularity container
extra	string	parameters that should be forwarded

Resources settings (`resources.yaml`)

Key	Type	Description
mem_mb	integer	memory in MB used per cpu
mem_per_cpu	integer	memory used per cpu
partition	string	partition to use on cluster
time	string	max execution time
threads	integer	number of threads to be available

samtools_extract_reads

Extract reads from each chromosome and put into separate .bam files using samtools view.

Rule

rule samtools_extract_reads:
    input:
        bam="alignment/bwa_mem/{sample}_{type}.bam",
        bai="alignment/bwa_mem/{sample}_{type}.bam.bai",
    output:
        bam=temp("alignment/samtools_extract_reads/{sample}_{type}_{chr}.bam"),
    params:
        extra=config.get("samtools_extract_reads", {}).get("extra", ""),
    log:
        "alignment/samtools_extract_reads/{sample}_{type}_{chr}.bam.log",
    benchmark:
        repeat(
            "alignment/samtools_extract_reads/{sample}_{type}_{chr}.bam.benchmark.tsv",
            config.get("samtools_extract_reads", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("samtools_extract_reads", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("samtools_extract_reads", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("samtools_extract_reads", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("samtools_extract_reads", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("samtools_extract_reads", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("samtools_extract_reads", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("samtools_extract_reads", {}).get("container", config["default_container"])
    message:
        "{rule}: create bam {output} with only reads from {wildcards.chr}"
    shell:
        "(samtools view -@ {threads} {params.extra} -b {input} {wildcards.chr} > {output}) &> {log}"

input / output files

Rule parameters	Key	Value	Description
input	bam	`"alignment/bwa_mem/{sample}_{type}.bam"`	bam file
input	bai	`"alignment/bwa_mem/{sample}_{type}.bam.bai"`	bam index file
output	bam	`"alignment/samtools_extract_reads/{sample}_{type}_{chr}.bam"`	one bam file for each chromosome

Configuration

Software settings (`config.yaml`)

Key	Type	Description
benchmark_repeats	integer	set number of times benchmark should be repeated
container	string	name or path to docker/singularity container
extra	string	parameters that should be forwarded

Resources settings (`resources.yaml`)

Key	Type	Description
mem_mb	integer	memory in MB used per cpu
mem_per_cpu	integer	memory used per cpu
partition	string	partition to use on cluster
time	string	max execution time
threads	integer	number of threads to be available

samtools_extract_reads_non_chr

Extract reads from non-chromosomal contigs and unmapped reads to separate .bam files using samtools view.

Rule

rule samtools_extract_reads_non_chr:
    input:
        bam="alignment/bwa_mem/{sample}_{type}.bam",
        bai="alignment/bwa_mem/{sample}_{type}.bam.bai",
    output:
        bam=temp("alignment/samtools_extract_reads/{sample}_{type}_non_chr.bam"),
    params:
        contigs=get_contig_list,
        extra=config.get("samtools_extract_reads_non_chr", {}).get("extra", ""),
    log:
        "alignment/samtools_extract_reads/{sample}_{type}_non_chr.bam.log",
    benchmark:
        repeat(
            "alignment/samtools_extract_reads/{sample}_{type}_non_chr.bam.benchmark.tsv",
            config.get("samtools_extract_reads_non_chr", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("samtools_extract_reads_non_chr", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("samtools_extract_reads_non_chr", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("samtools_extract_reads_non_chr", {}).get(
            "mem_per_cpu", config["default_resources"]["mem_per_cpu"]
        ),
        partition=config.get("samtools_extract_reads_non_chr", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("samtools_extract_reads_non_chr", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("samtools_extract_reads_non_chr", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("samtools_extract_reads_non_chr", {}).get("container", config["default_container"])
    message:
        "{rule}: create bam {output} with only reads from {params.contigs}"
    shell:
        "(samtools view -@ {threads} {params.extra} -b {input} {params.contigs} '*' > {output}) &> {log}"

input / output files

Rule parameters	Key	Value	Description
input	bam	`"alignment/bwa_mem/{sample}_{type}.bam"`	bam file
input	bai	`"alignment/bwa_mem/{sample}_{type}.bam.bai"`	bam index file
output	bam	`"alignment/samtools_extract_reads/{sample}_{type}_non_chr.bam"`	one bam file containing non-chromosomal contigs, requested using merged_contigs in config, and unmapped reads

Configuration

Software settings (`config.yaml`)

Key	Type	Description
benchmark_repeats	integer	set number of times benchmark should be repeated
container	string	name or path to docker/singularity container
extra	string	parameters that should be forwarded

Resources settings (`resources.yaml`)

Key	Type	Description
mem_mb	integer	max memory in MB to be available
mem_per_cpu	integer	memory in MB used per cpu
partition	string	partition to use on cluster
threads	integer	number of threads to be available
time	string	max execution time

samtools_extract_reads_umi

Extract reads from each chromosome and put into separate .bam files using samtools view.

Rule

rule samtools_extract_reads_umi:
    input:
        bam="alignment/bwa_mem_realign_consensus_reads/{sample}_{type}.umi.bam",
        bai="alignment/bwa_mem_realign_consensus_reads/{sample}_{type}.umi.bam.bai",
    output:
        bam=temp("alignment/samtools_extract_reads_umi/{sample}_{type}_{chr}.umi.bam"),
    params:
        extra=config.get("samtools_extract_reads", {}).get("extra", ""),
    log:
        "alignment/samtools_extract_reads_umi/{sample}_{type}_{chr}.umi.bam.log",
    benchmark:
        repeat(
            "alignment/samtools_extract_reads_umi/{sample}_{type}_{chr}.bam.benchmark.tsv",
            config.get("samtools_extract_reads_umi", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("samtools_extract_reads_umi", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("samtools_extract_reads_umi", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("samtools_extract_reads_umi", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("samtools_extract_reads_umi", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("samtools_extract_reads_umi", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("samtools_extract_reads_umi", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("samtools_extract_reads_umi", {}).get("container", config["default_container"])
    message:
        "{rule}: create bam {output} with only reads from {wildcards.chr}"
    shell:
        "(samtools view -@ {threads} {params.extra} -b {input} {wildcards.chr} > {output}) &> {log}"

input / output files

Rule parameters	Key	Value	Description
input	bam	`"alignment/bwa_mem_realign_consensus_reads/{sample}_{type}.umi.bam"`	bam file
input	bai	`"alignment/bwa_mem_realign_consensus_reads/{sample}_{type}.umi.bam.bai"`	bam index file
output	bam	`"alignment/samtools_extract_reads_umi/{sample}_{type}_{chr}.umi.bam"`	one bam file for each chromosome

Configuration

Software settings (`config.yaml`)

Key	Type	Description
benchmark_repeats	integer	set number of times benchmark should be repeated
container	string	name or path to docker/singularity container
extra	string	parameters that should be forwarded

Resources settings (`resources.yaml`)

Key	Type	Description
mem_mb	integer	memory in MB used per cpu
mem_per_cpu	integer	memory used per cpu
partition	string	partition to use on cluster
time	string	max execution time
threads	integer	number of threads to be available

samtools_extract_reads_non_chr_umi

Extract reads from contigs specified by merged_contigs and put into separate .bam files using samtools view.

Rule

rule samtools_extract_reads_non_chr_umi:
    input:
        bam="alignment/bwa_mem_realign_consensus_reads/{sample}_{type}.umi.bam",
        bai="alignment/bwa_mem_realign_consensus_reads/{sample}_{type}.umi.bam.bai",
    output:
        bam=temp("alignment/samtools_extract_reads/{sample}_{type}_non_chr.umi.bam"),
    params:
        contigs=get_contig_list,
        extra=config.get("samtools_extract_reads_non_chr_umi", {}).get("extra", ""),
    log:
        "alignment/samtools_extract_reads_non_chr_umi/{sample}_{type}_non_chr.umi.bam.log",
    benchmark:
        repeat(
            "alignment/samtools_extract_reads_non_chr_umi/{sample}_{type}_non_chr.umi.bam.benchmark.tsv",
            config.get("samtools_extract_reads_non_chr_umi", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("samtools_extract_reads_non_chr_umi", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("samtools_extract_reads_non_chr_umi", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("samtools_extract_reads_non_chr_umi", {}).get(
            "mem_per_cpu", config["default_resources"]["mem_per_cpu"]
        ),
        partition=config.get("samtools_extract_reads_non_chr_umi", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("samtools_extract_reads_non_chr_umi", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("samtools_extract_reads_non_chr_umi", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("samtools_extract_reads_non_chr_umi", {}).get("container", config["default_container"])
    message:
        "{rule}: create bam {output} with only reads from {params.contigs}"
    shell:
        "(samtools view -@ {threads} {params.extra} -b {input} {params.contigs} '*' > {output}) &> {log}"

input / output files

Rule parameters	Key	Value	Description
input	bam	`"alignment/bwa_mem_realign_consensus_reads/{sample}_{type}.umi.bam"`	bam file
input	bai	`"alignment/bwa_mem_realign_consensus_reads/{sample}_{type}.umi.bam.bai"`	bam index file
output	bam	`"alignment/samtools_extract_reads/{sample}_{type}_non_chr.umi.bam"`	one bam file containing non-chromosomal contigs, requested using merged_contigs in config, and unmapped reads

Configuration

Software settings (`config.yaml`)

Key	Type	Description
benchmark_repeats	integer	set number of times benchmark should be repeated
container	string	name or path to docker/singularity container
extra	string	parameters that should be forwarded

Resources settings (`resources.yaml`)

Key	Type	Description
mem_mb	integer	max memory in MB to be available
mem_per_cpu	integer	memory in MB used per cpu
partition	string	partition to use on cluster
threads	integer	number of threads to be available
time	string	max execution time

samtools_fastq

Converts a bam file to separate fastq files

Rule

rule samtools_fastq:
    input:
        bam="alignment/fgbio_call_and_filter_consensus_reads/{sample}_{type}.umi.unmapped_bam",
    output:
        fastq1="alignment/samtools_fastq/{sample}_{type}.fastq1.umi.fastq.gz",
        fastq2="alignment/samtools_fastq/{sample}_{type}.fastq2.umi.fastq.gz",
    params:
        sort=config.get("samtools_fastq", {}).get("sort", "-m 4G"),
        fastq=config.get("samtools_fastq", {}).get("fastq", "-n"),
    log:
        "alignment/samtools_fastq/{sample}_{type}.output.log",
    benchmark:
        repeat(
            "alignment/samtools_fastq/{sample}_{type}.output.benchmark.tsv",
            config.get("samtools_fastq", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("samtools_fastq", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("samtools_fastq", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("samtools_fastq", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("samtools_fastq", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("samtools_fastq", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("samtools_fastq", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("samtools_fastq", {}).get("container", config["default_container"])
    message:
        "{rule}: Convert the bam file {input.bam} into a fastq file"
    wrapper:
        "v2.6.0/bio/samtools/fastq/separate"

input / output files

Rule parameters	Key	Value	Description
input	bam	`"alignment/fgbio_call_and_filter_consensus_reads/{sample}_{type}.umi.unmapped_bam"`	input bam file
output	fastq1	`"alignment/samtools_fastq/{sample}_{type}.fastq1.umi.fastq.gz"`	fastq file with the first read in the read pair
output	fastq2	`"alignment/samtools_fastq/{sample}_{type}.fastq2.umi.fastq.gz"`	fastq file with the second read in the read pair

Configuration

Software settings (`config.yaml`)

Key	Type	Description
benchmark_repeats	integer	set number of times benchmark should be repeated
container	string	name or path to docker/singularity container
sort	string	parameters that should be forwarded to samtools sort
fastq	string	parameters that should be forwarded to samtools fastq

Resources settings (`resources.yaml`)

Key	Type	Description
mem_mb	integer	max memory in MB to be available
mem_per_cpu	integer	memory used per cpu NOTE: should be at least the amount put into the sort parameters
partition	string	partition to use on cluster
time	string	max execution time
threads	integer	number of threads to be available RECOMMENDATION: use at least 3 threads so that sort gets at least 1 exclusive thread NOTE: if multiple threads is used the memory must also be increased (mem_mb)

samtools index

Index .bam files using samtools index.

Rule

rule samtools_index:
    input:
        bam="{file}.bam",
    output:
        bai=temp("{file}.bam.bai"),
    params:
        extra=config.get("samtools_index", {}).get("extra", ""),
    log:
        "{file}.bam.bai.log",
    benchmark:
        repeat(
            "{file}.bam.bai.benchmark.tsv",
            config.get("samtools_index", {}).get("benchmark_repeats", 1),
        )
    container:
        config.get("samtools_index", {}).get("container", config["default_container"])
    threads: config.get("samtools_index", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("samtools_index", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("samtools_index", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("samtools_index", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("samtools_index", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("samtools_index", {}).get("time", config["default_resources"]["time"]),
    message:
        "{rule}: create index for {input.bam}"
    wrapper:
        "v1.1.0/bio/samtools/index"

input / output files

Rule parameters	Key	Value	Description
input	bam	`"{file}.bam"`	bam file
output	bai	`"{file}.bam.bai"`	bam index file

Configuration

Software settings (`config.yaml`)

Key	Type	Description
benchmark_repeats	integer	set number of times benchmark should be repeated
container	string	name or path to docker/singularity container
extra	string	parameters that should be forwarded

Resources settings (`resources.yaml`)

Key	Type	Description
mem_mb	integer	memory in MB used per cpu
mem_per_cpu	integer	memory used per cpu
partition	string	partition to use on cluster
time	string	max execution time
threads	integer	number of threads to be available

samtools_merge_bam

Merge .bam files from the same sample using samtools merge.

Rule

rule samtools_merge_bam:
    input:
        bams=get_chrom_bams,
        non_chr_bams="alignment/picard_mark_duplicates/{sample}_{type}_non_chr.bam"
        if config.get("reference", {}).get("merge_contigs", None) is not None
        else [],
    output:
        bam=temp("alignment/samtools_merge_bam/{sample}_{type}_unsorted.bam"),
    params:
        extra=config.get("samtools_merge_bam", {}).get("extra", ""),
    log:
        "alignment/samtools_merge_bam/{sample}_{type}_unsorted.bam.log",
    benchmark:
        repeat(
            "alignment/samtools_merge_bam/{sample}_{type}_unsorted.bam.benchmark.tsv",
            config.get("samtools_merge_bam", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("samtools_merge_bam", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("samtools_merge_bam", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("samtools_merge_bam", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("samtools_merge_bam", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("samtools_merge_bam", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("samtools_merge_bam", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("samtools", {}).get("container", config["default_container"])
    message:
        "{rule}: merge chr bam files, creating {output}"
    wrapper:
        "v1.1.0/bio/samtools/merge"

input / output files

Rule parameters	Key	Value	Description
input	bams	`get_chrom_bams`	list of bam files for all the chromosomes the list is generated by the function extract_chr defined in the hydra-genetics module
output	bam	`"alignment/samtools_merge_bam/{sample}_{type}_unsorted.bam"`	merged unsorted bam file

Configuration

Software settings (`config.yaml`)

Key	Type	Description
benchmark_repeats	integer	set number of times benchmark should be repeated
container	string	name or path to docker/singularity container
extra	string	parameters that should be forwarded RECOMMENDED: use -c -p to only keep one of the read groups IDs when merging files from the same sample

Resources settings (`resources.yaml`)

Key	Type	Description
mem_mb	integer	memory in MB used per cpu
mem_per_cpu	integer	memory used per cpu
partition	string	partition to use on cluster
time	string	max execution time
threads	integer	number of threads to be available

samtools sort

Sort .bam files using samtools sort.

Rule

rule samtools_sort:
    input:
        bam="{file}_unsorted.bam",
    output:
        bam=temp("{file}.bam"),
    params:
        extra=config.get("samtools_sort", {}).get("extra", ""),
    log:
        "{file}.bam.sort.log",
    benchmark:
        repeat(
            "{file}.bam.sort.benchmark.tsv",
            config.get("samtools_sort", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("samtools_sort", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("samtools_sort", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("samtools_sort", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("samtools_sort", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("samtools_sort", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("samtools_sort", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("samtools_sort", {}).get("container", config["default_container"])
    message:
        "{rule}: sort bam file {input.bam} using samtools"
    wrapper:
        "v2.0.0/bio/samtools/sort"

input / output files

Rule parameters	Key	Value	Description
input	bam	`"{file}_unsorted.bam"`	unsorted bam file
output	bam	`"{file}.bam"`	sorted bam file

Configuration

Software settings (`config.yaml`)

Key	Type	Description
benchmark_repeats	integer	set number of times benchmark should be repeated
container	string	name or path to docker/singularity container
extra	string	parameters that should be forwarded

Resources settings (`resources.yaml`)

Key	Type	Description
mem_mb	integer	memory in MB used per cpu
mem_per_cpu	integer	memory used per cpu
partition	string	partition to use on cluster
time	string	max execution time
threads	integer	number of threads to be available

samtools sort_umi

Sort .bam files using samtools sort. Sort on query name.

Rule

rule samtools_sort:
    input:
        bam="{file}_unsorted.bam",
    output:
        bam=temp("{file}.bam"),
    params:
        extra=config.get("samtools_sort", {}).get("extra", ""),
    log:
        "{file}.bam.sort.log",
    benchmark:
        repeat(
            "{file}.bam.sort.benchmark.tsv",
            config.get("samtools_sort", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("samtools_sort", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("samtools_sort", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("samtools_sort", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("samtools_sort", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("samtools_sort", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("samtools_sort", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("samtools_sort", {}).get("container", config["default_container"])
    message:
        "{rule}: sort bam file {input.bam} using samtools"
    wrapper:
        "v2.0.0/bio/samtools/sort"

input / output files

Rule parameters	Key	Value	Description
input	bam	`"{file}_unsorted.bam"`	unsorted bam file
output	bam	`"{file}.bam"`	sorted bam file

Configuration

Software settings (`config.yaml`)

Key	Type	Description
benchmark_repeats	integer	set number of times benchmark should be repeated
container	string	name or path to docker/singularity container
extra	string	parameters that should be forwarded

Resources settings (`resources.yaml`)

Key	Type	Description
mem_mb	integer	memory in MB used per cpu
mem_per_cpu	integer	memory used per cpu
partition	string	partition to use on cluster
time	string	max execution time
threads	integer	number of threads to be available

samtools_filter_reads

Filter .bam files using samtools view.

Rule

rule samtools_filter_reads:
    input:
        bam="alignment/samtools_merge_bam/{sample}_{type}.bam",
    output:
        bam=temp("alignment/samtools_filter_reads/{sample}_{type}.bam"),
    params:
        extra=config.get("samtools_filter_reads", {}).get("extra", "-f 2"),
    log:
        "alignment/samtools_filter_reads/{sample}_{type}.bam.log",
    benchmark:
        repeat(
            "alignment/samtools_filter_reads/{sample}_{type}.bam.benchmark.tsv",
            config.get("samtools_filter_reads", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("samtools_filter_reads", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("samtools_filter_reads", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("samtools_filter_reads", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("samtools_filter_reads", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("samtools_filter_reads", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("samtools_filter_reads", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("samtools_filter_reads", {}).get("container", config["default_container"])
    message:
        "{rule}: filter reads in {input.bam} with {params.extra}"
    shell:
        "(samtools view -@ {threads} {params.extra} -b {input.bam} > {output.bam}) &> {log}"

input / output files

Rule parameters	Key	Value	Description
input	bam	`"alignment/samtools_merge_bam/{sample}_{type}.bam"`	input bam file
output	bam	`"alignment/samtools_filter_reads/{sample}_{type}.bam"`	filtered bam file

Configuration

Software settings (`config.yaml`)

Key	Type	Description
benchmark_repeats	integer	set number of times benchmark should be repeated
container	string	name or path to docker/singularity container
extra	string	parameters that should be forwarded

Resources settings (`resources.yaml`)

Key	Type	Description
mem_mb	integer	memory in MB used per cpu
mem_per_cpu	integer	memory used per cpu
partition	string	partition to use on cluster
time	string	max execution time
threads	integer	number of threads to be available

star

Align .fastq files to a reference genome and generate a .bam file. Star is a split read aware aligner for RNA-data.

Rule

rule star:
    input:
        fq1="prealignment/merged/{sample}_{type}_fastq1.fastq.gz",
        fq2="prealignment/merged/{sample}_{type}_fastq2.fastq.gz",
        idx=config.get("star", {}).get("genome_index", ""),
    output:
        bam=temp("alignment/star/{sample}_{type}.bam"),
        sj=temp("alignment/star/{sample}_{type}.SJ.out.tab"),
    params:
        extra=config.get("star", {}).get("extra", "--outSAMtype BAM SortedByCoordinate"),
        idx="{input.idx}",
    log:
        "alignment/star/{sample}_{type}.bam.log",
    benchmark:
        repeat("alignment/star/{sample}_{type}.bam.benchmark.tsv", config.get("star", {}).get("benchmark_repeats", 1))
    threads: config.get("star", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("star", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("star", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("star", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("star", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("star", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("star", {}).get("container", config["default_container"])
    message:
        "{rule}: align with star, creating {output.bam}"
    wrapper:
        "v1.3.2/bio/star/align"

input / output files

Rule parameters	Key	Value	Description
input	fq1	`"prealignment/merged/{sample}_{type}_fastq1.fastq.gz"`	merged fastq file from read 1
	fq2	`"prealignment/merged/{sample}_{type}_fastq2.fastq.gz"`	merged fastq file from read 2
	idx	`config.get("star", {}).get("genome_index", "")`	star reference genome index file location is set in `config.yaml`
output	bam	`"alignment/star/{sample}_{type}.bam"`	aligned bam file
output	sj	`"alignment/star/{sample}_{type}.SJ.out.tab"`	junction file with split read information useful for interpreting rna data

Configuration

Software settings (`config.yaml`)

Key	Type	Description
benchmark_repeats	integer	set number of times benchmark should be repeated
container	string	name or path to docker/singularity container
extra	string	parameters that should be forwarded RECOMMENDATION: --outSAMtype BAM SortedByCoordinate this will output coordinate sorted bam files instead of a sam file
genome_index	string	path to star reference index

Resources settings (`resources.yaml`)

Key	Type	Description
mem_mb	integer	memory in MB used per cpu
mem_per_cpu	integer	memory used per cpu
partition	string	partition to use on cluster
time	string	max execution time
threads	integer	number of threads to be available RECOMMENDATION: use multiple threads for decreased run time. NOTE: if multiple threads is used the memory must also be increased (mem_mb)

Softwares used in the alignment module

Rule

input / output files

Configuration

Software settings (config.yaml)

Resources settings (resources.yaml)

Rule

input / output files

Configuration

Software settings (config.yaml)

Resources settings (resources.yaml)

Rule

input / output files

Configuration

Software settings (config.yaml)

Resources settings (resources.yaml)

Rule

input / output files

Configuration

Software settings (config.yaml)

Resources settings (resources.yaml)

Rule

input / output files

Configuration

Software settings (config.yaml)

Resources settings (resources.yaml)

Rule

input / output files

Configuration

Software settings (config.yaml)

Resources settings (resources.yaml)

Rule

input / output files

Configuration

Software settings (config.yaml)

Resources settings (resources.yaml)

Rule

input / output files

Configuration

Software settings (config.yaml)

Resources settings (resources.yaml)

Rule

input / output files

Configuration

Software settings (config.yaml)

Resources settings (resources.yaml)

Rule

input / output files

Configuration

Software settings (config.yaml)

Resources settings (resources.yaml)

Rule

input / output files

Configuration

Software settings (config.yaml)

Resources settings (resources.yaml)

Rule

input / output files

Configuration

Software settings (config.yaml)

Resources settings (resources.yaml)

Rule

input / output files

Configuration

Software settings (config.yaml)

Resources settings (resources.yaml)

Rule

input / output files

Configuration

Software settings (config.yaml)

Resources settings (resources.yaml)

Rule

input / output files

Configuration

Software settings (config.yaml)

Resources settings (resources.yaml)

Rule

input / output files

Configuration

Software settings (config.yaml)

Software settings (`config.yaml`)

Resources settings (`resources.yaml`)

Software settings (`config.yaml`)

Resources settings (`resources.yaml`)

Software settings (`config.yaml`)

Resources settings (`resources.yaml`)

Software settings (`config.yaml`)

Resources settings (`resources.yaml`)

Software settings (`config.yaml`)

Resources settings (`resources.yaml`)

Software settings (`config.yaml`)

Resources settings (`resources.yaml`)

Software settings (`config.yaml`)

Resources settings (`resources.yaml`)

Software settings (`config.yaml`)

Resources settings (`resources.yaml`)

Software settings (`config.yaml`)

Resources settings (`resources.yaml`)

Software settings (`config.yaml`)

Resources settings (`resources.yaml`)

Software settings (`config.yaml`)

Resources settings (`resources.yaml`)

Software settings (`config.yaml`)

Resources settings (`resources.yaml`)

Software settings (`config.yaml`)

Resources settings (`resources.yaml`)

Software settings (`config.yaml`)

Resources settings (`resources.yaml`)

Software settings (`config.yaml`)

Resources settings (`resources.yaml`)

Software settings (`config.yaml`)

Resources settings (`resources.yaml`)

Software settings (`config.yaml`)

Resources settings (`resources.yaml`)

Software settings (`config.yaml`)

Resources settings (`resources.yaml`)

Software settings (`config.yaml`)

Resources settings (`resources.yaml`)

Software settings (`config.yaml`)

Resources settings (`resources.yaml`)

Software settings (`config.yaml`)

Resources settings (`resources.yaml`)

Software settings (`config.yaml`)

Resources settings (`resources.yaml`)

Software settings (`config.yaml`)

Resources settings (`resources.yaml`)

Software settings (`config.yaml`)

Resources settings (`resources.yaml`)

Software settings (`config.yaml`)