Skip to content

Incorrect MD5 checksums being used for some files #331

Open
@tanaes

Description

@tanaes

Description of the bug

I've been having an issue with some files failing at checksum in some studies. Upon investigation, for at least some of these failing samples, it appears to be due to the pipeline not picking the correct MD5 value from the metadata.

For example, manually downloading the this file finishes and yields a 7b730 checksum:

(aspera) jonsan@nf-head:~/fetchngs/EA_pharma/fetchngs_exec/test$ ascp     -QT -l 300m -P33001     -i $CONDA_PREFIX/etc/aspera/aspera_bypass_dsa.pem     [email protected]:/vol1/fastq/SRR170/000/SRR17001000/SRR17001000.fastq.gz     SRX13191258_SRR17001000_1.fastq.gz
SRX13191258_SRR17001000_1.fastq.gz                                                                                                                 100%   26MB 15.3Mb/s    00:10
Completed: 26745K bytes transferred in 11 seconds
 (19634K bits/sec), in 1 file.
(aspera) jonsan@nf-head:~/fetchngs/EA_pharma/fetchngs_exec/test$ md5sum SRX13191258_SRR17001000_1.fastq.gz                                                                           7b7e0af5429bcb54b2c232489ea8212b  SRX13191258_SRR17001000_1.fastq.gz

However, looking at the command.sh file for this operation, the pipeline is comparing with a 3fcee checksum:

#!/bin/bash -euo pipefail
ascp \
    -QT -l 300m -P33001 \
    -i $CONDA_PREFIX/etc/aspera/aspera_bypass_dsa.pem \
    [email protected]:/vol1/fastq/SRR170/000/SRR17001000/SRR17001000.fastq.gz \
    SRX13191258_SRR17001000_1.fastq.gz

echo "3fcee2e72a2ec6221cac142538aff092  SRX13191258_SRR17001000_1.fastq.gz" > SRX13191258_SRR17001000_1.fastq.gz.md5
md5sum -c SRX13191258_SRR17001000_1.fastq.gz.md5

If we look at the metadata downloaded for this run, we see both checksums being represented, but in different columns:

fastq_md5	**7b7e0**af5429bcb54b2c232489ea8212b**;3fcee**2e72a2ec6221cac142538aff092;383df08e03e1cd1ee071fd67c16b085b
fastq_bytes	27387589;1445187226;1481254395
fastq_ftp	ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/000/SRR17001000/SRR17001000.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/000/SRR17001000/SRR17001000_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/000/SRR17001000/SRR17001000_2.fastq.gz
fastq_galaxy	ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/000/SRR17001000/SRR17001000.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/000/SRR17001000/SRR17001000_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/000/SRR17001000/SRR17001000_2.fastq.gz
fastq_aspera	fasp.sra.ebi.ac.uk:/vol1/fastq/SRR170/000/SRR17001000/SRR17001000.fastq.gz;fasp.sra.ebi.ac.uk:/vol1/fastq/SRR170/000/SRR17001000/SRR17001000_1.fastq.gz;fasp.sra.ebi.ac.uk:/vol1/fastq/SRR170/000/SRR17001000/SRR17001000_2.fastq.gz
fastq_1	ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/000/SRR17001000/SRR17001000_1.fastq.gz
fastq_2	ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/000/SRR17001000/SRR17001000_2.fastq.gz
md5_1	**3fcee**2e72a2ec6221cac142538aff092
md5_2	383df08e03e1cd1ee071fd67c16b085b

It appears as if there are three fastq files, and the workflow is grabbing the first one (maybe an index read? it's much smaller than the other two) and renaming it _1.fastq.gz, then comparing against the latter's MD5. I haven't looked in the code yet to determine where the logic is that's splitting reads 1 and 2, but it appears that it might be making too liberal an assumption about the structure of the fastq_ftp field?

Maybe related to issue #260 ?

Either way, this is leading to failed downloads, it seems like it might properly be considered a bug.

Command used and terminal output

Relevant files

No response

System information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions