Description
Description of the bug
I've been having an issue with some files failing at checksum in some studies. Upon investigation, for at least some of these failing samples, it appears to be due to the pipeline not picking the correct MD5 value from the metadata.
For example, manually downloading the this file finishes and yields a 7b730
checksum:
(aspera) jonsan@nf-head:~/fetchngs/EA_pharma/fetchngs_exec/test$ ascp -QT -l 300m -P33001 -i $CONDA_PREFIX/etc/aspera/aspera_bypass_dsa.pem [email protected]:/vol1/fastq/SRR170/000/SRR17001000/SRR17001000.fastq.gz SRX13191258_SRR17001000_1.fastq.gz
SRX13191258_SRR17001000_1.fastq.gz 100% 26MB 15.3Mb/s 00:10
Completed: 26745K bytes transferred in 11 seconds
(19634K bits/sec), in 1 file.
(aspera) jonsan@nf-head:~/fetchngs/EA_pharma/fetchngs_exec/test$ md5sum SRX13191258_SRR17001000_1.fastq.gz 7b7e0af5429bcb54b2c232489ea8212b SRX13191258_SRR17001000_1.fastq.gz
However, looking at the command.sh
file for this operation, the pipeline is comparing with a 3fcee
checksum:
#!/bin/bash -euo pipefail
ascp \
-QT -l 300m -P33001 \
-i $CONDA_PREFIX/etc/aspera/aspera_bypass_dsa.pem \
[email protected]:/vol1/fastq/SRR170/000/SRR17001000/SRR17001000.fastq.gz \
SRX13191258_SRR17001000_1.fastq.gz
echo "3fcee2e72a2ec6221cac142538aff092 SRX13191258_SRR17001000_1.fastq.gz" > SRX13191258_SRR17001000_1.fastq.gz.md5
md5sum -c SRX13191258_SRR17001000_1.fastq.gz.md5
If we look at the metadata downloaded for this run, we see both checksums being represented, but in different columns:
fastq_md5 **7b7e0**af5429bcb54b2c232489ea8212b**;3fcee**2e72a2ec6221cac142538aff092;383df08e03e1cd1ee071fd67c16b085b
fastq_bytes 27387589;1445187226;1481254395
fastq_ftp ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/000/SRR17001000/SRR17001000.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/000/SRR17001000/SRR17001000_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/000/SRR17001000/SRR17001000_2.fastq.gz
fastq_galaxy ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/000/SRR17001000/SRR17001000.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/000/SRR17001000/SRR17001000_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/000/SRR17001000/SRR17001000_2.fastq.gz
fastq_aspera fasp.sra.ebi.ac.uk:/vol1/fastq/SRR170/000/SRR17001000/SRR17001000.fastq.gz;fasp.sra.ebi.ac.uk:/vol1/fastq/SRR170/000/SRR17001000/SRR17001000_1.fastq.gz;fasp.sra.ebi.ac.uk:/vol1/fastq/SRR170/000/SRR17001000/SRR17001000_2.fastq.gz
fastq_1 ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/000/SRR17001000/SRR17001000_1.fastq.gz
fastq_2 ftp.sra.ebi.ac.uk/vol1/fastq/SRR170/000/SRR17001000/SRR17001000_2.fastq.gz
md5_1 **3fcee**2e72a2ec6221cac142538aff092
md5_2 383df08e03e1cd1ee071fd67c16b085b
It appears as if there are three fastq files, and the workflow is grabbing the first one (maybe an index read? it's much smaller than the other two) and renaming it _1.fastq.gz
, then comparing against the latter's MD5. I haven't looked in the code yet to determine where the logic is that's splitting reads 1 and 2, but it appears that it might be making too liberal an assumption about the structure of the fastq_ftp
field?
Maybe related to issue #260 ?
Either way, this is leading to failed downloads, it seems like it might properly be considered a bug.
Command used and terminal output
Relevant files
No response
System information
No response