Results¶
To assess sideRETRO performance and accuracy, we made a simulated dataset whereupon we could have control of true positive and negative values.
Dataset¶
Our dataset for testing is composed of 5 simulated human whole-genome sequencing with 40x of depth and 10 randomly distributed retrocopies each. In total, we had a list of 30 retrocopies consisting of the last 1000 bases of the largest transcript of the parental gene. All retrocopies was stochastically designed for chromosome, position, strand and zygosity.
ACAP3 | AMY2B | ATM | BCL2 | BRAF |
BRCA1 | BRIP1 | DVL1 | FANCD2 | FAT1 |
GNB1 | KLHL17 | MECP2 | MUTYH | NPHP4 |
OR4F5 | PBX1 | PRCC | PTEN | RER1 |
SAMD11 | SET | SIRT1 | SMARCE1 | SOX2 |
TMEM52 | TP53 | TPM3 | USP8 | XPO1 |
For each individual human whole-genome, we sampled with replacement 10 retrocopies from our list, so that some events were shared among part of our subjects.
Parental gene | Chromosome | Position | Strand | Zygosity |
---|---|---|---|---|
AMY2B | chr5 | 122895832 | - | HE |
ACAP3 | chr11 | 133246346 | + | H0 |
ATM | chr19 | 16443740 | + | HO |
BCL2 | chr4 | 146453786 | + | HO |
BRAF | chr18 | 22375812 | - | HO |
BRCA1 | chr7 | 102911369 | - | HO |
BRIP1 | chr12 | 113357216 | - | HO |
DVL1 | chr5 | 54105196 | - | HE |
FANCD2 | chr3 | 121339674 | + | HE |
FAT1 | chr16 | 65659952 | - | HE |
GNB1 | chr12 | 70734022 | + | HE |
MUTYH | chr4 | 2716745 | + | HO |
PBX1 | chr21 | 22718521 | - | HE |
PRCC | chr1 | 190777903 | - | HO |
PTEN | chr2 | 140252560 | - | HE |
RER1 | chr13 | 55179109 | + | HE |
SAMD11 | chr4 | 14585135 | + | HO |
SET | chr7 | 154178578 | - | HO |
SIRT1 | chrX | 120716688 | - | HO |
SOX2 | chr10 | 88689163 | - | HE |
TMEM52 | chr1 | 82897536 | + | HO |
TP53 | chr5 | 42938944 | - | HO |
TPM3 | chr7 | 3488208 | - | HE |
USP8 | chr7 | 41333317 | + | HO |
XPO1 | chr10 | 33172062 | - | HE |
IND1 | IND2 | IND3 | IND4 | IND5 |
---|---|---|---|---|
ATM | AMY2B | ATM | AMY2B | AMY2B |
BCL2 | BRCA1 | GNB1 | BRAF | BRIP1 |
BRIP | FANCD2 | MECP | BRIP1 | FANCD2 |
DVL1 | FAT1 | PBX1 | DVL1 | NPHP4 |
MECP | KLHL17 | PRCC | FANCD2 | PTEN |
OR4F | MUTYH | PTEN | GNB1 | RER1 |
PTEN | RER1 | RER1 | KLHL17 | SMARCE1 |
SET | SET | SAMD | MECP2 | TP53 |
SOX2 | SMARCE1 | SIRT | SET | TPM3 |
XPO1 | SOX2 | XPO1 | TMEM52 | XPO1 |
Simulation¶
We used the SANDY tool (version v0.23), A straightforward and complete next-generation sequencing read simulator [1], for simulate all 5 genomes according to the structural variations that we designed and according to the sampling. SANDY demands 2 steps for the task: First we indexed all retrocopies for each individual and then we could simulate all whole-genome sequencing.
# Reference genome
REF_FASTA=hg38.fa
# Coverage depth
DEPTH=40
# Retrocopies by individual
IND=(ind1.txt ind2.txt ind3.txt ind4.txt ind5.txt)
# Index all retrocopies
for ind in "${IND[@]}"; do
sandy variation add --structural-variation=${ind%%.txt} $ind
done
# Simulate all genomes
for ind in "${IND[@]}"; do
sandy_index=${ind%%.txt}
sandy genome \
--id='%i.%U_%c:%S-%E_%v' \
--structural-variation=$sandy_index \
--output-dir=$sandy_index \
--jobs=20 \
--seed=42 \
--quality-profile='hiseq_101' \
--coverage=$DEPTH \
--verbose \
$REF_FASTA
done
As result we have a pair of FASTQ files (forward and reverse complement) for each simulated individual. Next it is required to align our sequencing data against the human reference genome in order to generate mapped files in SAM format. We used BWA aligner (version 0.7.9) [2] for accomplish this task.
# Individual directories with the
# simulated data
IND_DIR=(ind1 ind2 ind3 ind4 ind5)
# Reference genome
REF_FASTA=hg38.fa
# Index reference genome
bwa index $REF_FASTA
# Alignment
for ind in "${IND[@]}"; do
bwa mem \
-t 10 \
$REF_FASTA \
$ind/out_R1_001.fastq.gz \
$ind/out_R2_001.fastq.gz > $ind/out.sam
done
Running sideRETRO¶
After our simulated dataset was ready, we could test the sideRETRO capabilities to detect the designed somatic retrocopies.
# Our simulated SAM files list
LIST=(ind1/out.sam ind2/out.sam ind3/out.sam ind4/out.sam ind5/out.sam)
# GENCODE annotation v31
ANNOTATION=gencode.v31.annotation.gff3.gz
# GENCODE reference genome
REF_FASTA=hg38.fa
# Run process-sample step
sider process-sample \
--cache-size=20000000 \
--output-dir=result \
--threads=5 \
--max-distance=15000 \
--alignment-frac=0.9 \
--phred-quality=20 \
--sorted \
--log-file=ps.log \
--annotation-file=$ANNOTATION \
"${LIST[@]}"
# Run merge-call step
sider merge-call \
--cache-size=20000000 \
--epsilon=500 \
--min-pts=20 \
--genotype-support=5 \
--near-gene-rank=3 \
--log-file=mc.log \
--threads=10 \
--phred-quality=20 \
--in-place \
result/out.db
# Finally run make-vcf
sider make-vcf \
--reference-file=$REF_FASTA \
--prefix=simulation \
result/out.db
Download¶
Our output simulation.vcf
, as well as the files to be indexed by SANDY,
ind1.txt
, ind2.txt
, ind3.txt
, ind4.txt
and
ind5.txt
, can be downloaded at
simulation.tar.gz
.
Analysis¶
All retrocopies were detected except for 5 retrocopies from the following parental
genes that were not detected in any individual: KLHL17
, MECP2
,
NPHP4
, OR4F5
and SMARCE1
. In a manual check, those
retrocopies do not present abnormal reads in any
individual - which could enable their recognition by our pipeline. However to make a
fair and reproducible analysis, we will consider it as a methodological error of our
tool. So sideRETRO found \(\frac{23}{28}\) retrocopies, which means 82%.
In more detail, we will show that result and more concerning:
Genomic coordinate¶
Regarding the genomic coordinate, sideRETRO got it right 100% for chromosome and strand. The performance in detecting the insertion point position was satisfactory with a MEDSE of 1 base.
Parental gene | Actual position | Predicted position | Error | ||||
---|---|---|---|---|---|---|---|
AMY2B | chr5 | 122895832 | - | chr5 | 122895831 | - | 1 |
ATM | chr19 | 16443740 | + | chr19 | 16443738 | + | 2 |
BCL2 | chr4 | 146453786 | + | chr4 | 146453783 | + | 3 |
BRAF | chr18 | 22375812 | - | chr18 | 22375811 | - | 1 |
BRCA1 | chr7 | 102911369 | - | chr7 | 102911368 | - | 1 |
BRIP1 | chr12 | 113357216 | - | chr12 | 113357216 | - | 0 |
DVL1 | chr5 | 54105196 | - | chr5 | 54105195 | - | 1 |
FANCD2 | chr3 | 121339674 | + | chr3 | 121339673 | + | 1 |
FAT1 | chr16 | 65659952 | - | chr16 | 65659951 | - | 1 |
GNB1 | chr12 | 70734022 | + | chr12 | 70734022 | + | 0 |
MUTYH | chr4 | 2716745 | + | chr4 | 2716760 | + | 15 |
PBX1 | chr21 | 22718521 | - | chr21 | 22718520 | - | 1 |
PRCC | chr1 | 190777903 | - | chr1 | 190777902 | - | 1 |
PTEN | chr2 | 140252560 | - | chr2 | 140252559 | - | 1 |
RER1 | chr13 | 55179109 | + | chr13 | 55179108 | + | 1 |
SAMD11 | chr4 | 14585135 | + | chr4 | 14585134 | + | 1 |
SET | chr7 | 154178578 | - | chr7 | 154178578 | - | 0 |
SIRT1 | chrX | 120716688 | - | chrX | 120716687 | - | 1 |
SOX2 | chr10 | 88689163 | - | chr10 | 88689162 | - | 1 |
TMEM52 | chr1 | 82897536 | + | chr1 | 82897534 | + | 2 |
TP53 | chr5 | 42938944 | - | chr5 | 42938943 | - | 1 |
TPM3 | chr7 | 3488208 | - | chr7 | 3488207 | - | 1 |
XPO1 | chr10 | 33172062 | - | chr10 | 33172056 | - | 6 |
Chromosome | Strand | Position | |
---|---|---|---|
MSE | MEDSE | ||
100% | 100% | 12.7 bases | 1 base |
Note
The retrocopies from MUTYH
and XPO1
do not present
splitted reads, so they are annotated as IMPRECISE
at simulation.vcf
file. So it can explain why their insertion point position
detection has the biggest absolute error, which pulls the MSE up.
Zygosity¶
With the exception of the 5 undetected retrocopies (parental genes KLHL17
,
MECP2
, NPHP4
, OR4F5
and SMARCE1
), it was possible
to accurately calculate the zygosity. Only for the MUTYH
’s retrocopy, the
zygosity was wrongly assigned as heterozygous for the individual IND2
- when
the true value was homozygous alternate. As MUTYH
’s retrocopy was
imprecisely annotated for the lack of splitted reads, we may expect a reference
allele depth overestimation on the predicted site - which explains the heterozygous
attribution.
To calculate the performance in terms of reference/alternate allele detection, we made a confusion matrix for each individual and an overall matrix with the junction of all simulations:
TP | True Positive |
FP | False Positive |
TN | True Negative |
FN | False Negative |
TPR | True Positive Rate or Sensitivity |
TNR | True Negative Rate or Specificity |
PPV | Positive Predictive Value or Precision |
NPV | Negative Predictive Value |
ACC | Accuracy |
There was no false positive at all - as can be seen by the precision and specificity with 100% value for all individuals. The worst accuracy was at individual 2, with 91.07% value, and best was at individual 3, with 96.43% value. The overall sensitivity, specificity and accuracy were: 81%, 100% and 93.21% value consecutively.
References and Further Reading¶
[1] | Miller, Thiago et al. (2019). galantelab/sandy: Release v0.23 (Version v0.23). Zenodo. http://doi.org/10.5281/zenodo.2589575. |
[2] | Li H. and Durbin R. (2009). Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics, 25:1754-60. [PMID: 19451168]. |