AspSeq assembly progress
Raw data
Illumina Paired End:
This was our initial dataset. Sequencing was done on a HiSeq using older chemistry (several years ago). This data has some GC bias issues, due to the chemistry.
Library Name | Coverage | MeanRL | MedianRL | Count | MeanQ | MedianQ | MeanGC | MedianGC | Insert Size |
---|---|---|---|---|---|---|---|---|---|
PE150 | 11.44 | 101 | 101 | 54275444 | 31.77 | 36 | 0.4079 | 0.4046 | 150 |
PE150.1 | 11.44 | 101 | 101 | 54275444 | 32.58 | 36.99 | 0.4073 | 0.4043 | 150 |
PE150-1 | 26.41 | 101 | 101 | 125234671 | 32.4 | 36.58 | 0.3859 | 0.3861 | 150 |
PE150-1.1 | 26.41 | 101 | 101 | 125234671 | 30.58 | 35.96 | 0.3864 | 0.3861 | 150 |
PE300 | 21.42 | 101 | 101 | 101587753 | 28.92 | 35 | 0.3769 | 0.3762 | 300 |
PE300.1 | 21.42 | 101 | 101 | 101587753 | 30.78 | 36 | 0.3769 | 0.3762 | 300 |
PE650 | 13.4 | 101 | 101 | 63533295 | 28.3 | 35.01 | 0.3791 | 0.382 | 650 |
PE650.1 | 13.4 | 101 | 101 | 63533295 | 30.84 | 36.02 | 0.3783 | 0.3792 | 650 |
454 reads
For the initial project, we also generated about 10X 454 data.
Library Name | Coverage | MeanRL | MedianRL | Count | MeanQ | MedianQ | MeanGC | MedianGC | Insert Size |
---|---|---|---|---|---|---|---|---|---|
asp201Run1se_1 | 0.2848 | 287.5 | 288.1 | 474412 | 31.08 | 34.85 | 0.3607 | 0.3641 | NA |
asp201Run1se_2 | 0.2932 | 303.3 | 306.4 | 463040 | 31.44 | 35.24 | 0.3617 | 0.3651 | NA |
asp201Run2pe_1 | 0.2496 | 292.1 | 313.5 | 409339 | 28.94 | 30.8 | 0.3786 | 0.3695 | 3000 |
asp201Run2pe_2 | 0.2557 | 298.1 | 321 | 410843 | 29.09 | 31 | 0.3795 | 0.3699 | 3000 |
asp201Run3se_1 | 0.000635 | 130.8 | 77.35 | 2326 | 23.8 | 23.93 | 0.4796 | 0.4809 | NA |
asp201Run3se_2 | 0.0001951 | 134.8 | 81.94 | 693 | 26.01 | 25.24 | 0.4788 | 0.4833 | NA |
asp201Run4se_1 | 0.4005 | 346.6 | 404.7 | 553438 | 31.07 | 34.37 | 0.4061 | 0.3855 | NA |
asp201Run4se_2 | 0.3629 | 337 | 390.4 | 515844 | 30.53 | 33.15 | 0.4113 | 0.3894 | NA |
asp201Run5se_1 | 0.5258 | 342 | 372.6 | 736322 | 29.61 | 32.07 | 0.3464 | 0.3494 | NA |
asp201Run5se_2 | 0.5788 | 342.5 | 376.1 | 809422 | 30.02 | 32.65 | 0.3504 | 0.3529 | NA |
asp201Run6se_1 | 0.3801 | 355.2 | 390.7 | 512577 | 30.39 | 33.66 | 0.3441 | 0.3474 | NA |
asp201Run6se_2 | 0.4044 | 364.2 | 402.2 | 531805 | 30.64 | 34.06 | 0.3444 | 0.3473 | NA |
asp201Run7se_1 | 0.3812 | 346 | 385.1 | 527682 | 29.98 | 32.57 | 0.3487 | 0.3521 | NA |
asp201Run7se_2 | 0.3919 | 348.8 | 388 | 538231 | 29.9 | 32.4 | 0.3494 | 0.3526 | NA |
asp201Run8se_1 | 0.3791 | 307.3 | 329.5 | 590877 | 28.68 | 30.66 | 0.349 | 0.3519 | NA |
asp201Run8se_2 | 0.44 | 306.5 | 328.9 | 687637 | 28.53 | 30.35 | 0.3482 | 0.3512 | NA |
GWDW1HR01 | 0.7442 | 533.4 | 583.6 | 668353 | 27.6 | 28.59 | 0.3646 | 0.3641 | NA |
GWDW1HR02 | 0.7451 | 541.6 | 595.3 | 658997 | 27.71 | 28.77 | 0.3641 | 0.3639 | NA |
GWLD3LU01 | 0.7276 | 524.9 | 568.8 | 663912 | 28.57 | 30.15 | 0.3643 | 0.3644 | NA |
GWLD3LU02 | 0.6257 | 483.7 | 521 | 619620 | 27.83 | 28.78 | 0.3667 | 0.3663 | NA |
GWLD5AX01 | 0.365 | 447.2 | 489.3 | 390891 | 27.44 | 28.06 | 0.3703 | 0.3699 | NA |
GWLD5AX02 | 0.5504 | 498.6 | 541.1 | 528744 | 28.3 | 29.59 | 0.3676 | 0.3674 | NA |
GWLFG7T01 | 0.7119 | 557.3 | 639.7 | 611862 | 28.65 | 30.44 | 0.3634 | 0.3644 | NA |
GWLFG7T02 | 0.4897 | 511.9 | 582 | 458205 | 28.66 | 30.29 | 0.3662 | 0.367 | NA |
Illumina experimental runs
As part of our collaboration with the Swedish central sequencing facility, they tested new protocols on their HiSeq and MiSeq machines to sequence long overlapping fragments. The fragments here are 450bp in total, 300bp overlapping. The HiSeq machine was run in Rapid Mode.
Library Name | Coverage | MeanRL | MedianRL | Count | MeanQ | MedianQ | MeanGC | MedianGC | "Insert" size |
---|---|---|---|---|---|---|---|---|---|
MiSeq-300 | 18.06 | 301 | 301 | 28740790 | 33.24 | 37 | 0.3499 | 0.3505 | -150 |
MiSeq-300.1 | 18.06 | 301 | 301 | 28740790 | 28.6 | 36.16 | 0.3553 | 0.3525 | -150 |
HiSeq-300 | 113.1 | 301 | 301 | 1.8e+08 | 34.05 | 38 | 0.3464 | 0.3488 | -150 |
HiSeq-300.1 | 113.1 | 301 | 301 | 1.8e+08 | 28.75 | 38 | 0.3491 | 0.3484 | -150 |
HiSeq-300.2 | 113.6 | 301 | 301 | 180772624 | 34.19 | 38 | 0.3461 | 0.3488 | -150 |
HiSeq-300.3 | 113.6 | 301 | 301 | 180772624 | 29.01 | 38 | 0.3484 | 0.3455 | -150 |
Jumping reads
To scaffold the initial assemblies, we generated some Mate Pair libraries. With exception to the 10Kb library, these suffer from high PE contamination and overall mediocre quality:
Library Name | Coverage | MeanRL | MedianRL | Count | MeanQ | MedianQ | MeanGC | MedianGC | Insert size |
---|---|---|---|---|---|---|---|---|---|
3KbMP | 22.15 | 101 | 101 | 1.05e+08 | 32.66 | 36.74 | 0.3632 | 0.3586 | 3000 |
3KbMP.1 | 22.15 | 101 | 101 | 1.05e+08 | 32.48 | 37 | 0.3635 | 0.3632 | 3000 |
10KbMP | 2.811 | 49 | 49 | 27481587 | 36.76 | 38.96 | 0.3653 | 0.3673 | 10000 |
10KbMP.1 | 2.811 | 49 | 49 | 27481587 | 32.89 | 37.02 | 0.3761 | 0.3784 | 10000 |
5KbMP | 2.959 | 101 | 101 | 14031890 | 34.45 | 36.4 | 0.3643 | 0.3564 | 5000 |
5KbMP.1 | 2.959 | 101 | 101 | 14031890 | 33.64 | 36.22 | 0.3641 | 0.3564 | 5000 |
3KbMP.2 | 8.774 | 101 | 101 | 41611057 | 32.31 | 35.28 | 0.367 | 0.3663 | 3000 |
3KbMP.3 | 8.774 | 101 | 101 | 41611057 | 34.9 | 38.04 | 0.3664 | 0.3663 | 3000 |
5KbMP.2 | 12.01 | 81.87 | 93.95 | 70277953 | 37.6 | 39 | 0.3485 | 0.3536 | 5000 |
5KbMP.3 | 12.02 | 81.89 | 93.96 | 70277953 | 37.34 | 38.96 | 0.3484 | 0.3532 | 5000 |
Fosmid pools
Recently, we ran a pilot to sequence fosmid pools and fosmid ends, which included ~8x PacBio (filtered subreads) of 5x1000 fosmids as well as fosmid end sequencing, generating jumping libraries with 40Kb insert sizes. Only about 5-10% of the initial fosmid end data maps with expected insert sizes.
Library Name | Coverage | MeanRL | MedianRL | Count | MeanQ | MedianQ | MeanGC | MedianGC | Insert Size |
---|---|---|---|---|---|---|---|---|---|
pb_162-1 | 1.568 | 5729 | 5861 | 131109 | 43.2 | 44.59 | 0.3922 | 0.3769 | NA |
pb_162-2 | 1.726 | 7931 | 7964 | 104220 | 43.09 | 44 | 0.3991 | 0.3832 | NA |
pb_162-3 | 1.668 | 7749 | 7742 | 103112 | 43.07 | 44 | 0.3977 | 0.3823 | NA |
pb_162-4 | 1.64 | 7518 | 7466 | 104519 | 43.04 | 44 | 0.3994 | 0.3826 | NA |
pb_162-5 | 1.302 | 6992 | 6707 | 89210 | 42.98 | 44 | 0.3983 | 0.3826 | NA |
FE1 | 4.918 | 151 | 151 | 15602157 | 35.96 | 38 | 0.4081 | 0.4159 | 40000 |
FE1.1 | 4.918 | 151 | 151 | 15602157 | 35.18 | 38 | 0.4082 | 0.4106 | 40000 |
FE2 | 4.968 | 151 | 151 | 15758362 | 35.95 | 38 | 0.4082 | 0.4106 | 40000 |
FE2.1 | 4.968 | 151 | 151 | 15758362 | 35.29 | 38 | 0.408 | 0.4106 | 40000 |
Genomic PacBio
In addition to the previous PacBio data, we have generated 60X PacBio (filtered subreads)
Library Name | Coverage | MeanRL | MedianRL | Count | MeanQ | MedianQ | MeanGC | MedianGC |
---|---|---|---|---|---|---|---|---|
pb_158 | 60.28 | 8062 | 7771 | 3581635 | 42.6 | 43.29 | 0.3648 | 0.3539 |
Assembly statistics
The assembly that is currently in use (Potra v1.1) is a hybrid assembly utilizing the initial Illumina PE data, the 454 data, as well as the Illumina Mate Pair data. Two newer generations of the assembly are listed here. The first is an assembly of the high coverage overlapping Illumina data using DISCOVAR de novo, the second is a PacBio only FALCON assembly. The expected genome size is 479Mbp, as estimated by flow cytometry.
Assembly | # Scaffolds | N50 | Total length |
---|---|---|---|
Potra v1.1 | 204318 | 44Kbp (Scaffold) | 387Mbp |
DISCOVAR | 44375 | 31Kbp (Contig) | 491Mbp |
FALCON | 4845 | 484Kbp (Contig) | 477Mbp |