Data science at a molecular diagnostic company here. Glad to see this post.
One thing I would like to share this this collaborate review of deep learning called deep review which @MicPie mentioned. Particularly I am interested in this DANN score for potentially linking multiple SNP to a trait as linear weights, and here is the link to the paper: https://academic.oup.com/bioinformatics/article/31/5/761/2748191
I am super curious if anyone in the fastai community has taken a look at this paper and thought about how fastai can fit into the picture.
On first glance, the publication Ching et at., bioRxiv, 2017 that the Green Labās github led to seems worth reading.
This āperspectiveā paper Zou et al., Nature Genetics, 2019, by authors out of Stanford, CZ Biohub, Scripps, etc. might also be a nice introduction to Deep Learning in Genomics. @axelstram
Iām happy to advise on writing scientific papers, but I have a lot to learn in relation to writing deep learning papers.
An outline for a scientific paper is generally:
Introduction
Start with a few lines that explain what motivated your work - what the field was missing
Shortly summarize the work that came before yours and made your work possible. Include anything that stimulated your ideas
In the final paragraph write what you aimed to do and what you found
Results & Discussion
Explain the steps you took to generate some results, state the results, and relate your results to the results that others had obtained
If others havenāt shown results like yours, state something like āthis is the first instance we know of where this type of results have been sharedā
Repeat for each logical step, including your final result.
Conclusion
The first line should summarize what you found in a broader sense than what you wrote in the introduction
State why what you did is important in a broad sense
Relate what you did to what others have done
Hypothesize about what could come next
Methods
Walk people through the more minute details of how you did stuff
Generally, scientists will write their methods section first, then the results/discussion section, followed by the introduction, and finish with the discussion.
Hey Thom, I think youāre hunch is right in that there is some data that is determined not to be of good enough quality to pass a threshold because it is unclear how to determine where the read should actually āalignā or which part of that squiggle is correct. Often the scientific way of looking into these areas would be to use a different sequencing method that targets the questionable area. This means that companies would have train/test data to use to build a model
Initially I thought we should move it out, but now that Iāve had a look around, I donāt think there is much related activity outside of part 2. So, I think we can wait until the end of these classes.
In trying to understand the difference between transductive and inductive transfer learning I googled and found a Quora answer by Waleed Kadous (AI PhD) that was easy to understand. Itās pasted it below:
Imagine you have a training data, but only a subset of it has labels.
For example, say you are trying to classify whether an image has a flower in it or not. You have 100,000 images, but you only have 1,000 images that you know definitively contain a flower; and another 1,000 that you know donāt contain a flower. The other 98,000 you have no idea about ā maybe they have flowers, maybe they donāt.
Inductive learning works by looking at the 2,000 labeled examples and building a classifier on this. Transductive learning (also known as semi-supervised learning) says "Wait: maybe the other 98,000 images donāt have labels, but they tell me something about the problem space. Maybe I can still use them to help improve my accuracy. "
Bonus: Thereās one more really interesting type of learning, which is active learning. That is when you look through the 98,000 examples and can select a subset and request labels from an oracle. So the algorithm might say āOK, of those 98,000, can you label this set of 50 images for me? That will help me build a better classifier.ā
So my understanding is:
Inductive transfer learning only uses labeled data
Transductive transfer learning uses labeled and unlabeled data to get an idea of the scope of the data that the model will attempt to classify
According to your Quora quote, transductive learning is another name for semi-supervised learning.
A quick search using term āsemi-supervisedā instead of word transductive yielded this article https://arxiv.org/abs/1812.05313 that might shed light on your question
How could I bulk download genome data from the web? I am trying to replicate @KarlH notebooks on my computer and I downloaded a few dozen of sequences for bacteria from NCBI database and it took an hour or so of tedious clicking. Is there a better way?
I downloaded .fna files on entire collection of bacteria using regular browser. It is a 15GB download. It did not take too long. I am training language model on 500 genomes now. One epoch on my GPU takes 5 hours. I will keep you posted on the result
Training on 1000 genomes now. Accuracy improved to 0.19 breaking through the previous barrier of 0.18. It is a good sign because so far I went only through less than 10% of training data. It is painfully slow, but training is going in the right direction. Accuracy improves and the model is not overfitting yet.
If any of you know how to interpret the accuracy metric in context of language model, please explain it to me. I only understand accuracy in context of classification and language model is not classification, or is it?