Official project group thread

I believe that since we were having issues with keeping a host on all the time (as it had to bounce back in forth and if the chain was broken Jeremy had to be on to assign a new one) he shut it off for now?

Is there a regular study group meeting aside from the local ones?

I haven’t seen. Yet, I would like to see a more topic oriented study groups like NLP, Vision, etc. Some people might be doing that on the projects based groups.

What do you have in mind?

Maybe we can use ms teams it’ free. We here use it for edu… it’s also fine…

Any update on this ? It has been days that the Zoom room for projects has been inactive…

4 Likes

I’ve become increasingly interested in fully understanding what Zeiler and Fergus did in visualizing the hidden layers in a neural network. There’s some good information in this post from last year. In particular I’m interested to see if it’s possible to do this for NLP as well as image recognition (but just implementing it with current fastai tools for imagery would be a great start).

To me it seems critically important to understand the progression that the network is going through, at least inasmuch as as human beings can recognize and understand it, in order to arrive at an answer.

If anyone is interested in collaborating or contributing, please let me know!

Thanks,

David

Any update on the Zoom link? I tried joining for the first time in around a week today and saw the “meeting scheduled for…” referenced earlier in the thread. Is there another group other than the discord chat?

5 Likes

Hi, I am new to Deep Learning and I would love any feedback especially with the best way to convert my input into images that CNN would understand. Here is a description of my project:

Modeling DNA sequencing error

My project is in the field of bioinformatics. The main idea of the problem is to estimate sequencing error that results from sequencing machines.

Background:

The DNA sequence is a very long array of the characters A,C,G,T. The order of those characters is very well defined, thanks to the human genome project in 2000. That order is what differentiates a person from others with disease. When we take a sample from a person, we are interseted to know what the sequence at a specific location is. To do this, we extract the DNA from the sample, and run it through a sequencing machine. This sequencing machine basically spits out what each position is along the DNA.

Now, lets assume that the specific position we are interested in has the letter T, and the sequence around it is ACCGGTGTAAA. If the sequencing machine does not do any mistake when reading the DNA, it should output ACCGGTGTAAA. However, sometimes, the machine reads something wrong. So instead, it spits out, for example ACCGGAGTAAA. This is called error. And the probability of that error happening depends on the sequence content of its nearby positions.

For this reason, when we do sequencing, we not only ask the machine to read exactly one peice of DNA, but we extract thousands of that same piece and sequence them. The reason is that the probability that the machine will do that same error while reading at that same position in all of the thousands reads is low. However, again, depending on the sequence context, some reagions will have higher error rates than others.

Dataset:
I have a whole bunch of data from normal samples. I had identified the exact positions I am interested in and sequenced them.

Example:
Lets say we have one specific position, where the actual sequence is supposed to be ACCGGTGTAAA. After sequencing 10 times, those are the reads we got (* indicates the position we are interested in)
-----------*---------
ACCGGTGTAAA
ACCGGTGTAAA
ACCGGTGTAAA
ACCGGTGTAAA
ACCGGAGTAAA
ACCGGAGTAAA
ACCGGTGTAAA
ACCGGTGTAAA
ACCGGTGTAAA
ACCGGTGTAAA

In this case, we can say that the total depth is 10, and the error (reading an A instead of T at the target position) = 2/10. Lets call this last metric error rate.

Model:
I am solving a regression problem, where I am trying to predict the error rate, based on the sequence content. My input would be

  1. The sequence content of the expected sequence
  2. The total depth of that sequence (in previous example, 10; After all, getting 2 errors out of 10 reads is not the same as getting 4 errors out of 20 reads)

So my data looks like:
sequence--------------------depth ------- error rate (predicted label)
ACCGGTGTAAA ------- ----10 ------------- 0.2
CCGTCAGTTAA------------20-------------- 0.1

Initial Idea
My initial thoughts are to transform the sequence into an image using one hot encoding. So for the given example, the matrix would look like
A C C G G T G T A A A
A 1 0 0 0 0 0 0 0 1 1 1
C 0 1 1 0 0 0 0 0 0 0 0
G 0 0 1 1 0 1 0 0 0 0 0
T 0 0 0 0 0 1 0 1 0 0 0

Then I would transform this to an image/tensor, similar to what was shown in lesson 3 for the MNIST dataset. My questions now are:

  1. Does this sequence representation make sense? or is it the case that my sequence is just one vector and it is too overload to convert it to an image?
  2. How to pass the depth as a second feature or layer to CNN?
  3. What is the best architecture to start experimenting with? I am guessing resnet would not be suitable to this type of problem

Thank you for taking the time to read all this :slight_smile:

3 Likes

@Dina IIRC people have done this with language models before (and here ULM-FiT). See here:

kheyer/Genomic-ULMFiT

sergeman/fastai-genomic

3 Likes

I would look into Recurrent Neural Networks (in particular, LSTMs) for this problem first. They are kind of an old technique, but they are simpler to use then more modern solutions and they work well on this kind of problems.

Some resources:

Long-Short Term Memory network
https://colah.github.io/posts/2015-08-Understanding-LSTMs/

This is a GREAT presentation to get an intuition, before you dive into the details: https://livefreeordichotomize.com/2017/11/08/lstm-neural-nets-as-told-by-baseball/

For your problem, since you want to predict in the middle of the sequence, you can use a bi-directional LSTM:

2 Likes

I’m assuming that the lack of responses means the project group sessions are done for now. Does everyone meetup in the smaller study groups? Is there a “meta-thread” with a list of the groups or do I just dig through the forum? Thanks.

3 Likes

You could look at the Source Code Study Group

Also, the Fastbook Study Group

Note that both these study groups generally have advanced discussions.

2 Likes

Would anyone be interested in teaming with me on the flower classification Kaggle competition https://www.kaggle.com/c/flower-classification-with-tpus as a group project? We can compare how fastai2 does with GPUs and TPUs, and we might be able to compare it to TensorFlow. We can also try techniques such as data augmentation, GANs for semi-supervised learning, and label smoothing https://towardsdatascience.com/what-is-label-smoothing-108debd7ef06

1 Like

Looks interesting. Happy to team up. But looks like FastAI still doesn’t work with TPUs yet (?). This competition seems to be geared towards TPU usage. Seems like Pytorch started supporting TPUs recently https://discuss.pytorch.org/t/pytorch-tpu-support/25504

1 Like

Best case scenario, we can test on fastai and then reproduce on TensorFlow. Google would at least appreciate the feedback in knowing where TensorFlow is falling short- for instance, it absorbed Keras to improve usability. Worst case, we can always run the fastai code on GPUs and then export the model to run on TPUs or even CPUs. It seems like Kaggle’s flower dataset is a good one to test our knowledge from the first few classes, even if it’s not quite what Google was looking for when it started this competition.

1 Like

sounds good! Happy to collaborate – Have you started working on this already?

Thanks for the info.

1 Like

I have not started but plan to soon. I just formed a team. What is your Kaggle user name, so I can invite you?

Awesome. You’ll need to form a team in order to merge. I went with “Fast Team”- seems simple enough. You can merge with that team, or you can tell me your team name and I can try merging.

Hey David, I’m interested. I also want to have a deeper understanding of their approach

1 Like