Official project group thread

Any update on this ? It has been days that the Zoom room for projects has been inactive…

4 Likes

I’ve become increasingly interested in fully understanding what Zeiler and Fergus did in visualizing the hidden layers in a neural network. There’s some good information in this post from last year. In particular I’m interested to see if it’s possible to do this for NLP as well as image recognition (but just implementing it with current fastai tools for imagery would be a great start).

To me it seems critically important to understand the progression that the network is going through, at least inasmuch as as human beings can recognize and understand it, in order to arrive at an answer.

If anyone is interested in collaborating or contributing, please let me know!

Thanks,

David

Any update on the Zoom link? I tried joining for the first time in around a week today and saw the “meeting scheduled for…” referenced earlier in the thread. Is there another group other than the discord chat?

5 Likes

Hi, I am new to Deep Learning and I would love any feedback especially with the best way to convert my input into images that CNN would understand. Here is a description of my project:

Modeling DNA sequencing error

My project is in the field of bioinformatics. The main idea of the problem is to estimate sequencing error that results from sequencing machines.

Background:

The DNA sequence is a very long array of the characters A,C,G,T. The order of those characters is very well defined, thanks to the human genome project in 2000. That order is what differentiates a person from others with disease. When we take a sample from a person, we are interseted to know what the sequence at a specific location is. To do this, we extract the DNA from the sample, and run it through a sequencing machine. This sequencing machine basically spits out what each position is along the DNA.

Now, lets assume that the specific position we are interested in has the letter T, and the sequence around it is ACCGGTGTAAA. If the sequencing machine does not do any mistake when reading the DNA, it should output ACCGGTGTAAA. However, sometimes, the machine reads something wrong. So instead, it spits out, for example ACCGGAGTAAA. This is called error. And the probability of that error happening depends on the sequence content of its nearby positions.

For this reason, when we do sequencing, we not only ask the machine to read exactly one peice of DNA, but we extract thousands of that same piece and sequence them. The reason is that the probability that the machine will do that same error while reading at that same position in all of the thousands reads is low. However, again, depending on the sequence context, some reagions will have higher error rates than others.

Dataset:
I have a whole bunch of data from normal samples. I had identified the exact positions I am interested in and sequenced them.

Example:
Lets say we have one specific position, where the actual sequence is supposed to be ACCGGTGTAAA. After sequencing 10 times, those are the reads we got (* indicates the position we are interested in)
-----------*---------
ACCGGTGTAAA
ACCGGTGTAAA
ACCGGTGTAAA
ACCGGTGTAAA
ACCGGAGTAAA
ACCGGAGTAAA
ACCGGTGTAAA
ACCGGTGTAAA
ACCGGTGTAAA
ACCGGTGTAAA

In this case, we can say that the total depth is 10, and the error (reading an A instead of T at the target position) = 2/10. Lets call this last metric error rate.

Model:
I am solving a regression problem, where I am trying to predict the error rate, based on the sequence content. My input would be

  1. The sequence content of the expected sequence
  2. The total depth of that sequence (in previous example, 10; After all, getting 2 errors out of 10 reads is not the same as getting 4 errors out of 20 reads)

So my data looks like:
sequence--------------------depth ------- error rate (predicted label)
ACCGGTGTAAA ------- ----10 ------------- 0.2
CCGTCAGTTAA------------20-------------- 0.1

Initial Idea
My initial thoughts are to transform the sequence into an image using one hot encoding. So for the given example, the matrix would look like
A C C G G T G T A A A
A 1 0 0 0 0 0 0 0 1 1 1
C 0 1 1 0 0 0 0 0 0 0 0
G 0 0 1 1 0 1 0 0 0 0 0
T 0 0 0 0 0 1 0 1 0 0 0

Then I would transform this to an image/tensor, similar to what was shown in lesson 3 for the MNIST dataset. My questions now are:

  1. Does this sequence representation make sense? or is it the case that my sequence is just one vector and it is too overload to convert it to an image?
  2. How to pass the depth as a second feature or layer to CNN?
  3. What is the best architecture to start experimenting with? I am guessing resnet would not be suitable to this type of problem

Thank you for taking the time to read all this :slight_smile:

3 Likes

@Dina IIRC people have done this with language models before (and here ULM-FiT). See here:

kheyer/Genomic-ULMFiT

sergeman/fastai-genomic

3 Likes

I would look into Recurrent Neural Networks (in particular, LSTMs) for this problem first. They are kind of an old technique, but they are simpler to use then more modern solutions and they work well on this kind of problems.

Some resources:

Long-Short Term Memory network
https://colah.github.io/posts/2015-08-Understanding-LSTMs/

This is a GREAT presentation to get an intuition, before you dive into the details: https://livefreeordichotomize.com/2017/11/08/lstm-neural-nets-as-told-by-baseball/

For your problem, since you want to predict in the middle of the sequence, you can use a bi-directional LSTM:

2 Likes

I’m assuming that the lack of responses means the project group sessions are done for now. Does everyone meetup in the smaller study groups? Is there a “meta-thread” with a list of the groups or do I just dig through the forum? Thanks.

3 Likes

You could look at the Source Code Study Group

Also, the Fastbook Study Group

Note that both these study groups generally have advanced discussions.

2 Likes

Would anyone be interested in teaming with me on the flower classification Kaggle competition https://www.kaggle.com/c/flower-classification-with-tpus as a group project? We can compare how fastai2 does with GPUs and TPUs, and we might be able to compare it to TensorFlow. We can also try techniques such as data augmentation, GANs for semi-supervised learning, and label smoothing https://towardsdatascience.com/what-is-label-smoothing-108debd7ef06

1 Like

Looks interesting. Happy to team up. But looks like FastAI still doesn’t work with TPUs yet (?). This competition seems to be geared towards TPU usage. Seems like Pytorch started supporting TPUs recently https://discuss.pytorch.org/t/pytorch-tpu-support/25504

1 Like

Best case scenario, we can test on fastai and then reproduce on TensorFlow. Google would at least appreciate the feedback in knowing where TensorFlow is falling short- for instance, it absorbed Keras to improve usability. Worst case, we can always run the fastai code on GPUs and then export the model to run on TPUs or even CPUs. It seems like Kaggle’s flower dataset is a good one to test our knowledge from the first few classes, even if it’s not quite what Google was looking for when it started this competition.

1 Like

sounds good! Happy to collaborate – Have you started working on this already?

Thanks for the info.

1 Like

I have not started but plan to soon. I just formed a team. What is your Kaggle user name, so I can invite you?

Awesome. You’ll need to form a team in order to merge. I went with “Fast Team”- seems simple enough. You can merge with that team, or you can tell me your team name and I can try merging.

Hey David, I’m interested. I also want to have a deeper understanding of their approach

1 Like

Hey Kofi, great!

I’ve been thinking that this might be the best approach in terms of learning:

https://blog.keras.io/how-convolutional-neural-networks-see-the-world.html

I think he provides a lot of information, some of which I don’t understand, but all of which seems really interesting. For instance:

“Note that we only go up to the last convolutional layer – we don’t include fully-connected layers. The reason is that adding the fully connected layers forces you to use a fixed input size for the model (224x224, the original ImageNet format). By only keeping the convolutional modules, our model can be adapted to arbitrary input sizes.”

This makes me think of many questions, including:
Wait wut? The last convolutional layer is fully connected? Last in what direction? If it’s the one before the target… how is it different from all the others? In the diagrams I’ve seen, each layer is fully connected, which means each node in this layer n gets inputs from every node in layer n-1.
“adding the fully connected layers forces you to use a fixed input size for the model” I don’t understand this at all. I believe he wrote somewhere else that in Keras, each layer is aware of how many inputs are coming in, so it ‘does it automatically for you’. So that should be able to be turned on or off, I would think. And then what does it mean to not be fully connected?

At any rate, how do you feel about trying to get this understood and working?

Thanks,

David

I’ve just retired from my work on Masks4All so I can focus full-time on fast.ai again, and I’m planning to use Discord as a regular hangout place. So please drop by and hang out if you’re interested in being involved with rolling out fastai2, the course, etc. I’m hoping to do some screen sharing as well, although haven’t set up specific times yet.

Here’s the Discord:

16 Likes

Nice! I will drop by for sure. BTW, I forgot the “go live” date of the 2020 course…is it still September or did it change?

this link shows an invalid link in discord