I am working on a problem of classifying peptides as positive or negative based on their protein sequence. The dataset has only attribute that is protein sequence, moreover lengths of sequences are also different. My Problem is how to deal with this kind of data, I am not able to get how to apply ML in such datasets.
I can share what worked for me. I had a problem I didn’t know how to use ML on it. Really didn’t know much about ML at all. I found fastai and have worked with the lessons for about 3 years (hey, slow learner, not the sharpest knife in the drawer, yeah, I get it). But I kept working the examples in the courses and I finally found an example of a problem that was quite similar to my problem. With Jeremy’s rewrite, I am confidant of having industrial strength results.
Start running the Jupyter Notebook examples. Don’t delve too deep. If it doesn’t look like your problem, bail out and go to the the next.
Pay attention to the lesson on the universal approximation theorem.
Hey @ksksingh022! It sounds like your problem could be described as a “sequence classification” problem - i.e. you start with a sequence of amino acids, and try to predict a binary output.
This is very closely related to lots of text classification problems - for example, extracting sentiment (positive or negative) from a piece of text.
If you’re able to express your protein sequence as a piece of text (e.g. map each amino acid to a letter), you should be able to use pretty standard features of the fastai or PyTorch libraries.
This bit of fastai docs might be a good starting point.
Or this PyTorch tutorial.
Like @mike00 mentioned though, using deep learning isn’t simple and if this is your first experience with it you’d be better off going through the course first. Good luck!