FitLaM for long sequences

fizx · April 26, 2018, 10:05pm

I’m using FitLaM to classify git merge conflicts as left/right. I trained a BPE+LSTM language model over a gig of C code. I’ve found ~5000 C merge conflicts in the wild with sequence lengths less than 1400 tokens. A FiTLaM classifier will classify these correctly around 85% of the time. I thought I’d have to do a lot of domain-specific things or advanced tricks, so its amazing (suspicious?) to see this work this well basically out of the box.

Obviously, many source code files are bigger than this. Is there something more obvious or interesting to do than taking an ensemble of sliding windows?

anamariapopescug · April 26, 2018, 10:48pm

This is awesome - did you try any other baseline before ? (just curious if it’s a “easy” or “hard” problem)

fizx · April 26, 2018, 11:05pm

This is awesome - did you try any other baseline before ?

No. I should perhaps grab lingpipe or weka (or maybe something else thats not 10 years out of date?) and see if something trivial/bag-of-wordsy gets any traction. I’m slightly worried that I could be doing the equivalent of labelling all grassy hillsides as containing sheep. If a trivial model gets traction, that might be indicative.

(just curious if it’s a “easy” or “hard” problem).

I think its a hard problem. These are hard to figure out as a human a non-trivial percentage of the time.

jeremy · April 26, 2018, 11:09pm

I’m looking forward to hearing what you find out!

anamariapopescug · April 26, 2018, 11:24pm

My intuition says you’d still be better off with FitLaM (or similar) than just some BOW-based model … just a question of how much, should you care :). Is the data fairly balanced, btw ? Anyway - very fun, good luck with the investigation.

fizx · April 26, 2018, 11:31pm

In the wild, merge conflicts are 3:2 left:right. I undersampled left to balance the data.