Lets apply our skills on some other datasets. One of the image recognition competition that is currently running on Kaggle is Dog Breed Identification which we can use as our sandbox. It should be good competition to start with as these classes belong to imagenet categories so building a CNN from scratch should not be required.
Its pitty but it looks like this competitions even though it is a playground has some rules
Privately sharing code or data outside of teams is not permitted. It’s okay to share code if made available to all participants on the forums.
So I am not sure how to collaborate on the forum and not to violate this rule? Maybe @jeremy can help us with advise on this? I thought we can form a team of all fastai students and after this we can share everything on the forum Or maybe Kaggle means fastai forums
One more thing to consider is
Submission Limits
You may submit a maximum of 5 entries per day.
One idea is to work on the problem individually or in small teams for 3-4 weeks and later we can merge together and compare the predictions to get a good stacked model out of uncorrelated predictions from different teams.
I agree it makes the most sense to start in smaller teams so we can maximize our submission counts per day leading up to the merger deadline at which point we could elect to join forces. As @ar_ai mentioned we could improve our scores a lot just by ensembling/averaging all of our predictions together. Just keep in mind we wouldn’t be able to share our code outside teams.
After the merger, total submission can’t go above number of days competition has been running multiplied by 5. So, we can’t form big teams if individual participants already submitted a lot. We have to keep that in mind also. We may have to depend on internal validation a lot.
I either keep in on my machine locally or use a private github repo so it can be shared amongst my team-members and release it publicly after the comp ends.
Often in teams one team member can host the repo and invite other contributors so that way not everyone needs to have a paid github plan. Or like you said bitbucket is another good option for free private repos.
There are a couple of diff kaggle command line tools you can use to download the datasets to aws etc. See below are two options that I know of that work well.