Homework ideas or assignments for the course

msivanes · August 16, 2020, 1:43pm

General

fast.ai Datasets page is a great way to get a hands-on with a different data set.
Google Datasets Search is also a way to find datasets related to a particular topic.
More than 200 NLP datasets available at Big bad NLP database
Github repository along with the link to data sets maintained by @nirantk also is a great way to find inspirations for the projects.

Vision

Classification
- Spot the mask
MultiLabel
- Bengali.AI Handwritten Grapheme Classification
- Planet: Understanding amazon from space
Object Segmentation (Camvid)
- iMaterialist (Fashion) 2020 at FGVC7 | Kaggle
Object Detection
- Pascal Visual Object Classes 2012 & HomePage

Tabular

Regression
- House Sale Price Prediction - Advanced Regression techniques
- Predict Patient Survival
Classification
Tabular fastai2 baselines and data are available from @muellerzr repository for the following
- Poker Hand Induction
- Higgs Boson

Collab

Zindi Restaurant Recommendation
Datasets - Datasets – RS_c & Courses - Online Courses On Recommender Systems – RS_c

Text

Classification - RNN/Transformers
- Zindi To vaccinate [Dataset]
MultiLabel
- Toxic Comment Classification Challenge | Kaggle
TextExtraction
- Tweet Sentiment Extraction | Kaggle
Question Answering
- MS Marco Question Answering (Note: Transfer Learning was not explored in this dataset as mentioned in GitHub )

Multimodal

PetFinder - Combining images, text, tabular for prediction.

TimeSeries

TBA

Ranking

MS Marco Passage Ranking

Other Competitions

Dravidian-CodeMix — sentiment analysis for Dravidian languages in the code-mixed text found in social media
IEEE BigData 2020 Cup — a data mining challenge to predict escalations in customer technical support using natural language techniques
NLC2CMD — translate English descriptions of command-line tasks to their corresponding Bash syntax
Contradictory, My Dear Watson: Detecting contradiction and entailment in the multilingual text using TPUs.This is a playground type competition based on Natural Language Inferencing (NLI) to determine whether pairs of sentences are related. Participants are challenged to create an NLI model from a dataset including text from 15 different languages.
Hate Speech and Offensive Content Identification in Indo-European Languages provides a forum and data challenge for promoting multilingual research on detecting problematic content. This year the dataset contains 10K annotated tweets from English, German, and Hindi. The focus of the first subtask is to detect hate, offensive, or profane content in the text. The second subtask is more granular to discriminate and classify the respective type.

strickvl · August 22, 2020, 8:17am

FYI the first example (fast.ai Datasets) gives me a 404 error when I try to open it. I guess this was from v3 of the course and those links don’t work any more?

msivanes · August 22, 2020, 8:03pm

Thanks for reporting. Jeremy already fixed the broken link.

ricardocalleja · August 23, 2020, 3:20pm

Thanks for your suggestions.

msivanes · September 17, 2020, 12:27pm

[Wiki Update]

Tabular Baselines
Multimodal type dataset

aifizdo · October 11, 2022, 7:07pm

I edited the link to the fast.ai dataset. Thank for for sharing this post.