General
- fast.ai Datasets page is a great way to get a hands-on with a different data set.
- Google Datasets Search is also a way to find datasets related to a particular topic.
- More than 200 NLP datasets available at Big bad NLP database
- Github repository along with the link to data sets maintained by @nirantk also is a great way to find inspirations for the projects.
Vision
- Classification
- MultiLabel
- Object Segmentation (Camvid)
- Object Detection
Tabular
- Regression
- Classification
Tabular fastai2 baselines and data are available from @muellerzr repository for the following- Poker Hand Induction
- Higgs Boson
Collab
- Zindi Restaurant Recommendation
- Datasets - Datasets β RS_c & Courses - Online Courses On Recommender Systems β RS_c
Text
- Classification - RNN/Transformers
- MultiLabel
- TextExtraction
- Question Answering
- MS Marco Question Answering (Note: Transfer Learning was not explored in this dataset as mentioned in GitHub )
Multimodal
- PetFinder - Combining images, text, tabular for prediction.
TimeSeries
- TBA
Ranking
Other Competitions
- Dravidian-CodeMix β sentiment analysis for Dravidian languages in the code-mixed text found in social media
- IEEE BigData 2020 Cup β a data mining challenge to predict escalations in customer technical support using natural language techniques
- NLC2CMD β translate English descriptions of command-line tasks to their corresponding Bash syntax
- Contradictory, My Dear Watson: Detecting contradiction and entailment in the multilingual text using TPUs.This is a playground type competition based on Natural Language Inferencing (NLI) to determine whether pairs of sentences are related. Participants are challenged to create an NLI model from a dataset including text from 15 different languages.
- Hate Speech and Offensive Content Identification in Indo-European Languages provides a forum and data challenge for promoting multilingual research on detecting problematic content. This year the dataset contains 10K annotated tweets from English, German, and Hindi. The focus of the first subtask is to detect hate, offensive, or profane content in the text. The second subtask is more granular to discriminate and classify the respective type.