Finding DataBlock Nirvana with v2 - Part 1

It is done!

Well, part 1 is at least :slight_smile:

As I started updating my v1 post by the same name, I soon realized that one article wasn’t going to do it. In part 1 we begin with a overview of the key components of a DataBlock and how it works. From there, we look at how we did this in the old days with PyTorch Datasets and DataLoaders, eventually, piece-by-piece, changing things along using more and more of until we get back to our glorious, < 10 lines, DataBlock example.

If you find any typos, anything confusing, or anything just flat-out wrong, please let me know.

If there are other things about the DataBlock API, or even the whole v2 data processing pipeline in general, that you would like to see covered in subsequent parts, leave a comment below.

Thanks and enjoy part 1!



Working on Part 2 …

What do folks want to see covered? Here’s a few ideas I’m playing with (maybe rate them on a 1-10 scale with 1 being most interested and 10 the least):

  • When Does What Happen (e.g., when do all the bits in a DataBlock get executed and why might you want to put your code in one bit vs. another, like, why would you likely want to do tokenization in a type transform vs. an item or batch transform)?

  • Resolving common DataBlock errors (e.g., look at common errors and fixes as well as how to read a stack trace yourself and figure out how to “figure out” where/why things are breaking)

  • DataBlock Patterns (e.g., look at common patterns and best practices for setting up your DataBlock based on raw data source and task)

  • Transforms Ins and Outs (e.g., in depth look at transforms and how you might use them)

  • Combining Tabular with Text (e.g., a look at how we can construct a DataBlock with both tabular and text data)

  • Walk-Thru on blurr’s DataBlock integration with huggingface transformers (walk thru/discussion on how I got the DataBlock API to work for folks that want to train huggingface models with fastai v2).

Those are just some idea … open to others as well.


I am interested in seeing working examples using more than one output block. For example having two CategoryBlocks or one CategoryBlock and a RegressionBlock.

Check out my library blurr, and in particular the extractive QA bits that use two category blocks to learn the beginning and end of where the answer is found.