Any resource for learning how to create our own dataset?

I’ve completed 2 Lessons of Part 1.
I want to make a classifier for languages. But there’s no good image data for it. I plan to create dataset. Is there a good guide for helping me understand the structure, citations required etc for creating one.


I tried search api images but the data scrapped isn’t good and uniform; leading less accurate models.

Not sure if I understand you correctly. What dataset are you trying to create for language? Recognise language from image?

I’m not sure what are yours goals. Please precise. Have you already tried datasets from spacy tools? This project has a lot of data for LLMs in many languages. URL

@anandms @usernotabuser
Sorry if I left something while specifying my query.
I wanted to create a language classifier based on image as inputs. I tried using the search api method to scrape images but those data was not uniformly scaled, had some other language like en comparison in it and data was not of quality.
I’ve just started with part1 of the course and thought I can create a good dataset. So wanted to know if there is any good blog for it which can help me in structuring the data.

Welcome and hello,

Put a weight tensor into loss function. I made for you example in Pytorch:

import torch
import torch.nn as nn
import torch.optim as optim
from import DataLoader, TensorDataset

# Example data (real data should be used here)
# Assume each image is represented as a vector of features (here, 10 features)
n_features = 10
# Goat images data, represented as latent compressed vectors
goats_data = torch.rand(200, n_features)
# Labels for goats, labeled as 0
goats_labels = torch.zeros(200, dtype=torch.long)

# Cow images data, also represented as latent compressed vectors
cows_data = torch.rand(100, n_features)
# Labels for cows, labeled as 1
cows_labels = torch.ones(100, dtype=torch.long)

# Combining data and labels
data =, cows_data), 0)
labels =, cows_labels), 0)

# Creating DataLoader for batch processing
dataset = TensorDataset(data, labels)
dataloader = DataLoader(dataset, batch_size=10, shuffle=True)

# Defining a simple neural network model
class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(n_features, 5)  # Hidden layer
        self.fc2 = nn.Linear(5, 2)  # Output layer (2 classes: goats and cows)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

model = SimpleNN()

# Weights for the loss function to handle class imbalance
weights = torch.tensor([0.5, 1.0])  # Higher weight for less represented class (cows)
criterion = nn.CrossEntropyLoss(weight=weights)

# Optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
for epoch in range(10):  # Number of epochs
    for batch in dataloader:
        inputs, targets = batch
        outputs = model(inputs)
        loss = criterion(outputs, targets)
    print(f'Epoch {epoch+1}, Loss: {loss.item()}')

There's no need to make a new dataset if there're techniques to work on unbalance dataset. Enjoy yourself :)

EDIT: and read the docs!

I haven’t reached this lesson but I’ll ask chatGPT for detailed step explanation.

I’m sure that there would be a way of dealing with unclean dataset but would you mind taking a peek at the notebook bellow? Let me know if you think this dataset can go along your referred method.

Thanks :slight_smile:


With all regards, but I steal don’t know what kind of classifier you intend to create. Is this OCR? Photos of docs in *.jpg? picture or paint me what your problem is?


!pip install -U fastbook, fastai

It’s language classifier.
Goal is to recognize the language (either mandarin or japanese) from the images not to read them.

1 Like

Now I understand. Very good idea. There are some players in the field, some of them are open source and completely free of any charges.

Open console window or terminal:

sudo apt-get install tesseract-ocr

If you got windows or mac google for tesseract-ocr install file.

Then using python in Jupyter Notebook or any python interpreter on the same mashine where you’ve got installed tesseract utils load a module to OCR and translate those languages, just like this:

import pytesseract
from PIL import Image

# Set the path for the Tesseract OCR (uncomment and set the path if necessary)
# pytesseract.pytesseract.tesseract_cmd = 'path_to_tesseract.exe' # For Windows

# Load the image
image ='path_to_image.jpg')

# Use Tesseract OCR for Mandarin
text_mandarin = pytesseract.image_to_string(image, lang='chi_sim')

# Use Tesseract OCR for Japanese
text_japanese = pytesseract.image_to_string(image, lang='jpn')


I hope this will be helpful for U. Have a nice day :slight_smile:

Ok… Your approach is to use ocr to detects the language (if it is able to) then it can be classified as good data for the particular language and we can retain it for further model training for image language classification.

Is that correct?

But we still have to manually look into the non-uniform scaling issue of the images. See bellow.

1 Like

Sorry, I mislead you. This is very ambitious idea to build such a stuff. I so amazed. One guy which is not liked here said one smart sentence:‘simplify, simplify and simplify’. This is also my approach. J. Howard’s I think also. Google: labels smoothing with BCEloss. In shortcut this means that you adjust 0 and 1 labels, to for example 0,8 and/or app zeros labels to float 0.2. This will result in lower confidence of the model in predicting labels and force the model to look up for higher differencing feature maps for those to similar languages. Other techniques as I show above is applicable (tensor of label weights) if you’ve got unbalanced dataset, this mean for ex. 1000 photos of Mandarin and 500 Japanese. Good look, BTW: you’ve got a good augmentation! :slight_smile: Use binarycrossentropy from logits for this case.


あなたの野心的なアプローチには本当に感心しています。fast.aiでの最初のレッスンにはかなり難しい課題ですね。YouTubeには、fast.aiを使って犬と猫の分類器を作る方法の指導ビデオが溢れています。だから、あなたの独創性も評価しています。そのような言語識別機能を持つ分類器を作ることが人生の目標の一つであれば、最も人気のあるチャットボットのAPIについて調べてみてください。この方法で行動すれば、目標を迅速に達成するだけでなく、その上で収入を得ることもできるでしょう。あなたは日本から来たような印象を受けます :wink: 。実り多い一日をお祈りします :slight_smile:


Hi there!
If I understood correctly, you want to train an image classifier to tell if the language is Mandarin or Japanese, not doing OCR, but simple image classification.

The dataset construction is straightforward, you just need images of Mandarin text and images of Japanese text… You could sort them into two folders. With just that you could train a simple model to classify them. The problem is, if using the image search API is not yielding good results, you will have to do this more manually (eg downloading Google search images into the various folders).

Given you have enough data, and that the characters are different enough, I think it sounds doable!

But from the images you shared, I see a different problem, in that there are images with phonetics and such. That can be a problem because the classifier would need to associate different-looking images with the same label… Maybe you can try different categories for that? (Japanese-phonetic) or such. But that adds complexity to the dataset construction and labeling task.

Hope this helps!

Ok…so update:

Approach 2: Using Character based dataset
I tried to use Chinese and Japanese character’s dataset which had individual characters.

  • Fine tuning the model leads to complete partial predictions.
  • Any image input that I’m using as test set in model prediction is always giving ‘Chinese’ as prediction

Is there an issue with the dataset or am I just trying to force use a method that isn’t possible to work with?

The problem I see here is that you are training on a kind of image (character level) and then testing on different images (text or sentence level).

That’s not gonna work properly because the model you are training can’t recognize that difference and adjust accordingly. The fact that the Chinese images are correctly predicted is more of a “coincidence” even, in this case. Because you didn’t train the model to recognize Chinese or Japanese text, but separated characters.

What you could try is a separate step to segment your testing data into characters before the prediction, kind of a “pre-processing” step. It can even be done with some other DL model…

Another problem you could be having is an imbalanced dataset. I glanced at your code, but it seems the Chinese dataset is much larger than the Japanese one. Even if you are selecting fewer images, you should guarantee the same amount for each category.

But the main problem persists: you have to train on the same kind of images you are trying to predict later on. It’s like the example in the book, if you want to recognize bears in surveillance camera footage to warn people against possible attacks, you should train on bear images taken from those cameras. If you train on pictures from the internet (well shot, well lit, centered, and front-facing bears) then your model is gonna perform badly in production.

I understand that the model’s purpose is not being used for prediction in current stage. Just like the surveillance camera example in book.
I’d like to know more about preprocessing step…is it mentioned in any part of the book? I’m currently on lesson 4 so the concept hasn’t came across to me… Is there any article or blog that I can refer to implement a simple preprocessing step?

Also, number of images used for Chinese and Japanese are roughly the same, It’s around 12000 and 14000. Thanks for the insight about preprocessing step… I’ll try to find about it and implement it
Thanks @santibescho :grinning:

Your welcome!
Yeah, you should investigate pre-processing in general. In the course, some of that is taken care of by the library, for example when you build the DataBlocks and apply some transformations. Resizing or cropping for example would be a preprocessing step, to ensure all images are the same size.
In other types of models you may want to normalize your data, for example.