I am struggling to create my own dataset

I’ve gone through the course and read Building Machine Learning Powered Applications by Emmanuel Ameisen… yet I still don’t have a comprehensive understanding of how to create my own dataset structure. I vaguely understand with fast.ai there needs to be a dataframe, and with other libraries it should be in either a python dictionary or a JSON format. Any general resources would be greatly appreciated.

More specifically I have trouble imaging how my multimodal data should be structured. The data that I have occurs in series and includes photos, documents, text, and numbers.

  • ~300 photos for each series used as input to be processed
  • Of the 300 photos, ~30 are selected and captioned with one sentence of text as output
  • For each series there is a document with text giving more description/context of what happened in the series.
  • The results from the previous series (30 photos with sentence length captions and a document giving more description/context)

I’m uncertain if I understand your question, but I guess you are referring to the format / structure that your data should be saved as?

I think it doesn’t matter too much in what format your data is being saved, what matters is that it contains valuable information. In the end, what it boils down to is that you want to provide your model an input (x) and an output (y) to train it on. After the training, you can just provide input (x) and ask it for output. In your case, you could provide the images as input (x) and the captions as output (y) if that’s what you’re trying to predict.

Now, the process of converting your data into x and y can be entirely different for each use case. In your case, you could save the images into a folder (train) and create a .csv file in which you link the image with its document (which contains the description). You could then load the .csv in python and load the files. You could use ImageDataLoaders.from_csv for that (docu) if you want to use FastAI.

To summarize, structuring your data depends on what you are trying to predict with it. There are many ways to load data into Python to hand it over to the model (json, folders, csv, bytes, …), but this is not relevant, what matters most is that the process is simple and efficient and that the model gets the correct x and y to train on.

Correct, I apologize for any vagueness. Your advice was what I was looking for, thank you. I might need to train a separate model just to reduce the ~300 images down to ~30, but there isn’t really any information to parameterize the selection. Simply the entire group as X and the subselection as Y. Do I need to in by hand and create some type of parameters that led to the selection of the 30? Or can I simply feed in what I have and hope that it figures it out?

I still can’t quite wrap my head around the data structure and goals that you have. I used the term “season” instead of series, please correct me if this is wrong. This is how I imagine the setup based on your description:

  • input:
    • season
      • 300 photos
      • Q1: Are there any other inputs besides photos? Are captions already included? Are descriptions already included?
  • output
    • 30 selected photos + captions
    • description of season
  • goal:
    • distill 300 photos into 30
      • Q2: What should this selection be based on? Importance of the scene? Contents of the picture?
    • generate captions for the photos that reflect it’s contents
    • generate results of previous season
      • Q3: What should the result contain? Should it just be the description of the previous season 1:1 (if such a description already exists) or should it be a summary, or something else?
  • tasks
    • rank importance of photos
    • generate caption for each photo
    • generate result of a (previous) season

Please tell me if my understanding is correct. If I’m clear about the setup, I could maybe give you advice on how to structure your data / train the model.

My goal is to take a folder of unculled and uncaptioned photos as input and pair it with the culled and captioned photos as output. Next, using the captioned photos (and the output of the previous report, generate a description of the progress since last month) I believe it should happen with three models:

1: Photo culling 2: Photo captioning the culled photos 3: Description generation. Instead of season, I would use the term monthly. Here is a list of answers to your questions and the folder structure. I will attach some example data as well (OK,it looks like it wont let me pist csv files). I am at the stage of cleaning and preparing data, do you think I should have separate excel files for each model’s input and output? Or should I keep all data in one csv but with different sheets? I also would love your input on how to approach training these models once the data is cleaned and organized.

Folder structure:
root_folder

└───project18001 (rept01 has been processed)
│ └───Rept01 (01/01/2023)
│ │ └───culled_and_captioned.pdf (for outputs)
│ │ └───photos_folder_to_be_culled (as input)
│ │ │ └─── img[i].jpg (to be culled)
│ │ └───output (processed folder)
│ │ │ └─── img[x].jpg culled
│ │ │ └─── captions.csv (Project Number, Report Number, Image Caption, Name Hash)
│ │ │ └─── to_be_culled.csv (Project Name, Rept Number, Image Path, Name Hash)
│ │ │ └─── captions.csv (Project Name, Rept Number, Culled Count, To Be Culled Count, Previous Report)
|
│ └───Rept02 (02/01/2023) – not processed
│ │ └───culled_and_captioned.pdf
│ │ └───photos_folder_to_be_culled
│ │ │ └─── img[i].jpg

└───project18002 (not processed)
└───Rept01 (10/01/2022)
│ │ └───culled_and_captioned.pdf
│ │ └───photos_folder_to_be_culled
│ │ │ └─── img[i].jpg
└───Rept02 (11/01/2022)
│ │ └───culled_and_captioned.pdf
│ │ └───photos_folder_to_be_culled
│ │ │ └─── img[i].jpg

A1. By default, there are no other inputs besides a batch of photos every month. However, I am going to create an additional input of the previous report, if there is one. So for example. Project 18001 Report 2 can have the output of the previous report as an example of what was selected correctly. However, rept 01 will not have that. This could pose a problem since there will be more captioned photos for later stages of the project (rept 8) but I hope to simply use it as context and that it will learn from the thousands of other projects that it will be trained on.

A2. A human would make these selections based on some things that are dynamic from project to project or even report to report. Phase of construction, capturing one photo of every type of room or of every floor… I don’t see any way of creating an algorithm to capture everything, its too dynamic. The importance and contents change every report/month.

A3. The results should contain a description of the construction progress that occurred since the previous month. It should be generated given the context of the captioned photos. Here is an example for Rept 01 associated with the attached excel sample data files:

SITEWORK:
• Earthwork: Rough grading around building pads is complete.
• Roads & Walks: Work has not yet commenced.
• Amenities & Site Improvements: Work has not yet commenced.
• Landscaping: Work has not yet commenced.
SITE UTILITIES:
• Electrical: Mobilization and some underground work is in progress.
• Domestic Water: Work has not yet commenced.
• Fire Protection: Work has not yet commenced.
• Sanitary Sewer: Underground piping is in progress.
• Storm Sewer: Storm sewer structure installation is in progress.
• Natural Gas: Work has not yet commenced.
• VERTICAL CONSTRUCTION:
• Masonry: Masonry skin as part of the precast panels is in progress.
• Concrete: Concrete slab‐on‐grade foundations are complete. Building B precast walls
panels have been erected. Forming of Building A panels is in progress.
• Structural Steel: Work has not yet commenced.
• Envelope: Work has not yet commenced.
• Carpentry: Work has not yet commenced.
• Plumbing: Work has not yet commenced.
• Electrical: Work has not yet commenced.
• Mechanical: Work has not yet commenced.
• Fire Protection: Work has not yet commenced.
• Drywall: Work has not yet commenced.
• Flooring: Work has not yet commenced.

Here is an example of the culled and captioned photos as output:

Project Number Report Number Image Caption Name Hash
18109 - Redacted project name 1 C:\Users\Redacted\sample data\complete\18109 - Redacted Project Name\09 - Inspection Reports\Rept01 10-29-18 PA01\output\Redacted_project_name_01_img1.png main photo 97a816874eb84a4284967cd609026e3b
18109 - Redacted project name 1 C:\Users\Redacted\sample data\complete\18109 - Redacted Project Name\09 - Inspection Reports\Rept01 10-29-18 PA01\output\Redacted_project_name_01_img2.png Site - Storm sewer structures stored on site 9f0ce7e0c50b60b8c0d307771a7a4421
18109 - Redacted project name 1 C:\Users\Redacted\sample data\complete\18109 - Redacted Project Name\09 - Inspection Reports\Rept01 10-29-18 PA01\output\Redacted_project_name_01_img3.png Site - Rough grading in progress dcae1672af6aa56d9401235458f531a8
18109 - Redacted project name 1 C:\Users\Redacted\sample data\complete\18109 - Redacted Project Name\09 - Inspection Reports\Rept01 10-29-18 PA01\output\Redacted_project_name_01_img4.png Site - Rough grading in progress eca83bcca38a0cb9154773cdab0ccb0f
18109 - Redacted project name 1 C:\Users\Redacted\sample data\complete\18109 - Redacted Project Name\09 - Inspection Reports\Rept01 10-29-18 PA01\output\Redacted_project_name_01_img5.png Site - Steel components and rebar stored on site db29ce0fb40827d77fc118cbff0999f6
18109 - Redacted project name 1 C:\Users\Redacted\sample data\complete\18109 - Redacted Project Name\09 - Inspection Reports\Rept01 10-29-18 PA01\output\Redacted_project_name_01_img6.png Site - Sanitary sewer system in progress 5ff8e86aafcb28090e6afbb70694259a
18109 - Redacted project name 1 C:\Users\Redacted\sample data\complete\18109 - Redacted Project Name\09 - Inspection Reports\Rept01 10-29-18 PA01\output\Redacted_project_name_01_img7.png Site - Storm sewer structures stored on site 85bba9ad3467b4455f358966506f6247
18109 - Redacted project name 1 C:\Users\Redacted\sample data\complete\18109 - Redacted Project Name\09 - Inspection Reports\Rept01 10-29-18 PA01\output\Redacted_project_name_01

Here is a sample of the entire set of photos to be culled:

Image Path Name Hash
C:\Users\Documents\sample data\complete\18109 - Redacted Project Name\09 - Inspection Reports\Rept02 11-30-18 PA02\Const Photos 11-30-18\IMG_7708.JPG 7388af7b5726b68b3c21ef66d74cc7ac
C:\Users\Documents\sample data\complete\18109 - Redacted Project Name\09 - Inspection Reports\Rept02 11-30-18 PA02\Const Photos 11-30-18\IMG_7709.JPG 651d91416fbe08720213c81f1fc8f818
C:\Users\Documents\sample data\complete\18109 - Redacted Project Name\09 - Inspection Reports\Rept02 11-30-18 PA02\Const Photos 11-30-18\IMG_7710.JPG 4f7c16160865b2c86925443352433086

and here is a sample of the report information which includes the previous report path. I really don’t know if I need to provide more than just the path… probably so. I am still cleaning and organizing the data from one project and then figuring out how to automate it for every project. Then the fun begins…

Project Name Rept Number Culled Count To Be Culled Count Previous Report
18109 - Redacted Project Name 2 13 36 C:\Users\Documents\sample data\complete\18109 - Redacted Project Name\ 09 -Inspection Reports\Rept01\output

Generally speaking, if you want to train different models, then I’d go for separate CSV files because it may result in a cleaner setup to work with. It’s important to note that this approach should be changed depending on how your data looks like right now: If it’s more effort to create separate CSV files than to load in one big file and split it up with pandas, then I would just use one big file.

Without going too much into detail for this project, I think what you’re looking for is a supervised learning approach. It’s a good idea to create a model for each individual task, e.g. one for the culling, one for the captioning and one for the description generation. If it turns out that there are more tasks involved, another model may be required.

The key to training any of these models is to provide a dataset with specific and clean input and output values. If you want to train a model to caption a photo, I’d expect a dataset which contains a photo (pixel values or file) as input and the caption (string) as an output to train the model. The description could be generated by a fine-tuned language model. You could provide the captions as input and descriptions as output to train on. The rest is on the model to figure out.

To really get the best result for this project, I’d also suggest optimizing everything besides the dataset. For example, you could look for pretrained models for creating captions for photos and use your data to fine tune the model instead of creating one from scratch. Lastly, I’d also think about what baseline models you could set up to measure the performance of the model you will train later on. This way, you can ensure that your models actually work and provide value.

Thank you for your response. I have some brief initial questions.

  1. How can I select the columns of one csv file for input, and one for output? and how could I do that where one input is in a column from one csv file and one output is in the column from a different csv file? I have a hard time understanding the docs.
  2. How can I use pretrained models? I asked a question (linked below) that touches on this: Lesson 2 official topic - #605 by cullenaryArtist

Feel free to simply send links for me to explore

It depends on the library and functionality you are using. For example, when using ImageDataLoaders from FastAI, you could use the from_csv function (docu) to use a CSV as your input. In this function, you can provide the parameters fn_col for the input (x, in this case a column containing the file names) and label_col for the output (y, in this case a column containing the outputs).

There are several approaches to this problem. If we continue with the example of ImageDataLoaders, you could load both CSV files into DataFrames with pandas and merge both DataFrames into one with the merge function. You could then use the from_df function (docu) from FastAI ImageDataLoaders to load the combined DataFrame and then define fn_cols and label_col as in the first example.

FastAI already provides some functionality to achieve this. If we continue with the ImageDataLoaders, you would create a vision_learner and specify an existing model to fine tune. FastAI lets you define a few different defaults like resnet34 to choose from. You could then call the fine_tune function to tune the model (example). It’s important to note that this process will differ depending on what you’re working with. For example, if you want to process natural language, you’d have to take additional steps before to fine tun (check out chapter 4 of the course to learn about NLP). This example I provided works for image classification.

Hope this helped.

Yes this does help tremendously, thank you. Some final thoughts for now:

  1. What is the text equivalent of imagedataloaders?

  2. How could I retrain CLIP for images or GPT for NLP?
    GitHub - openai/CLIP: CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
    Customizing GPT-3 for your application

As I said, to fine tune a model for NLP there are more steps involved, which are all described in chapter 4 of part 1 of the course. In the course, we were working with HuggingFace, not with the FastAI library. I found this piece of documentation which, combined with the lecture video, will help you out.

I have no experience fine-tuning CLIP or GPT, as I’m still learning myself. But I think the general concepts from fine-tuning vision_learner and NLP models will still hold true. I found this article on how to fine-tune CLIP, maybe that’s useful too.

Thank you again for all of your help.

I’m glad it helped