GLAMs (Galleries, Libraries, Archives and Museums) fastai study group

Danielvs · October 14, 2020, 7:36pm

I think the blog and notebook are great

My only suggestion, which might not apply so much this time, is printing an additional metric like F1 during training. I’ve found quite often ‘real-life’ often ends up being imbalanced and F1 can be useful as an additional insight into how well the model is performing. Again, I’m not sure if this applied so much this time and confusion matrix gives some insight into this too.

Danielvs · October 18, 2020, 9:17am

A reminder that we’ll have the call for lesson 4 this Tuesday (google calendar link to future calls) As usual I’ll post the zoom link here on Tuesday.

Last time we had some time for showing projects we’d been working on. I suggest we start with this again. I also thought it might be worth quickly discussing:

if there are any topics we want to plan things for further. A few ideas have already come up (https://github.com/davanstrien/fastai4GLAMS/issues/8). If there are a few common shared interests we could try and spend some time (probably asynchronously to start, exploring how to tackle these with an eventual aim of coming up with an example notebooks.
On a similar vein, if people have/know of open datasets from a GLAM setting which could usefully be labelled (by non-experts) we could think about whether we can to try and label a dataset together.

of course feel free to suggest other ideas below or during the call

see some of you on Tuesday

Danielvs · October 20, 2020, 1:27pm

Link for tonight’s call:

Topic: fastai4glams call
Time: Oct 20, 2020 05:00 PM London

Join Zoom Meeting

Meeting ID: 993 5711 0603
Passcode: 576773

TriznaM · October 20, 2020, 1:49pm

In case you’re really trying to load up on Zoom calls today, the monthly community call for AI4LAM is today immediately preceding the Study Group call at 1500 UTC (4PM UK and 11 AM EDT). Agenda and Zoom link are here: https://docs.google.com/document/d/1TthlO7-VVT4iHeuJIpzzE764cFcsbN9YMDD6HLOQxL4/edit?usp=sharing. There will be 3 lightning talks on “Machine Learning and Images” which seems relevant.

AdamF · October 20, 2020, 2:15pm

Glen - it would be fun to hook up to any of the GLAM IIIF repositories as open datasets and gently pull down (resized) images to work with along with metadata as available. With problems that are pretty easy for a classifier (e.g., map, photograph, letter, newspaper page, etc) you can get started with surprising small samples, train, predict the categories for a larger sample, then add them to the training set with some light review and editing. You may also find that it is easier to think of this as a multi-label problem as some pages will contain multiple key elements (eg a map and a drawing). The only thing that that changes is that you can’t really keep images in directories named by label any more.

Enjoy!

Danielvs · October 24, 2020, 11:06am

Hi all,

During these weeks call, we discussed working together on a notebook/project which addresses a broader GLAM use case. We decided that trying to come up with some example multimodal models might be a good starting point. In case you haven’t come across this term, multimodal models take input of different types as input i.e. text + images and predict a single out. i.e. something like this:

The idea behind this approach is that you give the model more useful information by not restricting it to one type of source. Since GLAM data will often already have some metadata associated with it. So, for example, you may have a collection of images with associated metadata, i.e. year, location, etc. and you feed that metadata + an image into a model to predict some new labels. With a bit of luck, the metadata will also give the model some signals as to what the labels might be, i.e. maybe particular years in your collection are more likely to feature a label than others. Hopefully, by giving the model the combination of inputs it would do better than it would do with each information source on its own.

To tackle this together we could start by finding some potential datasets which might work well for this particular approach. Since the course so far has focused on images more so it might be best to start with an example that combines an image with another type of data. This could potentially be a ‘free text’ field or more structured metadata. I suggest we start collating some options in the forum and we can then decide if we want to just focus on creating a book on only one dataset or aim to tackle multiple different datasets. Ideally, this would be a dataset that was open/could be made open.

Another thing that may be useful to collate is existing examples or discussions on using this type of approach. This could be examples of implementing the code in fastai/PyTorch or articles discussing some of the theory of using these types of models. I will try and dig around for some resources in the next few days and post them in reply to this thread.

Anyone should feel welcome to join, even if you haven’t joined the zoom calls. We can keep a good chunk of this asynchronous and I will endeavour to write up any other discussions so people can track progress without subjecting themselves to more zoom calls

Danielvs · November 1, 2020, 3:40pm

A reminder that we’ll have the call for lesson 5 this Tuesday (google calendar link to future calls) As usual I’ll post the zoom link here on Tuesday.

glenrobson · November 2, 2020, 11:39pm

Sorry to miss the call tomorrow but I have a conflicting meeting.

Danielvs · November 3, 2020, 2:53pm

Zoom details for this evening:

Join Zoom Meeting

Meeting ID: 986 6413 5223
Passcode: 807268

Danielvs · November 15, 2020, 1:31pm

A reminder that we’ll have the call for lesson 6 this Tuesday (google calendar link to future calls) As usual I’ll post the zoom link here on Tuesday.

susanford · November 17, 2020, 2:48pm

Apologies - I can’t attend this one. (But look forward to meeting in fortnight’s time when I’ll be in hotel quarantine in Sydney and looking for relief from boredom and a view of the harbour!)
Susan

Danielvs · November 17, 2020, 2:52pm

Zoom details for this evening:

Topic: fastai4glams call lesson 6
Time: Nov 17, 2020 05:00 PM London

Join Zoom Meeting

Meeting ID: 925 3405 2670
Passcode: 160613

Danielvs · November 17, 2020, 2:53pm

No worries, see you next time and hope the quarantine is not too dull!

Danielvs · November 22, 2020, 11:13am

A note for the next session (lesson 7) thanks to @AdamF

There appear to be several issues in fastbook/clean/09_Tabular.ipynb that have been reported in the forum, but not in github issues. This issue aggregates them together.

I confirmed that all of them occur in the version included in the fastdotai/fastai-course docker image of 19-Nov-2020. I presume that the official docker image version of the course should run through cleanly without any errors.

Some supporting modules are not installed: pip install kaggle waterfallcharts treeinterpreter dtreeviz (forum article)

Downloading the kaggle file bluebook-for-bulldozers does not appear to work from python. There are multiple reports of this (including here) with the workaround being to download manually via the browser or via the commandline (kaggle competitions download -c bluebook-for-bulldozers).

The code to download also fails because it tries to create a directory but needs a parents=True added to path.mkdir (see).

The load/save pickle method should be changed to load_pickle/save_pickle (reported a couple of times on the forum including here)

m_rmse(m, xs_filt2, y_filt), m_rmse(m2, valid_xs_time2, valid_y) raises an error. xs_filt2 should be xs_filt. as in m_rmse(m, xs_filt, y_filt), m_rmse(m2, valid_xs_time2, valid_y) (see here).

procs_nn = [Categorify, FillMissing, Normalize] causes an error in the following line. The suggested work-around is to remove Normalize from the list (see here)

I haven’t tried running the notebooks yet but hopefully the above will help with debugging any issues you come across

jimmiemunyi · November 23, 2020, 8:40am

Hey @Danielvs. Nice Summary here.
However for the Normalize issue workaround, this solution seems to work better.

Danielvs · November 23, 2020, 10:16am

thanks for sharing that

Danielvs · November 30, 2020, 10:50am

A reminder that we’ll have the call for lesson 7 this Tuesday (google calendar link to future calls) As usual I’ll post the zoom link here on Tuesday.

susanford · December 1, 2020, 10:37am

Apoogies @Danielvs - I won’t make it to the mtg today.
Susan

Danielvs · December 1, 2020, 2:23pm

Link for the call this evening:

Join Zoom Meeting

Meeting ID: 984 9293 3442
Passcode: 474931

Danielvs · December 1, 2020, 2:23pm

No worries, see you next time