GLAMs (Galleries, Libraries, Archives and Museums) fastai study group

Danielvs · August 19, 2020, 3:26pm

hey!

This is really nice!

This is really interesting to me too, I’m also very curious about techniques to allow users to manipulate the navigation of these embeddings i… also including metadata about the collections etc to move between a more ‘traditional’ metadata oriented view on a collection and a newer ‘data orientated’ view.

I’m working on a mini-project using video material at the moment so I’m very keen to discuss this with other people too

phivk · August 19, 2020, 4:32pm

Awesome! I often find that a lot of the experimentation with GLAM context stays in offline processing to generate static results (some made available in more traditional interfaces like searchable transcribed text or lists of similar records. I’m very interested in moving some of the control of parameters over to users, so that they can make their own visualisations and find what is of interest to them. To me, this feels like the next step up from ‘Generous Interfaces’. Of course a major challenge to keep such interfaces intuitive!

oh cool! can you tell more on this or share a link?
It feels like the machine learning community has made great strides in processing images in recent years, which I hope will progress to the medium of video

susanford · August 20, 2020, 6:42pm

Hi Daniel

Thanks for starting the study group. My name’s Susan Ford. I’m a Classicist who also has an interest in natural history, so the text I’m working on at present is Pliny the Elder’s Natural History (Latin, 1st century AD). I’m an independent scholar, not working in the GLAM sector (I hope that does not rule me out of the study group!). I want to join up Pliny’s thousands of words on plants and materia medica with relevant images, especially to make the text more useful to general readers. There are image collections labelled with Linaean names and modern common names in an informal sense so it’s a matter of adding Pliny’s names in the first instance.

I am aware of at least one collection of European images relevant to the project: Pl@ntNet. There are probably better ones owned by European herbaria but that’s the one I’m using to inform myself about European phorbs.

I’m OK on Python, though everytime I don’t use it for 6 months some of it goes away.

I look forward to doing the course, especially with the forum members. The main problem I anticipate is figuring out what the workflow should be for the above Pliny project which I’d like to do as I progress through the course (rather than get to the end and still not know how to identify and access relevent image datasets).

Susan

Danielvs · August 20, 2020, 9:28pm

I don’t have much worth sharing yet but hopefully soon!

I was inspired by this project which used CNNs to detect particular shot types in films i.e. long shot, medium, close-up etc. I am hoping to do something similar with other ‘broad’ categories which would apply to most/many types of video (like shot types). I’m particularly interested to see if there is a way of efficiently pre-training a CNN on one task of this type and then fine-tuning it again on a new task on the same videos (or at least similar).

One thing I am trying out at the moment is different approaches to scene change detection as a pre-processing step for working with film. Most methods I’ve looked at seem to use some form of threshold detection. These seem to be fairly sensitive to changes in threshold values. I am hoping to find something that will be fairly robust across a dataset that is big enough that it will be impossible/annoying to manually validate very much of it. It’s not absolutely essential that I get this part working but it would be nice to have the option to split by scenes rather than only on time intervals.

Danielvs · August 20, 2020, 9:35pm

Great to meet you Susan!

This sounds like a great project

I think time spent working out how to tackle a project can often be a big chunk of the work, especially when it’s not something for which you can easily replicate approaches other people have taken directly. I think it is really useful to have a specific project/goal in mind when you follow a course like this

Definitely doesn’t rule you out, very happy for you to join!

Danielvs · August 21, 2020, 6:36pm

ICYMI the course has just been relased https://www.fast.ai/2020/08/21/fastai2-launch/

nickynicolson · August 24, 2020, 2:52pm

Who you are?

Hello, I’m Nicky Nicolson & I’m senior research leader in Biodiversity Informatics at Kew.

As an applied computer scientist working at a collections based botanical research institute, my remit is to provide more efficient data mobilisation and decision support “to curate and provide data-rich evidence from Kew’s unrivalled collections as a global asset for scientific research”.

Digitising and integrating data is expensive, I’m particularly interested in how we can use ML techniques to break down tasks into smaller sets, some of which may be automatable, whilst others can be passed to humans for investigation / verification.

In addition to scientific research at Kew, we recognise that the collections represent history (history of science, collecting & social history) and we are starting to develop more projects to investigate these aspects of the collections, integrating our holdings with those from other organisations.

Why you are interested in machine learning/deep learning?

We’ve a lot of data, but crucially also access to experts to interpret results. Scientific research with biological specimens cross-cuts a lot of different data formats (structured data, images, texts) and traditionally we’ve done a lot of laborious work in to extract, integrate and annotate these - the outputs of these efforts could be re-purposed as training data for machine learning.

Do you already have some potential problems you are (or would like to) use machine learning for?

I’ve been working on data mining the collectors (people and team entities) from messy, aggregated datasets of specimen metadata using clustering techniques and using these to integrate data from specimens held in different institutional collections. General writeup here.

Datasets you are keen to work with? (either labelled or unlabelled)

As well as using data aggregated into the Global Biodiversity Information Facility data portal (structured specimen data and specimen images), I’d be interested in learning more about techniques to deal with text data. This could range in scale from fairly abbreviated sentences describing habitat or collecting localities to the kinds of bibliographic works digitised through the Biodiversity Heritage Library. Also as our data is very interconnected, techniques to work with graph data structures would also be of interest.

Is there anything that you think would help you get prepared to do follow the fastai course (i.e. you are a bit rusty with Python).

I’m happy with Python & sklearn etc, though deep learning & GPU technology is new to me.

Danielvs · August 25, 2020, 8:52am

Sorry this is a bit last minute. The best time to video call seems to be 17.00-18.00 GMT on Wednesday (tomorrow). Since it’s a bit last minute I realise people might not be able to make it anymore. If you do want to join, I will post a jitsi link here tomorrow morning. jitsi doesn’t require any software to be installed so hopefully will be possible for everyone to use regardless of device etc.

barnabywalker · August 25, 2020, 10:49am

Hi everyone! Sorry I’m a bit late to this.

I’m Barnaby Walker, and I’m a research fellow at Kew.

I work with @nickynicolson, so am also interested in using ML/DL to enrich data for research and automate some tasks. I’m specifically interested in using DL on our specimen images to make them easier to explore and to use them directly in research.

I’ve mainly used other ML methods before now, but am familiar with some deep learning, and keen to go through this course with other people.

Danielvs · August 25, 2020, 5:25pm

Nice to meet you Nicky I am a big fan of Kew Gardens!

This sounds really interesting.

I think there are some good synergies here with what other people are interested in with using for catalogue/metadata so hopefully, we can keep that thread going as we work through the course.

Danielvs · August 25, 2020, 5:31pm

Nice to meet you Barnaby

What kind of metadata to these images have already? Something I’m curious about (and have done some very crude experiments with) is trying to use existing metadata for self-supervised learning by using the existing metadata as a “pretext task” before training on a smaller set of other labels. This could be one way of potentially reducing the amount of training data needed for a new model and would leverage all the previous work that has gone into creating metadata.

susanford · August 25, 2020, 9:51pm

@danielvs Sorry I didn’t see doole poll, but I can make it to video meeting Wed 1700 GMT
Cheers Susan

Danielvs · August 26, 2020, 8:52am

Link for the call later today (17:00 UK time): https://meet.jit.si/fastai4glams. Since I’ve been bitten by this recently, the UK is currently on British Summer Time [https://www.worldtimebuddy.com/] should tell you the time in your time zone!

silviaegt · August 26, 2020, 4:02pm

I’m there! Won’t be able to stay long though, but thanks for setting this up!

Danielvs · August 27, 2020, 5:02pm

Was good to chat to some of you last night. As a summary for those who couldn’t make it, we mainly discussed a few practical issues, this included a desire for a rough schedule to keep momentum up. This will have a cadence of trying to get through one lesson every two weeks. I have updated the schedule in my previous post

There was also a desire to have semi-regular video calls to discuss progress. For anyone who wants to join those please fill in this doodle for times/dates you can normally make (i.e. don’t feel it is a commitment to join every time).

I have created an email signup for people who want to get notifications about upcoming calls etc. You can also adjust the form settings to get notifications about new posts etc:

If you switch to watching you should get a notification when a new post appears on the thread.

We also spent a bit of time discussing setup options. I think most people had used Colab. There are sometimes a few issues with Colab, and things work a little bit different to standard Jupyter notebooks. If you find this annoying the gradient/paperspace option might be good. The setup for that is at https://course.fast.ai/start_gradient. I have used the free GPU option and it seems plenty fast for the initial lesson notebooks.

If you are happy to embark on a slightly more involved process Google Cloud Platform setup instructions are fairly easy to follow. I used this option last time I did the course. Since you get $300 dollars credit when you first sign up it can be a nice option.

If I have forgotten anything I said I would do during the call please remind me!

barnabywalker · August 28, 2020, 10:39am

What kind of metadata to these images have already? Something I’m curious about (and have done some very crude experiments with) is trying to use existing metadata for self-supervised learning by using the existing metadata as a “pretext task” before training on a smaller set of other labels. This could be one way of potentially reducing the amount of training data needed for a new model and would leverage all the previous work that has gone into creating metadata.

That sounds like a really interesting use of the metadata.

For herbarium specimen images, the metadata is usually a record about the collection. So it will capture where the specimen was collected, when, and by who. Sometimes there will also be a description of the plant with it, covering things that might not be visible from the specimen (e.g. 10 ft tree, produces white latex), and sometimes there will be a description of the habitat the specimen was collected from (e.g. open grassland). And most of the time there will be a determination of what taxon the specimen is from, which could be added much later or change over time.

So the metadata can be really detailed, but because these specimens have been collected over a couple of hundred years, and by loads of different people, it really varies how much is actually there. It can also be wrong sometimes - someone’s identified it as the wrong taxon, or the coordinates have been wrongly geocoded from an old location description.

But there’s definitely a lot of potential to do some interesting things with it.

Danielvs · September 1, 2020, 2:52pm

I have created two notebooks based on the lesson one notebooks using data more relevant to GLAM applications. They can be found in this Github repo. I ran both of them in Colab originally so they should hopefully work there without any problems. I also have some drafted notes that I intend to upload soon

I mentioned something called nbreview on the call last week. If anyone else has worked with notebooks in version control, particularly with other collaborators, you’ll know it can be a bit of a pain. nbreview is a tool that is supposed to help make this a bit less annoying.

Since it is a bit tricky to explain you can see it in action on the pull request for the lesson 1 notebooks. The link here: https://app.reviewnb.com/davanstrien/fastai4GLAMS/pull/4 should take you to a ‘review’ for the notebook which includes comments on specific cells etc. This might be useful later on if people want to share work or get feedback or to ask questions on how a specific piece of code is working.

As a reminder, the new (suggested) schedule looks like this

Week begin	Lesson
24 August	Lesson 1 + set-up
7 September	Lesson 2

As a reminder, if you want emails about calls etc, please signup using the link below and/or if you want to join the video calls please indicate times/days you can generally do in the doodle.

Hopefully, everyone has managed to find some kind of server setup that works for them

Danielvs · September 7, 2020, 8:42am

In lieu of a better approach, I have randomly decided Thursday evenings at 17:00 UK time as a good evening for calls for those who want to join. The first call will be this week (10 Sep)

We’ll do these every two weeks. These calls will sit in the middle of the suggested schedule so we can check in how far people have got an either discuss only the previous weeks’ lessons (so week 1 for this call) or a bit of the next lesson.

I will post a link to the calls here and on the email list on the Thursday morning of the call. I will hopefully get around to finishing my notes for week 1 today…

Danielvs · September 10, 2020, 9:45am

A reminder of our call this evening at 17:00 UK time. You can join the call using this link https://meet.jit.si/fastai4glams. Hopefully, chat to some of you later

TriznaM · September 10, 2020, 5:04pm

Uh oh, I think I did the time zone conversion wrong (today I learned GMT is NOT the same as UK time) and missed today’s call. Were people able to meet and talk about Week 1?