Who you are?
Hello, Iām Nicky Nicolson & Iām senior research leader in Biodiversity Informatics at Kew.
As an applied computer scientist working at a collections based botanical research institute, my remit is to provide more efficient data mobilisation and decision support āto curate and provide data-rich evidence from Kewās unrivalled collections as a global asset for scientific researchā.
Digitising and integrating data is expensive, Iām particularly interested in how we can use ML techniques to break down tasks into smaller sets, some of which may be automatable, whilst others can be passed to humans for investigation / verification.
In addition to scientific research at Kew, we recognise that the collections represent history (history of science, collecting & social history) and we are starting to develop more projects to investigate these aspects of the collections, integrating our holdings with those from other organisations.
Why you are interested in machine learning/deep learning?
Weāve a lot of data, but crucially also access to experts to interpret results. Scientific research with biological specimens cross-cuts a lot of different data formats (structured data, images, texts) and traditionally weāve done a lot of laborious work in to extract, integrate and annotate these - the outputs of these efforts could be re-purposed as training data for machine learning.
Do you already have some potential problems you are (or would like to) use machine learning for?
Iāve been working on data mining the collectors (people and team entities) from messy, aggregated datasets of specimen metadata using clustering techniques and using these to integrate data from specimens held in different institutional collections. General writeup here.
Datasets you are keen to work with? (either labelled or unlabelled)
As well as using data aggregated into the Global Biodiversity Information Facility data portal (structured specimen data and specimen images), Iād be interested in learning more about techniques to deal with text data. This could range in scale from fairly abbreviated sentences describing habitat or collecting localities to the kinds of bibliographic works digitised through the Biodiversity Heritage Library. Also as our data is very interconnected, techniques to work with graph data structures would also be of interest.
Is there anything that you think would help you get prepared to do follow the fastai course (i.e. you are a bit rusty with Python).
Iām happy with Python & sklearn etc, though deep learning & GPU technology is new to me.