Entity Recognition on 60 year old images using Yolo

Amsterdam, 2023-08-24, Rinske Zandhuis & Richard Zijdeman

Introduction

The IISH is a research institute and holder of many archival documents, including thousands of images. Many of these images have been digitized but cannot yet be found in a search function because they contain few metadata. In two weeks, physics student Rinske Zandhuis applied image recognition to determine entities depicted in photos. Given the short time frame, training data was not an option. Instead, Zandhuis evaluated to what extent the latest version of Yolo, would be able to detect entities on 2300 images made by the famous press photographer Ben van Meerendonk in the 1950's and 60's.

YOLO is trained to recognise 80 different types of items, such as: persons, cars, chairs and ties. Below we venture in the different results Zandhuis obtained in the two week period. The scripts she created to obtain these results are available via Github. IISG-imagerecognition.py applies Yolo to detect the entities and creates a four column data frame: the image(number), the detected item, the probability of occurrence, and the coordinates of a bounding box around the entity. To circumvent memory issues with detecting entities in all 2300 images, the data frame is written into multiple csv files and the script IISG-adding-files.py was created to merge them in one large CSV file. Evaluation of the over 24000 recognised items was out of scope for the two week period, but IISG_create_dataframe_check.py shows snippets of where entities are supposed to be and a user can validate that with 'y' or 'n', after which the results are written to a dataframe. Finally, IISG-making-linked-data.py creates the Linked Data that we will explore below.

Data

Nearly all of the Yolo entities have been recognised (78 out of 80). Somewhat surprisingly as not all entities may appear the 60 year old images by Van Meerendonk. While surely some entities have been falsely recognised, the number of entities is far greater than anticipated the list of 80 entities seems applicable to the second half of the 20th century. We used Wikidata labels to link to the entities, resulting in small dispersions (e.g. Wikidata's 'table', rather than Yolo's 'dining table').The two figures below present the 10 most and least often detected images. Person entities have been detected most often (16k) and are omitted from the visualisation. Least often detected are 'hairdrier', 'microwave' and 'toaster' (3 times), 'mouse', 'snowboard', 'keyboard', 'orange' and finally 'donut' and 'giraffe' (once).

Obviously, there are some mistakes in the recognition. The 'giraffe' is actually a may tree, microwaves resemble typical 1950's square tv's and an occasional crack in an image depicting the sky is recognised as a kite. In subsequent efforts it would be important to look into two things. First, to what extent such mistakes are structural and thus provide bias (such as the case where in 1950 tv's are not recognised as such). Second, to what extent entities that are depicted are not recognised at all and whether there again is a structural component to it. For example, bigger changes in dressing patterns between man and women, might result in a specific sex being less often recognised as person.

Below are the results for some images supposedly depicting a 'horse'. You can click on an image to zoom into the detected area. You can look for other Wikidata entities by using the search box. To alter the query entirely, click on 'try this query yourself'. E.g. you can change 'Limit' to a different number to see more images. It is interesting to see that patterns in the haystack apparently resemble horse legs.

Images by Entity

Having transposed the results from the entity recognition to Linked Data, we are able to directly query for characteristics of the images along with the metadata of the photos. For example, one could ask for images depicting a 'handbag' as retrieved by Yolo and from the original metadata containing a word from the image title and a period in which the photograph should have been taken. Below we ask "show me all images, for the entire period, depicting a 'handbag' and having 'Meerendonk' in the image's title". Among the results is a unique image of the Van Meerendonk couple. One could also look for 'fork' and 'Sinatra' and find a picture of him and Ava Gardner at Schiphol Airport. It is a bit tedious, but also rewarding to look for other matching results. E.g. try 'car' and 'RAI', 'horse' and 'staking', 'cat' and 'straat' or any other combination of entities and words you might expect in the image titles. (Hint: choose an entity and add a single space as keyword).

Round up

Image recognition has come a long way. Without prior knowledge on the topic, Zandhuis showed that even an out-of-the-box algorithm is of value when detecting common entities even in 60 year old photographs. Moreover, by using Linked Data, Zandhuis illustrates the riches that lie ahead in combining traditional metadata, with metadata that can be newly derived from digitised images in archives. For example, filtering for images with 10+ persons, would retrieve all group photos. Images, with less than 3 persons, a desk and ties, might reveal official moments such as the signing of documents. Surely, archives would like to train detection algorithms to enhance precision and detect entities characteristic for the archive at hand. But, archives also struggle for money and a deep dive into out-of-the-box-tooling might give a unique sense of the possibilities.You Only Live Once.