Creating an Image Dataset for VQA: Part 2 - Extracting Real Images from Production Videos

We will focus on collecting real production images by extracting them from production videos.

Aleksandra Krasnodębska

2023-03-05

Introduction

In part one of this article, we explored the process of generating synthetic images to train our Visual Question Answering (VQA) model. Now, we will be moving on to the next step, which involves acquiring real images to enhance our dataset.

There are two ways to obtain authentic images. The first method involves downloading pictures from search engines like Google or stock photo websites, as well as searching for open source datasets. The second method is an unconventional one which involves using videos of factory production to extract photos from the footage. In this section, we will be delving into the latter option by processing a movie from a YouTube channel: Factories in Poland (pl: Fabryki w Polsce) about chips production.

Process movie

Step 1. Download a movie from YouTube:

We can use an awsome python library pytube.

1from pytube import YouTube
2
3link = "https://www.youtube.com/watch?v=UdPCQiu_qZA"
4output_path='movie.mp4'
5yt = YouTube(link)
6yt = yt.streams.get_highest_resolution()
7yt.download()
8

Method get_highest_resolution() will automatically download the highest resolution available.

Step 2. Cut the downloaded movie into frames.

We use code from clear tutorial from site www.thepythoncode.com. Tu use it opencv-python library is required.

1python extract_frames_opencv.py movie.mp4
2

Depending on our needs, we can manipulate the number of frames per second changing SAVING_FRAMES_PER_SECOND argument.

Step 3. Check results:

Underneath, we have included sample frames from the automatically cut video.

As we can see some images do not represent production pictures. To remove them we will build image classifier.

Step 4. Build classifier

We will use the torch library to build our classifier and albumentations to augment the dataset. As a backbone model, we will utilize efficientnet_b0, which is small but effective for our problem. During augmentations, we will use functions that do not change the initial image too much.

1import albumentations as A
2
3transform = A.Compose(
4    [
5        A.RandomBrightnessContrast(p=0.5, brightness_limit=0.2, contrast_limit=0.2),
6        A.ShiftScaleRotate(p=0.5, rotate_limit=20),
7        A.Blur(blur_limit=(3, 4), p=0.5),
8        A.Resize(240, 426),
9        A.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
10        A.pytorch.ToTensorV2(),
11    ]
12)
13

Below is an augmented version of one original frame (without normalization).

Conclusion

By following these steps, we can successfully extract authentic images of factory production from videos, which will result in the creation of a diverse dataset for our VQA (Visual Question Answering) model.

There are several benefits to utilizing factory production videos to extract images. Firstly, most of the videos available on YouTube have a high resolution of 720p or above, leading to processed frames of good quality.

Secondly, the images that are extracted from the videos have the advantage of being similar to real photos taken during factory work, in terms of their quality, height, and angles. This makes them better suited for our VQA model, which aims to answer questions based on mobile phone photos.

Furthermore, there are numerous videos available on various topics, which provides us with the flexibility to choose the appropriate video for our machine learning problem.

It is important to note that copyright laws vary from country to country, with the biggest differences between the US and EU. Therefore, it is crucial to obtain permission from the author before processing and utilizing any videos from YouTube.