Creating image dataset for VQA part1 synthetic data

Experiment to generate image data for VQA system. In this part 1 of the dataset creation process, synthetic images are generated using stable diffusion model through prompt engineering.

Maciej Osowski

2023-02-15

Without data there is no ML solution

One crucial step in every machine learning project we undertake at Emplocity is having access to high-quality, labeled data. However, obtaining such data is a challenging and time-consuming task that typically demands significant resources to collect, label, and prepare. Furthermore, in some cases, there may be no existing data to leverage.

Despite the difficulties associated with obtaining high-quality data, investing time and effort into this process is essential for the success of any machine learning project. We recognize the importance of this step and are committed to going the extra mile to ensure that our projects are built on solid foundations of high-quality data.

The project - expectations

Visual question answering is a cutting-edge field in machine learning that combines computer vision and natural language processing to answer questions based on visual information. In an industrial factory environment, there is a need for employees to quickly recognize the functionalities of machines and production lines. This is where a visual question answering model can be of great help. At Emplocity, we were tasked with delivering such a functionality for our core client. With help of camera, a machine learning model can process visual information to answer questions from employees in real-time. Implementing a visual question answering model in an industrial factory environment can greatly improve efficiency and safety by reducing the need for lengthy manuals and on-the-job training. Instead, employees can quickly get the information they need by asking the ML model, freeing up time for other important tasks. Additionally, the model can be continuously trained to improve its accuracy over time, ensuring that it remains a valuable tool. However, before implementing such a time-saving functionality, we require a substantial amount of data.

Generating prompts

Our idea was to create synthetic images with stable diffusion model. And for that we needed to experiment with prompting the SD model with many example prompts

Creating prompts for stable diffusion models to generate indoor factory images can be a challenging task. It requires carefully crafting prompts that are specific enough to guide the model towards generating realistic images, while also allowing for enough variability to generate diverse images. At times, it can be difficult to strike the right balance between specificity and variability, leading to inconsistent or unsatisfactory results.

To tackle this challenge, we leveraged the power of ChatGPT API to create a range of prompts that could be used to guide stable diffusion models to generate indoor factory images. Some examples of the prompts we created include:

"An image of a production line in a factory setting, with workers and machinery in action."

"An image of a warehouse filled with pallets of goods, with forklifts moving around and workers managing inventory."

"An image of a laboratory, with scientists conducting experiments and analyzing data."

"An image of a manufacturing plant, with assembly lines and conveyor belts moving products through the production process."

"An image of a maintenance room, with tools and equipment neatly organized on shelves."

"An image of an office, with desks, chairs, and computers in place, and workers going about their daily tasks."

"An image of a shipping and receiving area, with trucks being loaded and unloaded and workers managing logistics."

"An image of a cafeteria or break room, with tables and chairs, vending machines, and workers taking a break."

"An image of a clean room, with specialized equipment and workers wearing protective gear."

By leveraging the power of ChatGPT, we were able to create prompts that could be used to guide stable diffusion models to generate a wide range of indoor factory images with impressive accuracy and realism.

Generating images

The generated prompts were insufficient in detail to create the level of accuracy and image conditions required for our project. Moreover, while tinkering with the ChatGPT API, we discovered that it takes a significant amount of time to prompt the model for the expected results. It can take much longer to make the model deliver what we expect, particularly when it's a one-off creation, than to write it ourselves. This highlights an important lesson for future use of LLM for data creation: one must carefully consider where the actual line of time savings lies when experimenting with LLMs and prompts.

During the process of experimenting with prompts, we attempted various techniques to achieve just the right images.

A digital image of factory storage, 4k, detailed, realistic, plain |neg few detail:1

A realistic car factory assembly line, 4k, detailed, realistic, plain |neg employees:1

A hyperrealistic photo of  welder working in factory setting , 4k, mobile phone photo |neg dark:1

    Editorial Style Photo of laboratory with computers in factory setting , 4k, task lighting |neg sunlight:1

    realistic Style image of server stack room with servers in industrial setting 4k, detailed, symetric |neg hallucinations:1

As these are only a few examples, we experimented with different parameter settings and used several guides that can be found in the "Inspiration and Help" section. We also attempted to upscale, inpaint, and alter images into higher resolution using various techniques. However, none of these methods yielded satisfactory results in time efficient manner.

Conclusion

Stable diffusion is a new and interesting way to generate images that can assist artists and graphic designers. However, currently, this algorithm has limited use in creating synthetic realistic images for computer vision tasks.

We think it's essential to present and describe ML experiments, even if the data preparation pipelines are not up to expectations. This demonstrates the everyday reality of tinkering with the ML process for product creation, which involves iterations and pivot moments.

For this type of project, there are simpler and more time-saving methods to obtain images. We will present them in part two