Automatic Data Collection For Machine Learning Models
Automatic Image Classification Generator allows you to perform automated data collection, automation of training, and testing for machine learning models. Empowered by SerpApi’s Google Images Scraper API, the tool bypasses the need for manual data entry, and reduces human error by providing preprocessing functionalities in its workflow.
This week we will discuss how to get automatic data capture of training, and testing processes for machine learning models, and make an automation script for gathering information on the best possible model. This way we can gather insights on how to optimize the model, and decision making for which scenarios to create a line tool version of rudimentary artificial intelligence that recognizes images. I only tested the functionality of the code and didn’t run it completely as it is a time consuming process with the current setup I have. But it should give you an idea about how you could utilize it for your business processes.
For more information on the state of the tool, and how it was created, you may scroll to the bottom of the page.
What are the benefits of an automated data collection?
The software is using SerpApi’s Google Images API for automated data collection of images to be used in the training, and validation testing of machine learning models by bypassing manual processes necessary. For more information on the real time automated data collection systems for other creating datasets in other formats such as SerpApi’s Google Scholar Scraper API for automation of case studies, or other use cases, you may head to Use SERP Data to Build Machine Learning Models page.
SerpApi’s Scraping Engines provide the necessary automated data capture systems via fast, easy-to-understand, and complete APIs. You may Register to Claim Free Credits. You may also head to Pricing page to get informed on the cost for the project you have in mind.
Before we dive deep into how to streamline automatic data capture process for different metadata of machine learning models. I would like to give the reader an idea about how the Automatic Image Classification Generator, and SerpApi’s Google Images Scraper API can be utilized in a different way. Although formats other than images are not supported in the Generator, paper forms or paper documents in image formats can be utilized to further enhance the text data. Here is an example query:
With queries such as “Images with Plato Quotes“, you may enhance the dataset you have on quotes from famous people by utilizing Optical Character Recognition(OCR), Intelligent Character Recognition(ICR), and create a process automation service for an automated quotes post bot on social media.
Without giving definite templates for them, by utilizing automated data capture methods SerpApi provides, you may create all kinds of products such as classification models to improve barcode scanners, old paper document management tools, the next image recognition app that can replace RFID readers stocking, or optical mark recognition, automatic data collection tools for the healthcare industry that can replace manual data collection, publicly available personal data collection software to reduce government intervention in private data, QR code scanner that provides more data than the link which leverages different forms of data collection methods, etc. The possibilities are abundant with the type of data SerpApi can provide.
Automated Model Data Analysis
Because data collection process is done via SerpApi’s Google Images Scraper API, we collect data for images that are already preprocessed by using queries such as below:
Improvement in data quality is directly affecting the ease of cross-comparison of unstructured algorithms and reducing the need for human intervention. A good example is in the following details. We are creating a CNN(Convolutional Neural Network) with multiple Conv2d, and Maxpool2d layers. We want to calculate the input size for the first fully connected layer. To do this for a rectangular image, we would need to calculate the spatial resolution for both height and width on separate occasions. Something to expand at the Generator later. For now, we will use previously gathered 500x500 images and apply square kernels on them for each convolutional and pooling layer. Here is the easy-to-understand function which has the formula embedded in it:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def calculate_fully_connected(layers, size):
for layer in layers:
k = 1
p = 0
s = 1
d = 1
if "kernel_size" in layer:
k = layer["kernel_size"]
if "padding" in layer:
p = layer["padding"]
if "stride" in layer:
s = layer["stride"]
if "dilation" in layer:
d = layer["dilation"]
size = math.floor((size + 2*p - d*(k-1) - 1)/s + 1)
return size
Let’s declare different models we want to test out in classifying these images in a list. For now, I will put only one model in there:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
models = [
[
{
"name": "Conv2d",
"in_channels": 3,
"out_channels": 6,
"kernel_size": 5
},
{
"name": "ReLU",
"inplace": True
},
{
"name": "MaxPool2d",
"kernel_size": 2,
"stride": 2
},
{
"name": "Conv2d",
"in_channels": 6,
"out_channels": 16,
"kernel_size": 5
},
{
"name": "ReLU",
"inplace": True
},
{
"name": "MaxPool2d",
"kernel_size": 2,
"stride": 2
},
{
"name": "Flatten",
"start_dim": 1
},
{
"name": "Linear",
"in_features": "change_with_calculated_fn_size",
"out_features": 120
},
{
"name": "ReLU",
"inplace": True
},
{
"name": "Linear",
"in_features": 120,
"out_features": 84
},
{
"name": "ReLU",
"inplace": True
},
{
"name": "Linear",
"in_features": 84,
"out_features": "n_labels"
}
]
]
Let’s also declare different optimizers we want to test out in another list:
1
2
3
optimizers = [
"AdamW"
]
For now we will only test out AdamW, and only parameter at change will be the learning rate.
Next, we will declare a range, and stepping interval for the learning rate:
1
2
3
4
5
6
7
8
9
lr = 0.001
lr_range = []
while lr < 1.0:
lr = lr + 0.001
lr_range.append(lr)
loss_functions = [
"PoissonNLLLoss"
]
Learning rate will range from 0.001 to 1.0 with an increment of 0.001.
Also, we need to declare a list of loss functions. We will only declare one:
1
2
3
loss_functions = [
"PoissonNLLLoss"
]
We also need a counter for each iterative action we will take:
1
i = 0
We could’ve taken it this next part from the model. But output_size stands for the output size of last Conv2d layer in the model. This will be useful for calculating the fully connected linear input size.
1
output_size = 16
We will decrease the size of 500x500 images to 32x32 to save from processing power. Again, we could’ve taken in from the training dictionary declaration. It is going to be used as the second parameter of calculate_fully_connected function.
1
image_size = 32
Let’s declare an empty array for storing automated training commands dictionary:
1
training_dicts = []
Let’s start iterating over different lists. Since we have only one item in each list except lr_range, this iteration will test the best learning rate over a specific epoch size for the optimizer AdamW using loss function PoissonNLLLoss, and using the CNN model described.
1
2
3
4
for model in models:
for optimizer in optimizers:
for lr in lr_range:
for loss_function in loss_functions:
Let’s give each model a different name using the counter:
1
model_name = "american_dog_species_iterated_{}".format(str(i))
Next step is to calculate the fully connected layer’s input size. Notice that this is not the size to be used, but a variable in finding it. We will use the CNN model we declared in models list, and image size. Since we use 32x32 images, and 5x5 and 2x2 kernels, we can safely calculate one number and use it for height and length:
1
calculated_fc_size = calculate_fully_connected(model,image_size)
Then we need to change the input size of the fully connected layer with the calculated value:
1
2
3
4
for layer in model:
if (layer["name"] == "Linear") and (layer["in_features"] == "change_with_calculated_fn_size"):
model[model.index(layer)]['in_features'] = calculated_fc_size * calculated_fc_size * output_size ## Assuming image shape and kernel are squares
break
As you can see the final result will be the calculated number squared times the output size.
Now we need to declare the training dictionary with the variables we are iterating:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
training_dict = {
"model_name": model_name,
"criterion": {
"name": loss_function
},
"optimizer": {
"name": optimizer,
"lr": 0.001
},
"batch_size": 4,
"n_epoch": 5,
"n_labels": 0,
"image_ops": [
{
"resize": {
"size": [
image_size,
image_size
],
"resample": "Image.ANTIALIAS"
}
},
{
"convert": {
"mode": "'RGB'"
}
}
],
"transform": {
"ToTensor": True,
"Normalize": {
"mean": [
0.5,
0.5,
0.5
],
"std": [
0.5,
0.5,
0.5
]
}
},
"target_transform": {
"ToTensor": True
},
"label_names": [
"American Hairless Terrier imagesize:500x500",
"Alaskan Malamute imagesize:500x500",
"American Eskimo Dog imagesize:500x500",
"Australian Shepherd imagesize:500x500",
"Boston Terrier imagesize:500x500",
"Boykin Spaniel imagesize:500x500",
"Chesapeake Bay Retriever imagesize:500x500",
"Catahoula Leopard Dog imagesize:500x500",
"Toy Fox Terrier imagesize:500x500"
],
"model": {
"name": "",
"layers": model
}
}
Let’s add the training dictionaries to training_dicts to collect them in one list:
1
training_dicts = training_dicts + [training_dict]
Finally we increment the counter:
1
i = i + 1
For this next part you need to run the Automatic Images Classifier Generator to make the necessary calls for automatic training:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
results = []
for training_dict in training_dicts:
print("---")
print("Training Model: {}".format(training_dict['model_name']))
body = json.dumps(training_dict)
response = requests.post("http://localhost:8000/train", headers = {"Content-Type": "application/json"}, data=body, allow_redirects = True)
if response.status_code == 200:
while True:
response = requests.post("http://localhost:8000/find_attempt/?name={}".format(training_dict["model_name"]), headers = {"Content-Type": "application/json"}, allow_redirects = True)
if response.json()['status'] == "Trained":
break
time.sleep(0.001)
testing_dict = training_dict
testing_dict['limit'] = 100
body = json.dumps(testing_dict)
response = requests.post("http://localhost:8000/test", headers = {"Content-Type": "application/json"}, data=body, allow_redirects = True)
if response.status_code == 200:
while True:
response = requests.post("http://localhost:8000/find_attempt/?name={}".format(training_dict["model_name"]), headers = {"Content-Type": "application/json"}, allow_redirects = True)
if response.json()['status'] == "Complete":
break
time.sleep(0.001)
results = results + [response.json()]
print("Accuracy: {}".format(response.json()['accuracy']))
print("---")
For each training dictionary, we will send a request to train endpoint to train the model, check if it’s done training at find_attempt, test it with 100 random images from labels provided from the test endpoint, and then check again if the testing is over. Finally we will store all the training and testing data of each automated process in a list called results.
In the end, for us to find the maximum efficient setup, all we need to do is to check for the process with the most accuracy. Finally, we will print the most accurate setup to the user:
1
2
3
4
5
6
7
accuracy = 0.0
most_accurate_training = []
for result in results:
if accuracy < result['accuracy']:
most_accurate_training = result
print(most_accurate_training)
You may find the full code in the gist below:
Conclusion
I am grateful to the reader for their attention and to the Brilliant People of SerpApi for their support. In the coming weeks, we will analyze different patterns for different kinds of tasks requiring various image classification models, discuss how to transfer the learning experience of older models created to newer models created ad hoc, and utilize loss data to measure the training performance. In the end, we will minimize it in a command line tool format to make it useful for the general public.
Originally published at https://serpapi.com on September 1, 2022.