Clip Art or Stock Photo?

Introduction

Search engines like Google and Bing have tools that allow the user to choose specific types of images. Some types of images, such as line drawings or animated gifs, are trivial, as the search engine can simply check whether the picture is black-and-white, or whether the image is a .gif file. Other types of images, such as photographs and clip art, are more difficult to distinguish. In this project, I will attempt to create a model that can categorize a given image as either a stock photo or a clip art image.

Clip art images are typically vector drawings made using a computer, and are often heavily stylized. For the purposes of this project, all clip art images have been converted to .png files. Stock photos are professional photographs, often of objects or people in particular poses. They are usually sold for commercial use in presentations, papers, etc. and can be edited using image editing software such as Photoshop. Here is an example of each type of image:

The picture on the left is a clip art image, while the picture on the right is a (rather amusing) stock photo.

Building the Dataset

Gathering Images: Meet ClipArtBot

I created the image dataset myself, using a modified version of a Java program I made years ago, called ClipArtBot.

ClipArtBot’s initial purpose was for streaming on Twitch. Viewers can post a command such as “!clipart taco” into the chat, and a random clip art image of a taco will appear on the stream. ClipArtBot can pull images from many sites (and do many other unrelated things), but for the purposes of this project, it will only pull images from two specific websites:

openclipart.org for clip art images
bigstock.com for stock photos

Because I needed a large dataset, manually typing thousands of commands into a Twitch chat to get each image manually wasn’t feasible, so I had ClipArtBot automatically query these websites for an image 1000 times each. ClipArtBot can’t grab completely random images, so I found a text file containing the 1000 most common English words, available here. I then had ClipArtBot do a query using each of the 1000 words on both websites. I ended up with 966 clip art images and 956 stock photos, due to the websites not finding images for some of the words used.

Loading the Images into R

Next, I needed to convert the images into a format usable by R. I used the code below to do this:

Creating the CSV file

set.seed(1)

data <- data_frame(fname = dir("C:/pictures/OpenClipart", full.names = TRUE),
                   type = 0)
data <- bind_rows(data,
                  data_frame(fname = dir("C:/pictures/Bigstock", full.names = TRUE),
                   type = 1))
data$train_id <- rep(c("train", "test"))
data <- data[sample(1:nrow(data)),]

Creating the RDS files

X1 <- array(0, dim = c(nrow(data), 112, 112, 3))

for (i in seq_len(nrow(data))) {
  
  img <- readImage(data$fname[i])
  img <- rotate(img, 270)
  img <- resize(img, w = 112, h = 112)
  
  if (!is.na(dim(img)[3])) {
    if (dim(img)[3] == 2) {
      img <- apply( img, c(1,2), mean )
    }
  }

  if (length(dim(img)) == 2) { 
    img <- abind::abind(img, img, img, along = 3)
  }
  
  X1[i,,,] <- as.array(img)[,112:1,1:3]
  
  cat(sprintf("%04d of %04d\n", i, nrow(data)))
}

The process of creating the RDS files was quite tricky. Some of the clip art images were black-and-white, which meant that each of these images’ dimensions were 224 x 224 x NA instead of 224 x 224 x 3. I had to use the abind command to convert these images to the proper resolution. Another set of clipart images had the dimensions 224 x 224 x 2, where one channel was the black-and-white concentration and another channel was the transparency level. For these images, I dropped the third dimension to make the images equivalent to the first problem set of images, which I then converted to the proper dimension using the same method.

The order in which the images appear in the dataset is randomized, as neural networks require a randomized data set to properly function. Setting the seed to 1 should cause the ordering of the images to be the same each time the code block is ran, but just to be safe, I saved a copy of the csv file and rds dataset to use when making the model.

data <- read_csv("C:/cmsc395/data.csv")

## Parsed with column specification:
## cols(
##   fname = col_character(),
##   type = col_integer(),
##   train_id = col_character()
## )

x112 <- read_rds("C:/cmsc395/x112.rds")

We can look at some images to make sure that the images are loaded into R correctly:

par(mar = c(0,0,0,0))
par(mfrow = c(4, 6))
set.seed(1)
for (i in 1:48) {
  plot(0,0,xlim=c(0,1),ylim=c(0,1),axes= FALSE,type = "n")
  rasterImage(x112[i,,,],0,0,1,1)
}

It appears that the images are loaded properly. For clip art images where the background was transparent, the background was replaced with solid black. This makes sense as the transparency dimension was lost when loading the images into R.

Making the Model

First, I tried making a neural network model without using any transfer learning, with the idea that I would compare the model with one that does use transfer learning. Here is that model:

X <- t(apply(x112, 1, cbind))
y <- data$type
X_train <- X[data$train_id == "train",]
y_train <- to_categorical(y[data$train_id == "train"], num_classes = 2)

model <- keras_model_sequential()
model %>%

  layer_dense(units = 128, kernel_initializer = "glorot_normal",
              input_shape = c((112^2)*3)) %>%
  layer_activation(activation = "relu") %>%
  layer_dropout(rate = 0.125) %>%
  
  layer_dense(units = 128, kernel_initializer = "glorot_normal") %>%
  layer_activation(activation = "relu") %>%
  layer_dropout(rate = 0.125) %>%

  layer_dense(units = 2) %>%
  layer_activation(activation = "softmax")

model %>% compile(loss = 'categorical_crossentropy',
                  optimizer = optimizer_sgd(lr = 0.005, momentum = 0.9,
                                            nesterov = TRUE),
                  metrics = c('accuracy'))
model

## Model
## ___________________________________________________________________________
## Layer (type)                     Output Shape                  Param #     
## ===========================================================================
## dense_1 (Dense)                  (None, 128)                   4817024     
## ___________________________________________________________________________
## activation_1 (Activation)        (None, 128)                   0           
## ___________________________________________________________________________
## dropout_1 (Dropout)              (None, 128)                   0           
## ___________________________________________________________________________
## dense_2 (Dense)                  (None, 128)                   16512       
## ___________________________________________________________________________
## activation_2 (Activation)        (None, 128)                   0           
## ___________________________________________________________________________
## dropout_2 (Dropout)              (None, 128)                   0           
## ___________________________________________________________________________
## dense_3 (Dense)                  (None, 2)                     258         
## ___________________________________________________________________________
## activation_3 (Activation)        (None, 2)                     0           
## ===========================================================================
## Total params: 4,833,794
## Trainable params: 4,833,794
## Non-trainable params: 0
## ___________________________________________________________________________

history <- model %>%
  fit(X_train, y_train, epochs = 10,
      validation_split = 0.1, batch_size = 32)
plot(history)

data$type_pred <- predict_classes(model, X)
tapply(data$type == data$type_pred, data$train_id, mean)

##      test     train 
## 0.8210198 0.8397503

The classification rate without using transfer learning is roughly 83 percent. The model seems fairly unstable, as many changes, such as changing the number of units, make the model “flat-line” and perform no better than chance.

Next, I tried making a model that did use transfer learning. It uses the VGG16 transfer learning model:

vgg16 <- application_vgg16(weights = "imagenet", include_top = FALSE)
x112_pp <- imagenet_preprocess_input(255 * x112)
features <- vgg16 %>% predict(x112_pp)

X <- t(apply(features, 1, function(v) apply(v, 3, max)))
y <- data$type
X_train <- X[data$train_id == "train",]
y_train <- to_categorical(y[data$train_id == "train"], num_classes = 2)

model <- keras_model_sequential()
model %>%
  layer_dense(units = 256, input_shape = ncol(X_train)) %>%
  layer_activation(activation = "sigmoid") %>%
  layer_dropout(rate = 0.25) %>%

  layer_dense(units = 256) %>%
  layer_activation(activation = "sigmoid") %>%
  layer_dropout(rate = 0.25) %>%

  layer_dense(units = 2) %>%
  layer_activation(activation = "softmax")
model %>% compile(loss = 'categorical_crossentropy',
                  optimizer = optimizer_rmsprop(lr = 0.0005),
                  metrics = c('accuracy'))
model

## Model
## ___________________________________________________________________________
## Layer (type)                     Output Shape                  Param #     
## ===========================================================================
## dense_4 (Dense)                  (None, 256)                   131328      
## ___________________________________________________________________________
## activation_4 (Activation)        (None, 256)                   0           
## ___________________________________________________________________________
## dropout_3 (Dropout)              (None, 256)                   0           
## ___________________________________________________________________________
## dense_5 (Dense)                  (None, 256)                   65792       
## ___________________________________________________________________________
## activation_5 (Activation)        (None, 256)                   0           
## ___________________________________________________________________________
## dropout_4 (Dropout)              (None, 256)                   0           
## ___________________________________________________________________________
## dense_6 (Dense)                  (None, 2)                     514         
## ___________________________________________________________________________
## activation_6 (Activation)        (None, 2)                     0           
## ===========================================================================
## Total params: 197,634
## Trainable params: 197,634
## Non-trainable params: 0
## ___________________________________________________________________________

history <- model %>%
  fit(X_train, y_train, epochs = 40,
    validation_split = 0.1, batch_size = 32)
plot(history)

data$type_pred <- predict_classes(model, X)
tapply(data$type == data$type_pred, data$train_id, mean)

##      test     train 
## 0.8782518 0.9823101

With transfer learning, the classification rate is now roughly 86-88 percent. This is an increase of 3-5 percent from the non-transfer learning model, so transfer learning is clearly important for this data set. I used the sigmoid layer activation, as it seemed to perform better than the relu and selu activations. Since this model is the best of the models I tried, it is the final model.

I wanted to see what the final model was getting wrong, so I made a confusion matrix:

table(actual = data$type[data$train_id == "test"], predicted = data$type_pred[data$train_id == "test"])

##       predicted
## actual   0   1
##      0 411  72
##      1  45 433

The model seems to be pretty evenly split among incorrectly predicted images; it isn’t over-predicting one image type over another.

Next, I looked at all of the negative examples. Here are all of the clip art images incorrectly categorized as stock photos:

par(mfrow = c(4, 6))
id <- which(data$type != data$type_pred & data$type == 0 & data$train_id == "test")
for (i in id) {
  par(mar = rep(0, 4L))
  plot(0,0,xlim=c(0,1),ylim=c(0,1),axes= FALSE,type = "n")
  rasterImage(x112[i,,,],0,0,1,1)
}

Almost all of these images are full-screen images, lacking a black space around the edges. Some are very detailed, which at a low resolution are almost indistinguishable from actual photographs. While some images are fairly clearly clip art, it makes sense overall that the model would struggle with these images.

Here are all of the stock photos incorrectly categorized as clip art:

par(mfrow = c(4, 6))
id <- which(data$type != data$type_pred & data$type == 1 & data$train_id == "test")
for (i in id) {
  par(mar = rep(0, 4L))
  plot(0,0,xlim=c(0,1),ylim=c(0,1),axes= FALSE,type = "n")
  rasterImage(x112[i,,,],0,0,1,1)
}

One thing that is clear is that some of the stock photos queried from Bigstock.com were actually clip art, so it makes sense for the model to categorize them as such. Most of the other pictures have simple, dark backgrounds, which also makes sense, as many of the clip art images had a solid black background. The model seems to particularly struggle with images that are small patterns, as these were prevalent in both sets of negative examples.

Conclusion

The neural network model with transfer learning was able to correctly classify roughly 87 percent of clip art images and stock photos. At first, I was slightly disappointed with this result, as I hoped for a classification rate of over 90 percent. However, after seeing the negative examples, I can understand why achieving such a rate would be difficult.