Toward an Automatic Blog Photo Annotater

Machine Perception Project 2005
Bryan Klimt

Abstract

For my project, I will give an overview of existing approaches to photo classification, and evaluate their application to the problem of automatic photo annotation. My practical goal in this project is to develop a tool to automatically label the personal photos I place on my website, based on my previous labels. My academic goal in this project is to learn about image and photo classification, as I have never studied this area before. This project is applicable to the course because the techniques used in solving this problem cover several areas of machine perception, including computer vision and face recognition, as well as technologies for explicit user modeling, such as Markov models.

Introduction

I have a website where I post all of the photos I take. So far there are 416 photos on this site starting from last summer and that number is growing rapidly. In order to find photos in this large and growing dataset, I have implemented a search function whereby a user can type in keywords, and the system will find the relevant pictures. In order to make this search function work, I have so far been manually labeling each new photo I take with all of the keywords I think may be relevant to the photo. This is a very time-consuming and tedious process, and one that may be mitigated with intelligent use of machine perception and vision technologies, such as face recognition and photo classification.

The first step in building some statistical learner that can be trained to automatically label these documents is to determine which features of the data to use. Because these are simply photos that I have taken (and not, for instance, images on the web), there are really only two main types of features available: visual features , and the time the photo was taken. Both types of features are explored in the experiments of this project.

There has been extensive previous research on the use of visual features for photo retrieval and classification (well over 100 papers). gives an extensive list of these references, so I will not list them all here. They identify three main types of image features that have been used: color features, texture features, and shape features. Color features are the most common type of visual feature used. Generally, this involves using the Color Histogram of the photo. These can be compared using one of many metrics, such as L1 or L2 distance. Color based methods have the benefit of being fairly robust to image size and complication. However, they often do not perform as well as methods which use more domain-specific features. Several color-based methods are explored in this project.

Another common type of visual feature used is texture. defines texture as "the visual patterns that have properties of homogeneity that do not result from the presence of only a single color or intensity." A common way of extracting texture information from photos is that of Wavelet transforms. While this approach has been very successful in image retrieval, it will not be explored in this project. Time constraints allowed only a few techniques to be tried, and an analysis of this dataset, as given below, showed that the other types of features might be more useful. However, this would be an interesting direction for future research.

The third type of visual feature, shape information, is domain dependent, and has in the past included such activities as face identification and fingerprint recognition. As we will see below, face recognition is an obvious candidate for a classification feature in this dataset. In this project, I have experimented with several of these approaches. I will discuss several of them below, and show the performance of the best technique.

As mentioned previously, we can also use the time the photo was taken as another, non-visual, feature. This has not been explored previously in the field of photo classification, but is common in other tasks involving information streams, such as SPAM filtering and speech recognition. Sometimes this is done by modeling the time-distribution of categories. For some tasks though, such as with speech recognition the best performance has been with Hidden Markov Models . This approach is discussed in later sections.

In the following sections, I will first give an overview of the characteristics of this dataset which may be useful in determining what approaches to take. Then, I will describe in details the methods I have chosen to implement. Finally, I will discuss and analyze the results of these experiments, and give my conclusions.

Dataset

In order to determine what techniques will be more effective on this dataset, we must first take a close look at the features of the dataset, such as what types of keywords have been used to annotate photos, and how these relate to the photos and to each other. In figures 1 and 2, you can see the distribution of keywords in this dataset.

This distribution seems to roughly follows Zipf's law, with 5-10% of the keywords occurring roughly 90-95% of the time. This is good for us, because it means if we concentrate on the most common keywords, we will be able to do well on the majority of the photos. It is also advantagous because the most common keywords will have a lot more data for our statistical learner to train on. Obviously for the keywords that only appear one or two times, there will not be sufficient evidence to predictively label them no matter what approach we take.

Now that we know we need to focus on the most common categories, we need to examine what those categories are to determine what algorithms are most appropriate for them. Table 1 lists 45 the most common keywords.

Table 1: Top Keywords
 

As you can see, these keywords encompass a wide variety of types. However, they can be broken down to some specific types, such as locations, patterns, and people. I have chosen 15 of these keywords, given in bold above, to use in the experiments in this project. I have chosen only 15 because the small number makes the experiments manageably and interpretable. I have chosen these categories in particular to give some diversity representative of the different types of keywords encountered. Specifically, these can be grouped as:

Locations: china, italy, pisa, tiantan, florence, firenze, beijing, lucca
Colors or Patterns: sky, water, trees, cloud, hazy
People: bryan, amy, shyam

(A link to this dataset is given in the References section.)

Methods

I have conducted experiments using several of the approaches mentioned above. I will describe my approaches in detail in this section.

Color Similarity

To deal with the color or pattern type keywords, I have chosen to use color histogram similarity. This technique was introduced first introduced by . To find the color-similarity of two images, we first create a histogram for each image. A histogram is a feature vector where each dimension is a specific color. The magnitude of the vector for a photo in each dimension is the number of pixels that color in the photo. To compare two photos, we measure the distance between their vectors. This can be done using many different kinds of distance measures, but the most common in color histogram research have been L1, L2, and L-infinity. L1 is just the sum of the differences in the values in the 2 vectors. L2 is the euclidian distance and was found by to generally give the best performance, although not in every case. The L-infinity distance is merely the largest difference in the magnitudes between the two vectors in any one dimension.

also introduced a way of dealing with color sparsity. Color sparsity is an issue because, for example, one photo might have all black, while another has all very dark gray. While these two photos are intuitively very similar, normal color histogram approaches would have found no similarity because they have no color in common. In order to overcome this, they introduced the idea of the cumulative color histogram. In this technique, the value for a color is its histogram value plus the sum of the values for all of the colors less than it, where a color is less than another color if each of its components (red, green, and blue), is less than the other's.

In these experiments, I have used both color similarity and cumulative color similarity. The results for these experiments is given below. For a distance metric, I have used the L2 norm. I chose L2 after trying L1, L2, and L-infinity and finding L2 to give the best results, although the L1 results were quite similar.

k-Nearest Neighbors

Unfortunately, color similarity only gives a way of comparing two photos, not a way of classifying them. In order to meet this need, I have implemented a simple k-nearest neighbor classifier, as described in . While some other classification algorithms consistantly perform better than kNN, many, such as support vector machines, require a vector space model, whereas kNN only requires distances between each of the photos. To give a basic idea of how this works, in the case of 1-NN, a new photo will simply be assigned the keywords of its predecessor. Some discussion is given below of how to choose the k for kNN, but this is not the focus of the paper.

Face Detection

For the class of keywords dealing with people, I have chosen to use face recognition technology to identify the people in the photos. However, before a face can be scanned to determine who it is, we must first find all of the faces in the photos. For this task, I have chosen to use the CMU Robotics Institute Online Face Detection Algorithm Demo, as described in and . To be honest, my main motivation in choosing this particular approach is the free implementation available on the web, but its performance is quite good on this dataset, and would be difficult to beat.

This face detector works by first looping over every coordinate in the photo. At each coordinate, it also loops over several possible sizes that the face could be. It then classifies this subimage as either a face or not. The face classification is implemented using a Bayesian network whose structure is learned statistically from previous training examples. One advantage of this online classifier is that is comes pre-trained to recognize faces, having been trained on the MIT-CMU face dataset.

Face Recognition

Once faces have been detected, we need some method for determining the similarity of faces so that we can identify them. Two of the most common algorithms for doing this are Principal Component Analysis and Independent Component Analysis . These algorithms work by using training examples to select a mapping to a feature space of reduced dimensionality to reduce the noise in the photos. Unfortunately, all of the previous face recognition systems require a large number of large trainin examples. Furthermore, they generally tend to require that photos of faces be well-cropped, with the faces all pointing the same direction, and with certain features, such as eyes, manually labeled. This negatively affect the performance of face recognition algorithms on this dataset, as we will see below.

To meet time constraints (and to ensure correct results), I have use the CSU Face Identification Evaluation System, as described in . This system takes in a set of photos and creates a similarity matrix, using one of several algorithms, which allows us to determine the similarity between any two faces. I tried both PCA, ICA, and PCA with Linear Discriminant Analysis (LDA), but their results were virtually identical, so I have only reported the results of PCA in this project.

Once we have calculated the distance between two faces, we can then use a k-nearest neighbor classifier to predict the keywords of a new photo. This works in the following way. First, we locate the faces in a new photo. If no faces are found, we don't assign the photo any keywords. If faces are found, each one is identified by finding the k-nearest face neighbors and choosing the identity of the face based on that. Faces are matched to keywords whenever a photo is encountered with exactly one face.

Time Modeling

In addition to the standard methods for photo classification, I noticed that with this particular dataset, time information would likely be a useful feature for categorization. For instance, the most common keyword, "china", is used to label photos I took while I was in China. Since I've only been there once, most of the photos labeled "china" occur in one temporal cluster. Thus, some method that can take into account time information should be able to perform quite well on keywords like this one.

One of the most popular techniques for labeling time sequences is Hidden Markov Models . HMMs work by modeling the states that a system may be in, as well as the probability of transitioning from one state to another, as well as the probability of giving each output at each state. With this particular problem, however, the states are not actually hidden. For instance, the obvious hidden states for the "china" keyword represent either being in China or not being in China. Thus, a model for this keyword would be something like in figure 3.

With such a simple, well-defined model, it is not necessary to use an algorithm to learn the parameters; they are obvious given our knowledge of the problem. So, we can actually approximate using an HMM like this one using the same k-Nearest Neighbor approach we used with the other approaches. For instance, the HMM shown in figure 3 will give identical classification performance as a 1-NN model. I have therefore, again, used kNN to do use time features for classification. I do not believe that HMMs would provide any better performance, given the simplicity of our data model, although that may be an area for future research.

Time-based k-Nearest Neighbor approaches also have the advantage of being able to reduce the noise in the dataset by increasing k. For instance, if I failed to label a photo with "china", that may not mean I have left China; it might just be a human labeling error. With k=2, this will be avoided, with the cost being a false positive for the first photo after leaving China.

Experimental Setup

The experimental setup for this project is quite simple. I considered the photos on my blog to be a sequence stream arriving in the system. When a new photo arrives it is unlabeled, and the system uses its classifier(s) to automatically label the photo. The predictions made by the system are then stored for evaluation, and the labeled photo is shown to me. I then correct the annotations and resubmit the photos to the system so that it can learn from my corrections.

Results & Analysis

In order to measure the performance of the methods described above, I have used the standard metrics for evaluation of classification: precision, recall, and F1 score . Precision measures the fraction of the time the classifier is right when it says a photos should have a keyword. Recall measures what fraction of the photos with that keyword the system identified. F1 is the harmonic mean of these two measures, and gives a good final metric when we are looking for a classifier that has a good measure of both. Specifically, given the contingency table in Table 2, these metrics are defined below.

Has the keywordDoes not have the keyword
Predicted to have itab
Not predicted to have itcd
Table 2: Evaluation contingency table

 

Eq. 1, 2, and 3: Evaluation metrics

Color Similarity

The performance of the color histogram-based classifier on the 15 keywords I have chosen is shown in figures 4, 5, and 6. As you can see the performance is quite low. However, the low performance of the classifier is due a little more to low recall rather than low precision. It is difficult to see a pattern in which keywords this classifier did well in. While our intuition says that "sky", "cloud", and "hazy" should have distinct color signatures, only "hazy" performed particularly well. This suggests that the color signatures of photos have more to do with factors other than the subject of the photo.

In all, the performance of this color-based classifier is poor enought that it would not be a useful tool for helping me annotate my photos. So, let's see if the cumulative color classifier has better performance, as the literature said it would. This is given in figure 7. While the overall performance of this classifier is better than that of the simple classifier, it is by no means decisive. One item of interest in this chart is the high performance of the "shyam" category, which is an annotation for a particular person. The reason this is interesting is that all of the pictures of Shyam were taken at the same time, and they were with a different camera than the rest of the photos in the dataset. This suggests that choice of camera plays a larger role in determining color similarity than the content of the photo itself.

Comparing the two color-based methods on the categories where we think they would be most effective, we can see that, while the noise-reducing cumulative color-histograms do tend to perform better, especially on larger categories, it is not consistently better. In all, these color based methods perform so poorly, they would only be useful on a few keywords, such as "cloud" and "hazy", but it would be difficult to determine automatically which keywords they would be useful for.

Face Detection

For the face detection based methods, I will first discuss the performance of the face detector itself. The full output of the face detector can be seen on this huge, slow-loading, browser-crashing web page. I have summed up the performance in table 3.

FacesNot Faces
Predicted Faces6830
Not Predicted Faces13n/a
Table 3: Face Detector Performance

The face detector has 69% precision and 84% recall. However, most of the false positives occur near the end of the dataset, so the effective precision for most of the experiment was much higher. I also would not expect another face detector to do better, as many of the false positives do look vaguely like human faces. These results indicate that the face detecter worked adequately, although the real test will be if it helps the face-based classifier.

Face Recognition

The results of the face recognition based classifier are given in figure 9. This classifier generally does better on the keywords describing people than the color-based classifier. This is mostly unsurprising though, as it may be a result of the face detector without any recognition at all. Also, the classifier is able to make some predictions about the most common classes on the left, even though they don't describe people. This is because the classes are so common it learns a correlation between them and my face. (Also, it could probably get performance this good just by always saying "yes".)

In an absolute sense, this classifiers performance is still bad enough to not be very useful for a practical application. This is almost certainly because of the small size of the dataset though. As the literature suggested, face recognition requires a large number of large photos, but here we only have around 20-25 examples per class. Furthermore, the average size of a face in these photos is only 24x32 pixels, not nearly large enough for good performance. However, as we will see later, the face-based annotator can claim the best performance for the most common face, mine ("bryan"). This suggests that more training data may yet make this classifier useful.

Time Modeling

Since a major goal of this project was to learn about photo classification, the time-based classifier is, in a sense, the least interesting. However, it uses domain-specific information, and should do quite well. As I mentioned in the example given above, I went to China only once, so the performance of a time-based classifier should be nearly perfect for that class. Figure 10 shows the performance of this classifier with k=1. (As with all of the figures on this page, k was chosen to optimize F1 score.)

Unsurprisingly, the time-based classifier has quite good performance, indicating sadly that the best behavior for a photo annotation tool would be to simply always start a photo with the annotations of the previous photo. As predicted, the classifier did particularly well on keywords that correspond to location, and not as well on colors and people. The worst performance was on "bryan", because my appearance is sporadic through the dataset. One contradiction to this trend is "beijing", simply because I was inconsistent in my manual labeling.

One of the reasons I chose to model time with a kNN classifier was that larger values of k would be able to reduce noise in the dataset (i.e. if I missed a photo while labeling). Figure 11 shows the tradeoff of performance at various levels of k, and we see that every keyword worked best with k=1, indicating that the dataset was not very noisy.

Conclusions

The full chart of F1 scores for all classifiers and keywords is given in this large chart. It is clear from looking at this chart that the time-based classifier used its domain knowledge to outperform all of the other methods on every keyword with one exception. For the "bryan" keyword, the face-recognizer performed better because photos of me are not temporally clustered in the dataset, and my face is the most common. This is nice because it suggests that the face-based classifier will improve in performance as the number of photos on my blog increases.

The color classifier, on the other hand, had pretty much bad performance all around, and may not ever be appropriate for this task.

Future Work

This project was only a start on solving this need, and there is still plenty of work to be done on creating this automatic annotator. For instance, I have still not explored pattern-based classifiers, which may give good performance on the hardest keywords, such as "trees" and "water". Also, I have not tried any hybrid methods here. Often, it is possible to outperform any single classifier by combining other classifiers intelligently.

As to the particular classifiers used in this project, there are still some approaches we can take. The time-based classifier may still be improved by using other algorithms such as real HMMs (which will mostly be good for determining whether a time-based approach will work, rather than improving its results). The face recognizer may be improved by trying other algorithms or by finding ways to increase the size of the training set, such as using unlabeled faces on the web for feature selection.

References

Note: The Powerpoint presentation provides some additional examples and observations.

The entire dataset for this project, as well as all of the source code, tools, etc. can be found here.