úterý 25. února 2014

Image Similarity Assessment #2 - Getting Data

The first challenge in machine learning research is getting data. The second challenge is to get quality data. For my research I needed decent amount of real world images to test my algorithms against. I have listed some options I have tried below.

NUS-WIDE

People at National University of Singapore did a great job creating the NUS-WIDE dataset. This dataset contains ~250.000 images scraped from Flickr. The images are tagged and sorted in categories describing their content. They are also providing some feature data as histograms and SIFT descriptors. It's like researchers dream.

Well kinda. I have worked with this dataset in the past, and there are some issues I have found. First of all, the dataset is provided as a set of image links. Which makes sense, given the amount of images. The problem is, that because the dataset was created in 2009, some images are gone. That alone isn't a big issue, even though it means some further processing is needed. The main issue is, that Flickr doesn't respond with 404 File Not Found. Flickr responds with following image:
I'm not really networking savvy so there might be some ways to detect the redirect and act accordingly, but I didn't know them. In the end a fraction of images I have scraped from Flickr was this particular image.

If you have few hundreds of images you can remove those manually, if you have several  thousands, not really. Of course there are some ways (CRC checking, image hashing, ...) but that means extra work, and it's a waste of time, that could be spend differently.

Imgur

Another idea I had was to write custom Imgur scraper. The URL of Imgur image consists of random string 4 to 6 chars long. So my first try was generating random URLs and hoping for the best.

Well it turns out, that Imgur's adress space is only about 30-40% full. That means that only 30-40% of randomly generated strings are actually images. To put it in perspective, for 20,000 images you need to generate 50,000 - 65,000 random strings in the best case. "That's easy!" you say, and you are  right, the problem is not generating the URLs. The main issue is downloading the images.

The PC my code is running on, has public IP. This means, that if Imgur decides that I'm scraping too much data, they can easily block me. Yeah I might be a bit paranoid, but better save than sorry. Fast forward few days, I have noticed that on Imgur gallery page, there is a Random image link. This means, I don't have to care about generating the URLs. Imgur will do the hard work. So getting the images was as easy as write simple HTML parser, and follow the Random link. To overcome the problem with public IP I've decided to use public proxy service.

First I found a webpage with list of free proxies, then I've parsed the proxy IP's and use them for connecting to Imgur. It turned out, that not all proxies are that good. Some proxies didn't event work. So I've implemented simple selection algoritmh, working in similar manner as OS process scheduler. The better the proxy performed the higher priority it had, decaying the priority every few cycles.

Well this certainly worked. It worked really well, to be honest. The problem is, it was slow. Mean time for getting one image was around 7 seconds, in roughly 20% of cases the time was worse than 20s. Quick calculation shown that it would take few days to get decent number of images. Not really appealing option.

MIR FLICKR

In the end I have found MIR FLICKR dataset. This was really lucky shot. The researches at Leiden university obtained impressive dataset of 1 milion images from Flickr. And unlike the NUS WIDE they are providing the actual images for download.

There are two options for download. The older, smaller dataset with 25,000 images and the newer one with 1 milion images. I have decided to use the smaller one, mostly because I don't have so much time and processing time at hand to crunch through the 1M images in reasonable time. 

pondělí 24. února 2014

Image Similarity Assessment #1 - Preface

Preface

In recent few months I work at my master thesis. The topic is Image Similarity Assessment. The goal of my work is to evaluate different algorithms for comparing and retrieving similar images from given dataset. Basically the thing that Google does so well.

Few years back I did similar research based on feature extraction and comparison. The process was to run wavelet transform, extract interesting coefficients and use them as key for searching. Well it turned that this approach isn't much useful. In the end I was able to get >98% accuracy for corrupted images (changed aspect ratio, added blur), but for similarities it didn't work well.

When I started work on my thesis my supervisor asked me to try different approach - neural networks. More specifically deep neural networks. The researchers from LISA Lab in Montreal found new ways to effectively train deep neural networks and my goal is to try this new stuff. In short, my goal is to train deep auto-encoder which would generate low dimensional codes used for image retrieval.

The idea behind my work is, that user provides some image input and software will return images, which are in some way similar. I have to add advance notice here. The algorithm "sees" differently than humans. To prove my point look at following images:

Those two images were selected as similar even though, they have nothing in common. I have to admit, that this is result of algorithm with very poor performance, but it proves my point. The algorithm doesn't see an image of beach and image of tiger. It sees bunch of yellow-orange pixels (fur/sand), bunch of gray and white (clouds and foam/fur and wall) and bunch of blue pixels (sky).

As you can see this is quite challenging task and I'm really interested what results will my research yield.