úterý 4. března 2014

Image Similarity Assessment #3 - Big Data

As I have mentioned before I have a quite big dataset - 25,000 colour images. Each image has one dimension set to 500px and the other is scaled to preserve original ratio. Those images are in JPEG and the downloaded file has roughly 2 GB.

So the first step was to prepare those imaves for processing. For start i have decided to scale the images to uniform size - 256x256 pixels. Then I converted the images to grayscale and normalized to range <0,1>. Under this conditions each image has exactly 262,144 bytes multiply that by 25,000 and we are at 6.1 GB. My PC has only 8GB of RAM so I decided to trim that down. So I changed the image size to 128x128 and the size of the dataset was down to 1.5GB, which is still too big for my GPU (1GB), but fits in RAM just fine. 

The process

To change the size and color depth of the images I used imagemagick. But preparation of the data was done in python, using numpy and scipy.
The process was straightforward - read image, convert to vector (numpy.ravel) divide by 255 and push into storage matrix as a new row. And last step was dumping to disk using cpickle.

Well on paper it seems nice and easy. But there were some trouble. The pickled dataset had 13GB(this was for 256x256 images). As I found out, the python's float is actually double (64bit number).

So after this experience I have done calculations and decided to shrink the images even more. Other change was abandoning pickle and switching to HDF5 (h5py), which provides transparent compression and is usually faster than pickle.

So after these design changes I have updated the conversion script and oh boy, was I surprised when the resulting data file had only 7MB.

No that's not a typo. If you have any knowledge of information thoery, you can see it stinks. And you are right. Whelp at the moment, I was really excited about the result that it blinded my judgement.

It wasn't until few days later that I found the error. When I was rewriting the conversion script, I did a tiny error  with huge consequences. Can you spot it?

img_norm = img/255

Yep, integer division. Only pixels that were pure white yielded 1. Every other value was 0. That's why the compression was so efficient. Even RLE can get amazing compression ratio when there are zeros only.

So the lessons learned were: be careful with intuition while learning python and triple check every line you write.

úterý 25. února 2014

Image Similarity Assessment #2 - Getting Data

The first challenge in machine learning research is getting data. The second challenge is to get quality data. For my research I needed decent amount of real world images to test my algorithms against. I have listed some options I have tried below.


People at National University of Singapore did a great job creating the NUS-WIDE dataset. This dataset contains ~250.000 images scraped from Flickr. The images are tagged and sorted in categories describing their content. They are also providing some feature data as histograms and SIFT descriptors. It's like researchers dream.

Well kinda. I have worked with this dataset in the past, and there are some issues I have found. First of all, the dataset is provided as a set of image links. Which makes sense, given the amount of images. The problem is, that because the dataset was created in 2009, some images are gone. That alone isn't a big issue, even though it means some further processing is needed. The main issue is, that Flickr doesn't respond with 404 File Not Found. Flickr responds with following image:
I'm not really networking savvy so there might be some ways to detect the redirect and act accordingly, but I didn't know them. In the end a fraction of images I have scraped from Flickr was this particular image.

If you have few hundreds of images you can remove those manually, if you have several  thousands, not really. Of course there are some ways (CRC checking, image hashing, ...) but that means extra work, and it's a waste of time, that could be spend differently.


Another idea I had was to write custom Imgur scraper. The URL of Imgur image consists of random string 4 to 6 chars long. So my first try was generating random URLs and hoping for the best.

Well it turns out, that Imgur's adress space is only about 30-40% full. That means that only 30-40% of randomly generated strings are actually images. To put it in perspective, for 20,000 images you need to generate 50,000 - 65,000 random strings in the best case. "That's easy!" you say, and you are  right, the problem is not generating the URLs. The main issue is downloading the images.

The PC my code is running on, has public IP. This means, that if Imgur decides that I'm scraping too much data, they can easily block me. Yeah I might be a bit paranoid, but better save than sorry. Fast forward few days, I have noticed that on Imgur gallery page, there is a Random image link. This means, I don't have to care about generating the URLs. Imgur will do the hard work. So getting the images was as easy as write simple HTML parser, and follow the Random link. To overcome the problem with public IP I've decided to use public proxy service.

First I found a webpage with list of free proxies, then I've parsed the proxy IP's and use them for connecting to Imgur. It turned out, that not all proxies are that good. Some proxies didn't event work. So I've implemented simple selection algoritmh, working in similar manner as OS process scheduler. The better the proxy performed the higher priority it had, decaying the priority every few cycles.

Well this certainly worked. It worked really well, to be honest. The problem is, it was slow. Mean time for getting one image was around 7 seconds, in roughly 20% of cases the time was worse than 20s. Quick calculation shown that it would take few days to get decent number of images. Not really appealing option.


In the end I have found MIR FLICKR dataset. This was really lucky shot. The researches at Leiden university obtained impressive dataset of 1 milion images from Flickr. And unlike the NUS WIDE they are providing the actual images for download.

There are two options for download. The older, smaller dataset with 25,000 images and the newer one with 1 milion images. I have decided to use the smaller one, mostly because I don't have so much time and processing time at hand to crunch through the 1M images in reasonable time. 

pondělí 24. února 2014

Image Similarity Assessment #1 - Preface


In recent few months I work at my master thesis. The topic is Image Similarity Assessment. The goal of my work is to evaluate different algorithms for comparing and retrieving similar images from given dataset. Basically the thing that Google does so well.

Few years back I did similar research based on feature extraction and comparison. The process was to run wavelet transform, extract interesting coefficients and use them as key for searching. Well it turned that this approach isn't much useful. In the end I was able to get >98% accuracy for corrupted images (changed aspect ratio, added blur), but for similarities it didn't work well.

When I started work on my thesis my supervisor asked me to try different approach - neural networks. More specifically deep neural networks. The researchers from LISA Lab in Montreal found new ways to effectively train deep neural networks and my goal is to try this new stuff. In short, my goal is to train deep auto-encoder which would generate low dimensional codes used for image retrieval.

The idea behind my work is, that user provides some image input and software will return images, which are in some way similar. I have to add advance notice here. The algorithm "sees" differently than humans. To prove my point look at following images:

Those two images were selected as similar even though, they have nothing in common. I have to admit, that this is result of algorithm with very poor performance, but it proves my point. The algorithm doesn't see an image of beach and image of tiger. It sees bunch of yellow-orange pixels (fur/sand), bunch of gray and white (clouds and foam/fur and wall) and bunch of blue pixels (sky).

As you can see this is quite challenging task and I'm really interested what results will my research yield.