úterý 4. března 2014

Image Similarity Assessment #3 - Big Data

As I have mentioned before I have a quite big dataset - 25,000 colour images. Each image has one dimension set to 500px and the other is scaled to preserve original ratio. Those images are in JPEG and the downloaded file has roughly 2 GB.

So the first step was to prepare those imaves for processing. For start i have decided to scale the images to uniform size - 256x256 pixels. Then I converted the images to grayscale and normalized to range <0,1>. Under this conditions each image has exactly 262,144 bytes multiply that by 25,000 and we are at 6.1 GB. My PC has only 8GB of RAM so I decided to trim that down. So I changed the image size to 128x128 and the size of the dataset was down to 1.5GB, which is still too big for my GPU (1GB), but fits in RAM just fine. 

The process

To change the size and color depth of the images I used imagemagick. But preparation of the data was done in python, using numpy and scipy.
The process was straightforward - read image, convert to vector (numpy.ravel) divide by 255 and push into storage matrix as a new row. And last step was dumping to disk using cpickle.

Well on paper it seems nice and easy. But there were some trouble. The pickled dataset had 13GB(this was for 256x256 images). As I found out, the python's float is actually double (64bit number).

So after this experience I have done calculations and decided to shrink the images even more. Other change was abandoning pickle and switching to HDF5 (h5py), which provides transparent compression and is usually faster than pickle.

So after these design changes I have updated the conversion script and oh boy, was I surprised when the resulting data file had only 7MB.

No that's not a typo. If you have any knowledge of information thoery, you can see it stinks. And you are right. Whelp at the moment, I was really excited about the result that it blinded my judgement.

It wasn't until few days later that I found the error. When I was rewriting the conversion script, I did a tiny error  with huge consequences. Can you spot it?

img_norm = img/255

Yep, integer division. Only pixels that were pure white yielded 1. Every other value was 0. That's why the compression was so efficient. Even RLE can get amazing compression ratio when there are zeros only.

So the lessons learned were: be careful with intuition while learning python and triple check every line you write.