NUS-WIDE
People at National University of Singapore did a great job creating the NUS-WIDE dataset. This dataset contains ~250.000 images scraped from Flickr. The images are tagged and sorted in categories describing their content. They are also providing some feature data as histograms and SIFT descriptors. It's like researchers dream.Well kinda. I have worked with this dataset in the past, and there are some issues I have found. First of all, the dataset is provided as a set of image links. Which makes sense, given the amount of images. The problem is, that because the dataset was created in 2009, some images are gone. That alone isn't a big issue, even though it means some further processing is needed. The main issue is, that Flickr doesn't respond with 404 File Not Found. Flickr responds with following image:
I'm not really networking savvy so there might be some ways to detect the redirect and act accordingly, but I didn't know them. In the end a fraction of images I have scraped from Flickr was this particular image.
If you have few hundreds of images you can remove those manually, if you have several thousands, not really. Of course there are some ways (CRC checking, image hashing, ...) but that means extra work, and it's a waste of time, that could be spend differently.
Imgur
Another idea I had was to write custom Imgur scraper. The URL of Imgur image consists of random string 4 to 6 chars long. So my first try was generating random URLs and hoping for the best.Well it turns out, that Imgur's adress space is only about 30-40% full. That means that only 30-40% of randomly generated strings are actually images. To put it in perspective, for 20,000 images you need to generate 50,000 - 65,000 random strings in the best case. "That's easy!" you say, and you are right, the problem is not generating the URLs. The main issue is downloading the images.
The PC my code is running on, has public IP. This means, that if Imgur decides that I'm scraping too much data, they can easily block me. Yeah I might be a bit paranoid, but better save than sorry. Fast forward few days, I have noticed that on Imgur gallery page, there is a Random image link. This means, I don't have to care about generating the URLs. Imgur will do the hard work. So getting the images was as easy as write simple HTML parser, and follow the Random link. To overcome the problem with public IP I've decided to use public proxy service.
First I found a webpage with list of free proxies, then I've parsed the proxy IP's and use them for connecting to Imgur. It turned out, that not all proxies are that good. Some proxies didn't event work. So I've implemented simple selection algoritmh, working in similar manner as OS process scheduler. The better the proxy performed the higher priority it had, decaying the priority every few cycles.
Well this certainly worked. It worked really well, to be honest. The problem is, it was slow. Mean time for getting one image was around 7 seconds, in roughly 20% of cases the time was worse than 20s. Quick calculation shown that it would take few days to get decent number of images. Not really appealing option.
MIR FLICKR
In the end I have found MIR FLICKR dataset. This was really lucky shot. The researches at Leiden university obtained impressive dataset of 1 milion images from Flickr. And unlike the NUS WIDE they are providing the actual images for download.There are two options for download. The older, smaller dataset with 25,000 images and the newer one with 1 milion images. I have decided to use the smaller one, mostly because I don't have so much time and processing time at hand to crunch through the 1M images in reasonable time.