If you are not familiar with Microsoft CoCos, you should be. Its a treasure trove of data for your learning pleasure! There just happens to be one pesky problem with it, and that is the fact that when attempting to find the files for training/testing; the Annotation file that ships with MS CoCo does not include the actual file name, but rather the image id. This sounds fine, except the data when you download it has a bunch of trailing stuff! In this article we will go through how to get it ready.
Getting the Data Set
This is a big data set, so I like to open up a bash command (yes linux bash prompt) on windows and execute a wget with the link. You can alternatively just follow the download link and let your browser handle it (this kept crashing my box though). http://mscoco.org/dataset/#download
You need to download the images as well as the annotations file.
Opening the Annotations File with Python
So if you aren’t already, you should set up Visual Studio Code w/CNTK using the previous article here. You can read all about it there. Once you have your environment up and running with your .vscode folder and settings/launch files, you can just interactively execute code.
The following code opens the annotations into a pandas data frame.
#%% imports import pandas as pd import json import os ann_base_dir = 'C:/data/coco/instances_train-val2014/annotations/' train_lbls = 'instances_train2014.json' test_lbls = 'instances_val2014.json' #%% read data raw_json = json.load(open(ann_base_dir + test_lbls, 'r')) ann_df = pd.io.json.json_normalize(raw_json['annotations'])
Notice the Image ID is just an image Id. When you download the data you will see CoCo_Val2014_000000someid.jpg This stinks.
List Comprehension is Neat
So I just did a quick test using list comprehension before I let my code loose on the big data set. Here is the code to quickly test it.
#%% Remove Leading Zeros from Image Files base_dir = 'C:/data/coco/val2014/val2014/' [s.split('_').lstrip("0") for s in os.listdir(base_dir)]
Updating the File Names
Below is the code to actually go through every file and update the file name using the technique we tested out in list comprehension.
#%% for file in os.listdir(base_dir): removed_zeros = file.split('_').lstrip("0") os.rename(base_dir + file, base_dir + removed_zeros)
Alright! Thats it folks. Now you are ready to go with creating your label files and you know the image names in the directory are simply the image_id you have + .jpg at the end. Happy Deep learning!