Complex Neural Network Data Modelling with CNTK

Hello World,

This article is kinda exciting for me; because once you can internalize how this works, the world really becomes your oyster as far as what you can model with what kind of data.  In this example we are going to take some sample images and some random vector features and merge them together.  In a more realistic example you may take something like an image as well as some contextual tabular data and want to merge those two data sets together into a single prediction.

CNTK vs TensorFlow

First some general notes here.  I’ve been an avid user of Tensor Flow; but moving back towards CNTK a bit now that it has gone to general availability along with several things being fixed.  CNTK is shaping up to become the superior framework between those two options.  It really boils down to the data deserializers, data composability and the simplicity of rehydrating models for inference in a production ecosystem.  CNTK scores superior on all of those fronts as well as performance.  The learning curve from Tensor Flow to CNTK was pretty simple; I was up and running in about 2 days.  CNTK’s apis and such are a bit easier to deal with; but structurally they are very similar as far as what they do.

Download the Full Sample

I know this can be a bit of a tricky topic, so the sample can be downloaded from here as a python notebook.  You can download the full data, code, label files etc from here.  Since I am providing a full sample, in this article we will just focus on the key items in this sample around the merging, modelling and predictions.

The Label Files

So we compose two label files together.  The first is an image map file and the second is a text file.  The reason we split into two files is because the ImageDeserializer has a ton of great transforms you can apply.  This allows you to more easily and quickly experiment with image sizes, as well as adjunct your data size through random cropping, shearing etc which is made available through the transforms in a Keras like fashion.

The Image Map File

Basically we just want the location of the image.  We have a ‘fake’ class here simply because it is required.  Just make sure your location is right.  Notice in this file we use relative pathing.  This is relative to the execution of the CNTK program.  Alternatively, you can also provide fully qualified paths such as: C:\projects\samples\CNTK_CompositeReader\images\image1.png

images\image1.png	0
images\image2.png	0
images\image3.png	0
images\image4.png	0

The CTF Map File

This file contains a one hot encoding feature of ‘something’ with 3 categories in it as our tabular feature and a complex regression label corresponding to output nodes 0, 1, 2 and 3.  You would use this type of label for anything that requires multiple predictions from a single pass.  I personally have mostly done this for robotic controls; but perhaps you want to predict the price of 2 or 3 things simultaneously or whatever (who knows what you want to do).

|features 0 0 1 |label 1 2 3 4
|features 0 0 1 |label 1 2 3 4
|features 0 1 0 |label 4 3 2 1
|features 0 1 0 |label 4 3 2 1

Multiple Input Variables

# Define the data dimensions
num_channels = 3
image_width = 32
image_height = 32

input_conv_model_shape = (num_channels, image_width, image_height)    # images are 32 x 32 with 3 channels of color
conv_model_num_features = 32
input_tab_features = 3
num_regression_outputs = 4

x_i = C.input_variable(input_conv_model_shape)
x_t = C.input_variable(input_tab_features)
y = C.input_variable(num_regression_outputs)

Here we simply define some data shapes.  Our image shapes here are flexible as is conv_model_num_features (which is the number of features outputted by the conv net.).  The input_tab_features corresponds to the size 3 vector in our CTF file under features and num_regression_outputs corresponds to the 4 output nodes defined in our CTF file under label.

Data Reader Definition

# Read a COMPOSITE reader to read data from both the image map and CTF files
def create_reader(map_file, ctf_file, is_training, num_regression_outputs):
    # create transforms
    transforms = []
    # train uses data augmentation (translation only)
    if is_training:
        transforms += [ xforms.crop(crop_type='randomside', side_ratio=0.8)  ]
    transforms +=    [
        xforms.scale(width=image_width, height=image_height, channels=num_channels, interpolations='linear')]

    # create IMAGE DESERIALIZER for map file
    feature_source =,
        features_image ='image', transforms=transforms)))
    # create CTF DESERIALIZER for CTF file
    label_source =,
        labels ="label", shape=num_regression_outputs, is_sparse=False),
        features_tabular ="features", shape=3)))

    # create a minibatch source by compositing them together 
    return[feature_source, label_source], max_samples=sys.maxsize, randomize=is_training)

Here you can see we create two data deserializers and specifically name variables to features_image, features_tabular and labels.  We then return the deserializer sources.  These sources is where we will extract our mini batch data and push into x_i as well as x_t.

Model Definition

# function to build model
def create_model(x_i, x_t):
    with C.layers.default_options(init = C.layers.glorot_uniform(), activation = C.relu):
            h = x_i
            h = C.layers.Convolution2D(filter_shape=(5,5), num_filters=8, strides=(1,1), pad=True, name="first_conv")(h)            
            h = C.layers.MaxPooling(filter_shape=(2,2), strides=(2,2), name="first_max")(h)            
            h = C.layers.Convolution2D(filter_shape=(5,5), num_filters=16, strides=(1,1), pad=True, name="second_conv")(h)            
            h = C.layers.MaxPooling(filter_shape=(3,3), strides=(3,3), name="second_max")(h)
            # create a feature map
            h = C.layers.Dense(conv_model_num_features, name="feature_map")(h)
            #merge the convolutional feature map with raw tabular data
            h = C.splice(h, x_t, axis=0)
            #mix up the data in a dense output sequence
            h = C.layers.Dense(conv_model_num_features, name="merged_dense_1")(h)
            p = C.layers.Dense(num_regression_outputs, activation = None, name="prediction")(h)
            return p

Here you can see we take in x_i and x_t (image and tabular respectively).  We push the image through a conv-net and then the dense layer produces a feature map which is then spliced with the tabular data.  This then goes through a dense stack to produce the regressor outputs.  This allows us to use the correct type of network with the right type of data to then produce a feature map which is then able to be spliced together to produce an output which has taken into consideration all data available.

Loss & Error Functions

def create_criterion_function(model, labels):
    loss = C.losses.squared_error(model, labels)
    errs = loss
    return loss, errs # (model, labels) -> (loss, error metric)

Here we use squared_error as our loss and error functions.  This is because we are producing a regressor as opposed to a classifier.  Otherwise we would use cross_entropy (possibly with softmax).

Input Maps

    # Map the data streams to the input and labels.
        y  : train_reader.streams.labels,
        x_i  : train_reader.streams.features_image,
        x_t  : train_reader.streams.features_tabular

Here you can see we are mapping x_i and x_t from the readers within the train_test function.  Specifically we are calling out the specific name returned by each of the deserializers for labels, features_image and features_tabular.  We need to do this for the test reader streams as well as the train reader streams.

Execute the Train and Test Loop

z = create_model(x_i, x_t)
reader_train = create_reader("", "train.ctf", True, num_regression_outputs)
reader_test = create_reader("", "test.ctf", False, num_regression_outputs)

train_test(reader_train, reader_test)

Notice we pass both x_i and x_t (our place holder variables) into the model.  We then create our readers for training and testing and finally call the train_test loop.


data= {Input('Input3', [#], [3 x 32 x 32]): MinibatchData(data=Value([64 x 1 x 3 x 32 x 32], GPU), samples=64, seqs=64), Input('Input5', [#], [4]): MinibatchData(data=Value([64 x 1 x 4], GPU), samples=64, seqs=64), Input('Input4', [#], [3]): MinibatchData(data=Value([64 x 1 x 3], GPU), samples=64, seqs=64)}
Minibatch: 0, Loss: 8025.0112, Error: 802501.12%
Minibatch: 500, Loss: 27.8553, Error: 2785.53%
Minibatch: 1000, Loss: 26.5268, Error: 2652.68%
Minibatch: 1500, Loss: 25.2454, Error: 2524.54%
Minibatch: 2000, Loss: 24.0102, Error: 2401.02%
Minibatch: 2500, Loss: 22.7667, Error: 2276.67%
Minibatch: 3000, Loss: 21.5771, Error: 2157.71%

Alright, so its training and learning!  WOO HOO!  Specifically notice our input shapes.  we have an image 3x32x32, we also have a shape 4 and a shape 3 (our label and our tabular data).


So there you go.  How you can use CNTK to do some sophisticated modelling using neural networks on extremely large data sets using as many GPUs as you want to produce complex outputs.  Understanding those data input pipelines is really the most complex part of anything.  And its great to see how clean CNTK makes the data pipeline for large data sets.

Leave a Reply

Your email address will not be published. Required fields are marked *