Business Intelligence

An Introduction to Machine Learning

By June 10, 2017 February 7th, 2019 No Comments


In just a few decades, software engineering has revolutionized human capacity for problem solving. Today, an engineer can easily write software to perform addition, subtraction, searching items in databases, and even complex algorithms. But tasks like recognizing the text on a street sign – something that is very easy for a human to do – is still very difficult for a computer to do.

The next revolution in computing – machine learning – is attacking this problem by giving software itself the capacity to solve these problems. Boosted by the power and scalability of modern hardware and the cloud, machine learning is turning the artificial intelligence theories of yesterday into reality and blurring the lines between the human brain and software.

Double decker red buses image

Machine Learning Basics

Most articles introducing artificial neural networks will show what’s referred to as a feed-forward-network. While this is a very simple way to initially understand how these networks work, techniques can be applied to networks that are variations of typical feed-forward architectures. But it’s best to start with the simplest form, and expand on the concept from there.

An Introduction to Machine Learning image

Neural networks simulate the simplest aspects of computation happening in a human brain that researchers understand – that of charge flowing from one neuron to another, at massive scale.

The connections between these neurons are called synapses. While these are responsible for carrying charge from one neuron to the next, they can also weaken that charge based on the strength of the synapse. Several billion neurons and synapses working together form modules, even hierarchies of modules. And these modules form entire thoughts.

In these artificial neural networks, the analogy for neuron is a node while synapse strength is the weight between two nodes.

When defining a neural network, the first step is describing the architecture – how the nodes are structured and how they connect to each other. Input nodes receive values between -1.0 and 1.0. Many times these will either be 0 or 1 to represent off or on, respectively.

Output nodes contain the computed results of the network which is generally between -1.0 and 1.0, or 0.0 and 1.0.

In between the input nodes and output nodes is a hidden layer of nodes. This is an intermediate computation that allows an additional vector to exist as a part of the network’s formula, which adds the capability of handling more complex problems.

For a network capable of processing characters, every usable character has an input node. For a network capable of processing words, every usable word has an input node. Likewise, for all the possible classifications the data identifies, an output node must exist. This can be thought of like a dictionary, where every input node represents an identifier and every output node represents an identifier.

Simple output labels like dog, cat, and tree can be used to classify images, commonly defined as output node for each label/classification. In addition to these simple classifications, combinations of output nodes can make more complex classifications. For instance, understanding every object within an image or its position within the 2-dimensional space of the image.

food in fridge image

The collection of weights between the nodes are represented as floating-point values and are what gives the network meaning – the training state of the network.

Training a Neural Network

Once an architecture is defined, training can begin. Weights are typically initialized to random values, and to become a trained model the weights must find their optimal value. This requires significant volumes of training data consisting of inputs to the network as well as the expected outputs when the network is fed those inputs.

Each record of training data is processed at least once per epoch. Then hundreds, maybe thousands of epochs are run. The outputs of each layer are determined to have an error value from the expected final output. That delta is propagated backwards through the network to adjust each weight. The formula used is the derivative of the formula that the network uses when it is feeding forward. This training technique is called back-propagation.

If the model is converging, the error rate should decrease at the end of each epoch. The goal is to keep training until the error rate is within an acceptable range. The duration of this training process can vary depending on how much data and the complexity of the network architecture.

The error rate of a neural network is different than errors in applications/services. Rather than raising exceptions or logging an error, the error rate is referring to how many times the network in its current training state came to a wrong answer.

Typically, the training data is separate from the testing data because if you test on the same data you trained on, your test results would be flawed. The goal of testing a network properly is to identify that it is establishing patterns based on what it understands about the data. The only way to validate this is by testing the network on data it has never trained on before.

Deep Neural Networks

Deep neural networks can be thought of as feed-forward networks but capable of training on many hidden layers. The more hidden layers, the “deeper” the network is considered to be. This depth allows networks to solve even more complex problems. Simply put, the more layers a network has the more complex scenarios it can solve.

Increasing the number of layers certainly increases complex problem-solving capabilities, but it also increases training/evaluation time. In addition, the fixed size of neural networks makes them challenging for temporal data, or data that is dynamic in length. This is where different types of layers can come into play. Some types of layers these include are: Convolution, Recurrence, Dense, Dropout, BatchNormalization, etc.

These layers are based on theories for handling specific types of scenarios within models like: image scaling/rotating/translating, word/vector mappings, temporal data, etc. It isn’t necessary for these to flow through each other sequentially like feed-forward networks. Some layers may choose to split/bypass other layers altogether. This is where the architecture of the network begins to play a bigger role.

Back-propagation can be augmented with techniques like reinforcement learning. Reinforcement learning allows networks to be trained not based on already existing data, but on reinforcement algorithms external to the neural network. However, these reinforcement algorithms are calculating the error of an outcome that was decided by the neural network.

Double decker red buses image

Imagine an over-simplified neural network operating a self-driving car. The car’s cameras feed into the neural network. The neural network’s outputs control the steering, throttle, and brakes. The algorithms reinforce the training based on the commands given by the neural network. If the car crashes, negative reinforcement occurs on the network’s decisions leading up to the crash. If a car stays on the road, or more specifically within its lane, positive reinforcement occurs on the network’s decisions while operating safely.

Just calculating crashes and lane positioning for reinforcement learning requires these algorithms score the car on a negative-positive spectrum. These algorithms are a significant effort when working with reinforcement learning since they must have nuance to be effective. It’s not as helpful if the scores are just crash/no-crash. Some sort of vector on likelihood to crash would assist the network in learning faster.

Speeding up the learning process is a constant effort and there are ways of doing this with the data, software, or hardware. Many of the machine learning tools can run on GPUs, helping to accelerate training cycles because the calculations are distributed across several GPU cores.

Microsoft’s Cognitive Toolkit is capable of scaling across multiple machines that each have multiple GPUs. They’ve integrated techniques that allows these types of learning calculations to distribute well across a cluster of networked computers.

MicrosoftGoogleAmazon and Facebook have created services that are abstracted APIs to their already trained models. This exposes more commonplace machine learning models in a way that allows developers to evaluate data against them easily.


Even with GPUs, some models can take anywhere from days to months to train. The types of problems we are now attempting to solve with machine learning requires an almost incomprehensible volume of data. Cloud computing offers some benefit but we are nowhere close to what the human brain and more complex network architectures are capable of solving.

There’s also a limitation of talent. Every large company with a focus on AI has already hired the world’s best talent on machine learning. There aren’t enough people who are skilled in machine learning yet. The toolkits that exist are also relatively new to software engineers. The adoption of them is in its infancy.

Those less familiar with machine learning may think it’s as simple as taking a neural network and pumping data through it. In reality, models themselves have varying architectures (combinations of layers) for solving different types of problems. The process of building a network architecture that works is a trial-and-error type process. In a typical scenario, several networks are trained and tested before finalizing on an ideal one, including across-the-board variations on network architecture, input/output data structure, learning rates and momentum.

This process is still a challenging one because it requires a process where the training can be monitored and potentially stopped if the model isn’t converging and can be very time consuming. Performing these adjustments across multiple networks requires intuition on why the adjustments are being made. This type of experimentation just hasn’t become mainstream yet.

An AI Future

Google has created a processor they call a TPU (Tensor Processing Unit). This custom silicon is 15 to 30 times better than GPU/CPU combinations at training machine learning models. Google’s TensorFlow machine learning toolkit is capable of running its models on their TPUs.

Microsoft has been using FPGAs (Field Programmable Gate Arrays) optimized for machine learning tasks. These differ from custom silicon because they can be reprogrammed on-the-fly. These FPGAs have been deployed to nearly every computer within the Azure data centers. The primary reason for deploying these to the data centers was for monitoring resource consumption, but an underlying motivation was accelerated machine learning.

Intel and Qualcomm have been putting custom machine learning cores into their processors, which brings enhanced machine learning capabilities to the processors that power our consumer devices like PCs and phones. This allows common applications to be paired with enhanced machine learning capabilities.

Combine the enhancements in hardware with the elasticity of the cloud, training can be done cost-effectively at massive scale. The resulting trained models can then be supplied to field devices for use in cognitive applications.

Expect rapid improvements in the architecture and training algorithms of these artificial neural networks. While new machine learning concepts are being discovered regularly, almost all of the concepts we use today are actually decades old, some dating back to the 1950s. Previous research is also being actively revisited and integrated into modern architectures and engines.

Because of the expensive nature of training machine learning models on real world data, simulations are being used to train networks for the real world. In the self-driving car example, the car must crash 100 times before it starts learning how to drive around corners.

What if we could train that self-driving car within a simulation of the real world? The images which normally coming in from cameras and LIDAR is generated from 3D scenes within the simulator, like how a video game is rendered. The GPS, accelerometer and other sensors are electronic devices that provide some signal to the computer driving the car and these can all be synthesized within the simulator as well.

This allows training data to be gathered much less expensively. The data may not be perfect when translated to the real world. But research shows that decent accuracy can be achieved within simulators, and then improved once in the real world.

Microsoft recently open-sourced an application called AirSim which allows researchers to experiment with computer vision completely within a simulated environment. Initially it appears to be targeting flying drones, but they claim it will soon be able to supply scenarios for more types of autonomous vehicles.

We can also architect other solutions so that they collect data in a way that is already optimized for machine learning research or model training. The more data available for training models, the better. So it makes sense to be capturing that data in existing solutions with the anticipation of it having value for training in the near future.

With machine learning, we’re currently just scratching the surface. Simple forms of it are already affecting our daily lives in almost unnoticeable ways. Soon we will begin to see disruption across many industries, and its a critical time to get involved for businesses to ensure they’re on the forefront of this disruption.

Jeff Scherrer

Author Jeff Scherrer

More posts by Jeff Scherrer