Machine learning is one of the top topics popular online. There appear more and more articles, discussions and examples of implementation of ML in solutions addressing real tasks in the media. Among implementation options, the highlight is on the feasibility of using machine learning models in mobile development.
Being a Junior mobile developer, I wanted to become involved and figure out how neural nets are built and how they work. Our approach at Dashdevs is — learn by doing, so this article is about my experiments and findings.
I decided to start upgrading my skill by implementing the function of real-time object detection in a mobile application.
For now, the main restriction from using ML models in applications is the limited performance capabilities of mobile devices. These limitations are challenging especially for the implementation of deep machine learning models.
Creating a pre-trained model and its further integration into the mobile application is the way to overcome capacity restrictions.
Such pre-trained models should fit several requirements:
- The model should have a limited size (because they would be loaded into RAM and occupy a vast amount of computing resources of the GPU and/or CPU).
- The model should process a large amount of data within a reasonable timeframe and without draining a device battery or overheating it.
As soon as I am an iOS developer, the easiest way to implement an app with the object detection feature for me could be the usage of Core ML 2 — the machine learning framework. This framework helps to integrate machine learning functionality, such as computer vision, into a mobile application via Apple Vision API. Moreover, we are not tied up to one particular type of machine learning model — we can use our own or third-party models.
Such a model could be created and trained using the Create ML (an accelerated GPU written in Swift to teach the native artificial intelligence model on Macs) and immediately get the model with .mlmodel file extension, or convert it to this format created and trained using third-party frames (Facebook Caffe and Caffe2, Keras, scikit-learn, XGBoost, LibSVM, Google TensorFlow Lite). Core ML 2 has a converter for such models and can update them from cloud-based services — from Amazon Web Services (AWS) or Microsoft Azure.
Mostly, I like the last listed approach. Especially I prefer using TensorFlow Lite (ML Kit framework) for creating and pre-training models.
Unlike Core ML 2, TensorFlow Lite is a cross-platform framework and supports Android. It means that the chosen template or custom created pre-trained model can be integrated into an application for both platforms. It is vital because usually Android and iOS mobile apps are developed for the customer in parallel.
TensoFlow Models Compatibility
The additional advantage of the TensorFlow is its large library of free and open-source templates of machine learning models and APIs.
Comparing to Core ML, which is rather an opaque framework, TensorFlow does not hide information about used models.
The choice of the neural network model
Object detection is related to computer vision technology of detecting certain selected objects in digital images and videos. It is a more complicated technique comparing to object classification. Classification only identifies the main object or objects of an image. While the detection of objects can find several objects, classify them, and determine where they are in the image.
The object detection model involves the identification of object boundaries and framing every found object. As well it determines the probable variants of classification for each object.
Thus, the object detection model is a combination of classifying and recognition processes.
I have researched a lot on the topic and found that deep learning algorithms are mainly correlated with the structure of the data available for analysis. Below there are several principles that I would recommend to follow when choosing a learning algorithm.
- If images or similar topological structures are the base of the input dataset, then using a convolutional neural network (CNN) as a model is a good choice.
- If the input is a fixed-size vector image, then feed forward neural networks (FFNN) with layer-to-layer perception are used.
- If the input is a sequence of images, then the recurrent neural network (RNN) or the recursive neural network (RvNN) suits better to be a machine learning model.
For my project I have chosen, the convolutional neural network as a сlassifier.
Selecting an architecture
Сonvolutional neural networks can be used to create “lightweight” and compact neural networks. They allow reducing the size of the model significantly as well as the time of cycle processing. Nevertheless, they guarantee the accuracy of recognition and mitigate the negative effects of unbalanced data on recognition results.
Selection of a Model
Among the variety of architectures of such networks, I would like to draw your attention to the Xception architecture. It may help to reduce the number of parameters in the convolutional network if you use the Inception-like architecture instead of the deep architectures in your convolutional network.
As you can see from the table, Xception exceeds the accuracy of ResNet50, but is slightly inferior to the accuracy of InceptionResNetV2. While it wins the size comparison, and therefore it requires fewer resources for learning and using this model.
MobileNet versions V1 and V2 are more advanced versions of the described above architecture.
For a learning model architecture of the convolutional neural network, I have chosen MobileNetV2. It performs on mobile devices effectively as the basic image classifier.
MobileNetV2 has the following structure of the main block.
Main Block Structure
There are three convolution layers in the block:
- depthwise convolution, which filters the input data;
- 1×1 projection layer — pointwise convolution, which combines the filtered values to create new objects, and;
- expansion layer that filters input data at the entrance of the model.
In MobileNetV1, the pointwise convolution does not decrease the number of channels or even their doubled number. MobileNetV2 does it the opposite: it reduces the number of channels. That is why this layer is known as a projection layer — it takes data with a large number of measurements (channels) and outputs a reduced number of channels to a tensor. This layer type is also called a bottleneck since a layer reduces the amount of data that passes throughout the network.
Pointwise and depthwise convolution layers can be combined into reusable blocks to achieve better performance compared to standard convolution layers.
The expansion layer is a 1×1 convolution too. Its purpose is to expand the number of data channels before they go into a deep convolution. Consequently, on the expansion layer, the number of output channels always exceeds the number of input channels — just the opposite to the projection layer. The degree of data expansion is determined by the expansion factor. It is one of the crucial hyperparameters to experiment with various architectural compromises. The default expansion factor is 6.
Each layer has a batch normalization with the activation function — ReLU6. On the output from the projection layer, the activation function is not applied.
The complete architecture of MobileNet V2 consists of 17 of such blocks in a row. The very first block is slightly different from the others — it uses an ordinary 3×3 convolution with 32 channels instead of the expansion level. This series of units enlarge gradually. A block of a regular 1×1 convolution layer, a global middle pool layer, and a classification layer are added.
Experiments with detector
My choice of a detector was based on the recommendations that one-time (one-pass) object detectors are better for mobile devices.
A one-time object detector can predict all boundaries of objects at a time after a single pass through the neural network. The most popular examples of one-time object detectors are YOLO, SSD, SqueezeDet, DetectNet.
As an object detector, I selected the modification of a one-time SSD detector — SSDLite. This is due to the following reasons.
- SSDLite does not depend on the type of a basic network, and thus, can work with different models, including MobileNet.
- SSDLite allows receiving real-time results (30 FPS and up).
- SSDLite uses separated by depth layers, instead of ordinary convolutions, for a partial detection of objects in the network.
Thus, our object detection model is the combination of the architectures of the two convolutional neural networks — MobileNetV2 and the modification of a one-time object detector of SSD — SSDLite, accordingly.
Convolutional Neural Network Architecture
Convolutional Neural Network Architecture
Working with the image detector, I used MobileNet as a function extractor for the second neural network.
Configuring the neural network
Classification accuracy and speed depend on the number of model parameters. Reducing the number of parameters results in a decrease in the model size and processing time. However, it also affects the accuracy of the classification.
There are two hyperparameters of the MobileNet architecture which determine the size of the network: α is a width factor, and ρ is a depth factor or multiplier of resolution. The width factor is responsible for the number of channels in each layer. The multiplier of the resolution — for the spatial dimensions of the input tensors.
The width factor α denotes how the number of channels will be reduced. If the width factor is 1, the network starts with 32 channels and ends with 1 024.
The factor ρ denotes the reduction of the size of the input image. The default input size is 224×224 pixels.
Both factors help to vary the size of the network — reducing them, we decrease the accuracy of recognition, but at the same time, increase the speed of work and reduce memory consumption.
These settings can be used to scale the network down, and consequently, reduce its acceleration, but with a sacrifice in the prediction accuracy.
To find a balance between the accuracy of the classification and the size of the classifier, the MobileNetV2 network was set with the following parameters: mobilenet_v2_0.75_224 (value 224 corresponds to ρ = 1).
The number of channels for MobileNetV2 + SSDLite, in this case, was 4.3 M.
Ways of model size reduction
In order to reduce the size of the pre-trained model, even more, you can freeze a model, optimize output operations of a model, and perform its post-training quantization.
Model Size Reduction
TensorFlow Lite uses TensorFlow library for training a model.
TensorFlow generates 2 files. The .pb file is a serialized computational graph generated via Protobuf. This file does not contain variables. The variables are stored in the .ckpt file.
The process of converting variables into constants (fixed values) is called a freeze. Frozen graphs (models) are created with the help of freeze_graph.py which simplifies the graph of computations.
With the option optimize_for_inference you can remove unnecessary operations from the output.
The quantification after training is a common technique that allows you to reduce the size of the model and decrease the delay 3 times with a slight decrease in the accuracy of the model.
The after-training quantization method, with the sensitivity up to 8 bits and a floating point, is included as an option in the model converter of the TensorFlow Lite.
The size of the quantized model is almost three times less than the size of the original model.
The choice of a dataset
The quantized model pre-trained on the dataset Common Objects in Context (MSCOCO) — ssdlite_mobilenet_v2_ was downloaded from the TensorFlow library.
The MSCOCO dataset size is vast (2017 Train images [118K/18GB], 2017 Test images [41K/6GB], numbering 118,000 and 41,000 images accordingly). The number of classes is almost ninety. Since I was not going to train the model on my custom data, I was quite satisfied with the selection of such a pre-trained model.
Reasons to use a sample model from a library
If the architecture of the sample model didn’t work for me, or I had to teach the existing one or any other model that solves a similar problem on my custom dataset, then I would have to spend much more time and act the following way.
I refused the idea of the creation of new neural network architectures for the solution of academic tasks immediately, because, in my opinion, the average mobile developer does not need it in real life, for real projects. Although, you are eager to accept the challenge, have a little knowledge of Python and a lot of free time — you are welcome to try. Anyway, there may be a little value of your job — the TensorFlow-Slim Image Classification Library has a number of open source models pre-trained on the ImageNet dataset. Such CNN models as InceptionV3 or MobileNets (the library offers 16 variations of MobileNet_v1 with different settings for your choice), I suppose, it is a good set to find a ready solution for many problems. In the TensorFlow library, which is constantly updating, you can find next-generation models pre-trained on different datasets. InceptionV3 demonstrates better detection accuracy than MobileNets, but it also is larger in size.
The selected model can be retrained on your custom specific dataset if needed. Or, you may train it on the ImageNet set (1000 classes of objects). The creation of your own dataset is a topic worth a separate article. ;)
Then, the model needs to be transformed from the standard TensorFlow format to TensorFlow Lite, and you should integrate into iOS or Android applications using the API.
After this hypothetical insight, I will return to the description of my experiment on the implementation of neural networks in mobile applications.
The borrowed pre-trained model has been integrated into mobile applications that define objects on the screen of the device. The quality of the work has been tested on different devices in determining various moving and immovable objects.
Here are photos and videos of some moments of the experiment.
Sample Screenshot with Predictions
I would like to mention that I measured the temperature of all devices in the processor area with an infrared thermometer. While the application was running, in the first minute of its work the temperature increased at approximately 10°С and remained constant for the next five minutes of the experiment.
I have created a pre-trained model that can be integrated into a mobile application. The model is capable of detecting objects on a device screen, while a working app does not overheat the device or overload its performance.
Though the detection accuracy depends on the viewing angle, distance to an object, and lighting conditions, the detection accuracy of my machine learning model complies with the standards for such models.
My experiments with machine learning and neural networks have been intended to my personal skilling up as a part of a sci-tech crew of <em>Dashdevs</em>. Our team implements machine learning into digital business products for our clients. <em>Clipshot</em> is an example of how we combine machine learning, science, and math in mobile applications.