Apache MXNet - Medium

Start fitting CV models like Scikit-learn with GluonCV 0.10

Joshua Z. Zhang — Fri, 26 Mar 2021 04:06:43 GMT

Start optimizing CV models in less than 10 lines of code with GluonCV 0.10

Train computer vision models with HPO in less than 10 lines of code!

For a very long period of time, GluonCV has been focusing on providing diverse models targeting various computer vision tasks, from fundamental image classification, object detection, to depth estimation and video action recognition, etc.
Now it’s time for the long-overdue high-level training/inference API to be available for all users! Are you tired of checking the implementation difference or details in order to compose the right model, e.g., neural network architectures, training loop, metrics, losses? The new auto estimators and task handlers might be the right tool for you. There are new tools such as hyper-parameter optimization(HPO) to help you achieve better models without extensive manual exploration.

GluonCV Auto Estimators

GluonCV’s Auto Estimator are new in version 0.10. They are similar to scikit-learn and will allow you to initiate training in less than 3 lines of code. They also have a friendly customization interface:

A single line fit function to train an image classifier from web datasets

The new auto estimators pack simple predict functions which support raw image paths, data frames, gluoncv dataset. It also allows you to save and load trained estimators without knowing the underlying model implementation. In the past, for example, you need to know the corresponding pre-process and post-process functions for a particular model otherwise the model will produce wrong outputs. We recognize that these error-prone components are tricky and should be avoided. With the new estimator API, you can get rid of unnessary code snipets and prototype a working solution faster.

GluonCV Auto Tasks

On top of Auto Estimators, in this release, we also include Auto Tasks, for hyper-parameter optimization. Comparing with pure algorithm-based estimators, the auto tasks provide identical APIs and functionalities but allow you to fitwith HPO with specified num_trialsand time_limit. For object detection, it allows multiple algorithms (such as SSDEstimator and FasterRCNNEstimator) to be tuned as a categorical search space. For example, a complete HPO task can be as simple as:

An intuitive API to search for best configuration in a specified search space

Note that the search space configuration (`search_args`) can either be a dictionary shown in the above example, or a YAML configuration file (.yml, .yaml). For the latter, you can create a code-free tuning job by editing the config file alone.

Summary

GluonCV 0.10 offers two new modules: GluonCV Auto Estimatorsand GluonCV Auto Tasks. They allow you to bootstrap computer vision tasks in a few lines of code. It is still a powerful and customizable training pipeline as it supports hundreds of hyperparameters with YAML config file support. Most importantly, the high-level API is deep learning framework agnostic, which allows you to easily extend new models and tasks implemented using MXNet or PyTorch. To get started, try these tutorials first!

Acknowledgment

Special thanks to @tmwangcas, @karan6181, @bryanyzhu, @Zarand3r, @ChaiBapchya, @domin1985, @yinweisu, @mzolfaghari, @StevenJokess for contributing to this release.

Please Like/Star/Fork/Comment/Contribute if you like GluonCV, and don’t forget to check out our latest content. Stay tuned!

Start fitting CV models like Scikit-learn with GluonCV 0.10 was originally published in Apache MXNet on Medium, where people are continuing the conversation by highlighting and responding to this story.

Optimizing inference on CPU in MXNet 2.0

Adam Grygielski — Mon, 22 Feb 2021 19:20:32 GMT

Optimizing inference on CPU in the Upcoming Apache MXNet 2.0

Authors: Adam Grygielski, Bartłomiej Gawrych, Sylwester Fraczek

Introduction

Deep Learning inference is the process of deploying a trained neural network to perform prediction on unseen data. It is a commonly deployed workload in cloud servers. To provide a good user experience it has to have high performance so it is important to use optimized solutions. Optimized means also reduced hardware load and energy cost.

There are two types of performance bottlenecks to consider when optimizing a neural network. One is heavy compute-bound operations like convolution or fully-connected and the other is many small memory-bound elementwise operations such as ReLU (or transposition).

A couple of methods have been invented for optimizing Neural Networks. Operator fusion allows to chain operations together to speed up memory-bound operations by reducing memory IO operations. Quantization speeds up compute-bound operations by lowering precision and therefore simplifying computation while also reducing the amount of data being processed.

Apache MXNet (incubating) in the upcoming version 2.0 introduces some changes in the interface. Gluon API has now become the default superseding symbolic and model API. It is unifying the flexibility of imperative programming with the performance benefits of symbolic programming. Also MXNet 2.0 now fully supports NumPy semantics.

Operator Fusion

In previous versions of Apache MXNet operators fusion was enabled by default when MXNet was built with Intel oneDNN library. However, with version 2.0 of MXNet, Module API and GraphExecutor were replaced by CachedOp executor. Gluon’s Block is now the first choice API to define and execute a model instead of a symbolic API.

Let’s recap what operator fusion is and what benefits come along with it, but before we move on to that, it is worth mentioning that MXNet support two execution modes:

imperative mode - model is executed step-by-step as defined and we can access each tensor between operator execution
symbolic mode — based on defined model MXNet engine creates graph which can be optimized by e.g. pre-allocating space — we can’t access[1] tensors between operator execution

An imperative mode is great for debugging purposes and symbolic mode provides a big performance boost. More information about the imperative and symbolic mode you can find here.

Every model can be represented as a directed graph where operators and tensors are represented as nodes. Each connection between operators, where the output of a single operator can be input to another one, is represented in the graph as an edge.

Operator fusion is nothing else but replacing two or more subsequent operators with a new single operator that combines all functions of these operators. As a result, we get a smaller graph — only one kernel is being invoked instead of a few (one per operator). These fused kernels are optimized to run as fast as possible by utilizing all available features of a modern CPU. We also get rid of MXNet engine overhead between operator calls and we’re reducing memory access (read/write) operations.

How to fuse operators in my model?
To fuse model in MXNet 2.0 there are two requirements:

the model must be defined as a subclass of HybridBlock or Symbol
the model must have specific operator patterns which can be fused

At the time of publishing this article in MXNet we support three major fusing patterns:

Convolution + BatchNorm + ReLU/GELU/LeakyReLU/sigmoid/SoftReLU/tanh + Elementwise Add
FullyConnected + Elementwise Op (ReLU, Square, Logistic, SoftReLu, BoundedRelu, Sqrt, Exp, Abs)
BatchNorm + ReLU

As an example we define example network (sample block from ResNet architecture):

https://medium.com/media/d17e0c84759af266b197f47298d06444/href

Both HybridBlock and Symbol classes provide API to easily run fusion of operators. All we have to do is add single line of code enabling fusion passes on our model:

net.optimize_for(data, backend='MKLDNN')

If we would like to apply passes on a Symbol all we have to do is to call optimize_for function without passing input data on a symbol instance. Notice that Symbol’s optimize_for function is not done in-place, so we must assign it to a new variable:

optimized_symbol = sym.optimize_for(backend='MKLDNN')

For the above model definition in a naive benchmark with artificial data, we can gain up to 1.25x speedup without any accuracy loss on our testing machine. Detailed data on other models’ speedup is shown in the “Performance and accuracy results” part of the article.

Quantization

Similar to 1.x versions, MXNet 2.0 continues supporting model quantization from float32 to either signed or unsigned INT8 (s8/u8). Using a quantized model for running inference on modern CPUs can greatly increase the performance of your workloads. Quantized models utilize VNNI (Vector Neural Network Instructions) instruction set to speed-up execution of compute-heavy operations like convolution or a dot product. Moreover, using int8 data type reduces the amount of data reads in memory-bound operations like pooling or elementwise functions. MXNet uses optimized kernels delivered by Intel® oneDNN [3] library to speed up model execution.

During the quantization procedure, a floating-point model is first fused using fuses mentioned in the previous chapter. After that, operators that have int8 kernels are being marked as quantized and surrounded by quantize/requantize/dequantize operators. At last, the model can be either calibrated to get rid of requantize operators or left as it is to calculate scales during runtime. There are 2 major ways of quantizing an fp32 model:

without calibration. This way, all we have to do is call the quantize_net function with an fp32 Gluon model and a list of input data shapes. However, this approach is not recommended in terms of performance. It will result in requantize nodes in the graph that calculate min/max values during each forward pass. Calibrating a model before deploying it results in much faster inference.

with calibration. In this approach, after quantizing the graph, a model is run with user-supplied calibration data to collect statistics of quantized layers and set min/max values as parameters. The choice of these thresholds is based on the chosen calibration method.

Currently, there are three calibration methods supported:

naive — min/max values from the calibration run.
entropy— uses KL divergence to determine the best symmetrical quantization thresholds for a given histogram of values.
custom — uses user-defined CalibrationCollector to control a calibration process.

In MXNet 2.0, the quantization procedure has been adjusted to work well with Gluon models since it’s the main API now. The goal was to allow the user to quantize fp32 HybridBlock model in just a few lines of code.

Quantization flow in MXNet 2.0

As an example of a quantization procedure, we will use pretrained resnet50_v1 from model_zoo.vision. To get it we simply have to run the following code:

https://medium.com/media/a3d86475380175353c3a849648fbaa6d/href

To compare the performance, we will use simple function calculating total inference time on the model with an artificial data:

https://medium.com/media/edddf3cd87726f2e6205e5d840e21136/href

Now, to get a quantized model, all we need to do is call quantize_net function from contrib.quantize and provide list of input data shapes:

https://medium.com/media/49087ee38f8ed55c7fc940f7c1ecf17d/href

This way, we have quantized `resnet50_v1` ready to run inference tasks. However, we didn’t use any calibration data, therefore we won’t get a satisfactory performance boost.
We can compare these 2 models with our benchmark_net function to calculate the total speedup. To get better results, we should first hybridize both models with static_shape and static_alloc flags set to True. We can do it to let MXNet know that we won’t be changing shapes in the model during runtime thus it can pre-allocate the memory so we won’t get runtime allocations. To properly evaluate the performance benefits of quantization, we should compare it with fp32 model with MKLDNN backend enabled. It is because the quantization procedure fuses a graph before quantizing it so if we won’t do it, we will get the summed benefit of fusing and quantization.

https://medium.com/media/425f36346e2864b6e955e0e1c9cd542c/href

Output:

> Speedup: 0.72x

As we can see, we didn’t get any performance benefits from using int8. It came out, that calculating min/max during runtime adds a big overhead. Now let’s try calibrating the model before actual execution. The only difference in code is that we have to provide quantize_net with calibration data stored in DataLoader class. For this example, we will use the same dummy_data but in real use-case, it would be most probably small portion of validation dataset. Notice, that we no longer have to provide data_shape attribute because it will be taken from calib_data. If we don’t specify calib_batches parameter, the whole calib_data will be used.

https://medium.com/media/b7c7712b0b918dc7b7358a734bdeca71/href

Output:

> Speedup: 3.8x

This time we’ve got significant performance boost with just few lines of code.

The other aspect of lowering the precision of a model is a difference in its accuracy. We will check that on previously tested resnet50_v1 with ImageNet dataset. To run this example you will need ImageNet dataset prepared with this tutorial and stored in path_to_imagenet. Let’s compare top1 and top5 accuracy of standard fp32 model with quantized int8 model calibrated using naive and entropy calibration mode. We will use only 10 batches of the validation dataset to calibrate quantized model.

https://medium.com/media/11275649828c1ef5d868d8c3f89b367d/href

Output:

> FP32 Top1 Accuracy: 0.76364 Top5 Accuracy: 0.93094
> INT8Naive Top1 Accuracy: 0.76028 Top5 Accuracy: 0.92796
> INT8Entropy Top1 Accuracy: 0.76404 Top5 Accuracy: 0.93042

We can see that we’ve got almost the same accuracy for quantized model but we are able to run it with much less computing power and lower latencies. The difference between calibration method is dependent on the model itself, used activation layers and the size of calibration data. It should be chosen empirically. However, if you are not experienced with custom calibration methods, you should stick to entropy by default.

Custom layer collectors and calibrating the model

We have prepared an interface to give the user as much flexibility as possible in almost every step of quantization and one of the results of this approach is mentioned earlier custom LayerOutputCollector parameter in quantization API.

Layer collectors are responsible for collecting statistics of each node in the graph — it means that we can observe the input/output data of every operator executed. We are able to do this by utilizing the register_op_hook method of HybridBlock class.

To write your own layer collector your class has to inherit from the CalibrationCollector class, which we have provided, to stay consistent within API. As “code is worth a thousand words” below we show an example implementation of CalibrationCollector:

https://medium.com/media/92d4c12456e1ecab6ab7645e37807b30/href

We are ‘injecting’ names of nodes which require calibration into the include_layers instance attribute of custom collector — it is organized in this way because collecting statistics for every node is very time consuming, but thanks to this you can skip nodes that do not need calibration (also you can ignore this attribute and implement your own logic of picking nodes to calibrate).

After collecting all statistic data post_collect function is called. In post_collect you can implement additional logic, which will process gathered data in the implemented way and at the end return dictionary of node name as key and tuple of min and max threshold as value, which will be used to calibrate nodes.

Example of usage with quantization API:

https://medium.com/media/5061127ac8d8db0c7083ff9045ab1346/href

It is not easy to write your own logic of collector, but if you know what you are doing you can get better accuracy. In the previous version of MXNet and GluonNLP our colleagues wrote layer collector for BERT models — they clip minimum/maximum values of some layers to get better accuracy BertLayerCollector (note that it’s not compliant with MXNet 2.0 — you can refer to predefined collectors in MXNet 2.0 code).

Performance and accuracy results

Here we present performance results of some CV models. We compare Fused and Quantized models to Base models hybridized with static memory allocation.

Relative Inference Performance (img/sec) for Batch Size 128

As you can see, popular ResNet50 gains 62% speedup with just operator fusion and it is more than 6x faster than Base fp32 model while quantized. We may also observe that mobilenet benefits the most from presented optimizations. It’s mainly due to the efficient fusing. Moreover, in the quantized version, relu6 (which is the activation function in mobilenet) can be achieved just by adjusting scale factors of int8 computation thus we could optimize it even more.

ImageNet(ILSVRC2012) TOP1 validation accuracy

These are the results of TOP1 ImageNet accuracy for computer vision models and as you can see — by calling single function you can gain a significant performance boost almost without losing accuracy. In the chart, the biggest accuracy drop is visible in mobilenet v2, but it is still only half a percentage point less than fp32 model. On the other hand, VGG19 has no accuracy lost at all and after quantization, it is over 4 times faster than the baseline model.

Summary

In this article, we showed how you can speed up your MXNet model on CPU with oneDNN support. If you’re fine with tiny accuracy drop, model quantization is a great method of significantly improving performance. Otherwise, you can still use model fusion supported by oneDNN primitives without any side effects. In this case, the performance boost is not so impactful as in the quantized model but accuracy doesn’t change.

Moreover, the article described how you can utilize the new CalibrationCollector class to have better control of the calibration process. Such a solution gives you flexibility and a chance to increase the accuracy of your quantized model.

Many things have changed since the introduction of MXNet 2.0. Gluon API has taken over Executor-based execution thus now users have to manually call optimize_for function to get benefits from CPU optimization passes. We have also introduced the new BatchNorm + ReLU fusion to address ResNet v2 models. Our future plans are to add support for more and more int8 models, not only in a CV domain. We are also exploring new fusion opportunities so we can speed up popular models even more.

References

[1] Devlin, Jacob, et al. “BERT: Pre-training of deep bidirectional transformers for language understanding.” arXiv pre-print arXiv:1810.04805 (2018).
[2] Banu Nagasundaram. “Vector Neural Network Instructions Enable Int8 AI Inference on Intel Architecture.” https://www.intel.ai/vnni-enables-inference/, 2019.
[3] Intel® oneDNN library (https://github.com/oneapi-src/oneDNN)

Benchmark environment

CPU: Intel® Xeon® Platinum 8280L CPU @ 2.70GHz
Memory: 187 GB RAM
OS: CentOS Linux 7 (Core)
Compiler: gcc (GCC) 7.3.1 20180303 (Red Hat 7.3.1–5)
MXNet Commit SHA: 3746babc8fdb211584a9a661207061cb646b01a8
oneDNN Commit SHA: 2e4732679f0211bb311780d0f383cf2dce9baca7

Notices and Disclaimers

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.

Optimizing inference on CPU in MXNet 2.0 was originally published in Apache MXNet on Medium, where people are continuing the conversation by highlighting and responding to this story.

Using MXNet NDArray for fast GPU algebra on images

Olivier Cruchant — Mon, 15 Feb 2021 18:55:04 GMT

Using MXNet NDArray for Fast GPU Algebra on Images

By Philippe Saadé and Olivier Cruchant

Introduction

Apache MXNet is a modern deep learning framework featuring both a pleasant developer experience and high-performance training and inference. In order to get the best possible performance while exposing a friendly interface, the MXNet community re-developed from scratch a suite of optimized tensor algebra kernels, historically accessible via the Python MXNet NDArray library. The initial MXNet NDArray API is very similar to NumPy, yet it features major internal differences enabling high-performance deep learning: (1) it can run on GPU, (2) it supports auto-differentiation, (3) it runs asynchronously, executing only the code that need to run and in an optimized order. In order to make the powerful MXNet NDArray API fully accessible to NumPy developers, the MXNet community released in 2019 the mxnet.numpy library, which implements the NumPy API (Announcement on MXNet Medium). In this blog post, we illustrate the strength of MXNet NDArray library for fast and compact algebra over images. We first highlight couple NDArray concepts and then assemble them into a function, that we provide both as an mxnet.ndarray implementation and an mxnet.numpy implementation. We use as an illustration a fictional use-case of anomalous image detection via simple pixel difference. This can be a reasonable baseline when looking for anomalous areas over a batch of near-identical images with same viewpoint and luminosity, for example the images below:

From left to right: (1) correct sample (2) synthetic anomalous sample with a stain (3) anomalous sample after difference with the cross-batch mean.

Notable concepts

We will use the following libraries:

import itertools as itr
import os
import mxnet as mx
from mxnet import image as mxim
from mxnet import init, gluon, nd
import matplotlib.cm as cm
from matplotlib import pyplot as plt
from matplotlib import image as mpim

MXNet features the image library that we can use to read images:

picture = mxim.imread('orig1.JPG')

# .asnumpy() converts MXNet NDArrays to Numpy array
plt.imshow(picture.asnumpy())

This is a non-anomalous image. In our synthetic dataset, we have 10 such images, and two images with visual anomalies

MXNet NDArray leverages the mxnet context concept, that enables to control the hardware context - CPU or GPU of objects. For example, we can send our previously loaded image to the GPU (if on a GPU-equipped instance) with the following snippet:

pic_on_gpu = picture.copyto(mx.gpu())

Note that we can also use picture.as_in_context(mx.gpu()) to copy the picture to the GPU ; the advantage of the latter being that it does the actual copy only if the image is not in the GPU already.

Similarly to Numpy, MXNet NDArray is a great platform to apply algebraic transformations to batches of records. Writing transformations applied to batch of images is furthermore relevant on GPUs which excel at batch processing. Very similar to their Numpy counterparts, we use in this demo ndarray.concat(), ndarray.mean(), ndarray.abs(), ndarray.max() to manipulate and analyze a batch of images.

MXNet being primarily designed for deep learning, it features abundant neural network primitives. Its imperative Python front-end Gluon provides numerous model layers, and in this blog - even though we do not use deep learning nor even machine learning - we borrow from Gluon its 2D-convolution (mxnet.gluon.nn.Conv2D) with all coefficients initialized at 1 to apply a smoothing effect and isolate contiguous areas of anomalous pixels.

Wrapping it all in a compact function

The inline code block below proposes an example function to run the above-mentioned analysis. It first reads a batch of images, then subtracts from all of them the cross-batch mean image, applies a hard-coded convolution to each of them and return images that have high-value pixels after the convolution. Note that this code features couple hard-coded constants that would be worth tuning over representative data. This sample is provided as an illustration of the NDArray capacities and may deserve further refinement and testing before being used in the real-world.

Implementation with mxnet.ndarray

def mxnd_find_anomalies(folder, gpu=False, save_viz=True, threshold=300):
    """
    Parameters
        ----------
        folder : str
            local directory with the batch of images to score
        gpu : bool, optional
            whether to use GPU or CPU
        save_viz : bool, optional
            saves the pictures of deltas and convolutions
        threshold : int, optional
            conv threshold to be anomalous. Should be tuned!           
    """
    
    ctx = mx.gpu() if gpu else mx.cpu()  # Set context
    
    # read images in a 4D-NDArray
    pics = os.listdir(folder)
    ims = [mxim.imread(folder+'/'+pic).expand_dims(0).as_in_context(ctx) for pic in pics]
    ims = nd.concat(*ims, dim=0)
        
    # Compute the average image
    avg = nd.mean(ims.astype('float32'), axis=0)
    
    # Remove the mean from every image, then average-pool on color
    deltas = nd.mean(ims.astype('float32')-avg, axis=3)
    
    # Apply a batched convolution to all the deltas, then apply abs
    conv = gluon.nn.Conv2D(1, kernel_size=5, use_bias=False)
    conv.initialize(init.Constant(1))
    conv.collect_params().reset_ctx(ctx)
    
    # we expand dims for conv, which expect 3D pics (with channels)
    conv_delta = nd.abs(conv(deltas.expand_dims(1)))
    
    # Max-pool over the image
    top_deltas = nd.max(conv_delta, axis=(1,2,3))
    
    # Return pics that have a convolution > threshold
    anomalies = list(itr.compress(pics, top_deltas > threshold))
    an_indexes = list(itr.compress(range(len(pics)), top_deltas > threshold))
    
    if save_viz:  # optionally, save images for interpretability
        for a, i in zip(anomalies, an_indexes):
            mpim.imsave('pixmap-'+a, deltas[i].asnumpy(), format='png')
            mpim.imsave('convmap-'+a, conv_delta[i][0].asnumpy(), format='png')
                
    return list(anomalies)

Implementation with mxnet.numpy

In the below code block, we implement the same logic as above, while replacing the mxnet.ndarray API by the mxnet.numpy API. Since mxnet.numpy implements the NumPy API, we need to replace couple methods by their NumPy counterpart, notably:

Expanding array dimensions: array.expand_dims(0) above becomes np.expand_dims(array, 0)
Concatenation: nd.concat(*ims, dim=0) becomes np.concatenate(ims, axis=0)

We further need to import mxnet.numpy and notify MXNet backend of our use of NumPy semantics, with the extra import and configuration below:

from mxnet import np, npx
npx.set_np()

And here is the final implementation of our baseline anomaly detection function using the MXNet NumPy API:

def mxnp_find_anomalies(folder, gpu=False, save_viz=True, threshold=300):
    """
    Parameters
        ----------
        folder : str
            local directory with the batch of images to score
        gpu : bool, optional
            whether to use GPU or CPU
        save_viz : bool, optional
            saves the pictures of deltas and convolutions
        threshold : int, optional
            conv threshold to be anomalous. Should be tuned!
    """
    
    ctx = mx.gpu() if gpu else mx.cpu()  # Set context
    
    # read images in a 4D-NDArray
    pics = os.listdir(folder)
    ims = [np.expand_dims(mxim.imread(folder+'/'+pic), 0).as_in_context(ctx) for pic in pics]
    ims = np.concatenate(ims, axis=0)
        
    # Compute the average image
    avg = np.mean(ims.astype('float32'), axis=0)
    
    # Remove the mean from every image, then average-pool on color
    deltas = np.mean(ims.astype('float32')-avg, axis=3)
    
    # Apply a batched convolution to all the deltas, then apply abs
    conv = gluon.nn.Conv2D(1, kernel_size=5, use_bias=False)
    conv.initialize(init.Constant(1))
    conv.collect_params().reset_ctx(ctx)
    
    # we expand dims for conv, which expect 3D pics (with channels)
    conv_delta = np.abs(conv(np.expand_dims(deltas, 1)))
    
    # Max-pool over the image
    top_deltas = np.max(conv_delta, axis=(1,2,3))
    
    # Return pics that have a convolution > threshold
    anomalies = list(itr.compress(pics, top_deltas > threshold))
    an_indexes = list(itr.compress(range(len(pics)), top_deltas > threshold))
    
    if save_viz:  # optionally, save images for interpretability
        for a, i in zip(anomalies, an_indexes):
            mpim.imsave('pixmap-'+a, deltas[i].asnumpy(), format='png')
            mpim.imsave('convmap-'+a, conv_delta[i][0].asnumpy(), format='png')
                
    return list(anomalies)

Here are 2 anomalies the function found over our synthetic data:

Both functions have comparable latencies, which makes sense since they use the same back-end kernels. Here are execution latencies excluding visualizations creation, over a batch of 12 1000px*750px images with MXNet 1.7, on an Amazon SageMaker GPU-equipped ml.p3.2xlarge Notebook (average of 10 runs)

The mxnet.ndarray code runs in 780ms and the mxnet.numpy code in 750ms on the 8 vCPU (Intel Xeon E5)
The mxnet.ndarray code runs in 100ms and the mxnet.numpy code in 75ms on the GPU (NVIDIA Tesla V100 GPU), close to 90% faster vs CPU

Conclusion

Even though MXNet is a deep learning framework, its primitives can be creatively used beyond neural network development. In this post, we showed how to perform simple, GPU-accelerated algebra over images using its NDArray library, that can be used by NumPy developers without extra learning curve via the mxnet.numpy library. Do not hesitate to take a look at it, contribute and engage with the community!

Using MXNet NDArray for fast GPU algebra on images was originally published in Apache MXNet on Medium, where people are continuing the conversation by highlighting and responding to this story.

Which one is the best algorithm for video action recognition?

Yi Zhu — Wed, 06 Jan 2021 03:36:25 GMT

Author: Yi Zhu, Applied Scientist at Amazon

A chronological overview of recent representative work in video action recognition

Over time, computer vision researchers have shifted their focus from image to video, 2D to 3D, and supervised to unsupervised. One of the trends, video understanding, has become a hot topic. Video human action recognition, a basic task within video understanding, also attracts lots of attention. As shown in the timeline above, more and more algorithms on video action recognition are proposed each year. So, how do newcomers to this field know which is the right model for their use-case?

To address this question, here comes the GluonCV 0.9.0 release, which provides not only a large model zoo for video action recognition (46 pretrained models, both in PyTorch and Apache MXNet), but also step-by-step tutorials (feature extraction, model finetuning, FLOPS computation), a 30-page survey paper covering 200+ recent literatures, and video lectures on YouTube. If you want to get started with computer vision in video, don’t hesitate to try it out as we believe this will help you develop some new skills, and figure out the right model for your scenario.

Our survey paper and CVPR2020 tutorial video lectures.

Video researchers have long found it difficult to compare results due to variances between datasets and differing evaluation methods. In response, we began reproducing a large number of popular algorithms using the same dataset and the same data augmentation steps (see readme for steps to reproduce the results). Throughout this project we gained several interesting observations. First, despite the fact that 3D CNNs have higher accuracy than 2D CNNs, they have higher latency which makes them not ideal for deployment to edge devices or real-time inference. Second, pre-training a model on a large-scale dataset is usually more effective than improving the model itself. For example, the CSN model (ICCV 2019) pretrained on a large-scale dataset easily outperforms recent methods by a large margin. Hence, for real-world applications, it might be more cost effective to collect and clean your data, rather than pursuing the latest SOTA model.

Benchmark results on Kinetics400 dataset. Time computation does not include IO cost.

Furthermore, training a SOTA video action recognition model consumes lots of compute resources. Even using a high-end machine with 8 V100 GPUs will take more than a week to obtain a decent accuracy. To help you iterate faster, we provide support for the DistributedDataParallel (DDP) in PyTorch, and Multi-grid training. As shown in the bar plot below, under the same setting of using 8 GPUs, the baseline using DataParallel (DP) needs 250 hours to finish the 100-epoch training of I3D model, while GluonCV can finish it within 41 hours, which is 6 times faster without performance degradation. If we use 4 machines with a total of 32 V100s, the training can be done in 10 hours, achieving near-linear scalability. By further comparison, the mmaction2 toolbox required 148 hours to train the same model.

Training time comparison (a standard I3D model with ResNet50 backbone).

Summary

In summary, GluonCV 0.9.0 provides a whole package for learning video action recognition: a survey paper, video lectures, Jupyter demos, the model zoo, and a deployment tutorial. In future releases, we will update more models in PyTorch, including object tracking, multi-modality video modeling, self-supervised representation learning, etc. Welcome aboard with GluonCV, feel free to raise issues, and contribute back!

Acknowledgement

Special thanks to @Arthurlxy @ECHO960 @zhreshold @yinweisu for their support in this release. Thanks to @coocoo90 for contributing the CSN and R2+1D models. And thanks to other contributors for the bug fixes and improvements. Please Like/Star/Fork/Comment/Contribute if you like GluonCV!

Which one is the best algorithm for video action recognition? was originally published in Apache MXNet on Medium, where people are continuing the conversation by highlighting and responding to this story.

Using Optuna to Optimize Gluon Hyperparameters

Crissman Loomis — Mon, 07 Dec 2020 15:02:46 GMT

Gluon + Optuna!

Optuna is a hyperparameter optimization framework applicable to machine learning frameworks and black-box optimization solvers. The Gluon library in Apache MXNet (incubating) provides a clear, concise, and simple API for deep learning. It makes it easy to prototype, build, and train deep learning models without sacrificing training speed. Combining the two of them allows for automatic tuning of hyperparameters to find the best performing models.

Creating the Objective Function

Optuna is a black-box optimizer, which means it needs an objective function, which returns a numerical value to evaluate the performance of the hyperparameters, and decide where to sample in upcoming trials.

In our example, we will be doing this for identifying MNIST characters from the Optuna GitHub examples folder. In this case, the objective function frame looks like this:

https://medium.com/media/88940298542b91d94f65c982c731074e/href

Notice that the objective function is passed an Optuna specific argument of trial. This object is passed to the objective function to be used to specify which hyperparameters should be tuned. This returns the accuracy of the model as val_acc, which is used by Optuna as feedback on the performance of the trial.

Defining the hyperparameters to be tuned

Optuna allows you to define the types and ranges of hyperparameters you want to tune directly within your code using the trial object. This saves the effort of learning specialized syntax for hyperparameters, and also means you can use normal Python code for looping through or defining your hyperparameters.

Optuna supports a variety of hyperparameter settings, which can be used to optimize floats, integers, or discrete categorical values. Numerical values can be suggested from a logarithmic continuum as well. In our MNIST example, we optimize the model optimizer hyperparameters within the objective function:

https://medium.com/media/66f68a39912ca30fcbf28b7c825fad24/href

Which optimizer is used is selected from a list of possible choices by optimizer_name = trial.suggest_categorical("optimizer", ["Adam", "RMSprop", "SGD"]). The learning rate should vary by orders of magnitude, so log=True is used in the setting of lr = trial.suggest_uniform("lr", 1e-5, 1e-1, log=True), which will vary the values logarithmically from .00001 to 0.1.

https://medium.com/media/0bf13d93e34e5031ef6e13346320976c/href

For the definition of the model itself, Optuna leverages eager mode to allow normal Python looping to determine the number of layers and nodes in each layer with trial.suggest_int(“n_layers”, 1, 3)for the layers and trial.suggest_int(“n_units_l{}”.format(i), 4, 128) for the number of nodes in each layer, such as n_units_l1 or n_units_l2.

Running the Trials

The default sampler in Optuna Tree-structured Parzen Estimator (TPE), which is a form of Bayesian Optimization. Optuna uses TPE to search more efficiently than a random search, by choosing points closer to previous good results.

To run the trials, create a study object, which sets the direction of optimization ("maximize" or "minimize"), along with other settings. Then, the study object run with optimize(objective, n_trials=100, timeout=600), to do one hundred trials with a timeout after ten minutes.

https://medium.com/media/966857ff753de811880ecbe3ecc862d4/href

Each trial is chosen after evaluating all the trials that have been previously done, using a sampler to make a smart guess where the best values hyperparameters can be found. Optuna provides Tree-structured Parzen Estimator (TPE) samplers, which is a kind of bayesian optimization, as the default sampler.

The best values from the trials can be accessed through study.best_trial, and other methods of viewing the trials, such as formatting in a dataframe, are available.

Pruning — Early Stopping of Poor Trials

Pruning trials is a form of early-stopping which terminates unpromising trials, so that computing time can be used for trials that show more potential. In order to do pruning, it’s necessary to open up the black-box of the Objective function some more to provide intermittent feedback on how the trial is going to Optuna, so it can compare the progress with the progress of other trials, and decide whether to stop the trial early, and provide a method to receive a method from Optuna when the trial should be terminated, and also allow the trial in session to terminate cleanly after recording the results.

https://medium.com/media/b6fcb05d017690acf371688f865c43f2/href

trial.report is used to communicate with Optuna about the progress of the trial. In this example, the objective function communicates the current epoch and the accuracy. trial.should_prune() is how Optuna communicates to the objective function if it should terminate early.

To the Future, and Beyond!

Plot Contour Visualization

For those interested, Optuna has many other features, including visualizations, alternative samplers, optimizers, and pruning algorithms, as well as the ability to create user-defined versions as well. If you have more computing resources available, Optuna provides an easy interface for parallel trials to increase tuning speed.

Give Optuna a try today. See Installation and Optuna Github.

This post uses Gluon with Apache MXNet 1.7.0and optuna 2.3.0.

Using Optuna to Optimize Gluon Hyperparameters was originally published in Apache MXNet on Medium, where people are continuing the conversation by highlighting and responding to this story.

Tallying votes in a meeting using pose detection: A case study with Nexity and GluonCV

Valentin Lecerf — Wed, 21 Oct 2020 20:34:51 GMT

Picture from camera, on the left side the scenario and on right side the data from model applied

Who is Nexity, what was the challenge?

Nexity is France’s leading integrated real estate group, with business operations in all areas of real estate development and services (residential real estate, commercial real estate, real estate services to individuals and real estate services to companies, distribution networks and client relations, major urban projects) and enjoys a strong presence across all industry cycles (short, medium and long). Nexity heavily adopted the cloud as a catalyst for technology-driven innovation.

How did Nexity solve the challenge?

This initiative occurred in the context of an internal innovation hackathon organized at Nexity. The goal of the hackathon was to raise executive-level awareness of the potential of cloud technologies. Volunteer teams had to compete by demonstrating an innovative system that could be built quickly using cloud services. Each team could choose its own scenario. We decided to work on the services we could offer at our general meetings of co-owners. Real-estate management involves a lot of in-person meetings where decision submitted to show-of-hand votes and manually tallied. The team “KFC”, consisting of engineers, tech leaders and project managers Antoine Pellet, Vincent Boidin, Xavier Top, Valentin Lecerf, Jérémy Desvaux and Grégory Hivin, decided to tackle the challenge of automatically tallying votes of a show-of-hand poll in real-time from fictional yet representative pictures of meeting rooms. In this post, the team tells about the experience with Apache MXNet and GluonCV.

“Initially, our idea was to use an object detection model to detect raised hands. Due to the very limited implementation time, we switched approach to use a pre-trained pose estimation model from the MXNet GluonCV model zoo. This model gave us a list of coordinates representing people and from this we could deduce easily if a right or left hand was up. We used the coordinates of hands, shoulders and head to decide if an arm is raised or not. Without specific tuning we observed a 70% success rate. The main challenge was the inference latency, as we implemented a web-application displaying real-time vote count over a 1080p video stream. Some important aspects to keep in mind for this type of solution are (1) the performance (need to handle a video stream) and (2) the video size and quality.”

The code

The team developed and tested from an Amazon SageMaker Notebook and provided the following sample code:

Installation of gluoncv and import of relevant packages

! pip install gluoncv

from gluoncv import model_zoo, data, utils

from gluoncv.data.transforms.pose import detector_to_simple_pose, heatmap_to_coord

import mxnet as mx

from mxnet import nd

from mxnet.gluon.model_zoo import vision as models

Loading detector and pose estimation model, set detector to person class only

detector = model_zoo.get_model('yolo3_mobilenet1.0_coco', pretrained=True)

pose_net = model_zoo.get_model('simple_pose_resnet18_v1b', pretrained=True)

detector.reset_class(["person"], reuse_weights=['person'])

Run person detection and feed output into pose estimation:

class_IDs, scores, bounding_boxs = detector(x)

pose_input, upscale_bbox = detector_to_simple_pose(img, class_IDs, scores, bounding_boxs)

pred_coords, confidence = heatmap_to_coord(predicted_heatmap, upscale_bbox)

predicted_heatmap = pose_net(pose_input)

For each pose detection, check if the left hand or right hand is raised:

if (coords[10][1] < coords[8][1] < coords[6][1]  # left hand raised

  or coords[9][1] < coords[7][1] < coords[5][1]):  # right hand raised

    return True

else:

    return False

The full mapping between prediction coefficients and body joints was listed, to facilitate development:

# 0 # Facepoints

# 1 # Facepoints

# 2 # Facepoints

# 3 # Facepoints

# 4 # Facepoints

# 5 # right shoulders

# 6 # left shoulders

# 7 # right elbow

# 8 # left elbow

# 9 # right hand

# 10 # left hand

# 11 # pelvis right

# 12 # pelvis left

# 13 # right knee

# 14 # left knee

# 15 # right foot

# 16 # left foot

Going further with GluonCV

Abundant models, training and inference scripts are available in the gluoncv package, that as of May 2020 features 14 pre-trained models and 4 tutorials for pose estimation alone!

Accuracy-Throughput tradeoff in the gluoncv pose estimation model zoo

GluonCV is a python computer vision toolkit built on top of the efficient Apache MXNet deep learning framework. GluonCV also comes with features dedicated to inference optimization, such as pruned models and quantization functions.
Once scientific code is developed, it can be exposed as a service via a web server. Multi-Model Server (MMS) is a framework-agnostic model server that can be deployed as-is on compatible platforms, but that also ships as the managed backend of the SageMaker MXNet inference containers. Eventually, numerous ideas can be exploited to improve the performance and economics of the deployment, such as model compilation and hardware acceleration. Those concepts were presented in a previous blog post.

This is not the first time deep learning is used on a real-estate or urban planning use-case, yet this is a particularly original and creative one! Among existing deep learning research relating or relevant to real-estate, we note in particular:

— In Launching Similar Homes and Real-Time Personalized Recommendations (Gautam Narula, Ran Ding, Samuel Weiss, and Joseph Sirosh) Compass researchers describe the challenge of recommending real-estate listings. A deep embedding model is developed with Apache MXNet to learn listing similarity. Significant business impact is reported (+153% click-through rate and +107% engagement actions)
— In 2018, Development Seed described using Apache MXNet to classify building presence from aerial imagery
— In an AWS ML Blog post, the property data analytics company EagleView presents aerial computer vision solutions developed with Apache MXNet to assess urban damage created by natural disasters. Using deep learning, EagleView can assess property damage within 24 hours and inform insurance and homeowners more rapidly.

Conclusion

In summary, if you are working on a computer vision use-case, chances are GluonCV and Apache MXNet can help you drastically reduce your time to result while keeping a state-of-the-art scientific and efficiency bar. Please do not hesitate to give it a try, contribute to those projects and reach out to the community on the forum discuss.mxnet.io/!

Tallying votes in a meeting using pose detection: A case study with Nexity and GluonCV was originally published in Apache MXNet on Medium, where people are continuing the conversation by highlighting and responding to this story.

Analyze Driving Scenes with GluonCV 0.8

Yi Zhu — Tue, 15 Sep 2020 07:10:18 GMT

Author: Yi Zhu, Applied Scientist at Amazon

Nowadays driving scene analysis models are trending in more applications. For example, generating street scenes in an AI rendered virtual world, building their own self-driving cars, etc. Recently, OpenBot uses smartphones and a small electric vehicle to build a low-cost robot, which can support workloads like pedestrian following and real-time autonomous navigation. Do you want to build some more advanced features for your own applications or robots? Given the surging interest in autonomous driving applications, we include a new task called depth estimation and more semantic segmentation models in this latest GluonCV 0.8 release.

Depth Estimation

Depth estimation is a long standing computer vision task, which is an important step towards inferring scene geometry from 2D images. The goal of depth estimation is to predict the depth value of each pixel, given a single RGB image (monocular setting) or left-right image pairs (stereo setting). Recently, self-supervised depth estimation dominates the field due to its great performance and no need for annotated data. Monodepth and Monodepth2 are important milestones which have established new single camera depth estimation baselines.

Given Monodepth2’s popularity and strong performance, we provide GluonCV implementation of Monodepth2 in this release. We have pretrained models on KITTI and training logs for all three settings: monocular, stereo and monocular+stereo. All results are reproducible and are similar to numbers reported in the original publications.

To get you started, we provide a number of detailed tutorials, such as:

Once you have the estimated depth and trajectory, it will be easier for your autonomous bot to avoid obstacles. Note that monocular and monocular+stereo setting are included after the release date, so if you are eager to try these out right away, feel free to install the nightly version of GluonCV.

Better/Faster segmentation models

We include two new semantic segmentation models in this release, one is DANet, the other is FastSCNN. DANet is one of the state-of-the-art models on several segmentation benchmarks, and FastSCNN is one of the most popular real-time segmentation models. Following table shows their performance on the Cityscapes validation set.

Our FastSCNN model is an improved variant from our recent paper using semi-supervised learning, i.e., the performance of 72.3 mIoU is better than 68.6 mIoU reported in the original paper. To our best knowledge, 72.3 mIoU is the highest-scored implementation of FastSCNN and one of the best real-time semantic segmentation models. It can run at a speed of 80.8 fps on a single V100 GPU card given input of 1024x2048 video sequences.

If you want to train a segmentation model on your dataset/location, we provide tutorials on how to train and test segmentation models. You can also try our semi-supervised learning method to automatically generate pseudo labels, so that you don’t need to label your own data! We have demonstrated strong cross-domain generalization performance in our paper.

Summary

GluonCV v0.8 includes Monodepth2, DANet and FastSCNN to our model zoo which can help you analyze driving scenes for your own applications. You can use these models for much more than just diving scenes though. One example would be indoor scenes analysis. So open up your mind and only sky is the limit for ideas. Please checkout our tutorials and model zoo for more details.

Acknowledgement

We sincerely thank the following contributors:
@zhreshold, @KuangHaofei, @xdeng7, @ytian8, @FrankYoungchen, @bryanyzhu, @Jerryzcn, @yezqNLP, @LauLauThom, @karan6181, @chinakook, @tkhe, @tirkarthi, @mseth10, @ksindwan, @Neutron3529, @Aktcob, @tmyapple, @chongruo, @xinyu-intel

Please Like/Star/Fork/Comment/Contribute if you like GluonCV!

Analyze Driving Scenes with GluonCV 0.8 was originally published in Apache MXNet on Medium, where people are continuing the conversation by highlighting and responding to this story.

Streaming Inference Pipeline — Deploying MXNet model on AWS Lambda

WayTrue — Sat, 11 Jul 2020 04:18:38 GMT

Build a Streaming Inference Pipeline by Deploying Apache MXNet on AWS Lambda

Thanks to Sandeep Krishnamurthy, Olivier Cruchant, Thom Lane and Kevin Mould for their feedback

Introduction

AWS Lambda is a compute service that allows you to run code without provisioning or managing servers. It also introduces scalability, with the ability to run up to 3000 tasks concurrently. More importantly, you only pay for the compute time you consume, which can lead to significant cost savings in certain situations. When combining AWS Lambda with deep learning models, you can get outstanding model performance with the convenience of AWS Lambda. In this post, we will walkthrough how to build a streaming inference pipeline using Apache MXNet and AWS Lambda. In particular, we will discuss how to deploy a package that exceeds the typical Lambda upload limit. And finally, we will analyze the pipeline performance and costs.

Use case scenario

A deep learning streaming inference pipeline is a perfect use case for MXNet on AWS Lambda. Let’s assume an e-commerce company wants to extract meta data from their product images. For example, Airbnb uses image classification models to identify what type of room (kitchen, pool, garden) is displayed in a picture. This is where the streaming inference pipeline comes in: it continuously monitors the data flow, performs image classification whenever an image is acquired, and saves the results into long term storage (such as Amazon S3). AWS Lambda manages everything for you: from provisioning to scaling. Additionally, the company only pays for the compute time that AWS Lambda needs, which makes it very cost efficient for intermittent workloads.

Inference pipeline workflow

The above data extraction use case boils down to the following workflow:

An image is uploaded to an Amazon S3 bucket (input bucket).
The image triggers the AWS Lambda function to retrieve resources from the resource bucket and perform image classification using the MXNet model.
The inference result is stored in another Amazon S3 bucket (output bucket).

Build the pipeline

This section will show you how to build the above pipeline step by step. Before we start, please don’t forget to configure your AWS Command Line Interface (CLI) if you haven’t already done so:

$ aws configure

Amazon S3 buckets

The first step is to create the following storage components of the pipeline:

input_bucket: An S3 bucket to receive the input images.
resource_bucket: An S3 bucket to host the resource files for the Lambda function during runtime.
output_bucket: An S3 bucket to store the inference results.

You can either use the Amazon S3 management console or the following CLI commands to create the S3 buckets:

$ aws s3api create-bucket --bucket your-input-bucket-name \
                          --region your-region \
                          --create-bucket-configuration \
                          LocationConstraint=your-region
$ aws s3api create-bucket --bucket your-resource-bucket-name \
                          --region your-region \
                          --create-bucket-configuration \
                          LocationConstraint=your-region
$ aws s3api create-bucket --bucket your-output-bucket-name \
                          --region your-region \
                          --create-bucket-configuration \
                          LocationConstraint=your-region

Prepare the deployment package

The next step is to prepare the deployment package for AWS Lambda. A deployment package is a compressed archive that contains the function code and dependencies. AWS lambda will run this code when triggered. In our case, it should contain the following items:

lambda_function.py — the main function that performs inference during runtime.
model.params — the model parameter file. In this case we use a pre-trained ResNet50_v2 model (download here). You can also use your own model. More details of saving/loading model can be found here.
synset.txt — the label file for Imagenet dataset. It maps the model output (integer) to an object class (string). For instance, ‘559’ is mapped to ‘folding chair’. You can find it here.
dependencies — the libraries that the lambda function depends on, such as MXNet, numpy, etc.

Dependencies
The following command downloads the current released version of MXNet (1.6.0) and all its dependencies in to folder package at the current directory.

$ pip install mxnet -t . package

Note: please make sure that you download the dependencies within an operating system that is compatible with Amazon Linux (AL/AL2), since the Lambda function runs on Amazon Linux. The easiest way to do this, is to run an Amazon EC2 instance with Amazon Linux, or you can use Docker to setup an Amazon Linux environment.

Lambda upload limit
Adding everything to the package folder, we end up with the following file structure:

package
----lambda_function.py
----resnet50_v2.params
----synset.txt
----mxnet
----numpy
----numpy.libs
...
----urllib3

Our current package exceeds 370 MB (mxnet 190 MB, resnet50_v2.params 100 MB, numpy 50 MB, numpy.libs 30MB), but the deployment package limit for Lambda is 250 MB. This makes it impossible to upload the whole package as is. However, AWS Lambda provides an additional 512 MB of storage inside the “/tmp” directory (while running the Lambda function). All files in this directory will be stored during the life cycle of the Lambda instance. Thus, we can split the package into smaller parts: one uploaded to AWS Lambda directly, and the others are downloaded from the “/tmp” directory during runtime. By splitting the package and exploiting the “/tmp” storage, we increase the size limit from 250 MB to more than 750 MB.

In this implementation, we split the package as seen below. You can split this in other ways, as long as it fits within the AWS Lambda constraints.

pkg_lambda
----lambda_function.py
----synset.txt
----mxnet

pkg_tmp
----numpy
----numpy.libs
...
----urllib3

resnet50_v2.params

pkg_lambda
This package contains all the files that will be uploaded to AWS Lambda directly. The size is reduced to around 200 MB (from 370MB before). Use the following command to compress it and get pkg_lambda.zip:

~$ cd pkg_lambda
~/pkg_lambda$ zip -r9 ${OLDPWD}/pkg_lambda.zip .

We also provide our pkg_lambda.zip here.

You can either use the Amazon S3 management console or the following CLI command to upload it to the resource_bucket:

$ aws s3 cp pkg_lambda.zip s3://your-resource-bucket-name

pkg_tmp and model file
These files will be stored in the resource_bucket and get downloaded into the “/tmp” directory during the AWS Lambda runtime. The size is around 85 MB. Use the following command to get pkg_tmp.tar.gz:

~$ cd pkg_tmp
~/pkg_tmp$ tar -cvzf pkg_tmp.tar.gz .

We also provide our pkg_tmp.tar.gz here.

You can either use the Amazon S3 management console or the following CLI command to upload it to the resource_bucket:

$ aws s3 cp pkg_tmp.zip s3://your-resource-bucket-name

The last thing to upload is the model file:

$ aws s3 cp resnet50_v2.params s3://your-resource-bucket-name

Note: here we choose to separate the model file from pkg_tmp. Uploading it separately without compressing into package would buy us more space in “/tmp” directory when downloading it. It also makes updating model file easier since we don’t need to update the whole pkg_tmp.

lambda_function.py
lambda_function.py contains a function called lambda_handler which takes the trigger event (an image in our case) as an input and performs the model inference.

In this file we:

Specify the AWS Lambda runtime to download pkg_tmp and model file from resource_bucket.
Transform the input images into the required format for the model.
Perform image classification inference.
Send the results to output_bucket.

Note: The lambda_function.py contains multiple non event-specific tasks such as downloading pkg_tmp, loading model parameters, generating label list, etc. We should try to put these code blocks on top of the lambda_handler function scope to reduce the inference latency. More detail can be found in our implementation of lambda_function.py.

Creating an AWS Lambda function

Once the deployment packages are ready, the next step is to create the AWS Lambda function.

Set an IAM role
First, we need to create an IAM role that permits AWS Lambda to communicate with the other AWS components in the pipeline. Go to the AWS IAM management console: Roles → Create role → Lambda, and attach the following policies:

AmazonS3FullAccess
AmazonLambdaBasicExcutionRole
CloudWatchEventsFullAccess

More details of IAM roles can be found here

Create the Lambda function
The next step is to create the AWS Lambda function, attach the IAM role to it and upload the pkg_lambda. You can either use the AWS Lambda management console or the following CLI command:

$ aws lambda create-function --function-name your-function-name \
             --code S3Bucket=your-resource-bucket-name, \
             S3Key=pkg_lambda.zip \
             --handler lambda_function.lambda_handler \
             --runtime python3.7 \
             --role arn:aws:iam::your-aws-account: \
             role/your-role-name \
             --timeout 30 \
             --memory-size 1024

Add a trigger to lambda
The last step is to add an S3 “All object create events” trigger to the Lambda function. Go to the AWS Lambda management console → Functions → Select the function we just created → add trigger → select S3 → specify input_bucket for bucket name→ specify “All object create events” for Event type.

Test
At this point, we’ve successfully built an inference pipeline. To test it, upload an image to the input_bucket. A new file that contains the predicted object class should appear in the output_bucket.

MXNet with MKL-DNN

Intel MKL-DNN provides various highly vectorized and threaded operators to accelerate deep learning frameworks. MXNet supports MKL-DNN to achieve better training and inference performance. In this section, we will demonstrate how to enable MKL-DNN in our inference pipeline.

The only difference is the MXNet library. Use the following command to download the MKL-DNN enabled MXNet (version 1.6.0) :

$ pip install mxnet-mkl -t . package

Note that the MXNet library itself reaches 280 MB, and therefore no longer fits within the 250 MB limitation. We must rearrange the deployment package as follows to satisfy the limitation:

pkg_lambda_mkl
----lambda_function.py
----synset.txt
----numpy
----numpy.libs
...
----urllib3

pkg_tmp_mkl
----mxnet

resnet50_v2.params

You can prepare your own packages or download our pkg_lambda_mkl.zip and pkg_tmp_mkl.tar.gz here. Once the two packages are uploaded to the resource_bucket, you can either use the AWS Lambda management console or run the following command to update the pipeline and enable MKL-DNN:

$ aws lambda update-function-code \
             --function-name your-function-name \
             --s3-bucket your-resource-bucket-name \
             --s3-key pkg_lambda_mkl.zip

Inference latency

Latency is a key factor in our inference pipeline. With AWS Lambda functions, the latency depends on whether the instance is “cold” or “warm”. Cold start inference is much slower due to the initialization tasks. Latency also depends on cup resources. We observe that larger memory allocation in Lambda typically means better latency, but with higher costs. In this section, we summarize the latency for both mxnet and mxnet-mkl inference.

The figure above shows the cold start inference latency. Each value is measured by averaging 5 cold start inference latencies.

The figure above shows the warm start inference latency. Each value is measured by averaging 20 consecutive warm start inference latencies.

We observe that enabling MKL-DNN increases the cold start latency due to its larger package size. The overall cold start latency is around 6 seconds, which is a huge time cost for deep learning inference. In production, we should try to avoid cold start. On the other hand, the warm start takes around 400 ms in general, which is acceptable in many use cases. The figure also indicates that enabling MKL-DNN improves the inference latency by around 15%.

Inference cost

Cost is another important factor to the pipeline. Here we mainly focus on the warm start cost since it happens to most inference requests. AWS Lambda offers free usage of 1M requests and 400,000 GB-seconds per month. It charges per request basis after the free usage. The following figure shows the cost per million requests after free usage, calculated based on AWS Lambda pricing.

We see that:

Smaller memory saves the cost although it affects latency.
Enabling MKL-DNN not only speeds up the inference but also lower the cost.

Cost efficiency

Traditional deep learning servers (like Amazon EC2 instances) have fixed charges per month. On the other hand, the cost of Lambda is proportional to the number of inference requests. We benchmark the cost between mxnet-mkl on AWS Lambda and a commonly used EC2 setup c5.xlarge+eia2.medium (the same setup with this study), which costs $208.80 per month.

From the figure above, we observe that the cost of AWS Lambda 2048 MB and c5 instances intersects at around 15 million image requests, showing that Lambda is more cost efficient when monthly requests numbers are less than that. The threshold between Lambda 1024 and c5 is around 22 million requests, indicating a greater potential for AWS Lambda’s cost efficiency.

Conclusion

In this article, we demonstrated step by step how to build a streaming inference pipeline using MXNet and AWS Lambda. We addressed the AWS Lambda upload limit issue by splitting the deployment package and exploiting “/tmp” directory (giving us over 750 MB of potential storage space in total). We then benchmarked the inference performance of MXNet with and without MKL-DNN. Our results indicated that enabling MKL-DNN could speed up the inference by about 15%. And finally we showed that AWS Lambda inference is more cost efficient than an c5.xlarge instance when monthly requests are less than 22 million in our case.

Streaming Inference Pipeline — Deploying MXNet model on AWS Lambda was originally published in Apache MXNet on Medium, where people are continuing the conversation by highlighting and responding to this story.

Speed up your BERT inference by 3x on CPUs using Apache TVM

Haichen Shen — Wed, 08 Jul 2020 14:01:03 GMT

Introduction

BERT (Bidirectional Encoder Representations from Transformer) [1], a pre-trained natural language processing (NLP) model, was proposed by Google in 2018 and now plays an important role in NLP tasks such as text classification, sentiment analysis, question answering, and more. However, BERT is known to be compute intensive even for inference. As we are seeing the trend from lightweight solutions such as LSTM to the usage of BERT in production pipelines, BERT inference performance is becoming increasingly critical to reach the desired latency and scalability.

Recently, deep learning compilers such as TVM [2], MLIR [3], and Glow [4], which use compiler techniques to optimize model inference, have gained traction in both academia and companies across several industries. The progress [2, 5] has demonstrated that deep learning compilers can achieve lower inference latency compared to frameworks. At Amazon Web Services, we make multiple contributions to the Apache TVM open source project and we use TVM to speed up many model inference use cases, including BERT, on various platforms. We also offer a service, called Amazon SageMaker Neo, that provides a managed experience for compilation on a variety of frameworks, operators and hardware targets.

In this blog, we are going to share our recent progress on improving BERT inference performance on CPUs (e.g., c5 and m5 instances on Amazon EC2) and show you how to use TVM to reproduce our results. Overall, using TVM can help achieve up to 2.9x lower latency on EC2 c5.9xlarge instances and up to 2.3x higher throughput. More importantly, achieving these results requires almost no developer effort as we will demonstrate in this post.

BERT inference improvement

We will use the BERT-base [1] and DistilBERT [6] models to demonstrate our improvement of BERT inference performance using TVM. The BERT-base model contains 12 layers of transformer blocks and consists of 11.2 GFLOPs (floating operations) and 109M weights. DistilBERT is a distilled and smaller version of the BERT model, and only has 6 layers of transformer blocks. It contains 5.6 GFLOPs and 67M weights, and thus runs faster during inference. Computation is counted for batch size 1 and sequence length 128.

We take the pre-trained BERT-base and DistilBERT models from the GluonNLP (0.9.1) model zoo. The data type is fp32 in all benchmarks, as Intel CPUs do not support fp16 operations. We measure the inference latency for sequence lengths 64, 128, and 256 when batch size is 1. We compare the latency of Apache MXNet (mkl 1.6.0) and Apache TVM on EC2 c5.9xlarge instance (Intel CPU with 18 physical cores). The table below shows that the latency can be reduced by between 2.1x to 2.9x via TVM optimization for the BERT base and DistilBERT models. Notably, the latency of DistilBERT under sequence length 128 is only 9.5ms on CPUs after optimization. In comparison, the ONNX runtime achieves 9ms on similar CPUs using a 3-layer BERT model which is 2x smaller than the DistilBERT model.

https://medium.com/media/33f4eb82760f729c18282d22d2187f1b/href

We further evaluate the throughput of BERT with batch size 4 and sequence length 128 on more types of EC2 instances, including c5.2xlarge, c5.9xlarge, m5.2xlarge, and m5.12xlarge. TVM consistently achieves higher throughput for the BERT base and DistilBERT models with an average of 2x improvement.

The performance gain comes from three aspects: (1) small operators get fused together to reduce overhead in memory transfer between cache and main memory, (2) kernels generated by TVM achieve better performance in general, and (3) TVM performs graph-level optimizations and replaces some heavyweight math operators such as erf with an approximate implementation.

How can you optimize BERT using TVM?

Next, let’s walk through the steps to reproduce the results shown above.

We first launch a CPU instance such as c5.9xlarge on Amazon EC2 using AWS Deep Learning AMI, Ubuntu 18.04. After we ssh into the server, we activate the conda environment mxnet_p36 and install GluonNLP Python packages:

source activate mxnet_p36
pip install gluonnlp==0.9.1

Next, we install TVM:

pip install https://tvm-build-public.s3-us-west-2.amazonaws.com/dlami-cpu-mkl/tvm-0.7.dev1-cp36-cp36m-linux_x86_64.whl
pip install https://tvm-build-public.s3-us-west-2.amazonaws.com/dlami-cpu-mkl/topi-0.7.dev1-py3-none-any.whl

These Python wheel packages are compiled for the Deep Learning AMI, and are not guaranteed to be compatible with other environments. You can also compile TVM from source (instructions in the appendix).

After we’ve installed TVM, we can optimize the BERT model. You can find the steps to train and fine-tune a BERT model using GluonNLP from the tutorials on the GluonNLP website. Once we’ve instantiated a BERT model from GluonNLP, we can compile and optimize it using TVM.

We first need to convert the BERT model from MXNet to TVM Relay IR given an input shape mapping. Relay IR is an intermediate representation used in TVM that represents the computation graph of model architecture.

# Load model from GluonNLP
...
mx_model = nlp.model.BERTClassifier(...)

# Convert MXNet model to TVM
shape_dict = {
    'data0': (batch, seq_length),
    'data1': (batch, seq_length),
    'data2': (batch,)
}
mod, params = relay.frontend.from_mxnet(mx_model, shape_dict)

We then define the target and optimization level to compile the model using TVM. The CPUs in both c5 and m5 instances support AVX-512 instructions, which can boost floating computation via vectorization. So the CPU architecture (-mcpu=skylake-avx512) is specified in the target to enable this. We also utilize the Intel MKL library (-libs=cblas), which features an optimized BLAS library on Intel CPUs to speedup the computation of operators like matrix multiplication. Third, we explicitly turn on an optimization pass called “FastMath”. This pass will replace certain heavyweight math operators such as erf with an approximate implementation in order to reduce its latency.

target = "llvm -mcpu=skylake-avx512 -libs=cblas"
with relay.build_config(opt_level=3, required_pass=["FastMath"]):
    graph, lib, cparams = relay.build(mod, target, params=params)

Finally, we create a lightweight executor included in the TVM, and initialize it with weights and input data. We can then run the executor and examine the output. We also did a sanity check for the TVM output and verified the correctness.

ctx = tvm.cpu()
rt = runtime.create(graph, lib, ctx)
rt.set_input(**cparams)
rt.set_input(data0=inputs, data1=token_types, data2=valid_length)
rt.run()
out = rt.get_output(0)
print(out.ansumpy())

# verify the correctness
tvm.testing.assert_allclose(out.asnumpy(), mx_out.asnumpy(), rtol=1e-3, atol=1e-3)

The full script can be found at https://gist.github.com/icemelon9/860d3d2c9566d6f69fa8112840dd95c1.

Conclusion

In summary, we are seeing significant improvement of BERT inference using TVM on CPUs. The latency of BERT inference is reduced up to 2.9x and the throughput is increased up to 2.3x. It takes only a few lines of code to achieve such improvement and make the deployment.

Today, this solution works well in scenarios where padding to fixed sequence length is possible. We are currently working on expanding the support to include dynamic sequence length. Stay tuned for more news.

Appendix

Compile TVM from source

This document from the TVM website already provides the instructions how to compile and install TVM from source. Here we only describe some specific dependencies and configures to achieve the best performance for BERT inference.

First, you need to install the dependencies in addition to the building requirements included in the TVM document: LLVM (>=6.0) and Intel MKL library (instructions can be found here).

Next, we can compile the source and build the shared library.

git clone --recursive https://github.com/apache/incubator-tvm.git tvm
cd tvm && mkdir build && cd build
cmake -DUSE_LLVM=/path/to/llvm-config -DUSE_BLAS=mkl -DUSE_OPENMP=intel ..
make -j4

Finally, we install the TVM python package.

cd python; python setup.py install --user; cd ..
cd topi/python; python setup.py install --user; cd ../..

Reference

[1] Jacob Devlin, et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” arXiv pre-print arXiv:1810.04805 (2018).
[2] Tianqi Chen, et al. “TVM: An Automated End-to-End Optimizing Compiler for Deep Learning.” 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), 2018.
[3] Chris Lattner, et al. “MLIR: A Compiler Infrastructure for the End of Moore’s Law”. arXiv preprint arXiv:2002.11054, 2020.
[4] Nadav Rotem, et al. “Glow: Graph Lowering Compiler Techniques for Neural Networks”. CoRR, abs/1805.00907, 2018.
[5] Liu, Yizhi, et al. “Optimizing CNN Model Inference on CPUs.” 2019 USENIX Annual Technical Conference (USENIX ATC 19), 2019.
[6] Sanh, Victor, et al. “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.” arXiv preprint arXiv:1910.01108 (2019).

Speed up your BERT inference by 3x on CPUs using Apache TVM was originally published in Apache MXNet on Medium, where people are continuing the conversation by highlighting and responding to this story.

GluonCV 0.7: ResNeSt, Next Generation Backbone

Jerry Zhang — Thu, 14 May 2020 18:56:11 GMT

Authors: Jerry Zhang, Thomas Brady

Since the introduction of AlexNet, the 2012 ImageNet challenge champion, neural networks trained for image classification have been used as backbones for addressing other tasks such as object detection, semantic and instance segmentation, or pose estimation. We typically called these backbone network, as it is shared by variety of tasks. ResNet, introduced in 2015, through its many permutations, has been the reigning backbone for awhile now. In fact, the majority of research conducted for applications downstream of image classification are still using ResNet despite the great strides made in computer vision by the research community. ResNet’s continued efficacy is often attributed to its modularity, and ease-of-use for transfer learning. With modularity and transfer learning in mind, the GluonCV team is pleased to introduce the new backbone network ResNeSt in this release (GluonCV 0.7).

ResNet vs SE-Net vs ResNeSt

ResNeSt retains the inherent modularity and transfer learning capabilities of ResNet while boosting the accuracy of a variety of vision tasks, including image classification, object detection and semantic segmentation. Besides ResNet, our work also takes inspiration from the use of channel attention in SE-Net, a network designed in 2017. When comparing ResNeSt to the previous state-of-the-art backbone network EfficientNet partially designed using neural architecture search, we discovered that ResNeSt amplifies both speed and accuracy on a GPU. For example our ResNeSt-269 achieves slightly higher accuracy than EfficientNet-B7, while lowering the latency by around 30%. In addition, we discovered it is easy to adapt ResNeSt to downstream tasks like object detection and semantic segmentation. By simply swapping ResNet with ResNeSt without tuning hyper-parameters, we improve Faster R-CNN COCO mAP by approximately 4%, and DeepLabV3 ADE20K mIoU by around 3% assuming the previous best performance of a ResNet-based model as our baseline.

Image Classification

Backbone networks are usually pre-trained on the ImageNet-1K dataset with their weights used for various downstream tasks. Thus, accurate classification of images is of great importance to high-level computer vision. Included in the GluonCV 0.7 release are the four latest ResNeSt backbones with differing levels of complexity accompanied by the respective code we used for training to reproduce our results. In previous releases, our best result came from using SENet-154 model as the classification backbone, which produced a top-1 accuracy score of 81.26% on the ImageNet dataset. All our new models, except for ResNeSt-50, achieves higher accuracy than SENet-154 our most accurate model from the previous release. The following are the detailed results:

Average Latency vs. Top-1 Accuracy on ImageNet

In addition, we benchmarked our ResNeSt model against EfficientNet, using a single V100 GPU with a batch size of 16. As shown in the below graph, ResNeSt outperform EfficientNet, with higher accuracy and lower latency.

Object Detection

ResNeSt achieves great result on image classification, but how does it perform on other downstream tasks? To demonstrate that ResNeSt can improve downstream tasks, we changed the original ResNet to ResNeSt in Faster R-CNN resulting in the mean Average Precision (mAP) improving by 3% as reported in our paper. In GluonCV 0.7, we include a new bag of tricks for our Faster R-CNN models, such as synchronized batch normalization, random scale augmentation, and deeper box head (4 convolutions + 1 dense). With these improvements, we are able to increase the mAP to 42.7, which is higher than the previous result with ResNet-101. This is slightly higher than what we report in paper, as we use 26 epochs (2x learning rate schedule) as oppose to the 13 epochs in the paper .

Object Detection with Faster R-CNN

Semantic Segmentation

We also provide two new semantic segmentation models in this release. In our research, we swapped DeepLabV3’s ResNet backbone with ResNeSt, and obtained a 2.8% gain in mean intersection over union (mIoU) and 1% gain in pixel accuracy, reaching a state-of-the-art result on the ADE20K dataset. By simply dropping in ResNeSt as the backbone, our models outperformed numerous other models designed specifically for semantic segmentation such as ACNet, HRNet, etc. demonstrating the versatility of ResNeSt to generalize to different tasks.

Semantic Segmentation with DeepLabV3

Summary

GluonCV 0.7 brings you the latest image classification backbone that significantly improves downstream tasks. New models introduced in this release improves upon our existing model zoo, provide you with more potent computer vision models. With GluonCV 0.7, you can now use our state-of-the-art ResNeSt in your research or production. For more detail you can also checkout our paper here.

Acknowledgement

We sincerely thank the following contributors:
@zhreshold, @adursun, @KuangHaofei, @bryanyzhu, @FrankYoungchen, @ElectronicElephant, @lgov, @astonzhang, @ruslo, @mjamroz, @LauLauThom, @karan6181, @turiphro, @chinakook, @zhanghang1989, @Jerryzcn

Links

Please Like/Star/Fork/Comment/Contribute if you like GluonCV!
GluonCV Website
GluonCV Github

GluonCV 0.7: ResNeSt, Next Generation Backbone was originally published in Apache MXNet on Medium, where people are continuing the conversation by highlighting and responding to this story.

Apache MXNet - Medium

Start fitting CV models like Scikit-learn with GluonCV 0.10

Start optimizing CV models in less than 10 lines of code with GluonCV 0.10

GluonCV Auto Estimators

GluonCV Auto Tasks

Summary

Acknowledgment

Optimizing inference on CPU in MXNet 2.0

Optimizing inference on CPU in the Upcoming Apache MXNet 2.0

Introduction

Operator Fusion

Quantization

Quantization flow in MXNet 2.0

Custom layer collectors and calibrating the model

Performance and accuracy results

Summary

References

Benchmark environment

Notices and Disclaimers

Using MXNet NDArray for fast GPU algebra on images

Using MXNet NDArray for Fast GPU Algebra on Images

Introduction

Notable concepts

Wrapping it all in a compact function

Implementation with mxnet.ndarray

Implementation with mxnet.numpy

Conclusion

Which one is the best algorithm for video action recognition?

Summary

Acknowledgement

Using Optuna to Optimize Gluon Hyperparameters

Gluon + Optuna!

Creating the Objective Function

Defining the hyperparameters to be tuned

Running the Trials

Pruning — Early Stopping of Poor Trials

To the Future, and Beyond!

Tallying votes in a meeting using pose detection: A case study with Nexity and GluonCV

Who is Nexity, what was the challenge?

How did Nexity solve the challenge?

The code

Going further with GluonCV

Conclusion

Analyze Driving Scenes with GluonCV 0.8

Depth Estimation

Better/Faster segmentation models

Summary

Acknowledgement

Streaming Inference Pipeline — Deploying MXNet model on AWS Lambda

Build a Streaming Inference Pipeline by Deploying Apache MXNet on AWS Lambda

Introduction

Use case scenario

Inference pipeline workflow

Build the pipeline

Amazon S3 buckets

Prepare the deployment package

Creating an AWS Lambda function

MXNet with MKL-DNN

Inference latency

Inference cost

Cost efficiency

Further Reading

Cold Start vs Warm Start

Deploying other models

Changing storage component

Conclusion

Speed up your BERT inference by 3x on CPUs using Apache TVM

Introduction

BERT inference improvement

How can you optimize BERT using TVM?

Conclusion

Appendix

Compile TVM from source

Reference

GluonCV 0.7: ResNeSt, Next Generation Backbone

Image Classification

Object Detection

Semantic Segmentation

Summary

Acknowledgement