<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Apache MXNet - Medium]]></title>
        <description><![CDATA[Apache MXNet (incubating) is a deep learning framework designed for both efficiency and flexibility. It allows you to mix symbolic and imperative programming to maximize efficiency and productivity. - Medium]]></description>
        <link>https://medium.com/apache-mxnet?source=rss----d1802b798def---4</link>
        <image>
            <url>https://cdn-images-1.medium.com/proxy/1*TGH72Nnw24QL3iV9IOm4VA.png</url>
            <title>Apache MXNet - Medium</title>
            <link>https://medium.com/apache-mxnet?source=rss----d1802b798def---4</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Sat, 18 Apr 2026 16:51:18 GMT</lastBuildDate>
        <atom:link href="https://medium.com/feed/apache-mxnet" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Start fitting CV models like Scikit-learn with GluonCV 0.10]]></title>
            <link>https://medium.com/apache-mxnet/start-fitting-cv-models-like-scikit-learn-with-gluoncv-0-10-931ff910a38?source=rss----d1802b798def---4</link>
            <guid isPermaLink="false">https://medium.com/p/931ff910a38</guid>
            <category><![CDATA[mxnet]]></category>
            <category><![CDATA[gluoncv]]></category>
            <category><![CDATA[computer-vision]]></category>
            <category><![CDATA[automl]]></category>
            <category><![CDATA[pytorch]]></category>
            <dc:creator><![CDATA[Joshua Z. Zhang]]></dc:creator>
            <pubDate>Fri, 26 Mar 2021 04:06:43 GMT</pubDate>
            <atom:updated>2021-03-26T04:06:43.431Z</atom:updated>
            <content:encoded><![CDATA[<h3>Start optimizing CV models in less than 10 lines of code with GluonCV 0.10</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*au87fuYmm7KJo6BSGOo4uw.png" /><figcaption>Train computer vision models with HPO in less than 10 lines of code!</figcaption></figure><p>For a very long period of time, <a href="https://cv.gluon.ai">GluonCV</a> has been focusing on providing diverse models targeting various computer vision tasks, from fundamental image classification, object detection, to depth estimation and video action recognition, etc. <br>Now it’s time for the long-overdue high-level training/inference API to be available for all users! Are you tired of checking the implementation difference or details in order to compose the right model, e.g., neural network architectures, training loop, metrics, losses? The new auto estimators and task handlers might be the right tool for you. There are new tools such as hyper-parameter optimization(HPO) to help you achieve better models without extensive manual exploration.</p><h3>GluonCV Auto Estimators</h3><p>GluonCV’s Auto Estimator are new in version 0.10. They are similar to scikit-learn and will allow you to initiate training in less than 3 lines of code. They also have a friendly customization interface:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*f3Pm8NK8taWfOkOyxM_Pzg.png" /><figcaption>A single line fit function to train an image classifier from web datasets</figcaption></figure><p>The new auto estimators pack simple predict functions which support raw image paths, data frames, gluoncv dataset. It also allows you to save and load trained estimators without knowing the underlying model implementation. In the past, for example, you need to know the corresponding pre-process and post-process functions for a particular model otherwise the model will produce wrong outputs. We recognize that these error-prone components are tricky and should be avoided. With the new estimator API, you can get rid of unnessary code snipets and prototype a working solution faster.</p><h3>GluonCV Auto Tasks</h3><p>On top of Auto Estimators, in this release, we also include Auto Tasks, for hyper-parameter optimization. Comparing with pure algorithm-based estimators, the auto tasks provide identical APIs and functionalities but allow you to fitwith HPO with specified num_trialsand time_limit. For object detection, it allows multiple algorithms (such as SSDEstimator and FasterRCNNEstimator) to be tuned as a categorical search space. For example, a complete HPO task can be as simple as:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*1MqF3amRU_ERJ2crXvbDrQ.png" /><figcaption>An intuitive API to search for best configuration in a specified search space</figcaption></figure><p>Note that the search space configuration (`search_args`) can either be a dictionary shown in the above example, or a YAML configuration file (.yml, .yaml). For the latter, you can create a code-free tuning job by editing the config file alone.</p><h3>Summary</h3><p><a href="https://github.com/dmlc/gluon-cv/releases/tag/v0.10.0">GluonCV 0.10</a> offers two new modules: GluonCV Auto Estimatorsand GluonCV Auto Tasks. They allow you to bootstrap computer vision tasks in a few lines of code. It is still a powerful and customizable training pipeline as it supports hundreds of hyperparameters with YAML config file support. Most importantly, the high-level API is deep learning framework agnostic, which allows you to easily extend new models and tasks implemented using MXNet or PyTorch. To get started, try these <a href="https://cv.gluon.ai/build/examples_auto_module/index.html">tutorials</a> first!</p><h3>Acknowledgment</h3><p>Special thanks to @<a href="https://github.com/dmlc/gluon-cv/commits?author=tmwangcas">tmwangcas</a>, <a href="https://github.com/karan6181">@karan6181</a>, @<a href="https://github.com/bryanyzhu">bryanyzhu</a>, @<a href="https://github.com/dmlc/gluon-cv/commits?author=Zarand3r">Zarand3r</a>, @<a href="https://github.com/dmlc/gluon-cv/commits?author=ChaiBapchya">ChaiBapchya</a>, @<a href="https://github.com/dmlc/gluon-cv/commits?author=domin1985">domin1985</a>, @<a href="https://github.com/yinweisu">yinweisu</a>, @<a href="https://github.com/mzolfaghari">mzolfaghari</a>, @<a href="https://github.com/dmlc/gluon-cv/commits?author=StevenJokess">StevenJokess</a> for contributing to this release.</p><p>Please Like/Star/Fork/Comment/Contribute if you like GluonCV, and don’t forget to check out our latest content. Stay tuned!</p><ul><li><a href="https://gluon-cv.mxnet.io/index.html">GluonCV Website</a></li><li><a href="https://github.com/dmlc/gluon-cv">GluonCV Github</a></li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=931ff910a38" width="1" height="1" alt=""><hr><p><a href="https://medium.com/apache-mxnet/start-fitting-cv-models-like-scikit-learn-with-gluoncv-0-10-931ff910a38">Start fitting CV models like Scikit-learn with GluonCV 0.10</a> was originally published in <a href="https://medium.com/apache-mxnet">Apache MXNet</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Optimizing inference on CPU in MXNet 2.0]]></title>
            <link>https://medium.com/apache-mxnet/optimizing-inference-on-cpu-in-mxnet-2-0-1852ff9729b4?source=rss----d1802b798def---4</link>
            <guid isPermaLink="false">https://medium.com/p/1852ff9729b4</guid>
            <category><![CDATA[mxnet]]></category>
            <category><![CDATA[deep-learning]]></category>
            <category><![CDATA[quantization]]></category>
            <category><![CDATA[neural-networks]]></category>
            <category><![CDATA[intel]]></category>
            <dc:creator><![CDATA[Adam Grygielski]]></dc:creator>
            <pubDate>Mon, 22 Feb 2021 19:20:32 GMT</pubDate>
            <atom:updated>2021-02-22T23:55:40.205Z</atom:updated>
            <content:encoded><![CDATA[<h3>Optimizing inference on CPU in the Upcoming Apache MXNet 2.0</h3><p><em>Authors: </em><a href="https://github.com/grygielski"><em>Adam Grygielski</em></a><em>, </em><a href="https://github.com/bgawrych"><em>Bartłomiej Gawrych</em></a><em>, </em><a href="https://github.com/sfraczek"><em>Sylwester Fraczek</em></a></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*W8xA-XS2fwuTVqoK0FYN6Q.jpeg" /></figure><h3>Introduction</h3><p>Deep Learning inference is the process of deploying a trained neural network to perform prediction on unseen data. It is a commonly deployed workload in cloud servers. To provide a good user experience it has to have high performance so it is important to use optimized solutions. Optimized means also reduced hardware load and energy cost.</p><p>There are two types of performance bottlenecks to consider when optimizing a neural network. One is heavy compute-bound operations like convolution or fully-connected and the other is many small memory-bound elementwise operations such as ReLU (or transposition).</p><p>A couple of methods have been invented for optimizing Neural Networks. Operator fusion allows to chain operations together to speed up memory-bound operations by reducing memory IO operations. Quantization speeds up compute-bound operations by lowering precision and therefore simplifying computation while also reducing the amount of data being processed.</p><p>Apache MXNet (incubating) in the upcoming version 2.0 introduces some changes in the interface. Gluon API has now become the default superseding symbolic and model API. It is unifying the flexibility of imperative programming with the performance benefits of symbolic programming. Also MXNet 2.0 now fully supports NumPy semantics.</p><h3>Operator Fusion</h3><p>In previous versions of Apache MXNet operators fusion was enabled by default when MXNet was built with Intel oneDNN library. However, with version 2.0 of MXNet, Module API and GraphExecutor were replaced by CachedOp executor. Gluon’s Block is now the first choice API to define and execute a model instead of a symbolic API.</p><p>Let’s recap what operator fusion is and what benefits come along with it, but before we move on to that, it is worth mentioning that MXNet support two execution modes:</p><ul><li>imperative mode - model is executed step-by-step as defined and we can access each tensor between operator execution</li><li>symbolic mode — based on defined model MXNet engine creates graph which can be optimized by e.g. pre-allocating space — we can’t access[1] tensors between operator execution</li></ul><p>An imperative mode is great for debugging purposes and symbolic mode provides a big performance boost. More information about the imperative and symbolic mode you can find <a href="https://classic.d2l.ai/chapter_computational-performance/hybridize.html">here</a>.</p><p>Every model can be represented as a directed graph where operators and tensors are represented as nodes. Each connection between operators, where the output of a single operator can be input to another one, is represented in the graph as an edge.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*03DOI_29XeAIgv2f36lUCQ.png" /></figure><p>Operator fusion is nothing else but replacing two or more subsequent operators with a new single operator that combines all functions of these operators. As a result, we get a smaller graph — only one kernel is being invoked instead of a few (one per operator). These fused kernels are optimized to run as fast as possible by utilizing all available features of a modern CPU. We also get rid of MXNet engine overhead between operator calls and we’re reducing memory access (read/write) operations.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*2txh-y_TauT50PgxJygKag.png" /></figure><p><strong>How to fuse operators in my model?</strong><br>To fuse model in MXNet 2.0 there are two requirements:</p><ul><li>the model must be defined as a subclass of HybridBlock or Symbol</li><li>the model must have specific operator patterns which can be fused</li></ul><p>At the time of publishing this article in MXNet we support three major fusing patterns:</p><ul><li>Convolution + BatchNorm + ReLU/GELU/LeakyReLU/sigmoid/SoftReLU/tanh + Elementwise Add</li><li>FullyConnected + Elementwise Op (ReLU, Square, Logistic, SoftReLu, BoundedRelu, Sqrt, Exp, Abs)</li><li>BatchNorm + ReLU</li></ul><p>As an example we define example network (sample block from ResNet architecture):</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/d17e0c84759af266b197f47298d06444/href">https://medium.com/media/d17e0c84759af266b197f47298d06444/href</a></iframe><p>Both HybridBlock and Symbol classes provide API to easily run fusion of operators. All we have to do is add single line of code enabling fusion passes on our model:</p><pre>net.optimize_for(data, backend=&#39;MKLDNN&#39;)</pre><p>If we would like to apply passes on a Symbol all we have to do is to call <strong>optimize_for</strong> function without passing input data on a symbol instance. Notice that Symbol’s <strong>optimize_for</strong> function is not done in-place, so we must assign it to a new variable:</p><pre>optimized_symbol = sym.optimize_for(backend=&#39;MKLDNN&#39;)</pre><p>For the above model definition in a naive benchmark with artificial data, we can gain up to 1.25x speedup without any accuracy loss on our testing machine. Detailed data on other models’ speedup is shown in the “Performance and accuracy results” part of the article.</p><h3>Quantization</h3><p>Similar to 1.x versions, MXNet 2.0 continues supporting model quantization from float32 to either signed or unsigned INT8 (s8/u8). Using a quantized model for running inference on modern CPUs can greatly increase the performance of your workloads. Quantized models utilize VNNI (Vector Neural Network Instructions) instruction set to speed-up execution of compute-heavy operations like convolution or a dot product. Moreover, using int8 data type reduces the amount of data reads in memory-bound operations like pooling or elementwise functions. MXNet uses optimized kernels delivered by <a href="https://github.com/oneapi-src/oneDNN">Intel® oneDNN</a><a href="https://github.com/oneapi-src/oneDNN)[3">[3</a>] library to speed up model execution.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*umGfUUazPg1Dz0HgtHVLUA.png" /></figure><p>During the quantization procedure, a floating-point model is first fused using fuses mentioned in the previous chapter. After that, operators that have int8 kernels are being marked as quantized and surrounded by quantize/requantize/dequantize operators. At last, the model can be either calibrated to get rid of requantize operators or left as it is to calculate scales during runtime. There are 2 major ways of quantizing an fp32 model:</p><ul><li><strong>without calibration</strong>. This way, all we have to do is call the <strong>quantize_net </strong>function with an fp32 Gluon model and a list of input data shapes. However, this approach is not recommended in terms of performance. It will result in requantize nodes in the graph that calculate min/max values during each forward pass. Calibrating a model before deploying it results in much faster inference.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*pObYh5K4SWzo58F0XIvZeg.png" /></figure><ul><li><strong>with calibration</strong>. In this approach, after quantizing the graph, a model is run with user-supplied calibration data to collect statistics of quantized layers and set min/max values as parameters. The choice of these thresholds is based on the chosen calibration method.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*cnEgomIqdjvlvz9T5EYLmg.png" /></figure><p>Currently, there are three calibration methods supported:</p><ul><li><strong>naive </strong>— min/max values from the calibration run.</li><li><strong>entropy</strong>— uses KL divergence to determine the best symmetrical quantization thresholds for a given histogram of values.</li><li><strong>custom </strong>— uses user-defined CalibrationCollector to control a calibration process.</li></ul><p>In MXNet 2.0, the quantization procedure has been adjusted to work well with Gluon models since it’s the main API now. The goal was to allow the user to quantize fp32 HybridBlock model in just a few lines of code.</p><h4>Quantization flow in MXNet 2.0</h4><p>As an example of a quantization procedure, we will use pretrained <strong>resnet50_v1 </strong>from <strong>model_zoo.vision</strong>. To get it we simply have to run the following code:</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/a3d86475380175353c3a849648fbaa6d/href">https://medium.com/media/a3d86475380175353c3a849648fbaa6d/href</a></iframe><p>To compare the performance, we will use simple function calculating total inference time on the model with an artificial data:</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/edddf3cd87726f2e6205e5d840e21136/href">https://medium.com/media/edddf3cd87726f2e6205e5d840e21136/href</a></iframe><p>Now, to get a quantized model, all we need to do is call <strong>quantize_net</strong> function from <strong>contrib.quantize</strong> and provide list of input data shapes:</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/49087ee38f8ed55c7fc940f7c1ecf17d/href">https://medium.com/media/49087ee38f8ed55c7fc940f7c1ecf17d/href</a></iframe><p>This way, we have quantized `resnet50_v1` ready to run inference tasks. However, we didn’t use any calibration data, therefore we won’t get a satisfactory performance boost.<br>We can compare these 2 models with our <strong>benchmark_net </strong>function to calculate the total speedup. To get better results, we should first hybridize both models with <strong>static_shape </strong>and <strong>static_alloc </strong>flags set to <strong>True</strong>. We can do it to let MXNet know that we won’t be changing shapes in the model during runtime thus it can pre-allocate the memory so we won’t get runtime allocations. To properly evaluate the performance benefits of quantization, we should compare it with fp32 model with <strong>MKLDNN </strong>backend enabled. It is because the quantization procedure fuses a graph before quantizing it so if we won’t do it, we will get the summed benefit of fusing and quantization.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/425f36346e2864b6e955e0e1c9cd542c/href">https://medium.com/media/425f36346e2864b6e955e0e1c9cd542c/href</a></iframe><p>Output:</p><pre>&gt; Speedup: 0.72x</pre><p>As we can see, we didn’t get any performance benefits from using int8. It came out, that calculating min/max during runtime adds a big overhead. Now let’s try calibrating the model before actual execution. The only difference in code is that we have to provide <strong>quantize_net </strong>with calibration data stored in <strong>DataLoader</strong> class. For this example, we will use the same <strong>dummy_data </strong>but in real use-case, it would be most probably small portion of validation dataset. Notice, that we no longer have to provide <strong>data_shape </strong>attribute because it will be taken from <strong>calib_data</strong>. If we don’t specify <strong>calib_batches </strong>parameter, the whole <strong>calib_data</strong> will be used.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/b7c7712b0b918dc7b7358a734bdeca71/href">https://medium.com/media/b7c7712b0b918dc7b7358a734bdeca71/href</a></iframe><p>Output:</p><pre>&gt; Speedup: 3.8x</pre><p>This time we’ve got significant performance boost with just few lines of code.</p><p>The other aspect of lowering the precision of a model is a difference in its accuracy. We will check that on previously tested <strong>resnet50_v1 </strong>with ImageNet dataset. To run this example you will need ImageNet dataset prepared with <a href="https://cv.gluon.ai/build/examples_datasets/imagenet.html">this tutorial</a> and stored in <strong>path_to_imagenet</strong>. Let’s compare top1 and top5 accuracy of standard fp32 model with quantized int8 model calibrated using <strong>naive </strong>and <strong>entropy </strong>calibration mode. We will use only 10 batches of the validation dataset to calibrate quantized model.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/11275649828c1ef5d868d8c3f89b367d/href">https://medium.com/media/11275649828c1ef5d868d8c3f89b367d/href</a></iframe><p>Output:</p><pre>&gt; FP32 Top1 Accuracy: 0.76364 Top5 Accuracy: 0.93094<br>&gt; INT8Naive Top1 Accuracy: 0.76028 Top5 Accuracy: 0.92796<br>&gt; INT8Entropy Top1 Accuracy: 0.76404 Top5 Accuracy: 0.93042</pre><p>We can see that we’ve got almost the same accuracy for quantized model but we are able to run it with much less computing power and lower latencies. The difference between calibration method is dependent on the model itself, used activation layers and the size of calibration data. It should be chosen empirically. However, if you are not experienced with custom calibration methods, you should stick to <strong>entropy </strong>by default.</p><h4>Custom layer collectors and calibrating the model</h4><p>We have prepared an interface to give the user as much flexibility as possible in almost every step of quantization and one of the results of this approach is mentioned earlier custom LayerOutputCollector parameter in quantization API.</p><p>Layer collectors are responsible for collecting statistics of each node in the graph — it means that we can observe the input/output data of every operator executed. We are able to do this by utilizing the <strong>register_op_hook </strong>method of <strong>HybridBlock </strong>class.</p><p>To write your own layer collector your class has to inherit from the <strong>CalibrationCollector</strong> class, which we have provided, to stay consistent within API. As “code is worth a thousand words” below we show an example implementation of CalibrationCollector:</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/92d4c12456e1ecab6ab7645e37807b30/href">https://medium.com/media/92d4c12456e1ecab6ab7645e37807b30/href</a></iframe><p>We are ‘injecting’ names of nodes which require calibration into the <strong>include_layers </strong>instance attribute of custom collector — it is organized in this way because collecting statistics for every node is very time consuming, but thanks to this you can skip nodes that do not need calibration (also you can ignore this attribute and implement your own logic of picking nodes to calibrate).</p><p>After collecting all statistic data <strong>post_collect </strong>function is called. In <strong>post_collect </strong>you can implement additional logic, which will process gathered data in the implemented way and at the end return dictionary of node name as key and tuple of min and max threshold as value, which will be used to calibrate nodes.</p><p><strong>Example of usage with quantization API:</strong></p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/5061127ac8d8db0c7083ff9045ab1346/href">https://medium.com/media/5061127ac8d8db0c7083ff9045ab1346/href</a></iframe><p>It is not easy to write your own logic of collector, but if you know what you are doing you can get better accuracy. In the previous version of MXNet and GluonNLP our colleagues wrote layer collector for BERT models — they clip minimum/maximum values of some layers to get better accuracy <a href="https://github.com/dmlc/gluon-nlp/blob/v0.10.x/src/gluonnlp/calibration/collector.py">BertLayerCollector</a> (note that it’s not compliant with MXNet 2.0 — you can refer to <a href="https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/contrib/quantization.py#L193-L320">predefined collectors in MXNet 2.0 code</a>).</p><h3>Performance and accuracy results</h3><p>Here we present performance results of some CV models. We compare Fused and Quantized models to Base models hybridized with static memory allocation.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*wXFtTCaoQ59BcjMUx_JrvQ.png" /><figcaption>Relative Inference Performance (img/sec) for Batch Size 128</figcaption></figure><p>As you can see, popular ResNet50 gains 62% speedup with just operator fusion and it is more than 6x faster than Base fp32 model while quantized. We may also observe that <strong>mobilenet </strong>benefits the most from presented optimizations. It’s mainly due to the efficient fusing. Moreover, in the quantized version, relu6 (which is the activation function in mobilenet) can be achieved just by adjusting scale factors of int8 computation thus we could optimize it even more.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*8kCHeVdfCsL6vyE7umhnSw.png" /><figcaption>ImageNet(ILSVRC2012) TOP1 validation accuracy</figcaption></figure><p>These are the results of TOP1 ImageNet accuracy for computer vision models and as you can see — by calling single function you can gain a significant performance boost almost without losing accuracy. In the chart, the biggest accuracy drop is visible in mobilenet v2, but it is still only half a percentage point less than fp32 model. On the other hand, VGG19 has no accuracy lost at all and after quantization, it is over 4 times faster than the baseline model.</p><h3>Summary</h3><p>In this article, we showed how you can speed up your MXNet model on CPU with oneDNN support. If you’re fine with tiny accuracy drop, model quantization is a great method of significantly improving performance. Otherwise, you can still use model fusion supported by oneDNN primitives without any side effects. In this case, the performance boost is not so impactful as in the quantized model but accuracy doesn’t change.</p><p>Moreover, the article described how you can utilize the new <strong>CalibrationCollector </strong>class to have better control of the calibration process. Such a solution gives you flexibility and a chance to increase the accuracy of your quantized model.</p><p>Many things have changed since the introduction of MXNet 2.0. Gluon API has taken over Executor-based execution thus now users have to manually call <strong>optimize_for </strong>function to get benefits from CPU optimization passes. We have also introduced the new <strong>BatchNorm + ReLU</strong> fusion to address ResNet v2 models. Our future plans are to add support for more and more int8 models, not only in a CV domain. We are also exploring new fusion opportunities so we can speed up popular models even more.</p><h3>References</h3><p>[1] Devlin, Jacob, et al. “BERT: Pre-training of deep bidirectional transformers for language understanding.” arXiv pre-print arXiv:1810.04805 (2018).<br>[2] Banu Nagasundaram. “Vector Neural Network Instructions Enable Int8 AI Inference on Intel Architecture.” <a href="https://www.intel.ai/vnni-enables-inference/">https://www.intel.ai/vnni-enables-inference/</a>, 2019.<br>[3] Intel® oneDNN library (<a href="https://github.com/oneapi-src/oneDNN">https://github.com/oneapi-src/oneDNN</a>)</p><h3>Benchmark environment</h3><p><strong>CPU:</strong> Intel<strong>®</strong> Xeon<strong>®</strong> Platinum 8280L CPU @ 2.70GHz<br><strong>Memory:</strong> 187 GB RAM<br><strong>OS:</strong> CentOS Linux 7 (Core)<br><strong>Compiler:</strong> gcc (GCC) 7.3.1 20180303 (Red Hat 7.3.1–5)<br><strong>MXNet Commit SHA:</strong> 3746babc8fdb211584a9a661207061cb646b01a8<br><strong>oneDNN Commit SHA:</strong> 2e4732679f0211bb311780d0f383cf2dce9baca7</p><h3>Notices and Disclaimers</h3><p>© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.</p><p>Performance varies by use, configuration and other factors. Learn more at <a href="http://www.Intel.com/PerformanceIndex">www.Intel.com/PerformanceIndex</a>.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=1852ff9729b4" width="1" height="1" alt=""><hr><p><a href="https://medium.com/apache-mxnet/optimizing-inference-on-cpu-in-mxnet-2-0-1852ff9729b4">Optimizing inference on CPU in MXNet 2.0</a> was originally published in <a href="https://medium.com/apache-mxnet">Apache MXNet</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Using MXNet NDArray for fast GPU algebra on images]]></title>
            <link>https://medium.com/apache-mxnet/using-mxnet-ndarray-for-fast-gpu-algebra-on-images-a3cd63c94c62?source=rss----d1802b798def---4</link>
            <guid isPermaLink="false">https://medium.com/p/a3cd63c94c62</guid>
            <category><![CDATA[gpu]]></category>
            <category><![CDATA[python]]></category>
            <category><![CDATA[apache-mxnet]]></category>
            <category><![CDATA[computer-vision]]></category>
            <category><![CDATA[numpy]]></category>
            <dc:creator><![CDATA[Olivier Cruchant]]></dc:creator>
            <pubDate>Mon, 15 Feb 2021 18:55:04 GMT</pubDate>
            <atom:updated>2021-02-15T18:55:04.030Z</atom:updated>
            <content:encoded><![CDATA[<h3>Using MXNet NDArray for Fast GPU Algebra on Images</h3><p><em>By Philippe Saadé and Olivier Cruchant</em></p><h3>Introduction</h3><p><a href="https://mxnet.apache.org/versions/1.7.0/">Apache MXNet</a> is a modern deep learning framework featuring both a pleasant developer experience and high-performance training and inference. In order to get the best possible performance while exposing a friendly interface, the MXNet community re-developed from scratch a suite of optimized tensor algebra kernels, historically accessible via the Python MXNet <a href="https://mxnet.incubator.apache.org/versions/1.7.0/api/python/docs/api/ndarray/index.html">NDArray library</a>. The initial MXNet <a href="https://mxnet.apache.org/versions/1.0.0/api/python/ndarray/ndarray.html">NDArray API</a> is very similar to <a href="https://numpy.org/doc/stable/reference/">NumPy</a>, yet it features major internal differences enabling high-performance deep learning: (1) it can run on GPU, (2) it supports auto-differentiation, (3) it runs asynchronously, executing only the code that need to run and in an optimized order. In order to make the powerful MXNet NDArray API fully accessible to NumPy developers, the MXNet community released in 2019 the mxnet.numpy library, which implements the NumPy API (<a href="https://medium.com/apache-mxnet/a-new-numpy-interface-for-apache-mxnet-incubating-dbb4a4096f9f">Announcement on MXNet Medium</a>). In this blog post, we illustrate the strength of MXNet NDArray library for fast and compact algebra over images. We first highlight couple NDArray concepts and then assemble them into a function, that we provide both as an mxnet.ndarray implementation and an mxnet.numpy implementation. We use as an illustration a fictional use-case of anomalous image detection via simple pixel difference. This can be a reasonable baseline when looking for anomalous areas over a batch of near-identical images with same viewpoint and luminosity, for example the images below:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*iAa3VR0-Bk97SYoMFrNv4w.png" /><figcaption><em>From left to right: (1) correct sample (2) synthetic anomalous sample with a stain (3) anomalous sample after difference with the cross-batch mean.</em></figcaption></figure><h4><strong>Notable concepts</strong></h4><p>We will use the following libraries:</p><pre>import itertools as itr<br>import os<br>import mxnet as mx<br>from mxnet import image as mxim<br>from mxnet import init, gluon, nd<br>import matplotlib.cm as cm<br>from matplotlib import pyplot as plt<br>from matplotlib import image as mpim</pre><p>MXNet features the image library that we can use to read images:</p><pre>picture = mxim.imread(&#39;orig1.JPG&#39;)</pre><pre><em># .asnumpy() converts MXNet NDArrays to Numpy array</em><br>plt.imshow(picture.asnumpy())</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1000/1*jtp5ghCvqkseI42IniGyHQ.jpeg" /><figcaption><em>This is a non-anomalous image. In our synthetic dataset, we have 10 such images, and two images with visual anomalies</em></figcaption></figure><p>MXNet NDArray leverages the mxnet <a href="https://mxnet.apache.org/versions/1.7.0/api/python/docs/api/mxnet/context/index.html">context</a> concept, that enables to control the hardware context - CPU or GPU of objects. For example, we can send our previously loaded image to the GPU (if on a GPU-equipped instance) with the following snippet:</p><pre>pic_on_gpu = picture.copyto(mx.gpu())</pre><p>Note that we can also use picture.as_in_context(mx.gpu()) to copy the picture to the GPU ; the advantage of the latter being that it does the actual copy only if the image is not in the GPU already. <br> <br> Similarly to Numpy, MXNet NDArray is a great platform to apply algebraic transformations to batches of records. Writing transformations applied to batch of images is furthermore relevant on GPUs which excel at batch processing. Very similar to their Numpy counterparts, we use in this demo <a href="https://mxnet.apache.org/versions/1.0.0/api/python/ndarray/ndarray.html#mxnet.ndarray.Concat">ndarray.concat()</a>, <a href="https://mxnet.apache.org/versions/1.0.0/api/python/ndarray/ndarray.html#mxnet.ndarray.mean">ndarray.mean()</a>, <a href="https://mxnet.apache.org/versions/1.0.0/api/python/ndarray/ndarray.html#mxnet.ndarray.abs">ndarray.abs()</a>, <a href="https://mxnet.apache.org/versions/1.0.0/api/python/ndarray/ndarray.html#mxnet.ndarray.max">ndarray.max()</a> to manipulate and analyze a batch of images.<br> <br> MXNet being primarily designed for deep learning, it features abundant neural network primitives. Its imperative Python front-end <a href="https://mxnet.apache.org/versions/1.4.1/gluon/index.html">Gluon</a> provides numerous model layers, and in this blog - even though we do not use deep learning nor even machine learning - we borrow from Gluon its 2D-convolution (<a href="https://mxnet.apache.org/versions/1.6/api/python/docs/api/gluon/nn/index.html#mxnet.gluon.nn.Conv2D">mxnet.gluon.nn.Conv2D</a>) with all coefficients initialized at 1 to apply a smoothing effect and isolate contiguous areas of anomalous pixels.</p><h3>Wrapping it all in a compact function</h3><p>The inline code block below proposes an example function to run the above-mentioned analysis. It first reads a batch of images, then subtracts from all of them the cross-batch mean image, applies a hard-coded convolution to each of them and return images that have high-value pixels after the convolution. Note that this code features couple hard-coded constants that would be worth tuning over representative data. This sample is provided as an illustration of the NDArray capacities and may deserve further refinement and testing before being used in the real-world.</p><h4>Implementation with mxnet.ndarray</h4><pre>def mxnd_find_anomalies(folder, gpu=False, save_viz=True, threshold=300):<br><em>    &quot;&quot;&quot;<br>    Parameters<br>        ----------<br>        folder : str<br>            local directory with the batch of images to score<br>        gpu : bool, optional<br>            whether to use GPU or CPU<br>        save_viz : bool, optional<br>            saves the pictures of deltas and convolutions<br>        threshold : int, optional<br>            conv threshold to be anomalous. Should be tuned!           <br>    &quot;&quot;&quot;</em><br>    <br>    ctx = mx.gpu() if gpu else mx.cpu()  <em># Set context</em><br>    <br>    <em># read images in a 4D-NDArray</em><br>    pics = os.listdir(folder)<br>    ims = [mxim.imread(folder+&#39;/&#39;+pic).expand_dims(0).as_in_context(ctx) for pic in pics]<br>    ims = nd.concat(*ims, dim=0)<br>        <br>    <em># Compute the average image</em><br>    avg = nd.mean(ims.astype(&#39;float32&#39;), axis=0)<br>    <br>    <em># Remove the mean from every image, then average-pool on color</em><br>    deltas = nd.mean(ims.astype(&#39;float32&#39;)-avg, axis=3)<br>    <br>    <em># Apply a batched convolution to all the deltas, then apply abs</em><br>    conv = gluon.nn.Conv2D(1, kernel_size=5, use_bias=False)<br>    conv.initialize(init.Constant(1))<br>    conv.collect_params().reset_ctx(ctx)<br>    <br>    <em># we expand dims for conv, which expect 3D pics </em>(with channels)<br>    conv_delta = nd.abs(conv(deltas.expand_dims(1)))<br>    <br>    <em># Max-pool over the image</em><br>    top_deltas = nd.max(conv_delta, axis=(1,2,3))<br>    <br>    <em># Return pics that have a convolution &gt; threshold</em><br>    anomalies = list(itr.compress(pics, top_deltas &gt; threshold))<br>    an_indexes = list(itr.compress(range(len(pics)), top_deltas &gt; threshold))<br>    <br>    if save_viz:  <em># optionally, save images for interpretability</em><br>        for a, i in zip(anomalies, an_indexes):<br>            mpim.imsave(&#39;pixmap-&#39;+a, deltas[i].asnumpy(), format=&#39;png&#39;)<br>            mpim.imsave(&#39;convmap-&#39;+a, conv_delta[i][0].asnumpy(), format=&#39;png&#39;)<br>                <br>    return list(anomalies)</pre><h3>Implementation with mxnet.numpy</h3><p>In the below code block, we implement the same logic as above, while replacing the mxnet.ndarray API by the mxnet.numpy API. Since mxnet.numpy implements the NumPy API, we need to replace couple methods by their NumPy counterpart, notably:</p><ul><li>Expanding array dimensions: array.expand_dims(0) above becomes np.expand_dims(array, 0)</li><li>Concatenation: nd.concat(*ims, dim=0) becomes np.concatenate(ims, axis=0)</li></ul><p>We further need to import mxnet.numpy and notify MXNet backend of our use of NumPy semantics, with the extra import and configuration below:</p><pre>from mxnet import np, npx<br>npx.set_np()</pre><p>And here is the final implementation of our baseline anomaly detection function using the MXNet NumPy API:</p><pre>def mxnp_find_anomalies(folder, gpu=False, save_viz=True, threshold=300):<br>    <em>&quot;&quot;&quot;<br>    Parameters<br>        ----------<br>        folder : str<br>            local directory with the batch of images to score<br>        gpu : bool, optional<br>            whether to use GPU or CPU<br>        save_viz : bool, optional<br>            saves the pictures of deltas and convolutions<br>        threshold : int, optional<br>            conv threshold to be anomalous. Should be tuned!<br>    &quot;&quot;&quot;</em><br>    <br>    ctx = mx.gpu() if gpu else mx.cpu()<em>  # Set context</em><br>    <br>    <em># read images in a 4D-NDArray</em><br>    pics = os.listdir(folder)<br>    ims = [np.expand_dims(mxim.imread(folder+&#39;/&#39;+pic), 0).as_in_context(ctx) for pic in pics]<br>    ims = np.concatenate(ims, axis=0)<br>        <br>    <em># Compute the average image</em><br>    avg = np.mean(ims.astype(&#39;float32&#39;), axis=0)<br>    <br>    <em># Remove the mean from every image, then average-pool on color</em><br>    deltas = np.mean(ims.astype(&#39;float32&#39;)-avg, axis=3)<br>    <br>    <em># Apply a batched convolution to all the deltas, then apply abs</em><br>    conv = gluon.nn.Conv2D(1, kernel_size=5, use_bias=False)<br>    conv.initialize(init.Constant(1))<br>    conv.collect_params().reset_ctx(ctx)<br>    <br><em>    # we expand dims for conv, which expect 3D pics (with channels)</em><br>    conv_delta = np.abs(conv(np.expand_dims(deltas, 1)))<br>    <br>    <em># Max-pool over the image</em><br>    top_deltas = np.max(conv_delta, axis=(1,2,3))<br>    <br>    <em># Return pics that have a convolution &gt; threshold</em><br>    anomalies = list(itr.compress(pics, top_deltas &gt; threshold))<br>    an_indexes = list(itr.compress(range(len(pics)), top_deltas &gt; threshold))<br>    <br>    if save_viz:  <em># optionally, save images for interpretability</em><br>        for a, i in zip(anomalies, an_indexes):<br>            mpim.imsave(&#39;pixmap-&#39;+a, deltas[i].asnumpy(), format=&#39;png&#39;)<br>            mpim.imsave(&#39;convmap-&#39;+a, conv_delta[i][0].asnumpy(), format=&#39;png&#39;)<br>                <br>    return list(anomalies)</pre><p>Here are 2 anomalies the function found over our synthetic data:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*sqVO0tXGVdHCh3wxnfIcQA.png" /></figure><p>Both functions have comparable latencies, which makes sense since they use the same back-end kernels. Here are execution latencies excluding visualizations creation, over a batch of 12 1000px*750px images with MXNet 1.7, on an Amazon SageMaker GPU-equipped ml.p3.2xlarge Notebook (average of 10 runs)</p><ul><li>The mxnet.ndarray code runs in <strong>780ms</strong> and the mxnet.numpy code in <strong>750ms</strong> on the 8 vCPU (Intel Xeon E5)</li><li>The mxnet.ndarray code runs in <strong>100ms</strong> and the mxnet.numpy code in <strong>75ms</strong> on the GPU (NVIDIA Tesla V100 GPU), close to 90% faster vs CPU</li></ul><h3>Conclusion</h3><p>Even though MXNet is a deep learning framework, its primitives can be creatively used beyond neural network development. In this post, we showed how to perform simple, GPU-accelerated algebra over images using its NDArray library, that can be used by NumPy developers without extra learning curve via the mxnet.numpy library. Do not hesitate to take a look at it, contribute and engage with the community!</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=a3cd63c94c62" width="1" height="1" alt=""><hr><p><a href="https://medium.com/apache-mxnet/using-mxnet-ndarray-for-fast-gpu-algebra-on-images-a3cd63c94c62">Using MXNet NDArray for fast GPU algebra on images</a> was originally published in <a href="https://medium.com/apache-mxnet">Apache MXNet</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Which one is the best algorithm for video action recognition?]]></title>
            <link>https://medium.com/apache-mxnet/which-one-is-the-best-algorithm-for-video-action-recognition-298fb5c4ad4f?source=rss----d1802b798def---4</link>
            <guid isPermaLink="false">https://medium.com/p/298fb5c4ad4f</guid>
            <category><![CDATA[computer-vision]]></category>
            <category><![CDATA[videos]]></category>
            <category><![CDATA[action-recognition]]></category>
            <category><![CDATA[pytorch]]></category>
            <category><![CDATA[deep-learning]]></category>
            <dc:creator><![CDATA[Yi Zhu]]></dc:creator>
            <pubDate>Wed, 06 Jan 2021 03:36:25 GMT</pubDate>
            <atom:updated>2021-01-06T03:36:24.936Z</atom:updated>
            <content:encoded><![CDATA[<p>Author: <a href="https://github.com/bryanyzhu">Yi Zhu</a>, Applied Scientist at Amazon</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ioRec3e4bqWluDZTYk3CjA.png" /><figcaption><strong>A chronological overview of recent representative work in video action recognition</strong></figcaption></figure><p>Over time, computer vision researchers have shifted their focus from image to video, 2D to 3D, and supervised to unsupervised. One of the trends, video understanding, has become a hot topic. Video human action recognition, a basic task within video understanding, also attracts lots of attention. As shown in the timeline above, more and more algorithms on video action recognition are proposed each year. So, how do newcomers to this field know which is the right model for their use-case?</p><p>To address this question, here comes the GluonCV 0.9.0 release, which provides not only a large model zoo for video action recognition (<a href="https://cv.gluon.ai/model_zoo/action_recognition.html">46 pretrained models</a>, both in PyTorch and Apache MXNet), but also step-by-step tutorials (<a href="https://github.com/dmlc/gluon-cv/blob/master/scripts/action-recognition/feat_extract_pytorch.py">feature extraction</a>, <a href="https://cv.gluon.ai/build/examples_torch_action_recognition/finetune_custom.html">model finetuning</a>, <a href="https://github.com/dmlc/gluon-cv/blob/master/scripts/action-recognition/get_flops.py">FLOPS computation</a>), a 30-page <a href="https://arxiv.org/abs/2012.06567">survey paper </a>covering 200+ recent literatures, and video lectures on <a href="https://www.youtube.com/watch?v=Jwt0Wtlv_uo&amp;list=PLGCZZzK2R0X6RQiQrbShUULsbF1qeC17d">YouTube</a>. If you want to get started with computer vision in video, don’t hesitate to try it out as we believe this will help you develop some new skills, and figure out the right model for your scenario.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*U2xai8pWyT_Ry6AVEA1zRQ.png" /><figcaption><strong>Our survey paper and CVPR2020 tutorial video lectures.</strong></figcaption></figure><p>Video researchers have long found it difficult to compare results due to variances between datasets and differing evaluation methods. In response, we began reproducing a large number of popular algorithms using the same dataset and the same data augmentation steps (see <a href="https://github.com/dmlc/gluon-cv/blob/master/scripts/action-recognition/ARXIV.md">readme</a> for steps to reproduce the results). Throughout this project we gained several interesting observations. First, despite the fact that 3D CNNs have higher accuracy than 2D CNNs, they have higher latency which makes them not ideal for deployment to edge devices or real-time inference. Second, pre-training a model on a large-scale dataset is usually more effective than improving the model itself. For example, the <a href="https://arxiv.org/abs/1904.02811">CSN</a> model (ICCV 2019) pretrained on a large-scale dataset easily outperforms recent methods by a large margin. Hence, for real-world applications, it might be more cost effective to collect and clean your data, rather than pursuing the latest SOTA model.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*CNhZB4gXWyo3FVSQRIwytg.png" /><figcaption><strong>Benchmark results on Kinetics400 dataset. Time computation does not include IO cost.</strong></figcaption></figure><p>Furthermore, training a SOTA video action recognition model consumes lots of compute resources. Even using a high-end machine with 8 V100 GPUs will take more than a week to obtain a decent accuracy. To help you iterate faster, we provide support for the <a href="https://pytorch.org/tutorials/intermediate/ddp_tutorial.html">DistributedDataParallel (DDP)</a> in PyTorch, and <a href="https://arxiv.org/abs/1912.00998">Multi-grid training</a>. As shown in the bar plot below, under the same setting of using 8 GPUs, the baseline using DataParallel (DP) needs 250 hours to finish the 100-epoch training of I3D model, while GluonCV can finish it within 41 hours, which is 6 times faster without performance degradation. If we use 4 machines with a total of 32 V100s, the training can be done in 10 hours, achieving near-linear scalability. By further comparison, the <a href="https://github.com/open-mmlab/mmaction2/blob/master/configs/recognition/i3d/README.md">mmaction2</a> toolbox required 148 hours to train the same model.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Km2o16WCvEo9W036ksIkEA.png" /><figcaption><strong>Training time comparison (a standard I3D model with ResNet50 backbone).</strong></figcaption></figure><h3>Summary</h3><p>In summary, GluonCV 0.9.0 provides a whole package for learning video action recognition: a survey paper, video lectures, Jupyter demos, the model zoo, and a deployment tutorial. In future releases, we will update more models in PyTorch, including object tracking, multi-modality video modeling, self-supervised representation learning, etc. Welcome aboard with GluonCV, feel free to raise issues, and contribute back!</p><h3>Acknowledgement</h3><p>Special thanks to <a href="https://github.com/Arthurlxy">@Arthurlxy</a> <a href="https://github.com/ECHO960">@ECHO960</a> <a href="https://github.com/zhreshold">@zhreshold</a> <a href="https://github.com/yinweisu">@yinweisu</a> for their support in this release. Thanks to <a href="https://github.com/coocoo90">@coocoo90</a> for contributing the CSN and R2+1D models. And thanks to other contributors for the bug fixes and improvements. Please Like/Star/Fork/Comment/Contribute if you like GluonCV!</p><ul><li><a href="https://gluon-cv.mxnet.io/index.html">GluonCV Website</a></li><li><a href="https://github.com/dmlc/gluon-cv">GluonCV Github</a></li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=298fb5c4ad4f" width="1" height="1" alt=""><hr><p><a href="https://medium.com/apache-mxnet/which-one-is-the-best-algorithm-for-video-action-recognition-298fb5c4ad4f">Which one is the best algorithm for video action recognition?</a> was originally published in <a href="https://medium.com/apache-mxnet">Apache MXNet</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Using Optuna to Optimize Gluon Hyperparameters]]></title>
            <link>https://medium.com/apache-mxnet/using-optuna-to-optimize-gluon-hyperparameters-2fefa2549ba2?source=rss----d1802b798def---4</link>
            <guid isPermaLink="false">https://medium.com/p/2fefa2549ba2</guid>
            <category><![CDATA[mxnet]]></category>
            <category><![CDATA[optuna]]></category>
            <category><![CDATA[hyperparameter-tuning]]></category>
            <dc:creator><![CDATA[Crissman Loomis]]></dc:creator>
            <pubDate>Mon, 07 Dec 2020 15:02:46 GMT</pubDate>
            <atom:updated>2020-12-07T15:02:46.632Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*nqT7f1htNYHTv9AFFsOfYA.png" /></figure><h3>Gluon + Optuna!</h3><p>Optuna is a hyperparameter optimization framework applicable to machine learning frameworks and black-box optimization solvers. The Gluon library in Apache MXNet (incubating) provides a clear, concise, and simple API for deep learning. It makes it easy to prototype, build, and train deep learning models without sacrificing training speed. Combining the two of them allows for automatic tuning of hyperparameters to find the best performing models.</p><h3>Creating the Objective Function</h3><p>Optuna is a black-box optimizer, which means it needs an objective function, which returns a numerical value to evaluate the performance of the hyperparameters, and decide where to sample in upcoming trials.</p><p>In our example, we will be doing this for <a href="https://github.com/optuna/optuna/blob/master/examples/gluon_simple.py">identifying MNIST characters</a> from the <a href="https://github.com/optuna/optuna">Optuna GitHub</a> examples folder. In this case, the objective function frame looks like this:</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/88940298542b91d94f65c982c731074e/href">https://medium.com/media/88940298542b91d94f65c982c731074e/href</a></iframe><p>Notice that the objective function is passed an Optuna specific argument of trial. This object is passed to the objective function to be used to specify which hyperparameters should be tuned. This returns the accuracy of the model as val_acc, which is used by Optuna as feedback on the performance of the trial.</p><h3>Defining the hyperparameters to be tuned</h3><p>Optuna allows you to define the types and ranges of hyperparameters you want to tune directly within your code using the trial object. This saves the effort of learning specialized syntax for hyperparameters, and also means you can use normal Python code for looping through or defining your hyperparameters.</p><p>Optuna supports a variety of hyperparameter settings, which can be used to optimize floats, integers, or discrete categorical values. Numerical values can be suggested from a logarithmic continuum as well. In our MNIST example, we optimize the model optimizer hyperparameters within the objective function:</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/66f68a39912ca30fcbf28b7c825fad24/href">https://medium.com/media/66f68a39912ca30fcbf28b7c825fad24/href</a></iframe><p>Which optimizer is used is selected from a list of possible choices by optimizer_name = trial.suggest_categorical(&quot;optimizer&quot;, [&quot;Adam&quot;, &quot;RMSprop&quot;, &quot;SGD&quot;]). The learning rate should vary by orders of magnitude, so log=True is used in the setting of lr = trial.suggest_uniform(&quot;lr&quot;, 1e-5, 1e-1, log=True), which will vary the values logarithmically from .00001 to 0.1.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/0bf13d93e34e5031ef6e13346320976c/href">https://medium.com/media/0bf13d93e34e5031ef6e13346320976c/href</a></iframe><p>For the definition of the model itself, Optuna leverages eager mode to allow normal Python looping to determine the number of layers and nodes in each layer with trial.suggest_int(“n_layers”, 1, 3)for the layers and trial.suggest_int(“n_units_l{}”.format(i), 4, 128) for the number of nodes in each layer, such as n_units_l1 or n_units_l2.</p><h3>Running the Trials</h3><p>The default sampler in Optuna Tree-structured Parzen Estimator (TPE), which is a form of Bayesian Optimization. Optuna uses TPE to search more efficiently than a random search, by choosing points closer to previous good results.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*jsSKCHgDTPZnvF6C.png" /></figure><p>To run the trials, create a study object, which sets the direction of optimization (&quot;maximize&quot; or &quot;minimize&quot;), along with other settings. Then, the study object run with optimize(objective, n_trials=100, timeout=600), to do one hundred trials with a timeout after ten minutes.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/966857ff753de811880ecbe3ecc862d4/href">https://medium.com/media/966857ff753de811880ecbe3ecc862d4/href</a></iframe><p>Each trial is chosen after evaluating all the trials that have been previously done, using a sampler to make a smart guess where the best values hyperparameters can be found. Optuna provides Tree-structured Parzen Estimator (TPE) samplers, which is a kind of bayesian optimization, as the default sampler.</p><p>The best values from the trials can be accessed through study.best_trial, and other methods of viewing the trials, such as formatting in a dataframe, are available.</p><h3>Pruning — Early Stopping of Poor Trials</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*2TrkgiQX9ALnHSDZ.png" /></figure><p>Pruning trials is a form of early-stopping which terminates unpromising trials, so that computing time can be used for trials that show more potential. In order to do pruning, it’s necessary to open up the black-box of the Objective function some more to provide intermittent feedback on how the trial is going to Optuna, so it can compare the progress with the progress of other trials, and decide whether to stop the trial early, and provide a method to receive a method from Optuna when the trial should be terminated, and also allow the trial in session to terminate cleanly after recording the results.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/b6fcb05d017690acf371688f865c43f2/href">https://medium.com/media/b6fcb05d017690acf371688f865c43f2/href</a></iframe><p>trial.report is used to communicate with Optuna about the progress of the trial. In this example, the objective function communicates the current epoch and the accuracy. trial.should_prune() is how Optuna communicates to the objective function if it should terminate early.</p><h3>To the Future, and Beyond!</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/700/0*DFXSyK7P8O8WLm9A.png" /><figcaption>Plot Contour Visualization</figcaption></figure><p>For those interested, Optuna has many other features, including <a href="https://optuna.readthedocs.io/en/latest/reference/visualization.html">visualizations</a>, alternative samplers, optimizers, and pruning algorithms, as well as the ability to create user-defined versions as well. If you have more computing resources available, Optuna provides an easy interface for parallel trials to increase tuning speed.</p><p>Give <a href="https://optuna.org">Optuna</a> a try today. See <a href="https://optuna.readthedocs.io/en/stable/installation.html">Installation</a> and <a href="https://github.com/optuna/optuna">Optuna Github</a>.</p><p><em>This post uses </em><em>Gluon with Apache MXNet 1.7.0and </em><em>optuna 2.3.0.</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=2fefa2549ba2" width="1" height="1" alt=""><hr><p><a href="https://medium.com/apache-mxnet/using-optuna-to-optimize-gluon-hyperparameters-2fefa2549ba2">Using Optuna to Optimize Gluon Hyperparameters</a> was originally published in <a href="https://medium.com/apache-mxnet">Apache MXNet</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Tallying votes in a meeting using pose detection: A case study with Nexity and GluonCV]]></title>
            <link>https://medium.com/apache-mxnet/tallying-votes-in-a-meeting-using-pose-detection-a-case-study-with-nexity-and-gluoncv-7aaf5b3fede6?source=rss----d1802b798def---4</link>
            <guid isPermaLink="false">https://medium.com/p/7aaf5b3fede6</guid>
            <category><![CDATA[deep-learning]]></category>
            <category><![CDATA[sagemaker]]></category>
            <category><![CDATA[mxnet]]></category>
            <category><![CDATA[computer-vision]]></category>
            <category><![CDATA[gluoncv]]></category>
            <dc:creator><![CDATA[Valentin Lecerf]]></dc:creator>
            <pubDate>Wed, 21 Oct 2020 20:34:51 GMT</pubDate>
            <atom:updated>2020-10-21T20:34:50.945Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/647/1*652n9t4BgonxnTjBGsjMLA.png" /><figcaption>Picture from camera, on the left side the scenario and on right side the data from model applied</figcaption></figure><h3>Who is Nexity, what was the challenge?</h3><p><a href="https://www.nexity.fr/en/group"><em>Nexity</em></a><em> is France’s leading integrated real estate group, with business operations in all areas of real estate development and services (residential real estate, commercial real estate, real estate services to individuals and real estate services to companies, distribution networks and client relations, major urban projects) and enjoys a strong presence across all industry cycles (short, medium and long). Nexity heavily adopted the cloud as a catalyst for technology-driven innovation.</em></p><h3>How did Nexity solve the challenge?</h3><p><em>This initiative occurred in the context of an internal innovation hackathon organized at Nexity. The goal of the hackathon was to raise executive-level awareness of the potential of cloud technologies. Volunteer teams had to compete by demonstrating an innovative system that could be built quickly using cloud services. Each team could choose its own scenario. We decided to work on the services we could offer at our general meetings of co-owners. Real-estate management involves a lot of in-person meetings where decision submitted to show-of-hand votes and manually tallied. The team “KFC”, consisting of engineers, tech leaders and project managers Antoine Pellet, Vincent Boidin, Xavier Top, Valentin Lecerf, Jérémy Desvaux and Grégory Hivin, decided to tackle the challenge of automatically tallying votes of a show-of-hand poll in real-time from fictional yet representative pictures of meeting rooms. In this post, the team tells about the experience with Apache MXNet and GluonCV.<br> <br> “Initially, our idea was to use an object detection model to detect raised hands. Due to the very limited implementation time, we switched approach to use a pre-trained pose estimation model from the MXNet </em><a href="https://gluon-cv.mxnet.io/model_zoo/pose.html"><em>GluonCV model zoo</em></a><em>. This model gave us a list of coordinates representing people and from this we could deduce easily if a right or left hand was up. We used the coordinates of hands, shoulders and head to decide if an arm is raised or not. Without specific tuning we observed a 70% success rate. The main challenge was the inference latency, as we implemented a web-application displaying real-time vote count over a 1080p video stream. Some important aspects to keep in mind for this type of solution are (1) the performance (need to handle a video stream) and (2) the video size and quality.”</em></p><h3>The code</h3><p><em>The team developed and tested from an </em><a href="https://aws.amazon.com/sagemaker/"><em>Amazon SageMaker</em></a><em> Notebook and provided the following sample code:<br> <br> Installation of gluoncv and import of relevant packages</em></p><pre><em>! pip install gluoncv</em></pre><pre><em>from gluoncv import model_zoo, data, utils</em></pre><pre><em>from gluoncv.data.transforms.pose import detector_to_simple_pose, heatmap_to_coord</em></pre><pre><em>import mxnet as mx</em></pre><pre><em>from mxnet import nd</em></pre><pre><em>from mxnet.gluon.model_zoo import vision as models</em></pre><p><em>Loading detector and pose estimation model, set detector to person class only</em></p><pre><em>detector = model_zoo.get_model(&#39;yolo3_mobilenet1.0_coco&#39;, pretrained=True)</em></pre><pre><em>pose_net = model_zoo.get_model(&#39;simple_pose_resnet18_v1b&#39;, pretrained=True)</em></pre><pre><em>detector.reset_class([&quot;person&quot;], reuse_weights=[&#39;person&#39;])</em></pre><p><em>Run person detection and feed output into pose estimation:</em></p><pre><em>class_IDs, scores, bounding_boxs = detector(x)</em></pre><pre><em>pose_input, upscale_bbox = detector_to_simple_pose(img, class_IDs, scores, bounding_boxs)</em></pre><pre><em>pred_coords, confidence = heatmap_to_coord(predicted_heatmap, upscale_bbox)</em></pre><pre><em>predicted_heatmap = pose_net(pose_input)</em></pre><p><em>For each pose detection, check if the left hand or right hand is raised:</em></p><pre><em>if (coords[10][1] &lt; coords[8][1] &lt; coords[6][1]  # left hand raised</em></pre><pre><em>  or coords[9][1] &lt; coords[7][1] &lt; coords[5][1]):  # right hand raised</em></pre><pre><em>    return True</em></pre><pre><em>else:</em></pre><pre><em>    return False</em></pre><p><em>The full mapping between prediction coefficients and body joints was listed, to facilitate development:</em></p><pre><em># 0 # Facepoints</em></pre><pre><em># 1 # Facepoints</em></pre><pre><em># 2 # Facepoints</em></pre><pre><em># 3 # Facepoints</em></pre><pre><em># 4 # Facepoints</em></pre><pre><em># 5 # right shoulders</em></pre><pre><em># 6 # left shoulders</em></pre><pre><em># 7 # right elbow</em></pre><pre><em># 8 # left elbow</em></pre><pre><em># 9 # right hand</em></pre><pre><em># 10 # left hand</em></pre><pre><em># 11 # pelvis right</em></pre><pre><em># 12 # pelvis left</em></pre><pre><em># 13 # right knee</em></pre><pre><em># 14 # left knee</em></pre><pre><em># 15 # right foot</em></pre><pre><em># 16 # left foot</em></pre><h3>Going further with GluonCV</h3><p><em>Abundant models, training and inference scripts are available in the </em><a href="https://gluon-cv.mxnet.io/contents.html"><em>gluoncv</em></a><em> package, that as of May 2020 features 14 pre-trained models and 4 tutorials for pose estimation alone!</em></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/597/1*eHOhHdT1keobnugqOUY1KQ.png" /><figcaption><em>Accuracy-Throughput tradeoff in the </em><a href="https://gluon-cv.mxnet.io/model_zoo/pose.html"><em>gluoncv pose estimation model zoo</em></a></figcaption></figure><p><a href="https://gluon-cv.mxnet.io/"><em>GluonCV</em></a><em> is a python computer vision toolkit built on top of the efficient Apache MXNet deep learning framework. GluonCV also comes with features dedicated to inference optimization, such as pruned models and quantization functions.<br> Once scientific code is developed, it can be exposed as a service via a web server. </em><a href="https://github.com/awslabs/multi-model-server"><em>Multi-Model Server (MMS)</em></a><em> is a framework-agnostic model server that can be deployed as-is on compatible platforms, but that also ships as the managed backend of the </em><a href="https://github.com/aws/sagemaker-mxnet-inference-toolkit"><em>SageMaker MXNet</em></a><em> inference containers. Eventually, numerous ideas can be exploited to improve the performance and economics of the deployment, such as model compilation and hardware acceleration. Those concepts were presented in a </em><a href="https://medium.com/apache-mxnet/faster-cheaper-leaner-improving-real-time-ml-inference-using-apache-mxnet-2ee245668b55"><em>previous blog post</em></a><em>.<br> <br> This is not the first time deep learning is used on a real-estate or urban planning use-case, yet this is a particularly original and creative one! Among existing deep learning research relating or relevant to real-estate, we note in particular:<br> <br> — In </em><a href="https://medium.com/compass-true-north/similar-homes-and-homepage-recommendations-new-frontiers-of-ai-in-real-estate-1102330561eb"><em>Launching Similar Homes and Real-Time Personalized Recommendations</em></a><em> (Gautam Narula, Ran Ding, Samuel Weiss, and Joseph Sirosh) Compass researchers describe the challenge of recommending real-estate listings. A deep embedding model is developed with Apache MXNet to learn listing similarity. Significant business impact is reported (+153% click-through rate and +107% engagement actions)<br> — In 2018, </em><a href="https://developmentseed.org/"><em>Development Seed</em></a><em> </em><a href="https://medium.com/devseed/use-label-maker-and-amazon-sagemaker-to-automatically-map-buildings-in-vietnam-a63090fb399f"><em>described using Apache MXNet</em></a><em> to classify building presence from aerial imagery<br> — In an </em><a href="https://aws.amazon.com/fr/blogs/machine-learning/using-deep-learning-on-aws-to-lower-property-damage-losses-from-natural-disasters/"><em>AWS ML Blog post</em></a><em>, the property data analytics company </em><a href="https://www.eagleview.com/"><em>EagleView</em></a><em> presents aerial computer vision solutions developed with Apache MXNet to assess urban damage created by natural disasters. Using deep learning, EagleView can assess property damage within 24 hours and inform insurance and homeowners more rapidly.</em></p><h3>Conclusion</h3><p><em>In summary, if you are working on a computer vision use-case, chances are GluonCV and Apache MXNet can help you drastically reduce your time to result while keeping a state-of-the-art scientific and efficiency bar. Please do not hesitate to give it a try, contribute to those projects and reach out to the community on the forum </em><a href="https://discuss.mxnet.io/"><em>discuss.mxnet.io/</em></a><em>!</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=7aaf5b3fede6" width="1" height="1" alt=""><hr><p><a href="https://medium.com/apache-mxnet/tallying-votes-in-a-meeting-using-pose-detection-a-case-study-with-nexity-and-gluoncv-7aaf5b3fede6">Tallying votes in a meeting using pose detection: A case study with Nexity and GluonCV</a> was originally published in <a href="https://medium.com/apache-mxnet">Apache MXNet</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Analyze Driving Scenes with GluonCV 0.8]]></title>
            <link>https://medium.com/apache-mxnet/analyze-driving-scenes-with-gluoncv-0-8-56ac37664977?source=rss----d1802b798def---4</link>
            <guid isPermaLink="false">https://medium.com/p/56ac37664977</guid>
            <category><![CDATA[semantic-segmentation]]></category>
            <category><![CDATA[depth-estimation]]></category>
            <category><![CDATA[computer-vision]]></category>
            <category><![CDATA[semi-supervised-learning]]></category>
            <category><![CDATA[autonomous-cars]]></category>
            <dc:creator><![CDATA[Yi Zhu]]></dc:creator>
            <pubDate>Tue, 15 Sep 2020 07:10:18 GMT</pubDate>
            <atom:updated>2020-09-15T07:10:18.599Z</atom:updated>
            <content:encoded><![CDATA[<p>Author: <a href="https://github.com/bryanyzhu">Yi Zhu</a>, Applied Scientist at Amazon</p><p>Nowadays driving scene analysis models are trending in more applications. For example, generating street scenes in an <a href="https://www.youtube.com/watch?v=ayPqjPekn7g">AI rendered virtual world</a>, building their own <a href="https://www.youtube.com/watch?v=cB_ez2MNHMo">self-driving cars</a>, etc. Recently, <a href="https://github.com/intel-isl/OpenBot">OpenBot</a> uses smartphones and a small electric vehicle to build a low-cost robot, which can support workloads like pedestrian following and real-time autonomous navigation. Do you want to build some more advanced features for your own applications or robots? Given the surging interest in autonomous driving applications, we include a new task called depth estimation and more semantic segmentation models in this latest GluonCV 0.8 release.</p><h3><a href="https://gluon-cv.mxnet.io/model_zoo/depth.html">Depth Estimation</a></h3><p>Depth estimation is a long standing computer vision task, which is an important step towards inferring scene geometry from 2D images. The goal of depth estimation is to predict the depth value of each pixel, given a single RGB image (monocular setting) or left-right image pairs (stereo setting). Recently, self-supervised depth estimation dominates the field due to its great performance and no need for annotated data. <a href="http://visual.cs.ucl.ac.uk/pubs/monoDepth/">Monodepth</a> and <a href="https://arxiv.org/abs/1806.01260">Monodepth2</a> are important milestones which have established new single camera depth estimation baselines.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*Ej4JSOb-rHVHuhT-AhXcmw.gif" /></figure><p>Given Monodepth2’s popularity and strong performance, we provide GluonCV implementation of Monodepth2 in this release. We have pretrained models on <a href="http://www.cvlibs.net/datasets/kitti/">KITTI</a> and training logs for all three settings: monocular, stereo and monocular+stereo. All results are reproducible and are similar to numbers reported in the original publications.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Strsfw6hKWtH_3u8QIXnXQ.png" /></figure><p>To get you started, we provide a number of detailed tutorials, such as:</p><ul><li><a href="https://gluon-cv.mxnet.io/build/examples_depth/demo_monodepth2.html">how to predict depth from an image</a></li><li><a href="https://gluon-cv.mxnet.io/build/examples_depth/videos_monodepth2.html">how to generate a demo video as seen above</a></li><li><a href="https://gluon-cv.mxnet.io/build/examples_depth/train_monodepth2.html">how to train a Monodepth2 model</a></li><li><a href="https://gluon-cv.mxnet.io/build/examples_depth/test_monodepth2_posenet.html">how to predict trajectory of image sequences using PoseNet in Monodepth2</a></li></ul><p>Once you have the estimated depth and trajectory, it will be easier for your autonomous bot to avoid obstacles. Note that monocular and monocular+stereo setting are included after the release date, so if you are eager to try these out right away, feel free to install the nightly version of GluonCV.</p><h3><a href="https://gluon-cv.mxnet.io/model_zoo/segmentation.html">Better/Faster segmentation models</a></h3><p>We include two new semantic segmentation models in this release, one is <a href="https://arxiv.org/abs/1809.02983">DANet</a>, the other is <a href="https://arxiv.org/abs/1902.04502">FastSCNN</a>. DANet is one of the state-of-the-art models on several segmentation benchmarks, and FastSCNN is one of the most popular real-time segmentation models. Following table shows their performance on the Cityscapes validation set.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*rphKouhY82daQMoqLo8aEg.png" /></figure><p>Our FastSCNN model is an improved variant from our <a href="https://arxiv.org/abs/2004.14960">recent paper</a> using semi-supervised learning, i.e., the performance of 72.3 mIoU is better than 68.6 mIoU reported in the original paper. To our best knowledge, 72.3 mIoU is the highest-scored implementation of FastSCNN and one of the best real-time semantic segmentation models. It can run at a speed of 80.8 fps on a single V100 GPU card given input of 1024x2048 video sequences.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*sFIl2R2eYGQu4_9hTCbSjg.gif" /></figure><p>If you want to train a segmentation model on your dataset/location, we provide <a href="https://gluon-cv.mxnet.io/tutorials/index.html#semantic-segmentation">tutorials</a> on how to train and test segmentation models. You can also try our semi-supervised learning method to automatically generate pseudo labels, so that you don’t need to label your own data! We have demonstrated strong cross-domain generalization performance in <a href="https://arxiv.org/abs/2004.14960">our paper.</a></p><h3>Summary</h3><p>GluonCV v0.8 includes Monodepth2, DANet and FastSCNN to our model zoo which can help you analyze driving scenes for your own applications. You can use these models for much more than just diving scenes though. One example would be indoor scenes analysis. So open up your mind and only sky is the limit for ideas. Please checkout our <a href="https://gluon-cv.mxnet.io/tutorials/index.html">tutorials</a> and <a href="https://gluon-cv.mxnet.io/model_zoo/index.html">model zoo</a> for more details.</p><h3>Acknowledgement</h3><p>We sincerely thank the following contributors:<br><a href="https://github.com/zhreshold">@zhreshold</a>, <a href="https://github.com/KuangHaofei">@KuangHaofei</a>, @<a href="https://github.com/xdeng7">xdeng7</a>, @<a href="https://github.com/ytian8">ytian8</a>, <a href="https://github.com/FrankYoungchen">@FrankYoungchen</a>, <a href="https://github.com/bryanyzhu">@bryanyzhu</a>, @<a href="https://github.com/Jerryzcn">Jerryzcn</a>, @<a href="https://github.com/yezqNLP">yezqNLP</a>, <a href="https://github.com/LauLauThom">@LauLauThom</a>, <a href="https://github.com/karan6181">@karan6181</a>, <a href="https://github.com/chinakook">@chinakook</a>, @<a href="https://github.com/tkhe">tkhe</a>, @<a href="https://github.com/tirkarthi">tirkarthi</a>, @<a href="https://github.com/mseth10">mseth10</a>, @<a href="https://github.com/ksindwan">ksindwan</a>, @<a href="https://github.com/Neutron3529">Neutron3529</a>, @<a href="https://github.com/Aktcob">Aktcob</a>, @<a href="https://github.com/tmyapple">tmyapple</a>, @<a href="https://github.com/chongruo">chongruo</a>, @<a href="https://github.com/xinyu-intel">xinyu-intel</a></p><p>Please Like/Star/Fork/Comment/Contribute if you like GluonCV!</p><ul><li><a href="https://gluon-cv.mxnet.io/index.html">GluonCV Website</a></li><li><a href="https://github.com/dmlc/gluon-cv">GluonCV Github</a></li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=56ac37664977" width="1" height="1" alt=""><hr><p><a href="https://medium.com/apache-mxnet/analyze-driving-scenes-with-gluoncv-0-8-56ac37664977">Analyze Driving Scenes with GluonCV 0.8</a> was originally published in <a href="https://medium.com/apache-mxnet">Apache MXNet</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Streaming Inference Pipeline — Deploying MXNet model on AWS Lambda]]></title>
            <link>https://medium.com/apache-mxnet/streaming-inference-pipeline-deploying-mxnet-model-on-aws-lambda-7ce6bc8f4cc8?source=rss----d1802b798def---4</link>
            <guid isPermaLink="false">https://medium.com/p/7ce6bc8f4cc8</guid>
            <category><![CDATA[optimization]]></category>
            <category><![CDATA[mxnet]]></category>
            <category><![CDATA[deep-learning]]></category>
            <category><![CDATA[ml-model-deployment]]></category>
            <category><![CDATA[aws-lambda]]></category>
            <dc:creator><![CDATA[WayTrue]]></dc:creator>
            <pubDate>Sat, 11 Jul 2020 04:18:38 GMT</pubDate>
            <atom:updated>2020-07-15T16:37:18.111Z</atom:updated>
            <content:encoded><![CDATA[<h3>Build a Streaming Inference Pipeline by Deploying Apache MXNet on AWS Lambda</h3><p><em>Thanks to Sandeep Krishnamurthy, Olivier Cruchant, Thom Lane and Kevin Mould for their feedback</em></p><h3>Introduction</h3><p>AWS Lambda is a compute service that allows you to run code without provisioning or managing servers. It also introduces scalability, with the ability to run <a href="https://docs.aws.amazon.com/lambda/latest/dg/invocation-scaling.html">up to 3000 tasks concurrently</a>. More importantly, you only pay for the compute time you consume, which can lead to significant cost savings in certain situations. When combining AWS Lambda with deep learning models, you can get outstanding model performance with the convenience of AWS Lambda. In this post, we will walkthrough how to build a streaming inference pipeline using Apache MXNet and AWS Lambda. In particular, we will discuss how to deploy a package that exceeds the typical Lambda upload limit. And finally, we will analyze the pipeline performance and costs.</p><h3>Use case scenario</h3><p>A deep learning streaming inference pipeline is a perfect use case for MXNet on AWS Lambda. Let’s assume an e-commerce company wants to extract meta data from their product images. For example, <a href="https://medium.com/airbnb-engineering/categorizing-listing-photos-at-airbnb-f9483f3ab7e3">Airbnb uses image classification models</a> to identify what type of room (kitchen, pool, garden) is displayed in a picture. This is where the streaming inference pipeline comes in: it continuously monitors the data flow, performs image classification whenever an image is acquired, and saves the results into long term storage (such as Amazon S3). AWS Lambda manages everything for you: from provisioning to scaling. Additionally, the company only pays for the compute time that AWS Lambda needs, which makes it very cost efficient for intermittent workloads.</p><h3>Inference pipeline workflow</h3><p>The above data extraction use case boils down to the following workflow:</p><ol><li>An image is uploaded to an Amazon S3 bucket (input bucket).</li><li>The image triggers the AWS Lambda function to retrieve resources from the resource bucket and perform image classification using the MXNet model.</li><li>The inference result is stored in another Amazon S3 bucket (output bucket).</li></ol><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*VVBspL-9ALwfz1ODZ97x4w.png" /></figure><h3>Build the pipeline</h3><p>This section will show you how to build the above pipeline step by step. Before we start, please don’t forget to <a href="https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html">configure your AWS Command Line Interface (CLI)</a> if you haven’t already done so:</p><pre>$ aws configure</pre><h4>Amazon S3 buckets</h4><p>The first step is to create the following storage components of the pipeline:</p><ul><li>input_bucket: An S3 bucket to receive the input images.</li><li>resource_bucket: An S3 bucket to host the resource files for the Lambda function during runtime.</li><li>output_bucket: An S3 bucket to store the inference results.</li></ul><p>You can either use the Amazon S3 management console or the following CLI commands to create the S3 buckets:</p><pre>$ aws s3api create-bucket --bucket your-input-bucket-name \<br>                          --region your-region \<br>                          --create-bucket-configuration \<br>                          LocationConstraint=your-region<br>$ aws s3api create-bucket --bucket your-resource-bucket-name \<br>                          --region your-region \<br>                          --create-bucket-configuration \<br>                          LocationConstraint=your-region<br>$ aws s3api create-bucket --bucket your-output-bucket-name \<br>                          --region your-region \<br>                          --create-bucket-configuration \<br>                          LocationConstraint=your-region</pre><h4>Prepare the deployment package</h4><p>The next step is to prepare the deployment package for AWS Lambda. A <a href="https://docs.aws.amazon.com/lambda/latest/dg/python-package.html">deployment package</a> is a compressed archive that contains the function code and dependencies. AWS lambda will run this code when triggered. In our case, it should contain the following items:</p><ul><li>lambda_function.py — the main function that performs inference during runtime.</li><li>model.params — the model parameter file. In this case we use a pre-trained ResNet50_v2 model (download <a href="https://github.com/waytrue17/MXNet-on-Lambda/blob/master/resnet50_v2.params">here</a>). You can also use your own model. More details of saving/loading model can be found <a href="https://mxnet.apache.org/api/python/docs/tutorials/packages/gluon/blocks/save_load_params.html">here</a>.</li><li>synset.txt — the label file for Imagenet dataset. It maps the model output (integer) to an object class (string). For instance, ‘559’ is mapped to ‘folding chair’. You can find it <a href="https://github.com/waytrue17/MXNet-on-Lambda/blob/master/synset.txt">here</a>.</li><li>dependencies — the libraries that the lambda function depends on, such as MXNet, numpy, etc.</li></ul><p><strong>Dependencies</strong><br>The following command downloads the current released version of MXNet (1.6.0) and all its dependencies in to folder package at the current directory.</p><pre>$ pip install mxnet -t . package</pre><p>Note: please make sure that you download the dependencies within an operating system that is compatible with Amazon Linux (AL/AL2), since the Lambda function runs on Amazon Linux. The easiest way to do this, is to run an Amazon EC2 instance with Amazon Linux, or you can use<a href="https://docs.aws.amazon.com/AmazonECR/latest/userguide/amazon_linux_container_image.html"> Docker to setup an Amazon Linux environment.</a><br> <br><strong>Lambda upload limit</strong><br>Adding everything to the package folder, we end up with the following file structure:</p><pre>package<br>----lambda_function.py<br>----resnet50_v2.params<br>----synset.txt<br>----mxnet<br>----numpy<br>----numpy.libs<br>...<br>----urllib3</pre><p>Our current package exceeds 370 MB (mxnet 190 MB, resnet50_v2.params 100 MB, numpy 50 MB, numpy.libs 30MB), but the deployment package limit for Lambda is 250 MB. This makes it impossible to upload the whole package as is. However, AWS Lambda provides an <a href="https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html">additional 512 MB</a> of storage inside the “/tmp” directory (while running the Lambda function). All files in this directory will be stored during the life cycle of the Lambda instance. Thus, we can split the package into smaller parts: one uploaded to AWS Lambda directly, and the others are downloaded from the “/tmp” directory during runtime. By splitting the package and exploiting the “/tmp” storage, we increase the size limit from 250 MB to more than 750 MB. <br> <br>In this implementation, we split the package as seen below. You can split this in other ways, as long as it fits within the AWS Lambda constraints.</p><pre>pkg_lambda<br>----lambda_function.py<br>----synset.txt<br>----mxnet</pre><pre>pkg_tmp<br>----numpy<br>----numpy.libs<br>...<br>----urllib3</pre><pre>resnet50_v2.params</pre><p><strong>pkg_lambda</strong><br>This package contains all the files that will be uploaded to AWS Lambda directly. The size is reduced to around 200 MB (from 370MB before). Use the following command to compress it and get pkg_lambda.zip:</p><pre>~$ cd pkg_lambda<br>~/pkg_lambda$ zip -r9 ${OLDPWD}/pkg_lambda.zip .</pre><p>We also provide our pkg_lambda.zip <a href="https://github.com/waytrue17/MXNet-on-Lambda/blob/master/mxnet/pkg_lambda.zip">here</a>. <br><br>You can either use the Amazon S3 management console or the following CLI command to upload it to the resource_bucket:</p><pre>$ aws s3 cp pkg_lambda.zip s3://your-resource-bucket-name</pre><p><strong>pkg_tmp and model file</strong><br>These files will be stored in the resource_bucket and get downloaded into the “/tmp” directory during the AWS Lambda runtime. The size is around 85 MB. Use the following command to get pkg_tmp.tar.gz:</p><pre>~$ cd pkg_tmp<br>~/pkg_tmp$ tar -cvzf pkg_tmp.tar.gz .</pre><p>We also provide our pkg_tmp.tar.gz <a href="https://github.com/waytrue17/MXNet-on-Lambda/blob/master/mxnet/pkg_tmp.tar.gz">here</a>.<br> <br>You can either use the Amazon S3 management console or the following CLI command to upload it to the resource_bucket:</p><pre>$ aws s3 cp pkg_tmp.zip s3://your-resource-bucket-name</pre><p>The last thing to upload is the model file:</p><pre>$ aws s3 cp resnet50_v2.params s3://your-resource-bucket-name</pre><p>Note: here we choose to separate the model file from pkg_tmp. Uploading it separately without compressing into package would buy us more space in “/tmp” directory when downloading it. It also makes updating model file easier since we don’t need to update the whole pkg_tmp.</p><p><strong>lambda_function.py</strong><br>lambda_function.py contains a function called lambda_handler which takes the trigger event (an image in our case) as an input and performs the model inference. <br> <br>In this file we:</p><ul><li>Specify the AWS Lambda runtime to download pkg_tmp and model file from resource_bucket.</li><li>Transform the input images into the required format for the model.</li><li>Perform image classification inference.</li><li>Send the results to output_bucket.</li></ul><p>Note: The lambda_function.py contains multiple non event-specific tasks such as downloading pkg_tmp, loading model parameters, generating label list, etc. We should try to put these code blocks on top of the lambda_handler function scope to reduce the inference latency. More detail can be found in our implementation of <a href="https://github.com/waytrue17/MXNet-on-Lambda/blob/master/lambda_function.py">lambda_function.py</a>.</p><h4>Creating an AWS Lambda function</h4><p>Once the deployment packages are ready, the next step is to create the AWS Lambda function. <br> <br><strong>Set an IAM role</strong><br>First, we need to create an IAM role that permits AWS Lambda to communicate with the other AWS components in the pipeline. Go to the AWS IAM management console: Roles → Create role → Lambda, and attach the following policies:</p><ul><li>AmazonS3FullAccess</li><li>AmazonLambdaBasicExcutionRole</li><li>CloudWatchEventsFullAccess</li></ul><p>More details of IAM roles can be found <a href="https://docs.aws.amazon.com/lambda/latest/dg/lambda-intro-execution-role.html">here</a></p><p><strong>Create the Lambda function</strong><br>The next step is to create the AWS Lambda function, attach the IAM role to it and upload the pkg_lambda. You can either use the AWS Lambda management console or the following CLI command:</p><pre>$ aws lambda create-function --function-name your-function-name \<br>             --code S3Bucket=your-resource-bucket-name, \<br>             S3Key=pkg_lambda.zip \<br>             --handler lambda_function.lambda_handler \<br>             --runtime python3.7 \<br>             --role arn:aws:iam::your-aws-account: \<br>             role/your-role-name \<br>             --timeout 30 \<br>             --memory-size 1024</pre><p><strong>Add a trigger to lambda</strong><br>The last step is to add an S3 “All object create events” trigger to the Lambda function. Go to the AWS Lambda management console → Functions → Select the function we just created → add trigger → select S3 → specify input_bucket for bucket name→ specify “All object create events” for Event type.<br> <br><strong>Test</strong><br>At this point, we’ve successfully built an inference pipeline. To test it, upload an image to the input_bucket. A new file that contains the predicted object class should appear in the output_bucket.</p><h3>MXNet with MKL-DNN</h3><p>Intel MKL-DNN provides various highly vectorized and threaded operators to accelerate deep learning frameworks. MXNet supports MKL-DNN to achieve better training and inference performance. In this section, we will demonstrate how to enable MKL-DNN in our inference pipeline. <br> <br>The only difference is the MXNet library. Use the following command to download the MKL-DNN enabled MXNet (version 1.6.0) :</p><pre>$ pip install mxnet-mkl -t . package</pre><p>Note that the MXNet library itself reaches 280 MB, and therefore no longer fits within the 250 MB limitation. We must rearrange the deployment package as follows to satisfy the limitation:</p><pre>pkg_lambda_mkl<br>----lambda_function.py<br>----synset.txt<br>----numpy<br>----numpy.libs<br>...<br>----urllib3</pre><pre>pkg_tmp_mkl<br>----mxnet</pre><pre>resnet50_v2.params</pre><p>You can prepare your own packages or download our pkg_lambda_mkl.zip and pkg_tmp_mkl.tar.gz <a href="https://github.com/waytrue17/MXNet-on-Lambda/tree/master/mxnet-mkl">here</a>. Once the two packages are uploaded to the resource_bucket, you can either use the AWS Lambda management console or run the following command to update the pipeline and enable MKL-DNN:</p><pre>$ aws lambda update-function-code \<br>             --function-name your-function-name \<br>             --s3-bucket your-resource-bucket-name \<br>             --s3-key pkg_lambda_mkl.zip</pre><h3>Inference latency</h3><p>Latency is a key factor in our inference pipeline. With AWS Lambda functions, the latency depends on whether the instance is “<a href="https://mikhail.io/serverless/coldstarts/aws/">cold</a>” or “warm”. Cold start inference is much slower due to the initialization tasks. Latency also depends on cup resources. We observe that larger memory allocation in Lambda typically means better latency, but with higher costs. In this section, we summarize the latency for both mxnet and mxnet-mkl inference.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1008/1*Yk3eP7CmIqTwM4Rjxixx7g.png" /></figure><p>The figure above shows the cold start inference latency. Each value is measured by averaging 5 cold start inference latencies.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1010/1*AHO9uWJDOAojM5EZ9-gK0g.png" /></figure><p>The figure above shows the warm start inference latency. Each value is measured by averaging 20 consecutive warm start inference latencies. <br> <br>We observe that enabling MKL-DNN increases the cold start latency due to its larger package size. The overall cold start latency is around 6 seconds, which is a huge time cost for deep learning inference. In production, we should try to avoid cold start. On the other hand, the warm start takes around 400 ms in general, which is acceptable in many use cases. The figure also indicates that enabling MKL-DNN improves the inference latency by around 15%.</p><h3>Inference cost</h3><p>Cost is another important factor to the pipeline. Here we mainly focus on the warm start cost since it happens to most inference requests. AWS Lambda offers free usage of 1M requests and 400,000 GB-seconds per month. It charges per request basis after the free usage. The following figure shows the cost per million requests after free usage, calculated based on <a href="https://aws.amazon.com/lambda/pricing/">AWS Lambda pricing</a>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1008/1*rVa-IX3QIk8kWDKqDpZC-A.png" /></figure><p>We see that:</p><ul><li>Smaller memory saves the cost although it affects latency.</li><li>Enabling MKL-DNN not only speeds up the inference but also lower the cost.</li></ul><h3>Cost efficiency</h3><p>Traditional deep learning servers (like Amazon EC2 instances) have fixed charges per month. On the other hand, the cost of Lambda is proportional to the number of inference requests. We benchmark the cost between mxnet-mkl on AWS Lambda and a commonly used EC2 setup c5.xlarge+eia2.medium (the same setup with this <a href="https://aws.amazon.com/blogs/machine-learning/serving-deep-learning-at-curalate-with-apache-mxnet-aws-lambda-and-amazon-elastic-inference/?ref=Welcome.AI">study</a>), which costs $208.80 per month.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/878/1*29fC9V8M477ABwWicU7LRA.png" /></figure><p>From the figure above, we observe that the cost of AWS Lambda 2048 MB and c5 instances intersects at around 15 million image requests, showing that Lambda is more cost efficient when monthly requests numbers are less than that. The threshold between Lambda 1024 and c5 is around 22 million requests, indicating a greater potential for AWS Lambda’s cost efficiency.</p><h3>Further Reading</h3><p>In this section, we would like to discuss some observations we discovered while building the pipeline.</p><h4>Cold Start vs Warm Start</h4><p>From the previous analysis, we saw the cost advantages of the Lambda pipeline over the EC2 instance. However, those calculations assume “warm start” latency for most of the requests. Therefore, it is crucial to minimize “cold starts” of the AWS Lambda function and keep them “warm” as much as possible. In fact, an empty AWS Lambda instance will be invoked when the first request arrives. It first sets up the environment and then processes the actual request. This is the so called “cold start,” which takes longer due to the initialization steps. Once the cold start has completed, the AWS Lambda instance caches all the setup during its life cycle and gets “warm”. That’s the reason why “warm start” latency is so much better. Unfortunately, a warm instance only waits for about 5 to 10 min for the next request. The instance will be recycled if the interval between two requests is longer than that period of time, and Lambda will have to invoke another instance starting from a “cold start”. As a result, having a traffic pattern with a request interval of less than 5 mins can reduce the frequency of cold starts. Otherwise, applying a simple <a href="https://read.acloud.guru/how-to-keep-your-lambda-functions-warm-9d7e1aa6e2f0">pre-warming technique</a> can keep AWS Lambda instances alive and solve the cold start issue.</p><h4>Deploying other models</h4><p>It is relatively easy to deploy other deep learning models with this pipeline. We only need to upload the new model file and the corresponding synset.txt file to the resource bucket, and then modify the lambda_function.py accordingly. The only restriction of deploying a model on AWS Lambda is the size limitation of the packages: 250 MB to Lambda and 512 MB to “/tmp”. In the above implementation, the pkg_tmp is 85 MB, which left more than 400 MB for the model files. This should be sufficient for most of the MXNet computer vision models, e.g. ResNet152 (240 MB), Faster-RCNN (170 MB), Mask-RCNN (180 MB). However, enabling MKL-DNN would squeeze the space available for the model file to around 200 MB. For large models that don’t fit within the limitations, please consider other techniques to further reduce the package size, such as <a href="https://github.com/aws-samples/djl-demo/tree/master/lambda-model-serving">Deep Java Library</a> and <a href="https://github.com/neo-ai/neo-ai-dlr">Deep Learning Runtime</a>. You can also combine <a href="https://aws.amazon.com/blogs/compute/using-amazon-efs-for-aws-lambda-in-your-serverless-applications/">Lambda with Amazon EFS</a> to run large files.</p><h4>Changing storage component</h4><p>You may notice that the pipeline is built using various AWS services. This makes it flexible and easy to change some of the components to other AWS services. Amazon DynamoDB is a fully managed database that performs data queries in a very efficient way. It also allows us to store the inference result along with other customer defined information. Here, we will show you how to switch the output storage from an Amazon S3 bucket to an Amazon DynamoDB table.<br><br>First, define a simple table that stores three features for each input image. You can edit or add more based on your needs.</p><ul><li>UUID: a UUID4 code as the unique identifier of the image.</li><li>ImageName: the file name of the image.</li><li>ObjectClass: the predicted object class for the image.</li></ul><p>Use either the Amazon DynamoDB management console or the following CLI command to create the DynamoDB table:</p><pre>$ aws dynamodb create-table \<br>      --table-name your-output-table-name \<br>      --attribute-definitions AttributeName=UUID,AttributeType=S \<br>      --key-schema AttributeName=UUID,KeyType=HASH \<br>      --provisioned-throughput \<br>      ReadCapacityUnits=5,WriteCapacityUnits=5</pre><p>Add the following IAM policy to the IAM role we created previously:</p><ul><li>AmazonDynamoDBFullAccess</li></ul><p>Next, modify the lambda_function.py to send the result to the table, prepare the deployment packages and upload the two packages into the resource_bucket. Our pkg_lambda_ddb.zip and pkg_tmp_ddb.tar.gz can be found <a href="https://github.com/waytrue17/MXNet-on-Lambda/tree/master/mxnet-ddb">here</a>.<br><br>Finally, you can either use the AWS Lambda management console or run the following command to update the pipeline and connect the DynamoDB table to the pipeline:</p><pre>$ aws lambda update-function-code \<br>             --function-name your-function-name \<br>             --s3-bucket your-resource-bucket-name \<br>             --s3-key pkg_lambda_ddb.zip</pre><h3>Conclusion</h3><p>In this article, we demonstrated step by step how to build a streaming inference pipeline using MXNet and AWS Lambda. We addressed the AWS Lambda upload limit issue by splitting the deployment package and exploiting “/tmp” directory (giving us over 750 MB of potential storage space in total). We then benchmarked the inference performance of MXNet with and without MKL-DNN. Our results indicated that enabling MKL-DNN could speed up the inference by about 15%. And finally we showed that AWS Lambda inference is more cost efficient than an c5.xlarge instance when monthly requests are less than 22 million in our case.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=7ce6bc8f4cc8" width="1" height="1" alt=""><hr><p><a href="https://medium.com/apache-mxnet/streaming-inference-pipeline-deploying-mxnet-model-on-aws-lambda-7ce6bc8f4cc8">Streaming Inference Pipeline — Deploying MXNet model on AWS Lambda</a> was originally published in <a href="https://medium.com/apache-mxnet">Apache MXNet</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Speed up your BERT inference by 3x on CPUs using Apache TVM]]></title>
            <link>https://medium.com/apache-mxnet/speed-up-your-bert-inference-by-3x-on-cpus-using-apache-tvm-9cf7776cd7f8?source=rss----d1802b798def---4</link>
            <guid isPermaLink="false">https://medium.com/p/9cf7776cd7f8</guid>
            <category><![CDATA[optimization]]></category>
            <category><![CDATA[bert]]></category>
            <category><![CDATA[deep-learning]]></category>
            <category><![CDATA[mxnet]]></category>
            <category><![CDATA[nlp]]></category>
            <dc:creator><![CDATA[Haichen Shen]]></dc:creator>
            <pubDate>Wed, 08 Jul 2020 14:01:03 GMT</pubDate>
            <atom:updated>2020-07-08T14:01:03.619Z</atom:updated>
            <content:encoded><![CDATA[<h3>Introduction</h3><p>BERT (Bidirectional Encoder Representations from Transformer) [1], a pre-trained natural language processing (NLP) model, was proposed by Google in 2018 and now plays an important role in NLP tasks such as text classification, sentiment analysis, question answering, and more. However, BERT is known to be compute intensive even for inference. As we are seeing the trend from lightweight solutions such as LSTM to the usage of BERT in production pipelines, BERT inference performance is becoming increasingly critical to reach the desired latency and scalability.</p><p>Recently, deep learning compilers such as TVM [2], MLIR [3], and Glow [4], which use compiler techniques to optimize model inference, have gained traction in both academia and companies across several industries. The progress [2, 5] has demonstrated that deep learning compilers can achieve lower inference latency compared to frameworks. At Amazon Web Services, we make multiple contributions to the <a href="https://tvm.apache.org/">Apache TVM</a> open source project and we use TVM to speed up many model inference use cases, including BERT, on various platforms. We also offer a service, called <a href="https://aws.amazon.com/sagemaker/neo/">Amazon SageMaker Neo</a>, that provides a managed experience for compilation on a variety of frameworks, operators and hardware targets.</p><p>In this blog, we are going to share our recent progress on improving BERT inference performance on CPUs (e.g., c5 and m5 instances on <a href="https://aws.amazon.com/ec2/">Amazon EC2</a>) and show you how to use TVM to reproduce our results. Overall, using TVM can help achieve up to <strong>2.9x</strong> lower latency on EC2 c5.9xlarge instances and up to <strong>2.3x</strong> higher throughput. More importantly, achieving these results requires almost no developer effort as we will demonstrate in this post.</p><h3>BERT inference improvement</h3><p>We will use the BERT-base [1] and DistilBERT [6] models to demonstrate our improvement of BERT inference performance using TVM. The BERT-base model contains 12 layers of transformer blocks and consists of 11.2 GFLOPs (floating operations) and 109M weights. DistilBERT is a distilled and smaller version of the BERT model, and only has 6 layers of transformer blocks. It contains 5.6 GFLOPs and 67M weights, and thus runs faster during inference. Computation is counted for batch size 1 and sequence length 128.</p><p>We take the pre-trained BERT-base and DistilBERT models from the <a href="https://gluon-nlp.mxnet.io/">GluonNLP</a> (0.9.1) model zoo. The data type is fp32 in all benchmarks, as Intel CPUs do not support fp16 operations. We measure the inference latency for sequence lengths 64, 128, and 256 when batch size is 1. We compare the latency of <a href="https://mxnet.apache.org/">Apache MXNet</a> (<a href="https://mxnet.apache.org/versions/1.6/get_started?version=v1.6.0&amp;platform=linux&amp;language=python&amp;environ=pip&amp;processor=cpu">mkl 1.6.0</a>) and <a href="https://tvm.apache.org/">Apache TVM</a> on EC2 <a href="https://aws.amazon.com/ec2/instance-types/c5/">c5.9xlarge</a> instance (Intel CPU with 18 physical cores). The table below shows that the latency can be reduced by between 2.1x to 2.9x via TVM optimization for the BERT base and DistilBERT models. Notably, the latency of DistilBERT under sequence length 128 is only 9.5ms on CPUs after optimization. In comparison, the <a href="https://cloudblogs.microsoft.com/opensource/2020/01/21/microsoft-onnx-open-source-optimizations-transformer-inference-gpu-cpu/">ONNX runtime</a> achieves 9ms on similar CPUs using a <strong>3-layer</strong> BERT model which is 2x smaller than the DistilBERT model.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/33f4eb82760f729c18282d22d2187f1b/href">https://medium.com/media/33f4eb82760f729c18282d22d2187f1b/href</a></iframe><p>We further evaluate the throughput of BERT with batch size 4 and sequence length 128 on more types of EC2 instances, including c5.2xlarge, c5.9xlarge, m5.2xlarge, and m5.12xlarge. TVM consistently achieves higher throughput for the BERT base and DistilBERT models with an average of 2x improvement.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*OL-v9avLgyiwR7RGKg2d1A.png" /></figure><p>The performance gain comes from three aspects: (1) small operators get fused together to reduce overhead in memory transfer between cache and main memory, (2) kernels generated by TVM achieve better performance in general, and (3) TVM performs graph-level optimizations and replaces some heavyweight math operators such as erf with an approximate implementation.</p><h3>How can you optimize BERT using TVM?</h3><p>Next, let’s walk through the steps to reproduce the results shown above.</p><p>We first launch a CPU instance such as c5.9xlarge on <a href="https://aws.amazon.com/ec2/">Amazon EC2</a> using <a href="https://docs.aws.amazon.com/dlami/latest/devguide/ubuntu18-04.html">AWS Deep Learning AMI, Ubuntu 18.04</a>. After we ssh into the server, we activate the conda environment mxnet_p36 and install GluonNLP Python packages:</p><pre>source activate mxnet_p36<br>pip install gluonnlp==0.9.1</pre><p>Next, we install TVM:</p><pre>pip install <a href="https://tvm-build-public.s3-us-west-2.amazonaws.com/dlami-cpu-mkl/tvm-0.7.dev1-cp36-cp36m-linux_x86_64.whl">https://tvm-build-public.s3-us-west-2.amazonaws.com/dlami-cpu-mkl/tvm-0.7.dev1-cp36-cp36m-linux_x86_64.whl</a><br>pip install <a href="https://tvm-build-public.s3-us-west-2.amazonaws.com/dlami-cpu-mkl/topi-0.7.dev1-py3-none-any.whl">https://tvm-build-public.s3-us-west-2.amazonaws.com/dlami-cpu-mkl/topi-0.7.dev1-py3-none-any.whl</a></pre><p>These Python wheel packages are compiled for the Deep Learning AMI, and are not guaranteed to be compatible with other environments. You can also compile TVM from source (instructions in the appendix).</p><p>After we’ve installed TVM, we can optimize the BERT model. You can find the steps to train and fine-tune a BERT model using GluonNLP from the <a href="https://gluon-nlp.mxnet.io/examples/sentence_embedding/bert.html">tutorials</a> on the GluonNLP website. Once we’ve instantiated a BERT model from GluonNLP, we can compile and optimize it using TVM.</p><p>We first need to convert the BERT model from MXNet to TVM Relay IR given an input shape mapping. Relay IR is an intermediate representation used in TVM that represents the computation graph of model architecture.</p><pre># Load model from GluonNLP<br>...<br>mx_model = nlp.model.BERTClassifier(...)</pre><pre># Convert MXNet model to TVM<br>shape_dict = {<br>    &#39;data0&#39;: (batch, seq_length),<br>    &#39;data1&#39;: (batch, seq_length),<br>    &#39;data2&#39;: (batch,)<br>}<br>mod, params = relay.frontend.from_mxnet(mx_model, shape_dict)</pre><p>We then define the target and optimization level to compile the model using TVM. The CPUs in both c5 and m5 instances support <a href="https://www.intel.com/content/www/us/en/architecture-and-technology/avx-512-overview.html">AVX-512</a> instructions, which can boost floating computation via vectorization. So the CPU architecture (-mcpu=skylake-avx512) is specified in the target to enable this. We also utilize the <a href="https://software.intel.com/content/www/us/en/develop/tools/math-kernel-library.html">Intel MKL</a> library (-libs=cblas), which features an optimized BLAS library on Intel CPUs to speedup the computation of operators like matrix multiplication. Third, we explicitly turn on an optimization pass called “FastMath”. This pass will replace certain heavyweight math operators such as erf with an approximate implementation in order to reduce its latency.</p><pre>target = &quot;llvm -mcpu=skylake-avx512 -libs=cblas&quot;<br>with relay.build_config(opt_level=3, required_pass=[&quot;FastMath&quot;]):<br>    graph, lib, cparams = relay.build(mod, target, params=params)</pre><p>Finally, we create a lightweight executor included in the TVM, and initialize it with weights and input data. We can then run the executor and examine the output. We also did a sanity check for the TVM output and verified the correctness.</p><pre>ctx = tvm.cpu()<br>rt = runtime.create(graph, lib, ctx)<br>rt.set_input(**cparams)<br>rt.set_input(data0=inputs, data1=token_types, data2=valid_length)<br>rt.run()<br>out = rt.get_output(0)<br>print(out.ansumpy())</pre><pre># verify the correctness<br>tvm.testing.assert_allclose(out.asnumpy(), mx_out.asnumpy(), rtol=1e-3, atol=1e-3)</pre><p>The full script can be found at <a href="https://gist.github.com/icemelon9/860d3d2c9566d6f69fa8112840dd95c1">https://gist.github.com/icemelon9/860d3d2c9566d6f69fa8112840dd95c1</a>.</p><h3>Conclusion</h3><p>In summary, we are seeing significant improvement of BERT inference using TVM on CPUs. The latency of BERT inference is reduced up to 2.9x and the throughput is increased up to 2.3x. It takes only a few lines of code to achieve such improvement and make the deployment.</p><p>Today, this solution works well in scenarios where padding to fixed sequence length is possible. We are currently working on expanding the support to include dynamic sequence length. Stay tuned for more news.</p><h3>Appendix</h3><h4>Compile TVM from source</h4><p>This <a href="https://tvm.apache.org/docs/install/from_source.html">document</a> from the TVM website already provides the instructions how to compile and install TVM from source. Here we only describe some specific dependencies and configures to achieve the best performance for BERT inference.</p><p>First, you need to install the dependencies in addition to the building requirements included in the TVM document: LLVM (&gt;=6.0) and Intel MKL library (instructions can be found <a href="https://software.intel.com/content/www/us/en/develop/articles/installing-intel-free-libs-and-python-apt-repo.html">here</a>).</p><p>Next, we can compile the source and build the shared library.</p><pre>git clone --recursive <a href="https://github.com/apache/incubator-tvm.git">https://github.com/apache/incubator-tvm.git</a> tvm<br>cd tvm &amp;&amp; mkdir build &amp;&amp; cd build<br>cmake -DUSE_LLVM=/path/to/llvm-config -DUSE_BLAS=mkl -DUSE_OPENMP=intel ..<br>make -j4</pre><p>Finally, we install the TVM python package.</p><pre>cd python; python setup.py install --user; cd ..<br>cd topi/python; python setup.py install --user; cd ../..</pre><h3>Reference</h3><p>[1] Jacob Devlin, et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” <em>arXiv pre-print arXiv:1810.04805</em> (2018).<br>[2] Tianqi Chen, et al. “TVM: An Automated End-to-End Optimizing Compiler for Deep Learning.” <em>13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18),</em> 2018.<br>[3] Chris Lattner, et al. “MLIR: A Compiler Infrastructure for the End of Moore’s Law”. <em>arXiv preprint arXiv:2002.11054</em>, 2020. <br>[4] Nadav Rotem, et al. “Glow: Graph Lowering Compiler Techniques for Neural Networks”. <em>CoRR</em>, abs/1805.00907, 2018. <br>[5] Liu, Yizhi, et al. “Optimizing CNN Model Inference on CPUs.” <em>2019 USENIX Annual Technical Conference (USENIX ATC 19),</em> 2019.<br>[6] Sanh, Victor, et al. “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.” <em>arXiv preprint arXiv:1910.01108</em> (2019).</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=9cf7776cd7f8" width="1" height="1" alt=""><hr><p><a href="https://medium.com/apache-mxnet/speed-up-your-bert-inference-by-3x-on-cpus-using-apache-tvm-9cf7776cd7f8">Speed up your BERT inference by 3x on CPUs using Apache TVM</a> was originally published in <a href="https://medium.com/apache-mxnet">Apache MXNet</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[GluonCV 0.7: ResNeSt, Next Generation Backbone]]></title>
            <link>https://medium.com/apache-mxnet/gluoncv-0-7-resnest-next-generation-backbone-5c797863c92?source=rss----d1802b798def---4</link>
            <guid isPermaLink="false">https://medium.com/p/5c797863c92</guid>
            <category><![CDATA[apache-mxnet]]></category>
            <category><![CDATA[gluoncv]]></category>
            <category><![CDATA[stateoftheart]]></category>
            <category><![CDATA[computer-vision]]></category>
            <category><![CDATA[deep-learning]]></category>
            <dc:creator><![CDATA[Jerry Zhang]]></dc:creator>
            <pubDate>Thu, 14 May 2020 18:56:11 GMT</pubDate>
            <atom:updated>2020-05-14T22:16:31.237Z</atom:updated>
            <content:encoded><![CDATA[<p><em>Authors: Jerry Zhang, </em><a href="https://www.linkedin.com/in/thomas-brady/"><em>Thomas Brady</em></a></p><p>Since the introduction of <a href="https://en.wikipedia.org/wiki/AlexNet">AlexNet</a>, the <a href="http://www.image-net.org/challenges/LSVRC/2012/">2012 ImageNet challenge</a> champion, neural networks trained for image classification have been used as backbones for addressing other tasks such as object detection, semantic and instance segmentation, or pose estimation. We typically called these backbone network, as it is shared by variety of tasks. <a href="https://towardsdatascience.com/an-overview-of-resnet-and-its-variants-5281e2f56035">ResNet</a>, introduced in 2015, through its many permutations, has been the reigning backbone for awhile now. In fact, the majority of research conducted for applications downstream of image classification are still using ResNet despite the great strides made in computer vision by the research community. ResNet’s continued efficacy is often attributed to its <a href="https://mlsauce.wordpress.com/2018/05/29/modularity-in-deep-learning-a-farm-to-table-hot-take-part-i-basics/">modularity</a>, and ease-of-use for <a href="https://mxnet.apache.org/api/python/docs/tutorials/packages/gluon/image/pretrained_models.html">transfer learning</a>. With modularity and transfer learning in mind, the GluonCV team is pleased to introduce the new backbone network ResNeSt in this release (GluonCV 0.7).</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*_NSrruLvr63bx1nSWw-HBw.png" /><figcaption>ResNet vs SE-Net vs ResNeSt</figcaption></figure><p><strong>ResNeSt</strong> retains the inherent modularity and transfer learning capabilities of ResNet while boosting the accuracy of a variety of vision tasks, including image classification, object detection and semantic segmentation. Besides ResNet, our work also takes inspiration from the use of channel attention in <a href="https://arxiv.org/abs/1709.01507">SE-Net</a>, a network designed in 2017. When comparing ResNeSt to the previous state-of-the-art backbone network <a href="https://arxiv.org/abs/1905.11946">EfficientNet</a> partially designed using neural architecture search, we discovered that ResNeSt amplifies both speed and accuracy on a GPU. For example our ResNeSt-269 achieves slightly higher accuracy than EfficientNet-B7, while lowering the latency by around 30%. In addition, we discovered it is easy to adapt ResNeSt to downstream tasks like object detection and semantic segmentation. By simply swapping <em>ResNet </em>with <strong>ResNeSt</strong> without tuning hyper-parameters, we improve <a href="https://arxiv.org/abs/1506.01497">Faster R-CNN</a> <a href="http://cocodataset.org/#home">COCO</a> <a href="https://towardsdatascience.com/breaking-down-mean-average-precision-map-ae462f623a52">mAP</a> by approximately 4%, and <a href="https://arxiv.org/abs/1706.05587">DeepLabV3</a> <a href="https://groups.csail.mit.edu/vision/datasets/ADE20K/">ADE20K</a> mIoU by around 3% assuming the previous best performance of a ResNet-based model as our baseline.</p><h3>Image Classification</h3><p>Backbone networks are usually pre-trained on the ImageNet-1K dataset with their weights used for various downstream tasks. Thus, accurate classification of images is of great importance to high-level computer vision. Included in the GluonCV 0.7 release are the four latest ResNeSt backbones with differing levels of complexity accompanied by the respective code we used for training to reproduce our results. In previous releases, our best result came from using <a href="https://towardsdatascience.com/squeeze-and-excitation-networks-9ef5e71eacd7">SENet-154</a> model as the classification backbone, which produced a top-1 accuracy score of 81.26% on the ImageNet dataset. All our new models, except for ResNeSt-50, achieves higher accuracy than SENet-154 our most accurate model from the previous release. The following are the detailed results:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Jtkm_2_3x6SJWQCcQkcTNw.png" /><figcaption>Average Latency vs. Top-1 Accuracy on ImageNet</figcaption></figure><p>In addition, we benchmarked our ResNeSt model against EfficientNet, using a single V100 GPU with a batch size of 16. As shown in the below graph, ResNeSt outperform EfficientNet, with higher accuracy and lower latency.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*hr7-sbd8ZH4MSD11wni4aQ.png" /></figure><h3>Object Detection</h3><p>ResNeSt achieves great result on image classification, but how does it perform on other downstream tasks? To demonstrate that ResNeSt can improve downstream tasks, we changed the original <em>ResNet</em> to <strong>ResNeSt</strong> in Faster R-CNN resulting in the mean Average Precision (mAP) improving by 3% as reported in <a href="https://arxiv.org/pdf/2004.08955.pdf">our paper</a>. In GluonCV 0.7, we include a new bag of tricks for our Faster R-CNN models, such as synchronized batch normalization, random scale augmentation, and deeper box head (4 convolutions + 1 dense). With these improvements, we are able to increase the mAP to 42.7, which is higher than the previous result with ResNet-101. This is slightly higher than what we report in paper, as we use 26 epochs (2x learning rate schedule) as oppose to the 13 epochs in the paper .</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Ix3U-G78-731QCsegY70Nw.png" /><figcaption>Object Detection with Faster R-CNN</figcaption></figure><h3>Semantic Segmentation</h3><p>We also provide two new semantic segmentation models in this release. In our research, we swapped <a href="https://arxiv.org/abs/1706.05587">DeepLabV3</a>’s <em>ResNet</em> backbone with <strong>ResNeSt</strong>, and obtained a 2.8% gain in mean intersection over union (mIoU) and 1% gain in pixel accuracy, reaching a state-of-the-art <a href="https://paperswithcode.com/sota/semantic-segmentation-on-ade20k?p=resnest-split-attention-networks">result</a> on the <a href="https://groups.csail.mit.edu/vision/datasets/ADE20K/">ADE20K</a> dataset. By simply dropping in ResNeSt as the backbone, our models outperformed numerous other models designed specifically for semantic segmentation such as <a href="https://arxiv.org/abs/1908.03930">ACNet</a>, <a href="https://arxiv.org/abs/1908.07919">HRNet</a>, etc. demonstrating the versatility of ResNeSt to generalize to different tasks.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*e21ZGNeYgEmGebdZOOwHTA.png" /><figcaption>Semantic Segmentation with DeepLabV3</figcaption></figure><h3>Summary</h3><p>GluonCV 0.7 brings you the latest image classification backbone that significantly improves downstream tasks. New models introduced in this release improves upon our existing model zoo, provide you with more potent computer vision models. With GluonCV 0.7, you can now use our state-of-the-art ResNeSt in your research or production. For more detail you can also checkout our paper <a href="https://arxiv.org/pdf/2004.08955.pdf">here</a>.</p><h3>Acknowledgement</h3><p>We sincerely thank the following contributors:<br> <a href="https://github.com/zhreshold">@zhreshold</a>, <a href="https://github.com/adursun">@adursun</a>, <a href="https://github.com/KuangHaofei">@KuangHaofei</a>, <a href="https://github.com/bryanyzhu">@bryanyzhu</a>, <a href="https://github.com/FrankYoungchen">@FrankYoungchen</a>, <a href="https://github.com/ElectronicElephant">@ElectronicElephant</a>, <a href="https://github.com/lgov">@lgov</a>, <a href="https://github.com/astonzhang">@astonzhang</a>, <a href="https://github.com/ruslo">@ruslo</a>, <a href="https://github.com/mjamroz">@mjamroz</a>, <a href="https://github.com/LauLauThom">@LauLauThom</a>, <a href="https://github.com/karan6181">@karan6181</a>, <a href="https://github.com/turiphro">@turiphro</a>, <a href="https://github.com/chinakook">@chinakook</a>, <a href="https://github.com/zhanghang1989">@zhanghang1989</a>, <a href="https://github.com/Jerryzcn">@Jerryzcn</a></p><h3>Links</h3><p>Please Like/Star/Fork/Comment/Contribute if you like GluonCV!<br> <a href="https://gluon-cv.mxnet.io/index.html">GluonCV Website</a><br> <a href="https://github.com/dmlc/gluon-cv">GluonCV Github</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=5c797863c92" width="1" height="1" alt=""><hr><p><a href="https://medium.com/apache-mxnet/gluoncv-0-7-resnest-next-generation-backbone-5c797863c92">GluonCV 0.7: ResNeSt, Next Generation Backbone</a> was originally published in <a href="https://medium.com/apache-mxnet">Apache MXNet</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
    </channel>
</rss>