<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.2.2">Jekyll</generator><link href="https://polukhin.tech/feed.xml" rel="self" type="application/atom+xml" /><link href="https://polukhin.tech/" rel="alternate" type="text/html" /><updated>2024-03-25T17:01:08+00:00</updated><id>https://polukhin.tech/feed.xml</id><title type="html">Andrii Polukhin</title><subtitle>Writing about neural networks and artificial intelligence. Keywords: machine learning, computer vision, object detection, deep learning, research, edge device, optimization.</subtitle><author><name>Andrii Polukhin</name><email>pandrii000@gmail.com</email></author><entry><title type="html">Efficiency in AI: Practical tips for optimizing AI</title><link href="https://polukhin.tech/2023/12/21/efficiency-in-ai" rel="alternate" type="text/html" title="Efficiency in AI: Practical tips for optimizing AI" /><published>2023-12-21T00:00:00+00:00</published><updated>2023-12-21T00:00:00+00:00</updated><id>https://polukhin.tech/2023/12/21/efficiency-in-ai</id><content type="html" xml:base="https://polukhin.tech/2023/12/21/efficiency-in-ai"><![CDATA[<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gh">#TODO</span>
</code></pre></div></div>

<iframe src="./assets/efficiency-in-ai-presentation-ua.pdf" width="100%" height="480px" frameborder="0"></iframe>

<p><a href="./assets/efficiency-in-ai-presentation-ua.pdf">Download PDF</a></p>]]></content><author><name>Andrii Polukhin</name><email>pandrii000@gmail.com</email></author><summary type="html"><![CDATA[#TODO]]></summary></entry><entry><title type="html">How Object Detection Evolved: From Region Proposals and Haar Cascades to Zero-Shot Techniques</title><link href="https://polukhin.tech/2023/06/19/how-object-detection-evolved" rel="alternate" type="text/html" title="How Object Detection Evolved: From Region Proposals and Haar Cascades to Zero-Shot Techniques" /><published>2023-06-19T00:00:00+00:00</published><updated>2023-06-19T00:00:00+00:00</updated><id>https://polukhin.tech/2023/06/19/how-object-detection-evolved</id><content type="html" xml:base="https://polukhin.tech/2023/06/19/how-object-detection-evolved"><![CDATA[<p>The development of Object Detection algorithms has come a long way, starting with early computer vision and reaching a high level of accuracy through deep learning. In this post, we will look at the stages of development of these algorithms and the main methods used in modern object detection systems.</p>

<p>We start with a review of the early <strong>traditional object detection methods</strong>: Viola-Jones Detectors, HOG Detector, Part-based Approaches. They were actively used at the beginning of the development of this field.</p>

<p>Then, gradually move on to more modern <strong>deep learning object detection approaches</strong> based on Two-stage and One-stage Object Detection neural networks: RCNN, YOLO, SSD, CenterNet. These methods provide an end-to-end architecture that allows the algorithm to be adapted to any input data.</p>

<p>We end with the <strong>Zero-Shot object detection methods</strong>, which allow you to search for any objects in images without even training a neural network: OWL-ViT, GLIP, Segment Anything, GVT.</p>

<figure class="image" style="width:100%">
  <img src="./assets/object_detection_example.png" />
  <figcaption><a href="https://dagshub.com/blog/yolov6/">Object Detection Example</a></figcaption>
</figure>

<h2 id="introduction">Introduction</h2>

<p>In this article, we’ll look at the topic of Object Detection, what it means, what are its advantages for newcomers to the field, and why it’s interesting. Further discussion will be devoted to general roadmaps, where we will look at three or four important and interesting charts. We will also try to analyze the trends and evolution of Object Detection in general.</p>

<p>Let’s start with the basics. The Object Detection task can be formulated very simply: <strong>What objects are located where?</strong></p>

<p>Object Detection is a crucial task in computer vision where the objective is to identify and locate various objects, like cars, cyclists, and traffic lights, within an image. This is achieved by defining rectangular regions using coordinates (<code class="language-plaintext highlighter-rouge">xmin</code>, <code class="language-plaintext highlighter-rouge">ymin</code>, <code class="language-plaintext highlighter-rouge">xmax</code>, <code class="language-plaintext highlighter-rouge">ymax</code>) and associating them with a classification and probability vector (<code class="language-plaintext highlighter-rouge">p_1, p_2, ..., p_n</code>). Object Detection surpasses image classification in practical significance as it enables the detection of objects for subsequent analysis, modification, or classification. Techniques like Stable Diffusion and Face Swap leverage object detection to manipulate and replace objects or faces in images. Challenges arise when multiple objects of the same class, such as pedestrians, overlap, prompting the use of Non-Maximum Suppression as a common solution, although alternative methods are emerging.</p>

<p>Object Detection is an extremely important task that has been developed and improved over the past ~30 years.</p>

<figure class="image" style="width:100%">
  <img src="./assets/object_detection_importance.png" />
  <figcaption><a href="https://futuristech.com.au/services/computer-vision/object-detection/">Object Detection In Real World</a></figcaption>
</figure>

<p>The development of Object Detection algorithms is an active area of research, and we continue to observe new trends and advances in this field.</p>

<p>Its applications are widespread in a variety of industries, from video surveillance and autonomous cars to medicine and retail, specifically:</p>

<ul>
  <li>Video surveillance: detecting people and their location in video footage, monitoring the distance between them.</li>
  <li>Automotive industry: using computer vision to understand the environment and road safety.</li>
  <li>Medical diagnostics: detecting tumors and other abnormalities in images for analysis.</li>
  <li>Retail: automating the process of accounting for goods, detecting theft.</li>
  <li>Agriculture: monitoring plant health, disease detection, field analysis, and other agricultural tasks.</li>
</ul>

<h2 id="a-road-maps-of-object-detection">A Road Maps of Object Detection</h2>

<p>Let’s start with the “roadmaps”. This section contains figures from some articles, which describe the develoment of the object detection methods. Let’s give an overview of these roadmaps, focusing on some important points.</p>

<h3 id="road-map-general">Road Map (general)</h3>

<figure class="image" style="width:100%">
  <img src="./assets/od_roadmap_01.png" />
  <figcaption><a href="https://link.springer.com/article/10.1007/s11263-019-01247-4">(source)</a></figcaption>
</figure>

<p>At the beginning of the diagram below, you can see that convolutional neural networks have been in development since the 1980s. In 1998, LeCun et al. introduced LeNet-5, a significant CNN architecture for digit recognition.</p>

<p>This dataset, known as MNIST, originated in the 90s and has since become a popular benchmark for evaluating machine learning and deep learning algorithms.</p>

<p>Later, there were requests for object detection, but at that time there were not enough effective methods for this.</p>

<p>The first significant method was the Viola-Jones or Haar cascade, which was fast and easy to use on the PCs of the time, providing an acceptable speed of several frames per second (FPS).</p>

<p>A few years later, the HOG Detector method was introduced as an alternative to Viola-Jones, primarily focusing on capturing object shapes and contours.</p>

<p>Later, methods using Deformable Parts Models (DPM) came into play, which for a long time occupied leading positions in object detection accuracy ratings.</p>

<p>In 2012, the first large deep neural networks appeared, including AlexNet. Although slow and computationally intensive, the architectures of AlexNet and subsequent models such as MobileNet became more optimal.</p>

<p>These models provided high-quality representative image features that can describe the context and detect a wide range of objects.</p>

<p>One of the most important aspects of these methods is their ‘end-to-end’ nature, where the input image undergoes a sequence of differentiated operations, enabling holistic processing within a single architecture.</p>

<h3 id="road-map-more-traditional-methods">Road Map (more traditional methods)</h3>

<p>The development of the object detection has mainly been made through two historical periods: “traditional object detection period (pre-2014)” and “deep learning-based detection period (post-2014)”.</p>

<p>In the period before 2014, most object detection algorithms were built on the basis of manually created features. For example, in 2001, P. Viola and M. Jones achieved real-time human face detection without any constraints using the VJ detector. This detector used a “sliding window” technique that went through all possible locations and scales in an image to see if any window contained a human face.</p>

<figure class="image" style="width:100%">
  <img src="./assets/od_roadmap_02.png" />
  <figcaption><a href="https://arxiv.org/abs/1905.05055">(source)</a></figcaption>
</figure>

<p>SIFT (Scale-Invariant Feature Transform) is a method used to extract and describe distinctive features from images. Using traditional computer vision methods, various functions and features describing an image can be calculated. For example, you can calculate the image gradient by taking the difference between pixel values of neighboring pixels. This is an important step in feature extraction. Additionally, the Laplacian operator can be applied to detect edges in an image. SIFT divides the image into keypoint neighborhoods or interest regions, from which features describing these regions can be extracted. By comparing these features with a descriptor representing the object, the most similar region to the object can be identified. This way, you can express a high probability that the object is located there.</p>

<p>At the simplest level, this means finding the most similar part of the image. What other traditional methods are there? To be honest, I didn’t really look into it because <strong>almost none of them are used anymore</strong>. The only thing that might still be used is Haar cascades, if you need fast face detection.</p>

<h3 id="road-map-deep-learning-methods">Road Map (deep learning methods)</h3>

<p>In the period after 2014, with the advent of deep learning, object detection began to develop at an unprecedented pace. For example, in 2014, R. Girshick et al. proposed the Regions with CNN features (RCNN) method, which significantly improved the accuracy of object detection.</p>

<p>Continuing, in 2015, S. Ren et al. proposed the Faster RCNN detector, which was the first near-real-world detector based on deep learning. The main contribution of Faster-RCNN is the introduction of the Region Proposal Network (RPN), which allows to obtain region proposals almost for free. From R-CNN to Faster RCNN, most of the individual blocks of the object detection system, such as sentence detection, feature extraction, bounding box regression, etc., have been gradually integrated into a single final learning framework.</p>

<p>In 2017, T.-Y. Lin et al. proposed Feature Pyramid Networks (FPN). FPNs have shown significant progress in detecting objects with a wide range of scales. By using FPNs in the main Faster R-CNN system, it achieves the best results of detecting a single model on the COCO dataset without additional tuning.</p>

<p>As for single-stage detectors, YOLO (You Only Look Once) was proposed by R. Joseph et al. in 2015. YOLO is extremely fast: the fast version of YOLO runs at 155 frames per second. YOLO applies a single neural network to a complete image. This network divides the image into regions and simultaneously predicts bounding boxes and probabilities for each region.</p>

<figure class="image" style="width:100%">
  <img src="./assets/od_roadmap_03.png" />
  <figcaption><a href="https://link.springer.com/article/10.1007/s11263-019-01247-4">(source)</a></figcaption>
</figure>

<h3 id="object-detection-metrics-improvements">Object Detection Metrics Improvements</h3>

<p>An important aspect when evaluating Object Detection algorithms is the mean Average Precision (mAP) metric. This metric measures the relationship between precision and recall of the algorithm when the threshold is changed. The use of confidence threshold in algorithms such as YOLOv5 allows you to discard predictions with low probability. It is important to find a balance between precision and recall, which is reflected in the mAP metric.</p>

<p>We analyze in details the development and improvement of object detection mAP on the VOC07, VOC12 and MS-COCO datasets.</p>

<figure class="image" style="width:100%">
  <img src="./assets/od_accuracy.png" />
  <figcaption>Accuracy improvement of object detection on VOC07, VOC12 and MS-COCO datasets. <a href="https://arxiv.org/abs/1905.05055">(source)</a></figcaption>
</figure>

<ul>
  <li>On the VOC07 dataset, the mAP of object detection increased from 21% in 2008 (DPM method) to 83.8% in 2018 (RefineDet). <strong>+62.8%</strong></li>
  <li>On the VOC12 dataset, the mAP increased from 53.7% in 2014 (R-CNN method) to 83.5% in 2018 (RefineDet). <strong>+29.8%</strong></li>
  <li>On the MS-COCO dataset, the mAP of object detection increased from 19.7% in 2015 (Fast R-CNN method) to 57.7% in 2021 (Swin Transformer). <strong>+38%</strong></li>
</ul>

<p>These data confirm the significant progress in object detection in recent years, especially with the advent of deep learning and its application in object detection methods.</p>

<h2 id="traditional-detection-methods">Traditional Detection Methods</h2>

<p>The world of object detection algorithms has seen many changes since the first time methods for face recognition were actively researched. In this article, we will look at the development of this field since 2001, when several reviews of object-based face detection methods have already been conducted.</p>

<p>At that time, there were two main approaches: Image-based and Feature-based. Image-based approaches used methods such as SVMs (Support Vector Machines) and Linear Subspace Methods. They also made use of convolutional neural networks (CNNs) like LeNet, which played a significant role in early image recognition tasks. Statistical methods were also employed, including techniques such as Gaussian mixture models and probabilistic models based on the normal distribution.</p>

<p>Although some of these methods were interesting from a research point of view and may have their value for general familiarization, they are no longer used much in modern object detection systems. Instead, modern approaches are based on large neural networks that allow for efficient image comparison and object recognition. Such approaches provide much more representative results.</p>

<figure class="image" style="width:70%">
  <img src="./assets/face_detection_2001_methods.png" />
  <figcaption><a href="https://api.semanticscholar.org/CorpusID:15724653">Face Detection Methods in 2001.</a></figcaption>
</figure>

<h3 id="viola-jones-detectors-2001">Viola-Jones Detectors (2001)</h3>

<p>One of these algorithms - the Haar cascade, also known as the Viola-Jones algorithm.</p>

<figure class="image" style="width:70%">
  <img src="./assets/Viola-Jones-algorithm-parts-a-combination-of-regions-b-Haar-Features-c-cascade.png" />
  <figcaption><a href="https://www.researchgate.net/figure/Viola-Jones-algorithm-parts-a-combination-of-regions-b-Haar-Features-c-cascade_fig1_282972331">Viola-Jones algorithm parts: (а) combination of regions, (b) Haar Features, (c) cascade classifier, (d) Haar feature applies to the image, and (e) LBP feature.</a></figcaption>
</figure>

<p>The Haar cascade algorithm is based on a simple idea. If we want to detect faces in an image, generally speaking, all faces have similar characteristics, such as two eyes, a nose, and a mouth. For example, the eyes usually have a certain shape, the bottom of the face is darker because of shadows, and the cheeks and nose can be highlighted when taking a photo.</p>

<p>Thus, we can form a set of templates that describe these face characteristics. These templates can be in the form of small squares or rectangles. Convolution operations are used to convolve these templates with image patches to generate feature maps, which are subsequently analyzed for object detection.</p>

<p>The cascade approach of the Haar algorithm is used because of its advantages. The authors use a boosting method and sequentially apply different templates, which allows detecting faces with a lot of variability, such as tilts and lighting conditions. After sequentially applying different classifiers based on the cascade of templates, the algorithm makes decisions at each stage to determine whether to continue evaluating a candidate region as a face or reject it.</p>

<p>As a result, we get an object detector that works quickly and can show good results when various factors, including training data, feature selection, and application context, are considered.</p>

<h3 id="hog-detector-2005">HOG Detector (2005)</h3>

<figure class="image" style="width:100%">
  <img src="./assets/Object-detection-algorithm-using-HOG-features.png" />
  <figcaption><a href="https://www.researchgate.net/figure/Object-detection-algorithm-using-HOG-features_fig19_305510342">Object detection algorithm using HOG features.</a></figcaption>
</figure>

<p>The HOG (Histogram of Oriented Gradients) algorithm was invented in 2005 and differs from deep learning image processing methods by not using neural networks.</p>

<ol>
  <li>
    <p>First, the image is divided into small subpictures of 8x8 pixels. For each subimage, gradients are calculated, resulting in a set of gradient values. These values are distributed into a histogram with a specified number of bins, representing the distribution of gradients in that subregion. The histograms from multiple subregions are concatenated to form the feature vector.</p>
  </li>
  <li>
    <p>Next, the histograms are normalized using a process such as histogram equalization to enhance the contrast and equalize the intensity amplitude of pixels in different parts of the image. This helps improve the overall visual representation.</p>
  </li>
  <li>
    <p>After normalizing the histograms, a descriptor is computed for each region covered by a sliding window that moves across the image at multiple scales and aspect ratios. By examining these detection windows and comparing the feature vectors extracted from them, objects like faces can be detected. A trained classifier, often a support vector machine (SVM), is used to determine whether the object of interest is present.</p>
  </li>
</ol>

<p>While this method can detect faces, it may not be as effective in detecting fine-grained details or complex structures such as scratches or brain tumors, limiting its use for such tasks.</p>

<p>At first glance, one might suggest incorporating more complex features that consider color and other parameters, and indeed, further research has explored such modifications. For instance, combining HOG with other feature descriptors like Histograms of Color or Haar-like features has shown promising results. Additionally, there exist effective methods that leverage partial features for object detection, such as combining multiple feature descriptors to find objects like a person or a face. Although these methods can be more intricate, they have demonstrated improved accuracy in certain scenarios.</p>

<p>Overall, the HOG method is an effective approach for detecting objects in images, particularly for tasks like face detection. By utilizing mathematical methods and gradient-based features, it achieves good results. Nevertheless, further research and modifications of the method can lead to improvements in its efficiency and accuracy.</p>

<h3 id="part-based-approaches">Part-based Approaches</h3>

<ul>
  <li>Deformable Part-based Model (2010)</li>
  <li>Implicit Shape Model (2004)</li>
</ul>

<p>The Deformable Part-based Model (DPBM), proposed by Felzenszwalb et al. in 2010, is an object detection method based on the concept of variable-shaped parts. The Implicit Shape Model (ISM), proposed by Leibe et al. in 2004, is an object detection method that represents the shape of an object as a set of local features and uses statistical methods to find the most likely areas of an object in an image. Both methods have been widely used in object detection tasks, helping to improve the accuracy and reliability of image processing algorithms.</p>

<figure class="image" style="width:100%">
  <img src="./assets/deformable_part_model.png" />
  <figcaption><a href="https://arxiv.org/abs/1704.05519">(source)</a></figcaption>
</figure>

<h2 id="deep-learning-based-detection-methods">Deep Learning-based Detection Methods</h2>

<p>At the beginning of any deep learning-based object detection process, we have an input image that is fed to the model. This image can be processed in its original form or resized to a fixed size. Then, at each scale, we search for objects, and the results are averaged. There are different approaches to this task.</p>

<figure class="image" style="width:100%">
  <img src="./assets/od_dl_scheme.png" />
  <figcaption><a href="https://ieeexplore.ieee.org/document/10098596">The components of an ordinary object detection model.</a></figcaption>
</figure>

<p>After processing an image or a set of images, they are transferred to the model backbone. The task of the backbone is to extract various features from the image, creating feature vectors that describe the image. There are many different backbone models, such as AlexNet, VGG, ResNet, YOLO (using a modified version of DarkNet as its backbone), EfficientNet, MobileNet, and DenseNet.</p>

<p>The obtained features are passed from the backbone to the intermediate layers responsible for feature refinement before passing them to the head. In some architectures, there may be no intermediate module between the backbone and the head, and the features are directly passed to the head for generating the final bounding boxes and class predictions. The overall goal is to determine the location and class of objects.</p>

<h3 id="two--and-one--stage-detectors">Two- and One- Stage Detectors</h3>

<figure class="image" style="width:100%">
  <img src="./assets/one-two-stage-detectors.png" />
  <figcaption><a href="https://www.mdpi.com/2072-4292/13/1/89">Deep learning object detection meta-architectures.</a></figcaption>
</figure>

<p>Object detection algorithms can be divided into two categories: two-stage and one-stage. In two-stage algorithms, the backbone and neck tasks involve creating regional proposals. Features are extracted from the image and transferred to a neural network that returns a set of potential object locations and their confidence. Alternatively, a selective search algorithm can be used to generate object proposals. The received features, along with the suggestions, are transferred to the subsequent stages/components of the algorithm for further processing.</p>

<p>In contrast, one-step algorithms use a simpler and faster approach. The image is processed directly, and features are extracted to detect objects without an explicit proposal generation step.</p>

<h3 id="two-stage-detectors">Two-Stage Detectors</h3>

<h4 id="rcnn-2014">RCNN (2014)</h4>

<p>The first algorithm we will talk about is R-CNN (Region-based Convolutional Neural Network). It introduced the concept of region-based detection by employing a multi-stage pipeline and various components. We take an image and generate regional proposals. These proposals are then warped to a fixed size using a region of interest (RoI) pooling operation, which extracts fixed-length feature vectors from the proposed regions. The R-CNN network comprises a convolutional neural network (CNN) for feature extraction, followed by fully connected layers. The CNN extracts features, and the subsequent layers perform object classification, determining the object’s presence and its class. Additionally, the network incorporates bounding box regression to refine the coordinates of the bounding box around the object.</p>

<figure class="image" style="width:100%">
  <img src="./assets/rcnn.png" />
  <figcaption><a href="https://arxiv.org/abs/2104.11892">Illustration of the internal architecture of RCNN.</a></figcaption>
</figure>

<h4 id="fast-rcnn-2015">Fast RCNN (2015)</h4>

<p>Based on Region Convolutional Neural Network (RCNN), a modified algorithm called Fast R-CNN was developed. It is faster than its predecessor because it does not use whole images to detect objects, but instead utilizes special features that the neural network has identified. These features are much smaller in size compared to the original images. With the help of these features, feature maps are generated, taking into account the resized maps. Next, SPP (Spatial Pyramid Pooling) is applied to extract features from different levels of the feature maps. These features are passed to the fully connected layer, which performs object classification and refinement, as in previous architectures.</p>

<figure class="image" style="width:100%">
  <img src="./assets/fast-rcnn.png" />
  <figcaption><a href="https://arxiv.org/abs/2104.11892">Illustration of the internal architecture of Fast RCNN.</a></figcaption>
</figure>

<h4 id="faster-rcnn-2015">Faster RCNN (2015)</h4>

<p>Another modification that has significantly accelerated the process is Faster R-CNN. In this algorithm, regional proposals are generated using a neural network. This enables more fine-grained processing, simplifying training and facilitating the application of various optimization techniques to optimize the network for different platforms.</p>

<figure class="image" style="width:100%">
  <img src="./assets/faster-rcnn.png" />
  <figcaption><a href="https://arxiv.org/abs/2104.11892">Illustration of the internal architecture of Faster RCNN.</a></figcaption>
</figure>

<h4 id="fpn-2017">FPN (2017)</h4>

<p>The next algorithm we’ll talk about today is Feature Pyramid Networks (FPN), which is a modification of Faster R-CNN. Although it provides more accurate results compared to Faster R-CNN, Feature Pyramid Networks (FPN) maintains a similar processing speed. In FPN, feature maps are not extracted only from the last layer of the network, as it was before, but from different stages of image processing. These features are then aggregated through a top-down pathway and lateral connections using element-wise addition, and based on the resulting feature pyramid, initial suggestions are created for another neural network that performs object classification and detection.</p>

<figure class="image" style="width:100%">
  <img src="./assets/fpn.png" />
  <figcaption><a href="https://arxiv.org/abs/2104.11892">Illustration of the internal architecture of Feature Pyramid Networks (FPN).</a></figcaption>
</figure>

<p>This is an overview of some two-stage object detection algorithms. Each of them has its own advantages and can be used depending on the needs of your project.</p>

<h4 id="backbones">Backbones</h4>

<p>The development of Object Detection algorithms is an active research area, where much attention is paid to the impact of various architectures, such as backbone, on the accuracy and efficiency of detectors. We will look at the impact of backbone on two-stage detectors and discuss important aspects of this issue.</p>

<figure class="image" style="width:100%">
  <img src="./assets/backbone_mAP.png" />
  <figcaption><a href="https://arxiv.org/abs/1905.05055">A comparison of detection accuracy of three detectors: Faster RCNN, R-FCN and SSD on MS-COCO dataset with different detection backbones.</a></figcaption>
</figure>

<p>In order to generate accurate object proposals based on features, it is necessary to have high-quality features that will allow you to find objects in the image. The choice of a suitable backbone architecture has a significant impact on the accuracy of the detector. For example, popular architectures such as MobileNet, Inception, and ResNet exhibit different efficiency and accuracy characteristics.</p>

<p>The accuracy of extractor features can be evaluated by training the backbone on an object detection dataset with ground truth bounding boxes, using a suitable loss function. The head of the architecture is typically modified or augmented with additional layers to enable object detection.</p>

<p>Training the backbone in the Faster R-CNN architecture can be challenging due to its interdependence with other components. In this case, the neural network components, including the backbone, region proposal network, and object detection heads, are trained jointly.</p>

<p>First, the neural network components, including the feature extractor, are trained jointly. The feature extractor is not frozen after feature extraction, and it continues to be fine-tuned alongside other components.</p>

<p>An interesting characteristic of Faster R-CNN is its two-stage learning process, which involves training the region proposal network (RPN) first, followed by the training of object detection heads.</p>

<p>Currently, training one-stage object detection algorithms, such as YOLO (You Only Look Once) or SSD (Single Shot MultiBox Detector), has been greatly simplified as they are trained in one pass, but they still have their own nuances.</p>

<h3 id="one-stage-detectors">One-Stage Detectors</h3>

<h4 id="yolo-2015">YOLO (2015)</h4>

<p>One of the one-stage detectors is YOLO (You Only Look Once). Although a simple diagram below does not fully describe the inner workings of the algorithm, it helps to understand the general concept.</p>

<p>The image is divided into a grid of cells, where the size of the grid is configurable. Each cell contains features used for object detection.</p>

<p>The main idea is that YOLO predicts bounding boxes and class probabilities for multiple objects in each cell, without assuming a maximum of 2 objects in each cell.</p>

<figure class="image" style="width:100%">
  <img src="./assets/yolo.png" />
  <figcaption><a href="https://arxiv.org/abs/2104.11892">Illustration of the internal architecture of You Only Look Once (YOLO).</a></figcaption>
</figure>

<h4 id="ssd-2015">SSD (2015)</h4>

<p>Another one-stage detector is the SSD (Single Shot MultiBox Detector), which is a one-stage detector that works by aggregating features from different layers into the final classifier and regressor.</p>

<figure class="image" style="width:100%">
  <img src="./assets/ssd.png" />
  <figcaption><a href="https://arxiv.org/abs/2104.11892">Illustration of the internal architecture of Single Shot MultiBox Detector (SSD).</a></figcaption>
</figure>

<h4 id="retinanet-2017">RetinaNet (2017)</h4>

<p>RetinaNet is another significant one-stage detector that aggregates information about context and texture features to achieve high accuracy in object localization. It also utilizes a focal loss function and a feature pyramid network.</p>

<figure class="image" style="width:100%">
  <img src="./assets/retinanet.png" />
  <figcaption><a href="https://arxiv.org/abs/2104.11892">Illustration of the internal architecture of RetinaNet.</a></figcaption>
</figure>

<h4 id="centernet-2019">CenterNet (2019)</h4>

<p>The CenterNet architecture, released in 2019, is another one-stage solution worth mentioning.</p>

<figure class="image" style="width:100%">
  <img src="./assets/centernet.png" />
  <figcaption><a href="https://arxiv.org/abs/2104.11892">Illustration of the internal architecture of CenterNet.</a></figcaption>
</figure>

<p>The original proposal to form such grids has its challenges, especially when working with large satellite images containing, for example, thousands of objects.</p>

<p>Instead of defining a bounding box, CenterNet assigns a center point to each object. This allows for object detection and counting using the center point, in combination with the predicted offsets, to generate a bounding box that encompasses the object.</p>

<p>A significant feature of CenterNet is the use of an Hourglass Backbone, which enables multi-scale information fusion and enhances the model’s ability to capture contextual information.</p>

<p>After the Hourglass Backbone, CenterNet performs keypoint assessment and detection.</p>

<h3 id="object-detectors-by-category">Object Detectors by Category</h3>

<p>The unified and efficient development of algorithms for object detection is a hot topic in the field of computer vision. Today, there are several variants of object detector algorithms that differ in approach and results. One of them is anchor-based two-stage detectors, which are based on two stages of detection. This method uses anchors to suggest regions, which are then analyzed to identify objects.</p>

<figure class="image" style="width:100%">
  <img src="./assets/od_categories.png" />
  <figcaption><a href="https://ieeexplore.ieee.org/document/10098596">The number of state-of-the-art object detectors, by category, published in top journals and evaluated on MS-COCO.</a></figcaption>
</figure>

<p>Another option is anchor-free detectors, which offer an anchorless approach to object detection. They use other strategies to identify areas with objects. One of the new and promising areas is the use of transformer-based object detectors. Appeared in 2019, these algorithms based on Visual Transformer are highly accurate. However, they have limitations in performance and computing power due to non-optimization for some platforms. Nevertheless, active research and optimization of these algorithms are already underway.</p>

<h3 id="transformer-based-detectors">Transformer-based Detectors</h3>

<p>Now I want to draw your attention to the features of transformer-based object detectors. For example, they don’t use Non-Maximum Suppression, a method we’ll talk about later.</p>

<figure class="image" style="width:100%">
  <img src="./assets/vit.png" />
  <figcaption><a href="https://arxiv.org/abs/2010.11929">Model overview. We split an image into fixed-size patches, linearly embed each of them, add position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder. In order to perform classification, we use the standard approach of adding an extra learnable “classification token” to the sequence.</a></figcaption>
</figure>

<p>A transformer-based object detector is an algorithm that uses the Visual Transformer architecture. Visual Transformer uses a transformer based on the attention mechanism. The attention mechanism was first introduced in the context of the Transformer model in the paper “Attention is All You Need” published by Vaswani et al. in 2017, which revolutionized sequence transduction tasks.</p>

<p>The transformer has repeating blocks and is primarily based on self-attention mechanisms, enabling it to capture dependencies between different positions in the input sequence.</p>

<p>This allows transformers to become a powerful tool in text processing, such as text understanding and text generation. After their success in the field of text processing, transformers are also being used in the field of computer vision, in particular for object detectors.</p>

<p>Visual data is processed by dividing the image into patches and then processing these patches using a transformer. This approach greatly simplifies image processing compared to convolutional networks.</p>

<p>The image patches are flattened and treated as sequences of vectors, which are sequentially processed by the transformer. The output can be utilized for various tasks, including object detection.</p>

<p>For efficient work with vectors, Positional Embedding is used to incorporate positional information into the vectors by adding positional encoding vectors that represent the relative positions of elements.</p>

<h4 id="detr-2020">DETR (2020)</h4>

<p>DETR is an end-to-end object detection model that directly predicts the bounding boxes and class labels of objects in an image without using Non-Max Suppression. Training such algorithms requires optimizing the model’s parameters using labeled training data and an appropriate loss function.</p>

<figure class="image" style="width:100%">
  <img src="./assets/detr.png" />
  <figcaption><a href="https://towardsdatascience.com/detr-end-to-end-object-detection-with-transformers-and-implementation-of-python-8f195015c94d">DETR directly predicts (in parallel) the final set of detections by combining a common CNN with a transformer architecture. During training, bipartite matching uniquely assigns predictions with ground truth boxes. Prediction with no match should yield a “no object” class prediction.</a></figcaption>
</figure>

<h4 id="swin-2021">Swin (2021)</h4>

<p>In 2021, the SWIN (<strong>S</strong>hifted <strong>win</strong>dows) architecture was introduced. The idea behind the SWIN transformer is quite simple. It uses the Visual Transformer technique, which splits the input image into patches, but the SWIN transformer backbone is similar to a Feature Pyramid Network.</p>

<figure class="image" style="width:100%">
  <img src="./assets/swin.png" />
  <figcaption><a href="http://arxiv.org/abs/2103.14030">The proposed Swin Transformer builds hierarchical feature maps by merging image patches (shown in gray) in deeper layers and has linear computation complexity to input image size due to computation of self-attention only within each local window (shown in red).</a></figcaption>
</figure>

<p>Instead of processing each patch separately, the SWIN transformer divides the input image into a grid of patches, each with a fixed size, and processes them independently before combining them into one large feature vector. This approach enhances the model’s understanding of spatial relationships, improving object localization and segmentation results. The updated second version of the SWIN Transformer demonstrates improved performance in various tasks, such as object detection and instance segmentation.</p>

<h3 id="non-max-suppression-nms">Non-Max Suppression (NMS)</h3>

<figure class="image" style="width:100%">
  <img src="./assets/nms_evolution.png" />
  <figcaption><a href="https://arxiv.org/abs/1905.05055">Evolution of non-max suppression (NMS) techniques in object detection from 1994 to 2021: 1) Greedy selection, 2) Bounding box aggregation, 3) Learning to NMS, and 4) NMS-free detection.</a></figcaption>
</figure>

<p>One of the key steps of Object Detection is the Non-Max Suppression (NMS) algorithm, which is used to combine overlapping bounding boxes and obtain one final result.</p>

<p>Let’s imagine a situation where we use the YOLO algorithm for object detection. After processing the image, YOLO returns 4 bounding boxes. However, it turns out that all of these bounding boxes overlap. The question arises when choosing which of these bounding boxes should be considered correct. Each of these bounding boxes can cover only a certain part of the object, but we are interested in the whole object and its exact position.</p>

<p>There were several stages in the development of object detection algorithms. One approach, known as “greedy selection,” was used by Viola Jones and Haar detector but is not specifically part of the Non-Max Suppression (NMS) algorithm. The idea in this approach is to select the largest of all overlapping bounding boxes as the correct result. However, this approach has limitations in terms of detection accuracy.</p>

<p>Another common technique is “bounding box aggregation.” At this stage, all bounding boxes are combined according to specific principles, such as averaging the coordinates of bounding box centers or choosing the maximum or minimum coverage value. The choice of a particular method depends on the task at hand.</p>

<p>In general, the goal is to use all bounding boxes while preserving information by aggregating them. However, it should be borne in mind that this approach also has its drawbacks, especially when using inaccurate bounding boxes that may affect the final result.</p>

<p>Non-Max Suppression (NMS) is performed as a post-processing step after the object detection model generates bounding box predictions. NMS filters out redundant bounding boxes, selecting only the most confident and non-overlapping boxes as the final output.</p>

<p>There are also NMS-free detectors, such as DETR, that do not rely on maximum suppression. These detectors immediately return bounding boxes for each individual object, reducing the need for additional steps after detection. For example, CenterNet is a simple architecture that returns key points, eliminating the need for NMS. Each object is represented by one key point and a distance width that defines a bounding rectangle.</p>

<h2 id="zero--one--few---shot-object-detection">(Zero | One | Few) - Shot Object Detection</h2>

<p>In the development of algorithms for object detection, we are slowly moving towards the topic of Few, One, Zero - Shot Object Detection. In this section, we will focus less on technical details and more on a higher level, providing a simple description of the idea of how zero-shot object detection can be performed.</p>

<h3 id="multimodality">Multimodality</h3>

<figure class="image" style="width:80%">
  <img src="./assets/multimodal_workflow.png" />
  <figcaption><a href="https://www.v7labs.com/blog/multimodal-deep-learning-guide">Workflow of a typical multimodal. Three unimodal neural networks encode the different input modalities independently. After feature extraction, fusion modules combine the different modalities (optionally in pairs), and finally, the fused features are inserted into a classification network.</a></figcaption>
</figure>

<p>The key concept in this context is multimodality, which means that a neural network can understand several types of data simultaneously. For example, it can be a combination of image and text, image and sound, text and sound, or even image, text, and sound at the same time.</p>

<figure class="image">
  <img src="./assets/multimodal_model.png" />
  <figcaption><a href="https://research.aimultiple.com/multimodal-learning/">(source)</a></figcaption>
</figure>

<p>In this approach, we have several input signals, each of which is processed by a corresponding module. In our case, this includes a separate module for text processing, a separate module for image processing, and a separate module for audio processing. These modules form one single neural network that works from start to finish, which is called end-to-end architecture.</p>

<p>Next, fusion modules are used. They may have different names, but they perform the same function - they combine image, text, and audio features and perform certain operations on them. For example, perhaps they look for the most similar image feature vector to a text feature vector. This is similar to the principle of CLIP architecture, which we’ll talk about later.</p>

<h3 id="clip-2021">CLIP (2021)</h3>

<p>CLIP adds <strong>image-text connection</strong> to understand <strong>the content</strong> of the image.</p>

<figure class="image">
  <img src="./assets/clip.png" />
  <figcaption><a href="https://openai.com/research/clip">CLIP by OpenAI.</a></figcaption>
</figure>

<p>CLIP is a revolutionary development. The main idea behind CLIP is that it creates a connection between images and texts to better understand the context of the image. CLIP uses two models - TextEncoder and ImageEncoder. Each of these models converts data into a vector format.</p>

<p>CLIP is trained on a dataset consisting of text-image pairs, with each pair containing a text description and a corresponding image. During training, the model tries to find the TextEncoder and ImageEncoder parameters so that the vectors obtained for the text and image are similar to each other. The goal is to have the vectors of other text descriptions be different from the target image vector.</p>

<p>When using CLIP for Zero-Shot Object Detection, we can feed an image and a list of words or phrases related to the objects we want to find in the image. For example, if we have an image of a dog, we can use TextEncoder to create a vector with the text “A photo of a dog”. Then we compare this vector with the vectors obtained for each text in the list of words or phrases. The text with the smallest distance to the image vector indicates the object corresponding to the image.</p>

<p>Thus, we can use CLIP to classify objects in images even without separately training the model on a specific dataset with objects. This approach opens up a wide range of possibilities for applying CLIP in the field of Object Detection, where we can utilize the relationships between texts and images to find objects in images.</p>

<h3 id="owl-vit-2022">OWL-ViT (2022)</h3>

<p>OWL-ViT adds <strong>image-level patches</strong> to understand <strong>the location</strong> of the objects.</p>

<figure class="image" style="width:70%">
  <img src="./assets/owl_vit_00.png" />
  <figcaption><a href="https://arxiv.org/abs/2205.06230">OWL-ViT: Image-level contrastive pre-training.</a></figcaption>
</figure>

<p>In 2022, a new multimodal architecture, OWL-ViT, was introduced for object detection. This network, which is available on the Hugging Face platform, has gained considerable interest in the research and practice community. Let me tell you more about it.</p>

<p>The basic idea is to create embeddings of an image and text, and then compare these embeddings. The image is processed through a Vision Transformer, which generates a set of embeddings. Then, the Vision Transformer applies self-attention and feed-forward networks to these embeddings. Although some of the steps may seem confusing, in practice they help to improve the quality of the model.</p>

<p>Finally, during the training phase, a contrastive loss function is used to encourage corresponding image-textpairs to have similar embeddings, and non-corresponding pairs to have distinct embeddings. The model predicts a bounding box and the probability that a certain text embedding applies to a particular object.</p>

<p>It should be noted that the accuracy of object detection may be limited. The authors of the original model used a process of fine-tuning the pre-trained model with object detection datasets using a bipartite matching loss. This process assists in improving the quality of the detected bounding boxes. More information about this process is shown in the diagrams below.</p>

<figure class="image" style="width:100%">
  <img src="./assets/owl_vit_01.png" />
  <figcaption><a href="https://arxiv.org/abs/2205.06230">OWL-ViT: Transfer to open-vocabulary detection.</a></figcaption>
</figure>

<p>Now let’s look at an additional feature of this multimodal model. In addition to text, you can use an image as a template. For example, if you have a photo of a butterfly, you can use it as a search query and find similar images. The model is able to analyze both text and images based on common properties.</p>

<figure class="image" style="width:100%">
  <img src="./assets/owl_vit_02.png" />
  <figcaption><a href="https://arxiv.org/abs/2205.06230">OWL-ViT: Example of one-shot image-conditioned detection.</a></figcaption>
</figure>

<h3 id="glip-2022">GLIP (2022)</h3>

<p>GLIP adds <strong>word-level understanding</strong> to find the objects <strong>by the semantics</strong> of the prompt.</p>

<figure class="image" style="width:100%">
  <img src="./assets/glip.png" />
  <figcaption><a href="http://arxiv.org/abs/2112.03857">GLIP zero-shot transfers to various detection tasks, by writing the categories of interest into a text prompt.</a></figcaption>
</figure>

<p>GLIP (2022) goes further by providing insight into images to distinguish their semantics. Let’s illustrate this with an example. Suppose we have a sentence about a woman holding a hair dryer and wearing glasses. At the same time, we see an image showing this woman with a hairdryer and glasses. GLIP reformulates object detection as phrase grounding. By accepting both an image and a text prompt as inputs, it can identify entities such as a person, a hairdryer, and others.</p>

<figure class="image">
  <img src="./assets/glip_01.png" />
  <figcaption><a href="http://arxiv.org/abs/2112.03857">We reformulate detection as a grounding task by aligning each region/box to phrases in a text prompt. We add the cross-modality deep fusion to early fuse information from two modalities and to learn a language-aware visual representation.</a></figcaption>
</figure>

<p>This technology offers a new approach to finding objects in an image based on their semantic correspondence with a text prompt. Now, we are not just identifying objects, but also associating parts of the text with components of the image.</p>

<p>Even if you only provide the name of the object, such as “Stingray”, GLIP will be able to find it, but perhaps with a low accuracy. However, if you add a description, such as “flat fish”, it will provide additional context and understanding of what you are looking for. It is important to note that “Prompt Engineering” is of great importance when using ChatGPT and modern Zero-Shot Object Detection methods.</p>

<figure class="image" style="width:100%">
  <img src="./assets/glip_02.png" />
  <figcaption><a href="http://arxiv.org/abs/2112.03857">A manual prompt tuning example from the Aquarium dataset in ODinW. Given an expressive prompt (“flat and round”), zero-shot GLIP can detect the novel entity “stingray” better.</a></figcaption>
</figure>

<h3 id="segment-anything-2023">Segment Anything (2023)</h3>

<p>Segment Anything (SAM) adds <strong>masks</strong> to see <strong>the pixel-level</strong> location of the objects.</p>

<p>This algorithm, introduced in 2023, allows not only to detect objects in images but also to segment them by applying masks at the pixel level.</p>

<figure class="image" style="width:100%">
  <img src="./assets/sam_title.png" />
  <figcaption><a href="https://arxiv.org/abs/2304.02643">(source)</a></figcaption>
</figure>

<p>One of the main features of Segment Anything is its usage of image and prompt encoders to create an overall image embedding, which can be used to segment images based on prompts. These prompts can be spatial, textual, or a combination of both. For instance, you could input “person” as a text prompt, and the algorithm would strive to segment all objects in the image related to a person.</p>

<p>This not only allows you to segment different areas in images, but also to understand the layout and content of the scene. Using the segmentation masks produced by the algorithm, one could potentially perform tasks such as counting the number of instances of an object, given the appropriate post-processing steps.</p>

<figure class="image" style="width:100%">
  <img src="./assets/sam_masks.png" />
  <figcaption><a href="https://arxiv.org/abs/2304.02643">(source)</a></figcaption>
</figure>

<h3 id="good-visual-tokenizers-2023">Good Visual Tokenizers (2023)</h3>

<p>Good Visual Tokenizers (GVT) is a new Multimodal Large Language Model (MLLM) that involves a visual tokenizer, which has been optimized through proper pre-training methods. This tokenizer aids in understanding both the semantic and fine-grained aspects of visual data.</p>

<p>GVT adds <strong>usage of the Large Language Model</strong> to <strong>investigate</strong> the image with the text.</p>

<figure class="image" style="width:100%">
  <img src="./assets/gvt.png" />
  <figcaption><a href="https://arxiv.org/abs/2305.12223">Different tasks require visual understanding of different perspectives. Mainstream vision-language tasks, e.g., (a) VQA and (b) Image Captioning mainly focus on semantic understanding of the image. In this work, we also study two fine-grained visual understanding tasks: (c) Object Counting (OC) and (d) Multi-Class Identification (MCI).</a></figcaption>
</figure>

<p>GVT introduces an optimized visual tokenizer within a Large Language Model, enabling a more comprehensive investigation of images along with the associated text. While the application of these algorithms to specific domains such as medical imagery might require additional research, GVT has already demonstrated superior performance on tasks involving visual question answering, image captioning, and fine-grained visual understanding tasks such as object counting and multi-class identification.</p>

<figure class="image" style="width:100%">
  <img src="./assets/gvt_llm_scheme.png" />
  <figcaption><a href="https://arxiv.org/abs/2305.12223">Framework of GVT. First distill the features of a pretrained CLIP via smoothed L1 loss. Then, use it to encode images into a set of tokens, which are fed into the Perceiver Resampler as soft prompts. Together with language instructions, these prompts are fed into LLM to generate responses. Only the Perceiver Resampler is optimized in this process.</a></figcaption>
</figure>

<p>Integrating text and images into one model allows you to expand your data understanding and processing capabilities. By using algorithms like the ones above, significant advances can be made in solving a variety of tasks that previously required complex algorithms and large amounts of data.</p>

<p><u>To sum up Zero-Shot Object Detection</u>:</p>

<ol>
  <li>CLIP adds <strong>image-text connection</strong> to understand <strong>the content</strong> of the image.</li>
  <li>OWL-ViT adds <strong>image-level patches</strong> to understand <strong>the location</strong> of the objects.</li>
  <li>GLIP adds <strong>word-level understanding</strong> to find the objects <strong>by the semantics</strong> of the prompt.</li>
  <li>SAM adds <strong>masks</strong> to see <strong>the pixel-level</strong> location of the objects.</li>
  <li>GVT adds <strong>usage of the Large Language Model</strong> to <strong>investigate</strong> the image with the text.</li>
</ol>

<hr />

<p>You can learn more about Object Detection Evolution by exploring my presentation below.</p>

<iframe src="./assets/how-object-detection-evolved.pdf" width="100%" height="480px" frameborder="0"></iframe>
<caption><center>Download: <a href="./assets/how-object-detection-evolved.pdf">how-object-detection-evolved.pdf</a>.</center></caption>

<p>Also, check out the online meeting based on this presentation below. Speaking language is <strong>Ukrainian</strong>.</p>

<iframe width="100%" height="480px" src="https://www.youtube.com/embed/F-GZ1A-fWWU" frameborder="0" allowfullscreen=""></iframe>

<h2 id="conclusion">Conclusion</h2>

<p>The evolution of object detection algorithms has been a remarkable journey, from the early days of computer vision to the current state-of-the-art deep learning techniques. Starting with traditional methods like Viola-Jones Detectors and HOG Detectors, we witnessed the transition to more advanced approaches such as RCNN, YOLO, SSD, and CenterNet, which introduced end-to-end architectures for improved adaptability. However, the most groundbreaking leap came with Zero-Shot object detection methods like OWL-ViT, GLIP, Segment Anything, and GVT, enabling us to detect objects in images without the need for extensive neural network training!</p>

<hr />

<p>Thank you for taking the time to read this article. If you found it informative and engaging, feel free to connect with me through my social media channels.</p>

<ul>
  <li><a href="https://www.linkedin.com/in/pandrii000/">LinkedIn</a></li>
  <li><a href="https://github.com/pandrii000">GitHub</a></li>
  <li><a href="https://www.medium.com/@pandrii000">Medium</a></li>
  <li><a href="https://t.me/eiaioi">Telegram</a></li>
  <li><a href="mailto:pandrii000@gmail.com">Mail</a></li>
</ul>

<p>If you have any questions or feedback, please feel free to leave a comment below or contact me directly via any communication channel in the footer of the website.</p>

<p>I look forward to sharing more insights and knowledge with you in the future!</p>]]></content><author><name>Andrii Polukhin</name><email>pandrii000@gmail.com</email></author><summary type="html"><![CDATA[The development of Object Detection algorithms has come a long way, starting with early computer vision and reaching a high level of accuracy through deep learning. In this post, we will look at the stages of development of these algorithms and the main methods used in modern object detection systems.]]></summary></entry><entry><title type="html">ChatGPT in Machine Learning</title><link href="https://polukhin.tech/2023/02/21/chatgpt-in-machine-learning" rel="alternate" type="text/html" title="ChatGPT in Machine Learning" /><published>2023-02-21T00:00:00+00:00</published><updated>2023-02-21T00:00:00+00:00</updated><id>https://polukhin.tech/2023/02/21/chatgpt-in-machine-learning</id><content type="html" xml:base="https://polukhin.tech/2023/02/21/chatgpt-in-machine-learning"><![CDATA[<figure class="image">
  <img src="./assets/891777600_A_robot_sitting_next_to_a_human_in_an_office__trending_on_artstation__beautiful_coloring__4k__vibran.png" />
  <figcaption>A robot sitting next to a human in an office, trending on artstation, beautiful coloring, 4k, vibrant, blue and yellow, by <a href="https://beta.dreamstudio.ai">DreamStudio</a></figcaption>
</figure>

<h2 id="introduction">Introduction</h2>

<p>Everyone knows about the <a href="https://openai.com/blog/chatgpt/">ChatGPT</a>.</p>

<p>It is also very efficient at explained Machine Learning and Deep Learning concepts, at least by 2021. In this post, I would like to demonstrate its knowledge of Artificial Intelligence. Let’s test it 🔥</p>

<p><em>Please, keep in mind, the information below is just an LLM-generated and could be totally incorrect.</em></p>

<h2 id="top-lists">Top lists</h2>

<p>In this section, I will be testing ChatGPT knowledge of the Machine Learning most common, valuable, efficient entities in various domains.</p>

<h3 id="top-5-machine-learning-scientists">Top 5 Machine Learning Scientists</h3>

<figure class="image">
  <img src="./assets/3745277450_A_young_researcher_in_a_lab_coat_hiding_from_a_demon_arm_emanating_from_a_laptop_on_his_office_desk_.png" />
  <figcaption>A young researcher in a lab coat hiding from a demon arm emanating from a laptop on his office desk. cinematic, photorealistic, vintage, artstation, painterly, expressive, by <a href="https://beta.dreamstudio.ai">DreamStudio</a></figcaption>
</figure>

<blockquote>
  <p>Provide the list of the top 5 most valuable Machine Learning scientists and their appropriate studies, that made the biggest impact on the Artificial Intelligence growth. Format the information into the table Markdown format, with the following columns: [“Name”, “Contribution to AI”, “University”, “The most valuable publication in AI”].</p>
</blockquote>

<table>
  <thead>
    <tr>
      <th>Name</th>
      <th>Contribution to AI</th>
      <th>University</th>
      <th>The most valuable publication in AI</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Geoffrey Hinton</td>
      <td>Pioneer of Deep Learning, created backpropagation algorithm</td>
      <td>University of Toronto</td>
      <td>“Deep Learning” textbook, “ImageNet Classification with Deep Convolutional Neural Networks”</td>
    </tr>
    <tr>
      <td>Yann LeCun</td>
      <td>Inventor of Convolutional Neural Networks</td>
      <td>New York University</td>
      <td>“Convolutional Networks for Images, Speech, and Time-Series”</td>
    </tr>
    <tr>
      <td>Yoshua Bengio</td>
      <td>Pioneer of Deep Learning, developed early neural language models</td>
      <td>University of Montreal</td>
      <td>“Deep Learning” textbook, “A Neural Probabilistic Language Model”</td>
    </tr>
    <tr>
      <td>Andrew Ng</td>
      <td>Founder of Google Brain, developed popular online course “Machine Learning”</td>
      <td>Stanford University</td>
      <td>“Deep Learning” textbook, “ImageNet Classification with Deep Convolutional Neural Networks”</td>
    </tr>
    <tr>
      <td>Fei-Fei Li</td>
      <td>Director of Stanford AI Lab, developed ImageNet dataset</td>
      <td>Stanford University</td>
      <td>“ImageNet: A Large-Scale Hierarchical Image Database”</td>
    </tr>
  </tbody>
</table>

<h3 id="top-5-computer-vision-loss-functions">Top 5 Computer Vision Loss Functions</h3>

<figure class="image">
  <img src="./assets/3752020442_Humanoid_robot_with_an_old_school_camera_for_a_head__sci___fi__4_k__realistic__detailed__gadgets__co.png" />
  <figcaption>Humanoid robot with an old school camera for a head, sci - fi, 4 k, realistic, detailed, gadgets, colorful, by <a href="https://beta.dreamstudio.ai">DreamStudio</a></figcaption>
</figure>

<blockquote>
  <p>Provide a list of 5 the most common loss functions used in Computer Vision in Deep Learning and their specifications. Format the information in the table of Markdown format, with the following columns: “Name” with the name of the loss function; “Input” with the value, which is the input of the function, and its interpretation; and “Output” with the value, which is the input of the function, and its interpretation. Use LaTeX for “Input” and “Output” columns.</p>
</blockquote>

<table>
  <thead>
    <tr>
      <th>Name</th>
      <th>Input</th>
      <th>Output</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Mean Squared Error (MSE)</td>
      <td>\({(x_i,y_i)}_{i=1}^n\) - pairs of predicted and true values</td>
      <td>\(L=\frac{1}{n}\sum_{i=1}^n (y_i - f(x_i))^2\) - the mean squared error between predicted and true values</td>
    </tr>
    <tr>
      <td>Binary Cross-Entropy Loss</td>
      <td>\({(x_i,y_i)}_{i=1}^n\) - pairs of predicted and true labels, where \(y_i\in {0,1}\)</td>
      <td>\(L=-\frac{1}{n}\sum_{i=1}^n y_i\log(f(x_i)) + (1-y_i)\log(1-f(x_i))\) - the negative log-likelihood of the predicted labels given the true labels</td>
    </tr>
    <tr>
      <td>Categorical Cross-Entropy Loss</td>
      <td>\({(x_i,y_i)}_{i=1}^n\) - pairs of predicted and true labels, where \(y_i\in {0,1}^k\) (one-hot encoding)</td>
      <td>\(L=-\frac{1}{n}\sum_{i=1}^n \sum_{j=1}^k y_{i,j}\log(f(x_i)_j)\) - the negative log-likelihood of the predicted probabilities for each class given the true labels</td>
    </tr>
    <tr>
      <td>Kullback-Leibler (KL) Divergence</td>
      <td>\(p\) - true distribution, \(q\) - predicted distribution</td>
      <td>\(L=\sum_{i}p_i\log\frac{p_i}{q_i}\) - the measure of how much the predicted distribution diverges from the true distribution</td>
    </tr>
    <tr>
      <td>Hinge Loss</td>
      <td>\({(x_i,y_i)}_{i=1}^n\) - pairs of predicted and true labels, where \(y_i\in {-1,1}\)</td>
      <td>\(L=\frac{1}{n}\sum_{i=1}^n \max(0,1-y_i f(x_i))\) - the average of the distances between the predicted and true values, taking into account only the samples with a prediction error</td>
    </tr>
  </tbody>
</table>

<h3 id="top-5-machine-learning-optimization-algorithms">Top 5 Machine Learning Optimization Algorithms</h3>

<figure class="image">
  <img src="./assets/2718317050_a_picture_of_robot_studying_in_a_classroom__mid_shot__artistic__high_resolution__cartoon__detailed.png" />
  <figcaption>a picture of robot studying in a classroom, mid shot, artistic, high-resolution, cartoon, detailed, by <a href="https://beta.dreamstudio.ai">DreamStudio</a></figcaption>
</figure>

<blockquote>
  <p>Provide a list of the top 8 most efficient techniques to train Data Science models, like gradient descent, evolutional algorithms, etc. Also, provide their specifications. Format the information in the table Markdown of format, with the following columns: “Name” with the name of the algorithm; “Algorithm” with the two-sentence description of the algorithm, “Advantages” with the pros of the algorithm; “Disadvantages” with the cons of the algorithm.</p>
</blockquote>

<table>
  <thead>
    <tr>
      <th>Name</th>
      <th>Algorithm</th>
      <th>Advantages</th>
      <th>Disadvantages</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Gradient Descent</td>
      <td>Iterative optimization algorithm that updates model parameters in the direction of the negative gradient of the loss function</td>
      <td>Efficient and widely used, can handle large datasets</td>
      <td>Sensitive to learning rate and can get stuck in local optima</td>
    </tr>
    <tr>
      <td>Evolutionary Algorithms (EA)</td>
      <td>Population-based optimization algorithms inspired by natural selection, typically applied to discrete parameter spaces</td>
      <td>Can find global optima, can handle noisy and non-differentiable functions, useful for feature selection</td>
      <td>Computationally expensive, requires large population sizes and long training times, can be sensitive to hyperparameters</td>
    </tr>
    <tr>
      <td>Particle Swarm Optimization (PSO)</td>
      <td>Population-based optimization algorithm inspired by the collective behavior of bird flocks and fish schools, typically applied to continuous parameter spaces</td>
      <td>Can find global optima and handle non-differentiable functions, computationally efficient, can handle constraints and multi-objective optimization</td>
      <td>Limited exploration in high-dimensional spaces, may get stuck in local optima, sensitive to hyperparameters</td>
    </tr>
    <tr>
      <td>Bayesian Optimization (BO)</td>
      <td>Sequential model-based optimization algorithm that uses a probabilistic surrogate model to select the next point to evaluate</td>
      <td>Efficient use of evaluations, can handle noisy and non-differentiable functions, can handle constraints and multi-objective optimization</td>
      <td>Computationally expensive, requires a good surrogate model, can be sensitive to hyperparameters</td>
    </tr>
    <tr>
      <td>Reinforcement Learning (RL)</td>
      <td>Learning paradigm in which an agent interacts with an environment and learns to take actions that maximize a reward signal</td>
      <td>Can learn optimal policies for complex tasks and handle partial observability and delayed rewards</td>
      <td>Can be sample inefficient and require a large number of training episodes, can suffer from instability and non-convergence</td>
    </tr>
  </tbody>
</table>

<h3 id="the-most-difficult-algorithm-of-deep-learning">The Most Difficult Algorithm of Deep Learning</h3>

<figure class="image">
  <img src="./assets/2581524278_robot_sitting_on_a_ledge__looking_at_the_distance__muted_colors__by_jean_baptiste_monge__artistic__h.png" />
  <figcaption>a picture of robot studying in a classroom, mid shot, artistic, high-resolution, cartoon, detailed, by <a href="https://beta.dreamstudio.ai">DreamStudio</a></figcaption>
</figure>

<blockquote>
  <p>What is the most difficult algorithm of Deep Learning? Provide an explanation, how does it works. Provide technical details. Provide the reference to the original publication.</p>
</blockquote>

<p>The most difficult algorithm in Deep Learning is the Neural Turing Machine (NTM), which combines a neural network with a memory bank that can be read and written to, allowing it to learn algorithms and perform tasks that require a large amount of memory. The NTM was introduced by Graves et al. in the paper “Neural Turing Machines” (2014) [1]. The architecture of the NTM consists of a controller neural network that interacts with an external memory matrix using read and write heads. The controller network takes as input a query vector and produces read and write keys, which are used to address the memory matrix. The read and write heads then read from and write to the memory matrix based on the keys, with the read head outputting a read vector that is concatenated with the query vector to produce the final output.</p>

<p>The NTM is trained using backpropagation through time (BPTT) and can be used to perform tasks such as copying and sorting sequences, learning to perform algorithms such as binary addition and multiplication, and even playing simple games. The NTM has the potential to revolutionize the field of Artificial Intelligence by enabling machines to learn algorithms and perform tasks that would otherwise be difficult or impossible to perform using standard neural networks.</p>

<p>[1] Graves, A., Wayne, G., &amp; Danihelka, I. (2014). Neural turing machines. arXiv preprint arXiv:1410.5401.</p>

<h2 id="algorithms-explanation">Algorithms explanation</h2>

<p>Let’s ask to provide some algorithm’s explanation.</p>

<h3 id="adam">Adam</h3>

<figure class="image">
  <img src="./assets/2581524278_Artificial_intelligence_taking_over_the_whole_world_and_killing_all_humans.png" />
  <figcaption>Artificial intelligence taking over the whole world and killing all humans, by <a href="https://beta.dreamstudio.ai">DreamStudio</a></figcaption>
</figure>

<blockquote>
  <p>Provide an explanation, how the Adam gradient descent optimization algorithm works. Use LaTeX. Fit explanation into 3 sentences. Use mathematical formulas.</p>
</blockquote>

<p>Adam is an adaptive learning rate optimization algorithm that computes individual adaptive learning rates for different parameters. It computes a running estimate of the second raw moments of the gradients and updates the parameters with a combination of the estimated first and second-order moments. The algorithm combines the advantages of Adagrad and RMSprop and can be expressed in the following equations:</p>

\[m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t\]

\[v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2\]

\[\hat{m}_t = \frac{m_t}{1-\beta_1^t}\]

\[\hat{v}t = \frac{v_t}{1-\beta_2^t}\]

\[\theta_t = \theta{t-1} - \frac{\alpha}{\sqrt{\hat{v}_t}+\epsilon} \hat{m}_t,\]

<p>where \(m_t\) and \(v_t\) are the first and second moments of the gradients, \(\hat{m}_t\) and \(\hat{v}_t\) are the bias-corrected estimates of the moments, \(\theta_t\) is the current set of parameters, \(g_t\) is the gradient at time \(t\), \(\alpha\) is the learning rate, \(\beta_1\) and \(\beta_2\) are the exponential decay rates for the first and second moment estimates, and \(\epsilon\) is a small value added to the denominator for numerical stability.</p>

<h3 id="transformer">Transformer</h3>

<figure class="image">
  <img src="./assets/2581524278_optimus_prime__ultra_realistic__artistic__high_resolution__detailed__4k__cyber_punk.png" />
  <figcaption>optimus prime, ultra-realistic, artistic, high-resolution, detailed, 4k, cyber punk, by <a href="https://beta.dreamstudio.ai">DreamStudio</a></figcaption>
</figure>

<blockquote>
  <p>Provide an explanation, how the Transformer algorithm from the publication “Attention Is All You Need” works. Use LaTeX. Use mathematical formulas. Draw ASCII diagram explanation of Attention matrices keys, queries, and values multiplication in its pipeline.</p>
</blockquote>

<p>The Transformer is an attention-based neural network architecture used for various natural language processing (NLP) tasks. It was introduced in the paper “Attention Is All You Need” by Vaswani et al. (2017). The Transformer model consists of an encoder and a decoder, both of which are composed of self-attention layers and feed-forward layers.</p>

<p>Self-attention is used to compute the output of each token in the input sequence by taking a weighted sum of the embeddings of all the tokens in the sequence, where the weights are computed by attending to all the tokens in the sequence based on their relevance to the current token. The attention weight between a query \(q_i\) and a key \(k_j\) is computed using the dot product of their embeddings, scaled by the square root of the dimensionality of the embeddings, and then passed through a softmax function:</p>

\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) V\]

<p>where \(Q\), \(K\), and \(V\) are the query, key, and value matrices, respectively, and \(d_k\) is the dimensionality of the keys. The resulting attention matrix is then multiplied by the value matrix \(V\) to obtain the final output.</p>

<p>In the Transformer, the self-attention mechanism is used in a multi-head fashion, where the input embeddings are projected to multiple subspaces and attention is applied independently in each subspace, allowing the model to attend to different aspects of the input. The outputs from each attention head are concatenated and projected back to the original dimensionality.</p>

<p>Here is an ASCII diagram of the attention mechanism:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>     Q           K           V
      |           |           |
      v           v           v
+-----+-----+   +-----+-----+   +-----+-----+
|  q1 | ... |   |  k1 | ... |   |  v1 | ... |
+-----+-----+   +-----+-----+   +-----+-----+
      |           |           |
      +-----------+-----------+
                  |
              attention
                  |
                  v
+-------------------------------+
|          output token         |
+-------------------------------+`
</code></pre></div></div>

<h3 id="gpt-3">GPT-3</h3>

<figure class="image">
  <img src="./assets/2581524278_a_page_of_text.png" />
  <figcaption>a page of text, by <a href="https://beta.dreamstudio.ai">DreamStudio</a></figcaption>
</figure>

<blockquote>
  <p>Provide an explanation, how the GPT-3 algorithm works. Provide deep learning technical details. Use LaTeX. Use mathematical formulas. Fit explanation into the 5 sentences. Highlight the most important information. Provide the original publication in APA style.</p>
</blockquote>

<p>GPT-3 (Generative Pre-trained Transformer 3) is a language generation model developed by OpenAI, which uses deep learning techniques to generate human-like natural language. The model is trained on a massive dataset of diverse text to learn the patterns of language and is fine-tuned to specific natural language processing tasks.</p>

<p>The architecture of GPT-3 is based on a Transformer model with 175 billion parameters, which allows it to generate high-quality text with low error rates. The model uses unsupervised learning to generate coherent and contextually relevant language.</p>

<p>One of the unique features of GPT-3 is its ability to perform a wide range of natural language processing tasks, such as language translation, summarization, and question-answering, without the need for task-specific training data.</p>

<p>The model uses a combination of attention mechanisms, which allow it to focus on relevant parts of the input text, and a probabilistic language model, which generates the output text based on the learned distribution of language patterns.</p>

<p>According to Brown et al. (2020), the performance of GPT-3 on several natural language processing benchmarks exceeds the state-of-the-art in many cases, demonstrating the effectiveness of its architecture and training methodology.</p>

<p>Reference:</p>

<p>Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., … &amp; Amodei, D. (2020). Language models are few-shot learners. arXiv preprint arXiv:2005.14165.</p>

<h2 id="compare-architectures">Compare architectures</h2>

<p>ChatGPT should compare different Deep Learning architectures and explain their difference, or evolution, regarding to the previous version of a model.</p>

<h3 id="yolo-v1-to-v5-comparison">YOLO v1 To v5 Comparison</h3>

<figure class="image">
  <img src="./assets/2581524278_you_only_look_once.png" />
  <figcaption>you only look once, by <a href="https://beta.dreamstudio.ai">DreamStudio</a></figcaption>
</figure>

<blockquote>
  <p>Create the markdown table, with columns “Architecture”, “Changes”, where will be compared YOLO v1, v2, v3, v4, and v5 with the previous version correspondingly.</p>
</blockquote>

<table>
  <thead>
    <tr>
      <th>Architecture</th>
      <th>Changes</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>YOLO v1</td>
      <td>- Basic architecture with a single scale detection approach<br />- Backbone based on Darknet<br />- No anchor boxes used<br />- Limited to detecting 20 classes</td>
    </tr>
    <tr>
      <td>YOLO v2</td>
      <td>- Introduces anchor boxes to handle multiple scales<br />- Batch normalization added to improve convergence<br />- Convolutional layers replaced with Residual blocks<br />- Increased the number of convolutional layers<br />- Detection map is now shared across multiple anchor boxes<br />- Achieves higher accuracy and can detect 80 classes</td>
    </tr>
    <tr>
      <td>YOLO v3</td>
      <td>- Further increased the number of convolutional layers and added skip connections<br />- Incorporates multiple scales using feature pyramids<br />- Introduced three detection heads to capture different scales and aspect ratios<br />- Improvements in training and network regularization<br />- Achieves state-of-the-art accuracy in object detection tasks</td>
    </tr>
    <tr>
      <td>YOLO v4</td>
      <td>- Utilizes several advanced features such as CSPDarknet53 and SPP block<br />- Introduced Mish activation function<br />- Use of Spatial Attention and Context Aggregation Modules<br />- Introduced Cross-Stage Partial Network (CSPNet)<br />- More accurate and faster than YOLOv3</td>
    </tr>
    <tr>
      <td>YOLO v5</td>
      <td>- Introduced the use of anchor-free object detection with FCOS<br />- Smaller and faster than previous versions<br />- Use of BiFPN to fuse features at multiple scales<br />- Implemented a new loss function and post-processing approach<br />- Achieves comparable accuracy to YOLOv4 while being significantly faster</td>
    </tr>
  </tbody>
</table>

<h3 id="gradient-descend-roadmap">Gradient Descend Roadmap</h3>

<figure class="image">
  <img src="./assets/980890439_gradient_descent_roadmap.png" />
  <figcaption>gradient descent roadmap, by <a href="https://beta.dreamstudio.ai">DreamStudio</a></figcaption>
</figure>

<blockquote>
  <p>Provide the roadmap for the development of the gradient descent optimization algorithms, such as SGD, Adam, AdamW, etc. Make a markdown table with the columns: “Name”, with the name of the algorithm; “Publication” with the publication, presented this algorithm in APA style; and “Contribution” with the improvements made by this algorithm. Sort rows in ascending by publication time.</p>
</blockquote>

<table>
  <thead>
    <tr>
      <th>Name</th>
      <th>Publication</th>
      <th>Contribution</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Gradient Descent</td>
      <td>Cauchy, A. L. (1847). “Méthode générale pour la résolution de systèmes d’équations simultanées”. Comptes rendus de l’Académie des sciences de Paris 25: 536–538.</td>
      <td>Classic optimization algorithm, updates model parameters based on the gradient of the loss function with respect to the parameters.</td>
    </tr>
    <tr>
      <td>Stochastic Gradient Descent (SGD)</td>
      <td>Robbins, H., &amp; Monro, S. (1951). “A stochastic approximation method”. Annals of Mathematical Statistics 22 (3): 400–407.</td>
      <td>Optimizes the model by computing the gradient on a single example, making it faster than the standard Gradient Descent algorithm.</td>
    </tr>
    <tr>
      <td>Adagrad</td>
      <td>Duchi, J., Hazan, E., &amp; Singer, Y. (2011). “Adaptive subgradient methods for online learning and stochastic optimization”. Journal of Machine Learning Research, 12: 2121-2159.</td>
      <td>Adapts the learning rate to the parameters, so that frequent parameters get a smaller learning rate, while infrequent parameters get a higher learning rate.</td>
    </tr>
    <tr>
      <td>Adadelta</td>
      <td>Zeiler, M. D. (2012). “ADADELTA: An adaptive learning rate method”. arXiv preprint arXiv:1212.5701.</td>
      <td>Computes an adaptive learning rate, which takes into account the moving average of the squared gradients.</td>
    </tr>
    <tr>
      <td>RMSprop</td>
      <td>Tieleman, T., &amp; Hinton, G. (2012). “Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude”. COURSERA: Neural Networks for Machine Learning 4: 2-4.</td>
      <td>Adapts the learning rate based on the moving average of the squared gradients, normalizing the gradient by the root-mean-square of the previous gradients.</td>
    </tr>
    <tr>
      <td>Adam</td>
      <td>Kingma, D. P., &amp; Ba, J. (2015). “Adam: A method for stochastic optimization”. arXiv preprint arXiv:1412.6980.</td>
      <td>Computes adaptive learning rates for each parameter, taking into account both the first and second moments of the gradients, resulting in faster convergence and better generalization performance.</td>
    </tr>
    <tr>
      <td>Nadam</td>
      <td>Dozat, T. (2016). “Incorporating Nesterov Momentum into Adam”. ICLR workshop.</td>
      <td>Adds the Nesterov momentum to Adam, allowing for faster convergence in the case of saddle points.</td>
    </tr>
    <tr>
      <td>AdamW</td>
      <td>Loshchilov, I., &amp; Hutter, F. (2017). “Fixing weight decay regularization in Adam”. arXiv preprint arXiv:1711.05101.</td>
      <td>Fixes the weight decay problem in Adam, resulting in better generalization performance.</td>
    </tr>
    <tr>
      <td>RAdam</td>
      <td>Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., &amp; Han, J. (2019). “On the variance of the adaptive learning rate and beyond”. arXiv preprint arXiv:1908.03265.</td>
      <td>Rectifies the variance of the adaptive learning rate in Adam, resulting in better generalization performance.</td>
    </tr>
  </tbody>
</table>

<h3 id="computer-vision-roadmap">Computer Vision Roadmap</h3>

<figure class="image">
  <img src="./assets/3373870629_Computer_Vision__eye__artist_style__robot.png" />
  <figcaption>Computer Vision, eye, artist style, robot, by <a href="https://beta.dreamstudio.ai">DreamStudio</a></figcaption>
</figure>

<blockquote>
  <p>Develop a roadmap for the development of computer vision architectures, beginning with LeNet. Make a markdown table with the columns: “Name”, with the name of the architecture; “Publication” with the publication, presented this algorithm in APA style; and “Contribution” with the improvements made by this architecture. Sort rows in ascending by publication time.</p>
</blockquote>

<table>
  <thead>
    <tr>
      <th>Name</th>
      <th>Publication</th>
      <th>Contribution</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>LeNet</td>
      <td>LeCun, Y., Bottou, L., Bengio, Y., &amp; Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324.</td>
      <td>One of the first successful convolutional neural networks (CNNs) for handwritten digit recognition.</td>
    </tr>
    <tr>
      <td>AlexNet</td>
      <td>Krizhevsky, A., Sutskever, I., &amp; Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105).</td>
      <td>First CNN to win the ImageNet Large Scale Visual Recognition Competition (ILSVRC). Introducing a new architecture with a deeper and wider network, ReLU activation function, and dropout regularization.</td>
    </tr>
    <tr>
      <td>VGGNet</td>
      <td>Simonyan, K., &amp; Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.</td>
      <td>A deeper architecture with up to 19 layers, all with small convolutional filters. This architecture set a new benchmark on ImageNet.</td>
    </tr>
    <tr>
      <td>Inception v1</td>
      <td>Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., … &amp; Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1-9).</td>
      <td>Introducing the inception module, which consists of multiple parallel convolutional layers with different filter sizes and pooling operations. This module improved the accuracy while keeping the number of parameters low.</td>
    </tr>
    <tr>
      <td>ResNet</td>
      <td>He, K., Zhang, X., Ren, S., &amp; Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).</td>
      <td>Introducing residual connections that allow deeper architectures by solving the vanishing gradient problem. This architecture set a new state-of-the-art on ImageNet.</td>
    </tr>
    <tr>
      <td>Inception v3</td>
      <td>Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., &amp; Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2818-2826).</td>
      <td>Further development of the inception module, introducing factorized 7x7 convolutions, and adding batch normalization. This architecture set a new state-of-the-art on ImageNet.</td>
    </tr>
    <tr>
      <td>Xception</td>
      <td>Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1251-1258).</td>
      <td>Introducing depthwise separable convolutions that separate spatial and channel-wise convolutions, reducing the number of parameters while improving accuracy. This architecture set a new state-of-the-art on ImageNet.</td>
    </tr>
    <tr>
      <td>EfficientNet</td>
      <td>Tan, M., &amp; Le, Q. V. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning (pp. 6105-6114).</td>
      <td>Proposing a compound scaling method that scales the network’s depth, width, and resolution uniformly. This architecture set a new state-of-the-art on ImageNet while being more computationally efficient than previous architectures.</td>
    </tr>
  </tbody>
</table>

<h3 id="nlp-roadmap">NLP Roadmap</h3>

<figure class="image">
  <img src="./assets/380870119_natural_language_processing__artist_style__robot.png" />
  <figcaption>natural language processing, artist style, robot, by <a href="https://beta.dreamstudio.ai">DreamStudio</a></figcaption>
</figure>

<blockquote>
  <p>Develop a roadmap for the development of NLP architectures. Make a markdown table with the columns: “Name”, with the name of the architecture; “Publication” with the publication, presented this algorithm in APA style; and “Contribution” with the improvements made by this architecture. Sort rows in ascending by publication time.</p>
</blockquote>

<table>
  <thead>
    <tr>
      <th>Name</th>
      <th>Publication</th>
      <th>Contribution</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Hidden Markov Models (HMM)</td>
      <td>Rabiner, L. R. (1989)</td>
      <td>Introduced a statistical model for predicting a sequence of unknown variables.</td>
    </tr>
    <tr>
      <td>Recurrent Neural Networks (RNN)</td>
      <td>Elman, J. L. (1990)</td>
      <td>Introduced a neural network architecture capable of handling sequential data by using feedback connections.</td>
    </tr>
    <tr>
      <td>Long Short-Term Memory (LSTM)</td>
      <td>Hochreiter, S., &amp; Schmidhuber, J. (1997)</td>
      <td>Improved RNN by introducing a memory cell and gating mechanism, allowing better handling of long-term dependencies.</td>
    </tr>
    <tr>
      <td>Convolutional Neural Networks (CNN)</td>
      <td>LeCun, Y., Bottou, L., Bengio, Y., &amp; Haffner, P. (1998)</td>
      <td>Introduced an architecture capable of extracting features from local regions of the input data, allowing for translation invariance.</td>
    </tr>
    <tr>
      <td>CBOW</td>
      <td>Mikolov, T., Chen, K., Corrado, G., &amp; Dean, J. (2013). Efficient estimation of word representations in vector space.</td>
      <td>Proposed a simple neural network architecture for computing continuous distributed representations of words that is computationally efficient and suitable for large-scale NLP tasks.</td>
    </tr>
    <tr>
      <td>skip-gram</td>
      <td>Mikolov, T., Chen, K., Corrado, G., &amp; Dean, J. (2013). Efficient estimation of word representations in vector space.</td>
      <td>Proposed a simple neural network architecture for computing continuous distributed representations of words that is computationally efficient and suitable for large-scale NLP tasks.</td>
    </tr>
    <tr>
      <td>GRU</td>
      <td>Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., &amp; Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation.</td>
      <td>Proposed the gated recurrent unit (GRU), which is a variant of the LSTM architecture that has fewer parameters and is easier to train, while still being capable of modeling long-term dependencies.</td>
    </tr>
    <tr>
      <td>Transformer</td>
      <td>Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … &amp; Polosukhin, I. (2017)</td>
      <td>Introduced a model based solely on self-attention, which has achieved state-of-the-art results in many NLP tasks.</td>
    </tr>
    <tr>
      <td>BERT</td>
      <td>Devlin, J., Chang, M. W., Lee, K., &amp; Toutanova, K. (2018)</td>
      <td>Introduced a model based on the Transformer architecture that pretrains on a large corpus and then fine-tunes on downstream NLP tasks, achieving state-of-the-art results in many of them.</td>
    </tr>
    <tr>
      <td>GPT-2</td>
      <td>Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., &amp; Sutskever, I. (2019)</td>
      <td>Introduced a Transformer-based model pre-trained on a massive corpus, which has demonstrated impressive generation capabilities in text and beyond.</td>
    </tr>
    <tr>
      <td>T5</td>
      <td>Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., … &amp; Liu, P. J. (2019)</td>
      <td>Introduced a model capable of performing various NLP tasks through a single training objective and achieved state-of-the-art results in several benchmarks.</td>
    </tr>
    <tr>
      <td>GPT-3</td>
      <td>Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., … &amp; Amodei, D. (2020). Language models are few-shot learners.</td>
      <td>Presented the GPT-3 model, which is a large-scale version of GPT-2 with 175 billion parameters and demonstrated impressive few-shot learning capabilities, meaning that it</td>
    </tr>
  </tbody>
</table>

<h3 id="time-series-roadmap">Time Series Roadmap</h3>

<figure class="image">
  <img src="./assets/3236229149_time__artist_style.png" />
  <figcaption>time, artist style, by <a href="https://beta.dreamstudio.ai">DreamStudio</a></figcaption>
</figure>

<blockquote>
  <p>Develop a roadmap for the development of Time Series models. Make a markdown table with the columns: “Name”, with the name of the model; “Publication” with the publication, presented this model in APA style; and “Contribution” with the improvements made by this model. Sort rows in ascending by publication time.</p>
</blockquote>

<table>
  <thead>
    <tr>
      <th>Name</th>
      <th>Publication</th>
      <th>Contribution</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>ARMA</td>
      <td>Box, G.E.P., Jenkins, G.M. (1970). Time Series Analysis: Forecasting and Control. San Francisco: Holden-Day.</td>
      <td>Introduced the ARMA model, which is widely used for forecasting stationary time series data by combining the AR (autoregressive) and MA (moving average) models.</td>
    </tr>
    <tr>
      <td>ARIMA</td>
      <td>Box, G.E.P., Jenkins, G.M. (1976). Time Series Analysis: Forecasting and Control. San Francisco: Holden-Day.</td>
      <td>Extended the ARMA model by introducing the integration component, enabling the model to handle non-stationary time series data.</td>
    </tr>
    <tr>
      <td>GARCH</td>
      <td>Bollerslev, T. (1986). Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics, 31, 307–327.</td>
      <td>Introduced the GARCH model, which can model the volatility clustering commonly found in financial time series data by combining the autoregressive and moving average components with a time-varying variance.</td>
    </tr>
    <tr>
      <td>LSTM</td>
      <td>Hochreiter, S., &amp; Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.</td>
      <td>Introduced the LSTM architecture, which is a type of recurrent neural network that can learn long-term dependencies in time series data by incorporating memory cells that can store and retrieve information over time.</td>
    </tr>
    <tr>
      <td>GRU</td>
      <td>Cho, K., Van Merrienboer, B., Bahdanau, D., &amp; Bengio, Y. (2014). On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259.</td>
      <td>Introduced the GRU architecture, which is another type of recurrent neural network that can learn long-term dependencies in time series data, but with fewer parameters than LSTM.</td>
    </tr>
    <tr>
      <td>WaveNet</td>
      <td>van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., et al. (2016). WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499.</td>
      <td>Introduced the WaveNet architecture, which is a type of autoregressive model that can generate high-quality audio by predicting the next sample in the waveform using a stack of dilated causal convolutional layers.</td>
    </tr>
    <tr>
      <td>TCN</td>
      <td>Bai, S., Kolter, J.Z., &amp; Koltun, V. (2018). An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271.</td>
      <td>Introduced the TCN architecture, which is a type of convolutional neural network that can learn long-term dependencies in time series data with dilated causal convolutions, achieving state-of-the-art performance in several sequence modeling tasks.</td>
    </tr>
    <tr>
      <td>Transformer</td>
      <td>Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., et al. (2017). Attention is all you need. arXiv preprint arXiv:1706.03762.</td>
      <td>Introduced the Transformer architecture, which is a type of self-attention based model that can capture global dependencies in time series data by computing attention weights between all time steps, achieving state-of-the-art performance in several NLP tasks and time series forecasting.</td>
    </tr>
  </tbody>
</table>

<h2 id="personal-development">Personal Development</h2>

<p>In this section, I’ll ask ChatGPT for some tips on how I may develop into a specialist in the field of machine learning.</p>

<h3 id="how-to-become-an-expert-in-ai">How to Become an Expert in AI</h3>

<figure class="image">
  <img src="./assets/935559909_expert_in_AI__wise_old_man__artist_style.png" />
  <figcaption>expert in AI, wise old man, artist style, by <a href="https://beta.dreamstudio.ai">DreamStudio</a></figcaption>
</figure>

<blockquote>
  <p>Develop a roadmap for me, as a beginner in Machine Learning, how to develop myself into a specialist in the field of machine learning.</p>
</blockquote>

<ol>
  <li>Learn the basics of Mathematics, especially Calculus, Linear Algebra, and Probability Theory.</li>
  <li>Get familiar with programming languages, such as Python and R, as they are widely used in Machine Learning.</li>
  <li>Understand the fundamental Machine Learning concepts, such as supervised and unsupervised learning, overfitting, underfitting, and bias-variance tradeoff.</li>
  <li>Learn how to work with data by acquiring skills in data preprocessing, feature engineering, and data visualization.</li>
  <li>Master the core Machine Learning algorithms, such as linear regression, logistic regression, decision trees, random forests, and support vector machines (SVM).</li>
  <li>Gain knowledge of Deep Learning and Neural Networks, including concepts like backpropagation, convolutional neural networks (CNNs), recurrent neural networks (RNNs), and Long Short-Term Memory (LSTM) networks.</li>
  <li>Learn about advanced techniques in Machine Learning, such as ensemble methods, clustering, reinforcement learning, and generative models.</li>
  <li>Practice on real-world projects and challenges to gain hands-on experience and improve your skills.</li>
  <li>Keep up-to-date with the latest research and trends in Machine Learning through reading research papers, attending conferences, and following experts in the field.</li>
  <li>Consider pursuing a degree or certification program in Machine Learning, such as a Master’s in Data Science or a Machine Learning Engineer certification.</li>
</ol>

<p>Remember, developing yourself into a specialist in Machine Learning takes time, effort, and continuous learning. Stay committed to learning and practice consistently to achieve your goals.</p>

<h3 id="organizing-ai-knowledge">Organizing AI Knowledge</h3>

<figure class="image">
  <img src="./assets/4185775491_organizing_thoughts__mindset__knowledge__order__minimalistic_style.png" />
  <figcaption>organizing thoughts, mindset, knowledge, order, minimalistic style, by <a href="https://beta.dreamstudio.ai">DreamStudio</a></figcaption>
</figure>

<blockquote>
  <p>How can I effectively remember the countless different machine learning and deep learning algorithms? Provide how to structure the information, which tool could help organize the knowledge, and how to find the information quickly.</p>
</blockquote>

<p>To effectively remember the different machine learning and deep learning algorithms, it is important to have a structured approach to organizing the information. One way to do this is to create a comprehensive set of notes that includes:</p>

<ol>
  <li>A brief description of the algorithm</li>
  <li>The mathematical formulas and equations that are used</li>
  <li>The pros and cons of the algorithm</li>
  <li>Real-world examples of where the algorithm has been used</li>
</ol>

<p>Tools like Evernote, OneNote, or Google Keep can help you organize your notes and make it easy to search for specific algorithms quickly. These tools allow you to create different notebooks or sections for different topics and easily add tags to your notes to help you find them later.</p>

<p>Another effective way to remember the algorithms is to create flashcards. Write down the name of the algorithm on one side and the key details on the other. Use these flashcards to quiz yourself regularly.</p>

<p>It is also important to regularly review your notes and practice implementing the algorithms. Participating in online courses or finding coding challenges can help you solidify your understanding of the algorithms and improve your ability to apply them in real-world scenarios. Additionally, joining online communities or attending meetups related to machine learning can give you the opportunity to discuss and learn from others who are also interested in the topic.</p>

<h3 id="how-to-pass-an-interview-for-junior-ml-engineer">How to Pass an Interview for Junior ML Engineer</h3>

<figure class="image">
  <img src="./assets/553067973_interviewing_the_junior_Machine_Learning_Engineer__cartoon_style.png" />
  <figcaption>interviewing the junior Machine Learning Engineer, cartoon style, by <a href="https://beta.dreamstudio.ai">DreamStudio</a></figcaption>
</figure>

<blockquote>
  <p>How could I successfully pass the interview for the position of junior Machine Learning Engineer?</p>
</blockquote>

<p>To successfully pass the interview for the position of junior Machine Learning Engineer, here are some tips:</p>

<ol>
  <li>
    <p>Brush up on your basics: Be thorough with the fundamentals of mathematics, statistics, and programming, as these are the building blocks of machine learning.</p>
  </li>
  <li>
    <p>Practice coding: Practice coding exercises, algorithms, and data structures in the language you are most comfortable with. Be familiar with libraries and frameworks such as TensorFlow, PyTorch, and Scikit-Learn.</p>
  </li>
  <li>
    <p>Learn about machine learning algorithms: Be familiar with different machine learning algorithms such as linear regression, logistic regression, decision trees, random forests, support vector machines, and neural networks.</p>
  </li>
  <li>
    <p>Learn about deep learning: Be familiar with deep learning techniques such as convolutional neural networks, recurrent neural networks, and long short-term memory networks.</p>
  </li>
  <li>
    <p>Learn about data preparation: Be familiar with techniques for preparing data such as feature scaling, normalization, and one-hot encoding.</p>
  </li>
  <li>
    <p>Learn about evaluation metrics: Be familiar with evaluation metrics such as accuracy, precision, recall, F1 score, and ROC curve.</p>
  </li>
  <li>
    <p>Be able to explain your thought process: During the interview, be able to explain your thought process and the approach you take to solve a particular problem.</p>
  </li>
  <li>
    <p>Be familiar with common machine learning challenges: Be familiar with common challenges such as overfitting, underfitting, bias-variance trade-off, and regularization.</p>
  </li>
  <li>
    <p>Practice communication: Practice your communication skills as you will need to explain complex concepts to non-technical stakeholders.</p>
  </li>
  <li>
    <p>Use resources: Use resources such as online courses, books, and open-source projects to expand your knowledge.</p>
  </li>
</ol>

<p>By following these tips and practicing regularly, you can increase your chances of successfully passing the interview for the position of junior Machine Learning Engineer.</p>

<h2 id="conclusion">Conclusion</h2>

<figure class="image">
  <img src="./assets/416953266_a_small_robot__like_WALL_E_dreamy_ultra_wide_shot__atmospheric__hyper_realistic__epic_composition__c.png" />
  <figcaption>a small robot, like WALL-E,dreamy ultra wide shot, atmospheric, hyper realistic, epic composition, cinematic, octane render, artstation landscape vista photography by Carr Clifton &amp; Galen Rowell, 16K resolution, Landscape veduta photo by Dustin Lefevre &amp; tdraw, detailed landscape painting by Ivan Shishkin, DeviantArt, Flickr, rendered in Enscape, Miyazaki, Nausicaa Ghibli, Breath of The Wild, 4k detailed post processing, artstation, rendering by octane, unreal engine –iw 10 –ar 9:16, by <a href="https://beta.dreamstudio.ai">DreamStudio</a></figcaption>
</figure>

<p>ChatGPT can memorize a wide range of machine learning, data science, and deep learning methods. It might be really useful if you need to rapidly recall certain information.</p>

<p>Because ChatGPT regularly omits information, enters dates and constant values inaccurately, and otherwise makes mistakes, <strong>I do not recommend using it to study any new subject</strong>.</p>

<p>ChatGPT is a great tool for expanding your message, carrying on with your job, going over specifics, or conducting high-level research. It could effectively compare several well-known algorithms, create road maps, and quickly and probably correctly come up with fresh ideas.</p>

<h2 id="acknowledges">Acknowledges</h2>

<p>Thanks to the ChatGPT team for creating this fantastic tool and enabling others to accelerate our research and development!</p>

<p>Thank you for reading this post on the ChatGPT in Machine Learning! I hope that you found this information informative and useful. If you have any questions or feedback, please feel free to leave a comment below or contant me directly via any communication channel in the footer of the website. Also, be sure to check out my blog for more great content on machine learning and deep learning.</p>]]></content><author><name>Andrii Polukhin</name><email>pandrii000@gmail.com</email></author><summary type="html"><![CDATA[A robot sitting next to a human in an office, trending on artstation, beautiful coloring, 4k, vibrant, blue and yellow, by DreamStudio]]></summary></entry><entry><title type="html">Classification Metrics: Visual Explanations</title><link href="https://polukhin.tech/2022/11/26/classification-metrics-visual-explanations" rel="alternate" type="text/html" title="Classification Metrics: Visual Explanations" /><published>2022-11-26T00:00:00+00:00</published><updated>2022-11-26T00:00:00+00:00</updated><id>https://polukhin.tech/2022/11/26/classification-metrics-visual-explanations</id><content type="html" xml:base="https://polukhin.tech/2022/11/26/classification-metrics-visual-explanations"><![CDATA[<p>This post will visually describe the problem of understanding of such concepts as <strong>Accuracy</strong>, <strong>Precision</strong>, <strong>Recall</strong>, <strong>F1-Score</strong>, <strong>ROC Curve</strong>, and <strong>AUC</strong>, which are part of the development of any classification, detection, segmentation, etc. tasks in machine learning. All the images were created by the author.</p>

<p>I would also suggest you read the following articles about those metrics, which are highly informative and could give you a better understanding of the metrics evaluation process:</p>

<ul>
  <li><a href="https://developers.google.com/machine-learning/crash-course/classification/true-false-positive-negative">Classification: True vs. False and Positive vs. Negative</a></li>
  <li><a href="https://developers.google.com/machine-learning/crash-course/classification/accuracy">Classification: Accuracy</a></li>
  <li><a href="https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc">Classification: ROC Curve and AUC</a></li>
  <li><a href="https://towardsdatascience.com/precision-and-recall-made-simple-afb5e098970f">Precision and Recall Made Simple</a></li>
  <li><a href="https://medium.com/mlearning-ai/evaluating-classification-models-why-accuracy-is-not-enough-abf3d9c93a69">Evaluating Classification Models: Why Accuracy Is Not Enough</a></li>
  <li><a href="https://medium.com/opex-analytics/why-you-need-to-understand-the-trade-off-between-precision-and-recall-525a33919942">Precision and Recall: Understanding the Trade-Off</a></li>
  <li><a href="https://towardsdatascience.com/essential-things-you-need-to-know-about-f1-score-dbd973bf1a3">Essential Things You Need to Know About F1-Score</a></li>
  <li><a href="https://analyticsindiamag.com/understanding-the-auc-roc-curve-in-machine-learning-classification/">Understanding the AUC-ROC Curve in Machine Learning Classification</a></li>
  <li><a href="https://mlu-explain.github.io/precision-recall/">Precision &amp; Recall</a></li>
  <li><a href="https://mlu-explain.github.io/roc-auc/">ROC &amp; AUC</a></li>
</ul>

<h2 id="the-key-element">The Key Element</h2>

<p>The key element of all these metrics is <strong>True Positive (TP)</strong>, <strong>True Negative (TN)</strong>, <strong>False Positive (FP)</strong>, and <strong>False Negative (FN)</strong> metrics, which came from Statistics, specifically from <a href="https://www.coursera.org/lecture/stanford-statistics/the-idea-behind-testing-hypotheses-fvJGd">Hypothesis Testing</a>.</p>

<figure class="image" style="width:70%">
  <img src="./assets/confusion_matrix.png" />
  <figcaption>Confusion Matrix</figcaption>
</figure>

<ul>
  <li><strong>True Positive</strong> is about how many positive samples were classified as positive.</li>
  <li><strong>True Negative</strong> is about how many negative samples were classified as negative.</li>
  <li><strong>False Positive</strong> is about how many negative samples were classified as positive.</li>
  <li><strong>False Negative</strong> is about how many positive samples were classified as negative.</li>
</ul>

<h2 id="accuracy">Accuracy</h2>

<p>Accuracy shows <strong>how many correct classifications you have made</strong>.</p>

<figure class="image" style="width:70%">
  <img src="./assets/accuracy.png" />
  <figcaption>Accuracy</figcaption>
</figure>

<h2 id="precision">Precision</h2>

<p>Precision shows <strong>how many positive predictions were correct</strong>.</p>

<figure class="image" style="width:70%">
  <img src="./assets/precision.png" />
  <figcaption>Precision</figcaption>
</figure>

<h2 id="recall">Recall</h2>

<p>Recall shows <strong>how many predictions were correct across the only positive samples</strong>.</p>

<figure class="image" style="width:70%">
  <img src="./assets/recall.png" />
  <figcaption>Recall</figcaption>
</figure>

<h2 id="f1-score">F1-Score</h2>

<p>F1-Score is simply the <strong><a href="https://en.wikipedia.org/wiki/Harmonic_mean">harmonic mean</a> between Precision and Recall</strong>.</p>

<figure class="image" style="width:70%">
  <img src="./assets/f1_score.png" />
  <figcaption>F1-Score</figcaption>
</figure>

<h2 id="roc-curve">ROC Curve</h2>

<p>ROC Curve stands for “Receiver Operating Characteristic” and depends on two parameters:</p>

<ol>
  <li><strong>True Positive Rate (TPR)</strong>, also known as <strong>Recall</strong>.</li>
  <li><strong>False Positive Rate (FPR)</strong>, which is the probability of an actual negative class to be predicted as positive.</li>
</ol>

<figure class="image" style="width:70%">
  <img src="./assets/false_positive_rate.png" />
  <figcaption>False Positive Rate</figcaption>
</figure>

<p>Using different threshold values from 0 to 1, a ROC Curve is created by plotting FPR values on the X-axis, and TPR values on the Y-axis.</p>

<figure class="image" style="width:80%">
  <img src="./assets/roc_curve.png" />
  <figcaption>ROC Curve</figcaption>
</figure>

<h2 id="auc">AUC</h2>

<p>AUC stands for “Area under the ROC Curve” and measures the entire two-dimensional <strong>area underneath the entire ROC curve from (0,0) to (1,1)</strong>.</p>

<figure class="image" style="width:80%">
  <img src="./assets/auc.png" />
  <figcaption>AUC</figcaption>
</figure>

<h2 id="conclusion">Conclusion</h2>

<p>These metrics are widely used in different machine learning topics, so it is required to get a clear intuition about how they work, how to interpret, and, finally, how to raise them to 100%.</p>]]></content><author><name>Andrii Polukhin</name><email>pandrii000@gmail.com</email></author><summary type="html"><![CDATA[This post will visually describe the problem of understanding of such concepts as Accuracy, Precision, Recall, F1-Score, ROC Curve, and AUC, which are part of the development of any classification, detection, segmentation, etc. tasks in machine learning. All the images were created by the author.]]></summary></entry><entry><title type="html">Pruning: The History And Overview</title><link href="https://polukhin.tech/2022/10/27/pruning-the-history-and-overview" rel="alternate" type="text/html" title="Pruning: The History And Overview" /><published>2022-10-27T00:00:00+00:00</published><updated>2022-10-27T00:00:00+00:00</updated><id>https://polukhin.tech/2022/10/27/pruning-the-history-and-overview</id><content type="html" xml:base="https://polukhin.tech/2022/10/27/pruning-the-history-and-overview"><![CDATA[<figure class="image">
  <img src="./assets/pruning-before-and-after.png" />
  <figcaption><a href="http://arxiv.org/abs/1607.03250">Pruning: Before and After</a></figcaption>
</figure>

<h2 id="definition">Definition</h2>

<p><strong>Simply speaking,</strong> pruning is a process of shrinking a network by eliminating parameters.</p>

<p><strong>Formally,</strong> let’s define a neural network architecture is as a family of functions \(f(x; ·)\).
By using a model \(f(x; W)\) as input, a new model \(f(x; M \bigodot W')\) is created via neural network pruning.</p>

<p>Here,</p>

<ul>
  <li>\(M \in {0, 1} ^ { \vert W' \vert }\) is a binary mask that sets certain parameters of the model to \(0\).</li>
  <li>\(W′\) is a collection of parameters that might be different from \(W\).</li>
  <li>\(\bigodot\) is the elementwise product operator.</li>
</ul>

<p>In production, trimmed parameters of \(W\) are either set to zero or eliminated totally, rather than employing an explicit mask.</p>

<h2 id="history">History</h2>

<p>The pruning has been used since the late 1980s. Among the most famous publications of those years, <a href="https://papers.nips.cc/paper/1989/hash/6c9882bbac1c7093bd25041881277658-Abstract.html">LeCun et al. (1989). Optimal Brain Damage</a>, <a href="https://ieeexplore.ieee.org/abstract/document/80236">Karnin. (1990) A simple procedure for pruning back-propagation trained neural networks</a> and <a href="https://ieeexplore.ieee.org/abstract/document/298572">Hassibi et al. (1993). Optimal Brain Surgeon and general network pruning</a>.</p>

<p>First methods in pruning assumed brute-forcing for every weight, setting it to zero and evaluating the change in the error. It is a very non-optimal \(O(MW^3)\) algorithm, where \(W\) is the nubmer of weights and \(M\) is training simple operations. See the amazing survey <a href="https://ieeexplore.ieee.org/document/248452">Reed. (1993). Pruning Algorithms – A Survey</a> for more details about 80s - 90s pruning algorithms.</p>

<p>According to <a href="https://www.semanticscholar.org/">Semantic Scholar</a>, across the most influenced papers in pruning nowadays are <a href="https://api.semanticscholar.org/CorpusID:2134321">Han et at. (2015).  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding</a>, <a href="https://api.semanticscholar.org/CorpusID:14089312">Li et at. (2016). Pruning Filters for Efficient ConvNets</a> and <a href="https://api.semanticscholar.org/CorpusID:20157893">He et at. (2017). Channel Pruning for Accelerating Very Deep Neural Networks</a>.</p>

<p>Given the high interest in neural network pruning during the last few years, it seems only appropriate to investigate about the relative effectiveness of various pruning methods. The methods vary in <em>how many parameters should be pruned</em>, <em>which one to prune</em>, <em>when to prune</em> and <em>how to fine-tune the model after the pruning</em>. See <a href="http://arxiv.org/abs/2003.03033">Blalock et al. (2020). What is the State of Neural Network Pruning?</a> for the review of the pruning methods by 2020.</p>

<h2 id="why-pruning">Why Pruning?</h2>

<p>Most publications about the pruning include at least one of these two metrics:</p>
<ul>
  <li>the number of FLOPs, needed to do inference using the pruned network.</li>
  <li>the number of model parameters, that were removed.</li>
</ul>

<p>Actually, there are more pros than we see now. Namely?</p>

<p>The model’s storage footprint, device’s memory consumption, inference computing costs, inference energy usage, etc. are all impacted by reducing the number of model parameters.</p>

<p>By adding tiny amounts of compression during pruning, the <em>model’s accuracy</em> may occasionally <em>even be increased</em> (probably because of regularization effect, see <a href="https://arxiv.org/abs/1808.08558">Suzuki et al. (2018). Spectral Pruning: Compressing Deep Neural Networks via Spectral Analysis and its Generalization Error</a> for more details).</p>

<figure class="image" style="width:70%">
  <img src="./assets/specres.png" />
  <figcaption><a href="http://arxiv.org/abs/1808.08558">Pruning (<b>Spec-ResA</b> model) increased the accuracy on ImageNet</a></figcaption>
</figure>

<p>Also, there is a <a href="https://arxiv.org/abs/1906.06307">hypothesis</a>, that sparse models tend to outperform dense ones for a fixed number of parameters.</p>

<h2 id="pruning-methods">Pruning Methods</h2>

<p>The majority of neural network pruning techniques are based on the algorithm below:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">prune_and_finetune</span><span class="p">(</span><span class="n">N</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span> <span class="n">X</span><span class="p">:</span> <span class="nb">int</span><span class="p">):</span>
    <span class="s">'''
    Input:
        N is the number of iterations of pruning
        X is the dataset on which to train and fine-tune
    Return:
        M is the mask of 0 and 1, with the size |W|
        W is the final model weights, fine-tuned after pruning
    '''</span>

    <span class="c1"># model weights
</span>    <span class="n">W</span> <span class="o">=</span> <span class="n">initialize</span><span class="p">()</span>

    <span class="c1"># f is a training function
</span>    <span class="n">W</span> <span class="o">=</span> <span class="n">train_to_convergence</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">W</span><span class="p">)</span>

    <span class="c1"># M is a mask of 0 and 1, with the size |W|
</span>    <span class="n">M</span> <span class="o">=</span> <span class="n">generate_mask</span><span class="p">(</span><span class="n">W</span><span class="p">)</span>

    <span class="c1"># do N iterations of pruning
</span>    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">N</span><span class="p">):</span>

        <span class="c1"># Each parameter or structural element in the network is given a score s from S,
</span>        <span class="c1"># and the network is pruned based on these scores.
</span>        <span class="n">S</span> <span class="o">=</span> <span class="n">score</span><span class="p">(</span><span class="n">W</span><span class="p">)</span>

        <span class="c1"># prune the model, applying binary mask M according to the S score
</span>        <span class="n">M</span> <span class="o">=</span> <span class="n">prune</span><span class="p">(</span><span class="n">M</span><span class="p">,</span> <span class="n">score</span><span class="p">(</span><span class="n">W</span><span class="p">))</span>

        <span class="c1"># fine-tune the model after pruning
</span>        <span class="n">W</span> <span class="o">=</span> <span class="n">fine_tune</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">M</span><span class="p">,</span> <span class="n">W</span><span class="p">)</span>

    <span class="k">return</span> <span class="n">M</span><span class="p">,</span> <span class="n">W</span>
</code></pre></div></div>

<p>The network is initially trained to convergence in this algorithm. After that, a score is assigned to each parameter or structural component in the network, and the network is pruned in accordance with these scores. The network must be further trained (referred to as “fine-tuning”) to recover when pruning affects its accuracy. The network’s size is steadily decreased by a series of pruning and fine-tuning iterations.</p>

<p>Various questions need different model pruning techniques, which should be answered. Before you prune, you should consider the following key questions.</p>

<ul>
  <li>Which structure should we prune?
    <ul>
      <li><strong>Unstructured pruning.</strong> Individual weights are pruned via unstructured pruning, which ignores their location inside each tensor. [<a href="https://papers.nips.cc/paper/1989/hash/6c9882bbac1c7093bd25041881277658-Abstract.html">LeCun et al. (1989)</a>, <a href="https://arxiv.org/abs/1506.02626">Han et at. (2015)</a>]</li>
      <li><strong>Structured pruning.</strong> Convolutional filters, channels, or weights can all be removed as part of structured pruning. [<a href="https://arxiv.org/abs/1608.08710">Li et at. (2016)</a>]</li>
    </ul>
  </li>
  <li>When pruning should be applied?
    <ul>
      <li><strong>Regularization and gradual pruning.</strong> The network is pruned during the training process, resulting in a pruned network at the end of training. [<a href="https://arxiv.org/abs/1712.01312">Louizos et at. (2017)</a>]</li>
      <li><strong>Retraining.</strong> Prunes after the training. In order to restore accuracy, the pruned network is often retrained or fine-tuned. [<a href="https://arxiv.org/abs/1506.02626">Han et at. (2015)</a>]</li>
    </ul>
  </li>
  <li>Which weights should we prune?
    <ul>
      <li><strong>Global Magnitude Pruning.</strong> Prunes the weights all through the network that have the lowest absolute value.</li>
      <li><strong>Layerwise Magnitude Pruning.</strong> For each layer of the model, prunes the weights with the lowest absolute value.</li>
      <li><strong>Global Gradient Magnitude Pruning.</strong> Prunes the individual weights that, after a batch of inputs has been evaluated, have the lowest absolute value of \(\text{weight} * \text{gradient}\).</li>
      <li><strong>Layerwise Gradient Magnitude Pruning.</strong> Prunes the entire layer weights that, after a batch of inputs has been evaluated, has the lowest absolute value of \(\text{weight} * \text{gradient}\).</li>
      <li><strong>Random Pruning.</strong> Individually prunes each weight with a probability equal to the amount of the network that has to be pruned.</li>
    </ul>
  </li>
</ul>

<h2 id="practical-usage">Practical Usage</h2>

<p>The most popular libraries for pruning:</p>

<ul>
  <li><a href="https://GitHub.com/he-y/Awesome-Pruning"><img src="https://img.shields.io/github/stars/he-y/Awesome-Pruning?style=social&amp;label=Stars&amp;maxAge=2592000" alt="GitHub stars" /></a> <a href="https://github.com/he-y/Awesome-Pruning">he-y/Awesome-Pruning</a> – A curated list of neural network pruning resources.</li>
  <li><a href="https://GitHub.com/openvinotoolkit/nncf"><img src="https://img.shields.io/github/stars/openvinotoolkit/nncf?style=social&amp;label=Stars&amp;maxAge=2592000" alt="GitHub stars" /></a> <a href="https://github.com/openvinotoolkit/nncf">openvinotoolkit/nncf</a> – Neural Network Compression Framework for enhanced OpenVINO™ inference</li>
  <li><a href="https://GitHub.com/neuralmagic/sparseml"><img src="https://img.shields.io/github/stars/neuralmagic/sparseml?style=social&amp;label=Stars&amp;maxAge=2592000" alt="GitHub stars" /></a> <a href="https://github.com/neuralmagic/sparseml">neuralmagic/sparseml</a> – Libraries for applying sparsification recipes to neural networks with a few lines of code, enabling faster and smaller models</li>
  <li><a href="https://GitHub.com/alibaba/TinyNeuralNetwork"><img src="https://img.shields.io/github/stars/alibaba/TinyNeuralNetwork?style=social&amp;label=Stars&amp;maxAge=2592000" alt="GitHub stars" /></a> <a href="https://github.com/alibaba/TinyNeuralNetwork">alibaba/TinyNeuralNetwork</a> – TinyNeuralNetwork is an efficient and easy-to-use deep learning model compression framework.</li>
  <li><a href="https://GitHub.com/intel/neural-compressor"><img src="https://img.shields.io/github/stars/intel/neural-compressor?style=social&amp;label=Stars&amp;maxAge=2592000" alt="GitHub stars" /></a> <a href="https://github.com/intel/neural-compressor">intel/neural-compressor</a> – An open-source Python library supporting popular model compression techniques on all mainstream deep learning frameworks (TensorFlow, PyTorch, ONNX Runtime, and MXNet).</li>
</ul>

<h2 id="conclusion">Conclusion</h2>

<p>Pruning is a useful technique for reducing network size, accelerating the inference process, reducing memory footprint, etc.</p>

<figure class="image" style="width:50%">
  <img src="./assets/optimization.png" class="image" />
  <figcaption><a href="https://xkcd.com/1691/">Optimization</a></figcaption>
</figure>]]></content><author><name>Andrii Polukhin</name><email>pandrii000@gmail.com</email></author><summary type="html"><![CDATA[Pruning: Before and After]]></summary></entry><entry><title type="html">How Much Data Are Augmentations Worth?</title><link href="https://polukhin.tech/2022/10/18/how-much-data-are-augmentations-worth" rel="alternate" type="text/html" title="How Much Data Are Augmentations Worth?" /><published>2022-10-18T00:00:00+00:00</published><updated>2022-10-18T00:00:00+00:00</updated><id>https://polukhin.tech/2022/10/18/how-much-data-are-augmentations-worth</id><content type="html" xml:base="https://polukhin.tech/2022/10/18/how-much-data-are-augmentations-worth"><![CDATA[<p>This post is a short review of the publication</p>
<blockquote>
  <p><a href="http://arxiv.org/abs/2210.06441">Geiping, J., Goldblum, M., Somepalli, G., Shwartz-Ziv, R., Goldstein, T., &amp; Wilson, A. G. (2022). How Much Data Are Augmentations Worth? An Investigation into Scaling Laws, Invariance, and Implicit Regularization (arXiv:2210.06441)</a></p>
</blockquote>

<h2 id="tldr">TLDR</h2>

<p>This publication is a research about <strong>data augmentation influence on the model training invarience and robustness</strong>, conducted by University of Maryland, College Park, and New York University.</p>

<p><img src="assets/power-laws.png" alt="" /></p>

<h2 id="key-points">Key Points</h2>

<ul>
  <li>Augmentation can improve model performance even more than adding new real data, if the augmentation is inconsistent with the test data distribution, which means it generates out-of-domain data.</li>
  <li>In various data regimes, different augmentations are beneficial; for example, when there is little data, aggressive augmentations like horizontal flipping are preferable, whereas more data favors cautious augmentations like vertical flipping.</li>
  <li>On lower scales of data size, augmentations preferred, while invariant neural network architectures overtake them in the large-sample realm. Even on invariances that seem unrelated to one another, augmentations can be advantageous.</li>
  <li>Across neural network widths and topologies, relative increases through augmentations as sample sizes increase are generally stable, although absolute benefits depend on the architecture.</li>
</ul>]]></content><author><name>Andrii Polukhin</name><email>pandrii000@gmail.com</email></author><summary type="html"><![CDATA[This post is a short review of the publication Geiping, J., Goldblum, M., Somepalli, G., Shwartz-Ziv, R., Goldstein, T., &amp; Wilson, A. G. (2022). How Much Data Are Augmentations Worth? An Investigation into Scaling Laws, Invariance, and Implicit Regularization (arXiv:2210.06441)]]></summary></entry><entry><title type="html">Lightweight Neural Network Architectures</title><link href="https://polukhin.tech/2022/10/17/lightweight-neural-network-architectures" rel="alternate" type="text/html" title="Lightweight Neural Network Architectures" /><published>2022-10-17T00:00:00+00:00</published><updated>2022-10-17T00:00:00+00:00</updated><id>https://polukhin.tech/2022/10/17/lightweight-neural-network-architectures</id><content type="html" xml:base="https://polukhin.tech/2022/10/17/lightweight-neural-network-architectures"><![CDATA[<p>As the field of Deep Learning continues to grow, the demand for efficient and lightweight neural networks becomes increasingly important. In this blog post, we will explore six lightweight neural network architectures.</p>

<ul>
  <li>One of the most popular families of lightweight neural networks is the <strong>MobileNet</strong> family. MobileNets are designed to be efficient on mobile devices and have achieved state-of-the-art performance on various computer vision tasks.</li>
  <li>The family of <strong>model scaling formulas</strong> is another popular approach to building lightweight neural networks. This family of models scales the number of filters and layers based on a scaling factor, which enables efficient model design.</li>
  <li><strong>Neural Architecture Search (NAS)</strong> is another technique for building light neural networks. NAS involves using machine learning algorithms to automatically search for the best neural network architecture for a given task. This approach has proven successful in achieving state-of-the-art performance on various computer vision tasks.</li>
  <li><strong>Group convolution</strong> is another technique for creating lightweight neural networks. This approach involves grouping the input channels and applying convolutional filters to each group separately. This technique has been shown to reduce the number of parameters in a model while maintaining accuracy.</li>
  <li><strong>Squeeze &amp; Excitation</strong> is another technique for building light neural networks. This approach involves the use of a gating mechanism to selectively highlight important features in the input. This technique has been shown to improve the accuracy of lightweight neural networks.</li>
  <li>Finally, the <strong>Mobile Transformer</strong> family is a new approach to building lightweight neural networks. This family of models combines the efficiency of MobileNets with the attention mechanism of Transformers, resulting in models that are both efficient and accurate.</li>
</ul>

<p>In summary, lightweight neural networks are becoming increasingly important in the field of Deep Learning. By exploring different lightweight neural network architectures, we can create models that are both efficient and accurate.</p>

<p>You can learn more about these families by exploring my presentation below. <a href="./assets/lightweight-neural-network-architectures.pdf">lightweight-neural-network-architectures.pdf</a></p>

<iframe src="./assets/lightweight-neural-network-architectures.pdf" width="100%" height="480px" frameborder="0"></iframe>

<p>Also, check out the online meeting based on this presentation below.</p>

<iframe width="100%" height="480px" src="https://www.youtube.com/embed/tIUNH3PilXU" frameborder="0" allowfullscreen=""></iframe>]]></content><author><name>Andrii Polukhin</name><email>pandrii000@gmail.com</email></author><summary type="html"><![CDATA[As the field of Deep Learning continues to grow, the demand for efficient and lightweight neural networks becomes increasingly important. In this blog post, we will explore six lightweight neural network architectures.]]></summary></entry><entry><title type="html">Welcome!</title><link href="https://polukhin.tech/2022/10/16/welcome" rel="alternate" type="text/html" title="Welcome!" /><published>2022-10-16T00:00:00+00:00</published><updated>2022-10-16T00:00:00+00:00</updated><id>https://polukhin.tech/2022/10/16/welcome</id><content type="html" xml:base="https://polukhin.tech/2022/10/16/welcome"><![CDATA[<p>Hello ✋! My name is Andrii Polukhin, I am a deep learning enthusiast.</p>

<p>After working in the field of artificial intelligence for some time, I decided it was time to give back to the community. <strong>This blog is about my own machine learning experiences</strong>.</p>

<p><em>Please shoot me an <a href="mailto:">email</a> if you are interested in machine learning collaboration or if you have a cool concept to put into practice.</em></p>]]></content><author><name>Andrii Polukhin</name><email>pandrii000@gmail.com</email></author><category term="en" /><summary type="html"><![CDATA[Hello ✋! My name is Andrii Polukhin, I am a deep learning enthusiast.]]></summary></entry></feed>