<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.3.1">Jekyll</generator><link href="https://janakiev.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://janakiev.com/" rel="alternate" type="text/html" /><updated>2024-07-10T02:29:45-05:00</updated><id>https://janakiev.com/feed.xml</id><title type="html">njanakiev</title><subtitle>All things data // data science / data engineering / data visualization / GIS</subtitle><author><name>Nikolai Janakiev</name></author><entry><title type="html">Downloading Images with Python, PIL, Requests, and Urllib</title><link href="https://janakiev.com/blog/python-pilow-download-image/" rel="alternate" type="text/html" title="Downloading Images with Python, PIL, Requests, and Urllib" /><published>2023-04-12T00:00:00-05:00</published><updated>2023-04-12T00:00:00-05:00</updated><id>https://janakiev.com/blog/python-pilow-download-image</id><content type="html" xml:base="https://janakiev.com/blog/python-pilow-download-image/"><![CDATA[<p>Python is a great language for automating tasks, and downloading images is one of those tasks that can be easily automated. In this article, you’ll see how to use the Python Imaging Library (PIL) or rather Pillow, Requests, and Urllib to download images from the web.</p>

<h1 id="download-an-image-with-requests">Download an Image with Requests</h1>

<p>To get an image from a URL, you can use the <a href="https://requests.readthedocs.io/en/latest/">requests</a> package with the following lines of code. (Note, this was tested using requests <code class="language-plaintext highlighter-rouge">2.27.1</code>) In this case the script will download this picture from <a href="https://unsplash.com/photos/73F4pKoUkM0">Lake Tekapo, New Zealand</a> by <a href="https://unsplash.com/@tokeller">Tobias Keller</a> on Unsplash:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">requests</span>
<span class="kn">from</span> <span class="nn">PIL</span> <span class="kn">import</span> <span class="n">Image</span>

<span class="n">filepath</span> <span class="o">=</span> <span class="s">"assets/unsplash_image.jpg"</span>
<span class="n">url</span> <span class="o">=</span> <span class="s">"https://images.unsplash.com/photo-1465056836041-7f43ac27dcb5?w=720"</span>

<span class="n">r</span> <span class="o">=</span> <span class="n">requests</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
<span class="k">if</span> <span class="n">r</span><span class="p">.</span><span class="n">status_code</span> <span class="o">==</span> <span class="mi">200</span><span class="p">:</span>
    <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">filepath</span><span class="p">,</span> <span class="s">'wb'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
        <span class="n">f</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="n">r</span><span class="p">.</span><span class="n">content</span><span class="p">)</span>
</code></pre></div></div>

<h1 id="download-an-image-as-pil-image">Download an Image as PIL Image</h1>

<p>In some cases, it is not necessary or possible to save the image somewhere and the image needs to be processed right away. In this case the whole image needs to be streamed by setting parameter <code class="language-plaintext highlighter-rouge">stream=True</code>. For more information have a look at the <a href="https://requests.readthedocs.io/en/latest/user/advanced/#body-content-workflow">documentation</a>. Then the output needs to be converted into a <a href="https://docs.python.org/3/library/io.html#io.BytesIO">io.BytesIO</a> binary stream to be consumed by Pillow:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">io</span>
<span class="kn">import</span> <span class="nn">requests</span>
<span class="kn">from</span> <span class="nn">PIL</span> <span class="kn">import</span> <span class="n">Image</span>

<span class="n">w</span><span class="p">,</span> <span class="n">h</span> <span class="o">=</span> <span class="mi">800</span><span class="p">,</span> <span class="mi">600</span>
<span class="n">filepath</span> <span class="o">=</span> <span class="s">"image.jpg"</span>
<span class="n">url</span> <span class="o">=</span> <span class="s">"https://images.unsplash.com/photo-1465056836041-7f43ac27dcb5?w=720"</span>

<span class="n">r</span> <span class="o">=</span> <span class="n">requests</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">stream</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">if</span> <span class="n">r</span><span class="p">.</span><span class="n">status_code</span> <span class="o">==</span> <span class="mi">200</span><span class="p">:</span>
    <span class="n">img</span> <span class="o">=</span> <span class="n">Image</span><span class="p">.</span><span class="nb">open</span><span class="p">(</span><span class="n">io</span><span class="p">.</span><span class="n">BytesIO</span><span class="p">(</span><span class="n">r</span><span class="p">.</span><span class="n">content</span><span class="p">))</span>
    
    <span class="c1"># Do something with image
</span>    
    <span class="c1"># Save image to file
</span>    <span class="n">img</span><span class="p">.</span><span class="n">save</span><span class="p">(</span><span class="n">filepath</span><span class="p">)</span>
</code></pre></div></div>

<h1 id="download-an-image-using-urllib">Download an Image using Urllib</h1>

<p>Sometimes you cannot install the requests library and need to use the rich Python standard library. In this case, you can use <a href="https://docs.python.org/2/library/urllib.html#urllib.urlretrieve">urllib.urlretrieve</a> from the urllib package:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">urllib.request</span>

<span class="n">w</span><span class="p">,</span> <span class="n">h</span> <span class="o">=</span> <span class="mi">800</span><span class="p">,</span> <span class="mi">600</span>
<span class="n">filepath</span> <span class="o">=</span> <span class="s">"image.jpg"</span>
<span class="n">url</span> <span class="o">=</span> <span class="s">"https://images.unsplash.com/photo-1465056836041-7f43ac27dcb5?w=720"</span>

<span class="n">urllib</span><span class="p">.</span><span class="n">request</span><span class="p">.</span><span class="n">urlretrieve</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">filepath</span><span class="p">)</span>
</code></pre></div></div>

<h1 id="resources">Resources</h1>

<ul>
  <li><a href="https://requests.readthedocs.io/en/latest/">Requests</a></li>
  <li><a href="https://pillow.readthedocs.io/en/stable/">Pillow</a> - Pillow is the friendly PIL fork</li>
  <li><a href="https://docs.python.org/2/library/urllib.html">urllib</a> - Open arbitrary resources by URL</li>
</ul>]]></content><author><name>Nikolai Janakiev</name></author><category term="blog" /><category term="Python" /><category term="PIL" /><summary type="html"><![CDATA[Python is a great language for automating tasks, and downloading images is one of those tasks that can be easily automated. In this article, you’ll see how to use the Python Imaging Library (PIL) or rather Pillow, Requests, and Urllib to download images from the web.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://janakiev.com/assets/python-pilow-download-image_files/photo-1465056836041-7f43ac27dcb5.jpg" /><media:content medium="image" url="https://janakiev.com/assets/python-pilow-download-image_files/photo-1465056836041-7f43ac27dcb5.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Virtual Environments in Python with venv</title><link href="https://janakiev.com/blog/python-venv/" rel="alternate" type="text/html" title="Virtual Environments in Python with venv" /><published>2023-01-20T00:00:00-06:00</published><updated>2023-01-20T00:00:00-06:00</updated><id>https://janakiev.com/blog/python-venv</id><content type="html" xml:base="https://janakiev.com/blog/python-venv/"><![CDATA[<p>Python’s built-in venv module makes it easy to create virtual environments for your Python projects. Virtual environments are isolated spaces where your Python packages and their dependencies live. This means that each project can have its own dependencies, regardless of what other projects are doing.</p>

<h1 id="create-a-virtual-environment">Create a Virtual Environment</h1>

<p>Create an environment with:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python <span class="nt">-m</span> venv ./venv
python <span class="nt">-m</span> venv /path/to/venv
</code></pre></div></div>

<p>Activate an environment with:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">source </span>venv/bin/activate
<span class="nb">source</span> /path/to/venv/bin/activate
</code></pre></div></div>

<p>Make sure to test if <code class="language-plaintext highlighter-rouge">python</code> and <code class="language-plaintext highlighter-rouge">pip</code> are indeed in the environment by typing:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>which python
<span class="c"># /absolute/path/to/venv/bin/python</span>
which pip
<span class="c"># /absolute/path/to/venv/bin/pip</span>
</code></pre></div></div>

<p>Install packages from a <code class="language-plaintext highlighter-rouge">requriments.txt</code> with:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install</span> <span class="nt">-r</span> requrirements.txt
</code></pre></div></div>

<p>Deaktivate an environment with:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>deaktivate
</code></pre></div></div>

<h1 id="resources">Resources</h1>

<ul>
  <li><a href="https://docs.python.org/3/library/venv.html">venv — Creation of virtual environments</a></li>
</ul>]]></content><author><name>Nikolai Janakiev</name></author><category term="blog" /><category term="Python" /><category term="PIL" /><summary type="html"><![CDATA[Python’s built-in venv module makes it easy to create virtual environments for your Python projects. Virtual environments are isolated spaces where your Python packages and their dependencies live. This means that each project can have its own dependencies, regardless of what other projects are doing.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://janakiev.com/assets/python-venv_files/Bendix_G-15_module.jpg" /><media:content medium="image" url="https://janakiev.com/assets/python-venv_files/Bendix_G-15_module.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Object Serialization with JSON and compressed JSON in Python</title><link href="https://janakiev.com/blog/python-json/" rel="alternate" type="text/html" title="Object Serialization with JSON and compressed JSON in Python" /><published>2022-11-22T00:00:00-06:00</published><updated>2022-11-22T00:00:00-06:00</updated><id>https://janakiev.com/blog/python-json</id><content type="html" xml:base="https://janakiev.com/blog/python-json/"><![CDATA[<p>JSON is a popular data format for storing data in a structured way. Python has a built-in module called json that can be used to work with JSON data. In this article, we will see how to use the json module to serialize and deserialize data in Python.</p>

<h1 id="reading-and-writing-json-in-python">Reading and Writing JSON in Python</h1>

<p>Python offers out of the box a <a href="https://docs.python.org/3/library/json.html">JSON</a> encoder and decoder. To store and load JSON you can use the <code class="language-plaintext highlighter-rouge">dump()</code> and <code class="language-plaintext highlighter-rouge">load()</code> functions respectively. Since they are called the same as in pickling, this makes it easy to remember them.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">json</span>

<span class="c1"># Writing a JSON file
</span><span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s">'data.json'</span><span class="p">,</span> <span class="s">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
    <span class="n">json</span><span class="p">.</span><span class="n">dump</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">f</span><span class="p">)</span>

<span class="c1"># Reading a JSON file
</span><span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s">'data.json'</span><span class="p">,</span> <span class="s">'r'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
    <span class="n">data</span> <span class="o">=</span> <span class="n">json</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>
</code></pre></div></div>

<p>You can additionally encode and decode JSON to a string which is done with the <code class="language-plaintext highlighter-rouge">dumps()</code> and <code class="language-plaintext highlighter-rouge">loads()</code> functions respectively. Encoding can be done like here:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">json_string</span> <span class="o">=</span> <span class="n">json</span><span class="p">.</span><span class="n">dumps</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
</code></pre></div></div>

<p>And to decode JSON you can type:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data</span> <span class="o">=</span> <span class="n">json</span><span class="p">.</span><span class="n">loads</span><span class="p">(</span><span class="n">json_string</span><span class="p">)</span>
</code></pre></div></div>

<p>This comes handy when you work witk REST APIs where many APIs deal with JSON files as input and/or outputs.</p>

<h1 id="reading-and-writing-gzip-compressed-json-in-python">Reading and Writing GZIP Compressed JSON in Python</h1>

<p>It is also possible to compress the JSON in order to save storage space by typing the following:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">gzip</span>
<span class="kn">import</span> <span class="nn">json</span>

<span class="k">with</span> <span class="n">gzip</span><span class="p">.</span><span class="nb">open</span><span class="p">(</span><span class="s">"data.json.gz"</span><span class="p">,</span> <span class="s">'wt'</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s">'utf-8'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
    <span class="n">json</span><span class="p">.</span><span class="n">dump</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">f</span><span class="p">)</span>
</code></pre></div></div>

<p>To load the compressed JSON, type:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">with</span> <span class="n">gzip</span><span class="p">.</span><span class="nb">open</span><span class="p">(</span><span class="s">"data.json.gz"</span><span class="p">,</span> <span class="s">'rt'</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s">'utf-8'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
    <span class="n">data</span> <span class="o">=</span> <span class="n">json</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>
</code></pre></div></div>

<p>This is especially useful when caching large amounts of JSON outputs.</p>

<h1 id="resources">Resources</h1>

<ul>
  <li><a href="https://docs.python.org/3/library/json.html">Python json documentation</a></li>
  <li><a href="https://docs.python.org/3/library/">Python standard library</a></li>
  <li><a href="https://docs.python.org/3/library/persistence.html">Data persistence documentation</a></li>
</ul>]]></content><author><name>Nikolai Janakiev</name></author><category term="blog" /><category term="Python" /><category term="Pickle" /><category term="Data Engineering" /><summary type="html"><![CDATA[JSON is a popular data format for storing data in a structured way. Python has a built-in module called json that can be used to work with JSON data. In this article, we will see how to use the json module to serialize and deserialize data in Python.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://janakiev.com/assets/python-json_files/Construction_in_Gibraltar.jpg" /><media:content medium="image" url="https://janakiev.com/assets/python-json_files/Construction_in_Gibraltar.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">How to Save Temporary Changes in Git Using Git Stash</title><link href="https://janakiev.com/blog/git-stash/" rel="alternate" type="text/html" title="How to Save Temporary Changes in Git Using Git Stash" /><published>2022-11-21T00:00:00-06:00</published><updated>2022-11-21T00:00:00-06:00</updated><id>https://janakiev.com/blog/git-stash</id><content type="html" xml:base="https://janakiev.com/blog/git-stash/"><![CDATA[<p>Git stashing is a way to temporarily save changes that you do not want to commit yet. This is useful if you need to switch branches, but do not want to commit your changes first.</p>

<h1 id="stash-your-changes-in-git">Stash your Changes in Git</h1>

<p>To stash your changes, type:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git stash
git stash push -m "description of stash"
</code></pre></div></div>

<p>Once you are done, you can reapply your stash with:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git stash apply
git stash apply 2  # 2nd item in previous list
git stash apply stash@{2}
</code></pre></div></div>

<h1 id="listing-your-stash">Listing your Stash</h1>

<p>To show your stored stashes, type:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git stash list
</code></pre></div></div>

<h1 id="cleaning-up-the-stash">Cleaning up the Stash</h1>

<p>Reaply stash and remove it from stash with:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git stash pop 2
</code></pre></div></div>

<p>If you want to remove a stash, type:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git stash drop 2
</code></pre></div></div>

<p>Finally, to remove all items from stash, type:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git stash clear
</code></pre></div></div>

<h1 id="resources">Resources</h1>

<ul>
  <li><a href="https://git-scm.com/book/en/v2/Git-Tools-Stashing-and-Cleaning">7.3 Git Tools - Stashing and Cleaning</a></li>
</ul>]]></content><author><name>Nikolai Janakiev</name></author><category term="blog" /><category term="Git" /><summary type="html"><![CDATA[Git stashing is a way to temporarily save changes that you do not want to commit yet. This is useful if you need to switch branches, but do not want to commit your changes first.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://janakiev.com/assets/git-stash_files/Salle_de_lecture_reserve_Bibliotheque_Sainte-Genevieve_n4.jpg" /><media:content medium="image" url="https://janakiev.com/assets/git-stash_files/Salle_de_lecture_reserve_Bibliotheque_Sainte-Genevieve_n4.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Running Prometheus with Systemd</title><link href="https://janakiev.com/blog/prometheus-setup-systemd/" rel="alternate" type="text/html" title="Running Prometheus with Systemd" /><published>2022-11-10T00:00:00-06:00</published><updated>2022-11-10T00:00:00-06:00</updated><id>https://janakiev.com/blog/prometheus-setup-systemd</id><content type="html" xml:base="https://janakiev.com/blog/prometheus-setup-systemd/"><![CDATA[<p>Prometheus is a powerful open-source monitoring system that can be used to collect and track a variety of metrics for your applications. In this guide, we will cover how to get Prometheus up and running with systemd on a Ubuntu or Debian server.</p>

<h1 id="download-and-install-prometheus">Download and Install Prometheus</h1>

<p>Create a dedicated <code class="language-plaintext highlighter-rouge">prometheus</code> user with:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>useradd <span class="nt">-M</span> <span class="nt">-U</span> prometheus
</code></pre></div></div>

<p>Select a version for your system from <a href="https://prometheus.io/download/">here</a> and download it:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>wget https://github.com/prometheus/prometheus/releases/download/v2.40.0-rc.0/prometheus-2.40.0-rc.0.linux-amd64.tar.gz
<span class="nb">tar</span> <span class="nt">-xzvf</span> prometheus-2.40.0-rc.0.linux-amd64.tar.gz
<span class="nb">sudo mv </span>prometheus-2.40.0-rc.0.linux-amd64 /opt/prometheus
</code></pre></div></div>

<p>Change folder permissions for <code class="language-plaintext highlighter-rouge">prometheus</code> user with:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo chown </span>prometheus:prometheus <span class="nt">-R</span> /opt/prometheus
</code></pre></div></div>

<h1 id="create-systemd-unit-file">Create Systemd Unit File</h1>

<p>Create systemd service in <code class="language-plaintext highlighter-rouge">/etc/systemd/system/prometheus.service</code> with the following contents:</p>

<div class="language-ini highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">[Unit]</span>
<span class="py">Description</span><span class="p">=</span><span class="s">Prometheus Server</span>
<span class="py">Documentation</span><span class="p">=</span><span class="s">https://prometheus.io/docs/introduction/overview/</span>
<span class="py">After</span><span class="p">=</span><span class="s">network-online.target</span>

<span class="nn">[Service]</span>
<span class="py">User</span><span class="p">=</span><span class="s">prometheus</span>
<span class="py">Group</span><span class="p">=</span><span class="s">prometheus</span>
<span class="py">Restart</span><span class="p">=</span><span class="s">on-failure</span>
<span class="py">ExecStart</span><span class="p">=</span><span class="s">/opt/prometheus/prometheus </span><span class="se">\
</span>  <span class="s">--config.file=/opt/prometheus/prometheus.yml </span><span class="se">\
</span>  <span class="s">--storage.tsdb.path=/opt/prometheus/data </span><span class="se">\
</span>  <span class="s">--storage.tsdb.retention.time=30d</span>

<span class="nn">[Install]</span>
<span class="py">WantedBy</span><span class="p">=</span><span class="s">multi-user.target</span>
</code></pre></div></div>

<p>Start systemd service of Prometheus with:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>systemctl daemon-reload
<span class="nb">sudo </span>systemctl start prometheus.service
</code></pre></div></div>

<p>Enable service to start and system start-up:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>systemctl <span class="nb">enable </span>prometheus.service
</code></pre></div></div>

<p>Check the status of the service with:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>systemctl status prometheus.service
</code></pre></div></div>

<p>To view the logs of Prometheus for troubleshooting, type:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>journalctl <span class="nt">-u</span> prometheus.service <span class="nt">-f</span>
</code></pre></div></div>

<h1 id="resources">Resources</h1>

<ul>
  <li><a href="https://prometheus.io/">prometheus.io</a></li>
  <li><a href="https://prometheus.io/download/">Prometheus Download</a></li>
  <li><a href="https://prometheus.io/docs/introduction/overview/">Prometheus Documentation</a></li>
</ul>]]></content><author><name>Nikolai Janakiev</name></author><category term="blog" /><category term="DevOps" /><category term="Prometheus" /><category term="Systemd" /><summary type="html"><![CDATA[Prometheus is a powerful open-source monitoring system that can be used to collect and track a variety of metrics for your applications. In this guide, we will cover how to get Prometheus up and running with systemd on a Ubuntu or Debian server.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://janakiev.com/assets/prometheus-setup-systemd_files/PC_securite_fregate_Surcouf-IMG_5881.jpg" /><media:content medium="image" url="https://janakiev.com/assets/prometheus-setup-systemd_files/PC_securite_fregate_Surcouf-IMG_5881.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Reading and Writing Parquet Files on S3 with Pandas and PyArrow</title><link href="https://janakiev.com/blog/pandas-pyarrow-parquet-s3/" rel="alternate" type="text/html" title="Reading and Writing Parquet Files on S3 with Pandas and PyArrow" /><published>2022-04-10T00:00:00-05:00</published><updated>2022-04-10T00:00:00-05:00</updated><id>https://janakiev.com/blog/pandas-pyarrow-parquet-s3</id><content type="html" xml:base="https://janakiev.com/blog/pandas-pyarrow-parquet-s3/"><![CDATA[<p>When working with large amounts of data, a common approach is to store the data in S3 buckets. Instead of dumping the data as CSV files or plain text files, a good option is to use <a href="https://parquet.apache.org/">Apache Parquet</a>. In this short guide you’ll see how to read and write Parquet files on S3 using Python, Pandas and <a href="https://arrow.apache.org/docs/python/index.html">PyArrow</a>.</p>

<p>This guide was tested using <a href="https://contabo.com/">Contabo</a> object storage, <a href="https://min.io/">MinIO</a>, and <a href="https://linode.gvw92c.net/LPg90o">Linode</a> Object Storage. You should be able to use it on most S3-compatible providers and software.</p>

<h1 id="prepare-connection">Prepare Connection</h1>

<p>Prepare the S3 environment variables in a file called <code class="language-plaintext highlighter-rouge">.env</code> in the project folder with the following contents:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>S3_REGION=eu-central-1
S3_ENDPOINT=https://eu-central-1.domain.com
S3_ACCESS_KEY=XXXX
S3_SECRET_KEY=XXXX
</code></pre></div></div>

<p>Prepare some S3 bucket that you want to use. In this case we’ll be using <code class="language-plaintext highlighter-rouge">s3://s3-example</code> bucket to store and access our data. Next, prepare some random example data with:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s">'data'</span><span class="p">:</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">random</span><span class="p">((</span><span class="mi">1000</span><span class="p">,))})</span>
<span class="n">df</span><span class="p">.</span><span class="n">to_parquet</span><span class="p">(</span><span class="s">"data/data.parquet"</span><span class="p">)</span>
</code></pre></div></div>

<p>Load the environment variables in your script with <a href="https://github.com/theskumar/python-dotenv">python-dotenv</a>:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">dotenv</span> <span class="kn">import</span> <span class="n">load_dotenv</span>
<span class="n">load_dotenv</span><span class="p">();</span>
</code></pre></div></div>

<p>Now, prepare the S3 connection with:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">s3fs</span>

<span class="n">fs</span> <span class="o">=</span> <span class="n">s3fs</span><span class="p">.</span><span class="n">S3FileSystem</span><span class="p">(</span>
    <span class="n">anon</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
    <span class="n">use_ssl</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
    <span class="n">client_kwargs</span><span class="o">=</span><span class="p">{</span>
        <span class="s">"region_name"</span><span class="p">:</span> <span class="n">os</span><span class="p">.</span><span class="n">environ</span><span class="p">[</span><span class="s">'S3_REGION'</span><span class="p">],</span>
        <span class="s">"endpoint_url"</span><span class="p">:</span> <span class="n">os</span><span class="p">.</span><span class="n">environ</span><span class="p">[</span><span class="s">'S3_ENDPOINT'</span><span class="p">],</span>
        <span class="s">"aws_access_key_id"</span><span class="p">:</span> <span class="n">os</span><span class="p">.</span><span class="n">environ</span><span class="p">[</span><span class="s">'S3_ACCESS_KEY'</span><span class="p">],</span>
        <span class="s">"aws_secret_access_key"</span><span class="p">:</span> <span class="n">os</span><span class="p">.</span><span class="n">environ</span><span class="p">[</span><span class="s">'S3_SECRET_KEY'</span><span class="p">],</span>
        <span class="s">"verify"</span><span class="p">:</span> <span class="bp">True</span><span class="p">,</span>
    <span class="p">}</span>
<span class="p">)</span>
</code></pre></div></div>

<h1 id="write-pandas-dataframe-to-s3-as-parquet">Write Pandas DataFrame to S3 as Parquet</h1>

<p>Save the DataFrame to S3 using s3fs and Pandas:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">with</span> <span class="n">fs</span><span class="p">.</span><span class="nb">open</span><span class="p">(</span><span class="s">'s3-example/data.parquet'</span><span class="p">,</span> <span class="s">'wb'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
    <span class="n">df</span><span class="p">.</span><span class="n">to_parquet</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>
</code></pre></div></div>

<p>Save the DataFrame to S3 using s3fs and PyArrow:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pyarrow</span> <span class="k">as</span> <span class="n">pa</span>
<span class="kn">import</span> <span class="nn">pyarrow.parquet</span> <span class="k">as</span> <span class="n">pq</span>
<span class="kn">from</span> <span class="nn">pyarrow</span> <span class="kn">import</span> <span class="n">Table</span>

<span class="n">s3_filepath</span> <span class="o">=</span> <span class="s">'s3-example/data.parquet'</span>

<span class="n">pq</span><span class="p">.</span><span class="n">write_to_dataset</span><span class="p">(</span>
    <span class="n">Table</span><span class="p">.</span><span class="n">from_pandas</span><span class="p">(</span><span class="n">df</span><span class="p">),</span>
    <span class="n">s3_filepath</span><span class="p">,</span>
    <span class="n">filesystem</span><span class="o">=</span><span class="n">fs</span><span class="p">,</span>
    <span class="n">use_dictionary</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
    <span class="n">compression</span><span class="o">=</span><span class="s">"snappy"</span><span class="p">,</span>
    <span class="n">version</span><span class="o">=</span><span class="s">"2.4"</span><span class="p">,</span>
<span class="p">)</span>
</code></pre></div></div>

<p>You can also upload this file with <a href="https://s3tools.org/s3cmd">s3cmd</a> by typing:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>s3cmd \
  --config ~/.s3cfg \
  put data/data.parquet s3://s3-example
</code></pre></div></div>

<h1 id="reading-parquet-file-from-s3-as-pandas-dataframe">Reading Parquet File from S3 as Pandas DataFrame</h1>

<p>Now, let’s have a look at the Parquet file by using PyArrow:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">s3_filepath</span> <span class="o">=</span> <span class="s">"s3-example/data.parquet"</span>

<span class="n">pf</span> <span class="o">=</span> <span class="n">pq</span><span class="p">.</span><span class="n">ParquetDataset</span><span class="p">(</span>
    <span class="n">s3_filepath</span><span class="p">,</span>
    <span class="n">filesystem</span><span class="o">=</span><span class="n">fs</span><span class="p">)</span>
</code></pre></div></div>

<p>Now, you can already explore the metadata with <code class="language-plaintext highlighter-rouge">pf.metadata</code> or the schema with <code class="language-plaintext highlighter-rouge">pf.schema</code>. To read the data set into Pandas type:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pf</span><span class="p">.</span><span class="n">metadata</span>
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pf</span><span class="p">.</span><span class="n">schema</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;pyarrow._parquet.ParquetSchema object at 0x7f1c2fa4a300&gt;
required group field_id=-1 schema {
  optional double field_id=-1 data;
}
</code></pre></div></div>

<p>When using <code class="language-plaintext highlighter-rouge">ParquetDataset</code>, you can also use multiple paths. You can get those for example with:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">s3_filepath</span> <span class="o">=</span> <span class="s">'s3://s3-example'</span>
<span class="n">s3_filepaths</span> <span class="o">=</span> <span class="p">[</span><span class="n">path</span> <span class="k">for</span> <span class="n">path</span> <span class="ow">in</span> <span class="n">fs</span><span class="p">.</span><span class="n">ls</span><span class="p">(</span><span class="n">s3_filepath</span><span class="p">)</span>
                <span class="k">if</span> <span class="n">path</span><span class="p">.</span><span class="n">endswith</span><span class="p">(</span><span class="s">'.parquet'</span><span class="p">)]</span>
<span class="n">s3_filepaths</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>['s3-example/data.parquet', 's3-example/data.parquet']
</code></pre></div></div>

<h1 id="resources">Resources</h1>

<ul>
  <li><a href="https://s3fs.readthedocs.io/en/latest/">s3fs.readthedocs.io</a> - S3Fs Documentation</li>
  <li><a href="https://arrow.apache.org/docs/python/index.html">PyArrow</a> - Apache Arrow Python bindings</li>
  <li><a href="https://parquet.apache.org/">Apache Parquet</a></li>
</ul>]]></content><author><name>Nikolai Janakiev</name></author><category term="blog" /><category term="Data Engineering" /><category term="Big Data" /><category term="S3" /><category term="Python" /><category term="Pandas" /><category term="PyArrow" /><summary type="html"><![CDATA[When working with large amounts of data, a common approach is to store the data in S3 buckets. Instead of dumping the data as CSV files or plain text files, a good option is to use Apache Parquet. In this short guide you’ll see how to read and write Parquet files on S3 using Python, Pandas and PyArrow.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://janakiev.com/assets/pandas-pyarrow-parquet-s3_files/For_Hudson_Engineering_Corp._at_Carthage,_Texas_(8637451053).jpg" /><media:content medium="image" url="https://janakiev.com/assets/pandas-pyarrow-parquet-s3_files/For_Hudson_Engineering_Corp._at_Carthage,_Texas_(8637451053).jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Working with Credentials and Configurations in Python</title><link href="https://janakiev.com/blog/python-credentials-and-configuration/" rel="alternate" type="text/html" title="Working with Credentials and Configurations in Python" /><published>2022-01-11T00:00:00-06:00</published><updated>2022-01-11T00:00:00-06:00</updated><id>https://janakiev.com/blog/python-credentials-and-configuration</id><content type="html" xml:base="https://janakiev.com/blog/python-credentials-and-configuration/"><![CDATA[<p>When writing programs, there is often a large set of configuration and credentials that should not be hard-coded in the program. This also makes the customization of the program much easier and more generally applicable. There are various ways to handle configuration and credentials and you will see here a few of the popular and common ways to do that with Python.</p>

<p><strong>One important note right from the start:</strong> When using version control always make sure to not commit credentials and configuration into the repository as this could become a serious security issue. You can add those to .gitignore to avoid pushing those files to version control. Sometimes is useful to have general configuration also in version control, but that depends on your use case.</p>

<h1 id="python-configuration-files">Python Configuration Files</h1>

<p>The first and probably most straight forward way is to have a <code class="language-plaintext highlighter-rouge">config.py</code> file somewhere in the project folder that you add to your <code class="language-plaintext highlighter-rouge">.gitignore</code> file. A similar pattern can be found in <a href="https://flask.palletsprojects.com/en/2.0.x/">Flask</a>, where you can also structure the configuration based on different contexts like development, production, and testing. The <code class="language-plaintext highlighter-rouge">config.py</code> would look something like:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">host</span> <span class="o">=</span> <span class="s">'localhost'</span><span class="p">,</span>
<span class="n">port</span> <span class="o">=</span> <span class="mi">8080</span><span class="p">,</span>
<span class="n">username</span> <span class="o">=</span> <span class="s">'user'</span>
<span class="n">password</span> <span class="o">=</span> <span class="s">'password'</span>
</code></pre></div></div>

<p>You would simply import it and use it like this:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">config</span>

<span class="n">host</span> <span class="o">=</span> <span class="n">config</span><span class="p">.</span><span class="n">host</span>
<span class="n">port</span> <span class="o">=</span> <span class="n">config</span><span class="p">.</span><span class="n">port</span>
<span class="n">username</span> <span class="o">=</span> <span class="n">config</span><span class="p">.</span><span class="n">username</span>
<span class="n">password</span> <span class="o">=</span> <span class="n">config</span><span class="p">.</span><span class="n">password</span>
</code></pre></div></div>

<h1 id="environment-variables">Environment Variables</h1>

<p>You can access environment variables with <a href="https://docs.python.org/3/library/os.html#os.environ">os.environ</a>:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">os</span>

<span class="n">os</span><span class="p">.</span><span class="n">environ</span><span class="p">[</span><span class="s">'SHELL'</span><span class="p">]</span>
</code></pre></div></div>

<p>This will throw a <code class="language-plaintext highlighter-rouge">KeyError</code> if the variable does not exists. You can check if the variable exists with <code class="language-plaintext highlighter-rouge">"SHELL" in os.environ</code>. Sometimes its more elegant to get <code class="language-plaintext highlighter-rouge">None</code> or a default value instead of getting an error when a variable does not exist. This can be done like this:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># return None if VAR does not exists
</span><span class="n">os</span><span class="p">.</span><span class="n">environ</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'VAR'</span><span class="p">)</span>

<span class="c1"># return "default" if VAR does not exists
</span><span class="n">os</span><span class="p">.</span><span class="n">environ</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'VAR'</span><span class="p">,</span> <span class="s">"default"</span><span class="p">)</span>  
</code></pre></div></div>

<p>You can combine this with the previous way to have a <code class="language-plaintext highlighter-rouge">config.py</code> with the following contents:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">os</span>

<span class="n">host</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">environ</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'APP_HOST'</span><span class="p">,</span> <span class="s">'localhost'</span><span class="p">)</span>
<span class="n">port</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">environ</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'APP_PORT'</span><span class="p">,</span> <span class="mi">8080</span><span class="p">)</span>
<span class="n">username</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">environ</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'APP_USERNAME'</span><span class="p">)</span>
<span class="n">password</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">environ</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'APP_PASSWORD'</span><span class="p">)</span>
</code></pre></div></div>

<h1 id="python-dotenv">Python Dotenv</h1>

<p>Oftentimes you want to have the environment variables in a dedicated <code class="language-plaintext highlighter-rouge">.env</code> file outside of version control. One way is to load the file before with:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">source</span> .env
</code></pre></div></div>
<p>This is sometimes error-prone or not possible depending on the setup, so its sometimes better to load the file dynamically with <a href="https://github.com/theskumar/python-dotenv">python-dotenv</a>. You can install the package with:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install</span> <span class="nt">-U</span> python-dotenv
</code></pre></div></div>

<p>Load the <code class="language-plaintext highlighter-rouge">.env</code> file in your program with:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">dotenv</span> <span class="kn">import</span> <span class="n">load_dotenv</span>

<span class="n">load_dotenv</span><span class="p">()</span>
</code></pre></div></div>

<p>If your environment file is located somewhere else, you can load it with:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">load_dotenv</span><span class="p">(</span><span class="s">"/path/to/.env"</span><span class="p">)</span>
</code></pre></div></div>

<p>Now, you can use the environment file as you saw before.</p>

<h1 id="javascript-object-notation-json">JavaScript Object Notation (JSON)</h1>

<p>JSON is another handy file format to store your configuration as it has native support. If you are working with frontend code, you are already familiar with its usefulness and ubiquity.</p>

<p>You can prepare your configurations as a <a href="https://www.json.org/json-en.html">JSON</a> (JavaScript Object Notation) in a <code class="language-plaintext highlighter-rouge">config.json</code> with the following example configuration:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
    </span><span class="nl">"host"</span><span class="p">:</span><span class="w"> </span><span class="s2">"localhost"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"port"</span><span class="p">:</span><span class="w"> </span><span class="mi">8080</span><span class="p">,</span><span class="w">
    </span><span class="nl">"credentials"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
        </span><span class="nl">"username"</span><span class="p">:</span><span class="w"> </span><span class="s2">"user"</span><span class="p">,</span><span class="w">
        </span><span class="nl">"password"</span><span class="p">:</span><span class="w"> </span><span class="s2">"password"</span><span class="w">
    </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>You can load this configuration then with the built-in <code class="language-plaintext highlighter-rouge">json</code> package:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">json</span>

<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s">'config.json'</span><span class="p">,</span> <span class="s">'r'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
    <span class="n">config</span> <span class="o">=</span> <span class="n">json</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>
</code></pre></div></div>

<p>This returns the data as (nested) dictionaries and lists which you can access the way you are used to (<code class="language-plaintext highlighter-rouge">config['host']</code> or <code class="language-plaintext highlighter-rouge">config.get('host')</code>).</p>

<h1 id="yet-another-markup-language-yaml">Yet Another Markup Language (YAML)</h1>

<p>Another popular way to store configurations and credentials is the (in)famous <a href="https://yaml.org/">YAML</a> format. It is much simpler to use but has some minor quirks when using more complicated formatting. Here is the previous configuration as a YAML file:</p>

<div class="language-yml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">host</span><span class="pi">:</span> <span class="s">localhost</span>
<span class="na">port</span><span class="pi">:</span> <span class="m">8080</span>
<span class="na">credentials</span><span class="pi">:</span>
  <span class="na">username</span><span class="pi">:</span> <span class="s">user</span>
  <span class="na">password</span><span class="pi">:</span> <span class="s">password</span>
</code></pre></div></div>

<p>There are various packages that you can use. Most commonly <a href="https://pyyaml.org/">PyYAML</a>. You can install it with:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pip</span> <span class="n">install</span> <span class="o">-</span><span class="n">U</span> <span class="n">PyYAML</span>
</code></pre></div></div>

<p>To load the configuration, you can type:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s">"config.yml"</span><span class="p">,</span> <span class="s">'r'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
    <span class="n">config</span> <span class="o">=</span> <span class="n">yaml</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="n">Loader</span><span class="o">=</span><span class="n">yaml</span><span class="p">.</span><span class="n">FullLoader</span><span class="p">)</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">config</code> can be used as previously seen with the JSON example.</p>

<p>Note, that you need to add a <code class="language-plaintext highlighter-rouge">Loader</code> in <code class="language-plaintext highlighter-rouge">PyYAML 5.1+</code> because of a vulnerability. Read more about it <a href="https://github.com/yaml/pyyaml/wiki/PyYAML-yaml.load(input)-Deprecation">here</a>. Another common alternative to <code class="language-plaintext highlighter-rouge">PyYAML</code> is <a href="https://github.com/omry/omegaconf">omegaconf</a>, which includes many other useful parsers for various different file types.</p>

<h1 id="using-a-configuration-parser">Using a Configuration Parser</h1>

<p>The Python standard library includes the <a href="https://docs.python.org/3/library/configparser.html">configparser</a> module which can work with configuration files similar to the Microsoft Windows INI files. You can prepare the configuration in <code class="language-plaintext highlighter-rouge">config.ini</code> with the following contents:</p>

<div class="language-ini highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">[DEFAULT]</span>
<span class="py">host</span> <span class="p">=</span> <span class="s">localhost</span>
<span class="py">port</span> <span class="p">=</span> <span class="s">8080</span>

<span class="nn">[credentials]</span>
<span class="py">username</span> <span class="p">=</span> <span class="s">user</span>
<span class="py">password</span> <span class="p">=</span> <span class="s">password</span>
</code></pre></div></div>

<p>The configuration is seperated into sections like <code class="language-plaintext highlighter-rouge">[credentials]</code> and within those sections the configuration is stored as key-value pairs like <code class="language-plaintext highlighter-rouge">host = localhost</code>.</p>

<p>You can load and use the previous configuration as follows:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">configparser</span>

<span class="n">config</span> <span class="o">=</span> <span class="n">configparser</span><span class="p">.</span><span class="n">ConfigParser</span><span class="p">()</span>
<span class="n">config</span><span class="p">.</span><span class="n">read</span><span class="p">(</span><span class="s">"test.ini"</span><span class="p">)</span>

<span class="n">host</span> <span class="o">=</span> <span class="n">config</span><span class="p">[</span><span class="s">'DEFAULT'</span><span class="p">][</span><span class="s">'host'</span><span class="p">]</span>
<span class="n">port</span> <span class="o">=</span> <span class="n">config</span><span class="p">[</span><span class="s">'DEFAULT'</span><span class="p">][</span><span class="s">'port'</span><span class="p">]</span>
<span class="n">username</span> <span class="o">=</span> <span class="n">config</span><span class="p">[</span><span class="s">'credentials'</span><span class="p">][</span><span class="s">'username'</span><span class="p">]</span>
<span class="n">password</span> <span class="o">=</span> <span class="n">config</span><span class="p">[</span><span class="s">'credentials'</span><span class="p">][</span><span class="s">'password'</span><span class="p">]</span>
</code></pre></div></div>

<p>As you can see, to access the values you have to type <code class="language-plaintext highlighter-rouge">config[section][element]</code>. To get all sections as a list, you can type <code class="language-plaintext highlighter-rouge">config.sections()</code>. For more information, have a look at the <a href="https://docs.python.org/3/library/configparser.html">documentation</a>.</p>

<h1 id="parsing-command-line-options">Parsing Command-line Options</h1>

<p>It is also possible to get credentials and configuration through arguments by using the built-in <a href="https://docs.python.org/3/library/argparse.html">argparse</a> module.</p>

<p>You can initialize the argument parser with:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">argparse</span>

<span class="n">parser</span> <span class="o">=</span> <span class="n">argparse</span><span class="p">.</span><span class="n">ArgumentParser</span><span class="p">(</span>
    <span class="n">description</span><span class="o">=</span><span class="s">"Example Program"</span><span class="p">)</span>

<span class="c1"># Required arguments
</span><span class="n">parser</span><span class="p">.</span><span class="n">add_argument</span><span class="p">(</span><span class="n">action</span><span class="o">=</span><span class="s">'store'</span><span class="p">,</span>
    <span class="n">dest</span><span class="o">=</span><span class="s">'username'</span><span class="p">,</span> <span class="n">help</span><span class="o">=</span><span class="s">"session username"</span><span class="p">)</span>
<span class="n">parser</span><span class="p">.</span><span class="n">add_argument</span><span class="p">(</span><span class="n">action</span><span class="o">=</span><span class="s">'store'</span><span class="p">,</span>
    <span class="n">dest</span><span class="o">=</span><span class="s">'password'</span><span class="p">,</span> <span class="n">help</span><span class="o">=</span><span class="s">"session password"</span><span class="p">)</span>

<span class="c1"># Optional arguments with default values
</span><span class="n">parser</span><span class="p">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s">"-H"</span><span class="p">,</span> <span class="s">"--host"</span><span class="p">,</span> <span class="n">action</span><span class="o">=</span><span class="s">'store'</span><span class="p">,</span>
    <span class="n">dest</span><span class="o">=</span><span class="s">'host'</span><span class="p">,</span> <span class="n">default</span><span class="o">=</span><span class="s">"localhost"</span><span class="p">,</span>
    <span class="n">help</span><span class="o">=</span><span class="s">"connection host"</span><span class="p">)</span>
<span class="c1"># Allow only arguments of type int
</span><span class="n">parser</span><span class="p">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s">"-P"</span><span class="p">,</span> <span class="s">"--port"</span><span class="p">,</span> <span class="n">action</span><span class="o">=</span><span class="s">'store'</span><span class="p">,</span>
    <span class="n">dest</span><span class="o">=</span><span class="s">'port'</span><span class="p">,</span> <span class="n">default</span><span class="o">=</span><span class="mi">8080</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="nb">int</span><span class="p">,</span>
    <span class="n">help</span><span class="o">=</span><span class="s">"connection port"</span><span class="p">)</span>
</code></pre></div></div>

<p>Now, you can parse the arguments with:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">args</span> <span class="o">=</span> <span class="n">parser</span><span class="p">.</span><span class="n">parse_args</span><span class="p">()</span>

<span class="n">host</span> <span class="o">=</span> <span class="n">args</span><span class="p">.</span><span class="n">host</span>
<span class="n">port</span> <span class="o">=</span> <span class="n">args</span><span class="p">.</span><span class="n">port</span>
<span class="n">username</span> <span class="o">=</span> <span class="n">args</span><span class="p">.</span><span class="n">username</span>
<span class="n">password</span> <span class="o">=</span> <span class="n">args</span><span class="p">.</span><span class="n">password</span>
</code></pre></div></div>

<p>If you save this program in <code class="language-plaintext highlighter-rouge">example.py</code> and type <code class="language-plaintext highlighter-rouge">python example.py -h</code>, you will receive the following help description:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>usage: untitled.py [-h] [-H HOST] [-P PORT] username password

Example Program

positional arguments:
  username              session username
  password              session password

optional arguments:
  -h, --help            show this help message and exit
  -H HOST, --host HOST  connection host
  -P PORT, --port PORT  connection port
</code></pre></div></div>

<p>Another alternative to <code class="language-plaintext highlighter-rouge">argparse</code> is <a href="https://github.com/tiangolo/typer">typer</a> which makes some of the parsing easier for complex CLI tools.</p>

<h1 id="conclusion">Conclusion</h1>

<p>Here you saw a few common and popular ways to load configuration and credentials in Python, but there are many more ways if those are not sufficient for your usecase. You can always resort to XML if you really wish. If you miss some way that you particularly find useful, feel free to add it in the comments bellow.</p>

<h1 id="resources">Resources</h1>

<ul>
  <li>2014 - <a href="https://martin-thoma.com/configuration-files-in-python/">Configuration files in Python</a></li>
  <li><a href="https://www.digitalocean.com/community/tutorials/how-to-read-and-set-environmental-and-shell-variables-on-a-linux-vps">How To Read and Set Environmental and Shell Variables on a Linux VPS</a></li>
  <li>Github - <a href="https://github.com/theskumar/python-dotenv">theskumar/python-dotenv</a></li>
  <li>Github - <a href="https://github.com/omry/omegaconf">omry/omegaconf</a></li>
  <li>Github - <a href="https://github.com/tiangolo/typer">tiangolo/typer</a></li>
</ul>]]></content><author><name>Nikolai Janakiev</name></author><category term="blog" /><category term="Python" /><summary type="html"><![CDATA[When writing programs, there is often a large set of configuration and credentials that should not be hard-coded in the program. This also makes the customization of the program much easier and more generally applicable. There are various ways to handle configuration and credentials and you will see here a few of the popular and common ways to do that with Python.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://janakiev.com/assets/python-credentials-and-configuration_files/K%C3%B6ln,_Hohenzollernbr%C3%BCcke_--_2014_--_1879.jpg" /><media:content medium="image" url="https://janakiev.com/assets/python-credentials-and-configuration_files/K%C3%B6ln,_Hohenzollernbr%C3%BCcke_--_2014_--_1879.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">tqdm Cheat Sheet</title><link href="https://janakiev.com/blog/tqdm-cheat-sheet/" rel="alternate" type="text/html" title="tqdm Cheat Sheet" /><published>2021-12-20T00:00:00-06:00</published><updated>2021-12-20T00:00:00-06:00</updated><id>https://janakiev.com/blog/tqdm-cheat-sheet</id><content type="html" xml:base="https://janakiev.com/blog/tqdm-cheat-sheet/"><![CDATA[<p>tqdm is a fast, user-friendly and extensible progress bar for Python and shell programs. Here you’ll find a collection of useful commands for quick reference.</p>

<h1 id="installation">Installation</h1>

<p>Install <a href="https://tqdm.github.io/">tqdm</a> with:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># With pip</span>
pip <span class="nb">install </span>tqdm

<span class="c"># With anaconda</span>
conda <span class="nb">install</span> <span class="nt">-c</span> conda-forge tqdm
</code></pre></div></div>

<p>To install tqdm for JupyterLab, you need to have <a href="https://ipywidgets.readthedocs.io/en/latest/index.html">ipywidgets</a> installed. You can install it with:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># With pip</span>
pip <span class="nb">install </span>ipywidgets

<span class="c"># With anaconda</span>
conda <span class="nb">install</span> <span class="nt">-c</span> conda-forge ipywidgets
</code></pre></div></div>

<p>Enable ipywidgets for jupyter with:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>jupyter nbextension <span class="nb">enable</span> <span class="nt">--py</span> widgetsnbextension
</code></pre></div></div>

<h1 id="cheat-sheet">Cheat Sheet</h1>

<p>To import tqdm to work both for notebooks and shell programs type:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">tqdm.auto</span> <span class="kn">import</span> <span class="n">tqdm</span>
</code></pre></div></div>

<p>Iterate over a range with:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">100</span><span class="p">)):</span>
    <span class="c1"># do something
</span></code></pre></div></div>

<p>Add description to the progress bar with:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">100</span><span class="p">),</span> <span class="n">desc</span><span class="o">=</span><span class="s">"First loop"</span><span class="p">):</span>
    <span class="c1"># do something
</span></code></pre></div></div>

<p>Iterate over a Pandas table with:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">idx</span><span class="p">,</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">iterrows</span><span class="p">(),</span> <span class="n">total</span><span class="o">=</span><span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">)):</span>
    <span class="c1"># do something with that row
</span></code></pre></div></div>

<p>Add changing description to progress bar:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pbar</span> <span class="o">=</span> <span class="n">tqdm</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">100</span><span class="p">))</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">pbar</span><span class="p">:</span>
    <span class="n">pbar</span><span class="p">.</span><span class="n">set_description</span><span class="p">(</span><span class="sa">f</span><span class="s">"Element </span><span class="si">{</span><span class="n">i</span><span class="si">:</span><span class="mi">03</span><span class="n">d</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
    <span class="c1"># do something
</span></code></pre></div></div>

<p>Show progress for nested loops:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">10</span><span class="p">)):</span>
    <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">100</span><span class="p">),</span> <span class="n">leave</span><span class="o">=</span><span class="bp">False</span><span class="p">):</span>
        <span class="c1"># do something
</span></code></pre></div></div>

<p>The option <code class="language-plaintext highlighter-rouge">leave=False</code> discards nested bars upon completion.</p>

<h1 id="resources">Resources</h1>

<ul>
  <li><a href="https://tqdm.github.io/">tqdm.github.io</a></li>
  <li>Github - <a href="https://github.com/tqdm/tqdm">tqdm/tqdm</a></li>
</ul>]]></content><author><name>Nikolai Janakiev</name></author><category term="blog" /><category term="Python" /><category term="Jupyter" /><summary type="html"><![CDATA[tqdm is a fast, user-friendly and extensible progress bar for Python and shell programs. Here you’ll find a collection of useful commands for quick reference.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://janakiev.com/assets/tqdm-cheat-sheet_files/tqdm_title_image.jpg" /><media:content medium="image" url="https://janakiev.com/assets/tqdm-cheat-sheet_files/tqdm_title_image.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Remove Jupyter Notebook Output from Terminal and when using Git</title><link href="https://janakiev.com/blog/jupyter-git-remove-output/" rel="alternate" type="text/html" title="Remove Jupyter Notebook Output from Terminal and when using Git" /><published>2021-11-06T00:00:00-05:00</published><updated>2021-11-06T00:00:00-05:00</updated><id>https://janakiev.com/blog/jupyter-git-remove-output</id><content type="html" xml:base="https://janakiev.com/blog/jupyter-git-remove-output/"><![CDATA[<p>Often times you want to delete the output of a jupyter notebook before commiting it to a repository, but in most cases you want to still have the notebook output for yourself. In this short guide you will seeh how to delete the notebook output automatically when committing notebooks to a repository while keeping the outputs local.</p>

<h1 id="removing-the-notebook-output-in-the-command-line">Removing the Notebook Output in the Command-line</h1>

<p>The first tool to do this job is the <a href="https://nbconvert.readthedocs.io/en/latest/">nbconvert</a> command-line tool to work with jupyter notebooks. First, check your installed version of <code class="language-plaintext highlighter-rouge">nbconvert</code> by typing:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>jupyter nbconvert <span class="nt">--version</span>
</code></pre></div></div>

<p>In order to delete the output, you can type the following command:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>jupyter nbconvert <span class="se">\</span>
  <span class="nt">--clear-output</span> <span class="se">\</span>
  <span class="nt">--to</span> notebook <span class="se">\</span>
  <span class="nt">--output</span><span class="o">=</span>new_notebook <span class="se">\</span>
  notebook.ipynb
</code></pre></div></div>

<p>To remove the output inplace, you can type:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>jupyter nbconvert <span class="se">\</span>
  <span class="nt">--clear-output</span> <span class="se">\</span>
  <span class="nt">--inplace</span> <span class="se">\</span>
  notebook.ipynb
</code></pre></div></div>

<p>If you have <code class="language-plaintext highlighter-rouge">nbconvert</code> below version <code class="language-plaintext highlighter-rouge">6.0</code>, change the command to:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>jupyter nbconvert <span class="se">\</span>
  <span class="nt">--ClearOutputPreprocessor</span>.enabled<span class="o">=</span>True <span class="se">\</span>
  <span class="nt">--to</span> notebook <span class="se">\</span>
  <span class="nt">--output</span><span class="o">=</span>new_notebook <span class="se">\</span>
  notebook.ipynb
</code></pre></div></div>

<p>It is also possible to remove notebook outputs in batch:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>find <span class="k">*</span>.ipynb <span class="se">\</span>
  <span class="nt">-exec</span> jupyter nbconvert <span class="nt">--clear-output</span> <span class="nt">--inplace</span> <span class="o">{}</span> <span class="se">\;</span>
</code></pre></div></div>

<p>Or, by using a simple loop:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for </span>f <span class="k">in</span> <span class="k">*</span>.ipynb<span class="p">;</span> <span class="k">do
  </span>jupyter nbconvert <span class="nt">--clear-output</span> <span class="nt">--inplace</span> <span class="nv">$f</span> 
<span class="k">done</span>
</code></pre></div></div>

<h1 id="removing-the-notebook-output-automatically-when-committing">Removing the Notebook Output automatically when Committing</h1>

<p>Register a new filter by appending the following lines to the <code class="language-plaintext highlighter-rouge">.git/config</code> in your chosen repository:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[filter "remove-notebook-output"]
    clean = "jupyter nbconvert --clear-output --to=notebook --stdin --stdout --log-level=ERROR"
</code></pre></div></div>

<p>If you want the filter to be available globally, append it to <code class="language-plaintext highlighter-rouge">~/.gitconfig</code> instead. Also, remember to check the <code class="language-plaintext highlighter-rouge">nbconvert</code> version and change the command as shown previously.</p>

<p>Now, append the following lines to the <a href="https://git-scm.com/docs/gitattributes">.gitattributes</a> file:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>*.ipynb filter=remove-notebook-output
</code></pre></div></div>

<p>If you want to apply those filters only to a specific folder you can instead append:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>folder/*.ipynb filter=remove-notebook-output
</code></pre></div></div>

<p>That’s all! Now you should be able to commit jupyter notebooks to git repositories without output if you followed all the steps.</p>

<h1 id="resources">Resources</h1>

<p>For more resources, have a look at:</p>

<ul>
  <li><a href="https://nbconvert.readthedocs.io/en/latest/">nbconvert Documentation</a></li>
  <li><a href="https://git-scm.com/docs/gitattributes">gitattributes Documentation</a></li>
  <li><a href="https://git-scm.com/docs/git-config">git-config Documentation</a></li>
  <li><a href="https://git-scm.com/book/en/v2/Customizing-Git-Git-Attributes">Customizing Git - Git Attributes</a></li>
</ul>]]></content><author><name>Nikolai Janakiev</name></author><category term="blog" /><category term="Jupyter" /><category term="Git" /><summary type="html"><![CDATA[Often times you want to delete the output of a jupyter notebook before commiting it to a repository, but in most cases you want to still have the notebook output for yourself. In this short guide you will seeh how to delete the notebook output automatically when committing notebooks to a repository while keeping the outputs local.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://janakiev.com/assets/jupyter-git-remove-output_files/jupyter_git.jpg" /><media:content medium="image" url="https://janakiev.com/assets/jupyter-git-remove-output_files/jupyter_git.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Reading and Writing Pandas DataFrames in Chunks</title><link href="https://janakiev.com/blog/python-pandas-chunks/" rel="alternate" type="text/html" title="Reading and Writing Pandas DataFrames in Chunks" /><published>2021-04-03T00:00:00-05:00</published><updated>2021-04-03T00:00:00-05:00</updated><id>https://janakiev.com/blog/python-pandas-chunks</id><content type="html" xml:base="https://janakiev.com/blog/python-pandas-chunks/"><![CDATA[<p>This is a quick example how to chunk a large data set with <a href="https://pandas.pydata.org/">Pandas</a> that otherwise won’t fit into memory. In this short example you will see how to apply this to CSV files with <a href="https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html">pandas.read_csv</a>.</p>

<h1 id="create-pandas-iterator">Create Pandas Iterator</h1>

<p>First, create a <code class="language-plaintext highlighter-rouge">TextFileReader</code> object for iteration. This won’t load the data until you start iterating over it. Here it chunks the data in DataFrames with 10000 rows each:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df_iterator</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span>
    <span class="s">'input_data.csv.gz'</span><span class="p">,</span> 
    <span class="n">chunksize</span><span class="o">=</span><span class="mi">10000</span><span class="p">,</span>
    <span class="n">compression</span><span class="o">=</span><span class="s">'gzip'</span><span class="p">)</span>
</code></pre></div></div>

<h1 id="iterate-over-the-file-in-batches">Iterate over the File in Batches</h1>

<p>Now, you can use the iterator to load the chunked DataFrames iteratively. Here you have a function <code class="language-plaintext highlighter-rouge">do_something(df_chunk)</code>, that is some operation that you need to have done on the table:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">df_chunk</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">df_iterator</span><span class="p">)</span>

    <span class="n">do_something</span><span class="p">(</span><span class="n">df_chunk</span><span class="p">)</span>
    
    <span class="c1"># Set writing mode to append after first chunk
</span>    <span class="n">mode</span> <span class="o">=</span> <span class="s">'w'</span> <span class="k">if</span> <span class="n">i</span> <span class="o">==</span> <span class="mi">0</span> <span class="k">else</span> <span class="s">'a'</span>
    
    <span class="c1"># Add header if it is the first chunk
</span>    <span class="n">header</span> <span class="o">=</span> <span class="n">i</span> <span class="o">==</span> <span class="mi">0</span>

    <span class="n">df_chunk</span><span class="p">.</span><span class="n">to_csv</span><span class="p">(</span>
        <span class="s">"dst_data.csv.gz"</span><span class="p">,</span>
        <span class="n">index</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>  <span class="c1"># Skip index column
</span>        <span class="n">header</span><span class="o">=</span><span class="n">header</span><span class="p">,</span> 
        <span class="n">mode</span><span class="o">=</span><span class="n">mode</span><span class="p">,</span>
        <span class="n">compression</span><span class="o">=</span><span class="s">'gzip'</span><span class="p">)</span>
</code></pre></div></div>

<p>By default, Pandas infers the compression from the filename. Other supported compression formats include <code class="language-plaintext highlighter-rouge">bz2</code>, <code class="language-plaintext highlighter-rouge">zip</code>, and <code class="language-plaintext highlighter-rouge">xz</code>.</p>

<h1 id="resources">Resources</h1>

<p>For more information on chunking, have a look at the documentation on <a href="https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-chunking">chunking</a>. Another useful tool, when working with data that won’t fit your memory, is <a href="https://dask.org/">Dask</a>. Dask can parallelize the workload on multiple cores or even multiple machines, although it is not a drop-in replacement for Pandas and can be rather viewed as a wrapper for Pandas.</p>]]></content><author><name>Nikolai Janakiev</name></author><category term="blog" /><category term="Python" /><category term="Pandas" /><summary type="html"><![CDATA[This is a quick example how to chunk a large data set with Pandas that otherwise won’t fit into memory. In this short example you will see how to apply this to CSV files with pandas.read_csv.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://janakiev.com/assets/python-pandas-chunks_files/Washington_National_Records_Center_Stack_Area_with_Employee_Servicing_Records.jpg" /><media:content medium="image" url="https://janakiev.com/assets/python-pandas-chunks_files/Washington_National_Records_Center_Stack_Area_with_Employee_Servicing_Records.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry></feed>