njanakiev

Downloading Images with Python, PIL, Requests, and Urllib

2023-04-12T00:00:00-05:00

Python is a great language for automating tasks, and downloading images is one of those tasks that can be easily automated. In this article, you’ll see how to use the Python Imaging Library (PIL) or rather Pillow, Requests, and Urllib to download images from the web.

Download an Image with Requests

To get an image from a URL, you can use the requests package with the following lines of code. (Note, this was tested using requests 2.27.1) In this case the script will download this picture from Lake Tekapo, New Zealand by Tobias Keller on Unsplash:

import requests
from PIL import Image

filepath = "assets/unsplash_image.jpg"
url = "https://images.unsplash.com/photo-1465056836041-7f43ac27dcb5?w=720"

r = requests.get(url)
if r.status_code == 200:
    with open(filepath, 'wb') as f:
        f.write(r.content)

Download an Image as PIL Image

In some cases, it is not necessary or possible to save the image somewhere and the image needs to be processed right away. In this case the whole image needs to be streamed by setting parameter stream=True. For more information have a look at the documentation. Then the output needs to be converted into a io.BytesIO binary stream to be consumed by Pillow:

import io
import requests
from PIL import Image

w, h = 800, 600
filepath = "image.jpg"
url = "https://images.unsplash.com/photo-1465056836041-7f43ac27dcb5?w=720"

r = requests.get(url, stream=True)
if r.status_code == 200:
    img = Image.open(io.BytesIO(r.content))
    
    # Do something with image
    
    # Save image to file
    img.save(filepath)

Download an Image using Urllib

Sometimes you cannot install the requests library and need to use the rich Python standard library. In this case, you can use urllib.urlretrieve from the urllib package:

import urllib.request

w, h = 800, 600
filepath = "image.jpg"
url = "https://images.unsplash.com/photo-1465056836041-7f43ac27dcb5?w=720"

urllib.request.urlretrieve(url, filepath)

Resources

Requests
Pillow - Pillow is the friendly PIL fork
urllib - Open arbitrary resources by URL

Virtual Environments in Python with venv

2023-01-20T00:00:00-06:00

Python’s built-in venv module makes it easy to create virtual environments for your Python projects. Virtual environments are isolated spaces where your Python packages and their dependencies live. This means that each project can have its own dependencies, regardless of what other projects are doing.

Create a Virtual Environment

Create an environment with:

python -m venv ./venv
python -m venv /path/to/venv

Activate an environment with:

source venv/bin/activate
source /path/to/venv/bin/activate

Make sure to test if python and pip are indeed in the environment by typing:

which python
# /absolute/path/to/venv/bin/python
which pip
# /absolute/path/to/venv/bin/pip

Install packages from a requriments.txt with:

pip install -r requrirements.txt

Deaktivate an environment with:

deaktivate

Resources

venv — Creation of virtual environments

Object Serialization with JSON and compressed JSON in Python

2022-11-22T00:00:00-06:00

JSON is a popular data format for storing data in a structured way. Python has a built-in module called json that can be used to work with JSON data. In this article, we will see how to use the json module to serialize and deserialize data in Python.

Reading and Writing JSON in Python

Python offers out of the box a JSON encoder and decoder. To store and load JSON you can use the dump() and load() functions respectively. Since they are called the same as in pickling, this makes it easy to remember them.

import json

# Writing a JSON file
with open('data.json', 'w') as f:
    json.dump(data, f)

# Reading a JSON file
with open('data.json', 'r') as f:
    data = json.load(f)

You can additionally encode and decode JSON to a string which is done with the dumps() and loads() functions respectively. Encoding can be done like here:

json_string = json.dumps(data)

And to decode JSON you can type:

data = json.loads(json_string)

This comes handy when you work witk REST APIs where many APIs deal with JSON files as input and/or outputs.

Reading and Writing GZIP Compressed JSON in Python

It is also possible to compress the JSON in order to save storage space by typing the following:

import gzip
import json

with gzip.open("data.json.gz", 'wt', encoding='utf-8') as f:
    json.dump(data, f)

To load the compressed JSON, type:

with gzip.open("data.json.gz", 'rt', encoding='utf-8') as f:
    data = json.load(f)

This is especially useful when caching large amounts of JSON outputs.

Resources

How to Save Temporary Changes in Git Using Git Stash

2022-11-21T00:00:00-06:00

Git stashing is a way to temporarily save changes that you do not want to commit yet. This is useful if you need to switch branches, but do not want to commit your changes first.

Stash your Changes in Git

To stash your changes, type:

git stash
git stash push -m "description of stash"

Once you are done, you can reapply your stash with:

git stash apply
git stash apply 2  # 2nd item in previous list
git stash apply stash@{2}

Listing your Stash

To show your stored stashes, type:

git stash list

Cleaning up the Stash

Reaply stash and remove it from stash with:

git stash pop 2

If you want to remove a stash, type:

git stash drop 2

Finally, to remove all items from stash, type:

git stash clear

Resources

7.3 Git Tools - Stashing and Cleaning

Running Prometheus with Systemd

2022-11-10T00:00:00-06:00

Prometheus is a powerful open-source monitoring system that can be used to collect and track a variety of metrics for your applications. In this guide, we will cover how to get Prometheus up and running with systemd on a Ubuntu or Debian server.

Download and Install Prometheus

Create a dedicated prometheus user with:

sudo useradd -M -U prometheus

Select a version for your system from here and download it:

wget https://github.com/prometheus/prometheus/releases/download/v2.40.0-rc.0/prometheus-2.40.0-rc.0.linux-amd64.tar.gz
tar -xzvf prometheus-2.40.0-rc.0.linux-amd64.tar.gz
sudo mv prometheus-2.40.0-rc.0.linux-amd64 /opt/prometheus

Change folder permissions for prometheus user with:

sudo chown prometheus:prometheus -R /opt/prometheus

Create Systemd Unit File

Create systemd service in /etc/systemd/system/prometheus.service with the following contents:

[Unit]
Description=Prometheus Server
Documentation=https://prometheus.io/docs/introduction/overview/
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Restart=on-failure
ExecStart=/opt/prometheus/prometheus \
  --config.file=/opt/prometheus/prometheus.yml \
  --storage.tsdb.path=/opt/prometheus/data \
  --storage.tsdb.retention.time=30d

[Install]
WantedBy=multi-user.target

Start systemd service of Prometheus with:

sudo systemctl daemon-reload
sudo systemctl start prometheus.service

Enable service to start and system start-up:

sudo systemctl enable prometheus.service

Check the status of the service with:

sudo systemctl status prometheus.service

To view the logs of Prometheus for troubleshooting, type:

sudo journalctl -u prometheus.service -f

Resources

Reading and Writing Parquet Files on S3 with Pandas and PyArrow

2022-04-10T00:00:00-05:00

When working with large amounts of data, a common approach is to store the data in S3 buckets. Instead of dumping the data as CSV files or plain text files, a good option is to use Apache Parquet. In this short guide you’ll see how to read and write Parquet files on S3 using Python, Pandas and PyArrow.

This guide was tested using Contabo object storage, MinIO, and Linode Object Storage. You should be able to use it on most S3-compatible providers and software.

Prepare Connection

Prepare the S3 environment variables in a file called .env in the project folder with the following contents:

S3_REGION=eu-central-1
S3_ENDPOINT=https://eu-central-1.domain.com
S3_ACCESS_KEY=XXXX
S3_SECRET_KEY=XXXX

Prepare some S3 bucket that you want to use. In this case we’ll be using s3://s3-example bucket to store and access our data. Next, prepare some random example data with:

import numpy as np
import pandas as pd

df = pd.DataFrame({'data': np.random.random((1000,))})
df.to_parquet("data/data.parquet")

Load the environment variables in your script with python-dotenv:

from dotenv import load_dotenv
load_dotenv();

Now, prepare the S3 connection with:

import os
import s3fs

fs = s3fs.S3FileSystem(
    anon=False,
    use_ssl=True,
    client_kwargs={
        "region_name": os.environ['S3_REGION'],
        "endpoint_url": os.environ['S3_ENDPOINT'],
        "aws_access_key_id": os.environ['S3_ACCESS_KEY'],
        "aws_secret_access_key": os.environ['S3_SECRET_KEY'],
        "verify": True,
    }
)

Write Pandas DataFrame to S3 as Parquet

Save the DataFrame to S3 using s3fs and Pandas:

with fs.open('s3-example/data.parquet', 'wb') as f:
    df.to_parquet(f)

Save the DataFrame to S3 using s3fs and PyArrow:

import pyarrow as pa
import pyarrow.parquet as pq
from pyarrow import Table

s3_filepath = 's3-example/data.parquet'

pq.write_to_dataset(
    Table.from_pandas(df),
    s3_filepath,
    filesystem=fs,
    use_dictionary=True,
    compression="snappy",
    version="2.4",
)

You can also upload this file with s3cmd by typing:

s3cmd \
  --config ~/.s3cfg \
  put data/data.parquet s3://s3-example

Reading Parquet File from S3 as Pandas DataFrame

Now, let’s have a look at the Parquet file by using PyArrow:

s3_filepath = "s3-example/data.parquet"

pf = pq.ParquetDataset(
    s3_filepath,
    filesystem=fs)

Now, you can already explore the metadata with pf.metadata or the schema with pf.schema. To read the data set into Pandas type:

pf.metadata

pf.schema

required group field_id=-1 schema {
  optional double field_id=-1 data;
}

When using ParquetDataset, you can also use multiple paths. You can get those for example with:

s3_filepath = 's3://s3-example'
s3_filepaths = [path for path in fs.ls(s3_filepath)
                if path.endswith('.parquet')]
s3_filepaths

['s3-example/data.parquet', 's3-example/data.parquet']

Resources

s3fs.readthedocs.io - S3Fs Documentation
PyArrow - Apache Arrow Python bindings
Apache Parquet

Working with Credentials and Configurations in Python

2022-01-11T00:00:00-06:00

When writing programs, there is often a large set of configuration and credentials that should not be hard-coded in the program. This also makes the customization of the program much easier and more generally applicable. There are various ways to handle configuration and credentials and you will see here a few of the popular and common ways to do that with Python.

One important note right from the start: When using version control always make sure to not commit credentials and configuration into the repository as this could become a serious security issue. You can add those to .gitignore to avoid pushing those files to version control. Sometimes is useful to have general configuration also in version control, but that depends on your use case.

Python Configuration Files

The first and probably most straight forward way is to have a config.py file somewhere in the project folder that you add to your .gitignore file. A similar pattern can be found in Flask, where you can also structure the configuration based on different contexts like development, production, and testing. The config.py would look something like:

host = 'localhost',
port = 8080,
username = 'user'
password = 'password'

You would simply import it and use it like this:

import config

host = config.host
port = config.port
username = config.username
password = config.password

Environment Variables

You can access environment variables with os.environ:

import os

os.environ['SHELL']

This will throw a KeyError if the variable does not exists. You can check if the variable exists with "SHELL" in os.environ. Sometimes its more elegant to get None or a default value instead of getting an error when a variable does not exist. This can be done like this:

# return None if VAR does not exists
os.environ.get('VAR')

# return "default" if VAR does not exists
os.environ.get('VAR', "default")  

You can combine this with the previous way to have a config.py with the following contents:

import os

host = os.environ.get('APP_HOST', 'localhost')
port = os.environ.get('APP_PORT', 8080)
username = os.environ.get('APP_USERNAME')
password = os.environ.get('APP_PASSWORD')

Python Dotenv

Oftentimes you want to have the environment variables in a dedicated .env file outside of version control. One way is to load the file before with:

source .env

This is sometimes error-prone or not possible depending on the setup, so its sometimes better to load the file dynamically with python-dotenv. You can install the package with:

pip install -U python-dotenv

Load the .env file in your program with:

from dotenv import load_dotenv

load_dotenv()

If your environment file is located somewhere else, you can load it with:

load_dotenv("/path/to/.env")

Now, you can use the environment file as you saw before.

JavaScript Object Notation (JSON)

JSON is another handy file format to store your configuration as it has native support. If you are working with frontend code, you are already familiar with its usefulness and ubiquity.

You can prepare your configurations as a JSON (JavaScript Object Notation) in a config.json with the following example configuration:

{
    "host": "localhost",
    "port": 8080,
    "credentials": {
        "username": "user",
        "password": "password"
    }
}

You can load this configuration then with the built-in json package:

import json

with open('config.json', 'r') as f:
    config = json.load(f)

This returns the data as (nested) dictionaries and lists which you can access the way you are used to (config['host'] or config.get('host')).

Yet Another Markup Language (YAML)

Another popular way to store configurations and credentials is the (in)famous YAML format. It is much simpler to use but has some minor quirks when using more complicated formatting. Here is the previous configuration as a YAML file:

host: localhost
port: 8080
credentials:
  username: user
  password: password

There are various packages that you can use. Most commonly PyYAML. You can install it with:

pip install -U PyYAML

To load the configuration, you can type:

with open("config.yml", 'r') as f:
    config = yaml.load(f, Loader=yaml.FullLoader)

The config can be used as previously seen with the JSON example.

Note, that you need to add a Loader in PyYAML 5.1+ because of a vulnerability. Read more about it here. Another common alternative to PyYAML is omegaconf, which includes many other useful parsers for various different file types.

Using a Configuration Parser

The Python standard library includes the configparser module which can work with configuration files similar to the Microsoft Windows INI files. You can prepare the configuration in config.ini with the following contents:

[DEFAULT]
host = localhost
port = 8080

[credentials]
username = user
password = password

The configuration is seperated into sections like [credentials] and within those sections the configuration is stored as key-value pairs like host = localhost.

You can load and use the previous configuration as follows:

import configparser

config = configparser.ConfigParser()
config.read("test.ini")

host = config['DEFAULT']['host']
port = config['DEFAULT']['port']
username = config['credentials']['username']
password = config['credentials']['password']

As you can see, to access the values you have to type config[section][element]. To get all sections as a list, you can type config.sections(). For more information, have a look at the documentation.

Parsing Command-line Options

It is also possible to get credentials and configuration through arguments by using the built-in argparse module.

You can initialize the argument parser with:

import argparse

parser = argparse.ArgumentParser(
    description="Example Program")

# Required arguments
parser.add_argument(action='store',
    dest='username', help="session username")
parser.add_argument(action='store',
    dest='password', help="session password")

# Optional arguments with default values
parser.add_argument("-H", "--host", action='store',
    dest='host', default="localhost",
    help="connection host")
# Allow only arguments of type int
parser.add_argument("-P", "--port", action='store',
    dest='port', default=8080, type=int,
    help="connection port")

Now, you can parse the arguments with:

args = parser.parse_args()

host = args.host
port = args.port
username = args.username
password = args.password

If you save this program in example.py and type python example.py -h, you will receive the following help description:

usage: untitled.py [-h] [-H HOST] [-P PORT] username password

Example Program

positional arguments:
  username              session username
  password              session password

optional arguments:
  -h, --help            show this help message and exit
  -H HOST, --host HOST  connection host
  -P PORT, --port PORT  connection port

Another alternative to argparse is typer which makes some of the parsing easier for complex CLI tools.

Conclusion

Here you saw a few common and popular ways to load configuration and credentials in Python, but there are many more ways if those are not sufficient for your usecase. You can always resort to XML if you really wish. If you miss some way that you particularly find useful, feel free to add it in the comments bellow.

Resources

2014 - Configuration files in Python
How To Read and Set Environmental and Shell Variables on a Linux VPS
Github - theskumar/python-dotenv
Github - omry/omegaconf
Github - tiangolo/typer

tqdm Cheat Sheet

2021-12-20T00:00:00-06:00

tqdm is a fast, user-friendly and extensible progress bar for Python and shell programs. Here you’ll find a collection of useful commands for quick reference.

Installation

Install tqdm with:

# With pip
pip install tqdm

# With anaconda
conda install -c conda-forge tqdm

To install tqdm for JupyterLab, you need to have ipywidgets installed. You can install it with:

# With pip
pip install ipywidgets

# With anaconda
conda install -c conda-forge ipywidgets

Enable ipywidgets for jupyter with:

jupyter nbextension enable --py widgetsnbextension

Cheat Sheet

To import tqdm to work both for notebooks and shell programs type:

from tqdm.auto import tqdm

Iterate over a range with:

for i in tqdm(range(100)):
    # do something

Add description to the progress bar with:

for i in tqdm(range(100), desc="First loop"):
    # do something

Iterate over a Pandas table with:

for idx, row in tqdm(df.iterrows(), total=len(df)):
    # do something with that row

Add changing description to progress bar:

pbar = tqdm(range(100))
for i in pbar:
    pbar.set_description(f"Element {i:03d}")
    # do something

Show progress for nested loops:

for i in tqdm(range(10)):
    for j in tqdm(range(100), leave=False):
        # do something

The option leave=False discards nested bars upon completion.

Resources

tqdm.github.io
Github - tqdm/tqdm

Remove Jupyter Notebook Output from Terminal and when using Git

2021-11-06T00:00:00-05:00

Often times you want to delete the output of a jupyter notebook before commiting it to a repository, but in most cases you want to still have the notebook output for yourself. In this short guide you will seeh how to delete the notebook output automatically when committing notebooks to a repository while keeping the outputs local.

Removing the Notebook Output in the Command-line

The first tool to do this job is the nbconvert command-line tool to work with jupyter notebooks. First, check your installed version of nbconvert by typing:

jupyter nbconvert --version

In order to delete the output, you can type the following command:

jupyter nbconvert \
  --clear-output \
  --to notebook \
  --output=new_notebook \
  notebook.ipynb

To remove the output inplace, you can type:

jupyter nbconvert \
  --clear-output \
  --inplace \
  notebook.ipynb

If you have nbconvert below version 6.0, change the command to:

jupyter nbconvert \
  --ClearOutputPreprocessor.enabled=True \
  --to notebook \
  --output=new_notebook \
  notebook.ipynb

It is also possible to remove notebook outputs in batch:

find *.ipynb \
  -exec jupyter nbconvert --clear-output --inplace {} \;

Or, by using a simple loop:

for f in *.ipynb; do
  jupyter nbconvert --clear-output --inplace $f 
done

Removing the Notebook Output automatically when Committing

[filter "remove-notebook-output"]
    clean = "jupyter nbconvert --clear-output --to=notebook --stdin --stdout --log-level=ERROR"

If you want the filter to be available globally, append it to ~/.gitconfig instead. Also, remember to check the nbconvert version and change the command as shown previously.

Now, append the following lines to the .gitattributes file:

*.ipynb filter=remove-notebook-output

If you want to apply those filters only to a specific folder you can instead append:

folder/*.ipynb filter=remove-notebook-output

That’s all! Now you should be able to commit jupyter notebooks to git repositories without output if you followed all the steps.

Resources

For more resources, have a look at:

Reading and Writing Pandas DataFrames in Chunks

2021-04-03T00:00:00-05:00

This is a quick example how to chunk a large data set with Pandas that otherwise won’t fit into memory. In this short example you will see how to apply this to CSV files with pandas.read_csv.

Create Pandas Iterator

First, create a TextFileReader object for iteration. This won’t load the data until you start iterating over it. Here it chunks the data in DataFrames with 10000 rows each:

df_iterator = pd.read_csv(
    'input_data.csv.gz', 
    chunksize=10000,
    compression='gzip')

Iterate over the File in Batches

Now, you can use the iterator to load the chunked DataFrames iteratively. Here you have a function do_something(df_chunk), that is some operation that you need to have done on the table:

for i, df_chunk in enumerate(df_iterator)

    do_something(df_chunk)
    
    # Set writing mode to append after first chunk
    mode = 'w' if i == 0 else 'a'
    
    # Add header if it is the first chunk
    header = i == 0

    df_chunk.to_csv(
        "dst_data.csv.gz",
        index=False,  # Skip index column
        header=header, 
        mode=mode,
        compression='gzip')

By default, Pandas infers the compression from the filename. Other supported compression formats include bz2, zip, and xz.

Resources

For more information on chunking, have a look at the documentation on chunking. Another useful tool, when working with data that won’t fit your memory, is Dask. Dask can parallelize the workload on multiple cores or even multiple machines, although it is not a drop-in replacement for Pandas and can be rather viewed as a wrapper for Pandas.