Aray Karjauv

Hands-on Guide to Multi-Language Speech Recognition and Speaker diarization

2023-03-31T15:45:00+00:00

Multi-Language speech recognition and speaker diarization are two important tasks in the field of audio processing. Speech recognition can be defined as the process of converting spoken language into written text, while speaker diarization involves segmenting an audio recording and assigning each segment to a particular speaker. These techniques are used in a variety of applications, including podcasts and conference transcription.

In this blog post, you will learn how to build a pipeline for multi-language speech recognition and speaker diarization using existing libraries.

Introduction

Podcasts are a great example of how this technology can be useful. Podcasts have gained growing popularity, which has led to an increasing demand for tools capable of automatically transcribing and segmenting podcast episodes, thus saving a significant amount of time on manual work. Many podcasts are recorded with multiple speakers and are often distributed in audio format only. By using speaker diarization, podcast producers can automatically identify each speaker and generate subtitles for each one. This not only makes the podcast more accessible to hard-of-hearing listeners but also makes it easier to search for specific topics within the podcast or create chapters for YouTube.

Before diving into the Jupyter notebook, let me briefly introduce three libraries that form the backbone of this pipeline.

Denoiser is a PyTorch implementation of Meta’s paper Real Time Speech Enhancement in the Waveform Domain. It is used to remove noise from the background and can enhance speech from the raw waveform in real-time on a laptop CPU.

Pyannote is an open-source toolkit for audio segmentation. It can identify and separate speakers.

Whisper is OpenAI’s Automatic Speech Recognition system trained on 680,000 hours of multilingual and multitasking data collected from the internet. The researchers show that using such a large and diverse data set can improve tolerance to accents and background noise. It can not only automatically recognize language and speech, but can also translate text into one of 99 languages.

Interestingly, the official announcement for translation into any language has not been made. I accidentally stumbled upon this possibility while experimenting with the model. The repository only says that it can translate one of the languages into English.

This demo also contains an HTML5 video player with custom controls. Specifically, it implements a YouTube-like timeline that is divided into chapters for each speaker.

The Jupyter Notebook can be found on my GitHub or you can run it on Google Colab. If you encounter any problems, you are welcome to open an issue on GitHub.

Implementation

The notebook has a Setup section that installs packages and defines helper functions. We will go through all sections and look at each cell step by step.

Install dependencies

To begin, it’s important to install dependencies, and it must be done in a specific order due to conflicts between Pyannote and PyTorch Lightning.

!pip install pyannote.audio==2.1.1 denoiser==0.1.5 moviepy==1.0.3 pydub==0.25.1 git+https://github.com/openai/whisper.git@v20230124
!pip install omegaconf==2.3.0 pytorch-lightning==1.8.4

If you want to try out this demo on your own computer, you will need to install ffmpeg package since we will process video and audio files.

Start web server

This cell installs and starts a web server on the Google Colab virtual machine. This is needed to host the HTML player and resources.

!npm install http-server -g

import subprocess
subprocess.Popen(['http-server', '-p', '8000']);

Although Python’s built-in http.server could be used, it lacks support for the Range request, which is needed to rewind video.

HTML player template

The next cell defines the HTML5 video player template and contains only a string with JavaScript and CSS.

Main code

This section contains the most important part of the demo. Let’s examine the code more closely. We’ll start by importing the required libraries and loading the pretrained models.

# Imports...

denoise_model = pretrained.get_model(Namespace(model_path=None, dns48=False, dns64=False, master64=False, valentini_nc=False)).to(device)
denoise_model.eval()
whisper_model = whisper.load_model("large").to(device)
whisper_model.eval()

The split_audio function extracts the audio from a video file and divides it into smaller pieces using the MoviePy package, which is a wrapper around ffmpeg. This is done to ensure that the audio can fit into the available memory. chunk_size controls the duration of the chunks. The function returns the total duration of the video (which is required for building a timeline) and saves the audio chunks into the tmpdirname directory for further processing.

def split_audio(tmpdirname, video, chunk_size=120):
    """
    Split audio into chunks of chunk_size
    """
    path = opj(tmpdirname, 'noisy_chunks')
    os.makedirs(path)
    # extract audio from video
    audio = AudioFileClip(video.name)
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=True) as audio_fp:
        audio.write_audiofile(audio_fp.name, verbose=False)

        # round duration to the next whole integer
        for i, chunk in enumerate(np.arange(0, audio.duration, chunk_size)):
            ffmpeg_extract_subclip(audio_fp.name, chunk, min(chunk + chunk_size, audio.duration),
                                targetname=opj(path, f'{i:09}.wav'))
    return audio.duration

The get_speakers function removes noise from the chunks, reassembles them back to a cleaned audio file, and passes this file into the Pyannote pipeline for speaker diarization.

def get_speakers(tmpdirname, use_auth_token=True):
    files = find_audio_files(opj(tmpdirname, 'noisy_chunks'))
    dset = Audioset(files, with_path=True,
                    sample_rate=denoise_model.sample_rate, channels=denoise_model.chin, convert=True)
    
    loader = distrib.loader(dset, batch_size=1)
    distrib.barrier()

    print('removing noise...')
    enhanced_chunks = []
    with tempfile.TemporaryDirectory() as denoised_tmpdirname:
        for data in loader:
            noisy_signals, filenames = data
            noisy_signals = noisy_signals.to(device)
            
            with torch.no_grad():
                wav = denoise_model(noisy_signals).squeeze(0)
            wav = wav / max(wav.abs().max().item(), 1)

            name = opj(denoised_tmpdirname, filenames[0].split('/')[-1])
            torchaudio.save(name, wav.cpu(), denoise_model.sample_rate)
            enhanced_chunks.append(name)

        print('reassembling chunks...')
        clips = [AudioFileClip(c) for c in sorted(enhanced_chunks)]
        final_clip = concatenate_audioclips(clips)
        cleaned_path = opj(tmpdirname, 'cleaned.wav')
        final_clip.write_audiofile(cleaned_path, verbose=False)

        print('identifying speakers...')
        # load pre-trained model
        pipeline = Pipeline.from_pretrained('pyannote/speaker-diarization', use_auth_token=use_auth_token)
    
        return str(pipeline({'uri': '', 'audio': cleaned_path})).split('\n'), cleaned_path

The function returns an array of time codes for the speaker turns and the path to clean audio that will be used for transcription.

As we will be downloading pretrained models from Hugging Face, we need to set use_auth_token. We will use notebook_login to store the token in the config file. Setting True indicates that the token will be read from that config file. It is also required to accept Pyannote’s speaker-diarization and segmentation user conditions.

Finally, the function get_subtitles transcribes audio and composes a dictionary with subtitles.

def get_subtitles(timecodes, clened_audio_path, language=None):
    if(device == 'cpu'):
        options = whisper.DecodingOptions(language=language, fp16=False)
    else:
        options = whisper.DecodingOptions(language=language)

    timeline = {}
    prev_speaker = None
    prev_start = 0
    for line in timecodes:
        start, end = re.findall(r'\d{2}:\d{2}:\d{2}.\d{3}', line)
        start = str_to_seconds(start)
        end = str_to_seconds(end)
        speaker = re.findall(r'\w+$', line)[0]

        # extract a segment of the audio for a speaker
        with tempfile.NamedTemporaryFile(suffix=".wav", delete=True) as audio_fp:
            ffmpeg_extract_subclip(clened_audio_path, start, end,
                                    targetname=audio_fp.name)

            # load audio and pad/trim it to fit 30 seconds
            audio = whisper.load_audio(audio_fp.name)
            audio = whisper.pad_or_trim(audio)  
            # make log-Mel spectrogram and move to the same device as the model
            mel = whisper.log_mel_spectrogram(audio).to(whisper_model.device)
            # decode the audio
            result = whisper.decode(whisper_model, mel, options)

            if(speaker == prev_speaker):
                timeline[prev_start]['text'] += f' <{seconds_to_str(start)}>{result.text}'
                timeline[prev_start]['end'] = end
            else:
                timeline[start] = { 'end': end, 
                                    'speaker': speaker,
                                    'text': f'{speaker}>{speaker}: {result.text}'}
                prev_start = start

            prev_speaker = speaker

    return timeline

This function performs speech recognition on the audio segments corresponding to each speaker turn using the pretrained Whisper model. It does this by iterating through the time codes produced by the get_speakers function and extracting a segment for each speaker from the clean audio. It then computes the log-Mel spectrogram of the audio and passes it into the whisper_model function for speech recognition. Finally, the resulting transcription in VTT format and the speaker’s ID are added to the timeline dictionary, with the start time of the speaker’s turn serving as the key for the dictionary.

UI code

This section defines an input form that allows users to provide a link to a video or upload a file. The form also features a language drop-down menu Translate to, which includes all the languages that Whisper supports. If the user selects the Original option, the model will receive None as the language parameter, indicating that the model should automatically detect the language.

Additionally, there are helper functions provided for displaying the video player directly in the notebook or in a new tab. Depending on the situation we need to replace URL placeholders for the video and subtitles.

If the video player is rendered within the notebook, the base URL will be http://localhost:8000/ (as the server was started on port 8000). Jupyter will automatically replace this URL with the correct one during requests.

In the case where the player is opened in a separate tab, we need to obtain an external URL for the Colab virtual machine. To accomplish this, we use Colab’s helper function eval_js to interact with JavaScript within the current cell’s context by executing Colab’s JS function proxyPort. It will return the current server URL. Be aware that your browser may need to allow third-party cookies for this to function properly.

# Get get an external URL to the virtual machine
from google.colab.output import eval_js
url = eval_js("google.colab.kernel.proxyPort(8000)")

Furthermore, it includes a workaround fixing A UTF-8 locale is required. Got ANSI_X3.4-1968 problem, which occurs after installing Pyannote. For unknown reasons, the locale is being set to ANSI_X3.4-1968 and as a temporary solution, we can override locale.getpreferredencoding function as follows:

import locale
locale.getpreferredencoding = lambda: "UTF-8"

Lastly, the process function is responsible for preparing the video file, as well as the video player templates, and invoking functions that are defined in the Main code section to generate subtitles.

Conclusion

To sum up, this tutorial provides a comprehensive guide to building a speech recognition and speaker diarization pipeline utilizing three different models.

With the recent release of the Whisper API developers can now easily integrate the model into their apps.

Additionally, Whisper has recently been ported to C++, which enables high-performance inferencing. This makes it feasible to execute the model on a CPU or even on a Raspberry Pi. Moreover, thanks to WebAssembly (more details on this in my post on Python on a Webpage), the model can also run within a web browser. You can try out Whisper in the browser for yourself in this live demo.

Turning StyleGAN into a latent feature extractor

2023-01-06T11:18:00+00:00

While Generative Adversarial Networks (GANs) are primarily known for their ability to generate high-quality synthetic images, their main task is to learn a latent feature representation of real data. In addition, recent improvements to the original GAN allow it to learn a disentangled latent representation, enabling us to obtain semantically meaningful embeddings.

This property could possibly allow GANs to be used as high-level feature extractors. However, the problem is that the original GAN architecture is not invertible or, in other words, it is impossible to project real images into the latent space. This article addresses this issue and attempts to answer whether GANs can extract meaningful features from real images and if they are suitable for downstream tasks.

StyleGAN [1] has revolutionized the creation of synthetic images and its successor, StyleGAN2 [2], has become the de-facto base for many state-of-the-art generative models. One of the reasons for this is that, along with high quality, it attempts to solve the problem of latent space entanglement and thereby makes each latent variable control a single abstract function by introducing perceptual path length (PPL) regularization.

Disentangled representations are a type of representation in which the factors of variation in the data are represented in a separate and independent way in the representation. This means that each dimension of the representation corresponds to a single factor of variation, and changing that dimension only affects that factor and not any others (e.g., hairstyle for human faces).

Illustrative example taken from StyleGAN [1]. Two factors of variation (image features, e.g., masculinity and hair length): (a) An example training set where some combination (e.g., long-haired males) is missing. (b) This forces the mapping from $\mathcal{Z}$ to image features to become curved so that the forbidden combination disappears in $\mathcal{Z}$ to prevent the sampling of invalid combinations. (c) The learned mapping from $\mathcal{Z}$ to $\mathcal{W}$ is able to “undo” much of the warping.

GAN can be viewed as a self-supervised representation learning approach with contrastive loss, where real images are positive examples and the generator produces negative ones.

As mentioned, one of the limitations is that a common GAN is non-invertible, meaning it can only generate images from random noise and cannot extract embeddings from real images. Although there are methods to project real images into GAN’s latent space, the most popular is slow and computationally expensive as it is based on an optimization approach. Instead, we can train an encoder along with the generator and discriminator. From this point of view, GAN can be viewed as a self-supervised representation learning approach with contrastive loss, where real images are positive examples and the generator produces negative ones.

Essentially, the discriminator in GAN already has an encoding part, as it is nothing more than a simple CNN binary classification model and CNNs are known to be good at extracting features from images. As a matter of fact, we can logically decompose it into a CNN encoder network and a fully connected discriminator network. So, Instead of adding another network, we can reuse the discriminator’s weights, saving memory and computational resources.

Architecture of Adversarial Latent Autoencoder [3].

The approach described in this article is based on the architecture proposed in “Adversarial Latent Autoencoders” (ALAE) [3]. To make the latent spaces of the Mapping network $F$ and encoder $E$ consistent with each other, the authors add an additional term to the GAN loss:

\[L_{\text{consistency}} = E_{p(z)}\bigg [ ||F(z) - E \circ G \circ F(z) ||_2^2 \bigg ]\]

This term forces the encoder to produce the same latent vector from a synthetic image used to generate it. More precisely, we first generate an intermediate vector from noise, $w = F(z)$, then generate a synthetic image from it, $x^\prime = G(w)$, and encode it back into an intermediate vector, $w^{\prime} = E(x^\prime)$. Finally, we minimize the $L_2$ norm between these vectors, $|| w - w^\prime||_2^2$.

In contrast to autoencoders, where the loss calculates an error element-wise in pixel space, this loss operates in latent space. The pixel-wise $L_2$ norm loss is one of the reasons why autoencoders have not been as successful as GANs in generating diverse and high-quality images [4]. Its application in pixel space does not reflect human visual perception, since an image shift of even one pixel may cause a large pixel-wise error, while its representation in latent space would be barely changed. Therefore, the $L_2$ norm can be used more effectively by applying it to the latent space providing invariance, such as for translation.

Additionally, ALAE introduces an information flow between the generator and discriminator, which makes the model more complex but can improve convergence speed and image quality. In this example, I leave it out to keep everything simple.

Implementation

To demonstrate this approach I chose an unofficial StyleGAN2 PyTorch implementation.

The main change is the introduction of a new loss term which I called consistency loss:

z = torch.randn(args.batch, args.latent, device=device)
w_z = mapping(z)
fake_img, _ = generator([w_z])
w_e = encoder(fake_img)
consitency_loss = (w_z - w_e).pow(2).mean()

Basically, that’s all. We could use the original implementation of the discriminator and slightly change it to return intermediate results, right after the last convolutional layer. But I find it much cleaner to split the discriminator into two independent networks: Encoder and DiscriminatorMini.

Since the Generator in this implementation is combined with the mapping network, I also split it into 2 separate networks: Generator1 and MappingNetwork.

Evaluation

To quantitatively evaluate the encoder, I trained a base ResNet18 model on raw images and linear logistic regression along with SVM with linear kernel on embeddings (this was done only for MNIST and PCam datasets).

The expected result is that the embeddings will be linearly separable and the accuracy of the base models will be similar to that of linear models. This assumption is based on the use of PPL, which enforces a disentangled and linearly separable latent space.

Visual inspection still remains the standard evaluation approach, so I generated synthetical images to check if the model was not broken and also visualized embedding using UMAP to see if they form clusters.

Results

I trained this model on three different datasets: MNIST, CelebA + FFHQ, and PCam, and moved the training logic to the solver_mnist.py, solver_celeba.py, and solver_pcam.py, respectively. Each of the solvers has been slightly adjusted to match the dataset requirements. There is also a notebook with pretrained models where you can reproduce the results.

MNIST

Since the images in the MNIST dataset are only 28x28 pixels (were converted to 32x32x3) and the dataset itself is very simple, I first trained the model on it to test whether there are no bugs and the algorithm works as expected.

python3 solver_mnist.py --path path/to/save/dataset --size 128 --name  --run_name  --batch 32 --iter 10000 --augment --wandb

Where --name and --run_name are used for wandb logging. The description of the parameters for each solver can be found in the help strings.

First I generate some random images to see if the changes didn’t break the model:

Next, I check if the encoder produces latent features from the same distribution as the generator by encoding real images from the test set (that the model hasn’t seen) and generating new ones from these embeddings:

The first row represents the original images, the second row demonstrates the reconstruction

This figure demonstrates that the reconstruction works fairly well but is not ideal (one of the 8 was reconstructed into 3).

Additionally, I encoded the whole test set and used the embeddings to demonstrate querying top N similar images using cosine similarity:

Searching for the most similar images. The first column contains real images

Now, we move on to quantitative assessment. As mentioned earlier, I trained linear SVM and logistic regression models to check if the embeddings are linearly separable. These models were trained on embeddings produced from half of the test set (which the GAN did not see) and the other half was used as a validation set. Both models reached 99% accuracy. The RestNet18 model was trained on the raw images from the training set and validated on the entire test set. It also achieved 99% accuracy which indicates that the GAN model has successfully learned the disentangled latent representation.

Finally, I visualize the embeddings by projecting them into 2d space using UMAP:

MNIST embeddings visualization. Each color represents a number from 0 to 9

This visualization demonstrates that there are clear clusters with few misassignments, supporting the statement that the model was able to learn a linearly separable (and thus disentangled) latent representation. A look at the interactive visualization suggests that most of the misassigned samples look very similar to the nearest ones. I especially like how crossed 7 forms a separate cluster, although this would cause problems if we wanted to label clusters.

CelebA and FFHQ

After testing the model on MNIST, it was trained on the CelebA + FFHQ datasets.

As before, let’s generate some random images to see if the model works correctly:

Synthetic images generated from noise using custom ALAE

Now, let’s reconstruct real images:

The first row represents original images, the second row demonstrates reconstruction

We can see that the images have been reconstructed inaccurately.

And here is a visualization of embedding with the gender attribute highlighted:

CelebA test set visualization of embeddings. Orange - female, blue - male.

At first glance, the model seems to have succeeded in capturing the gender attribute, but a closer look at the interactive visualization reveals that the haircut may play a greater role.

However, which features were decisive remains open. For instance, this diagram shows the attribute black hair:

black hair in blue

As previously mentioned, the reconstruction loss may decrease the quality of images. To test this, I added a reconstruction loss between real and generated images in pixel space. The figure below shows the results.

The first row shows real images; the second shows the reconstruction

The results confirm that optimizing a GAN in latent space is generally considered to be a better approach for image generation.

Camelyon

Finally, the model was trained on the Camelyon data set that consists of medical images of H&E-stained whole-slide images of lymph node sections containing normal tissues or with breast cancer metastases.

Similar to MNIST experiment, I trained linear SVM and logistic regression on the test set.

As we can see, the ResNet18 model reached 77% accuracy, whereas linear models trained on embeddings reached only 50% which is a random choice.

And here is a visualization of the embeddings:

This diagram shows embeddings colored by their class (normal, cancer). As you can see, these classes do not form clusters. This indicates that the model did not capture the cancer cells, making the approach useless for this dataset.

Note that there are several point clusters indicating that the dataset contains duplicates, completely black patches, and patches without tissues.

Conclusion

As we have just seen that decomposing and reusing the encoder part of the discriminator and adding a simple consistency loss allow real images to be projected into latent space. Having disentangled embeddings can potentially allow us to identify features in the latent space and assign semantical attributes to them, which may allow us to reason predictions in downstream tasks, assuming that the latent representation is indeed disentangled.

However, the linear separability of the embeddings does not necessarily mean that the latent representation is disentangled, nor does the visualization with UMAP. This question, therefore, remains open for further investigation. Nonetheless, we still can use embeddings to search for similar samples and, for example, clean and balance datasets.

Another issue is that the encoder approach is not optimal, causing the model to fail to accurately reconstruct images. There are already better methods for inverting real images, for instance, by combining the encoder approach and optimization technique, but it is also not optimal as we still need to run an iterative optimization until we get reasonable embeddings. I encourage you to watch this talk on the topic.

In conclusion, StyleGAN2 with encoder appears to be able to capture coarse details such as hair color or scanner color palette in digital pathology but may struggle with fine features that are only a few pixels in size. Further investigation is needed to confirm these findings.

References

[1] Karras, T., Laine, S., & Aila, T. (2018). A Style-Based Generator Architecture for Generative Adversarial Networks. arXiv

[2] Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., & Aila, T. (2019). Analyzing and Improving the Image Quality of StyleGAN. arXiv

[3] Pidhorskyi, S., Adjeroh, D., & Doretto, G. (2020). Adversarial Latent Autoencoders. arXiv

[4] Wu, Zongze, Dani Lischinski, and Eli Shechtman. “Stylespace analysis: Disentangled controls for StyleGAN image generation.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021. PDF

Bringing Python to the Web

2022-07-12T10:01:27+00:00

Have you ever wanted to share your cool Python app with the world without deploying an entire Django server or developing a mobile app just for a small project?

Good news, you don’t have to! All you need is to add one JavaScript library to your HTML page and it will even work on mobile devices, allowing you to mix JS with Python so you can take advantage of both worlds.

Take a look at this REPL example:

Note that this may take some time and cause the page to freeze.

Note:

This guide has been updated to Pyodide v0.21.3.

Witchcraft! This is made possible by WebAssembly (Wasm) and the Pyodide project. You can also open the pyodide_repl.html (source code) example in a new tab.

So what can we actually do? Spoiler: With the power of Python and JS, we can do almost anything. But before getting into the details, let me first tell you a little story behind this writing.

I recently started a hobby project where I implemented image pixelation. I decided to write it in Python, as this language has a bunch of libraries for working with images. The problem was that I couldn’t easily share the app without developing an Android app or finding a hosting and deploying a Django or Flask server.

I’ve heard about WebAssembly before and have wanted to try it out for a long time. Searching the Internet for “webassembly python”, I immediately came across a link to an interesting article “Pyodide: Bringing the scientific Python stack to the browser”. Unfortunately, the article is mainly about the iodide project that is no longer in development and the documentation of Pyodide was sparse.

The idea to write this article came to me when I decided to contribute to the project by improving its documentation after collecting the information about the API piece by piece and a number of experiments with code.

Here I would like to share my experience. I will also give more examples and discuss some issues.

What is Pyodide?

According to the official repository,

Pyodide is a port of CPython to WebAssembly/Emscripten. It was created in 2018 by Michael Droettboom at Mozilla as part of the Iodide project. Iodide is an experimental web-based notebook environment for literate scientific computing and communication.

All of this is made possible by Wasm.

WebAssembly is a new type of code that can be run in modern web browsers and provides new features and major gains in performance. It is not primarily intended to be written by hand, rather it is designed to be an effective compilation target for source languages like C, C++, Rust, etc.

Wasm could potentially have a huge impact on the future of front-end development by extending the JS stack with numerous libraries and opening new possibilities for developers programming in languages other than JS. For example, there are already projects using it under the hood, such as PyScript by Anaconda.

So, it’s time to get your hands dirty. Let’s take a closer look at the minimal example

First of all, we have to include the pyodide.js script by adding the CDN URL

After this, we must load the main Pyodide wasm module using loadPyodide and wait until the Python environment is bootstrapped

const pyodide = await loadPyodide()

Finally, we can run Python code

console.log(pyodide.runPython('import sys; sys.version'))

By default, the environment only includes standard Python modules such as sys, csv, etc. If we want to import a third-party package like numpy we have two options: we can either pre-load required packages manually and then import them in Python

await pyodide.loadPackage('numpy');
// numpy is now available
pyodide.runPython('import numpy as np')
// create a numpy array
np_array = pyodide.runPython('np.ones((3, 3))')
// convert Python array to JS array
np_array = np_array.toJs()
console.log(np_array)

or we can use the pyodide.loadPackagesFromImports function that will automatically download all packages that the code snippet imports

const python_code = `
import numpy as np
np.ones((3,3))
`;
(async () => {
  await pyodide.loadPackagesFromImports(python_code)
  const result = pyodide.runPython(python_code)
  console.log(result.toJs())
})() // call the function immediately

Note:

Since pyodide 0.18.0, pyodide.runPythonAsync does not automatically load packages, so loadPackagesFromImports should be called beforehand. It currently does not download packages from PyPI, but only downloads packages included in the Pyodide distribution (see Packages list). More information about loading packages can be found here

Okay, but how can we use all of this? In fact, we can replace JS and use Python as the main language for web development. Pyodide provides a bridge between JS and Python scopes.

Accessing JavaScript scope from Python

The JS scope can be accessed from Python through the js module. This module gives us access to the global object window and allows us to directly manipulate the DOM and access global variables and functions from Python. In other words, js is an alias for window, so we can either use window by importing it from the js import window or just use js directly.

Why not try it yourself? You can either try it out in the live demo above or open the demo in a new tab.

Just run this Python code and watch what happens.

from js import document

div = document.createElement('div')
div.innerHTML = 'This element was created from Python'
document.getElementById('simple-example').prepend(div)

We have just created an h1 heading at the top of the example’s container using Python. Isn’t it cool?!

We first created a div element and then inserted it into the

using the JS document interface.

Since we have full control over the window object, we can also handle all events from python. Let’s add a button at the bottom of the example that clears the output when clicked

from js import document

def handle_clear_output(event):
  output_area = document.getElementById('output')
  output_area.value = ''

clear_button = document.createElement('button')
clear_button.innerHTML = 'Clear output'
clear_button.onclick = handle_clear_output
document.getElementById('simple-example').appendChild(clear_button)

Note that we now use a Python function as an event handler.

Note:

We can only access the properties of the window object. That is, we can access only the variables directly attached to the window or defined globally with the var statement. Because let statement declares a block-scoped local variable just like the const, it does not create properties of the window object when declared globally.

HTTP requests

Python has a built-in module called requests that allows us to make HTTP requests. However, it is still not supported by Pyodide. Luckily, we can use the Fetch API to make HTTP requests from Python.

Pyodide used to support JS then/catch/finally promise functions and we could use fetch as follows:

from js import window
window.fetch('https://karay.me/assets/misc/test.json')
      .then(lambda resp: resp.json()).then(lambda data: data.msg)
      .catch(lambda err: 'there were error: '+err.message)

I personally find this example very cool. JS has the arrow function expression introduced in ES6, which is very handy if we want to create a callback inline. An alternative in Python is the lambda expression. Here we write the code in JS way and take advantage of chains of promises. The resp.json() function converts the response body into an object that we can then access from Python. This also enables us to handle rejections.

However, since v0.17, it integrates the implementation of await for JsProxy. So when JS returns a Promise, it converts it to Future in Python, which allows us to use await, but this object has no then/catch/finally attributes, and hence it is no longer possible to build chains like in older versions. This should be fixed in the future, but for now, we can use the await keyword to wait for the response:

import json
from js import window

resp = await window.fetch('https://karay.me/assets/misc/test.json')
data = await resp.json()
print(type(data))
# convert JsProxy to Python dict
data = data.to_py()
json.dumps(data, indent=2)

Note:

Since the code on the demo page is executed using runPythonAsync we can use await outside of a function.

As you probably noticed, we had to convert the JsProxy object to a Python dict using JsProxy.to_py. This is required when we communicate between JS and Python. However, some standard types do not need to be converted since this is done implicitly. You can find more information about this here.

Accessing Python scope from JS

We can also go in the opposite direction and get full access to the Python scope from JS through the pyodide.globals.get() function. Additionally, similar to Python’s JsProxy.to_py, we also need to convert the returned object to JS type using PyProxy.toJs (we’ve already done this in previous examples). For example, if we import numpy into the Python scope, we can immediately use it from JS. This option is for those who prefer JS but want to take advantage of Python libraries.

Let’s try it live

import numpy as np
x = np.ones([3,3])

Now, I will ask you to open the browser console and run this JS code

pyodide.globals.get('x').toJs()
// >>> [Float64Array(3), Float64Array(3), Float64Array(3)]

To access Python scope from JS, we use the pyodide.globals.get() that takes the name of the variable or class as an argument. The returned object is a PyProxy that we convert to JS using toJs().

As you can see, the x variable was converted to JS typed array. In the earlier version (prior to v0.17.0), we could directly access the Python scope:

let x = pyodide.globals.np.ones(new Int32Array([3, 3]))
// x >>> [Float64Array(3), Float64Array(3), Float64Array(3)]

Now, we have to manually convert the shape parameter into Python type using pyodide.toPy and then convert the result back to JS:

let x = pyodide.globals.get('np').ones(pyodide.toPy([3,3])).toJs()
// x >>> [Float64Array(3), Float64Array(3), Float64Array(3)]

This may change in the future and hopefully, most types will be implicitly converted.

Since we have full scope access, we can also re-assign new values or even JS functions to variables and create new ones from JS using globals.set function. Feel free to experiment with the code in the browser console.

// re-assign a new value to an existing Python variable
pyodide.globals.set('x', 'x is now string')
// create a new js function that will be available from Python
// this will show a browser alert if the function is called from Python and msg is not null (None in Python)
pyodide.globals.set('alert', msg => msg && alert(msg))
// this new function will also be available in Python and will return the square of the window
pyodide.globals.set('window_square', function(){
  return innerHeight*innerWidth
})

All of these variables and functions will be available in the global Python scope:

alert(f'Hi from Python. Windows square: {window_square()}')

Installing packages

If we want to import a module that is not in the Pyodide repository, say seaborn, we will get the following error

import seabornas sb
# => ModuleNotFoundError: No module named 'seaborn'

Pyodide currently supports a limited number of packages, but you can install the unsupported ones yourself using micropip module

import micropip

await micropip.install('seaborn')

But this does not guarantee that the module will work correctly. Also, note that there must be a wheel file in PyPi to install a module.

If a package is not found in the Pyodide repository it will be loaded from PyPI. Micropip can only load pure Python packages or for packages with C extensions that are built for Pyodide.

The recent major release (0.21-release) introduces improvements to the systems for building and loading packages. It is now much easier to build and use binary wheels that are not included in the distribution. It also includes a large number of popular packages, such as bitarray, opencv-python, shapely, and xgboost.

Detailed information on how to install and build packages can be found here.

Advanced example

Finally, let’s look at the last example. Here we will create a plot using matplotlib and display it a the page. You can reproduce the result by running the following code on the demo page.

First, we import all necessary modules. Since this will load a bunch of dependencies, the import will take a few minutes. The download progress can be seen in the browser console.

from js import document
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import io, base64

The numpy and scipy.stats modules are used to create a Probability Density Function (PDF). The io and base64 modules are used to encode the plot into a Base64 string, which we will later set as the source for an tag.

Now let’s create the HTML layout

div_container = document.createElement('div')
div_container.innerHTML = """
  
  mu:
  
  sigma:
  
"""
document.body.appendChild(div_container)

The layout is pretty simple. The only thing I want to draw your attention to is that we have set pyodide.globals.get("generate_plot_img")() as button’s onclick handler. Here, we get the generate_plot_img function from the Python scope and imminently call it.

After that, we define the handler function itself

def generate_plot_img():
  # get values from inputs
  mu = int(document.getElementById('mu').value)
  sigma = int(document.getElementById('sigma').value)
  # generate an interval
  x = np.linspace(mu - 3*sigma, mu + 3*sigma, 100)
  # calculate PDF for each value in the x given mu and sigma and plot a line
  plt.plot(x, stats.norm.pdf(x, mu, sigma))
  # create buffer for an image
  buf = io.BytesIO()
  # copy the plot into the buffer
  plt.savefig(buf, format='png')
  buf.seek(0)
  # encode the image as Base64 string
  img_str = 'data:image/png;base64,' + base64.b64encode(buf.read()).decode('UTF-8')
  # show the image
  img_tag = document.getElementById('fig')
  img_tag.src = img_str
  buf.close()

This function will generate a plot and encode it as a Base64 string, which will then be set to the img tag.

You should get the following result:

Note that this may take some time and cause the page to freeze.

Every time we click the button the generate_plot_img is called. The function gets values from the inputs, generates a plot, and sets it to the img tag. Since the plt object is not closed, we can add more charts to the same figure by changing the mu and sigma values.

Conclusion

Thanks to Pyodide, we can mix JS and Python and use the two languages interchangeably, allowing us to get the best of both worlds and speed up prototyping.

On the one hand, it enables us to extend JS with vast numbers of libraries. On the other hand, it gives us the power of HTML and CSS to create a modern GUI. The final application can then be shared as a single HTML document or uploaded to any free hosting service such as the GitHub pages.

There are of course some limitations. Apart from some of the issues discussed earlier, the main one is multithreading. This can be partially solved using WebWorkers.

As mentioned at the beginning, the Iodide project is no longer in development. The Pyodide is a subproject of Iodide and it is still supported by its community, so I encourage everyone to contribute to the project.

As the project is being developed quickly, most of the issues mentioned in this guide will be resolved soon. On the other hand, new brake changes can also be introduced, so it’s only worth using the latest version as well as checking the changelog before starting a new project.

Wasm is a great technology that opens many possibilities and it has a great future. Since almost any existing C/C++ project can be compiled into Wasm, there are already many interesting ports allowing you to run games such as Doom 3 and Open Transport Tycoon Deluxe inside modern Web Browsers, and Goolge uses Wasm to rum mediapipe on the web.

Furthermore, WebAssembly System Interface (WASI) makes it possible to take full advantage of Wasm outside the browser:

It’s designed to be independent of browsers, so it doesn’t depend on Web APIs or JS, and isn’t limited by the need to be compatible with JS. And it has integrated capability-based security, so it extends WebAssembly’s characteristic sandboxing to include I/O.

For example, WASI enables us to import modules written in any language into Node.js or into other languages (e.g. import Rust module into Python), and a recent Pyodide release introduces support for Rust packages.

I hope this guide was helpful to you and you enjoyed playing with Pyodide as much as I did.