Gesture-Driven Audiovisual System

Authors: Zofia Mizgalewicz

This project tries to develop an interactive audiovisual system controlled by hand gestures seen through a webcam. I use TouchDesigner to process a live camera feed, detect hands and extract control signals — position, movement, and possibly size. These signals are mapped to parameters of a visual system, which allows the user to influence the visual environment through gestures.

The goal is an interface where body movement directly shapes what appears on screen — demonstrating real-time computer vision, signal processing, and interactive graphics.

 Setup:
   1. Install Python 3.12, create venv: `py -3.12 -m venv venv`
   2. Activate venv, install: `pip install mediapipe python-osc opencv-python`
   3. Open .toe file in TouchDesigner (free Non-Commercial license)
   4. Run `python hand_track.py` 
   5. In TD, set oscin1 Active to On
   6. Wave hands at webcam

   Audio file (Jammer by PCxTC, CC-BY, source: Free Music Archive) goes in `muzyka/` folder.

Milestone 1 (09.03) — TD on Linux + Working Camera Input

TouchDesigner installed and running stably on Linux via Wine/Bottles
Webcam recognized and accessible as a video input inside TD
Basic threshold -> crop -> blob track pipeline tested

Getting TouchDesigner to run on Linux through Wine is not easy — device passthrough, library compatibility, and rendering backend all needed to be resolved before any actual project work could begin.

Environment setup: TouchDesigner is Windows-only software. To run it on Linux, I used Bottles — a GUI tool that manages isolated Wine environments. Wine is a compatibility layer that lets Windows applications run on Linux without a virtual machine.

The camera problem: The default Wine runner did not pass the webcam through to TouchDesigner — the device was simply invisible inside TD. I switched the runner to Soda (soda-9.0-1), this solved the camera passthrough.

Graphics translation: DXVK (dxvk-2.7.1) translates Direct3D 8/9/10/11 calls to Vulkan, and VKD3D-Proton handles Direct3D 12. These components allow TouchDesigner's renderer to work on Linux hardware.

Result: I built the initial pipeline: threshold -> crop -> blobtrack. The blob tracker is detecting the hand and producing output, which already gives me a start on Milestone 2.

Milestone 2 (23.03) — Stable Blob Detection Pipeline

Reliable hand isolation under varying lighting conditions (3.5h)
~~Tuned threshold / background subtraction for clean binary mask (2h)~~
~~Crop region configured to focus on gesture zone (0.5h)~~
~~BlobTrack2 consistently tracking the hand blob without false positives (3h)~~

The initial plan was TD's native blob tracking with threshold and background subtraction. Surprise, surprise: it didn't work:( Background subtraction was working only in very limited lighting conditions and creating scary ghost-like figures in others. Tracking hands of ghosts is (as far as I know) not possible

Updated timeline:

trying to make things work with TD blobtrack - 4h
reliable hand isolation under varying lighting conditions with MediaPipe - 2h

Fallback strategy - MediaPipe: I replaced the TD-native vision pipeline with Google's MediaPipe hand tracking, running as an external Python process that streams 21-point hand landmarks into TouchDesigner via OSC. This approach is robust across skin tones, lighting conditions and supports up to two simultaneous hands.

Architecture:

Python script owns the webcam (OpenCV + MediaPipe)
Hand landmarks extracted and converted to normalised control signals
Signals streamed to TD over OSC (/hand0/x, /hand0/y, /hand0/size, /hands/count)
Camera feedback with tracking is shown in an OpenCV preview window

Milestone 3 (06.04) — Control Signal Extraction

First reactive visual: circle controlled by hand:
Building first visual (2h)
Hand X/Y position drives circle position on screen (1.5h)
Hand distance to camera drives circle size (1h)
Smoothing to remove jitter and give a responsive, glidable feel (2h)

This step went surprisingly smoothly and took less time than planned:)

Signal smoothing: MediaPipe's raw landmark data jitters slightly frame-to-frame. The circle twitched even when my hand was still, so I lagged the signal to smooth incoming values over ~150ms — enough to kill noise, fast enough to still feel fast

First reactive visual: a circle with its position and radius driven by hand0/x, hand0/y, and hand0/size. Moving my hand moves the circle; bringing my hand closer to the camera makes the circle grow. The coordinate ranges need to be remapped in a way that they match what a user sees and all markings are within camera boundaries - the computer was fine without it, but for the user experience that's important

The pictures below show how the circle follows the hand position

Result: the full chain works end-to-end with low latency. This simple circle is far from the final visual, but it confirms every part of the pipeline is ready for more ambitious generative systems

Milestone 4 (20.04) — Generative Visual System (First Version)

Generative visual system responding live to hand gestures:

Visual development — building the graphics, tuning the aesthetic (6h)
Hand position and distance control parameters of the visual system (1h)
Smoothing the system so that after interaction, it slowly returns to its zero position (2h)
Adding sound as a bonus experimental layer (1.5h)

This milestone took me way longer than planned, mostly due to artistic choices I had to make along the way and the mismatch between my skills and vision. Lots of experiments and ideas. I experimented with different textures and colours.

My inspiration for this part was

Cocteau Twin's album Victorialand - dreamy, not rushed, very atmospheric

Feet like Fins for album Victorialand on YouTube

Jan Garbarek's album Dis - slow, modal, Nordic jazz, lots of space

Vandrere from album Dis on YouTube.

Below is one of the candidates I quite liked, but didn't go with - a water droplet dissortion

System: The final choice was a displaced noise field - hands warp the texture, and the warping slowly fades back to the original state over about a second. This gave the piece a contemplative, liquid feel, quite atmospheric.

The picture below shows the displaced noise field being warped by hand movement, a camera input with mapped hands, in the top right, the cyan glow is the colored trail following the hand position

Adding colour: once the warping worked, I added a second layer - luminous colored trails painted by the hands. The colour layer has its own feedback loop with slow decay, so trails linger and bleed together as you move. Two hands with the same colour are composited over the displaced noise. The picture below shows the system so far

Sound: I started experimenting with a sound layer. My reference was Jan Garbarek, so I built a simple pentatonic scale controlled by hand vertical position. I experimented with real saxophone samples to get closer to the Garbarek tone, but it never sounded good, so I went back to the simpler sine-wave synth approach. Less realistic, but coherent, and it sits nicely under the visuals without fighting them

Milestone 5 (04.05) — Polish and Expressiveness

Dual-screen setup: participant mirror on laptop, visuals on big screen (2h)
Idle state when no hands are detected (0.5h)
User testing with other people and adjustments based on observation (1h)
Sound refinements, release envelope, subtle low drone layer (2h)
Final aesthetic tuning: colors, bloom, trail decay (3h)

Updated sound plan:

Pivot from synth to ambient track with gesture-controlled filtering (4h)
Switching from Linux to Windows due to audio stability (3h)

Linux drama: the sine-synth approach worked, but under Wine/Linux the audio output kept dying after about a minute. I tried everything reasonable - different drivers, restarting PipeWire, mixing music through TD instead of a separate player. Nothing held. Eventually I switched the whole project over to Windows for stable audio. Everything else (Python + MediaPipe + OSC + TouchDesigner) ported cleanly since the .toe file and Python script don't care which OS they run on.

Sound pivot: I dropped the sine synth in favor of an ambient track that the user distorts through gesture. The reference for distortion was still Cocteau Twins' Victorialand — slow, calm, atmospheric. I used Jammer by PCxTC from Free Music Archive (CC-BY licensed) as the base track and routed it through two filters:

hand0/y → low-pass filter (vertical position controls muffled-to-bright)
hand0/x → high-pass filter (horizontal position controls full-to-thin)

Result: each hand position creates a different sonic character. The audio is calm and ambient when nobody interacts; movement sculpts the tone in real time.

Jammer by PCxTC on Free Music Archive (CC-BY)

Idle state — "move your hands": I added a Text TOP showing move your hands that composites into the displaced noise layer. Because it's inside the displacement pipeline, it warps and ripples with hand movement, then dissolves as users engage. The alpha fades over a few seconds, so the text gently disappears once someone is interacting.

Milestone 6 (18.05) — Final Presentation

All milestone goals reviewed and documented
1–2 minute demo video recorded and embedded
Repository cleaned up with a README explaining setup and usage
Live demonstration performed for the class

Attach:demo.mp4

Arvutigraafika projekt 2025/26 kevad