Gesture-Driven Audiovisual System
Authors: Zofia Mizgalewicz
This project tries to develop an interactive audiovisual system controlled by hand gestures seen through a webcam. I use TouchDesigner to process a live camera feed, detect hand blobs and extract control signals — position, movement, and possibly size. These signals are mapped to parameters of a visual system, which allows the user to influence the visual environment through gestures.
The goal is an interface where body movement directly shapes what appears on screen — demonstrating real-time computer vision, signal processing, and interactive graphics.
| | (links added on completion)
Milestone 1 (09.03) — TD on Linux + Working Camera Input
- TouchDesigner installed and running stably on Linux via Wine/Bottles
- Webcam recognized and accessible as a video input inside TD
- Basic threshold -> crop -> blob track pipeline tested
Getting TouchDesigner to run on Linux through Wine is not easy — device passthrough, library compatibility, and rendering backend all needed to be resolved before any actual project work could begin.
Environment setup: TouchDesigner is Windows-only software. To run it on Linux, I used Bottles — a GUI tool that manages isolated Wine environments. Wine is a compatibility layer that lets Windows applications run on Linux without a virtual machine.
The camera problem: The default Wine runner did not pass the webcam through to TouchDesigner — the device was simply invisible inside TD. I switched the runner to Soda (soda-9.0-1), this solved the camera passthrough.
Graphics translation: DXVK (dxvk-2.7.1) translates Direct3D 8/9/10/11 calls to Vulkan, and VKD3D-Proton handles Direct3D 12. These components allow TouchDesigner's renderer to work on Linux hardware.

Result: I built the initial pipeline: threshold -> crop -> blobtrack. The blob tracker is detecting the hand and producing output, which already gives me a start on Milestone 2.

Milestone 2 (23.03) — Stable Blob Detection Pipeline
- Reliable hand isolation under varying lighting conditions (3.5h)
Tuned threshold / background subtraction for clean binary mask (2h)Crop region configured to focus on gesture zone (0.5h)BlobTrack2 consistently tracking the hand blob without false positives (3h)
The initial plan was TD's native blob tracking with threshold and background subtraction. Surprise, surprise: it didn't work:( Background subtraction was working only in very limited lighting conditions and creating scary ghost-like figures in others. Tracking hands of ghosts is (as far as I know) not possible
Updated timeline:
- trying to make things work with TD blobtrack - 4h
- reliable hand isolation under varying lighting conditions with MediaPipe - 2h

Fallback strategy - MediaPipe: I replaced the TD-native vision pipeline with Google's MediaPipe hand tracking, running as an external Python process that streams 21-point hand landmarks into TouchDesigner via OSC. This approach is robust across skin tones, lighting conditions and supports up to two simultaneous hands.
Architecture:
- Python script owns the webcam (OpenCV + MediaPipe)
- Hand landmarks extracted and converted to normalised control signals
- Signals streamed to TD over OSC (/hand0/x, /hand0/y, /hand0/size, /hands/count)
- Camera feedback with tracking is shown in an OpenCV preview window

Milestone 3 (06.04) — Control Signal Extraction
- First reactive visual: circle controlled by hand:
- Building first visual (2h)
- Hand X/Y position drives circle position on screen (1.5h)
- Hand distance to camera drives circle size (1h)
- Smoothing to remove jitter and give a responsive, glidable feel (2h)
This step went surprisingly smoothly and took less time than planned:)
Signal smoothing: MediaPipe's raw landmark data jitters slightly frame-to-frame. The circle twitched even when my hand was still, so I lagged the signal to smooth incoming values over ~150ms — enough to kill noise, fast enough to still feel fast
First reactive visual: a circle with its position and radius driven by hand0/x, hand0/y, and hand0/size. Moving my hand moves the circle; bringing my hand closer to the camera makes the circle grow. The coordinate ranges need to be remapped in a way that they match what a user sees and all markings are within camera boundaries - the computer was fine without it, but for the user experience that's important
The pictures below show how the circle follows the hand position


Result: the full chain works end-to-end with low latency. This simple circle is far from the final visual, but it confirms every part of the pipeline is ready for more ambitious generative systems
Milestone 4 (20.04) — Generative Visual System (First Version)
Generative visual system responding live to hand gestures:
- Visual development — building the graphics, tuning the aesthetic (6h)
- Hand position and distance control parameters of the visual system (1h)
- Smoothing the system so that after interaction, it slowly returns to its zero position (2h)
- Adding sound as a bonus experimental layer (1.5h)
This milestone took me way longer than planned, mostly due to artistic choices I had to make along the way and the mismatch between my skills and vision. Lots of experiments and ideas. I experimented with different textures and colours.
My inspiration for this part was
- Cocteau Twin's album Victorialand - dreamy, not rushed, very atmospheric
Feet like Fins for album Victorialand on YouTube
- Jan Garbarek's album Dis - slow, modal, Nordic jazz, lots of space
Vandrere from album Dis on YouTube.
Below is one of the candidates I quite liked, but didn't go with - a water droplet dissortion

System: The final choice was a displaced noise field - hands warp the texture, and the warping slowly fades back to the original state over about a second. This gave the piece a contemplative, liquid feel, quite atmospheric.
The picture below shows the displaced noise field being warped by hand movement, a camera input with mapped hands, in the top right, the cyan glow is the colored trail following the hand position

Adding colour: once the warping worked, I added a second layer - luminous colored trails painted by the hands. The colour layer has its own feedback loop with slow decay, so trails linger and bleed together as you move. Two hands with the same colour are composited over the displaced noise. The picture below shows the system so far

Sound: I started experimenting with a sound layer. My reference was Jan Garbarek, so I built a simple pentatonic scale controlled by hand vertical position. I experimented with real saxophone samples to get closer to the Garbarek tone, but it never sounded good, so I went back to the simpler sine-wave synth approach. Less realistic, but coherent, and it sits nicely under the visuals without fighting them

Milestone 5 (04.05) — Polish and Expressiveness
- Dual-screen setup: participant mirror on laptop, visuals on big screen (2h)
- Idle state when no hands are detected (0.5h)
- User testing with other people and adjustments based on observation (1h)
- Sound refinements, release envelope, subtle low drone layer (2h)
- Final aesthetic tuning: colors, bloom, trail decay (3h)
You can add development notes here, or remarks on the progress / result. Screenshots and videos are always good!
Milestone 6 (18.05) — Final Presentation
- All milestone goals reviewed and documented
- 1–2 minute demo video recorded and embedded
- Repository cleaned up with a README explaining setup and usage
- Live demonstration performed for the class
You can add development notes here, or remarks on the progress / result. Screenshots and videos are always good!