VideoDog - An AI Video Logging Tool That Runs Locally

Update: After posting this 10/13, I’ve had many people reach out to remind me there are companies doing what this program does. This includes Shade, Valossa, MomentsLab, TwelveLabs, Clarifai, ioMoVo – even Azure, Amazon and Google have solution. However, none of these run their AI on device, and require moving assets to the cloud first. I’m more interested in running a product like this locally for privacy and other workflow considerations.

This last weekend, I set out to build an AI-powered video logging tool (AI Media Asset Magament program) that runs locally on my Mac Mini using lightweight open-source models. Below is how I approached it, what worked, and what I learned.

Like a lot of millennials who grew up in action sports, I’ve got hard drives full of old footage collecting dust. Every so often I get the urge to make an edit or tell a story, especially after 15 years in the Tetons, but it’s nearly impossible to find the clips I want. Hobbyists like myself almost never tag or organize their footage in a way that’s searchable years later.

So last weekend, I decided to tackle that problem myself, and the results show some real promise.

I’ve attached the flowchart (at bottom), but here’s the gist for anyone who doesn’t want to have a stroke looking at it: the Python-based program ingests video into an embedded SQLite3 database and uses SmolVLM2-2.2-Instruct (Via Hugging Face) as the primary vision model to “look” through each video on an X-frame basis. Whisper handles the audio, and CLIP/nano serves as a fallback for visual and object detection.

How’d it go?

*I built a little “reporting” tool that lets me analyze results. Here is what the model is seeing/saying.*

As a POC, it works, and it solves the problem. But (and it’s a big but) it’s slow. This reminds me of trying to run a video game on your outdated computer in 1998. It “runs” but you can’t really play. The primary vision model takes about 30 seconds per frame to process, which means my Mac Mini would be maxed out for months if I ran it across all my drives.

*Here is the program itself. Bare bones for now.*

Possible fixes:

1) Analyze fewer frames. This helps but eventually defeats the purpose.
2)Downsize the frame capture, worth exploring.
3) Move from Apple Silicon to an Nvidia GPU and leverage CUDA, but I’m not spending money just yet.
4) Extract frames and batch-process them in the cloud on a big boy GPU. This would 1000% work, though it means paying for GPU time.

Ultimately, I want users to be able to generate rough video edits just by writing prompts, using their own footage. To me, this is the kind of AI content creation we want. Eventually, I’d like to tie it directly into Premiere through a plugin.

From where I’m at now to get to the holy grail of text-to-video using your own footage would be a pile of work, plus there are a lot of nuances to work through here, and probably some model fine tuning using RLHF if I want to really dial in for this use case, which I frankly welcome as a way to further my learning. But yeah, unless GoPro wants to hire me (they should) I don’t know that its worth it.

All that said, it’s surprisingly empowering to run a model locally, even if it’s slow. And while SmolVLM2 isn’t state-of-the-art, it’s impressive for its size.

So, what do you think? Anyone else interested in this project?