Making "Simple Made Easy" easier, simply

In Simple Made Easy, Rich Hickey talks about the difference between “simple” (i.e., uncomplicated) and “easy” (i.e., convenient). This is just one of the many great talks that are available on YouTube and other sites.

As much as I like these talks, a couple of things disturb me about them. First, the video content (especially slides and screencasts) isn’t easy for blind users to access. Second, there isn’t any way to index the content of the slides.

I’ve mused for some years about ways to improve this situation, but it always seemed like an insuperable challenge. However, advances in the Elixir ecosystem (e.g., Broadway, Bumblebee, Nx) may be bringing a relatively simple solution into reach. If you find this (speculative!) notion appealing, please read on, comment, etc.

Problem Description

A typical, well-edited conference presentation video will show the speaker, some slides and/or screen content, and perhaps a banner giving the talk and/or conference name, etc. The layout will vary, based on the taste of the person doing the video editing.

So much for input. The desired output would be a set of time-stamped summaries of the slides, preferably in a format such as Markdown. This could be used, along with the audio stream, to allow a blind user to gain access to most of the material being presented. It could also give any interested party an easy way to search for keywords, etc.

Here’s a high level rundown of the steps that might be involved:

  • Capture the video stream from the web site.
  • Convert the stream into a time-tagged series of images.
  • Extract the portion of each image containing the slide.
  • Analyze the slide’s textual content.
  • Generate markup to replicate the text and formatting.
  • Save the (time-stamped) markup as a web page.
  • Rinse, repeat…

Of course, there will be complications. Dynamic content, embedded graphics, and live coding all come to mind. However, even a partial solution would be much better than the current impasse.

Might anyone have comments, clues, and/or assistance to offer?

-r

2 Likes

Sounds like a great idea Rich. I’d be surprised if YouTube isn’t already working on something to make videos in general more accessible - that’s also where I’d start, a broad system which then identifies types of videos and applies rules based on the type.

I not sure if it’s built in now, but I think TikTok (and probably other similar services) have options that do things like automatically generate subtitles (as well as text to speech), so how they go about it could be worth looking into as well.

Not sure whether this is of any help, but found this on huggingface:

If you decide to do it, good luck - keep us posted! :023:

FYI, I just posted a related issue on GitHub:

-r

1 Like

@AstonJ Yeah, YouTube already does speech-to-text on the video’s audio accesed via ... > Show transcript. If I’m understanding @Rich_Morin’s proposal correctly, the interesting bit is extracting time stamped text from slides embedded in the video.

From what I can tell, Bumblebee doesn’t currently offer a high level API for the image segmentation and OCR tasks that will likely be necessary. We could try making the simplifying assumption that segmentation is not strictly necessary and skip isolating the slides portion of the video frame to just focus on OCR.

We could then use OpenCV via Evision to slice up the video as outlined in Video object detection in Elixir using Nx and Bumblebee. Then Using Tesseract OCR in Elixir/Phoenix would let us grab the contents of the video frame via Tesseract OCR: Text localization and detection. Just thinking out loud…

1 Like