Making "Simple Made Easy" easier, simply

Rich_Morin · January 5, 2023, 6:28pm

In Simple Made Easy, Rich Hickey talks about the difference between “simple” (i.e., uncomplicated) and “easy” (i.e., convenient). This is just one of the many great talks that are available on YouTube and other sites.

As much as I like these talks, a couple of things disturb me about them. First, the video content (especially slides and screencasts) isn’t easy for blind users to access. Second, there isn’t any way to index the content of the slides.

I’ve mused for some years about ways to improve this situation, but it always seemed like an insuperable challenge. However, advances in the Elixir ecosystem (e.g., Broadway, Bumblebee, Nx) may be bringing a relatively simple solution into reach. If you find this (speculative!) notion appealing, please read on, comment, etc.

Problem Description

A typical, well-edited conference presentation video will show the speaker, some slides and/or screen content, and perhaps a banner giving the talk and/or conference name, etc. The layout will vary, based on the taste of the person doing the video editing.

So much for input. The desired output would be a set of time-stamped summaries of the slides, preferably in a format such as Markdown. This could be used, along with the audio stream, to allow a blind user to gain access to most of the material being presented. It could also give any interested party an easy way to search for keywords, etc.

Here’s a high level rundown of the steps that might be involved:

Capture the video stream from the web site.
Convert the stream into a time-tagged series of images.
Extract the portion of each image containing the slide.
Analyze the slide’s textual content.
Generate markup to replicate the text and formatting.
Save the (time-stamped) markup as a web page.
Rinse, repeat…

Of course, there will be complications. Dynamic content, embedded graphics, and live coding all come to mind. However, even a partial solution would be much better than the current impasse.

Might anyone have comments, clues, and/or assistance to offer?

-r

AstonJ · January 6, 2023, 7:33pm

Sounds like a great idea Rich. I’d be surprised if YouTube isn’t already working on something to make videos in general more accessible - that’s also where I’d start, a broad system which then identifies types of videos and applies rules based on the type.

I not sure if it’s built in now, but I think TikTok (and probably other similar services) have options that do things like automatically generate subtitles (as well as text to speech), so how they go about it could be worth looking into as well.

Not sure whether this is of any help, but found this on huggingface:

If you decide to do it, good luck - keep us posted!

Rich_Morin · January 7, 2023, 2:01am

FYI, I just posted a related issue on GitHub:

github.com/elixir-nx/bumblebee

Support image region extraction as an object segmentation task

opened 12:49AM - 07 Jan 23 UTC

RichMorin

Note: I'm posting this issue at Sean Moriarity's (emailed) suggestion. However, …I'm not at all sure what needs to be done here, let alone how. So, I'll just summarize the use case. As discussed [here](https://elixirforum.com/t/making-simple-made-easy-easier-simply/52995), I'd like there to be a way to scan videos of technical presentations, extract the text and layout of slides, and generate corresponding Markdown files. Aside from making it possible to search the slides, this could help to make the presentations more accessible to blind and visually impaired users. Let's assume that a video has been downloaded from a web site (e.g., via [VLC media player](https://en.wikipedia.org/wiki/VLC_media_player)) and that we can extract individual images from the resulting file (e.g., via [Membrane](https://membrane.stream/)). On edited videos, these images will often contain regions showing the presenter, a slide, and assorted fill. Before we can process the slide (e.g., via [Tesseract OCR](https://en.wikipedia.org/wiki/Tesseract_(software))), we need to extract it from the surrounding image. And, before we can do that, we need to determine its boundaries. According to Sean: > This is an object segmentation task. It's a task available in pre-trained models on [HuggingFace](https://huggingface.co) like [DETR](https://huggingface.co/docs/transformers/model_doc/detr) -- which means we can certainly build the same functionality into Bumblebee. Object segmentation outlines the boundary region of an image as you describe, and then you can use that boundary region to do whatever you want. As a side note, various related tasks will need to be addressed. For example, a production system should identify and handle duplicate images, dynamic content, embedded graphics, etc. It would also be nice to generate and incorporate transcriptions from the audio track (and a pony...).

-r

codeanpeace · March 22, 2023, 12:31am

@AstonJ Yeah, YouTube already does speech-to-text on the video’s audio accesed via ... > Show transcript. If I’m understanding @Rich_Morin’s proposal correctly, the interesting bit is extracting time stamped text from slides embedded in the video.

From what I can tell, Bumblebee doesn’t currently offer a high level API for the image segmentation and OCR tasks that will likely be necessary. We could try making the simplifying assumption that segmentation is not strictly necessary and skip isolating the slides portion of the video frame to just focus on OCR.

We could then use OpenCV via Evision to slice up the video as outlined in Video object detection in Elixir using Nx and Bumblebee. Then Using Tesseract OCR in Elixir/Phoenix would let us grab the contents of the video frame via Tesseract OCR: Text localization and detection. Just thinking out loud…