In Simple Made Easy, Rich Hickey talks about the difference between “simple” (i.e., uncomplicated) and “easy” (i.e., convenient). This is just one of the many great talks that are available on YouTube and other sites.
As much as I like these talks, a couple of things disturb me about them. First, the video content (especially slides and screencasts) isn’t easy for blind users to access. Second, there isn’t any way to index the content of the slides.
I’ve mused for some years about ways to improve this situation, but it always seemed like an insuperable challenge. However, advances in the Elixir ecosystem (e.g., Broadway, Bumblebee, Nx) may be bringing a relatively simple solution into reach. If you find this (speculative!) notion appealing, please read on, comment, etc.
Problem Description
A typical, well-edited conference presentation video will show the speaker, some slides and/or screen content, and perhaps a banner giving the talk and/or conference name, etc. The layout will vary, based on the taste of the person doing the video editing.
So much for input. The desired output would be a set of time-stamped summaries of the slides, preferably in a format such as Markdown. This could be used, along with the audio stream, to allow a blind user to gain access to most of the material being presented. It could also give any interested party an easy way to search for keywords, etc.
Here’s a high level rundown of the steps that might be involved:
- Capture the video stream from the web site.
- Convert the stream into a time-tagged series of images.
- Extract the portion of each image containing the slide.
- Analyze the slide’s textual content.
- Generate markup to replicate the text and formatting.
- Save the (time-stamped) markup as a web page.
- Rinse, repeat…
Of course, there will be complications. Dynamic content, embedded graphics, and live coding all come to mind. However, even a partial solution would be much better than the current impasse.
Might anyone have comments, clues, and/or assistance to offer?
-r