Clarification on DETS/Nerves issue mentioned at ElixirConf

nerves

#1

I’m at NYC Elixir meetup tonight and a talk here on ETS/DETS by James Fertel reminded me that I promised @cjbell and some other folks at the last meetup that I would post this here to try and clarify something I was told by Nerves maintainers (@fhunleth and company) at ElixirConf

Correct me if wrong Frank but you told us not to use DETS or Mnesia on embedded/nerves projects because somehow they were corrupting SD cards… This sounded very weird to me, and others I talked to, so wanted to document here and clarify cause if anyone knows. My only guess was somehow SD cards lack wear leveling that SSD’s apply to distribute writes evenly across storage, but others thought that didn’t sound right/possible, and wanted to understand what’s known, since they’d be super convenient for Nerves project if the problem could be solved, and if can’t be there should be some easily findable discussion to save future peeps some hair pulling :slight_smile:


Can I use PostgreSQL with Nerves?
#2

Hmm, that’s possible, I know most are crazy-low-quality and it would explain the short lifetimes I’ve seen of them.


#3

DETS overwrites its data files on updates and if power is removed abruptly, then on the next boot, DETS may have to repair the file. There’s some information about this at http://erlang.org/doc/man/dets.html near the warning about using CTRL+C. It can’t always fix a corrupted file and you lose everything in the file. That’s the issue. For many embedded devices, ensuring that the device is always gracefully shutdown is not possible so then you would end up writing code to deal with lost DETs tables. If there’s a parameter that doesn’t have a good default or can’t be automatically restored, then the device stops working. If you have access to the device to fix the files manually or the DETs tables don’t contain anything critical, then using DETs seems fine.

You may also be interested in how DETS corruption impacted a commandline history implementation: https://github.com/ferd/erlang-history#how-do-you-store-history. Back when I first saw the note, it surprised me, but it’s the same issue.

As for SDCard wear levelling (or lack of), that can be a problem too. I’ve mainly seen it in devices that log at high verbosity or save video, but if you’re not hammering the SDCard, I think that you should be fine. There’s a chance that you’ll run into a flakey SDCard, but things seem better now than they did 5 years ago for me. (anecdotally, of course)

Also, everyone I know has more than one file for saving data on their production devices. Critical configuration and provisioning data is stored in separate files or separate disk partitions to isolate issues with the main database. They do this even if their main database has journalling, etc. that recovers from crashes and poweroffs well. It may be overkill, but some devices are so hard to manually access that it seems worth it just in case.

Hope this helps,
Frank


#4

Thanks Frank! Apologies for delayed response. I also asked about this at the meetup and someone there mentioned a problem with Linux AIO I think/maybe bypassing system drivers? Sounds like maybe they were talking about something like this https://patchwork.kernel.org/patch/8848531/ but definitely not my area, so maybe there are 3 separate possible causes and they’re hard to disentangle if somewhat non-deterministic? Curious if anyone else knows more about this, but in any case, glad to have it semi-documented for future alchemists considering or struggling with this problem!


#5

I don’t think that it’s anything so complicated. The DETS documentation warns that it’s susceptible to corruption by a user CTRL+C’ing to exit the VM. Surviving a CTRL+C is not a Linux kernel issue. Given how upfront they are about this, I’d say that this is a design decision - maybe to keep things simple or maybe since this isn’t something that happens in their environment. I don’t think that it matters, though, since we’re not lacking for database options (SQLite is popular in embedded and it handles unexpected poweroffs) and there are embedded environments that don’t have to worry about CTRL+C’s or unexpected poweroffs.


#6

While I haven’t investigated the issue for at least 5 years, so I don’t know if the situation is still has bad:

I believe most SD-card do apply some kind of wear-levelling and write caching. I would guess that the cheapest cards do it very bad, with the more expensive ones doing much better (no surprise).
Wear-levelling is completely transparent to the host system, and is performed on physical NAND blocks. If the filesystem sitting on top of the SD-card has its blocks misaligned with regards to the NAND blocks the life-time of the SD-card is reduced (see http://3gfp.com/wp/2014/07/formatting-sd-cards-for-speed-and-lifetime/).

Also, write patterns play a very important role: Writing a single byte to a NAND block, entails writing the entire contents of the NAND block to a new NAND block. This of course reduces the life-time of an NAND block. Should the power be cut during the relocation of the NAND block, the entire NAND block may be lost/corrupted, and not just the single byte written.
I remember running into some issues with having SQLite databases on an SD-card, but unfortunately I can’t remember the details.