Designing and implementing robust systems in production?

I’m fairly confident in my programming skills when dealing with small components; however I’m a bit uncertain when it comes to actually deploy code that must be kept running for years, upgraded with new features and contains customer data.

I’m aware of various aspects of issues that need to be addressed:

  • Security
  • Backups
  • Logging
  • Monitoring
  • Software Upgrades
  • Hardware Upgrades/Migrations
  • Networking

Am I missing any? What are some good resources that address these concerns in a practical way?


Scaling? :wink:

Mastering all these would need a dozen of books and several years of hands-on experience at best, but it’s a great goal to strive for. Looking to get it perfect the first time seems ideal, yet not exactly realistic. In software development we have the luxury of iteration (not like, for example, architects or doctors), so I’d say start with what you know and deal with the issues as they come while still asking for directions and feedbacks.

Sometimes focusing too much on learning the necessary informations upfront would prevent you from actually building things (spoken from experience, I’m still working on it :slight_smile:).


I have built, (with various degrees of success) enough systems in my career, so for me it’s time to stop reacting and start acting :slight_smile:

A dozen books seems about right, and not too much work; if you have such a list of books I would love to see such recommendations. I have 10+ years of hands-on experience: not enough to master any of the above, but enough to have hunches about things that might go wrong.

1 Like

I wouldn’t put scaling into the same category; it is an issue only for really popular platforms, whereas the list I presented applies even to intranet-targeting applications.

I am talking about scaling to millions of users though; Scaling to a huge amount of data is application specific, and needs to be addressed in the initial design of the system.

1 Like

I want to apologize first if I came across as offensive or doubting your experience, definitely not my intention! I think that was the inexperienced me kicking in.

I really wish I could help, but I don’t think I’m in a position to recommend you any tips and books yet, though, so I’ll just subscribe to this thread in case others chime in. :slight_smile:

1 Like

Oh, sorry if my tone was off! No offence taken :slight_smile:

  • Disaster recovery / Business continuity
  • High availability
  • Service management (ITIL, …)

Some of these issues, at least from the point of view of the runtime system, are addressed in Designing for Scalability with Erlang/OTP; especially the last few chapters:

  • System Principles and Release Handling
  • Release Upgrades
  • Distributed Architectures
  • Systems That Never Stop
  • Scaling Out
  • Monitoring and Preemptive Support

Many of the others feel like they depend a lot on your chosen hosting strategy; if you rely on a database service, and deploy to private or public cloud, perhaps with Docker, or some kind of Heroku-like service, you wouldn’t need to worry about things like backups, hardware, etc. On the other hand, maybe then you can’t cluster in the most straightforward way, perform hot code upgrades, and so forth.

Most of what I know in this area comes from practical experience after 15 years as a sysadmin, DBA and ERP developer (the mix has varied over time), followed by a few years more focused on automation, “devops” and BI development, so I’d be hard pressed to point to a specific set of learning resources, unfortunately :frowning:

One book that stands out though, in the system administration area, is The Practice of System and Network Administration (I own an earlier edition).


It’s true that most of these relate to sysadmin, perhaps I should ask then for sysadmin resources? I have added suggested books in my Amazon list!

We have a lot of small clients that ask that the services we provide are run on-premises, but lack any kind of IT department to support them. So far there haven’t been any catastrophic failures but I believe this is because our choice of tech (especially database) is pretty boring and comes with sane defaults (Postgres). I want to make sure that by the time something bad happens, at least I know where to start :slight_smile:

Hmm… well, while a lot of the stuff in the sysadmin book I suggested is somewhat geared towards fairly large scale infrastructure, there are also many habits that are good to start picking up even if / while you operate on a smaller scale… So I’d definitely recommend reading it through and picking up what you think could be of benefit :slight_smile:

this is an excellent talk about designing systems:


I’m going to describe industry specific practices used by companies that build black box type network devices (Ericsson, Cisco, Endace, Riverbed etc…). They spend a lot of time and effort ‘productizing’ their software so that the device is easily manageable by the customer and reduces the need for any sysadmin type of work, therefore lowering the total cost of ownership. We don’t want sysadmins editing files under /etc/ to configure features of the systems. We also want to be able to report status of the device and perform backups, upgrade software and install new software or even new versions of the Linux kernel. You want to be able to do all of this without impeding traffic and being able to revert to a previous version of software should the upgrade fail or cause a performance degradation. They might be running on a remote mountain where physical access is via helicopter and therefore prohibitively expensive. Typically these devices are running software that cannot easily be dockerized or virtualized. In those environments you can always make a snapshot of the docker image before the configuration change, make the configuration change, and if you don’t like it, then roll back to the previous image. To satisfy these requirements, these products will offer most of the following:

  • A CLI, programmable via netconf
  • A configuration management database. This acts as a central point for all configuration. It becomes an interface to /etc/ or whereever the application reads it’s config from. It acts as a bus for other applications and they listen for change in config and act appropriately.
  • Backup and restore of config
  • Performance management - the ability to monitor aspects of the system
  • Split partitions installation - a system upgrade is done on the inactive partition thus preserving the software on working partition.
  • Transforms on the configuration database schema to allow clean upgrade/downgrade of software so you don’t end up with problems where the application is expecting a field that doesn’t exist in the config database.
  • Authentication and auditing methods. We want to know who made what change to the system.

Commercial offerings for this sort of platform come from Cisco via TailF and previously Tallmaple had a system before it was bought by Fireeye. See the confd ( training videos for a better overview of what really goes into making a production ready application. In the open source world there is OpenSAF but the way the upgrades and installation of applications is handled is terrible because the design requires you to program in XML.

In my view this isn’t a sysadmin question. Sysadmin and devops is what happens when your product or application isn’t ‘productized’ properly so you have to dockerize/heroku/kubernetes whatever to upgrade it or keep it highly available. That said the work required to make a manageable product and going the full distance the way a Cisco might is too much work for a small development team.


Great insight. Thanks for posting.