What's the cost of DB as a service?

It’s default settings are very generic, not optimized for a ‘specific’ load but decent in average.

I’ve had to tweak some things on it on occasion but some minor tweaks can have some significant changes on some workload types! :slight_smile:

I find it easier just to use a plugin to replicate all changes to another backup server in real time. If the load ever got high enough I could have both serve ‘most’ queries without issue even. PostgreSQL has some fantastic plugins!

Yeah this is fairly true about PostgreSQL, changing memory is the main thing I do.

In my cases it would have been more like $30k+/month. If your site is low enough for $50 to be enough for AWS then it is probably easier just to get a droplet or whatever they are called instead. ^.^;

And cost, they are SOOOO much cheaper!!

For me the dedicated server has always be significantly cheaper than *aaS’s.

Never an issue here either. My big server is replicated in near real-time to both another smaller server and my home server, ‘if’ it ever goes down yet (never has yet in 19/20 years time! knocks on wood) then I can reverse proxy a backup in seconds (I could automate that but eh, never had a need yet…).

Doesn’t really take that much time, and for saving $30k I’ll happily do it myself, plus its enjoyable figuring out everything. ^.^

4 Likes

This is really the entire conversation for any aaS offering though. Why use Sendgrid when you can setup your own Postfix, etc.

When it comes to the database, it’s usually the most critical part of your system and people tend to feel a lot better about trusting a platform utilized by many people over learning and setting up each line item themselves. I manage several largish PG instances now (couple of TB each) and if we were running in AWS I’d have used theirs in a heartbeat (RDS or Aurora).

The trade off bonus is that I can install extensions that aren’t on their whitelist, which I do and I appreciate.

Without RDS/Aurora I have to:

  • Setup backups
  • Setup WAL backups
  • Manually create a replica
  • Manually run version upgrades
  • Don’t get automatic zero downtime upgrades through their managed failover
  • Manually setup monitoring both within the DB and the server
  • Manually monitor logs
  • Manually setup alerting and paging

Running on GCP, the PG offering they haven’t isn’t really as polished but compute engine persistent disks do make managing your own a lot easier. Running our own has been very stable but I’d still sleep better with everything AWS has to offer.

4 Likes

I use Postfix :003:

Have you looked into the cost difference? I would be interested in what it might be.

2 Likes

I have. When time costs are factored in Aurora PG is a clear winner (even over Aurora RDS because that still requires a little bit of hand holding). With large databases on RDS I still used to spend about 6 hours a month looking into things. Aurora took that to 0 hours and lower service costs.

Aurora is the primary reason I’d choose AWS based platforms for anything serious. It’s that good.

4 Likes

I’d be interested in knowing the cost difference (minus time/other costs) for a straight comparison, but no problem if you can’t share.

For many ‘solopreneurs’ (quoting the OP here :lol:) time is something they might have, whereas they might not have limitless funds.

4 Likes

Cost isn’t simple, it varies with a number of factors, including risk.

You’re talking about never having had a failure, which is fine, and, if you’re considering a single node then banking on it can totally make sense. I’m considering 100K+ servers, 1K+ applications and 1K+ databases deployed across 1K+ physical locations. With a mean-time to failure of various components in the 10 years range, that means I need to think in terms of continuous partial failure as the norm. Servers will fail every day, probably in cascades/clusters because that’s how shit happens.

You’re talking about being able to flip things over manually. I’m in a world where a critical failure at the wrong time can cost tens of thousands of dollars per second. How many seconds will it take you to solve the problem? I’m in a world where an increase in latency can cost millions of dollars as people abandon an app, or a site, or a shopping cart in a store.

Say I’m in a highly seasonal business, where our hardware utilization is high for a small part of the year and very low for the rest: *aaS services let me scale and pay for what I need, when I need it. That makes it cheaper in real terms.

Say I have processes that only run 5 * 8. I can turn off my databases for the rest of the week and pay only for storage. That makes them much cheaper.

Combine scaling only when I need it and scheduled shutdowns and I can reduce my costs from *aaS offerings by a really high percentage.

Can you build something that solves those problems yourself? Absolutely! Is it easy or cheap? No, it isn’t. It also isn’t easy to test it and make sure that it will actually work when you need it to. Cloud/*aaS offerings are much more capable of investing the time and energy necessary for that.

Not all money is created equal, either: in general purchasing hardware is a capital expenditure which means I need to carry the value of that hardware for years, pay taxes on the value of it and I need to either have the money or borrow it. SaaS services are all operational expenditure, which makes it more attractive from a taxation and net-present-value of capital perspective.

Say I’m in San Francisco, where the full cost of an employee is somewhere in the region of $250K per year - salary, pension, healthcare, training, insurance, holidays, office-space, travel, etc. etc. etc. If I’m considering 2000 staff, what percentage of our total productive output should we spend on database administration?

Overall, I want to minimize the total cost of ownership of our databases whilst creating an acceptable risk profile and the right performance characteristics. *aaS is sometimes the right option for that, and sometimes it isn’t.

I toss out these figures, not because I, or the company I work for are special, but precisely because we’re absolutely not. There are thousands of companies like us around the world. This is the reality of enterprise software; we’re not optimizing for cost of individual servers, we’re optimizing for organizational level concerns and we’re architecting for risk because - in a big enough corpus of possibilities - ten thousand to one risks happen dozens of times every day.

Your concerns are different from mine, I totally respect that your choices works for you, and I’m not suggesting you should do anything different. It doesn’t work for me and it certainly isn’t always cheaper, in real terms.

10 Likes

For many ‘solopreneurs’ (quoting the OP here :lol:) time is something they might have, whereas they might not have limitless funds.

I’ll just speak to my own experiences, time spent learning is available, though TBH (in year four) I’m at a point where I’m past the threshold of cost of an “aaS” is a limiting factor. In Heroku-speak, I’m still on Standard dynos and the first non-Hobby level of database.

For me, the biggest cost comes in terms of a recovery situation, since my escalation path is… myself learning something new under pressure.

So, “time to learn” is certainly there. “Time to fix” isn’t. Does that resonate?

5 Likes

That was my last job. ^.^;

For just over 8 years I worked at a place managing about 40000 old SCO UNIX servers (that they tried to replace with Windows 2003 but never could so they ended up running two servers at most of the locations, one of each, but I primarily managed the SCO UNIX ones). Very much not cloud, they were all physical servers at each physical location, they had to work if the internet went down, if the central servers couldn’t push updates, etc… etc… And they were rock solid unlike the Windows ones, generally only lost 1 or 2 a day (of which we sent out replacements but on-sites usually had a backup in the city for immediate replacement as well).

Precisely! That is why we had to have such uptime, even in the case of Internet or even Power failures, you can’t have that if you are completely cloud based. Even the low-yield places still managed multiple $thousands of dollars a day, the average was around $30k/day (the high yield was… impressive…).

This is why physical servers are so important.

And then your Internet goes down and you are out $750 every minute for example. With our servers we still ran fine without Internet, I could still connect over dial-up if I needed to, and it used dial-up to verify credit/debit transactions, and even if dial-up was down (like power is out in the city so they are running on backup) then it could still batch the transactions (we’d eat the cost if a debit/credit transaction ended up not going through, it was worth it to keep up the clients goodwill though and it was usually pittance if anything).

Or just pay $2500 for one of these SCO UNIX servers (~$5000 for full server/backup setup) per location just outright and it would last for 7-10 years on average (excepting a few outliers, had a fun issue where a magnetic field generated by a motor was killing one site much much faster until that was figured out…), through downtime, power outages, etc… Even with the cost of power that’s still generally less than a dollar of cost a day for reliability that is unachievable with *aaS.

Although these places were on and operational 18 to 24 hours a day, depending on the location and day of week.

It was actually quite easy to manage, and it’s significantly cheaper than *aaS would have been, especially with the very common downtime (on average ~2-5% of all sites had Internet connectivity issues at any given time, generally for multiple days at a time because local ISPs tend to really suck, I have a special hatred for AT&T and Comcast because of that job…).

Except it’s also a service that you don’t own or control at all, you are at the whims of, say, if AWS goes down for a period of time as happened not long ago. We only have to worry about each site individually, not everything going down at once. There is no reason for a site not to even take credit cards for more than 15 minutes once a day (a certain report generation accessed and locked that system), though many on-site locations did because they weren’t doing that right (and if customers complained to us at corporate we would raise hell on the site, check their logs and all and always got proof).

We had ~4 Level 3 people (the programmers), ~8 Level 2 people (system diagnostic and repair), and about 30-50 Level 1 people (limited access, they mostly walked the sites through most issues they could solve themselves or passed it up to L2 or L3 for us to handle), for all ~40k locations (China we didn’t handle, their locations were… another team because China is irritating, though they had a total of like 20 people altogether). That’s to handle the database, synchronizations, communication issues, failures, everything, and we handled it efficiently, fast, and properly.

And that is precisely my thing, with *aaS you have a lot of things outside of your control, even a trivial internet outage for just a couple of days can put a site in red, we can’t stand that, and I doubt most other places could either.

I run my own personal things on dedicated servers as well, I truly think it is the best way of running things especially if you are reliant on Internet being up and accessible (that’s why even my 4 current personal servers are hosted with 3 different companies, if any have issues then I can have another take up its slack immediately, you can’t do that if you stuff is hosted at, say, AWS and AWS stops responding for 2 hours).

Oh, and as for the above servers, if the server went down then the individual client machines on-site would take over about 60% of its functionality, batching all the data until the server was set back up again and it would take all the accumulated information back.

2 Likes

In both of these sections I think you might be conflating two different things: edge devices which need to be in physical locations and servers which can be anywhere. Capable edge devices with durable on-board storage and fallback manual processes are definitely important to overall reliability. But that says nothing about where server infrastructure should be. In fact, the more capable the edge device the less important it is where the server infrastructure is.

I totally agree, you can’t use cloud services for edge devices, because they need to be at the edge, not in the middle.

The cloud offers me the option of multi-provider, multi-region, multi-availability-zone, global replication over dedicated links, giving you tested, hardened, reliable and fast redundancy against more types of failure than I can manage by myself.

The other consideration is that I’m not talking about edge devices or internet access, but both. I have physical locations, customer facing web properties and mobile applications, as well as global supply-chain and logistics concerns.

I can’t pay for 1 server, because I want multiple redundant nodes, in multiple data-centers, in each of eastern-US, western-US, western-Europe, north-India, china, and Japan. I also want to have production, development and test setups that replicate my production environment That means I need to buy 3 servers * 3 locations * 6 regions * 3 environments, giving me an ideal state of 162 servers. Don’t want to do that. Even if I decide I might only want dev and test clusters near my engineers because I don’t want to actually develop or test my global failover plans or my intra-region failover or my regional latency, I’m still buying dozens of servers.

With cloud options I don’t need any of that, I pay for exactly what I use, not what I might need. When I’m buying servers I always have to pay for what I might need.

Maybe loss-free, global replication of large numbers of petabyes of data and 10s of millions of transactions per second, across a 1K+ databases and 5K+ physical data-base servers balancing consistency and availability concerns, whilst providing consistent latency is easy for you. But I suspect that most people don’t find it to be so.

There are tiers of concerns here, each of which has an on-prem analog:

  • Loss of a node.
  • Loss of part of a data-center (a rack, hypervisor, whatever…)
  • Loss of a whole data-center.
  • Loss of all of the data-centers in a region.
  • Loss of all of the data-centers in a country.
  • Loss of all of the data-centers of a provider
  • Loss of all of the data-centers of all providers everywhere.

If the last one happens, we’re screwed whatever happens because it probably means the internet itself is down everywhere, for everyone.

All of the rest are solvable with a multi-cloud strategy, almost for free since you don’t need to pay for these features until you use them, unless you need to reserve capacity. They are only solvable with your own servers at great expense.

I was talking about 2000 or so people just involved in the process of creating and maintaining software. The total headcount is a couple of orders of magnitude higher than that. Even 1% of 2000 * 1/4 million is… lots.

Defense in depth my-friend, defense in depth. Building that yourself is possible but expensive since you have to buy the infrastructure before you need it. You get most of it with a single cloud provider in a trivial manner and a multi-cloud strategy gives you additional layers - at additional cost and complexity.

You can only fallback to other hardware if you’ve pre-purchased the spare capacity to do so. If you’re running at 89+% utilization rates you can’t do that. If you’re not running at high utilization rates under peak-load you’ve over-provisioned.

Like I said, I’m not judging the solutions that you’ve created to solve your problems, but they don’t solve my problems.

I also think I’ve said enough on this subject, so I’ll leave the last word to you.

3 Likes

What kind of app is it and what kind of backup frequency are you opting for (or need) on the DBaaS?

The edge devices weren’t those servers I was speaking of, they are terminals that number between 4 at the lowest to 8 on average to up to 60 in the largest areas, the server is what they communicated with and it synched to a central set of servers once a day (if it could, it tried over the internet, then tried dial-up, otherwise it held the data for later).

Essentially it was this:

  • 4-60 terminals per site
  • 2 servers per site
  • Whole ton of secondary display-only devices on site, anywhere from 10 to 80
  • About 40000 sites
  • Central server at corporate headquarters with significant redundancy and extreme CPU load because of all the constant report generation going on (it had a few dozen maxed out blade servers by the time I left), along with 3 Internet paths, a bank of I think it was 4096 dial-in devices, dedicated fiber connections between the warehouses in different cities that did not go over the Internet, etc…

Even then, those central servers only managed synch data, if they went entirely down (which never happened the entire 8+ years I was there) then all sites would still be fully up and functional, maybe just with some out of date pricing or description information (and that can always be uploaded manually pretty trivially, which we had to do via floppy or USB when the site had no internet or dialup connectivity).

The only time a site would be technologically down (worst-case if the technology all blew up at once is they could still operate manually, just no credit/debit transaction really, the rest is documented and entered later) is if the main server in the back was down, the backup server in the city didn’t exist, the main #1 terminal (a more beefy terminal than the rest) was down (it acted as an intermediate backup server until the main backup server was installed, all terminals migrated to it if the back server went down), and all terminals went down (in the worst case the terminals themselves could run in standalone mode, they’d resync later when some type of server came back up). The main corporate servers was mostly just for data aggregation purposes, generally only checked by a certain owner or area supervisor once a day at most (since it only updated once a day anyway).

Oh, and yes, the main corporate servers do host a large website, online ordering, mobile ordering interfaces, etc… all of which would interact to a site over the internet (and since a site’s individual internet is flaky the uptime on those remote ordering is not a big deal, not that the big servers ever went down anyway). And yes, the global supply chain of this company is probably only dwarfed by amazon, it likely even outdid walmart. :wink:

That sounds like a large enough business to warrant having complete ownership of their entire infrastructure at that point however.

The thing is, even ‘just’ for the report servers, the cost of any modern cloud offering would have absolutely dwarfed the cost of the entire infrastructure at corporate in full, people’s salaries included. You offset the hassle of managing it yourself by getting dedicated people by an absolutely exorbitant price! If you are small enough that it is a small price, a dedicated vm should do someone quite well while also being cheaper, and if someone needs more power then a dedicated server does not cost that much either, significantly less than cloud offerings unless you are running only a couple hours of cpu a month (in which case why not just run it in-house?!).

That sounds about what we had, petabytes of data all stored since they started storing it electronically in the early 90’s, across 40k databases with certain information (not all by far) synched to the main servers for report generation (detailed reported were generated at each sites server itself, it was also quite busy generating reports, that was most of it’s CPU power by far), the 40k sites did not need consistent data, only data relevant to their specific site and item data (the largest synched list) synched from area databases (that were area-specific modified data from the central worldwide big pull).

Consistent latency is the exact opposite of what was possible, as stated prior a huge chunk of sites would just have no connectivity quite routinely, and every single site had connectivity issues because of local ISP’s a few times in their lives at the very least. Everything was built around the assumption of extremely flaky connections (the satellite connections to the sites in the middle of nowhere was especially ‘fun’ ^.^;) and non-realtime synchronization (as it’s just outright impossible a lot of the time). The system would work fine even when they eventually deploy sites to the moon or mars when humans finally settle those. Having perfect consistency at all servers is an impossibility in physical reality.

Nicely at that last job all sites would still be completely functional with any of those and still mostly functional with the worst of those.

Significantly less expense, there’s no way a cloud offering would be able to support such a service, the only cloud offering they ever tried was remote-site inter-telecom handling, which pretty quickly died off because the reliability of most local ISP’s made it untenable, so imagine running the entire business off a cloud infrastructure, whether it’s a flaky local ISP that drops out often or a satellite connection with over 4 seconds of latency during the rare times it does actually work or a 19.6k dialup (though it was surprising to ever get even 2400 baud at some sites due to all the noise on the line) or just when there is no connectivity at all for days stretches at a time.

The software for all our systems was developed by less than 20 people by far, the support systems by another 20 or so people, then the L1 support people themselves were at most there is 100 at the busiest of years.

A significantly greater cost in the long run and quite often a very very difficult migration path off of them to more efficient setups, such as dedicated systems, especially when you start tossing in link-issues at user-local interfaces.

If the main server of a site ran at, say, 100% utilization (report generation is a horror) then if it fail the backup servers wouldn’t start that back up, they do what they do to get the site fully operational, the reports are extra fluff that is not needed in the case of emergency situations, then you don’t need that utilization at those times.

Or just not using the hardware effectively. There was always “something” for the main site servers to do, whether just to finalize credit transactions a little faster than once a day or generate one of the big monthly reports that scanned the millions to billions of rows that were added in that time period to keep it up to date or whatever, they always had ‘something’ to do and would happily pause the fluff if any situation demanded it.

Not my creation, created by the company over 25 years of running the infrastructure in the face of significant reliability concerns (this sounds a lot like Erlang’s/BEAM’s history…). ^.^

Aww, but I’m finding this fun, I don’t often get to reminisce about that place, it’s hard to relate to people with it. ^.^

I’m really really curious if a cloud service could ever actually possibly replace functionality such as this but I just don’t see it as possible purely because of physical/real-life limitations, unrelated to the hardware. The big servers really do practically nothing there other than generate reports and push new item information out a few weeks ahead of when it activates (so early because of those reliability concerns they’ve built up around, everything has an ‘activation’ date and it’s pushed down well well in advanced, many things are planned out for multiple years in advanced, already pushed to the site servers), their reliability is not a big thing and yet they’ve never yet had a complete outage with them (a partial outage just slows down report generation, it never stopped, which is the difference between getting a report in 15 minutes compared to 2 hours).

A big issue I see with Databases as a Service though (rather than ‘everything’) is that it is remote, you have latency to contend with, you have no local replication (at least I havn’t seen any service where, say, AWS or Google helps with local replication, has to be application driven, which is useless work for it), you have outages when their service goes down even when the application servers are up, etc… etc…

What kind of app is it and what kind of backup frequency are you opting for (or need) on the DBaaS?

Ed tech. Parent-teacher communication through SMS. I’m content right now with daily Heroku backups because, in a worst-case scenario I can make API calls to my SMS vendor to recreate sent messages. If I had more hands on this than my own, I’m sure I’d have that code written right now. :wink:

If your DC has noticeable latency to one of the DCs of the big 3 IaaS platforms then you need to switch DCs and if a few additional ms in latency to your db matter then a lot of other factors would stop you from using DBaaS anyhow. :slight_smile:

Heroku also has a rollback feature that allows you to revert the database state to a previous point in time.

2 Likes