Belt - A Flexible File-Storage Library

wmnnd · April 16, 2017, 4:27pm

Hi there,

for my project DBLSQD, I needed a file storage solution that is a bit more flexible than Arc. Because I thought others might find it useful as well, I decided to share it with everyone and release it as free software. You can install version 0.1.1 from hex.pm.

What can it do?

Belt allows you to store files on remote or local systems. Currently, SFTP, S3 and the local filesystem are supported as targets. Belt is especially useful if you want to target multiple storage systems that are configured at runtime (e. g. because they are user-provided).
Belt is built on top of GenStage and also supports asynchronous uploads.

Belt also offers some convenience functions such as retrieving hashes of a file.

How does it work?

Here is a little example on how to use Belt:

#Simple file upload
{:ok, config} = Belt.Provider.SFTP.new(host: "example.com", directory: "/var/files",
                                       user: "…", password: "…")
Belt.store(config, "/path/to/local/file.ext")
#=> {:ok, %Belt.FileInfo{…}}


#Asynchronous file upload
{:ok, config} = Belt.Provider.S3.new(access_key_id: "…", secret_access_key: "…",
                                     bucket: "belt-file-bucket")
{:ok, job} = Belt.store_async(config, "/path/to/local/file.ext")
#Do other things while Belt is uploading in the background
Belt.await(job)
#=> {:ok, %Belt.FileInfo{…}}

You can read more on how to install and configure Belt in the Getting Started guide.

Roadmap

Belt is definitely usable the way it is right now but I have planned several additional features such as an Ecto integration.

Feedback welcome!

Suggestions for useful new features are more than welcome.
Also, if you discover bugs or find the documentation lacking in any way, please file a bug report.

Eiji · April 16, 2017, 6:24pm

@wmnnd: nice!

I have some questions:

supports asynchronous uploads - is it support add upload file to queue, so we can send only x files at one time and y files on one storage type at one time
Do you support/plan to upload done callback/event?
Do you support/plan to create mirrors between storages?
Do you support/plan to sync files/folders automatically?
Do you support/plan to priorities? For example. You have a 1GB folder, but none of your project storage providers support that big space, so you want to split it automatically for x providers and then download/sync that folder to another machine automatically.

wmnnd · April 16, 2017, 8:34pm

@Eiji: Thank you for your questions!

This is not currently implemented but I will add it in the next release! You will be able to limit uploads both overall and per storage Provider.
Belt has a concept of Jobs that work similar to Tasks in Elixir. You can easily use that mechanism for creating callbacks already. But I suppose it’s a feature worth exploring. How would you like to see it implemented? Would you want to be able to execute arbitrary code when a Job has been completed? Is this considered good practice in Elixir? So far I haven’t seen many examples of this kind of callback being used.
Queries to multiple targets are not there yet but they are on my to-do list already.
I think this would exceed the scope of a pure file storage library, but you easily implement something like this on top of Belt.
This is an interesting idea but Belt does not currently keep track of the available storage size a provider has available.

hassan · April 16, 2017, 8:38pm

Currently, SFTP, S3 and the local filesystem are supported as targets.

Cool, I’m working a lot with AWS right now, so this response may be
colored by spending a lot of time with the AWS CLI, but

It seems wrong to have to specify a bucket as an option to config;
aws s3 ls alone will give me a list of buckets, which is useful for
further processing
I may have missed it, but most aws-related tools support pulling
secret keys from the standard ~/.aws/credentials file (and reading
an AWS_PROFILE env var if present). Is that supported? And if
not would you entertain a PR?

Regardless, thanks for publishing this.

wmnnd · April 16, 2017, 8:57pm

@hassan: Thank you for your feedback!

One of the ideas behind Belt is to abstract different storage providers behind a unified API. From how I see S3, only a set of credentials with a bucket is an actual valid storage destination. But maybe it’d be worth adding a wrapper for ExAws.S3.list_buckets/1 to Belt.Provider.S3. Do you think that would be worth it in order to avoid having to work directly with ExAws?
Right now, there is no mechanism in Belt that pulls default configuration for any provider. However, I would like to add support for default destinations so you can also use Belt if you don’t need to configure providers at runtime. ExAws which is used by Belt already has support for reading AWS CLI files. How do you imagine using Belt with defaults read from the CLI file? My current approach would be something like this:

{:ok, config} = Belt.Provider.S3.default_config(override_options \\ [])

You could then configure defaults in your config file. A provider could also automatically try to pull common environment variables. What do you think about this idea?

Eiji · April 16, 2017, 9:51pm

@wmnnd:
1, 3 - nice
2. I think that GenStage will be really good here, so I can listen on what you will produce, so in async mode background job could catch errors and apply callbacks on specified producer events.
4. If we are talking about Producer -> Consumers scenario then it could be good to have a stand alone process that checks differences between shared files/folders, so after example app will got event that something was changed it could call callback to (ask) sync data.

hassan · April 16, 2017, 10:19pm

One of the ideas behind Belt is to abstract different storage providers behind a unified API. From how I see S3, only a set of credentials with a bucket is an actual valid storage destination. But maybe it’d be worth adding a wrapper for ExAws.S3.list_buckets/1 to Belt.Provider.S3. Do you think that would be worth it in order to avoid having to work directly with ExAws?

Put that way, probably not a high priority

Right now, there is no mechanism in Belt that pulls default configuration for any provider. However, I would like to add support for default destinations so you can also use Belt if you don’t need to configure providers at runtime. ExAws which is used by Belt already has support for reading AWS CLI files. How do you imagine using Belt with defaults read from the CLI file? My current approach would be something like this:

{:ok, config} = Belt.Provider.S3.default_config(override_options \ )

You could then configure defaults in your config file. A provider could also automatically try to pull common environment variables. What do you think about this idea?

Good point, let me look at how ExAws handles config first before
offering any more useless suggestions

LostKobrakai · April 17, 2017, 8:32am

I’m sure you did your research, but I want to share it anyways: there’s https://flysystem.thephpleague.com/, which does what you created in php. This might be interesting in terms of kick-starting other cloud providers.

wmnnd · April 17, 2017, 8:51am

@LostKobrakai: Cool, thank you for pointing this out!
I’ll see if I can’t get some good ideas from there Is there something in particular that you like about Flysystem?

wmnnd · April 19, 2017, 2:48pm

New version 0.1.2

I have just released a small update, version 0.1.2.
It comes with the following changes:

Changelog

Added Belt.delete_all/2 and Belt.delete_scope/3
Added default configurations with Belt.Provider.default/1
You can now configure your own defaults with Mix.Config. Belt.Provider.S3 can also make use of the AWS environment variables and AWS CLI config (read more). Thanks to @hassan for suggesting this.

Roadmap

Next up will be:

Improved configuration of the maximum number concurrent jobs
Caching of completed job replies.
Ecto integration.

Feedback and suggestions regarding functionality, code quality and/or documentation are always welcome!

wmnnd · April 25, 2017, 9:05am

New version 0.1.4

I have just released a little update, version 0.1.4.

Changelog

Added Belt.Ecto.Config for storing Provider config structs with Ecto
You can now directly store config structs in your database and Belt will take care of serializing and deserializing those structs so that no manual conversion is necessary. This is what it looks like:

#in migrations

create table(:belt_providers) do
  add :config, :map #Belt.Ecto.Config uses Ecto primitive :map
end

#in schemas

schema "belt_providers" do
  field :config, Belt.Ecto.Config
end

Roadmap

Next up will be:

Improved configuration of the maximum number concurrent jobs
Caching of completed job replies.
Ecto Type for storing FileInfo structs

I’m looking forward to your feedback

odyright · April 29, 2017, 8:30pm

Hello, @wmnnd is Belt working without pain with umbrella’s structure in Phoenix 1.3 rc1?

wmnnd · April 29, 2017, 8:53pm

Hey @odyright, thank you for your question!
Belt does not in any way rely on Phoenix or the project structure created by Phoenix’s generators; it is its own independent OTP application.
This means you can simply declare Belt a dependency and add it to your extra_applications array in one (or multiple) of your own umbrella applications.
The Getting Started guide explains how the installation works in more detail.

wmnnd · July 3, 2017, 9:07am

##New version 0.1.6
For this announcement I have joined the two small releases 0.1.5 and 0.1.6 that bring a couple of small bug fixes and improvements.

Changelog

Only define SFTP and S3 providers if dependencies are available
Improve compatibility with non-AWS S3 providers by normalizing headers
Correctly assemble non-presigned S3 URLs
Fix incompatible API in Belt.Ecto
Ignore :enoent errors in Filesystem Provider implementation of Belt.delete

wmnnd · July 15, 2017, 11:45pm

#New version 0.1.8
Since DBLSQD is finally up and running, there is at least one instance of Belt out in the wild now

More features will come as DBLSQD and Belt evolve. For now, these are the changes of the most recent release:

Changelog

Improvement: When storing files on SFTP, hashes are now calculated locally. Retrieving hashes when none were requested is now a noop.
Improvement: Better compatibility with SFTP servers with small maximum package sizes. Thanks to the obscure and undocumented (yet public) functions :ssh_sftp.send_window/2 and :ssh_sftp.recv_window/2, package size is now determined dynamically.
Fixed bug causing infinite loops when listing files on some SFTP servers
Fixed connecting to SFTP servers with verify_host set to false

###Roadmap

Make jobs more flexible
Explore the possibility of using ETS directly or ConCache as a backend for jobs

Feedback and suggestions are welcome as always!

wmnnd · August 14, 2017, 9:14am

Elixir 1.5 compatibility update 0.1.9

Version 0.1.9 of Belt is now compatible with Elixir 1.5.
The only thing that changed is actually just a dependency on a slightly newer version of GenStage

abitdodgy · August 23, 2017, 10:43pm

@wmnnd Does Belt support direct uploads to S3, bypassing the Phoenix application? A bit like Refile for Ruby.

wmnnd · August 25, 2017, 8:08am

@abitdodgy Thank you for your question. Direct S3 uploads are currently not supported because they would not allow for certain features such as calculating a file’s hashes and file size.
Could you explain what your use-case would be for Belt with direct uploads some more?

NobbZ · August 25, 2017, 8:19am

A relayed transfer costs bandwith and therefore money.

When I have to receive an upload to my personal server first, I have to pay for that transfer.
Then I have to push the payload over to S3, which is another transfer I have to pay
Last but not least, I have to pay whatever amazon charges me for using their service.

I’d like to elimate the first 2 costs from this list.

(I’m not a user of Belt, nor of S3, but thats the first thing that comes to my mind when thinking even briefly about it)

I do assume though, that I can ask S3 for at least the filesize after the fact. If it is a proper service it will also have some way to retrieve a hasvalue for uploads to easily verify them before pushing the same file again.

abitdodgy · August 25, 2017, 1:11pm

@wmnnd Precisely at what @NobbZ said, but I want to add a couple of more notes. Beyond cost, the issue becomes more obvious when using a PaaS like Heroku, where Dyno sizes are small and can’t handle large sizes. By passing the app removes the need for heavy processing power and memory consumption. In Ruby, I used Refile, which is incredibly easy and pleasant to work with… except when it comes to medium and large files, where out of memory exceptions happen all the time, sometimes even for small files, like 1mb.

Granted, Elixir’s architecture makes a big difference, but I never tested this. My assumption is that PaaS services aren’t suited for dealing with files.