Steamroller: an opinionated Erlang code formatter

dtip · December 12, 2019, 11:31am

I’m an Erlang developer originally but have been working with Elixir for a while now. The quality of the Elixir toolchain, libraries, and community is fantastic. The focus always seems to be on modern tools and modern development best practice.

In particular I’m a massive fan of mix format. Code consistency is so important for readability and it’s great to see this as a first-class citizen in Elixir.

I feel like this is an area where Erlang is lacking. We have erl_tidy but in my completely biased opinion it makes Erlang code harder to read!

I’ve been building Steamroller to bring more of an Elixir-flavour code format to Erlang. It’s opinionated, which means you can change the line length and nothing else, and it’s super early stage, which means you can expect it to break if you try to run it on a large codebase. It can handle the rebar3 source code, but at the moment can’t deal with everything in the OTP source.

Steamroller works by tokenising the source code with erl_scan:string and formatting that. Any existing formatting is blasted and everything is standardised. The abstract form is checked before and after formatting to make sure that the code is still functionally the same.

You can get it from hex as a rebar3 plugin: {plugins, [steamroller]}.
You can run it with: rebar3 steamroll
and as part of CI: rebar3 steamroll --check

Roadmap as of 2019-12-12:

Autoformat the OTP source code without crashing
Tidy up, refactor, and document the source so it’s not a total mess
Add whitespace padding similar to Elixir
Consider changing the indent from 4 to 2 spaces
Add editor integration for people who don’t use Vim

Feel free to try it out and complain at me when it breaks. Aim is to work on it gradually through 2020 and get a polished v1 by the end of the year. Comments, feedback, bug reports, and improvement suggestions are all welcome!

NobbZ · December 12, 2019, 11:41am

Will it provide an API that I can use to pretty print generated AST from erl_syntax?

dtip · December 12, 2019, 12:01pm

No plans to add that at the moment. For now the inteface takes a file name and formats the file. It probably wouldn’t be tough to turn it into a more generic pretty-printer though. Definitely worth thinking about once it’s in a more stable state!

NobbZ · December 12, 2019, 12:07pm

Thats sad to hear, as I use erl_pp currently to pretty print generated code in the “erlang exercism test generator”, which is basically erl_tidy from AST instead of from file… So the generated code looks ugly, and I’m searching for alternatives.

I think I will check your output and probably apply it as an after step when the output is better readable.

rvirding · December 12, 2019, 2:03pm

A very quick question: how do you handle comments wrt whether they start with %, %% or %%%?

dtip · December 12, 2019, 2:19pm

At the moment %% and %%% are treated the same: they’re padded with blank lines before and after.
% comments aren’t padded with blank lines so they can sit right on top of the piece of code they’re related to.

Comment handling definitely needs a bit more thought to get right. At the moment inline comments aren’t allowed, as in mix format. But I can see situations where it can be handy to have in-line comments - it could make it harder for the comment to accidentally go stale. Would be curious to hear from the Elixir team why they decided to disallow in-line comments in their formatter.

rvirding · December 12, 2019, 2:34pm

The current comment style with different handling of comments depending on the number of % at the start of the comment is “classic”. It has always been like that in the emacs erlang mode; there is direct support for it in emacs.

It is in fact much older, coming from MIT Maclisp from the 60’s and 70’s, which is where emacs originally came from and therefore got its handling.

I don’t know why Elixir formatter does its comments the way it does. I find it really strange.

dtip · December 12, 2019, 4:56pm

Ah of course, erlang emacs mode:

Lines with one %-character is indented to the right of the code. The column is specified by the variable comment-column, by default column 48 is used.

Lines with two %-characters will be indented to the same depth as code would have been in the same situation.

Lines with three of more %-characters are indented to the left margin.

Perhaps I’m working only on dodgy codebases but I feel like I rarely see this style in practice.

Here’s an example of that % style in action from erlang/otp/lib/stdlib/src/timer.erl:

-spec handle_info(term(), timers()) -> {'noreply', timers(), timeout()}.
handle_info(timeout, Ts) ->                       % Handle timeouts 
    Timeout = timer_timeout(system_time()),
    {noreply, Ts, Timeout};
handle_info({'EXIT',  Pid, _Reason}, Ts) ->       % Oops, someone died
    pid_delete(Pid),
    {noreply, Ts, next_timeout()};
handle_info(_OtherMsg, Ts) ->                     % Other Msg's
    {noreply, Ts, next_timeout()}.

Personally I find this style of mid-line floating comment impossible to visually parse. I would guess from the lack of seeing it in the wild that other developers feel the same way.

Instead what I see in practice is, roughly speaking, the emacs erlang mode definitions shifted over by one %:

Lines with one %-character will be indented to the same depth as code would have been in the same situation.

Lines with two or more %-characters are indented to the left margin.

Or even more simply (the way it’s implemented in Steamroller currently): %% for big, important comments describing behaviour; % for quick mid-function comments.

I did a quick check on the rebar3 and otp source to see how often %%% is used as a proxy for how often people stick to emacs style erlang:

Rebar3:

$ find . -name '*.[he]rl' | wc -l
179
$ find . -name '*.[he]rl' | xargs grep -l '%%% ' | wc -l
30
$ find . -name '*.[he]rl' | xargs grep -l '%% ' | wc -l
161

→ 17% of files use %%%. A quick visual inspection shows most using %% for module-level comments.

OTP:

$ find . -name '*.[he]rl' | wc -l
4144
$ find . -name '*.[he]rl' | xargs grep -l '%%% ' | wc -l
1210
$ find . -name '*.[he]rl' | xargs grep -l '%% ' | wc -l 
3766

→ 29% of files making use of %%% here.

I think we should format comments in the simplest and most practical way possible whilst keeping everything easy to read. For me, the erlang emacs mode style doesn’t improve readability. Perhaps most of the rest of the Erlang world disagrees and this thing will totally flop That’s all part of the fun of throwing it out there!

NobbZ · December 13, 2019, 7:57am

I skimmed steamrollers code, is it formatting itself?

In that case I consider it’s code much nicer than erl_pp already, but it’s a bit too indent heavy for my taste.

Having each closing paren/bracket/etc on its own line would be unbearable for me to use it on my own codebase, though I’d still consider it for the generated test suits mentioned above.

elbrujohalcon · December 13, 2019, 12:33pm

Well… it seems like it’s all of us or none of us, right?
We (as in NextRoll) are also developing a rebar3 formatting plugin.
We were a bit less creative with the name and just called it rebar3 format (also on hex.pm).
And we know that @michalmuskala is working on one as well based on his comments here and here.
We literally spent decades hearing/reading complains about the lack of a formatter without anybody implementing one and now… you have not one, not two, but three of them!
And that’s if you don’t count rebar3_fmt which requires you to have emacs installed.

Anyway… rebar3_format is pretty similar to steamroller, but we based it on OTP’s erl_prettypr and therefore it’s more akin to emacs-mode.
It’s also not so opinionated because our idea is that you can have a canonical format (i.e. using the plugin without options) and use that format when you commit/push your code, but you can also have as many rebar3 profiles as you want, each one with your favorite options. That way, you can rebar3 as brujo format your code when you start working and then rebar3 format it right before committing/pushing it.

In any case, as its version number (0.0.3) indicates, it’s also in its very early stages and that’s why we were not too eager to make it public yet.

What we do have is a fairly extensive base of test cases (some of them copied from erlang/OTP itself, some others created by us) here.
As an exercise, I tried to use steamroller on that app and I found…

A bug in rebar3_format that I promptly reported as an issue.
That steamroller doesn’t support attributes without parentheses (like -module my_mod.) which are totally valid.
That steamroller doesn’t like macros used in weird type specs (like -spec ?MODULE:my_fun(…), which, to be fair, rebar3 format doesn’t really handle even when it doesn’t complain (hence, the bug above).

But the most important result is that given the proper configuration (mostly {inline_expressions, false} and {sub_indent, 4}) and disregarding the already reported issues, both formatters produce very similar output.

Now, as a maintainability fanboy, I’m very happy that we’re focusing on these tools now. They’re a great addition for the Erlang ecosystem in general. On the other hand, having multiple formatters around is better than having none of them, but it’s also worse than having just a single one.

dtip · December 13, 2019, 2:41pm

Sure is! There’s a hack in the rebar.config so that it uses the latest version of itself when formatting itself

I’ve been meaning to do a bit of bracket grouping to reduce the vertical space usage a bit. Could you post an example or two which is painful for you and how you think it ought to look?

dtip · December 13, 2019, 2:46pm

What are the odds! Looks like we started work within a week of each other.

Thanks a lot for the bug reports - I’ve released an update with better macro handling and support for attributes without parentheses (it adds parentheses).

It might be worth teaming up to prevent duplicating work, but it depends what your goals are for rebar3_format. I feel quite strongly that formatters should be opinionated and have the bare minimum configuration. It seems like you’d like to offer config choices to your users, which is fair enough.

Interesting that @michalmuskala is building an autoformatter too! Can you share your plans?

NobbZ · December 13, 2019, 2:59pm

github.com

old-reliable/steamroller/blob/master/src/steamroller_algebra.erl#L189-L193


      
          concat(X, Y, Break) -> cons(X, cons(break(Break), Y)).
          
          %%
          %% Token Consumption
          %%

In my opinion the case … of should be on the same line as the LHS of the match (unless of course other rules do forbid that, eg to long expression on LHS or to long expression between case and of) and indenting the matches only one level relative to the match:

Doc1 = case PrevTerm of
    function_comment -> newline(Doc0, Doc);
    _ -> newlines(Doc0, Doc)
end,

github.com

old-reliable/steamroller/blob/master/src/steamroller_algebra.erl#L217-L226


      
          generate_doc_([{'-', _} = H0, {atom, _, Type} = H1, {'(', _} | Rest0], Doc, PrevTerm)
          when Type == type orelse Type == opaque ->
            % Remove brackets from Types
            Rest1 = remove_matching('(', ')', Rest0),
            generate_doc_([H0, H1 | Rest1], Doc, PrevTerm);
          
          generate_doc_([{'-', _}, {atom, _, Type} | Tokens], Doc0, PrevTerm)
          when Type == type orelse Type == opaque ->
            % Type
            {_ForceBreak, Group, Rest} = type(Tokens),

If you break one clause into multiple lines, do so for all of them, makes it easier to distinguish match arms. Also when you break for a when then don’t have it on the same level as the clauses bode, nor the match. It took me quite a while to figure out, that its actually a single clause rather than 2 on lines 220 to 222.

Also I’d like to have some whitespace between clauses.

The above paragraphs though are heavily inspired by the elixir formatter though ;D

I won’t give an example of that snippet how I’d prefer it.

github.com

old-reliable/steamroller/blob/master/src/steamroller_algebra.erl#L356-L369


      
                  {ForceBreak0, newline(Clauses)}
              end,
            {ForceBreak, Doc, Tokens1}.
          
          
          -spec type(tokens()) -> {force_break(), doc(), tokens()}.
          type(Tokens0) ->
            {FunctionTokens0, Tokens1, Token} = get_until(dot, Tokens0),
            FunctionTokens1 = FunctionTokens0 ++ [Token],
            {ForceBreak, Doc} =
              case get_until('::', FunctionTokens1) of
                {_, _, {dot, _}} ->
                  % This is not actually a type with clauses.
                  % We can get here if there is funny code inside an ifdef.

I’d say the same for arbitrary function calls what I’ve already said about the case above.

Also instead of closing one paren per line, I’d really prefer have them closed the “lisp-style”:

Doc = force_break(
    ForceBreak,
    group(
        space(
            cons(
                group(space(text(<<"case">>), CaseArg)),
                nest(?indent, space(text(<<" of">>), GroupedClauses))),
            text(<<"end">>)),
        inherit)),

elbrujohalcon · December 13, 2019, 3:08pm

Teaming up looks nice, but yeah… we want our formatter to be both opinionated and configurable.
As I said in my comment above, our goal is to allow the following development process:

Say you need to change something in your_module.erl

You format it to your favorite fashion: rebar3 as dtip format --files src/your_module.erl.
You do your work.
You format it to the canonical style with rebar3 format.
You commit & push your changes.

Of course, some of these steps can be automated by git-hooks or rebar3 aliases.
So far we only slightly deviated from what erl_prettypr does for the canonical style, but we added a bunch of config options so everyone can work on code that looks nice to them.
This is inspired by how the Smalltalk formatter used to work many years ago. It automatically formatted the code in the user’s favorite style for edition, but it saved the code in canonical format for storage and distribution.

Josh · December 14, 2019, 4:07am

I would be very happy if I could find something like that for JavaScript. There are too many radical formatting styles and it takes up a lot of mental processing power to switch all the time. Other people have different preferences/requirements, so that seems like an ideal solution.

I guess it might make looking at diffs a bit more confusing though, since git would see different code than the editor.

dtip · December 14, 2019, 10:29am

I could be wrong, but I thought the definition of an opinionated formatter is that it offers minimal configuration. Otherwise what’s the difference between an opinionated formatter and any other formatter?

I can see why your development process is attractive to some. It seems like it carries quite a lot of mental overhead. What happens if you’re helping a colleague to debug and have to read code on their machine? What happens if you’re reviewing a PR on e.g. the GitHub web interface?

The massive advantage, in my opinion, of a tool like mix format is that pretty much all Elixir code looks the same. They’ve put a lot of time and effort into designing a tool which improves the readability of Elixir code so developers don’t have to think about it. As an Elixir developer, you learn to read one style of Elixir and that skill is then transferable to any Elixir codebase in any environment. I think it would be invaluable if we could bring this to Erlang!

The disadvantage of trying to introduce a similar tool to Erlang is that plenty of people have their own ideas about how Erlang should look and those ideas are all fairly well-ingrained. It makes that initial barrier of getting used to a new style a bit more painful. I think it pays off in the long run - both for individuals and teams

rvirding · December 15, 2019, 2:28pm

I don’t understand the issue here. There is only one REAL way to format Erlang and that is the emacs way. That is just how Erlang should look.

NobbZ · December 15, 2019, 2:34pm

And whats the name of the function I need to call for emacs formatting my source file?

We are not only talking about indenting, but about fully formatting a file.

Emacs does not care whether I write [Foo|Bar] or [Foo | Bar], it will not canonically format it, it will just indent it if necessary. Or have I missed something in the docs?

Also, as I’m generating code programmatically, I can not rely on emacs beeing available when generating the modules source code. I need a way to generate erlang source code from some erl_syntax:syntaxTree/0. I can’t do that using emacs.

rvirding · December 15, 2019, 2:38pm

I was joking, hence the .

NobbZ · December 15, 2019, 2:38pm

I missed that