Binary encoding across STDIO between two processes causing unexpected results

Background:
I’m writing a small program that communicates with a process over STDIO, here’s an excerpt from their docs explaining how it works:

// This program will fork the process under test and communicate with it over
// its stdin/stdout:
//
//     +--------+   pipe   +----------+
//     | tester | <------> | testee   |
//     |        |          |          |
//     |  C++   |          | any lang |
//     +--------+          +----------+
//
// The tester contains all of the test cases and their expected output.
// The testee is a simple program written in the target language that reads
// each test case and attempts to produce acceptable output for it.
//
// Every test consists of a ConformanceRequest/ConformanceResponse
// request/reply pair.  The protocol on the pipe is simply:
//
//   1. tester sends 4-byte length N (little endian)
//   2. tester sends N bytes representing a ConformanceRequest proto
//   3. testee sends 4-byte length M (little endian)
//   4. testee sends M bytes representing a ConformanceResponse proto

Here’s my current WIP script that is being exercised by the tester:

Here’s the tester procedure in C++ to follow along with how it interacts:

I’ve added some printlns strategically to trace how its working, here’s what I see that causes the issue:

[libprotobuf INFO conformance_test_runner.cc:160] TEST: Recommended.Proto2.ProtobufInput.ValidDataScalarBinary.ENUM[1].ProtobufOutput
[libprotobuf INFO conformance_test_runner.cc:364] PARENT READ: 4
>> CHILD TX LEN: 5 
[libprotobuf INFO conformance_test_runner.cc:201] << PARENT RX LEN:5
[libprotobuf INFO conformance_test_runner.cc:364] PARENT READ: 5
[libprotobuf INFO conformance_test.cc:370] TEST DONE

[libprotobuf INFO conformance_test_runner.cc:160] TEST: Required.Proto2.ProtobufInput.ValidDataScalar.ENUM[2].ProtobufOutput
[libprotobuf INFO conformance_test_runner.cc:364] PARENT READ: 4
>> CHILD TX LEN: 5 
[libprotobuf INFO conformance_test_runner.cc:201] << PARENT RX LEN:5
[libprotobuf INFO conformance_test_runner.cc:364] PARENT READ: 5
[libprotobuf INFO conformance_test.cc:370] TEST DONE

[libprotobuf INFO conformance_test_runner.cc:160] TEST: Recommended.Proto2.ProtobufInput.ValidDataScalarBinary.ENUM[2].ProtobufOutput
[libprotobuf INFO conformance_test_runner.cc:364] PARENT READ: 4
>> CHILD TX LEN: 5 
[libprotobuf INFO conformance_test_runner.cc:201] << PARENT RX LEN:5
[libprotobuf INFO conformance_test_runner.cc:364] PARENT READ: 5
[libprotobuf INFO conformance_test.cc:370] TEST DONE

[libprotobuf INFO conformance_test_runner.cc:160] TEST: Required.Proto2.ProtobufInput.ValidDataScalar.ENUM[3].ProtobufOutput
[libprotobuf INFO conformance_test_runner.cc:364] PARENT READ: 4
>> CHILD TX LEN: 14 
[libprotobuf INFO conformance_test_runner.cc:201] << PARENT RX LEN:976367882
[libprotobuf INFO conformance_test_runner.cc:364] PARENT READ: 976367882

As you can see in that last run, the elixir side belives its transmitting a length of 14 but the parent receives 976367882, which is way off and so the process hangs never being satisfied with that result. I have 966 tests working as intended w.r.t this protocol, its this particular case that is way off and I can’t understand why. I’ve been trying to debug this for a couple of days and I’m not sure where the issue is exactly.

Any advice is appreciated!

1 Like

You should also put Elixir IO.puts “logging” statements and show us their output just before the script sends length + data back.

What are other lengths? Because maybe there is problem (from what I see all other. cases use len == 5).

The parent appears to be consuming the 4-character string “:21\n”, which it interprets as a 32-bit integer, producing 976367882. Maybe that’s part of the stack trace of an exception thrown by the Elixir script, or something else you can relate to the output of that script…?

2 Likes

I think you’re onto something! I didn’t make the connection thats probably a stack trace being written to STDIO somehow… I thought I was being really careful with writes to STDIO, does anything look potentially suspect to you?

Well, line 21 is cast_response(). If this is a GenServer cast it might trigger an exception in another process, which could cause unexpected output before the current process gets to output its next length. Normally I would expect exceptions to be written to stderr, but maybe in your setup that’s not the case, or the exception is caught and output to stdout by some part of your script…

This isn’t a GenServer its literally just this one file as an escript. Printing out the data I’m sending there I get:

�
 ¨ÿÿÿÿÿÿÿÿÿ

Not sure how to make that more useful, but that is what I’m sending along. Here’s a log snip of that in context:

[libprotobuf INFO conformance_test_runner.cc:160] TEST: Recommended.Proto2.ProtobufInput.ValidDataScalarBinary.ENUM[2].ProtobufOutput
[libprotobuf INFO conformance_test_runner.cc:174] LEN B4=0
[libprotobuf INFO conformance_test_runner.cc:366] PARENT READ: 4
�¨[libprotobuf INFO conformance_test_runner.cc:203] << PARENT RX LEN:5
[libprotobuf INFO conformance_test_runner.cc:366] PARENT READ: 5
[libprotobuf INFO conformance_test.cc:370] TEST DONE

[libprotobuf INFO conformance_test_runner.cc:160] TEST: Required.Proto2.ProtobufInput.ValidDataScalar.ENUM[3].ProtobufOutput
[libprotobuf INFO conformance_test_runner.cc:174] LEN B4=0
[libprotobuf INFO conformance_test_runner.cc:366] PARENT READ: 4
�
 ¨ÿÿÿÿÿÿÿÿÿ[libprotobuf INFO conformance_test_runner.cc:203] << PARENT RX LEN:14
[libprotobuf INFO conformance_test_runner.cc:366] PARENT READ: 14
[libprotobuf INFO conformance_test.cc:370] TEST DONE

[libprotobuf INFO conformance_test_runner.cc:160] TEST: Recommended.Proto2.ProtobufInput.ValidDataScalarBinary.ENUM[3].ProtobufOutput
[libprotobuf INFO conformance_test_runner.cc:174] LEN B4=0
[libprotobuf INFO conformance_test_runner.cc:366] PARENT READ: 4
[libprotobuf INFO conformance_test_runner.cc:203] << PARENT RX LEN:976367882
�
 ¨ÿÿÿÿÿÿÿÿÿ[libprotobuf INFO conformance_test_runner.cc:366] PARENT READ: 976367882

EDIT: Should also mention that the JSONified representation of the message I am sending back is valid looking, so the actual packet transmission must be the source of the issue.

EDIT 2: I think you’re on to something, its changed now to “12:\n” with this version of the script:

EDIT 3: actually, its 12: or :21 based on endianness, and it should be little-endian.

Right, it it’s little-endian then the input is “\n12:”, which doesn’t look much like a stack trace. Still, it might be a clue, maybe…?

1 Like

Here’s the JSON format of the request and response being sent through the program that causes the weird length:

REQ: {"message_type":"protobuf_test_messages.proto2.TestAllTypesProto2","protobuf_payload":"qAH///////////8B","requested_output_format":"PROTOBUF","test_category":"BINARY_TEST"}
RES: {"protobuf_payload":"qAH///////////8B"}

The idea here is that I’m basically decoding, copying that value, and re-encoding it back as a response.

EDIT: Here’s the IO.inspect of what I’m sending over using binwrite exactly:

<<14, 0, 0, 0, 26, 12, 168, 1, 255, 255, 255, 255, 255, 255, 255, 255, 255, 1>>

EDIT 2: Another interesting thing, this number seems to change every time I restart, now I’m getting: 976433418 which decodes as \n13: Not sure how its getting that from the packet above :confused:

I FIGURED IT OUT AND IT WAS DUMB:

Logger.disable(self())

Something somewhere is logging to STDIO during encoding/decoding which causes random mayhem! This silences it, although I’ll redirect it to STDERR instead. Thanks everyone!

EDIT: for those following along at home, RE: the number keeps changing it was a TIMESTAMP!

16:16:28.597 [warn] unknown enum value 18446744073709551615 when decoding for {:enum, ProtobufTestMessages.Proto3.TestAllTypesProto3.NestedEnum}

2 Likes