Fixing stutters in Papers Please on Linux

Tue Dec 28, 2021

Since I switched to Linux some time ago, I had my fair share of problems with running games on Linux. To be fair, most of these were not designed to run on linux, but the awesome Proton and its main project wine make it easy to run most games I am interested in seamlessly. Most of the time when problems arise, games refuse to start, but some workaround exists to make them run.

When I wanted to play some Papers Please I was delighted to see that a native port exists, which should make it easy to run it. Installing and starting the game from GOG was easy enough, but starting in the main menu, something was off. After starting a game it was clear that the animations were pausing every few seconds for around a second, which made it almost unplayable.

Looking at ProtonDB, people are complaining about stutters. Some are reporting that the 64-bit version stutters, whereas the 32-bit version works. Unlucky for me, GOG only has a download for 64-bit on Linux. One comment says there is “a half a second freeze every few seconds”, which seems similar to my experience.

There has to be an answer why the client stutters, the developers must have tested this configuration! As a side note, I am wondering if the 64-bit version worked at some point, but without a good starting point it is hard to test. So let’s investigate where the pauses are coming from. Linux features a whole suite of tools to debug many different aspects from performance to correctness. And here I am assuming that the stutters are not inherent in the game logic, because the 32-bit version works correctly.

What is going on?

The first tool I am using is strace. In its default configuration it runs a program and records all system calls, which a program makes, and reports them on the console. Without any flags every call is reported which can be overwhelming, but without a hint where to look, I have to record everything.

After some digging through useful options, I discover the -T and -t flags which record the time spent in the syscall and the timestamp when calling. So I record some data using strace -T -t ./PapersPlease > log.txt and sift through the mess:

...
13:08:06.067929 pselect6(9, [8], NULL, NULL, {tv_sec=0, tv_nsec=0}, NULL) = 0 (Timeout) <0.000005>
13:08:06.067946 recvmsg(8, {msg_namelen=0}, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000004>
13:08:06.067962 pselect6(9, [8], NULL, NULL, {tv_sec=0, tv_nsec=0}, NULL) = 0 (Timeout) <0.000005>
13:08:06.068033 getpid()                = 2541133 <0.000005>
13:08:06.068048 getpid()                = 2541133 <0.000004>
13:08:06.068062 getpid()                = 2541133 <0.000004>
13:08:06.068077 getpid()                = 2541133 <0.000004>
13:08:06.068091 getpid()                = 2541133 <0.000004>
13:08:06.068105 getpid()                = 2541133 <0.000004>
13:08:06.068119 poll([{fd=8, events=0}], 1, 0) = 0 (Timeout) <0.000005>
13:08:06.068138 poll([{fd=19, events=POLLIN|POLLPRI}], 1, 1000) = 1 ([{fd=19, revents=POLLIN|POLLPRI}]) <0.019100>
-------------------------------------------------------------------------------
13:08:06.087261 ioctl(19, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x52, 0x10), 0x7ffce133ef70) = 0 <0.000008>
13:08:06.087288 sched_yield()           = 0 <0.000005>
13:08:06.087306 sched_yield()           = 0 <0.000005>
13:08:06.087325 sched_yield()           = 0 <0.000004>
13:08:06.087340 sched_yield()           = 0 <0.000004>
13:08:06.087357 recvmsg(8, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\6\0\324\0015\304j\10\352\1\0\0\v\0\300\3\0\0\0\0\36\6)\3X\3t\1\20\0\1\0"..., iov_len=4096}], msg_i>
13:08:06.087391 recvmsg(8, {msg_namelen=0}, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000005>
13:08:06.087410 recvmsg(8, {msg_namelen=0}, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000006>
13:08:06.087433 recvmsg(8, {msg_namelen=0}, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000005>
13:08:06.087454 recvmsg(8, {msg_namelen=0}, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000005>
...

strace records a lot of information like the parameters passed into the syscalls, as well as the return value. For some parameters, like flags, strace knows their meaning. We can also see that the game calls at least some syscall every few milliseconds. I have highlighted the longest gap between syscalls, which originates from the previous syscall taking 19 milliseconds to complete. The first parameter to poll contains a file descriptor, which we can resolve to a path using lsof -p <pid>. In this case it resolves to /dev/nvidia0, so some communication with my graphics card. The delay of 19 ms is close to the infamous 16.67 ms which game developers can use to draw a single frame to still achieve 60 fps. This might be related to a v-sync and allows us to filter normal operation.

Using this knowledge, we can filter for polling on file descriptor 19 using grep "fd=19" log.txt:

...
13:08:05.987848 poll([{fd=19, events=POLLIN|POLLPRI}], 1, 1000) = 1 ([{fd=19, revents=POLLIN|POLLPRI}]) <0.018904>
13:08:06.007749 poll([{fd=19, events=POLLIN|POLLPRI}], 1, 1000) = 1 ([{fd=19, revents=POLLIN|POLLPRI}]) <0.019502>
13:08:06.028516 poll([{fd=19, events=POLLIN|POLLPRI}], 1, 1000) = 1 ([{fd=19, revents=POLLIN|POLLPRI}]) <0.018306>
13:08:06.048071 poll([{fd=19, events=POLLIN|POLLPRI}], 1, 1000) = 1 ([{fd=19, revents=POLLIN|POLLPRI}]) <0.018689>
13:08:06.068138 poll([{fd=19, events=POLLIN|POLLPRI}], 1, 1000) = 1 ([{fd=19, revents=POLLIN|POLLPRI}]) <0.019100>
13:08:06.088641 poll([{fd=19, events=POLLIN|POLLPRI}], 1, 1000) = 1 ([{fd=19, revents=POLLIN|POLLPRI}]) <0.018155>
13:08:06.108200 poll([{fd=19, events=POLLIN|POLLPRI}], 1, 1000) = 1 ([{fd=19, revents=POLLIN|POLLPRI}]) <0.018565>
-------------------------------------------------------------------------------
13:08:07.519463 poll([{fd=19, events=POLLIN|POLLPRI}], 1, 1000) = 1 ([{fd=19, revents=POLLIN|POLLPRI}]) <0.000005>
13:08:07.522046 poll([{fd=19, events=POLLIN|POLLPRI}], 1, 1000) = 1 ([{fd=19, revents=POLLIN|POLLPRI}]) <0.004434>
13:08:07.539017 poll([{fd=19, events=POLLIN|POLLPRI}], 1, 1000) = 1 ([{fd=19, revents=POLLIN|POLLPRI}]) <0.008308>
13:08:07.554338 poll([{fd=19, events=POLLIN|POLLPRI}], 1, 1000) = 1 ([{fd=19, revents=POLLIN|POLLPRI}]) <0.012495>
13:08:07.570526 poll([{fd=19, events=POLLIN|POLLPRI}], 1, 1000) = 1 ([{fd=19, revents=POLLIN|POLLPRI}]) <0.016243>
...

Most of the time, the poll is called around every 19-20 ms, but this is interrupted by a 1.4 s pause. The following poll returns in effectively zero time, because the v-sync is late and no pause is needed to achieve a consistent frame rate. The length of 1.4 s is also very noticeable for a player and consistent with the comment on ProtonDB about the half second stuttering. The difference between 0.5 s and 1.4 s might be human perception or the problem might be worse on my system, either way we have a clue.

During the long pause a few syscalls stand out, because I have not seen them elsewhere:

13:08:06.127285 openat(AT_FDCWD, "/dev/input", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 29 <0.000010>
13:08:06.127311 newfstatat(29, "", {st_mode=S_IFDIR|0755, st_size=720, ...}, AT_EMPTY_PATH) = 0 <0.000005>
13:08:06.127340 getdents64(29, 0x55fb3ef36880 /* 36 entries */, 32768) = 1128 <0.000014>
13:08:06.127369 stat("/dev/input/event10", {st_mode=S_IFCHR|0660, st_rdev=makedev(0xd, 0x4a), ...}) = 0 <0.000005>
13:08:06.127388 openat(AT_FDCWD, "/dev/input/event10", O_RDONLY) = 32 <0.000007>
13:08:06.127406 ioctl(32, EVIOCGBIT(0, 8), [EV_SYN EV_KEY EV_MSC]) = 8 <0.000006>
13:08:06.127430 ioctl(32, EVIOCGBIT(EV_KEY, 96), [KEY_POWER KEY_SLEEP KEY_WAKEUP]) = 96 <0.000004>
13:08:06.127447 ioctl(32, EVIOCGBIT(EV_ABS, 8), []) = 8 <0.000005>
13:08:06.127463 close(32)               = 0 <0.039483>
13:08:06.173139 stat("/dev/input/event9", {st_mode=S_IFCHR|0660, st_rdev=makedev(0xd, 0x49), ...}) = 0 <0.000008>
13:08:06.173177 openat(AT_FDCWD, "/dev/input/event9", O_RDONLY) = 32 <0.000009>
13:08:06.173200 ioctl(32, EVIOCGBIT(0, 8), [EV_SYN EV_KEY EV_REL EV_ABS ...]) = 8 <0.000005>
13:08:06.173219 ioctl(32, EVIOCGBIT(EV_KEY, 96), [KEY_ESC KEY_ENTER KEY_KPMINUS KEY_KPPLUS ...]) = 96 <0.000005>
13:08:06.173239 ioctl(32, EVIOCGBIT(EV_ABS, 8), [ABS_VOLUME]) = 8 <0.000005>
13:08:06.173256 close(32)               = 0 <0.023215>
...
13:08:07.465881 stat("/dev/input/event0", {st_mode=S_IFCHR|0660, st_rdev=makedev(0xd, 0x40), ...}) = 0 <0.000008>
13:08:07.465904 openat(AT_FDCWD, "/dev/input/event0", O_RDONLY) = 32 <0.000009>
13:08:07.465926 ioctl(32, EVIOCGBIT(0, 8), [EV_SYN EV_KEY]) = 8 <0.000005>
13:08:07.465943 ioctl(32, EVIOCGBIT(EV_KEY, 96), [KEY_POWER]) = 96 <0.000004>
13:08:07.465958 ioctl(32, EVIOCGBIT(EV_ABS, 8), []) = 8 <0.000004>
13:08:07.465974 close(32)               = 0 <0.017162>
13:08:07.483156 getdents64(29, 0x55fb3ef36880 /* 0 entries */, 32768) = 0 <0.000007>
13:08:07.483179 close(29)               = 0 <0.000006>

The snippet starts with opening /dev/input, which is a folder containing input devices, with the resulting file descriptor 29. Using getdents the program iterates over the directory entries, which are subsequently opened, checked for some flags and closed. Finally, file descriptor 29 is closed. Overall, this operation takes around 1.4 s, it seems we found, where the pause originates. The main delay is caused by the close operations, which takes up to 100 ms (not included in the snippet). This result is unexpected, as a simple operation like close should be much faster. If we look at the last line of the snippet, we can even see an example, closing file descriptor 29 only takes 6 µs. That is more than a factor of 100.000 faster! Something fishy is going on here.

We can further understand what is going on, when we know the calling stack trace, which results in the slow close operation. Luckily, strace is loaded with many useful features like -k which does exactly what we need, it records the calling trace. We only need the stack traces for close calls, which we can filter using -e trace=close. This results in a trace like this:

13:08:07.465974 close(32)               = 0 <0.017162>
 > /usr/lib/libpthread-2.33.so(__close+0x3b) [0x1282b]
 > ~/Games/gog/papers-please/game/lime.ndll(MaybeAddDevice+0x127) [0x4a6d07]
 > ~/Games/gog/papers-please/game/lime.ndll(LINUX_JoystickDetect+0x1db) [0x4a70eb]
 > ~/Games/gog/papers-please/game/lime.ndll(LINUX_JoystickInit+0x91) [0x4a7191]
 > ~/Games/gog/papers-please/game/lime.ndll(SDL_JoystickInit+0x3f) [0x47d8ff]
 > ~/Games/gog/papers-please/game/lime.ndll(SDL_InitSubSystem_REAL+0x1c5) [0x471c15]
 > ~/Games/gog/papers-please/game/lime.ndll(lime::SDLApplication::SDLApplication()+0xa1) [0xea6a1]
 > ~/Games/gog/papers-please/game/lime.ndll(lime::CreateApplication()+0x1b) [0xeb23b]
 > ~/Games/gog/papers-please/game/lime.ndll(lime::lime_application_create()+0x9) [0xb9929]
 > ~/Games/gog/papers-please/game/PapersPlease(lime::_internal::backend::native::NativeApplication_obj::__construct(hx::ObjectPtr<lime::app::Application_obj>)+0x48b) [>
 > ~/Games/gog/papers-please/game/PapersPlease(lime::_internal::backend::native::NativeApplication_obj::__alloc(hx::ImmixAllocator*, hx::ObjectPtr<lime::app::Applicati>
 > ~/Games/gog/papers-please/game/PapersPlease(lime::app::Application_obj::__construct()+0xf6) [0xdfc206]
 > ~/Games/gog/papers-please/game/PapersPlease(openfl::display::Application_obj::__construct()+0x3b) [0x5d8a9b]
 > ~/Games/gog/papers-please/game/PapersPlease(openfl::display::Application_obj::__alloc(hx::ImmixAllocator*)+0x41) [0x5d8da1]
 > ~/Games/gog/papers-please/game/PapersPlease(ApplicationMain_obj::create(Dynamic)+0x6d) [0xdf3a4d]
 > ~/Games/gog/papers-please/game/PapersPlease(ApplicationMain_obj::main()+0x66) [0x2b8936]
 > ~/Games/gog/papers-please/game/PapersPlease(main+0x51) [0x2be641]
 > /usr/lib/libc-2.33.so(__libc_start_main+0xd5) [0x27b25]
 > ~/Games/gog/papers-please/game/PapersPlease(_start+0x2a) [0x2bf48a]

The PapersPlease binary contains symbols, which is helping a lot. The function calling close is MaybeAddDevice which is called from a function with a SDL_ prefix. Searching through GitHub, I can find the following code for MaybeAddDevice. The code opens some path, checks if it is a joystick, and closes it again. It does this for every input device on my computer. Somehow, there are 30 input devices in my /dev/input which might make the effect more prolonged on my computer.

At this point, we have an idea what is going on: PapersPlease uses SDL on Linux, which checks every few seconds if a joystick has been connected. To do this, every input device is checked. Normally, this operation should be almost instantaneous, but takes between 20-100 ms. On computers with a lot of input devices this can easily take more than a second, which results in a noticable stutter, because the operation takes place on the main gameloop.

Why is close slow?

Time to bring out the big guns, bcc. This toolkit can be used to instrument processes and calls at various stages. While writing tools with this toolkit is not too hard, it also comes bundled with a few handy tools. One of these is funcslower, which takes a symbol from a process or a kernel symbol and measures how long each invocation takes. If it takes longer than a threshold, it is logged. With /usr/share/bcc/tools/funcslower /usr/lib/libpthread-2.33.so:close -p <pid> we can filter close calls which take longer than 1 ms.

Tracing function calls slower than 1 ms... Ctrl+C to quit.
TIME       COMM           PID    LAT(ms)             RVAL FUNC
0.000000   PapersPlease   539544   43.49                0 /usr/lib/libpthread-2.33.so:close
0.043507   PapersPlease   539544   32.42                0 /usr/lib/libpthread-2.33.so:close
0.075961   PapersPlease   539544   20.86                0 /usr/lib/libpthread-2.33.so:close
...
1.163494   PapersPlease   539544   21.81                0 /usr/lib/libpthread-2.33.so:close
1.185336   PapersPlease   539544   31.48                0 /usr/lib/libpthread-2.33.so:close
1.216830   PapersPlease   539544   21.78                0 /usr/lib/libpthread-2.33.so:close
3.000077   PapersPlease   539544   23.92                0 /usr/lib/libpthread-2.33.so:close
3.024016   PapersPlease   539544   32.80                0 /usr/lib/libpthread-2.33.so:close
3.056834   PapersPlease   539544   39.98                0 /usr/lib/libpthread-2.33.so:close
...
4.261013   PapersPlease   539544   35.80                0 /usr/lib/libpthread-2.33.so:close
4.296830   PapersPlease   539544   26.65                0 /usr/lib/libpthread-2.33.so:close
4.323498   PapersPlease   539544   26.65                0 /usr/lib/libpthread-2.33.so:close
6.000148   PapersPlease   539544   43.33                0 /usr/lib/libpthread-2.33.so:close
6.043497   PapersPlease   539544   39.99                0 /usr/lib/libpthread-2.33.so:close
6.083495   PapersPlease   539544   26.66                0 /usr/lib/libpthread-2.33.so:close
...
7.265320   PapersPlease   539544   31.49                0 /usr/lib/libpthread-2.33.so:close
7.296819   PapersPlease   539544   26.66                0 /usr/lib/libpthread-2.33.so:close
7.323483   PapersPlease   539544   26.66                0 /usr/lib/libpthread-2.33.so:close

This recreates the results from our earlier analysis, that every few seconds, close calls are slow. We can even see that the stuttering starts every three seconds and on my machine takes 1.2 - 1.4 s. But where is this latency coming from? In principle, the close function should just invoke the syscall and then return, so it should not introduce latency itself, at least not measurable in milliseconds. The syscall is handled in the kernel in the _sys_close function, which on my machine has the prefix __x64. So we start funcslower again with /usr/share/bcc/tools/funcslower __x64_sys_close -p <pid> -t and we get, …, nothing. Either we used the wrong function or the function completes quicker than 1 ms. With the -u 1 flag, we can filter invocations which take longer than 1 µs. And with this setup we do get results:

TIME       COMM           PID    LAT(us)             RVAL FUNC
0.000000   PapersPlease   539544    1.00                0 __x64_sys_close
2.240417   PapersPlease   539544    1.55                0 __x64_sys_close
2.269315   PapersPlease   539544    1.33                0 __x64_sys_close
2.311854   PapersPlease   539544    1.03                0 __x64_sys_close
2.535184   PapersPlease   539544    1.31                0 __x64_sys_close
2.591848   PapersPlease   539544    1.11                0 __x64_sys_close
2.725182   PapersPlease   539544    1.14                0 __x64_sys_close
3.018515   PapersPlease   539544    1.00                0 __x64_sys_close
3.151844   PapersPlease   539544    1.04                0 __x64_sys_close
3.205187   PapersPlease   539544    1.55                0 __x64_sys_close
3.259313   PapersPlease   539544    1.08                0 __x64_sys_close
3.485194   PapersPlease   539544    1.49                0 __x64_sys_close
5.240487   PapersPlease   539544    1.54                0 __x64_sys_close
5.274988   PapersPlease   539544    1.14                0 __x64_sys_close

We can still see a repetition after three seconds, the time between the second and the second to last call. But 11 of the 30 calls to close are recorded and the latency is very close to 1 µs, most likely, the missing calls were slighly faster than 1 µs. Still, the kernel is fast enough to not matter for our 50 ms delay.

In theory, if the kernel does not add latency, the measured delay has to come from the libpthread close function. We can use gdb to disassemble a specific function, gdb -batch -ex "file /usr/lib/libpthread-2.33.so" -ex "disas close". The resulting disassembly is shown below. The three main points are the two calls to some pthread functions as well as the syscall instruction, which is effectively a call into the kernel. The remaining instructions are negligible from a latency perspective. Ideally, we would use a profiling tool like perf to locate where the latency originates, but I do not know how to use perf for analyzing a single function. perf records samples of the stack every few ticks, but at this time, the process may be doing something else and the moment where the MaybeAddDevice function is running might not be recorded.

Dump of assembler code for function close:
   ...
   0x0000000000012810 <+32>:	sub    $0x18,%rsp
   0x0000000000012814 <+36>:	mov    %edi,0xc(%rsp)
   0x0000000000012818 <+40>:	call   0x12450 <__pthread_enable_asynccancel>
   0x000000000001281d <+45>:	mov    0xc(%rsp),%edi
   0x0000000000012821 <+49>:	mov    %eax,%r8d
   0x0000000000012824 <+52>:	mov    $0x3,%eax
   0x0000000000012829 <+57>:	syscall
   0x000000000001282b <+59>:	cmp    $0xfffffffffffff000,%rax
   0x0000000000012831 <+65>:	ja     0x12868 <close+120>
   0x0000000000012833 <+67>:	mov    %r8d,%edi
   0x0000000000012836 <+70>:	mov    %eax,0xc(%rsp)
   0x000000000001283a <+74>:	call   0x124d0 <__pthread_disable_asynccancel>
   0x000000000001283f <+79>:	mov    0xc(%rsp),%eax
   0x0000000000012843 <+83>:	add    $0x18,%rsp
   0x0000000000012847 <+87>:	ret
   ...
   0x0000000000012868 <+120>:	mov    0x9739(%rip),%rdx        # 0x1bfa8
   0x000000000001286f <+127>:	neg    %eax
   0x0000000000012871 <+129>:	mov    %eax,%fs:(%rdx)
   0x0000000000012874 <+132>:	mov    $0xffffffff,%eax
   0x0000000000012879 <+137>:	jmp    0x12833 <close+67>

We can build our own toy profiler with the related funclatency tool from bcc, which records a histogram of all latencies, so we do not miss any calls. In addition to specifying a function, with its corresponding address, which we want to trace, we can define an offset in the function, where the tracing begins. Using this feature, we can start tracing at different points inside the function and see how the latencies change. The added latency is then the difference between the latency before and after the instruction. This method is not reliable when we are microprofiling on the order of microseconds or shorter, but our latency is long enough to make a measurable difference.

As the main latency should come from the three calls, we can measure at the two points between the calls. For my testing, I used the offset 57 and 59, so right before and after the syscall. With this setup I get the following histograms.

#> /usr/share/bcc/tools/funclatency /lib/libpthread-2.33.so:close -o 57 -p <pid>
     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 0        |                                        |
      2048 -> 4095       : 0        |                                        |
      4096 -> 8191       : 0        |                                        |
      8192 -> 16383      : 0        |                                        |
     16384 -> 32767      : 0        |                                        |
     32768 -> 65535      : 0        |                                        |
     65536 -> 131071     : 0        |                                        |
    131072 -> 262143     : 0        |                                        |
    262144 -> 524287     : 0        |                                        |
    524288 -> 1048575    : 0        |                                        |
   1048576 -> 2097151    : 0        |                                        |
   2097152 -> 4194303    : 0        |                                        |
   4194304 -> 8388607    : 0        |                                        |
   8388608 -> 16777215   : 0        |                                        |
  16777216 -> 33554431   : 17       |*****************                       |
  33554432 -> 67108863   : 39       |****************************************|
  67108864 -> 134217727  : 2        |**                                      |

avg = 46994220 nsecs, total: 2725664779 nsecs, count: 58
#> /usr/share/bcc/tools/funclatency /lib/libpthread-2.33.so:close -o 59 -p <pid>
     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 15       |********                                |
      1024 -> 2047       : 68       |****************************************|
      2048 -> 4095       : 27       |***************                         |
      4096 -> 8191       : 1        |                                        |

avg = 1678 nsecs, total: 186310 nsecs, count: 111

This is as far as I can analyze this problem. I think, this proves that the latency originates in the kernel, but not directly from the __x64_sys_close function, but by something else, that the kernel is doing. Doing some analysis with perf, I came across the synchronize_rcu function inside the kernel, which might be the root cause. It is part of the RCU system in the kernel, which provides asynchronous access to data structures. This is a rabbit hole, I am not going down this time.

With all this information, it is not obvious how to fix this problem, I am not a kernel developer, and this latency might still come from some other process accessing the same data. So a regular fix is not possible, but the whole point of enumerating the input devices every few seconds is to detect connected joysticks. Papers Please is not the type of game where a joystick or more likely a game controller is helpful. So why do we need this functionality in the first place?

Binary patching

The function which we are looking at is named MaybeAddDevice. So according to this name, not adding the device should be an appropriate behaviour. So changing the implementation to just return immediately seems like a sensible option. We can do this change by editing the binary at the point where the function is located.

MaybeAddDevice is part of lime.ndll, a shared library, probably containing the game engine. With objdump -t we can see all the symbols which lime.ndll imports or exports, shown below. The first number is the address where the function will be loaded at runtime, plus a random offset. After the type of the symbol is given, here we have a global and a local symbol. With this information, we could use a hexeditor and patch the binary directly, but there is a, in my opinion, more elegant way.

000000000049c030 g     F .text	0000000000000009              SDL_SemWait
...
00000000004a6be0 l     F .text	000000000000032d              MaybeAddDevice

On linux, there is a mechanism, where we can inject our own code into a process. We can load our own shared library using the LD_PRELOAD environment variable. Every global function from a shared library can be overwritten using this technique. You may have noticed that the function which we are interested in is not a global function. This complicates our endeavour slightly. Instead of overriding the library directly, we override the library loading function dlopen. We inject ourselves into the loading process. So we forward most calls to the original dlopen calls, until we encounter the lime.ndll library.

At this point, the library is loaded into memory, but we do not know where. We can query the location of global symbols using dlsym and using the information from objdump we can calculate where MaybeAddDevice is located relative to SDL_SemWait. Because both of these are located in the same section, .text, their relative location does not change. So we just write to the calculated address, right? Well on modern systems there are protections installed, that memory with executable cannot be written. This is an important protection to reduce the attack surface of software, because an attacker could write his own code into memory and just jump there, but we want to use this power for good! Fortunately, linux also includes the ability to change the protections at runtime using mprotect. Otherwise software using JIT could not be implemented.

So, we inject the library loading process, find the location of the function, change the memory protection to writable and patch the function. The patching itself is rather easy, we just inject a ret as the first instruction, which is encoded as 0xc3. The next time this function is called, it returns immediately.

void patch_function(void* ptr, size_t offset) {
	char* func = reinterpret_cast<char*>(ptr);
	func += offset;
	const long page_size = sysconf(_SC_PAGESIZE);
	int result = mprotect(page_round_down(func, page_size), 1, PROT_READ | PROT_WRITE | PROT_EXEC);
	if(result) {
		fprintf(stderr, "result %d %d\n", result, errno);
		return;
	}
	func[0] = 0xc3; // the actual patching
	mprotect(page_round_down(func, page_size), 1, PROT_READ | PROT_EXEC);
}

using dlopen_t = void* (*)(const char*, int);

extern "C"
void* dlopen(const char* filename, int flags) {
	static dlopen_t real_dlopen = nullptr;
    if(!real_dlopen)
        real_dlopen = reinterpret_cast<dlopen_t>(dlsym(RTLD_NEXT, "dlopen"));
	void* handle = real_dlopen(filename, flags);
	if(!filename || filename != "././lime.ndll"s) {
		return handle;
	}
	void* func = dlsym(handle, "SDL_SemWait");
	if(!func) {
		fprintf(stderr, "could not find symbol: %s\n", symbol);
		return handle;
	}
	patch_function(func, 43952);
	return handle;
}

We can load compile this code to a shared library and inject it using LD_PRELOAD. A full implementation can be found here. I have changed the code to allow setting the function which we look up and the offset using an environment variable. This way, even if the binary changes, the code does not need to be recompiled.

The environment variables can be set in the launch.sh file like this:

export LD_PRELOAD=$(realpath papers_please_fix.so)
export PP_SYMBOL=SDL_SemWait PP_OFFSET=43952

With all of this in place, Papers Please works as intended, at least of you do not play it with your joystick. Unfortunately, I still do not know if the slow close call is a bug and if so, where does it come from? Nevertheless, I can finally play Papers Please and I hope you can too or at least learned something along the way.