A curious case of memory leak in a node.js app

data: 29 listopada, 2016

czas czytania: 20 min

autor: Aleksander Kasiuk

Categories

This is a story of a bug I had an opportunity to fix in the ancient times of node.js 0.10. The new and shiny node.js 0.12 was still just glimmering somewhere on the horizon, we were still young and beautiful. Some of us, at least.

Let me take you on an adventure through time and space. Hopefully, there will be some learning involved along the way. So hop on the magic school bus and we’re off!

The bug

It all started just as usual – with a bug report. Our performance team noticed a repeating pattern in one of the applications we were developing for a sizeable distributed system – a slow, linear increase in memory consumption of each worker process, followed by an imminent crash of one of the workers.

It wasn’t as critical issue as it seems, though. The performance testing environment was set up for determining the maximum load our system could handle, which meant that our applications were processing a constant load of requests far exceeding the projected peak load of the production system for days at a time. During week-long tests, the application would have several workers died due to the found memory leak and have them instantly brought back up. The data processing was resumed, nothing got lost and the drop in performance for the split second one of the workers was down was hardly noticeable. Overall, the impact on the system was minimal.

Anyhow, we started investigating.

The hunt

Immediately, we were faced with a problem. It took several hours of processing on a computer much more powerful than our developer PCs for the leak to become apparent, which slowed us down significantly. We had no chance to actually reproduce the application crashing. Instead, we settled on running the data processing for an hour or two at a time, doing other tasks in the meantime, and then returning to analyse the results. Sometimes we left the processing overnight hoping to find something new in the morning. What exactly did we look at?

We started small, with node-inspector, a Chrome Developer Tools implementation for node.js. At the time, node-inspector was a godsend allowing us to not only easily debug the application but also to capture CPU profiles and V8 heap snapshots of the running application for analysis. A lot of time has passed since then and a similar feature has been introduced into node.js itself – starting with the version 6.3.0 you can use --inspect flag to enable it. But I digress.

That last feature, V8 heap snapshots, is exactly what we focused on. For a week or so, we were going through a cycle of running the test data through the application, generating a heap snapshot and sifting through the results, finding nothing peculiar and going back to the start. Rinse and repeat. To our dismay, all those attempts yielded no results. The reported heap size was consistently staying at 20-30 megabytes. Such range may not seem consistent but comparing to the memory usage of processes dying on the performance environment – around 1.5 gigabytes – it started to become glaringly obvious that we’re looking in a wrong place.

Unfortunately, we’ve spent quite a lot of time on this single bug already and considering the fact that in the production environment a single crash wouldn’t have a chance to occur for days, maybe even weeks, the priority of the issue quickly decreased. We were still at a stage where functionalities were missing or were incomplete, so this particular issue was put on a back burner.

Even though the issue was deemed insignificant from the perspective of the project’s success, it stuck with us, the developers. It was like a splinter that you feel you’re oh-so-close to pull out but can’t quite grab it. Most of the time you don’t even notice it’s there. But when you do notice – it just can’t stop bothering you. And just like that splinter, this bug remained in our minds.

No wonder, then, that we returned to it periodically over the course of the next few months, when the pressure temporarily subsided and we had some time to look into those less important issues.

Returning with a clear mind, we tried a different approach. As we depended on numerous native modules, built with node-gyp, we had the growing suspicion the memory leak might not only not be created by us after all, but also not even be hidden in JavaScript code, being a result of a bug in C++ code of one of our dependencies instead.

For this reason, we decided to pull out the big guns. Being more specific, the big gun – Valgrind. For those of you who don’t know, Valgrind is a framework for building tools analysing applications in runtime. One of the most commonly used tools being shipped with Valgrind is Memcheck. Not getting into too much detail, which I might not be able to provide reliably anyway, Memcheck’s modus operandi is replacing all dynamic memory allocation/deallocation C functions (malloc, calloc, realloc and free) with its own implementations, allowing it to track all dynamic memory usage within the program. This, in turn, allows detecting a range of common problems with memory access in your programs, such as out-of-bounds array access, reading after free and, you guessed it, memory leaks.

However, all that power comes at a price. The problem our big gun is encumbered with, as many big things are, is that it moves slowly. Very slowly, in fact. Running an application through Memcheck causes a slowdown, on average, by a factor of 20 to 30, which is bad in and of itself. In our case, it was terrible. For us, such slowdown meant that whatever we managed to process within an hour, as little as it was, would now take a full day, maybe more. Having no other options in sight, we gave it a shot anyway.

Before we delve any deeper, it would make sense to mention what exactly you can find in Memcheck report and how to make sure you get the most out of it. Firstly, I’d like to focus on the categories that allocated memory blocks get divided into, which we’ve got four of, namely:

still reachable – the non-concerning kind, Memcheck is certain that there are variables in the code that hold addresses of those memory blocks and they could’ve been freed before the program exit.
definitely lost – the opposite of the previous ones; there are no more pointers that allow you access to the given memory blocks and there is no way to call free on them. These definitely call for a fix in the code.
indirectly lost – this one is a bit less straightforward, as it involves having a data structure containing pointers. If, for example, you created a linked list and lost all pointers to the first element of the list, then that would make the first element of that list definitely lost and all remaining elements indirectly lost. Even though pointers to memory blocks allocated for those elements exist in the preceding items, there’s still no way to actually reach them from the outside. These would most likely be fixed simply by fixing the error that caused the initial block to be marked as definitely lost.
possibly lost– it’s another category that is not very straightforward, as it relies on the notion of interior pointers. An interior pointer is basically a pointer that, instead of holding an address of the start of the allocated block, holds the address of somewhere in the middle of the block. There are multiple reasons for why that may be the case, many of them valid. One of such reasons may be using a data structure utilising first few bytes of the block to hold information about the size of the block internally, but, for convenience or else, being referenced to using pointers skipping this size data. If any of such pointers are detected in a pointer chain leading up to a memory block, then it is marked as possibly lost, as Memcheck has no way of knowing whether the offset in the pointer is intended or appeared as a result of an error. All such cases require human insight to determine whether they should be corrected or not.

That’s the most basic rundown, for more details and in-depth explanation I’d like to refer you to the Memcheck’s manual itself.

As for making the most out of the report, Memcheck is kind enough not only to let you know that you have memory leaks but also to tell you the exact line in your code where the leaky memory has been allocated. That is, of course, provided the binary you’re running contains debug symbols, which likely isn’t the case for most of the node.js native packages. Thus, before you start your debugging session, make sure to look through your node_modules directory, find all binding.gyp files within it and modify them to include compiler flag enabling debug symbols generation (-g in case of gcc) in appropriate cflags array (more details on where it should be located within the file available in the GYP documentation). This isn’t the official way of building a debug release with node-gyp – that would be installing node-gyp so that the command is available, calling node-gyp build --debug in directories of all native modules and then digging through their code to replace Release with Debug in all require calls that load the compiled module. Too much hassle, if you ask me. With my approach, you only have to modify the binding.gyp file and call npm rebuild in that directory – fewer changes and nothing more to set up. Once you’ve done that and load your node.js application with Memcheck, the generated report should contain filenames and line numbers of all the affected C/C++ files.

Now, it’s time to return back to our story. Armed with a new weapon, we were hoping for a breakthrough. I mean – if the memory leak doesn’t show up in JavaScript heap dump, we have to have some leaky code on the C++ side of things. Surely that has to be the case, hasn’t it? Well… not exactly. And don’t call me Shirley.

True, we did find some definitely lost memory blocks in one of our dependencies. At that point, we were overjoyed. Even though the size of those blocks was in mere hundreds of bytes, it seemed to us that we might have found the culprit. This part of the code was executed for each batch of the processed data and even though the leak was marginal, we took into consideration the short span of our tests, coupled with slowdown introduced by Memcheck and our limited processing power in comparison to the much beefier performance environment. Through the advanced technique of guesstimation, we arrived at a conclusion that this particular bug being the source of the whole problem is within the realms of possibility. We quickly forked the library, fixed the issue, pushed the fixes to our private NPM repository, quickly reran Memcheck to confirm the bug is gone and asked our performance team to run the tests once more.

The next day came and we were eagerly waiting for the test results to come, ready to pat ourselves on the back for job well done. As you can probably guess by now, the results were less than desirable. In fact, we weren’t even sure if anything changed at all. Now, it wasn’t the fact that this particular fix didn’t work that left us disappointed. It was the realisation that, once again, we were completely clueless about where to go from here. Soon enough, more important tasks appeared and we were forced to leave the issue of the elusive memory leak unresolved.

Even more time has passed before one of my colleagues decided to have another go at the problem. Once more, with a clear mind and strong conviction that we must be missing something obvious, he tried to retrace our steps and started with node-inspector. After a day or two, he decided to share some of his heap dumps with me, so I could look through them as well. This is something we’ve never done before. Every single time we analysed heap dumps, we did it each on our own, without bothering to export them to a file.

And this was where we noticed something very peculiar.

The revelation

Even though the heap size reported within the DevTools was roughly the same across all files I received, the smallest of them was merely 30-40 megabytes in size, while the largest – well over a hundred. Which is odd. „Why could that be the case?”, you might ask. What untold mysteries hide within these files? These are some very good questions. I’ve asked them myself.

At this point, it became obvious that there are some things Chrome DevTools were hiding from us all this time. Not trusting the tool anymore, I’ve decided to peek inside the files with text editor. This allowed me to discover that the generated snapshots are nothing more than a JSON document, the structure of which was fairly easy to work with even without a parser.

The three main parts of a heap dump are nodes, edges and strings arrays. Nodes are various types of entities kept in the memory by V8, such as objects, strings, numbers, compiled regular expressions, but also, for example, closures. Edges are connections that exist between those nodes. Both nodes and edges are stored as one-dimensional arrays of numbers, but, for your convenience, the file comes with not only line breaks separating each node and edge but also human-readable descriptions of what each of the fields of their representation means. Talk about being developer-friendly!

Let’s look what I’ve found inside, shall we? For this purpose, I will be using a heap dump which you can download here – I’ve generated it after a few minutes of running a minimal program that reproduced the issue. The first thing I noticed was a staggering total number of edges in comparison to the total number of nodes: 3,858,154 to 13,538. More than that, 3,792,103 of them, over 98% of the total, originated from a single node. In line 10 of the file you can see ,3,9,23,0,3792103. The most interesting fields here are the second, third and the last one – name, ID of the node and outgoing edge count. Name is the index of the string in the strings array, which reads (Handle scope). Scrolling through the file, you will inevitably stumble upon the unmissable close-to-four million repeated lines reading ,3,3415,80. Here, the most important fields for us are the second and third – name and ID of target node. Again, name is an index from the strings array, reading undefined_value and – surprise, surprise – there’s no node with ID 80 in the file.

What it meant was that our application, for some reason, kept millions upon millions of handles to undefined. When I discovered this, all I could think of for a few moments was „Huh, well that’s weird”. Suddenly my eyes opened a bit wider as the thought „Wait a moment… That reminds me of something!” sparked into my mind. You see, at this point, we had some strong suspicions our database connectivity module of choice, node-odbc, was the offender. The reason being similar leaky behaviour had been already observed in our other applications, just not as pronounced. Since the number of database queries generated by them was much lower than in the first application this problem was discovered in, we started to believe the size of the memory leak is somehow proportional to that number.

I quickly found the C++ file I was after. I’d read this code many times before, trying to find something that could potentially cause a memory leak, so I knew where to look. This would be good moment to explain a bit how managing the JavaScript scopes works in the C++ API. Let’s use a simple example for that.

void exampleFunction() {
    HandleScope scope;

    Local<Integer> value = Integer::New(Isolate::GetCurrent(), 0);
}

At the start of the function, we created a new instance of HandleScope, which created a new handle scope and pushed it to the top of the scope stack. From that point on, all handles are created within this scope. In V8, you can have object be referenced by Local and Persistent handles. In our example, we create a local handle to a new Integer object containing value 0. Next, the exampleFunction ends, which causes the scope variable go out of scope of that C++ function. This triggers the destructor of the scope object, removing it from the scope stack and removing all local handles created within that scope. Any object, for which all local handles to it are removed, is marked for removal by garbage collector. This can be prevented by creating a persistent handle instead, but that’s just a side note. I’ve prepared a simple animation to showcase the behaviour of our snippet.

exampleFunction execution

Now, we will modify our function a bit to make it return handle to an object in a way that ensures the object itself won’t get marked for GC.

Local<Integer> returningFunction() {
    EscapableHandleScope scope;

    Local<Integer> zeroValue = Integer::New(Isolate::GetCurrent(), 0);
    Local<Integer> oneValue = Integer::New(Isolate::GetCurrent(), 1);

    return scope.Escape(oneValue);
}

What scope.Escape() does is that it creates a new local handle in handle scope one step below our current scope in the stack and copies the value of the handle given as the parameter to the newly created one, thus ensuring the referenced object is still accessible. In the given example, object referenced by the zeroValue handle will be marked for GC after returningFunction, while object referenced by the oneValue handle will not. Why not create a persistent handle instead and return it from the function without scope.Escape(), you might ask? If we did, we’d have to remember to make sure to dispose of it manually somewhere else in the code, which obviously is error-prone. The way we did it, the handle created by scope.Escape() will be properly deleted on its own once the scope it was created within is disposed of. Again, let’s take a look at an animation presenting the behaviour of the code.

returningFunction execution

This is one part of the puzzle. Another thing that needs explanation is how node.js allows you to make synchronous, blocking C++ code seem asynchronous. I’ve already talked about this in more detail in my previous post. To keep the explanation brief here, let’s just say that to achieve this, you run the code you expect to be blocking for longer periods of time on a separate thread, and once it finishes its job, the results are passed to another function you provided, which is run in the main thread on the next cycle of the JS event loop.

So here I was, staring at several scope.Close(Undefined()); – our examples were using the current version of the API, this is how Escape() was called back then and Undefined() simply returned a handle to the undefined – inside those callback functions for asynchronous tasks, wondering if they are truly the reason for the observed behaviour. To confirm that, I edited the method for querying database so that it wouldn’t do any actual work. Instead, only starting a job on the separate thread, doing nothing particular there as well and returning to the callback function in main thread, which would only create and close the scope. I also added a global variable that I would increment and printf out on each call to the function triggering the threaded execution, just so I could compare that number to the reported edge count coming from handle scope in a heap dump. I then followed with creation of the script I already mentioned – we analysed its heap dump. It simply created a connection to the database and ran our gutted querying function in a loop. I’ve let the counter reach several million, which didn’t take very long and the memory usage ballooned, generated the heap snapshot and immediately killed the application right after. I opened the generated file, looked at the number of edges coming out of (Handle scope), compared it to the last reported value of our counter, and Bob’s your uncle – they were almost the same.

At that moment I knew I had found the „how?”, now all that was left before I fixed the issue was figuring out the „why?”.

The resolution

Certainly many of you are familiar with rubber duck debugging technique. There are times where you’ve read a particular bit of code so many times that you are certain it is obvious what it does. It was no different this time. I had seen those scopes being closed a hundred times. I knew what they were doing and it never occurred to me to question it. Not until that day.

You see, with these three functions that we used for the sake of having asynchronous code in JavaScript – the one called from JavaScript, triggering the execution on the separate thread, the one running our blocking code on a thread and the one collecting results and invoking JavaScript callback with the results – only the first and the last ones are running within a V8 context. Looking from JS side, even though you pass a callback to an asynchronous function, it can still return you something synchronously. That would be a thing that is returned from scope.Escape() of our first C++ function. However, despite invoking a JavaScript callback, we also had a handle to undefined returned synchronously from the last C++ function. Which begs a question: where did that handle go?

This was the key. It was being created, most likely, in the global handle scope, which lives throughout the entire lifetime of our node.js application. The third C++ function, the callback with the results from the thread, is not executed on top of the call stack of the original function that triggered the processing. Instead, a brand new call stack is being created, starting with the function we’re currently in. This means that the concept of closing/escaping the scope and returning data synchronously doesn’t really make sense here – there’s nothing that could make use of anything we return.

Once I realised that, I removed all scope.Close(Undefined()); expressions from the C++ callback functions, rebuilt the package and ran my test program once more. Poof. No more multiplying handles, no more leaking memory.

The performance tests using fixed version of the library only confirmed this result. The next day I submitted a pull request to the official repository, finally closing the issue.

The afterword

This bug is among the most enjoyable I’ve had the opportunity to approach. While frustrating at times, when our attempts at fixing it were thwarted, it offered many chances to use new tools and learn a lot about the inner workings of node.js and V8 in general. It also provided an unmatched satisfaction of being able to finally get closure after nearly 10 months of having it live in the back of our minds.

I hope you learned something new and useful today or, at the very least, enjoyed the story. Do you have similar stories of your own? Or maybe you’d like to know a bit more about anything mentioned in this post. If you do, please let me know in the comments below.

A curious case of memory leak in a node.js app

The bug

The hunt

The revelation

The resolution

The afterword

Zobacz więcej postów

On problems with threads in node.js

Angular 2+ fundamentals based on angular-cli created app

Zainteresowały Cię nasze artykuły?

Kontakt

Social Media