Memory leak/GC issues in ignition gateway 7.9

Yesterday we had a big issue with the ignition gateway running out of memory, causing a lot of HDD swapping, which dropped out other important services on the server.

After some investigation, it looks a lot like a memory leak to me.

[attachment=1]memoryLeak.jpg[/attachment]

We’re running some other projects on 7.9 without problems. So it must be something with the modules or the scripts we use for this specific project.

It’s the first time we use the web-browser module, with a custom-made web page. A memory leak in a web-page is also possible ofc, but that should show up as a client memory leak I guess, not on the server. Perhaps it’s the web-browser module itself?

Meanwhile we’ve increased the allocated memory, so it should at least hold out a little longer.

Also notice the resemblance to this post: https://inductiveautomation.com/forum/viewtopic.php?f=72&t=16439&p=60085, though we don’t appear to get those big drops (and I still don’t get what those big drops are, the regular small spikes are normal Java garbage collection, but I never heard about a garbage collection happening once every 6 hours).

[attachment=0]modules.JPG[/attachment]

Here’s my current config:

# Java Additional Parameters
wrapper.java.additional.1=-XX:+UseConcMarkSweepGC
wrapper.java.additional.2=-XX:+CMSClassUnloadingEnabled
wrapper.java.additional.3=-Ddata.dir=data
wrapper.java.additional.4=-Dorg.apache.catalina.loader.WebappClassLoader.ENABLE_CLEAR_REFERENCES=false
#wrapper.java.additional.5=-Xdebug
#wrapper.java.additional.6=-Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=8000

# Initial Java Heap Size (in MB)
wrapper.java.initmemory=2048 # was 1024 MB

# Maximum Java Heap Size (in MB)
wrapper.java.maxmemory=2048 # was 1024 MB

What’s the reason to use the ConcMarkSweepGC?

According to the oracle documentation, it’s mainly useful for low-latency applications, and doesn’t compact the heap. Certainly the part about not compacting the heap seems strange to me for a service that should keep running (it’s fine for something that has to run a few hours, but after a long period, the heap will probably get fragmented).

And I also wonder if latency is that important for the gateway, as Ignition tends to work a lot with polling, latency is quite low overall anyway.

This Phil Turmel post give configuration details and some information about changing the garbage collector algorithm.

I did not try it myself, but it seems to have given results to Phil. Maybe someone from AI can tell us why they were the criteria for not changing it and configuring it by default, when it was introduced as a recommendation in java.

Regards,

[quote=“Sanderd17”]Here’s my current config:
What’s the reason to use the ConcMarkSweepGC?

According to the oracle documentation, it’s mainly useful for low-latency applications, and doesn’t compact the heap. Certainly the part about not compacting the heap seems strange to me for a service that should keep running (it’s fine for something that has to run a few hours, but after a long period, the heap will probably get fragmented).

And I also wonder if latency is that important for the gateway, as Ignition tends to work a lot with polling, latency is quite low overall anyway.[/quote]

In this case, being non-compacting means that the tenured area grows and eventually results in a stop-the-world pause, not that the heap is forever fragmented and unusable for long-running applications.

At the time we chose CMS it seemed a better option than parallel as we are trying to avoid frequent 500ms+ pauses (there’s now the ClockDriftDetector warning and in 7.9 that status overview UI that gives info on recent pauses, caused for any reason, not just GC). I can’t remember much more about why we chose it over the parallel collector… it was a while ago :confused:

Some people have had good results with G1 but it’s not recommended for heap sizes smaller than 4-6GB. I’ve also found in testing that G1 will raise the baseline CPU usage of your gateway.

That being said, G1 is going to become Java’s default collector in Java 9, so it’s probably ready for use.

Changing your GC isn’t going to fix a memory leak though.

1 Like

Your graph shows garbage collection and out-of-memory with less than 1GB occupied, yet your config is for 2G. Is this a 32-bit machine?

FWIW, I’ve had very good results with the G1GC in keeping latencies down – less than 100 milliseconds is quite doable – in order to support high packet rate data collection. From what I’ve seen, your app should have a low-water mark (lowest value just after a big GC cycle) of 10-20 % of your max allowed memory. This gives the GC plenty of flexibility to choose when/if to do compaction.

As for the memory leak, you’ll want to use the jmap and related tools to examine the distribution of memory allocations among the various classes. That’ll give you the details you need to hunt it down.

1 Like

Thanks for all the replies.

My config was indeed 1GB, but I changed it to 2 GB after the problems. I’m sorry if this wasn’t clear.

Apparenly, it’s not a mem leak after all, but it seems the big GC happens only seldomly, and caused big stress on the HDD when we had problems. As I heard from support, there’s apparently also a bug in table updates causing too much IO (which should be fixed with version 7.9.1). The combination of the two might have been deadly.

[attachment=0]bigGC.JPG[/attachment]

I’m reading some docs about the G1GC and GC in general, and I think I understand what went wrong.

The problem is the Tenured data that kept growing until a full GC was needed. At that point, because the data was old, and because some other apps had been opened and closed, most of the unreferenced tenured data was paged to the HDD.

So when the cleaning operation began, the GC had to fetch a lot of data from the HDD causing too much IO and timeouts in other threads and other apps (and an unresponsive user interface, causing the operators to give the same command multiple times, making the whole thing worse).

So what we want is to clean up the tenured data more often, or at least read it more often so it doesn’t get paged to the HDD. Reading about the G1GC, I notice it has a “mixed GC”, and for that, it executes a Multi-Phase Concurrent Marking Cycle where it marks the liveness of tenured data blocks. So it reads the tenured data on every GC cycle. I didn’t read the docs about the Concurrent-Mark-Sweep yet, but I guess it only has a minor and a full GC, is that true?

In that case, switching to the G1GC should help as it will read the tenured data often enough to prevent large page files, and make the mixed GC fast enough. And in any case, the mixed GC won’t try to clean the entire tenured data at once, but do it in chunks, giving other threads still some IO time in case of HDD-paged memory.

2 Likes

Thanks a lot for all your help. Using the G1GC has made the memory footprint a lot more stable. It’s now constantly between 250 and 1400 MB. I don’t see any rising trend anymore, so most likely there will be no reason to have big garbage collections that can trash the HDD.

[attachment=0]G1GC.JPG[/attachment]