Understanding Memory Usage of your meshIQ products – meshIQ

The meshIQ products are primarily java based. As such, it is important to know a few things about system memory to correctly configure your environment.

Memory Concepts

The following diagram is a high level explanation of the storage being used in a system. Like a balloon, you can only fit so many applications in at a time. There are typically a few large applications and a lot of small ones that can be fit in to the space available. This balloon is full but is not ready to burst.

Common concepts

Total memory represent the total amount of space available to run applications. This includes system memory and swap space.

In order for applications to run, they have to be resident in System memory. Most people are familiar with System memory, since this is the amount of storage listed for your home computer when you buy it.

What happens when the resident memory exceeds the system memory is less commonly understood. To allow applications to consume more memory than exists, unused pages are written to disk and restored when needed. In older home computers, you could identify this happening as the disk drive would become active when switching from one application to another. In Linux systems, the term swap is used for this activity and in others, it is referred to as paging. Imagine in the balloon, siphoning air out to put a different application pages in and then repeating the process to re-insert the original.

For high volume systems, turning off swap (paging) is typically recommended. The reason being that disk writing/reading is very slow. Even solid state drives are no competition to internal system memory. With swap turned off, the total memory size and system memory are the same. The operating system cannot allow the total resident size to exceed the physical memory.

Java Specific concepts

Drilling down one level further with java applications, the most commonly discussed memory is the java heap. This is where all java objects are stored. While the techniques used to manage this storage differ by version of java, the most important thing is consistent. When a java application allocates a new object, it comes from the heap. When that object is no longer used, it is freed back to the heap. Making the storage available to other applications typically does not happen immediately. As such, the heap will grow and shrink constantly over time when Garbage Collection (GC) happens. Recent versions of java have optimized GC to reduce the impact of memory usage and GC.

In addition to the heap, java applications will also consume native memory which is also referred to as non-heap storage. Because of the nature of the meshIQ platform which allocates a lot of processing threads, connections and file handles, the amount of native storage can be considerable.

Recent versions of java are opportunistic when allocating the heap. That is, just because the application has asked for 8GB does not mean it will all be resident, but as noted since the actual size used is based on GC algorithms, the expectation should be that all requested memory will be used. Recent java tuning tips suggest that the initial (Xms) and maximum size (Xmx) be the same. Overhead for expanding and contracting the memory is an expensive operation. If the application anticipates eventually using it, it is better to allocate it initially. For estimation purposes, the maximum heap size plus the native memory required should be considered as ultimately being resident.

Storage usage and meshIQ products

When problems arise in the meshIQ platform, the most common problems are storage related because our applications consume and produce a lot of data. The amount of heap storage allocated will impact the amount of resident memory required, which will compete with all other application running. In the ideal world, you would want all heap storage to be used and resident in storage with no swap activity.

Typical Problem Scenarios

The typical problems that can occur are:

Insufficient memory to host all of the applications
Excessive garbage collection cycles consuming high amounts of CPU
Running out of heap storage and new java objects cannot be created
Insufficient system memory to run all of the applications resulting in swapping (paging)

Case 1: When the available memory is depleted, the operating system has to take action. Letting the balloon pop is not an option. In order to make space, it will pick one of the applications to terminate. Selecting one of the small ones would make negligible impact so it picks one that will. Given the meshIQ processes consume the most storage, these are likely candidates. Unlike in case 2, the meshIQ applications will not be able to log any message beyond stopping or create NRD files.

Reducing Storage requirements for total memory

To address this condition, you can take the same actions in the first 2 cases. You can increase memory available on the server, reduce the workload of the CEP, or distribute the other components to other servers. Do not increase heap memory to solve this issue as it will only make it worse and fails will happen more frequently.

If you can allocate more storage, than that is usually the quickest option.

One option for reducing the memory workload is to determine if the heap is overallocated. One thing we have seen in resolving case 1 or 2 is doubling the heap size. If you ran out at 4GB and increased it to 8GB, you may have only needed 6GB. If you have multiple meshIQ applications on the same server, these small overallocations add up.

More likely, the best option is to review the workload and distribute it across multiple servers. The meshIQ architecture is designed to be distributed. Databases, messaging servers, even the meshIQ services can all be distributed across a number of servers. If using Solr, it should also be run on independent server(s).

Case 2: When the working set of the service is close to size of the java heap, GC will be involved almost continually. Imagine bailing out a leaky boat. If there is only 1 person in the boat, you can get away with bailing out the water periodically. When the boat is full of people, you need to bail almost constantly to keep it afloat. Similarly here, GC is consuming the CPU and the service will slow down.

Case 3: Continuing the previous example, if you put another person in the boat, no amount of bailing may help. When you run out of heap storage, java will terminate the application. When this happens, the applications typically log an "out of memory" message and shut down.

Reducing Storage requirements for java heap

In these cases, the resolution is the same. You can either increase the amount of heap size available (get a bigger boat) or reduce the workload (less cargo). The meshIQ CEP engines create a diagnostics file (NRD) to dump the current state when these problematic states are detected. The NRD file can be used by the support team to make recommendations as to how to reduce the workload. Recommendations could include reducing the amount of data being collected or to divide the workload to other instances. Ongoing monitoring of the memory profile is needed to understand your profile as waiting until the system dies makes it challenging to determine what caused it to exceed its capacity. There are limits to how much you can increase the heap size. In addition to consuming more system memory, larger sizes increase the amount of time to perform garbage collection and can still lead to system resource shortages.

Case 4: As mentioned above, In general, the recommendation is to have enough memory to house the allocated memory. If you have swap active, then when you exceed the capacity of the system, it will begin to page. This typically will result in inconsistent behavior of the applications.

That is, if you run 2 java applications with 8GB heap sizes on a system with 12GB, the applications will have to share 4GB. Both applications will run slower as well as other applications in the system.

16gb might seem like the minimum but this does not address other applications, 3rd party tooling and operating system services that also will be running.

Using system tools to understand your memory usage

The top command can be used to be used to display available memory and usage. If swap is turned off, it will show 0 available.

Using top -c can also be used to display the process command line arguments using the memory. Use shift-M to sort by resident memory (RES). Use the PID or the COMMAND to determine the processes. Virtual memory can appear quite high with multi-threaded applications but it is not related to actual memory and can be ignored in most cases.

Use dmesg to get details memory statistics and any events from the out of memory manager. The command dmesg -T | grep -i 'kill' will identify if the process was killed.