ShowTable of Contents
This article describes a series of performance benchmarks that demonstrate the scaling properties of the Linux Kernel Virtual Machine (KVM) running WebSphere Portal on Windows virtual machines. It describes libvirt configuration options for CPU configuration. Then, this article documents the WebSphere Portal benchmark environment. Finally, the results show how WebSphere Portal scales when CPUs are pinned, when NUMA boundaries are crossed as well as when other configuration changes are made.
As typical of Unix-based systems, virtualization in Linux is actually composed of multiple, layered pieces. A ‘KVM’ system consists of multiple components. KVM is just the kernel module which allows access to the virtualization extensions of the underlying processor. A separate component is required to actually create a virtual machine. In practice, this component is split into two parts: libvirt and QEMU.
libvirt is an abstraction layer around the concept of virtualization. It provides a common layer to configure and manage virtual machines (VMs). libvirt does not directly create, configure or run virtual machines; instead it delegates to an underlying emulator or hypervisor. libvirt supports a number of hypervisor technologies, the most common of which is QEMU.
libvirt configuration is done via XML using the virtual shell command processor ‘virsh’. Each virtual machine has its own configuration file that specifies all data needed to define a VM. These include the number of CPUs; network configuration and MAC addresses; and disks and disk image locations among many other parameters.
The libvirt project also maintains a set of paravirtualized drivers that VMs can use. These are the so-called ‘virtio’ drivers. Rather than fully emulating a device for the VM, these drivers are aware of the underlying hypervisor and are able to cooperate with it directly for improved disk and network performance.
QEMU is the actual virtualization engine that runs virtual machines. libvirt commands used to start a VM are converted into lower level QEMU commands that actually start the QEMU processes. Each QEMU process represents a virtual machine on the host operating system.
QEMU can emulate x86 and other processor architectures, but this is not necessary on x86 Linux. Instead, QEMU uses the KVM kernel module to run a virtual machine natively on x86 hardware by using the processors built-in virtualization instructions.
From the host Linux operating system perspective, a virtual machine is just another process. By default, process groups are used for security isolation between VMs, but otherwise the process behaves like any other. CPU time is scheduled, memory is allocated and resources are managed by the Linux kernel using the same code and algorithms a ‘normal’ process is governed by.
This can have performance implications since there are, by default, no restrictions on:
- Other processes running on the host
- Which VM, if any, has higher priority
- How & when processes are scheduled to run
- On which CPUs / NUMA nodes a given process will execute on
For issue 1, minimal installations and careful configuration can be used to ensure there is little other processing work needed on the host OS. Issues 2 and 3 are areas for further study and are not covered in this article.
This article will examine issue 4 by running WebSphere Portal in various CPU configurations. Two classes of configurations will be measured: 4 virtual CPUs (vCPUs) and 8 vCPUs. The 4 vCPU configurations will be used to establish a baseline for which tunings give the best performance. The 8 vCPU configurations will expand on these results to determine the impact of different NUMA configurations.
Note that a virtual CPU is the CPU that the guest operating system sees. There does not need to be a physical CPU for each virtual CPU. Across all VMs running in a host, the total number of virtual CPUs can exceed the physical CPU count. This condition is referred to as overcommitment. Overcommiting CPUs can cause performance issues if more virtual CPU resources are needed than there are physical CPUs (i.e. if all VMs on a host are busy).
By default, a virtual machine can run on any CPU in the system. The Linux scheduler will attempt to maintain processes and threads on ‘recently run’ CPUs, but this is not guaranteed over periods of time longer than a few seconds. This has performance implications because every time a process moves to a different CPU, the new processor’s L2 cache is likely empty. Process memory has to be re-fetched from main memory which slows down the entire VM.
It is possible however to ‘pin’ a process to a specific set of CPUs. Only these CPUs will be used by the given process. libvirt controls CPU pinning using the cpuset attribute within the <vcpu> element in the configuration file for a particular VM. By default, there is no cpuset attribute specified which implies that there is no CPU pinning.
For example, to pin a VM to CPUs 0-3, the following configuration stanza can be used:
<vcpu placement='static' cpuset='0-3'>4</vcpu>
Contrast this to the default, which specifies no pinning:
Note that it is possible to pin to more CPUs than there are vCPUs (i.e. the cpuset attribute specifies more CPUs than what the <vcpu> element contains). For example, <vcpu placement='static' cpuset='0-5'>4</vcpu>, would allow only 4 vCPUs but those 4 vCPUs would be able to run on any physical CPU in the first processor socket.
While CPU pinning ensures that the given process only runs on the specified CPUs, there is nothing to stop other processes from also using those CPUs. The performance impact of this could be significant if two large VMs are both trying to consume a large amount of CPU at the same time. It could also affect VMs running on overcommitted hosts. Neither of these conditions will be explicitly measured by the benchmarks documented here.
Be aware that it is possible to pin to hyperthreads. This may or may not be what is intended, so care should be taken when specifying the CPU numbers. The virsh capabilities command can be used to determine which CPUs belong to which NUMA node. Within the node, the higher numbered CPUs are the hyperthreads. For example, from a 12 core system here is the partial ouput:
Listing 1. Partial output from virsh capabilities
Note that there are two <cell> elements, one for each NUMA (non-uniform memory access) node. On this system each processor socket has a separate memory controller and is thus a separate NUMA node. Each processor socket has 6 cores with hyperthreading, so there are 24 logical CPU threads in the system. In cell one, CPUs are numbered 0-5 then 12-17. CPUs 12-17 are the hyperthreads in this node. Similarly, CPUs 18-23 are the hyperthreads in the other node.
Memory accesses to the other NUMA node will be slower than memory access to the local node. The benchmarks documented in this article will show the performance effects of splitting a virtual machine across NUMA nodes.
While pinning restricts a virtual machine to a particular set of CPUs, it does not restrict where the vCPUs are actually executed. A particular vCPU can run on any of the pinned CPUs. This may have similar performance implications as not pinning at all. Even in a multithreaded workload, L2 caching could play an important role. Running all work for a particular thread on the same real CPU could result in improved performance.
libvirt supports explicitly mapping a vCPU to a real CPU via the <vcpupin> element. By default, there are no <vcpupin> statements, thus any vCPU can run on any physical host CPU.
For example, to pin each vCPU to a specific real CPU, the following configuration can be used:
Listing 2. vCPU pinning example configuration
<vcpupin vcpu='0' cpuset='0'/>
<vcpupin vcpu='1' cpuset='1'/>
<vcpupin vcpu='2' cpuset='2'/>
<vcpupin vcpu='3' cpuset='3'/>
<vcpupin vcpu='4' cpuset='6'/>
<vcpupin vcpu='5' cpuset='7'/>
<vcpupin vcpu='6' cpuset='8'/>
<vcpupin vcpu='7' cpuset='9'/>
Given the above output from the virsh capabilities command, this <cputune> element will configure an 8 vCPU VM to use 4 real CPUs on NUMA node 1 (CPUs 0, 1, 2 and 3) and 4 real CPUs on NUMA node2 (CPUs 6, 7, 8 and 9). Note that the virtual CPUs numbering should not contain gaps; all vCPU numbers should be specified.
Virtual Machine Topology
Regardless of which physical CPUs are being used by a VM, what the virtual machine itself sees can also be configured. By default, each virtual CPU is seen by the VM as a separate CPU in a separate socket. So, a 4 vCPU machine will be seen as 4 single CPU processors by the VM OS. This may have performance implications if the guest operating system schedules differently when hyperthreads are present.
To change the VM topology, the libvirt <topology> element can be used. For example, to expose 4 vCPUs as a single dual core processor with hypertheading, the following can be added to the VM’s configuration file:
<topology sockets='1' cores='2' threads='2'/>
Similarly, this same system could be configured as a 4 core processor without hyperthreading with the following:
<topology sockets='1' cores='4' threads='1'/>
WebSphere Portal Benchmark
To answer the questions posed above, a set of measurements was run on WebSphere Portal 126.96.36.199.
For this benchmark, a simple set of test portlets was installed. These portlets make no requests to external databases, files or other systems. Tuning was applied as specified in the Portal 8.0 Tuning Guide to ensure that database access by Portal is minimal after warmup. This means that the benchmark will only be limited by the CPU resources of the Portal server.
To simulate a realistic customer application, the test portlets were installed on a set of several hundred portal pages in a 2 level hierarchy. Each user can see 26 pages. Users are stored in a separate LDAP server that is not running on the virtual machine under test.
The host system under test is an IBM xSeries x3550 M3 running a minimal install of RedHat Enterprise Linux (RHEL) 6.3. It has two 6 core Intel Xeon X5650 processors (Westmere) running at 2.67 GHz; hyperthreading is enabled. This system has 96GB of memory. It runs the following virtual machines:
various CPU configurationsHTTP Server
16 GB memory
WebSphere Portal 188.8.131.52
IBM HTTP Server, 32 bit (when running with 4 vCPUs)
Windows 2008 R2 Standard
4 vCPUs LDAP Server
8 GB memory
IBM HTTP Server, 64 bit
16 GB memory
IBM Tivoli Directory Server 6.3
The Portal server, the HTTP server and the LDAP server are connected to the same bridged network on the host. Traffic between these servers never leaves the host system. All the virtual machines are using file based virtual disks stored on a single SAN volume connected to the host. virtio drivers are used for both the network and disks.
A separate server was used to run DB2 for the Portal databases. Both this server and the LDAP server never used more than 10% CPU during the execution of these measurements. Tuning of these servers is not covered in this article.
For the measurements with 4 vCPUs, the HTTP server was run on the same server as Portal. With 8 vCPUs however, Portal required more throughput than the 32bit Windows version of IHS can support. For these measurements, the HTTP server VM was used instead. In either case, the IBM HTTP server is configured to cache static content in a memory cache. At steady state, the majority of the requests being made to Portal are for dynamic pages, not static content in portlet WAR files.
For all measurements, the same Portal virtual machine was used. The configuration was changed by altering the libvirt configuration rather than using a different virtual machine. This was done to ensure that the system being tested was otherwise identical. Note that the tuning done on this system was done to achieve maximum throughput on the largest configuration for this system. Some settings may not be ideal for smaller virtual machine configurations, but this was deemed acceptable given the goals of these benchmarks.
The above configuration is used to run a measurement workload with Rational Performance Tester (RPT). This workload consists of five separate paths through the site. One path is unauthenticated and only visits public pages. In paths two and three, the users log in and access either pages all users can see pages specific to the groups the user is a member of. In the 4th and 5th paths, users access pages which contain 'application' portlets that drive more CPU utilization by simulating a shopping cart and other business processes. These paths are differentiated like paths two and three; one contains apps all users can see the other only apps applicable to the user's groups.
In RPT, this workload is ramped up to increasing transaction levels by increasing the number of virtual users. Each level is run for 15 minutes to ensure sufficient transactions are measured. The measurement is stopped when the response times for any transaction exceed one second, on average, for a given throughput level. Peak throughput is recorded as the level before the maximum; i.e. the last point when all response times were below thresholds.
The RPT agents were all run on IBM xSeries systems running Redhat Enterprise Linux 6.3. The agent systems are connected to the system under test by a dedicated gigabit network used only for benchmark traffic.
Results with 4 vCPUs
With 4 vCPUs the following configurations were benchmarked:
- Default (no CPU pinning, default 4 processor topology); this is the baseline
- Pinned to 4 real CPUs
- Pinned to 2 real CPUs and 2 hypertheads
- Pinned to 2 real CPUs and 2 hyperthreads; VM topology set to match
For these measurements, CPUs were pinned using the cpuset configuration attribute. No vCPUs were explicitly pinned to a specific physical CPU using the <vcpupin> configuration element.
One additional run was also measured with the numad process running. numad, which was added in RHEL 6.3 monitors processes for memory allocations that cross NUMA boundaries and attempts to re-balance or move memory to the node the process is running on.
Running these tests gives the following results:
Figure 1. 4 vCPU results
Not surprisingly, pinning the VM process to 4 real CPUs gives the best results since the virtual machine gets the benefits of warmed up caches and full processor cores. Examination of host system statistics also shows that the hyperthreads associated with these processors were minimally utilized during the benchmark. So, this configuration also had nearly exclusive access to the CPUs cores.
Similarly, pinning to hyperthreaded CPUs degrades performance below even the default configuration. Interestingly, having the virtual machine be aware of the configuration does not change throughput. But, at peak load, the VM is at almost 100% CPU utilization so there is little room for optimization. However, at the plateau before peak load, response times were 10% (about 25ms) faster. This is not a conclusive result given that the VM is still at 99% CPU. Further investigation is needed at lower loads to see if the topology gives consistently better response times.
Looking at host system statistics, having numad enabled appears to have caused the virtual machine to ignore the libvirt pinning configuration. During the measurement, all CPUs in the first NUMA node were in use rather than just the 4 that were supposed to be pinned. In addition, compared to the default settings, numad also seems to add about 3% overhead. In these benchmarks, the VMs under test were the only load on the system. Further investigation would be needed to determine if numad provided a benefit on a fully loaded host system.
Results with 8 vCPUs
The host KVM system for these measurements is a dual socket server. Each socket has 6 cores. So, with 8 vCPUs, the Portal virtual machine must cross NUMA boundaries if it is pinned to 8 real CPUs. The measurements run with 8 vCPU therefore investigate how NUMA effects overall performance and which configurations are better when dealing with multiple memory cells.
With 8 vCPUs the following configurations were benchmarked:
- Default (no CPU pinning); this is the baseline
- Pinned to 8 real CPUs; 6 CPUs pinned on 1 NUMA node 2 on the other
- Pinned to 8 real CPUs; 4 CPUs pinned on 1 NUMA node 4 on the other
- Same as previous, but the HTTP server VM was also pinned with 2 CPUs on 1 NUMA node and 2 on the other.
- Pinned to 8 real CPUs; 4 CPUs pinned on 1 NUMA node 4 on the other; virtual CPUs explicitly mapped to real CPUs.
The HTTP server VM was not pinned.
- Same as previous, but with NUMA configuration on the virtual machine so the VM is NUMA aware.
The HTTP server VM was not pinned.
In the final configuration, NUMA awareness applies to the fact that the virtual machine topology was altered so that the VM has two NUMA nodes, just like the host. For this configuration, the 16GB of guest memory was split evenly between the two NUMA nodes using the following configuration:
<cell cpus='0-3' memory='8388608'/>
<cell cpus='4-7' memory='8388608'/>
The expectation is that this configuration will perform better since the VM can optimize its memory accesses to match how the host is allocating memory to the virtual machine.
One issue with 8 vCPUs is that the default libvirt configuration is presented as 8 separate CPUs sockets to a virtual machine. Windows 2008 Standard edition does not support this configuration; Enterprise edition must be used for 8 CPU support. In order to avoid a system reinstall, the topology was configured so that the VM sees a single socket, 4-core-hyperthreaded processor. This topology was used for all measurements.
Running these tests gives the following results:
Figure 2. 8 vCPU results
While not shown, the 8 vCPU configuration did give more than double the throughput of the 4 vCPU configurations. This is consistent with doubling the CPUs and moving the HTTP server to another virtual machine.
Compared to the default configuration, CPU pinning gives better performance, just as in the 4 vCPU runs. Having 4 CPUs pinned to each NUMA node performs slightly better than having more CPUs on one of the nodes. This is expected because it allows a more even distribution of CPUs to memory on each NUMA node.
During all these measurements, the HTTP server was always less than 15% CPU utilization. Given that low utilization, it is not surprising that pinning the HTTP server virtual machine had almost no effect. Explicitly mapping the virtual CPUs to real CPUs further increases throughput by less than 1%.
While all of these configurations explicitly cross NUMA boundaries, the Windows VM is not aware of NUMA by default. By making the virtual machine OS aware of NUMA, it can optimize its own memory usage. This results in a further 2% increase in overall throughput. While not documented above, configurations with 4 vCPUs pinned with a 2/2 split did not perform better than the default 4 vCPU configuration regardless of the VM being NUMA aware. This suggests that, by default, the Linux scheduler is attempting to keep processes on a single NUMA node when possible.
Finally, while not shown in the chart, two additional tunings were tried: enabling zone_reclaim mode and changing the swappiness value to 0. These were suggested in http://publib.boulder.ibm.com/infocenter/lnxinfo/v3r0m0/index.jsp
. Both of these values resulted in reduced overall throughput compared to the default configuration. However, these settings deal with how Linux allocates and frees memory. The system was never under any memory pressure during benchmarks, so it is possible that these values could still be useful when running a system that has almost all its memory allocated by virtual machines.
Overall, KVM is an easy to use, built-in option for virtualizing workloads on x86 systems. Running a WebSphere Portal workload, a Windows 2008 virtual machine running on RedHat Enterrprise Linux 6.3 was able to drive significant throughput with out of the box settings. Additional VM configuration options specific to the host machine architecture allowed about 9% more throughput on a virtual machine with 8 virtual CPUs.
Significant performance improvements were seeing when pinning the virtual CPUs to a limited set of physical CPU cores on the host. Additional, minor gains can be made by mapping of vCPUs explicitly to physical cores. While not measured, these gains will probably only be seen on KVM hosts that are not overcommitted. On overcommmited hosts, pinning may reduce performance because the host will have less options for allocating resources fairly between all virtual machines running on the host.
When running virtual machines with more vCPUs than physical CPUs in a NUMA node, awareness of NUMA node boundaries should also be exposed to the VM. This gives the guest OS the ability to optimize memory access by matching the memory configuration of the host. Virtual machines should be allocated an equal number of CPUs on each NUMA node.
About the authors
is the WebSphere Portal Performance team lead. He has over 13 years of performance testing and analysis experience. He has been with IBM Lotus for five years and has previously worked on IBM Connections and IBM SmartCloud.
Andrew P. Citron
is a performance engineer on the WebSphere Portal Performance team in IBM's Research Triangle Park. He is responsible for all Portal benchmarks on Windows and contributes to the WebSphere Portal tuning guides for each Portal release.