5. Performance

In this section we study the performance of Orleans. We start with synthetic micro benchmarks targeting specific parts of the system. Next we report on whole-system performance running the production code of the Halo Presence service described in Section 4.1. The synthetic micro benchmarks were run 5 times for 10 minutes each and the production performance evaluation runs were done for 30 minutes each.

The measurements were performed on a cluster of up to 125 servers, each with two AMD Quad-Core Opteron processors running at 2.10GHz for a total of 8 cores per server, 32GB of RAM, all running 64 bit Windows Server 2008 R2 and .NET 4.5 framework.

5.1 Synthetic Micro BenchmarksAsynchronous IO and cooperative multi-tasking

In this benchmark we evaluated the effectiveness of asynchronous messaging with cooperative multi-tasking. We show how these two mechanisms can efficiently mask latency in theactor’swork. This test uses 1000 actors and issues requests from multiple load generators to fully saturate the system. Every request models a situation where an actor issues a remote call to another actor or an external service, such as storage. We vary the latency. Since the remote invocation is asynchronous, the current request is not blocked and thus the calling thread can be released to do work on another request. Figure 4 shows that the increased latency of the simu- lated external calls from within actors has very little impact on the overall throughput of the system.

The Orleans scheduler uses a small number of compute threads, usually equal to the number of CPUs, with cooperative multitasking. This is known to be more efficient than using a large number of threads. We run a throughput test with short ping messages and different numbers of threads used by the Orleans scheduler. The result is shown in Figure 5. As expected, we see a steady degradation of the throughput, as the number of threads increases due to increasing overhead of the thread context switching, extra memory, longer OS scheduling queues and reduced cache locality.

Price of isolation

The isolation of actors in Orleans implies that arguments in actor calls have to be deep copied. In this benchmark the client calls a first actor with a given argument which calls a second actor with the same argument, once passed as is and once passed asImmutable(meaning it is not copied). In the benchmark, 50% of the calls are local and 50% remote. In general, the larger the fraction of remote calls, the smaller the throughput drop due to deep copying, since the overhead of serialization and remote messaging increases. In a large application running on hundreds of servers the majority of the calls would be remote and thus the price of deep copy would shrink.

Table 1 shows the price of deep copying (request throughput) for three data types. For a simplebyte[]it is very small, about 4%. For a dictionary, more data is copied, but the price is still below 10%. With a complicated data structure, a dictionary each element of which is itself a mutable complex type, the overhead grows significantly.

5.2 Halo Presence Performance EvaluationScalability in the number of servers

We run the production actor code of Halo 4 Presence service in our test cluster, with 1 million actors. We use enough load generators to fully saturate the Orleans nodes with generated heartbeat traffic and measure the maximum throughput the service can sustain. In this test the nodes run stably at 95-97% CPU utilization the whole time. Each heartbeat request incurs at least two RPCs: client to a Router actor and the Router actor to a Session actor. The first call is always remote, and the second is usually remote because of random placement of Halo 4 Session actors. We see in Figure 6 that the throughput of 25 servers is about 130,000 heartbeats per second (about 5200 per server). This throughput scales almost linearly as the number of servers grows to 125.

Scalability in the number of actors

In this test the number of servers was fixed at 25 and we saturate the system with multiple load generators. In Figure 7 we see that the throughput remains almost the same as the number of actors increases from 2 thousand to 2 million. The small degradation at the large numbers is due to the increased size of internal data structures.

Latency as a function of load

We measured the latency of heartbeat calls. The number of servers was fixed at 25 and we vary the load by increasing the number of load generators. In Figure 8 the x-axis depicts the average CPU utilization of the 25 servers. The median latency is about 6.5 milliseconds (ms) for up to 19% CPU utilization and grows to 10ms and 15ms for 34% and 55% CPU utilization. Recall that every heartbeat is 2 RPC calls including a CPU intensive blob decompression. In addition, a small fraction of heartbeats trigger additional actors which were omitted from our description above. The latency of those heartbeats is naturally higher due to the extra hop and the additional CPU intensive processing. This contributes to the higher mean, standard deviation, and 95thpercentile.

chapter 5