奏为臣病垂危,自知不起,口占遗疏,仰求圣鉴事。窃臣体气素健,向能耐劳,服官四十余年,未尝因病请假。前在马关受伤,流血过久,遂成眩晕。去夏冒暑北上,复患泄泻,元气大伤。入都后又以事机不顺,朝夕焦思,往往彻夜不眠,胃纳日减,触发旧疾时 作时止。迭蒙圣慈垂询,特赏假期,慰谕周详,感激涕零。
和约幸得竣事,俄约仍无定期,上贻宵旰之忧,是臣未终心事。每一念及,忧灼五中。本月十九夜,忽喀血碗余,数日之间,遂至沉笃,群医束手,知难久延。谨口占遗疏,烦臣子经述恭校写成,固封以俟。
伏念臣受知最早,蒙恩最深,每念时局艰危,不敢自称衰病。 惟冀稍延余息,重睹中兴。赍志以终,殁身难瞑。现值京师初复,銮辂未归,和议新成,东事尚棘,根本至计,处处可虞。窃念多难兴邦,殷忧启圣。伏读迭次谕 旨,举行新政,力图自强。庆亲王等皆臣久经共事之人,此次复同更患难,定能一心效力,翼赞讦谟。臣在九泉,庶无遗憾。至臣子孙,皆受国厚恩,唯有勖其守身读书,勉图报效。属纩在即,瞻望无时,长辞圣明,无任依恋之至。谨叩谢天恩,乞皇太后、皇上圣鉴。谨奏。
- Latency Numbers http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html, it will be less change in the future
- What Every Programmer Should Know About Memory.pdf
-The Secret to 10 Million Concurrent Connections -The Kernel is the Problem, Not the Solution
Now that we have the C10K concurrent connection problem licked, how do we level up and support 10 million concurrent connections? Impossible you say. Nope, systems right now are delivering 10 million concurrent connections using techniques that are as radical as they may be unfamiliar.
To learn how it’s done we turn to Robert Graham, CEO of Errata Security, and his absolutely fantastic talk at Shmoocon 2013 called C10M Defending The Internet At Scale.
Robert has a brilliant way of framing the problem that I’ve never heard of before. He starts with a little bit of history, relating how Unix wasn’t originally designed to be a general server OS, it was designed to be a control system for a telephone network. It was the telephone network that actually transported the data so there was a clean separation between the control plane and the data plane. The problem is we now use Unix servers as part of the data plane, which we shouldn’t do at all. If we were designing a kernel for handling one application per server we would design it very differently than for a multi-user kernel.
Which is why he says the key is to understand:
- The kernel isn’t the solution. The kernel is the problem.
Which means:
- Don’t let the kernel do all the heavy lifting. Take packet handling, memory management, and processor scheduling out of the kernel and put it into the application, where it can be done efficiently. Let Linux handle the control plane and let the the application handle the data plane.
The result will be a system that can handle 10 million concurrent connections with 200 clock cycles for packet handling and 1400 hundred clock cycles for application logic. As a main memory access costs 300 clock cycles it’s key to design in way that minimizes code and cache misses.
With a data plane oriented system you can process 10 million packets per second. With a control plane oriented system you only get 1 million packets per second.
If this seems extreme keep in mind the old saying: scalability is specialization. To do something great you can’t outsource performance to the OS. You have to do it yourself.
Now, let’s learn how Robert creates a system capable of handling 10 million concurrent connections...
C10K Problem - So Last Decade
A decade ago engineers tackled the C10K scalability problems that prevented servers from handling more than 10,000 concurrent connections. This problem was solved by fixing OS kernels and moving away from threaded servers like Apache to event-driven servers like Nginx and Node. This process has taken a decade as people have been moving away from Apache to scalable servers. In the last few years we’ve seen faster adoption of scalable servers.
The Apache Problem
- The Apache problem is the more connections the worse the performance.
- Key insight: performance and scalability or orthogonal concepts. They don’t mean the same thing. When people talk about scale they often are talking about performance, but there’s a difference between scale and performance. As we’ll see with Apache.
- With short term connections that last a few seconds, say a quick transaction, if you are executing a 1000 TPS then you’ll only have about a 1000 concurrent connections to the server.
- Change the length of the transactions to 10 seconds, now at 1000 TPS you’ll have 10K connections open. Apache’s performance drops off a cliff though which opens you to DoS attacks. Just do a lot of downloads and Apache falls over.
- If you are handling 5,000 connections per second and you want to handle 10K, what do you do? Let’s say you upgrade hardware and double it the processor speed. What happens? You get double the performance but you don’t get double the scale. The scale may only go to 6K connections per second. Same thing happens if you keep on doubling. 16x the performance is great but you still haven’t got to 10K connections. Performance is not the same as scalability.
- The problem was Apache would fork a CGI process and then kill it. This didn’t scale.
- Why? Servers could not handle 10K concurrent connections because of O(n^2) algorithms used in the kernel.
- Two basic problems in the kernel:
- Connection = thread/process. As a packet came in it would walk down all 10K processes in the kernel to figure out which thread should handle the packet
- Connections = select/poll (single thread). Same scalability problem. Each packet had to walk a list of sockets.
- Solution: fix the kernel to make lookups in constant time
- Threads now constant time context switch regardless of number of threads.
- Came with a new scalable epoll()/IOCompletionPort constant time socket lookup.
- Two basic problems in the kernel:
- Thread scheduling still didn’t scale so servers scaled using epoll with sockets which led to the asynchronous programming model embodied in Node and Nginx. This shifted software to a different performance graph. Even with a slower server when you add more connections the performance doesn’t drop off a cliff. At 10K connections a laptop is even faster than a 16 core server.
The C10M Problem - The Next Decade
In the very near future servers will need to handle millions of concurrent connections. With IPV6 the number of potential connections from each server is in the millions so we need to go to the next level of scalability.
- Examples of applications that will need this sort of scalability: IDS/IPS because they connection to a server backbone. Other examples: DNS root server, TOR node, Nmap of Internet, video streaming, banking, Carrier NAT, Voip PBX, load balancer, web cache, firewall, email receive, spam filtering.
- Often people who see Internet scale problems are appliances rather than servers because they are selling hardware + software. You buy the device and insert it into your datacenter. These devices may contain an Intel motherboard or Network processors and specialized chips for encryption, packet inspection, etc.
- X86 prices on Newegg as of Feb 2013 - $5K for 40gpbs, 32-cores, 256gigs RAM. The servers can do more than 10K connections. If they can’t it’s because you’ve made bad choices with software. It’s not the underlying hardware that’s the issues. This hardware can easily scale to 10 million concurrent connections.
What the 10M Concurrent Connection Challenge means:
- 10 million concurrent connections
- 1 million connections/second - sustained rate at about 10 seconds a connections
- 10 gigabits/second connection - fast connections to the Internet.
- 10 million packets/second - expect current servers to handle 50K packets per second, this is going to a higher level. Servers used to be able to handle 100K interrupts per second and every packet caused interrupts.
- 10 microsecond latency - scalable servers might handle the scale but latency would spike.
- 10 microsecond jitter - limit the maximum latency
- 10 coherent CPU cores - software should scale to larger numbers of cores. Typically software only scales easily to four cores. Servers can scale to many more cores so software needs to be rewritten to support larger core machines.
We’ve Learned Unix Not Network Programming
- A generation of programmers has learned network programming by reading Unix Networking Programming by W. Richard Stevens. The problem is the book is about Unix, not just network programming. It tells you to let Unix do all the heavy lifting and you just write a small little server on top of Unix. But the kernel doesn’t scale. The solution is to move outside the kernel and do all the heavy lifting yourself.
- An example of the impact of this is to consider Apache’s thread per connection model. What this means is the thread scheduler determines which read() to call next depending on which data arrives. You are using the thread scheduling system as the packet scheduling system. (I really like this, never thought of it that way before).
- What Nginx says it don’t use thread scheduling as the packet scheduler. Do the packet scheduling yourself. Use select to find the socket, we know it has data so we can read immediately and it won’t block, and then process the data.
- Lesson: Let Unix handle the network stack, but you handle everything from that point on.
How do you write software that scales?
How do change your software to make it scale? A lot of or rules of thumb are false about how much hardware can handle. We need to know what the performance capabilities actually are.
To go to the next level the problems we need to solve are:
- packet scalability
- multi-core scalability
- memory scalability
Packet Scaling - Write Your Own Custom Driver to Bypass the Stack
- The problem with packets is they go through the Unix kernel. The network stack is complicated and slow. The path of packets to your application needs to be more direct. Don’t let the OS handle the packets.
- The way to do this is to write your own driver. All the driver does is send the packet to your application instead of through the stack. You can find drivers: PF_RING, Netmap, Intel DPDK (data plane development kit). The Intel is closed source, but there’s a lot of support around it.
- How fast? Intel has a benchmark where the process 80 million packets per second (200 clock cycles per packet) on a fairly lightweight server. This is through user mode too. The packet makes its way up through to user mode and then down again to go out. Linux doesn’t do more than a million packets per second when getting UDP packets up to user mode and out again. Performance is 80-1 of a customer driver to a Linux.
- For the 10 million packets per second goal if 200 clock cycles are used in getting the packet that leaves 1400 clocks cycles to implement functionally like a DNS/IDS.
- With PF_RING you get raw packets so you have to do your TCP stack. People are doing user mode stacks. For Intel there is an available TCP stack that offers really scalable performance.
Multi-Core Scalability
Multi-core scalability is not the same thing as multi-threading scalability. We’re all familiar with the idea processors are not getting faster, but we are getting more of them.
Most code doesn’t scale past 4 cores. As we add more cores it’s not just that performance levels off, we can get slower and slower as we add more cores. That’s because software is written badly. We want software as we add more cores to scale nearly linearly. Want to get faster as we add more cores.
Multi-threading coding is not multi-core coding
- Multi-threading:
- More than one thread per CPU core
- Locks to coordinate threads (done via system calls)
- Each thread a different task
- Multi-core:
- One thread per CPU core
- When two threads/cores access the same data they can’t stop and wait for each other
- All threads part of the same task
- Our problem is how to spread an application across many cores.
- Locks in Unix are implemented in the kernel. What happens at 4 cores using locks is that most software starts waiting for other threads to give up a lock. So the kernel starts eating up more performance than you gain from having more CPUs.
- What we need is an architecture that is more like a freeway than an intersection controlled by a stop light. We want no waiting where everyone continues at their own pace with as little overhead as possible.
- Solutions:
- Keep data structures per core. Then on aggregation read all the counters.
- Atomics. Instructions supported by the CPU that can called from C. Guaranteed to be atomic, never conflict. Expensive, so don’t want to use for everything.
- Lock-free data structures. Accessible by threads that never stop and wait for each other. Don’t do it yourself. It’s very complex to work across different architectures.
- Threading model. Pipelined vs worker thread model. It’s not just synchronization that’s the problem, but how your threads are architected.
- Processor affinity. Tell the OS to use the first two cores. Then set where your threads run on which cores. You can also do the same thing with interrupts. So you own these CPUs and Linux doesn’t.
Memory Scalability
- The problem is if you have 20gigs of RAM and let’s say you use 2k per connection, then if you only have 20meg L3 cache, none of that data will be in cache. It costs 300 clock cycles to go out to main memory, at which time the CPU isn’t doing anything.
- Think about this with our 1400 clock cycle budge per packet. Remember 200 clocks/pkt overhead. We only have 4 cache misses per packet and that's a problem.
- Co-locate Data
- Don’t scribble data all over memory via pointers. Each time you follow a pointer it will be a cache miss: [hash pointer] -> [Task Control Block] -> [Socket] -> [App]. That’s four cache misses.
- Keep all the data together in one chunk of memory: [TCB | Socket | App]. Prereserve memory by preallocating all the blocks. This reduces cache misses from 4 to 1.
- Paging
- The paging table for 32gigs require 64MB of paging tables which doesn’t fit in cache. So you have two caches misses, one for the paging table and one for what it points to. This is detail we can’t ignore for scalable software.
- Solutions: compress data; use cache efficient structures instead of binary search tree that has a lot of memory accesses
- NUMA architectures double the main memory access time. Memory may not be on a local socket but is on another socket.
- Memory pools
- Preallocate all memory all at once on startup.
- Allocate on a per object, per thread, and per socket basis.
- Hyper-threading
- Network processors can run up to 4 threads per processor, Intel only has 2.
- This masks the latency, for example, from memory accesses because when one thread waits the other goes at full speed.
- Hugepages
- Reduces page table size. Reserve memory from the start and then your application manages the memory.
Summary
- NIC
- Problem: going through the kernel doesn’t work well.
- Solution: take the adapter away from the OS by using your own driver and manage them yourself
- CPU
- Problem: if you use traditional kernel methods to coordinate your application it doesn’t work well.
- Solution: Give Linux the first two CPUs and you application manages the remaining CPUs. No interrupts will happen on those CPUs that you don’t allow.
- Memory
- Problem: Takes special care to make work well.
- Solution: At system startup allocate most of the memory in hugepages that you manage.
The control plane is left to Linux, for the data plane, nothing. The data plane runs in application code. It never interacts with the kernel. There’s no thread scheduling, no system calls, no interrupts, nothing.
Yet, what you have is code running on Linux that you can debug normally, it’s not some sort of weird hardware system that you need custom engineer for. You get the performance of custom hardware that you would expect for your data plane, but with your familiar programming and development environment.
Related Articles
看内核patchs的时候,经常会看到一些清理代码的patch,理由是unnecessary,比如这个:
[PATCH] vfs: no need to check about IS_IMMUTABLE in do_fallocate
From: Ashish Sangwan <a.sangwan <at> samsung.com> In do_fallocate, first there is check for FMODE_WRITE and after that there is second check for IS_IMMUTABLE. A file cannot be opened in write mode if the corresponding inode is immutable, hence the second check is not required. Signed-off-by: Ashish Sangwan <a.sangwan <at> samsung.com> Signed-off-by: Namjae Jeon <namjae.jeon <at> samsung.com> --- fs/open.c | 3 --- 1 files changed, 0 insertions(+), 3 deletions(-) diff --git a/fs/open.c b/fs/open.c index 8c74100..939e402 100644 --- a/fs/open.c +++ b/fs/open.c <at> <at> -245,9 +245,6 <at> <at> int do_fallocate(struct file *file, int mode, loff_t offset, loff_t len) if (mode & FALLOC_FL_PUNCH_HOLE && IS_APPEND(inode)) return -EPERM; - if (IS_IMMUTABLE(inode)) - return -EPERM; - /* * Revalidate the write permissions, in case security policy has * changed since the files were opened.
事实上看起来这处代码是多余的。不过我更偏向的想法是,基础函数或者通常一下比较通用的代码,要尽可能的具备语义自描述的特性,就是说,它不需要理解外界的环境是什么样的,只对输入做符合规格的检查。
比如这里的do_fallocate()函数,正常情况下,能以write模式打开的文件,它的inode肯定不是IS_IMMUTABLE,所以这个patch里把这个不必要的一步去掉了。
但是反过来考虑一下,如果将来某天,我们可能为了满足某种特别的需求,即使文件的inode是IS_IMMUTABLE的,也可以以write模式打开,但是实现这个功能代码的作者,未必会想到此处可能会引发BUG,而且这种地方还不少。
- J.A.K.Mouw@its.tudelft.nl Linux Kernel Procfs Guide
1. Introduction
The /proc file system (procfs) is a special file system in the linux kernel. It’s a virtual file system: it is not associated with a block device but exists only in memory. The files in the procfs are there to allow userland programs access to certain information from the kernel (like process information in /proc/[0-9]+/), but also for debug purposes (like /proc/ksyms). This guide describes the use of the procfs file system from within the Linux kernel. It starts by introducing all relevant functions to manage the files within the file system.
After that it shows how to communicate with userland, and some tips and tricks will be pointed out. Finally a complete example will be shown. Note that the files in /proc/sys are sysctl files: they don’t belong to procfs and are governed by a completely different API described in the Kernel API book.
2. Managing procfs entries
This chapter describes the functions that various kernel components use to populate the procfs with files, symlinks, device nodes, and directories. A minor note before we start: if you want to use any of the procfs functions, be sure to include the correct header file! This should be one of the first lines in your code:
#include <linux/proc_fs.h>
2.1. Creating a regular file
struct proc_dir_entry* create_proc_entry(const char* name, mode_t mode, struct proc_dir_entry* parent);
This function creates a regular file with the name name, file mode mode in the directory parent. To create a file in the root of the procfs, use NULL as parent parameter. When successful, the function will return a pointer to the freshly created struct proc_dir_entry; otherwise it will return NULL. Chapter 3 describes how to do something useful with regular files. Note that it is specifically supported that you can pass a path that spans multiple directories. For example create_proc_entry("drivers/via0/info") will create the via0 directory if necessary, with standard 0755 permissions. If you only want to be able to read the file, the function create_proc_read_entry described in Section 4.1 may be used to create and initialise the procfs entry in one single call.
2.2. Creating a symlink
struct proc_dir_entry* proc_symlink(const char* name, struct proc_dir_entry* parent, const char* dest);
This creates a symlink in the procfs directory parent that points from name to dest. This translates in userland to ln -s dest name.
2.3. Creating a device
struct proc_dir_entry* proc_mknod(const char* name, mode_t mode, struct proc_dir_entry* parent, kdev_t rdev);
Creates a device file name with mode mode in the procfs directory parent. The device file will work on the device rdev, which can be generated by using the MKDEV macro from linux/kdev_t.h. The mode parameter must contain S_IFBLK or S_IFCHR to create a device node. Compare with userland mknod --mode=mode name rdev.
2.4. Creating a directory
struct proc_dir_entry* proc_mkdir(const char* name, struct proc_dir_entry* parent);
Create a directory name in the procfs directory parent.
2.5. Removing an entry
void remove_proc_entry(const char* name, struct proc_dir_entry* parent);
Removes the entry name in the directory parent from the procfs. Entries are removed by their name, not by the struct proc_dir_entry returned by the various create functions. Note that this function doesn’t recursively remove entries. Be sure to free the data entry from the struct proc_dir_entry before remove_proc_entry is called (that is: if there was some data allocated, of course). See Section 3.3 for more information on using the data entry.
3. Communicating with userland
Instead of reading (or writing) information directly from kernel memory, procfs works with call back functions for files: functions that are called when a specific file is being read or written. Such functions have to be initialised after the procfs file is created by setting the read_proc and/or write_proc fields in the struct proc_dir_entry* that the function create_proc_entry returned:
struct proc_dir_entry* entry; entry->read_proc = read_proc_foo; entry->write_proc = write_proc_foo;
If you only want to use a the read_proc, the function create_proc_read_entry described in Section 4.1 may be used to create and initialise the procfs entry in one single call.
3.1. Reading data
The read function is a call back function that allows userland processes to read data from the kernel. The read function should have the following format:
int read_func(char* page, char** start, off_t off, int count, int* eof, void* data);
The read function should write its information into the page. For proper use, the function should start writing at an offset of off in page and write at most count bytes, but because most read functions are quite simple and only return a small amount of information, these two parameters are usually ignored (it breaks pagers like more and less, but cat still works). If the off and count parameters are properly used, eof should be used to signal that the end of the file has been reached by writing 1 to the memory location eof points to.
The parameter start doesn’t seem to be used anywhere in the kernel. The data parameter can be used to create a single call back function for several files, see Section 3.3. The read_func function must return the number of bytes written into the page. Chapter 5 shows how to use a read call back function.
3.2. Writing data
The write call back function allows a userland process to write data to the kernel, so it has some kind of control over the kernel. The write function should have the following format:
int write_func(struct file* file, const char* buffer, unsigned long count, void* data);
The write function should read count bytes at maximum from the buffer. Note that the buffer doesn’t live in the kernel’s memory space, so it should first be copied to kernel space with copy_from_user. The file parameter is usually ignored. Section 3.3 shows how to use the data parameter. Again, Chapter 5 shows how to use this call back function.
3.3. A single call back for many files
When a large number of almost identical files is used, it’s quite inconvenient to use a separate call back function for each file. A better approach is to have a single call back function that distinguishes between the files by using the data field in struct
proc_dir_entry. First of all, the data field has to be initialised:
struct proc_dir_entry* entry; struct my_file_data *file_data; file_data = kmalloc(sizeof(struct my_file_data), GFP_KERNEL); entry->data = file_data;
The data field is a void *, so it can be initialised with anything. Now that the data field is set, the read_proc and write_proc can use it to distinguish between files because they get it passed into their data parameter:
int foo_read_func(char *page, char **start, off_t off,
int count, int *eof, void *data)
{
int len;
if(data == file_data) {
/* special case for this file */
} else {
/* normal processing */
}
return len;
}
Be sure to free the data data field when removing the procfs entry.
4. Tips and tricks
4.1. Convenience functions
struct proc_dir_entry* create_proc_read_entry(const char* name, mode_t mode, struct proc_dir_entry* parent, read_proc_t* read_proc, void* data);
This function creates a regular file in exactly the same way as create_proc_entry from Section 2.1 does, but also allows to set the read function read_proc in one call. This function can set the data as well, like explained in Section 3.3.
4.2. Modules
If procfs is being used from within a module, be sure to set the owner field in the struct proc_dir_entry to THIS_MODULE.
struct proc_dir_entry* entry; entry->owner = THIS_MODULE;
4.3. Mode and ownership
Sometimes it is useful to change the mode and/or ownership of a procfs entry. Here is an example that shows how to achieve that:
struct proc_dir_entry* entry; entry->mode = S_IWUSR |S_IRUSR | S_IRGRP | S_IROTH; entry->uid = 0;entry->gid = 100;
- http://blog.golang.org/2011/06/profiling-go-programs.html
At Scala Days 2011 a few weeks ago, Robert Hundt presented a paper titled “Loop Recognition in C++/Java/Go/Scala.” The paper implemented a specific loop finding algorithm, such as you might use in a flow analysis pass of a compiler, in C++, Go, Java, Scala, and then used those programs to draw conclusions about typical performance concerns in these languages. The Go program presented in that paper runs quite slowly, making it an excellent opportunity to demonstrate how to use Go's profiling tools to take a slow program and make it faster.
By using Go's profiling tools to identify and correct specific bottlenecks, we can make the Go loop finding program run an order of magnitude faster and use 6x less memory.
Hundt's paper does not specify which versions of the C++, Go, Java, and Scala tools he used. In this blog post, we will be using the most recent weekly snapshot of the 6g Go compiler and the version of g++ that ships with the Ubuntu Natty distribution. (We will not be using Java or Scala, because we are not skilled at writing efficient programs in either of those languages, so the comparison would be unfair. Since C++ was the fastest language in the paper, the comparisons here with C++ should suffice.)
$ 6g -V 6g version weekly.2011-06-16 8787 $ g++ --version g++ (Ubuntu/Linaro 4.5.2-8ubuntu4) 4.5.2 ... $
The programs are run on a Lenovo X201s with a 2.13 GHz Core i7-based CPU and 4 GB of RAM running Ubuntu Natty's Linux 2.6.38-8-generic kernel. The machine is running with CPU frequency scaling disabled via
$ sudo bash
# for i in /sys/devices/system/cpu/cpu[0-9]
do
echo performance > $i/cpufreq/scaling_governor
done
#
We've taken Hundt's benchmark programs in C++ and Go, combined each into a single source file, and removed all but one line of output. We'll time the program using Linux's time utility with a format that shows user time, system time, real time, and maximum memory usage:
$ cat xtime #!/bin/sh /usr/bin/time -f '%Uu %Ss %er %MkB %C' "$@" $ $ make havlak1cc g++ -O3 -o havlak1cc havlak1.cc $ xtime havlak1cc # of loops: 76002 (total 3800100) loop-0, nest: 0, depth: 0 27.37u 0.08s 27.47r 716864kB havlak1cc $ $ make havlak1 6g havlak1.go 6l -o havlak1 havlak1.6 $ xtime havlak1 # of loops: 76000 (including 1 artificial root node) 56.63u 0.26s 56.92r 1642640kB havlak1 $
The C++ program runs in 27.47 seconds and uses 700 MB of memory. The Go program runs in 56.92 seconds and uses 1604 MB of memory. (These measurements are difficult to reconcile with the ones in the paper, but the point of this post is to explore how to use gopprof, not to reproduce the results from the paper.)
To start tuning the Go program, we have to enable profiling. If the code used the Go testing package's benchmarking support, we could use gotest's standard -cpuprofile and -memprofile flags. In a standalone program like this one, we have to import runtime/pprof and add a few lines of code:
var cpuprofile = flag.String("cpuprofile", "", "write cpu profile to file")
func main() {
flag.Parse()
if *cpuprofile != "" {
f, err := os.Create(*cpuprofile)
if err != nil {
log.Fatal(err)
}
pprof.StartCPUProfile(f)
defer pprof.StopCPUProfile()
}
...
The new code defines a flag named cpuprofile, calls the Go flag library to parse the command line flags, and then, if the cpuprofile flag has been set on the command line, starts CPU profiling redirected to that file. The profiler requires a final call to StopCPUProfile to flush any pending writes to the file before the program exits; we use defer to make sure this happens as main returns.
After adding that code, we can run the program with the new -cpuprofile flag and then run gopprof to interpret the profile.
$ make havlak1.prof havlak1 -cpuprofile=havlak1.prof # of loops: 76000 (including 1 artificial root node) $ gopprof havlak1 havlak1.prof Welcome to pprof! For help, type 'help'. (pprof)
The gopprof program is a slight variant of Google's pprof C++ profiler. The most important command is topN, which shows the top N samples in the profile:
(pprof) top10
Total: 5758 samples
1028 17.9% 17.9% 1161 20.2% hash_lookup
696 12.1% 29.9% 697 12.1% scanblock
565 9.8% 39.8% 1042 18.1% hash_insert_internal
420 7.3% 47.0% 4278 74.3% main.FindLoops
225 3.9% 51.0% 1149 20.0% main.DFS
225 3.9% 54.9% 226 3.9% memhash
198 3.4% 58.3% 437 7.6% sweep
172 3.0% 61.3% 1902 33.0% runtime.mallocgc
102 1.8% 63.1% 500 8.7% runtime.MCache_Alloc
102 1.8% 64.8% 102 1.8% runtime.memmove
(pprof)
When CPU profiling is enabled, the Go program stops about 100 times per second and records a sample consisting of the program counters on the currently executing goroutine's stack. The profile has 5758 samples, so it was running for a bit over 57 seconds. In the gopprof output, there is a row for each function that appeared in a sample. The first two columns show the number of samples in which the function was running (as opposed to waiting for a called function to return), as a raw count and as a percentage of total samples. The hash_lookup function was running during 1028 samples, or 17.9%. The top10 output is sorted by this sample count. The third column shows the running total during the listing: the first three rows account for 39.8% of the samples. The fourth and fifth columns show the number of samples in which the function appeared (either running or waiting for a called function to return). The main.FindLoops function was running in 7.3% of the samples, but it was on the call stack (it or functions it called were running) in 74.3% of the samples.
To sort by the fourth and fifth columns, use the -cum (for cumulative) flag:
(pprof) top5 -cum
Total: 5758 samples
0 0.0% 0.0% 4301 74.7% main.main
0 0.0% 0.0% 4301 74.7% runtime.initdone
0 0.0% 0.0% 4301 74.7% runtime.mainstart
0 0.0% 0.0% 4278 74.3% main.FindHavlakLoops
420 7.3% 7.3% 4278 74.3% main.FindLoops
(pprof)
In fact the total for main.FindLoops and main.main should have been 100%, but each stack sample only includes the bottom 100 stack frames; during about a quarter of the samples, the recursive main.DFS function was more than 100 frames deeper than main.main so the complete trace was truncated.
The stack trace samples contain more interesting data about function call relationships than the text listings can show. The web command writes a graph of the profile data in SVG format and opens it in a web browser. (There is also a gv command that writes PostScript and opens it in Ghostview. For either command, you need graphviz installed.)
(pprof) web
A small fragment of the full graph looks like:

main.DFS to itself.
Just at a glance, we can see that the program spends much of its time in hash operations, which correspond to use of Go's map values. We can tell web to use only samples that include a specific function, such as hash_lookup, which clears some of the noise from the graph:
(pprof) web hash_lookup

runtime.mapaccess1 are being made by main.FindLoops and main.DFS.
Now that we have a rough idea of the big picture, it's time to zoom in on a particular function. Let's look at main.DFS first, just because it is a shorter function:
(pprof) list DFS
Total: 5758 samples
ROUTINE ====================== main.DFS in /home/rsc/g/benchgraffiti/havlak/havlak1.go
samples 225 Total 2296 (flat / cumulative)
3 3 240: func DFS(currentNode *BasicBlock, nodes []*UnionFindNode, number map[*BasicBlock]int, last []int, current int) int {
18 19 241: nodes[current].Init(currentNode, current)
. 166 242: number[currentNode] = current
. . 243:
2 2 244: lastid := current
167 167 245: for _, target := range currentNode.OutEdges {
17 508 246: if number[target] == unvisited {
10 1157 247: lastid = DFS(target, nodes, number, last, lastid+1)
. . 248: }
. . 249: }
7 273 250: last[number[currentNode]] = lastid
1 1 251: return lastid
. . 252: }
(pprof)
The listing shows the source code for the DFS function (really, for every function matching the regular expression DFS). The first three columns are the number of samples taken while running that line, the number of samples taken while running that line or in code called from that line, and the line number in the file. The related command disasm shows a disassembly of the function instead of a source listing; when there are enough samples this can help you see which instructions are expensive. The weblist command mixes the two modes: it shows a source listing in which clicking a line shows the disassembly.
Since we already know that the time is going into map lookups implemented by the hash runtime functions, we care most about the second column. A large fraction of time is spent in recursive calls to DFS (line 247), as would be expected from a recursive traversal. Excluding the recursion, it looks like the time is going into the accesses to the number map on lines 242, 246, and 250. For that particular lookup, a map is not the most efficient choice. Just as they would be in a compiler, the basic block structures have unique sequence numbers assigned to them. Instead of using a map[*BasicBlock]int we can use a []int, a slice indexed by the block number. There's no reason to use a map when an array or slice will do.
Changing number from a map to a slice requires editing seven lines in the program and cut its run time by nearly a factor of two:
$ make havlak2
6g havlak2.go
6l -o havlak2 havlak2.6
rm havlak2.6
$ xtime havlak2 # diff from havlak1
# of loops: 76000 (including 1 artificial root node)
30.88u 0.24s 31.14r 1564608kB havlak2
$
We can run the profiler again to confirm that main.DFS is no longer a significant part of the run time:
$ make havlak2.prof
havlak2 -cpuprofile=havlak2.prof
# of loops: 76000 (including 1 artificial root node)
$ gopprof havlak2 havlak2.prof
Welcome to pprof! For help, type 'help'.
(pprof) top5
Total: 3099 samples
626 20.2% 20.2% 626 20.2% scanblock
309 10.0% 30.2% 2839 91.6% main.FindLoops
176 5.7% 35.9% 1732 55.9% runtime.mallocgc
173 5.6% 41.4% 397 12.8% sweep
101 3.3% 44.7% 111 3.6% main.DFS
(pprof)
main.DFS still appears in the profile, but its total time has dropped from 20.0% to 3.6%. The rest of the program runtime has dropped too. Now the program is spending most of its time allocating memory and garbage collecting (runtime.mallocgc, which both allocates and runs periodic garbage collections, accounts for 55.9% of the time). To find out why the garbage collector is running so much, we have to find out what is allocating memory. One way is to add memory profiling to the program. We'll arrange that if the -memprofile flag is supplied, the program stops after one iteration of the loop finding, writes a memory profile, and exits:
var memprofile = flag.String("memprofile", "", "write memory profile to this file")
...
FindHavlakLoops(cfgraph, lsgraph)
if *memprofile != "" {
f, err := os.Create(*memprofile)
if err != nil {
log.Fatal(err)
}
pprof.WriteHeapProfile(f)
f.Close()
return
}
We invoke the program with -memprofile flag to write a profile:
$ make havlak3.mprof # diff from havlak2 havlak3 -memprofile=havlak3.mprof $
We use gopprof exactly the same way. Now the samples we are examining are memory allocations, not clock ticks.
$ gopprof havlak3 havlak3.mprof
Adjusting heap profiles for 1-in-524288 sampling rate
Welcome to pprof! For help, type 'help'.
(pprof) top5
Total: 118.3 MB
66.1 55.8% 55.8% 103.7 87.7% main.FindLoops
30.5 25.8% 81.6% 30.5 25.8% main.*LSG·NewLoop
10.0 8.5% 90.1% 10.0 8.5% main.NewBasicBlock
6.5 5.5% 95.6% 6.5 5.5% main.*SimpleLoop·AddNode
2.1 1.7% 97.3% 12.1 10.2% main.*CFG·CreateNode
(pprof)
Gopprof reports that FindLoops has allocated approximately 66.1 of the 118.3 MB in use; NewLoop accounts for another 30.5 MB. To reduce overhead, the memory profiler only records information for approximately one block per half megabyte allocated (the “1-in-524288 sampling rate”), so these are approximations to the actual counts.
To find the memory allocations, we can list those functions.
(pprof) list FindLoops
Total: 118.3 MB
ROUTINE ====================== main.FindLoops in /home/rsc/g/benchgraffiti/havlak/havlak3.go
MB 66.1 Total 103.7 (flat / cumulative)
...
. . 267:
1.9 1.9 268: nonBackPreds := make([]map[int]bool, size)
3.8 3.8 269: backPreds := make([][]int, size)
. . 270:
1.0 1.0 271: number := make([]int, size)
1.0 1.0 272: header := make([]int, size, size)
1.0 1.0 273: types := make([]int, size, size)
1.0 1.0 274: last := make([]int, size, size)
1.9 1.9 275: nodes := make([]*UnionFindNode, size, size)
. . 276:
. . 277: for i := 0; i < size; i++ {
5.5 5.5 278: nodes[i] = new(UnionFindNode)
. . 279: }
...
. . 286: for i, bb := range cfgraph.Blocks {
. . 287: number[bb.Name] = unvisited
48.0 48.0 288: nonBackPreds[i] = make(map[int]bool)
. . 289: }
...
(pprof) list NewLoop
Total: 118.3 MB
ROUTINE ====================== main.*LSG·NewLoop in /home/rsc/g/benchgraffiti/havlak/havlak3.go
. . 578: func (lsg *LSG) NewLoop() *SimpleLoop {
2.5 2.5 579: loop := new(SimpleLoop)
7.5 7.5 580: loop.basicBlocks = make(map[*BasicBlock]bool)
20.5 20.5 581: loop.Children = make(map[*SimpleLoop]bool)
...
. . 588: }
(pprof)
It looks like the current bottleneck is the same as the last one: using maps where simpler data structures suffice. FindLoops is allocating about 48 MB of maps, and NewLoop is allocating another 20 MB.
As an aside, if we run gopprof with the --inuse_objects flag, it will report allocation counts instead of sizes:
$ gopprof --inuse_objects havlak3 havlak3.mprof
Adjusting heap profiles for 1-in-524288 sampling rate
Welcome to pprof! For help, type 'help'.
(pprof) list NewLoop
Total: 1604080 objects
ROUTINE ====================== main.*LSG·NewLoop in /home/rsc/g/benchgraffiti/havlak/havlak3.go
. . 578: func (lsg *LSG) NewLoop() *SimpleLoop {
54613 54613 579: loop := new(SimpleLoop)
75678 75678 580: loop.basicBlocks = make(map[*BasicBlock]bool)
207530 207530 581: loop.Children = make(map[*SimpleLoop]bool)
...
. . 588: }
(pprof)
Since the 200,000 maps account for 20 MB, it looks like the initial map allocation takes about 100 bytes. That's reasonable when a map is being used to hold key-value pairs, but not when a map is being used as a stand-in for a simple set, as it is here.
Instead of using a map, we can use a simple slice to list the elements. In all but one of the cases where maps are being used, it is impossible for the algorithm to insert a duplicate element. In the one remaining case, we can write a simple variant of the append built-in function:
func appendUnique(a []int, x int) []int {
for _, y := range a {
if x == y {
return a
}
}
return append(a, x)
}
In addition to writing that function, changing the Go program to use slices instead of maps requires changing just a few lines of code.
$ xtime havlak4 # diff from havlak3 # of loops: 76000 (including 1 artificial root node) 18.35u 0.11s 18.48r 575792kB havlak4 $
We're now at 3x faster than when we started. Let's look at a CPU profile again.
$ gopprof havlak4 havlak4.prof
Welcome to pprof! For help, type 'help'.
(pprof) top10
Total: 1851 samples
283 15.3% 15.3% 283 15.3% scanblock
233 12.6% 27.9% 1622 87.6% main.FindLoops
142 7.7% 35.5% 1054 56.9% runtime.mallocgc
112 6.1% 41.6% 276 14.9% sweep
111 6.0% 47.6% 115 6.2% main.DFS
85 4.6% 52.2% 661 35.7% runtime.growslice
84 4.5% 56.7% 84 4.5% runtime.memmove
69 3.7% 60.5% 281 15.2% runtime.MCache_Alloc
67 3.6% 64.1% 84 4.5% MCentral_Alloc
67 3.6% 67.7% 93 5.0% MCentral_Free
(pprof)
Now memory allocation and the consequent garbage collection (runtime.mallocgc) accounts for 56.9% of our run time. Another way to look at why the system is garbage collecting is to look at the allocations that are causing the collections, the ones that spend most of the time in mallocgc:
(pprof) web mallocgc

gopprof to ignore nodes that don't account for at least 10% of the samples:
$ gopprof --nodefraction=0.1 6.out prof Welcome to pprof! For help, type 'help'. (pprof) web mallocgc

FindLoops is triggering most of the garbage collection. If we list FindLoops we can see that much of it is right at the beginning:
(pprof) list FindLoops
. . 270: func FindLoops(cfgraph *CFG, lsgraph *LSG) {
. . 271: if cfgraph.Start == nil {
. . 272: return
. . 273: }
. . 274:
. . 275: size := cfgraph.NumNodes()
. . 276:
. 17 277: nonBackPreds := make([][]int, size)
. 82 278: backPreds := make([][]int, size)
. . 279:
. 2 280: number := make([]int, size)
. 1 281: header := make([]int, size, size)
. 61 282: types := make([]int, size, size)
. . 283: last := make([]int, size, size)
. 58 284: nodes := make([]*UnionFindNode, size, size)
. . 285:
2 2 286: for i := 0; i < size; i++ {
. 261 287: nodes[i] = new(UnionFindNode)
. . 288: }
...
(pprof)
Every time FindLoops is called, it allocates some sizable bookkeeping structures. Since the benchmark calls FindLoops 50 times, these add up to a significant amount of garbage, so a significant amount of work for the garbage collector.
Having a garbage-collected language doesn't mean you can ignore memory allocation issues. In this case, a simple solution is to introduce a cache so that each call to FindLoops reuses the previous call's storage when possible. (In fact, in Hundt's paper, he explains that the Java program needed just this change to get anything like reasonable performance, but he did not make the same change in the other garbage-collected implementations.)
We'll add a global cache structure:
var cache struct {
size int
nonBackPreds [][]int
backPreds [][]int
number []int
header []int
types []int
last []int
nodes []*UnionFindNode
}
and then have FindLoops consult it as a replacement for allocation:
if cache.size < size {
cache.size = size
cache.nonBackPreds = make([][]int, size)
cache.backPreds = make([][]int, size)
cache.number = make([]int, size)
cache.header = make([]int, size)
cache.types = make([]int, size)
cache.last = make([]int, size)
cache.nodes = make([]*UnionFindNode, size)
for i := range cache.nodes {
cache.nodes[i] = new(UnionFindNode)
}
}
nonBackPreds := cache.nonBackPreds[:size]
for i := range nonBackPreds {
nonBackPreds[i] = nonBackPreds[i][:0]
}
backPreds := cache.backPreds[:size]
for i := range nonBackPreds {
backPreds[i] = backPreds[i][:0]
}
number := cache.number[:size]
header := cache.header[:size]
types := cache.types[:size]
last := cache.last[:size]
nodes := cache.nodes[:size]
Such a global variable is bad engineering practice, of course: it means that concurrent calls to FindLoops are now unsafe. For now, we are making the minimal possible changes in order to understand what is important for the performance of our program; this change is simple and mirrors the code in the Java implementation. The final version of the Go program will use a separate LoopFinder instance to track this memory, restoring the possibility of concurrent use.
$ xtime havlak5 # diff from havlak4 # of loops: 76000 (including 1 artificial root node) 12.59u 0.07s 12.67r 584496kB havlak5 $
There's more we can do to clean up the program and make it faster, but none of it requires profiling techniques that we haven't already shown. The work list used in the inner loop can be reused across iterations and across calls to FindLoops, and it can be combined with the separate “node pool” generated during that pass. Similarly, the loop graph storage can be reused on each iteration instead of reallocated. In addition to these performance changes, the final version is written using idiomatic Go style, using data structures and methods. The stylistic changes have only a minor effect on the run time: the algorithm and constraints are unchanged.
The final version runs in 3.84 seconds and uses 257 MB of memory:
$ xtime havlak6 # of loops: 76000 (including 1 artificial root node) 3.79u 0.04s 3.84r 263472kB havlak6 $
That's nearly 15 times faster than the program we started with. Even if we disable reuse of the generated loop graph, so that the only cached memory is the loop finding bookeeping, the program still runs 10x faster than the original and uses 2.5x less memory.
$ xtime havlak6 -reuseloopgraph=false # of loops: 76000 (including 1 artificial root node) 5.74u 0.10s 5.84r 617040kB havlak6 -reuseloopgraph=false $
Of course, it's no longer fair to compare this Go program to the original C++ program, which used inefficient data structures like sets where vectors would be more appropriate. As a sanity check, we translated the final Go program into equivalent C++ code. Its execution time is similar to the Go program's:
$ xtime havlak6cc # of loops: 76000 (including 1 artificial root node) 4.04u 0.38s 4.42r 387744kB havlak6cc $
The Go program runs slightly faster because the C++ program is using automatic deletes and allocation instead of an explicit cache. That makes the C++ program a bit shorter and easier to write, but not dramatically so:
$ wc havlak6.cc; wc havlak6.go 401 1220 9040 havlak6.cc 461 1441 9467 havlak6.go $
Benchmarks are only as good as the programs they measure. We used gopprof to study an inefficient Go program and then to improve its performance by an order of magnitude and to reduce its memory usage by a factor of six. A subsequent comparison with an equivalently optimized C++ program shows that Go can be competitive with C++ when programmers are careful about how much garbage is generated by inner loops.
The program sources, Linux x86-64 binaries, and profiles used to write this post are available in the benchgraffiti project on Google Code.
As mentioned above, gotest includes these profiling flags already: define a benchmark function and you're all set. There is also a standard HTTP interface to profiling data. In an HTTP server, adding
import _ "http/pprof"
will install handlers for a few URLs under /debug/pprof/. Then you can run gopprof with a single argument—the URL to your server's profiling data—and it will download and examine a live profile.
gopprof http://localhost:6060/debug/pprof/profile # 30-second CPU profile gopprof http://localhost:6060/debug/pprof/heap # heap profile
- Russ Cox
