Friday, June 14, 2013

The root of all evil is optimization? Or apathy?

You've probably heard the refrain "Premature optimization is the root of all evil".

Well, how did that turn out? Every Windows computer is filled with little gadgets in the system tray. Users may not know what those little icons are, but each one seems to make the hard drive thrash for 10 additional seconds on startup, and uses 10 additional megabytes of RAM. Plus there's all these hidden services that the user cannot see or measure, but slow down the PC just as much as those system tray icons. Of course, some of these apps are small, optimized, and necessary, but a few bad apples (which users have no way to locate) ruin the startup experience.

I have a somewhat different opinion: I think that optimization is the main job of many programmers—not all programmers by any means, but many. Look, here's an algorithm that searches a list of points for the closest one:

  using System.Windows;
  ...
  public int ClosestPointTo(Point near, IList<Point> list)
  {
    return list.IndexOfMin(p => (p - near).Length);
  }

Well, that was easy! So how come I spent days researching and writing an R*-tree implementation? Because the easy solution is just too damn slow! Anybody can find "naive" solutions to problems, and if that was all we needed, it would be easy to find programmers that are "good enough". But inevitably, as the problem size grows larger, the naïve solution isn't good enough. And when the problem size gets really huge (as it inevitably will for somebody), even the solutions everyone thought were good become useless.

I admit, I have a bad habit of premature optimization, and it is a vice, sometimes. For example, for my company I wrote turn-by-turn navigation software called FastNav, which requires files in a proprietary format called a "NaviMap" file, which are converted from Shapefiles. I thought it would be neat to "script" the conversion process with text files, using ANTLR to parse expressions that would then be interpreted, but I was concerned that interpreting dynamically-typed expressions would be slow. So I spent a lot of time writing a little compiler that converted these textual expressions to statically-typed, compiled MSIL, typed according to the schema of the Shapefile.

So now I have this ultra-fast expression evaluator. But guess what? My MapConverter is still slow to this day, because it turned out that the bottleneck was elsewhere (I improved the worst bottleneck, which left other bottlenecks that were difficult to fix. Users weren't complaining, so I let it be).

But FastNav itself was also too slow, running on a 400MHz WinCE device, and I spent six months just optimizing it until my boss was satisfied (and that was after writing all the drawing primitives from scratch because WinCE drawing code is abysmal and AGG, while faster than WinCE itself, was still too slow). There are no good profilers for WinCE, and over time I developed an intuition that the bottlenecks of the code on WinCE were dramatically different than the bottlenecks on the desktop (actually, on the desktop version, there weren't really any bottlenecks to speak of); so much so that desktop profiling results were completely useless. So I painstakingly benchmarked flash I/O performance (horribly slow), floating point (horribly slow), font drawing (horribly slow if the text is rotated, fast otherwise), and all the various modules of FastNav, optimizing each one until the product was finally usable. I even optimized std::vector (writing a replacement called inline_vector and then, finding that this didn't really help, a simpler replacement called mini_vector), implemented my own hashtables, and replaced the memory manager (new and delete) to optimize small allocations.

Optimization has always been an important and necessary part of my work. Around 1998 I wrote the (now-dead) Super Nintendo emulator SNEqr in C++, but I was told that C optimizers are "really good now" and there's no reason to use assembly language anymore. Well, silly me, I believed them and wrote the CPU emulator in C--it was horrifically slow, and became about 10 times faster when I rewrote it in assembly language. And every program I write seems to end up with something that is too slow, something that either gets optimized--or that users just have to live with.

After all this experience, I have a tendency to optimize sooner rather than later, and it can be a bad habit because I may choose to optimize the wrong things--the non-bottlenecks.

But I'm convinced that premature optimization is nowhere near as bad as not giving a f*ck, which is a much more common practice. For example, how as it that every other program needs a splash screen and takes a few seconds to start on an idle system? I have never written a program that takes more than a second or two to start--except thanks to big and slow third-party components.

I recently made an app for Taxi dispatchers called IntelliMap. On my fast developer machine it takes at least 6 seconds to restart after having started once. But that's the WPF version. I originally wrote it with WinForms and that version restarts instantaneously, as if it was Notepad or something. But I was told that the user interface looked "too 90s" and should be modernized using WPF and Infragistics controls. Luckily I had used the MVVM pattern, and I was able to switch to new view code while using the same models and viewmodels. In fact, I kept the WinForms version operational in the same executable file. To this day you can start it with the "--winforms" switch and it starts instantly, albeit with a "90's" interface, and fewer features since I didn't bother maintaining the WinForms version.

The problem is worse than it sounds. Because while it might take 6 or 7 seconds to restart on my fast developer machine, it takes over 45 seconds to restart (not cold-start) on one of our client's machines. And it's not just startup time; the WPF UI is more sluggish, and uses a lot more memory.

This really ticks me off. I wrote a program that starts instantly. But then I had to use second- and third-party libraries that are hella slow. The "Premature optimization" argument says you should wait to optimize; wait until your application is slow, then profile the code to find and remove the bottlenecks. But there are three problems:

  1. If you're waiting for it to be slow on your fast developer machine, you're waiting too long. Some of your end users will have much older, slower hardware.
  2. If you don't have good habits, then your code will be slow throughout. So there won't just be one or two bottlenecks you have to optimize, but many; each bottleneck you fix will just cause the next one to become more apparent. If you have a habit of writing fast code, the bottlenecks will be fewer and you'll expend less effort optimizing (unless of course, the OS itself is slow, WinCE I'm talking to you).
  3. This argument is hard to apply to libraries. Apparently Microsoft and Infragistics felt that their WPF controls were fast and lean enough for them. But it's not fast enough for me! When writing a low-level library that other people will rely on, it's no good for a developer to wait until it's too slow for them. Libraries are used for many reasons by many people. Library code that is not a bottleneck in one application will surely become a bottleneck in some other application. Every application stresses low-level libraries somehow, but each app causes stress in a different place. This implies that core, oft-used libraries should be optimized uniformly. You might say "well, even so, it's only inner loops that need optimization". But lots of non-loops need optimization too, just in case the client application calls those non-loops inside an important loop.

In my opinion, the lower level the code is, the more important its speed is. I write a lot of low-level code, so speed is almost always important to me. And it's irritating to have to rely on slow libraries written by others, especially closed-source commercial stuff that I cannot even understand, let alone do anything about.

And when it comes to code that is used by lots of different people, premature optimization of the public interface or the system architecture may be warranted even when optimization of the implementation is not. I don't have any great examples handy, but consider the IEnumerator interface: you have to call MoveNext() and then Current--two interface calls per iteration. Since this is a fundamental, ubiquitous interface, used constantly by everyone and often used in tight loops, it would have been good if it could iterate and return the next item with a single interface call: bool MoveNext(out T current). Since it's a public API though, it cannot be changed; that's why public interfaces need careful design up-front. (Of course, performance isn't the only factor in API design; things like flexibility and ease-of-use are equally important.)

You don't have to optimize alone

Some people seem to believe that there are two kinds of languages: fast and efficient languages that make the developer work harder, like C/C++, and "RAD" languages that are easy to use and productive, like Ruby or C#, but have speed limits at runtime. Some people have assumed you can't have runtime performance and productivity all in one language, and I reject that assumption in the strongest terms. That language can exist, should exist, and even does exist to some extent (e.g. D2).

The arguments against premature optimization are that it's a waste of time if you're optimizing the wrong thing, or that it makes code harder to understand, or that you might make a mistake and turn correct code into faulty code. But what if it was easy, and what if it didn't harm readability at all? Would there be a reason not to optimize then?

An obvious example of this kind of optimization--easy optimization that doesn't harm readability--is when you go to your compiler settings and enable optimizations. Ahh, all in a day's work! But the optimizer can't do everything, because you have knowledge that it does not, and it will probably never be smart enough to convert your linear search of a sorted list into a binary search.

In the long run, a big part of the solution is to give the compiler more knowledge, e.g. by using programming languages with features like "effects" and "global optimizations". Another big part of the solution is to have standard libraries that not only have lots of fast algorithms to call upon, but also make those algorithms easy to find and use. But there are also simple and easy optimizations that the programmer could use himself in his own code, that just require some tweaks to our programming languages. For instance, let's say you want to run some sort of search through some data structure, and scan the results only if there is more than one of them. It's so easy to write this:

  if (DoSearch().Results.Count > 1)
     foreach(var r in DoSearch().Results) {
        // do something with r
     }
Now, if we refactor it like this instead:
  var rs = DoSearch().Results;
  if (rs.Count > 1)
     foreach(var r in rs) {
        // do something with r
     }

That's 20 seconds we'll never get back. Or perhaps more than 20 seconds if this is an "else if" clause in a chain, because you'll have to spend some time thinking about how to refactor it:

  if (...) {
     ...
  } else if (DoSearch().Results.Count > 1) {
     foreach(var r in DoSearch().Results) {
        // do something with r
     }
  } else if (...) {
     ...
  }

Plus, refactoring makes the code longer now. So should we even bother? Isn't this the premature optimization that we were warned about? But, what if the data structure is large? We shouldn't do the search twice, should we?

I say this dilemma shouldn't exist; it is entirely a flaw in the programming language. That's why I defined the "quick binding" operator for EC#:

  if (DoSearch().Results=:rs.Count > 1)
     foreach(var r in rs) {
        // do something with r
     }

It may look weird at first (it's reverse of the := short variable declaration operator in Go, and I am considering whether to use "::" for this operator instead) but now there's no dillema. It's easy to create a new variable to hold the value of DoSearch().Results, so you may as well just do it and move on. No need to weigh the pros and cons of "premature optimization".

Another good example is LINQ. Suppose we have a long list of numbers and we'd like to derive another list of numbers. Doesn't matter exactly what the query says, here's one:

List<int> numbers = ... // get a list somehow
var numbers2 = (from n in numbers where n < 100000 select n + 1).ToList();
But LINQ involves a bunch of delegate methods which, given the way .NET is designed, cannot be inlined. You can get more speed using a loop like this:

for (int trial = 0; trial < 200; trial++)
{
  numbers2 = new List<int>();
  for (int i = 0; i < numbers.Count; i++) {
    int n = numbers[i];
    if (n < 100000)
      numbers2.Add(n + 1);
  }
}

I just benchmarked this on my PC (200 trials, 1000000 random numbers of at most 6 digits) and got:

LINQ:    2500ms (100579 results)
for:     859ms (100579 results)
foreach: 1703ms (100579 results)

So the plain for-loop is almost 3 times faster (surprisingly the foreach version, not shown, is only slightly faster.)

So, should you write a plain for-loop instead to get that extra speed? Usually, the answer is "of course not". But my answer is "of course not--your computer should write the plain for-loop instead".

This is one of many reasons why EC# will have a procedural macro system. So that end-users can optimize code themselves, using macros to detect certain code patterns and rewrite them into more efficient forms. Of course, the most common optimizations can be bundled into a DLL and shared, so most people will not write these transformations themselves. Typically, a user will simply have to write a global attribute like [assembly: LinqToForLoop] to install the macro in their program, or they could attach an attribute to a class or method for more conservative optimization.

I haven't actually figured out how the LinqToForLoop macro code would look. My thinking right now is that this type of macro, that looks for a code pattern and rewrites it, should "register" the kinds of nodes it wants to look at. The compiler will look for these nodes on the macro's behalf and give them to the macro when found. This will be more efficient than the obvious solution of simply passing "the whole program" to the macro and letting the macro find things itself. Since programmers will inevitably use lots of macros, it would be terribly inefficient for each one to scan the program separately.

< Published on >

No comments: