The Grumpy Programmer: August 2015

This is a story of how I managed to create the largest memory leak I have ever seen, in fact, the largest memory leak I have ever heard of. I did it using Java, a garbage collected programming language. You are not supposed to be able to have memory leaks in garbage collected languages.

This story is not in anyway a slam at Java. I really like Java. It is a story about unintended consequences and how what you do not know can bite you right on the ass.

I know a lot about garbage collectors and garbage collected languages. I have implemented Lisp several times. I have implemented it as an interpreter and as a compiler. That means I have written several garbage collectors. I have actually written at least a half dozen garbage collectors of various sorts. I have read at least 100 papers and several books on garbage collection and at least that many on storage management in general. Not counting garbage collectors I have written... Ok, I've lost count... general heap management systems using many different methods including both relocating handle based systems and pointer based systems.

I know a little bit about memory management. I do not claim to be an expert. In the bad old days you just often had to write your own memory managers to make it possible to write the application you that you were actually working on. Anyway, I was so very glad to find myself in a situation where I was able to use a garbage collected language. So very very glad not to have to write yet another memory manager, not to have to worry about memory management at all. And, that was my first mistake.

It takes a few mistakes to create a memory leak in a garbage collected language.

I was writing a web spider. I was working on a research project that required a web spider and at the same time was supposed to be evaluating Java. It was actually new when I was doing this work. I.E. it was a while ago. So, I decided to write a web spider in Java to kill multiple birds with one stone.

So, I did just that. I wrote a multithreaded web spider in Java. In the end it was very handy for collecting the statistics I was looking for and it showed me some statistics I was not expecting. What did I find? Well I found that the distribution of vocabulary used on the web varies significantly depending on where you start your web spider. I found that (at the time) there was a porn site within three links of any page on whitehouse.gov. Of course (at the time) whitehouse.com was a porn site. People just can't seem to avoid typing ".com" when they mean ".gov".

I also found out that using whitehouse.gov as a starting point for testing a web spider is a bad idea. I was working for a large phone company at the time and it seems that upper management was not pleased when they got a telephone call from the real live White House, the Official Residence of POTUS asking the telco to cut it out. They were OK with me hitting them at night, but not during working hours. That taught me to salt the search queue with many starting URLs so that the initial start up activity would be spread all over the net and not dumped on one poor little old web server. It also taught me that paying taxes is not a justification for beating up the White House web server.

I also found out that every URL I hit was being recorded by the Telephone Company Police (yes, really). I found out about this little detail when the TCP showed up in my bosses office with a list of the tens of thousands (yes, really) porn sites I had hit from my PC. After she started breathing again she explained what was going on and the TCP left without dragging me, and possibly her, out to the parking lot and firing me on the spot. (I am talking about the company that fired Dilbert...) The fact is that if you start a web spider you will find porn sites.

It turned out the TCP were really more interested in how I was managing to hit thousands of porn sites per minute without downloading any pictures. They found the pattern to be ... odd.

OK, so I wrote and used a web spider and had some interesting experiences and actually accomplished several tasks that I was getting paid to do. But, there is still the story about the giant memory leak to tell.

This all started with the problem of parsing web pages. To write a web spider you have to examine a web page and find all the HTTP type URLs that are in it. You store the URLs in a queue if you want to do a breadth first search or a stack if you want to do a depth first search. You can also sort the URLs by first assigning them a weight based on some measure you apply to them. You also need a list of URLs you have already visited so that you do not visit them again. The web, and webs in general, have lots and lots of cycles.

(Oh yeah, if you want a really useful web spider you should never do what those pesky robot.txt files tell you to do. Really they do get in the way. It can be fun to down load them and analyze them though. They give you clues for finding really interesting stuff.)

I thought that the way to find the URLs was to parse the HTML and look in all the places you are supposed to have URLs. I wrote a parser that, like most parsers, treated the source document as a stream. That should work, right? It turns out that very few web pages that I found were syntactically correct HTML. Tags were misspelled or just plain missing. If the tags were there their contents were not what they were supposed to be. It seems the syntax of HTML is not defined by the standard documents but by what web browsers will accept. Web browsers are very flexible about what they accept as valid HTML.

I also found out that most of the URLs, especially the ones people do not want you to find, are hidden inside chunks of Javascript. You can't find those by parsing HTML. If I were writing it today I would definitely scan all the CSS for URLs. I've found some interesting stuff in CSS with out a web spider.

My first attempt at parsing web pages was a total bust. The technique that worked was to down load the entire web page as a single chunk of text and apply pattern matching to it. So I did just that. I down loaded each page and stored it in a Java string and searched for patterns. For the first version of the working spider I did everything in memory. Since everything was being done in memory I was not surprised when the spider ran out of memory. That turned out to be a mistake. I should have been surprised.

So, what was taking up memory? Well I had a queue for storing URLs that were waiting to be scanned. The values in that queue were actually pairs of URLs. I stored the source URL of the page along with the new URL. I did that so that if the new URL didn't work the spider would log both URLs so I could go back and learn something new about how to hide URLs. There was also a hash that I used to keep track of URLs that had been scanned so that the spider would not get caught in loops and rescan pages multiple times. Then there were the structures I used to store the statistics I was interested in.

OK, so I expected to run out of memory and It did so I assumed that I needed to reduce memory use. I did not suspect a memory leak. What to do? Well, I built a file based queue. That was slower than RAM but it moved all that information out to a disk. Sure enough, I still ran out of RAM. Moving the queue to a file had no effect. Time to move the hash, right?

I had some old C code for B-Tree files so I converted that to Java. (Later I converted the same code to C++. Good code never dies...) That code let me substitute a B-Tree file for the hash that I had been using to keep track of previously visited URLs. Putting all that stuff on disk also let me stop a scan and restart it at a later time.

The spider was still running out of memory. I learned how to monitor memory usage in a Java program. I learned how to dig through RAM in a Java program. I dumped all sorts of information as the spider ran around the web. I noticed something. The program still ran out of RAM, but the rate at which it used RAM declined over time. Really weird. I now suspected a memory leak.

I finally figured it out. I also just found out that what caught me was fixed in 2013 in version Java 1.7.0_06. BTW, that is only about 15 years after I reported the problem to Sun. The problem was with the original implementation of strings in Java and my lack of understanding of that design. If I had understood the design I would not have had a problem so I can hardly put all the blame on Java. What you don't know can kill you.

I believe that the original version of Java strings was designed to both save memory and avoid memory fragmentation. Strings are hell for garbage collectors and all other storage managers because they tend to have random sizes. Most data types have a single fixed size. If you create one and delete it then create another one you are usually going to find a free block exactly the size you need. But, not so with strings. Strings tend to be small and random sized. Just finding a piece of memory the right size can be time consuming. But, worse, you wind up with lots of small random sized blobs of memory scattered all through your heap. These little nasty blobs can take up a lot of space but not be large enough to be useful. So, the original designers of Java tried to avoid all those problems. Can't blame them. And, they did a pretty good job too. I have a lot of respect for Gosling.

When I download those pages and stored them in a single string I was creating a large string. I "assumed" that when I was done with the string it would be garbage collected. But, the strings were not being garbage collected. In the original version of Java a string has a pointer to the text of the string, plus an offset to the starting position of the string within the text and a length. This representation lets you take substrings of a string with out copying any part of the text of the original string. What happens is that the substring has a pointer to the original text of the string, but the offset indicates the starting byte of the substring in the original text and the length is the length of the substring. Allowing substrings to share the text of the original string helps avoid all the problems with memory management I described above. The trouble is that I was keeping substrings around in memory as part of my statistical information. Keeping a substring around pinned the entire original string into memory.

I down loaded a page into a single string. I then scanned for words and looked them up in my "dictionary". If the word was already in the dictionary I incremented its count by one. If not, I added the word to the dictionary with a count of one. That means that every time I added a word to my dictionary I pinned the entire source page in RAM. Even though I had no reference to the original text, the substring did and so the page could not be garbage collected.

The first pages I down loaded had lots of new words in them because the dictionary started out empty. As time went on fewer and fewer new words were found. Many pages had no new words. If a page had no new words is would be garbage collected. But, every time I found a new word, and I eventually found tens of thousands of words, an entire page was being pinned in memory.

Once I understood the problem, the solution was easy. Just create an actual copy of any substring I wanted to keep around. Doing that removed all those pesky invisible references and suddenly I could let the spider run for weeks without running out of RAM.

So what is the moral of our story?

Do not make assumptions about anything.
Do not use any language feature if you do not understand how it is implemented.
You must understand the implications of the implementation.

Those are good ideas for programmers using any language. For language implementers and designers the moral should be "all the use cases you can imagine are a tiny subset of the use cases that exist." Or, as Shakespeare said so well:

There are more things in heaven and earth, Horatio,
Than are dreamt of in your philosophy.

Another way to say that would be:

Do not make assumptions about how programmers will use your language.
Features should work the way programmers believe they work.
The distance between clever and stupid is very short.

I guess the sum total of my morals is "Do not make assumptions". The question is, are you wise enough to realize you are making an assumption? In the case of this bug I was not that wise. That is why it is one of my favorite bugs. Fortunately it only cost me about a weeks work.

This isn't one of my usual grumpy stories. This morning I had rainbow time. It was wonderful.

When my children were young Mother's Day became a really problem for me. My two kids always made some kind of craft, usually at school, for Mom. But, they also wanted to get her something special. They wanted to get her something beautiful, and artistic, something Mom would love and remember forever. On the other hand, taking two active inquisitive children into stores that sell beautiful artistic things usually means taking two active inquisitive children into stores that have many fragile things that little hands should not touch. One parent, two kids, the math just doesn't work. You can't watch them both at the same time.

But I found a solution. My wife loves Swarovski style cut lead glass crystals. And, every child I have ever met loves rainbows. Those prismatic crystals are wonderful rainbow makers.

There is a swarovski store near where we live and one of its nicest features is that everything in the store is either out of the reach of small children or in locked cases. These stores are actually child proof. I could take my kids in there and know that even if I got distracted my children could do no damage. They were good kids. Well behaved. But, they were children.

The trouble with giving someone a beautiful prismatic lead glass crystal is that the crystals usually wind up in a box in a drawer where they may be looked at only once every few years, if they are looked at at all. Or, they wind up in a curio cabinet where they become part of the wall paper and are never looked at at all.

I found a solution to that too. When I was a little kid, almost eight years old, I saw the move Pollyanna. That movie had a surprisingly large effect on my life.

Back then, in the late '50s and early '60s, my parents would dress the three of us, me and my two sisters, in our PJs and load us in the back of the car with our pillows and some blankets. Mom always made sure to tuck in a blanket over the whole back seat so that we wouldn't sweat and stick to it. Then we would head out to the drive in movie theater. The idea was that my parents would watch the double feature and that we would fall asleep sometime during the first film of the night. It usually worked very well, but sometimes I would manage to stay awake and watch the whole first film. But I always fell asleep in the back seat of the car curled up with my sisters like a pile of sleeping kittens. I have a few memories of awaking in my father's arms as he carried me from the car to my bed and tucked me in.

In the movie, Pollyanna notices that the prismatic decorations hanging on a huge chandelier cast "rainbows" on the wall when sunlight hits them just right. She and her guardian then take all the rainbow makers and hangs them in a window. As the sun sets the light streams through the window and casts what seen like hundreds of rainbows on the walls of the room. I saw the same effect in the old Hansen Planetarium in Salt Lake City. It had beveled glass windows and just as the sun set over the Great Salt Lake the second floor gallery would light up with dozens of rainbows.

Those Mother's Day crystals are wonderful rainbow makers. We put cup hooks at the tops of our living room and kitchen windows and hung the crystals where, once in a while, the light through the window casts dozens of dancing rainbows across the whole first floor of my home.

Rainbow time is the time when sunlight manages to come through the windows at just the right angle to get past the trees and clouds and be diffracted by the crystals into dancing rainbows. It doesn't happen every day. It only happens when the sun rises in the right position and there are no clouds or mist to block the direct rays of sunshine. But when it happens it is a amazing and wonderful.

When my children were small we would play a game where we tried to catch a rainbow. We would put our hands in just the right place to catch a rainbow in our palms. Then we would follow the rainbow as far as we could toward the crystals and toward the walls.

Over the decades many of the crystals have wound up in a display case. But there are two left in the arch of the living room window. This morning I had rainbow time and I caught a rainbow in the palm of my hand and on my big toe. My children are grown now and I'm a grandfather. But, I still get to play catch the rainbow and remember when I was the father of two young children. I tried my best to fill their world with wonder.

I believe that my fascination with prisms, inspire by that movie, is part of what drove me to become so curious and so fascinated by science. I saw light going in one side. I saw the rainbow coming out the other side. I was desperate to understand how and why that could happen. I now know the standard explanation, and I also know that at the bottommost level we do not know why the Universe works that way. There are so many unanswered questions. I hope we never run out of questions and people who are willing to spend their lives answering one more level of "Why?"

The Grumpy Programmer

Aug 14, 2015

My Favorite Bugs: Java Memory Leak

Aug 6, 2015

Rainbow Time