This story is not in anyway a slam at Java. I really like Java. It is a story about unintended consequences and how what you do not know can bite you right on the ass.
I know a lot about garbage collectors and garbage collected languages. I have implemented Lisp several times. I have implemented it as an interpreter and as a compiler. That means I have written several garbage collectors. I have actually written at least a half dozen garbage collectors of various sorts. I have read at least 100 papers and several books on garbage collection and at least that many on storage management in general. Not counting garbage collectors I have written... Ok, I've lost count... general heap management systems using many different methods including both relocating handle based systems and pointer based systems.
I know a little bit about memory management. I do not claim to be an expert. In the bad old days you just often had to write your own memory managers to make it possible to write the application you that you were actually working on. Anyway, I was so very glad to find myself in a situation where I was able to use a garbage collected language. So very very glad not to have to write yet another memory manager, not to have to worry about memory management at all. And, that was my first mistake.
It takes a few mistakes to create a memory leak in a garbage collected language.
I was writing a web spider. I was working on a research project that required a web spider and at the same time was supposed to be evaluating Java. It was actually new when I was doing this work. I.E. it was a while ago. So, I decided to write a web spider in Java to kill multiple birds with one stone.
So, I did just that. I wrote a multithreaded web spider in Java. In the end it was very handy for collecting the statistics I was looking for and it showed me some statistics I was not expecting. What did I find? Well I found that the distribution of vocabulary used on the web varies significantly depending on where you start your web spider. I found that (at the time) there was a porn site within three links of any page on whitehouse.gov. Of course (at the time) whitehouse.com was a porn site. People just can't seem to avoid typing ".com" when they mean ".gov".
I also found out that using whitehouse.gov as a starting point for testing a web spider is a bad idea. I was working for a large phone company at the time and it seems that upper management was not pleased when they got a telephone call from the real live White House, the Official Residence of POTUS asking the telco to cut it out. They were OK with me hitting them at night, but not during working hours. That taught me to salt the search queue with many starting URLs so that the initial start up activity would be spread all over the net and not dumped on one poor little old web server. It also taught me that paying taxes is not a justification for beating up the White House web server.
I also found out that every URL I hit was being recorded by the Telephone Company Police (yes, really). I found out about this little detail when the TCP showed up in my bosses office with a list of the tens of thousands (yes, really) porn sites I had hit from my PC. After she started breathing again she explained what was going on and the TCP left without dragging me, and possibly her, out to the parking lot and firing me on the spot. (I am talking about the company that fired Dilbert...) The fact is that if you start a web spider you will find porn sites.
It turned out the TCP were really more interested in how I was managing to hit thousands of porn sites per minute without downloading any pictures. They found the pattern to be ... odd.
OK, so I wrote and used a web spider and had some interesting experiences and actually accomplished several tasks that I was getting paid to do. But, there is still the story about the giant memory leak to tell.
This all started with the problem of parsing web pages. To write a web spider you have to examine a web page and find all the HTTP type URLs that are in it. You store the URLs in a queue if you want to do a breadth first search or a stack if you want to do a depth first search. You can also sort the URLs by first assigning them a weight based on some measure you apply to them. You also need a list of URLs you have already visited so that you do not visit them again. The web, and webs in general, have lots and lots of cycles.
(Oh yeah, if you want a really useful web spider you should never do what those pesky robot.txt files tell you to do. Really they do get in the way. It can be fun to down load them and analyze them though. They give you clues for finding really interesting stuff.)
I thought that the way to find the URLs was to parse the HTML and look in all the places you are supposed to have URLs. I wrote a parser that, like most parsers, treated the source document as a stream. That should work, right? It turns out that very few web pages that I found were syntactically correct HTML. Tags were misspelled or just plain missing. If the tags were there their contents were not what they were supposed to be. It seems the syntax of HTML is not defined by the standard documents but by what web browsers will accept. Web browsers are very flexible about what they accept as valid HTML.
My first attempt at parsing web pages was a total bust. The technique that worked was to down load the entire web page as a single chunk of text and apply pattern matching to it. So I did just that. I down loaded each page and stored it in a Java string and searched for patterns. For the first version of the working spider I did everything in memory. Since everything was being done in memory I was not surprised when the spider ran out of memory. That turned out to be a mistake. I should have been surprised.
So, what was taking up memory? Well I had a queue for storing URLs that were waiting to be scanned. The values in that queue were actually pairs of URLs. I stored the source URL of the page along with the new URL. I did that so that if the new URL didn't work the spider would log both URLs so I could go back and learn something new about how to hide URLs. There was also a hash that I used to keep track of URLs that had been scanned so that the spider would not get caught in loops and rescan pages multiple times. Then there were the structures I used to store the statistics I was interested in.
OK, so I expected to run out of memory and It did so I assumed that I needed to reduce memory use. I did not suspect a memory leak. What to do? Well, I built a file based queue. That was slower than RAM but it moved all that information out to a disk. Sure enough, I still ran out of RAM. Moving the queue to a file had no effect. Time to move the hash, right?
I had some old C code for B-Tree files so I converted that to Java. (Later I converted the same code to C++. Good code never dies...) That code let me substitute a B-Tree file for the hash that I had been using to keep track of previously visited URLs. Putting all that stuff on disk also let me stop a scan and restart it at a later time.
The spider was still running out of memory. I learned how to monitor memory usage in a Java program. I learned how to dig through RAM in a Java program. I dumped all sorts of information as the spider ran around the web. I noticed something. The program still ran out of RAM, but the rate at which it used RAM declined over time. Really weird. I now suspected a memory leak.
I finally figured it out. I also just found out that what caught me was fixed in 2013 in version Java 1.7.0_06. BTW, that is only about 15 years after I reported the problem to Sun. The problem was with the original implementation of strings in Java and my lack of understanding of that design. If I had understood the design I would not have had a problem so I can hardly put all the blame on Java. What you don't know can kill you.
I believe that the original version of Java strings was designed to both save memory and avoid memory fragmentation. Strings are hell for garbage collectors and all other storage managers because they tend to have random sizes. Most data types have a single fixed size. If you create one and delete it then create another one you are usually going to find a free block exactly the size you need. But, not so with strings. Strings tend to be small and random sized. Just finding a piece of memory the right size can be time consuming. But, worse, you wind up with lots of small random sized blobs of memory scattered all through your heap. These little nasty blobs can take up a lot of space but not be large enough to be useful. So, the original designers of Java tried to avoid all those problems. Can't blame them. And, they did a pretty good job too. I have a lot of respect for Gosling.
When I download those pages and stored them in a single string I was creating a large string. I "assumed" that when I was done with the string it would be garbage collected. But, the strings were not being garbage collected. In the original version of Java a string has a pointer to the text of the string, plus an offset to the starting position of the string within the text and a length. This representation lets you take substrings of a string with out copying any part of the text of the original string. What happens is that the substring has a pointer to the original text of the string, but the offset indicates the starting byte of the substring in the original text and the length is the length of the substring. Allowing substrings to share the text of the original string helps avoid all the problems with memory management I described above. The trouble is that I was keeping substrings around in memory as part of my statistical information. Keeping a substring around pinned the entire original string into memory.
I down loaded a page into a single string. I then scanned for words and looked them up in my "dictionary". If the word was already in the dictionary I incremented its count by one. If not, I added the word to the dictionary with a count of one. That means that every time I added a word to my dictionary I pinned the entire source page in RAM. Even though I had no reference to the original text, the substring did and so the page could not be garbage collected.
The first pages I down loaded had lots of new words in them because the dictionary started out empty. As time went on fewer and fewer new words were found. Many pages had no new words. If a page had no new words is would be garbage collected. But, every time I found a new word, and I eventually found tens of thousands of words, an entire page was being pinned in memory.
Once I understood the problem, the solution was easy. Just create an actual copy of any substring I wanted to keep around. Doing that removed all those pesky invisible references and suddenly I could let the spider run for weeks without running out of RAM.
So what is the moral of our story?
- Do not make assumptions about anything.
- Do not use any language feature if you do not understand how it is implemented.
- You must understand the implications of the implementation.
There are more things in heaven and earth, Horatio,Another way to say that would be:
Than are dreamt of in your philosophy.
- Do not make assumptions about how programmers will use your language.
- Features should work the way programmers believe they work.
- The distance between clever and stupid is very short.