SDL: Adding Atomic Operations to SDL 1.3, part #1

We, the folks on the libSDL.org mailing list, have been talking for years about adding atomic operations to SDL. One user even went so far as to post code for a set of atomic operations. I have been wanting to write some articles about atomic operations and I wanted to use a set that was not tied to any one operating system or processor, so an SDL atomic operations library was exactly what I wanted. But, it wasn't there yet.

I asked about the patch that had already been posted and wound up volunteering to add it to SDL 1.3. Which I did. The patch was several years old by then and it didn't compile correctly on Linux so I had to patch the patch. It included a test program and when it compiled and the test program ran I checked it in. Well, the trouble with developing SDL is that the code is compiled on many different machines and operating systems. The patch broke SDL on Windows and the Mac. I didn't build it on either of those system because I do not have Windows or Mac at home.

After everyone on the mailing list had a good look at the patch we decided to pull it and I offered to write a new version. That set me off on a round of research. I need to find a useful set of atomic operations that are supported on a range of processors and across a wide range of operating systems. I looked at Linux, Windows, and Mac OSX. I looked at ARM, x86, and x86_64 processors. I believe that those operating systems and processors cover the majority of SDL users. I admit that the processors list is a bit short. But, when I said I looked at Linux, I wasn't being entirely accurate. I really looked at GCC, the GNU Compiler Collection. I looked at GCC because it 1) supports atomic operations as an extension to C/C++ and 2) it compiles code for a huge number of different processors and operating systems. In other words, there is a huge amount of knowledge embedded in GCC. Knowledge I can extract and use.

The one thing I did not try to do was wade in with a bunch of assembly code and try to create my own low level set of atomic operations. There are a number of libraries on the web that do exactly that. I believe that is a serious mistake. There are a large number of people working on GCC, Windows, Mac OSX, and any other OS you want to name who are paid, maybe full time, to develop, test, and maintain libraries of atomic operations. Not to mention that to work properly atomic operations really need some help from the compiler. So, it would be silly for me to roll my own.

After looking at the atomic operations provided by three different tool sets I was faced with the fact that they all support a different set of atomic operations working on different sizes of operands. I looked for a common subset and didn't like what I wound up with. There were two problem areas, the operations themselves and the size of the operands. GCC, supports atomic operations on 8, 16, 32, and 64 bit values, but not on all processors. Mac OSX supports an odd assortment of 8 bit, 32 bit, and 64 bit operands. Windows provides a large set of atomic operations but it is very inconsistent about operand sizes. Some operations support 8, 16, 32, and 64 bit operands while others support only 32, and 64 bit operands. Windows even has a few atomic operations that take 128 bit operands.

At this point in the project I stewed over the problem for a while, asked a number of questions an the mailing list and finally decided that the best thing to do was get something out so that people could react to it. I find that many, if not most, people are very good at reactive thinking but are not so good at proactive thinking. That is, faced with a bear, they will do a very good job of finding a way to get a way from the bear, but they are not so good about thinking of ways to avoid the bear in the first place. That means that to find out what people really want you have to first give them what they asked for.

I decided to pick a set of atomic operations and operands that I believe are useful and let people react to my solution. I decided that if a processor or OS does not support one of my operations I will emulate the operation. SDL has a tradition of emulating missing stuff so emulating missing atomic operations seems reasonable to me. I emulate the operations using a mutex to insure atomic behavior. The mutex can be a spin lock or, if that isn't possible it can be an SDL_mutex.

I decided to support 8, 16, and 32 operations by default and 64 bit operations on systems where SDL says it has a 64 bit data type. The SDL build system defines SDL_HAS_64BIT_TYPE when a system supports 64 bit values.

The operations I picked are ones I know can be used to implement spin locks, mutexes, and queues. These are:

  • Exchange

  • CompareThenSet

  • TestThenSet

  • Clear

  • FetchThenIncrement

  • FetchThenDecrement

  • FetchThenAdd

  • FetchThenSubtract

  • IncrementThenFetch

  • DecrementThenFetch

  • AddThenFetch

  • SubtractThenFetch

I thought a long time about the names. I decided to use the word “then” where traditionally the word “and” was used. I use “then” because it clearly states the order in which parts on an operation occur. I was always bothered by the use of “and” because it does not clearly state the order of operations. The SDL API name precedes the operation with “SDL_Atomic” and adds 8, 16, 32, or 64 to the end to indicate the size of the operands.

The SDL atomic operations library is in /src/atomic/. The /src/atomic/linux directory contains a complete implementation of the library for Linux. The /src/atomic/dummy directory contains a completely emulated version of the library that should build anywhere that SDL 1.3 builds, but does not actually use any atomic operations. The win32 and macosx subdirectories currently contain copies of the dummy library. I need help with those libraries.

The source code for the library uses a system of #ifdefs and #defines to tell which atomic operations are built in and which must be emulated. For each atomic operation in the library there is a line like:

#define nativeExchange8(ptr, value) (__sync_lock_test_and_set(ptr, value))

that defines the native operations in terms of system provided functions. If the operation is not #defined then the code for that operations in the library will build using emulation code. If it is #defined than it will be built using the native operation. There is another set of #ifdefs and #defines used to include support for emulation. That code gives you the opportunity to use mutex functions that I wrote or to write your own.

 

Comments

My God

What a complex  things you have talking about I had linux over my laptops which i used sell but not so much demand in the market

In my experience, atomic ops

In my experience, atomic ops were often slower than uncontended mutexes, but that they scaled better as more cores/threads were involved. Last I checked, though, the tipping point was at around 8 cores, which isn't exactly a typical configuration at the moment. Some benchmarks would go a long way to help decide whether to go one way or another

----------------------------------------------------

itil exam

benchmarks?

It'd be really neat to have a benchmark for these operations (and mutexes as well), so that one could investigate what works well on their particular platform and what doesn't.

In my experience, atomic ops were often slower than uncontended mutexes, but that they scaled better as more cores/threads were involved. Last I checked, though, the tipping point was at around 8 cores, which isn't exactly a typical configuration at the moment. Some benchmarks would go a long way to help decide whether to go one way or another.

Re: Benchmarks?

I agree completely.

My opinion is that existing Mutexes are designed to make sure that hardware threads are busy doing real work and not wasting time in busy waits. That makes perfect sense when hardware threads are scarce. But mutex implementations tend to use a single queue to keep track of ready software threads. Access to that queue should become a real bottle neck when there are many hardware threads. That means that when their are many hardware threads it makes sense to spend time in busy-waits to avoid time spent waiting to access the ready queue.

I would love to see the benchmarks to back that up. And, as soon as I get a 16+ core (or hyperthread) processor I'll do them if someone else hasn't already done it. I should be getting that machine in less than 5 years, but not less than 2 years. So, don't hold your breath.

Bob Pendleton

Who's online

There are currently 0 users and 2 guests online.