Programming

Scary pitfalls when using Spring annotation based transactions

Spring annotation based declarative transactions uses AOP to very easily add transactions around methods. Using it looks like this.

@Transactional
public Stuff createStuff(Input input) {
       Stuff stuff = new Stuff(input);
       dao.storeStuff(stuff);
       dao2.registerStuff(stuff2);
       return stuff;
}

Explained very quickly: If the method returns successfully then it commits, if a runtime exception passes the proxy boundary around the method, or if the transaction is marked rollback-only, then the transaction manager will do a rollback of the transaction.

It is incredibly easy to use, and it saves a lot of boilerplate compared to a more procedural approach where commit and rollback calls are specified explicitly. While useful, it comes with the cost of being tied to the scope of the annotated method, along with several other more hidden risks and costs.

Below are several examples of problematic code I have seen caused by annotation based transactions. Similar code examples and tests can be found in this repo on Github.

Catching exceptions in a transaction annotated method

In the method below, it is very easy to think that all errors are handled, and that this method will never throw, but if the transaction is marked as rollback only, this method will throw outside the try catch block. Breaking expectations and causing unexpected errors.

@Transactional
public Stuff createStuff(Input input) {
       Stuff stuff = null;
       try {
           stuff = new Stuff(input);
           dao.storeStuff(stuff);
           dao2.registerStuff(stuff2);
       }
       catch(RuntimeException e) {
           logger.info("No problem…");
           //Do more stuff
       }
       return stuff;
}

This is much more visible with a explicit transaction scope. I have seen this several times, even by experienced developers. These errors are hard to test for, and when they happen lead to very weird behaviour, since the exception is absolutely not expected where it suddenly occurs.

Order of proxies can change outcome of transaction

It is critical that AOP interceptors are stacked in a sensible order. We usually do not want to commit a transaction, and then get a timeout error from Hystrix. Or have the transaction commit, and then have validation constraints on the Stuff instance fail. Ask yourself. Do you know in which order each annotation will be applied here? Does everyone on your team know?

@Transactional
@HystrixCommand("abc")
@Valid
@OtherAOPstuff
@EvenMoreAOPStuff
public Stuff createStuff(Input input) {
       Stuff stuff = new Stuff(input);
       dao.storeStuff(stuff);
       dao2.registerStuff(stuff2);
       return stuff;
}

The order by default is (though due to this Hystrix bug/feature validation will never be applied O_o) :

Transaction start
Hystrix
Validation
Validation
Hystrix
Transaction commit/rollback

It would be nice to have Hystrix outside the Transaction, since that avoids the transaction boundary when the short-circuit is open. This is possible with ordering the AOP interceptors,at the same time, it is important to avoid Validation happening outside the Transaction, since then we might commit and then throw an exception (typically a rollback is wanted if data is invalid).

Nested transactions joins existing transaction by default

This is very nefarious, and I have no idea why nesting transactions does not cause an exception by default.

The example that I think best shows this problem, is when you reuse a method in a service class which uses a transaction further down the call chain. That method uses a transaction to ensure it is rolled back if it throws an exception, but the calling method catches that exception and does some business decision based on the exception.

Now along comes an unsuspecting victim who has a transactional method that wants to use this method for some business purpose.

What happens now, is that the second transaction joins the first. Then an exception is thrown by one of the database queries in moveStuffInDB. That exception passes the first transaction boundary marking the transaction rollback only but is then caught. Processing continues, and then the exception reappears at the last boundary on commit. This means that the semantics of reusedMethod might have been changed a lot, and there is no way createStuff can see from the outside that this will happen.

@Transactional
public Stuff createStuff(Input input) {
       Stuff stuff = new Stuff(input);
       dao.storeStuff(stuff);
       reused.reusedMethod(stuff.getInfo());
       return stuff;
}

//A method in another class calling another transactional method.
public void reusedMethod(String input) {
     try {
         deepClass.moveStuffInDB(input);
     }
     catch(RuntimeException e) {
         //Swallow exception and do some action
     }
}

//Tx method with several db update calls
@Transactional
public void moveStuffInDB(String input) {
      dao.moveStuff(input);
      dao2.moveStuff(input);
}

To add to the confusion, use checked exceptions

When a checked exception passes the boundary of an @Transactional annotated method, it does not cause rollback by default (it can be changed in the annotation though). Add Spring Batch, which will also start transactions that a nested transactions will happily join, and transaction outcomes become even more foggy and hard to reason about.

I think using nested transactions usually is a bad idea, and it should be an opt in feature. In the normal case throwing an exception if a transaction within a transaction is encountered seems sensible to me.

Lambdas provide a much better API for transactions

In any language with good support for lambdas, I think it is very natural to do declarative transaction management using them. Spring sort of supports this, but you need to depend on a platformTransactionManager and a TransactionTemplate. This seems weird to me, given that the annotation based transactions does not make you depend on these.

With lambdas, transactions can be done like below (example is using Kotlin). This is declarative, while providing explicit transaction boundaries. It is therefore very easy to catch around the transaction, and the transaction does not have an unclear order when combined with other annotations.


fun createStuff(stuff : Input ) : Stuff {
       return doInTransaction { //Explicit tx scope
           val stuff = new Stuff(input);
           dao.storeStuff(stuff);
           dao2.registerStuff(stuff2);
           stuff 
       }
}

fun multipleTransactionsInMethod(stuff : Input ) : Stuff {
       val res = doInTransaction {
           val stuff = dao.storeStuff(stuff);
           stuff;
       }

       val res2 = doInTransaction {
           val stuff = dao2.storeStuff(stuff);
           stuff;
       }

       try {
           doInTransaction {
              dao.storeStuff(stuff);
            }
       } catch(e : RuntimeException) {
           //Handle stuff
       }
}

Along with the previously mentioned advantages, this also makes transactions much more visible, and allows easy inspection of the transaction code by developers.

To be honest I am a bit uncertain how good it is to rely on exceptions passing boundaries to do rollback in the first place, but if that is the chosen way, I think the above approach is much better then annotation based transactions. I am really looking forward to try Exposed, which is an ORM tool from JetBrains that seems to go in this direction.

Learning what WebRTC SDP a=setup values mean

My very hacky webRTC datachannel implementation stopped working a while back, and I could not figure out why.

The way it behaved was hard to understand. Signaling worked as expected, and I received a STUN on the correct port and responded to that. Both Firefox and Chrome reported the response as fine, and kept sending new STUN heartbeats at the normal rate, but no DTLS handshake was initiated.

Initially i thought something was really broken with my STUN/DTLS multiplexing, but I soon figured out that it behaved as expected.

This meant I was probably sending some wrong parameter during signaling, but what?

This is the SDP of my answer to the given offer from the browser.

v=0
o=- 1234567 2 IN IP4 192.168.1.158
s=-
t=0 0
a=group:BUNDLE data
a=msid-semantic: WMS
m=application 51410 DTLS/SCTP 5000
c=IN IP4 192.168.1.158
a=candidate:1 1 udp 2113937151 192.168.1.158 51410 typ host
a=ice-ufrag:4a64ca2b
a=ice-pwd:a26b6a15a8b4d35db21692d37906840a
a=ice-options:trickle
a=fingerprint:sha-256 C9:E2:48:09:47:C8:CC:B3:51:A8:A1:C5:AA:63:51:26:50:1D:FF:76:AE:EF:CB:31:0C:E7:41:21:5A:11:FA:D5
a=setup:actpass
a=mid:data
a=sctpmap:5000 webrtc-datachannel 1024

SDPs are confusing to me, and figuring out what stuff really means in WebRTC context is a lesson in reading RFCs with a microscope, while always wondering if this is the correct RFC for this concrete problem.

Suddenly it dawned on me that it seemed like both sides were waiting for the other side to initiate the DTLS handshake.

It turned out this was the problem:

a=setup:actpass

Since i responded with actpass in my answer SDP, the browser could not know if it wanted to initiate DTLS or not, and I guess it defaults to passive now. actpass is an illegal response value according to this RFC, and defaulting to passive is probably more correct then active. Setting a=setup:passive fixed the issue, since that tells the browser to be the initiating party.

Good times.

Draft Engine

Start a draft

For a while I have been working on a generic draft engine for card games. In trading card games (TCGs), drafting is a way to distribute cards in a semi random way, where players interact with how cards are distributed. In the TCG world this is distinct from sealed deck (semi randomly distributed cards, but no player interaction during dustribution) and constructed (you design your deck before playing from a set of allowed cards).

Supported draft styles

My draft engine supports two styles of draft:

Grid draft: A draft style for two people where you select rows or columns of cards from 9 face up cards.
Regular draft: In this draft style you pick a card from a pack and pass the pack to the next player. It works with 2-8 players, but 6-8 is recommended.

The draft engine in action. This example is an Magic the Gathering grid draft.

These forms of draft can be used for most kinds of TCGs. Since the engine is not tied to any specific kind of game, you can draft anything you can give a name and an image. You can draft your family photos if you want to.

Drafts with custom content

The engine works by using a very simple JSON structure to supply card names and card images, it looks like this:

[
  [
    {
      "name": "CardOnePackOne",
      "url": "http://crazymedia.com/cardonepackone.png",
      "id": 0
    },
    {
      "name": "CardTwoPackOne",
      "url": "http://crazymedia.com/cardtwopackone.png",
      "id": 0
    }
  ],
  [
    {
      "name": "CardOnePackTwo",
      "url": "http://crazymedia.com/cardonepacktwo.png",
      "id": 0
    },
    {
      "name": "CardTwoPackTwo",
      "url": "http://crazymedia.com/cardtwopacktwo.png",
      "id": 0
    }
  ]
]

The values are pretty self explanatory, but for clarity:

“name” – The name of the card, which you can export when finished drafting.
“url” – An URL pointing to an image of the card.
“id” – Not in use, so 0 is a fine value.

The engine comes with several predefined card list. Packs will then be drawn from those lists, but if you want to supply your own set (for example a cube or your own game) it is possible to start a draft where you send in any number of packs of cards using the JSON format shown above.

Have fun drafting.

Writing a WebRTC data channel implementation

After looking at webtorrent a while ago I really wanted to dig deep down into WebRTC to see how it actually worked. WebRTC is a web standard that allows two browsers to set up a peer to peer connection (usually browsers only talk through intermediary servers). It also brings unreliable and unordered networking to the browser, which is great for some kinds of applications.

Searching for WebRTC and Java on Github mostly gave me some very simple applications that use a Java server to set up a peer to peer application between two browsers. There also seemed to be some unmaintained JNI wrappers for the WebRTC libs. What I wanted was a pure Java WebRTC implementation so that I could talk directly to the browser more on my terms.

Since this was lacking I decided to try to write my own Java WebRTC implementation. To limit the scope, I decided that I would only try to talk to Firefox, and only support an unordered unreliable DataChannel. Which is WebRTC’s greatest selling point for me. What I ended up with can be found on my profile on GitHub. It is not pretty or solid, but it sort of works.

Protocols

WebRTC is made up of bunch of protocols: ICE (to establish communication), STUN (binding and NAT traversal), DTLS (encryption), SCPT (over UDP) (message flow control), TURN (fallback for difficult network situations, which I have chosen to ignore).

The initial part of starting a WebRTC connection uses the ICE protocol, the ICE protocol is typically carried out over an encrypted WebSocket connection, though you can use encrypted pigeons if you like.

To be honest I have not looked a lot at ICE, but the parts most relevant for WebRTC works a bit like this:

A goes: This is my password and certificate signature, this is referred to as an offer.
B goes: This is my password and certificate signature, this is referred to as an answer.
Then A and B send each other different options of connectivity as candidates. I am currently blissfully ignorant of how this is supposed to work, in my current implementation I just send the first candidate from the browser and replace IP, port and auth data in the response.

The message format used in the browsers WebRTC implementation for offers and answers is JSON containing an SDP. The SDP contains tons of information that is not used (afaik). The values I have touched are marked in green below, for more information about the fields, this is a great overview.

The green highlights the important parts of the offer and candidate — The important parts of the offer and candidates, this is SDPARTA!

Important SDP values:

fingerprint:sha-256 – DTLS certificate sha-256 fingerprint.
ice-pwd – Your chosen password for STUN.
ice-ufrag – Your chosen user for STUN.
192.168.1.100 58713 – Your offered address and port.

STUN

Once candidates have been exchanged, a STUN binding request is sent to establish connectivity, this request must be responded to with a binding response with a xor mapped address. The received message integrity hash is a HMAC-SHA1 which can be checked against the message hashed with the ice-pwd as key.

DTLS

After the STUN the next step is a DTLS handshake. Thank you Legion of the Bouncy Castle, without you I would have given up here.

Over the same UDP socket where DTLS is running, STUN must be multiplexed. STUN is here used to send “consent” requests every X seconds. If you do not respond to the STUN the browser reports ICE error and shuts down the connection. Bouncy Castle DTLS will silently drop these STUN “consent” packets, so this was a pain to figure out and debug.

SCTP

Once the DTLS is set up an SCTP connection is initiated. SCTP can be configured to work in all constellations of ordered/unordered and reliable/unreliable.

SCTP over UDP challenges

With SCTP over UDP several issues crop up. My main languages are all JVM based (Kotlin, Java). There is no SCTP implementation in Java, but there is an API for the systems SCTP. The API is very hard to integrate with an UDP socket though (I tried a bit, but it seemed very hard), so the simpler solution is probably to use a userland C lib like the one used by libjitsi.

Sadly i did not know libjitsi existed so I started hacking my own SCTP implementation in Java land. Way more fun…

Writing my own SCTP implementation

The SCTP spec is mostly ok, if a bit unclear in some parts. Once I found the SCTP parameter list it got better. It should also be mentioned that SCTP uses a CRC32C for checksums, this differs from the one used by STUN. This took me quite a while to figure out.

I currently have a very minimal “working” SCTP implementation. Meaning that it can talk to a browser, but does not shut down nicely, and does not do congestion control, and does not handle resends properly. These are very tricky areas to do well, and I have no clue, so I will probably change to use the lib used by libjitsi.

Getting the all important firefox logs

To not go in blind when debugging, logging from the connected browser is invaluable. This is my script to run firefox with WebRTC logging. The different modules can be turned on and off by not including them in the NSPR_LOG_MODULES. It took me a while to figure out the correct modules, so hopefully this helps someone else.

#!/bin/bash
dat=`echo ~`
path="$dat/webrtc_firefox.log"
export NSPR_LOG_FILE=$path
export NSPR_LOG_MODULES='signaling:5,mtransport:5,SCTP:5'
export R_LOG_LEVEL=9
export R_LOG_DESTINATION=stderr
open /Applications/Firefox.app/

Where is the code?

If this was interesting and you want to take a look at the source code. It is available on my GitHub profile.

Webtorrent; we get signal.

Lately I have been looking into WebRTC which is an open standard for real time communication between browsers. It allows communication styles in the browser which previously were not possible.

WebTorrent uses WebRTC for a protocol similar to bittorrent. Instead of having to download a program or install a browser plugin, you can download a webtorrent (which currently must be seeded as a webtorrent, by forexample instant.io) directly in the browser.

To test Webtorrent, I made a small javascript project. WT-widgets, is a collection of graphics for starting webtorrent downloads and showing download progress. Below is an example where you can download Artifact for Mac OS X using webtorrent.

No seeds

If there are no seeds, you can seed the file yourself by visiting the differently styled button below. This button uses a feature of WT-widgets that does a fallback to XMLHttpRequest after 5 seconds. When it is finished downloading the file, it will start seeding. Then the first button should work, since there is a seed.

The file will be downloaded in your browser, and you can then copy it to your filesystem by clicking the link that appears when the file finished downloading. This does not follow the usual download flow, so one of the aims of WT-widgets is to ensure that it clear to the user that a download is happening. I am not sure how well my widgets succeed in that regard, as my current widgets might not be the best at communicating that there is a download happening.

Suggestions or pull requests with fixes/additions are always welcome. I hope to expand WT-widgets with some widgets that show progress horizontally, as well as some widgets that more clearly show when it is in the different states of a download.

What Webtorrent sorely needs

While seeding in the browser works great, I do not want to have a browser fired up at all times to ensure there are seeds for my content. The best option I found for seeding webtorrents was webtorrent-hybrid, but when I tried it, I sadly could not get it to build.

Once Webtorrent has a solid solution for server side bootstrap seeding, I think it is will become a great alternative for distributing some kinds of content.

Take off every ‘ZIG’!!

Scape – a very ninja scripting language

I made a small scripting language that runs in the browser. It is very ninja. To see the ninja, first open Javascript console and write:

function recur() {recur()};recur();

Hopefully it blew the stack. Then type this into the Scape REPL:

def recur() recur(); recur();

When you are convinced it will infinitely loop without blowing the stack, hit ctrl-c to stop further processing.

Rincewinds rave, that is black magic! Also called tail call elimination. Scape code is not evaluated by snarfing functions from Javascript (JS functions do not have tail call elimination before ECMAScript 6), but instead is compiled to its own set of instructions, which are then run on a stack machine (running in the Javascript VM). During parsing Scape functions are checked for whether they can use tail call elimination. If they can, they get different instructions that reuse the existing stack frame.

More magic

Scape has forward mode automatic differentiation as a language feature. Automatic differentiation allows you to compute the derivative of a function, without having to define the derivative explicitly.

Without automatic differentiation, this would be the way to compute the partial derivative of the function $f(x,y) = x^{2}y^{2}$ for $x$ and $y$ :

def fun(x,y) * (* x x) (* y y);
def diff_fun(x,y) [* (* 2 x) (* y y),* (* 2 y) (* x x)];
diff_fun(4,5);
[200, 160]

With automatic differentiation in Scape, this is how it is done:

def fun(x,y) * (* x x) (* y y);
diff(fun(4,5));
[200, 160]

This is very useful for a number of numerical methods involving derivatives. The feature is currently experimental, it might interact with non-double types in funky ways.

Wai?

Mostly just for fun. I also started toying with the idea to make a safe scripting language for use in networked games. A language and runtime that would allow the player to define custom logic during gameplay without being able to ruin the experience for other players.

A dream would be a personalized Starcraft where it is you and your custom control scripts versus the other player and his scripts.

I hope to create a simple real time multiplayer game to show how I imagine it working. For now, playing with the Scape REPL is the only way to try the language.

Sayōnara

Prepare your Welding Blaster; play Artifact

I made a computer game together with Nils Eriksen Meling and Sven Meling; It is named Artifact; go try it! It plays similarly to Barrack, a favourite game from my youth. In addition to the action and strategy found in Barrack, Artifact also adds an aspect of resource management.

In Artifact, each of 40 levels consists of a screen filled with a variety of different orbs. These orbs must be isolated by you. Once 75% of the playing field has been isolated, the level ends. You are rewarded for playing fast, and for ending the level with a high % of the screen isolated.

To accomplish your task you control a dual-direction Welding Blaster that can be transformed to fire vertically or horizontally. When fired, a welding beam will slowly traverse the screen in 2 directions. Once both ends reach a wall, the playing field will be cut, isolating balls on each side of the beam. However, care must be taken not to hit the orbs, as they will create a feedback surge, damaging the Welding Blaster.

Artifact is free and will stay so, but developing and maintaining it is not free. If you enjoy Artifact, I hope you will consider a donation. However, most of all, I hope you enjoy playing it.
If you have problems with installation on the supported platforms, please send a mail to: .

Download Artifact

Useful if caught in overly simplistic space combat

One of my favorite games, and by far the one I used the most time creating content for during my youth is Escape Velocity Nova. The Escape Velocity series are 2D space RPGs, which means there is lots of 2D space combat. In these types of games you typically have three main types of weapons, non-turreted, turreted and homing. Non-turreted and homing weapons are not really that complicated. The non-turreted fire in one direction (typically forward), while the homing weapon typically turns as hard as it can towards the target at maximum speed. Turreted weapons were always a puzzle to me though, they somehow knew where you would be, and fired to intercept your ship. I always wondered how this was done.

Some years later during a visualization class (on how to find intersections between geometric primitives) I was wondering if I could use geometric equations to figure out where a spaceship needs to aim its torpedo, if it knows the position and velocity of the target. Below is the solution I came up with. It assumes constant speed for the target and the torpedo.

The positions of something moving at constant speed, can be described by this equation where $t$ denotes time, $\vec{v}$ the current velocity and $\vec{p_o}$ is the starting position. This restricts the position of the target to a line, where the position on the line is determined by $t$ .

(1) $\begin{equation*} \vec{p} = \vec{p_o} + \vec{v}t \end{equation*}$

For 2 dimensions this is:

(2) $\begin{equation*} \begin{pmatrix}p_x\\p_y\end{pmatrix} = \begin{pmatrix}p_{xo}\\p_{yo}\end{pmatrix} + \begin{pmatrix}v_x\\v_y\end{pmatrix}t \end{equation*}$

The possible locations $p$ of a constant speed ( $s$ ) bullet fired from point $c$ in any direction, are given by this equation, where again time is denoted by $t$ . This restrict the possible positions to a circle which expands with time.

(3) $\begin{equation*} (st)^{2} = (\vec{p} - \vec{c})^{2} \end{equation*}$

For 2 dimensions this is:

(4) $\begin{equation*} (st)^2 = (p_{x} - c_x)^2 + (p_{y} - c_y)^2 \end{equation*}$

Now inserting the restrictions on $p_x$ and $p_y$ from 2 into 4 gives a polynomial that only depends on $t$ :

(5) $\begin{equation*} (st)^2 = ((p_{xo}+v_xt) - c_x)^2 + ((p_{yo}+v_yt) - c_y)^2 \end{equation*}$

Rearranging gives:

(6) $\begin{equation*} s^2t^2 = (p_{xo}-c_x + v_xt)^2 + (p_{yo}-c_y + v_yt)^2 \end{equation*}$

Simplifying by setting constants $i = p_{xo}-c_x$ and $j = p_{yo} - c_y$ and moving towards standard form:

(7) $\begin{equation*} s^2t^2 = (i)^2 + 2iv_xt + v_x^2t^2 + (j)^2 + 2jv_yt + v_y^2t^2 \end{equation*}$

Rearranging to standard form:

(8) $\begin{equation*} 0 = (v_y^2 + v_x^2 - s^2)t^2 + 2(iv_x + jv_y)t + (i)^2 + (j)^2 \end{equation*}$

This is a quadratic equation, with coefficients:

(9) $\begin{equation*} a = v_y^2 + v_x^2 - s^2 \end{equation*}$

(10) $\begin{equation*} b=2(iv_x + jv_y) \end{equation*}$

(11) $\begin{equation*} c=i^2 + j^2 \end{equation*}$

Solving the quadratic equation gives two solution for $t$ , if an answer is positive and has no imaginary part, there is a solution to the problem.

Inserting a solution for $t$ in equation 1 gives the $x,y$ coordinates where the torpedo should be aimed, and where it will hit the target.

Below is an implementation of this, where the constant speed of the torpedo can be set. If there are two solutions, the smallest $t$ is chosen. If there is no solution (typically because the target it too fast for the torpedo to catch) the torpedo is never launched.

As you can see in some cases, this does not take into account the size of the target, only the targets center. This can be remedied by using a representative selection of boundary points of the target instead, and firing if one of them would connect.

The solution does also not take into account how fast the target can change direction, or when the torpedo runs out of fuel. This can be handled by not firing if the time from the torpedo is fired to the impact is too long.

This method can also quite easily be expanded for an accelerating weapon or target. This will result in a higher degree polynomial though, which are mostly only possible to solve using numerical methods like Newtons’s method for example.

The van der Corput sequence

Lately I have been looking for a solid way to create a robust implementation moving a spaceship from one point to another. This turned out to be a hard problem, and led me to a lot of new literature, which included Steven M. LaValle’s Planning Algorithms. Randomly reading from it I found a small gem called the van der Corput sequence.

The van der Corput sequence is useful for sampling in an interval. LaValle discusses it as an option to random sampling within the context of sampling for planning algorithms. I had thought about similar issues in the context of root finding, where I always wondered how to predictably generate a sequence that would distribute samples evenly over an interval, as well as do so in a manner that would not “favor” parts of the interval over others.

One naive way to sample over an interval, is to split the interval in X pieces and do the samples in order. If X is 8 and this method is used the interval would be explored as shown to the left in the figure below (clearly this means the the early parts of the interval will be explored first). Using the van der Corput sequence would result in exploration as seen on the right (which explores the interval much more evenly).

Generating the van der Corput sequence is surprisingly simple and elegant, just flip the bits of the naive sequence as seen in the binary column above. I find this very peculiar, since mirrored donkeys do not normally turn into cheetahs.

The great properties of this sequence is that whenever you end it, it will have explored the interval pretty evenly. A second nice property is that you can easily continue the sequence beyond an initial size, by creating the naive sequence at the position you want to start from, and keep reversing the results.

This method will definitely go into my toolbox of things to consider whenever I am thinking about using random sampling.

Tips for more robust GPU programs using C++ and OpenCL

When I started programming for the GPU, I struggled to keep my GPU programs robust. This was partly because I was very new to C++ and C. OpenCL is very low level and this in combination with my inexperience led to a lot of not so robust prototypes. While my code worked for prototypes, it was very cumbersome to code and not very reusable.

In this post I will share some of the solutions I found to some of the issues I had. These deal with allocating and releasing memory on the GPU, as well as converting relatively cleanly between STL vector and GPU memory representation.

Simplifying release and allocation of memory

Basic memory allocating on the GPU using OpenCl is done by using the code below. Where memory is a reference to the memory on the GPU.

cl_mem memory = clCreateBuffer(context, CL_MEM_READ_WRITE, 
                               numberOfBytes,NULL, NULL);

To release the memory allocated, the below function has to be called with the memory object as argument.

clReleaseMemObject(memory);

This leads to several ways to leak GPU memory. Especially if your code throws exceptions or does something else which “looses” the reference to the memory object and you forgot to release the memory explicitly.

A way to mitigate this in C++ is to create an object which on initialisation allocates memory, and which when its destructor is called release the memory. This ensures that when the object goes out of scope or is deleted the memory on the GPU is also released. This is basically RAII (Resource Acquisition Is Initialization)

A very basic (leaving out code for copy constructors etc) object for GPU memory would look like this.

class GPUMemory {
private:
    cl_mem memory;
public:
    GPUMemory(size_t e,const cl_context &ctx) {
        memory = clCreateBuffer(ctx, CL_MEM_READ_WRITE, e,
                                NULL, NULL);
    }
    ~GPUMemory() {
        //release if out of scoped or deleted
        clReleaseMemObject(memory);
    }

    /*
     * Copy constructor, copy assignment operator etc goes here 
     * implementation should be according to your need 
     */
}

// Possible usage
void doAwesomeCalculation (size_t size,cl_context &context) {
       GPUMemory map(size,context);
       // Do calculation, throw exceptions, etc
} // Out of scope so GPU memory is released

This means that you do not have to explicitly release memory for every weird way your application could behave, which avoids a lot of code repetition and leak opportunities. As long as the memory in allocated in the correct scope you mostly do not have to think more about it. It should be noted that you should follow rule of three and implement copy constructor and assignment operator as well. The implementation may wary depending on how you want your object to behave. I’ll leave that for another blog post.

From vector to GPU memory

Vector is a very useful data structure when you want something sent to the GPU and back, since everything is stored adjacently in memory, no fetching or disbursing on the CPU side is needed. To convert a vector to the kind of memory object I designed above, this is the code i use:

template  GPUMemory createAndFillMemory(const vector data,
                                                    cl_context context,
                                                    cl_command_queue queue) {
   cl_int err;
   if(data.size() > 0) {
        GPUMemory mem = GPUMemory(data.size(),context);
        err = clEnqueueWriteBuffer(queue, mem.getMemory(), CL_TRUE, 0, mem.getBytes(),
                                       (void*)&data[0], 0, NULL, NULL);

        if(err != CL_SUCCESS) {
            throw runtime_error("Loading buffer failed");
        }
        return mem;
    }
    throw runtime_error("no content in vector");
}

As you can see the GPUMemory object has been extended to use templates, since the use I want in this case is a generic container for some struct. This saves a lot of code since you would need to do this conversion manually for each different struct you are planning to put on the GPU if you do not have a similar method. It also ensures that you do not make offset or type errors when converting to and from the different representations.

The GPUMemory class has also been extended to contain information about the amount of objects added, as well as type size.

Conversion back from GPUMemory object to a vector is done using this function:

template  vector readBackMemory(GPUMemory &data,
                                           cl_context context,
                                           cl_command_queue queue) {
   vector vec(data.getElements(),T());
   cl_int err;  
   if(data.getElements() > 0) {
        err = clEnqueueReadBuffer(queue, data.getMemory(), CL_TRUE, 0, data.getBytes(),
                                  &vec[0], 0, NULL, NULL);
        if(err != CL_SUCCESS) {
            checkError(err);
            throw runtime_error("Content not read back from GPU");
        }
        return vec;
    }
    throw runtime_error("No content in vector");
}

General conversion from a vector containing any object to GPU memory is not something I would recommend. While this is possible using the code above, conversion using these methods should be restricted to primitives and structs of fixed size. You should also avoid pointers, as it makes little sense to send a pointer pointing to objects in CPU memory to the GPU.

There is still room for mistakes as you will see by the following example. When you use memory on the GPU, no metadata carry over from the C++ code. So on the GPU side you are responsible for using the memory you transferred correctly, the methods above only ensure that the copying of memory back and forth from the GPU results in correctly typed and sized arrays.

An example

A simple example can be to copy some particles to the GPU, apply some calculation and then copy the particles back. On the CPU side I then define a particle struct:

struct GPUParticle{
    cl_float3 pos;
    cl_int identifier;
    cl_float3 direction;
};

On the GPU (OpenCL kernel) side, a simple kernel to be applied could look like this:

struct GPUParticle{
    float3 pos;
    int identifier;
    float3 direction;
};

__kernel void copyParticles(
                       global struct GPUParticle * data,
                       global struct GPUParticle * dataOut
                       )
{
    int x = get_global_id(0);
    //do calculation here
    dataOut[x] = data[x];
}

It is important to keep the structs on both sides similar in structure and byte size, so that there are no offset issues (you should run tests for this). This means that data is not type safe when sent and retrieved from the GPU, but I still think that partial type safety is much better then none at all, since this leave much fewer avenues for such errors to occur.

Finally to apply the kernel to the data I create this method:

vector Stuff::testCopy(vector &data) {
    GPUMemory mem = utils.createAndFillMemory(data, context, cmd_queue[0]);

    GPUMemory out = GPUMemory(mem.getElements(),context);

    size_t work_size[1];
    work_size[0] = mem.getElements();

    err  = clSetKernelArg(copyParticles,  0,
                          sizeof(cl_mem), mem.getMemoryRef());
    err |= clSetKernelArg(copyParticles,  1,
                          sizeof(cl_mem), out.getMemoryRef());

    if(err != CL_SUCCESS) {
        utils.checkError(err);
        runtime_error("Kernel setup failed");
    }

    // Apply kernel to data
    err = clEnqueueNDRangeKernel(cmd_queue[0], copyParticles,
                                 1, NULL, work_size, NULL, 0,
                                 NULL, NULL);
    if(err != CL_SUCCESS) {
        utils.checkError(err);
        runtime_error("Calculation failed");
    }
    return utils.readBackMemory(out,context, cmd_queue[0]);
}

There are of course problems where these techniques are not suitable or possible, but I use them a lot and they have made my GPU code much more robust. If you have suggestions for improvement or your own small snippets, please add a comment.

Some of my inspiration for this:
Bjarne Stroustrup Going native 2012