BBTouch, code, iPhone, multitouch

Some thoughts on BBTouch optimization23 Jul

SO, the old programming lesson, never optimize before it's time, is a good lesson. However, i think that BBTouch has reached a milestone, and is in need of some optimization.

To this end I have been applying the various instruments that XCode provides to identify the bottlenecks in the code, and then the good old standby of just printing out the time before and after various events to see if i can reduce the amount of time spent in the tight loops.

For BBTouch, at this point, there are two big bottlenecks:

first, when the data comes out of the sequence grabber, there is lots of buffer moving and color-space converting going on before you get the 'final' NSImage that is used for the blob detection. this (according to shark) takes up about %12 of the processing time of the app during blob detection.

second, the blob detecting itself. There is an earlier post on that subject so i wont go into it much here. but sufficed to say that I have optimized the blob detector (at least as it applies to the data structures i am currently using) to probably within %85- %90 of what i can squeeze out of it. (the last %15 i would have to trade off code maintainability too much to make it worth the small performance increase, and I dont think it is worth that). in any case, shark tells me that BBTouch is spending about %14 of it's time just blob detecting, which is fairly intensive, but not too bad considering that it has to look at every single pixel of every image. so we will be happy with that for now.

(there is also quite a bit of processor being wasted just drawing the raw video into the config view, but that goes away when you close the config window, so I am going to ignore that for now)

SO! that leaves us with the fairly large loss of %12 of our time just getting the bits out of the sequence grabber. In a perfect world, i would just pass the pointer to the start of the SG pixel buffer on up to the blob detector, and there would be NO time lost in the shuffling of data. Unfortunately we need those bits to be in a certain order for the blob detector to work at peak efficiency. and that order is a planar, packed 8-bit greyscale image. (ie a single stream of bytes, each one represents a pixel, and if the image is 640 pixels wide then the 641st byte is the first pixel of the second row)

Why is this? In order for this particular algorithm to work properly, the area we are blob detecting in (known to BBTouch as the ROI, or region of interest) needs to be bounded by at LEAST a single pixel-width line of 'bad pixels' (in the generic case, black pixels, or 0x00 valued bytes) otherwise the algorithm will go out of bounds and not work. so that is important.

the other reason is simplicity in adressing the memory block. if we have an unpacked non-planar image format, then we have to do a bunch of extra math to figure out where each pixel is in the byte-soup, and over 220,000 pixels, a few extra adds and multiplies add up to lots of extra lag.

So! to that end: I am playing around with CIImages. CIImages are really great for lots of reasons, and bad for a few reasons. But the good outweighs the bad in this case. First, the CIImage is never rendered until the very last second when it is 'needed' to be. this is great. Currently the SG buffer gets rendered into a CGImageRef, then re-drawn as an 8-bit greyscale CGImageRef, then those bytes are stuffed into an NSBitmapImageRef which is then attached to an NSImage object. I dont actually know how many times that the entire pixel buffer gets copied in that case, but it takes about 4000 microseconds (us). And all this happens before I even start thinking about blob detecting (which also takes about 4000us for a 640x480 image. (dont forget, at 30 fps, you only have 33000us to mangle data and stuff before the next frame comes barging in, and if you take all that time just detecting blobs, then the other apps who are, say rendering exciting openGL worlds based on your MT input will have no processing power to do that. so time is of the essence.)

OK, so Here are some positive results: I replaced all the crap in the SG with a single 'createCGIMage' call (which jams the pixels from the SG into a CGImageRef format, but doesnt actually do anything (ie it doesnt render it right away like an NSImage does) and then, I wrap that in a nice CIImage. (also not rendered, so the data hasn't actually gone anywhere, just the pointer changing hands)

Of course, i need the image to be in 8 bit planar greyscale. CIImage doesnt do that, but it comes close. I can use the CIMaximumComponent filter to make an ARGB image that is kinda planar.

(ie the data format for a regular ARGB might look like: A:255 R:128 G:17 B:92, after the filter it looks like A:255 R:128 G:128 b:128. this is the maximum, also known as the 'value' or the 'brightness', which is all i care about for blobs. this 'faux' planar is good because i really just need to look at a single byte (say, the red component) to get the value that I need to check against, so it is almost a planar stream)

what does all this cost? i run a big ole CIFilter on the image buffer, and then stick it into a CIImage?
well, that bit goes down from 4000ms to around 45us. No rendering of the buffer. god i love apple.

"But wait!" you say! you have to render it to a pixel buffer at some time, otherwise you cant blob detect. this is true. and rendering that out to a nice bitmap takes about 100us. so now we are up to about 145us instead of 4000us. pretty good. all thanks to apple's Core Image framework. neato!

Now, the downside ( you knew it was coming ). the downside is that the image is no longer in a nice 8 bit format, it is still in ARGB. I still need to alter the blob detecting code to handle the different bit format. (I havent done this just yet) I am guessing that the extra stuff to deal with this will add 1000us to my blob detection processing. this should (theoretically) still yield about a 2500us gain in time. (from about 8000us for the SG render + blob to about 5500us render+ blob) which is still a hefty 36% decrease in processor time) this will hopefully free up a nice chunk of about 28000us for any other apps on the machine.

(not to mention if one would want to port this to a slower architecture, like the iphone, say, if you wanted to detect blobs with the built-in camera and then send the TUIO information via wifi to your other machine that is in control of a projector or something.. wouldn't that be neat? a $200 fully contained tuio generating camera platform.. hmmm)

anyhow, why am i writing all this? mostly because I needed a break from the bits and bytes. Maybe some other cocoa nerd will find it useful, who knows?

I will keep you all updated on the progress in any case.

well, we could get away with a non-packed data block (ie the ends of each row have junk padding data to make it fit a specific buffer size) but then we just have to add a few more cycles to each loop iteration just to figure out the address for each pixel.

No TweetBacks yet. (Be the first to Tweet this post)

Leave a Reply

You must be logged in to post a comment.

About

meMy full name is Ben Britten Smith.

I go by Ben Britten because Ben Smith is a bit too common and using my full name is a mouthful.

I live in Melbourne, Australia and service clients all over the globe.

Contact

Have some questions?

Feel free to contact me directly at support@benbritten.com with any questions you might have about any of the applications I support.

Thanks!

PHVsPjxsaT48c3Ryb25nPndvb19hYm91dDwvc3Ryb25nPiAtIGFib3V0LXdpZGdldDwvbGk+PGxpPjxzdHJvbmc+d29vX2FkX2JlbG93X2ltYWdlPC9zdHJvbmc+IC0gaHR0cDovL2JlbmJyaXR0ZW4uY29tL3dwLWNvbnRlbnQvdGhlbWVzL3ZpYnJhbnRjbXMvaW1hZ2VzL2FkNDY4LmpwZzwvbGk+PGxpPjxzdHJvbmc+d29vX2FkX2JlbG93X3VybDwvc3Ryb25nPiAtIGh0dHA6Ly93d3cud29vdGhlbWVzLmNvbTwvbGk+PGxpPjxzdHJvbmc+d29vX2FsdF9zdHlsZXNoZWV0PC9zdHJvbmc+IC0gYmVuYnJpdHRlbi5jc3M8L2xpPjxsaT48c3Ryb25nPndvb19ibG9ja19pbWFnZTwvc3Ryb25nPiAtIGh0dHA6Ly9iZW5icml0dGVuLmNvbS93cC1jb250ZW50L3RoZW1lcy92aWJyYW50Y21zL2ltYWdlcy9hZDMzNi5qcGc8L2xpPjxsaT48c3Ryb25nPndvb19ibG9ja191cmw8L3N0cm9uZz4gLSBodHRwOi8vd3d3Lndvb3RoZW1lcy5jb208L2xpPjxsaT48c3Ryb25nPndvb19ibG9nPC9zdHJvbmc+IC0gdHJ1ZTwvbGk+PGxpPjxzdHJvbmc+d29vX2Jsb2djYXQ8L3N0cm9uZz4gLSAvY2F0ZWdvcnkvYmxvZy88L2xpPjxsaT48c3Ryb25nPndvb19jYXRfbWVudTwvc3Ryb25nPiAtIGZhbHNlPC9saT48bGk+PHN0cm9uZz53b29fY29udGFjdDwvc3Ryb25nPiAtIGNvbnRhY3Q8L2xpPjxsaT48c3Ryb25nPndvb19jdXN0b21fY3NzPC9zdHJvbmc+IC0gPC9saT48bGk+PHN0cm9uZz53b29fY3VzdG9tX2Zhdmljb248L3N0cm9uZz4gLSBodHRwOi8vYmVuYnJpdHRlbi5jb20vZmF2aWNvbi5pY288L2xpPjxsaT48c3Ryb25nPndvb19mZWF0cGFnZXM8L3N0cm9uZz4gLSA1NDk8L2xpPjxsaT48c3Ryb25nPndvb19mZWVkYnVybmVyX3VybDwvc3Ryb25nPiAtIDwvbGk+PGxpPjxzdHJvbmc+d29vX2dvb2dsZV9hbmFseXRpY3M8L3N0cm9uZz4gLSA8L2xpPjxsaT48c3Ryb25nPndvb19ncmF2YXRhcjwvc3Ryb25nPiAtIHRydWU8L2xpPjxsaT48c3Ryb25nPndvb19sYXlvdXQ8L3N0cm9uZz4gLSBkZWZhdWx0LnBocDwvbGk+PGxpPjxzdHJvbmc+d29vX2xvZ288L3N0cm9uZz4gLSA8L2xpPjxsaT48c3Ryb25nPndvb19tYW51YWw8L3N0cm9uZz4gLSBodHRwOi8vd3d3Lndvb3RoZW1lcy5jb20vc3VwcG9ydC90aGVtZS1kb2N1bWVudGF0aW9uL3ZpYnJhbnRjbXMvPC9saT48bGk+PHN0cm9uZz53b29fbmF2X2V4Y2x1ZGU8L3N0cm9uZz4gLSAyLDgyLDU0OSw1NTMsNTY3LDUzMiw1MzQsNTM3LDgzMjwvbGk+PGxpPjxzdHJvbmc+d29vX3Nob3J0bmFtZTwvc3Ryb25nPiAtIHdvbzwvbGk+PGxpPjxzdHJvbmc+d29vX3Nob3dfYWQ8L3N0cm9uZz4gLSBmYWxzZTwvbGk+PGxpPjxzdHJvbmc+d29vX3Nob3dfbXB1PC9zdHJvbmc+IC0gZmFsc2U8L2xpPjxsaT48c3Ryb25nPndvb19zdGVwczwvc3Ryb25nPiAtIDEuLCAyLiwgMy48L2xpPjxsaT48c3Ryb25nPndvb190YWJiZXI8L3N0cm9uZz4gLSBmYWxzZTwvbGk+PGxpPjxzdHJvbmc+d29vX3RoZW1lbmFtZTwvc3Ryb25nPiAtIFZpYnJhbnRDTVM8L2xpPjwvdWw+