computer ads by google

Tuesday, February 24, 2009

Darryl Gove's blog

Darryl Gove's blog

* All
* music
* Personal
* Sun

Main | Next page »
Friday Jan 30, 2009

OSUM presentation: Multithreaded programming for CMT systems

My 8am PST presentation for OSUM seemed to go well. The slides from the presentation are available. The presentation can be streamed from the elliminate website..

There's a number of OSUM presentations available, the full list is on the site (registration required).

Posted at 12:06PM Jan 30, 2009 by Darryl Gove in Sun | Comments[0]
Wednesday Jan 28, 2009

Tying the bell on the cat

Diane Meirowitz has finally written the document that many of us have either thought about writing, or wished that someone had already written. This is the document that maps gcc compiler flags to Sun Studio compiler flags.

Posted at 01:07PM Jan 28, 2009 by Darryl Gove in Sun | Comments[0]

A look inside the Sun compiler team

At the end of last year I was asked to appear in a short video for AMD and give an elevator pitch for performance analysis. The video is now up on the site.

They also took videos of some of the Solaris folks, as well as a few others from the compiler team.

Yuan Lin has a short talk about parallelisation. My manager, Fu-Hwa Wang, talks about the work of our organisation.

Posted at 10:14AM Jan 28, 2009 by Darryl Gove in music | Comments[0]
Friday Jan 23, 2009

OSUM presentation on multi-threaded coding

I'll be giving a presentation titled "Multi-threaded coding for CMT processors" to OSUM members next friday (8am PST). If you are an OSUM member you can read the details here. OSUM stands for Open Source University Meetup - the definition is

"OSUM (pronounced "awesome") is a global community of students that are passionate about Free and Open Source Software (FOSS) and how it is Changing (Y)Our World. We call it a "Meetup" to encourage collaboration between student groups to create an even stronger open source community.".

Posted at 03:57PM Jan 23, 2009 by Darryl Gove in Sun | Comments[3]
Friday Jan 16, 2009

Out of memory in the Performance Analyzer

I've been working on an Analyzer experiment from a long running multithreaded application. Being MT I really needed to see the Timeline view to make sense of what was happening. However, when I switched to the Timeline I got a Java Out of Memory error (insufficient heap space).

Tracking this down, I used prstat to watch the Java application run and the memory footprint increase. I'd expected it to get to 4GB and die at that point, so I was rather surprised when the process was only consuming 1.1GB when the error occurred.

I looked at the commandline options for the Java process using pargs, and spotted the flag -Xmx1024m; which sets the max memory to be 1GB. Ok, found the culprit. You can use the -J option to analyzer to pass flags to the JVM. The following invocation of the analyzer sets the limit to 4GB:

$ analyzer -J-Xmx4096m test.1.er

If you need more memory than that, you'll have to go to the 64-bit JVM, and allocate an appropriate amount of memory:

$ analyzer -J-d64 -J-Xmx8192m test.1.er

Posted at 03:17PM Jan 16, 2009 by Darryl Gove in Sun | Comments[2]
Thursday Jan 15, 2009

Computer history articles on the BBC

The BBC seems to be running a series on UK computer history. Code breaking at Bletchley Park. Packet switching. The Manchester Baby. Computer pioneers.

Posted at 12:55PM Jan 15, 2009 by Darryl Gove in Personal | Comments[0]
Tuesday Jan 13, 2009

Engelbart - Evolving Collective Intelligence

The December break is one of the few times when I'm able to find chunks of time to read. One of the books I was given was Doug Engelbart's Evolving Collective Intelligence", which is really a (very) short (88 pages!) set of essays calling for action on using computers to improve our 'collective intelligence'; meaning our ability to manage complexity. The book was a bit disappointing in that it didn't feel very focused and I didn't come away with a clear 'message'. However it did talk extensively about the 1968 demo.

The 1968 demo is described by the Stanford website that hosts the video as:

"was the public debut of the computer mouse. But the mouse was only one of many innovations demonstrated that day, including hypertext, object addressing and dynamic file linking, as well as shared-screen collaboration involving two persons at different sites communicating over a network with audio and video interface."

It was also described by Steven Levy as "The mother of all demos".

Now I've just got to find the time to watch nearly two hours of video....

Posted at 01:17PM Jan 13, 2009 by Darryl Gove in music | Comments[1]

Bay Area Model

After nearly 10 years of living in the Bay Area, we visited the Bay Area Model in Sausilito. The model is of the water flow in the San Francisco Bay, and surrounding areas. Construction was started in 1956 as part of a feasibility study for building a dam. Based on the modelling work they decided not to build the dam, but the model grew until 2000 when it was replaced by computer models.

I had imagined that it would be table-sized, or perhaps the size of a room. It's not. It's much larger, as you can see in the pictures. We ended up spending about an hour there, we could probably have stayed longer reading more details.

Posted at 01:11AM Jan 13, 2009 by Darryl Gove in Personal |
Thursday Jan 08, 2009

Illuminated light switches and flourescent lights

Just back from the break. Spent some time doing odd jobs around the house. We use the compact fluorescent light bulbs around the house, I'd noticed one of them flickering when it was supposed to be switched off. The reason dawned on me when I was thinking about the illuminated switch that controls that particular light.

I started thinking about where the power to make the switch glow came from, after a moment of reflection it was clear that there had to be a small current flowing in order for the switch to light up. And if there's a small current flowing through the switch, then there has to be one flowing through the fluorescent bulb, presumably just enough to make it flicker.

I found a blog entry which confirmed this, plus showed some nice illustrations.

Unfortunately there are only two solutions. (1) Don't use illuminated switches or (2) Don't use fluorescent lights.

Posted at 10:06PM Jan 08, 2009 by Darryl Gove in Personal |
Tuesday Dec 23, 2008

Debugging inline templates with dbx

Been working on inline templates to improve the performance on a couple of hot routines in a customer code. I've a couple of articles on this kind of work if you want to find out more details. There's an introductory article which covers the rules, and there's an article specifically talking about using VIS instructions.

Anyway, one of the most important things to do is to write a test harness, it's very easy to make a mistake and have the template not work for some particular situation. For these routines, one of my colleagues had already written a test harness. I ended up extending it to try a different corner case, and at that point discovered that my code no longer validated. The problem turned out to be a branch that should have been branch >= 2 and I'd coded branch != 2. The original test cases terminated with the value 2 at this point, but the new test I added ended up with the value 1, which still should have terminated, but the inline template as written didn't handle it correctly.

So I fired up dbx to take a look at what was going on:

$ cc -g test.c test.il
$ dbx a.out
Reading a.out
Reading ld.so.1
Reading libc.so.1
(dbx) stop at 150
(dbx) run
stopped in main at line 150 in file "test.c"
150 res1=campare(&buff1[j],buff2,i);

The stop at command tells the debugger to stop at the problem line number (more details). However, the problem actually occurred when j was equal to 1. So I really should specify the break point better (more details).

(dbx) status
*(2) stop at "mcmp-test-all.c":150
(dbx) delete 2
(dbx) stop at 150 -if j==1
(3) stop at "mcmp-test-all.c":150 -if j == 1
(dbx) run
Running: a.out
(process id 14983)

That got me to the point where the problem occurred. My initial thought was to step through the execution of the inline template using the nexti command. However, this is pretty inefficient:

(dbx) nexti
stopped in main at 0x00011cfc
0x00011cfc: main+0x1394: sll %l0, 1, %l1
(dbx) nexti
stopped in main at 0x00011d00
0x00011d00: main+0x1398: add %l3, %l1, %l0
(dbx) nexti
stopped in main at 0x00011d04
0x00011d04: main+0x139c: ld [%fp - 1044], %l1

It could take quite a large number of instructions before I actually encountered the problem code. Plus each step takes three lines on screen. However, there's a tracei command which traces the execution at the assembly code level (more details).

(dbx) tracei next
(dbx) cont
0x00011d08: main+0x13a0: mov %l0, %o0
0x00011d0c: main+0x13a4: mov %l2, %o1
0x00011d10: main+0x13a8: mov %l1, %o2
0x00011d14: main+0x13ac: nop

The output took me through the code, and knowing the code path I had expected, I could pretty easily see the branch that caused the code to diverge.

Posted at 02:26PM Dec 23, 2008 by Darryl Gove in Sun |
Tuesday Dec 16, 2008

OpenSPARC Internals available on Amazon

OpenSPARC Internals is now available from Amazon. As well as print-on-demand from lulu, and as a free (after registration) download.

Posted at 03:19PM Dec 16, 2008 by Darryl Gove in Sun |
Thursday Nov 13, 2008

How to learn SPARC assembly language

Got a question this morning about how to learn SPARC assembly language. It's a topic that I cover briefly in my book, however, the coverage in the book was never meant to be complete. The text in my book is meant as a quick guide to reading SPARC (and x86) assembly, so that the later examples make some kind of sense. The basics are the instruction format:

[instruction] [source register 1], [source register 2], [destination register]

For example:

faddd %f0, %f2, %f4

Means:

%f4 = %f0 + %f2

The other thing to learn that's different about SPARC is the branch delay slot. Where the instruction placed after the branch is actually executed as part of the branch. This is different from x86 where a branch instruction is the delimiter of the block of code.

With those basics out the way, the next thing to do would be to take a look at the SPARC Architecture manual. Which is a very detailed reference to all the software visible implementation details.

Finally, I'd suggest just writing some simple codes, and profiling them using the Sun Studio Performance Analyzer. Use the disassembly view tab and the architecture manual to see how the instructions are used in practice.

Posted at 11:29AM Nov 13, 2008 by Darryl Gove in Sun | Comments[15]

November Sun Studio Express release

The November release of Sun Studio Express is out now. The features are listed on the wiki. The main news is that the OpenMP 3.0 implementation has been completed in this release.

Posted at 10:30AM Nov 13, 2008 by Darryl Gove in Sun |
Tuesday Nov 11, 2008

New SPEC CPU search programme announced

SPEC has announced the search programme for the follow up to the CPU2006 benchmark suite. They are looking for compute intensive codes, real production apps, not microkernels or artificial benchmarks, that will cover a broad range of subject domains. They need to have both the code and workloads. There are financial rewards for completing the various selection hurdles.

Posted at 11:53AM Nov 11, 2008 by Darryl Gove in Sun |
Monday Nov 10, 2008

Compiler flags for building python

One of my colleagues has posted compiler flags for building python. The quick summary is that you can get about a 25+% performance gain from using crossfile optimisation, -xO4, and profile feedback.

Posted at 06:26PM Nov 10, 2008 by Darryl Gove in Sun | Comments[3]

Poster for London workshop

Just got the poster for the OpenSPARC Workshop in London in December.

Posted at 05:48PM Nov 10, 2008 by Darryl Gove in Sun |
Wednesday Nov 05, 2008

OpenSPARC workshop in London

I'm thrilled to have been asked to present at the OpenSPARC workshop to be run in London, on December 4 and 5th. I'll be covering the 'software topics'. There's no charge for attending the workshop.

Posted at 12:16PM Nov 05, 2008 by Darryl Gove in Sun |

OpenSPARC presentations

As part of the OpenSPARC book, we were asked to provide slideware and to present that slideware. The details of what's available are listed on the OpenSPARC site, and are available for free from the site on wikis.sun.com.

I contributed two sections. I produced the slides and did the voice over for the material on developing for CMT, the accompanying slides are also available. I also did a voice over for someone else's slides on Operating Systems for CMT (again slides available).

The recording sessions were ok, but a bit strange since it was just myself and the sound engineer working in a meeting room in Santa Clara. I get a lot of energy from live presentations, particularly the interactions with people, and I found the setup rather too quiet for my liking.

The Sun Studio presentation was relatively easy. It runs for nearly an hour, and there's a couple of places where I felt that additional slides would have helped the flow. The Operating Systems presentation was much harder as it was trying to weave a story around someone else's slide deck.

Posted at 11:08AM Nov 05, 2008 by Darryl Gove in Sun |

OpenSPARC Internals book

The OpenSPARC Internals book has been released. This is available as a free (after registration) pdf or as a print-on-demand book. The book contains a lot of very detailed information about the OpenSPARC processors, and my contribution was a chapter about Sun Studio, tools, and developing for CMT.

Posted at 09:38AM Nov 05, 2008 by Darryl Gove in Sun | Comments[2]
Tuesday Nov 04, 2008

Job available in this performance analysis team

We're advertising a job opening in this group. We're looking for someone who's keen on doing performance analysis on x86 and SPARC platforms. The req number is 561456, and you can read the details on sun.com. If you have any questions, please do feel free to contact me.

Posted at 03:05PM Nov 04, 2008 by Darryl Gove in Sun |
Friday Oct 31, 2008

The limits of parallelism

We all know Amdahl's law. The way I tend to think of it is if you reduce the time spent in the hot region of code, the most benefit you can get is the total time that you initially spent there. However, the original setting for the 'law' was parallelisation - the runtime improvement depends on the proportion of the code that can be made to run in parallel.

Aside: When I'm looking at profiles of applications, and I see a hot region of code, I typically consider what the improvement in runtime would be if I entirely eliminated the time spent there, or if I halved it. Then use this as a guide as to whether it's worth the effort of changing the code.

The issue with Amdahl is that it's completely unrealistic to consider parallelisation without considering issues of the synchronisation overhead introduced when you use multiple threads. So let's do that and see what happens. Assume that:

P = parallel runtime
S = Serial runtime
N = Number of threads
Z = Synchronisation cost

Amdahl would give you:

P = S / N

The flaw is that I can keep adding processors, and the parallel runtime keeps getting smaller and smaller - why would I ever stop? A more accurate equation would be something like:

P = S / N + Z*log(N)

This is probably a fair approximation to the cost of synchronisation, some kind of binary tree object that synchronises all the threads. So we can differentiate that:

dP/DN = -S / N^2 + Z / N

And then solve:

0 = -S / N^2 + Z / N
S / N = Z
N = S / Z

Ok, it's a bit disappointing, you start off with some nasty looking equation, and end up with a ratio. But let's take a look at what the ratio actually means. Let's suppose I reduce the synchronisation cost (Z). If I keep the work constant, then I can scale to a greater number of threads on the system with the lower synchronisation cost. Or, if I keep the number of threads constant, I can make a smaller chunk of work run in parallel.

Let's take a practical example. If I do synchronisation between threads on traditional SMP system, then communication between cores occurs at memory latency. Let's say that's ~200ns. Now compare that with a CMT system, where the synchronisation between threads can occur through the second level cache, with a latency of ~20ns. That's a 10x reduction in latency, which means that I can either use 10x the threads on the same chunk of work, or I can run in parallel a chunk of work that is 10x smaller.

The logical conclusion is that CMT is a perfect enabler of Microparallelism. You have both a system with huge numbers of threads, and the synchronisation costs between threads are potentially very low.

Now, that's exciting!

Posted at 08:15AM Oct 31, 2008 by Darryl Gove in Sun |
Thursday Oct 30, 2008

The multi-core is complex meme (but it's not)

Hidden amoungst the interesting stories (Gaming museum opening, Tennant stepping down) on the BBC was this little gem from Andrew Herbert, the head of Microsoft research in the UK.

The article describes how multi-core computing is hard. Here's a snippet of it:

"For exciting, also read 'complicated'; this presents huge programming challenges as we have to address the immensely complex interplay between multiple processors (think of a juggler riding a unicycle on a high wire, and you're starting to get the idea)."

Now, I just happened to see this particular article, but there's plenty of other places where the same meme appears. And yes, writing a multi-threaded application can be very complex, but probably only if you do it badly :) I mean just how complex is:

#pragma omp parallel for

Ok, so it's not fair comparing using OpenMP to parallelise a loop with writing some convoluted juggler riding unicycle application, but let's take a look at the example he uses:

"Handwriting recognition systems, for example, work by either identifying pen movement, or by recognising the written image itself. Currently, each approach works, but each has certain drawbacks that hinder adoption as a serious interface.

Now, with a different processor focusing on each recognition approach, learning our handwriting style and combining results, multi-core PCs will dramatically increase the accuracy of handwriting recognition."

This sounds like a good example (better than the old examples of using a virus scanner on one core whilst you worked on the other), but to me it implies two independent tasks. Or two independent threads. Yes, having a multicore chip means that the two tasks can execute in parallel, but assuming a sufficiently fast single core processor we could use the same approach.

So yes, to get the best from multi-core, you need to use multi-threaded programming (or multi-process, or virtualisation, or consolidation, but that's not the current discussion). But multi-threaded programming, whilst it can be tricky, is pretty well understood, and more importantly, for quite a large range of codes easy to do using OpenMP.

So I'm going to put it the other way around. It's easy to find parallelism in today's environment, from the desktop, through gaming, or numeric codes. There's abundant examples of where many threads are simultaneously active. Where it gets really exciting (for exciting read "fun") is if you start looking at using CMT processors to parallelise things that previously were not practical to run with multiple threads.

Posted at 10:45PM Oct 30, 2008 by Darryl Gove in music |
Tuesday Oct 28, 2008

Second life - Utilising CMT slides and transcript

Just finished presenting in second life. This time the experience was not so good, my audio cut out unexpected during the presentation, so I ended up having to use chat to present the material. I was very glad that I'd gone to the effort of writing a script before the presentation, however, reading off the screen is not as effective as presenting the material.

Anyway, I found 'statistics' panel in the environment and looking at this indicated that I was down to <1 FPS, with massive lag. Interestingly, after the presentation and once everyone had left the presentation area, the FPS went up to 10-12. The SL program was still maxing out the CPU (as you might expect, I guess there's no reason to back off until the program hits the frame rate of the screen), but much more responsive - things actually happened when I clicked the on-screen controls.

So, I'm sorry for anyone who found the experience frustrating. I did too. And thank you to those people who turned up, and persevered, and particularly to the bunch of people who participated in the discussion at the end.

Anyway, for those who are interesting the slides and transcript for the presentation are available.

Posted at 10:59AM Oct 28, 2008 by Darryl Gove in Sun | Comments[3]

More Sun Studio resources from AMD

Bao from AMD pointed me at these two additional resources. A cheat sheet for Sun Studio. I disagree with the suggestion on it to use -xO2, I would suggest using -O instead. There's also a Solaris Developer Zone.

Posted at 10:42AM Oct 28, 2008 by Darryl Gove in Sun | Comments[1]
Monday Oct 27, 2008

Tutorial on writing a hardware performance counter profiler

This post looks like a very useful tutorial on writing a hardware performance counter profiler. Probably a good read for anyone who fancied writing their own Performance Analyzer.

Posted at 10:14PM Oct 27, 2008 by Darryl Gove in Sun |

Book apparently available in India

It looks like my book is available in India from Dorling Kindersley. I've only just found this out. I presume the book is not localised. The link is the only book store I could find selling it.

Posted at 02:46PM Oct 27, 2008 by Darryl Gove in Sun |

x86 compiler flags

This AMD document summarises the optimisation flags available for many x86 compilers (Sun Studio, PGI, Intel etc.). It's about a year old, but it looks ok for Sun Studio. However it talks about -xcrossfile which is ancient history - use -xipo instead!

Posted at 09:18AM Oct 27, 2008 by Darryl Gove in Sun | Comments[2]

Pecha Kucha - How to write the presentation

For those that haven't heard of it Pecha Kucha is a presentation format where you present 20 slides on a topic, with the twist that each slide is only displayed for 20 seconds. I'd first read about in on Presentation Zen, and last week I got the opportunity to experience it first hand.

It actually took me quite some time to put my slides together. This was for a couple of reasons. First of all, I wanted to use the opportunity to put together a more graphical set of slides than I normally do. I normally have to present a lot of textual information, and I really wanted to practice getting away from the bullet points. The second problem was that structuring the presentation was quite different.

When I present, I have a very clear idea of the points I want to make, and the material necessary to support each point. I also have a structure, which builds up the material in a suitable way. Now this works very well when I have no constraints on the slides or the material that I present on each slide. The problems are that some slides might have only a single point to make - so I don't need to talk for long; and most talks don't have exactly twenty points that I want to make.

So I needed an algorithm to come up with the talk and slides. The way I ended up doing it was to have a list of the twenty slides and write a script, allocating a couple of sentences to each slide. I timed how much text I generally could say in 20 seconds in order to come up with an estimate of how much text to assign to each slide. This was a bit of a shock as twenty seconds can feel a surprisingly long time (my initial estimate was one sentence per slide, when I actually deliver two).

Now, the presentation is bound not to fit onto twenty slides, so after the first pass, you probably need to either add a couple more ideas into the presentation, or remove some of the material.

Once I'd got the text down, I gathered up some material for the visuals. Although this was fun, it was actually easier than the step of drafting the flow.

On the evening, I noticed that there were several alternative approaches. The first was to put twenty slides of material up, and the talk without trying to connect to the material on every slide. Another approach was to basically produce a twenty slide (bullet point) presentation, and then talk through it.

My presentation went well enough, there was one point in the middle where I actually caught up with the slide show and paused for probably 10 seconds. The rest of the time I managed to pretty much be in step with the visuals as they changed.

The one caveat is that I did actually practice the talk a couple of times, which turned out to be very helpful. Since the slides change automatically, you don't have the option of staying longer on one slide to make a more complex point, or quickly skipping to the next slide if there's not much to say.

Posted at 07:17AM Oct 27, 2008 by Darryl Gove in Sun |
Sunday Oct 26, 2008

Second life presentation

I'll be presenting in Second Life on Tuesday 28th at 9am PST. The title of the talk is "Utilising CMT systems".

Posted at 10:53PM Oct 26, 2008 by Darryl Gove in Sun |
Tuesday Oct 07, 2008

Voting in California

We recently became citizens, so this was the first election in which we'd be eligible to vote. Probably somewhat enthusiastically I sat down with the thick set of documents and tried to figure out what to vote for.

What was surprising to me, voting for the first time, was that did I not only get to pick the President (well, I guess they'll let some other people have a say too ;), together with various Senators etc. They also have a bunch of propositions which I can be for or against. So it's not like I just get to put a tick in one box; I have to read a fair sized telephone directory of arguments, then try to make sense of which argument is most convincing.

Most of the arguments are, unfortunately, just that. Here's some quotes both for and against one of the propositions "Don't believe the scare tactics.", "[Proposition]... over time saves California $2.5 billion.", "[Proposition]... will massively increase costs to taxpayers.".

Anyway I do not blog to discuss politics. But I figured it might be useful to provide a table showing the various propositions and the positions adopted by some of the political parties.

The Republican and Democratic parties just give out a list of the propositions and whether they are for or against them. I've included a link to the Green party which includes some analysis behind their decisions and a link to Pete Stahl who came up first when I searched for other discussions of the propositions.
Proposition Democratic Republican Green Pete Stahl
1A Safe, Reliable High-Speed Passenger Train Bond Act for the 21st Century. Yes No No position tbd
2 Treatment of Farm Animals. Statute. Yes No Yes Yes
3 Children’s Hospital Bond Act. Grant Program. Statute. Yes No No No
4 Waiting Period and Parental Notification Before Termination of Minor’s Pregnancy. Constitutional Amendment. No Yes No No
5 Nonviolent Offenders. Sentencing, Parole and Rehabilitation. Statute. Yes No Yes Yes
6 Criminal Penalties and Laws. Public Safety Funding. Statute. No Yes No No
7 Renewable Energy. Statute. No No No No
8 Limit on Marriage. Constitutional Amendment. No Yes No No
9 Criminal Justice System. Victims’ Rights. Parole. Constitutional Amendment and Statute. No Yes No No
10 Bonds. Alternative Fuel Vehicles and Renewable Energy. Statute. Neutral No No No
11 Redistricting. Constitutional Amendment and Statute. No No position No Yes
12 Veterans' Bond Act of 2008. Yes Yes Yes, with reservations Yes

There are some other political parties which I've not included. The peace and freedom party is broadly in line with the Green party. The Libertarian party basically recommends voting No for those that it actually cares about. Neither of these gives any information about their motivation.

Posted at 08:12AM Oct 07, 2008 by Darryl Gove in Personal |

No comments: