My View David's view on anything and everything…


Multi-Threaded Part 2

I recently assembled a 16-core AMD Opteron based workstation for use as a software test system.

The relevant hardware includes:

- AMD Opteron 6376, 16-Core Server Processor, Socket G34, 2.3GHz base, 16MB L3 Cache.
- 8GB Memory in 2×4GB PC3-12800 DDR3-1600 CL9-9-9-24 (note: running dual channel).
- SuperMicro MBD-H8SGL-O ATX Server Motherboard, Socket G34, AMD SR5650 chipset.
- Microsoft Windows 8.1 Enterprise 64-bit.

The architecture of the Opteron 6376 is as follows:

- The 16-core processor is split into two banks of 8 cores.
- Each 8-core bank has its own 6MB L3 cache.
- Each 8-core bank is then split into 4 sets of 2 cores.
- Each 2-core bank is sharing a 2MB L2 cache and a 64KB instruction cache.
- Each core then has its own 16KB L1 cache.

The Opteron 6376 has a variable clock frequency dependent on the number of cores in use:

Core count  -  16
Idle frequency  -  1.4 GHz
Base frequency  -  2.3 GHz
Turbo frequency (more than 8 cores)  -  2.6 GHz
Turbo frequency (8 cores or less)  -  3.2 GHz
System bus speed  -  3200 MHz
HyperTransport  -  6.40 GT/s


Multi-Threading Performance

Note that the main point of this test is scalability of the software when adding more cores, and not a comparison to other processor hardware.
The test was ran multiple times and the results averaged.
Note that using quad channel memory will probably improve these scores as this test is heavily memory dependent for the large noisemap, and since memory is a shared resource for all threads, the faster it is, the better.

Heightmap dimensions of 4096×4096
Gradient Noise Preset: Eroded Rivers

1 thread = 38808 ms = 38.808 seconds
175ms variance, 3.2 GHz

4 threads = 11515 ms = 11.515 seconds
36ms variance, 3.2 GHz
3.370 times faster than 1 thread

8 threads = 6849 ms = 6.849 seconds
329 ms variance, 3.2 GHz
5.666 times faster than 1 thread

16 threads = 3702 = 3.702 seconds
135 ms variance, 2.6 GHz
10.483 times faster than 1 thread

For comparison from the previous blog post:

i7-2600K 3.4GHz Quad Core HyperThread
8 threads = 4890 ms = 4.89 seconds

Results graphed:



TerreSculptor noisemap multi-threading on this processor when increasing from 1 thread to 16 threads has an average performance improvement from 38.8 seconds to 3.7 seconds, which is more than 10x faster.

While it does not achieve a true 16x performance improvement, due to the lower clock frequency with 16 cores running, the shared memory access and processor cache access delays, etc., the more than 10x performance increase is a significant productivity boost and easily noticeable when using the software.

Future builds of TerreSculptor will also include multi-threading code improvements to the Modifiers and any application sections than can utilize and benefit from more threads.

Note that the next build of TerreSculptor is almost ready for public access.



Comments (0) Trackbacks (0)

Sorry, the comment form is closed at this time.

Trackbacks are disabled.

Recent Posts