your Windows® embedded community
In January 2010, we covered Nvidia's first parallel processing products, the C2050 and C2070. These PCI Express Gen2 cards respectively occupied one or two slots in a workstation, and included either 3GB or 6GB of onboard GDDR5 memory.
Nvidia also offered the M2070 and M2050, which offered the same technology but came with passive heatsinks rather than active cooling. Designed to be offered only in pre-approved system solutions, they're still on sale, but the M2090 (below) now bolsters this product line at the high end.

Nvidia's Tesla M2090
(Click to enlarge)
The x2070 and x2050 products were originally claimed to offer 512 "Cuda" processor cores, delivering from 520 to 640 Gigaflops of double precision floating point performance. Nvidia now lists them as having 448 Cuda cores each, with 515-gigaflop performance.
Timothy Prickett Morgan writes in an article for The Register that Nvidia apparently had some yield and heat issues with its initial Fermi chips, leading to the shortfall in specs. But, he adds, the new M2090 features a new tape-out of the Fermi design using Taiwan Semiconductor Manufacturing Corp's 40nm process.
As a result, the M2090 how has all 512 Cuda cores activated. Core clock speed is up by 13 percent to 1.3GHz, and GDDR5 memory speed is up by 18.6 percent to 1.85GHz, Morgan says.
For its part, Nvidia says the M2090 delivers 665 gigaflops of peak double-precision performance, "enabling application acceleration by up to ten times compared to using a CPU alone." In the latest version of Amber 11 -- one of the most widely used applications for simulating behaviors of biomolecules -- four Tesla M2090 GPUs coupled with four CPUs delivered record performance of 69 nanoseconds of simulation per day, the company adds. (In contrast, the fastest Amber performance recorded on a CPU-only supercomputer is said to have been 46 ns/day.)
Ross Walker, assistant research professor at the San Diego Computer Center and principal contributor to the Amber code, is quoted as saying, "This is the fastest result ever reported. With Tesla M2090 GPUs, Amber users in university departments can obtain application performance that outstrips what is possible even with extensive supercomputer access."
In addition to Amber, the Tesla M2090 GPU is ideally suited to a wide range of GPU-accelerated HPC applications, according to Nvidia. These are said to include: molecular dynamics applications, NAMD and GROMACS, computer-aided engineering applications, ANSYS Mechanical, Altair Acusolve and Simulia Abaqus, earth science applications, WRF, HOMME and ASUCA, oil and gas applications, Paradigm Voxelgeo and Schlumberger Petrel, plus other key applications such as MATLAB, GADGET2 and GPU-BLAST.

Nvidia says the Telsa M2090 will be offered in servers such as the HP ProLiant S390s G7 (above). This device incorporates up to eight of the boards in a half-width 4U chassis and, "with a configuration of eight GPUs to two CPUs, offers the highest GPU-to-CPU density on the market," according to the company.
Glenn Keels, director of marketing for HP's Hyperscale business unit, stated, "Clients running intensive data center applications require systems that can process massive amounts of complex data quickly and efficiently. The decade-long collaboration between HP and Nvidia has created one of the industry's fastest CPU to GPU configurations available, delivering clients the needed processing power and speed to handle the most complex scientific computations."
Background on Nvidia's Fermi
Fermi, announced in 2009 at Nvidia's inaugural GPU Technology Conference in San Jose, amounts to a third generation of products embodying the company's "GPU computing" model. The first generation was the G80 unified graphics/computing architecture, introduced in November 2006 and later embodied in the GeForce 8800, Quadro FX 6500, and Tesla C870 GPU products. The G80 was the first GPU to replace separate vertex and pixel pipelines with a single unified processor, the first to utilize a scalar thread processor, and the first to support C, according to the company.
The second generation was the GT200, introduced in the GeForce GTX 280, Quadro FX 5800, and Tesla T10 GPUs. GT200 increased the number of streaming processor cores -- subsequently referred to as "Cuda" cores -- from 128 to 240. It also added "hardware memory access coalescing," improving memory access efficiency, along with double-precision floating point support, Nvidia says.

Fermi, implemented in a GPU containing more than three billion transistors, more than doubles the number of Cuda cores, organizing them into 16 SMs (streaming multiprocessors) with 32 cores apiece. Sporting up to 6GB of GDDR5 RAM, Fermi is the first product of its type to support ECC (error correcting code), the company says.
In October, Nvidia claimed the following additional features for Fermi:
According to Nvidia, Fermi was the first product of its type to support C++, complementing existing support for C, Fortran, Java, Python, OpenCL and DirectCompute. Fermi also supports Nexus (below), touted as "the world's first fully integrated heterogeneous computing application development environment within Microsoft Visual Studio."

Nvidia's Nexus
(Click to enlarge)Source: Nvidia
Nexus is composed of the following three components, according to Nvidia:
Availability
Nvidia did not spell out pricing or availability for the Tesla M2090, but the boards appear to be orderable now in the HP server mentioned in this story. More information can be found on Nvidia's Tesla product page.
Jonathan Angel can be reached at jonathan.angel@ziffdavisenterprise.com and followed at www.twitter.com/gadgetsense.