The ATI FireStream with Cypress chip has 1,600 SIMD engines and a slew of supporting electronics wrapped around them so they can do math with their clothing still intact. The AMD GPU has full support for the DirectCompute 11 and OpenCL 1.1 graphics and number-crunching protocols embedded in its hardware, and also includes 32-bit atomic operations, flexible 32KB local data shares, 64KB global data shares, global synchronization, and append/consume buffers etched onto its silicon.
With all of its cores working properly, the Cypress GPU can deliver 2.72 teraflops of single-precision and 544 gigaflops of double-precision floating point performance. While there are some workloads that can use single-precision just fine (some life sciences and oil and gas exploration apps are fine with single precision), most flop heads care about double-precision. And in this case, the ATI Cypress GPU can hold its own against the best Fermi that Nvidia has.
I am delivering high optimized universal source code with OpenCL which runs on each CPU and GPU with maximum possible calculation speed. One generic codebase is the best way to prepare the code for the future and additional functionality is just a stone's throw away.