Nvidia GeForce GTX 1660 Ti 6GB Review: Turing Without the RTX

Turing Without The RTX

11/21/2019 Update: Since the launch of the GTX 1660 Ti in February 2019, the GPU landscape has changed dramatically, with a swath of “Super” cards based on the same Turing architecture, but pushing both higher performance and lower prices than the company’s initial Turing lineup. Most relevant to potential buyers of the GTX 1660 Ti is the GeForce GTX 1660 Super, which delivers similar performance to the 1660 Ti, at a lower starting price of $229. At this writing, that’s about $30 less than the lowest-price GTX 1660 Ti.

Nvidia GeForce GTX 1660 Ti is built on TU116—an all-new graphics processor that incorporates Turing’s improved shaders, its unified cache architecture, support for adaptive shading, and a full complement of video encode/decode acceleration features. The GPU is paired up to GDDR6 memory, just like the higher-end GeForce RTX 20-series models. But it’s not fast enough to justify tacking on RT cores for accelerated ray tracing or Tensor cores for inferencing in games. As a result, TU116 is a leaner chip with a list of specifications that emphasizes today’s top titles.

EVGA’s take on the GeForce GTX 1660 Ti

Nvidia says that GeForce GTX 1660 Ti will start at $280 and completely replace GeForce GTX 1060 6GB. Although that base price is $30 (or 12 percent) higher than where the Pascal-based 1060 6GB began its journey back in 2016, the company claims GeForce GTX 1660 Ti is up to 1.5 times faster—and at the same 120W board power rating, no less.

  • EVGA GeForce GTX 1660 Ti 6GB (Black) at Amazon for $386.21

Improved performance per dollar isn’t something we’ve seen much of from the Turing generation thus far. Can Nvidia turn that around with a GPU more purpose-built for performance at 1920 x 1080?

Meet TU116: Turing Sans RT and Tensor Cores

We’ve seen Nvidia launch four separate GPUs as it escorts us down the Turing hierarchy. With each, the company peels away resources to target lower price points. But we know it’s trying to maintain balance along the way, minimizing the bottlenecks that’d unnecessarily rob lower-end processors of their peak performance.

GeForce RTX 2060 is equipped with 44 percent of the 2080 Ti’s CUDA cores and texture units, 54 percent of its ROPs and memory bandwidth, and 50 percent of its L2 cache. Before the 2060 launched, we suspected that luxuries like RT and Tensor cores would no longer make sense at those levels. But a series of patches for Battlefield V—the one ray tracing-enabled game available at the time—enabled big performance gains, proving that Turing’s signature features could still be utilized at playable frame rates.

It turns out we were off by one tier. Nvidia considers TU116 the boundary where shading horsepower drops low enough to preclude Turing’s future-looking capabilities from serving much purpose. After stripping away the RT and Tensor cores, we’re left with a 284mm² chip composed of 6.6 billion transistors manufactured using TSMC’s 12nm FinFET process. But despite its smaller transistors, TU116 is still 42 percent larger than the GP106 processor that preceded it.

Some of the growth is attributable to Turing’s more sophisticated shaders. Like the higher-end GeForce RTX 20-series cards, GeForce GTX 1660 Ti supports simultaneous execution of FP32 arithmetic instructions, which constitute most shader workloads, and INT32 operations (for addressing/fetching data, floating-point min/max, compare, etc.). When you hear about Turing cores achieving better performance than Pascal at a given clock rate, this capability largely explains why.

Turing’s Streaming Multiprocessors are composed of fewer CUDA cores than Pascal’s, but the design compensates in part by spreading more SMs across each GPU. The newer architecture assigns one scheduler to each set of 16 CUDA cores (2x Pascal), along with one dispatch unit per 16 CUDA cores (same as Pascal). Four of those 16-core groupings comprise the SM, along with 96KB of cache that can be configured as 64KB L1/32KB shared memory or vice versa, and four texture units. Because Turing doubles up on schedulers, it only needs to issue an instruction to the CUDA cores every other clock cycle to keep them full. In between, it’s free to issue a different instruction to any other unit, including the INT32 cores.

In TU116 specifically, Nvidia says it replaces Turing’s Tensor cores with 128 dedicated FP16 cores per SM, which allow GeForce GTX 1660 Ti to process half-precision operations at 2x the rate of FP32. The other Turing-based GPUs boast double-rate FP16 as well though, so it’s unclear how GeForce GTX 1660 Ti is unique within its family. More obvious, based on the chart below, is that the 1660 Ti delivers a massive improvement to half-precision throughput compared to GeForce GTX 1060 and its Pascal-based GP106 chip.

But when we run Sandra’s Scientific Analysis module, which tests general matrix multiplies, we see how much more FP16 throughput TU106’s Tensor cores achieve compared to TU116. GeForce GTX 1060, which only supported FP16 symbolically, barely registers on the chart at all.

In addition to the Turing architecture’s shaders and unified cache, TU116 also supports a pair of algorithms called Content Adaptive Shading and Motion Adaptive Shading, together referred to as Variable Rate Shading. We covered this technology in Nvidia’s Turing Architecture Explored: Inside the GeForce RTX 2080. That story also introduced Turing’s accelerated video encode and decode capabilities, which carry over to GeForce GTX 1660 Ti as well.

Putting It All Together…

Nvidia packs 24 SMs into TU116, splitting them between three Graphics Processing Clusters. With 64 FP32 cores per SM, that’s 1,536 CUDA cores and 96 texture units across the entire GPU. Board partners will undoubtedly target a range of frequencies to fill the gap between GTX 1660 Ti and RTX 2060. However, the official base clock rate is 1,500 MHz with a GPU Boost specification of 1,770 MHz. Our EVGA GeForce GTX 1660 Ti XC Black Gaming sample topped out around 1,845 MHz through three runs of Metro: Last Light, while other cards we’ve seen readily exceed 2,000 MHz. On paper, then, GeForce GTX 1660 Ti offers up to 5.4 TFLOPS of FP32 performance and 10.9 TFLOPS of FP16 throughput.

Six 32-bit memory controllers give TU116 an aggregate 192-bit bus, which is populated by 12 Gb/s GDDR6 modules (Micron MT61K256M32JE-12:A) that push up to 288 GB/s. That’s 50% more memory bandwidth than GeForce GTX 1060 gets, helping GeForce GTX 1660 Ti maintain its performance advantage at 2560 x 1440 with anti-aliasing enabled.

Each memory controller is associated with eight ROPs and a 256KB slice of L2 cache. In total, TU116 exposes 48 ROPs and 1.5MB of L2. GeForce GTX 1660 Ti’s ROP count compares favorably to RTX 2060, which also utilizes 48 render outputs. But its L2 cache slices are half as large.

Despite a larger die, a 50%-higher transistor count, and a more aggressive GPU Boost clock rate, GeForce GTX 1660 Ti is rated for the same 120W as GeForce GTX 1060. Unfortunately, neither graphics card includes multi-GPU support. Nvidia continues pushing the narrative that SLI is meant to drive higher absolute performance, rather than give gamers a way to match single-GPU configurations.