Among all the fanfare and sizzle at NVIDIA’s virtual GPU Technology Conference (GTC) this week, the California AI and Gaming platform powerhouse finally announced its next-gen PC Gaming graphics cards based on its Ada Lovelace GPU architecture. Named after an English mathematician and computer pioneer, NVIDIA’s Lovelace is indeed a beast slab of silicon with a more-of-everything design approach, built on a bleeding-edge TSMC 4N chip fab process. However, its base chip architecture was also designed with new innovations in its various silicon engines, in an effort to scale performance beyond the limitations of Moore’s Law, in-which transistor density is reaching a point of ever-diminishing returns with every new fab node it seems.
GeForce RTX 4090 And 4080 Brute-Force Silicon Enhancements
Indeed there’s little question that NVIDIA’s Lovelace GPU is much beefier than its previous gen Ampere architecture, and in fact the new GeForce RTX 4090 has 16,384 CUDA cores and 24GB of GDDR6X memory, versus 10,752 CUDA cores (same memory) in an RTX 3090. Although, the new GeForce RTX 4080 12GB has 7,680 CUDA cores, versus an RTX 3080 at 8960, while an RTX 4080 16GB card has 9,728 cores which is less than an RTX 3080 Ti at 10,240 CUDA cores. These RTX 4080 series specs and model branding might be a sticking point for some that are just counting cores, but performance scaling here simply isn’t linear, especially when you consider these new GeForce RTX 40 series cards have boost clocks north of 2.5GHz, whereas the previous gen topped out at 1.75GHz.
Beyond these core counts, speeds and feeds, there are several new enhancements and innovations that NVIDIA points to for Ada Lovelace performance gains, and ultimately what will usher in new levels of image fidelity and immersion for gamers, chief of which are new Ray Tracing core innovations, as well as 4th gen Tensor cores that are now claimed push over 2X the TFLOP throughput. In addition, Lovelace will also support AV1 video encode/decode in hardware, much like Intel’s Arc series, which should be a boon for game streaming performance with much lower overhead at some point in the future.
Shader Execution Reordering Innovation Boosts Ray Tracing Performance, Affording Better RT Effects
Ray Tracing (RT) is a graphics rendering technique for lighting and reflection effects with much higher and more accurate visual fidelity than traditional rasterization, though it has a much higher computational overhead as well. Before the advent of ray tracing, traditional rasterization was a very orderly, deterministic process. RT doesn’t allow for this natural coherence and parts of a 3D rendered scene therefore can’t be rendered concurrently, causing stalls in the pipeline.
This issue greatly limits ray traced effects in modern game engines. However, NVIDIA’s Ada Lovelace GPU arch supports a new technique, called Shader Execution Reordering (SER), that adds a stage in the RT pipeline that batches and reorders work so rays that are running the same program can run together more efficiently (see illustration above).
NVIDIA claims SER can offer up to a 2X percent improvement in RT rendering performance, and specifically highlighted a new version of the game Cyberpunk 2077 with new higher levels of RT effects, including an Overdrive mode that allows for 635 RT operations per pixel (above) for great visuals. Finally, it should be noted that game developers need to coordinate with NVIDIA on best RT workload optimization and sort practices, so NVIDIA has an API available for devs to help optimize their game engines and rendering techniques with this feature.
DLSS 3 Image Upscaling And The AI Supercomputer Behind Your Gaming Experience
NVIDIA’s DLSS or Deep Learning Super-Sampling technology is a performance reclamation technique that has delivered nice perf gains for gamers that either want to dial up visual fidelity with ray tracing, or boost FPS (Frames Per Second) for higher resolution gameplay on GeForce cards. The technology uses machine learning to render higher resolution frames inferred from pre-trained models in NVIDIA’s datacenters, while allowing the rest of the graphics pipeline to run at lower resolution for higher performance and lower latency, but with similar image quality output to the native higher resolution image. Though AMD and Intel have competitive upscaling techniques as well (FSR and XeSS), DLSS is now in its third iteration and has been well-received and deployed by game developers, with 200 game titles and apps that currently utilize the tech.
Where NVIDIA’s new DLSS 3 (only supported on RTX 40 series cards) differs from its previous gen DLSS 2, is that the architecture has gotten so fast with NVIDIA’ s new architecture that the GPU can now generate entire frames in real-time for much higher performance, while maintaining excellent image quality. With DLSS 3, NVIDIA gave examples of the AI rendering half the frames in a sequence, and 7 out of 8 pixels, with both the upscaling involved and frame generation combined.
Without getting too deep into the weeds, GeForce RTX 40 cards achieve this in-part due to a much faster Optical Flow Accelerator that calculates motion of pixels in a scene. This accelerator has an understanding of how lighting and shadows should properly render as an object moves, then it feeds all that information into the Tensor AI engines on the chip (above diagram), to make a decision on how best to generate the frame. The technology can also help improve performance in game engines that are typically more CPU-bound as well, by way of this multi-frame generation technique.
NVIDIA showed Cyberpunk and Microsoft Flight Simulator demos with the technology, for impressive 2X performance gains with great visual fidelity. NVIDIA will also have a streamlined DLSS 3 AI plugin for easier game dev integration, and both the Unity and Unreal game engines will natively support the tech as well. Further, the company noted that, in addition to Cyberpunk and MS Flight Sim, there will be 35 game titles at launch that will support DLSS 3, with more coming, in what NVIDIA claims is the fastest uptake ever of its technology.
GeForce RTX 4090 And RTX 4080 Performance Expectations And Take-Aways
Forgive the graph below, which is a bit of an eye chart here in the Forbes engine. Regardless, NVIDIA was fairly direct about performance expectations for the new GeForce RTX 40 series, showing an $899 GeForce RTX 4080 meeting or sometimes handily beating the previous-gen GeForce RTX 3090 Ti, which had a $1999 MSRP at launch. The company also showed performance in next-gen games like Cyberpunk 2077, that support DLSS 3, with much higher performance levels.
Overall, as you can faintly see in the graph above, GeForce RTX 4090 and 4080 series cards can perform up to 2X – 4X faster (RTX 4090) than the mighty GeForce RTX 3090 Ti. However, the graph above shows all DLSS (left, up-to 2X) and DLSS 3 performance (right, up to 4X) comparisons. It will be very interesting to see how performance shakes out with DLSS turned off in traditional game play, though one could argue why bother turning it off, as long as a game supports the technology.
NVIDIA’s new family of initial GeForce RTX 40 cards is listed above with their respective price points. Speeds, feeds and configurations aside, the company has rolled out an extremely potent product offering here, that it claims brings a big performance-per-dollar lift of 3X on average for its RTX 4080 cards and 4X for its RTX 4090 cards, versus its previous generation. It’s important to note that these performance claims are made with its new DLSS 3 technology in play, however, so again it will be interesting to see how performance shakes out across the board, with DLSS on and off, as well as ray tracing-enabled games and traditional rasterization gaming workloads.
Thoughts On Ada Lovelace And The Future Of Gaming From NVIDIA’s CEO
Last but certainly not least, I had the chance to meet with NVIDIA CEO Jensen Huang in conference this week, and I asked him about how advantageous the move from Samsung’s 8N chip fab process to TSMC 4N was for this generation. Jensen noted that his design team realized an “about 15%” uplift from process alone, while the remainder of RTX 40’s performance gains come from silicon innovation like SER (Shader Execution Re-Ordering) and DLSS. Huang noted that, while TSMC’s 4N process is far more advanced, “unfortunately the cost goes up by more than 15%,” and that scaling transistor density alone isn’t enough and no longer gets the job done, because “Moore’s Law is dead.” Further, Jensen noted, “and it’s not because TSMC is trying to capture more profit. That’s just not true. Their cost has gone up. You can tell that their cycle time has gone up because the number of steps of the process has gone up.”
Huang went on to explain that “the way that we solved it, Dave, with Ada is architecture. The compounding benefit of several different architectures and the big lever, the giant lever was artificial intelligence and tensor cores. That’s the giant lever…And so I think we have to overcome the weakness that we’re at the end of Moore’s Law, not by giving up, but by coming up with a lot more clever techniques, and thank goodness artificial intelligence came just in time.”
You have to admire Jensen’s passion for the company, his products and the burgeoning field of AI. There’s little question artificial intelligence is a “big lever,” as Huang notes. AI is becoming pervasive now in so many areas of technology, and driving higher fidelity visuals for PC gaming is a natural evolution to be sure.