Bring your Metal app to Apple silicon Macs

Bring your Metal app to Apple silicon Macs

Meet the Tile Based Deferred Rendering (TBDR) GPU architecture for Apple silicon Macs — the heart of your Metal app or game's graphics performance. Learn how you can translate or port your graphics-intensive app over to Apple silicon, and how to take advantage of TBDR and Metal when building natively for the platform. We'll look at how TBDR compares with the Immediate Mode Rendering pipeline of older Macs, go through common issues you may face when bringing an app or game over, and explore how to offer incredible performance when building with the native SDK. We've designed this session in tandem with “Optimize Metal Performance for Apple silicon Macs.” After you've watched this session be sure to check that out next.

Resources
Related Videos

WWDC23
- Bring your game to Mac, Part 3: Render with Metal
WWDC20
Download

Hello and welcome to WWDC I'm Gokhan Avkarogullari, and I lead the graphics acceleration driver teams for Apple Silicon.
Today my colleague Sam and I will tell you how to bring your Metal based apps to Apple Silicon Macs and then how to make them faster more robust and ready for the future. For many years Apple has designed the graphics processors found in the amazing A series chips used in iPhone iPad and Apple TV. Now for the first time we're bringing our deep expertise in GPU design to the Mac. So let's get started.
The Apple Silicon Mac contains an Apple designed GPU that is designed to handle the workloads found in desktop games and pro apps all while delivering new levels of speed and power efficiency. This is Dirt Rally from Feral Games. This was recorded directly from an Apple Silicon Mac running the unmodified x86 compiled binary translated using Rosetta. You can see all the modern rendering techniques applied to high quality game assets complex shading reflections particle effects motion blur and lush vegetation make the game look great and of course the Apple designed GPU makes it run smoothly. How do we deliver this level of performance? We do it using the same powerful tile base deferred rendering architecture used on iPhone iPad and Apple TV.
If you're familiar with these devices then you know that there are several important advantages to this architecture that are exposed to the metal graphics and compute API. Capabilities such as programmable blending tile shaders and local image blocks are now available to take your Mac app or game to a whole new level of performance. In fact the GPU in the Apple Silicon Mac supports a unified metal feature set that combines what was previously available on both macOS and iOS. And we've done our homework to make this move as easy as possible. This enables you to get the benefits of the faster more power efficient GPU with more features without having to retool your app or game.
However it's really important for you to follow the guidance in today's talk to account for certain behavioral differences and really unlock the full potential of the platform. Before we get into the specifics let's take a look at the path you will take.
On Intel based Macs your app runs natively. On Apple Silicon Macs. Your app will run out of the box under Rosetta's highly optimized translation layer. Under translation. Your app will be automatically opted into several Metal consistency features that ensure your apps look great to.
Translation and consistency features do have some performance costs. So your next step will be to recompile your app for Apple Silicon native architecture. You will then find and fix any Metal consistency issues. And if you want optimal performance on Apple GPUs Check out the session by my colleagues Michael and Don. Optimize mental performance for Apple Silicon Macs. Let's look at today's agenda. The first section is about adapting your app for the Apple GPU. I will start by removing the GPU architecture of Intel based Macs and the differences with Apple GPUs and I'll discuss the API support on the Apple Silicon Mac and Metal features you can use in your apps. Different GPU families will behave differently when it comes to undefined behaviors programming errors and performance characteristics. So in the second section we will talk about a few common issues we observe on some apps coming from Intel based Macs. And finally Sam will talk about what we recommend you do for consistent rendering using Metal on the Mac.
So let's start with adopting your app for the Apple GPU. Let me highlight some of the major changes from the Intel based Mac to Apple Silicon Mac when it comes to the GPU Apple Silicon Mac contains an Apple designed GPU whereas Intel based Macs contain GPU from Intel AMD and Nvidia. Apple GPU is built on a tile based deferred render architecture whereas Intel AMD and Nvidia GPUs are known as immediate mode renderers. I will cover the differences in a minute. Apple designed GPUs combined the feature set of iOS and macOS. Both Mac 2 and Apple GPU family feature sets are supported. Intel based Macs support only the Mac 2 family. Now let me describe to you how tile based deferred renderers differs from immediate mode renderers.
Let us start with an Immediate Mode Renderer abbreviated here as IMR.
This is a logical view of the architecture. From left to right the hardware fetches vertex attributes executes the vertex shaders and transforms the geometry into screen space with the help of fixed function hardware. Then the fragment stage rasterizes the triangles and performs some early depth testing for efficiency.
An important note here is that triangles can span the entire screen or two triangles from the same draw call might end up in widely different places on the screen meaning that the hardware needs the entire depth buffer and only can cache a small subset of it.
Early depth test is followed by shading. When early depth test cannot be done the depth testing is done post fragment shading.
There's also alpha test and blending where access to color buffers are required. Just like as it is for the depth buffer the color buffer is needed in its entirety and only a small part of it can be cached.
Now that we reviewed IMR Architecture I will move on to the tile based deferred renderer and highlight the differences step by step. The first change is tiling the geometry in screen space and creating a database for all geometry in memory that I'm going to refer us tiles or text buffer. Since we have all the geometry information for an entire tile we can rasterize it upfront as a result. For each pixel in the tile we know for each opaque fragment the shading will have to happen. We also know which opaque fragments will fail the depth test. As a result the hidden surfaces are entirely removed. We don't cheat any fragment that will later be occluded by another one. We can do this with just tile size that buffer on the chip rather than the entire depth buffer. It does not matter if a triangle is full screen or in the same draw call triangles are widely distributed on the screen. We get perfect testing with just a tile size on chip that buffer thanks to the tiled vertex buffer.
Additionally if we're not going to use that depth buffer later making us memorable isn't the target feature and not sort of depth data and not just memory for the depth buffer. It's a win win situation. Not only that we also can do blending and alpha testing without ever having to load from the full color buffering memory. Because we process entire tile at once We only need a tile word of color buffer on the chip. Please note that the arrow for color buffer in memory is one way. We never need to access it for alpha testing write mask or blending or for anything else.
This architecture allows for alpha tests and blending operation to be performed by the shader core instead of fixed function hardware. It also allows you to have access to the contents of the color buffer in your shader which you can use for programmable blending custom MSAA result and so on.
Obviously feedback such as depth writes between the depth buffer and shading is covered by this architecture.
To recap tile based deferred renderers have two main phases. Tiling where all the geometry will be processed rendering and all the pixels will be processed. Let us summarize on a TBDR GPU a well written app uses a lot less memory bandwidth than IMR GPU. Since processing is done for full tile all blending happens in registers. So it's really fast and power efficient. This is great for games pro apps and UI performance as well. No reason to re-fetch color depth or stencil buffers over and over again. As a result memory bandwidth is saved performance increases Power consumption goes down.
Now that we reviewed the architecture we can move onto apps view of the GPU. With Apple Silicon Mac you get what you had before and more. This new Mac supports everything the intel based Mac supported. All the APIs we and you love such as Metal MPS CoreAnimation and so on are available. All the other APIs that are also used such as OpenGL OpenCL are also available on Apple Silicon Mac. I will breifly talk about GL and CL first and then move on to Metal. OpenGL and OpenCL are still deprecated but they're not yet removed. They're present for backwards compatibility purposes with the library of existing Mac apps that use these legacy technologies. OpenGL is supported up to version 4.1 and OpenCL version 1.1. For features not in the core you should always use extension or query checks before using them.
Note that Apple designed GPUs are engineered for Metal. If you're still using OpenGL and OpenCL in your app or game you should prioritize moving to Metal as soon as possible. Otherwise you will continue to leave performance and capability on the table. Now onto more exciting things. Let's talk about those new features that Apple designed GPUs bring to the Mac.
Apple Silicon Macs have the feature set defined by the Metal GPU family Mac 2. Additionally for the first time ever the features that are specific to Apple GPUs are coming to the Mac. These features provide tremendous upside for your apps in terms of better performance lower power consumption and lower memory usage. Through the use of the features such as programmable blending tile shaders and local image blocks. You can accelerate deferred forward and all types of tile rendering. With memory loss render targets and On-Chip MSAA Resolve MSA feature shines on this system. You can even use MSAA with deferred rendering because all the processing is done on the chip in this architecture without the cost of additional memory bandwidth and storage.
There are many other features such as ASTC textures image block sample coverage control and so on that open up opportunities for new techniques and optimizations. Let's recap your app should work fine as all APIs are already supported. Be aware of the underlying Tile based deferred renderer architecture and use its best practices. And don't stop there. Take advantage of the new features and the new architecture. They will improve your apps performance even more. Now we're going to look into some issues that we observed on Metal API misuse on the Mac on some apps. We will review what right usages. I will start with Metal feature detection. After that Sam will take over and we'll talk about a few app correctness issues that we observed and how we can fix those such as lodes and store actions position invariance and Threadgroup memory synchronization and sampling previous depth while forming up new depth values. Let us start with metal feature detection. Previously we talked about the unified feature set support. So how do you go about and find the availability of these features? Applications should query Metal GPU features directly. Metal organizes features into different GPU families that can be queried for a given device. Some individual feature queries are also available. Make use of those for features that vary within the same GPU family. One important note querying the GPU name to know if a particular feature is supported is not the best practice. The code is not feature proof. Let's see an example. Here is some code basing feature usage on the OS differentiating between macOS and iOS platforms. Then it uses the GPU name to decide some other information such as simdgroupSize and if device is marked as low power. And finally it assumes that Apple GPU features are not available on macOS. This is not true anymore with Apple Silicon Mac. Instead applications should gather this information using the Metal API. The supportsFamily API can be used to know if Apple GPU features are supported. The simdgroup size can be found using the method threadExecutionwitdth on a compute pipeline object. And to know if a GPU needs to be treated as integrated or discrete use the isLowPower API.
Note that for Apple GPUs isLowPower returns false which means that you should treat these GPUs in a similar way as discrete GPUs - not integrated GPUs This is because the performance characteristics of Apple GPUs are in line with discrete ones not the integrated GPUs. Despite the property name though Apple GPUs are also way way more power efficient than both integrated and discrete GPUs. Now that we reviewed metal feature detection it's time to review how to correctly use load and store actions. Sam will tell you all about it and more. Sam? Thanks Gokahn. Let's now talk about a common issue related to metal load and store actions.
First a quick recap. Load/store actions control the action performed on an attachment either color depth or stencil - at the start and end of a render pass On Apple GPUs they directly control the state of the on-chip tile memory. These memories initialised based on reductions and flushed based on star actions. An incorrect usage can cause performance issues if attachments are unnecessarily loaded or stored.
Please refer to the video Optimize Metal performance for Apple Silicon Macs to know how to best use load and store actions. But in this section lets focus on correctness issues. Here is an example showing some incorrect usage on the left. We are had rendered the skybox in an earlier pass and then drew the scene in the final pass. However we use the DontCare loadAction for the final pass causing the skybox to not be uploaded to tile memory generating graphics artifacts. Setting loadAction to load avoids this issue.
Let's take a look at the diagram to understand what is happening. We first start with a previously rendered texture in system memory. If a DonCare load action is chosen Apple GPUs will not upload the textures from system memory to tile memory leaving it uninitialized. The scene will then be drawn on top of uninitialized data and if the drawing does not cover the full frame buffer uninitialized data will remain creating correctness issues. However if loadAction load is chosen the tiled memory is correctly initialized. In this case the scene is rendered on top of the previous content. Let's now look at store actions. If DontCare store action is chosen the title memory is not being flushed to system memory leaving it potentially uninitialized. This is not the desired behavior if this texture is consumed by a later render pass.
However if a store action store is being used the rendering done and tile memory is populated to system memory. Let's summarize out to correctly use load and store actions.
For load actions make sure to only load if the previous content needs to be preserved. This is the case when a draw does not cover the full frame buffer and accumulation is needed on top of the previous rendering. If you don't need accumulation to not use a reduction load as it costs performance.
Similarly choose to store only if the content will be consumed in a later pass. Otherwise make sure to not store to avoid extra memory traffic.
Let's carry on with an issue related to position invariance. The results of the same vertex position calculation across different vertex shaders can be slightly different.
This is because by default the Apple GPU compiler highly optimizes vertex shaders. As a consequence different optimizations between vertex shaders might lead to position value differences. If an application relies on position invarience without enabling it correctness issues might be observed. So in Metal position invariance is not guaranteed by default it has to be explicitly enabled.
This is exactly what is happening in this image. Due to position invariance being disabled position outputs are slightly different between two passes causing many pixels to be discarded. Here is the same image with position invarience correctly enabled.
Let's illustrate this issue with a diagram. Let's assume that the first pass of a multi-pass algorithm has rendered some geometry and stored the depth texture in system memory for later usage.
Then comes pass2 that loads this depth texture. Also note that the tile memory that is holding color values is cleared at the beginning of the render pass Pass2 then starts rendering the same geometry as pass1 but this time with the depth test set to equal. And due to position invariance being disabled pass2 generates slightly different depth values than pass1. This causes some pixels to fail the depth test and be discarded. For those pixels the cleared color is populated to system memory causing correctness issues.
Let's now take a look at the code. Here we have the 2 vertex shaders used in the previous diagram: vertexPass1 and vertexPass2. As you can see the position computation is exactly the same between the two shaders. It calls compute position in both cases. But there is no guarantee that it will actually produce the exact same values for the position. This is due to the code around the position computation that might trigger differences in compiler optimizations affecting the position itself. How can this be avoided? First make sure to parse the preserveInvariance option to the metal compiler when creating a metal library. Note that there is a similar option if you use the offline compiler. And for all the vertex shaders within this library that require position invariance just add the invariant keyword on the position output. Note that you need to set these two options to ensure position invariance. Since position invariance comes at the performance cost carefully enable these options only for shaders that really require it. We encourage you to go through your application and identify vertex shaders that require position invariance. This will likely be needed where the depth compare function is set to equal as it would require matching depth values between different passes. And to enable position invariance make sure to use the preserveInvariance flag during compilation and the invariant keyword on the position output in your vertex shaders. Now let's take a look at thread group memory synchronization. This is an area where you can get great performance if you have a good understanding of the underlying hardware architecture. Let's first refresh the notion of threadgroups and SIMDgroups, Compute passes launch a set of individual threads organized into threadgroups. The threads within a threadgroup execute together and can share a common block of threadgroup memory. Access to this memory needs to be properly synchronized for correct ordering.
Threadgroups are organized into simdgroups that execute concurrently on the GPU. However the simdgroup size is GPU specific - it is 32 on Apple GPUs.
Knowing the simdgroup size can be very useful to optimize compute shaders. Particularly if there is only one simdgroup per threadgroup there is no need to synchronize between simdgroups. So in order to ensure correct synchronization on any GPU you should query the simdgroup size through the Metal API at runtime. If an application does not use the correct simdgroup size synchronization might be missing causing correctness issues. Let's see an example.
The artifacts in this image are due to the fact that the application expects only 1 simdgroup in a 64 threadgroup size omitting any kind of threadgroup memory synchronization. Here is the correct image with proper synchronization.
Let's dig into the computer shader used in this application. This kernel makes use of threadgroup memory. It has also been launched using a threadgroup size of 64. You can also see that the threadgroup memory is first initialized and read back later in the shader. Note that there is cross-thread communication as the buffer written from one thread will be consumed on another thread. However nothing in the code guarantees correct ordering. Something is missing here.
You probably already guessed what is missing. Memory synchronization. Here is what the code looks like when adding proper synchronization. First the application queries the simdgroup size using the built in threads_per_simdgroups. These can then be used to inject the correct synchronization.
If there is only 1simdgroup per threadgroup - in this shader it means that simd_size is 64 then a simple simdgroup barrier is required to guarantee memory ordering. However if there are several simdgroups a full threadgroup barrier is required to synchronize all the simdgroups. And even though extra control flow is generally not a good practice in this particular case a simd_size is a constant the compiler is able to optimize out the code path that will not be taken. This code will ensure correct synchronization on any GPUs. For best performance you might want to rewrite your shaders with a 32 simdgroup size in mind to avoid threadgroup barriers as they are expensive. Applications could have different sets of shaders optimized for given simdgroup size and would choose at runtime which variant to use. With that in mind go through your compute shaders and ensure correct threadgroup memory synchronization. GPUs have different simdgroup size which will require different synchronization.
Make sure to query the simdgroup size either in your shader using the built in threads_per_simdgroups or with Metal API using threadExecutionwidth.
And for better performance write multiple versions of your shader optimized for a given simdgroup size. Let's now explore a final issue that we observed related depth and stencil texture sampling. Applications should ensure correct texture and attachment bindings. More specifically a texture used as an attachment cannot also be sampled in the same render pass. This might create a concurrent read write access to the same underlying texture and trigger correctness issues.
This is exactly what is happening in this image. The current depth attachment is also being sampled in the same render pass. No artifacts are being observed if the application avoids this undefined behavior. A diagram will help us understand what is happening.
We first start with a depth texture in system memory. Then the first draw comes in and starts accumulating depth.
Fragment shaders within this draw execute and sample depth texture from system memory. Then comes a later draw. Same story here. It will accumulate depth and sample the depth texture. One important point in this diagram is the fact that depth is accumulated prior fragment shader execution. This will be the case for opaque geometry where fragment shaders don't modify depth.
At this point in the render pass all the depth information is finalized. Apple GPUs will then start flushing the on-chip depth/stencil memory back to system memory. This is where a concurrent read write access to the same depth texture introduces correctness issues. And note that this race condition can happen on any draws in your render pass - not only the last one.
In summary ensure that your application does not sample the current depth/stencil attachments in the same render pass. This is undefined behavior on any GPU. Also do not use texture or memory barriers to work around this issue. Those are very expensive especially on Apple GPUs using a Tiled Based Deferred Renderer architecture. Instead if your application really requires sampling the current attachments create a second copy for sampling.
Let's now wrap up with what you need to do in order to get consistent rendering using Metal. In our application testing we observed three main issues affecting graphics correctness. That is why we developed three Metal work arounds that are enabled for backward compatibility.
These workarounds are only applied for applications compiled with macOS Catalina SDK or earlier. The first issue is when applications are using DontCare load actions where it should have been a load. In this case Metal is remapping all DontCare load actions to actual loads to avoid correctness issues.
The second issue is related to position invariance. Some applications rely on position invariance without enabling it. Metal is then forcing position invariance for all vertex shaders And finally if the same depth texture is both sampled and rendered in the same render pass Metal will snapshot the texture prior rendering. Again these workarounds are only enabled for applications compiled with a macOS Catalina SDK or earlier. And as you probably suspect they also come at a performance cost. So this is really not what you want. To help you identify these issues we augmented the Metal API validation layer to catch a few of those such as: the misuse of DontCare load and store actions as well as sampling the current depth and stencil attachments.
Here is a recap of the Metal best practices to get consistent rendering across different different GPUs. Go through your application and make sure that this is already what you are doing. And if not just follow these simple recommendations.
Regarding Metal GPU feature detection make sure that it is API driven by querying the Metal GPU features directly. Carefully choose your load and store actions to address both correctness and performance issues. Do not unnecessarily load or store attachments as it will increase memory traffic. Also consider using memoryless attachments to lower the memory footprint of your application. These can typically be used for depth attachments. Enable position invariance where required. Ensure correct threadgroup memory synchronization by queering the simdgroup size and using the appropriate synchronization primitives. And finally avoid sampling the current depth and stencil attachments by snapshotting the attachments prior rendering. We are very excited to bring Apple Silicon to the Mac. This means that your applications will be more efficient and run faster out of the box so we can't wait to see how much faster they will run once you start optimizing for Apple Silicon. This will also enable you to use brand new features on the Mac that were not possible before - such as programmable blending tile shaders memoryless framebuffers and more.
And finally this transition aligns the technology used on Mac platforms with existing Apple platforms. That means that you can easily share a common code base across all Apple products. And this makes it even easier to bring all your iPad and iPhone applications to the Mac. Apple Silicon Macs brings up new opportunities for you. At the beginning of this session we showed how games get amazing graphics performance while playing Dirt Rally. On Apple GPUs, even the most demanding pro applications will get great performance.
As an example I would like to show you a video of Cinema4D. This is recorded live on an Apple Silicon Mac. Note that it is also running the original x86 binary under Rosetta translation. In this video the application is rendering a high polygon model in real time with realistic camera properties such as user specified aperture, f-stop and focal point. Enjoy how smooth it runs on Apple Silicon Mac while rendering a graphics demanding scene! Where does this all fit in the transition process to Apple Silicon Macs? At this point your application should look great without metal work arounds running on new macOS SDK. But your journey should not stop here. To take your application to the next level and greatly improve performance make sure to check out the session Optimize Metal Performance for Apple Silicon Macs. Thank you.

{
    self.appleGPUFeatures = metalDevice.supportsFamily(.apple5)
    
    self.simdgroupSize = computePipeline.threadExecutionWidth
    
    self.isLowPower = metalDevice.isLowPower
}

20:58 - Enabling position invariance

// Renderer.swift
let options = MTLCompileOptions()
options.preserveInvariance = true

library = try device.makeLibrary(source: sourceString,
                                 options: options)


// vertex.metal
struct VertexOut
{
    float4 pos [[position, invariant]];
    float data;
};

24:25 - Threadgroup synchronization

// Correct synchronization

// launched with threadgroup size = 64
kernel void kernelMain(uint tid [[ thread_index_in_threadgroup ]],
                       uint simd_size [[ threads_per_simdgroup ]],
                       device uint * res [[ buffer(0) ]])
{
    threadgroup uint buf[64];
    
    buf[tid] = initBuffer(tid);

    if (simd_size == 64u)
        simdgroup_barrier(mem_flags::mem_threadgroup);
    else
        threadgroup_barrier(mem_flags::mem_threadgroup);

    uint index = (tid < 32) ? tid + 32 : tid - 32;
    res[tid] = buf[tid] + buf[index];
}

Looking for something specific? Enter a topic above and jump straight to the good stuff.

An error occurred when submitting your query. Please check your Internet connection and try again.

Resources

Related Videos

WWDC23

WWDC20