Discover Swift enhancements in the Vision framework

More Videos

Discover Swift enhancements in the Vision framework

The Vision Framework API has been redesigned to leverage modern Swift features like concurrency, making it easier and faster to integrate a wide array of Vision algorithms into your app. We'll tour the updated API and share sample code, along with best practices, to help you get the benefits of this framework with less coding effort. We'll also demonstrate two new features: image aesthetics and holistic body pose.

Chapters
- 0:00 - Introduction
- 1:07 - New Vision API
- 1:47 - Get started with Vision
- 8:59 - Optimize with Swift Concurrency
- 11:05 - Update an existing Vision app
- 13:46 - What’s new in Vision?
Resources
- Forum: Machine Learning and AI
- - HD Video
  - SD Video
Related Videos

WWDC23
- Detect animal poses in Vision
- Explore 3D body pose and person segmentation in Vision
Download

Hi. I’m Megan Williams, and I work on the Vision framework team. Vision is a framework that offers computer vision APIs developers can use to create incredible apps and experiences. Let me show you some of the things you can do with the Vision framework. Vision can detect faces and face landmarks like eyes, nose, and mouth.
The Vision framework can recognize text in 18 different languages, including Korean, Swedish, and Chinese! For health and fitness applications, Vision can track body poses and trajectories.
Vision also has a hand-pose tracking feature that enables a whole new way of interacting with an Apple device without touching a screen. These are just a few examples. Thousands of developers all around the world have used Vision to build incredible apps. This year, it will be even easier to bring computer vision into your apps. We are excited to announce a new API that is just as powerful as before, now with streamlined syntax designed for Swift.
We are also introducing full support for Swift Concurrency and Swift 6, enabling you to write the most performant apps yet. In the rest of this video we will show you how to get started with the new Swift API. Then, we’ll demonstrate how to optimize your apps with Swift Concurrency. Next, we will explain how to update existing Vision applications to use the new API. And finally, we’ll introduce some new capabilities in Vision this year. Let’s get started! In Vision, everything begins with a request. Think of a Vision request as a question we would ask of an image. Such as "Where are the faces are in this image?" "What is the total on this receipt?" Or, "What is the subject of this image?" We can ask these questions with DetectFaceRectangleRequest to find faces, RecognizeTextRequest to understand text, and GenerateObjectnessBasedSaliencyImageRequest to find salient objects in an image.
Now that we have some questions, we also want some answers. Vision provides the answers to these questions in the form of observations, which are produced by performing requests.
DetectFaceRectanglesRequest produces FaceObservations that tells us where the faces are. RecognizedTextObservation can help understand text And SaliencyImageObservation helps highlight what’s important in the image.
These are just a few examples. Vision has 31 different requests, each representing a type of image analysis. While there are way too many requests to cover in this video, I want to highlight some of my favorites. Vision has requests for common computer vision tasks like image classification, and text recognition. Vision can detect and recognize a variety of objects in images like barcodes, people, and animals. There are also APIs for body pose estimation, in both 2D and 3D. Vision can also detect moving objects and track them over multiple frames! Now that we understand what requests are, Let’s see what an example looks like in code. I am going to build a grocery store application. In order to navigate all the different products in a grocery store, I want to be able to scan their barcodes. I can do this with DetectBarcodesRequest.
Let’s scan the barcode in this image. I’ll begin by creating the request. Then, I’ll perform the request on the image.
This will produce barcodeObservations, one observation for each barcode detected in the image.
The request may fail, so I need to add a try statement.
Another important note is that the new APIs are asynchronous, so the app must await for them to finish. And that’s it. We can detect barcodes with just 3 lines of code.
This is just one example, but all Vision requests follow the same general structure.
Once I’ve detected a barcode, I can get the product information from the barcode’s payload. I can use this to scan the barcode. I also want to know where the barcode is located in the image, so I can highlight it for my users. I can get the barcode’s location from the .boundingBox property. It is important to note that Vision observations have coordinates that are normalized to the image.
In Vision’s system, the coordinates are normalized between 0 and 1, and the origin is in the lower left hand corner.
This may be different from other frameworks you are used to, such as SwiftUI, which places the origin in the upper left hand corner. But don’t worry. We are offering a new API this year to help perform the conversion between different coordinate systems. To convert from Vision’s normalized coordinates back into the coordinates of the original image, I can call toImageCoordinates(), and pass in the size of my image. I can also specify whether the coordinate origin should be in the upper left or lower left hand corner of the image. I’m going to use to upperLeft.
This will give us the bounding boxes in the original coordinate space as the image with the origin in the upper left hand corner. Now we can know where the barcodes are located within the original image, and highlight them for our users.
Now that I have an app that can scan for barcodes, I can make a few optimizations to make the app even better. Many request have properties that can be configured to fine-tune them for better performance. DetectBarcodesRequest() will by default scan for many different types of barcodes. Because I am building a grocery store application, I can instead just scan for barcodes commonly encountered in grocery stores.
I will set the symbologies property on the request to just scan for .ean13. This will be more efficient than scanning for all barcode types, and will give my app better performance.
Now that we’ve covered basic API usage, let’s highlight some more advanced use cases for our grocery store application. We can scan images for barcodes, but what if the product doesn’t have a barcode? Instead, I’ll identify the product by the text on the label. I will need to perform a second request, and in this case, I’ll use RecognizeTextRequest().
I will create the requests like before.
While I could perform the requests one at a time, for optimal performance Vision recommends performing requests together. To do this I’ll need an ImageRequestHandler.
Think of an ImageRequestHandler as a container for an image. I’ll use the handler to perform the requests. The handler will return one result for each request.
Now I can scan for barcodes in an image and identity any text labels at the same time.
One thing to note is that this perform call uses parameter pack syntax, which allows us to perform an arbitrary number of requests. Another thing I want to mention is that using this method, we must wait for all requests to finish before we can use their results.
This may be fine for some applications. However, in our grocery store application, If the DetectBarcodesRequest detects a barcode, we don’t want to have to have to wait for the RecognizeTextRequest to finish before scanning the barcode. Vision provides an alternate API called performAll that allows us to perform multiple requests together and process their results as soon as each requests finishes. performAll returns results as a stream meaning when a request is done, the results are returned immediately while other requests are still running.
To access observations, I await for results from the stream.
This lets me use the barcode observations immediately when they become available, without waiting for the RecognizeTextRequest to finish.
Now that we’ve covered the basics, let’s talk about optimizing Vision APIs with Swift Concurrency. There are many times when an application may want to process multiple images. These images can be processed one at a time in a for-loop. However, for best performance, we can use concurrency.
This allows us to take batches of images and process them at the same time.
Let’s say we have a photo library with a bunch of images. I want to display the images in a grid view, however the images are different sizes. The first thing I need to do is crop the images to a square.
The goal is to crop to the main subject of the image. I will use GenerateObjectnessBasedSaliencyImageRequest to identify the location of the main subjects in the image, and create a crop around that.
I’m going to write a function, generateThumbnail, that creates a crop around the salient parts of the image. Then, I’ll use a for-loop to iterate over the images in my library and generate a thumbnail for each one. I’m processing each image one at a time, which can be slow. I can use concurrency to speed this up.
I’ll instead use TaskGroups. With TaskGroups, I can create multiple tasks that perform requests in parallel. I need to make one more adjustment to this code.
Vision requests can be memory-intensive, so I recommend limiting the number of Vision requests performed at the same time to reduce your app’s memory footprint. In this case, I will limit the number of tasks to 5.
When a task finishes, I will add another task to begin processing the next image. This guarantees I am processing no more than 5 requests at a time. I have chosen to limit the number of tasks to 5 for my application. However, you may find a different limit works better for your applications.
Now we’ll demonstrate how to update your applications to the new APIs. Our old Swift APIs aren’t going away. However, to take full advantage of Swift 6 and Swift Concurrency, I strongly recommend adopting the new APIs in your application.
Before we begin, we have one additional housekeeping note. To reduce our memory footprint, Vision will remove CPU and GPU support for some requests on devices with a Neural Engine.
On these device, the Neural Engine is the most performant option.
You can always check what compute devices are supported by a request using the supportedComputeDevices() API.
Now, let’s dive in to how to update an existing Vision application to use the new API. We can do this in just three steps.
First, we need to adopt the new request and observation types. Most requests in the old API have a direct equivalent in the new API.
To get the new request or observation type, remove the VN prefix from Vision type names.
Next, Vision has eliminated the completion handler for requests and replaced it with async/await syntax.
Finally, the observations produced by requests are returned directly from the perform() call Let’s update some code. This code uses the old Vision API to detect barcodes in an image.
I’ll update VNDetectBarcodesRequest to use the new request by removing the VN prefix.
In the old API, we processed the results in the completion handler. Now, results are returned directly from perform(), so I will remove the completion handler.
I will instead adopt async/await syntax for the perform() call.
To use the asynchronous API, I also need to make my function async.
Instead of getting barcode observations from the completion handler, the observations are now returned directly from the perform() call. You may also notice that I don’t have to unwrap the optional anymore. Lastly, since I’m just performing one request, the imageRequestHandler is not really needed. I’ll remove it to simplify the code.
And that’s it.
By adopting the new API, the number of lines of code has been reduced from 10 to just six! The streamlined syntax is awesome. Finally, there are two new capabilities offered by vision this year. Vision is introducing a new request called CalculateImageAestheticsScoresRequest. This request can be used to assess image quality and find memorable photos.
There are multiple factors that get analyzed to determine image quality such as blur and exposure. This quality is represented by an overall score assigned to the image.
The request also identifies utility images, which are images that may be useful but are not particularly memorable, like screenshots or photos of receipts. Let’s see some examples.
This image of a beautiful landscape has a high score. It’s well taken, with good exposure, and the scene looks very memorable.
This image doesn’t have a clear focus. It looks like it was taken by accident and has a low score.
This image of a wooden box is technically well taken, but it might not be a memorable photo that you would want to share with friends. This is a utility image.
To use the new API, just use CalculateImageAestheticsScoresRequest. This request generates a ImageAestheticsScoresObservation, which has an overall score that ranges from -1 to 1 indicating how well taken the image is. This observation also has another property called isUtility, which is true for photos that are well taken, but have non-memorable content.
Also new to Vision this year is holistic body pose.
Previously in Vision, we had separate requests for body and hands pose detection.
Holistic body pose allows you to detect hands and body together.
To use holistic body pose, create a DetectHumanBodyPoseRequest.
Set detectsHands on the request to true.
This request produces a HumanBodyPoseObservation, which now has two additional properties, one for the right hand observation and one for the left. This was a lot new stuff, so let me recap. Our new Swift API works well in the Swift ecosystem by supporting Concurrency and Swift 6. This will make adoption of Vision in your Swift applications a lot easier.
With that in mind, Vision will only introduce new features in Swift moving forward. Our existing APIs are not going away, but we highly recommend adopting the new API. The new Image Aesthetics and Holistic Body Pose APIs allow you to add new functionality to your applications.
Thanks for watching. And I can’t wait to see all the great apps you create with Vision.
Looking for something specific? Enter a topic above and jump straight to the good stuff.

An error occurred when submitting your query. Please check your Internet connection and try again.

Chapters

Resources

Related Videos

WWDC23