フレクトのクラウドblog re:newal

http://blog.flect.co.jp/cloud/からさらに引っ越しています

Manipulate AR objects using UI Gesture and Hand Gesture

This is Shuochen Wang from R&D department in Flect. In this blog I am going to explain how to move AR objects by both translation and rotation using UI Gesture and Hand Gesture.

Table of contents

Introduction

Why do we need to learn about AR?

The need to learn about AR (Augmented Reality) has become more important than ever. In 2020, AR has disappeared from Gartner’s Hype Cycle. This means

AR has reached maturity and became an industry-proofed technology that executives can safely invest in to improve and innovate their business.(AR Post)1

In other words, AR is going to be an essential technology to be used behind other new, trending technologies. As a software developer/ engineer, one needs to learn how to use AR.

Why use Apple for AR apps?

The first decision one has to decide is whether to develop the AR app on Apple or Android platform. I have chosen Apple for the following reasons. First, to quote from Apple's official website:

Apple has the world’s largest AR platform, with thousands of AR apps on the App Store 2

Second, with the introduction of LiDAR Scanner, there are many things you can achieve which was not possible before. LiDAR Scanner (sensor) is available on iPad Pro 12.9-inch, iPad Pro 11-inch, iPhone 12 Pro, and iPhone 12 Pro Max. Specifically, LiDAR Scanner allows the phone to accurately detect the distance of a real world object (hand in our case) and interacts with the AR objects in the app. By using LiDAR scanner, the apps will become a MR (mixed reality) app instead of a pure AR app.

What is the difference between VR, MR and AR?

Before moving on, it is necessary to distinguish the difference between VR (Virtual Reality), MR and AR.

Firstly there is VR, which is a fully immersed virtual environment, usually by a head mount display (HMD). The user is completely immersed within the virtual environment and is unable to see the real world. An example of a VR product would be Oculus Quest.

On the other end of the spectrum there is AR. AR works by adding 3D rendered object to the real world. However, the AR objects cannot interact with the real world. Most of the apps on smart phones are AR apps.

Finally there is MR. Also referred as AR 2.0, MR blends both physical and digital worlds. This means, the AR objects are aware and can interact with the real world. Conversely, the real world can affect AR objects. An example of a MR device would be Microsoft Holo Lens.

What is ARKit?

ARKit is the framework from Apple that handles the processing to built Augmented Reality apps and games for iOS devices which is updated every year. ARKit has seamless integration with LiDAR Scanner. Overall, ARKit is highly polished, extremely accurate, outstandingly realistic. ARKit does the high level API instruction, it is necessary to specify the framework to render (draw) its contents. As for the rendering option, SceneKit will be used.

What is SceneKit?

SceneKit is a high-performance rendering engine and 3D graphics framework. It’s built on top of Metal (the lowest level of framework for rendering), which delivers the highest performance possible. In this project, there will be no adjustments to the rendering.

What is UI Gesture?

Quoting from Apple's official documentation,

A gesture-recognizer object—or, simply, a gesture recognizer—decouples the logic for recognizing a sequence of touches (or other input) and acting on that recognition. 3

This means, if one of the specified gesture is detected on the cellphone screen, it will call the IBAction to perform the appropriate activity. UITapGestureRecognizer can recognizes a variety of gestures such as Pin, Rotation, Swipe, Pan and so on. In this blog, we will use Pan and Rotation gestures.

What is Hand Gesture?

Since 2020, it is also possible to manipulate the AR objects using Hand Gesture. Hand gesture recognization can be achieved by using the Vision framework. The Vision framework allows the phone to identify the pose of people’s body or hands. Not only the Vision framework can detect if the hand is present or not, but also it can location of the fingers. With this information one can easily specify which finger joint to track to. This Qiita post (https://qiita.com/john-rocky/items/29c2cf791051c7205302) summarizes the finger locations nicely.

f:id:shuochenwang:20210323094204p:plain
Variable names for finger joints Vision framework

Start Up

Setting up the view

The first thing that we need to do is to create an AR project. Go to File -> New -> Project, then select Augmented Reality App: f:id:shuochenwang:20210323094754p:plain

Next, give the project a name that you would like, and select the development account that you have registered. Leave all the other settings as it is (SceneKit, Swift) and follow the wizard to finish the creation of the Augmented Reality App template.

The default creation comes with all the skeleton files that one needs for the app. It even comes with a nice AR airplane if you wish to use. In this example, we will create our own simple AR objects. Delete everything inside viewDidLoad method.

Then, check the Main.storyboard view and see if it contains ARSCNView. If not, we need to setup the ARKit SceneKit View, referred as ARSCNView to the app. This needs to be done both in the Main.storyboard and in ViewController.swift. First of all, we add a ARSCNView in Main.storyboard by clicking on the plus sign on the top right corner.

f:id:shuochenwang:20210323100427p:plain
ARSCNView

Now drag the ARSCNView view to make it fully occupy the screen. This makes the app displays ARSCNView once the app is launched.

Next, it is necessary to link the view to the ViewController.swift. Click Editor -> Assistant to open dual views. Make one view displays Main.storyboard and ViewController.swift on the other. Then click the view while holding the command key and drag it to ViewController.swift to add it as IBOutlet. This IBOutlet will be referred as "sceneView" for the rest of the project. Change the viewDidLoad to the following code below:

 override func viewDidLoad() {
        super.viewDidLoad()
        
        // Set the view's delegate
        sceneView.delegate = self
        
        // Show statistics such as fps and timing information
        sceneView.showsStatistics = true
          
        let configuration = ARWorldTrackingConfiguration()
        sceneView.session.run(configuration)
    }

Code snippet 1 Basic viewDidLoad

Adding AR objects

The last step for the initial setup is to add the AR objects to the sceneView. Every AR object, regardless of shapes and sizes is considered a node. We are going to create 3 AR cubes therefore we are creating a method called addCube:

    func addCube(position: SCNVector3, name: String) -> SCNNode{
        let box = SCNBox(width: 0.02, height: 0.02, length: 0.02, chamferRadius: 0)
        let boxNode = SCNNode(geometry: box)
        box.firstMaterial?.diffuse.contents = UIColor.red
        boxNode.position = position
        boxNode.name = name
        return boxNode
    }

Code snippet 2 addCube

A cube can be created with the pre-built class SCNBox. The method addCube adds the cube with the given name at any position we desire. The center of the camera would be the origin for this AR world. The distance is measured in meters, so 0.1 refers 10cm in real life. Also having a name for each node is important, because this is the only way to distinguish between the nodes. The reason that we do not add the physicsBody is that doing so would causes the cube to fall off to the ground because of gravity and even with removing gravity, when the cubes collide, they would bounce off each other.

Now all we need to do is to call this method to create the node, and then add the node to the sceneView. The completed code becomes:

    override func viewDidLoad() {
        super.viewDidLoad()
        
        // Set the view's delegate
        sceneView.delegate = self
        
        // Show statistics such as fps and timing information
        sceneView.showsStatistics = true
        
        let boxNode1 = addCube(position: SCNVector3(0.05,0,0), name: "box")
        let boxNode2 = addCube(position: SCNVector3(0,-0.05,0), name: "box2")
        let boxNode3 = addCube(position: SCNVector3(0,0.05,0), name: "box3")

        sceneView.scene.rootNode.addChildNode(boxNode1)
        sceneView.scene.rootNode.addChildNode(boxNode2)
        sceneView.scene.rootNode.addChildNode(boxNode3)
        
        let configuration = ARWorldTrackingConfiguration()
        sceneView.session.run(configuration)
    }

Code snippet 3 Complete viewDidLoad

We have successfully setup the AR world environment and upon launching it shall displays 3 AR cubes in the coordinate we have specified.

Setting up UI Gesture recognizers

One of the most basic functionality of any AR application is to be able to manipulate the AR objects. First we control the objects by using UI Gesture recognizer because it is easier to setup, then I will explain how to control the objects using hand gestures in the following section.

Before we can define the action itself, we need to add the Recognizer to the sceneView. This can be achieved by the following 2 lines:

sceneView.addGestureRecognizer(UIPanGestureRecognizer(target: self, action: #selector(ViewController.handleMove(_:))))
sceneView.addGestureRecognizer(UIRotationGestureRecognizer(target: self, action: #selector(ViewController.handleRotate(_:))))

Code snippet 4 addGestureRecognizer

Setting up the Pan Gesture for Translation

A pan gesture is when a finger is pressed on the screen, the Recognizer will extract the coordinate of where it is pressed and we can move the object to the specified. This is the perfect gesture for dragging an object. For instant position change without pressing the screen, use Tap Gesture instead.

So how do we achieve translation using the Pan Gesture? First, we obtain the currently touched position as coordinate. Then we conduct a hit test to determine if this position actually hits any AR objects we have placed. Then we need to convert the on screen coordinate to the world coordinate of this AR world. Finally we transform the node using the the previous world coordinate. The complete code is as follows:

    @objc func handleMove(_ gesture: UIPanGestureRecognizer) {

    //1. Get The Current Touch Point
    let location = gesture.location(in: self.sceneView)

    //2. Get The Next Feature Point Etc
    guard let nodeHitTest = self.sceneView.hitTest(location, options: nil).first else { print("no node"); return }
        
    let nodeHit = nodeHitTest.node

    //3. Convert To World Coordinates
    let worldTransform = nodeHitTest.simdWorldCoordinates

    //4. Apply To The Node
    nodeHit.position = SCNVector3(worldTransform.x, worldTransform.y, 0)
    }

Code snippet 5 handleMove single movement version

We can move the AR cube that is hit by the hitTest to anywhere we want using Pan Gesture. What if we want to move the whole group at the same time (i.e maintaining the relative distance between the cubes)? We can modify the code as follows:

    @objc func handleMove(_ gesture: UIPanGestureRecognizer) {

    //1. Get The Current Touch Point
    let location = gesture.location(in: self.sceneView)

    //2. Get The Next Feature Point Etc
    guard let nodeHitTest = self.sceneView.hitTest(location, options: nil).first else { print("no node"); return }
        
    let nodeHit = nodeHitTest.node
    let original_x = nodeHitTest.node.position.x
    let original_y = nodeHitTest.node.position.y
    //3. Convert To World Coordinates
    let worldTransform = nodeHitTest.simdWorldCoordinates
    //4. Apply To The Node
    nodeHit.position = SCNVector3(worldTransform.x, worldTransform.y, 0)

    for node in nodeHit.parent!.childNodes {
         if node.name != nodeHit.name {
                let old_x = node.position.x
                let old_y = node.position.y
                node.position = SCNVector3((nodeHit.simdPosition.x - original_x + old_x), (nodeHit.simdPosition.y - original_y + old_y), 0)
            }
       }
    }

Code snippet 6 handleMove group movement version

For the node that is hit by the hitTest, we just apply the transformation matrix as before. However for the other 2 cubes we cannot do that otherwise that would make all 3 cubes to be moved into one place. Instead, we calculate the relative distance between different cubes and apply it to the other cubes, thus maintaining the relative distance. The following video demonstrates the code in action.

Setting up the Rotation Gesture for Rotation

Rotation works similar to translation. We need to keep track of the status on the screen of whether the fingers are still on the screen, therefore indicating rotation is still effective. We can add additional animations when the cube is still rotating, although this is not necessary. When we rotate, we are changing the eulerAngle of the node. We can also rotate every node in the scene as before, by changing the code to the code snippet 6. This time, it is even easier because we no longer need to keep track the relative distance between cubes.

    @objc func handleRotate(_ gesture: UIRotationGestureRecognizer) {
        let location = gesture.location(in: sceneView)
        guard let nodeHitTest = self.sceneView.hitTest(location, options: nil).first else { print("no node"); return }
        let nodeHit = nodeHitTest.node
        //call rotation method here
        if gesture.state == UIGestureRecognizer.State.changed {
            //1. Get The Current Rotation From The Gesture
            let rotation = Float(gesture.rotation)

            //2. If The Gesture State Has Changed Set The Nodes EulerAngles.y
            if gesture.state == .changed{
                isRotating = true
                nodeHit.eulerAngles.y = currentAngleY + rotation
            }

            //3. If The Gesture Has Ended Store The Last Angle Of The Cube
            if(gesture.state == .ended) {
                currentAngleY = nodeHit.eulerAngles.y
                isRotating = false
            }
        }else{
    }

    }

Code snippet 7 handleRotate

Setting up Hand Gesture

So far, most of what I have covered is rather fundamental. The app controls well but somehow feels lacking in terms of engagement from the user. How can we add more MR capability to this app? One way is to adding hand gesture recognition. Instead of touching the screen and control the objects, we can control the objects using hand gestures. Now the app can be controlled both using UI gesture and Hand Gesture.

Setting up Hand Gesture for Translation

The reader may have noticed the previous UI Gesture method can be achieved with just a few lines of codes. This is because a lot of the methods have been predefined by the API and all we need to do is to call it. When we use Hand Gesture, there is no such as addGestureRecognizer so we have to define the flow from scratch. Also it will not work if we define it in viewDidLoad because viewDidLoad will only be executed once but we need constant camera tracking of hands. One of the calling timing is to add the method in the following manner:

    func renderer(_ renderer: SCNSceneRenderer, didRenderScene scene: SCNScene, atTime time: TimeInterval) {
        DispatchQueue.main.async {
           self.updateCoreML()
        }
    }

Code snippet 8 renderer

In this way, the Vision framework will be called every TimeInterval, making our hand detection work in real time.

So what happens inside updateCoreML? In terms of process, first we set a CMSampleBuffer to capture camera image at every fixed interval, then pass this information to the Vision framework to perform a VNImageRequestHandler which stores the location of 21 finger joints inside the array. Finally we can retrieve the information that we need, which is the location of the index tip and the distance of the index tip from the camera. You can edit this part to make the activation gesture more sophisticated. For example, you can make the hand tracking only work when both thumbs and index fingers are closed. In addition, you can even design different gestures for translation and rotation. In this blog, I will use the simplest model which is one single finger joint and the distance of that joint.

The Vision framework does all the errors handling so if you have values stored in the array, it means there is a hand present and all the joints are detected correctly. Although the code sample stores all the 21 finger joints, we only need to use one finger joint which is indexTip. Once this has been stored as a CGPoint, we need to convert this point to a VNImagePointForNormalizedPoint in the following line:

let indexTip2 = VNImagePointForNormalizedPoint(indexTip!, Int(self.sceneView.bounds.size.width), Int(self.sceneView.bounds.size.height))

Code snippet 9 VNImagePoint

Once we have this data, we can perform the same hitTest as in the UI Gesture method to determine if the finger hits with the AR cube. We can move all the cubes in the scene, or we can move the affected cube only. I will provide the code for moving all the cubes and the video demo in the following.

            guard let nodeHitTest = self.sceneView.hitTest(indexTip2, options: nil).first else { return }

            let nodeHit = nodeHitTest.node
            let original_x = nodeHitTest.node.position.x
            let original_y = nodeHitTest.node.position.y
            //3. Convert To World Coordinates
            let worldTransform = nodeHitTest.simdWorldCoordinates
            //4. Apply To The Node
        ////    nodeHit.position = SCNVector3(worldTransform.x, worldTransform.y, 0)
            nodeHit.position = SCNVector3(worldTransform.x, worldTransform.y, 0)

            for node in nodeHit.parent!.childNodes {
                if node.name != nil{
                    if node.name != nodeHit.name {
                        let old_x = node.position.x
                        let old_y = node.position.y
                        node.position = SCNVector3((nodeHit.simdPosition.x - original_x + old_x), (nodeHit.simdPosition.y - original_y + old_y), 0)
                    }
                }
            }

Code snippet 10 Translation using hand gesture

It is possible to even add the distance of the finger joint to the camera as the parameter for the z-value. In that case, it is best to apply Kalman filter to reduce the noise from the LiDAR Scanner. Also, since the distance from camera is in a different coordinate system as the current AR world system, one needs to translate the change in z value, instead of the current distance from camera otherwise it would results the cube move very far from the origin.

Setting up Hand Gesture for Rotation

Again, the method for rotation works similar to translation. We still call the Vision framework to do the handPoseRequest and obtain the position of indexTip2. Then the code becomes:

guard let nodeHitTest = self.sceneView.hitTest(indexTip2, options: nil).first else { print("no node"); return }
let nodeHit = nodeHitTest.node
//call rotation method here

//2. If The Gesture State Has Changed Set The Nodes EulerAngles.y
nodeHit.eulerAngles.y = currentAngleY + 0.1
currentAngleY += 0.1

Code snippet 11 Rotation using hand gesture

This time we do not even need the z value as it makes little sense in rotation. Below is the the rotation video in action:

Now our app has much better immersion experience because the user can actually control the AR objects using hand gestures.

As I have stated before, I am using the simplest model of using only one finger position, so it is not possible to both translation and rotation using hand gestures at the same time. If you want to implement both hand gesture recognition, it is necessary to create a method to determine the hand pose first. One possible implementation would be use the hand gesture "2" for translation and "3" for rotation. Another implementation is to use "thumb up" as translation and "thumb down" as rotation. As long your hand gestures that can be distinguished from each other, you can add more hand gestures to cover zoom and other MR operations (make a copy of itself, change the color of the object etc) as well.

Conclusion

In this blog I have explained how to setup AR project with AR objects, then control the objects using UI gesture and Hand Gesture. There are many other possible UI gestures to be implemented and what can be achieved using hand tracking and LiDAR Scanner is almost endless. I hope my blog has helped you to begin your AR journey.

Reference

The complete source code has been uploaded to my GitHub repository. It includes 3 optional features that you can implement (Kalman filter, tap gesture, grouping nodes)

github.com

Microsoft documents explaining what is mixed reality Mixed Reality とは - Mixed Reality | Microsoft Docs

Apple documentation for handling ui gestures https://developer.apple.com/documentation/uikit/touches_presses_and_gestures/handling_uikit_gestures

HUman interface guidelines by apple on how to design better UIs Human Interface Guidelines - Design - Apple Developer

Apple documentation for the vision framework https://developer.apple.com/documentation/vision/detecting_hand_poses_with_vision

Apple documentation for the joint names https://developer.apple.com/documentation/vision/vnhumanhandposeobservation/jointname

One way to implement hand pose detection (written in Korean): [Vision] Detect Hand Pose with Vision (2) - ThumbUp & ThumbDown

URL for the subscripts: