I copied the scribbles in the previous article "Geometry of scribbles with ARKit + Vision + iOS14", so this time I examined the processing required to copy the object. As an overview, Vision + CoreML (DeeplabV3) is used to cut out an object, and ARKit's depthMap (iOS14 ~) is used to create a three-dimensional effect.
Completed image
Since the depth information is acquired only by one shot of the copy button tap, the three-dimensional effect gives the signboard a three-dimensional effect. ↓ is the result of copying the car and displaying it in SCNView. This article mainly describes some of the difficulties in using depthMap.
Before getting into the main subject, about the depth information that can be obtained from ARKit and its reliability. depthMap
On LiDAR-equipped terminals such as the iPhone12Pro, depth information can be acquired in real time from the rear camera. In this example, the depth information in the middle part of the screen is visualized in grayscale, but in reality the depth information is the distance from the camera and the unit is meters (when confirmed with iPhone 12 Pro, the type is Float 32. In this example, 0 to 5 m is 0. Displayed in ~ 255 gradations). It seems that the depth of this gif video is a little delayed, but this is because the screen recording of iOS is used, so if you do not record it, the delay is almost unknown. The method of acquiring and processing depth information will be described later.
confidenceMap
From ARKit, it is possible to acquire the reliability of the information together with the depth information at the same resolution as the depth information. The accuracy is defined by ARConfidenceLevel and has 3 levels of low, medium and high. high represents the most reliable. This is the image of the accuracy information as low = black, medium = gray, high = white. It can be seen that the accuracy of the contour part of the object is relatively poor, and that the accuracy is poor if small objects such as fingers are lined up in the direction of the camera (the surface of the object facing the camera is unreliable). It seems that it is high and the reliability decreases as it goes to the side. Since there is only one LiDAR sensor, it is understandable that the accuracy of the side that is difficult to see from the sensor decreases).
In this article, it is realized by the following procedure.
① Cut out the central part of the captured image with 513 x 513 pixels (2) Cut out the depth information at the position / aspect ratio of (1). ③ Cut out the reliability information of the depth in the same way as ② ④ Segmentation of ① using Vision + CoreML ⑤ Acquire the coordinates of the object recognized in ④ and the highly reliable part of ③. ⑥ Convert the depth information (2D) of ⑤ to 3D coordinates ⑦ ⑥ is made into a 3D model and added to the scene
I had a hard time ②. This conversion was troublesome because it was necessary to acquire depth information according to the position and size cut out in (1).
The reliability is obtained in ③ and used in the judgment of ⑤, but this is because the depth value of the object and the background and the boundary between the objects is often suspicious (extremely far), and it is a 3D model. I'm doing it to prevent things from going wrong.
The procedure is explained below.
Vision + CoreML is used to cut out the object, and the Core ML model uses Deeplab V3 model published by Apple. Since the image size that can be handled by this CoreML model is 513 x 513, the central part of the captured image is cut out at this size. The captured image passed from ARKit does not consider the orientation and screen size of the terminal. Please refer to the previously written ARKit + Vision + iOS14 Geometry ① [Outline Detection] for how to cut out the displayed part.
Depth information is provided in a ** Float32 array **. Therefore, CIImage/CGImage cannot be used to cut out or process depth information. When I started making the sample of this article, I thought that it should be treated as an image, but since the unit is meters, it is different. .. .. I divided Float32 into 8bit units and assigned it to RGBA, but I also stumbled upon my head, but it may not return to the original Float32 during processing (color space, edge processing, etc.), so I decided to do it properly. did.
Depth information depthMap
can be obtained via sceneDepth
of ARFrame
.
In this sample, it is obtained from the extension of ARFrame
, so it becomes as follows.
guard let pixelBuffer = self.sceneDepth?.depthMap else { return ([], 0) }
If you check with iPhone12Pro + iOS14.2, the size of the depth information that can be obtained with depthMap
is
256 x 192
It has become.
On the other hand, the resolution of the image that can be obtained from capturedImage
of ARFrame
is
1920 x 1440
Is.
Both are 1.333: 1, so it seems that they can be cut out by using displayTransform
of ARFrame
as in ①.
var displayTransform = self.displayTransform(for: .portrait, viewportSize: viewPortSize)
First of all, I am wondering what to set to viewPortSize
when getting displayTransform
. The API document says "The size, in points, of the view intended for rendering the camera image.", So the drawing size I want to specify, but the depth information is different from the screen size. The API documentation also states "The affine transform does not scale to the viewport's pixel size.", Which states that the affine transform does not scale to the viewport's pixel size. Shouldn't it be given? The result of actually checking it is as follows (checked with iPhone12Pro + iOS14.2).
▿ CGAffineTransform
▿ CGAffineTransform
Match. I decided to set the aspect ratio of the screen. From here, the depthMap data is extracted on the screen using the affine transformation matrix within the range that can be analyzed by CoreML.
displayTransform
is CGAffineTransform
and contents are in the following matrix.
\begin{bmatrix}
a & b & 0 \\
c & d & 0 \\
t_{x} & t_{y} & 1 \\
\end{bmatrix}
In the case of portrait, the captured image (and depth information) X-axis and Y-axis are upside down, so add that.
\begin{bmatrix}
-1 & 0 & 0 \\
0 & -1 & 0 \\
1 & 1 & 1 \\
\end{bmatrix}
\begin{bmatrix}
a & b & 0 \\
c & d & 0 \\
t_{x} & t_{y} & 1 \\
\end{bmatrix}
=
\begin{bmatrix}
-a & -b & 0 \\
-c & -d & 0 \\
a+c+t_{x} & b+d+t_{y} & 1 \\
\end{bmatrix}
\begin{bmatrix}
x' & y' & 1 \\
\end{bmatrix}
=
\begin{bmatrix}
x & y & 1 \\
\end{bmatrix}
×
\begin{bmatrix}
-a & -b & 0 \\
-c & -d & 0 \\
a+c+t_{x} & b+d+t_{y} & 1 \\
\end{bmatrix}
x' = -ax + -cy + a + c + t_{x} \\
= -a(x-1)-c(y-1)+ t_{x} \\
y' = -bx + -dy + b + d + t_{y} \\
= -b(x-1)-d(y-1)+ t_{y} \\
Try to apply the acquired value.
x' = 1.62(y-1)+ 1.31 \\
y' = -(x-1) \\
The following can be seen from this result.
--The x-axis and y-axis are swapped. --The converted x takes a value between -0.31 and 1.31. In other words, only the 1/1.62 part in the center of y before conversion is displayed (as far as I can confirm, the range from 0 to 1 is displayed). --The converted y takes a value between 1.0 and 0.0. That is, all in the x direction before conversion is displayed.
Since it is necessary to cut off both sides of the converted x-axis, find the size. Since it is known that the central part is to be cut out, it is calculated by c and height of the depth information.
let sideCutoff = Int((1.0 - (1.0 / displayTransform.c)) / 2.0 * CGFloat(pixelBuffer.height))
This is calculated to be 36. The size of the acquired depth information is ** 256 x 192 **, so if you cut off 36 on both sides, it will be ** 120 x 256 ** (vertical and horizontal are swapped) after conversion.
After that, since the data to be passed to CoreML is square, cut out the square part according to it. The result is a size of 120x120.
func cropPortraitCenterData<T>(sideCutoff: Int) -> ([T], Int) {
CVPixelBufferLockBaseAddress(self, CVPixelBufferLockFlags(rawValue: CVOptionFlags(0)))
guard let baseAddress = CVPixelBufferGetBaseAddress(self) else { return ([], 0) }
let pointer = UnsafeMutableBufferPointer<T>(start: baseAddress.assumingMemoryBound(to: T.self),
count: width * height)
var dataArray: [T] = []
//Calculate the width size on the screen. The values are cut off from the full horizontal size of the screen.
//* Although it is confusing, the data obtained from ARKit during portrait is landscape, so it is calculated by height.
let size = height - sideCutoff * 2
//Get the data of the vertical center part of the screen. The acquisition order is upside down.
for x in (Int((width / 2) - (size / 2)) ..< Int((width / 2) + (size / 2))).reversed() {
//Get the data of the horizontal center part of the screen. The acquisition order is reversed left and right.
for y in (sideCutoff ..< (height - sideCutoff)).reversed() {
let index = y * width + x
dataArray.append(pointer[index])
}
}
CVPixelBufferUnlockBaseAddress(self, CVPixelBufferLockFlags(rawValue: CVOptionFlags(0)))
return (dataArray, size)
}
The process of cutting out was written in the extension of CVPixelBuffer
.
In the case of depth reliability information described later, the type is handled by generics because the data stored in CVPixelBuffer
is an array of UInt8.
I couldn't find anything that could transform the contents of the array into affine in one shot, so I had to implement the process of swapping the x-axis and y-axis and reversing the data upside down, left and right (only for portraits). ).
If you can convert up to this point, it will be easy to combine it with the camera capture image like the gif video of depthMap and confidenceMap at the beginning of this article (By the way, it is converted to MTL Texture with Metal and recorded).
The processing content is the same as ②.
The confidenceMap
can be obtained via the sceneDepth
of the ARFrame
.
guard let pixelBuffer = self.sceneDepth?.confidenceMap else { return ([], 0) }
Note that the depth information is Float32, while the reliability information is UInt8.
Perform object segmentation using Vison + CoreML (DeepLab V3). The method is the same as this reference article "Simple Semantic Image Segmentation in an iOS Application — DeepLabV3 Implementation".
The types of objects that can be segmented with DeepLabV3 are as follows when referring to Metadata from Xcode.
When the hand is actually recognized, the label value = 15 (the position of the person in "labels") is obtained. This is an example of taking a picture of a hand and displaying the segmentation result overlaid on the captured image. It can be seen that the recognition of the hand is slightly delayed. This time he wanted to recognize not only the human body but also cars and flower pots, so he tried using DeepLab V3, but if you just want to recognize the human body, ARKit's People Occlusion is faster. Reference: Everyone is a Super Saiyan with ARKit + Metal
The point is how to access MLMultiArray
, which stores the segmentation results.
MLMultiArray
has a property called dataPointer
, which can be accessed as an array through assumingMemoryBound (to :).
guard let observations = request.results as? [VNCoreMLFeatureValueObservation],
let segmentationmap = observations.first?.featureValue.multiArrayValue else { return }
//The segmentation result is an array of Int32
let labels = segmentationmap.dataPointer.assumingMemoryBound(to: Int32.self)
let centerLabel = labels[centerIndex]
Accessed as an array of Int32. This is because if you check DeepLabV3.mlmodel from Xcode, it is defined as such. Actually, the value can be taken correctly with Int32.
Decide whether to draw according to the following rules.
--Only the same object as the center of the screen is displayed among the segmentation results. --Display only the part with high reliability of depth information
The segmentation result is judged by the following isInSegment
part.
let depthDeeplabV3ScaleFactor = Float(self.detectSize) / Float(self.depthSize) //Detection resolution/Depth resolution ratio
let isInSegment: (Int, Int) -> Bool = { (x, y) in
guard x < self.depthSize && y < self.depthSize else { return false }
let segmentX = Int(Float(x) * depthDeeplabV3ScaleFactor + depthDeeplabV3ScaleFactor / 2)
let segmentY = Int(Float(y) * depthDeeplabV3ScaleFactor + depthDeeplabV3ScaleFactor / 2)
let segmentIndex = (segmentedImageSize - segmentY - 1) * segmentedImageSize + segmentX
//Model if it matches the label in the center
return labels[segmentIndex] == centerLabel
}
The argument is the coordinates of the depth information. When this sample is run on iPhone12Pro + iOS14.2, the depth information is 120x120, so it is judged by the segmentation result (513x513) corresponding to this coordinate.
The reliability is judged by the following isConfidentDepth
part.
//Only reliable depth information is 3D modeled
let isConfidentDepth: (Int, Int) -> Bool = { (x, y) in
guard x < self.depthSize && y < self.depthSize else { return false }
return self.depthArray[y * self.depthSize + x] >= 0.0
}
Before this judgment process, the depth information with low reliability is set to -1 in depthArray
, so 0.0 or more is judged to be valid.
By the way, this is where -1 is set.
guard depthArray.count == depthConfidenceArray.count else { return }
self.depthArray = depthConfidenceArray.enumerated().map {
//Depth if confidence is less than high-Rewrite to 1
return $0.element >= UInt8(ARConfidenceLevel.high.rawValue) ? depthArray[$0.offset] : -1
}
The depth reliability information is checked and the corresponding depth information value is rewritten.
self.cameraIntrinsicsInversed = frame.camera.intrinsics.inverse
//(Omitted)
let depth = self.depthArray[y * self.depthSize + x]
let x_px = Float(x) * depthScreenScaleFactor
let y_px = Float(y) * depthScreenScaleFactor
//Convert 2D depth information to 3D
let localPoint = cameraIntrinsicsInversed * simd_float3(x_px, y_px, 1) * depth
The intrinsics
of ARCamera
contains a matrix that converts a 3D scale to 2D. Therefore, by using this inverse matrix, the depth information that is 2D can be expanded to the 3D scale.
The code for this part is based on Apple's LiDAR sample Visualizing a Point Cloud Using Scene Depth.
For the contents of intrinsics
, I got an image in this article "I tried to superimpose the evaluation value on a real Othello board with ARKit".
Now that we have the information to be displayed as 3D (the 3D coordinates of an arbitrary image segment and the part with highly reliable depth information), we will model it.
Make it 3D based on the arrangement of the depth information array. At a certain position in the depth information, polygons are created according to the rule that the depth information directly below and to the right is a triangle, and the depth information directly below and below the left is a triangle.
//Triangle (downward)
if isInSegment(x, y), isInSegment(x, y + 1), isInSegment(x + 1, y),
isConfidentDepth(x, y), isConfidentDepth(x, y + 1), isConfidentDepth(x + 1, y) {
//Add polygon index within segment and if depth information is reliable
indices.append(Int32(y * self.depthSize + x))
indices.append(Int32(y * self.depthSize + x + 1))
indices.append(Int32((y + 1) * self.depthSize + x))
if localPoint.y > yMax { yMax = localPoint.y }
if localPoint.y < yMin { yMin = localPoint.y }
}
//Triangle (upward)
if isInSegment(x, y), isInSegment(x - 1, y + 1), isInSegment(x, y + 1),
isConfidentDepth(x, y), isConfidentDepth(x - 1, y + 1), isConfidentDepth(x, y + 1){
//Add polygon index within segment and if depth information is reliable
indices.append(Int32(y * self.depthSize + x))
indices.append(Int32((y + 1) * self.depthSize + x))
indices.append(Int32((y + 1) * self.depthSize + x - 1))
if localPoint.y > yMax { yMax = localPoint.y }
if localPoint.y < yMin { yMin = localPoint.y }
}
If only this rule is used, the jaggedness will be noticeable if the accuracy of the boundary part is poor. There is room for improvement.
For details on how to create geometry & nodes, see this article "How to create custom geometry with SceneKit + bonus".
Once the geometry is created, it will be possible to determine the collision with the floor.
let bodyGeometry = SCNBox(width: 5.0,
height: CGFloat(yMax - yMin),
length: 5.0,
chamferRadius: 0.0)
let bodyShape = SCNPhysicsShape(geometry: bodyGeometry, options: nil)
node.physicsBody = SCNPhysicsBody(type: .dynamic, shape: bodyShape)
//Drop from 3m above
node.simdWorldPosition = SIMD3<Float>(0.0, 3.0, 0.0)
DispatchQueue.main.async {
self.scnView.scene.rootNode.addChildNode(node)
}
Since the shape of the created geometry is based on depth information, it is the size from the camera to the distance of the target object. At this time, the origin of the geometry is the position of the camera. The SCNPhysicsShape
used for collision detection should also match the size of the geometry, but this time it is only necessary to be able to collide with the floor, so set the height of the geometry only for the height, and the width and depth are the size (5m) that is considered to collide reliably. I have to.
In the sample made this time, the geometry is created based on the direction in which the camera is facing with the camera as the origin (the direction of the camera is the reference). Therefore, the horizontal plane is not considered and the orientation is not considered. In the example above, the car was shot looking down at an angle, so the geometry is made in that orientation. If you consider the horizontal properly, you should be able to create a horizontal geometry on the outer white line (BoundingBox). This area was dealt with at another time because Apple's LiDAR sample Visualizing a Point Cloud Using Scene Depth seemed to convert it to world coordinates properly at the time of drawing.
That's all for the explanation. I made it by groping, so I think there are some mistakes and what I should do. It would be helpful if you could point out.
ViewController.swift
import ARKit
import Vision
import UIKit
import SceneKit
class ViewController: UIViewController, ARSessionDelegate, ARSCNViewDelegate {
@IBOutlet weak var scnView: ARSCNView!
//Segmentation range. DeepLab V3.Fit to the image size of mlmodel
private let detectSize: CGFloat = 513.0
// Vision Model
private var visonRequest: VNCoreMLRequest?
//Depth processing result
private var depthArray: [Float32] = []
private var depthSize = 0
private var cameraIntrinsicsInversed: simd_float3x3?
//3D texture image. The captured image when the copy button is pressed is set.
private var texutreImage: CGImage?
//Copy button pressed
private var isButtonPressed = false
//Floor thickness(m)
private let floorThickness: CGFloat = 0.5
//Local coordinates of the floor. Lower the Y coordinate by the thickness of the floor
private lazy var floorLocalPosition = SCNVector3(0.0, -floorThickness/2, 0.0)
//First recognized anchor
private var firstAnchorUUID: UUID?
override func viewDidLoad() {
super.viewDidLoad()
//CoreML settings
setupVison()
//AR Session started
self.scnView.delegate = self
self.scnView.session.delegate = self
let configuration = ARWorldTrackingConfiguration()
if ARWorldTrackingConfiguration.supportsFrameSemantics(.sceneDepth) {
configuration.planeDetection = [.horizontal]
configuration.frameSemantics = [.sceneDepth]
self.scnView.session.run(configuration, options: [.removeExistingAnchors, .resetTracking])
} else {
print("Does not work on this terminal")
}
}
//Anchor added
func renderer(_: SCNSceneRenderer, didAdd node: SCNNode, for anchor: ARAnchor) {
guard anchor is ARPlaneAnchor, self.firstAnchorUUID == nil else { return }
self.firstAnchorUUID = anchor.identifier
//Add floor node
let floorNode = SCNScene.makeFloorNode(width: 10.0, height: self.floorThickness, length: 10.0)
floorNode.position = floorLocalPosition
DispatchQueue.main.async {
node.addChildNode(floorNode)
}
}
//Anchor updated
func renderer(_: SCNSceneRenderer, didUpdate node: SCNNode, for anchor: ARAnchor) {
guard anchor is ARPlaneAnchor else { return }
if let childNode = node.childNodes.first {
DispatchQueue.main.async {
//Reposition floor nodes
childNode.position = self.floorLocalPosition
}
}
}
//AR frame updated
func session(_ session: ARSession, didUpdate frame: ARFrame) {
guard self.isButtonPressed else { return }
self.isButtonPressed = false
let aspectRatio = self.scnView.bounds.height / self.scnView.bounds.width
//Deeplab V3 size in the center of the captured image(513x513)Cut out with
let image = frame.cropCenterSquareImage(fullWidthScale: self.detectSize,
aspectRatio: aspectRatio,
orientation: self.scnView.window!.windowScene!.interfaceOrientation)
let context = CIContext(options: nil)
self.texutreImage = context.createCGImage(image, from: image.extent)
//Get depth information
let (depthArray, depthSize) = frame.cropPortraitCenterSquareDepth(aspectRatio: aspectRatio)
//Get depth confidence information
let (depthConfidenceArray, _) = frame.cropPortraitCenterSquareDepthConfidence(aspectRatio: aspectRatio)
//Extract only highly reliable depth information
guard depthArray.count == depthConfidenceArray.count else { return }
self.depthArray = depthConfidenceArray.enumerated().map {
//Depth if confidence is less than high-Rewrite to 1
return $0.element >= UInt8(ARConfidenceLevel.high.rawValue) ? depthArray[$0.offset] : -1
}
self.depthSize = depthSize
//Inverse matrix of "camera focal length and center point offset information". Prepared to extend 2D depth information to 3D. Reference: https://qiita.com/tanaka-a/items/042fdbd3da6d6332e7e2
self.cameraIntrinsicsInversed = frame.camera.intrinsics.inverse
//Perform segmentation
let handler = VNImageRequestHandler(ciImage: image, options: [:])
try? handler.perform([self.visonRequest!])
}
//The copy button was pressed
@IBAction func pressButton(_ sender: Any) {
isButtonPressed = true
}
}
// MARK: -
extension ViewController {
private func setupVison() {
guard let visionModel = try? VNCoreMLModel(for: DeepLabV3(configuration: MLModelConfiguration()).model) else { return }
let request = VNCoreMLRequest(model: visionModel) { request, error in
//Receive segmentation results
guard let observations = request.results as? [VNCoreMLFeatureValueObservation],
let segmentationmap = observations.first?.featureValue.multiArrayValue else { return }
//3D model generation of mask part
self.draw3DModel(segmentationmap: segmentationmap)
}
request.imageCropAndScaleOption = .centerCrop
self.visonRequest = request
}
private func draw3DModel(segmentationmap: MLMultiArray) {
guard !self.depthArray.isEmpty, let cameraIntrinsicsInversed = self.cameraIntrinsicsInversed else { return }
//The segmentation result is an array of Int32
let labels = segmentationmap.dataPointer.assumingMemoryBound(to: Int32.self)
//Get the label in the center of the screen
let segmentedImageSize = Int(self.detectSize)
let centerIndex = (segmentedImageSize / 2) * segmentedImageSize + (segmentedImageSize / 2)
let centerLabel = labels[centerIndex]
print("Label value in the center of the screen[\(centerLabel)]")
//Depth information size (this sample for iPhone 12 Pro+iOS14.When executed in 2, the segmentation result is centered on 120x120)(513x513)Determine if it is a 3D model target by referring to
let depthDeeplabV3ScaleFactor = Float(self.detectSize) / Float(self.depthSize) //Detection resolution/Depth resolution ratio
let isInSegment: (Int, Int) -> Bool = { (x, y) in
guard x < self.depthSize && y < self.depthSize else { return false }
let segmentX = Int(Float(x) * depthDeeplabV3ScaleFactor + depthDeeplabV3ScaleFactor / 2)
let segmentY = Int(Float(y) * depthDeeplabV3ScaleFactor + depthDeeplabV3ScaleFactor / 2)
let segmentIndex = (segmentedImageSize - segmentY - 1) * segmentedImageSize + segmentX
//Model if it matches the label in the center
return labels[segmentIndex] == centerLabel
}
//Only reliable depth information is 3D modeled
let isConfidentDepth: (Int, Int) -> Bool = { (x, y) in
guard x < self.depthSize && y < self.depthSize else { return false }
return self.depthArray[y * self.depthSize + x] >= 0.0
}
//Generate polygon vertex coordinates and texture coordinates
var vertices: [SCNVector3] = []
var texcoords: [CGPoint] = []
var indices: [Int32] = []
var yMax: Float = 0.0
var yMin: Float = 0.0
let depthScreenScaleFactor = Float(self.scnView.bounds.width * UIScreen.screens.first!.scale / CGFloat(self.depthSize))
for y in 0 ..< self.depthSize {
for x in 0 ..< self.depthSize {
//Create vertex coordinates (make something that will not be displayed in the end)
let depth = self.depthArray[y * self.depthSize + x]
let x_px = Float(x) * depthScreenScaleFactor
let y_px = Float(y) * depthScreenScaleFactor
//Convert 2D depth information to 3D
let localPoint = cameraIntrinsicsInversed * simd_float3(x_px, y_px, 1) * depth
//Depth information is positive, SceneKit is negative in the back, so Z is sign inversion
vertices.append(SCNVector3(localPoint.x, localPoint.y, -localPoint.z))
//Use the coordinates on the captured image as the texture coordinates
let x_coord = CGFloat(x) * CGFloat(depthDeeplabV3ScaleFactor) / self.detectSize
let y_coord = CGFloat(y) * CGFloat(depthDeeplabV3ScaleFactor) / self.detectSize
texcoords.append(CGPoint(x: x_coord, y: 1 - y_coord))
//Triangle (downward)
if isInSegment(x, y), isInSegment(x, y + 1), isInSegment(x + 1, y),
isConfidentDepth(x, y), isConfidentDepth(x, y + 1), isConfidentDepth(x + 1, y) {
//Add polygon index within segment and if depth information is reliable
indices.append(Int32(y * self.depthSize + x))
indices.append(Int32(y * self.depthSize + x + 1))
indices.append(Int32((y + 1) * self.depthSize + x))
if localPoint.y > yMax { yMax = localPoint.y }
if localPoint.y < yMin { yMin = localPoint.y }
}
//Triangle (upward)
if isInSegment(x, y), isInSegment(x - 1, y + 1), isInSegment(x, y + 1),
isConfidentDepth(x, y), isConfidentDepth(x - 1, y + 1), isConfidentDepth(x, y + 1){
//Add polygon index within segment and if depth information is reliable
indices.append(Int32(y * self.depthSize + x))
indices.append(Int32((y + 1) * self.depthSize + x))
indices.append(Int32((y + 1) * self.depthSize + x - 1))
if localPoint.y > yMax { yMax = localPoint.y }
if localPoint.y < yMin { yMin = localPoint.y }
}
}
}
//Geometry creation
let vertexSource = SCNGeometrySource(vertices: vertices)
let texcoordSource = SCNGeometrySource(textureCoordinates: texcoords)
let geometryElement = SCNGeometryElement(indices: indices, primitiveType: .triangles)
let geometry = SCNGeometry(sources: [vertexSource, texcoordSource], elements: [geometryElement])
//Material creation
let material = SCNMaterial()
material.lightingModel = .constant
material.diffuse.contents = self.texutreImage
geometry.materials = [material]
//Node creation
let node = SCNNode(geometry: geometry)
//The size of the collision judgment is appropriate just by hitting the floor.
let bodyGeometry = SCNBox(width: 5.0,
height: CGFloat(yMax - yMin),
length: 5.0,
chamferRadius: 0.0)
let bodyShape = SCNPhysicsShape(geometry: bodyGeometry, options: nil)
node.physicsBody = SCNPhysicsBody(type: .dynamic, shape: bodyShape)
//Drop from 3m above
node.simdWorldPosition = SIMD3<Float>(0.0, 3.0, 0.0)
DispatchQueue.main.async {
self.scnView.scene.rootNode.addChildNode(node)
}
}
}
SwiftExtensions.swift
import UIKit
import ARKit
extension CVPixelBuffer {
var width: Int { CVPixelBufferGetWidth(self) }
var height: Int { CVPixelBufferGetHeight(self) }
func cropPortraitCenterData<T>(sideCutoff: Int) -> ([T], Int) {
CVPixelBufferLockBaseAddress(self, CVPixelBufferLockFlags(rawValue: CVOptionFlags(0)))
guard let baseAddress = CVPixelBufferGetBaseAddress(self) else { return ([], 0) }
let pointer = UnsafeMutableBufferPointer<T>(start: baseAddress.assumingMemoryBound(to: T.self),
count: width * height)
var dataArray: [T] = []
//Calculate the width size on the screen. The values are cut off from the full horizontal size of the screen.
//* Although it is confusing, the data obtained from ARKit during portrait is landscape, so it is calculated by height.
let size = height - sideCutoff * 2
//Get the data of the vertical center part of the screen. The acquisition order is upside down.
for x in (Int((width / 2) - (size / 2)) ..< Int((width / 2) + (size / 2))).reversed() {
//Get the data of the horizontal center part of the screen. The acquisition order is reversed left and right.
for y in (sideCutoff ..< (height - sideCutoff)).reversed() {
let index = y * width + x
dataArray.append(pointer[index])
}
}
CVPixelBufferUnlockBaseAddress(self, CVPixelBufferLockFlags(rawValue: CVOptionFlags(0)))
return (dataArray, size)
}
}
extension ARFrame {
func cropCenterSquareImage(fullWidthScale: CGFloat, aspectRatio: CGFloat, orientation: UIInterfaceOrientation) -> CIImage {
let pixelBuffer = self.capturedImage
//Convert input image to screen size
let imageSize = CGSize(width: pixelBuffer.width, height: pixelBuffer.height)
let image = CIImage(cvImageBuffer: pixelBuffer)
// 1)Input image 0.0〜1.Convert to 0 coordinates
let normalizeTransform = CGAffineTransform(scaleX: 1.0/imageSize.width, y: 1.0/imageSize.height)
// 2)For portraits, flip the X and Y axes
var flipTransform = CGAffineTransform.identity
if orientation.isPortrait {
//Invert both X and Y axes
flipTransform = CGAffineTransform(scaleX: -1, y: -1)
//Since both the X and Y axes move to the minus side, move to the plus side.
flipTransform = flipTransform.concatenating(CGAffineTransform(translationX: 1, y: 1))
}
// 3)Move to the orientation / position of the screen on the input image
let viewPortSize = CGSize(width: fullWidthScale, height: fullWidthScale * aspectRatio)
let displayTransform = self.displayTransform(for: orientation, viewportSize: viewPortSize)
// 4) 0.0〜1.Convert from 0 coordinate system to screen coordinate system
let toViewPortTransform = CGAffineTransform(scaleX: viewPortSize.width, y: viewPortSize.height)
// 5)Convert from 1 to 4 and clip the converted image to the specified size
let transformedImage = image
.transformed(by: normalizeTransform
.concatenating(flipTransform)
.concatenating(displayTransform)
.concatenating(toViewPortTransform))
.cropped(to: CGRect(x: 0,
y: CGFloat(Int(viewPortSize.height / 2.0 - fullWidthScale / 2.0)),
width: fullWidthScale,
height: fullWidthScale))
return transformedImage
}
func cropPortraitCenterSquareDepth(aspectRatio: CGFloat) -> ([Float32], Int) {
guard let pixelBuffer = self.sceneDepth?.depthMap else { return ([], 0) }
return cropPortraitCenterSquareMap(pixelBuffer, aspectRatio)
}
func cropPortraitCenterSquareDepthConfidence(aspectRatio: CGFloat) -> ([UInt8], Int) {
guard let pixelBuffer = self.sceneDepth?.confidenceMap else { return ([], 0) }
return cropPortraitCenterSquareMap(pixelBuffer, aspectRatio)
}
private func cropPortraitCenterSquareMap<T>(_ pixelBuffer: CVPixelBuffer, _ aspectRatio: CGFloat) -> ([T], Int) {
let viewPortSize = CGSize(width: 1.0, height: aspectRatio)
var displayTransform = self.displayTransform(for: .portrait, viewportSize: viewPortSize)
//In the case of portrait, both X-axis and Y-axis are inverted
var flipTransform = CGAffineTransform(scaleX: -1, y: -1)
//Since both the X and Y axes move to the minus side, move to the plus side.
flipTransform = flipTransform.concatenating(CGAffineTransform(translationX: 1, y: 1))
displayTransform = displayTransform.concatenating(flipTransform)
let sideCutoff = Int((1.0 - (1.0 / displayTransform.c)) / 2.0 * CGFloat(pixelBuffer.height))
return pixelBuffer.cropPortraitCenterData(sideCutoff: sideCutoff)
}
}
extension SCNScene {
static func makeFloorNode(width: CGFloat, height: CGFloat, length: CGFloat) -> SCNNode {
let geometry = SCNBox(width: width, height: height, length: length, chamferRadius: 0.0)
let material = SCNMaterial()
material.lightingModel = .shadowOnly
geometry.materials = [material]
let node = SCNNode(geometry: geometry)
node.castsShadow = false
node.physicsBody = SCNPhysicsBody.static()
return node
}
}