29,99 €
With this hands-on guide to 3D deep learning, developers working with 3D computer vision will be able to put their knowledge to work and get up and running in no time.
Complete with step-by-step explanations of essential concepts and practical examples, this book lets you explore and gain a thorough understanding of state-of-the-art 3D deep learning. You’ll see how to use PyTorch3D for basic 3D mesh and point cloud data processing, including loading and saving ply and obj files, projecting 3D points into camera coordination using perspective camera models or orthographic camera models, rendering point clouds and meshes to images, and much more. As you implement some of the latest 3D deep learning algorithms, such as differential rendering, Nerf, synsin, and mesh RCNN, you’ll realize how coding for these deep learning models becomes easier using the PyTorch3D library.
By the end of this deep learning book, you’ll be ready to implement your own 3D deep learning models confidently.
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 251
Veröffentlichungsjahr: 2022
Design and develop your computer vision model with 3D data using PyTorch3D and more
Xudong Ma
Vishakh Hegde
Lilit Yolyan
BIRMINGHAM—MUMBAI
Copyright © 2022 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Publishing Product Manager: Dinesh Chaudhary
Content Development Editor: Joseph Sunil
Technical Editor: Rahul Limbachiya
Copy Editor: Safis Editing
Project Coordinator: Farheen Fathima
Proofreader: Safis Editing
Indexer: Rekha Nair
Production Designer: Ponraj Dhandapani
Marketing Coordinator: Shifa Ansari
First published: November 2022
Production reference: 1211022
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-80324-782-3
www.packt.com
"To my wife and family, for their support and encouragement at every step". - Vishakh Hegde
"To my family and friends, whose love and support have been my biggest motivation". - Lilit Yolyan
Xudong Ma is a Staff Machine Learning engineer with Grabango Inc. in Berkeley California. He was a Senior Machine Learning Engineer at Facebook (Meta) Oculus and worked closely with the 3D PyTorch Team on 3D facial tracking projects. He has many years of experience working on computer vision, machine learning, and deep learning and holds a Ph.D. in Electrical and Computer Engineering.
Vishakh Hegde is a Machine Learning and Computer Vision researcher. He has over 7 years of experience in the field, during which he has authored multiple well-cited research papers and published patents. He holds a masters from Stanford University specializing in applied mathematics and machine learning, and a BS and MS in Physics from IIT Madras. He previously worked at Schlumberger and Matroid. He is a Senior Applied Scientist at Ambient.ai, where he helped build their weapon detection system which is deployed at several Global Fortune 500 companies. He is now leveraging his expertise and passion for solving business challenges to build a technology startup in Silicon Valley. You can learn more about him on his website.
I would like to thank the computer vision researchers whose breakthrough research I got to write about. I want to thank the reviewers for their feedback and the wonderful team at Packt Publishing for giving me the chance to be creative. Finally, I want to thank my wife and family for all their support and encouragement when I most needed it.
Lilit Yolyan is a machine learning researcher working on her Ph.D. at YSU. Her research focuses on building computer vision solutions for smart cities using remote sensing data. She has 5 years of experience in the field of computer vision and has worked on a complex driver safety solution to be deployed by many well-known car manufacturing companies.
Eya Abid is a Masters of Engineering student specializing in Deep Learning and Computer Vision. She holds the position of an AI instructor within NVIDIA and quantum machine learning at CERN.
I would like to dedicate this work first to my family, friends, and whoever helped me through this process. A special dedication to Aymen, to whom I am forever grateful.
Ramesh Sekhar is the CEO and co-founder of Dapster.ai, a company that builds affordable and easily deployable robots that perform the most arduous tasks in warehouses. Ramesh has worked at companies like Symbol, Motorola, and Zebra and specializes in building products at the intersection of computer vision, AI, and Robotics. He has a BS in Electrical Engineering and an MS in Computer Science. Ramesh founded Dapster.ai in 2020. Dapster’s mission is to build robots that positively impact human beings by performing dangerous and unhealthy tasks. Their vision is to unlock better jobs, fortify supply chains, and better negotiate the challenges arising from climate change.
Utkarsh Srivastava is an AI/ML professional, trainer, YouTuber, and blogger. He loves to tackle and develop ML, NLP, and computer vision algorithms to solve complex problems. He started his data science career as a blogger on his blog (datamahadev.com) and YouTube channel (datamahadev), followed by working as a senior data science trainer at an institute in Gujarat. Additionally, he has trained and counseled 1,000+ working professionals and students in AI/ML. Utkarsh has completed 40+ freelance training and development work/projects in data science and analytics, AI/ML, Python development, and SQL. He hails from Lucknow and is currently settled in Bangalore, India, as an analyst at Deloitte USI Consulting.
I would like to thank my mother, Mrs. Rupam Srivastava, for her continuous guidance and support throughout my hardships and struggles. Thanks also to the Supreme Para-Brahman.
Mason McGough is a Sr. R&D Engineer and Computer Vision Specialist at Lowe’s Innovation Labs. He has a passion for imaging and has spent over a decade solving computer vision problems across a broad range of industrial and academic disciplines including geology, bio-informatics, game development, and retail. Most recently he is exploring the use of Digital Twins and 3D scanning for retail stores.
I wish to thank Andy Lykos, Joseph Canzano, Alexander Arango, Oleg Alexander, Erin Clark, and my family for their support.
This first part of the book will define the most basic concepts for data and image processing since these concepts are essential to our later discussions. This part of the book makes the book self-contained so that readers do not need to read any other books to get started with learning about PyTorch3D.
This part includes the following chapters:
Chapter 1, Introducing 3D Data ProcessingChapter 2, Introducing 3D Computer Vision and GeometryIn this chapter, we are going to discuss some basic concepts that are very fundamental to 3D deep learning and that will be used frequently in later chapters. We will begin by learning about the most frequently used 3D data formats, as well as the many ways that we are going to manipulate them and convert them to different formats. We will start by setting up our development environment and installing all the necessary software packages, including Anaconda, Python, PyTorch, and PyTorch3D. We will then talk about the most frequently used ways to represent 3D data – for example, point clouds, meshes, and voxels. We will then move on to the 3D data file formats, such as PLY and OBJ files. We will then discuss 3D coordination systems. Finally, we will discuss camera models, which are mostly related to how 3D data is mapped to 2D images.
After reading this chapter, you will be able to debug 3D deep learning algorithms easily by inspecting output data files. With a solid understanding of coordination systems and camera models, you will be ready to build on that knowledge and learn about more advanced 3D deep learning topics.
In this chapter, we’re going to cover the following main topics:
Setting up a development environment and installing Anaconda, PyTorch, and PyTorch3D3D data representation3D data formats – PLY and OBJ files3D coordination systems and conversion between themCamera models – perspective and orthographic camerasIn order to run the example code snippets in this book, you will need to have a computer ideally with a GPU. However, running the code snippets with only CPUs is possible.
The recommended computer configuration includes the following:
A GPU such as the GTX series or RTX series with at least 8 GB of memoryPython 3The PyTorch library and PyTorch3D librariesThe code snippets for this chapter can be found at https://github.com/PacktPublishing/3D-Deep-Learning-with-Python.
Let us first set up a development environment for all the coding exercises in this book. We recommend using a Linux machine for all the Python code examples in this book:
We will first set up Anaconda. Anaconda is a widely used Python distribution that bundles with the powerful CPython implementation. One advantage of using Anaconda is its package management system, enabling users to create virtual environments easily. The individual edition of Anaconda is free for solo practitioners, students, and researchers. To install Anaconda, we recommend visiting the website, anaconda.com, for detailed instructions. The easiest way to install Anaconda is usually by running a script downloaded from their website. After setting up Anaconda, run the following command to create a virtual environment of Python 3.7:$ conda create -n python3d python=3.7
This command will create a virtual environment with Python version 3.7. In order to use this virtual environment, we need to activate it first by running the command:
Activate the newly created virtual environments with the following command:$ source activate python3d
Install PyTorch. Detailed instructions on installing PyTorch can be found on its web page at www.pytorch.org/get-started/locally/. For example, I will install PyTorch 1.9.1 on my Ubuntu desktop with CUDA 11.1, as follows:$ conda install pytorch torchvision torchaudio cudatoolkit-11.1 -c pytorch -c nvidia
Install PyTorch3D. PyTorch3D is an open source Python library for 3D computer vision recently released by Facebook AI Research. PyTorch3D provides many utility functions to easily manipulate 3D data. Designed with deep learning in mind, almost all 3D data can be handled by mini-batches, such as cameras, point clouds, and meshes. Another key feature of PyTorch3D is the implementation of a very important 3D deep learning technique, called differentiable rendering. However, the biggest advantage of PyTorch3D as a 3D deep learning library is its close ties to PyTorch.PyTorch3D may need some dependencies, and detailed instructions on how to install these dependencies can be found on the PyTorch3D GitHub home page at github.com/facebookresearch/pytorch3d. After all the dependencies have been installed by following the instructions from the website, installing PyTorch3D can be easily done by running the following command:
$ conda install pytorch3d -c pytorch3d
Now that we have set up the development environment, let’s go ahead and start learning data representation.
In this section, we will learn the most frequently used data representation of 3D data. Choosing data representation is a particularly important design decision for many 3D deep learning systems. For example, point clouds do not have grid-like structures, thus convolutions cannot be usually used directly for them. Voxel representations have grid-like structures; however, they tend to consume a high amount of computer memory. We will discuss the pros and cons of these 3D representations in more detail in this section. Widely used 3D data representations usually include point clouds, meshes, and voxels.
A 3D point cloud is a very straightforward representation of 3D objects, where each point cloud is just a collection of 3D points, and each 3D point is represented by one three-dimensional tuple (x, y, or z). The raw measurements of many depth cameras are usually 3D point clouds.
From a deep learning point of view, 3D point clouds are one of the unordered and irregular data types. Unlike regular images, where we can define neighboring pixels for each individual pixel, there are no clear and regular definitions for neighboring points for each point in a point cloud – that is, convolutions usually cannot be applied to point clouds. Thus, special types of deep learning models need to be used for processing point clouds, such as PointNet: https://arxiv.org/abs/1612.00593.
Another issue for point clouds as training data for 3D deep learning is the heterogeneous data issue – that is, for one training dataset, different point clouds may contain different numbers of 3D points. One approach for avoiding such a heterogeneous data issue is forcing all the point clouds to have the same number of points. However, this may not be always possible – for example, the number of points returned by depth cameras may be different from frame to frame.
The heterogeneous data may create some difficulties for mini-batch gradient descent in training deep learning models. Most deep learning frameworks assume that each mini-batch contains training examples of the same size and dimensions. Such homogeneous data is preferred because it can be most efficiently processed by modern parallel processing hardware, such as GPUs. Handling heterogeneous mini-batches in an efficient way needs some additional work. Luckily, PyTorch3D provides many ways of handling heterogeneous mini-batches efficiently, which are important for 3D deep learning.
Meshes are another widely used 3D data representation. Like points in point clouds, each mesh contains a set of 3D points called vertices. In addition, each mesh also contains a set of polygons called faces, which are defined on vertices.
In most data-driven applications, meshes are a result of post-processing from raw measurements of depth cameras. Often, they are manually created during the process of 3D asset design. Compared to point clouds, meshes contain additional geometric information, encode topology, and have surface-normal information. This additional information becomes especially useful in training learning models. For example, graph convolutional neural networks usually treat meshes as graphs and define convolutional operations using the vertex neighboring information.
Just like point clouds, meshes also have similar heterogeneous data issues. Again, PyTorch3D provides efficient ways for handling heterogeneous mini-batches for mesh data, which makes 3D deep learning efficient.
Another important 3D data representation is voxel representation. A voxel is the counterpart of a pixel in 3D computer vision. A pixel is defined by dividing a rectangle in 2D into smaller rectangles and each small rectangle is one pixel. Similarly, a voxel is defined by dividing a 3D cube into smaller-sized cubes and each cube is called one voxel. The processes are shown in the following figure:
Figure 1.1 – Voxel representation is the 3D counterpart of 2D pixel representation, where a cubic space is divided into small volume elements
Voxel representations usually use Truncated Signed Distance Functions (TSDFs) to represent 3D surfaces. A Signed Distance Function (SDF) can be defined at each voxel as the (signed) distance between the center of the voxel to the closest point on the surface. A positive sign in an SDF indicates that the voxel center is outside an object. The only difference between a TSDF and an SDF is that the values of a TSDF are truncated, such that the values of a TSDF always range from -1 to +1.
Unlike point clouds and meshes, voxel representation is ordered and regular. This property is like pixels in images and enables the use of convolutional filters in deep learning models. One potential disadvantage of voxel representation is that it usually requires more computer memory, but this can be reduced by using techniques such as hashing. Nevertheless, voxel representation is an important 3D data representation.
There are 3D data representations other than the ones mentioned here. For example, multi-view representations use multiple images taken from different viewpoints to represent a 3D scene. RGB-D representations use an additional depth channel to represent a 3D scene. However, in this book, we will not be diving too deep into these 3D representations. Now that we have learned the basics of 3D data representations, we will dive into a few commonly used file formats for point clouds and meshes.
The PLY file format was developed in the mid-1990s by a group of researchers from Stanford University. It has since evolved into one of the most widely used 3D data file formats. The file format has both an ASCII version and a binary version. The binary version is preferred in cases where file sizes and processing efficiencies are needed. The ASCII version makes it quite easy to debug. Here, we will discuss the basic format of PLY files and how to use both Open3D and PyTorch3D to load and visualize 3D data from PLY files.
In this section, we are going to discuss the two most frequently used data file formats to represent point clouds and meshes, the PLY file format and the OBJ file format. We are going to discuss the formats and how to load and save these file formats using PyTorch3D. PyTorch3D provides excellent utility functions, so loading from and saving to these file formats is efficient and easy using these utility functions.
An example, a cube.ply file, is shown in the following code snippet:
plyformat ascii 1.0comment created for the book 3D Deep Learning with Pythonelement vertex 8property float32 xproperty float32 yproperty float32 zelement face 12property list uint8 int32 vertex_indicesend_header-1 -1 -11 -1 -11 1 -1-1 1 -1-1 -1 11 -1 11 1 1-1 1 13 0 1 23 5 4 73 6 2 13 3 7 43 7 3 23 5 1 03 0 2 33 5 7 63 6 1 53 3 4 03 7 2 63 5 0 4As seen here, each PLY file contains a header part and a data part. The first line of every ASCII PLY file is always ply, which indicates that this is a PLY file. The second line, format ascii 1.0, shows that the file is of the Ascii type with a version number. Any lines starting with comment will be considered as a comment line, and thus anything following comment will be ignored when the PLY file is loaded by a computer. The element vertex 8 line means that the first type of data in the PLY file is vertex and we have eight vertices. property float32 x means that each vertex has a property named x of the float32 type. Similarly, each vertex also has y and z properties. Here, each vertex is one 3D point. The element face 12 line means that the second type of data in this PLY file is of the face type and we have 12 faces. property list unit8 int32 vertex_indices shows that each face will be a list of vertex indices. The header part of the ply file always ends with an end_header line.
The first part of the data part of the PLY file consists of eight lines, where each line is the record for one vertex. The three numbers in each line represent the three x, y, and z properties of the vertex. For example, the three numbers -1, -1, -1 specify that the vertex has an x coordinate of -1, y coordinate of -1, and z coordinate of -1.
The second part of the data part of the ply file consists of 12 lines, where each line is the record for one face. The first number in the sequence of numbers indicates the number of vertices that the face has, and the following numbers are the vertex indices. The vertex indices are determined by the order that the vertices are declared in the PLY file.