Difference between revisions of "Real-Time Geometry Scanning System"
(→Features and Progress) |
(→Introduction) |
||
Line 1: | Line 1: | ||
==Introduction== | ==Introduction== | ||
The field of structure from motion within the study of computer vision is active and evolving. Existing approaches for using cameras to obtain the 3D structure of a scene use visual correspondence and tracking across multiple views to triangulate the position of points in the scene. This is a well-studied problem with entire textbooks written about the various stages of its solution, such as ''An Invitation to 3-D Vision: From Images to Geometric Models'', by Yi Ma, Stefano Soatto, Jana Kosecka, and Shankar Sastry. | The field of structure from motion within the study of computer vision is active and evolving. Existing approaches for using cameras to obtain the 3D structure of a scene use visual correspondence and tracking across multiple views to triangulate the position of points in the scene. This is a well-studied problem with entire textbooks written about the various stages of its solution, such as ''An Invitation to 3-D Vision: From Images to Geometric Models'', by Yi Ma, Stefano Soatto, Jana Kosecka, and Shankar Sastry. | ||
+ | |||
+ | [[Image:http://img1.wantitall.co.za/images/ShowImage.aspx?ImageId=An-Invitation-to-3-D-Vision-From-Images-to-Geometric-Models-Interdisciplinary-Applied-Mathematics|41bPmC8rGHL.jpg]] | ||
Unfortunately, purely vision-based approaches for using camera images to calculate the 3D geometry of a scene suffer from a number of well-known drawbacks. High-quality visual features must exist, and correspondences between them must be established within multiple views. The process of matching correspondences is subject to noise which depends on each view. Views without visual features, like images of the floors or walls of a building, are not suitable for use at all. In addition, aligningment of the views and triangulation of the geometry involves a considerable amount of computational expense. For many applications, this expense is acceptable, but once the geometry is constructed, it may be incomplete due to the presenece of "holes" from places where the user forgot to scan. | Unfortunately, purely vision-based approaches for using camera images to calculate the 3D geometry of a scene suffer from a number of well-known drawbacks. High-quality visual features must exist, and correspondences between them must be established within multiple views. The process of matching correspondences is subject to noise which depends on each view. Views without visual features, like images of the floors or walls of a building, are not suitable for use at all. In addition, aligningment of the views and triangulation of the geometry involves a considerable amount of computational expense. For many applications, this expense is acceptable, but once the geometry is constructed, it may be incomplete due to the presenece of "holes" from places where the user forgot to scan. |
Revision as of 19:40, 26 April 2011
Introduction
The field of structure from motion within the study of computer vision is active and evolving. Existing approaches for using cameras to obtain the 3D structure of a scene use visual correspondence and tracking across multiple views to triangulate the position of points in the scene. This is a well-studied problem with entire textbooks written about the various stages of its solution, such as An Invitation to 3-D Vision: From Images to Geometric Models, by Yi Ma, Stefano Soatto, Jana Kosecka, and Shankar Sastry.
Unfortunately, purely vision-based approaches for using camera images to calculate the 3D geometry of a scene suffer from a number of well-known drawbacks. High-quality visual features must exist, and correspondences between them must be established within multiple views. The process of matching correspondences is subject to noise which depends on each view. Views without visual features, like images of the floors or walls of a building, are not suitable for use at all. In addition, aligningment of the views and triangulation of the geometry involves a considerable amount of computational expense. For many applications, this expense is acceptable, but once the geometry is constructed, it may be incomplete due to the presenece of "holes" from places where the user forgot to scan.
The recent emergence of geometry cameras that use structured patterns of infrared light to construct a camera-space depth map in hardware solve a number of these problems. The infrared light projection and reconstruction occur outside of the visible light spectrum, so the system does not depend on visible features at all. Many of the cameras are relatively inexpensive, and the geometry construction occurs in real-time. They do, however, arrive with a number of their own novel problems. Infrared light from energy sources such as the sun or other infrared cameras may interfere with the projected patten. The projected light may also reflect off of different surfaces, confusing the algorithm used to perform the reconstruction. Also, the data is only available as a depth map in camera space; it is impossible to obtain data about any geometry that is not immediately visible in the current view.
The former problems may be solved keeping the infrared camera out of direct sunlight and avoiding highly reflective surfaces like mirrors and consumer electronics liquid-crystal displays. This system presents a novel approach to solving the latter problem by saving the data obtained by each view of the camera into a global data structure. Provided with a means to obtain the pose of the camera at each frame, each frame's depth map can be transformed into a colored point cloud in world space, relative to some origin and set of coordinate axes. Once enough such vertices are obtained, they may be linked together into a triangle mesh, assigned texture coordinates and saved to a common geometry definition file format like Wavefront OBJ. The scanned geometry is now ready for use in real-time computer graphics applications like virtual tourism or video games, or offline applications like ray-tracing rendering systems.
This system uses a high-quality tracking system to obtain the pose of the camera in real-time. The projected points are inserted into a novel, but simple data structure based on spatial hashing. It uses ray tracing to enforce a maximum of one layer of scanned points for each correspoinding real surface. While the presence of positional noise in the 3D data cannot be eliminated due to hardware constraints, it can be acknowledged and corrected for as best as possible. We believe that this system is able to scan and generate triangle meshes in many types of scenes that other geometry scanning system are completely unable to process, such as those with few or no visual features. In addition, our data structures keep the total data size of scanned scene small, even though our camera sends us millions of colored points per second.
Features and Progress
- Reading in data from the camera - accomplished using libfreenect
- Projecting the depth map into camera space - accomplished by modifying the glview example source code
- Transforming the camera space point cloud into world space - accomplished by constructing a camera matrix using data obtained by a Calit2 tracking system, installed and calibrated by Andrew Prudhomme (thanks!)
- Ensuring that scanned surfaces are represented by at most one layer of points:
- Each scanned point is added to the scene with a surrounding sphere of constant radius. When considering whether to add a new point to the scene, the system traces a ray from the camera origin in the direction of the new point. If that ray intersects the sphere of any other point in the scene, the new point is NOT added. Nearby spheres quickly form surfaces through which no new rays can penetrate, enforcing the "single layer" invariant.
- Testing all points in the scene for intersection quickly becomes prohibitively computationally expensive. However, traditional ray tracing acceleration data structures like bounding volume hierarchies or kd-trees fail to properly address the problem because these data structures are optimized for quick lookups, but not quick inserts. Since many new points per frame can be added to the scene, the insertion cost becomes high. Our data structure establishes a bounding box about the entire scannable area and roughly divides it into axis-aligned regions called bins. These bins are quite large and each one can store thousands of points, unlike a voxel grid which generally stores either one or zero points per voxel.
- Addressing noise:
- The ray tracing system addresses noise in the form of varying depth values over time. We can accurately characterize this type of noise with a density function and easily correct for it.
- Another form of noise arises due to a translation between the optical center of the infrared camera and the center of mass of the tracking client unit. When the tracking system reports a rotation, this is relative to the tracking client, not necessarily the infrared camera itself. While the transformations generally match, surfaces may appear to be translated by four to five centimeters with different rotations of the camera. To correct for this, we implemented a high depth gradient filter to remove samples that change in depth too quickly from pixel to pixel. This effectively limits the range of angles between the camera's optical axis and the tangent plane of the surface to angles close to orthogonal to the surface. Since each surface can only be scanned by a small range of angles, the rotational noise for that surface stays more or less constant, and we avoid the problem of adding multiple layers of a surface to the scene.
- Virtual reality installation - in progress
- The system comprises a scanning process and a rendering process. The rendering process has been ported to OpenCOVER and therefore runs on any COVISE-equipped system.
- Work is currently underway to get the system ported into the StarCAVE.
- Real-time triangle meshing - in progress, collaboration with Robert Pardridge and James Lu [1]