본문 바로가기
Programming/Kinect

[Kinect] Kinect Natural User Interface (NUI) Overview

by deviAk 2012. 1. 8.
반응형

Kinect for Windows Architecture

The SDK provides a sophisticated software library and tools to help developers use the rich form of Kinect-based natural input, sensing and reacting to real-world events.

The Kinect sensor and associated software library interact with your application, as shown in Figure 1.

Figure 1.  Figure 1. Hardware and software interaction with an application

The components of the SDK are shown in Figure 2.

Figure 2.  SDK Architecture

Components for the SDK shown in Figure 2 include the following:

  • Kinect hardware - The hardware components, including the Kinect sensor and the USB hub, through which the sensor is connected to the computer.
  • Microsoft Kinect drivers - The Windows 7 drivers for the Kinect sensor, which are installed as part of the SDK setup process as described in this document. The Microsoft Kinect drivers support:
    • The Kinect sensor’s microphone array as a kernel-mode audio device that you can access through the standard audio APIs in Windows.
    • Streaming image and depth data.
    • Device enumeration functions that enable an application to use more than one Kinect sensor that is connected to the computer.
  • KinectAudio DirectX Media Object (DMO) - The Kinect DMO that extends the microphone array support in Windows 7 to expose beamforming and source localization functionality.
  • Windows 7 standard APIs - The audio, speech, and media APIs in Windows 7, as described in the Windows 7 SDK and the Microsoft Speech SDK.
The NUI API

The NUI API is the core of the Kinect for Windows API. It supports fundamental image and device management features, including the following:

  • Access to the Kinect sensors that are connected to the computer.
  • Access to image and depth data streams from the Kinect image sensors.
  • Delivery of a processed version of image and depth data to support skeletal tracking. 
This SDK includes C++ and C# versions of the SkeletalViewer sample. SkeletalViewer shows how to use the NUI API in an application to capture data streams from the NUI Image camera, use skeletal images, and process sensor data. For more information, see “Skeletal Viewer Walkthrough” on the SDK website.

NUI API Initialization

The Microsoft Kinect drivers support the use of multiple Kinect sensors on a single computer. The NUI API includes functions that enumerate the sensors, so that you can determine how many sensors are connected to the machine, get the name of a particular sensor, and individually open and set streaming characteristics for each sensor.

Although the SDK supports an application using multiple Kinect sensors, only one application can use each sensor at any given time.

Sensor Enumeration and Access

C++ and managed code applications enumerate the available Kinect sensors, open a sensor, and initialize the NUI API in one of the following ways:

To initialize the NUI API and use only one Kinect sensor in a C++ application

  1. Call NuiInitialize. This function initializes the first instance of the Kinect sensor device on the system.
  2. Call other NuiXxx functions to stream image and skeleton data and manage the cameras.
  3. Call NuiShutdown when use of the Kinect sensor is complete.
To initialize the NUI API and use more than one Kinect sensor in a C++ application
  1. Call MSR_NuiDeviceCount to determine how many sensors are available.
  2. Call MSR_NuiCreateInstanceByIndex to create an instance for each sensor that the application uses. This function returns an INuiInstance interface pointer for the instance.
  3. Call INuiInstance::NuiInitialize to initialize the NUI API for the sensor.
  4. Call other methods on the INuiInstance interface to stream image and skeleton data and manage the cameras.
  5. Call INuiInstance::NuiShutdown on a sensor instance to close the NUI API when use of that sensor is complete.
  6. Call MSR_NuiDestroyInstance to destroy the instance.
To initialize the NUI API and use one Kinect sensor in managed code
  1. Create a new Runtime object and leave the parameter list empty, as in the following C# code: nui = new Runtime(); This constructor creates an object that represents the first instance of the Kinect sensor device on the system.
  2. Call Runtime.Initialize to initialize the NUI API for the sensor.
  3. Call additional methods in the managed interface to stream image and skeleton data and to manage the cameras.
  4. Call Runtime.Shutdown when use of the Kinect sensor is complete.
To initialize the NUI API and use more than one Kinect sensor in managed code
  1. Call MSR_NuiDeviceCount to determine how many sensors are available.
  2. Create a new Runtime object and pass the index of a sensor, as in the following C# code:
       nui = new Runtime(index);
  3. This constructor creates an object that represents a particular instance of the Kinect sensor device on the system.
  4. Call Runtime.Initialize to initialize the NUI API for that device instance.
  5. Call additional methods in the managed interface to stream image and skeleton data and to manage the cameras.
  6. Call Runtime.Shutdown when use of that device instance is complete.

Initialization Options

The NUI API processes data from the Kinect sensor through a multistage pipeline. At initialization, the application specifies the subsystems that it uses, so that the runtime can start the required portions of the pipeline. An application can choose one or more of the following options:

  • Color - The application streams color image data from the sensor.
  • Depth - The application streams depth image data from the sensor.
  • Depth and player index - The application streams depth data from the sensor and requires the player index that the skeleton tracking engine generates.
  • Skeleton - The application uses skeleton position data.

These options determine the valid stream types and resolutions for the application. For example, if an application does not indicate at NUI API initialization that it uses depth, it cannot later open a depth stream.

NUI Image Data Streams: An Overview

The NUI API provides the means to modify settings for the Kinect sensor array, and it enables you to access image data from the sensor array.

Stream data is delivered as a succession of still-image frames. At NUI initialization, the application identifies the steams it plans to use. It then opens those streams with additional stream-specific details, including stream resolution, image type, and the number of buffers that the runtime should use to store incoming frames. If the runtime fills all the buffers before the application retrieves and releases a frame, the runtime discards the oldest frame and reuses that buffer. As a result, it is possible for frames to be dropped. An application can request up to four buffers; two is adequate for most usage scenarios.

An application has access to the following kinds of image data from the sensor array:

  • Color data
  • Depth data
  • Player segmentation data 

Color Image Data

Color data is available in the following two formats:

  • RGB color provides 32-bit, linear X8R8G8B8-formatted color bitmaps, in sRGB color space. To work with RGB data, an application should specify a color or color_YUV image type when it opens the stream.
  • YUV color provides 16-bit, gamma-corrected linear UYVY-formatted color bitmaps, where the gamma correction in YUV space is equivalent to sRGB gamma in RGB space. Because the YUV stream uses 16 bits per pixel, this format uses less memory to hold bitmap data and allocates less buffer memory when you open the stream. To work with YUV data, your application should specify the raw YUV image type when it opens the stream. YUV data is available only at the 640x480 resolution and only at 15 FPS. 
Both color formats are computed from the same camera data, so that the YUV data and RGB data represent the same image. Choose the data format that is most convenient given your application's implementation.

The sensor array uses a USB connection to pass data to the PC, and that connection provides a given amount of bandwidth. The Bayer color image data that the sensor returns at 1280x1024 is compressed and converted to RGB before transmission to the runtime. The runtime then decompresses the data before it passes the data to your application. The use of compression makes it possible to return color data at frame rates as high as 30 FPS, but the algorithm that is used leads to some loss of image fidelity.

The sensor array uses a USB connection to pass data to the PC, and that connection provides a given amount of bandwidth. The Bayer color image data that the sensor returns at 1280x1024 is compressed and converted to RGB before transmission to the runtime. The runtime then decompresses the data before it passes the data to your application. The use of compression makes it possible to return color data at frame rates as high as 30 FPS, but the algorithm that is used leads to some loss of image fidelity.

Depth Data

The depth data stream provides frames in which each pixel represents the Cartesian distance, in millimeters, from the camera plane to the nearest object at that particular x and y coordinate in the depth sensor's field of view. The following depth data streams are available:

  • Frame size of 640×480 pixels
  • Frame size of 320×240 pixels
  • Frame size of 80×60 pixels 
Applications can process data from a depth stream to support various custom features, such as tracking users' motions or identifying background objects to ignore during application play.

The format of the depth data depends on whether the application specifies depth only or depth and player index at NUI initialization, as follows:

  • For depth only, the low-order 12 bits (bits 0?11) of each pixel contain depth data and the remaining 4 bits are unused.
  • For depth and player index, the low-order 3 bits (bits 0?2) of each pixel contain the player index and the remaining bits contain depth data. 
A depth data value of 0 indicates that no depth data is available at that position, because all the objects were either too close to the camera or too far away from it.

Player Segmentation Data

In the SDK , the Kinect system processes sensor data to identify two human figures in front of the sensor array and then creates the Player Segmentation map. This map is a bitmap in which the pixel values correspond to the player index of the person in the field of view who is closest to the camera, at that pixel position.

Although the player segmentation data is a separate logical stream, in practice the depth data and player segmentation data are merged into a single frame:

  • The 13 high-order bits of each pixel represent the distance from the depth sensor to the closest object, in millimeters.
  • The 3 low-order bits of each pixel represent the player index of the tracked player who is visible at the pixel's x and y coordinates. These bits are treated as an integer value and are not used as flags in a bit field. 
A player index value of zero indicates that no player was found at that location. Values one and two identify players. Applications commonly use player segmentation data as a mask to isolate specific users or regions of interest from the raw color and depth images.

Retrieving Image Information

Application code gets the latest frame of image data by calling a frame retrieval method and passing a buffer. If the latest frame of data is ready, it is copied into the buffer. If your code requests frames of data faster than new frames are available, you can choose whether to wait for the next frame or to return immediately and try again later. The NUI Image Camera API never provides the same frame of data more than once.

Applications can use either of the following two usage models:

  • Polling Model The polling model is the simplest option for reading data frames. First, the application code opens the image stream. It then requests a frame and specifies how long to wait for the next frame of data (between 0 and an infinite number of milliseconds). The request method returns when a new frame of data is ready or when the wait time expires, whichever comes first. Specifying an infinite wait causes the call for frame data to block and to wait as long as necessary for the next frame.

    When the request returns successfully, the new frame is ready for processing. If the time-out value is set to zero, the application code can poll for completion of a new frame while it performs other work on the same thread. A native C++ application calls NuiImageStreamOpen to open a color or depth stream and omits the optional event. Managed code calls ImageStream.Open. To poll for color and depth frames, native C++applications call NuiImageStreamGetNextFrame and managed code calls ImageStream.GetNextFrame.

  • Event Model The event model supports the ability to integrate retrieval of a skeleton frame into an application engine with more flexibility and more accuracy.

    In this model, C++ application code passes an event handle to NuiImageStreamOpen. When a new frame of image data is ready, the event is signaled. Any waiting thread wakes and gets the frame of skeleton data by calling NuiImageGetNextFrame. During this time, the event is reset by the NUI Image Camera API.

    Managed code uses the event model by hooking a Runtime.DepthFrameReady or Runtime.ImageFrameReady event to an appropriate event handler. When a new frame of data is ready, the event is signaled and the handler runs and calls ImageStream.GetNextFrame to get the frame.

NUI Skeleton Tracking

The NUI Skeleton API provides information about the location of up to two players standing in front of the Kinect sensor array, with detailed position and orientation information.

The data is provided to application code as a set of points, called skeleton positions, that compose a skeleton, as shown in Figure 3. This skeleton represents a user’s current position and pose. Applications that use skeleton data must indicate this at NUI initialization and must enable skeleton tracking.

Figure 3.  Figure 3. Skeleton joint positions relative to the human body

Retrieving Skeleton Information

Application code gets the latest frame of skeleton data in the same way that it gets a frame of image data: by calling a frame retrieval method and passing a buffer. Applications can use either a polling model or an event model, in the same way as for image frames. An application must choose one model or the other; it cannot use both models simultaneously.

To use the polling model:

  • A native C++ application calls NuiSkeletonGetNextFrame to retrieve a skeleton frame.
  • Managed code calls SkeletonEngine.GetNextFrame.
To use the event model:
  • C++ application code passes an event handle to NuiSkeletonTrackingEnable.

    When a new frame of skeleton data is ready, the event is signaled. Any waiting thread wakes and gets the frame of skeleton data by calling NuiSkeletonGetNextFrame. During this time, the event is reset by the NUI Skeleton API.

  • Managed code uses the event model by hooking a Runtime.SkeletonFrameReady event to an appropriate event handler. 
When a new frame of skeleton data is ready, the event is signaled and the handler runs and calls SkeletonEngine.GetNextFrame to get the frame.

The skeletal tracking engine processes depth frame data to calculate the floor clipping plane, which is described in “Floor Determination,” later in this document. If the application indicates at initialization that it uses skeleton tracking, the skeletal tracking engine signals a skeleton frame each time it processes the depth data, whether or not a skeleton currently appears in the frame. Applications that use the floor clipping plane values can retrieve the skeleton frame. The returned skeleton frame includes the timestamp of the corresponding depth image so that applications can match skeleton data with depth image data.

Active and Passive Skeletal Tracking

The skeletal tracking engine provides full skeletal tracking for one or two players in the sensor's field of view. When a player is actively tracked, calls to get the next skeleton frame return complete skeleton data for the player. Passive tracking is provided automatically for up to four additional players in the sensor's field of view. When a player is being tracked passively, the skeleton frame contains only limited information about that player's position. By default, the first two skeletons that the skeletal tracking system finds are actively tracked, as shown in Figure 4.

Figure 4.  Figure 4. Active tracking for two players

The runtime returns skeleton data in a skeleton frame, which contains an array of skeleton data structures, one for each skeleton that the skeletal tracking system recognized. Not every skeleton frame contains skeleton data. When skeleton tracking is enabled, the runtime signals a skeleton event every time it processes a depth frame, as described in the previous section.

For all returned skeletons, the following data is provided:

  • The current tracking state of the associated skeleton:
    • For skeletons that are passively tracked, this value indicates position-only tracking.
    • For an actively tracked skeleton, the value indicates skeleton-tracked.
  • A unique tracking ID that remains assigned to a single player as that player moves around the screen.

The tracking ID is guaranteed to remain consistently applied to the same player for as long as he or she remains in the field of view. A given tracking ID is guaranteed to remain at the same index in the skeleton data array for as long as the tracking ID is in use. If the tracking ID of the skeleton at a particular index in the array changes, one of two things has happened: Either the tracked player left the field of view and tracking started on another player in the field of view, or the tracked player left the field of view, then returned, and is now being tracked again.

  • A position (of type Vector4) that indicates the center of mass for that player. This value is the only available positional value for passive players.
  • For the actively tracked players, returned data also includes the current full skeletal data.
  • For the passively tracked players, returned data includes only basic positional and identification data, and no skeletal data. 
NUI Transformations

This section provides a brief overview of the various coordinate systems that are used with skeleton tracking and the API support that is provided for transformations between these spaces.

Depth Image Space

Image frames of the depth map are 640x480, 320×240, or 80x60 pixels in size. Each pixel represents the Cartesian distance, in millimeters, from the camera plane to the nearest object at that particular x and y coordinate, as shown in Figure 5. A pixel value of 0 indicates that the sensor did not find any objects within its range at that location.

Figure 5.  Figure 5. Depth stream values

The x and y coordinates of the image frame do not represent physical units in the room, but rather pixels on the depth imaging sensor. The interpretation of the x and y coordinates depends on specifics of the optics and imaging sensor. For discussion purposes, this projected space is referred to as the depth image space.

Skeleton Space

Player skeleton positions are expressed in x, y, and z coordinates. Unlike the coordinate of depth image space, these three coordinates are expressed in meters. The x, y, and z axes are the body axes of the depth sensor. This is a right-handed coordinate system that places the sensor array at the origin point with the positive z axis extending in the direction in which the sensor array points. The positive y axis extends upward, and the positive x axis extends to the left (with respect to the sensor array), as shown in Figure 6. For discussion purposes, this expression of coordinates is referred to as the skeleton space.

Figure 6.  Figure 6. Skeleton-space coordinate system for the sensor array

Placement of the sensor array affects the images that the camera records. For example, the camera might be placed on a surface that is not level or the sensor array might be vertically pivoted to optimize the sensor's field of view. In these cases, the y-axis of the skeleton space is usually not perpendicular to the floor or parallel with gravity. In the resulting images, people that are standing up straight could appear to be leaning.

Floor Determination

Each skeleton frame also contains a floor clip plane vector, which contains the coefficients of an estimated floor plane equation. The skeleton tracking system updates this estimate for each frame and uses it as a clipping plane for removing the background and segmentation of players. The general plane equation is:

  
Ax + By + Cz + D = 0

where:

  A = vFloorClipPlane.x 
  B = vFloorClipPlane.y 
  C = vFloorClipPlane.z 
  D = vFloorClipPlane.w

The equation is normalized so that the physical interpretation of D is the height of the camera from the floor, in meters. It is worth noting that the floor might not always be visible. In this case, the floor clip plane is a zero vector. The floor clip plane can be found in the vFloorClipPlane member of the NUI_SKELETON_FRAME structure in the native interface and in the SkeletonFrame.FloorClipPlane field in the managed interface.

Skeletal Mirroring

By default, the skeleton system always mirrors the user who is being tracked. For applications that use an avatar to represent the user, such mirroring could be desirable if the avatar is shown facing into the screen. However, if the avatar faces the user, mirroring would present the avatar as backwards. Depending on its requirements, an application can create a transformation matrix to mirror the skeleton and then apply the matrix to the points in the array that contains the skeleton positions for that skeleton. The application is responsible for choosing the proper plane for reflection.

반응형