Paper Presentations

Tuesday, 5 October

9:30-11:00 CEST UTC+2

9:30-11:00 CEST UTC+2

11:30-13:00 CEST UTC+2

11:30-13:00 CEST UTC+2

16:00-17:30 CEST UTC+2

16:00-17:30 CEST UTC+2

Wednesday, 6 October

9:30-11:00 CEST UTC+2

9:30-11:00 CEST UTC+2

16:00-17:30 CEST UTC+2

16:00-17:30 CEST UTC+2

18:00-19:30 CEST UTC+2

18:00-19:30 CEST UTC+2

Thursday, 7 October

9:30-11:00 CEST UTC+2

9:30-11:00 CEST UTC+2

11:30-13:00 CEST UTC+2

11:30-13:00 CEST UTC+2

Paper Session 1: Displays

Tuesday, 5 October
9:30 CEST UTC+2
Track A

Session Chair: Yifan (Evan) Peng
YouTube Stream (non-interactive)
Discord Channel for Zoom link and Interactive Q&A Access (registered attendees only)Browser, App
Post-Session Discussion with Authors in Gathertown Room: Q&A Track A

Edge-Guided Near-Eye Image Analysis for Head Mounted Displays

Zhimin Wang, Beihang University
Yuxin Zhao, Beihang University
Yunfei Liu, Beihang University
Feng Lu, Beihang University

Conference Paper

Eye tracking provides an effective way for interaction in Augmented Reality (AR) Head Mounted Displays (HMDs). Current eye tracking techniques for AR HMDs require eye segmentation and ellipse fitting under near-infrared illumination. However, due to the low contrast between sclera and iris regions and unpredictable reflections, it is still challenging to accomplish accurate iris/pupil segmentation and the corresponding ellipse fitting tasks. In this paper, inspired by the fact that most essential information is encoded in the edge areas, we propose a novel near-eye image analysis method with edge maps as guidance. Specifically, we first utilize an Edge Extraction Network (E2-Net) to predict high-quality edge maps, which only contain eyelids and iris/pupil contours without other undesired edges. Then we feed the edge maps into an Edge-Guided Segmentation and Fitting Network (ESF-Net) for accurate segmentation and ellipse fitting. Extensive experimental results demonstrate that our method outperforms current state-of-the-art methods in near-eye image segmentation and ellipse fitting tasks, based on which we present applications of eye tracking with AR HMD.

Head-Mounted Display with Increased Downward Field of View Improves Presence and Sense of Self-Location

Kizashi Nakano, Nara Institute of Science and Technology
Naoya Isoyama, Nara Institute of Science and Technology
Diego Vilela Monteiro, Xi’an Jiaotong-Liverpool University
Nobuchika Sakata, Ryukoku University
Kiyoshi Kiyokawa, Nara Institute of Science and Technology
Takuji Narumi, The University of Tokyo

Journal Paper

Common existing head-mounted displays (HMDs) for virtual reality (VR) provide users with a high presence and embodiment. However, the field of view (FoV) of a typical HMD for VR is about 90 to 110 [deg] in the diagonal direction and about 70 to 90 [deg] in the vertical direction, which is narrower than that of humans. Specifically, the downward FoV of conventional HMDs is too narrow to present the user avatar’s body and feet. To address this problem, we have developed a novel HMD with a pair of additional display units to increase the downward FoV by approximately 60 (10 + 50) [deg]. We comprehensively investigated the effects of the increased downward FoV on the sense of immersion that includes presence, sense of self-location (SoSL), sense of agency (SoA), and sense of body ownership (SoBO) during VR experience and on patterns of head movements and cybersickness as its secondary effects. As a result, it was clarified that the HMD with an increased FoV improved presence and SoSL. Also, it was confirmed that the user could see the object below with a head movement pattern close to the real behavior, and did not suffer from cybersickness. Moreover, the effect of the increased downward FoV on SoBO and SoA was limited since it was easier to perceive the misalignment between the real and virtual bodies.

Blending Shadows: Casting Shadows in Virtual and Real using Occlusion-Capable Augmented Reality Near-Eye Displays

Kiyosato Someya, Tokyo Institute of Technology
Yuta Itoh, The University of Tokyo

Conference Paper

The fundamental goal of augmented reality (AR) is to integrate virtual objects into the user’s perceived reality seamlessly. However, various issues hinder this integration. In particular, Optical See-Through (OST) AR is hampered by the need for light subtraction due to its see-through nature, making some basic rendering harder to realize.
In this paper, we realize mutual shadows between real and virtual objects in OST AR to improve this virtual–real integration. Shadows are a classic problem in computer graphics, virtual reality, and video see-through AR, yet they have not been fully explored in OST AR due to the light subtraction requirement. We build a proof-of-concept system that combines a custom occlusion-capable OST display, global light source estimation, 3D registration, and ray-tracing-based rendering. We will demonstrate mutual shadows using a prototype and demonstrate its effectiveness by quantitatively evaluating shadows with the real environment using a perceptual visual metric.

Directionally Decomposing Structured Light for Projector Calibration

Masatoki Sugimoto, Osaka University
Daisuke Iwai, Osaka University
Koki Ishida, Osaka University
Parinya Punpongsanon, Osaka University
Kosuke Sato, Osaka University

Journal Paper

Intrinsic projector calibration is essential in projection mapping (PM) applications, especially in dynamic PM. However, due to the shallow depth-of-field (DOF) of a projector, more work is needed to ensure accurate calibration. We aim to estimate the intrinsic parameters of a projector while avoiding the limitation of shallow DOF. As the core of our technique, we present a practical calibration device that requires a minimal working volume directly in front of the projector lens regardless of the projector’s focusing distance and aperture size. The device consists of a flat-bed scanner and pinhole-array masks. For calibration, a projector projects a series of structured light patterns in the device. The pinholes directionally decompose the structured light, and only the projected rays that pass through the pinholes hit the scanner plane. For each pinhole, we extract a ray passing through the optical center of the projector. Consequently, we regard the projector as a pinhole projector that projects the extracted rays only, and we calibrate the projector by applying the standard camera calibration technique, which assumes a pinhole camera model. Using a proof-of-concept prototype, we demonstrate that our technique can calibrate projectors with different focusing distances and aperture sizes at the same accuracy as a conventional method. Finally, we confirm that our technique can provide intrinsic parameters accurate enough for a dynamic PM application, even when a projector is placed too far from a projection target for a conventional method to calibrate the projector using a fiducial object of reasonable size.

Multifocal Stereoscopic Projection Mapping

Sorashi Kimura, Osaka University
Daisuke Iwai, Osaka University
Parinya Punpongsanon, Osaka University
Kosuke Sato, Osaka University

Journal Paper

Stereoscopic projection mapping (PM) allows a user to see a three-dimensional (3D) computer-generated (CG) object floating over physical surfaces of arbitrary shapes around us using projected imagery. However, the current stereoscopic PM technology only satisfies binocular cues and is not capable of providing correct focus cues, which causes a vergence–accommodation conflict (VAC). Therefore, we propose a multifocal approach to mitigate VAC in stereoscopic PM. Our primary technical contribution is to attach electrically focus-tunable lenses (ETLs) to active shutter glasses to control both vergence and accommodation. Specifically, we apply fast and periodical focal sweeps to the ETLs, which causes the “virtual image” (as an optical term) of a scene observed through the ETLs to move back and forth during each sweep period. A 3D CG object is projected from a synchronized high-speed projector only when the virtual image of the projected imagery is located at a desired distance. This provides an observer with the correct focus cues required. In this study, we solve three technical issues that are unique to stereoscopic PM: (1) The 3D CG object is displayed on non-planar and even moving surfaces; (2) the physical surfaces need to be shown without the focus modulation; (3) the shutter glasses additionally need to be synchronized with the ETLs and the projector. We also develop a novel compensation technique to deal with the “lens breathing” artifact that varies the retinal size of the virtual image through focal length modulation. Further, using a proof-of-concept prototype, we demonstrate that our technique can present the virtual image of a target 3D CG object at the correct depth. Finally, we validate the advantage provided by our technique by comparing it with conventional stereoscopic PM using a user study on a depth-matching task.

Paper Session 2: Gestures & Hand

Tuesday, 5 October
9:30 CEST UTC+2
Track B

Session Chair: Guofeng Zang
YouTube Stream (non-interactive)
Discord Channel for Zoom link and Interactive Q&A Access (registered attendees only)Browser, App
Post-Session Discussion with Authors in Gathertown Room: Q&A Track B

Detection-Guided 3D Hand Tracking for Mobile AR Application

Yunlong Che, Oppo
Yue Qi, BUAA

Conference Paper

Interaction using bare hands is experiencing a growing interest in mobile-based Augmented Reality (AR). Existing RGB-based works fail to provide a practical solution to identifying rich details of the hand. In this paper, we present a detection-guided method capable of recovery 3D hand posture with a color camera. The proposed method consists of key-point detectors and 3D pose optimizer. The detectors first locate the 2D hand bounding box and then apply a lightweight network on the hand region to provide a pixel-wise like-hood of hand joints. The optimizer lifts the 3D pose from the estimated 2D joints in a model-fitting manner. To ensure the result plausibly, we encode the hand shape into the objective function. The estimated 3D posture allows flexible hand-to-mobile interaction in AR applications. We extensively evaluate the proposed approach on several challenging public datasets. The experimental results indicate the efficiency and effectiveness of the proposed method.

SAR: Spatial-Aware Regression for 3D Hand Pose and Mesh Reconstruction from a Monocular RGB Image

Xiaozheng Zheng, Beijing University of Posts and Telecommunications
Pengfei Ren, Beijing University of Posts and Telecommunications
Haifeng Sun, Beijing University of Posts and Telecommunications
Jingyu Wang, Beijing University of Posts and Telecommunications
Qi Qi, Beijing University of Posts and Telecommunications
Jianxin Liao, Beijing University of Posts and Telecommunications

Conference Paper

3D hand reconstruction is a popular research topic in recent years, which has great potential for VR/AR applications. However, due to the limited computational resource of VR/AR equipment, the reconstruction algorithm must balance accuracy and efficiency to make the users have a good experience. Nevertheless, current methods are not doing well in balancing accuracy and efficiency. Therefore, this paper proposes a novel framework that can achieve a fast and accurate 3D hand reconstruction. Our framework relies on three essential modules, including spatial-aware initial graph building (SAIGB), graph convolutional network (GCN) based belief maps regression (GBBMR), and pose-guided refinement (PGR). At first, given image feature maps extracted by convolutional neural networks, SAIGB builds a spatial-aware and compact initial feature graph. Each node in this graph represents a vertex of the mesh and has vertex-specific spatial information that is helpful for accurate and efficient regression. After that, GBBMR first utilizes adaptive-GCN to introduce interactions between vertices to capture short-range and long-range dependencies between vertices efficiently and flexibly. Then, it maps vertices’ features to belief maps that can model the uncertainty of predictions for more accurate predictions. Finally, we apply PGR to compress the redundant vertices’ belief maps to compact joints’ belief maps with the pose guidance and use these joints’ belief maps to refine previous predictions better to obtain more accurate and robust reconstruction results. Our method achieves state-of-the-art performance on four public benchmarks, FreiHAND, HO-3D, RHD, and STB. Moreover, our method can run at a speed of two to three times that of previous state-of-the-art methods. Our code is available at https://github.com/zxz267/SAR.

Two-hand Pose Estimation from the non-cropped RGB Image with Self-Attention Based Network

Zhoutao Sun, Beihang University
Hu Yong, Beihang University
Xukun Shen, Beihang University

Conference Paper

Estimating the pose of two hands is a crucial problem for many human-computer interaction applications. Since most of the existing works utilize cropped images to predict the hand pose, they require a hand detection stage before pose estimation or input cropped images directly. In this paper, we propose the first real-time one-stage method for pose estimation from a single RGB image without hand tracking. Combining the self-attention mechanism with convolutional layers, the network we proposed is able to predict the 2.5D hand joints coordinate while locating the two hands regions. And to reduce the extra memory and computational consumption caused by self-attention, we proposed a linear attention structure with a spatial-reduction attention block called SRAN block. We demonstrate the effectiveness of each component in our network through the ablation study. And experiments on public datasets showed the competitive result with the state-of-the-art method.

STGAE: Spatial-Temporal Graph Auto-Encoder for Hand Motion Denoising

Kanglei Zhou, Beihang University
Zhiyuan Cheng, Beihang University
Hubert Shum, Durham University
Frederick W. B. Li, Durham University
Xiaohui Liang, Beihang University

Conference Paper

Hand object interaction in mixed reality (MR) relies on the accurate tracking and estimation of human hands, which provide users with a sense of immersion. However, raw captured hand motion data always contains errors such as joints occlusion, dislocation, high-frequency noise, and involuntary jitter. Denoising and obtaining the hand motion data consistent with the user’s intention are of the utmost importance to enhance the interactive experience in MR. To this end, we propose an end-to-end method for hand motion denoising using the spatial-temporal graph auto-encoder (STGAE). The spatial and temporal patterns are recognized simultaneously by constructing the consecutive hand joint sequence as a spatial-temporal graph. Considering the complexity of the articulated hand structure, a simple yet effective partition strategy is proposed to model the physic-connected and symmetry-connected relationships. Graph convolution is applied to extract structural constraints of the hand, and a self-attention mechanism is to adjust the graph topology dynamically. Combining graph convolution and temporal convolution, a fundamental graph encoder or decoder block is proposed. We finally establish the hourglass residual auto-encoder to learn a manifold projection operation and a corresponding inverse projection through stacking these blocks. In this work, the proposed framework has been successfully used in hand motion data denoising with preserving structural constraints between joints. Extensive quantitative and qualitative experiments show that the proposed method has achieved better performance than the state-of-the-art approaches.

Classifying In-Place Gestures with End-to-End Point Cloud Learning

Lizhi Zhao, Northwest A&F University
Xuequan Lu, Deakin University
Min Zhao, Northwest A&F University
Meili Wang, Northwest A&F University

Conference Paper

Walking in place for moving through virtual environments has attracted noticeable attention recently. Recent attempts focused on training a classifier to recognize certain patterns of gestures (e.g., standing, walking, etc) with the use of neural networks like CNN or LSTM. Nevertheless, they often consider very few types of gestures and/or induce less desired latency in virtual environments. In this paper, we propose a novel framework for accurate and efficient classification of in-place gestures. Our key idea is to treat several consecutive frames as a “point cloud”. The HMD and two VIVE trackers provide three points in each frame, with each point consisting of 12-dimensional features (i.e., three-dimensional position coordinates, velocity, rotation, angular velocity). We create a dataset consisting of 9 gesture classes for virtual in-place locomotion. In addition to the supervised point-based network, we also take unsupervised domain adaptation into account due to inter-person variations. To this end, we develop an end-to-end joint framework involving both a supervised loss for supervised point learning and an unsupervised loss for unsupervised domain adaptation. Experiments demonstrate that our approach generates very promising outcomes, in terms of high overall classification accuracy (95.0%) and real-time performance (192ms latency). We will release our dataset and source code to the community.

The object at hand: Automated Editing for Mixed Reality Video Guidancefrom hand-object interactions

Yao Lu, University of Bristol
Walterio Mayol-Cuevas, University of Bristol

Conference Paper

In this paper, we concern with the problem of how to automatically extract the steps that compose hand activities. This is a key competency towards processing, monitoring and providing video guidance in Mixed Reality systems. We use egocentric vision to observe hand-object interactions in real-world tasks and automatically decompose a video into its constituent steps. Our approach combines hand-object interaction (HOI) detection, object similarity measurement and a finite state machine (FSM) representation to automatically edit videos into steps. We use a combination of Convolutional Neural Networks (CNNs) and the FSM to discover, edit cuts and merge operations while observing real hand activities. We evaluate quantitatively and qualitatively our algorithm on two datasets: the GTEA [19], and a new dataset we introduce for Chinese Tea making. Results show our method is able to segment hand-object interaction videos into key step segments with high levels of precision

Paper Session 3: Input & Interaction

Tuesday, 5 October
11:30 CEST UTC+2
Track A

Session Chair: Christian Holz
YouTube Stream (non-interactive)
Discord Channel for Zoom link and Interactive Q&A Access (registered attendees only)Browser, App
Post-Session Discussion with Authors in Gathertown Room: Q&A Track A

A Taxonomy of Interaction Techniques for Immersive Augmented Reality based on an Iterative Literature Review

Julia Hertel, Universität Hamburg
Sukran Karaosmanoglu, Universität Hamburg
Susanne Schmidt, Universität Hamburg
Julia Bräker, Universität Hamburg
Martin Semmann, Universität Hamburg
Frank Steinicke, Universität Hamburg

Conference Paper

Developers of interactive systems have a variety of interaction techniques to choose from, each with individual strengths and limitations in terms of the considered task, context, and users. While there are taxonomies for desktop, mobile, and virtual reality applications, augmented reality (AR) taxonomies have not been established yet. However, recent advances in immersive AR technology (i.e., head-worn or projection-based AR), such as the emergence of untethered headsets with integrated gesture and speech sensors, have enabled the inclusion of additional input modalities and, therefore, novel multimodal interaction methods have been introduced. To provide an overview of interaction techniques for current immersive AR systems, we conducted a literature review of publications between 2016 and 2021. Based on 44 relevant papers, we developed a comprehensive taxonomy focusing on two identified dimensions – task and modality. We further present an adaptation of an iterative taxonomy development method to the field of human-computer interaction. Finally, we discuss observed trends and implications for future work.

HPUI: Hand Proximate User Interfaces for One-Handed Interactions on Head Mounted Displays

Shariff AM Faleel, University of Manitoba
Michael Gammon, University of Manitoba
Kevin Fan, Huawei Canada
Da-Yuan Huang, Huawei Canada
Wei Li, Huawei Canada
Pourang Irani, University of Manitoba

Journal Paper

We explore the design of Hand Proximate User Interfaces (HPUIs) for head-mounted displays (HMDs) to facilitate near-body interactions with the display directly projected on, or around the user’s hand. We focus on single-handed input, while taking into consideration the hand anatomy which distorts naturally when the user interacts with the display. Through two user studies, we explore the potential for discrete as well as continuous input. For discrete input, HPUIs favor targets that are directly on the fingers (as opposed to off-finger) as they offer tactile feedback. We demonstrate that continuous interaction is also possible, and is as effective on the fingers as in the off-finger space between the index finger and thumb. We also find that with continuous input, content is more easily controlled when the interaction occurs in the vertical or horizontal axes, and less with diagonal movements. We conclude with applications and recommendations for the design of future HPUIs.

Rotational-constrained optical see-through headset calibration with bare-hand alignment

Xue Hu, Imperial College London
Ferdinando Rodriguez Y Baena, Imperial College London
Fabrizio Cutolo, University of Pisa

Conference Paper

The inaccessibility of user-perceived reality remains an open issue in pursuing the accurate calibration of optical see-through (OST) head-mounted displays (HMDs). Manual user alignment is usually required to collect a set of virtual-to-real correspondences, so that a default or an offline display calibration can be updated to account for the user’s eye position(s). Current alignment-based calibration procedures usually require point-wise alignments between rendered image point(s) and associated physical landmark(s) of a target calibration tool. As each alignment can only provide one or a few correspondences, repeated alignments are required to ensure calibration quality.

This work presents an accurate and tool-less online OST calibration method to update an offline-calibrated eye-display model. The user’s bare hand is markerlessly tracked by a commercial RGBD camera anchored to the OST headset to generate a user-specific cursor for correspondence collection. The required alignment is object-wise, and can provide thousands of unordered corresponding points in tracked space. The collected correspondences are registered by a proposed rotation-constrained iterative closest point (rcICP) method to optimise the viewpoint-related calibration parameters. We implemented such a method for the Microsoft HoloLens 1. The resiliency of the proposed procedure to noisy data was evaluated through simulated tests and real experiments performed with an eye-replacement camera.
According to the simulation test, the rcICP registration is robust against possible user-induced rotational misalignment. With a single alignment, our method achieves 8.81 arcmin (1.37 mm) positional error and 1.76 degrees rotational error by camera-based tests in the arm-reach distance, and 10.79 arcmin (7.71 pixels) reprojection error by user tests.

Complex Interaction as Emergent Behaviour: Simulating Mid-Air Text Entry using Reinforcement Learning

Lorenz Hetzel, ETH Zürich
John J Dudley, University of Cambridge
Anna Maria Feit, Saarland University
Per Ola Kristensson, University of Cambridge

Journal Paper

Accurately modelling user behaviour has the potential to significantly improve the quality of human-computer interaction. Traditionally, these models are carefully hand-crafted to approximate specific aspects of well-documented user behaviour. This limits their availability in virtual and augmented reality where user behaviour is often not yet well understood. Recent efforts have demonstrated that reinforcement learning can approximate human behaviour during simple goal-oriented reaching tasks. We build on these efforts and demonstrate that reinforcement learning can also approximate user behaviour in a complex mid-air interaction task: typing on a virtual keyboard. We present the first reinforcement learning-based user model for mid-air and surface-aligned typing on a virtual keyboard. Our model is shown to replicate high-level human typing behaviour. We demonstrate that this approach may be used to augment or replace human testing during the validation and development of virtual keyboards.

A Predictive Performance Model for Immersive Interactions in Mixed Reality

Florent Cabric, IRIT
Emmanuel Dubois, IRIT
Marcos Serrano, IRIT

Conference Paper

The design of immersive interaction for mixed reality based on head-mounted displays (HMDs), hereafter referred to as Mixed Reality (MR), is still a tedious task which can hinder the advent of such devices. Indeed, the effects of the interface design on task performance are difficult to anticipate during the design phase: the spatial layout of virtual objects and the interaction techniques used to select those objects can have an impact on task completion time. Besides, testing such interfaces with users in controlled experiments requires considerable time and efforts. To overcome this problem, predictive models, such as the Keystroke-Level Model (KLM), can be used to predict the time required to complete an interactive task at an early stage of the design process. However, so far these models have not been properly extended to address the specific interaction techniques of MR environments. In this paper we propose an extension of the KLM model to interaction performed in MR. First, we propose new operators and experimentally determine the unit times for each of them with a HoloLens v1. Then, we perform experiments based on realistic interaction scenarios to consolidate our model. These experiments confirm the validity of our extension of KLM to predict interaction time in mixed reality environments.

Paper Session 4: Rendering & Display

Tuesday, 5 October
11:30 CEST UTC+2
Track B

Session Chair: Kaan Aksit
YouTube Stream (non-interactive)
Discord Channel for Zoom link and Interactive Q&A Access (registered attendees only)Browser, App
Post-Session Discussion with Authors in Gathertown Room: Q&A Track B

AgentDress: Realtime Clothing Synthesis for Virtual Agents using Plausible Deformations

Nannan Wu, Zhejiang University
Qianwen Chao, Xidian University
Yanzhen Chen, State Key Lab of CAD&CG
Weiwei Xu, Zhejiang University
Chen Liu, Zhejiang Linctex Digital Technology Co.
Dinesh Manocha, University of Maryland
Wenxin Sun, Zhejiang University
Yi Han, Zhejiang University
Xinran Yao, Zhejiang University
Xiaogang Jin, Zhejiang University

Journal Paper

We present a CPU-based real-time cloth animation method for dressing virtual humans of various shapes and poses. Our approach formulates the clothing deformation as a high-dimensional function of body shape parameters and pose parameters. In order to accelerate the computation, our formulation factorizes the clothing deformation into two independent components: the deformation introduced by body pose variation (Clothing Pose Model) and the deformation from body shape variation (Clothing Shape Model). Furthermore, we sample and cluster the poses spanning the entire pose space and use those clusters to efficiently calculate the anchoring points. We also introduce a sensitivity-based distance measurement to both find nearby anchoring points and evaluate their contributions to the final animation. Given a query shape and pose of the virtual agent, we synthesize the resulting clothing deformation by blending the Taylor expansion results of nearby anchoring points. Compared to previous methods, our approach is general and able to add the shape dimension to any clothing pose model. Furthermore, we can animate clothing represented with tens of thousands of vertices at 50+ FPS on a CPU. We also conduct a user evaluation  and show that our method can improve a user’s perception of dressed  virtual agents in an immersive virtual environment (IVE) compared to a realtime linear blend skinning method.

Perception-Driven Hybrid Foveated Depth of Field Rendering for Head-Mounted Displays

Jingyu Liu, Technical University of Denmark
Claire Mantel, Technical University of Denmark
Soren Forchhammer, Technical University of Denmark

Conference Paper

In this paper, we present a novel perception-driven hybrid rendering method leveraging the limitation of the human visual system (HVS). Features accounted in our model include: foveation from the visual acuity eccentricity (VAE), depth of field (DOF) from vergence & accommodation, and longitudinal chromatic aberration (LCA) from color vision. To allocate computational workload efficiently, first we apply a gaze-contingent geometry simplification. Then we convert the coordinates from screen space to polar space with a scaling strategy coherent with VAE. Upon that, we apply a stochastic sampling based on DOF. Finally, we post-process the Bokeh for DOF, which can at the same time achieve LCA and anti-aliasing. A virtual reality (VR) experiment on 6 Unity scenes with a head-mounted display (HMD) HTC VIVE Pro Eye yields frame rates range from 25.2 to 48.7 fps. Objective evaluation with FovVideoVDP – a perceptual based visible difference metric – suggests that the proposed method gives satisfactory just-objectionable-difference (JOD) scores across 6 scenes from 7.61 to 8.69 (in a 10 unit scheme). Our method achieves better performance compared with the existing methods while having the same or better level of quality scores.

Long-Range Augmented Reality with Dynamic Occlusion Rendering

Mikhail Sizintsev, SRI International
Niluthpol Chowdhury Mithun, SRI International
Han-Pang Chiu, SRI International
Supun Samarasekera, SRI International
Rakesh Kumar, SRI International

Journal Paper

Proper occlusion based rendering is very important to achieve realism in all indoor and outdoor Augmented Reality (AR) applications. This paper addresses the problem of fast and accurate dynamic occlusion reasoning by real objects in the scene for large scale outdoor AR applications. Conceptually, proper occlusion reasoning requires an estimate of depth for every point in augmented scene which is technically hard to achieve for outdoor scenarios, especially in the presence of moving objects. We propose a method to detect and automatically infer the depth for real objects in the scene without explicit detailed scene modeling and depth sensing (e.g. without using sensors such as 3D-LiDAR). Specifically, we employ instance segmentation of color image data to detect real dynamic objects in the scene and use either a top-down terrain elevation model or deep learning based monocular depth estimation model to infer their metric distance from the camera for proper occlusion reasoning in real time. The realized solution is implemented in a low latency real-time framework for video-see-though AR and is directly extendable to optical-see-through AR. We minimize latency in depth reasoning and occlusion rendering by doing semantic object tracking and prediction in video frames.

Scan&Paint: Image-based Projection Painting

Vanessa Klein, Friedrich-Alexander-University Erlangen-Nuremberg
Markus Leuschner, Friedrich-Alexander-University Erlangen-Nuremberg
Tobias Langen, Friedrich-Alexander-University Erlangen-Nuremberg
Philipp Kurth, Friedrich-Alexander-University Erlangen-Nuremberg
Marc Stamminger, Friedrich-Alexander-University Erlangen-Nuremberg
Frank Bauer, Friedrich-Alexander-University Erlangen-Nuremberg

Conference Paper

We present a pop-up projection painting system that projects onto an unknown three-dimensional surface, while the user creates the projection content on the fly. The digital paint is projected immediately and follows the object if it is moved. If unexplored surface areas are thereby exposed, an automated trigger system issues new depth recordings that expand and refine the surface estimate. By intertwining scanning and projection painting we scan the exposed surface at the appropriate time and only if needed. Like image-based rendering, multiple automatically recorded depth maps are fused in screen space to synthesize novel views of the object, making projection poses independent from the scan positions. Since the user’s digital paint is also stored in images, we eliminate the need to reconstruct and parametrize a single full mesh, which makes geometry and color updates simple and fast.

Gaze-Contingent Retinal Speckle Suppression in Holographic Displays

Praneeth Chakravarthula, UNC Chapel Hill
Zhan Zhang, University of Science and Technology of China
Okan Tarhan Tursun, Università della Svizzera italiana (USI)
Piotr Didyk, University of Lugano
Qi Sun, New York University
Henry Fuchs, University of North Carolina at Chapel Hill

Journal Paper

Computer-generated holographic (CGH) displays show great potential and are emerging as the next-generation displays for augmented and virtual reality, and automotive heads-up displays. One of the critical problems harming the wide adoption of such displays is the presence of speckle noise inherent to holography, that compromises its quality by introducing perceptible artifacts. Although speckle noise suppression has been an active research area, the previous works have not considered the perceptual characteristics of the Human Visual System (HVS), which receives the final displayed imagery. However, it is well studied that the sensitivity of the HVS is not uniform across the visual field, which has led to gaze-contingent rendering schemes for maximizing the perceptual quality in various computer-generated imagery. Inspired by this, we present the first method that reduces the “perceived speckle noise” by integrating foveal and peripheral vision characteristics of the HVS, along with the retinal point spread function, into the phase hologram computation. Specifically, we introduce the anatomical and statistical retinal receptor distribution into our computational hologram optimization, which places a higher priority on reducing the perceived foveal speckle noise while being adaptable to any individual’s optical aberration on the retina. Our method demonstrates superior perceptual quality on our emulated holographic display. Our evaluations with objective measurements and subjective studies demonstrate a significant reduction of the human perceived noise.

Paper Session 5: Avatars

Tuesday, 5 October
16:00 CEST UTC+2
Track A

Session Chair: Bobby Bodenheimer
YouTube Stream (non-interactive)
Discord Channel for Zoom link and Interactive Q&A Access (registered attendees only)Browser, App
Post-Session Discussion with Authors in Gathertown Room: Q&A Track A

The Effects of Virtual Avatar Visibility on Pointing Interpretation by Observers in 3D Environments

Brett Benda, University of Florida
Eric Ragan, University of Florida

Conference Paper

Avatars are often used to provide representations of users in 3D environments, such as desktop games or VR applications. While full-body avatars are often sought to be used in applications, low visibility avatars (i.e., head and hands) are often used in a variety of contexts, either as intentional design choices, for simplicity in contexts where full-body avatars are not needed, or due to external limitations. Avatar style can also vary from more simplistic and abstract to highly realistic depending on application context and user choices. We present the results of two desktop experiments that examine avatar visibility, style, and observer view on accuracy in a pointing interpretation task. Significant effects of visibility were found, with effects varying between horizontal and vertical components of error, and error amounts not always worsening as a result of lowering visibility. Error due to avatar visibility was much smaller than error resulting from avatar style or observer view. Our findings suggest that humans are reasonably able to understand pointing gestures with a limited observable body.

Diegetic Representations for Seamless Cross-Reality Interruptions

Matt Gottsacker, University of Central Florida
Nahal Norouzi, University of Central Florida
Kangsoo Kim, University of Central Florida
Gerd Bruder, University of Central Florida
Greg Welch, University of Central Florida

Conference Paper

The closed design of virtual reality (VR) head-mounted displays substantially limits users’ awareness of their real-world surroundings. This presents challenges when another person in the same physical space needs to interrupt the VR user for a brief conversation. Such interruptions, e.g., tapping a VR user on the shoulder, can cause a disruptive break in presence (BIP), which affects their place and plausibility illusions, and may cause a drop in performance of their virtual activity. Recent findings related to the concept of diegesis, which denotes the internal consistency of an experience/story, suggest potential benefits of integrating registered virtual representations for physical interactors, especially when these appear internally consistent in VR. In this paper, we present a human-subject study we conducted to compare and evaluate five different diegetic and non-diegetic methods to facilitate cross-reality interruptions in a virtual office environment, where a user’s task was briefly interrupted by a physical person. We created a Cross-Reality Interaction Questionnaire (CRIQ) to capture the quality of the interaction from the VR user’s perspective. Our results show that the diegetic representations afforded reasonably high senses of co-presence, the highest quality interactions, the highest place illusions, and caused the least disruption of the participants’ virtual experiences. We discuss our findings as well as implications for practical applications that aim to leverage virtual representations to ease cross-reality interruptions.

Avatars for Teleconsultation: Effects of Avatar Embodiment Techniques on User Perception in 3D Asymmetric Telepresence

Kevin Yu, Technische Universität München
Gleb Gorbachev, Technische Universität München
Ulrich Eck, Technische Universitaet Muenchen
Frieder Pankratz, LMU
Nassir Navab, Technische Universität München
Daniel Roth, Computer Aided Medical Procedures and Augmented Reality

Journal Paper

A 3D Telepresence system allows users to interact with each other in a virtual, mixed, or augmented reality (VR, MR, AR) environment, creating a shared space for collaboration and communication. There are two main methods for representing users within these 3D environments. Users can be represented either as point cloud reconstruction-based avatars that resemble a physical user or as virtual character-based avatars controlled by tracking the users’ body motion. This work compares both techniques to identify the differences between user representations and their fit in the reconstructed environments regarding the perceived presence, uncanny valley factors, and behavior impression. Our study uses an asymmetric VR/AR teleconsultation system that allows a remote user to join a local scene using VR. The local user observes the remote user with an AR head-mounted display, leading to facial occlusions in the 3D reconstruction. Participants perform a warm-up interaction task followed by a goal-directed collaborative puzzle task, pursuing a common goal. The local user was represented either as a point cloud reconstruction or as a virtual character-based avatar, in which case the point cloud reconstruction of the local user was masked. Our results show that the point cloud reconstruction-based avatar was superior to the virtual character avatar regarding perceived co-presence, social presence, behavioral impression, and humanness. Further, we found that the task type partly affected the perception. The point cloud reconstruction-based approach led to higher usability ratings, while objective performance measures showed no significant difference. We conclude that despite partly missing facial information, the point cloud-based reconstruction resulted in better conveyance of the user behavior and a more coherent fit into the simulation context.

AlterEcho: Loose Avatar-Streamer Coupling for Expressive VTubing

Man To Tang, Purdue University
Victor Long Zhu, Purdue University
Voicu Popescu, Purdue University

Conference Paper

VTubers are live streamers who embody computer animation virtual avatars. VTubing is a rapidly rising form of online entertainment in East Asia, most notably in Japan and China, and it has been more recently introduced in the West. However, animating an expressive VTuber avatar remains a challenge due to budget and usability limitations of current solutions, i.e., high-fidelity motion capture is expensive, while keyboard-based VTubing interfaces impose a cognitive burden on the streamer. This paper proposes a novel approach for VTubing animation based on the key principle of loosening the coupling between the VTuber and their avatar, and it describes a first implementation of the approach in the AlterEcho VTubing animation system. AlterEcho generates expressive VTuber avatar animation automatically, without the streamer’s explicit intervention; it breaks the strict tethering of the avatar from the streamer, allowing the avatar’s nonverbal behavior to deviate from that of the streamer. Without the complete independence of a true alter ego, but also without the constraint of mirroring the streamer with the fidelity of an echo, AlterEcho produces avatar animations that have been rated significantly higher by VTubers and viewers (N = 315) compared to animations created using simple motion capture, or using VMagicMirror, a state-of-the-art keyboard-based VTubing system. Our work also opens the door to personalizing the avatar persona for individual viewers.

Varying User Agency Alongside Interaction Opportunities in a Home Mobile Mixed Reality Story

Gideon Raeburn, Queen Mary University of London
Laurissa Tokarchuk, Queen Mary University of London

Conference Paper

New opportunities for immersive storytelling experiences have arrived through the technology in mobile phones, including the ability to overlay or register digital content on a user’s real world surroundings, to greater immerse the user in the world of the story. This raises questions around the methods and freedom to interact with the digital elements, that will lead to a more immersive and engaging experience. To investigate these areas the Augmented Virtuality (AV) mobile phone application Home Story was developed for iOS devices. It allows a user to move and interact with objects in a virtual environment displayed on their phone, by physically moving in the real world, completing particular actions to progress a story. A mixed methods study with Home Story either guided participants to the next interaction, or offered them increased agency to choose what object to interact with next. Virtual objects could also be interacted with in one of three ways; imagining the interaction, an embodied interaction using the user’s free hand, or a virtual interaction performed on the phone’s touchscreen. Similar levels of immersion were recorded across both study conditions suggesting both can be effective, though highlighting different issues in each case. The embodied free hand interactions proved particularly memorable, though further work is required to improve their implementation, arising from their novelty and lack of familiarity.

Paper Session 6: Navigation & Training

Tuesday, 5 October
16:00 CEST UTC+2
Track B

Session Chair: Qi Sun
YouTube Stream (non-interactive)
Discord Channel for Zoom link and Interactive Q&A Access (registered attendees only)Browser, App
Post-Session Discussion with Authors in Gathertown Room: Q&A Track B

The Cognitive Loads and Usability of Target-based and Steering-based Travel Techniques

Chengyuan Lai, The University of Texas at Dallas
Xinyu Hu, University of Central Florida
Afham Aiyaz, University of Texas at Dallas
Ann K Segismundo, New York University
Ananya A Phadke, The Hockaday School
Ryan P. McMahan, University of Central Florida

Journal Paper

Target and steering-based techniques are two common approaches to travel in consumer VR applications. In this paper, we present two within-subject studies that employ a prior dual-task methodology to evaluate and compare the cognitive loads, travel performances, and simulator sickness of three common target-based travel techniques and three common steering-based travel techniques. We also present visual meta-analyses comparing our results to prior results using the same dual-task methodology. Based on our results and meta-analyses, we present several design suggestions for travel techniques based on various aspects of user experiences.

Understanding, Modeling and Simulating Unintended Positional Drift during Repetitive Steering Navigation Tasks in Virtual Reality

Hugo Brument, INRIA
Gerd Bruder, University of Central Florida
Maud Marchal, INRIA
Anne-Hélène Olivier, University of Rennes
Ferran Argelaguet Sanz, INRIA

Journal Paper

Virtual steering techniques enable users to navigate in larger Virtual Environments (VEs) than the physical workspace available. Even though these techniques do not require physical movement of the users (e.g. using a joystick and the head orientation to steer towards a virtual direction), recent work observed that users might unintentionally move in the physical workspace while navigating, resulting in Unintended Positional Drift (UPD). This phenomenon can be a safety issue since users may unintentionally reach the physical boundaries of the workspace while using a steering technique. In this context, as a necessary first step to improve the design of navigation techniques minimizing the UPD, this paper aims at analyzing and modeling the UPD during a virtual navigation task. In particular, we characterize and analyze the UPD for a dataset containing the positions and orientations of eighteen users performing a virtual slalom task using virtual steering techniques. Participants wore a head-mounted display and had to follow three different sinusoidal-like trajectories (with low, medium and high curvature) using a torso-steering navigation technique. We analyzed the performed motions and proposed two UPD models: the first based on a linear regression analysis and the second based on a Gaussian Mixture Model (GMM) analysis. Then, we assessed both models through a simulation-based evaluation where we reproduced the same navigation task using virtual agents. Our results indicate the feasibility of using simulation-based evaluations to study UPD. The paper concludes with a discussion of potential applications of the results in order to gain a better understanding of UPD during steering and therefore improve the design of navigation techniques by compensating for UPD.

Redirected Walking in Static and Dynamic Scenes Using Visibility Polygons

Niall L. Williams, University of Maryland
Aniket Bera, University of Maryland
Dinesh Manocha, University of Maryland

Journal Paper

We present a new approach for redirected walking in static and dynamic scenes that uses techniques from robot motion planning to compute the redirection gains that steer the user on collision-free paths in the physical space. Our first contribution is a mathematical framework for redirected walking using concepts from motion planning and configuration spaces. This framework highlights various geometric and perceptual constraints that tend to make collision-free redirected walking difficult. We use our framework to propose an efficient solution to the redirection problem that uses the notion of visibility polygons to compute the free spaces in the physical environment and the virtual environment. The visibility polygon provides a concise representation of the entire space that is visible, and therefore walkable, to the user from their position within an environment. Using this representation of walkable space, we apply redirected walking to steer the user to regions of the visibility polygon in the physical environment that closely match the region that the user occupies in the visibility polygon in the virtual environment. We show that our algorithm is able to steer the user along paths that result in significantly fewer resets than existing state-of-the-art algorithms in both static and dynamic scenes. Our project website is available at https://gamma.umd.edu/vis_poly/.

Using Multi-Level Precueing to Improve Performance in Path-Following Tasks in Virtual Reality

Jen-Shuo Liu, Columbia University
Carmine Elvezio, Columbia University
Barbara Tversky, Columbia University
Steven Feiner, Columbia University

Journal Paper

Work on VR and AR task interaction and visualization paradigms has typically focused on providing information about the current step (a cue) immediately before or during its performance. Some research has also shown benefits to simultaneously providing information about the next step (a precue). We explore whether it would be possible to improve efficiency by precueing information about multiple upcoming steps before completing the current step. To accomplish this, we developed a remote VR user study comparing task completion time and subjective metrics for different levels and styles of precueing in a path-following task. Our visualizations vary the precueing level (number of steps precued in advance) and style (whether the path to a target is communicated through a line to the target, and whether the place of a target is communicated through graphics at the target). Participants in our study performed best when given two to three precues for visualizations using lines to show the path to targets. However, performance degraded when four precues were used. On the other hand, participants performed best with only one precue for visualizations without lines, showing only the places of targets, and performance degraded when a second precue was given. In addition, participants performed better using visualizations with lines than ones without lines.

Personal Identifiability of User Tracking Data During VR Training

Alec G Moore, University of Central Florida
Ryan P. McMahan, University of Central Florida
Hailiang Dong, University of Texas at Dallas
Nicholas Ruozzi, University of Texas at Dallas

Conference Paper

Recent research indicates that user tracking data from virtual reality (VR) experiences can be used to personally identify users with degrees of accuracy as high as 95%. However, these results indicating that VR tracking data should be understood as personally identifying data were based on observing 360° videos. In this paper, we present results based on sessions of user tracking data from an ecologically valid VR training application, which indicate that the prior claims may not be as applicable for identifying users beyond the context of observing 360° videos. Our results indicate that the degree of identification accuracy notably decreases between VR sessions. Furthermore, we present results indicating that user tracking data can be obfuscated by encoding positional data as velocity data, which has been successfully used to predict other user experience outcomes like simulator sickness and knowledge acquisition. These results, which show identification accuracies were reduced by more than half, indicate that velocity-based encoding can be used to reduce identifiability and help protect personal identifying data.

Paper Session 7: Modeling

Wednesday, 6 October
9:30 CEST UTC+2
Track A

Session Chair: Shohei Mori
YouTube Stream (non-interactive)
Discord Channel for Zoom link and Interactive Q&A Access (registered attendees only)Browser, App
Post-Session Discussion with Authors in Gathertown Room: Q&A Track A

Mobile3DScanner: An Online 3D Scanner for High-quality Object Reconstruction with a Mobile Device

Xiaojun Xiang, Sensetime Research
Hanqing Jiang, Sensetime Research
Guofeng Zhang, Computer Science College
Yihao Yu, Sensetime Research
Chenchen Li, Sensetime Research
Xingbin Yang, Sensetime Research
Danpeng Chen, Sensetime Research
Hujun Bao, Zhejiang University

Journal Paper

We present a novel online 3D scanning system for high-quality object reconstruction with a mobile device, called Mobile3DScanner. Using a mobile device equipped with an embedded RGBD camera, our system provides online 3D object reconstruction capability for users to acquire high-quality textured 3D object models. Starting with a simultaneous pose tracking and TSDF fusion module, our system allows users to scan an object with a mobile device to get a 3D model for real-time preview. After the real-time scanning process is completed, the scanned 3D model is globally optimized and mapped with multi-view textures as an efficient post-process to get the final textured 3D model on the mobile device. Unlike most existing state-of-the-art systems which can only scan homeware objects such as toys with small dimensions due to the limited computation and memory resources of mobile platforms, our system can reconstruct objects with large dimensions such as statues. We propose a novel visual-inertial ICP approach to achieve real-time accurate 6DoF pose tracking of each incoming frame on the front end, while maintaining a keyframe pool on the back end where the keyframe poses are optimized by local BA. Simultaneously, the keyframe depth maps are fused by the optimized poses to a TSDF model in real-time. Especially, we propose a novel adaptive voxel resizing strategy to solve the out-of-memory problem of large dimension TSDF fusion on mobile platforms. In the post-process, the keyframe poses are globally optimized and the keyframe depth maps are optimized and fused to obtain a final object model with more accurate geometry. The experiments with quantitative and qualitative evaluation demonstrate the effectiveness of the proposed 3D scanning system based on a mobile device, which can successfully achieve online high-quality 3D reconstruction of natural objects with larger dimensions for efficient AR content creation.

Parametric Model Estimation for 3D Clothed Humans from Point Clouds

Kangkan Wang, Nanjing University of Science and Technology
Huayu Zheng, Nanjing University of Science and Technology
Guofeng Zhang, Computer Science College
Jian Yang, Nanjing University of Science and Technology

Conference Paper

This paper presents a novel framework to estimate parametric models for 3D clothed humans from partial point clouds. It is a challenging problem due to factors such as arbitrary human shape and pose, large variations in clothing details, and significant missing data. Existing methods mainly focus on estimating the parametric model of undressed bodies or reconstructing the non-parametric 3D shapes from point clouds. In this paper, we propose a hierarchical regression framework to learn the parametric model of detailed human shapes from partial point clouds of a single depth frame. Benefiting from the favorable ability of deep neural networks to model nonlinearity, the proposed framework cascades several successive regression networks to estimate the parameters of detailed 3D human body models in a coarse-to-fine manner. Specifically, the first global regression network extracts global deep features of point clouds to obtain an initial estimation of the undressed human model. Based on the initial estimation, the local regression network then refines the undressed human model by using the local features of neighborhood points of human joints. Finally, the clothing details are inferred as an additive displacement on the refined undressed model using the vertex-level regression network. The experimental results demonstrate that the proposed hierarchical regression approach can accurately predict detailed human shapes from partial point clouds and outperform prior works in the recovery accuracy of 3D human models.

BuildingSketch: Freehand Mid-Air Sketching for Building Modeling

Zhihao Liu, Chinese Academy of Sciences
Fanxing Zhang, Shenzhen Institutes of Advanced Technology
Zhanglin Cheng, Shenzhen Institutes of Advanced Technology

Conference Paper

Advancements in virtual reality (VR) technology enable us to rethink the way of interactive 3D modeling – intuitively creating 3D content directly in 3D space. However, conventional VR-based modeling is laborious and tedious to generate a detailed 3D model in full manual mode since users need to carefully draw almost the entire surface. In this paper, we present a freehand mid-air sketching system with the aid of deep learning techniques for modeling structured buildings, where the user freely draws a few key strokes in mid-air using his/her fingers to represent the desired shapes and our system automatically interprets the strokes using a deep neural network and generates a detailed building model based on a procedural modeling method. After creating several building blocks one by one, the user can freely move, rotate, and combine the blocks to form a complex building model. We demonstrate the ease of use for novice users, effectiveness, and efficiency of our sketching system, BuildingSketch, by presenting a variety of building models.

BDLoc: Global Localization from 2.5D Building Map

Hai Li, Zhejiang University
Tianxing Fan, Zhejiang University
Hongjia Zhai, Zhejiang University
Zhaopeng Cui, Zhejiang University
Hujun Bao, Zhejiang University
Guofeng Zhang, Zhejiang University

Conference Paper

Robust and accurate global 6DoF localization is essential for many applications, i.e., augmented reality and autonomous driving. Most existing 6DoF visual localization approaches need to build a dense texture model in advance, which is computationally extensive and almost infeasible in the global range. In this work, we propose BDLoc, a hierarchical global localization framework via the 2.5D building map, which is able to estimate the accurate pose of the query street-view image without using detailed dense 3D model and texture information. Specifically speaking, we first extract the 3D building information from the street-view image and surrounding 2.5D building map, and then solve a coarse relative pose by local to global registration. In order to improve the feature extraction, we propose a novel SPG-Net which is able to capture both local and global features. Finally, an iterative semantic alignment is applied to obtain a finner result with the differentiable rendering and the cross-view semantic constraint. Except for a coarse longitude and latitude from GPS, BDLoc doesn’t need any additional information like altitude and orientation that are necessary for many previous works. We also create a large dataset to explore the performance of the 2.5D map-based localization task. Extensive experiments demonstrate the superior performance of our method.

Distortion-aware room layout estimation from a single fisheye image

Ming Meng, Beihang University
Likai Xiao, Beihang University
Yi Zhou, Beijing BigView Technology Co. Ltd
Zhaoxin Li, Chinese Academy of Sciences
Zhong Zhou, Beihang University

Conference Paper

Omnidirectional images of 180 or 360 field of view provide the entire visual content around the capture cameras, giving rise to more sophisticated scene understanding and reasoning and bringing broad application prospects for VR/AR/MR. As a result, researches on omnidirectional image layout estimation have sprung up in recent years. However, existing layout estimation methods designed for panorama images cannot perform well on fisheye images, mainly due to lack of public fisheye dataset as well as the significantly differences in the positions and degree of distortions caused by different projection models. To fill theses gaps, in this work we first reuse the released large-scale panorama datasets and reproduce them to fisheye images via projection conversion, thereby circumventing the challenge of obtaining high-quality fisheye datasets with ground truth layout annotations. Then, we propose a distortion-aware module according to the distortion of the orthographic projection (i.e., OrthConv) to perform effective features extraction from fisheye images. Additionally, we exploit bidirectional LSTM with two-dimensional step mode for horizontal and vertical prediction to capture the long-range geometric pattern of the object for the global coherent predictions even with occlusion and cluttered scenes. We extensively evaluate our deformable convolution for room layout estimation task. In comparison with state-of-the-art approaches, our approach produces considerable performance gains in real-world dataset as well as in synthetic dataset. This technology provides high-efficiency and low-cost technical implementations for VR house viewing and MR video surveillance. We present an MR-based building video surveillance scene equipped with nine fisheye lens can achieve an immersive hybrid display experience, which can be used for intelligent building management in the future.

Paper Session 8: Redirected Walking & Locomotion

Wednesday, 6 October
9:30 CEST UTC+2
Track B

Session Chair: Ferran Argelaguet
YouTube Stream (non-interactive)
Discord Channel for Zoom link and Interactive Q&A Access (registered attendees only)Browser, App
Post-Session Discussion with Authors in Gathertown Room: Q&A Track B

OpenRDW: A Redirected Walking Library and Benchmark with Multi-User, Learning-based Functionalities and State-of-the-art Algorithms

Yi-Jun  Li, Beihang University
Miao  Wang, Beihang University
Prof. Dr. Frank  Steinicke, Universität Hamburg
Qinping  Zhao, Beihang University

Conference Paper

Redirected walking (RDW) is a locomotion technique that guides users on virtual paths, which might vary from the paths they physically walk in the real world. Thereby, RDW enables users to explore a virtual space that is larger than the physical counterpart with near-natural walking experiences. Several approaches have been proposed and developed; each using individual platforms and evaluated on a custom dataset, making it challenging to compare between methods. However, there are seldom public toolkits and recognized benchmarks in this field. In this paper, we introduce OpenRDW, an open-source library and benchmark for developing, deploying and evaluating a variety of methods for walking path redirection. The OpenRDW library provides application program interfaces to access the attributes of scenes, to customize the RDW controllers, to simulate and visualize the navigation process, to export multiple formats of the results, and to evaluate RDW techniques. It also supports the deployment of multi-user real walking, as well as reinforcement learning-based models exported from TensorFlow or PyTorch. The OpenRDW benchmark includes multiple testing conditions, such as walking in size varied tracking spaces or shape varied tracking spaces with obstacles, multiple user walking, etc. On the other hand, procedurally generated paths and walking paths collected from user experiments are provided for a comprehensive evaluation. It also contains several classic and state-of-the-art RDW techniques, which include the above mentioned functionalities.

Redirected Walking using Continuous Curvature Manipulation

Hiroaki Sakono, The University of Tokyo
Keigo Matsumoto, The University of Tokyo
Takuji Narumi, The University of Tokyo
Hideaki Kuzuoka, The University of Tokyo

Journal Paper

In this paper, we propose a novel redirected walking (RDW) technique that applies dynamic bending and curvature gains so that users perceive less discomfort than existing techniques that apply constant gains. 

Humans are less likely to notice continuous changes than those that are sudden. Therefore, instead of applying constant bending or curvature gains to users, we propose a dynamic method that continuously changes the gains. We conduct experiments to investigate the effect of dynamic gains in bending and curvature manipulation with regards to discomfort. The experimental results show that the proposed method significantly suppresses discomfort by up to 16 and 9% for bending and curvature manipulations, respectively.

A Reinforcement Learning Approach to Redirected Walking with Passive Haptics

Ze-Yin Chen, Beihang University
Yi-Jun Li, Beihang University
Miao Wang, Beihang University
Frank Steinicke, Universität Hamburg
Qinping Zhao, Beihang University

Conference Paper

Various redirected walking (RDW) techniques have been proposed, which unwittingly manipulate the mapping from the user’s physical locomotion to motions of the virtual camera. Thereby, RDW techniques guide users on physical paths with the goal to keep them inside a limited tracking area, whereas users perceive the illusion of being able to walk infinitely in the virtual environment. However, the inconsistency between the user’s virtual and physical location hinders passive haptic feedback when the user interacts with virtual objects, which are represented by physical props in the real environment.

In this paper, we present a novel reinforcement learning approach towards RDW with passive haptics. With a novel dense reward function, our method learns to jointly consider physical boundary avoidance and consistency of user-object positioning between virtual and physical spaces. The weights of reward and penalty terms in the reward function are dynamically adjusted to adaptively balance term impacts during the walking process. Experimental results demonstrate the advantages of our technique in comparison to previous approaches. Finally, the code of our technique is provided as
an open-source solution.

Redirected Walking Using Noisy Galvanic Vestibular Stimulation

Keigo Matsumoto, The University of Tokyo
Kazuma Aoyama, The University of Tokyo
Takuji Narumi, The University of Tokyo
Hideaki Kuzuoka, The University of Tokyo

Conference Paper

In this study, considering the characteristics of multisensory integration, we examined a method for improving redirected walking (RDW) by adding noise to the vestibular system to reduce the effects of vestibular inputs on self-motion perception. In RDW, the contradiction between vestibular inputs and visual sensations may make users notice the RDW manipulation, resulting in discomfort throughout the experience. Because humans integrate multisensory information by considering the reliability of each modality, by reducing the effects of vestibular inputs on self-motion perception, it is possible to suppress awareness of and discomfort during RDW manipulation and improve the effectiveness of the manipulation. Therefore, we hypothesized that adding noise to the vestibular inputs would reduce the reliability of the vestibular sensations and enhances the effectiveness of RDW by improving the relative reliability of vision. In this study, we used noisy galvanic vestibular stimulation (GVS) to reduce the reliability of vestibular inputs. GVS is a method of stimulating vestibular organs and nerves by applying small electrical currents to the bilateral mastoid. To reduce the reliability of vestibular inputs, we employed noisy GVS whose current pattern is white noise. We experimented with comparing the threshold of curvature gains between noisy GVS conditions and a control condition.

RNIN-VIO: Robust Neural Inertial Navigation Aided Visual-Inertial Odometry in Challenging Scenes

Danpeng Chen, Computer Science College
Nan Wang, Sensetime
Runsen Xu, Zhejiang University
Weijian Xie, Computer Science College
Hujun Bao, Zhejiang University
Guofeng Zhang, Computer Science College

Conference Paper

In this work, we propose a tightly-coupled EKF framework for visual-inertial odometry with NIN (Neural Inertial Navigation) aided. Traditional VIO systems are fragile in challenging scenes with weak or confusing visual information, such as weak/repeated texture, dynamic environment, fast camera motion with serious motion blur, etc. It is extremely difficult for a vision-based algorithm to handle these problems. So we firstly design a robust deep learning based inertial network~(called RNIN), using only IMU measurements as input. RNIN is significantly more robust in challenging scenes than traditional VIO systems. In order to take full advantage of vision-based algorithms in AR/VR areas, we further develop a multi-sensor fusion system RNIN-VIO, which tightly couples the visual, IMU and NIN measurements. Our system performs robustly in extremely challenging conditions, with high precision both in trajectories and AR effects. The experimental results of evaluation on dataset evaluation and online AR demo demonstrate the superiority of the proposed system in robustness and accuracy.

PAVAL: Position-Aware Virtual Agent Locomotion for Assisted VR Navigation

Ziming Ye, Beihang University
Junlong Chen, Beihang University
Miao Wang, Beihang University
Yong-Liang Yang, University of Bath

Conference Paper

Virtual agents are typical assistance tools for navigation and interac-tion in Virtual Reality (VR) tour, training, education, etc. It has beendemonstrated that the gaits, gestures, gazes, and positions of virtualagents are major factors that affect the user’s perception and experi-ence for seated and standing VR. In this paper, we present a novelposition-aware virtual agent locomotion method, called PAVAL, thatcan perform virtual agent positioning (position+orientation) in realtime for room-scale VR navigation assistance. We first analyze de-sign guidelines for virtual agent locomotion and model the problemusing the positions of the user and the surrounding virtual objects.Then we conduct a one-off preliminary study to collect subjectivedata and present a model for virtual agent positioning predictionwith fixed user position. Based on the model, we propose an algo-rithm to optimize the object of interest, virtual agent position, andvirtual agent orientation in sequence for virtual agent locomotion.As a result, during user navigation in a virtual scene, the virtualagent automatically moves in real time and introduces virtual objectinformation to the user. We evaluate PAVAL and two alternativemethods via a user study with humanoid virtual agents in variousscenes, including virtual museum, factory, and school gym. Theresults reveal that our method is superior to the baseline condition.

Paper Session 9: Session Frameworks & Datasets

Wednesday, 6 October
16:00 CEST UTC+2
Track A

Session Chair: Benjamin Weyers
YouTube Stream (non-interactive)
Discord Channel for Zoom link and Interactive Q&A Access (registered attendees only)Browser, App
Post-Session Discussion with Authors in Gathertown Room: Q&A Track A

TEyeD: Over 20 million real-world eye images with Pupil, Eyelid, and Iris 2D and 3D Segmentations, 2D and 3D Landmarks, 3D Eyeball, Gaze Vector, and Eye Movement Types

Wolfgang Fuhl, Wilhelm Schickard Institut
Gjergji Kasneci, University of Tubingen
Enkelejda Kasneci, University of Tubingen

Conference Paper

We present TEyeD, the world’s largest unified public data set of eye images taken with head-mounted devices. TEyeD was acquired with seven different head-mounted eye trackers. Among them, two eye trackers were integrated into virtual reality (VR) or augmented reality (AR) devices. The images in TEyeD were obtained from various tasks, including car rides, simulator rides, outdoor sports activities, and daily indoor activities. The data set includes 2D\&3D landmarks, semantic segmentation, 3D eyeball annotation and the gaze vector and eye movement types for all images. Landmarks and semantic segmentation are provided for the pupil, iris and eyelids. Video lengths vary from a few minutes to several hours. With more than 20 million carefully annotated images, TEyeD provides a unique, coherent resource and a valuable foundation for advancing research in the field of computer vision, eye tracking and gaze estimation in modern VR and AR applications. Data and code at:

https://unitc-my.sharepoint.com/:f:/g/personal/iitfu01_cloud_uni-tuebingen_de/EvrNPdtigFVHtCMeFKSyLlUBepOcbX0nEkamweeZa0s9SQ

Supporting Iterative Virtual Reality Analytics Design and Evaluation by Systematic Generation of Surrogate Clustered Datasets

Slawomir Konrad Tadeja, University of Cambridge
Patrick Langdon, University of Cambridge
Per Ola Kristensson, University of Cambridge

Conference Paper

Virtual Reality (VR) is a promising technology platform for immersive visual analytics. However, the design space of VR analytics interface design is vast and difficult to explore using traditional A/B comparisons in formal or informal controlled experiments—a fundamental part of an iterative design process. A key factor that complicates such comparisons is the dataset. Exposing participants to the same dataset in all conditions introduces an unavoidable learning effect. On the other hand, using different datasets for all experimental conditions introduces the dataset itself as an uncontrolled variable, which reduces internal validity to an unacceptable degree. In this paper, we propose to rectify this problem by introducing a generative process for synthesizing clustered datasets for VR analytics experiments. This process generates datasets that are distinct while simultaneously allowing systematic comparisons in experiments. A key advantage is that these datasets can then be used in iterative design processes. In a two-part experiment, we show the validity of the generative process and demonstrate how new insights in VR-based visual analytics can be gained using synthetic datasets.

ARENA: The Augmented Reality Edge Networking Architecture

Nuno Pereira, Carnegie Mellon University
Anthony Rowe, Carnegie Mellon University
Michael W Farb, Carnegie Mellon University
Ivan Liang, Carnegie Mellon University
Edward Lu, Carnegie Mellon University
Eric Riebling, Carnegie Mellon University

Conference Paper

Many have predicted the future of the Web to be the integration of Web content with the real-world through technologies such as Augmented Reality (AR). This has led to the rise of Extended Reality (XR) Web Browsers used to shorten the long AR application development and deployment cycle of native applications especially across different platforms. As XR Browsers mature, we face new challenges related to collaborative and multi-user applications that span users, devices, and machines. These collaborative XR applications require: (1) networking support for scaling to many users, (2) mechanisms for content access control and application isolation, and (3) the ability to host application logic near clients or data sources to reduce application latency.
In this paper, we present the design and evaluation of the AR Edge Networking Architecture ARENA which is a platform that simplifies building and hosting collaborative XR applications on WebXR capable browsers. ARENA provides a number of critical components including: a hierarchical geospatial directory service that connects users to nearby servers and content, a token-based authentication system for controlling user access to content, and an application/service runtime supervisor that can dispatch programs across any network connected device. All of the content within ARENA exists as endpoints in a PubSub scene graph model that is synchronized across all users. We evaluate ARENA in terms of client performance as well as benchmark end-to-end response-time as load on the system scales. We show the ability to horizontally scale the system to Internet-scale with scenes containing hundreds of users with latencies on the order of tens of milliseconds. Finally, we highlight projects built using ARENA and showcase how our approach dramatically simplifies collaborative multi-user XR development compared to monolithic approaches.

TransforMR: Pose-Aware Object Substitution for Composing Alternate Mixed Realities

Mohamed Kari, Porsche AG
Tobias Grosse-Puppendahl, Porsche AG
Luis Falconeri Coelho, Porsche AG
Andreas Rene Fender, ETH Zürich
David Bethge, Porsche AG
Reinhard Schütte, Institute for Computer Science and Business Information Systems
Christian Holz, ETH Zürich

Conference Paper

Despite the advances in machine perception, semantic scene understanding is still a limiting factor in mixed reality scene composition.
In this paper, we present TransforMR, a video see-through mixed reality system for mobile devices that performs 3D-pose-aware object substitution to create meaningful mixed reality scenes.
In real-time and for previously unseen and unprepared real-world environments, TransforMR composes mixed reality scenes so that virtual objects assume behavioral and environment-contextual properties of replaced real-world objects. This yields meaningful, coherent, and human-interpretable scenes, not yet demonstrated by today’s augmentation techniques.
TransforMR creates these experiences through our novel pose-aware object substitution method building on different 3D object pose estimators, instance segmentation, video inpainting, and pose-aware object rendering.
TransforMR is designed for use in the real-world, supporting the substitution of humans and vehicles in everyday scenes, and runs on mobile devices using just their monocular RGB camera feed as input.
We evaluated TransforMR with eight participants in an uncontrolled city environment employing different transformation themes.
Applications of TransforMR include real-time character animation analogous to motion capturing in professional film making, however without the need for preparation of either the scene or the actor, as well as narrative-driven experiences that allow users to explore fictional parallel universes in mixed reality.

Excite-O-Meter: Software Framework to Integrate Bodily Signals in Virtual Reality Experiments

Luis Quintero, Stockholm University
John Edison Muñoz Cardona, University of Waterloo
Jeroen de De mooij De mooij, Thefirstfloor.nl
Michael Gaebler, Max Planck Institute for Human Cognitive and Brain Sciences

Conference Paper

Bodily signals can complement subjective and behavioral measures to analyze human factors, such as user engagement or stress, when interacting with virtual reality (VR) environments. Enabling widespread use of (also the real-time analysis) of bodily signals in VR applications could be a powerful method to design more user-centric, personalized VR experiences. However, technical and scientific challenges (e.g., cost of research-grade sensing devices, required coding skills, expert knowledge needed to interpret the data) complicate the integration of bodily data in existing interactive applications. This paper presents the design, development, and evaluation of an open-source software framework named Excite-O-Meter. It allows existing VR applications to integrate, record, analyze, and visualize bodily signals from wearable sensors, with the example of cardiac activity (heart rate and its variability) from the chest strap Polar H10. Survey responses from 58 potential users determined the design requirements for the framework. Two tests evaluated the framework and setup in terms of data acquisition/analysis and data quality. Finally, we present an example experiment that shows how our tool can be an easy-to-use and scientifically validated tool for researchers, hobbyists, or game designers to integrate bodily signals in VR applications.

Paper Session 10: Applications

Wednesday, 6 October
16:00 CEST UTC+2
Track B

Session Chair: Voicu Popescu
YouTube Stream (non-interactive)
Discord Channel for Zoom link and Interactive Q&A Access (registered attendees only)Browser, App
Post-Session Discussion with Authors in Gathertown Room: Q&A Track B

Design and Evaluation of Personalized Percutaneous Coronary Intervention Surgery Simulation System

Shuai Li, Beihang University
Jiahao Cui, Beihang University
Aimin Hao, Beihang University
Shuyang Zhang, Peking Union Medical College Hospital
Qinping Zhao, Beihang University

Journal Paper

In recent years, medical simulators have been widely applied to a broad range of surgery training tasks. However, most of the existing surgery simulators can only provide limited immersive environments with a few pre-processed organ models, while ignoring the instant modeling of various personalized clinical cases, which brings substantive differences between training experiences and real surgery situations. To this end, we present a virtual reality (VR) based surgery simulation system for personalized percutaneous coronary intervention (PCI). The simulation system can directly take patient-specific clinical data as input and generate virtual 3D intervention scenarios. Specially, we introduce a fiber-based patient-specific cardiac dynamic model to simulate the nonlinear deformation among the multiple layers of the cardiac structure, which can well respect and correlate the atriums, ventricles and vessels, and thus gives rise to more effective visualization and interaction. Meanwhile, we design a tracking and haptic feedback hardware, which can enable users to manipulate physical intervention instruments and interact with virtual scenarios. We conduct quantitative analysis on deformation precision and modeling efficiency, and evaluate the simulation system based on the user studies from 16 cardiologists and 20 intervention trainees, comparing it to traditional desktop intervention simulators. The results confirm that our simulation system can provide a better user experience, and is a suitable platform for PCI surgery training and rehearsal.

Augmented Reality for Subsurface Utility Engineering, Revisited

Lasse Hedegaard Hansen, Aalborg University
Philipp Fleck, Graz University of Technology
Marco Stranner, Institute for Computer Graphics and Vision
Dieter Schmalstieg, Graz University of Technology
Clemens Arth, AR4 GmbH

Journal Paper

Civil engineering is a primary domain for new augmented reality technologies. In this work, the area of subsurface utility engineering is revisited, and new methods tackling well-known, yet unsolved problems are presented. We describe our solution to the outdoor localization problem, which is deemed one of the most critical issues in outdoor augmented reality, proposing a novel, lightweight hardware platform to generate highly accurate position and orientation estimates in a global context. Furthermore, we present new approaches to drastically improve realism of outdoor data visualizations. First, a novel method to replace physical spray markings by indistinguishable virtual counterparts is described. Second, the visualization of 3D reconstructions of real excavations is presented, fusing seamlessly with the view onto the real environment. We demonstrate the power of these new methods on a set of different outdoor scenarios.

A Compelling Virtual Tour of the Dunhuang Cave With an Immersive Head-Mounted Display

Ping-Hsuan Han, National Taiwan University
Yang-Sheng Chen, National Taiwan University
Iou-Shiuan Liu, National Taiwan University
Yi-Ping Jang, National Taiwan University
Ling Tsai, National Taiwan University
Alvin Chang, National Taiwan University
Yi-Ping Hung, National Taiwan University

Invited  CG&A Paper

The Dunhuang Caves are the home to the largest Buddhist art sites in the world and are listed as a UNESCO World Heritage Site. Over time, the murals have been damaged by both humans and nature. In this article, we present an immersive virtual reality system for exploring spatial cultural heritage, which utilizes the digitized data from the Dunhuang Research Academy to represent the virtual environment of the cave. In this system, the interaction techniques that allow users to flexibly experience any of the artifacts or displays contribute to their understanding of the cultural heritage. Additionally, we evaluated the system by conducting a user study to examine the extent of user acquaintance after the entire experience. Our result has shown what participants learn from the spatial context and augmented information in the VR. This can be used as design considerations for developing other spatial heritages.

The Passenger Experience of Mixed Reality Virtual Display Layouts In Airplane Environments

Alexander Ng, University of Glasgow
Daniel Medeiros, University of Glasgow
Mark McGill, University of Glasgow
Julie R. Williamson, University of Glasgow
Stephen Anthony Brewster, University of Glasgow

Conference Paper

Augmented / Mixed Reality headsets will in-time see adoption and use in a variety of mobility and transit contexts, allowing users to view and interact with virtual content and displays for productivity and entertainment. However, little is known regarding how multi-display virtual workspaces should be presented in a transit context, nor to what extent the unique affordances of transit environments (e.g. the social presence of others) might influence passenger perception of virtual display layouts. Using a simulated VR passenger airplane environment, we evaluated three different AR-driven virtual display configurations (Horizontal, Vertical, and Focus main display with smaller secondary windows) at two different depths, exploring their usability, user preferences, and the underlying factors that influenced those preferences. We found that the perception of invading other’s personal space significantly influenced preferred layouts in transit contexts. Based on our findings, we reflect on the unique challenges posed by passenger contexts, provide recommendations regarding virtual display layout in the confined airplane environment, and expand on the significant benefits that AR offers over physical displays in said environments.

FLASH: Video AR Anchors for Live Events

Edward Lu, Carnegie Mellon University
John Miller, Carnegie Mellon University
Nuno Pereira, Carnegie Mellon University
Anthony Rowe, Carnegie Mellon University

Conference Paper

Public spaces like concert stadiums and sporting arenas are ideal venues for AR content delivery to crowds of mobile phone users. Unfortunately, these environments tend to be some of the most challenging in terms of lighting and dynamic staging for vision-based relocalization. In this paper, we introduce FLASH, a system for delivering AR content within challenging lighting environments that uses active tags (i.e. blinking) with detectable features from passive tags (quads) for marking regions of interest and determining pose. This combination allows the tags to be detectable from long distances with significantly less computational overhead per frame, making it possible to embed tags in existing video displays like large jumbotrons. To aid in pose acquisition, we implement a gravity-assisted pose solver that removes the ambiguous solutions that are often encountered when trying to localize using standard passive tags. We show that our technique outperforms similarly sized passive tags in terms of range by 20-30% and is fast enough to run at 30 FPS even within a mobile web browser on a smartphone.

Paper Session 11: Tracking & Prediction

Wednesday, 6 October
18:00 CEST UTC+2
Track A

Session Chair: Alain Pagani
YouTube Stream (non-interactive)
Discord Channel for Zoom link and Interactive Q&A Access (registered attendees only)Browser, App
Post-Session Discussion with Authors in Gathertown Room: Q&A Track A

Simulating Realistic Human Motion Trajectories of Mid-Air Gesture Typing

Junxiao Shen, University of Cambridge
John J Dudley, University of Cambridge
Per Ola Kristensson, University of Cambridge

Conference Paper

The eventual success of many AR and VR intelligent interactive systems relies on the ability to collect user motion data at large scale. Realistic simulation of human motion trajectories is a potential solution to this problem. Simulated user motion data can facilitate prototyping and speed up the design process. There are also potential benefits in augmenting training data for deep learning-based AR/VR applications to improve performance. However, the generation of realistic motion data is nontrivial. In this paper, we examine the specific challenge of simulating index finger movement data to inform mid-air gesture keyboard design. The mid-air gesture keyboard is deployed on an optical see-through display that allows the user to enter text by articulating word gesture patterns with their physical index finger in the vicinity of a visualized keyboard layout. We propose and compare four different approaches to simulating this type of motion data, including a Jerk-Minimization model, a Recurrent Neural Network (RNN)-based generative model, and a Generative Adversarial Network (GAN)-based model with two modes: style transfer and data alteration. We also introduce a procedure for validating the quality of the generated trajectories in terms of realism and diversity. The GAN-based model shows significant potential for generating synthetic motion trajectories to facilitate design and deep learning for advanced gesture keyboards deployed in AR and VR.

Cybersickness Prediction from Integrated HMD's Sensors: A Multimodal Deep Fusion Approach using Eye-tracking and Head-tracking Data

Rifatul Islam, University of Texas at San Antonio
John Quarles, University of Texas at San Antonio
Kevin Desai, The University of Texas at San Antonio

Conference Paper

Cybersickness prediction is one of the significant research challenges for real-time cybersickness reduction. Researchers have proposed different approaches for predicting cybersickness from bio-physiological data (e.g., heart rate, breathing rate, electroencephalogram). However, collecting bio-physiological data often requires external sensors, limiting locomotion and 3D-object manipulation during the virtual reality (VR) experience. Limited research has been done to predict cybersickness from the data readily available from the integrated sensors in head-mounted displays (HMDs) (e.g., head-tracking, eye-tracking, motion features), allowing free locomotion and 3D-object manipulation. This research proposes a novel deep fusion network to predict cybersickness severity from heterogeneous data readily available from the integrated HMD sensors. We extracted 1755 stereoscopic videos, eye-tracking, and head-tracking data along with the corresponding self-reported cybersickness severity collected from 30 participants during their VR gameplay. We applied several deep fusion approaches with the heterogeneous data collected from the participants. Our results suggest that cybersickness can be predicted with an accuracy of 87.77% and a root-mean-square error of 0.51 when using only eye-tracking and head-tracking data. We concluded that eye-tracking and head-tracking data are well suited for a standalone cybersickness prediction framework.

A Comparison of the Fatigue Progression of Eye-Tracked and Motion-Controlled Interaction in Immersive Space

Lukas Maximilian Masopust, University of California, Davis
David Bauer, University of California, Davis
Siyuan Yao, University of California, Davis
Kwan-Liu Ma, University of California, Davis

Conference Paper

Eye-tracking enabled virtual reality (VR) headsets have recently become more widely available. This opens up opportunities to incorporate eye gaze interaction methods in VR applications. However, studies on the fatigue-induced performance fluctuations of these new input modalities are scarce and rarely provide a direct comparison with established interaction methods. We conduct a study to compare the selection-interaction performance between commonly used handheld motion control devices and emerging eye interaction technology in VR. We investigate each interaction’s unique fatigue progression pattern in study sessions with ten minutes of continuous engagement. The results support and extend previous findings regarding the progression of fatigue in eye-tracked interaction over prolonged periods. By directly comparing gaze- with motion-controlled interaction, we put the emerging eye-trackers into perspective with the state-of-the-art interaction method for immersive space. We then discuss potential implications for future extended reality (XR) interaction design based on our findings.

DVIO: Depth-Aided Visual Inertial Odometry for RGBD Sensors

Abhishek Tyagi, SOC R&D, Samsung Semiconductor, Inc.
Yangwen Liang, SOC R&D, Samsung Semiconductor, Inc.
Shuangquan Wang, SOC R&D, Samsung Semiconductor, Inc.
Dongwoon Bai, SOC R&D, Samsung Semiconductor, Inc.

Conference Paper

In past few years we have observed an increase in the usage of RGBD sensors in mobile devices. These sensors provide a good estimate of the depth map for the camera frame, which can be used in numerous augmented reality applications. This paper presents a new visual inertial odometry (VIO) system, which uses measurements from a RGBD sensor and an inertial measurement unit (IMU) sensor for estimating the motion state of the mobile device. The resulting system is called the depth-aided VIO (DVIO) system. In this system we add the depth measurement as part of the nonlinear optimization process. Specifically, we propose methods to use the depth measurement using one-dimensional (1D) feature parameterization as well as three-dimensional (3D) feature parameterization. In addition, we propose to utilize the depth measurement for estimating time offset between the unsynchronized IMU and the RGBD sensors. Last but not least, we propose a novel block-based marginalization approach to speed up the marginalization processes and maintain the real-time performance of the overall system. Experimental results validate that the proposed DVIO system outperforms the other state-of-the-art VIO systems in terms of trajectory accuracy as well as processing time.

Instant Visual Odometry Initialization for Mobile AR

Alejo Concha Belenguer, Facebook
Jesus Briales, Facebook
Christian Forster, Facebook
Luc Oth, Facebook
Michael Burri, Facebook

Journal Paper

Mobile AR applications benefit from fast initialization to display world-locked effects instantly. However, standard visual odometry or SLAM algorithms require motion parallax to initialize (see Figure 1) and, therefore, suffer from delayed initialization. In this paper, we present a 6-DoF monocular visual odometry that initializes instantly and without motion parallax. Our main contribution is a pose estimator that decouples estimating the 5-DoF relative rotation and translation direction from the 1-DoF translation magnitude. While scale is not observable in a monocular vision-only setting, it is still paramount to estimate a consistent scale over the whole trajectory (even if not physically accurate) to avoid AR effects moving erroneously along depth. In our approach, we leverage the fact that depth errors are not perceivable to the user during rotation-only motion. However, as the user starts translating the device, depth becomes perceivable and so does the capability to estimate consistent scale. Our proposed algorithm naturally transitions between these two modes. Our second contribution is a novel residual in the relative pose problem to further improve the results. The residual combines the Jacobians of the functional and the functional itself and is minimized using a Levenberg–Marquardt optimizer on the 5-DoF manifold. We perform extensive validations of our contributions with both a publicly available dataset and synthetic data. We show that the proposed pose estimator outperforms the classical approaches for 6-DoF pose estimation used in the literature in low-parallax configurations. Likewise, we show our relative pose estimator outperforms state-of-the-art approaches in an odometry pipeline configuration where we can leverage initial guesses. We release a dataset for the relative pose problem using real data to facilitate the comparison with future solutions for the relative pose problem. Our solution is either used as a full odometry or as a pre-SLAM component of any supported SLAM system (ARKit, ARCore) in world-locked AR effects on platforms such as Instagram and Facebook.

Paper Session 12: XR Experiences & Guidance

Wednesday, 6 October
18:00 CEST UTC+2
Track B

Session Chair: John Quarles
YouTube Stream (non-interactive)
Discord Channel for Zoom link and Interactive Q&A Access (registered attendees only)Browser, App
Post-Session Discussion with Authors in Gathertown Room: Q&A Track B

Virtual Animals as Diegetic Attention Guidance Mechanisms in 360-Degree Experiences

Nahal Norouzi, University of Central Florida
Gerd Bruder, University of Central Florida
Austin Erickson, University of Central Florida
Kangsoo Kim, University of Calgary
Jeremy N. Bailenson, Stanford University
Pamela J. Wisniewski, University of Central Florida
Charles E Hughes, University of Central Florida
Greg Welch, University of Central Florida

Journal Paper

360-degree experiences such as cinematic virtual reality and 360-degree videos are becoming increasingly popular. In most examples, viewers can freely explore the content by changing their orientation. However, in some cases, this increased freedom may lead to viewers missing important events within such experiences. Thus, a recent research thrust has focused on studying mechanisms for guiding viewers’ attention while maintaining their sense of presence and fostering a positive user experience. One approach is the utilization of diegetic mechanisms, characterized by an internal consistency with respect to the narrative and the environment, for attention guidance. While such mechanisms are highly attractive, their uses and potential implementations are still not well understood. Additionally, acknowledging the user in 360-degree experiences has been linked to a higher sense of presence and connection. However, less is known when acknowledging behaviors are carried out by attention guiding mechanisms. To close these gaps, we conducted a within-subjects user study with five conditions of no guide and virtual arrows, birds, dogs, and dogs that acknowledge the user and the environment.  Through our mixed-methods analysis, we found that the diegetic virtual animals resulted in a more positive user experience, all of which were at least as effective as the non-diegetic arrow in guiding users towards target events. The acknowledging dog received the most positive responses from our participants in terms of preference and user experience and significantly improved their sense of presence compared to the non-diegetic arrow. Lastly, three themes emerged from a qualitative analysis of our participants’ feedback, indicating the importance of the guide’s blending in, its acknowledging behavior, and participants’ positive associations as the main factors for our participants’ preferences.

Measuring the Perceived Three-Dimensional Location of Virtual Objects in Optical See-Through Augmented Reality

Farzana Alam Khan, Mississippi State University
Veera Venkata Ram Murali Krishna Rao Muvva, University of Nebraska–Lincoln
Dennis Wu, Mississippi State University
Mohammed Safayet Arefin, Mississippi State University
Nate Phillips, Mississippi State University
J. Edward Swan II, Mississippi State University

Conference Paper

For optical see-through augmented reality (AR), a new method for measuring the perceived three-dimensional location of virtual objects is presented, where participants verbally report a virtual object’s location relative to both a vertical and horizontal grid. The method is tested with a small (1.95 × 1.95 × 1.95 cm) virtual object at distances of 50 to 80 cm, viewed through a Microsoft HoloLens 1st generation AR display. Two experiments examine two different virtual object designs, whether turning in a circle between reported object locations disrupts HoloLens tracking, and whether accuracy errors, including a rightward bias and underestimated depth, might be due to systematic errors that are restricted to a particular display. Turning in a circle did not disrupt HoloLens tracking, and testing with a second display did not suggest systematic errors restricted to a particular display. Instead, the experiments are consistent with the hypothesis that, when looking downwards at a horizontal plane, HoloLens 1st generation displays exhibit a systematic rightward perceptual bias. Precision analysis suggests that the method could measure the perceived location of a virtual object within an accuracy of less than 1 mm.

Mirror Mirror on My Phone: Investigating Dimensions of Self-Face Perception Induced by Augmented Reality Filters

Rebecca Fribourg, Trinity College Dublin
Etienne Peillard, LabSTICC
Rachel McDonnell, Trinity College Dublin

Conference Paper

The main use of Augmented Reality (AR) today for the general public is in applications for smartphones. In particular, social network applications allow the use of many AR filters, modifying users’ environment but also their own image. These AR filters are increasingly and frequently being used and can distort in many ways users’ facial traits. Yet, as for today we do not know clearly how users perceive their faces as augmented by these filters. Face perception has been in the center of a consequent core of research, from which studies have highlighted that specific facial traits could be interpreted from different facial features manipulation (e.g. eyes size was found to influence the perception of dominance and trustworthiness). However, while the depicted studies highlighted valuable insights on the link between face features and the perception of human faces, they only tackle the perception of other persons’ faces. Up to this day, it remains unclear how one perceives appeal, personality traits, intelligence and emotion in one’s own face depending on specific facial feature alterations. In this paper, we present a study that aims to evaluate the impact of different filters, modifying several features of the face such as the size or position of the eyes, the shape of the face or the orientation of the eyebrows. These filters are evaluated via a self-evaluation questionnaire, asking the participants about the emotions and moral traits that their distorted face convey.
Our results show relative effects between the different filters in line with previous results regarding the perception of others. However, they also reveal specific effects on self-perception, showing, inter alia, that facial deformation decreases participants’ credence towards their image. The findings of this study covering multiple factors allow us to highlight the impact of face deformation on users’ perception but also the specificity related to this use in AR, paving the way for new works focusing on the psychological impact of such filters.

CrowdXR - Pitfalls and Potentials of Experiments with Remote Participants

Jiayan Zhao, The Pennsylvania State University
Mark Simpson, The Pennsylvania State University
Pejman Sajjadi, The Pennsylvania State University
Jan Oliver Wallgrün, The Pennsylvania State University
Ping Li, The Hong Kong Polytechnic University
Mahda M. Bagher, The Pennsylvania State University
Danielle Oprean, University of Missouri
Lace Padilla, UC Merced
Alexander Klippel, The Pennsylvania State University

Conference Paper

Although the COVID-19 pandemic has made the need for remote data collection more apparent than ever, progress has been slow in the virtual reality (VR) research community, and little is known about the quality of the data acquired from crowdsourced participants who own a head-mounted display (HMD), which we call crowdXR. To investigate this problem, we report on a VR spatial cognition experiment that was conducted both in-lab and out-of-lab. The in-lab study was administered as a traditional experiment with undergraduate students and dedicated VR equipment. The out-of-lab study was carried out remotely by recruiting HMD owners from VR-related research mailing lists, VR subreddits in Reddit, and crowdsourcing platforms. Demographic comparisons show that our out-of-lab sample was older, included more males, and had a higher sense of direction than our in-lab sample. The results of the involved spatial memory tasks indicate that the reliability of the data from out-of-lab participants was as good as or better than their in-lab counterparts. Additionally, the data for testing our research hypotheses were comparable between in- and out-of-lab studies. We conclude that crowdsourcing is a feasible and effective alternative to the use of university participant pools for collecting survey and performance data for VR research, despite potential design issues that may affect the generalizability of study results. We discuss the implications and future directions of running VR studies outside the laboratory and provide a set of practical recommendations.

SceneAR: Scene-based Micro Narratives for Sharing and Remixing in Augmented Reality

Mengyu Chen, University of California Santa Barbara
Andrés Monroy-Hernández, Snap Inc.
Misha Sra, University of California Santa Barbara

Conference Paper

Short-form digital storytelling has become a popular medium for millions of people to express themselves. Traditionally, this medium uses primarily 2D media such as text (e.g., memes), images (e.g., Instagram), gifs (e.g., Giphy), and videos (e.g., TikTok, Snapchat). To expand the modalities from 2D to 3D media, we present SceneAR, a smartphone application for creating sequential scene-based micro narratives in augmented reality (AR). What sets SceneAR apart from prior work is the ability to share the scene-based stories as AR content—no longer limited to sharing images or videos, these narratives can now be experienced in people’s own physical environments. Additionally, SceneAR affords users the ability to remix AR, empowering them to build-upon others’ creations collectively. We asked 18 people to use SceneAR in a 3-day study. Based on user interviews, analysis of screen recordings, and the stories they created, we extracted three themes. From those themes and the study overall, we derived six strategies for designers interested in supporting short-form AR narratives.

Paper Session 13: Rendering

Thursday, 7 October
9:30 CEST UTC+2
Track A

Session Chair: Itaru Kitahara
YouTube Stream (non-interactive)
Discord Channel for Zoom link and Interactive Q&A Access (registered attendees only)Browser, App
Post-Session Discussion with Authors in Gathertown Room: Q&A Track A

Reconstructing Reflection Maps using a Stacked-CNN for Mixed Reality Rendering

Andrew Chalmers, Victoria University of Wellington
Junhong Zhao, Victoria University of Wellington
Daniel Medeiros, Victoria University of Wellington
Taehyun Rhee, Victoria University of Wellington

Invited TVCG Paper

Corresponding lighting and reflectance between real and virtual objects is important for spatial presence in augmented and mixed reality (AR and MR) applications. We present a method to reconstruct real-world environmental lighting, encoded as a reflection map (RM), from a conventional photograph. To achieve this, we propose a stacked convolutional neural network (SCNN) that predicts high dynamic range (HDR) 360  RMs with varying roughness from a limited field of view, low dynamic range photograph. The SCNN is progressively trained from high to low roughness to predict RMs at varying roughness levels, where each roughness level corresponds to a virtual object’s roughness (from diffuse to glossy) for rendering. The predicted RM provides high-fidelity rendering of virtual objects to match with the background photograph. We illustrate the use of our method with indoor and outdoor scenes trained on separate indoor/outdoor SCNNs showing plausible rendering and composition of virtual objects in AR/MR. We show that our method has improved quality over previous methods with a comparative user study and error metrics.

Adaptive Light Estimation using Dynamic Filtering for Diverse Lighting Conditions

Junhong Zhao, Victoria University of Wellington
Andrew Chalmers, Victoria University of Wellington
Taehyun Rhee, Victoria University of Wellington

Journal Paper

High dynamic range (HDR) panoramic environment maps are widely used to illuminate virtual objects to blend with real-world scenes. However, in common applications for augmented and mixed-reality (AR/MR), capturing 360-degree surroundings to obtain an HDR environment map is often not possible using consumer-level devices. We present a novel light estimation method to predict 360-degree HDR environment maps from a single photograph with a limited field-of-view (FOV). We introduce the Dynamic Lighting network (DLNet), a convolutional neural network that dynamically generates the convolution filters based on the input photograph sample to adaptively learn the lighting cues within each photograph. We propose novel Spherical Multi-Scale Dynamic (SMD) convolutional modules to dynamically generate sample-specific kernels for decoding features in the spherical domain to predict 360-degree environment maps.

Using DLNet and data augmentations with respect to FOV, an exposure multiplier, and color temperature, our model shows the capability of estimating lighting under diverse input variations. Compared with prior work that fixes the network filters once trained, our method maintains lighting consistency across different exposure multipliers and color temperature, and maintains robust light estimation accuracy as FOV increases. The surrounding lighting information estimated by our method ensures coherent illumination of 3D objects blended with the input photograph, enabling high fidelity augmented and mixed reality supporting a wide range of environmental lighting conditions and device sensors.

Neural Cameras: Learning Camera Characteristics for Coherent Mixed Reality Rendering

David Mandl, Graz University of Technology
Peter Mohr, VRVis Research Center
Tobias Langlotz, University of Otago
Christoph Ebner, Graz University of Technology
Shohei Mori, Graz University of Technology
Stefanie Zollmann, University of Otago
Peter Roth, Technical University of Munich
Denis Kalkofen, Graz University of Technology

Conference Paper

Coherent rendering is important for generating plausible Mixed Reality presentations of virtual objects within a user’s real-world environment. Besides photo-realistic rendering and correct lighting, visual coherence requires simulating the imaging system that is used to capture the real environment. While existing approaches either focus on a specific camera or a specific component of the imaging system, we introduce Neural Cameras, the first approach that jointly simulates all major components of an arbitrary modern camera using neural networks. Our system allows for adding new cameras to the framework by learning the visual properties from a database of images that has been captured using the physical camera. We present qualitative and quantitative results and discuss future direction for research that emerge from using Neural Cameras.

Selective Foveated Ray Tracing for Head-Mounted Displays

Youngwook Kim, Sogang University
Yunmin Ko, Snow Corp.
Insung Ihm, Sogang University

Conference Paper

Although ray tracing produces significantly more realistic images than traditional rasterization techniques, it is still considered computationally burdensome when implemented on a head-mounted display (HMD) system that demands both wide field of view and high rendering rate. A further challenge is that to present high-quality images on an HMD screen, a sufficient number of ray samples should be taken per pixel for effective antialiasing to reduce visually annoying artifacts. In this paper, we present a novel foveated real-time rendering framework that realizes classic Whitted-style ray tracing on an HMD system. In particular, our method proposes combining the selective supersampling technique by Jin et al.[8] with the foveated rendering scheme, resulting in perceptually highly efficient pixel sampling suitable for HMD ray tracing. We demonstrate that further enhanced by foveated temporal antialiasing, our ray tracer renders nontrivial 3D scenes in real time on commodity GPUs at high sampling rates as effective as up to 36 samples per pixel (spp) in the foveal area, gradually reducing to at least 1 spp in the periphery.

Foveated Photon Mapping

Xuehuai Shi, Beihang University
Lili Wang, Beihang University
Xiaoheng Wei, Beihang University
Ling-Qi Yan, University of California, Santa Barbara

Journal Paper

Virtual reality (VR) applications require high-performance rendering algorithms to efficiently render 3D scenes on the VR head-mounted display, to provide users with an immersive and interactive virtual environment. Foveated rendering provides a solution to improve the performance of rendering algorithms by allocating computing resources to different regions based on the human visual acuity, and renders images of different qualities in different regions. Rasterization-based methods and ray tracing methods can be directly applied to foveated rendering, but rasterization-based methods are difficult to estimate global illumination (GI), and ray tracing methods are inefficient for rendering scenes that contain paths with low probability. Photon mapping is an efficient GI rendering method for scenes with different materials. However, since photon mapping cannot dynamically adjust the rendering quality of GI according to the human acuity, it cannot be directly applied to foveated rendering. In this paper, we propose a foveated photon mapping method to render realistic GI effects in the foveal region. We use the foveated photon tracing method to generate photons with high density in the foveal region, and these photons are used to render high-quality images in the foveal region. We further propose a temporal photon management to select and update the valid foveated photons of the previous frame for improving our method’s performance. Our method can render diffuse, specular, glossy and transparent materials to achieve effects specifically related to GI, such as color bleeding, specular reflection, glossy reflection and caustics. Our method supports dynamic scenes and renders high-quality GI in the foveal region at interactive rates.

Paper Session 14: Perception & Experiences

Thursday, 7 October
9:30 CEST UTC+2
Track B

Session Chair: Etienne Peillard
YouTube Stream (non-interactive)
Discord Channel for Zoom link and Interactive Q&A Access (registered attendees only)Browser, App
Post-Session Discussion with Authors in Gathertown Room: Q&A Track B

Investigation of Size Variations in Optical See-through Tangible Augmented Reality

Denise Kahl, German Research Center for Artificial Intelligence
Marc Ruble, German Research Center for Artificial Intelligence
Antonio Krüger, German Research Center for Artificial Intelligence

Conference Paper

Optical see-through AR headsets are becoming increasingly attractive for many applications. Interaction with the virtual content is usually achieved via hand gestures or with controllers. A more seamless interaction between the real and virtual world can be achieved by using tangible objects to manipulate the virtual content. Instead of interacting with detailed physical replicas, working with abstractions allows a single physical object to represent a variety of virtual objects. These abstractions would differ from their virtual representations in shape, size, texture and material.
This paper investigates for the first time in optical see-through AR, whether size variations are possible without major losses in performance, usability and immersion. The conducted study shows that size can be varied within a limited range without significantly affecting task completion times as well as feelings of disturbance and presence. Stronger size deviations are possible for physical objects smaller than the virtual object than for larger physical objects.

Virtual extensions improve perception-based instrument alignment using optical see-through devices

Mohamed Benmahdjoub, Erasmus MC
Wiro J. Niessen, Erasmus MC
Eppo B. Wolvius, Erasmus MC
Theo van Walsum, Erasmus MC

Journal Paper

Instrument alignment is a common task in various surgical interventions using navigation. The goal of the task is to position and orient an instrument as it has been planned preoperatively.  To this end, surgeons rely on patient-specific data visualized on screens alongside preplanned trajectories. The purpose of this manuscript is to investigate the effect of instrument visualization/non visualization on alignment tasks, and to compare it with virtual extensions approach which augments the realistic representation of the instrument with simple 3D objects. 18 volunteers performed six alignment tasks under each of the following conditions: no visualization on the instrument; realistic visualization of the instrument; realistic visualization extended with virtual elements (Virtual extensions). The first condition represents an egocentric-based alignment while the two other conditions additionally make use of exocentric depth estimation to perform the alignment. The device used was a see-through device (Microsoft HoloLens 2). The positions of the head and the instrument were acquired during the experiment.  Additionally, the users were asked to fill NASA-TLX and SUS forms foreach condition. The results show that instrument visualization is essential for a good alignment using see-through devices. Moreover, virtual extensions helped achieve the best performance compared to the other conditions with medians of 2 mm and 2° positional and angular error respectively. Furthermore, the virtual extensions decreased the average head velocity while similarly reducing the frustration levels. Therefore, making use of virtual extensions could facilitate alignment tasks in augmented and virtual reality (AR/VR) environments, specifically in AR navigated surgical procedures when using optical see-through devices.

Now I’m Not Afraid: Reducing Fear of Missing Out in 360° Videos on a Head-Mounted Display Using a Panoramic Thumbnail

Shoma Yamaguchi, The University of Tokyo
Nami Ogawa, DMM.com
Takuji Narumi, The University of Tokyo

Conference Paper

Cinematic virtual reality, or 360° video, provides viewers with an immersive experience, allowing them to enjoy a video while moving their head to watch in any direction. However, there is an inevitable problem of feeling fear of missing out (FOMO) when viewing a 360° video, as only a part of the video is visible to the viewer at any given time. To solve this problem, we developed a technique to present a panoramic thumbnail of a full 360° video to users through a head-mounted display. With this technique, the user can grasp the overall view of the video as needed. We conducted an experiment to evaluate the FOMO, presence, and quality of viewing experience while using this technique compared to normal viewing without it. The results of the experiment show that the proposed technique relieved FOMO, the quality of viewing experience was improved, and there was no difference in presence. We also investigated how users interacted with this new interface based on eye tracking and head tracking data during viewing, which suggested that users used the panoramic thumbnail to actively explore outside their field of view.

Understanding the Two-Step Nonvisual Omnidirectional Guidance for Target Acquisition in 3D Spaces

Seung A Chung, Ewha Womans University
Kyungyeon Lee, Ewha Womans University
Uran Oh, Ewha Womans University

Conference Paper

Providing directional guidance is important especially for exploring unfamiliar environments. However, most studies are limited to two-dimensional guidance when many interactions happen in 3D spaces. Moreover, visual feedback that is often used to communicate the 3D position of a particular object may not be available in situations when the target is occluded by other objects or located outside of one’s field of view, or due to visual overload or light conditions. Inspired by a prior finding that showed users’ tendency of scanning a 3D space in one direction at a time, we propose two-step nonvisual omnidirectional guidance feedback designs varying the searching order where the guidance for the vertical location of the target (the altitude) is offered to the users first, followed by the horizontal direction of the target (the azimuth angle) and vice versa. To investigate its effect, we conducted the user study with 12 blind-folded sighted participants. Findings suggest that our proposed two-step guidance outperforms the default condition with no order in terms of task completion time and travel distance, particularly when the guidance in the horizontal direction is presented first. We plan to extend this work to assist with finding a target in 3D spaces in a real-world environment.

Investigating Textual Sound Effects in a Virtual Environment and their impacts on Object Perception and Sound Perception

Thibault Fabre, The University of Tokyo
Adrien Alexandre Verhulst, Sony Computer Science Laboratories
Alfonso Balandra, The University of Tokyo
Maki Sugimoto, Keio University
Masahiko Inami, The University of Tokyo

Conference Paper

In comics, Textual Sound Effects (TE) can describe sounds, but also actions, events, etc. TE could be used in Virtual Environment to efficiently create an easily recognizable scene and add more information to objects at a relatively low design cost. We investigate the impact of TE in a Virtual Environment on objects’ material perception (on category and properties) and on sound perception (on volume [dB] and spatial position). Participants (N=13, repeated measures) categorized metallic and wooden spheres and significantly changed their reaction time depending on the TE congruence with the spheres’ material/sound. They then rated a sphere’s properties (i.e., wetness, warmness, softness, smoothness, and dullness) and significantly changed their rating depending on the TE. When comparing 2 sound volumes, they perceived a sound associated with a shrinking TE as less loud and a sound associated with a growing TE as louder. When locating an audio source location, they located it significantly closer to a TE.

The Impact of Focus and Context Visualization Techniques on Depth Perception in Optical See-Through Head-Mounted Displays

Alejandro Martin-Gomez, Technical University of Munich
Jakob Weiss, Technical University of Munich
Andreas Keller, Technical University of Munich
Ulrich Eck, Technical University of Munich
Daniel Roth, Technical University of Munich
Nassir Navab, Technical University of Munich

Invited TVCG Paper

Estimating the depth of virtual content has proven to be a challenging task in Augmented Reality (AR) applications. Existing studies have shown that the visual system uses multiple depth cues to infer the distance of objects, occlusion being one of the most important ones. Generating appropriate occlusions becomes particularly important for AR applications that require the visualization of augmented objects placed below a real surface. Examples of these applications are medical scenarios in which anatomical information needs to be observed within the patients body. In this regard, existing works have proposed several focus and context (F+C) approaches to aid users in visualizing this content using Video See-Through (VST) Head-Mounted Displays (HMDs). However, the implementation of these approaches in Optical See-Through (OST) HMDs remains an open question due to the additive characteristics of the display technology. In this paper, we, for the first time, design and conduct a user study that compares depth estimation between VST and OST HMDs using existing in-situ visualization methods. Our results show that these visualizations cannot be directly transferred to OST displays without increasing error in depth perception tasks. To tackle this gap, we perform a structured decomposition of the visual properties of AR F+C methods to find best-performing combinations. We propose the use of chromatic shadows and hatching approaches transferred from computer graphics. In a second study, we perform a factorized analysis of these combinations, showing that varying the shading type and using colored shadows can lead to better depth estimation when using OST HMDs.

Paper Session 15: Human Factors & Ethics

Thursday, 7 October
11:30 CEST UTC+2
Track A

Session Chair: Manuela Chessa
YouTube Stream (non-interactive)
Discord Channel for Zoom link and Interactive Q&A Access (registered attendees only)Browser, App
Post-Session Discussion with Authors in Gathertown Room: Q&A Track A

Safety, Power Imbalances, Ethics and Proxy Sex: Surveying In-The-Wild Interactions Between VR Users and Bystanders

Joseph O’Hagan, University of Glasgow
Julie R. Williamson, University of Glasgow
Mark McGill, University of Glasgow
Mohamad Khamis, University of Glasgow

Conference Paper

VR users and bystanders must sometimes interact, but our understanding of these interactions – their purpose, how they are accomplished, attitudes toward them, and where they break down – is limited. This current gap inhibits research into managing or supporting these interactions, and preventing unwanted or abusive activity. We present the results of the first survey (N=100) that investigates stories of actual emergent in-the-wild interactions between VR users and bystanders. Our analysis indicates VR user and bystander interactions can be categorised into one of three categories: coexisting, demoing, and interrupting. We highlight common interaction patterns and impediments encountered during these interactions. Bystanders play an important role in moderating the VR user’s experience, for example intervening to save the VR user from potential harm. However, our stories also suggest that the occlusive nature of VR introduces the potential for bystanders to exploit the vulnerable state of the VR user; and for the VR user to exploit the bystander for enhanced immersion, introducing significant ethical concerns.

Evaluating the User Experience of a Photorealistic Social VR Movie

Jie Li, Centrum Wiskunde & Informatica
Shishir Subramanyam, Centrum Wiskunde & Informatica
Jack Jansen, Centrum Wiskunde & Informatica
Yanni Mei, Centrum Wiskunde & Informatica
Ignacio Reimat, Centrum Wiskunde & Informatica
Kinga Lawicka, Centrum Wiskunde & Informatica
Pablo Cesar, Centrum Wiskunde & Informatica

Conference Paper

We all enjoy watching movies together. However, this is not always possible if we live apart. While we can remotely share our screens, the experience differs from being together. We present a social Virtual Reality (VR) system that captures, reconstructs, and transmits multiple users’ volumetric representations into a commercially produced 3D virtual movie, so they have the feeling of “being there” together. We conducted a 48-user experiment where we invited users to experience the virtual movie either using a Head Mounted Display (HMD) or using a 2D screen with a game controller. In addition, we invited 14 VR experts to experience both the HMD and the screen version of the movie and discussed their experiences in two focus groups. Our results showed that both end-users and VR experts found that the way they navigated and interacted inside a 3D virtual movie was novel. They also found that the photorealistic volumetric representations enhanced feelings of co-presence. Our study lays the groundwork for future interactive and immersive VR movie co-watching experiences.

Directions for 3D User Interface Research from Consumer VR Games

Anthony Steed, University College London
Tuukka M. Takala, Waseda University
Dan Archer, University College London
Wallace Lages, Virginia Tech
Robert W. Lindeman, University of Canterbury

Journal Paper

With the continuing development of affordable immersive virtual reality (VR) systems, there is now a growing market for consumer content. The current form of consumer systems is not dissimilar to the lab-based VR systems of the past 30 years: the primary input mechanism is a head-tracked display and one or two tracked hands with buttons and joysticks on hand-held controllers. Over those 30 years, a very diverse academic literature has emerged that covers design and ergonomics of 3D user interfaces (3DUIs). However, the growing consumer market has engaged a very broad range of creatives that have built a very diverse set of designs. Sometimes these designs adopt findings from the academic literature, but other times they experiment with completely novel or counter-intuitive mechanisms. In this paper and its online adjunct, we report on novel 3DUI design patterns that are interesting from both design and research perspectives: they are highly novel, potentially broadly re-usable and/or suggest interesting avenues for evaluation. The supplemental material, which is a living document, is a crowd-sourced repository of interesting patterns. This paper is a curated snapshot of those patterns that were considered to be the most fruitful for further elaboration.

Using Trajectory Compression Rate to Predict Changes in Cybersickness in Virtual Reality Games

Diego Vilela Monteiro, Xi’an Jiaotong-Liverpool University
Hai-Ning Liang, Xi’an Jiaotong-Liverpool University
Xiaohang Tang, Xi’an Jiaotong-Liverpool University
Pourang Irani, University of Manitoba

Conference Paper

Identifying cybersickness in virtual reality (VR) applications such as games in a fast, precise, non-intrusive, and non-disruptive way remains challenging. Several factors can cause cybersickness, and their identification will help find its origins and prevent or minimize it. One such factor is virtual movement. Movement, whether physical or virtual, can be represented in different forms. One way to represent and store it is with a temporally annotated point sequence. Because a sequence is memory-consuming, it is often preferable to save it in a compressed form. Compression allows redundant data to be eliminated while still preserving changes in speed and direction. Since changes in direction and velocity in VR can be associated with cybersickness, changes in compression rate can likely indicate changes in cybersickness levels. In this research, we explore whether quantifying changes in virtual movement can be used to estimate variation in cybersickness levels of VR users. We investigate the correlation between changes in the compression rate of movement data in two VR games with changes in players’ cybersickness levels captured during gameplay. Our results show (1) a clear correlation between changes in compression rate and cybersickness, and (2) that a machine learning approach can be used to identify these changes. Finally, results from a second experiment show that our approach is feasible for cybersickness inference in games and other VR applications that involve movement.

A Partially-Sorted Concentric Layout for Efficient Label Localization in Augmented Reality

Zijing Zhou, Beihang University
Lili Wang, Beihang University
Voicu Popescu, Purdue University

Journal Paper

A common approach for Augmented Reality labeling is to display the label text on a flag planted into the real world element at a 3D anchor point. When there are more than just a few labels, the efficiency of the interface decreases as the user has to search for a given label sequentially. The search can be accelerated by sorting the labels alphabetically, but sorting all labels results in long and intersecting leader lines from the anchor points to the labels. This paper proposes a partially-sorted concentric label layout that leverages the search efficiency of sorting while avoiding the label display problems of long or intersecting leader lines. The labels are partitioned into a small number of sorted sequences displayed on circles of increasing radii. Since the labels on a circle are sorted, the user can quickly search each circle. A tight upper bound derived from circular permutation theory limits the number of circles and thereby the complexity of the label layout. For example, 12 labels require at most three circles. When the application allows it, the labels are presorted to further reduce the number of circles in the layout. The layout was tested in a user study where it significantly reduced the label searching time compared to a conventional single-circle layout.

Paper Session 16: 3D Manipulation

Thursday, 7 October
11:30 CEST UTC+2
Track B

Session Chair: Hartmut Seichter
YouTube Stream (non-interactive)
Discord Channel for Zoom link and Interactive Q&A Access (registered attendees only)Browser, App
Post-Session Discussion with Authors in Gathertown Room: Q&A Track B

Gaze Comes in Handy: Predicting and Preventing Erroneous Hand Actions in AR-Supported Manual Tasks

Julian Wolf, ETH Zürich
Quentin Lohmeyer, ETH Zürich
Christian Holz, ETH Zürich
Mirko Meboldt, ETH Zürich

Conference Paper

Emerging Augmented Reality headsets incorporate gaze and hand tracking and can, thus, observe the user’s behavior without interfering with ongoing activities. In this paper, we analyze hand-eye coordination in real-time to predict hand actions during target selection and warn users of potential errors before they occur. In our first user study, we recorded 10 participants playing a memory card game, which involves frequent hand-eye coordination with little task-relevant information. We found that participants’ gaze locked onto target cards 350 ms before the hands touched them in 73.3% of all cases, which coincided with the peak velocity of the hand moving to the target. Based on our findings, we then introduce a closed-loop support system that monitors the user’s fingertip position to detect the first card turn and analyzes gaze, hand velocity and trajectory to predict the second card before it is turned by the user. In a second study with 12 participants, our support system correctly displayed color-coded visual alerts in a timely manner with an accuracy of 85.9%. The results indicate the high value of eye and hand tracking features for behavior prediction and provide a first step towards predictive real-time user support.

Evaluation of Drop Shadows for Virtual Object Grasping in Augmented Reality

Muadh Al-Kalbani, Birmingham City University
Maite Frutos-Pascual, Birmingham City University
Ian Williams, Birmingham City University

Invited CG&A Paper

This article presents the use of rendered visual cues as drop shadows and their impact on overall usability and accuracy of grasping interactions for monitor-based exocentric augmented reality (AR). We report on two conditions, grasping with drop shadows and without drop shadows, and analyze a total of 1620 grasps of two virtual object types (cubes and spheres). We report on the accuracy of one grasp type, the Medium Wrap grasp, against Grasp Aperture (GAp), Grasp Displacement (GDisp), completion time, and usability metrics from 30 participants. A comprehensive statistical analysis of the results is presented giving comparisons of the inclusion of drop shadows in AR grasping. Findings showed that the use of drop shadows increases usability of AR grasping while significantly decreasing task completion times. Furthermore, drop shadows also significantly improve user’s depth estimation of AR object position. However, this study also shows that using drop shadows does not improve user’s object size estimation, which remains as a problematic element in grasping AR interaction literature.

Fine Virtual Manipulation with Hands of Different Sizes

Suzanne Sorli, Universidad Rey Juan Carlos
Dan Casas, Universidad Rey Juan Carlos
Mickeal Verschoor, Universidad Rey Juan Carlos
Ana Tajadura-Jiménez, University College London
Miguel Otaduy, Universidad Rey Juan Carlos

Conference Paper

Natural interaction with virtual objects relies on two major technology components: hand tracking and hand-object physics simulation. There are functional solutions for these two components, but their hand representations may differ in size and skeletal morphology, hence making the connection non-trivial. In this paper, we introduce a pose retargeting strategy to connect the tracked and simulated hand representations, and we have formulated and solved this hand retargeting as an optimization problem. We have also carried out a user study that demonstrates the effectiveness of our approach to enable fine manipulations that are slow and awkward with naïve approaches.

VR Collaborative Object Manipulation Based on View Quality

Lili Wang, Beihang University
Xiaolong Liu, Beihang University
Xiangyu Li, Beihang University

Conference Paper

We introduce a collaborative manipulation method to improve the efficiency and accuracy of object manipulation in virtual reality applications with multiple users. When multiple users manipulate an object in collaboration, some users have a better viewpoint than other users at a certain moment, can clearly observe the object to be manipulated and the target position, and it is more efficient and accurate for him to manipulate the object. We construct a viewpoint quality function, and evaluate the viewpoints of multiple users by calculating its three components: the visibility of the object needs to be manipulated, the visibility of the target, the depth and distance combined of the target. By comparing the viewpoint quality of multiple users, the user with the highest viewpoint quality is determined as the dominant manipulator, who can manipulate the object at the moment. A temporal filter is proposed to filter the dominant sequence generated by the previous frames and the current frame, which reduces the dominant manipulator jumping back and forth between multiple users in a short time slice, making the determination of the dominant manipulator more stable. We have designed a user study and tested our method with three multi-user collaborative manipulation tasks. Compared to the two traditional dominant manipulator determination methods: first come first action and actively switch dominance, our method showed significant improvement in manipulation task completion time, rotation accuracy. Moreover, our method balances the participation time of users and reduces the task load significantly.

Separation, Composition, or Hybrid? – Comparing Collaborative 3D Object Manipulation Techniques for Handheld Augmented Reality

Jonathan Wieland, University of Konstanz
Johannes Zagermann, University of Konstanz
Jens Müller, University of Konstanz
Harald Reiterer, University of Konstanz

Conference Paper

Augmented Reality (AR) supported collaboration is a popular topic in HCI research. Previous work has shown the benefits of collaborative 3D object manipulation and identified two possibilities: Either separate or compose users’ inputs. However, their experimental comparison using handheld AR displays is still missing. We, therefore, conducted an experiment in which we tasked 24 dyads with collaboratively positioning virtual objects in handheld AR using three manipulation techniques: 1) Separation – performing only different manipulation tasks (i. e., translation or rotation) simultaneously, 2) Composition – performing only the same manipulation tasks simultaneously and combining individual inputs using a merge policy, and 3) Hybrid – performing any manipulation tasks simultaneously, enabling dynamic transitions between Separation and Composition. While all techniques were similarly effective, Composition was least efficient, with higher subjective workload and worse user experience. Preferences were polarized between clear work division (Separation) and freedom of action (Hybrid). Based on our findings, we offer research and design implications.

Exploring Head-based Mode-Switching in Virtual Reality

Rongkai Shi, Xi’an Jiaotong-Liverpool University
Nan Zhu, Xi’an Jiaotong-Liverpool University
Hai-Ning Liang, Xi’an Jiaotong-Liverpool University
Shengdong Zhao, National University of Singapore

Conference Paper

Mode-switching supports multilevel operations using a limited number of input methods. In Virtual Reality (VR) head-mounted displays (HMD), common approaches for mode-switching use buttons, controllers, and users’ hands. However, they are inefficient and challenging to do with tasks that require both hands (e.g., when users need to use two hands during drawing operations). Using head gestures for mode-switching can be an efficient and cost-effective way, allowing for a more continuous and smooth transition between modes. In this paper, we explore the use of head gestures for mode-switching especially in scenarios when both users’ hands are performing tasks. We present a first user study that evaluated eight head gestures that could be suitable for VR HMD with a dual-hand line-drawing task. Results show that move forward, move backward, roll left, and roll right led to better performance and are preferred by participants. A second study integrating these four gestures in Tilt Brush, an open-source painting VR application, is conducted to further explore the applicability of these gestures and derive insights. Results show that Tilt Brush with head gestures allowed users to change modes with ease and led to improved interaction and user experience. The paper ends with a discussion on some design recommendations for using head-based mode-switching in VR HMD.

Sponsors

Platinum

Gold

sponsor-02

Silver

sponsor-04
sponsor-05

Bronze

sponsor-06

SME and Media partner

sponsor-03

Partners

ieee_mb_blue_3015
cs_logo
vgtclogo-transparent-large
ACM In-Cooperation