Monday 26 June 2017

Virtual Society: Collaboration in 3D spaces on the Internet.

Rodger Lea, Yasuaki Honda, Kouichi Matsuda

Sony Computer Science Lab. Tokyo, Japan
Tel: +81 3 5448 4380 email rodger@csl.sony.co.jp

Abstract
The Virtual Society (VS) project is a long term research initiative that is investigating the evolution of the future electronic society. Our vision for this electronic society is a shared 3D virtual world where users, from homes and offices, can explore, interact and work. Our first implementation of an infrastructure to support our investigation is known as Community Place and has been developed to support large-scale shared 3D spaces on the Internet using the Virtual Reality Modeling Language (VRML). Obviously, such an ambitious project cuts across many different domains. In this paper we outline the goals of the Virtual Society project, discuss the architecture and implementation of CommunityPlace with particular emphasis on Internet related technologies such as VRML and discuss our view on the role of VRML and the Internet to support large-scale shared 3D spaces.
Keywords: Distributed Virtual Environment Internet Collaboration Consistency VRML

Introduction

The Virtual Society (VS) project is a long term research initiative that is investigating how the future electronic society will evolve. Recent trends in communications, consumer audio-visual technology and computer devices point to a synergy creating a comprehensive electronic network that will be ubiquitous in the home and office. Such a network will allow easy access to media and data from a variety of sources and will deliver this information to users wherever they may be. To a limited extent, the recent growth of the WWW has already initiated this process.
It is our belief that this ubiquitous network will also provide opportunities for far greater interaction than is currently possible in the WWW. In particular, the current WWW is a 'lonely' place. While it is possible for several users to access the same piece of information, they are unaware of each other and in most cases there is little support for interaction between them.
Our goal is to begin to explore the capabilities of existing technology to support social spaces, i.e. electronic locales where people go to interact. As a first step in this investigation, we have chosen to explore the 3D spatial metaphor as a basis for a shared information and interaction space. Our choice of a 3D spatial metaphor is based on our believe that such a metaphor is an attractive 'natural' environment within which users can interact. Rather than strive to find new metaphors to present data, we mimic the world in which we live. While it is clear that not all interaction needs or benefits from a three dimensional setting, we believe that such a setting, providing support for notions such as presence, location, identity and activity[1], will provide a generic basis on which a number of different application types will be constructed.
Thus, our goal has been to build a support infrastructure that will allow many users to participate in a shared, interactive 3D world. Such interaction will include the ability to see each other, talk to each other, visit locales with each other and work with each other. Our system, CommunityPlace (CP) has elements of a computer-supported cooperative work (CSCW) environment, a virtual reality system and an on-line chat forum. Such systems have already been explored in a number of experimental research platforms. However in the majority of cases the work has been confined to high bandwidth communication networks supporting small numbers of users. Our work differs in that our initial goal has been large-scale systems capable of supporting many geographically dispersed users, interconnected through low bandwidth, high latency communication links.

A simplistic architecture

A naive and basic infrastructure for a shared 3D world is simple; it consists of a database of objects that exist in the world, a set of tools to populate that database and a set of devices that display the contents of the database. The display device doubles as an input device and allows users to navigate through the world and to interact with other users and objects in the world. To achieve this, it requires some form of communication that will allow the display devices to access the database and to propagate user input to the database.
The major components of such a system are:
  • The display device can range from a low-cost consumer electronics device up to a high-end graphics workstation.
  • The communications link is of prime importance to the performance of the user device. In a consumer setting, current technology constrains us to a maximum bit rate of 14k bits per second, whereas a modern research lab has access to a Mbit communication link.
  • The server maintains the database of scenery objects that make up the world and users who are navigating through those scenes. It delivers the contents of the database to the display devices as and when needed.
  
Figure 1: A simple architecture

If our goal was to support a limited number of interacting users in a well-defined network setting, then the simple architecture outlined above would suffice. However, because we wish to support many hundreds of users, often with significantly different machine and communications access, we need to ensure that the architecture will scale.
In terms of overall networking, since one of our basic requirements is wide area accessibility, we are naturally forced to use the Internet as communications infrastructure. The Internet is a particularly harsh example of a wide area network. It offers low and variable bandwidth with no guarantees and manifests high and variable latencies. In addition, it is an extremely dynamic network where communication characteristics can change on a packet-to-packet basis.
As a starting point for our development we have assumed the low-end or minimum capabilities for the above categories. In particular, we assume that access devices are home PCs without graphics support as well as a 14.4kbps internet access point. These assumptions have significantly motivated our system architecture resulting in a hybrid client-server/peer-to-peer system model that we believe balances our goals and constraints.
This paper is laid out as follows, in section 2 we introduce the basic CP system architecture and discuss each of the major components. Section 2.1.1 gives a brief introduction to the Virtual reality modeling language (VRML) and the section finishes with a discussion of scripting mechanism to support distributed applications. In 3 we introduce the key problem, consistency, that we face when we try to scale-up any shared virtual world and in section 3.2 we introduce our solution to this problem based on a spatial model. Section 3.3 then discusses further the issues of supporting large scale shared worlds in theinternet, and discusses our use of distributed server technology to address these issues. In section 4 we introduce a series of shared worlds that we have developed, and discuss our experiences with these worlds. Section 5 relates our work to others both in the academic and internet communities, and sections 6 and 7 discuss future directions and conclude.

CP system architecture

  The basic system architecture for CP is shown in figure 2. In the following sections we discuss the individual components in detail.
  
Figure 2: CP architecture

Browser

The browser is the term we use for the part of the system that renders the 3D scene and allows users to navigate through it. The browser runs on the users' home PC.
As can be seen from figure 2, the browser works in conjunction with a HTML browser. A typical scenario is the following. The user is browsing a web page with a standard HTML browser. One of the links points to a document that contains 3D information. The link is selected, and downloaded to the HTML browser using the standard HTTP protocol. The HTML browser recognises that the mime type of the 3D data format requires the CP 3D browser and therefore starts the CP browser either as a helper application or as a plug-in. The CP browser loads the 3D data file, in the course of which it finds an entry describing the location of the server to be used for this shared 3D scene. The CP browser then contacts the server via the Virtual Society Client Protocol (VSCP) that runs above IP. The server informs the CP browser of any other users in the scene, including their location, and any other 3D objects not contained in the original scene description downloaded from the web server. The details are discussed below.

Virtual Reality Modeling Language

 
The choice of 3D scene description language has again been influenced by the fact that our primary network target is the Internet and the WWW hypermedia system it supports. The predominant language used for text documents in the WWW is the Hyper-Text Markup Language (HTML). A significant effort has been underway to produce an equivalent 3D description language that works well in the WWW. This language is referred to as the Virtual Reality Modeling Language (VRML)
VRML history
The VRML standardisation effort began in early 1995 with a small working group defining a set of requirements for a scene description language suitable for the WWW. The resulting CFT and the subsequent selection process adopted a proposal from Silicon Graphics Inc.(SGI) for a static scene description language based on the OpenInventor format.
This proposal led to a initial standard referred to as VRML1.0. VRML1.0 is a simple graphics language based on a model of a scene made up from a series of transformation nodes, i.e. spatial position markers. Each node can have any number of sub-nodes forming a tree. A set of geometry nodes, at the level of cube, cone, etc., have a set of properties including colour or material. A scene will consist of a large number of these trees in a forest organisation. The entire forest is often referred to as the scene graph. VRML, like HTML, supports the notions of linking and embedding, and allows scene authors to embed other media types, or to link to other 2D or 3D documents.
While VRML1.0 is an adequate language for static scene descriptions, it does not support our requirements for interactive scenes or for sharing. To address these problems, we extended VRML1.0 with a set of extra nodes that supported sound, video, and more importantly, a mechanism to associate a language script node with a 3D scene object. We then added an event mechanism based on sensors that allowed a scene author to animate the scene by using sensors to generate events, and then using these events to trigger scripts which in turn manipulated the scene graph. These extensions are known as E-VRML [4].
Subsequently these extensions were combined into a joint proposal with SGI and Worldmaker (see footnote) called Moving Worlds, and submitted to the VRML architecture group (VAG) as a proposal for VRML2.0. This proposal was accepted as the basis for VRML2.0 [2] which was officially released in August 1996. VRML2.0 is now being standardised within the ISO framework.
VRML2.0
A full discussion of VRML2.0 is outside the scope of this document. In essence, the proposal retains the node structure of VRML1.0, adds the sensors and event mechanism and supports a routing mechanism that allows events to be generated and routed to parts of the scene graph. The target for such events may be graphics nodes, or more typically, script nodes. Script nodes are used to actually perform processing which is subsequently reflected in the scene graph. This approach offers a very flexible and open model for 3D scene manipulation, allowing scene authors to write scripts in a choice of languages, re-use scripts and dynamically add to, change, or remove nodes - including the scripting nodes themselves - from the scene graph,

Local scripting

The CP browser supports the VRML2.0 standard and uses Java as its scripting language. In the usage scenario discussed above, a VRML file is downloaded to the local browser which renders its contents. The script nodes in the VRML scene point either to local Java scripts, or to Java scripts on a http server. In the latter case, CP uses the associated HTML browser to subsequently download these scripts. Scripts are able to manipulate scene graph nodes by generating events that are delivered to the node and change one or more of its properties, for example, its position in the scene, its shape or one of its material attributes. Obviously, since the scripts are fully functional Java code, they are not restricted to just changing the scene graph. They can, for example, dynamically generate additional VRML nodes, or locate and add existing VRML to the base scene downloaded in the original VRML file. This may be carried out using a call to a http server or by a request to another network machine. Further, they can also interact with other applications, for example mining data from a database which can subsequently be turned into VRML and added to the shared scene.
In a standalone browser, the execution mechanism of sensors, events and scripts allows animation of a local scene graph. However, to support our goal of shared interactive scenes we allow scripts to communicate events to the scene graphs managed by other browsers.

Browser-server communications

The browser communicates with other browsers using the server (see below) and a protocol called Virtual Society Communications Protocol (VSCP). VSCP has two goals: efficient communication of 3D scene transformations and open-ended support for script specific messages.
The first goal is answered by ensuring that VSCP has a very compact representation of 3D transformations, for example, a full 3D rotate requires a 34 byte payload. This efficiency is obviously crucial consisdering our target of dial-up connections.
For the second goal, VSCP has an object-oriented packet definition that allows applications to extend the basic packet format with application specific messages.
This mechanism enables us to send and receive script level messages that allow the browsers to share events and so support shared interaction with the 3D scene. For example, a local user event causes a local script to run, which in turn uses the message sending facility of the CP system to deliver the event to a remote browser sharing the scene. At the remote browser, this network event is transformed into a local event which in turn causes execution of the local script. We discuss this mechanism in more detail in section 2.3.

Server

The server, known as the CP Bureau acts as a position tracker and message forwarder. Each user's browser, as it navigates through the shared scene, sends position information to the server. The server then uses AOI (area of interest) algorithms (see below) to decide which other browsers need to be aware of these position changes. The server sends out the position to the chosen browsers, which in turn use the information to update the position of the local representative, the avatar, of the remote user. The role of the server is limited to managing state on behalf of connected users. It is generally unaware of the original scene loaded by the browser.
The second role of the server is to carry out a similar function for any script level messages that are generated by a browser as a result of user interaction. Again in a typical scenario, a user event, such as a mouse click, will cause a local script to run. This script will update the local scene graph and then post the event (or the resulting change) to the server. The server then re-distributes this message to other users in the scene so that the scene update is replicated and shared by all users. We refer to this approach to application development as simple shared scripts (SSS).

Application programming models

 
The CP system provides two models for application building, the first is known as the Simple Shared Script (SSS) model, and the second as the Application object (AO) model. The two share some elements but are targeted at different applications and different authors.

Simple shared scripts

The SSS model is a simple mechanism designed for small shared applications in the 3D world. The model is a replicated script model with each browser downloading the same script and executing it locally. Typically these scripts would be associated with objects that are downloaded in the initial VRML file.
As discussed above, the VSCP protocol supports script message sending allowing a local script to send a message to all other browsers sharing the scene. Using this mechanism, it is possible for scene authors to develop small scale applications that share events by sending those events to other browsers via the server.
In figure 3 we can see message flows as a result of a user selection in both the SSS model (left side). A user selection (1) causes a local script to run (2). This in turn converts the event into a message and sends it to the server (3). The server sends the message to all other browsers (4) who then convert the message to an event that causes execution of the local script (5).
The drawback of the SSS model are based on ownership and persistence. Since all scripts are equal, they need to communicate among themselves to ensure that any issues such as ownership and locking are resolved. Secondly, when all users leave the scene, unless one of the scripts takes responsibility for writing out a new initial VRML file, then all changes are lost.
We provide a simple set of script objects to help solve these problems, but the burden still rests on the scene authors. As such, we tend to use this mechanism for simple shared applications that do not have sophisticated synchronisation or persistency requirements.
  
Figure 3: SSS versus AO scripting

Application objects

While the SSS approach is suitable for a number of simple shared scene updates, more complicated applications require a more sophisticated mechanism. To support this, CP has a notion of an application object which exists externally to the browser and the server. The application object is an application run time that allows application builders to create 3D objects and to inject them into existing shared scenes. It allows users, via local scripts, to interact with these applications. The applications use the Virtual Society Application Protocol (VSAP) to register their application objects with the server. Registration informs the server about the 3D visual representation, written in VRML, and the spatial positions of the objects. The server then informs the relevant browsers about the existence of these application objects and the VRML file to be downloaded to display them. Lastly, the server forwards application-specific messages between the AO and the browsers. Thus, an AO consists of three parts: the 3D data description that represents the application in the shared scene; the associated scripts that accept user input and communicate back to the AO; and the AO side code that implements the application logic.
The application model presented by the AO is subtly different from the SSS model described above. In particular, the AO defines a master or controller for the application, whereas in the SSS model, the scripts are essentially peer-to-peer. In addition, the AO mechanism, because it registers objects via the server, benefits from the server's use of AOI to reduce communications.
Returning to figure 3 in the AO model (right side), the user event (1) causes a message to be sent to the server (2), which in turn sends the event to the AO managing the selected object (3). The AO carries out internal processing and then typically sends back a message (4) via the server to each browser (5) that runs the local script (6). There are obviously many variations within these models. However the major difference is that in the AO model, there is a designated owner for an object who has sole control over its update.
A key aspect of the AO model is that it allows dynamic addition of VRML data and associated scripts to an existing scene. The feature allows us to build shared worlds that evolve over time. The basic scene description is set up in a base VRML file and downloaded by browsers. Subsequently, new scene elements can be added by creating AOs to manage the new elements, and by using the server and the VSAP protocol to add the new scene element to the basic model already loaded by browsers. In a commercial environment, this allows service providers to dynamically inject an application into an existing shared scene. For example, a 3D shopping mall would consist of a basic 3D scene which is downloaded initially by the user. Subsequently, service providers can add shops into the scene by creating AOs and connecting to the server. This model allows a decoupling between server managers and service providers, thus providing an open and extensible mechanism for application provision.

Scalability

  In the previous section, we discussed the basic architecture of the CP system and the main components. Our original motivation for the system architecture listed scalability as a key requirement. The architecture above addresses some aspects of scalability in three ways:
  • Static scene data is downloaded initially as part of the VRML file and replicated at all browsers. Dynamic data can be managed using local scripts plus message passing. This reduces the burden on the server because it does not need to manage this scene data.
  • We offload some processing into the client browser using the local scripting facility. This allows us to send events, rather than state changes, and to use local scripts to handle the events. This enables such techniques as dead reckoning[10].
  • Sophisticated applications can be managed by external processes and can use the local script to manage local updates in individual browsers. Again, this approach reduces the role of the server to a message forwarder and the management of the application data is split between the AO's and the browsers.
Although these mechanism do allow some degree of scaling by reducing the communications between browsers (via the server), they are not sufficient to support our goal of many hundreds of users interacting in a shared space. To achieve such scalability, it is necessary for us to find a way to limit the number of messages needed between browsers to support the shared scene.

Consistency

The fundamental model presented by a distributed virtual environment (VE) platform is one of a shared 3D space. Such a space, because it is shared, must be seen ``consistently'' by all users of that space. A system can provide different levels of consistency, ranging from a strict interpretation to best effort[13].
In a strict interpretation, any actions that occur in the shared space must be propagated to all participants in that space, and conflicts between user actions are either avoided, or resolved. Furthermore, actions in the space maintain their causal relationship so that a user can make sense of a 'happened before' and 'happens after' relationship. Obviously, maintaining such consistency in a system where there are many participants is a complicated task and one that requires significant exchange of information between the copies. The choice of algorithm is crucial to the amount of message passing needed to reach consistency. Any distributed consistency algorithm has two major concerns:
  • Membership: The membership of the consistency group, i.e, who is taking part in the consistency algorithm is crucial to performance. Any mechanism that reduces the number of participants in the consistency group directly reduces the number of messages that must be exchanged.
  • Consistency guarantee: Once membership has been decided, the next issue is what model of consistency is used by the consistency algorithms. There has been much work in the research community addressing the issue of distributed consistency in more traditional data applications with a goal of reducing the cost of the algorithms. This work has concentrated on relaxing the degree of consistency either in a temporal domain[21] [22], or in a data value domain[12].
To attack these issues, we rely on a facet of our application domain, 3D space, and exploit a spatial area of interest (AOI) model to reduce the participants in any consistency decision. We then use adaptive techniques and a range of consistency algorithms to reduce the message traffic. In the following section we outline the AOI model we have adopted.

Spatial areas of interest

  In previous experiments [9], we have observed that participants form sub-groups where activities occur in clusters or peer-to-peer within the global session. This mimics the way we use the spatial model in the real world. The observation can be exploited to decrease overall message passing if one can deliver packets only to the recipients they are intended for, i.e. those within the sub-group. In this way, the amount of global traffic is limited, and the number of incoming messages to each user is reduced.
Using the three dimensions of space is a well-known approach to partition VEs into several more or less disjoint AOIs. Static geographical regions are used in applications based on natural terrains, such as in DIS based systems[16].
A different approach uses intersecting volumes to model interaction between participants. This notion of a spatial area of interest associated with a user has evolved out of work in the COMIC project [3]. The spatial area, known as an aura determines a boundary; objects or users outside the boundary can not be influenced or interacted with. In contrast, all objects within the boundary are candidates for influence or interaction. The COMIC model goes further by defining two notions, focus and nimbus, to represent the degree of interest users have in each other. The focus represents the degree of interest one user brings to bear on another. The nimbus represents the degree of attention one user pays to another. The combination of the focus and nimbus of two interacting users defines their level or degree of interaction.
It is this model that we seek to use to drive our consistency mechanism and to reduce the number of participants in any consistency algorithm.
To achieve this, the server is structured as shown in figure 4. An aura manager is responsible for tracking the spatial location of any user (or AO object) and for determining if two user's auras have collided. If they have, the aura manager causes those two objects to join a consistency group which is defined as a set of objects who have shared data which must be maintained consistent. For example, in figure 4, user 1 and user 2 are in each other's aura, but user 1 is not in user 3's aura. Thus, any updates to user 3, e.g. a position update, will be sent to user 2 but not user 1. The actual replicas are denoted by proxies, i.e. local representatives of the remote object. In the case where the objects are all local to one server, these proxies are generally pointers to the master object.
  
Figure 4: Auras and groups

In essence, the aura manager is responsible for defining groups of spatially co-located objects who need to maintain a degree of consistency. As it decreases the degree of sharing, this mechanism is used to reduce the amount of information that has to be sent out from the server as a result of any state changes.

Distributed architecture

  However, as the number of connected users and applications increases, the server eventually becomes a bottleneck as it manages state on behalf of these users, and forwards messages between users, and between applications and users. To deal with this issue, we are forced to replicate the server and deal with replicated state between servers. By replicating the server, we are able to spread the processing and communication load between several servers and so scale the entire system.
The model described above deals gracefully with server replication. Our proxy objects become real remote proxies and are responsible for updating their replicas. As before, the aura manager tracks objects and informs them of any aura collisions. The replica joins the communication group associated with the remote object and runs the consistency algorithm defined for that object. However, unlike before, where the group was a structuring technique within the server, within the distributed server; the group maps to a multicast communication group.
The basic architecture of the distributed version of the server is shown in figure 5. The aura manager (not shown) now becomes a replicated entity which we organise in a hierarchy. Each server has an associated aura manager which is responsible for periodically sending updates to a master aura manager (i.e. the next AM in the hierarchy) who then calculates aura collisions between objects managed by different servers. In the case of a collision, the master AM informs the server AMs, who then add the colliding object to the respective consistency groups.
  
Figure 5: Distributed architecture

In the distributed server case, the spatial model is used exactly as in the single server case. It partitions the database into groups of spatially co-located objects who manage their consistency using a group communication model. This allows us to reduce the amount of data that must be replicated at each server to that which is required for the groups associated with users actually connected to that server.
Our communication mechanism is based on multicast which is used between servers to support the consistency algorithms needed.
Multicast communication allows a single message send to be delivered to a group of receivers. In hardware supported multicast environments, e.g. ethernet this allows for very efficient messaging. In the Internet, an experimental multicast layer built above IP is used. This system, known as the MBone, implements a virtual multicast network over the inherently point to point mechanism of the internet. The technique used is based on message encapsulation and tunneling.
We have built a lightweight group layer above our multicast communication package [26] which implements low-cost groups by multiplexing lightweight groups over IP multicast groups. This layer is in turn used by our proxies as the basic communication interface. Each object and its proxies are associated with a lightweight group. The semantics of the message sending are provided by the group interface and allow a proxy group to implement a range of consistency policies.
Further details of the aura model and its use of the group communication model can be found in [8] which reports on joint work between our group and the Dive group at SICS.
Our initial investigations of using this multicast group mechanism to support weak and adaptive consistency models is based on previous work with Apertos[6], and is reported in more detail in [7]

Latency

The second issue when addressing scalability is latency. Again, since we are targeting the Internet we need to deal with the high latency between geographically remote parts of the network. Obviously, replicating the server provides a framework to solve this issue. By maintaining replicated state in a set of geographically remote servers, we can reduce access time for a browser to some server state by using cached state at a local server.

Current status and experiences

  The CP system has been in public beta release since Dec'95. The freely downloadable browser and server run on Win'95/NT. Larger servers run on various flavours of UNIX. The system is currently being productised.
We have tested the larger servers at two public sites, one in Japan and one in the USA. Both sites allow users anywhere on the Internet to connect to and share a world. We have made available a set of sample shared worlds, all hosted by these servers and concentrating on social or entertainment spaces.
The servers currently in use implement all of the mechanisms discussed in this paper except for the distribution based on multicast. The multicast server is under development and will be available later this year. The existing server has been tested with upto 700 simulated clients (all running remotely), where each client implements a typical navigation pattern. In actual usage we have never seen more than 40 clients in any of the public worlds at one time.

Experiences

In this section we describe representative examples of shared worlds that have been built, and discuss any observations we have made and feedback we have received.

The basic user interface

Figure 6 shows the basic user interface and interaction model. The browser presents a window on a 3D scene, in this case a model of a circus park. User navigation is effected through one of two methods: either through use of a set of movement buttons at the base of the screen, or through use of the pointing device. In general usage, CP implements a terrain following mode to allow users to climb stairs etc. However, the user can override this and travel both up and down if desired. In addition to the basic 3 degrees of freedom, CP offers the ability to look up and down to allow users to examine space above and below the eye level (or camera position). Two other navigation modes allow users to select a distant object and automatically navigate to that object. Users can also choose a 'fly' mode which moves the user to a location high above the ground plane allowing them to gain an overview of the entire scene.
In the screen shot, the user has selected a textured billboard in the scene which is linked to a 2D HTML page. This page is subsequently fetched and loaded into the HTML browser also seen in the screen shot. As discussed above, the link within the 3D world could equally have been a 3D model which would have been loaded into the 3D browser and either added to the scene or used to replace the current scene. In the latter case, CP caches previous 3D scenes, allowing users to move back and forward between them, using the buttons at the top of the right side of the browser window.
  
Figure 6: VRML driving HTML

User interaction

Figure 7 shows three users (the third is the camera position), again in a circus park, holding a conversation. Interaction consists of text chat, augmented with body actions. The text chat facility allows the usual text messages. These are displayed in the associated windowgif and as text balloons above the originating user's head. The action panel (top of the screen shot) provides a set of scene defined actions that can be used to add emotion or stress to a conversational point, or to replace text. In the screen shot, the female avatar is waving, which is the result of the 'hello' button being selected; and the male is expressing some degree of surprise, having used the 'Wao!' button.
  
Figure 7: User's interacting

We choose to make the actions a scene-dependent feature, and not part of the browser's UI, so that individual scene authors can tailor the actions to the semantics of the scene. The action panel and associated actions are all written in the scripting language and loaded when the VRML file is first loaded.
As described in [19], the limited peripheral vision afforded by using a standard display leads to users having difficulty in their perception of their immediate surroundings. This is particularly true when a conversational group exceeds 3 or 4 people. To ease this problem, we have provided a simple 'glance' feature which allows users to turn their 'heads' 36, 180 or 360 degrees to quickly assertain the spatial position of other objects in their immediate locale.
We have found that these simple, customisable actions are frequently utilised by users and are considered essential. In addition, the ability to choose among avatar representatives and to personalise the avatar is also key in giving users a sense of identity. It is interesting that in 6 months of usage, our informal observations have shown that only first-time users retain the default avatar settings. Although we have not carried out any rigorous investigation, our observations suggest that the majority of users adopt an identity, and then maintain that identity for subsequent visits to the shared locales. This enables users to easily recognise one another.
A last feature of note is the 'active' button which allows the user to select an active or inactive mode. In inactive mode, the user is still present in the scene, and monitors text and scene updates, but is not able to participate. Use of this mode generally indicates either that a user is not willing to take part in activity, or that they are no longer present at the computer and so unable to take part. The manifestation of this action in the shared world is to place such users in a sitting posture indicating to their attention state to others.

Shared interactions

In figure 8 we can see two browsers showing the same scene from two different camera positions. There are also several other users watching the same demo. In the center is a sea lion whose behaviour is to flip the ball up on his nose, juggle it and then put it down again. This behaviour is activated by user selection. It is also shared, so the user selection causes the select event to be passed to the local script and also, via the server, to all other browsers within the aura of the sea-lion. In the two browsers in the screen shot, it can be seen that the action is happening simultaneously in both browsers.
  
Figure 8: Shared behaviour

In the example shown, the degree of consistency is actually weak. The two browsers and the server co-operate to ensure that both worlds see the change. However, there is no strict synchronisation. This approach works well for the majority of shared behaviours we have built and where actions are self-contained sequences.

Collaborative working

Figure 9 shows a simple example of a collaborative work environment, in this case a simple conference area with a shared whiteboard. In initial experiments with this shared room, we used a 3D whiteboard application whose operation was confined to the 3D world. Users found this device clumsy and hard to use. We subsequently used a 2D version of the shared whiteboard, where selection of a whiteboard object in the 3D scene caused a window to be displayed on the desktop and userinput was into this desktop window.
  
Figure 9: Shared whiteboard

This use of 2D for some tasks initially conflicted with our original goal to use the 3D world as the base context for all activities. On reflection we believe that there are many cases where information presented outside the 3D world is actually preferable. The white board is an example for this, as is the use of a HTML browser. As such, we have engineered the environment to support communication between the 3D world and external applications to allow 3D objects to make use of existing tools such as collaborative editing suites or conference systems. However, it remains an interesting experiment to actually determine if a combination of 2 and 3D is preferable, or if our initial failures with some types of information is caused by the inflexibility of the system.

Commerce applications

In figure 10 we see a simple example of a commerce application, in this case a music shop. The user is free to browse the CDs in the shop and to select and listen to an associated audio clip. In addition, information about the CD, including artist and price, is displayed both in the 3D scene and in the associated HTML browser. Subsequent purchase of the CD would be carried out using the secure communication facilities of the associated HTML browser.
  
Figure 10: CD shop

The CD shop was one of the few example worlds we have built where spatial sound has been explored to any extent. In most of our worlds sound has been used simply as background. In the CD shop we used spatial sound as a way to augment visual data. Each section in the shop had a proximity sensor which registered the users' location. As they moved from one area to another, a representative sound track was played to indicate the type of music found in that area. This effect was surprisingly useful, resulting in the majority of users navigating between different parts of the shop based primarily on sound. We hope to further investigate the use of sound when the more sophisticated sound model in VRML2.0 is fully implemented.

Related work

  There is considerable research activity in the area of large scale distributed environments, these include projects focusing on collaboration [9], [19] [11] [14] [15] and projects focusing on simulation [20] [17]. In most of this work, the emphasis has been on workstation level devices and high bandwidth communications.
As discussed in the text, the Dive system was the original testbed for most of the work on the spatial model we used. MASSIVE inherited this spatial model and carried out a fuller implementation. Our implementation of the spatial model, particularly the use of the aura collision manager is based on both Dive and MASSIVE. However, the main use of the aura model in the CP system is to reduce communications to enable scaling, our use of multicast communications supports this. Our work also differs in the target. Both MASSIVE, and to a lesser extent, Dive, have concentrated on collaboration and conferencing and have assumed professional level computer and communication facilities. Our main goal has been large-scale social worlds using low-cost consumer equipment. As such, the eventual architecture we adopted, a hybrid client-server/peer-to -peer model differs from the more 'pure' approaches of Dive and MASSIVE.
In terms of social shared spaces targeted at the consumer market, there is already a legacy with systems such as habitat[23] and Worlds Away - a Compuserve service based on Habitat, which, although not full 3D spaces, offer some degree of spatial metaphor. More sophisticated shared spaces have been built by Worlds Inc., including the Worlds Chat and the Alpha world. However, these projects, although using the Internet, have relied on proprietary graphics and browsers.
Recent work targeting the WWW and using full 3D shared spaces has mainly been confined to the VRML community. Within that community there are several projects of note. The Cybergate system from Blacksun Inc.has built and experimented with shared 3D spaces similar to the CP project. However, CyberGate is based on VRML1.0 and so supports only static scenes. Moondo is a similar system to CyberGate in that it currently supports only static scenes. However, Moondo has experimented with a shared object model as a basis for shared consistent objects. In addition, Moondo has added support for audio chat.
The Pueblo project from Chaco Communications has evolved out of earlier work on social MUDs. Recently it has augmented the MUD server with VRML support and provided a VRML1.0 browser that allows MUD authors to build 3D scenes. This approach allows world builders access to the rich mechanism of the MUD database, but again only supports static scenes.
A proposed mechanism for shared collaborative VRML worlds, with the same goals as CP, has been made in [5] by GMD. This proposal also tries to tackle the difficult issue of consistency in distributed VEs. The proposal attempts to design a pure peer-to-peer model based on browsers multicasting modifications, or lock information. However, like the original Dive system, the proposal suffers from the false sharing problem. Since the granularity of sharing is at the 'world' level, all messages are sent to all participants. This leads to a client being forced to receive all information, irrespective of their interest in that information. In contrast, in CP we adopt a hybrid architecture, where a server is used to partition the world data according to a spatial model. CP then uses different multicast groups to support the data partitions and sends information using unicasts to individual browsers.
An interesting experiment in 3D spaces that supports audio chat is the Traveler system from OnLive! Again, Traveler is VRML1.0 based, and so static, but the primary focus of the group has been on audio interaction. To support this, they have built a low-bandwidth audio codec suitable for the Internet. In addition, they have augmented user avatars with facial motion, and in particular lip synch, that is driven from the audio stream. This technique offers a computationally cheap, but rich and compelling interaction mechanism.

Current and future directions

  We are currently working in three broad areas. Firstly, we are extending the media support within the shared scenes. We have therefore experimented with audio chat facilities. However, the low-bandwidth links to home PCs severely curtail the fidelity of the audio stream. We are also designing streaming mechanisms for various media to allow us to stream audio and video from AOs, via the server, to the browser. Again the principal constraint is bandwidth.
Our second area of interest is augmenting the application mechanism with application libraries that allow simpler creation of complicated applications. To date, authoring consists of using a 3D modeler to build the basic components. An authoring tool, called CP conductor, is then used to assemble objects into the scene, and to subsequently associate behaviours with those objects. The authoring tool provides a set of pre-defined scripts that can be dragged and dropped onto objects in the scene, allowing easy development of simple scenes. However, more complicated behaviours have to be written by the scene author. We aim to provide a set of more sophisticated objects and associated behaviours, and to enable users to move these objects between independant scenes. A particular area of concern is inter-object interactions. This basic facility will allow a far richer space as it will enable users to claim ownership of objects.
Lastly, we are continuing our work on scaling issues in order to support larger numbers of users and more complicated scenes. As discussed in the text, part of this work is concentrating on the issues of consistency in large scale VEs, where we are exploring adaptive techniques to deal with the wide area communication problems.

Conclusion

  One of the major reasons for the success of the WWW is that it has enabled unsophisticated users to participate, both as consumers and, more importantly, producers of information.
However, the WWW remains an essentially 'lonely place'. Although many users may be simultaneously viewing the same information, there is no support to allow them to interact, or even be aware of others.
Our goal has been to enable interactions, so that the WWW moves from being an information space to being a social space. To do that, we have chosen to use the 3D spatial metaphor to build 3D spaces that mimic real world spaces and provide a virtual place for interaction.
Although this is a necessary first step, it is not, we believe, sufficient to cause interaction to take place. It is our belief that this type of large scale social interaction will only happen if users can create spaces to reflect their requirements. As such, the most important goal of the CP project has been to provide an infrastructure that allows easy creation of such spaces, within a familiar framework, the WWW.
While the main focus of our work has been on social spaces rather than on spaces that support more traditional CSCW tasks, we have built simple examples of worlds where collaboration is possible and well used. It is our belief, that by providing a platform that is sufficiently rich to support collaboration, but sufficiently open and accessible to allow anybody to author spaces, we will enable far greater use of the Internet for collaboration.

Acknowledgements

We are indebted to our colleagues in the Sony Computer Science Lab. and Sony's architecture Lab. for their help in the definition of this project. We also wish to thank our colleagues at the Swedish Institute of Computer Science who have contributed indirectly to the CP design as part of our joint research project, Wide area virtual environments (WAVE). Lastly, our thanks, as always, go to Mario Tokoro and Toshi Doi for their continuing support.