SLAM + Machine Learning ushers in the Age of Perception

Dr Pablo Alcantarilla 21 April 2020

Blog

The recent crisis has increased focus on autonomous robots being used for practical benefit. We’ve seen robots cleaning hospitals, delivering food and medicines and even assessing patients. These are all amazing use cases, and clearly illustrate the ways in which robots will play a greater role in our lives from now on.

However, for all their benefits, currently the ability for a robot to autonomously map its surroundings and successfully locate itself is still quite limited. Robots are getting better at doing specific things in planned, consistent environments; but dynamic, untrained situations remain a challenge.

What excites me is the next generation of SLAM that will allow robot designers to create robots much more capable of autonomous operation in a broad range of scenarios. It is already under development and attracting investment and interest across the industry.

We are calling it the “Age of Perception,” and it combines recent advances in machine and deep learning to enhance SLAM. Increasing the richness of maps with semantic scene understanding improves localization, mapping quality and robustness.

‍

Simplifying maps

Currently, most SLAM solutions take raw data from sensors and use probabilistic algorithms to calculate the location and a map of the surroundings of the robot. LIDAR is most commonly used but increasingly lower-cost cameras are providing rich data streams for enhanced maps. Whatever sensors are used the data creates maps made up of millions of 3-dimensional reference points. These allow the robot to calculate its location.

The problem is that these clouds of 3D points have no meaning – they are just a spatial reference for the robot to calculate its position. Constantly processing all of these millions of points is also a heavy load on the robot’s processors and memory. By inserting machine learning into the processing ‘pipeline’ we can both improve the utility of these maps and simplify them.

Panoptic Segmentation techniques use machine learning to categorize collections of pixels from camera feeds into recognizable ‘objects.’ For example, the millions of pixels representing a wall can be categorized as a single object. In addition, we can use machine learning to predict the geometry and the shape of these pixels in the 3D world. So, millions of 3D points representing a wall can be all summarized into a single plane. Millions of 3D points representing a chair can be all summarized into a shape model with a small number of parameters. Breaking scenes down into distinct objects into 2D and 3D lowers the overhead on processors and memory.

Adding understanding

As well as simplification of maps, this approach provides the foundation of greater understanding of the scenes the robot’s sensors capture. With machine learning we are able to categorize individual objects within the scene and then write code that determines how they should be handled.

The first goal of this emerging capability is to be able to remove moving objects, including people, from maps. In order to navigate effectively, robots need to reference static elements of a scene; things that will not move, and so can be used as a reliable locating point. Machine learning can be used to teach autonomous robots which elements of a scene to use for location, and which to disregard as parts of the map or classify them as obstacles to avoid. Combining the panoptic segmentation of objects in a scene with underlying map and location data will soon deliver massive increases in accuracy and capability of robotic SLAM.

Perceiving objects

The next exciting step will be to build on this categorization to add a level of understanding of individual objects. Machine learning, working as part of the SLAM system, will allow a robot to learn to distinguish the walls and floors of a room from the furniture and other objects within it. Storing these elements as individual objects means that adding or removing a chair will not necessitate the complete redrawing of the map.

This combination of benefits is the key to massive advances in the capability of autonomous robots. Robots do not generalize well in untrained situations; changes, particularly rapid movement, disrupt maps and add significant computational load. Machine learning creates a layer of abstraction that improves the stability of maps. The greater efficiency it allows in processing data creates the overhead to add more sensors and more data that can increase the granularity and information that can be included in maps.

Natural interaction

Linking location, mapping and perception will allow robots to understand more about their surroundings and operate in more useful ways. For example, a robot that can perceive the difference between a hall and a kitchen can undertake more complex sets of instructions. Being able to identify and categorize objects such as chairs, desks, cabinets etc will improve this still further. Instructing a robot to go to a specific room to get a specific thing will become much simpler.

The real revolution in robotics will come when robots start interacting more with people in more natural ways. Robots that learn from multiple situations and combine that knowledge into a model that allows them to take on new, un-trained tasks based on maps and objects preserved in memory. Creating those models and abstraction demands complete integration of all three layers of SLAM. Thanks to the efforts of the teams at SLAMcore who are leading the industry in these areas, I believe that the Age of Perception is just around the corner.