[Robotics Trends 2023] Evaluating Motion Planning Performance — Part 1

[Robotics Trends 2023] Evaluating Motion Planning Performance — Part 1

My name is Yuri Rocha, and I am a Robotics and Machine Learning Research Engineer at MakinaRocks. I am currently working in the OLP team, which aims to automate the robot Offline Programming (OLP) process in the automotive industry. The OLP process consists of distributing approximately three thousand welding spots between a few hundred robots, then generating an efficient trajectory for each robot to reach each assigned goal within its cycle time. They should also avoid collisions with the environment and the other robots. Task and motion planning are paramount for our project, and evaluating the performance of different algorithms is an integral part of our work.

IROS 2022 Name Tag
IROS 2022

Last October, me and MakinaRocks’ robotics team had the opportunity to attend IROS 2022 in Kyoto, Japan. Among other things, our main goals were:

  • Obtain a fresh view of the robotics field, its current breakthroughs, and its pain points.
  • Survey how the latest algorithms can be used in our robotics projects.
  • Research how Machine Learning is being leveraged in robotics.

On the first day of the conference, I attended the workshop named Evaluating Motion Planning Performance: Metrics, Tools, Datasets, and Experimental Design, organized by a team with members from Rice University and the Australian National University. The workshop focused on two main topics: reproducible experimental design and informative evaluation metrics. There were talks about how to evaluate a motion planner, how to design experiments, which metrics to use, how to evaluate human-robot interaction, etc. They also used the opportunity to share open-source datasets and benchmarks.

I already had high expectations about this workshop, but it surpassed them. Moreover, from what I heard, this was one of the workshops with the highest interest from the attendees (you can see in the picture below that the room was packed, and there were moments with even more people!).

I am somewhere in this picture. Source: https://twitter.com/wilthomason
Workshop room without available seats. There were also online attendees.

In this and the next Medium posts, I will try to summarize the presentations and main topics discussed in the workshop. There were topics from a wide range of robotics fields, many of which are far from my current field of work (robot motion planning in industrial environments), so I might talk more about topics I am more comfortable with. I also recommend watching the full workshop, which was recorded and is available here.

To help people interested in the topic, I summarized and added links below for the tools and datasets presented in the workshop.

Open-Source Datasets

  • BARN: 2D navigation in cluttered environments.
  • DynaBARN: 2D navigation in cluttered dynamic environments.
  • SCAND: Socially compliant navigation.
  • Med-MPD: Environments for evaluating motion planning for robotic surgery.
  • 2.5D elevation maps: Maps of planetary environments for robotic rovers.
  • TBD Pedistrian Dataset: Labeled top-down and perspective views of pedestrian-rich environments.
  • Crowdbot: Labeled outdoor pedestrian tracking from onboard sensors on a personal mobility robot navigating in crowds.

Open-Source Tools

Lightning Talks

There were also short papers presented in the workshop, which will not be covered in this medium post. The papers can be accessed here.

Session 1: Reproducible Experimental Design

TL;DR

The first session focused on how to design good experiments and what are the best ways to evaluate motion planners.

If you don’t want to read about each presentation separately, here is a list of my main takeaways from the first session:

  • Benchmarks are important to allow a fair comparison between different planners. Many published works compare a proposed algorithm optimized for a given task with a state-of-the-art (from here onwards, SOTA) algorithm with off-the-shelf hyperparameters. Having official benchmarks can mitigate this issue. However, academia needs to avoid making specific benchmarks a requirement for new papers and encourage the development of new benchmarks when new tasks are presented. It is also important to allow works about specialized tasks which might not fit in any available benchmark to be published without the need for a benchmark validation.
  • Motion planners should not be evaluated in a “vacuum” with perfect perception and actuation. It is important to evaluate how uncertainties from world representation and inaccurate action can affect the planner’s performance.
  • Performing evaluations with “human-in-the-loop” is a complex problem. The existence of a robot interferes with human natural behavior, which invalidates most of the current datasets. There are some efforts to create datasets and human simulators to solve this issue.

Evaluating Motion Planning “in-the-Loops”

This talk was presented by Professor Xuesu Xiao from George Mason University and Everyday Robots. The main points of this speech were focused on comparing how motion planning algorithms are evaluated in scientific papers and the actual challenges the robots will meet when applying those algorithms to the real world:

Deployment vs Development Loop. Source: https://youtu.be/c3muaY6j9RA
  • In academia, most of the time, planners are evaluated in a “perfect world,” which expects perception and action modules to work without any errors or noise. Many unexpected issues can arise when applying an algorithm to the real world; for example, the robot might not have enough computing power to run the planner and the perception together in real-time.
  • When publishing papers, researchers usually focus on the best or average case scenario; however, for real-world applications, the worst-case scenario matters the most.
  • Often, planning time and trajectory time are taken into account separately. However, there is an innate power consumption trade-off with real-world path planning: do we spend more time and power trying to generate a better plan, or do we go with the current imperfect plan?
  • “Optimal” plans might not be the best approach when there are “humans in the loop” (e.g. a robot that follows the shortest path can move dangerously close to humans). Robots should be socially compliant if we want to deploy them in the real world.

Finally, the speaker also shared some open-source datasets: BARN and DynaBARN (2D navigation in cluttered environments), SCAND (Socially compliant navigation).

Lessons for Benchmarking from Learning Motion Planners

This talk was presented by Dr. Adithya Murali and Dr. Clemens Eppner, both from NVIDIA. They made a parallel between benchmarking supervised learning-based problems and traditional motion planners. From this comparison, they were able to take some important lessons:

Motion Planning: Learning vs Benchmarking. Source: https://youtu.be/E-ZG_HVsVIc
  • The diversity of motion problems matters. Comparing planners using only a few environments can lead to wrong conclusions.
  • To benchmark a planner using several environments, speed is paramount.
  • Motion planners should be evaluated end-to-end with real environment observations.
  • Prior work generally sampled random goals in the free space; however, when training a policy, goal distributions strongly depend on the task.
  • Without established benchmarks, paper reviews are noisy. Different reviewers can have opposite opinions on the algorithm evaluation method.

Finally, the authors also shared a new motion planning policy and the data used to train it.

Intro to Experiment Design for Motion Planning

This talk was presented by Professor Anca Dragan from UC Berkeley. Her presentation was about improving experiment design in academia. Her main points were:

What is a good experiment? Source: https://youtu.be/ogbpjY6hVEo
  • Good experiments should avoid confounding variables, i.e., variables whose effect cannot be distinguished from the effect of another independent variable. Independent variables are the variables we manipulate to improve performance, while dependent variables are the ones we measure to quantify this improvement.
  • Confounding variables are particularly problematic when researchers spend months tuning hyperparameters for their proposed algorithm and then compare its performance with “off-the-shelf” SOTA algorithms. This can cloud which part of the proposed algorithm is improving the performance. Having official benchmarks is one way of dealing with those unfair comparisons.
  • Good experiments are factorial, i.e., they isolate the contribution of a single variable and test it with different combinations of the other variables (e.g., ablation studies).
  • Good experiments measure what they are supposed to measure. It is especially common in motion planning to optimize a given cost/reward function that makes the robot act differently than expected.
  • Good experiments should sample not o only from random distributions but also from the target population.

Author’s take: Designing good experiments is important not only for academia but also for the industry. First, producing papers with well-designed, reproducible, and trustable experiments can aid practitioners in choosing algorithms that perform well in applications they are interested in. Moreover, factorial experiments can separate the contribution of each independent variable. This allows combining techniques from different works to build new custom planners tailored for specific industry needs.

Panel on Reproducible Experimental Design

The first panel featured Prof. Dmitry Berenson, Prof. Xuesu Xiao, Dr. Adithya Murali, and Dr. Clemens Eppner. The main topics of discussion and conclusions are summarized below:

Humans-in-the-loop for evaluation metrics

  • It is complex to test algorithms with real humans. Datasets containing human motions can help to a certain extent, but they cannot provide reactions to the robot’s actions.
  • Simulated humans still can’t behave human-like, but they can be used as an indirect signal for evaluation.

Best practices when evaluating long-duration tasks

  • Avoid confounding variables (e.g., keep vision, action, and other modules constant across the experiments).
  • Interaction and communication between different teams are paramount (planning team, vision team, etc.).

Building consensus in the community

Standard Evaluation metrics:

  • Planning time, success rate, and path quality are the most common metrics. Prof. Berenson also suggested that probability of success should be used when the environment is not fully known and/or we use an imperfect learned dynamics model.

Standard Simulators:

  • There are several robotics simulators (and new ones keep coming out), but none is perfect, each with a different set of strong and weak points. It is better to use the simulator that works best for the task at hand.

Standard Benchmarks:

  • Benchmarks should be based on real-world problems that we want to solve. After a benchmark is considered solved, it can be retired in favor of new harder ones.
  • Having benchmarks can make a fair comparison between different algorithms. Because new algorithms would compete with the best results of SOTA instead of the ones the researchers proposing the new algorithm obtained by themselves (which might be using sub-optimal parameters).
  • It is important to avoid being locked into specific tasks. The community needs to have some flexibility to introduce benchmarks on new tasks.
  • Standardizing hardware is also important, as worse algorithms can perform better if using faster hardware.

Lessons from developing competitions and benchmarks from the classic planning community:

  • Having a defined representation format (e.g., PDDL, URDF) and defined metrics and optimization objectives is paramount.
  • The main risk of focusing too much on building standard benchmarks is that they can become a requirement for publishing research papers. Currently, it is hard to publish a paper about classical planning without proving the algorithm can beat SOTA in dozens of benchmarks created decades ago. This hinders the ability to publish works that focus on something else other than getting the best score in benchmarks.

Parting Words

The main focus of the first session was on how to create meaningful experiments. I believe there is a disparity between what is a good experiment for academia and what is a good experiment for the industry. Most of the time, research papers focus on isolated results that best show the paper’s contribution. The industry, on the other hand, is more interested in the practical results of the algorithm when applied to the real world. Industry-backed benchmarks and competitions can reduce this gap and promote the usage of novel algorithms to solve real-world issues.

The usage of benchmarks to evaluate motion planning research papers, on the other hand, is still a challenge. There is a large variety of applications and robotic platforms. How to compare the performance of a planner in the real world if every laboratory has access to different robots? Using standardized simulation environments can help, but not every application can be tested in simulation (e.g., social robots).

Finally, one of the challenges I encountered when I first applied motion planning algorithms was the lack of ways to compare the performance of different planners in my application. We ended up having to implement custom testbeds. Some software suites trying to solve this issue were shared in the workshop (MoveIt Benchmark SuitePlanner Developer ToolsHyperPlan + Robowflex), and we have plans to include some of them in our evaluation pipeline.

The second part of this post will focus on Performance and Evaluation Metrics and can be accessed here.

» This article was originally published on our Medium blog and is now part of the MakinaRocks Blog. The original post remains accessible here.

Yuri Rocha
2022-12-14
Related Contents