[Robotics Trends 2023] Evaluating Motion Planning Performance — Part 2

This is the second part of a two-part summary of the IROS 2022 workshop Evaluating Motion Planning Performance: Metrics, Tools, Datasets, and Experimental Design. If you did not read the first part, I suggest you go there before reading this post.

In this second part, I will focus on the topics presented in the afternoon session. As stated in my first post, I might give my opinion on topics I am more comfortable with, but I will try to summarize every presentation.

Session 2: Performance and Evaluation Metrics

TL;DR

The second session focused on defining good metrics, finding appropriate parameters, and dealing with uncertainties and specific tasks where general planners may be unusable.

If you don’t want to read about each presentation separately, here is a list of my main takeaways from the afternoon session:

There is still no general “best in every case” motion planner. Specialized solutions tailored for the specific application still tend to perform better. However, there is still room for researching and developing better general planners.
Hyperparameter optimization can help improve the performance of general planners on specific applications.
When calculating a single metric over a series of experiments: don’t use mean; use median instead.

Hyperparameter Optimization as a Tool for Motion Planning Algorithm Selection

This talk was presented by Dr. Mark Moll from PickNik Robotics. The focus of this presentation was on how to select a suitable motion planning algorithm and the right parameters for a particular application. However, the search space for motion planning algorithms is very complex:

There are dozens of algorithms.
Each has several parameters, which can be integers, booleans, real values, etc.
For sampling-based planners, the performance varies from run to run due to their stochastic nature.
Finally, there is no universal best planning algorithm. Therefore, the performance also varies according to the task.

Example dataset created using the MotionBenchMaker. Source: https://youtu.be/MFfRBxFW5sA

He proposes that hyperparameter optimization techniques, which are vastly used in the machine learning field, should help with this task. To do so, a dataset of motion planning problems and a loss function should be provided. Good loss functions should be fast to calculate and be able to differentiate good planners from bad ones. Dr. Moll used BOHB, a combination of HyperBand and Bayesian Optimization, to do hyperparameter optimization. Some of the results of his experiments are:

Using a variety of motion problems in the environment dataset helps to improve generalization.
The optimal planner varies with the tasks available in the dataset.
If the dataset only contains overly complex problems, the optimization process might not converge.
The whole process can take several days, depending on the available computational power.

The code for HyperPlan is available here.

Author’s take: According to the results, hyperparameter optimization can drastically improve the performance of motion planners when compared with the default values. There are many robotic applications, so it is virtually impossible to find parameters that work well for every situation. However, if the practitioner can clearly define their problem, this approach seems promising. The code is still under development, but I believe it can become an integral part of MoveIt after a stable release comes out.

Nonparametric Statistical Evaluation of Sampling-Based Motion Planning Algorithms

This talk was presented by Dr. Jonathan Gammell from the Oxford Robotics Institute. The speaker shared best practices for the statistical evaluation of sampling-based planners. His main points were:

Evaluating sampling-based planners is hard. In the top example, running the same planner in a different order affected the planner’s performance. Running a specific piece of code in between runs to “clear” the current state yielded the expected results. Source: https://youtu.be/tbEYjFp0CuE

It is important to analyze the evaluation results and think if they make sense.
The main evaluation metric for a planner must be the success rate (in a reasonable time).
The time cost for anytime (e.g., RRT*) and non-anytime (e.g., RRTConnect) planners is different. Non-anytime planners only have a time cost for the initial solution, while anytime planners also spend some time optimizing this initial solution.
When running multiple runs of the same experiment, one should use the median instead of the mean to calculate the cost. This is because the mean cannot represent infinite values (e.g., planner failed to find a solution), and it represents the combined performance of multiple runs. The median represents a single solution (in real applications, usually run the planner only once), and it can include unsolved trials/infinite costs.
When calculating confidence intervals, nonparametric models should be preferred as they do not require a “guess” of the underlying data distribution and instead infer it from the data. One suggested approach is the Clopper-Pearson algorithm.

Author’s take: I also consider planning success to be the most important metric when evaluating a planner, as a planner with a low success rate cannot be used in critical applications. However, using it as the sole evaluation metric is insufficient. I believe the metrics should vary based on the application. For example, when evaluating the performance of a planned trajectory in an industrial environment, metrics such as clearance from obstacles, maximum joint torque, trajectory jerk, and trajectory duration must also be considered. On the other hand, in the case of Offline Programming, planning time is less critical, as there is no real-time requirement.

Comparable Performance Evaluation in Motion Planning under Uncertainty

This talk was presented by Professor Hanna Kurniawati from Australian National University. The speaker focused on the issues when evaluating motion planning under uncertainty and the efforts to overcome those issues. The key takeaways from her talk were:

OPPT framework — OPPT. Source: https://youtu.be/qXbrwhNzTsY

Several discrete benchmarks are already used by the partially observable Markov decision process (POMDP) research community (e.g., PBVI, HSVI, etc.). However, there is still the need to standardize problems that use continuous spaces.
Small changes in the problem definition of a POMDP can drastically change its difficulty (e.g., change the initial state of the agent).
When running several instances of the same planner in parallel, one needs to be careful with the “random seed,” which is usually the processor’s time. If two or more parallel processes start simultaneously, they will have the same seed and be essentially the same. This can create a bias in the evaluation process.
Her research group developed the Online POMDP Planning Toolkit, an open-source software toolkit to ease the implementation of real-time POMDP planners.

Lessons learned from Motion Planning Benchmarking.

This talk was presented by Dr. Andreas Orthey from Realtime Robotics. The speaker’s main focus was that using tailor-made planning solutions is paramount when dealing with complex tasks. The key takeaways from the talk were:

spot welding — An example of a complex motion planning problem. Source: https://youtu.be/0PRHItIQvRY

The speaker shared MotionBenchMaker, which is an open-source benchmark datasets generator. This software can be used to generate several different manipulation tasks, which can also include random perturbations to improve generalization.
There is no single best algorithm when working with a large variety of tasks. Hence, specialized algorithms tend to perform better on specific tasks. Therefore, developing planners dedicated to specific tasks is better than trying to find a general-purpose planner that can do well in every task.
When dealing with complex problems, divide them into simpler problems and try to solve them sequentially. For example, in a multi-robot problem, one can first solve the problem for a single robot, then use the result motion as a constraint for the other robots.
Motion planning should make goals, objectives, and constraints differentiable and exploit this differentiability to steer the algorithm toward better solutions.

Author’s take: Using specialized planners is paramount for our industrial applications. Offline Planning is a complex but well-defined problem with clear metrics that need to be optimized. Hence, Makinarocks’ OLP team also uses a custom planner which combines sampling-based planners, optimizing planners, and custom heuristics.

Panel on Performance and Evaluation Metrics

Panel participants. From left to right: Dr. Jonathan Gammell, Dr. Mark Moll, Prof. Hanna Kurniawati, and Dr. Andreas Orthey.

The second panel featured Dr. Jonathan Gammell, Dr. Mark Moll, Prof. Hanna Kurniawati, and Dr. Andreas Orthey. The main topics of discussion and conclusions are summarized below:

Importance of estimating the planner's performance:

Before estimating the performance, it is essential to ensure that the evaluation data represents the real problem. Otherwise, the estimation will be biased, which can generate some over-confidence about the planner’s performance even though it cannot perform well in use cases outside the test set.
Dr. Gammell stated in his talk that success rate should be the most important evaluation metric. However, there are some applications where it is hard to define “success.” For example, when doing social navigation, how to tell if the robot has succeeded in respecting the social constraints? Dr. Gammell argued that if a plan is feasible (i.e., it respects the set of defined constraints), the experiment should be considered successful.

Benchmarking and competitions to measure the progress of planners every year. Is it helpful or just a distraction?

The community must refrain from over-optimizing research to get better scores in benchmarks without creating any real value. Benchmarks can be helpful as a sanity check, but they are only one of the ways of evaluating a planner. Some applications need specialized planners, which might not perform well in benchmarks.
Competitions make sense for the industry, which usually is interested in solving specific challenging problems. This can help find breakthroughs in the scene. It can also help attract interest to the field. However, it should be clear that the competition is focused on a specific application and should not be taken as a single measurement of a planner’s performance.
Competitions make more sense for practical tasks that require not only good planning algorithms but also vision, control, hardware, and system design. In most cases, it does not matter if your planner performs way better than the others if it cannot be integrated into real systems and be used to solve real problems.

Did we reach the limit for general planners, and should we focus on optimizing for specific problems?

There is still room for improvement in general planners. There is also space for the development of new specialized planners. Both paths are not exclusive and can be improved at the same time.
Heuristics proposed by specialized planners can sometimes be incorporated into general planners. Instead of requiring general planners to work for every situation, we can create classes of problems and develop planners that work well for problems inside a specific class.

Parting Words

The second session touched on some topics that were also present in the morning session, such as promoting benchmarks and competitions focused on path planning. Most of the presenters showed a similar view that the discriminated usage of benchmarks in research can do more harm than good, as it might discourage research focused on applications not covered by benchmarks. As stated in this series's first post, I agree that having industry-backed benchmarks and challenges makes more sense as a way to use research breakthroughs in real applications.

They also discussed whether the community should focus on tailor-made algorithms that work well on the application they are supposed to be used at or on general planners that work well on various applications. The consensus was that both approaches should be explored in parallel and can benefit each other in the long run. However, I believe that tailor-made planners are more suitable for industrial applications where an improvement in performance in exchange for the lack of flexibility is still worth it in most cases. This might change in the future when robots are able to perform several tasks in various environments, but there is still a long way to go.

Overall, I enjoyed the workshop, and I am looking forward to the next iterations. Hopefully, it can become an annual workshop where researchers and practitioners can meet and share knowledge and tools to improve the motion planning field. Moreover, I want to express my gratitude to the organizers who worked hard to make the event so enjoyable and to MakinaRocks for sending my teammates and me to IROS 2022, so we could experience firsthand the latest robotics trends.

Finally, I would like to thank you for taking the time to read this post series. Feel free to contact me to discuss the contents of this post, the OLP project, or robotics in general. Don’t forget to check our other posts covering our experience and the latest robotics trends from IROS 2022 (even if you can’t speak Korean, Google Translate can work wonders 😉).

» This article was originally published on our Medium blog and is now part of the MakinaRocks Blog. The original post remains accessible here.