rosorg-logo1I now turn my attention to finding the root cause of random sporadic failures when the Roboclaw ROS driver makes calls into Roboclaw API. The most common symptom of these failures is the TypeError I addressed earlier, but avoiding crash by TypeError is only a band-aid. There were still other issues. Sometimes Phoebe's movement stutters. And even more mysteriously, sometimes when Phoebe was supposed to be standing still, it would twitch for a fraction of a second.

Looking into ROS logs, the node never intended to send a movement. But it does send a continuous stream of "run at speed X" commands at a rate of ten per second, whether it is trying to move the robot or not. When staying still, that stream continues at the same rate constantly sending "run at speed zero". The fact that my robot twitches when it's supposed to be standing still tells me that "run at speed zero" command is occasionally corrupted into a "run at speed X" command. Which starts the motor moving for 1/10th of a second until it is stopped by the next non-corrupted "run at speed zero" command.

Any time there is a random sporadic failure, my instinct is to look for places where threading collisions may be taking place. Programming for multi-threaded environments can get tricky. When I wrote code for SGVHAK Rover, the intent was to make that code easy to understand and pick up. In that spirit, I explicitly kept everything on a single thread to avoid multi-threading issues.

But now we're playing in the major leagues and there's no avoiding multiple threads. ROS itself runs across multiple processes and threads, so it could scale to robots that have multiple on board computers, each of which has multiple processing cores. Fortunately, ROS implementation takes care of almost everything, each ROS node just has to make sure they're doing the right thing with their own private data.

There are at least three different ROS topics involved in this Roboclaw ROS node:

  1. Driving commands received via /cmd_vel
  2. Odometry computation broadcast to /odom
  3. Diagnostics information

Each ROS topic is processed in own thread, so given three topics, we should expect at least three different threads who might all call into Roboclaw API at the same time to perform their tasks. Armed with this knowledge, I looked for code managing cross-thread access. My plan was to review that code to see if I can find any obvious problems with it.

That plan was changed when I found no code managing cross-thread access. I guess its absence qualifies as an obvious problem and it would certainly explain the kind of behavior I saw.

Not being a Python expert, I cruised StackOverflow for a pattern I could use to implement Python thread synchronization. I decided it was most straightforward to use the with keyword described on one of the later replies on this "Semaphores on Python" thread. Using this pattern makes the code change delta very straightforward to read.

There were a few initial calls into Roboclaw API to set things up, I left those alone. But as soon as the code started kicking off events that would have other threads (specifically, when it started the diagnostics thread) every following call into Roboclaw API is synchronized by a threading.Lock() object. With this modification, we can guarantee that only one thread will be performing serial communication to Roboclaw motor controller at any given time, and avoid data corruption by multiple threads trying to talk to the serial port at the same time.

Phoebe ran smoothly and reliably after this work. No more stutter in motion, and no more twitching when standing still. I've submitted the fix as a pull request.