R2SNet: Scalable Domain Adaptation for Object Detection in Cloud-Based Robots Ecosystems via Proposal Refinement

TaskNet

TaskNet + R2SNet

Abstract

We introduce a novel approach for scalable domain adaptation in cloud robotics scenarios where robots rely on third-party AI inference services powered by large pre-trained deep neural networks. Our method is based on a downstream proposal-refinement stage running locally on the robots, explotiting a new lightweight DNN architecture, R2SNet. This architecture aims to mitigate performance degradation from domain shifts by adapting the object detection process to the target environment, focusing on relabeling, rescoring, and suppression of bounding-box proposals. Our method allows for local execution on robots, addressing the scalability challenges of domain adaptation without incurring significant computational costs. Real-world results on mobile service robots performing door detection show the effectiveness of the proposed method in achieving scalable domain adaptation.

Scenario

We consider a robotic ecosystem where multiple independent units are deployed across different environments and rely on a cloud-based DNN model (called TaskNet) to perform object detection. We perform efficient domain adaptation by refining the TaskNet's proposals locally on the robots. To do this, we introduce R2SNet, a novel lightweight DNN architecture focuses on three different types of corrective actions: relabeling, rescoring, and suppression of bounding boxes.

Method

Architecture

R2SNet is composed of two parallel networks that process the proposals descriptors (location, dimension, label, and confidence) and the features related to their correspondent image portion. Each of them extracts local features using shared MLPs and a global representation of the proposal set using a permutation invariant max operation. These embeddings are aggregated and processed by 3 heads to address the relabeling, rescoring, and suppression of the input proposal.

R2SNet architecture.

The Background Feature extractor Network (BFNet) extracts the features of each proposal related to their corresponding region of the image. At first, BFNet extracts an image embedding using a multi–scale CNN backbone with residual connections where the 3 embeddings are processed with convolutional layers and step-by-step top-down aggregated through upsampling and summation. To speed up the inference time, the image embedding is mapped to each proposal with a binary mask generated by 4 MLPs with fixed weights which suppresses the features exceeding the bounding box's boundaries.

BFNet architecture.

Experiments

To carry out our experiments, we teleoperate a Giraff-X robot for mapping 4 real environments (3 university facilities and an apartment) while acquiring images at 1 Hz with a low resolution Orbecc Astra Camera. We train different versions of R2SNet using increasing amounts of data and varying the number of input proposals. We prove that R2SNet strongly improves the performance obtained by using the TaskNet alone even with a few training data (only the 25% of the acquired images) and with a reasonable number of bounding box to refine (between 30 to 100).

BibTeX

@misc{antonazzi2024r2snet,
      title={R2SNet: Scalable Domain Adaptation for Object Detection in Cloud-Based Robots Ecosystems via Proposal Refinement},
      author={Michele Antonazzi and Matteo Luperto and N. Alberto Borghese and Nicola Basilico},
      year={2024}
}