Skip to content

Stage 4: Scene Understanding using Vision Language Model #3

@Methasit-Pun

Description

@Methasit-Pun

Implement Stage 4 of the grasp pipeline: Scene Understanding using a Vision Language Model.

Requirements:

  • Extract object-object spatial relations using a vision-language model.
  • Generate human-readable summaries of the scene.
  • Use spatial relation templates: left_of, right_of, above, below, near, far_from, in_front_of, behind, touching, overlapping.
  • Integrate the following interface:
def __init__(self, model_name: str = "placeholder"):
    self.model_name = model_name
    # Spatial relation templates
    self.spatial_relations = [
        "left_of", "right_of", "above", "below", "near", "far_from", 
        "in_front_of", "behind", "touching", "overlapping"
    ]
    
def understand_scene(self, image: np.ndarray, labels: List[str], boxes: List[List[float]]) -> Dict:
    """
    Extract scene graph / relations rather than full text description
    
    Args:
        image: RGB image
        labels: List of object class labels
        boxes: List of bounding boxes [x, y, w, h]
        
    Returns:
        scene_description: Structured scene understanding with spatial relations
    """
  • The output should be a structured scene graph with spatial relations between detected objects.

This feature will help provide context-aware grasping by understanding spatial relationships in the scene.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions