-
Notifications
You must be signed in to change notification settings - Fork 0
Stage 4: Scene Understanding using Vision Language Model #3
Copy link
Copy link
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Implement Stage 4 of the grasp pipeline: Scene Understanding using a Vision Language Model.
Requirements:
- Extract object-object spatial relations using a vision-language model.
- Generate human-readable summaries of the scene.
- Use spatial relation templates: left_of, right_of, above, below, near, far_from, in_front_of, behind, touching, overlapping.
- Integrate the following interface:
def __init__(self, model_name: str = "placeholder"):
self.model_name = model_name
# Spatial relation templates
self.spatial_relations = [
"left_of", "right_of", "above", "below", "near", "far_from",
"in_front_of", "behind", "touching", "overlapping"
]
def understand_scene(self, image: np.ndarray, labels: List[str], boxes: List[List[float]]) -> Dict:
"""
Extract scene graph / relations rather than full text description
Args:
image: RGB image
labels: List of object class labels
boxes: List of bounding boxes [x, y, w, h]
Returns:
scene_description: Structured scene understanding with spatial relations
"""- The output should be a structured scene graph with spatial relations between detected objects.
This feature will help provide context-aware grasping by understanding spatial relationships in the scene.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request