diff --git a/docs/guides/extensions/curator/metadata_curation.md b/docs/guides/extensions/curator/metadata_curation.md index ddc2b6f2d..5cdb9a78e 100644 --- a/docs/guides/extensions/curator/metadata_curation.md +++ b/docs/guides/extensions/curator/metadata_curation.md @@ -1,497 +1,216 @@ -# How to Create Metadata Curation Workflows +# How to Curate Metadata with the Synapse Python Client -This guide shows you how to set up a metadata curation workflow in Synapse using the curator extension. You'll learn to find appropriate schemas, create curation tasks for your research data. +This guide walks you through programmatic metadata curation in Synapse, from setting up curation tasks to validating data and managing grid sessions. -## What you'll accomplish +## What you'll learn -By following this guide, you will: - -- Find and select the right JSON schema for your data type -- Create a metadata curation workflow with automatic validation -- Set up either file-based or record-based metadata collection -- Configure curation tasks that guide collaborators through metadata entry -- Retrieve and analyze detailed validation results to identify data quality issues +- How to find schemas and create curation tasks +- How to manage grid sessions: import CSV data, download data, and synchronize changes +- How to check validation results **before** committing (pre-commit validation via WebSocket) +- How to check validation results **after** committing (export-based validation) +- How to manage curation task lifecycle (list, update, delete with cleanup) ## Prerequisites +- Python environment with `pip install --upgrade "synapseclient[curator]"` - A Synapse account with project creation permissions -- Python environment with synapseclient and the `curator` extension installed (ie. `pip install --upgrade "synapseclient[curator]"`) -- An existing Synapse project and folder where you want to manage metadata -- A JSON Schema registered in Synapse (many schemas are already available for Sage-affiliated projects, or you can register your own by following the [JSON Schema tutorial](../../../tutorials/python/json_schema.md)) - - If you are leveraging the [Curator CSV data model](../../../explanations/curator_data_model.md), you can create JSON schemas by following this [tutorial](../../extensions/curator/schema_operations.md) -- (Optional) An existing Synapse team if you want multiple users to collaborate on the same Grid session. Pass the team's ID as `assignee_principal_id` when creating the curation task. +- A JSON Schema registered in Synapse (see [JSON Schema tutorial](../../../tutorials/python/json_schema.md) or [Schema Operations guide](schema_operations.md)) + +--- -## Step 1: Authenticate and import required functions +## 1. Authentication and setup ```python -from synapseclient.extensions.curator import ( - create_record_based_metadata_task, - create_file_based_metadata_task, - query_schema_registry -) -from synapseclient import Synapse - -syn = Synapse() -syn.login() +{!docs/guides/extensions/curator/scripts/setup_and_create_tasks.py!lines=7-15} ``` -## Step 2: Find the right schema for your data +## 2. Find a schema for your data -Before creating a curation task, identify which JSON schema matches your data type. Many schemas are already registered in Synapse for Sage-affiliated projects. The schema registry contains validated schemas organized by data coordination center (DCC) and data type. - -**If you need to register your own schema**, follow the [JSON Schema tutorial](../../../tutorials/python/json_schema.md) to understand the registration process. +The schema registry contains validated JSON schemas organized by data coordination center (DCC) and data type. ```python -# Find the latest schema for your specific data type -schema_uri = query_schema_registry( - synapse_client=syn, - dcc="ad", # Your data coordination center, check out the `syn69735275` table if you do not know your code - datatype="IndividualAnimalMetadataTemplate" # Your specific data type -) - -print("Latest schema URI:", schema_uri) +{!docs/guides/extensions/curator/scripts/setup_and_create_tasks.py!lines=17-24} ``` -**When to use this approach:** You know your DCC and data type, you want the most current schema version, and it has already been registered into . +To browse all available versions of a schema: -**Alternative - browse available schemas:** ```python -# Get all versions to see what's available -all_schemas = query_schema_registry( - synapse_client=syn, - dcc="ad", - datatype="IndividualAnimalMetadataTemplate", - return_latest_only=False -) +{!docs/guides/extensions/curator/scripts/setup_and_create_tasks.py!lines=26-31} ``` -## Step 3: Choose your metadata workflow type +## 3. Create a curation task + +A curation task guides collaborators through metadata entry. There are two types: -### Option A: Record-based metadata +### Record-based metadata (structured records in a RecordSet) -Use this when metadata describes individual data files and is stored as annotations directly on each file. +Use this when metadata is stored as tabular records, like a spreadsheet of sample annotations. ```python -record_set, curation_task, data_grid = create_record_based_metadata_task( - synapse_client=syn, - project_id="syn123456789", # Your project ID - folder_id="syn987654321", # Folder where RecordSet Entity will be stored - record_set_name="AnimalMetadata_Records", - record_set_description="Centralized metadata for animal study data", - curation_task_name="AnimalMetadata_Curation", # Must be unique within the project - upsert_keys=["StudyKey"], # Fields that uniquely identify records - instructions="Complete all required fields according to the schema. Use StudyKey to link records to your data files.", - schema_uri=schema_uri, # Schema found in Step 2 - bind_schema_to_record_set=True, - assignee_principal_id="123456" # Optional: Assign to a user or team -) - -print(f"Created RecordSet: {record_set.id}") -print(f"Created CurationTask: {curation_task.task_id}") +{!docs/guides/extensions/curator/scripts/setup_and_create_tasks.py!lines=33-51} ``` -**What this creates:** +This creates a RecordSet, a CurationTask, and an initial Grid session for collaborative editing. -- A RecordSet where metadata is stored as structured records (like a spreadsheet) -- A CurationTask that guides users through completing the metadata -- Automatic schema binding for validation -- A data grid interface for easy metadata entry +### File-based metadata (annotations on individual files) -### Option B: File-based metadata (for unique per-file metadata) - -Use this when metadata is normalized in structured records to eliminate duplication and ensure consistency. +Use this when metadata describes individual files in a folder. ```python -entity_view_id, task_id = create_file_based_metadata_task( - synapse_client=syn, - folder_id="syn987654321", # Folder containing your data files - curation_task_name="FileMetadata_Curation", # Must be unique within the project - instructions="Annotate each file with metadata according to the schema requirements.", - attach_wiki=False, # Creates a wiki in the folder with the entity view (Defaults to False) - entity_view_name="Animal Study Files View", - schema_uri=schema_uri, # Schema found in Step 2 - assignee_principal_id="123456" # Optional: Assign to a user or team -) - -print(f"Created EntityView: {entity_view_id}") -print(f"Created CurationTask: {task_id}") +{!docs/guides/extensions/curator/scripts/setup_and_create_tasks.py!lines=53-63} ``` -**What this creates:** +--- -- An EntityView that displays all files in the folder -- A CurationTask for guided metadata entry -- Automatic schema binding to the folder for validation -- Optional wiki attached to the folder +## 4. Work with Grid sessions -## Complete example script +Grid sessions are the core editing interface for curation. You can create them, import CSV data, download data, check validation, and synchronize changes. -Here's the full script that demonstrates both workflow types: +### Create a Grid session ```python -from pprint import pprint -from synapseclient.extensions.curator import ( - create_record_based_metadata_task, - create_file_based_metadata_task, - query_schema_registry -) -from synapseclient import Synapse - -# Step 1: Authenticate -syn = Synapse() -syn.login() - -# Step 2: Find schema -schema_uri = query_schema_registry( - synapse_client=syn, - dcc="ad", - datatype="IndividualAnimalMetadataTemplate" -) -print("Using schema:", schema_uri) - -# Step 3A: Create record-based workflow -record_set, curation_task, data_grid = create_record_based_metadata_task( - synapse_client=syn, - project_id="syn123456789", - folder_id="syn987654321", - record_set_name="AnimalMetadata_Records", - record_set_description="Centralized animal study metadata", - curation_task_name="AnimalMetadata_Curation", - upsert_keys=["StudyKey"], - instructions="Complete metadata for all study animals using StudyKey to link records to data files.", - schema_uri=schema_uri, - bind_schema_to_record_set=True, - assignee_principal_id="123456" # Optional: Assign to a user or team -) - -print(f"Record-based workflow created:") -print(f" RecordSet: {record_set.id}") -print(f" CurationTask: {curation_task.task_id}") - -# Step 3B: Create file-based workflow -entity_view_id, task_id = create_file_based_metadata_task( - synapse_client=syn, - folder_id="syn987654321", - curation_task_name="FileMetadata_Curation", - instructions="Annotate each file with complete metadata according to schema.", - attach_wiki=True, - entity_view_name="Animal Study Files View", - schema_uri=schema_uri, - assignee_principal_id="123456" # Optional: Assign to a user or team -) - -print(f"File-based workflow created:") -print(f" EntityView: {entity_view_id}") -print(f" CurationTask: {task_id}") +{!docs/guides/extensions/curator/scripts/grid_session_operations.py!lines=7-20} ``` -## Step 4: Work with metadata and validate (Record-based workflow) - -After creating a record-based metadata task, collaborators can enter metadata through the Grid interface. Once metadata entry is complete, you'll want to validate the data against your schema and identify any issues. +### Import CSV data into a Grid -### The metadata curation workflow +Upload CSV data into an active grid session. You can provide a local file path, a pandas DataFrame, or an existing file handle ID. The CSV must match the grid's column schema. -1. **Data Entry**: Collaborators use the Grid interface (via the curation task link in the Synapse web UI) to enter metadata -2. **Grid Export**: Export the Grid session back to the RecordSet to save changes (this can be done via the web UI or programmatically) -3. **Validation**: Retrieve detailed validation results to identify schema violations -4. **Correction**: Fix any validation errors and repeat as needed +```python +{!docs/guides/extensions/curator/scripts/grid_session_operations.py!lines=22-53} +``` -### Creating and exporting a Grid session +### Download Grid data as CSV -Validation results are only generated when a Grid session is exported back to the RecordSet. This triggers Synapse to validate each row against the bound schema. You have two options: +Export the current grid state to a local CSV file. The downloaded CSV does **not** include validation columns. -**Option A: Via the Synapse web UI (most common)** +```python +{!docs/guides/extensions/curator/scripts/grid_session_operations.py!lines=55-57} +``` -Users can access the curation task through the Synapse web interface, enter/edit data in the Grid, and click the export button. This automatically generates validation results. +### Synchronize Grid with data source -**Option B: Programmatically create and export a Grid session** +Apply grid session changes back to the source entity (table, view, or RecordSet). ```python -from synapseclient import Synapse -from synapseclient.models import RecordSet -from synapseclient.models.curation import Grid - -syn = Synapse() -syn.login() - -# Get your RecordSet (must have a schema bound) -record_set = RecordSet(id="syn987654321").get() +{!docs/guides/extensions/curator/scripts/grid_session_operations.py!lines=59-67} +``` -# Create a Grid session from the RecordSet -grid = Grid(record_set_id=record_set.id).create() +### List and delete Grid sessions -# At this point, users can interact with the Grid (either programmatically or via web UI) -# When ready to save changes and generate validation results, export back to RecordSet -grid.export_to_record_set() +```python +{!docs/guides/extensions/curator/scripts/grid_session_operations.py!lines=69-78} +``` -# Clean up the Grid session -grid.delete() +--- -# Re-fetch the RecordSet to get the updated validation_file_handle_id -record_set = RecordSet(id=record_set.id).get() -``` +## 5. Check validation results -**Important**: The `validation_file_handle_id` attribute is only populated after a Grid export operation. Until then, `get_detailed_validation_results()` will return `None`. +There are two ways to check whether metadata passes schema validation: -### Getting detailed validation results +### Option A: Pre-commit validation (WebSocket snapshot) -After exporting from a Grid session with a bound schema, Synapse automatically validates each row against the schema and generates a detailed validation report. Here's how to retrieve and analyze those results: +Get per-row validation results from an active grid session **without committing changes**. This connects via WebSocket, reads the current grid state, and returns validation data. ```python -from synapseclient import Synapse -from synapseclient.models import RecordSet - -syn = Synapse() -syn.login() - -# After Grid export (either via web UI or programmatically) -# retrieve the updated RecordSet -record_set = RecordSet(id="syn987654321").get() - -# Get detailed validation results as a pandas DataFrame -validation_results = record_set.get_detailed_validation_results() - -if validation_results is not None: - print(f"Total rows validated: {len(validation_results)}") - - # Filter for valid and invalid rows - valid_rows = validation_results[validation_results['is_valid'] == True] - invalid_rows = validation_results[validation_results['is_valid'] == False] - - print(f"Valid rows: {len(valid_rows)}") - print(f"Invalid rows: {len(invalid_rows)}") - - # Display details of any validation errors - if len(invalid_rows) > 0: - print("\nRows with validation errors:") - for idx, row in invalid_rows.iterrows(): - print(f"\nRow {row['row_index']}:") - print(f" Error: {row['validation_error_message']}") - print(f" ValidationError: {row['all_validation_messages']}") -else: - print("No validation results available. The Grid session must be exported to generate validation results.") +{!docs/guides/extensions/curator/scripts/precommit_validation.py!lines=7-29} ``` -### Example: Complete validation workflow for animal study metadata +**When to use:** You want to check validation before committing changes. This is useful for automated pipelines that import data, validate, and only commit if validation passes. + +**Note:** If you call `get_snapshot()` immediately after importing CSV data, some rows may show `validation_status = "pending"` while the backend processes validation. Wait briefly and retry if needed. -This example demonstrates the full workflow from creating a curation task through validating the submitted metadata: +### Option B: Post-commit validation (export to RecordSet) + +Export the grid session back to the RecordSet. This commits changes and generates detailed validation results. ```python -from synapseclient import Synapse -from synapseclient.extensions.curator import create_record_based_metadata_task, query_schema_registry -from synapseclient.models import RecordSet -from synapseclient.models.curation import Grid -import pandas as pd -import tempfile -import os -import time - -syn = Synapse() -syn.login() - -# Step 1: Find the schema -schema_uri = query_schema_registry( - synapse_client=syn, - dcc="ad", - datatype="IndividualAnimalMetadataTemplate" -) - -# Step 1.5: Create initial test data with validation examples -# Row 1: VALID - all required fields present and valid -# Row 2: INVALID - missing required field 'genotype' -# Row 3: INVALID - invalid enum value for 'sex' ("other" not in enum) -test_data = pd.DataFrame({ - "individualID": ["ANIMAL001", "ANIMAL002", "ANIMAL003"], - "species": ["Mouse", "Mouse", "Mouse"], - "sex": ["female", "male", "other"], # Row 3: invalid enum - "genotype": ["5XFAD", None, "APOE4KI"], # Row 2: missing required field - "genotypeBackground": ["C57BL/6J", "C57BL/6J", "C57BL/6J"], - "modelSystemName": ["5XFAD", "5XFAD", "APOE4KI"], - "dateBirth": ["2024-01-15", "2024-02-20", "2024-03-10"], - "individualIdSource": ["JAX", "JAX", "JAX"], -}) - -# Create a temporary CSV file with the test data -temp_fd, temp_csv = tempfile.mkstemp(suffix=".csv") -os.close(temp_fd) -test_data.to_csv(temp_csv, index=False) - -# Step 2: Create the curation task (this creates an empty template RecordSet) -record_set, curation_task, data_grid = create_record_based_metadata_task( - synapse_client=syn, - project_id="syn123456789", - folder_id="syn987654321", - record_set_name="AnimalMetadata_Records", - record_set_description="Animal study metadata with validation", - curation_task_name="AnimalMetadata_Validation_Example", - upsert_keys=["individualID"], - instructions="Enter metadata for each animal. All required fields must be completed.", - schema_uri=schema_uri, - bind_schema_to_record_set=True, -) - -time.sleep(10) - -print(f"Curation task created with ID: {curation_task.task_id}") -print(f"RecordSet created with ID: {record_set.id}") - -# Step 2.5: Upload the test data to the RecordSet -record_set = RecordSet(id=record_set.id).get(synapse_client=syn) -print("\nUploading test data to RecordSet...") -record_set.path = temp_csv -record_set = record_set.store(synapse_client=syn) -print(f"Test data uploaded to RecordSet {record_set.id}") - -# Step 3: Collaborators enter data via the web UI, OR you can create/export a Grid programmatically -# For demonstration, here's the programmatic approach: -print("\nCreating Grid session for data entry...") -grid = Grid(record_set_id=record_set.id).create() -print("Grid session created. Users can now enter data.") - -# After data entry is complete (either via web UI or programmatically), -# export the Grid to generate validation results -print("\nExporting Grid to RecordSet to generate validation results...") -grid.export_to_record_set() - -# Clean up the Grid session -grid.delete() -print("Grid session exported and deleted.") - -# Step 4: Refresh the RecordSet to get the latest validation results -print("\nRefreshing RecordSet to retrieve validation results...") -record_set = RecordSet(id=record_set.id).get() - -# Step 5: Analyze validation results -validation_df = record_set.get_detailed_validation_results() - -if validation_df is not None: - # Summary statistics - total_rows = len(validation_df) - valid_count = (validation_df['is_valid'] == True).sum() # noqa: E712 - invalid_count = (validation_df['is_valid'] == False).sum() # noqa: E712 - - print("\n=== Validation Summary ===") - print(f"Total records: {total_rows}") - print(f"Valid records: {valid_count} ({valid_count}/{total_rows})") - print(f"Invalid records: {invalid_count} ({invalid_count}/{total_rows})") - - # Group errors by type for better understanding - if invalid_count > 0: - invalid_rows = validation_df[validation_df['is_valid'] == False] # noqa: E712 - - # Export detailed error report for review - error_report = invalid_rows[['row_index', 'validation_error_message', 'all_validation_messages']] - error_report_path = "validation_errors_report.csv" - error_report.to_csv(error_report_path, index=False) - print(f"\nDetailed error report saved to: {error_report_path}") - - # Show first few errors as examples - print("\n=== Sample Validation Errors ===") - for idx, row in error_report.head(3).iterrows(): - print(f"\nRow {row['row_index']}:") - print(f" Error: {row['validation_error_message']}") - print(f" ValidationError: {row['all_validation_messages']}") - -# Clean up temporary file -if os.path.exists(temp_csv): - os.unlink(temp_csv) +{!docs/guides/extensions/curator/scripts/postcommit_validation.py!lines=7-34} ``` -In this example you would expect to get results like: +**When to use:** You want committed validation results with full detail. The RecordSet's `get_detailed_validation_results()` returns a pandas DataFrame with row-level error messages. -``` -=== Sample Validation Errors === +--- -Row 0: - Error: expected type: String, found: Long - ValidationError: ["#/dateBirth: expected type: String, found: Long"] +## 6. Manage curation tasks -Row 1: - Error: 2 schema violations found - ValidationError: ["#/genotype: expected type: String, found: Null","#/dateBirth: expected type: String, found: Long"] +### List tasks in a project -Row 2: - Error: 2 schema violations found - ValidationError: ["#/dateBirth: expected type: String, found: Long","#/sex: other is not a valid enum value"] +```python +{!docs/guides/extensions/curator/scripts/manage_tasks.py!lines=7-18} ``` -**Key points about validation results:** - -- **Automatic generation**: Validation results are created automatically when you export data from a Grid session with a bound schema -- **Row-level detail**: Each row in your RecordSet gets its own validation status and error messages -- **Multiple violations**: The `all_validation_messages` column contains all schema violations for a row, not just the first one -- **Iterative correction**: Use the validation results to identify issues, make corrections in the Grid, export again, and re-validate +### Update a task -### When validation results are available +```python +{!docs/guides/extensions/curator/scripts/manage_tasks.py!lines=20-22} +``` -Validation results are only available after: -1. A JSON schema has been bound to the RecordSet (set `bind_schema_to_record_set=True` when creating the task) -2. Data has been entered through a Grid session -3. **The Grid session has been exported back to the RecordSet** - This is the critical step that triggers validation and populates the `validation_file_handle_id` attribute +### Delete a task -The export can happen in two ways: -- **Via the Synapse web UI**: Users click the export/save button in the Grid interface -- **Programmatically**: Call `grid.export_to_record_set()` after creating a Grid session +```python +{!docs/guides/extensions/curator/scripts/manage_tasks.py!lines=24-29} +``` -If `get_detailed_validation_results()` returns `None`, the most common reason is that the Grid session hasn't been exported yet. Check that `record_set.validation_file_handle_id` is not `None` after exporting. +When `delete_file_view=True`, the task's associated EntityView is also deleted. This only applies to file-based metadata tasks. Record-based tasks do not have an EntityView. -## Additional utilities +--- -### Validate schema binding on folders +## 7. Validate folder annotations -Use this script to verify the schema on a folder against the items contained within that folder: +For file-based workflows, you can validate annotations on files within a folder: ```python -from synapseclient import Synapse -from synapseclient.models import Folder +{!docs/guides/extensions/curator/scripts/validate_folder.py!lines=7-22} +``` -# The Synapse ID of the entity you want to bind the JSON Schema to. This should be the ID of a Folder where you want to enforce the schema. -FOLDER_ID = "" +--- -syn = Synapse() -syn.login() +## Complete example: Programmatic CSV upload and validation -folder = Folder(id=FOLDER_ID).get() -schema_validation = folder.validate_schema() +This example demonstrates the full workflow for power users who work entirely through the Python client without the grid UI: -print(f"Schema validation result for folder {FOLDER_ID}: {schema_validation}") +```python +{!docs/guides/extensions/curator/scripts/full_csv_workflow.py!lines=7-62} ``` -### List existing curation tasks +--- -Use this script to see all curation tasks in a project: +## API reference -```python -from pprint import pprint -from synapseclient import Synapse -from synapseclient.models.curation import CurationTask +### Curation task creation -PROJECT_ID = "" # The Synapse ID of the project to list tasks from +- [create_record_based_metadata_task][synapseclient.extensions.curator.create_record_based_metadata_task] +- [create_file_based_metadata_task][synapseclient.extensions.curator.create_file_based_metadata_task] +- [query_schema_registry][synapseclient.extensions.curator.query_schema_registry] -syn = Synapse() -syn.login() +### Grid session management -for curation_task in CurationTask.list( - project_id=PROJECT_ID -): - pprint(curation_task) -``` +- [Grid.create][synapseclient.models.grid.Grid.create] +- [Grid.import_csv][synapseclient.models.grid.Grid.import_csv] +- [Grid.download_csv][synapseclient.models.grid.Grid.download_csv] +- [Grid.synchronize][synapseclient.models.grid.Grid.synchronize] +- [Grid.export_to_record_set][synapseclient.models.grid.Grid.export_to_record_set] +- [Grid.get_snapshot][synapseclient.models.grid.Grid.get_snapshot] +- [Grid.get_validation][synapseclient.models.grid.Grid.get_validation] +- [Grid.delete][synapseclient.models.grid.Grid.delete] +- [Grid.list][synapseclient.models.grid.Grid.list] + +### Curation task management -## References +- [CurationTask.store][synapseclient.models.CurationTask.store] +- [CurationTask.get][synapseclient.models.CurationTask.get] +- [CurationTask.delete][synapseclient.models.CurationTask.delete] +- [CurationTask.list][synapseclient.models.CurationTask.list] -### API Documentation +### Validation -- [query_schema_registry][synapseclient.extensions.curator.query_schema_registry] - Search for schemas in the registry -- [create_record_based_metadata_task][synapseclient.extensions.curator.create_record_based_metadata_task] - Create RecordSet-based curation workflows -- [create_file_based_metadata_task][synapseclient.extensions.curator.create_file_based_metadata_task] - Create EntityView-based curation workflows -- [RecordSet.get_detailed_validation_results][synapseclient.models.RecordSet.get_detailed_validation_results] - Get detailed validation results for RecordSet data -- [Grid.create][synapseclient.models.curation.Grid.create] - Create a Grid session from a RecordSet -- [Grid.export_to_record_set][synapseclient.models.curation.Grid.export_to_record_set] - Export Grid data back to RecordSet and generate validation results -- [Folder.bind_schema][synapseclient.models.Folder.bind_schema] - Bind schemas to folders -- [Folder.validate_schema][synapseclient.models.Folder.validate_schema] - Validate folder schema compliance -- [CurationTask.list][synapseclient.models.CurationTask.list] - List curation tasks in a project +- [RecordSet.get_detailed_validation_results][synapseclient.models.RecordSet.get_detailed_validation_results] +- [Folder.get_schema_validation_statistics][synapseclient.models.Folder.get_schema_validation_statistics] +- [Folder.get_invalid_validation][synapseclient.models.Folder.get_invalid_validation] -### Related Documentation +### Related guides -- [JSON Schema Tutorial](../../../tutorials/python/json_schema.md) - Learn how to register schemas -- [Schema Registry](https://synapse.org/Synapse:syn69735275/tables/) - Browse available schemas +- [Schema Operations](schema_operations.md) - Generate and register JSON schemas +- [JSON Schema Tutorial](../../../tutorials/python/json_schema.md) - Learn JSON schema basics +- [Curator Data Model](../../../explanations/curator_data_model.md) - CSV data model format diff --git a/docs/guides/extensions/curator/scripts/full_csv_workflow.py b/docs/guides/extensions/curator/scripts/full_csv_workflow.py new file mode 100644 index 000000000..012112ec4 --- /dev/null +++ b/docs/guides/extensions/curator/scripts/full_csv_workflow.py @@ -0,0 +1,59 @@ +""" +Script: Complete programmatic CSV upload and validation workflow. +Demonstrates the full end-to-end flow for power users who work +entirely through the Python client without the grid UI. +""" + +from synapseclient import Synapse +from synapseclient.extensions.curator import ( + create_record_based_metadata_task, + query_schema_registry, +) +from synapseclient.models import Grid + +syn = Synapse() +syn.login() + +# 1. Find schema and create curation task +schema_uri = query_schema_registry( + synapse_client=syn, dcc="ad", datatype="IndividualAnimalMetadataTemplate" +) + +record_set, curation_task, _ = create_record_based_metadata_task( + synapse_client=syn, + project_id="syn123456789", + folder_id="syn987654321", + record_set_name="StudyMetadata", + record_set_description="Animal study metadata", + curation_task_name="StudyMetadata_Curation", + upsert_keys=["individualID"], + instructions="Complete all required fields.", + schema_uri=schema_uri, + bind_schema_to_record_set=True, +) + +# 2. Import CSV data into a grid session +# Column schema is auto-derived from the CSV header and the +# JSON schema bound to the grid. +grid = Grid(record_set_id=record_set.id).create() +grid = grid.import_csv(path="metadata.csv") +print(f"Imported {grid.csv_import_total_count} rows") + +# 3. Check validation before committing +snapshot = grid.get_snapshot() +summary = snapshot.validation_summary +print(f"Validation: {summary['valid']}/{summary['total']} valid") + +if summary["invalid"] > 0: + print("Validation errors found:") + for row in snapshot.rows: + if row.validation and not row.validation.is_valid: + print(f" Row {row.row_id}: " f"{row.validation.validation_error_message}") + # Fix errors and re-import if needed... + +# 4. Commit when ready +grid = grid.export_to_record_set() +print(f"Exported to RecordSet version {grid.record_set_version_number}") + +# 5. Clean up +grid.delete() diff --git a/docs/guides/extensions/curator/scripts/grid_session_operations.py b/docs/guides/extensions/curator/scripts/grid_session_operations.py new file mode 100644 index 000000000..d77b5e81f --- /dev/null +++ b/docs/guides/extensions/curator/scripts/grid_session_operations.py @@ -0,0 +1,80 @@ +""" +Script: Working with Grid sessions. +Covers creating sessions, importing CSV data, downloading data, +synchronizing changes, and listing/deleting sessions. +""" + +from synapseclient import Synapse +from synapseclient.models import Grid, Query + +syn = Synapse() +syn.login() + +# Create a Grid session from a RecordSet +grid = Grid(record_set_id="syn987654321") +grid = grid.create() +print(f"Grid session: {grid.session_id}") + +# Or create a Grid session from an EntityView query +grid_from_query = Grid(initial_query=Query(sql="SELECT * FROM syn123456789")) +grid_from_query = grid_from_query.create() + +# Import a CSV from a local file path. +# Column names are read from the CSV header and types are resolved +# from the JSON schema bound to the grid session automatically. +grid = grid.import_csv(path="path/to/metadata.csv") + +print(f"Imported {grid.csv_import_total_count} rows") +print(f" Created: {grid.csv_import_created_count}") +print(f" Updated: {grid.csv_import_updated_count}") + +# Or import directly from a pandas DataFrame +import pandas as pd + +df = pd.DataFrame( + { + "individualID": ["ANIMAL001", "ANIMAL002"], + "species": ["Mouse", "Mouse"], + "sex": ["female", "male"], + "genotype": ["5XFAD", "APOE4KI"], + } +) + +grid = grid.import_csv(dataframe=df) + +# You can also provide an explicit schema to override auto-derivation: +from synapseclient.models import Column, ColumnType + +schema = [ + Column(name="individualID", column_type=ColumnType.STRING), + Column(name="species", column_type=ColumnType.STRING), + Column(name="sex", column_type=ColumnType.STRING), + Column(name="genotype", column_type=ColumnType.STRING), +] + +grid = grid.import_csv(path="path/to/metadata.csv", schema=schema) + +# Download grid data as a local CSV file +file_path = grid.download_csv(download_location="/tmp") +print(f"Downloaded grid data to: {file_path}") + +# Synchronize grid changes with the data source +grid = grid.synchronize() + +if grid.synchronize_error_messages: + print("Synchronization errors:") + for msg in grid.synchronize_error_messages: + print(f" - {msg}") +else: + print("Synchronization successful") + +# List all active grid sessions +for session in Grid.list(): + print(f"Session: {session.session_id}, Source: {session.source_entity_id}") + +# List sessions for a specific source +for session in Grid.list(source_id="syn987654321"): + print(f"Session: {session.session_id}") + +# Delete a grid session +grid.delete() diff --git a/docs/guides/extensions/curator/scripts/manage_tasks.py b/docs/guides/extensions/curator/scripts/manage_tasks.py new file mode 100644 index 000000000..c3687e8db --- /dev/null +++ b/docs/guides/extensions/curator/scripts/manage_tasks.py @@ -0,0 +1,30 @@ +""" +Script: Managing curation tasks. +Covers listing, updating, and deleting curation tasks. +""" + +from synapseclient import Synapse +from synapseclient.models import CurationTask + +syn = Synapse() +syn.login() + +# List tasks in a project +for task in CurationTask.list(project_id="syn123456789"): + print(f"Task {task.task_id}: {task.data_type}") + print(f" Instructions: {task.instructions}") + if task.assignee_principal_id: + print(f" Assigned to: {task.assignee_principal_id}") + +# Update a task +task = CurationTask(task_id=42).get() +task.instructions = "Updated instructions for data contributors" +task = task.store() + +# Delete a task (simple) +task = CurationTask(task_id=42) +task.delete() + +# Delete a task and clean up the associated EntityView (file-based only) +task = CurationTask(task_id=42) +task.delete(delete_file_view=True) diff --git a/docs/guides/extensions/curator/scripts/metadata.csv b/docs/guides/extensions/curator/scripts/metadata.csv new file mode 100644 index 000000000..cade57de5 --- /dev/null +++ b/docs/guides/extensions/curator/scripts/metadata.csv @@ -0,0 +1,9 @@ +individualID,species,sex,genotype,modelSystemName,ageDeath,ageDeathUnit,brainWeight,tissueWeight,bedding,waterpH,lightCycle,roomTemperature,roomHumidity +IND-001,Mus musculus,male,5XFAD/WT,5XFAD,6,months,0.42,0.38,corn cob,7.0,12/12,22,50 +IND-002,Mus musculus,female,5XFAD/WT,5XFAD,6,months,0.41,0.37,corn cob,7.0,12/12,22,50 +IND-003,Mus musculus,male,WT/WT,wildtype,6,months,0.44,0.40,corn cob,7.0,12/12,22,50 +IND-004,Mus musculus,female,WT/WT,wildtype,6,months,0.43,0.39,corn cob,7.0,12/12,22,50 +IND-005,Mus musculus,male,5XFAD/WT,5XFAD,12,months,0.40,0.36,corn cob,7.0,12/12,22,52 +IND-006,Mus musculus,female,5XFAD/WT,5XFAD,12,months,0.39,0.35,corn cob,7.0,12/12,22,52 +IND-007,Mus musculus,male,WT/WT,wildtype,12,months,0.45,0.41,corn cob,7.0,12/12,22,52 +IND-008,Mus musculus,female,WT/WT,wildtype,12,months,0.44,0.40,corn cob,7.0,12/12,22,52 diff --git a/docs/guides/extensions/curator/scripts/postcommit_validation.py b/docs/guides/extensions/curator/scripts/postcommit_validation.py new file mode 100644 index 000000000..51c23c4a3 --- /dev/null +++ b/docs/guides/extensions/curator/scripts/postcommit_validation.py @@ -0,0 +1,34 @@ +""" +Script: Post-commit validation via RecordSet export. +Exports grid session to RecordSet (commits changes) and retrieves +detailed per-row validation results. +""" + +from synapseclient import Synapse +from synapseclient.models import Grid, RecordSet + +syn = Synapse() +syn.login() + +grid = Grid(record_set_id="syn987654321") +grid = grid.create() + +# Export to RecordSet (commits changes + generates validation) +grid = grid.export_to_record_set() + +if grid.validation_summary_statistics: + stats = grid.validation_summary_statistics + print(f"Valid: {stats.number_of_valid_children}") + print(f"Invalid: {stats.number_of_invalid_children}") + +# Clean up the grid session +grid.delete() + +# Get detailed per-row validation from the RecordSet +record_set = RecordSet(id="syn987654321").get() +validation_df = record_set.get_detailed_validation_results() + +if validation_df is not None: + invalid = validation_df[validation_df["is_valid"] == False] # noqa: E712 + for _, row in invalid.iterrows(): + print(f"Row {row['row_index']}: {row['validation_error_message']}") diff --git a/docs/guides/extensions/curator/scripts/precommit_validation.py b/docs/guides/extensions/curator/scripts/precommit_validation.py new file mode 100644 index 000000000..d049b75e7 --- /dev/null +++ b/docs/guides/extensions/curator/scripts/precommit_validation.py @@ -0,0 +1,29 @@ +""" +Script: Pre-commit validation via WebSocket snapshot. +Gets per-row validation results from an active grid session +WITHOUT committing changes. +""" + +from synapseclient import Synapse +from synapseclient.models import Grid + +syn = Synapse() +syn.login() + +grid = Grid(record_set_id="syn987654321") +grid = grid.create() + +# (Import data into the grid first — see grid_session_operations.py) + +# Get validation results without committing +snapshot = grid.get_snapshot() + +print(f"Validation summary: {snapshot.validation_summary}") +# Example output: {'total': 100, 'valid': 85, 'invalid': 12, 'pending': 3} + +# Inspect individual row validation +for row in snapshot.rows: + if row.validation and not row.validation.is_valid: + print(f"Row {row.row_id}: {row.validation.validation_error_message}") + for msg in row.validation.all_validation_messages or []: + print(f" - {msg}") diff --git a/docs/guides/extensions/curator/scripts/setup_and_create_tasks.py b/docs/guides/extensions/curator/scripts/setup_and_create_tasks.py new file mode 100644 index 000000000..ef8e6b992 --- /dev/null +++ b/docs/guides/extensions/curator/scripts/setup_and_create_tasks.py @@ -0,0 +1,63 @@ +""" +Script: Setting up curation workflows. +Covers authentication, schema lookup, and creating both +record-based and file-based curation tasks. +""" + +from synapseclient import Synapse +from synapseclient.extensions.curator import ( + create_file_based_metadata_task, + create_record_based_metadata_task, + query_schema_registry, +) + +syn = Synapse() +syn.login() + +# Find the latest schema for a specific data type +schema_uri = query_schema_registry( + synapse_client=syn, + dcc="ad", + datatype="IndividualAnimalMetadataTemplate", +) +print(f"Schema URI: {schema_uri}") + +# Browse all available versions of a schema +all_schemas = query_schema_registry( + synapse_client=syn, + dcc="ad", + datatype="IndividualAnimalMetadataTemplate", + return_latest_only=False, +) + +# Create a record-based curation task +record_set, curation_task, data_grid = create_record_based_metadata_task( + synapse_client=syn, + project_id="syn123456789", + folder_id="syn987654321", + record_set_name="AnimalStudy_Records", + record_set_description="Metadata for animal study specimens", + curation_task_name="AnimalStudy_Curation", + upsert_keys=["individualID"], + instructions="Complete all required fields for each animal.", + schema_uri=schema_uri, + bind_schema_to_record_set=True, + assignee_principal_id="123456", # Optional: assign to user or team +) + +print(f"RecordSet: {record_set.id}") +print(f"CurationTask: {curation_task.task_id}") + +# Create a file-based curation task +entity_view_id, task_id = create_file_based_metadata_task( + synapse_client=syn, + folder_id="syn987654321", + curation_task_name="FileAnnotations_Curation", + instructions="Annotate each file according to the schema.", + entity_view_name="Animal Study Files View", + schema_uri=schema_uri, + assignee_principal_id="123456", # Optional +) + +print(f"EntityView: {entity_view_id}") +print(f"CurationTask: {task_id}") diff --git a/docs/guides/extensions/curator/scripts/validate_folder.py b/docs/guides/extensions/curator/scripts/validate_folder.py new file mode 100644 index 000000000..25d90df01 --- /dev/null +++ b/docs/guides/extensions/curator/scripts/validate_folder.py @@ -0,0 +1,22 @@ +""" +Script: Validating folder annotations. +For file-based workflows, validates annotations on files +within a schema-bound folder. +""" + +from synapseclient import Synapse +from synapseclient.models import Folder + +syn = Synapse() +syn.login() + +folder = Folder(id="syn987654321").get() + +# Get summary statistics +stats = folder.get_schema_validation_statistics() +print(f"Valid: {stats.number_of_valid_children}") +print(f"Invalid: {stats.number_of_invalid_children}") + +# Get details for invalid files +for result in folder.get_invalid_validation(): + print(f"Entity {result.object_id}: {result.validation_error_message}") diff --git a/docs/reference/experimental/async/curator.md b/docs/reference/experimental/async/curator.md index bf292948b..19f5f0338 100644 --- a/docs/reference/experimental/async/curator.md +++ b/docs/reference/experimental/async/curator.md @@ -56,6 +56,13 @@ at your own risk. members: - create_async - export_to_record_set_async + - import_csv_async + - download_csv_async + - synchronize_async + - get_snapshot_async + - get_validation_async + - delete_async + - list_async --- [](){ #query-reference-async } ::: synapseclient.models.Query diff --git a/docs/reference/experimental/sync/curator.md b/docs/reference/experimental/sync/curator.md index b02244aab..b0c9b2c2a 100644 --- a/docs/reference/experimental/sync/curator.md +++ b/docs/reference/experimental/sync/curator.md @@ -56,6 +56,31 @@ at your own risk. members: - create - export_to_record_set + - import_csv + - download_csv + - synchronize + - get_snapshot + - get_validation + - delete + - list +--- +[](){ #grid-snapshot-reference } +::: synapseclient.models.GridSnapshot + options: + inherited_members: true + members: +--- +[](){ #grid-row-reference } +::: synapseclient.models.GridRow + options: + inherited_members: true + members: +--- +[](){ #grid-row-validation-reference } +::: synapseclient.models.GridRowValidation + options: + inherited_members: true + members: --- [](){ #query-reference } ::: synapseclient.models.Query diff --git a/setup.cfg b/setup.cfg index 2eb6645ee..a0121d90c 100644 --- a/setup.cfg +++ b/setup.cfg @@ -62,6 +62,8 @@ install_requires = async-lru~=2.0.4 psutil>=5.9.8 setuptools>=80.10.1 + websockets>=12.0 + cbor2>=5.0 tests_require = pytest~=8.2.0 pytest-mock>=3.0,<4.0 diff --git a/synapseclient/api/__init__.py b/synapseclient/api/__init__.py index 2f9e454ea..ff6b29071 100644 --- a/synapseclient/api/__init__.py +++ b/synapseclient/api/__init__.py @@ -109,6 +109,7 @@ list_form_data, list_form_data_sync, ) +from .grid_services import create_grid_replica, get_grid_presigned_url, get_grid_session from .json_schema_services import ( bind_json_schema_to_entity, create_organization, @@ -318,6 +319,10 @@ "list_curation_tasks", "list_grid_sessions", "update_curation_task", + # grid_services + "create_grid_replica", + "get_grid_presigned_url", + "get_grid_session", # docker_commit_services "get_docker_tag", # docker_services diff --git a/synapseclient/api/grid_services.py b/synapseclient/api/grid_services.py new file mode 100644 index 000000000..015dc7145 --- /dev/null +++ b/synapseclient/api/grid_services.py @@ -0,0 +1,112 @@ +"""API services for Grid session operations. + +This module provides low-level async functions for grid replica +and presigned URL management. + +https://rest-docs.synapse.org/rest/org/sagebionetworks/repo/web/controller/GridController.html +""" + +import json +from typing import TYPE_CHECKING, Any, Dict, Optional + +if TYPE_CHECKING: + from synapseclient import Synapse + + +async def create_grid_replica( + session_id: str, + *, + synapse_client: Optional["Synapse"] = None, +) -> Dict[str, Any]: + """ + Create a new grid replica for a grid session. + + A replica is an in-memory document that represents a 'copy' of the grid. + Each replica is identified by a unique replicaId. + + https://rest-docs.synapse.org/rest/POST/grid/session/sessionId/replica.html + + Arguments: + session_id: The ID of the grid session. + synapse_client: If not passed in and caching was not disabled by + `Synapse.allow_client_caching(False)` this will use the last + created instance from the Synapse class constructor. + + Returns: + CreateReplicaResponse containing the replica information. + """ + from synapseclient import Synapse + + client = Synapse.get_client(synapse_client=synapse_client) + + request_body = {"gridSessionId": session_id} + + return await client.rest_post_async( + uri=f"/grid/session/{session_id}/replica", + body=json.dumps(request_body), + ) + + +async def get_grid_presigned_url( + session_id: str, + replica_id: int, + *, + synapse_client: Optional["Synapse"] = None, +) -> str: + """ + Create a presigned URL to establish a WebSocket connection with a grid + session. The presigned URL will expire 15 minutes after it is issued. + + https://rest-docs.synapse.org/rest/POST/grid/session/sessionId/presigned/url.html + + Arguments: + session_id: The ID of the grid session. + replica_id: The replica ID that will use this WebSocket connection. + synapse_client: If not passed in and caching was not disabled by + `Synapse.allow_client_caching(False)` this will use the last + created instance from the Synapse class constructor. + + Returns: + The presigned WebSocket URL string. + """ + from synapseclient import Synapse + + client = Synapse.get_client(synapse_client=synapse_client) + + request_body = { + "gridSessionId": session_id, + "replicaId": replica_id, + } + + response = await client.rest_post_async( + uri=f"/grid/session/{session_id}/presigned/url", + body=json.dumps(request_body), + ) + + return response.get("presignedUrl", "") + + +async def get_grid_session( + session_id: str, + *, + synapse_client: Optional["Synapse"] = None, +) -> Dict[str, Any]: + """ + Get the basic information about an existing grid session. + + https://rest-docs.synapse.org/rest/GET/grid/session/sessionId.html + + Arguments: + session_id: The ID of the grid session. + synapse_client: If not passed in and caching was not disabled by + `Synapse.allow_client_caching(False)` this will use the last + created instance from the Synapse class constructor. + + Returns: + GridSession information. + """ + from synapseclient import Synapse + + client = Synapse.get_client(synapse_client=synapse_client) + + return await client.rest_get_async(uri=f"/grid/session/{session_id}") diff --git a/synapseclient/core/constants/concrete_types.py b/synapseclient/core/constants/concrete_types.py index fba11dbdb..61a16eb13 100644 --- a/synapseclient/core/constants/concrete_types.py +++ b/synapseclient/core/constants/concrete_types.py @@ -128,3 +128,8 @@ LIST_GRID_SESSIONS_RESPONSE = ( "org.sagebionetworks.repo.model.grid.ListGridSessionsResponse" ) +GRID_CSV_IMPORT_REQUEST = "org.sagebionetworks.repo.model.grid.GridCsvImportRequest" +DOWNLOAD_FROM_GRID_REQUEST = ( + "org.sagebionetworks.repo.model.grid.DownloadFromGridRequest" +) +SYNCHRONIZE_GRID_REQUEST = "org.sagebionetworks.repo.model.grid.SynchronizeGridRequest" diff --git a/synapseclient/core/grid_crdt_decoder.py b/synapseclient/core/grid_crdt_decoder.py new file mode 100644 index 000000000..d9898a254 --- /dev/null +++ b/synapseclient/core/grid_crdt_decoder.py @@ -0,0 +1,623 @@ +""" +Decoder for json-joy indexed binary CRDT snapshots. + +This module provides a minimal Python implementation of the json-joy CRDT +snapshot decoder, sufficient to extract grid data and per-row validation +results from a Synapse grid session. + +The decoder handles: +- CBOR outer layer decoding +- Clock table decoding for session ID mapping +- Node type decoding: CON, VAL, OBJ, VEC, ARR +- Tombstone filtering in ARR nodes (deleted rows) + +Reference implementation: + json-joy/lib/json-crdt/codec/indexed/binary/Decoder.js + +This does NOT implement full CRDT semantics (no patch application, no +conflict resolution). It is a read-only snapshot decoder. +""" + +from dataclasses import dataclass +from typing import Any, Dict, List, Optional, Tuple + +import cbor2 + +from synapseclient.models.grid_query import GridRow, GridRowValidation, GridSnapshot + + +def _cbor_item_byte_length(data: bytes, offset: int = 0) -> int: + """Compute the exact byte length of one CBOR item starting at offset. + + This allows us to slice exactly one CBOR item from a buffer containing + multiple items, then decode it with cbor2.loads() without consuming + trailing data. + + Arguments: + data: The buffer containing CBOR data. + offset: Starting position in the buffer. + + Returns: + The number of bytes that the first CBOR item occupies. + """ + if offset >= len(data): + raise ValueError("No CBOR data at offset") + + initial_byte = data[offset] + major_type = initial_byte >> 5 + additional_info = initial_byte & 0x1F + pos = offset + 1 + + # Decode the argument (length/value) from additional info + if additional_info < 24: + argument = additional_info + elif additional_info == 24: + argument = data[pos] + pos += 1 + elif additional_info == 25: + argument = int.from_bytes(data[pos : pos + 2], "big") + pos += 2 + elif additional_info == 26: + argument = int.from_bytes(data[pos : pos + 4], "big") + pos += 4 + elif additional_info == 27: + argument = int.from_bytes(data[pos : pos + 8], "big") + pos += 8 + elif additional_info == 31: + # Indefinite length - scan for break code (0xFF) + if major_type in (2, 3): # byte/text string chunks + while data[pos] != 0xFF: + chunk_len = _cbor_item_byte_length(data, pos) + pos += chunk_len + pos += 1 # skip 0xFF break + return pos - offset + elif major_type in (4, 5): # array/map + while data[pos] != 0xFF: + pos += _cbor_item_byte_length(data, pos) + if major_type == 5: + pos += _cbor_item_byte_length(data, pos) + pos += 1 # skip 0xFF break + return pos - offset + else: + # Simple break or other - just the initial byte + return 1 + else: + # Reserved (28-30) - treat as 0 argument + argument = 0 + + # Major types 0, 1: unsigned/negative int - no payload beyond argument + if major_type in (0, 1): + return pos - offset + + # Major types 2, 3: byte/text string - argument is the string length + if major_type in (2, 3): + return (pos - offset) + argument + + # Major type 4: array - argument is number of items + if major_type == 4: + for _ in range(argument): + pos += _cbor_item_byte_length(data, pos) + return pos - offset + + # Major type 5: map - argument is number of key-value pairs + if major_type == 5: + for _ in range(argument): + pos += _cbor_item_byte_length(data, pos) # key + pos += _cbor_item_byte_length(data, pos) # value + return pos - offset + + # Major type 6: tag - argument is tag number, followed by one item + if major_type == 6: + pos += _cbor_item_byte_length(data, pos) + return pos - offset + + # Major type 7: simple values and floats + if major_type == 7: + return pos - offset + + return pos - offset + + +# CRDT major type constants (upper 3 bits of node type octet) +CRDT_CON = 0 # Constant (immutable value) +CRDT_VAL = 1 # Value (LWW register) +CRDT_OBJ = 2 # Object (LWW map) +CRDT_VEC = 3 # Vector (fixed-size LWW array) +CRDT_STR = 4 # String (RGA) +CRDT_BIN = 5 # Binary (RGA) +CRDT_ARR = 6 # Array (RGA) + + +@dataclass +class Timestamp: + """A logical CRDT timestamp.""" + + sid: int # Session ID + time: int # Logical clock time + + +@dataclass +class ClockEntry: + """An entry in the clock table.""" + + sid: int + time: int + + +@dataclass +class ArrChunk: + """A chunk in an RGA array, may be a tombstone.""" + + id: Timestamp + length: int + deleted: bool + data: Optional[List[Timestamp]] = None + + +@dataclass +class CrdtNode: + """A decoded CRDT node.""" + + id: Timestamp + node_type: int + value: Any = None + children: Optional[Dict[str, Timestamp]] = None + elements: Optional[List[Optional[Timestamp]]] = None + chunks: Optional[List[ArrChunk]] = None + + +class BinaryReader: + """Reader for binary data with position tracking.""" + + def __init__(self, data: bytes): + self.data = data + self.pos = 0 + + def u8(self) -> int: + """Read unsigned 8-bit integer.""" + val = self.data[self.pos] + self.pos += 1 + return val + + def vu57(self) -> int: + """Read variable-length unsigned integer (up to 57 bits). + + Each byte contributes 7 data bits. MSB indicates continuation. + """ + result = 0 + shift = 0 + while True: + byte = self.u8() + result |= (byte & 0x7F) << shift + if not (byte & 0x80): + break + shift += 7 + return result + + def b1vu56(self) -> Tuple[int, int]: + """Read 1 flag bit + variable-length 56-bit integer. + + Returns (flag, value) where flag is 0 or 1 (MSB of first byte). + """ + byte = self.u8() + flag = 1 if (byte & 0x80) else 0 + # Lower 6 bits of first byte + result = byte & 0x3F + if not (byte & 0x40): + return (flag, result) + # Continue reading with 7-bit encoding + shift = 6 + while True: + byte = self.u8() + result |= (byte & 0x7F) << shift + if not (byte & 0x80): + break + shift += 7 + return (flag, result) + + def id(self) -> Tuple[int, int]: + """Read a compact session-index + time-diff pair. + + Returns (session_index, time_diff). + """ + byte = self.data[self.pos] + if byte <= 0x7F: + self.pos += 1 + return (byte >> 4, byte & 0x0F) + # Fall back to variable-length encoding + flag_and_val = self.b1vu56() + time_diff = self.vu57() + return (flag_and_val[1], time_diff) + + def reset(self, data: bytes) -> None: + """Reset reader to new data.""" + self.data = data + self.pos = 0 + + @property + def remaining(self) -> int: + """Bytes remaining to read.""" + return len(self.data) - self.pos + + +class GridSnapshotDecoder: + """Decoder for json-joy indexed binary CRDT snapshots. + + Decodes the CRDT model to extract grid data: + - columnNames (vec of con strings) + - rows (arr of obj with data vec + metadata obj) + - rowValidation (con containing ValidationResults) + + Handles tombstones in ARR nodes (deleted rows) by filtering + them during view generation, matching json-joy's ArrNode.view(). + """ + + def __init__(self): + self.clock_table: List[ClockEntry] = [] + self.nodes: Dict[str, CrdtNode] = {} # Keyed by "sid_time" + self.reader = BinaryReader(b"") + self.root_id: Optional[Timestamp] = None + + def decode(self, cbor_data: bytes) -> GridSnapshot: + """Decode CBOR-encoded CRDT snapshot to GridSnapshot. + + Arguments: + cbor_data: Raw CBOR bytes of the snapshot. + + Returns: + GridSnapshot with column names, row data, and validation. + """ + # 1. CBOR decode → IndexedFields dict + fields = cbor2.loads(cbor_data) + + # 2. Decode clock table + self._decode_clock_table(fields[b"c"] if b"c" in fields else fields["c"]) + + # 3. Decode root if present + root_key = b"r" if b"r" in fields else "r" + if root_key in fields: + self.reader.reset(fields[root_key]) + self.root_id = self._read_ts() + + # 4. Decode all field nodes + for key, value in fields.items(): + key_str = key.decode("utf-8") if isinstance(key, bytes) else key + if key_str in ("c", "r"): + continue + # Parse field name: "${relativeSidBase36}_${timeBase36}" + node_id = self._parse_field_name(key_str) + self.reader.reset(value) + node = self._decode_node(node_id) + node_key = f"{node_id.sid}_{node_id.time}" + self.nodes[node_key] = node + + # 5. Build GridSnapshot from the CRDT tree + return self._build_snapshot() + + def _decode_clock_table(self, data: bytes) -> None: + """Decode the clock table from binary data.""" + self.reader.reset(data) + length = self.reader.vu57() + self.clock_table = [] + for _ in range(length): + sid = self.reader.vu57() + time = self.reader.vu57() + self.clock_table.append(ClockEntry(sid=sid, time=time)) + + def _read_ts(self) -> Timestamp: + """Read a timestamp from the current reader position.""" + session_index, time_diff = self.reader.id() + if session_index < len(self.clock_table): + entry = self.clock_table[session_index] + return Timestamp(sid=entry.sid, time=time_diff) + return Timestamp(sid=session_index, time=time_diff) + + def _parse_field_name(self, field_name: str) -> Timestamp: + """Parse a field name like '2_10' (base-36) to a Timestamp.""" + underscore_idx = field_name.index("_") + relative_sid = int(field_name[:underscore_idx], 36) + time = int(field_name[underscore_idx + 1 :], 36) + if relative_sid < len(self.clock_table): + entry = self.clock_table[relative_sid] + return Timestamp(sid=entry.sid, time=time) + return Timestamp(sid=relative_sid, time=time) + + def _read_cbor_value(self) -> Any: + """Read exactly one CBOR value from the current reader position. + + Computes the exact byte length of the CBOR item before decoding, + then advances the reader position by exactly that amount. + + Returns: + The decoded CBOR value, or None if no data remains. + """ + remaining = self.reader.data[self.reader.pos :] + if not remaining: + return None + try: + item_length = _cbor_item_byte_length(remaining) + item_bytes = remaining[:item_length] + value = cbor2.loads(item_bytes) + self.reader.pos += item_length + return value + except Exception: + return None + + def _decode_node(self, node_id: Timestamp) -> CrdtNode: + """Decode a single CRDT node from the current reader.""" + octet = self.reader.u8() + major = octet >> 5 # Upper 3 bits + minor = octet & 0x1F # Lower 5 bits + + if major == CRDT_CON: + return self._decode_con(node_id, minor) + elif major == CRDT_VAL: + return self._decode_val(node_id) + elif major == CRDT_OBJ: + return self._decode_obj(node_id, minor) + elif major == CRDT_VEC: + return self._decode_vec(node_id, minor) + elif major == CRDT_STR: + return self._decode_str(node_id, minor) + elif major == CRDT_BIN: + return self._decode_bin(node_id, minor) + elif major == CRDT_ARR: + return self._decode_arr(node_id, minor) + else: + return CrdtNode(id=node_id, node_type=major) + + def _decode_con(self, node_id: Timestamp, length: int) -> CrdtNode: + """Decode a CON (constant) node. + + If length == 0: value is a CBOR-encoded constant. + If length > 0: value is a Timestamp reference. + """ + if length == 0: + # Read exactly one CBOR value using streaming decoder + # to track exact byte consumption + value = self._read_cbor_value() + else: + # Timestamp reference + value = self._read_ts() + + return CrdtNode(id=node_id, node_type=CRDT_CON, value=value) + + def _decode_val(self, node_id: Timestamp) -> CrdtNode: + """Decode a VAL (value/register) node - pointer to another node.""" + child_ts = self._read_ts() + return CrdtNode(id=node_id, node_type=CRDT_VAL, value=child_ts) + + def _decode_obj(self, node_id: Timestamp, length: int) -> CrdtNode: + """Decode an OBJ (object/map) node.""" + children: Dict[str, Timestamp] = {} + for _ in range(length): + # Read key using streaming CBOR decoder for exact byte tracking + key = self._read_cbor_value() + if key is None: + break + # Read value timestamp + val_ts = self._read_ts() + children[str(key)] = val_ts + return CrdtNode(id=node_id, node_type=CRDT_OBJ, children=children) + + def _decode_vec(self, node_id: Timestamp, length: int) -> CrdtNode: + """Decode a VEC (vector/fixed-array) node.""" + elements: List[Optional[Timestamp]] = [] + for _ in range(length): + # Check for null/empty slot + if self.reader.remaining > 0: + el_ts = self._read_ts() + elements.append(el_ts) + else: + elements.append(None) + return CrdtNode(id=node_id, node_type=CRDT_VEC, elements=elements) + + def _decode_arr(self, node_id: Timestamp, length: int) -> CrdtNode: + """Decode an ARR (RGA array) node.""" + chunks: List[ArrChunk] = [] + for _ in range(length): + chunk = self._decode_arr_chunk() + chunks.append(chunk) + return CrdtNode(id=node_id, node_type=CRDT_ARR, chunks=chunks) + + def _decode_arr_chunk(self) -> ArrChunk: + """Decode a single ARR chunk (may be a tombstone).""" + chunk_id = self._read_ts() + deleted_flag, chunk_length = self.reader.b1vu56() + if deleted_flag: + return ArrChunk(id=chunk_id, length=chunk_length, deleted=True) + else: + data = [] + for _ in range(chunk_length): + data.append(self._read_ts()) + return ArrChunk( + id=chunk_id, + length=chunk_length, + deleted=False, + data=data, + ) + + def _decode_str(self, node_id: Timestamp, length: int) -> CrdtNode: + """Decode a STR (RGA string) node - treated similarly to ARR.""" + # For our purposes, we skip string internals + return CrdtNode(id=node_id, node_type=CRDT_STR) + + def _decode_bin(self, node_id: Timestamp, length: int) -> CrdtNode: + """Decode a BIN (RGA binary) node.""" + return CrdtNode(id=node_id, node_type=CRDT_BIN) + + # --- Snapshot to GridSnapshot conversion --- + + def _resolve_node(self, ts: Timestamp) -> Optional[CrdtNode]: + """Resolve a Timestamp to its CrdtNode.""" + key = f"{ts.sid}_{ts.time}" + return self.nodes.get(key) + + def _resolve_value(self, ts: Timestamp) -> Any: + """Resolve a Timestamp reference to its final value.""" + node = self._resolve_node(ts) + if node is None: + return None + if node.node_type == CRDT_CON: + return node.value + if node.node_type == CRDT_VAL: + if isinstance(node.value, Timestamp): + return self._resolve_value(node.value) + return node.value + return node + + def _build_snapshot(self) -> GridSnapshot: + """Build a GridSnapshot from the decoded CRDT tree.""" + if self.root_id is None: + return GridSnapshot() + + root_node = self._resolve_node(self.root_id) + if root_node is None or root_node.node_type == CRDT_VAL: + # Root is a VAL pointing to the actual document object + if root_node and isinstance(root_node.value, Timestamp): + root_node = self._resolve_node(root_node.value) + if root_node is None: + return GridSnapshot() + + # Extract column names + column_names = self._extract_column_names(root_node) + + # Extract rows with validation + rows = self._extract_rows(root_node, column_names) + + return GridSnapshot(column_names=column_names, rows=rows) + + def _extract_column_names(self, doc_node: CrdtNode) -> List[str]: + """Extract column names from the document's columnNames vec.""" + if doc_node.children is None: + return [] + + col_names_ts = doc_node.children.get("columnNames") + if col_names_ts is None: + return [] + + col_names_node = self._resolve_node(col_names_ts) + if col_names_node is None: + return [] + + # VEC of CON strings + if col_names_node.node_type == CRDT_VEC and col_names_node.elements: + names = [] + for el_ts in col_names_node.elements: + if el_ts is not None: + val = self._resolve_value(el_ts) + names.append(str(val) if val is not None else "") + else: + names.append("") + return names + + return [] + + def _extract_rows( + self, doc_node: CrdtNode, column_names: List[str] + ) -> List[GridRow]: + """Extract rows from the document's rows arr.""" + if doc_node.children is None: + return [] + + rows_ts = doc_node.children.get("rows") + if rows_ts is None: + return [] + + rows_node = self._resolve_node(rows_ts) + if rows_node is None: + return [] + + # ARR of row objects - filter tombstones + if rows_node.node_type != CRDT_ARR or not rows_node.chunks: + return [] + + grid_rows: List[GridRow] = [] + for chunk in rows_node.chunks: + if chunk.deleted: + continue # Skip tombstones + if chunk.data: + for row_ts in chunk.data: + row = self._extract_single_row(row_ts, column_names) + if row is not None: + grid_rows.append(row) + + return grid_rows + + def _extract_single_row( + self, row_ts: Timestamp, column_names: List[str] + ) -> Optional[GridRow]: + """Extract a single row's data and validation.""" + row_node = self._resolve_node(row_ts) + if row_node is None: + return None + + # Row is a VAL pointing to an OBJ + if row_node.node_type == CRDT_VAL and isinstance(row_node.value, Timestamp): + row_node = self._resolve_node(row_node.value) + if row_node is None: + return None + + if row_node.node_type != CRDT_OBJ or not row_node.children: + return None + + # Extract row data from 'data' vec + row_data: Dict[str, Any] = {} + data_ts = row_node.children.get("data") + if data_ts: + data_node = self._resolve_node(data_ts) + if data_node and data_node.node_type == CRDT_VEC and data_node.elements: + for i, el_ts in enumerate(data_node.elements): + col_name = column_names[i] if i < len(column_names) else f"col_{i}" + if el_ts is not None: + row_data[col_name] = self._resolve_value(el_ts) + else: + row_data[col_name] = None + + # Extract validation from 'metadata.rowValidation' + validation = self._extract_row_validation(row_node) + + row_id = f"{row_ts.sid}.{row_ts.time}" + + return GridRow( + row_id=row_id, + data=row_data, + validation=validation, + ) + + def _extract_row_validation( + self, row_node: CrdtNode + ) -> Optional[GridRowValidation]: + """Extract validation results from a row's metadata.""" + if not row_node.children: + return None + + metadata_ts = row_node.children.get("metadata") + if metadata_ts is None: + return None + + metadata_node = self._resolve_node(metadata_ts) + if ( + metadata_node is None + or metadata_node.node_type != CRDT_OBJ + or not metadata_node.children + ): + return None + + validation_ts = metadata_node.children.get("rowValidation") + if validation_ts is None: + return None + + validation_value = self._resolve_value(validation_ts) + if validation_value is None: + return None + + # validation_value should be a dict-like object from CBOR + if isinstance(validation_value, dict): + return GridRowValidation( + is_valid=validation_value.get("isValid"), + validation_error_message=validation_value.get("validationErrorMessage"), + all_validation_messages=validation_value.get("allValidationMessages"), + ) + + return None diff --git a/synapseclient/core/grid_websocket.py b/synapseclient/core/grid_websocket.py new file mode 100644 index 000000000..d99f36d61 --- /dev/null +++ b/synapseclient/core/grid_websocket.py @@ -0,0 +1,188 @@ +""" +Read-only WebSocket client for Synapse grid sessions. + +Connects to a grid session via presigned WebSocket URL, receives the initial +grid state (snapshot or patches), decodes the CRDT model, and extracts grid +data including per-row validation results. + +Protocol: JSON-Rx (https://jsonjoy.com/specs/json-rx) +Messages are JSON arrays with type code as first element: + [1, reqId, method, payload?] - Request (complete) + [4, subId, payload] - Response (data) + [5, subId, payload?] - Response (complete) + [8, method, payload?] - Notification +""" + +import asyncio +import json +import logging +from typing import Any, List, Optional + +import httpx +import websockets + +from synapseclient.core.grid_crdt_decoder import GridSnapshotDecoder +from synapseclient.models.grid_query import GridSnapshot + +logger = logging.getLogger(__name__) + +# JSON-Rx message type codes +JSONRX_REQUEST_COMPLETE = 1 +JSONRX_RESPONSE_DATA = 4 +JSONRX_RESPONSE_COMPLETE = 5 +JSONRX_NOTIFICATION = 8 + + +class GridWebSocketClient: + """Read-only WebSocket client for grid sessions. + + Connects to a grid session via presigned WebSocket URL, + receives the initial snapshot or patches, and extracts grid data + including per-row validation results. + + This client is designed for one-shot reads: connect, receive state, + extract data, disconnect. It does not participate in collaborative + editing or send patches. + """ + + def __init__(self, connect_timeout: float = 30.0): + """ + Arguments: + connect_timeout: Timeout in seconds for the WebSocket connection + and initial data reception. + """ + self.connect_timeout = connect_timeout + + async def get_snapshot( + self, + presigned_url: str, + replica_id: int, + ) -> GridSnapshot: + """Connect to a grid session, receive its state, and return a snapshot. + + Arguments: + presigned_url: The presigned WebSocket URL from + ``POST /grid/session/{sessionId}/presigned/url``. + replica_id: The replica ID for this connection. + + Returns: + GridSnapshot with column names, row data, and per-row validation. + """ + snapshot_url: Optional[str] = None + patches: List[Any] = [] + + async with websockets.connect( + presigned_url, + close_timeout=10, + open_timeout=self.connect_timeout, + ) as ws: + try: + async with asyncio.timeout(self.connect_timeout): + # Wait for initial messages until sync complete + async for raw_message in ws: + message = self._parse_message(raw_message) + if message is None: + continue + + msg_type = message[0] + + if msg_type == JSONRX_NOTIFICATION: + method = message[1] if len(message) > 1 else None + if method == "connected": + logger.debug("Grid WebSocket connected") + # Send clock sync with empty clock + sync_msg = json.dumps( + [ + JSONRX_REQUEST_COMPLETE, + 1, + "synchronize-clock", + [], + ] + ) + await ws.send(sync_msg) + elif method == "ping": + pass # Ignore keep-alive pings + + elif msg_type == JSONRX_RESPONSE_DATA: + payload = message[2] if len(message) > 2 else None + if isinstance(payload, dict): + payload_type = payload.get("type") + if payload_type == "snapshot": + snapshot_url = payload.get("body") + logger.debug("Received snapshot URL") + elif payload_type == "patch": + patches.append(payload.get("body")) + elif payload is not None: + # Raw patch data + patches.append(payload) + + elif msg_type == JSONRX_RESPONSE_COMPLETE: + logger.debug("Grid sync complete") + break + + # Safety: don't loop forever + if len(patches) > 10000: + logger.warning( + "Received >10000 patches without sync " + "complete, stopping" + ) + break + + except TimeoutError: + logger.warning( + "Grid WebSocket timed out after %.1fs waiting for " + "sync complete signal", + self.connect_timeout, + ) + except websockets.exceptions.ConnectionClosed: + logger.debug("WebSocket connection closed during sync") + + # Process the received data + if snapshot_url: + return await self._process_snapshot(snapshot_url) + elif patches: + logger.warning( + "Received patches but no snapshot URL. " + "Patch-based initialization is not yet supported. " + "Returning empty snapshot." + ) + return GridSnapshot() + else: + logger.warning("No snapshot or patches received from grid session") + return GridSnapshot() + + async def _process_snapshot(self, snapshot_url: str) -> GridSnapshot: + """Fetch and decode a CRDT snapshot from its S3 URL. + + Arguments: + snapshot_url: Presigned S3 URL containing the CBOR snapshot. + + Returns: + Decoded GridSnapshot. + """ + async with httpx.AsyncClient() as client: + response = await client.get(snapshot_url, timeout=30.0) + response.raise_for_status() + cbor_data = response.content + + logger.debug("Fetched snapshot: %d bytes", len(cbor_data)) + + decoder = GridSnapshotDecoder() + return decoder.decode(cbor_data) + + def _parse_message(self, raw: Any) -> Optional[list]: + """Parse a WebSocket message as JSON-Rx. + + Arguments: + raw: Raw WebSocket message (str or bytes). + + Returns: + Parsed JSON array, or None if parsing fails. + """ + try: + if isinstance(raw, bytes): + raw = raw.decode("utf-8") + return json.loads(raw) + except (json.JSONDecodeError, UnicodeDecodeError) as e: + logger.debug("Failed to parse WebSocket message: %s", e) + return None diff --git a/synapseclient/models/__init__.py b/synapseclient/models/__init__.py index 7a85b6b83..d49c926b7 100644 --- a/synapseclient/models/__init__.py +++ b/synapseclient/models/__init__.py @@ -10,7 +10,6 @@ from synapseclient.models.curation import ( CurationTask, FileBasedMetadataTaskProperties, - Grid, RecordBasedMetadataTaskProperties, ) from synapseclient.models.dataset import Dataset, DatasetCollection, EntityRef @@ -20,6 +19,8 @@ from synapseclient.models.file import File, FileHandle from synapseclient.models.folder import Folder from synapseclient.models.form import FormData, FormGroup +from synapseclient.models.grid import Grid +from synapseclient.models.grid_query import GridRow, GridRowValidation, GridSnapshot from synapseclient.models.link import Link from synapseclient.models.materializedview import MaterializedView from synapseclient.models.mixins.table_components import QueryMixin @@ -91,6 +92,9 @@ "FileBasedMetadataTaskProperties", "RecordBasedMetadataTaskProperties", "Grid", + "GridSnapshot", + "GridRow", + "GridRowValidation", "UserProfile", "UserPreference", "UserGroupHeader", diff --git a/synapseclient/models/curation.py b/synapseclient/models/curation.py index 6b3eb5843..98912d165 100644 --- a/synapseclient/models/curation.py +++ b/synapseclient/models/curation.py @@ -14,10 +14,8 @@ from synapseclient.api import ( create_curation_task, delete_curation_task, - delete_grid_session, get_curation_task, list_curation_tasks, - list_grid_sessions, update_curation_task, ) from synapseclient.core.async_utils import ( @@ -26,17 +24,10 @@ wrap_async_generator_to_sync_generator, ) from synapseclient.core.constants.concrete_types import ( - CREATE_GRID_REQUEST, FILE_BASED_METADATA_TASK_PROPERTIES, - GRID_RECORD_SET_EXPORT_REQUEST, - LIST_GRID_SESSIONS_REQUEST, - LIST_GRID_SESSIONS_RESPONSE, RECORD_BASED_METADATA_TASK_PROPERTIES, ) from synapseclient.core.utils import delete_none_keys, merge_dataclass_entities -from synapseclient.models.mixins.asynchronous_job import AsynchronousCommunicator -from synapseclient.models.recordset import ValidationSummary -from synapseclient.models.table_components import Query @dataclass @@ -45,7 +36,8 @@ class FileBasedMetadataTaskProperties: A CurationTaskProperties for file-based data, describing where data is uploaded and a view which contains the annotations. - Represents a [Synapse FileBasedMetadataTaskProperties](https://rest-docs.synapse.org/org/sagebionetworks/repo/model/curation/metadata/FileBasedMetadataTaskProperties.html). + Represents a [Synapse FileBasedMetadataTaskProperties]\ +(https://rest-docs.synapse.org/org/sagebionetworks/repo/model/curation/metadata/FileBasedMetadataTaskProperties.html). Attributes: upload_folder_id: The synId of the folder where data files of this type are to be uploaded @@ -94,14 +86,15 @@ class RecordBasedMetadataTaskProperties: """ A CurationTaskProperties for record-based metadata. - Represents a [Synapse RecordBasedMetadataTaskProperties](https://rest-docs.synapse.org/org/sagebionetworks/repo/model/curation/metadata/RecordBasedMetadataTaskProperties.html). + Represents a [Synapse RecordBasedMetadataTaskProperties]\ +(https://rest-docs.synapse.org/org/sagebionetworks/repo/model/curation/metadata/RecordBasedMetadataTaskProperties.html). Attributes: record_set_id: The synId of the RecordSet that will contain all record-based metadata """ record_set_id: Optional[str] = None - """The synId of the RecordSet that will contain all record-based metadata""" + """The synId of the RecordSet that will contain all record-based metadata of this type""" def fill_from_dict( self, synapse_response: Union[Dict[str, Any], Any] @@ -135,14 +128,13 @@ def _create_task_properties_from_dict( properties_dict: Dict[str, Any] ) -> Union[FileBasedMetadataTaskProperties, RecordBasedMetadataTaskProperties]: """ - Factory method to create the appropriate FileBasedMetadataTaskProperties/RecordBasedMetadataTaskProperties - based on the concreteType. + Factory method to create the appropriate task properties based on the concreteType. Arguments: properties_dict: Dictionary containing task properties data Returns: - The appropriate FileBasedMetadataTaskProperties/RecordBasedMetadataTaskProperties instance + The appropriate task properties instance """ concrete_type = properties_dict.get("concreteType", "") @@ -182,151 +174,29 @@ async def _get_existing_curation_task_id( class CurationTaskSynchronousProtocol(Protocol): def get(self, *, synapse_client: Optional[Synapse] = None) -> "CurationTask": - """ - Gets a CurationTask from Synapse by ID. - - Arguments: - synapse_client: If not passed in and caching was not disabled by - `Synapse.allow_client_caching(False)` this will use the last created - instance from the Synapse class constructor. - - Returns: - CurationTask: The CurationTask object. - - Raises: - ValueError: If the CurationTask object does not have a task_id. - - Example: Get a curation task by ID -   - - ```python - from synapseclient import Synapse - from synapseclient.models import CurationTask - - syn = Synapse() - syn.login() - - task = CurationTask(task_id=123).get() - print(task.data_type) - print(task.instructions) - ``` - """ + """Gets a CurationTask from Synapse by ID.""" return self - def delete(self, *, synapse_client: Optional[Synapse] = None) -> None: + def delete( + self, + delete_file_view: bool = False, + *, + synapse_client: Optional[Synapse] = None, + ) -> None: """ Deletes a CurationTask from Synapse. Arguments: + delete_file_view: If True and the task has FileBasedMetadataTaskProperties, + also delete the associated EntityView. Defaults to False. synapse_client: If not passed in and caching was not disabled by `Synapse.allow_client_caching(False)` this will use the last created instance from the Synapse class constructor. - - Raises: - ValueError: If the CurationTask object does not have a task_id. - - Example: Delete a curation task -   - - ```python - from synapseclient import Synapse - from synapseclient.models import CurationTask - - syn = Synapse() - syn.login() - - task = CurationTask(task_id=123) - task.delete() - ``` """ return None def store(self, *, synapse_client: Optional[Synapse] = None) -> "CurationTask": - """ - Creates a new CurationTask or updates an existing one on Synapse. - - This method implements non-destructive updates. If a CurationTask with the same - project_id and data_type exists and this instance hasn't been retrieved from - Synapse before, it will merge the existing task data with the current instance - before updating. - - Arguments: - synapse_client: If not passed in and caching was not disabled by - `Synapse.allow_client_caching(False)` this will use the last created - instance from the Synapse class constructor. - - Returns: - CurationTask: The CurationTask object. - - Example: Create a new file-based curation task -   - - ```python - from synapseclient import Synapse - from synapseclient.models import CurationTask, FileBasedMetadataTaskProperties - - syn = Synapse() - syn.login() - - # Create file-based task properties - file_properties = FileBasedMetadataTaskProperties( - upload_folder_id="syn1234567", - file_view_id="syn2345678" - ) - - # Create the curation task - task = CurationTask( - project_id="syn9876543", - data_type="genomics_data", - instructions="Upload your genomics files to the specified folder", - task_properties=file_properties - ) - task = task.store() - print(f"Created task with ID: {task.task_id}") - ``` - - Example: Create a new record-based curation task -   - - ```python - from synapseclient import Synapse - from synapseclient.models import CurationTask, RecordBasedMetadataTaskProperties - - syn = Synapse() - syn.login() - - # Create record-based task properties - record_properties = RecordBasedMetadataTaskProperties( - record_set_id="syn3456789" - ) - - # Create the curation task - task = CurationTask( - project_id="syn9876543", - data_type="clinical_data", - instructions="Fill out the clinical data form", - task_properties=record_properties - ) - task = task.store() - print(f"Created task with ID: {task.task_id}") - ``` - - Example: Update an existing curation task -   - - ```python - from synapseclient import Synapse - from synapseclient.models import CurationTask - - syn = Synapse() - syn.login() - - # Get existing task and update - task = CurationTask(task_id=123).get() - task.instructions = "Updated instructions for data contributors" - task = task.store() - ``` - """ + """Creates a new CurationTask or updates an existing one on Synapse.""" return self @classmethod @@ -336,36 +206,7 @@ def list( *, synapse_client: Optional[Synapse] = None, ) -> Generator["CurationTask", None, None]: - """ - Generator that yields CurationTasks for a project as they become available. - - Arguments: - project_id: The synId of the project. - synapse_client: If not passed in and caching was not disabled by - `Synapse.allow_client_caching(False)` this will use the last created - instance from the Synapse class constructor. - - Yields: - CurationTask objects as they are retrieved from the API. - - Example: List all curation tasks in a project -   - - ```python - from synapseclient import Synapse - from synapseclient.models import CurationTask - - syn = Synapse() - syn.login() - - # List all curation tasks in the project - for task in CurationTask.list(project_id="syn9876543"): - print(f"Task ID: {task.task_id}") - print(f"Data Type: {task.data_type}") - print(f"Instructions: {task.instructions}") - print("---") - ``` - """ + """Generator that yields CurationTasks for a project.""" yield from wrap_async_generator_to_sync_generator( async_gen_func=cls.list_async, project_id=project_id, @@ -380,7 +221,8 @@ class CurationTask(CurationTaskSynchronousProtocol): The CurationTask provides instructions for a Data Contributor on how data or metadata of a specific type should be both added to a project and curated. - Represents a [Synapse CurationTask](https://rest-docs.synapse.org/org/sagebionetworks/repo/model/curation/CurationTask.html). + Represents a [Synapse CurationTask]\ +(https://rest-docs.synapse.org/org/sagebionetworks/repo/model/curation/CurationTask.html). Attributes: task_id: The unique identifier issued to this task when it was created @@ -389,56 +231,19 @@ class CurationTask(CurationTaskSynchronousProtocol): instructions: Instructions to the data contributor task_properties: The properties of a CurationTask. This can be either FileBasedMetadataTaskProperties or RecordBasedMetadataTaskProperties. - etag: Synapse employs an Optimistic Concurrency Control (OCC) scheme to handle - concurrent updates. Since the E-Tag changes every time an entity is updated - it is used to detect when a client's current representation of an entity is - out-of-date. + etag: Synapse employs an Optimistic Concurrency Control (OCC) scheme created_on: (Read Only) The date this task was created modified_on: (Read Only) The date this task was last modified created_by: (Read Only) The ID of the user that created this task modified_by: (Read Only) The ID of the user that last modified this task - - Example: Complete curation task workflow -   - - ```python - from synapseclient import Synapse - from synapseclient.models import CurationTask, FileBasedMetadataTaskProperties - - syn = Synapse() - syn.login() - - # Create a new file-based curation task - file_properties = FileBasedMetadataTaskProperties( - upload_folder_id="syn1234567", - file_view_id="syn2345678" - ) - - task = CurationTask( - project_id="syn9876543", - data_type="genomics_data", - instructions="Upload your genomics files and complete metadata", - task_properties=file_properties - ) - task = task.store() - print(f"Created task: {task.task_id}") - - # Later, retrieve and update the task - existing_task = CurationTask(task_id=task.task_id).get() - existing_task.instructions = "Updated instructions with new requirements" - existing_task.store() - - # List all tasks in the project - for project_task in CurationTask.list(project_id="syn9876543"): - print(f"Task: {project_task.data_type} - {project_task.task_id}") - ``` + assignee_principal_id: The principal ID of the user or team assigned """ task_id: Optional[int] = None """The unique identifier issued to this task when it was created""" data_type: Optional[str] = None - """Will match the data type that a contributor plans to contribute. The dataType must be unique within a project""" + """Will match the data type that a contributor plans to contribute""" project_id: Optional[str] = None """The synId of the project""" @@ -452,7 +257,7 @@ class CurationTask(CurationTaskSynchronousProtocol): """The properties of a CurationTask""" etag: Optional[str] = None - """Synapse employs an Optimistic Concurrency Control (OCC) scheme to handle concurrent updates. Since the E-Tag changes every time an entity is updated it is used to detect when a client's current representation of an entity is out-of-date""" + """Synapse employs an Optimistic Concurrency Control (OCC) scheme""" created_on: Optional[str] = None """(Read Only) The date this task was created""" @@ -467,26 +272,22 @@ class CurationTask(CurationTaskSynchronousProtocol): """(Read Only) The ID of the user that last modified this task""" assignee_principal_id: Optional[str] = None - """The principal ID of the user or team assigned to this task. Null if unassigned. For metadata - tasks, determines the owner of the grid session. Team members can all join grid sessions - owned by their team, while user-owned grid sessions are restricted to that user only.""" + """The principal ID of the user or team assigned to this task.""" _last_persistent_instance: Optional["CurationTask"] = field( default=None, repr=False, compare=False ) - """The last persistent instance of this object. This is used to determine if the - object has been changed and needs to be updated in Synapse.""" + """The last persistent instance of this object.""" @property def has_changed(self) -> bool: - """Determines if the object has been changed and needs to be updated in Synapse.""" + """Determines if the object has been changed.""" return ( not self._last_persistent_instance or self._last_persistent_instance != self ) def _set_last_persistent_instance(self) -> None: - """Stash the last time this object interacted with Synapse. This is used to - determine if the object has been changed and needs to be updated in Synapse.""" + """Stash the last time this object interacted with Synapse.""" del self._last_persistent_instance self._last_persistent_instance = replace(self) @@ -566,34 +367,11 @@ async def get_async( Raises: ValueError: If the CurationTask object does not have a task_id. - - Example: Get a curation task asynchronously -   - - ```python - import asyncio - from synapseclient import Synapse - from synapseclient.models import CurationTask - - syn = Synapse() - syn.login() - - async def main(): - task = await CurationTask(task_id=123).get_async() - print(f"Data type: {task.data_type}") - print(f"Instructions: {task.instructions}") - - asyncio.run(main()) - ``` """ if not self.task_id: raise ValueError("task_id is required to get a CurationTask") - trace.get_current_span().set_attributes( - { - "synapse.task_id": str(self.task_id), - } - ) + trace.get_current_span().set_attributes({"synapse.task_id": str(self.task_id)}) task_result = await get_curation_task( task_id=self.task_id, synapse_client=synapse_client @@ -602,48 +380,45 @@ async def main(): self._set_last_persistent_instance() return self - async def delete_async(self, *, synapse_client: Optional[Synapse] = None) -> None: + async def delete_async( + self, + delete_file_view: bool = False, + *, + synapse_client: Optional[Synapse] = None, + ) -> None: """ Deletes a CurationTask from Synapse. Arguments: + delete_file_view: If True and the task has FileBasedMetadataTaskProperties, + also delete the associated EntityView. Defaults to False. synapse_client: If not passed in and caching was not disabled by `Synapse.allow_client_caching(False)` this will use the last created instance from the Synapse class constructor. Raises: ValueError: If the CurationTask object does not have a task_id. - - Example: Delete a curation task asynchronously -   - - ```python - import asyncio - from synapseclient import Synapse - from synapseclient.models import CurationTask - - syn = Synapse() - syn.login() - - async def main(): - task = CurationTask(task_id=123) - await task.delete_async() - print("Task deleted successfully") - - asyncio.run(main()) - ``` """ if not self.task_id: raise ValueError("task_id is required to delete a CurationTask") - trace.get_current_span().set_attributes( - { - "synapse.task_id": str(self.task_id), - } - ) + trace.get_current_span().set_attributes({"synapse.task_id": str(self.task_id)}) + + file_view_id = None + if delete_file_view: + if not self.task_properties and self.task_id: + await self.get_async(synapse_client=synapse_client) + if isinstance(self.task_properties, FileBasedMetadataTaskProperties): + file_view_id = self.task_properties.file_view_id await delete_curation_task(task_id=self.task_id, synapse_client=synapse_client) + if delete_file_view and file_view_id: + from synapseclient.api.entity_services import delete_entity + + client = Synapse.get_client(synapse_client=synapse_client) + await delete_entity(entity_id=file_view_id, synapse_client=client) + async def store_async( self, *, synapse_client: Optional[Synapse] = None ) -> "CurationTask": @@ -662,37 +437,6 @@ async def store_async( Returns: CurationTask: The CurationTask object. - - Example: Create a new curation task asynchronously -   - - ```python - import asyncio - from synapseclient import Synapse - from synapseclient.models import CurationTask, FileBasedMetadataTaskProperties - - syn = Synapse() - syn.login() - - async def main(): - # Create file-based task properties - file_properties = FileBasedMetadataTaskProperties( - upload_folder_id="syn1234567", - file_view_id="syn2345678" - ) - - # Create and store the curation task - task = CurationTask( - project_id="syn9876543", - data_type="genomics_data", - instructions="Upload your genomics files to the specified folder", - task_properties=file_properties - ) - task = await task.store_async() - print(f"Created task with ID: {task.task_id}") - - asyncio.run(main()) - ``` """ if not self.project_id: raise ValueError("project_id is required") @@ -771,986 +515,36 @@ async def list_async( Yields: CurationTask objects as they are retrieved from the API. - - Example: List all curation tasks in a project asynchronously -   - - ```python - import asyncio - from synapseclient import Synapse - from synapseclient.models import CurationTask - - syn = Synapse() - syn.login() - - async def main(): - # List all curation tasks in the project - async for task in CurationTask.list_async(project_id="syn9876543"): - print(f"Task ID: {task.task_id}") - print(f"Data Type: {task.data_type}") - print(f"Instructions: {task.instructions}") - print("---") - - asyncio.run(main()) - ``` """ - trace.get_current_span().set_attributes( - { - "synapse.project_id": project_id, - } - ) - async for task_dict in list_curation_tasks( project_id=project_id, synapse_client=synapse_client ): - task = cls().fill_from_dict(synapse_response=task_dict) + task = cls() + task.fill_from_dict(task_dict) + task._set_last_persistent_instance() yield task - -@dataclass -class CreateGridRequest(AsynchronousCommunicator): - """ - Start a job to create a new Grid session. - - Represents a [Synapse CreateGridRequest](https://rest-docs.synapse.org/rest/org/sagebionetworks/repo/model/grid/CreateGridRequest.html). - - Attributes: - concrete_type: The concrete type for the request - record_set_id: When provided, the grid will be initialized using the CSV file - stored for the given record set id - initial_query: Initialize a grid session from an EntityView. - Mutually exclusive with record_set_id. - session_id: The session ID of the created grid (populated from response) - """ - - concrete_type: str = CREATE_GRID_REQUEST - """The concrete type for the request""" - - record_set_id: Optional[str] = None - """When provided, the grid will be initialized using the CSV file stored for - the given record set id. The grid columns will match the header of the CSV. - Optional, if present the initialQuery cannot be included.""" - - initial_query: Optional[Query] = None - """Initialize a grid session from an EntityView. - Mutually exclusive with record_set_id.""" - - session_id: Optional[str] = None - """The session ID of the created grid (populated from response)""" - - _grid_session_data: Optional[Dict[str, Any]] = field(default=None, compare=False) - """Internal storage of the full grid session data from the response for later use.""" - - def fill_from_dict( - self, synapse_response: Union[Dict[str, Any], Any] - ) -> "CreateGridRequest": - """ - Converts a response from the REST API into this dataclass. - - Arguments: - synapse_response: The response from the REST API. - - Returns: - The CreateGridRequest object. - """ - # Extract session ID from the response body - grid_session_data = synapse_response.get("gridSession", {}) - self.session_id = grid_session_data.get("sessionId", None) - - # Store the full grid session data for later use - self._grid_session_data = grid_session_data - - return self - - def fill_grid_session_from_response(self, grid_session: "Grid") -> "Grid": - """ - Fills a GridSession object with data from the stored response. - - Arguments: - grid_session: The GridSession object to populate. - - Returns: - The populated GridSession object. - """ - if not hasattr(self, "_grid_session_data"): - return grid_session - - data = self._grid_session_data - - grid_session.session_id = data.get("sessionId", None) - grid_session.started_by = data.get("startedBy", None) - grid_session.started_on = data.get("startedOn", None) - grid_session.etag = data.get("etag", None) - grid_session.modified_on = data.get("modifiedOn", None) - grid_session.last_replica_id_client = data.get("lastReplicaIdClient", None) - grid_session.last_replica_id_service = data.get("lastReplicaIdService", None) - grid_session.grid_json_schema_id = data.get("gridJsonSchema$Id", None) - grid_session.source_entity_id = data.get("sourceEntityId", None) - - return grid_session - - def to_synapse_request(self) -> Dict[str, Any]: - """ - Converts this dataclass to a dictionary suitable for a Synapse REST API request. - - Returns: - A dictionary representation of this object for API requests. - """ - request_dict = {"concreteType": self.concrete_type} - request_dict["recordSetId"] = self.record_set_id - request_dict["initialQuery"] = ( - self.initial_query.to_synapse_request() if self.initial_query else None - ) - delete_none_keys(request_dict) - return request_dict - - -@dataclass -class GridRecordSetExportRequest(AsynchronousCommunicator): - """ - A request to export a grid created from a record set back to the original record set. - A CSV file will be generated and set as a new version of the recordset. - - Represents a [Synapse GridRecordSetExportRequest](https://rest-docs.synapse.org/rest/org/sagebionetworks/repo/model/grid/GridRecordSetExportRequest.html). - - Attributes: - concrete_type: The concrete type for the request - session_id: The grid session ID - response_session_id: The session ID from the export response - response_record_set_id: The record set ID from the export response - record_set_version_number: The version number from the export response - validation_summary_statistics: Summary statistics from the export response - """ - - concrete_type: str = GRID_RECORD_SET_EXPORT_REQUEST - """The concrete type for the request""" - - session_id: Optional[str] = None - """The grid session ID""" - - response_session_id: Optional[str] = None - """The session ID from the export response""" - - response_record_set_id: Optional[str] = None - """The record set ID from the export response""" - - record_set_version_number: Optional[int] = None - """The version number from the export response""" - - validation_summary_statistics: Optional[ValidationSummary] = None - """Summary statistics from the export response""" - - def fill_from_dict( - self, synapse_response: Union[Dict[str, Any], Any] - ) -> "GridRecordSetExportRequest": - """ - Converts a response from the REST API into this dataclass. - - Arguments: - synapse_response: The response from the REST API. - - Returns: - The GridRecordSetExportRequest object. - """ - self.response_session_id = synapse_response.get("sessionId", None) - self.response_record_set_id = synapse_response.get("recordSetId", None) - self.record_set_version_number = synapse_response.get( - "recordSetVersionNumber", None - ) - - validation_stats_dict = synapse_response.get( - "validationSummaryStatistics", None - ) - if validation_stats_dict: - self.validation_summary_statistics = ValidationSummary( - container_id=validation_stats_dict.get("containerId", None), - total_number_of_children=validation_stats_dict.get( - "totalNumberOfChildren", None - ), - number_of_valid_children=validation_stats_dict.get( - "numberOfValidChildren", None - ), - number_of_invalid_children=validation_stats_dict.get( - "numberOfInvalidChildren", None - ), - number_of_unknown_children=validation_stats_dict.get( - "numberOfUnknownChildren", None - ), - generated_on=validation_stats_dict.get("generatedOn", None), - ) - - return self - - def to_synapse_request(self) -> Dict[str, Any]: - """ - Converts this dataclass to a dictionary suitable for a Synapse REST API request. - - Returns: - A dictionary representation of this object for API requests. - """ - request_dict = {"concreteType": self.concrete_type} - if self.session_id is not None: - request_dict["sessionId"] = self.session_id - return request_dict - - -@dataclass -class GridSession: - """ - Basic information about a grid session. - - Represents a [Synapse GridSession](https://rest-docs.synapse.org/rest/org/sagebionetworks/repo/model/grid/GridSession.html). - - Attributes: - session_id: The unique sessionId that identifies the grid session - started_by: The user that started this session - started_on: The date-time when the session was started - etag: Changes when the session changes - modified_on: The date-time when the session was last changed - last_replica_id_client: The last replica ID issued to a client - last_replica_id_service: The last replica ID issued to a service - grid_json_schema_id: The $id of the JSON schema used for model validation - source_entity_id: The synId of the table/view/csv that this grid was cloned from - """ - - session_id: Optional[str] = None - """The unique sessionId that identifies the grid session""" - - started_by: Optional[str] = None - """The user that started this session""" - - started_on: Optional[str] = None - """The date-time when the session was started""" - - etag: Optional[str] = None - """Changes when the session changes""" - - modified_on: Optional[str] = None - """The date-time when the session was last changed""" - - last_replica_id_client: Optional[int] = None - """The last replica ID issued to a client. Client replica IDs are incremented.""" - - last_replica_id_service: Optional[int] = None - """The last replica ID issued to a service. Service replica IDs are decremented.""" - - grid_json_schema_id: Optional[str] = None - """The $id of the JSON schema that will be used for model validation in this grid session""" - - source_entity_id: Optional[str] = None - """The synId of the table/view/csv that this grid was cloned from""" - - def fill_from_dict(self, synapse_response: Dict[str, Any]) -> "GridSession": - """ - Converts a response from the REST API into this dataclass. - - Arguments: - synapse_response: The response from the REST API. - - Returns: - The GridSession object. - """ - self.session_id = synapse_response.get("sessionId", None) - self.started_by = synapse_response.get("startedBy", None) - self.started_on = synapse_response.get("startedOn", None) - self.etag = synapse_response.get("etag", None) - self.modified_on = synapse_response.get("modifiedOn", None) - self.last_replica_id_client = synapse_response.get("lastReplicaIdClient", None) - self.last_replica_id_service = synapse_response.get( - "lastReplicaIdService", None - ) - self.grid_json_schema_id = synapse_response.get("gridJsonSchema$Id", None) - self.source_entity_id = synapse_response.get("sourceEntityId", None) - return self - - -@dataclass -class ListGridSessionsRequest: - """ - Request to list a user's active grid sessions. - - Represents a [Synapse ListGridSessionsRequest](https://rest-docs.synapse.org/rest/org/sagebionetworks/repo/model/grid/ListGridSessionsRequest.html). - - Attributes: - concrete_type: The concrete type for the request - source_id: Optional. When provided, only sessions with this synId will be returned - next_page_token: Forward the returned 'nextPageToken' to get the next page of results - """ - - concrete_type: str = LIST_GRID_SESSIONS_REQUEST - """The concrete type for the request""" - - source_id: Optional[str] = None - """Optional. When provided, only sessions with this synId will be returned""" - - next_page_token: Optional[str] = None - """Forward the returned 'nextPageToken' to get the next page of results""" - - def to_synapse_request(self) -> Dict[str, Any]: - """ - Converts this dataclass to a dictionary suitable for a Synapse REST API request. - - Returns: - A dictionary representation of this object for API requests. - """ - request_dict = {"concreteType": self.concrete_type} - if self.source_id is not None: - request_dict["sourceId"] = self.source_id - if self.next_page_token is not None: - request_dict["nextPageToken"] = self.next_page_token - delete_none_keys(request_dict) - return request_dict - - -@dataclass -class ListGridSessionsResponse: - """ - Response to a request to list a user's active grid sessions. - - Represents a [Synapse ListGridSessionsResponse](https://rest-docs.synapse.org/rest/org/sagebionetworks/repo/model/grid/ListGridSessionsResponse.html). - - Attributes: - concrete_type: The concrete type for the response - page: A single page of results that match the request parameters - next_page_token: Forward this token to get the next page of results - """ - - concrete_type: str = LIST_GRID_SESSIONS_RESPONSE - """The concrete type for the response""" - - page: Optional[list[GridSession]] = None - """A single page of results that match the request parameters""" - - next_page_token: Optional[str] = None - """Forward this token to get the next page of results""" - - def fill_from_dict( - self, synapse_response: Dict[str, Any] - ) -> "ListGridSessionsResponse": - """ - Converts a response from the REST API into this dataclass. - - Arguments: - synapse_response: The response from the REST API. - - Returns: - The ListGridSessionsResponse object. - """ - self.next_page_token = synapse_response.get("nextPageToken", None) - page_data = synapse_response.get("page", []) - if page_data: - self.page = [] - for session_dict in page_data: - session = GridSession() - session.fill_from_dict(session_dict) - self.page.append(session) - return self - - -class GridSynchronousProtocol(Protocol): - """ - The protocol for methods that are asynchronous but also - have a synchronous counterpart that may also be called. - """ - - def create( - self, - attach_to_previous_session=False, - *, - timeout: int = 120, - synapse_client: Optional[Synapse] = None, - ) -> "Grid": - """ - Creates a new grid session from a `record_set_id` or `initial_query`. - - Arguments: - attach_to_previous_session: If True and using `record_set_id`, will attach - to an existing active session if one exists. Defaults to False. - timeout: The number of seconds to wait for the job to complete or progress - before raising a SynapseTimeoutError. Defaults to 120. - synapse_client: If not passed in and caching was not disabled by - `Synapse.allow_client_caching(False)` this will use the last created - instance from the Synapse class constructor. - - Returns: - GridSession: The GridSession object with populated session_id. - - Raises: - ValueError: If `record_set_id` or `initial_query` is not provided. - - Example: Create a grid session from a record set -   - - ```python - from synapseclient import Synapse - from synapseclient.models import Grid - - syn = Synapse() - syn.login() - - # Create a grid session from a record set - grid = Grid(record_set_id="syn1234567") - grid = grid.create() - print(f"Created grid session: {grid.session_id}") - ``` - - Example: Create a grid session from a query -   - - ```python - from synapseclient import Synapse - from synapseclient.models import Grid - from synapseclient.models.table_components import Query - - syn = Synapse() - syn.login() - - # Create a grid session from an entity view query - query = Query(sql="SELECT * FROM syn1234567") - grid = Grid(initial_query=query) - grid = grid.create() - print(f"Created grid session: {grid.session_id}") - ``` - """ - return self - - def export_to_record_set( - self, *, timeout: int = 120, synapse_client: Optional[Synapse] = None - ) -> "Grid": - """ - Exports the grid session data back to a record set. This will create a new version - of the original record set with the modified data from the grid session. - - Arguments: - timeout: The number of seconds to wait for the job to complete or progress - before raising a SynapseTimeoutError. Defaults to 120. - synapse_client: If not passed in and caching was not disabled by - `Synapse.allow_client_caching(False)` this will use the last created - instance from the Synapse class constructor. - - Returns: - GridSession: The GridSession object with export information populated. - - Raises: - ValueError: If session_id is not provided. - - Example: Export grid session data back to record set -   - - ```python - from synapseclient import Synapse - from synapseclient.models import Grid - - syn = Synapse() - syn.login() - - # Export modified grid data back to the record set - grid = Grid(session_id="abc-123-def") - grid = grid.export_to_record_set() - print(f"Exported to record set: {grid.record_set_id}") - print(f"Version number: {grid.record_set_version_number}") - if grid.validation_summary_statistics: - print(f"Valid records: {grid.validation_summary_statistics.number_of_valid_children}") - ``` - """ - return self - - def delete(self, *, synapse_client: Optional[Synapse] = None) -> None: - """ - Delete the grid session. - - Note: Only the user that created a grid session may delete it. - - Arguments: - synapse_client: If not passed in and caching was not disabled by - `Synapse.allow_client_caching(False)` this will use the last created - instance from the Synapse class constructor. - - Returns: - None - - Raises: - ValueError: If session_id is not provided. - - Example: Delete a grid session -   - - ```python - from synapseclient import Synapse - from synapseclient.models import Grid - - syn = Synapse() - syn.login() - - # Delete the grid session - grid = Grid(session_id="abc-123-def") - grid.delete() - ``` - """ - return None - - @classmethod - def list( - cls, - source_id: Optional[str] = None, - *, - synapse_client: Optional[Synapse] = None, - ) -> Generator["Grid", None, None]: - """ - Generator to get a list of active grid sessions for the user. - - Arguments: - source_id: Optional. When provided, only sessions with this synId will be returned. - synapse_client: If not passed in and caching was not disabled by - `Synapse.allow_client_caching(False)` this will use the last created - instance from the Synapse class constructor. - - Yields: - Grid objects representing active grid sessions. - - Example: List all active grid sessions -   - - ```python - from synapseclient import Synapse - from synapseclient.models import Grid - - syn = Synapse() - syn.login() - - # List all active grid sessions for the user - for grid in Grid.list(): - print(f"Session ID: {grid.session_id}") - print(f"Source Entity: {grid.source_entity_id}") - print(f"Started: {grid.started_on}") - print("---") - ``` - - Example: List grid sessions for a specific source -   - - ```python - from synapseclient import Synapse - from synapseclient.models import Grid - - syn = Synapse() - syn.login() - - # List grid sessions for a specific record set - for grid in Grid.list(source_id="syn1234567"): - print(f"Session ID: {grid.session_id}") - print(f"Modified: {grid.modified_on}") - ``` - """ - - -@dataclass -@async_to_sync -class Grid(GridSynchronousProtocol): - """ - A GridSession provides functionality to create and manage grid sessions in Synapse. - Grid sessions are used for curation workflows where data can be edited in a grid format - and then exported back to record sets. - - Attributes: - record_set_id: The synId of the RecordSet to use for initializing the grid - initial_query: Initialize a grid session from an EntityView. - Mutually exclusive with record_set_id. - session_id: The unique sessionId that identifies the grid session - started_by: The user that started this session - started_on: The date-time when the session was started - etag: Changes when the session changes - modified_on: The date-time when the session was last changed - last_replica_id_client: The last replica ID issued to a client - last_replica_id_service: The last replica ID issued to a service - grid_json_schema_id: The $id of the JSON schema used for model validation - source_entity_id: The synId of the table/view/csv that this grid was cloned from - record_set_version_number: The version number of the exported record set - validation_summary_statistics: Summary statistics for validation results - - Example: Create and manage a grid session workflow -   - - ```python - from synapseclient import Synapse - from synapseclient.models import Grid - - syn = Synapse() - syn.login() - - # Create a new grid session from a record set - grid = Grid(record_set_id="syn1234567") - grid = grid.create() - print(f"Created grid session: {grid.session_id}") - - # Later, export the modified data back to the record set - grid = grid.export_to_record_set() - print(f"Exported to version: {grid.record_set_version_number}") - - # Clean up by deleting the session when done - grid.delete() - ``` - - Example: Working with grid sessions using queries -   - - ```python - from synapseclient import Synapse - from synapseclient.models import Grid - from synapseclient.models.table_components import Query - - syn = Synapse() - syn.login() - - # Create a grid from an entity view query - query = Query(sql="SELECT * FROM syn1234567") - grid = Grid(initial_query=query) - grid = grid.create() - - # Work with the grid session... - # Export when ready - grid = grid.export_to_record_set() - ``` - """ - - record_set_id: Optional[str] = None - """The synId of the RecordSet to use for initializing the grid""" - - initial_query: Optional[Query] = None - """Initialize a grid session from an EntityView. - Mutually exclusive with record_set_id.""" - - session_id: Optional[str] = None - """The unique sessionId that identifies the grid session""" - - started_by: Optional[str] = None - """The user that started this session""" - - started_on: Optional[str] = None - """The date-time when the session was started""" - - etag: Optional[str] = None - """Changes when the session changes""" - - modified_on: Optional[str] = None - """The date-time when the session was last changed""" - - last_replica_id_client: Optional[int] = None - """The last replica ID issued to a client. Client replica IDs are incremented.""" - - last_replica_id_service: Optional[int] = None - """The last replica ID issued to a service. Service replica IDs are decremented.""" - - grid_json_schema_id: Optional[str] = None - """The $id of the JSON schema that will be used for model validation in this grid session""" - - source_entity_id: Optional[str] = None - """The synId of the table/view/csv that this grid was cloned from""" - - record_set_version_number: Optional[int] = None - """The version number of the exported record set""" - - validation_summary_statistics: Optional[ValidationSummary] = None - """Summary statistics for validation results""" - - async def create_async( - self, - attach_to_previous_session=False, - *, - timeout: int = 120, - synapse_client: Optional[Synapse] = None, - ) -> "Grid": - """ - Creates a new grid session from a `record_set_id` or `initial_query`. - - When using `record_set_id`, first checks for existing active sessions that match - the record set before creating a new one. When using `initial_query`, always - creates a new session due to the complexity of matching query parameters. - - Arguments: - attach_to_previous_session: If True and using `record_set_id`, will attach - to an existing active session if one exists. Defaults to False. - timeout: The number of seconds to wait for the job to complete or progress - before raising a SynapseTimeoutError. Defaults to 120. - synapse_client: If not passed in and caching was not disabled by - `Synapse.allow_client_caching(False)` this will use the last created - instance from the Synapse class constructor. - - Returns: - GridSession: The GridSession object with populated session_id. - - Raises: - ValueError: If `record_set_id` or `initial_query` is not provided. - - Example: Create a grid session asynchronously -   - - ```python - import asyncio - from synapseclient import Synapse - from synapseclient.models import Grid - - syn = Synapse() - syn.login() - - async def main(): - # Create a grid session from a record set - grid = Grid(record_set_id="syn1234567") - grid = await grid.create_async() - print(f"Created grid session: {grid.session_id}") - - asyncio.run(main()) - ``` - """ - if not self.record_set_id and not self.initial_query: - raise ValueError( - "record_set_id or initial_query is required to create a GridSession" - ) - - trace.get_current_span().set_attributes( - { - "synapse.record_set_id": self.record_set_id or "", - "synapse.session_id": self.session_id or "", - } - ) - - # Check for existing active sessions only when using record_set_id - # For initial_query, always create a new session due to complexity of matching - if self.record_set_id and attach_to_previous_session: - # Look for existing active sessions for this record set - async for existing_session in self.list_async( - source_id=self.record_set_id, synapse_client=synapse_client - ): - # Found an existing session, populate this object with its data and return - self.session_id = existing_session.session_id - self.started_by = existing_session.started_by - self.started_on = existing_session.started_on - self.etag = existing_session.etag - self.modified_on = existing_session.modified_on - self.last_replica_id_client = existing_session.last_replica_id_client - self.last_replica_id_service = existing_session.last_replica_id_service - self.grid_json_schema_id = existing_session.grid_json_schema_id - self.source_entity_id = existing_session.source_entity_id - return self - - # No existing session found, create a new one - create_request = CreateGridRequest( - record_set_id=self.record_set_id, initial_query=self.initial_query - ) - result = await create_request.send_job_and_wait_async( - timeout=timeout, synapse_client=synapse_client - ) - - # Fill this GridSession with the grid session data from the async job response - result.fill_grid_session_from_response(self) - - return self - - async def export_to_record_set_async( - self, *, timeout: int = 120, synapse_client: Optional[Synapse] = None - ) -> "Grid": - """ - Exports the grid session data back to a record set. This will create a new version - of the original record set with the modified data from the grid session. - - Arguments: - timeout: The number of seconds to wait for the job to complete or progress - before raising a SynapseTimeoutError. Defaults to 120. - synapse_client: If not passed in and caching was not disabled by - `Synapse.allow_client_caching(False)` this will use the last created - instance from the Synapse class constructor. - - Returns: - GridSession: The GridSession object with export information populated. - - Raises: - ValueError: If session_id is not provided. - - Example: Export grid session data back to record set asynchronously -   - - ```python - import asyncio - from synapseclient import Synapse - from synapseclient.models import Grid - - syn = Synapse() - syn.login() - - async def main(): - # Export modified grid data back to the record set - grid = Grid(session_id="abc-123-def") - grid = await grid.export_to_record_set_async() - print(f"Exported to record set: {grid.record_set_id}") - print(f"Version number: {grid.record_set_version_number}") - if grid.validation_summary_statistics: - print(f"Valid records: {grid.validation_summary_statistics.number_of_valid_children}") - - asyncio.run(main()) - ``` - """ - if not self.session_id: - raise ValueError("session_id is required to export a GridSession") - - trace.get_current_span().set_attributes( - { - "synapse.session_id": self.session_id or "", - } - ) - - # Create and send the export request - export_request = GridRecordSetExportRequest(session_id=self.session_id) - result = await export_request.send_job_and_wait_async( - timeout=timeout, synapse_client=synapse_client - ) - - self.record_set_id = result.response_record_set_id - self.record_set_version_number = result.record_set_version_number - self.validation_summary_statistics = result.validation_summary_statistics - - return self - - def fill_from_dict(self, synapse_response: Dict[str, Any]) -> "Grid": - """Converts a response from the REST API into this dataclass.""" - self.session_id = synapse_response.get("sessionId", None) - self.started_by = synapse_response.get("startedBy", None) - self.started_on = synapse_response.get("startedOn", None) - self.etag = synapse_response.get("etag", None) - self.modified_on = synapse_response.get("modifiedOn", None) - self.last_replica_id_client = synapse_response.get("lastReplicaIdClient", None) - self.last_replica_id_service = synapse_response.get( - "lastReplicaIdService", None - ) - self.grid_json_schema_id = synapse_response.get("gridJsonSchema$Id", None) - self.source_entity_id = synapse_response.get("sourceEntityId", None) - return self - - @skip_async_to_sync - @classmethod - async def list_async( - cls, - source_id: Optional[str] = None, - *, - synapse_client: Optional[Synapse] = None, - ) -> AsyncGenerator["Grid", None]: - """ - Generator to get a list of active grid sessions for the user. - - Arguments: - source_id: Optional. When provided, only sessions with this synId will be returned. - synapse_client: If not passed in and caching was not disabled by - `Synapse.allow_client_caching(False)` this will use the last created - instance from the Synapse class constructor. - - Yields: - Grid objects representing active grid sessions. - - Example: List all active grid sessions asynchronously -   - - ```python - import asyncio - from synapseclient import Synapse - from synapseclient.models import Grid - - syn = Synapse() - syn.login() - - async def main(): - # List all active grid sessions for the user - async for grid in Grid.list_async(): - print(f"Session ID: {grid.session_id}") - print(f"Source Entity: {grid.source_entity_id}") - print(f"Started: {grid.started_on}") - print("---") - - # List grid sessions for a specific source - async for grid in Grid.list_async(source_id="syn1234567"): - print(f"Session ID: {grid.session_id}") - print(f"Modified: {grid.modified_on}") - - asyncio.run(main()) - ``` - """ - async for session_dict in list_grid_sessions( - source_id=source_id, synapse_client=synapse_client - ): - # Convert the dictionary to a Grid object - grid = cls() - grid.fill_from_dict(session_dict) - yield grid - @classmethod def list( cls, - source_id: Optional[str] = None, + project_id: str, *, synapse_client: Optional[Synapse] = None, - ) -> Generator["Grid", None, None]: + ) -> Generator["CurationTask", None, None]: """ - Generator to get a list of active grid sessions for the user. + Generator that yields CurationTasks for a project. Arguments: - source_id: Optional. When provided, only sessions with this synId will be returned. + project_id: The synId of the project. synapse_client: If not passed in and caching was not disabled by `Synapse.allow_client_caching(False)` this will use the last created instance from the Synapse class constructor. Yields: - Grid objects representing active grid sessions. + CurationTask objects as they are retrieved from the API. """ return wrap_async_generator_to_sync_generator( async_gen_func=cls.list_async, - source_id=source_id, + project_id=project_id, synapse_client=synapse_client, ) - - async def delete_async(self, *, synapse_client: Optional[Synapse] = None) -> None: - """ - Delete the grid session. - - Note: Only the user that created a grid session may delete it. - - Arguments: - synapse_client: If not passed in and caching was not disabled by - `Synapse.allow_client_caching(False)` this will use the last created - instance from the Synapse class constructor. - - Returns: - None - - Raises: - ValueError: If session_id is not provided. - - Example: Delete a grid session asynchronously -   - - ```python - import asyncio - from synapseclient import Synapse - from synapseclient.models import Grid - - syn = Synapse() - syn.login() - - async def main(): - # Delete the grid session - grid = Grid(session_id="abc-123-def") - await grid.delete_async() - print("Grid session deleted successfully") - - asyncio.run(main()) - ``` - """ - if not self.session_id: - raise ValueError("session_id is required to delete a GridSession") - - trace.get_current_span().set_attributes( - { - "synapse.session_id": self.session_id or "", - } - ) - - await delete_grid_session( - session_id=self.session_id, synapse_client=synapse_client - ) diff --git a/synapseclient/models/grid.py b/synapseclient/models/grid.py new file mode 100644 index 000000000..484db9c3d --- /dev/null +++ b/synapseclient/models/grid.py @@ -0,0 +1,1380 @@ +""" +Grid session dataclasses for managing Grid sessions in Synapse. + +Grid sessions are used for curation workflows where data can be edited in a grid +format and then exported back to record sets or synchronized with data sources. +""" + +from __future__ import annotations + +from dataclasses import dataclass, field +from typing import ( + TYPE_CHECKING, + Any, + AsyncGenerator, + Dict, + Generator, + List, + Optional, + Protocol, + Union, +) + +if TYPE_CHECKING: + from synapseclient.models.grid_query import GridSnapshot + +from opentelemetry import trace + +from synapseclient import Synapse +from synapseclient.api import delete_grid_session, list_grid_sessions +from synapseclient.core.async_utils import ( + async_to_sync, + skip_async_to_sync, + wrap_async_generator_to_sync_generator, +) +from synapseclient.core.constants.concrete_types import ( + CREATE_GRID_REQUEST, + DOWNLOAD_FROM_GRID_REQUEST, + GRID_CSV_IMPORT_REQUEST, + GRID_RECORD_SET_EXPORT_REQUEST, + LIST_GRID_SESSIONS_REQUEST, + LIST_GRID_SESSIONS_RESPONSE, + SYNCHRONIZE_GRID_REQUEST, +) +from synapseclient.core.utils import delete_none_keys +from synapseclient.models.mixins.asynchronous_job import AsynchronousCommunicator +from synapseclient.models.recordset import ValidationSummary +from synapseclient.models.table_components import Column, CsvTableDescriptor, Query + + +@dataclass +class CreateGridRequest(AsynchronousCommunicator): + """ + A request to create a new grid session. + + Represents a + [Synapse CreateGridRequest]\ +(https://rest-docs.synapse.org/rest/org/sagebionetworks/repo/model/grid/CreateGridRequest.html). + + Attributes: + concrete_type: The concrete type for the request + record_set_id: The synId of the RecordSet to use for initializing the grid + initial_query: Initialize a grid session from an EntityView. + Mutually exclusive with record_set_id. + session_id: The session ID of the created grid (populated from response) + """ + + concrete_type: str = CREATE_GRID_REQUEST + """The concrete type for the request""" + + record_set_id: Optional[str] = None + """When provided, the grid will be initialized using the CSV file stored for + the given record set id. The grid columns will match the header of the CSV. + Optional, if present the initialQuery cannot be included.""" + + initial_query: Optional[Query] = None + """Initialize a grid session from an EntityView. + Mutually exclusive with record_set_id.""" + + session_id: Optional[str] = None + """The session ID of the created grid (populated from response)""" + + _grid_session_data: Optional[Dict[str, Any]] = field(default=None, compare=False) + """Internal storage of the full grid session data from the response.""" + + def fill_from_dict( + self, synapse_response: Union[Dict[str, Any], Any] + ) -> "CreateGridRequest": + """ + Converts a response from the REST API into this dataclass. + + Arguments: + synapse_response: The response from the REST API. + + Returns: + The CreateGridRequest object. + """ + grid_session_data = synapse_response.get("gridSession", {}) + self.session_id = grid_session_data.get("sessionId", None) + self._grid_session_data = grid_session_data + return self + + def fill_grid_session_from_response(self, grid_session: "Grid") -> "Grid": + """ + Fills a Grid object with data from the stored response. + + Arguments: + grid_session: The Grid object to populate. + + Returns: + The populated Grid object. + """ + if not hasattr(self, "_grid_session_data"): + return grid_session + + data = self._grid_session_data + + grid_session.session_id = data.get("sessionId", None) + grid_session.started_by = data.get("startedBy", None) + grid_session.started_on = data.get("startedOn", None) + grid_session.etag = data.get("etag", None) + grid_session.modified_on = data.get("modifiedOn", None) + grid_session.last_replica_id_client = data.get("lastReplicaIdClient", None) + grid_session.last_replica_id_service = data.get("lastReplicaIdService", None) + grid_session.grid_json_schema_id = data.get("gridJsonSchema$Id", None) + grid_session.source_entity_id = data.get("sourceEntityId", None) + + return grid_session + + def to_synapse_request(self) -> Dict[str, Any]: + """ + Converts this dataclass to a dictionary suitable for a Synapse REST API + request. + + Returns: + A dictionary representation of this object for API requests. + """ + request_dict = {"concreteType": self.concrete_type} + request_dict["recordSetId"] = self.record_set_id + request_dict["initialQuery"] = ( + self.initial_query.to_synapse_request() if self.initial_query else None + ) + delete_none_keys(request_dict) + return request_dict + + +@dataclass +class GridRecordSetExportRequest(AsynchronousCommunicator): + """ + A request to export a grid created from a record set back to the original + record set. A CSV file will be generated and set as a new version of the + recordset. + + Represents a + [Synapse GridRecordSetExportRequest]\ +(https://rest-docs.synapse.org/rest/org/sagebionetworks/repo/model/grid/GridRecordSetExportRequest.html). + + Attributes: + concrete_type: The concrete type for the request + session_id: The grid session ID + response_session_id: The session ID from the export response + response_record_set_id: The record set ID from the export response + record_set_version_number: The version number from the export response + validation_summary_statistics: Summary statistics from the export + """ + + concrete_type: str = GRID_RECORD_SET_EXPORT_REQUEST + """The concrete type for the request""" + + session_id: Optional[str] = None + """The grid session ID""" + + response_session_id: Optional[str] = None + """The session ID from the export response""" + + response_record_set_id: Optional[str] = None + """The record set ID from the export response""" + + record_set_version_number: Optional[int] = None + """The version number from the export response""" + + validation_summary_statistics: Optional[ValidationSummary] = None + """Summary statistics from the export response""" + + def fill_from_dict( + self, synapse_response: Union[Dict[str, Any], Any] + ) -> "GridRecordSetExportRequest": + """ + Converts a response from the REST API into this dataclass. + + Arguments: + synapse_response: The response from the REST API. + + Returns: + The GridRecordSetExportRequest object. + """ + self.response_session_id = synapse_response.get("sessionId", None) + self.response_record_set_id = synapse_response.get("recordSetId", None) + self.record_set_version_number = synapse_response.get( + "recordSetVersionNumber", None + ) + + validation_stats_dict = synapse_response.get( + "validationSummaryStatistics", None + ) + if validation_stats_dict: + self.validation_summary_statistics = ValidationSummary( + container_id=validation_stats_dict.get("containerId", None), + total_number_of_children=validation_stats_dict.get( + "totalNumberOfChildren", None + ), + number_of_valid_children=validation_stats_dict.get( + "numberOfValidChildren", None + ), + number_of_invalid_children=validation_stats_dict.get( + "numberOfInvalidChildren", None + ), + number_of_unknown_children=validation_stats_dict.get( + "numberOfUnknownChildren", None + ), + generated_on=validation_stats_dict.get("generatedOn", None), + ) + + return self + + def to_synapse_request(self) -> Dict[str, Any]: + """ + Converts this dataclass to a dictionary suitable for a Synapse REST API + request. + + Returns: + A dictionary representation of this object for API requests. + """ + request_dict = {"concreteType": self.concrete_type} + if self.session_id is not None: + request_dict["sessionId"] = self.session_id + return request_dict + + +@dataclass +class GridCsvImportRequest(AsynchronousCommunicator): + """ + A request to import a CSV file into an existing grid session. + Currently supports only grids started using a RecordSet. + + Attributes: + concrete_type: The concrete type for the request + session_id: The grid session ID + file_handle_id: The ID of the file handle containing the CSV data + csv_descriptor: The description of the CSV for upload + schema: The list of ColumnModel objects describing the CSV file + response_session_id: The session ID from the import response + total_count: Total number of processed rows + created_count: Number of newly created rows in the grid + updated_count: Number of updated rows in the grid + """ + + concrete_type: str = GRID_CSV_IMPORT_REQUEST + """The concrete type for the request""" + + session_id: Optional[str] = None + """The grid session ID""" + + file_handle_id: Optional[str] = None + """The ID of the file handle containing the CSV data""" + + csv_descriptor: Optional[CsvTableDescriptor] = None + """The description of the CSV for upload""" + + schema: Optional[List[Column]] = None + """The list of Column objects describing the CSV file (required). + Each Column must have at least ``name`` and ``column_type`` set, + and the order must match the CSV header columns exactly.""" + + # Response fields + response_session_id: Optional[str] = None + """The session ID from the import response""" + + total_count: Optional[int] = None + """Total number of processed rows""" + + created_count: Optional[int] = None + """Number of newly created rows in the grid""" + + updated_count: Optional[int] = None + """Number of updated rows in the grid""" + + def fill_from_dict( + self, synapse_response: Union[Dict[str, Any], Any] + ) -> "GridCsvImportRequest": + """ + Converts a response from the REST API into this dataclass. + + Arguments: + synapse_response: The response from the REST API. + + Returns: + The GridCsvImportRequest object. + """ + self.response_session_id = synapse_response.get("sessionId", None) + self.total_count = synapse_response.get("totalCount", None) + self.created_count = synapse_response.get("createdCount", None) + self.updated_count = synapse_response.get("updatedCount", None) + return self + + def to_synapse_request(self) -> Dict[str, Any]: + """ + Converts this dataclass to a dictionary suitable for a Synapse REST API + request. + + Returns: + A dictionary representation of this object for API requests. + """ + request_dict = {"concreteType": self.concrete_type} + if self.session_id is not None: + request_dict["sessionId"] = self.session_id + if self.file_handle_id is not None: + request_dict["fileHandleId"] = self.file_handle_id + csv_desc = ( + self.csv_descriptor + if self.csv_descriptor is not None + else CsvTableDescriptor() + ) + request_dict["csvDescriptor"] = csv_desc.to_synapse_request() + if self.schema is not None: + request_dict["schema"] = [col.to_synapse_request() for col in self.schema] + return request_dict + + +@dataclass +class DownloadFromGridRequest(AsynchronousCommunicator): + """ + A request to download grid data as a CSV file. + + Note: The downloaded CSV does NOT include validation columns. + + Attributes: + concrete_type: The concrete type for the request + session_id: The grid session ID + write_header: Whether to include column names as header + include_row_id_and_row_version: Whether to include row ID and version + include_etag: Whether to include row etag + csv_table_descriptor: The description of the CSV for download + file_name: Optional name for the downloaded file + response_session_id: The session ID from the download response + results_file_handle_id: The file handle ID for the resulting CSV + """ + + concrete_type: str = DOWNLOAD_FROM_GRID_REQUEST + """The concrete type for the request""" + + session_id: Optional[str] = None + """The grid session ID""" + + write_header: Optional[bool] = True + """Whether to include column names as header. Defaults to True.""" + + include_row_id_and_row_version: Optional[bool] = True + """Whether to include row ID and version columns. Defaults to True.""" + + include_etag: Optional[bool] = True + """Whether to include row etag column. Defaults to True.""" + + csv_table_descriptor: Optional[CsvTableDescriptor] = None + """The description of the CSV for download""" + + file_name: Optional[str] = None + """Optional name for the downloaded file""" + + # Response fields + response_session_id: Optional[str] = None + """The session ID from the download response""" + + results_file_handle_id: Optional[str] = None + """The file handle ID for the resulting CSV""" + + def fill_from_dict( + self, synapse_response: Union[Dict[str, Any], Any] + ) -> "DownloadFromGridRequest": + """ + Converts a response from the REST API into this dataclass. + + Arguments: + synapse_response: The response from the REST API. + + Returns: + The DownloadFromGridRequest object. + """ + self.response_session_id = synapse_response.get("sessionId", None) + self.results_file_handle_id = synapse_response.get("resultsFileHandleId", None) + return self + + def to_synapse_request(self) -> Dict[str, Any]: + """ + Converts this dataclass to a dictionary suitable for a Synapse REST API + request. + + Returns: + A dictionary representation of this object for API requests. + """ + request_dict = {"concreteType": self.concrete_type} + if self.session_id is not None: + request_dict["sessionId"] = self.session_id + if self.write_header is not None: + request_dict["writeHeader"] = self.write_header + if self.include_row_id_and_row_version is not None: + request_dict[ + "includeRowIdAndRowVersion" + ] = self.include_row_id_and_row_version + if self.include_etag is not None: + request_dict["includeEtag"] = self.include_etag + if self.csv_table_descriptor is not None: + request_dict[ + "csvTableDescriptor" + ] = self.csv_table_descriptor.to_synapse_request() + if self.file_name is not None: + request_dict["fileName"] = self.file_name + return request_dict + + +@dataclass +class SynchronizeGridRequest(AsynchronousCommunicator): + """ + A request to synchronize a grid session with its data source. + Synchronization is a two-phase process that ensures consistency between + the grid and its source. + + Attributes: + concrete_type: The concrete type for the request + grid_session_id: The ID of the grid session to synchronize + response_grid_session_id: The grid session ID from the response + error_messages: Any error messages from the synchronization + """ + + concrete_type: str = SYNCHRONIZE_GRID_REQUEST + """The concrete type for the request""" + + grid_session_id: Optional[str] = None + """The ID of the grid session to synchronize""" + + # Response fields + response_grid_session_id: Optional[str] = None + """The grid session ID from the response""" + + error_messages: Optional[List[str]] = None + """Any error messages generated during synchronization""" + + def fill_from_dict( + self, synapse_response: Union[Dict[str, Any], Any] + ) -> "SynchronizeGridRequest": + """ + Converts a response from the REST API into this dataclass. + + Arguments: + synapse_response: The response from the REST API. + + Returns: + The SynchronizeGridRequest object. + """ + self.response_grid_session_id = synapse_response.get("gridSessionId", None) + self.error_messages = synapse_response.get("errorMessages", None) + return self + + def to_synapse_request(self) -> Dict[str, Any]: + """ + Converts this dataclass to a dictionary suitable for a Synapse REST API + request. + + Returns: + A dictionary representation of this object for API requests. + """ + request_dict = {"concreteType": self.concrete_type} + if self.grid_session_id is not None: + request_dict["gridSessionId"] = self.grid_session_id + return request_dict + + +@dataclass +class GridSession: + """ + Basic information about a grid session. + + Represents a + [Synapse GridSession](https://rest-docs.synapse.org/rest/org/sagebionetworks/repo/model/grid/GridSession.html). + + Attributes: + session_id: The unique sessionId that identifies the grid session + started_by: The user that started this session + started_on: The date-time when the session was started + etag: Changes when the session changes + modified_on: The date-time when the session was last changed + last_replica_id_client: The last replica ID issued to a client + last_replica_id_service: The last replica ID issued to a service + grid_json_schema_id: The $id of the JSON schema used for validation + source_entity_id: The synId of the source table/view/csv + """ + + session_id: Optional[str] = None + """The unique sessionId that identifies the grid session""" + + started_by: Optional[str] = None + """The user that started this session""" + + started_on: Optional[str] = None + """The date-time when the session was started""" + + etag: Optional[str] = None + """Changes when the session changes""" + + modified_on: Optional[str] = None + """The date-time when the session was last changed""" + + last_replica_id_client: Optional[int] = None + """The last replica ID issued to a client.""" + + last_replica_id_service: Optional[int] = None + """The last replica ID issued to a service.""" + + grid_json_schema_id: Optional[str] = None + """The $id of the JSON schema used for model validation""" + + source_entity_id: Optional[str] = None + """The synId of the table/view/csv that this grid was cloned from""" + + def fill_from_dict(self, synapse_response: Dict[str, Any]) -> "GridSession": + """ + Converts a response from the REST API into this dataclass. + + Arguments: + synapse_response: The response from the REST API. + + Returns: + The GridSession object. + """ + self.session_id = synapse_response.get("sessionId", None) + self.started_by = synapse_response.get("startedBy", None) + self.started_on = synapse_response.get("startedOn", None) + self.etag = synapse_response.get("etag", None) + self.modified_on = synapse_response.get("modifiedOn", None) + self.last_replica_id_client = synapse_response.get("lastReplicaIdClient", None) + self.last_replica_id_service = synapse_response.get( + "lastReplicaIdService", None + ) + self.grid_json_schema_id = synapse_response.get("gridJsonSchema$Id", None) + self.source_entity_id = synapse_response.get("sourceEntityId", None) + return self + + +@dataclass +class ListGridSessionsRequest: + """ + Request to list a user's active grid sessions. + + Attributes: + concrete_type: The concrete type for the request + source_id: Optional filter by source entity synId + next_page_token: Pagination token + """ + + concrete_type: str = LIST_GRID_SESSIONS_REQUEST + """The concrete type for the request""" + + source_id: Optional[str] = None + """Optional. When provided, only sessions with this synId are returned""" + + next_page_token: Optional[str] = None + """Forward the returned 'nextPageToken' to get the next page""" + + def to_synapse_request(self) -> Dict[str, Any]: + """ + Converts this dataclass to a dictionary suitable for a Synapse REST API + request. + + Returns: + A dictionary representation of this object for API requests. + """ + request_dict = {"concreteType": self.concrete_type} + if self.source_id is not None: + request_dict["sourceId"] = self.source_id + if self.next_page_token is not None: + request_dict["nextPageToken"] = self.next_page_token + delete_none_keys(request_dict) + return request_dict + + +@dataclass +class ListGridSessionsResponse: + """ + Response to a request to list a user's active grid sessions. + + Attributes: + concrete_type: The concrete type for the response + page: A single page of results + next_page_token: Forward this token to get the next page + """ + + concrete_type: str = LIST_GRID_SESSIONS_RESPONSE + """The concrete type for the response""" + + page: Optional[list[GridSession]] = None + """A single page of results""" + + next_page_token: Optional[str] = None + """Forward this token to get the next page of results""" + + def fill_from_dict( + self, synapse_response: Dict[str, Any] + ) -> "ListGridSessionsResponse": + """ + Converts a response from the REST API into this dataclass. + + Arguments: + synapse_response: The response from the REST API. + + Returns: + The ListGridSessionsResponse object. + """ + self.next_page_token = synapse_response.get("nextPageToken", None) + page_data = synapse_response.get("page", []) + if page_data: + self.page = [] + for session_dict in page_data: + session = GridSession() + session.fill_from_dict(session_dict) + self.page.append(session) + return self + + +class GridSynchronousProtocol(Protocol): + """ + The protocol for methods that are asynchronous but also + have a synchronous counterpart that may also be called. + """ + + def create( + self, + attach_to_previous_session=False, + *, + timeout: int = 120, + synapse_client: Optional[Synapse] = None, + ) -> "Grid": + """Creates a new grid session from a `record_set_id` or + `initial_query`.""" + return self + + def export_to_record_set( + self, *, timeout: int = 120, synapse_client: Optional[Synapse] = None + ) -> "Grid": + """Exports the grid session data back to a record set.""" + return self + + def import_csv( + self, + file_handle_id: Optional[str] = None, + path: Optional[str] = None, + dataframe: Optional[Any] = None, + schema: Optional[List[Column]] = None, + csv_descriptor: Optional[CsvTableDescriptor] = None, + *, + timeout: int = 120, + synapse_client: Optional[Synapse] = None, + ) -> "Grid": + """Imports CSV data into the grid session. Provide a file path, + DataFrame, or file handle ID. Schema is auto-derived when + omitted (requires path or dataframe).""" + return self + + def download_csv( + self, + download_location: Optional[str] = None, + write_header: bool = True, + include_row_id_and_row_version: bool = True, + include_etag: bool = True, + csv_table_descriptor: Optional[CsvTableDescriptor] = None, + file_name: Optional[str] = None, + *, + timeout: int = 120, + synapse_client: Optional[Synapse] = None, + ) -> str: + """Downloads grid data as a CSV file. Returns local file path.""" + return "" + + def synchronize( + self, *, timeout: int = 120, synapse_client: Optional[Synapse] = None + ) -> "Grid": + """Synchronizes the grid session with its data source.""" + return self + + def get_snapshot( + self, + *, + connect_timeout: float = 30.0, + synapse_client: Optional[Synapse] = None, + ) -> GridSnapshot: + """Get a read-only snapshot of the grid session's current state.""" + return GridSnapshot() + + def get_validation( + self, + *, + connect_timeout: float = 30.0, + synapse_client: Optional[Synapse] = None, + ) -> GridSnapshot: + """Get per-row validation results from the grid session.""" + return GridSnapshot() + + def delete(self, *, synapse_client: Optional[Synapse] = None) -> None: + """Delete the grid session.""" + return None + + @classmethod + def list( + cls, + source_id: Optional[str] = None, + *, + synapse_client: Optional[Synapse] = None, + ) -> Generator["Grid", None, None]: + """Generator to get a list of active grid sessions for the user.""" + yield from [] + + +@dataclass +@async_to_sync +class Grid(GridSynchronousProtocol): + """ + A Grid provides functionality to create and manage grid sessions in Synapse. + Grid sessions are used for curation workflows where data can be edited in a + grid format and then exported back to record sets. + + Attributes: + record_set_id: The synId of the RecordSet to initialize the grid + initial_query: Initialize from an EntityView query. + Mutually exclusive with record_set_id. + session_id: The unique sessionId for this grid session + started_by: The user that started this session + started_on: The date-time when the session was started + etag: Changes when the session changes + modified_on: The date-time when the session was last changed + last_replica_id_client: The last replica ID issued to a client + last_replica_id_service: The last replica ID issued to a service + grid_json_schema_id: The $id of the JSON schema for validation + source_entity_id: The synId of the source table/view/csv + record_set_version_number: The version number of the exported record set + validation_summary_statistics: Summary statistics for validation results + csv_import_total_count: Total rows processed in last CSV import + csv_import_created_count: Rows created in last CSV import + csv_import_updated_count: Rows updated in last CSV import + synchronize_error_messages: Error messages from last synchronization + """ + + record_set_id: Optional[str] = None + """The synId of the RecordSet to use for initializing the grid""" + + initial_query: Optional[Query] = None + """Initialize a grid session from an EntityView. + Mutually exclusive with record_set_id.""" + + session_id: Optional[str] = None + """The unique sessionId that identifies the grid session""" + + started_by: Optional[str] = None + """The user that started this session""" + + started_on: Optional[str] = None + """The date-time when the session was started""" + + etag: Optional[str] = None + """Changes when the session changes""" + + modified_on: Optional[str] = None + """The date-time when the session was last changed""" + + last_replica_id_client: Optional[int] = None + """The last replica ID issued to a client.""" + + last_replica_id_service: Optional[int] = None + """The last replica ID issued to a service.""" + + grid_json_schema_id: Optional[str] = None + """The $id of the JSON schema for model validation in this grid session""" + + source_entity_id: Optional[str] = None + """The synId of the table/view/csv that this grid was cloned from""" + + record_set_version_number: Optional[int] = None + """The version number of the exported record set""" + + validation_summary_statistics: Optional[ValidationSummary] = None + """Summary statistics for validation results""" + + csv_import_total_count: Optional[int] = None + """Total rows processed in last CSV import""" + + csv_import_created_count: Optional[int] = None + """Rows created in last CSV import""" + + csv_import_updated_count: Optional[int] = None + """Rows updated in last CSV import""" + + synchronize_error_messages: Optional[List[str]] = None + """Error messages from last synchronization""" + + async def create_async( + self, + attach_to_previous_session=False, + *, + timeout: int = 120, + synapse_client: Optional[Synapse] = None, + ) -> "Grid": + """ + Creates a new grid session from a `record_set_id` or `initial_query`. + + Arguments: + attach_to_previous_session: If True and using `record_set_id`, + will attach to an existing active session if one exists. + timeout: Seconds to wait for the job to complete. Defaults to 120. + synapse_client: If not passed in and caching was not disabled by + `Synapse.allow_client_caching(False)` this will use the last + created instance from the Synapse class constructor. + + Returns: + Grid: The Grid object with populated session_id. + + Raises: + ValueError: If `record_set_id` or `initial_query` is not provided. + """ + if not self.record_set_id and not self.initial_query: + raise ValueError( + "record_set_id or initial_query is required to create a " "GridSession" + ) + + trace.get_current_span().set_attributes( + { + "synapse.record_set_id": self.record_set_id or "", + "synapse.session_id": self.session_id or "", + } + ) + + if self.record_set_id and attach_to_previous_session: + async for existing_session in self.list_async( + source_id=self.record_set_id, synapse_client=synapse_client + ): + self.session_id = existing_session.session_id + self.started_by = existing_session.started_by + self.started_on = existing_session.started_on + self.etag = existing_session.etag + self.modified_on = existing_session.modified_on + self.last_replica_id_client = existing_session.last_replica_id_client + self.last_replica_id_service = existing_session.last_replica_id_service + self.grid_json_schema_id = existing_session.grid_json_schema_id + self.source_entity_id = existing_session.source_entity_id + return self + + create_request = CreateGridRequest( + record_set_id=self.record_set_id, initial_query=self.initial_query + ) + result = await create_request.send_job_and_wait_async( + timeout=timeout, synapse_client=synapse_client + ) + + result.fill_grid_session_from_response(self) + + return self + + async def export_to_record_set_async( + self, *, timeout: int = 120, synapse_client: Optional[Synapse] = None + ) -> "Grid": + """ + Exports the grid session data back to a record set. This will create + a new version of the original record set with the modified data. + + Arguments: + timeout: Seconds to wait for the job to complete. Defaults to 120. + synapse_client: If not passed in and caching was not disabled by + `Synapse.allow_client_caching(False)` this will use the last + created instance from the Synapse class constructor. + + Returns: + Grid: The Grid object with export information populated. + + Raises: + ValueError: If session_id is not provided. + """ + if not self.session_id: + raise ValueError("session_id is required to export a GridSession") + + trace.get_current_span().set_attributes( + {"synapse.session_id": self.session_id or ""} + ) + + export_request = GridRecordSetExportRequest(session_id=self.session_id) + result = await export_request.send_job_and_wait_async( + timeout=timeout, synapse_client=synapse_client + ) + + self.record_set_id = result.response_record_set_id + self.record_set_version_number = result.record_set_version_number + self.validation_summary_statistics = result.validation_summary_statistics + + return self + + async def _derive_schema_async( + self, + path: Optional[str] = None, + dataframe: Optional[Any] = None, + *, + synapse_client: Optional[Synapse] = None, + ) -> List[Column]: + """Derive the column schema from a CSV file or DataFrame. + + Column names come from the CSV header or DataFrame columns. + Column types are resolved from the JSON schema bound to the + grid session. Columns not found in the schema default to STRING. + """ + import csv as csv_module + + from synapseclient.api.json_schema_services import get_json_schema_body + from synapseclient.extensions.curator.file_based_metadata_task import ( + _get_column_type_from_js_property, + ) + from synapseclient.models.table_components import ColumnType + + # 1. Get column names from the data source + if path is not None: + with open(path, newline="") as f: + reader = csv_module.reader(f) + column_names = next(reader) + elif dataframe is not None: + column_names = list(dataframe.columns) + else: + raise ValueError( + "Either path or dataframe must be provided to " "derive the schema" + ) + + # 2. Get column types from the bound JSON schema + type_map: Dict[str, ColumnType] = {} + + # Ensure we have grid session info with the schema ID + if self.grid_json_schema_id is None and self.session_id: + from synapseclient.api import get_grid_session + + session_info = await get_grid_session( + session_id=self.session_id, + synapse_client=synapse_client, + ) + self.grid_json_schema_id = session_info.get("gridJsonSchema$Id", None) + + if self.grid_json_schema_id: + try: + schema_body = await get_json_schema_body( + json_schema_uri=self.grid_json_schema_id, + synapse_client=synapse_client, + ) + properties = schema_body.get("properties", {}) + for prop_name, prop_def in properties.items(): + type_map[prop_name] = _get_column_type_from_js_property(prop_def) + except Exception: + pass # Fall back to STRING for all columns + + # 3. Build Column list + return [ + Column( + name=name, + column_type=type_map.get(name, ColumnType.STRING), + ) + for name in column_names + ] + + async def import_csv_async( + self, + file_handle_id: Optional[str] = None, + path: Optional[str] = None, + dataframe: Optional[Any] = None, + schema: Optional[List[Column]] = None, + csv_descriptor: Optional[CsvTableDescriptor] = None, + *, + timeout: int = 120, + synapse_client: Optional[Synapse] = None, + ) -> "Grid": + """ + Imports CSV data into the grid session. Currently supports only + grids started using a RecordSet. + + Provide exactly one of ``file_handle_id``, ``path``, or + ``dataframe``. When a local file path or DataFrame is provided, + it is uploaded automatically to obtain a file handle. + + When ``schema`` is omitted the column schema is derived + automatically: column names come from the CSV header (or + DataFrame columns) and column types are resolved from the + JSON schema bound to the grid session. If no JSON schema is + bound, all columns default to ``ColumnType.STRING``. + + Arguments: + file_handle_id: The ID of an already-uploaded file handle + containing the CSV data. When using this option, + ``schema`` is required. + path: Local file path to a CSV file. The file will be + uploaded automatically. + dataframe: A pandas DataFrame. It will be written as CSV + and uploaded automatically. + schema: List of Column objects describing the CSV columns. + Optional when ``path`` or ``dataframe`` is provided; + required when using ``file_handle_id``. + csv_descriptor: Optional description of the CSV format. + timeout: Seconds to wait for the job to complete. + Defaults to 120. + synapse_client: If not passed in and caching was not disabled + by ``Synapse.allow_client_caching(False)`` this will use + the last created instance from the Synapse class + constructor. + + Returns: + Grid: The Grid object with import counts populated. + + Raises: + ValueError: If session_id is not provided or if the + source arguments are invalid. + """ + from synapseclient.core.upload.multipart_upload_async import ( + multipart_upload_dataframe_async, + multipart_upload_file_async, + ) + + if not self.session_id: + raise ValueError("session_id is required to import CSV into a GridSession") + + sources = sum(x is not None for x in (file_handle_id, path, dataframe)) + if sources != 1: + raise ValueError( + "Provide exactly one of file_handle_id, path, " "or dataframe" + ) + + if file_handle_id is not None and schema is None: + raise ValueError( + "schema is required when using file_handle_id " + "directly (column names cannot be read from the " + "file). Provide a path or dataframe instead to " + "auto-derive the schema." + ) + + client = Synapse.get_client(synapse_client=synapse_client) + + trace.get_current_span().set_attributes( + {"synapse.session_id": self.session_id or ""} + ) + + # Auto-derive schema when not provided + if schema is None: + schema = await self._derive_schema_async( + path=path, + dataframe=dataframe, + synapse_client=client, + ) + + if path is not None: + file_handle_id = await multipart_upload_file_async( + syn=client, + file_path=path, + content_type="text/csv", + ) + elif dataframe is not None: + file_handle_id = await multipart_upload_dataframe_async( + syn=client, + df=dataframe, + content_type="text/csv", + ) + + import_request = GridCsvImportRequest( + session_id=self.session_id, + file_handle_id=file_handle_id, + schema=schema, + csv_descriptor=csv_descriptor, + ) + result = await import_request.send_job_and_wait_async( + timeout=timeout, synapse_client=synapse_client + ) + + self.csv_import_total_count = result.total_count + self.csv_import_created_count = result.created_count + self.csv_import_updated_count = result.updated_count + + return self + + async def download_csv_async( + self, + download_location: Optional[str] = None, + write_header: bool = True, + include_row_id_and_row_version: bool = True, + include_etag: bool = True, + csv_table_descriptor: Optional[CsvTableDescriptor] = None, + file_name: Optional[str] = None, + *, + timeout: int = 120, + synapse_client: Optional[Synapse] = None, + ) -> str: + """ + Downloads grid data as a CSV file. + + Note: The downloaded CSV does NOT include validation columns. + + Arguments: + download_location: Directory to download the CSV file to. + Defaults to Synapse cache directory. + write_header: Include column names as header. Defaults to True. + include_row_id_and_row_version: Include row ID and version + columns. Defaults to True. + include_etag: Include row etag column. Defaults to True. + csv_table_descriptor: Optional CSV format description. + file_name: Optional name for the downloaded file. + timeout: Seconds to wait for the job to complete. Defaults to 120. + synapse_client: If not passed in and caching was not disabled by + `Synapse.allow_client_caching(False)` this will use the last + created instance from the Synapse class constructor. + + Returns: + str: The local file path of the downloaded CSV. + + Raises: + ValueError: If session_id is not provided. + """ + import os + + from synapseclient.core.download.download_async import download_by_file_handle + + if not self.session_id: + raise ValueError( + "session_id is required to download CSV from a GridSession" + ) + + trace.get_current_span().set_attributes( + {"synapse.session_id": self.session_id or ""} + ) + + download_request = DownloadFromGridRequest( + session_id=self.session_id, + write_header=write_header, + include_row_id_and_row_version=include_row_id_and_row_version, + include_etag=include_etag, + csv_table_descriptor=csv_table_descriptor, + file_name=file_name, + ) + result = await download_request.send_job_and_wait_async( + timeout=timeout, synapse_client=synapse_client + ) + + client = Synapse.get_client(synapse_client=synapse_client) + + if download_location is None: + download_location = client.cache.get_cache_dir(0) + + actual_file_name = file_name or "grid_download.csv" + destination = os.path.join(download_location, actual_file_name) + + path = await download_by_file_handle( + file_handle_id=result.results_file_handle_id, + synapse_id=self.source_entity_id or self.session_id, + entity_type="TableEntity", + destination=destination, + synapse_client=client, + ) + + return path + + async def synchronize_async( + self, *, timeout: int = 120, synapse_client: Optional[Synapse] = None + ) -> "Grid": + """ + Synchronizes the grid session with its data source. This is a + two-phase process that ensures consistency between the grid and source. + + Arguments: + timeout: Seconds to wait for the job to complete. Defaults to 120. + synapse_client: If not passed in and caching was not disabled by + `Synapse.allow_client_caching(False)` this will use the last + created instance from the Synapse class constructor. + + Returns: + Grid: The Grid object with synchronization results. + + Raises: + ValueError: If session_id is not provided. + """ + if not self.session_id: + raise ValueError("session_id is required to synchronize a GridSession") + + trace.get_current_span().set_attributes( + {"synapse.session_id": self.session_id or ""} + ) + + sync_request = SynchronizeGridRequest(grid_session_id=self.session_id) + result = await sync_request.send_job_and_wait_async( + timeout=timeout, synapse_client=synapse_client + ) + + self.synchronize_error_messages = result.error_messages + + return self + + def fill_from_dict(self, synapse_response: Dict[str, Any]) -> "Grid": + """Converts a response from the REST API into this dataclass.""" + self.session_id = synapse_response.get("sessionId", None) + self.started_by = synapse_response.get("startedBy", None) + self.started_on = synapse_response.get("startedOn", None) + self.etag = synapse_response.get("etag", None) + self.modified_on = synapse_response.get("modifiedOn", None) + self.last_replica_id_client = synapse_response.get("lastReplicaIdClient", None) + self.last_replica_id_service = synapse_response.get( + "lastReplicaIdService", None + ) + self.grid_json_schema_id = synapse_response.get("gridJsonSchema$Id", None) + self.source_entity_id = synapse_response.get("sourceEntityId", None) + return self + + @skip_async_to_sync + @classmethod + async def list_async( + cls, + source_id: Optional[str] = None, + *, + synapse_client: Optional[Synapse] = None, + ) -> AsyncGenerator["Grid", None]: + """ + Generator to get a list of active grid sessions for the user. + + Arguments: + source_id: Optional. When provided, only sessions with this synId + will be returned. + synapse_client: If not passed in and caching was not disabled by + `Synapse.allow_client_caching(False)` this will use the last + created instance from the Synapse class constructor. + + Yields: + Grid objects representing active grid sessions. + """ + async for session_dict in list_grid_sessions( + source_id=source_id, synapse_client=synapse_client + ): + grid = cls() + grid.fill_from_dict(session_dict) + yield grid + + @classmethod + def list( + cls, + source_id: Optional[str] = None, + *, + synapse_client: Optional[Synapse] = None, + ) -> Generator["Grid", None, None]: + """ + Generator to get a list of active grid sessions for the user. + + Arguments: + source_id: Optional. When provided, only sessions with this synId + will be returned. + synapse_client: If not passed in and caching was not disabled by + `Synapse.allow_client_caching(False)` this will use the last + created instance from the Synapse class constructor. + + Yields: + Grid objects representing active grid sessions. + """ + return wrap_async_generator_to_sync_generator( + async_gen_func=cls.list_async, + source_id=source_id, + synapse_client=synapse_client, + ) + + async def get_snapshot_async( + self, + *, + connect_timeout: float = 30.0, + synapse_client: Optional[Synapse] = None, + ) -> "GridSnapshot": + """Get a read-only snapshot of the grid's current state including + per-row validation results. Does NOT commit changes. + + This connects via WebSocket, receives the current grid state, + extracts row data and validation results, and disconnects. + + Arguments: + connect_timeout: Timeout in seconds for the WebSocket connection. + Defaults to 30.0. + synapse_client: If not passed in and caching was not disabled by + `Synapse.allow_client_caching(False)` this will use the last + created instance from the Synapse class constructor. + + Returns: + GridSnapshot with column names, row data, and per-row validation. + + Raises: + ValueError: If session_id is not provided. + """ + from synapseclient.api import create_grid_replica, get_grid_presigned_url + from synapseclient.core.grid_websocket import GridWebSocketClient + + if not self.session_id: + raise ValueError("session_id is required to get a grid snapshot") + + trace.get_current_span().set_attributes( + {"synapse.session_id": self.session_id or ""} + ) + + # 1. Create a replica for this read-only connection + replica_response = await create_grid_replica( + session_id=self.session_id, + synapse_client=synapse_client, + ) + replica = replica_response.get("replica", {}) + replica_id = replica.get("replicaId") + + if replica_id is None: + raise ValueError("Failed to create grid replica - no replicaId returned") + + # 2. Get a presigned WebSocket URL + presigned_url = await get_grid_presigned_url( + session_id=self.session_id, + replica_id=replica_id, + synapse_client=synapse_client, + ) + + if not presigned_url: + raise ValueError("Failed to get presigned WebSocket URL for grid session") + + # 3. Connect, receive snapshot, extract data + ws_client = GridWebSocketClient(connect_timeout=connect_timeout) + snapshot = await ws_client.get_snapshot( + presigned_url=presigned_url, + replica_id=replica_id, + ) + + return snapshot + + async def get_validation_async( + self, + *, + connect_timeout: float = 30.0, + synapse_client: Optional[Synapse] = None, + ) -> "GridSnapshot": + """Get per-row validation results from the grid session. + + Convenience alias for get_snapshot_async. + + Arguments: + connect_timeout: Timeout in seconds. Defaults to 30.0. + synapse_client: If not passed in and caching was not disabled by + `Synapse.allow_client_caching(False)` this will use the last + created instance from the Synapse class constructor. + + Returns: + GridSnapshot with per-row validation data. + """ + return await self.get_snapshot_async( + connect_timeout=connect_timeout, + synapse_client=synapse_client, + ) + + async def delete_async(self, *, synapse_client: Optional[Synapse] = None) -> None: + """ + Delete the grid session. + + Note: Only the user that created a grid session may delete it. + + Arguments: + synapse_client: If not passed in and caching was not disabled by + `Synapse.allow_client_caching(False)` this will use the last + created instance from the Synapse class constructor. + + Raises: + ValueError: If session_id is not provided. + """ + if not self.session_id: + raise ValueError("session_id is required to delete a GridSession") + + trace.get_current_span().set_attributes( + {"synapse.session_id": self.session_id or ""} + ) + + await delete_grid_session( + session_id=self.session_id, synapse_client=synapse_client + ) diff --git a/synapseclient/models/grid_query.py b/synapseclient/models/grid_query.py new file mode 100644 index 000000000..11b62affb --- /dev/null +++ b/synapseclient/models/grid_query.py @@ -0,0 +1,121 @@ +""" +Data models for grid session snapshots and per-row validation results. + +These models represent the read-only view of a grid session's current state, +including row data and validation results extracted via WebSocket connection. +""" + +from dataclasses import dataclass, field +from typing import Any, Dict, List, Optional + + +@dataclass +class GridRowValidation: + """Per-row validation results from an active grid session. + + Attributes: + is_valid: True if the row passes schema validation, False if invalid, + None if validation has not been computed yet. + validation_error_message: Summary error message if invalid. + all_validation_messages: Detailed list of all validation errors + (one per sub-schema violation). + validation_status: Computed status: 'valid', 'invalid', or 'pending' + (when data has been modified after the last validation). + """ + + is_valid: Optional[bool] = None + """True if valid, False if invalid, None if not yet validated.""" + + validation_error_message: Optional[str] = None + """Summary error message if the row is invalid.""" + + all_validation_messages: Optional[List[str]] = None + """Detailed list of all validation errors.""" + + validation_status: Optional[str] = None + """Computed status: 'valid', 'invalid', or 'pending'.""" + + +@dataclass +class GridRow: + """A single row from a grid session with data and validation. + + Attributes: + row_id: The logical row identifier in format 'replicaId.sequenceNumber'. + data: The row's cell values as a dict mapping column name to value. + validation: Per-row validation results, if available. + """ + + row_id: Optional[str] = None + """The logical row identifier.""" + + data: Optional[Dict[str, Any]] = None + """The row's cell values as {column_name: value}.""" + + validation: Optional[GridRowValidation] = None + """Per-row validation results.""" + + +@dataclass +class GridSnapshot: + """Read-only snapshot of a grid session's current state. + + Contains the column names, row data, and per-row validation results + extracted from the grid session via WebSocket connection. + + Attributes: + column_names: Ordered list of column names in the grid. + rows: List of GridRow objects with data and validation. + """ + + column_names: List[str] = field(default_factory=list) + """Ordered list of column names in the grid.""" + + rows: List[GridRow] = field(default_factory=list) + """List of rows with data and validation results.""" + + @property + def total_rows(self) -> int: + """Total number of rows in the grid.""" + return len(self.rows) + + @property + def valid_rows(self) -> int: + """Number of rows that pass validation.""" + return sum( + 1 for row in self.rows if row.validation and row.validation.is_valid is True + ) + + @property + def invalid_rows(self) -> int: + """Number of rows that fail validation.""" + return sum( + 1 + for row in self.rows + if row.validation and row.validation.is_valid is False + ) + + @property + def pending_rows(self) -> int: + """Number of rows where validation is pending or not yet computed.""" + return sum( + 1 + for row in self.rows + if not row.validation + or row.validation.is_valid is None + or row.validation.validation_status == "pending" + ) + + @property + def validation_summary(self) -> Dict[str, int]: + """Returns a summary of validation counts. + + Returns: + Dict with keys: total, valid, invalid, pending. + """ + return { + "total": self.total_rows, + "valid": self.valid_rows, + "invalid": self.invalid_rows, + "pending": self.pending_rows, + } diff --git a/synapseclient/models/mixins/asynchronous_job.py b/synapseclient/models/mixins/asynchronous_job.py index fd3649bc1..658fe75d9 100644 --- a/synapseclient/models/mixins/asynchronous_job.py +++ b/synapseclient/models/mixins/asynchronous_job.py @@ -14,10 +14,13 @@ AGENT_CHAT_REQUEST, CREATE_GRID_REQUEST, CREATE_SCHEMA_REQUEST, + DOWNLOAD_FROM_GRID_REQUEST, GET_VALIDATION_SCHEMA_REQUEST, + GRID_CSV_IMPORT_REQUEST, GRID_RECORD_SET_EXPORT_REQUEST, QUERY_BUNDLE_REQUEST, QUERY_TABLE_CSV_REQUEST, + SYNCHRONIZE_GRID_REQUEST, TABLE_UPDATE_TRANSACTION_REQUEST, ) from synapseclient.core.exceptions import ( @@ -30,6 +33,9 @@ AGENT_CHAT_REQUEST: "/agent/chat/async", CREATE_GRID_REQUEST: "/grid/session/async", GRID_RECORD_SET_EXPORT_REQUEST: "/grid/export/recordset/async", + GRID_CSV_IMPORT_REQUEST: "/grid/import/csv/async", + DOWNLOAD_FROM_GRID_REQUEST: "/grid/download/csv/async", + SYNCHRONIZE_GRID_REQUEST: "/grid/synchronize/async", TABLE_UPDATE_TRANSACTION_REQUEST: "/entity/{entityId}/table/transaction/async", GET_VALIDATION_SCHEMA_REQUEST: "/schema/type/validation/async", CREATE_SCHEMA_REQUEST: "/schema/type/create/async", diff --git a/tests/integration/synapseclient/models/async/test_recordset_async.py b/tests/integration/synapseclient/models/async/test_recordset_async.py index 191ff360b..770bb2728 100644 --- a/tests/integration/synapseclient/models/async/test_recordset_async.py +++ b/tests/integration/synapseclient/models/async/test_recordset_async.py @@ -20,7 +20,7 @@ UsedEntity, UsedURL, ) -from synapseclient.models.curation import Grid +from synapseclient.models.grid import Grid from synapseclient.services.json_schema import JsonSchemaOrganization diff --git a/tests/unit/synapseclient/models/async/unit_test_curation_async.py b/tests/unit/synapseclient/models/async/unit_test_curation_async.py index 53649445b..4dd0b94ce 100644 --- a/tests/unit/synapseclient/models/async/unit_test_curation_async.py +++ b/tests/unit/synapseclient/models/async/unit_test_curation_async.py @@ -10,14 +10,16 @@ RECORD_BASED_METADATA_TASK_PROPERTIES, ) from synapseclient.models.curation import ( - CreateGridRequest, CurationTask, FileBasedMetadataTaskProperties, - Grid, - GridRecordSetExportRequest, RecordBasedMetadataTaskProperties, _create_task_properties_from_dict, ) +from synapseclient.models.grid import ( + CreateGridRequest, + Grid, + GridRecordSetExportRequest, +) from synapseclient.models.recordset import ValidationSummary TASK_ID = 42 @@ -725,7 +727,7 @@ async def test_delete_async(self) -> None: # WHEN I call delete_async with patch( - "synapseclient.models.curation.delete_grid_session", + "synapseclient.models.grid.delete_grid_session", new_callable=AsyncMock, return_value=None, ) as mock_delete: @@ -766,7 +768,7 @@ async def mock_list(*args, **kwargs): # WHEN I call list_async with patch( - "synapseclient.models.curation.list_grid_sessions", + "synapseclient.models.grid.list_grid_sessions", return_value=mock_list(), ): results = [] @@ -789,7 +791,7 @@ async def mock_list(*args, **kwargs): # WHEN I call list_async with a source_id with patch( - "synapseclient.models.curation.list_grid_sessions", + "synapseclient.models.grid.list_grid_sessions", return_value=mock_list(), ): results = []