Skip to content

Conversation

@SandeepTuniki
Copy link
Contributor

@SandeepTuniki SandeepTuniki commented Nov 6, 2025

(Note: Re-raised #1679 due to some merge issues there)

Adds a utility list_dataflows for data discovery to the SDMX client and CLI.

Includes unit tests and sample scripts for OECD and Eurostat.

Note: This PR originally had more utils methods for dataflow discovery (search_dataflows, get_dataflow_details , get_dataflow_series). Based on the feedback, I removed them since their functionality could be achieved through other utils.

Adds a data discovery feature to the SDMX client and CLI, allowing users to list, search, and view details of dataflows. Includes unit tests and sample scripts for OECD and Eurostat.
Addresses review comments by adding a null check in the sdmx client and replacing print statements with logging in the cli.
…actored `get_dataflow_details` to robustly handle both DataFrame and Series return types from `sdmx.to_pandas` and to gracefully catch `HTTPError` for non-existent dataflows. Updated `handle_discover_dataflows` to print DataFrame output to stdout. Adjusted `sdmx_client_test.py` to reflect these changes with accurate mocks and imports.
This change introduces a new 'discover-series' command to the SDMX CLI, allowing users to discover the available data series within a specific dataflow.

The implementation includes:
- A 'get_dataflow_series' method in the 'SdmxClient'.
- A corresponding 'discover-series' command in the 'sdmx_cli'.
- Unit tests to validate the series discovery logic.
- New sample scripts for both Eurostat and OECD to demonstrate usage.

Bug fixes identified during development, including handling of 'None' vs '{}' for keys and params, have also been addressed.
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @SandeepTuniki, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the data discovery capabilities of the SDMX client and its command-line interface. It allows users to more easily explore and understand the available data from SDMX endpoints by providing structured ways to find dataflows, inspect their details, and identify specific data series. This enhancement aims to improve the overall usability and accessibility of SDMX data within the system.

Highlights

  • New Data Discovery Features: Introduced capabilities to list, search, and view details of SDMX dataflows, and to discover specific data series within them.
  • SDMX Client Enhancements: The SdmxClient now includes methods for list_dataflows, search_dataflows, get_dataflow_details, and get_dataflow_series.
  • CLI Integration: The sdmx_cli.py tool has been updated with new commands (discover-dataflows, discover-series) and a --search flag to leverage the new discovery functionalities.
  • Sample Scripts and Tests: Added sample scripts for OECD and Eurostat to demonstrate the new discovery features, along with comprehensive unit tests for the SdmxClient.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces valuable data discovery capabilities to the SDMX client and command-line interface. The new features for listing, searching, and inspecting dataflows, as well as discovering data series, are well-implemented and accompanied by sample scripts and unit tests. The code is generally clean and follows good practices. I've identified a key issue in the get_dataflow_series method where it incorrectly assumes the name of the observation value column, which would cause it to fail. I also found a couple of related minor issues in the new unit tests. My review includes suggestions to correct these points, which should ensure the new features function as expected.

This commit addresses the review comments from the gemini-code-assist bot:

- The observation value column is now correctly identified as 'value' instead of 'OBS_VALUE'.
- The unit test mock data has been updated to use 'value'.
- The unit test assertion for 'params' now correctly checks for an empty dictionary.
@SandeepTuniki
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces valuable data discovery capabilities to the SDMX client and CLI. The changes are well-organized, adding new methods to SdmxClient, corresponding commands to the CLI, comprehensive unit tests, and illustrative sample scripts. The overall implementation is solid. I've provided a couple of specific comments on the SdmxClient to enhance robustness and maintainability.

This commit addresses the review comments from the gemini-code-assist bot:

- The observation value column is now correctly identified as 'value' instead of 'OBS_VALUE'.
- The unit test mock data has been updated to use 'value'.
- The unit test assertion for 'params' now correctly checks for an empty dictionary.
- The code comment for the observation value column has been updated to be consistent with the code.
This commit addresses the review comments from rohitkumarbhagat and improves the overall structure of the sdmx_import tool.

- The SdmxClient now returns a `Dataflow` dataclass instead of a dictionary, providing a more structured and typed API.
- The `Dataflow` dataclass has been moved to its own `sdmx_models.py` file to improve separation of concerns.
- The CLI has been updated to handle the new `Dataflow` dataclass.
- A TODO has been added to investigate support for sub-agency dataflows.
- The SdmxClient docstring has been updated to be more descriptive.
- The CLI now robustly handles Python path issues by modifying `sys.path`.
This commit refactors the list_dataflows method to use a more concise and idiomatic list comprehension.
This commit renames the `download_metadata` method to `get_dataflow_metadata` in the SdmxClient and updates all call sites in the CLI and sample scripts. This is the first step in a larger refactoring to consolidate metadata handling.
This commit removes the `get_dataflow_details` method from the SdmxClient, as well as its corresponding unit tests and usages in sample scripts.

This change is part of a larger refactoring to simplify the public API, as the functionality is being consolidated into the `get_dataflow_metadata` method.
This commit introduces a new `StructureMessage` dataclass in `sdmx_models.py` to serve as a consistent container for structured metadata.

- Modifies `SdmxClient.list_dataflows` and `SdmxClient.search_dataflows` to return `StructureMessage` objects, encapsulating lists of `Dataflow` objects.
- Updates all relevant call sites in `sdmx_cli.py` and sample scripts (`discover_eurostat_dataflows.py`, `discover_oecd_dataflows.py`) to correctly handle the new `StructureMessage` return type.
- Updates `sdmx_client_test.py` to reflect the new return types in assertions.

This change enhances the stability and extensibility of the public API by providing a unified and structured response format.
Removes the `search_dataflows` method from the `SdmxClient` and the corresponding `--search` flag from the CLI. This change aligns with the refactoring plan to simplify the public API and streamline functionality.


@dataclass
class Dataflow:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call this SdmxDataFlow so that it doesn't conflict with other dataflows like GCP.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with either approach.

@rohitkumarbhagat what do you think of this suggestion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be fine for disambiguation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per our offline discussion, I updated both the dataclasses and added "Sdmx" prefix to them .

Removes the `get_dataflow_series` method from the `SdmxClient` and the corresponding `discover-series` subcommand from the CLI. This change aligns with the refactoring plan to simplify the public API. The functionality was redundant, as series information can be derived from the data downloaded via `download_data_as_csv`.

The following files have been updated to remove all references to the series discovery utility:
- `sdmx_client.py`
- `sdmx_cli.py`
- `sdmx_client_test.py`
- `samples/discover_oecd_series.py` (deleted)
- `samples/discover_eurostat_series.py` (deleted)
@SandeepTuniki SandeepTuniki changed the title feat(sdmx): Add utilities to discover SDMX data feat(sdmx): Add utility for data discovery in SDMX Nov 13, 2025
@SandeepTuniki SandeepTuniki requested a review from ajaits November 13, 2025 09:22

# Convert to list of dicts for DataFrame creation and print
df = pd.DataFrame([asdict(df) for df in dataflows_msg.dataflows])
print(df)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use log instead of print?

return

# Convert to list of dicts for DataFrame creation and print
df = pd.DataFrame([asdict(df) for df in dataflows_msg.dataflows])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is dataframe required? should we just print df or dict?

},
'discover-dataflows': {
'handler': lambda: handle_discover_dataflows(),
'description': 'Discover available dataflows with optional search'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional search - did not find any search flags. Should we just call list-dataflows for compatability with sdmx_client?

id=df_id,
name=str(df.name),
description=str(df.description) if df.description else '',
agency_id=str(df.maintainer.id),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unconditionally reads df.maintainer.id; several SDMX dataflows lack maintainer, so discovery will crash with AttributeError, breaking the new CLI command and sample scripts for those sources.

Suggestion: use empty if not present, to replicate api response. the user can decide the fallback behavior (use agencyId as maintainer)

# See the License for the specific language governing permissions and
# limitations under the License.
"""
sdmx_models.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can be skipped

from typing import List


@dataclass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can dataclasses be marked frozen?

class SdmxClientTest(unittest.TestCase):

@patch('sdmx.Client')
def setUp(self, mock_sdmx_client):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to reset mock_sdmx_client after each test?

sys.path.append(
os.path.abspath(os.path.join(os.path.dirname(__file__), '..', '..', '..')))

import logging
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be top level import

Comment on lines +54 to +59
# List all dataflows
all_dataflows_msg = client.list_dataflows()
logging.info(f"Found {len(all_dataflows_msg.dataflows)} dataflows.")
for df in all_dataflows_msg.dataflows:
logging.info(f" - {df.id}: {df.name}")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks duplicate of 54-58?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants