ORION 2.0#409
Draft
EvanDietzMorris wants to merge 39 commits into
Draft
Conversation
- remove graph spec env vars - always load all included graph specs - option for user to specify another spec in commands
…lt single source graphs
…allow using either
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This reworks high level behavior for how ORION resolves data source and sub-graph dependencies for knowledge graphs, implements new interfaces for defining graphs and configuring graph specs, implements semantic versioning for releases of graphs, refactors how graph dependencies are referenced and passed around, and updates the helm charts and README accordingly.
Graph Dependency Resolution
For each knowledge source or sub-graph to be included in a KG, ORION now prefers to reuse an already-built sub-graph by first checking local storage, then a centralized graph registry, and only running the ingest/parser pipeline from scratch when no prebuilt graph is available locally or remotely. This also involved changing how individual knowledge sources are utilized in graphs, instead of using artifacts directly from the ingest pipeline (in ORION_STORAGE) ORION generates releases of single data source KGs (in ORION_GRAPHS) to create a clear separation of fresh ingests and other kinds of sub-graph dependencies - no need to treat them differently. This also includes implementing a widespread capability to utilize gzipped or uncompressed jsonl files automatically.
Graph Spec Usage Rework
Reworked how Graph Specs are provided and utilized. Previously, it was required that ORION_GRAPH_SPEC or ORION_GRAPH_SPEC_URL environment variables be used to specify a URL or the name of a single Graph Spec in the ORION codebase. Now these env vars are removed and better more flexible options are available. By default, a number of included Graph Specs are automatically all loaded as options at once. Any of the graphs in any of these Specs can be built by a user without configuring any Graph Spec. These graph specs were renamed and reorganized for clarity. A base set for robokop/automat graphs is always included and other optional project specific graphs are put into a sub-directory that isn't automatically loaded.
New options for specifying graphs were added: as paths to Graph Spec files from the command line or in helm chart configurations, or by declaring simple graphs completely on the command line with parameters ( --sources A,B,C --output_format neo4j ).
Semantic Versioning
Previously, versioning of graphs was entirely based on a deterministic hash of input sources and versions. Nice functionality, but not human readable. This PR completely refactors the implementation and terminology surrounding graph versioning, making a clear separation between the deterministic build_version used for dependency/pipeline management and a new release_version for user facing human readable versioning of graph releases using semantic versioning. Both of these versions are tracked and provided in metadata, but the release_version is the one used in file paths, urls etc. The release version increments semver versions based on previous builds, using a graph spec "base_release_version" that can be used to bump the major version of a graph.
Graph Registry Client
ORION now has the capability to query a graph registry to find and download prebuilt graphs as dependencies as needed.
Graph Pipeline Refactor
To accomplish the above and to pay off tech debt, build_manager.py was renamed graph_pipeline.py and heavily refactored to include the source resolution behavior.
Default Workspace
Previous PRs did a lot to remove the reliance on environment variables, however I was not satisfied with default options for creating and utilizing default directories for ORION outputs, so specifying ORION_GRAPHS and ORION_STORAGE output directories was still required. This is complicated significantly by the need to support users importing ORION as a package from pypi and those utilizing ORION by cloning the repo to an arbitrary location. In the end, to simplify installation instructions and make a quick start example as quick and simple as possible, I changed my mind and implemented a default workspace which is created in a users home directory (~) when the env vars are not set. Configuring env vars is still preferred and expected for most users.
Helm Chart Updates
To support new Graph Spec declaration options, the helm chart now supports the ability to provide a custom Graph Spec with a URL, a local file, or inlined into a helm chart.
Misc Parser Updates