Skip to content

Latest commit

 

History

History
100 lines (75 loc) · 4.54 KB

File metadata and controls

100 lines (75 loc) · 4.54 KB

Modern Table Formats (Delta, Hudi, Paimon)

Pangolin supports cataloging external table formats like Delta Lake, Apache Hudi, and Apache Paimon as Generic Assets. This allows you to maintain a unified discovery layer across your entire data estate, even if some tables are managed by different engines (e.g., Databricks, EMR, Flink).

Overview

Unlike Iceberg tables, which Pangolin manages natively (handling commits, snapshots, and manifest files), other modern table formats are treated as governed pointers.

What Pangolin Does:

  • Discovery: Tables appear in search results with their specific type (e.g., DeltaTable).
  • Governance: Apply RBAC policies and tags to these tables.
  • Lineage: Track them as sources or sinks in your data pipelines.
  • Metadata: Store custom properties (e.g., managed_by: databricks, last_compacted: 2023-10-01).

What Pangolin Does NOT Do:

  • Transaction Management: Pangolin does not process Delta/Hudi commits. The query engine (Spark/Flink/Trino) handles the ACID guarantees.
  • Schema Evolution: You cannot change the schema of a Delta table via Pangolin's API.

🏗️ Supported Formats

1. Delta Lake (DeltaTable)

Developed by Databricks, widely used in Spark ecosystems.

  • Root Path: The directory containing the _delta_log folder.
  • Best Practice: Point Pangolin to the table root, not the _delta_log folder itself.
    • s3://bucket/warehouse/db/my_table
    • s3://bucket/warehouse/db/my_table/_delta_log

2. Apache Hudi (HudiTable)

Popular for streaming and upsert-heavy workloads.

  • Root Path: The directory containing the .hoodie folder.
  • Best Practice: Use properties to note the table type (Copy-on-Write vs Merge-on-Read).

3. Apache Paimon (ApachePaimon)

A streaming data lake format (formerly Flink Table Store).

  • Root Path: The root directory of the Paimon table.

🚀 How to Catalog These Tables

Uses Case: The "Unmanaged" Catalog

You have existing pipelines writing Delta tables using Databricks. You want these tables to be discoverable by analysts in Pangolin alongside your native Iceberg tables.

Option 1: Via Management UI (Data Explorer)

  1. Navigate to the Data Explorer.
  2. Drill down into the Catalog and Namespace where you want to register the table.
  3. Click the "Register Asset" button (top-right).
  4. Fill in the details:
    • Name: customer_churn_prediction
    • Type: Select Delta Lake Table (or Hudi/Paimon).
    • Location: s3://finance-data/delta/churn_preds/
  5. (Optional) Add Properties:
    • owner: data-science-team
    • update_frequency: daily
  6. Click Register.

Option 2: Via REST API

# Register a Delta Table
curl -X POST http://localhost:8080/api/v1/catalogs/analytics/namespaces/gold/assets \
  -H "Authorization: Bearer <token>" \
  -H "X-Pangolin-Tenant: <tenant-id>" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "customer_churn_prediction",
    "kind": "DeltaTable",
    "location": "s3://finance-data/delta/churn_preds/",
    "properties": {
      "provider": "delta",
      "managed_by": "databricks-job-123"
    }
  }'

🔌 Interoperability Patterns

XTable (formerly OneTable)

If you use tools like Apache XTable to translate metadata between formats (e.g., Delta -> Iceberg), you can register the same data location twice with different semantics:

  1. Primary: Register as DeltaTable (the source of truth).
  2. Read-Replica: Register as IcebergTable (using the XTable-generated metadata) if you want Pangolin to serve it to Iceberg clients.

UniForm (Delta Lake 3.0)

If your Delta tables use UniForm (Universal Format) to generate Iceberg metadata:

  • You should register the table as an IcebergTable in Pangolin pointing to the metadata directory generated by UniForm if you want full Iceberg client compatibility.
  • Registering it as a DeltaTable is still useful for discovery but avoids Pangolin trying to serve the Iceberg metadata directly.

❓ FAQ

Q: Can I run SQL queries on these tables via Pangolin? A: No. Pangolin is a catalog, not a query engine. You use Dremio, Trino, Spark, or StarRocks to query the tables. Pangolin ensures you can find them and know where they are.

Q: Do these tables show up in Iceberg clients? A: No. Generic assets are filtered out of the standard Iceberg REST API responses (loadTable) because they don't have valid Iceberg metadata. They only appear in Pangolin's asset search and discovery APIs.