Skip to content

Hive Metastore on multiple workspaces may point to the same assets. We need to dedupe upgrades. #335

@nfx

Description

@nfx

Is there an existing issue for this?

  • I have searched the existing issues

Problem statement

Need to handle duplication of credentials & prefixes across different workspaces

  • Prefixes that show up on more than one workspace.
  • Prefixes that show up on more than one workspace with different credentials

Proposed Solution

  1. Addressing prefix conflicts/duplications require special processing we have the following options
  • Prefixes that show up on more than one workspace.
    • If already upgraded, ignore
    • If not, warn, and will upgrade later
  • Prefixes that show up on more than one workspace with different credentials
    • Prompt, confirm choice of credentials

Additional Context

Requires:

#910

  1. Create an exception list at the account level the list should contain
    1. Tables that show up on more than one workspace (pointing to the same cloud storage location)
    2. Tables that show up on more than one workspace with different metadata
    3. Tables that show up on more than one workspace with different ACLs
  2. Addressing table conflicts/duplications require special processing we have the following options
    1. Define a "master" and create derivative objects as views
    2. Flag and skip the dupes
    3. Duplicate the data and create dupes
  3. Consider upgrading a workspace at a time. Highlight the conflict with prior upgrades.

Now for tables, there also needs to be a report on table/db inconsistency - like
A: db1.tbl1, db1.tbl3
B: db1.tbl2

And the team(s) that are driving UC Migration within account would make a decision after some time in review (of excel spreadsheet). By the way, we can split UCX installation across different Azure Subscriptions. And every installation would just focus on defining target catalog mapping per database. But here are unanswered questions:

two workspaces, same dbs, all different tables and columns (all managed tables, effectively)
two workspaces, same dbs, 90% same tables, 10% are different tables
two workspaces, two different dbs
We can technically support both db_to_catalog and workspace_to_catalog, and even at the same time, but db_to_catalog will override workspace_to_catalog. We also need default_catalog_for_workspace, if workspace_to_catalog is set (default catalog for all workspaces is set per metastore)..

We can also do another override for tables, but we have unanswered questions:

what if same db, same workspace, same table, but different columns/order/types? Ignore and keep in hive metastore? And then rerun the scan for tables and grants?
what if during migration catalog/database/table were deleted either from hms and/or uc?
Speaking of metastores, in the beginning, there needs to be workspace_to_metastore mapping with default_metastore_for_workspace. Can we come up with a good default mapping here? Coarse or fine grained? Select between the two? Ask for inline input? How many conflicts we expect to justify the need to create/support custom mapping?

the last very important question is what future-proof configuration format might we need for this mapping.

Metadata

Metadata

Assignees

Labels

cloud/azureissues related to Azurefeat/account-levelcross-workspace installationsfeat/cliCLI commandsfeat/migration-indexmapping of databases to catalog or potentially other databasesmigrate/managedgo/uc/upgrade Upgrade Managed Tables and Jobsstep/assign metastorego/uc/upgrade Assign Metastore

Projects

Status

Blocked/Hold

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions