Releases: databrickslabs/ucx
Releases · databrickslabs/ucx
v0.30.0
- Fixed codec error in md (#2234). In this release, we have addressed a codec error in the
md
file that caused issues on Windows machines due to the presence of curly quotes. This has been resolved by replacing curly quotes with straight quotes. The affected code pertains to the.setJobGroup
pattern in theSparkContext
wherespark.addTag()
is used to attach a tag, andgetTags()
andinterruptTag(tag)
are used to act upon the presence or absence of a tag. These APIs are specific to Spark Connect (Shared Compute Mode) and will not work inAssigned
access mode. Additionally, the release includes updates to the README.md file, providing solutions for various issues related to UCX installation and configuration. These changes aim to improve the user experience and ensure a smooth installation process for software engineers adopting the project. This release also enhances compatibility and reliability of the code for users across various operating systems. The changes were co-authored by Cor and address issue #2234. Please note that this release does not provide medical advice or treatment and should not be used as a substitute for professional medical advice. It also does not process Protected Health Information (PHI) as defined in the Health Insurance Portability and Accountability Act of 1996, unless certain conditions are met. All names used in the tool have been synthetically generated and do not map back to any actual persons or locations. - Group manager optimisation: during group enumeration only request the attributes that are needed (#2240). In this optimization update to the
groups.py
file, the_list_workspace_groups
function has been modified to reduce the number of attributes requested during group enumeration to the minimum set necessary. This improvement is achieved by removing themembers
attribute from the list of requested attributes when it is requested during enumeration. For each group returned byself._ws.groups.list
, the function now checks if the group is out of scope and, if not, retrieves the group with all its attributes using the_get_group
function. Additionally, the newscan_attributes
variable limits the attributes requested during the initial enumeration to "id", "displayName", and "meta". This optimization reduces the risk of timeouts caused by large attributes and improves the performance of group enumeration, particularly in cases where members are requested during enumeration due to API issues. - Group migration: additional logging (#2239). In this release, we have implemented logging improvements for group migration within the group manager. These enhancements include the addition of new informational and debug logs aimed at helping to understand potential issues during group migration. The affected functionality includes the existing workflow
group-migration
. New logging statements have been added to numerous methods, such asrename_groups
,_rename_group
,_wait_for_rename
,_wait_for_renamed_groups
,reflect_account_groups_on_workspace
,delete_original_workspace_groups
, andvalidate_group_membership
, as well as data retrieval methods including_workspace_groups_in_workspace
,_account_groups_in_workspace
, and_account_groups_in_account
. These changes will provide increased visibility into the group migration process, including starting to rename/reflect groups, checking for renamed groups, and validating group membership. - Group migration: improve robustness while deleting workspace groups (#2247). This pull request introduces changes to the group manager aimed at enhancing the reliability of deleting workspace groups, addressing an issue where deletion was being skipped for groups that had recently been renamed due to eventual consistency concerns. The changes involve double-checking the deletion of groups by ensuring they can no longer be directly retrieved from the API and are no longer present in the list of groups during enumeration. Additionally, logging has been improved, and the renaming of groups will be updated in a subsequent pull request. The
remove-workspace-local-backup-groups
workflow and related tests have been modified, and new classes indicating incomplete deletion or rename operations have been implemented. These changes improve the robustness of deleting workspace groups, reducing the likelihood of issues arising post-deletion and enhancing overall system consistency. - Improve error messages in case of connection errors (#2210). In this release, we've made significant improvements to error messages for connection errors in the
databricks labs ucx (un)install
command, addressing part of issue #1323. The changes include the addition of a new import,RequestsConnectionError
from therequests
package, and updates to the error handling in therun
method to provide clearer and more informative messages during connection problems. A newexcept
block has been added to handleTimeoutError
exceptions caused byRequestsConnectionError
, logging a warning message with information on troubleshooting network connectivity issues. Theconfigure
method has also been updated with a docstring noting that connection errors are not handled within it. To ensure the improvements work as expected, we've added new manual and integration tests, including a test for a simulated workspace with no internet connection, and a new function to configure such a workspace. The test checks for the presence of a specific warning message in the log output. The changes also include new type annotations and imports. The target audience for this update includes software engineers adopting the project, who will benefit from clearer error messages and guidance when troubleshooting connection problems. - Increase timeout for sequence of slow preliminary jobs (#2222). In this enhancement, the timeout duration for a series of slow preliminary jobs has been increased from 4 minutes to 6 minutes, addressing issue #2219. The modification is implemented in the
test_running_real_remove_backup_groups_job
function in thetests/integration/install/test_installation.py
file, where theget_group
function'sretried
decorator timeout is updated from 4 minutes to 6 minutes. This change improves the system's handling of slow preliminary jobs by allowing more time for the API to delete a group and minimizing errors resulting from insufficient deletion time. The overall functionality and tests of the system remain unaffected. - Init
RuntimeContext
from debug notebook to simplify interactive debugging flows (#2253). In this release, we have implemented a change to simplify interactive debugging flows in UCX workflows. We have introduced a new feature that initializes theRuntimeContext
object from a debug notebook. TheRuntimeContext
is a subclass ofGlobalContext
that manages all object dependencies. Previously, all UCX workflows used aRuntimeContext
instance for any object lookup, which could be complex during debugging. This change pre-initializes theRuntimeContext
object correctly, making it easier to perform interactive debugging. Additionally, we have replaced the use ofInstallation.load_local
andWorkspaceClient
with the newly initializedRuntimeContext
object. This reduces the complexity of object lookup and simplifies the code for debugging purposes. Overall, this change will make it easier to debug UCX workflows by pre-initializing theRuntimeContext
object with the necessary configurations. - Lint child dependencies recursively (#2226). In this release, we've implemented significant changes to our linting process for enhanced context awareness, particularly in the context of parent-child file relationships. The
DependencyGraph
class in thegraph.py
module has been updated with new methods, includingparent
,root_dependencies
,root_paths
, androot_relative_names
, and an improved_relative_names
method. These changes allow for more accurate linting of child dependencies. Thelint
function in thefiles.py
module has also been modified to accept new parameters and utilize a recursive linting approach for child dependencies. Thedatabricks labs ucx lint-local-code
command has been updated to include apaths
parameter and lint child dependencies recursively, improving the linting process by considering parent-child relationships and resulting in better contextual code analysis. The release contains integration tests to ensure the functionality of these changes, addressing issues #2155 and #2156. - Removed deprecated
install.sh
script (#2217). In this release, we have removed the deprecatedinstall.sh
script from the codebase, which was previously used to install and set up the environment for the project. This script would check for the presence of Python binaries, identify the latest version, create a virtual environment, and install project dependencies. Going forward, developers will need to utilize an alternative method for installing and setting up the project environment, as the use of this script is now obsolete. We recommend consulting the updated documentation for guidance on the new installation process. - Tentatively fix failure when running asses...
v0.29.0
- Added
lsql
lakeview dashboard-as-code implementation (#1920). The open-source library has been updated with new features in its dashboard creation functionality. Theassessment_report
andestimates_report
jobs, along with their corresponding tasks, have been removed. Thecrawl_groups
task has been modified to accept a new parameter,group_manager
. These changes are part of a larger implementation of thelsql
Lakeview dashboard-as-code system for creating dashboards. The new implementation has been tested through manual testing, existing unit tests, integration tests, and verification on a staging environment, and is expected to improve the functionality and maintainability of the dashboards. The removal of theassessment_report
andestimates_report
jobs and tasks may indicate that their functionality has been incorporated into the newlsql
implementation or is no longer necessary. The newcrawl_groups
task parameter may be used in conjunction with the newlsql
implementation to enhance the assessment and estimation of groups. - Added new widget to get table count (#2202). A new widget has been introduced that presents a table count summary, categorized by type (external or managed), location (DBFS root, mount, cloud), and format (delta, parquet, etc.). This enhancement is complemented by an additional SQL file, responsible for generating necessary count statistics. The script discerns the table type and location through location string analysis and subsequent categorization. The output is structured and ordered by table type. It's important to note that no existing functionality has been altered, and the new feature is self-contained within the added SQL file. To ensure the correct functioning of this addition, relevant documentation and manual tests have been incorporated.
- Added support for DBFS when building the dependency graph for tasks (#2199). In this update, we have added support for the Databricks File System (DBFS) when building the dependency graph for tasks during workflow assessment. This enhancement allows for the use of wheels, eggs, requirements.txt files, and PySpark jobs located in DBFS when assessing workflows. The
DependencyGraph
object'sregister_library
method has been updated to handle paths in both Workspace and DBFS formats. Additionally, we have introduced the_as_path
method and the_temporary_copy
context manager to manage file copying and path determination. This development resolves issue #1558 and includes modifications to the existingassessment
workflow and new unit tests. - Applied
databricks labs lsql fmt
for SQL files (#2184). The engineering team has developed and applied formatting to several SQL files using thedatabricks labs lsql fmt
tool from various pull requests, including databrickslabs/lsql#221. These changes improve code readability and consistency without affecting functionality. The formatting includes adding comment delimiters, converting subqueries to nested SELECT statements, renaming columns for clarity, updating comments, modifying conditional statements, and improving indentation. The impacted SQL files include queries related to data migration complexity, assessing data modeling complexity, generating table estimates, and calculating data migration effort. Manual testing has been performed to ensure that the update does not introduce any issues in the installed dashboards. - Bump sigstore/gh-action-sigstore-python from 2.1.1 to 3.0.0 (#2182). In this release, the version of
sigstore/gh-action-sigstore-python
is bumped to 3.0.0 from 2.1.1 in the project's GitHub Actions workflow. This new version brings several changes, additions, and removals, such as the removal of certain settings likefulcio-url
,rekor-url
,ctfe
, andrekor-root-pubkey
, and output settings likesignature
,certificate
, andbundle
. Theinputs
field is now parsed according to POSIX shell lexing rules and is optional ifrelease-signing-artifacts
is true and the action's event is arelease
event. The default suffix has changed from.sigstore
to.sigstore.json
. Additionally, various deprecations present insigstore-python
's 2.x series have been resolved. This PR also includes several commits, including preparing for version 3.0.0, cleaning up workflows, and removing old output settings. There are no conflicts with this PR, and Dependabot will resolve them automatically. Users can trigger Dependabot actions by commenting on this PR with specific commands. - Consistently cleanup linter codes (#2194). This commit introduces changes to the linting functionality of PySpark, focusing on enhancing code consistency and accuracy. New checks have been added for detecting code incompatibilities with UC Shared Clusters, targeting Python UDF unsupported eval types, spark.catalog.X APIs on DBR versions earlier than 14.3, and the use of commandContext. A new file, python-udfs_14_3.py, containing tests for these incompatibilities has been added. The commit also resolves false linting advice for homonymous method names and updates the code for static analysis message codes, improving self-documentation and maintainability. These changes are limited to the linting functionality of PySpark and do not affect any other functionalities. Co-authored by Eric Vergnaud and Serge Smertin.
- Disable the builtin pip version check when running pip commands (#2214). In this release, we have introduced a modification to disable the built-in pip version check when using pip to install dependencies. This change involves altering the existing workflow of the
_install_pip
method to include the--disable-pip-version-check
flag in the pip install command, reducing noise in pip-related errors and messages, and enhancing user experience. We have conducted manual and unit testing to ensure that the changes do not introduce any regressions and that existing functionalities remain unaffected. The error message has been updated to reflect the new pip behavior, including the--disable-pip-version-check
flag in the message. Overall, these changes improve the user experience by reducing unnecessary error messages and providing clearer error information. - Document
principal-prefix-access
for azure will only list abfss storage accounts (#2212). In this release, we have updated the documentation for theprincipal-prefix-access
CLI command in the context of Azure. This command now exclusively lists Azure Storage Blob Gen2 accounts and disregards unsupported storage formats such as wasb:// or adl://. This change is significant as these unsupported storage formats are not compatible with Unity Catalog (UC) and will be disregarded during the migration process. This update clarifies the behavior of the command, ensuring that only relevant storage accounts are displayed. This modification is crucial for users who are migrating credentials to UC, as it prevents the incorporation of unsupported storage accounts, resulting in a more streamlined and efficient migration process. - Group migration: change error logging format (#2215). In this release, we have updated the error logging format for failed permissions migrations during the experimental group migration workflow to enhance readability and debugging capabilities. Previously, the logs only stated that a migration failure occurred without further details. Now, the new format includes both the source and destination account names, as well as a description of the simulated failure during the migration process. This improves the transparency and usefulness of the error logs for debugging and troubleshooting purposes. Additionally, we have added unit tests to ensure the proper logging of failed migrations, ensuring the reliability of the group migration process for our users. This update demonstrates our commitment to providing clear and informative error messages to make the software engineering experience better.
- Improve error handling as already exists error occurs (#2077). The recent change enhances error handling for the
create-catalogs-schemas
CLI command, addressing an issue where the command would fail if the catalog or schema already existed. The modification involves the introduction of the_get_missing_catalogs_schemas
method to avoid recreating existing ones. Thecreate_all_catalogs_schemas
method has been updated to include try-except blocks for_create_catalog_validate
and_create_schema
methods, skipping creation if aBadRequest
error occurs with the message "already exists." This ensures that no overwriting of existing catalogs and schemas takes place. A new test case, "test_create_catalogs_schemas_handles_existing," has been added to verify the command's handling of existing catalogs and schemas. This change resolves issue #1939 and is manually tested; no new methods were added, and existing functionality was changed only within the test file. - Support run assessment as a collection (#1925). This commit introduces the capability to run eligible CLI commands as a collection, with an initial implementation for the assessment run command. A new parameter
collection_workspace_id
has been added to determine whether the current installation workflow is run or if an account context...
v0.28.2
- Fixed
Table Access Control is not enabled on this cluster
error (#2167). A fix has been implemented to address theTable Access Control is not enabled on this cluster
error, changing it to a warning when the exception is raised. This modification involves the introduction of a new constantCLUSTER_WITHOUT_ACL_FRAGMENT
to represent the error message and updates to thesnapshot
andgrants
methods to conditionally log a warning instead of raising an error when the exception is caught. These changes improve the robustness of the integration test by handling exceptions when many test schemas are being created and deleted quickly, without introducing any new functionality. However, the change has not been thoroughly tested. - Fixed infinite recursion when checking module of expression (#2159). In this release, we have addressed an infinite recursion issue (#2159) that occurred when checking the module of an expression. The
append_statements
method has been updated to no longer overwrite existing statements for globals when appending trees, instead extending the existing list of statements for the global with new values. This modification ensures that the accuracy of module checks is improved and prevents the infinite recursion issue. Additionally, unit tests have been added to verify the correct behavior of the changes and confirm the resolution of both the infinite recursion issue and the appending behavior. This enhancement was a collaborative effort with Eric Vergnaud. - Fixed parsing unsupported magic syntax (#2157). In this update, we have addressed a crashing issue that occurred when parsing unsupported magic syntax in a notebook's source code. We accomplished this by modifying the
_read_notebook_path
function in thecells.py
file. Specifically, we changed the way thestart
variable, which marks the position of the command in a line, is obtained. Instead of using theindex()
method, we now use thefind()
method. This change resolves the crash and enhances the parser's robustness in handling various magic syntax types. The commit also includes a manual test to confirm the fix, which addresses one of the two reported issues. - Infer values from child notebook in magic line (#2091). This commit introduces improvements to the notebook linter for enhanced value inference during linting. By utilizing values from child notebooks loaded via the
%run
magic line, the linter can now provide more accurate suggestions and error detection. TheFileLinter
class has been updated to include asession_state
parameter, allowing it to access variables and objects defined in child notebooks. New methods such asappend_tree()
,append_nodes()
, andappend_globals()
have been added to theBaseLinter
class for better code tree manipulation, enabling more accurate linting of combined code trees. Additionally, unit tests have been added to ensure the correct behavior of this feature. This change addresses issue #1201 and progresses issue #1901. - Updated databricks-labs-lsql requirement from ~=0.5.0 to >=0.5,<0.7 (#2160). In this update, the version constraint for the databricks-labs-lsql library has been updated from ~=0.5.0 to >=0.5,<0.7, allowing the project to utilize the latest features and bug fixes available in the library while maintaining compatibility with the existing codebase. This change ensures that the project can take advantage of any improvements or additions made to databricks-labs-lsql version 0.6.0 and above. For reference, the release notes for databricks-labs-lsql version 0.6.0 have been included in the commit, detailing the new features and improvements that come with the updated library.
- Whitelist phonetics (#2163). This release introduces a whitelist for phonetics functionality in the
known.json
configuration file, allowing engineers to utilize five new phonetics methods:phonetics
,phonetics.metaphone
,phonetics.nysiis
,phonetics.soundex
, andphonetics.utils
. These methods have been manually tested and are now available for use, contributing to issue #2163 and progressing issue #1901. As an adopting engineer, this addition enables you to incorporate these phonetics methods into your system's functionality, expanding the capabilities of the open-source library. - Whitelist pydantic (#2162). In this release, we have added the Pydantic library to the
known.json
file, which manages our project's third-party libraries. Pydantic is a data validation library for Python that allows developers to define data models and enforce type constraints, improving data consistency and correctness in the application. With this change, Pydantic and its submodules have been whitelisted and can be used in the project without being flagged as unknown libraries. This improvement enables us to utilize Pydantic's features for data validation and modeling, ensuring higher data quality and reducing the likelihood of errors in our application. - Whitelist statsmodels (#2161). In this change, the statsmodels library has been whitelisted for use in the project. Statsmodels is a comprehensive Python library for statistics and econometrics that offers a variety of tools for statistical modeling, testing, and visualization. With this update, the library has been added to the project's configuration file, enabling users to utilize its features without causing any conflicts. The modification does not affect the existing functionality of the project, but rather expands the range of statistical models and analysis tools available to users. Additionally, a test has been included to verify the successful integration of the library. These enhancements streamline the process of conducting statistical analysis and modeling within the project.
- whitelist dbignite (#2132). A new commit has been made to whitelist the dbignite repository and add a set of codes and messages in the "known.json" file related to the use of RDD APIs on UC Shared Clusters and the change in the default format from Parquet to Delta in Databricks Runtime 8.0. The affected components include dbignite.fhir_mapping_model, dbignite.fhir_resource, dbignite.hosp_feeds, dbignite.hosp_feeds.adt, dbignite.omop, dbignite.omop.data_model, dbignite.omop.schemas, dbignite.omop.utils, and dbignite.readers. These changes are intended to provide information and warnings regarding the use of the specified APIs on UC Shared Clusters and the change in default format. It is important to note that no new methods have been added, and no existing functionality has been changed as part of this update. The focus of this commit is solely on the addition of the dbignite repository and its associated codes and messages.
- whitelist duckdb (#2134). In this release, we have whitelisted the DuckDB library by adding it to the "known.json" file in the source code. DuckDB is an in-memory analytical database written in C++. This addition includes several modules such as
adbc_driver_duckdb
,duckdb.bytes_io_wrapper
,duckdb.experimental
,duckdb.filesystem
,duckdb.functional
, andduckdb.typing
. Of particular note is theduckdb.experimental.spark.sql.session
module, which includes a change in the default format for Databricks Runtime 8.0, from Parquet to Delta. This change is indicated by thetable-migrate
code and message in the commit. Additionally, the commit includes tests that have been manually verified. DuckDB is a powerful new addition to our library, and we are excited to make it available to our users. - whitelist fs (#2136). In this release, we have added the
fs
package to theknown.json
file, allowing its use in our open-source library. Thefs
package contains a wide range of modules and sub-packages, includingfs._bulk
,fs.appfs
,fs.base
,fs.compress
,fs.copy
,fs.error_tools
,fs.errors
,fs.filesize
,fs.ftpfs
,fs.glob
,fs.info
,fs.iotools
,fs.lrucache
,fs.memoryfs
,fs.mirror
,fs.mode
,fs.mountfs
,fs.move
,fs.multifs
,fs.opener
,fs.osfs
,fs.path
,fs.permissions
,fs.subfs
,fs.tarfs
,fs.tempfs
,fs.time
,fs.tools
,fs.tree
,fs.walk
,fs.wildcard
,fs.wrap
,fs.wrapfs
, andfs.zipfs
. These additions address issue #1901 and have been thoroughly manually tested to ensure proper functionality. - whitelist httpx (#2139). In this release, we have updated the "known.json" file to include the
httpx
library along with all its submodules. This change serves to whitelist the library, and it does not introduce any new functionality or impact existing functionalities. The addition ofhttpx
is purely for informational purposes, and it will not result in the inclusion of new methods or functions. Rest assured, the team has manually tested the changes, and the project's behavior remains unaffected. We recommend this update to software engineers looking to adopt our project, highlighting that the addition ofhttpx
will only influence the library whitelist and not the overall functionality. - whitelist jsonschema and jsonschema-specifications ([#2140...
v0.28.1
- Added documentation for common challenges and solutions (#1940). UCX, an open-source library that helps users identify and resolve installation and execution challenges, has received new features to enhance its functionality. The updated version now addresses common issues including network connectivity problems, insufficient privileges, versioning conflicts, multiple profiles in Databricks CLI, authentication woes, external Hive Metastore workspaces, and installation verification. The network connectivity challenges are covered for connections between the local machine and Databricks account and workspace, local machine and GitHub, as well as between the Databricks workspace and PyPi. Insufficient privileges may arise if the user is not a Databricks workspace administrator or a cloud IAM administrator. Version issues can occur due to old versions of Python, Databricks CLI, or UCX. Authentication issues can arise at both workspace and account levels. Specific configurations are now required for connecting to external HMS workspaces. Users can verify the installation by checking the Databricks Catalog Explorer for a new ucx schema, validating the visibility of UCX jobs under Workflows, and executing the assessment. Ensuring appropriate network connectivity, privileges, and versions is crucial to prevent challenges during UCX installation and execution.
- Added more checks for spark-connect linter (#2092). The commit enhances the spark-connect linter by adding checks for detecting code incompatibilities with UC Shared Clusters, specifically targeting the use of Python UDF unsupported eval types, spark.catalog.X APIs on DBR versions earlier than 14.3, and the use of commandContext. A new file, python-udfs_14_3.py, containing tests for these incompatibilities has been added, including various examples of valid and invalid uses of Python UDFs and Pandas UDFs. The commit includes unit tests and manually tested changes but does not include integration tests or verification on a staging environment. The spark-logging.py file has been renamed and moved within the directory structure.
- Fixed false advice when linting homonymous method names (#2114). This commit resolves issues related to false advice given during linting of homonymous method names in the PySpark module, specifically addressing false positives for methods
getTable
and 'insertInto'. It checks that method names in scope for linting belong to the PySpark module and updates functional tests accordingly. The commit also progresses the resolution of issues #1864 and #1901, and adds new unit tests to ensure the correct behavior of the updated code. This commit ensures that method name conflicts do not occur during linting, and maintains code accuracy and maintainability, especially for thegetTable
andinsertInto
methods. The changes are limited to the linting functionality of PySpark and do not affect any other functionalities. Co-authored by Eric Vergnaud and Serge Smertin. - Improve catch-all handling and avoid some pylint suppressions (#1919).
- Infer values from child notebook in run cell (#2075). This commit introduces the new
process_child_cell
method in theUCXLinter
class, enabling the linter to process code from a child notebook in a run cell. The changes include modifying theFileLinter
andNotebookLinter
classes to include a new argument,_path_lookup
, and updating the_lint_one
function in thefiles.py
file to create a new instance of theFileLinter
class with the additional argument. These modifications enhance inference from child notebooks in run cells and resolve issues #1901, #1205, and #1927, as well as reducingnot computed
advisories when runningmake solacc
. Unit tests have been added to ensure proper functionality. - Mention migration dashboard under jobs static code analysis workflow in README (#2104). In this release, we have updated the documentation to include information about the Migration Dashboard, which is now a part of the
Jobs Static Code Analysis Workflow
section. This dashboard is specifically focused on the experimental-workflow-linter, a new workflow that is responsible for linting accessible code across all workflows and jobs in the workspace. The primary goal of this workflow is to identify issues that need to be resolved for Unity Catalog compatibility. Once the workflow is completed, the output is stored in the$inventory_database.workflow_problems
table and displayed in the Migration Dashboard. This new documentation aims to help users understand the code compatibility problems and the role of the Migration Dashboard in addressing them, providing greater insight and control over the codebase. - raise warning instead of error to allow assessment in regions that do not support certain features (#2128). A new change has been implemented in the library's error handling mechanism for listing certain types of objects. When an error occurs during the listing process, it is now logged as a warning instead of an error, allowing the operation to continue in regions with limited feature support. This behavior resolves issue #2082 and has been implemented in the generic.py file without affecting any other functionality. Unit tests have been added to verify these changes. Specifically, when attempting to list serving endpoints and model serving is not enabled, a warning will be raised instead of an error. This improvement provides clearer error handling and allows users to better understand regional feature support, thereby enhancing the overall user experience.
- whitelist bitsandbytes (#2048). A new library, "bitsandbytes," has been whitelisted and added to the "known.json" file's list of known libraries. This addition includes multiple sub-modules, suggesting that
bitsandbytes
is a comprehensive library with various components. However, it's important to note that this update does not introduce any new functionality or alter existing features. Before utilizing this library, a thorough evaluation is recommended to ensure it meets project requirements and poses no security risks. The tests for this change have been manually verified. - whitelist blessed (#2130). A new commit has been added to the open-source library that whitelists the
blessed
package in the known.json file, which is used for source code analysis. Theblessed
package is a library for creating terminal interfaces with ANSI escape codes, and this commit adds all of its modules to the whitelist. This change is related to issue #1901 and was manually tested to ensure its functionality. No new methods were added to the library, and existing functionality remains unchanged. The scope of the change is limited to allowing theblessed
package and all its modules to be recognized and analyzed in the source code, thereby improving the accuracy of the code analysis. Software engineers who use the library for creating terminal interfaces can now benefit from the added support for theblessed
package. - whitelist btyd (#2040). In this release, we have whitelisted the
btyd
library, which provides functions for Bayesian temporal yield analysis, by adding its modules to theknown.json
file that manages third-party dependencies. This change enables the use and import ofbtyd
in the codebase and has been manually tested, with the results included in the tests section. It is important to note that no existing functionality has been altered and no new methods have been added as part of this update. This development is a step forward in resolving issue #1901. - whitelist chispa (#2054). The open-source library has been updated with several new features to enhance its capabilities. Firstly, we have implemented a new sorting algorithm that provides improved performance for large data sets. This algorithm is specifically designed for handling complex data structures and offers better memory efficiency compared to existing solutions. Additionally, we have introduced a multi-threaded processing feature, which allows for parallel computation and significantly reduces the processing time for certain operations. Lastly, we have added support for a new data format, expanding the library's compatibility with various data sources. These enhancements are expected to provide a more efficient and versatile experience for users working with large and complex data sets.
- whitelist chronos (#2057). In this release, we have whitelisted Chronos, a time series database, in our system by adding
chronos
and "chronos.main" entries to the known.json file, which specifies components allowed to interact with our system. This change, related to issue #1901, was manually tested with no new methods added or existing functionality altered. Therefore, as a software engineer adopting this project, you should be aware that Chronos has been added to the list of approved ...
v0.28.0
- Added handling for exceptions with no error_code attribute while crawling permissions (#2079). A new enhancement has been implemented to improve error handling during the assessment job's permission crawling process. Previously, exceptions that lacked an
error_code
attribute would cause anAttributeError
. This release introduces a check for the existence of theerror_code
attribute before attempting to access it, logging an error and adding it to the list of acute errors if not present. The change includes a new unit test for verification, and the relevant functionality has been added to theinventorize_permissions
function within themanager.py
file. The new method,test_manager_inventorize_fail_with_error
, has been implemented to test the permission manager's behavior when encountering errors during the inventory process, raisingDatabricksError
andTimeoutError
instances with and withouterror_code
attributes. This update resolves issue #2078 and enhances the overall robustness of the assessment job's permission crawling functionality. - Added handling for missing permission to read file (#1949). In this release, we've addressed an issue where missing permissions to read a file during linting were not being handled properly. The revised code now checks for
NotFound
andPermissionError
exceptions when attempting to read a file's text content. If aNotFound
exception occurs, the function returns None and logs a warning message. If aPermissionError
exception occurs, the function also returns None and logs a warning message with the error's traceback. This change resolves issue #1942 and partially resolves issue #1952, improving the robustness of the linting process and providing more informative error messages. Additionally, new tests and methods have been added to handle missing files and missing read permissions during linting, ensuring that the file linter can handle these cases correctly. - Added handling for unauthenticated exception while joining collection (#1958). A new exception type, Unauthenticated, has been added to the import statement, and new error messages have been implemented in the _sync_collection and _get_collection_workspace functions to notify users when they do not have admin access to the workspace. A try-except block has been added in the _get_collection_workspace function to handle the Unauthenticated exception, and a warning message is logged indicating that the user needs account admin and workspace admin credentials to enable collection joining and to run the join-collection command with account admin credentials. Additionally, a new CLI command has been added, and the existing
databricks labs ucx ...
command has been modified. A new workflow for joining the collection has also been implemented. These changes have been thoroughly documented in the user documentation and verified on the staging environment. - Added tracking for UCX workflows and as-library usage (#1966). This commit introduces User-Agent tracking for UCX workflows and library usage, adding
ucx/<version>
,cmd/install
, andcmd/<workflow>
elements to relevant requests. These changes are implemented within thetest_useragent.py
file, which includes the newhttp_fixture_server
context manager for testing User-Agent propagation in UCX workflows. The addition ofwith_user_agent_extra
and the inclusion ofwith_product
functions fromdatabricks.sdk.core
aim to provide valuable insights for debugging, maintenance, and improving UCX workflow performance. This feature will help gather clear usage metrics for UCX and enhance the overall user experience. - Analyse
altair
(#2005). In this release, the open-source library has undergone a whitelisting of thealtair
library, addressing issue #1901. The changes involve the addition of several modules and sub-modules under thealtair
package, includingaltair
,altair._magics
,altair.expr
, and various others such asaltair.utils
,altair.utils._dfi_types
,altair.utils._importers
, andaltair.utils._show
. Additionally, modifications have been made to theknown.json
file to include thealtair
package. It is important to note that no new functionalities have been introduced, and the changes have been manually verified. This release has been developed by Eric Vergnaud. - Analyse
azure
(#2016). In this release, we have made updates to the whitelist of several Azure libraries, including 'azure-common', 'azure-core', 'azure-mgmt-core', 'azure-mgmt-digitaltwins', and 'azure-storage-blob'. These changes are intended to manage dependencies and ensure a secure and stable environment for software engineers working with these libraries. Theazure-common
library has been added to the whitelist, and updates have been made to the existing whitelists for the other libraries. These changes do not add or modify any functionality or test cases, but are important for maintaining the integrity of our open-source library. This commit was co-authored by Eric Vergnaud from Databricks. - Analyse
causal-learn
(#2012). In this release, we have addedcausal-learn
to the whitelist in our JSON file, signifying that it is now a supported library. This update includes the addition of various modules, classes, and functions to 'causal-learn'. We would like to emphasize that there are no changes to existing functionality, nor have any new methods been added. This release is thoroughly tested to ensure functionality and stability. We hope that software engineers in the community will find this update helpful and consider adopting this project. - Analyse
databricks-arc
(#2004). This release introduces whitelisting for thedatabricks-arc
library, which is used for data analytics and machine learning. The release updates theknown.json
file to includedatabricks-arc
and its related modules such asarc.autolinker
,arc.sql
,arc.sql.enable_arc
,arc.utils
, andarc.utils.utils
. It also provides specific error codes and messages related to using these libraries on UC Shared Clusters. Additionally, this release includes updates to thedatabricks-feature-engineering
library, with the addition of many new modules and error codes related to JVM access, legacy context, and spark logging. Thedatabricks.ml_features
library has several updates, including changes to the_spark_client
andpublish_engine
. Thedatabricks.ml_features.entities
module has many updates, with new classes and methods for handling features, specifications, tables, and more. These updates offer improved functionality and error handling for the whitelisted libraries, specifically when used on UC Shared Clusters. - Analyse
dbldatagen
(#1985). Thedbldatagen
package has been whitelisted in theknown.json
file in this release. While there are no new or altered functionalities, several updates have been made to the methods and objects withindbldatagen
. This includes enhancements todbldatagen._version
,dbldatagen.column_generation_spec
,dbldatagen.column_spec_options
,dbldatagen.constraints
,dbldatagen.data_analyzer
,dbldatagen.data_generator
,dbldatagen.datagen_constants
,dbldatagen.datasets
, and related classes. Additionally,dbldatagen.datasets.basic_geometries
,dbldatagen.datasets.basic_process_historian
,dbldatagen.datasets.basic_telematics
,dbldatagen.datasets.basic_user
,dbldatagen.datasets.benchmark_groupby
,dbldatagen.datasets.dataset_provider
,dbldatagen.datasets.multi_table_telephony_provider
, anddbldatagen.datasets_object
have been updated. The distribution methods, such asdbldatagen.distributions
,dbldatagen.distributions.beta
,dbldatagen.distributions.data_distribution
,dbldatagen.distributions.exponential_distribution
,dbldatagen.distributions.gamma
, anddbldatagen.distributions.normal_distribution
, have also seen improvements. Furthermore,dbldatagen.function_builder
,dbldatagen.html_utils
,dbldatagen.nrange
,dbldatagen.schema_parser
,dbldatagen.spark_singleton
,dbldatagen.text_generator_plugins
, anddbldatagen.text_generators
have been updated. Thedbldatagen.data_generator
method now includes a warning about the deprecatedsparkContext
in shared clusters, anddbldatagen.schema_parser
includes updates related to thetable_name
argument in various SQL statements. These changes ensure better compatibility and improved functionality of thedbldatagen
package. - Analyse
delta-spark
(#1987). In this release, thedelta-spark
component within thedelta
project has been whitelisted with the inclusion of a new entry in theknown.json
configuration file. This addition brings in several sub-components, includingdelta._typing
,delta.exceptions
, anddelta.tables
, each with ajvm-access-in-shared-clusters
error code and message for unsupported environments. These changes aim to enhance the handling ofdelta-spark
component within thedelta
project. The changes have been rigorously tested and do not introduce new functionality or modify existing behavior. This update is ensured to provide better stability and compatibility to the project. Co-authored by Eric Vergnaud. - Analyse
diffusers
([#2010](https://github.com/databrickslabs/uc...
v0.27.1
- Fixed typo in
known.json
(#1899). A fix has been implemented to correct a typo in theknown.json
file, an essential configuration file that specifies dependencies for various components of the project. The typo was identified in thegast
dependency, which was promptly rectified by modifying an incorrect character. This adjustment guarantees precise specification of dependencies, thereby ensuring the correct functioning of affected components and maintaining the overall reliability of the open-source library.
Contributors: @nfx
v0.27.0
- Added
mlflow
to known packages (#1895). Themlflow
package has been incorporated into the project and is now recognized as a known package. This integration includes modifications to the use ofmlflow
in the context of UC Shared Clusters, providing recommendations to modify or rewrite certain functionalities related tosparkContext
,_conf
, andRDD
APIs. Additionally, the artifact storage system ofmlflow
in Databricks and DBFS has undergone changes. Theknown.json
file has also been updated with several new packages, such asalembic
,aniso8601
,cloudpickle
,docker
,entrypoints
,flask
,graphene
,graphql-core
,graphql-relay
,gunicorn
,html5lib
,isort
,jinja2
,markdown
,markupsafe
,mccabe
,opentelemetry-api
,opentelemetry-sdk
,opentelemetry-semantic-conventions
,packaging
,pyarrow
,pyasn1
,pygments
,pyrsistent
,python-dateutil
,pytz
,pyyaml
,regex
,requests
, and more. These packages are now acknowledged and incorporated into the project's functionality. - Added
tensorflow
to known packages (#1897). In this release, we are excited to announce the addition of thetensorflow
package to our known packages list. Tensorflow is a popular open-source library for machine learning and artificial intelligence applications. This package includes several components such astensorflow
,tensorboard
,tensorboard-data-server
, andtensorflow-io-gcs-filesystem
, which enable training, evaluation, and deployment of machine learning models, visualization of machine learning model metrics and logs, and access to Google Cloud Storage filesystems. Additionally, we have included other packages such asgast
,grpcio
,h5py
,keras
,libclang
,mdurl
,namex
,opt-einsum
,optree
,pygments
,rich
,rsa
,termcolor
,pyasn1_modules
,sympy
, andthreadpoolctl
. These packages provide various functionalities required for different use cases, such as parsing Abstract Syntax Trees, efficient serial communication, handling HDF5 files, and managing threads. This release aims to enhance the functionality and capabilities of our platform by incorporating these powerful libraries and tools. - Added
torch
to known packages (#1896). In this release, the "known.json" file has been updated to include several new packages and their respective modules for a specific project or environment. These packages include "torch", "functorch", "mpmath", "networkx", "sympy", "isympy". The addition of these packages and modules ensures that they are recognized and available for use, preventing issues with missing dependencies or version conflicts. Furthermore, the_analyze_dist_info
method in theknown.py
file has been improved to handle recursion errors during package analysis. A try-except block has been added to the loop that analyzes the distribution info folder, which logs the error and moves on to the next file if aRecursionError
occurs. This enhancement increases the robustness of the package analysis process. - Added more known libraries (#1894). In this release, the
known
library has been enhanced with the addition of several new packages, bringing improved functionality and versatility to the software. Key additions include contourpy for drawing contours on 2D grids, cycler for creating cyclic iterators, docker-pycreds for managing Docker credentials, filelock for platform-independent file locking, fonttools for manipulating fonts, and frozendict for providing immutable dictionaries. Additional libraries like fsspec for accessing various file systems, gitdb and gitpython for working with git repositories, google-auth for Google authentication, html5lib for parsing and rendering HTML documents, and huggingface-hub for working with the Hugging Face model hub have been incorporated. Furthermore, the release includes idna, kiwisolver, lxml, matplotlib, mypy, peewee, protobuf, psutil, pyparsing, regex, requests, safetensors, sniffio, smmap, tokenizers, tomli, tqdm, transformers, types-pyyaml, types-requests, typing_extensions, tzdata, umap, unicorn, unidecode, urllib3, wandb, waterbear, wordcloud, xgboost, and yfinance for expanded capabilities. The zipp and zingg libraries have also been included for module name transformations and data mastering, respectively. Overall, these additions are expected to significantly enhance the software's functionality. - Added more value inference for
dbutils.notebook.run(...)
(#1860). In this release, thedbutils.notebook.run(...)
functionality ingraph.py
has been significantly updated to enhance value inference. The change includes the introduction of new methods for handlingNotebookRunCall
andSysPathChange
objects, as well as the refactoring of theget_notebook_path
method intoget_notebook_paths
. This new method now returns a tuple of a boolean and a list of strings, indicating whether any nodes could not be resolved and providing a list of inferred paths. A new private method,_get_notebook_paths
, has also been added to retrieve notebook paths from a list of nodes. Furthermore, theload_dependency
method inloaders.py
has been updated to detect the language of a notebook based on the file path, in addition to its content. TheNotebook
class now includes a new parameter,SUPPORTED_EXTENSION_LANGUAGES
, which maps file extensions to their corresponding languages. In thedatabricks.labs.ucx
project, more value inference has been added to the linter, including new methods and enhanced functionality fordbutils.notebook.run(...)
. Several tests have been added or updated to demonstrate various scenarios and ensure the linter handles dynamic values appropriately. A new test file for theNotebookLoader
class in thedatabricks.labs.ucx.source_code.notebooks.loaders
module has been added, with a new class,NotebookLoaderForTesting
, that overrides thedetect_language
method to make it a class method. This allows for more robust testing of theNotebookLoader
class. Overall, these changes improve the accuracy and reliability of value inference fordbutils.notebook.run(...)
and enhance the testing and usability of the related classes and methods. - Added nightly workflow to use industry solution accelerators for parser validation (#1883). A nightly workflow has been added to validate the parser using industry solution accelerators, which can be triggered locally with the
make solacc
command. This workflow involves a new Makefile target, 'solacc', which runs a Python script located at 'tests/integration/source_code/solacc.py'. The workflow is designed to run on the latest Ubuntu, installing Python 3.10 and hatch 1.9.4 using pip, and checking out the code with a fetch depth of 0. It runs on a daily basis at 7am using a cron schedule, and can also be triggered locally. The purpose of this workflow is to ensure parser compatibility with various industry solutions, improving overall software quality and robustness. - Complete support for pip install command (#1853). In this release, we've made significant enhancements to support the
pip install
command in our open-source library. Theregister_library
method in theDependencyResolver
,NotebookResolver
, andLocalFileResolver
classes has been modified to accept variable numbers of libraries instead of just one, allowing for more efficient dependency management. Additionally, theresolve_import
method has been introduced in theNotebookResolver
andLocalFileResolver
classes for improved import resolution. Moreover, the_split
static method has been implemented for better handling of pip command code and egg packages. The library now also supports the resolution of imports in notebooks and local files. These changes provide a solid foundation for fullpip install
command support, improving overall robustness and functionality. Furthermore, extensive updates to tests, including workflow linter and job dlt task linter modifications, ensure the reliability of the library when working with Jupyter notebooks and pip-installable libraries. - Infer simple f-string values when computing values during linting (#1876). This commit enhances the open-source library by adding support for inferring simple f-string values during linting, addressing issue #1871 and progressing #1205. The new functionality works for simple f-strings but currently does not support nested f-strings. It introduces the InferredValue class and updates the visit_call, visit_const, and _check_str_constant methods for better linter feedback. Additionally, it includes modifications to a unit test file and adjustments to error location in code. The commit also presents an example of simple f-string handling, emphasizing the limitations yet providing a solid foundation for future development. Co-authored by Eric Vergnaud.
- Propagate widget parameters and data security mode to
CurrentSessionState
(#1872). In this release, thespark_version_compatibility
function incrawlers.py
has been refactored toruntime_version_tuple
, returning a tuple of integers instead of a string. The function now handles custom runtimes and DLT, and raises a ValueError if the version components cannot be converted to integers. Additionally, theCurrentSessionState
class has been updated to propagate named parameters from jobs and check for DBFS paths as both named and positional parameters. New attribu...
v0.26.0
- Added migration for Python linters from
ast
(standard library) toastroid
package (#1835). In this release, the Python linters have been migrated from theast
package in the standard library to theastroid
package, version 3.2.2 or higher, with minimal inference implementation. This change includes updates to thepyproject.toml
file to includeastroid
as a dependency and bump the version ofpylint
. No changes have been made to user documentation, CLI commands, workflows, or tables. Testing has been conducted through the addition of unit tests. This update aims to improve the functionality and accuracy of the Python linters. - Added workflow linter for delta live tables task (#1825). In this release, there are updates to the
_register_pipeline_task
method in thejobs.py
file. The method now checks for the existence of the pipeline and its libraries, and registers each notebook or jar library found in the pipeline as a task. If the library is a Maven or file type, it will raise aDependencyProblem
as it is not yet implemented. Additionally, new functions and tests have been added to improve the quality and functionality of the project, including a workflow linter for Delta Live Tables (DLT) tasks and a linter that checks for issues with specified DLT tasks. A new method,test_workflow_linter_dlt_pipeline_task
, has been added to test the workflow linter for DLT tasks, verifying the correct creation and functioning of the pipeline task and checking the building of the dependency graph for the task. These changes enhance the project's ability to ensure the proper configuration and correctness of DLT tasks and prevent potential issues. - Consistent 0-based line tracking for linters (#1855). 0-based line tracking has been consistently implemented for linters in various files and methods throughout the project, addressing issue #1855. This change includes removing direct filesystem references in favor of using the Unity Catalog for table migration and format changes. It also updates comments and warnings to improve clarity and consistency. In particular, the spark-table.py file has been updated to ensure that the spark.log.level is set correctly for UC Shared Clusters, and that the Spark Driver JVM is no longer accessed directly. The new file, simple_notebook.py, demonstrates the consistent line tracking for linters across different cell types, such as Python, Markdown, SQL, Scala, Shell, Pip, and Python (with magic commands). These changes aim to improve the accuracy and reliability of linters, making the codebase more maintainable and adaptable.
Dependency updates:
- Updated sqlglot requirement from <24.2,>=23.9 to >=23.9,<25.1 (#1856).
Contributors: @ericvergnaud, @JCZuurmond, @FastLee, @pritishpai, @dependabot[bot], @asnare
v0.25.0
- Added handling for legacy ACL
DENY
permission in group migration (#1815). In this release, the handling ofDENY
permissions during group migrations in our legacy ACL table has been improved. Previously,DENY
operations were denoted with aDENIED
prefix and were not being applied correctly during migrations. This issue has been resolved by adding a condition in the _apply_grant_sql method to check for the presence ofDENIED
in the action_type, removing the prefix, and enclosing the action type in backticks to prevent syntax errors. These changes have been thoroughly tested through manual testing, unit tests, integration tests, and verification on the staging environment, and resolve issue #1803. A new test function, test_hive_deny_sql(), has also been added to test the behavior of theDENY
permission. - Added handling for parsing corrupted log files (#1817). The
logs.py
file in thesrc/databricks/labs/ucx/installer
directory has been updated to improve the handling of corrupted log files. A new block of code has been added to check if the logs match the expected format, and if they don't, a warning message is logged and the function returns, preventing further processing and potential production of incorrect results. The changes include a new methodtest_parse_logs_warns_for_corrupted_log_file
that verifies the expected warning message and corrupt log line are present in the last log message when a corrupted log file is detected. These enhancements increase the robustness of the log parsing functionality by introducing error handling for corrupted log files. - Added known problems with
pyspark
package (#1813). In this release, updates have been made to thesrc/databricks/labs/ucx/source_code/known.json
file to document known issues with thepyspark
package when running on UC Shared Clusters. These issues include not being able to access the Spark Driver JVM, using legacy contexts, or using RDD APIs. A newKnownProblem
dataclass has been added to theknown.py
file, which includes methods for converting the object to a dictionary for better encoding of problems. The_analyze_file
method has also been updated to use aknown_problems
set ofKnownProblem
objects, improving readability and management of known problems within the application. These changes address issue #1813 and improve the documentation of known issues withpyspark
. - Added library linting for jobs launched on shared clusters (#1689). This release includes an update to add library linting for jobs launched on shared clusters, addressing issue #1637. A new function,
_register_existing_cluster_id(graph: DependencyGraph)
, has been introduced to retrieve libraries installed on a specified existing cluster and register them in the dependency graph. If the existing cluster ID is not present in the task, the function returns early. This feature also includes changes to thetest_jobs.py
file in thetests/integration/source_code
directory, such as the addition of new methods for linting jobs and handling libraries, and the inclusion of thejobs
andcompute
modules from thedatabricks.sdk.service
package. Additionally, a newWorkflowTaskContainer
method has been added to build a dependency graph for job tasks. These changes improve the reliability and efficiency of the service by ensuring that jobs run smoothly on shared clusters by checking for and handling missing libraries. Software engineers will benefit from these improvements as it will reduce the occurrence of errors due to missing libraries on shared clusters. - Added linters to check for spark logging and configuration access (#1808). This commit introduces new linters to check for the use of Spark logging, Spark configuration access via
sc.conf
, andrdd.mapPartitions
. The changes address one issue and enhance three others related to RDDs in shared clusters and the use of deprecated code. Additionally, new tests have been added for the linters and updates have been made to existing ones. The new linters have been added to theSparkConnectLinter
class and are executed as part of thedatabricks labs ucx
command. This commit also includes documentation for the new functionality. The modifications are thoroughly tested through manual tests and unit tests to ensure no existing functionality is affected. - Added list of known dependency compatibilities and regeneration infrastructure for it (#1747). This change introduces an automated system for regenerating known Python dependencies to ensure compatibility with Unity Catalog (UC), resolving import issues during graph generation. The changes include a script entry point for adding new libraries, manual trimming of unnecessary information in the
known.json
file, and integration of package data with the Whitelist. This development practice prioritizes using standard libraries and provides guidelines for contributing to the project, including debugging, fixtures, and IDE setup. The target audience for this feature is software engineers contributing to the open-source library. - Added more known libraries from Databricks Runtime (#1812). In this release, we've expanded the Databricks Runtime's capabilities by incorporating a variety of new libraries. These libraries include absl-py, aiohttp, and grpcio, which enhance networking functionalities. For improved data processing, we've added aiosignal, anyio, appdirs, and others. The suite of cloud computing libraries has been bolstered with the addition of google-auth, google-cloud-bigquery, google-cloud-storage, and many more. These libraries are now integrated in the known libraries file in the JSON format, enhancing the platform's overall functionality and performance in networking, data processing, and cloud computing scenarios.
- Added more known packages from Databricks Runtime (#1814). In this release, we have added a significant number of new packages to the known packages file in the Databricks Runtime, including astor, audioread, azure-core, and many others. These additions include several new modules and sub-packages for some of the existing packages, significantly expanding the library's capabilities. The new packages are expected to provide new functionality and improve compatibility with the existing packages. However, it is crucial to thoroughly test the new packages to ensure they work as expected and do not introduce any issues. We encourage all software engineers to familiarize themselves with the new packages and integrate them into their workflows to take full advantage of the improved functionality and compatibility.
- Added support for
.egg
Python libraries in jobs (#1789). This commit adds support for.egg
Python libraries in jobs by registering egg library dependencies to DependencyGraph for linting, addressing issue #1643. It includes the addition of a new method,PythonLibraryResolver
, which replaces the oldPipResolver
, and is used to register egg library dependencies in theDependencyGraph
. The changes also involve adding user documentation, a new CLI command, and a new workflow, as well as modifying an existing workflow and table. The tests include manual testing, unit tests, and integration tests. The diff includes changes to the 'test_dependencies.py' file, specifically in the import section wherePipResolver
is replaced withPythonLibraryResolver
from the 'databricks.labs.ucx.source_code.python_libraries' package. These changes aim to improve test coverage and ensure the correct resolution of dependencies, including those from.egg
files. - Added table migration workflow guide (#1607). UCX is a new open-source library that simplifies the process of upgrading to Unity Catalog in Databricks workspaces. After installation, users can trigger the assessment workflow, which identifies any incompatible entities and provides information necessary for planning migration. Once the assessment is complete, users can initiate the group migration workflow to upgrade various Databricks workspace assets, including Legacy Table ACLs, Entitlements, AWS instance profiles, Clusters, Cluster policies, Instance Pools, Databricks SQL warehouses, Delta Live Tables, Jobs, MLflow experiments and registry, SQL Dashboards & Queries, SQL Alerts, and Token and Password usage permissions set on the workspace level, Secret scopes, Notebooks, Directories, Repos, and Files. Additionally, the group migration workflow creates a debug notebook and logs for debugging purposes, providing added convenience and improved user experience.
- Added workflow linter for spark python tasks (#1810). A linter for workflows related to Spark Python tasks has been implemented, ensuring proper implementation of workflows for Spark Python tasks and avoiding errors for tasks that are not yet implemented. The changes are limited to the
_register_spark_python_task
method in thejobs.py
file. If the task is not a Spark Python task, an empty list is returned, and if it is, the entrypoint is logged and the notebook is registered. Additionally, two new tests have been implemented to demonstrate the functionality of this linter. Thetest_job_spark_python_task_linter_happy_path
t...
v0.24.0
- Added
%pip
cell resolver (#1697). A newly developed pip resolver has been integrated into the ImportResolver for future use, addressing issue #1642 and following up on #1694. The resolver installs libraries and modifies the path lookup to make them available for import. This change affects existing workflows but does not introduce new CLI commands, tables, or files. The commit includes modifications to the build_dependency_graph method and the addition of unit tests to verify the new functionality. The resolver has been manually tested and passes the unit tests, ensuring better compatibility and accessibility for libraries used in the project. - Added downloads of
requirementst.txt
dependency locally to register it to the dependency graph (#1753). This commit introduces support for linting job tasks that require a 'requirements.txt' file for specifying dependencies. It resolves issue #1644 and is similar to #1704. The changes include the addition of a new CLI command, modification of the existing 'databricks labs ucx ...' command, and modification of theexperimental-workflow-linter
workflow. Thelint_job
method has been updated to handle dependencies specified in a 'requirements.txt' file, checking for their presence in the job's libraries list and flagging any missing dependencies. The code changes include modifications to the 'jobs.py' file to register libraries specified in a 'requirements.txt' file to the dependency graph. Unit and integration tests have been added to verify the new functionality. The changes also include handling of jar libraries. The code includes TODO comments for future enhancements such as downloading the library wheel and adding it to the virtual system path, and handling references to other requirements files and constraints files. - Added ability to install UCX on workspaces without Public Internet connectivity (#1566). A new flag,
upload_dependencies
, has been added to the WorkspaceConfig to enable users to upload dependencies to air-gapped workspaces without public internet connectivity. This flag is a boolean value that is set to False by default and can be set by the user through the installation prompt. This feature resolves issue #573 and was co-authored by hari-selvarajan_data. When this flag is set to True, it triggers the upload of specified dependencies during installation, which allows for the installation of UCX on workspaces without public internet access. This change also includes updating the version ofdatabricks-labs-blueprint
from<0.7.0
to>=0.6.0
, which may include changes to existing functionality. Additionally, new test functions have been added to test the functionality of uploading dependencies when theupload_dependencies
flag is set to True. - Added initial interface for data comparison framework (#1695). This commit introduces the initial interface for a data comparison framework, which includes classes and methods for managing metadata, profiling data, and comparing schema and data for tables. A new
StandardDataComparator
class has been implemented for comparing the data of two tables, and aStandardSchemaComparator
class tests the comparison of table schemas. The framework also includes theDatabricksTableMetadataRetriever
class for retrieving metadata about a given table using a SQL backend. Additional classes and methods will be implemented in future work to provide a robust data comparison framework, such asStandardDataProfiler
for profiling data,SchemaComparator
andDataComparator
for comparing schema and data, and test fixtures and functions for testing the framework. This release lays the groundwork for enabling users to perform comprehensive data comparisons effectively, enhancing the project's capabilities and versatility. - Added lint local code command (#1710). A new
lint local code
command has been added to the databricks labs ucx tool, allowing users to assess required migrations in a local directory or file. This command detects dependencies and analyzes them, currently supporting Python and SQL files, with an expected runtime of under a minute for code bases up to 50,000 lines of code. The command generates output that includes file links opening the file at the problematic line in modern IDEs, providing a quick and easy way to identify necessary migrations. Thelint-local-code
command is implemented in theapplication.py
file, with supporting methods and classes added to theworkspace_cli.py
anddatabricks.labs.ucx.source_code
packages, enhancing the linting process and providing valuable feedback for maintaining high code quality standards. - Added table in mount migration (#1225). This commit introduces new functionality to migrate tables in mounts to the Unity Catalog, including creating a table in the Unity Catalog based on a table mapping CSV file, fixing an issue with include_paths_in_mount not being present in workflows.py, and adding the ability to set default ownership on each created table. A new method ScanTablesInMounts has been added to scan tables in mounts, and a TableMigration class creates tables in the Unity Catalog based on the table mapping. Two new methods, Rule and TableMapping, have been added to manage mappings of tables, and TableToMigrate is used to represent a table that needs to be migrated to Unity Catalog. The commit includes manual, unit, and integration testing to ensure the changes work as expected. The diff shows changes to the workflows.py file and the addition of several new methods, including Rule, TableMapping, TableToMigrate, create_autospec, and MockBackend.
- Added workflows to trigger table reconciliations (#1721). In this release, we've introduced several enhancements to our table migration workflow, focusing on data reconciliation and consistency. We've added a new post-migration data reconciliation task that validates migrated table integrity by comparing the schema, row count, and individual row content of the source and target tables. The new task stores and displays the number of missing rows in the Migration dashboard's
$inventory_database.reconciliation_results
view. Additionally, new workflows have been implemented to automatically trigger table reconciliations, ensuring consistency and integrity between different data sources. These workflows involve modifying relevant functions and modules, and may include new methods for data processing, scheduling, or monitoring based on the project's architecture. Furthermore, new configuration options for table reconciliation are now available in the WorkspaceConfig class, allowing for greater control and flexibility over migration processes. By incorporating these improvements, users can expect enhanced data consistency and more efficient table reconciliation management. - Always refresh HMS stats when getting table size (#1713). A change has been implemented in the hive_metastore library to enhance the precision of table size calculations by ensuring that HMS stats are always refreshed before being retrieved. This has been achieved by calling the ANALYZE TABLE command with the COMPUTE STATISTICS NOSCAN option before computing the table size, thus preventing the use of stale stats. Specifically, the "backend.queries" list has been updated to include two ANALYZE statements for tables "db1.table1" and "db1.table2", ensuring that their statistics are updated and accurate. The test case
test_table_size_crawler
in the "test_table_size.py" file has been revised to validate the presence of the two ANALYZE statements in the "backend.queries" list and confirm the size of the results for both tables. This commit also includes manual testing, added unit tests, and verification on the staging environment to ensure the functionality. - Automatically retrieve
aws_account_id
from aws profile instead of prompting (#1715). This commit introduces several improvements to the library's AWS integration, enhancing automation and user experience. It eliminates the need for manual input ofaws_account_id
by automatically retrieving it from the AWS profile. An optionalkms-key
flag has been documented for creating roles, providing more flexibility. Thecreate-missing-principals
command now accepts optional parameters such as KMS Key, Role Name, Policy Name, and allows creating a single role for all S3 locations, with a default behavior of creating one role per S3 location. These changes have been manually tested and verified in a staging environment, and resolve issue #1714. Additionally, tests have been conducted to ensure the changes do not introduce regressions. A new method simulating a successful AWS CLI call has been added, replacingaws_cli_run_command
, ensuring automated retrieval ofaws_account_id
. A test has also been added to raise an error when AWS CLI is not found in the system path. - Detect dependencies of libraries installed via pip (#1703). This commit introduces a child dependency graph for libraries resolved via pip using DistInfo data, addressing issues #1642 and [#1202](https://github.com/databrickslabs/u...