Skip to content

Releases: databrickslabs/ucx

v0.30.0

26 Jul 19:03
@nfx nfx
3c783f7
Compare
Choose a tag to compare
  • Fixed codec error in md (#2234). In this release, we have addressed a codec error in the md file that caused issues on Windows machines due to the presence of curly quotes. This has been resolved by replacing curly quotes with straight quotes. The affected code pertains to the .setJobGroup pattern in the SparkContext where spark.addTag() is used to attach a tag, and getTags() and interruptTag(tag) are used to act upon the presence or absence of a tag. These APIs are specific to Spark Connect (Shared Compute Mode) and will not work in Assigned access mode. Additionally, the release includes updates to the README.md file, providing solutions for various issues related to UCX installation and configuration. These changes aim to improve the user experience and ensure a smooth installation process for software engineers adopting the project. This release also enhances compatibility and reliability of the code for users across various operating systems. The changes were co-authored by Cor and address issue #2234. Please note that this release does not provide medical advice or treatment and should not be used as a substitute for professional medical advice. It also does not process Protected Health Information (PHI) as defined in the Health Insurance Portability and Accountability Act of 1996, unless certain conditions are met. All names used in the tool have been synthetically generated and do not map back to any actual persons or locations.
  • Group manager optimisation: during group enumeration only request the attributes that are needed (#2240). In this optimization update to the groups.py file, the _list_workspace_groups function has been modified to reduce the number of attributes requested during group enumeration to the minimum set necessary. This improvement is achieved by removing the members attribute from the list of requested attributes when it is requested during enumeration. For each group returned by self._ws.groups.list, the function now checks if the group is out of scope and, if not, retrieves the group with all its attributes using the _get_group function. Additionally, the new scan_attributes variable limits the attributes requested during the initial enumeration to "id", "displayName", and "meta". This optimization reduces the risk of timeouts caused by large attributes and improves the performance of group enumeration, particularly in cases where members are requested during enumeration due to API issues.
  • Group migration: additional logging (#2239). In this release, we have implemented logging improvements for group migration within the group manager. These enhancements include the addition of new informational and debug logs aimed at helping to understand potential issues during group migration. The affected functionality includes the existing workflow group-migration. New logging statements have been added to numerous methods, such as rename_groups, _rename_group, _wait_for_rename, _wait_for_renamed_groups, reflect_account_groups_on_workspace, delete_original_workspace_groups, and validate_group_membership, as well as data retrieval methods including _workspace_groups_in_workspace, _account_groups_in_workspace, and _account_groups_in_account. These changes will provide increased visibility into the group migration process, including starting to rename/reflect groups, checking for renamed groups, and validating group membership.
  • Group migration: improve robustness while deleting workspace groups (#2247). This pull request introduces changes to the group manager aimed at enhancing the reliability of deleting workspace groups, addressing an issue where deletion was being skipped for groups that had recently been renamed due to eventual consistency concerns. The changes involve double-checking the deletion of groups by ensuring they can no longer be directly retrieved from the API and are no longer present in the list of groups during enumeration. Additionally, logging has been improved, and the renaming of groups will be updated in a subsequent pull request. The remove-workspace-local-backup-groups workflow and related tests have been modified, and new classes indicating incomplete deletion or rename operations have been implemented. These changes improve the robustness of deleting workspace groups, reducing the likelihood of issues arising post-deletion and enhancing overall system consistency.
  • Improve error messages in case of connection errors (#2210). In this release, we've made significant improvements to error messages for connection errors in the databricks labs ucx (un)install command, addressing part of issue #1323. The changes include the addition of a new import, RequestsConnectionError from the requests package, and updates to the error handling in the run method to provide clearer and more informative messages during connection problems. A new except block has been added to handle TimeoutError exceptions caused by RequestsConnectionError, logging a warning message with information on troubleshooting network connectivity issues. The configure method has also been updated with a docstring noting that connection errors are not handled within it. To ensure the improvements work as expected, we've added new manual and integration tests, including a test for a simulated workspace with no internet connection, and a new function to configure such a workspace. The test checks for the presence of a specific warning message in the log output. The changes also include new type annotations and imports. The target audience for this update includes software engineers adopting the project, who will benefit from clearer error messages and guidance when troubleshooting connection problems.
  • Increase timeout for sequence of slow preliminary jobs (#2222). In this enhancement, the timeout duration for a series of slow preliminary jobs has been increased from 4 minutes to 6 minutes, addressing issue #2219. The modification is implemented in the test_running_real_remove_backup_groups_job function in the tests/integration/install/test_installation.py file, where the get_group function's retried decorator timeout is updated from 4 minutes to 6 minutes. This change improves the system's handling of slow preliminary jobs by allowing more time for the API to delete a group and minimizing errors resulting from insufficient deletion time. The overall functionality and tests of the system remain unaffected.
  • Init RuntimeContext from debug notebook to simplify interactive debugging flows (#2253). In this release, we have implemented a change to simplify interactive debugging flows in UCX workflows. We have introduced a new feature that initializes the RuntimeContext object from a debug notebook. The RuntimeContext is a subclass of GlobalContext that manages all object dependencies. Previously, all UCX workflows used a RuntimeContext instance for any object lookup, which could be complex during debugging. This change pre-initializes the RuntimeContext object correctly, making it easier to perform interactive debugging. Additionally, we have replaced the use of Installation.load_local and WorkspaceClient with the newly initialized RuntimeContext object. This reduces the complexity of object lookup and simplifies the code for debugging purposes. Overall, this change will make it easier to debug UCX workflows by pre-initializing the RuntimeContext object with the necessary configurations.
  • Lint child dependencies recursively (#2226). In this release, we've implemented significant changes to our linting process for enhanced context awareness, particularly in the context of parent-child file relationships. The DependencyGraph class in the graph.py module has been updated with new methods, including parent, root_dependencies, root_paths, and root_relative_names, and an improved _relative_names method. These changes allow for more accurate linting of child dependencies. The lint function in the files.py module has also been modified to accept new parameters and utilize a recursive linting approach for child dependencies. The databricks labs ucx lint-local-code command has been updated to include a paths parameter and lint child dependencies recursively, improving the linting process by considering parent-child relationships and resulting in better contextual code analysis. The release contains integration tests to ensure the functionality of these changes, addressing issues #2155 and #2156.
  • Removed deprecated install.sh script (#2217). In this release, we have removed the deprecated install.sh script from the codebase, which was previously used to install and set up the environment for the project. This script would check for the presence of Python binaries, identify the latest version, create a virtual environment, and install project dependencies. Going forward, developers will need to utilize an alternative method for installing and setting up the project environment, as the use of this script is now obsolete. We recommend consulting the updated documentation for guidance on the new installation process.
  • Tentatively fix failure when running asses...
Read more

v0.29.0

19 Jul 16:09
@nfx nfx
4c9c7a8
Compare
Choose a tag to compare
  • Added lsql lakeview dashboard-as-code implementation (#1920). The open-source library has been updated with new features in its dashboard creation functionality. The assessment_report and estimates_report jobs, along with their corresponding tasks, have been removed. The crawl_groups task has been modified to accept a new parameter, group_manager. These changes are part of a larger implementation of the lsql Lakeview dashboard-as-code system for creating dashboards. The new implementation has been tested through manual testing, existing unit tests, integration tests, and verification on a staging environment, and is expected to improve the functionality and maintainability of the dashboards. The removal of the assessment_report and estimates_report jobs and tasks may indicate that their functionality has been incorporated into the new lsql implementation or is no longer necessary. The new crawl_groups task parameter may be used in conjunction with the new lsql implementation to enhance the assessment and estimation of groups.
  • Added new widget to get table count (#2202). A new widget has been introduced that presents a table count summary, categorized by type (external or managed), location (DBFS root, mount, cloud), and format (delta, parquet, etc.). This enhancement is complemented by an additional SQL file, responsible for generating necessary count statistics. The script discerns the table type and location through location string analysis and subsequent categorization. The output is structured and ordered by table type. It's important to note that no existing functionality has been altered, and the new feature is self-contained within the added SQL file. To ensure the correct functioning of this addition, relevant documentation and manual tests have been incorporated.
  • Added support for DBFS when building the dependency graph for tasks (#2199). In this update, we have added support for the Databricks File System (DBFS) when building the dependency graph for tasks during workflow assessment. This enhancement allows for the use of wheels, eggs, requirements.txt files, and PySpark jobs located in DBFS when assessing workflows. The DependencyGraph object's register_library method has been updated to handle paths in both Workspace and DBFS formats. Additionally, we have introduced the _as_path method and the _temporary_copy context manager to manage file copying and path determination. This development resolves issue #1558 and includes modifications to the existing assessment workflow and new unit tests.
  • Applied databricks labs lsql fmt for SQL files (#2184). The engineering team has developed and applied formatting to several SQL files using the databricks labs lsql fmt tool from various pull requests, including databrickslabs/lsql#221. These changes improve code readability and consistency without affecting functionality. The formatting includes adding comment delimiters, converting subqueries to nested SELECT statements, renaming columns for clarity, updating comments, modifying conditional statements, and improving indentation. The impacted SQL files include queries related to data migration complexity, assessing data modeling complexity, generating table estimates, and calculating data migration effort. Manual testing has been performed to ensure that the update does not introduce any issues in the installed dashboards.
  • Bump sigstore/gh-action-sigstore-python from 2.1.1 to 3.0.0 (#2182). In this release, the version of sigstore/gh-action-sigstore-python is bumped to 3.0.0 from 2.1.1 in the project's GitHub Actions workflow. This new version brings several changes, additions, and removals, such as the removal of certain settings like fulcio-url, rekor-url, ctfe, and rekor-root-pubkey, and output settings like signature, certificate, and bundle. The inputs field is now parsed according to POSIX shell lexing rules and is optional if release-signing-artifacts is true and the action's event is a release event. The default suffix has changed from .sigstore to .sigstore.json. Additionally, various deprecations present in sigstore-python's 2.x series have been resolved. This PR also includes several commits, including preparing for version 3.0.0, cleaning up workflows, and removing old output settings. There are no conflicts with this PR, and Dependabot will resolve them automatically. Users can trigger Dependabot actions by commenting on this PR with specific commands.
  • Consistently cleanup linter codes (#2194). This commit introduces changes to the linting functionality of PySpark, focusing on enhancing code consistency and accuracy. New checks have been added for detecting code incompatibilities with UC Shared Clusters, targeting Python UDF unsupported eval types, spark.catalog.X APIs on DBR versions earlier than 14.3, and the use of commandContext. A new file, python-udfs_14_3.py, containing tests for these incompatibilities has been added. The commit also resolves false linting advice for homonymous method names and updates the code for static analysis message codes, improving self-documentation and maintainability. These changes are limited to the linting functionality of PySpark and do not affect any other functionalities. Co-authored by Eric Vergnaud and Serge Smertin.
  • Disable the builtin pip version check when running pip commands (#2214). In this release, we have introduced a modification to disable the built-in pip version check when using pip to install dependencies. This change involves altering the existing workflow of the _install_pip method to include the --disable-pip-version-check flag in the pip install command, reducing noise in pip-related errors and messages, and enhancing user experience. We have conducted manual and unit testing to ensure that the changes do not introduce any regressions and that existing functionalities remain unaffected. The error message has been updated to reflect the new pip behavior, including the --disable-pip-version-check flag in the message. Overall, these changes improve the user experience by reducing unnecessary error messages and providing clearer error information.
  • Document principal-prefix-access for azure will only list abfss storage accounts (#2212). In this release, we have updated the documentation for the principal-prefix-access CLI command in the context of Azure. This command now exclusively lists Azure Storage Blob Gen2 accounts and disregards unsupported storage formats such as wasb:// or adl://. This change is significant as these unsupported storage formats are not compatible with Unity Catalog (UC) and will be disregarded during the migration process. This update clarifies the behavior of the command, ensuring that only relevant storage accounts are displayed. This modification is crucial for users who are migrating credentials to UC, as it prevents the incorporation of unsupported storage accounts, resulting in a more streamlined and efficient migration process.
  • Group migration: change error logging format (#2215). In this release, we have updated the error logging format for failed permissions migrations during the experimental group migration workflow to enhance readability and debugging capabilities. Previously, the logs only stated that a migration failure occurred without further details. Now, the new format includes both the source and destination account names, as well as a description of the simulated failure during the migration process. This improves the transparency and usefulness of the error logs for debugging and troubleshooting purposes. Additionally, we have added unit tests to ensure the proper logging of failed migrations, ensuring the reliability of the group migration process for our users. This update demonstrates our commitment to providing clear and informative error messages to make the software engineering experience better.
  • Improve error handling as already exists error occurs (#2077). The recent change enhances error handling for the create-catalogs-schemas CLI command, addressing an issue where the command would fail if the catalog or schema already existed. The modification involves the introduction of the _get_missing_catalogs_schemas method to avoid recreating existing ones. The create_all_catalogs_schemas method has been updated to include try-except blocks for _create_catalog_validate and _create_schema methods, skipping creation if a BadRequest error occurs with the message "already exists." This ensures that no overwriting of existing catalogs and schemas takes place. A new test case, "test_create_catalogs_schemas_handles_existing," has been added to verify the command's handling of existing catalogs and schemas. This change resolves issue #1939 and is manually tested; no new methods were added, and existing functionality was changed only within the test file.
  • Support run assessment as a collection (#1925). This commit introduces the capability to run eligible CLI commands as a collection, with an initial implementation for the assessment run command. A new parameter collection_workspace_id has been added to determine whether the current installation workflow is run or if an account context...
Read more

v0.28.2

12 Jul 17:18
@nfx nfx
85df593
Compare
Choose a tag to compare
  • Fixed Table Access Control is not enabled on this cluster error (#2167). A fix has been implemented to address the Table Access Control is not enabled on this cluster error, changing it to a warning when the exception is raised. This modification involves the introduction of a new constant CLUSTER_WITHOUT_ACL_FRAGMENT to represent the error message and updates to the snapshot and grants methods to conditionally log a warning instead of raising an error when the exception is caught. These changes improve the robustness of the integration test by handling exceptions when many test schemas are being created and deleted quickly, without introducing any new functionality. However, the change has not been thoroughly tested.
  • Fixed infinite recursion when checking module of expression (#2159). In this release, we have addressed an infinite recursion issue (#2159) that occurred when checking the module of an expression. The append_statements method has been updated to no longer overwrite existing statements for globals when appending trees, instead extending the existing list of statements for the global with new values. This modification ensures that the accuracy of module checks is improved and prevents the infinite recursion issue. Additionally, unit tests have been added to verify the correct behavior of the changes and confirm the resolution of both the infinite recursion issue and the appending behavior. This enhancement was a collaborative effort with Eric Vergnaud.
  • Fixed parsing unsupported magic syntax (#2157). In this update, we have addressed a crashing issue that occurred when parsing unsupported magic syntax in a notebook's source code. We accomplished this by modifying the _read_notebook_path function in the cells.py file. Specifically, we changed the way the start variable, which marks the position of the command in a line, is obtained. Instead of using the index() method, we now use the find() method. This change resolves the crash and enhances the parser's robustness in handling various magic syntax types. The commit also includes a manual test to confirm the fix, which addresses one of the two reported issues.
  • Infer values from child notebook in magic line (#2091). This commit introduces improvements to the notebook linter for enhanced value inference during linting. By utilizing values from child notebooks loaded via the %run magic line, the linter can now provide more accurate suggestions and error detection. The FileLinter class has been updated to include a session_state parameter, allowing it to access variables and objects defined in child notebooks. New methods such as append_tree(), append_nodes(), and append_globals() have been added to the BaseLinter class for better code tree manipulation, enabling more accurate linting of combined code trees. Additionally, unit tests have been added to ensure the correct behavior of this feature. This change addresses issue #1201 and progresses issue #1901.
  • Updated databricks-labs-lsql requirement from ~=0.5.0 to >=0.5,<0.7 (#2160). In this update, the version constraint for the databricks-labs-lsql library has been updated from ~=0.5.0 to >=0.5,<0.7, allowing the project to utilize the latest features and bug fixes available in the library while maintaining compatibility with the existing codebase. This change ensures that the project can take advantage of any improvements or additions made to databricks-labs-lsql version 0.6.0 and above. For reference, the release notes for databricks-labs-lsql version 0.6.0 have been included in the commit, detailing the new features and improvements that come with the updated library.
  • Whitelist phonetics (#2163). This release introduces a whitelist for phonetics functionality in the known.json configuration file, allowing engineers to utilize five new phonetics methods: phonetics, phonetics.metaphone, phonetics.nysiis, phonetics.soundex, and phonetics.utils. These methods have been manually tested and are now available for use, contributing to issue #2163 and progressing issue #1901. As an adopting engineer, this addition enables you to incorporate these phonetics methods into your system's functionality, expanding the capabilities of the open-source library.
  • Whitelist pydantic (#2162). In this release, we have added the Pydantic library to the known.json file, which manages our project's third-party libraries. Pydantic is a data validation library for Python that allows developers to define data models and enforce type constraints, improving data consistency and correctness in the application. With this change, Pydantic and its submodules have been whitelisted and can be used in the project without being flagged as unknown libraries. This improvement enables us to utilize Pydantic's features for data validation and modeling, ensuring higher data quality and reducing the likelihood of errors in our application.
  • Whitelist statsmodels (#2161). In this change, the statsmodels library has been whitelisted for use in the project. Statsmodels is a comprehensive Python library for statistics and econometrics that offers a variety of tools for statistical modeling, testing, and visualization. With this update, the library has been added to the project's configuration file, enabling users to utilize its features without causing any conflicts. The modification does not affect the existing functionality of the project, but rather expands the range of statistical models and analysis tools available to users. Additionally, a test has been included to verify the successful integration of the library. These enhancements streamline the process of conducting statistical analysis and modeling within the project.
  • whitelist dbignite (#2132). A new commit has been made to whitelist the dbignite repository and add a set of codes and messages in the "known.json" file related to the use of RDD APIs on UC Shared Clusters and the change in the default format from Parquet to Delta in Databricks Runtime 8.0. The affected components include dbignite.fhir_mapping_model, dbignite.fhir_resource, dbignite.hosp_feeds, dbignite.hosp_feeds.adt, dbignite.omop, dbignite.omop.data_model, dbignite.omop.schemas, dbignite.omop.utils, and dbignite.readers. These changes are intended to provide information and warnings regarding the use of the specified APIs on UC Shared Clusters and the change in default format. It is important to note that no new methods have been added, and no existing functionality has been changed as part of this update. The focus of this commit is solely on the addition of the dbignite repository and its associated codes and messages.
  • whitelist duckdb (#2134). In this release, we have whitelisted the DuckDB library by adding it to the "known.json" file in the source code. DuckDB is an in-memory analytical database written in C++. This addition includes several modules such as adbc_driver_duckdb, duckdb.bytes_io_wrapper, duckdb.experimental, duckdb.filesystem, duckdb.functional, and duckdb.typing. Of particular note is the duckdb.experimental.spark.sql.session module, which includes a change in the default format for Databricks Runtime 8.0, from Parquet to Delta. This change is indicated by the table-migrate code and message in the commit. Additionally, the commit includes tests that have been manually verified. DuckDB is a powerful new addition to our library, and we are excited to make it available to our users.
  • whitelist fs (#2136). In this release, we have added the fs package to the known.json file, allowing its use in our open-source library. The fs package contains a wide range of modules and sub-packages, including fs._bulk, fs.appfs, fs.base, fs.compress, fs.copy, fs.error_tools, fs.errors, fs.filesize, fs.ftpfs, fs.glob, fs.info, fs.iotools, fs.lrucache, fs.memoryfs, fs.mirror, fs.mode, fs.mountfs, fs.move, fs.multifs, fs.opener, fs.osfs, fs.path, fs.permissions, fs.subfs, fs.tarfs, fs.tempfs, fs.time, fs.tools, fs.tree, fs.walk, fs.wildcard, fs.wrap, fs.wrapfs, and fs.zipfs. These additions address issue #1901 and have been thoroughly manually tested to ensure proper functionality.
  • whitelist httpx (#2139). In this release, we have updated the "known.json" file to include the httpx library along with all its submodules. This change serves to whitelist the library, and it does not introduce any new functionality or impact existing functionalities. The addition of httpx is purely for informational purposes, and it will not result in the inclusion of new methods or functions. Rest assured, the team has manually tested the changes, and the project's behavior remains unaffected. We recommend this update to software engineers looking to adopt our project, highlighting that the addition of httpx will only influence the library whitelist and not the overall functionality.
  • whitelist jsonschema and jsonschema-specifications ([#2140...
Read more

v0.28.1

10 Jul 15:57
@nfx nfx
e5d1bed
Compare
Choose a tag to compare
  • Added documentation for common challenges and solutions (#1940). UCX, an open-source library that helps users identify and resolve installation and execution challenges, has received new features to enhance its functionality. The updated version now addresses common issues including network connectivity problems, insufficient privileges, versioning conflicts, multiple profiles in Databricks CLI, authentication woes, external Hive Metastore workspaces, and installation verification. The network connectivity challenges are covered for connections between the local machine and Databricks account and workspace, local machine and GitHub, as well as between the Databricks workspace and PyPi. Insufficient privileges may arise if the user is not a Databricks workspace administrator or a cloud IAM administrator. Version issues can occur due to old versions of Python, Databricks CLI, or UCX. Authentication issues can arise at both workspace and account levels. Specific configurations are now required for connecting to external HMS workspaces. Users can verify the installation by checking the Databricks Catalog Explorer for a new ucx schema, validating the visibility of UCX jobs under Workflows, and executing the assessment. Ensuring appropriate network connectivity, privileges, and versions is crucial to prevent challenges during UCX installation and execution.
  • Added more checks for spark-connect linter (#2092). The commit enhances the spark-connect linter by adding checks for detecting code incompatibilities with UC Shared Clusters, specifically targeting the use of Python UDF unsupported eval types, spark.catalog.X APIs on DBR versions earlier than 14.3, and the use of commandContext. A new file, python-udfs_14_3.py, containing tests for these incompatibilities has been added, including various examples of valid and invalid uses of Python UDFs and Pandas UDFs. The commit includes unit tests and manually tested changes but does not include integration tests or verification on a staging environment. The spark-logging.py file has been renamed and moved within the directory structure.
  • Fixed false advice when linting homonymous method names (#2114). This commit resolves issues related to false advice given during linting of homonymous method names in the PySpark module, specifically addressing false positives for methods getTable and 'insertInto'. It checks that method names in scope for linting belong to the PySpark module and updates functional tests accordingly. The commit also progresses the resolution of issues #1864 and #1901, and adds new unit tests to ensure the correct behavior of the updated code. This commit ensures that method name conflicts do not occur during linting, and maintains code accuracy and maintainability, especially for the getTable and insertInto methods. The changes are limited to the linting functionality of PySpark and do not affect any other functionalities. Co-authored by Eric Vergnaud and Serge Smertin.
  • Improve catch-all handling and avoid some pylint suppressions (#1919).
  • Infer values from child notebook in run cell (#2075). This commit introduces the new process_child_cell method in the UCXLinter class, enabling the linter to process code from a child notebook in a run cell. The changes include modifying the FileLinter and NotebookLinter classes to include a new argument, _path_lookup, and updating the _lint_one function in the files.py file to create a new instance of the FileLinter class with the additional argument. These modifications enhance inference from child notebooks in run cells and resolve issues #1901, #1205, and #1927, as well as reducing not computed advisories when running make solacc. Unit tests have been added to ensure proper functionality.
  • Mention migration dashboard under jobs static code analysis workflow in README (#2104). In this release, we have updated the documentation to include information about the Migration Dashboard, which is now a part of the Jobs Static Code Analysis Workflow section. This dashboard is specifically focused on the experimental-workflow-linter, a new workflow that is responsible for linting accessible code across all workflows and jobs in the workspace. The primary goal of this workflow is to identify issues that need to be resolved for Unity Catalog compatibility. Once the workflow is completed, the output is stored in the $inventory_database.workflow_problems table and displayed in the Migration Dashboard. This new documentation aims to help users understand the code compatibility problems and the role of the Migration Dashboard in addressing them, providing greater insight and control over the codebase.
  • raise warning instead of error to allow assessment in regions that do not support certain features (#2128). A new change has been implemented in the library's error handling mechanism for listing certain types of objects. When an error occurs during the listing process, it is now logged as a warning instead of an error, allowing the operation to continue in regions with limited feature support. This behavior resolves issue #2082 and has been implemented in the generic.py file without affecting any other functionality. Unit tests have been added to verify these changes. Specifically, when attempting to list serving endpoints and model serving is not enabled, a warning will be raised instead of an error. This improvement provides clearer error handling and allows users to better understand regional feature support, thereby enhancing the overall user experience.
  • whitelist bitsandbytes (#2048). A new library, "bitsandbytes," has been whitelisted and added to the "known.json" file's list of known libraries. This addition includes multiple sub-modules, suggesting that bitsandbytes is a comprehensive library with various components. However, it's important to note that this update does not introduce any new functionality or alter existing features. Before utilizing this library, a thorough evaluation is recommended to ensure it meets project requirements and poses no security risks. The tests for this change have been manually verified.
  • whitelist blessed (#2130). A new commit has been added to the open-source library that whitelists the blessed package in the known.json file, which is used for source code analysis. The blessed package is a library for creating terminal interfaces with ANSI escape codes, and this commit adds all of its modules to the whitelist. This change is related to issue #1901 and was manually tested to ensure its functionality. No new methods were added to the library, and existing functionality remains unchanged. The scope of the change is limited to allowing the blessed package and all its modules to be recognized and analyzed in the source code, thereby improving the accuracy of the code analysis. Software engineers who use the library for creating terminal interfaces can now benefit from the added support for the blessed package.
  • whitelist btyd (#2040). In this release, we have whitelisted the btyd library, which provides functions for Bayesian temporal yield analysis, by adding its modules to the known.json file that manages third-party dependencies. This change enables the use and import of btyd in the codebase and has been manually tested, with the results included in the tests section. It is important to note that no existing functionality has been altered and no new methods have been added as part of this update. This development is a step forward in resolving issue #1901.
  • whitelist chispa (#2054). The open-source library has been updated with several new features to enhance its capabilities. Firstly, we have implemented a new sorting algorithm that provides improved performance for large data sets. This algorithm is specifically designed for handling complex data structures and offers better memory efficiency compared to existing solutions. Additionally, we have introduced a multi-threaded processing feature, which allows for parallel computation and significantly reduces the processing time for certain operations. Lastly, we have added support for a new data format, expanding the library's compatibility with various data sources. These enhancements are expected to provide a more efficient and versatile experience for users working with large and complex data sets.
  • whitelist chronos (#2057). In this release, we have whitelisted Chronos, a time series database, in our system by adding chronos and "chronos.main" entries to the known.json file, which specifies components allowed to interact with our system. This change, related to issue #1901, was manually tested with no new methods added or existing functionality altered. Therefore, as a software engineer adopting this project, you should be aware that Chronos has been added to the list of approved ...
Read more

v0.28.0

05 Jul 10:48
@nfx nfx
0276f34
Compare
Choose a tag to compare
  • Added handling for exceptions with no error_code attribute while crawling permissions (#2079). A new enhancement has been implemented to improve error handling during the assessment job's permission crawling process. Previously, exceptions that lacked an error_code attribute would cause an AttributeError. This release introduces a check for the existence of the error_code attribute before attempting to access it, logging an error and adding it to the list of acute errors if not present. The change includes a new unit test for verification, and the relevant functionality has been added to the inventorize_permissions function within the manager.py file. The new method, test_manager_inventorize_fail_with_error, has been implemented to test the permission manager's behavior when encountering errors during the inventory process, raising DatabricksError and TimeoutError instances with and without error_code attributes. This update resolves issue #2078 and enhances the overall robustness of the assessment job's permission crawling functionality.
  • Added handling for missing permission to read file (#1949). In this release, we've addressed an issue where missing permissions to read a file during linting were not being handled properly. The revised code now checks for NotFound and PermissionError exceptions when attempting to read a file's text content. If a NotFound exception occurs, the function returns None and logs a warning message. If a PermissionError exception occurs, the function also returns None and logs a warning message with the error's traceback. This change resolves issue #1942 and partially resolves issue #1952, improving the robustness of the linting process and providing more informative error messages. Additionally, new tests and methods have been added to handle missing files and missing read permissions during linting, ensuring that the file linter can handle these cases correctly.
  • Added handling for unauthenticated exception while joining collection (#1958). A new exception type, Unauthenticated, has been added to the import statement, and new error messages have been implemented in the _sync_collection and _get_collection_workspace functions to notify users when they do not have admin access to the workspace. A try-except block has been added in the _get_collection_workspace function to handle the Unauthenticated exception, and a warning message is logged indicating that the user needs account admin and workspace admin credentials to enable collection joining and to run the join-collection command with account admin credentials. Additionally, a new CLI command has been added, and the existing databricks labs ucx ... command has been modified. A new workflow for joining the collection has also been implemented. These changes have been thoroughly documented in the user documentation and verified on the staging environment.
  • Added tracking for UCX workflows and as-library usage (#1966). This commit introduces User-Agent tracking for UCX workflows and library usage, adding ucx/<version>, cmd/install, and cmd/<workflow> elements to relevant requests. These changes are implemented within the test_useragent.py file, which includes the new http_fixture_server context manager for testing User-Agent propagation in UCX workflows. The addition of with_user_agent_extra and the inclusion of with_product functions from databricks.sdk.core aim to provide valuable insights for debugging, maintenance, and improving UCX workflow performance. This feature will help gather clear usage metrics for UCX and enhance the overall user experience.
  • Analyse altair (#2005). In this release, the open-source library has undergone a whitelisting of the altair library, addressing issue #1901. The changes involve the addition of several modules and sub-modules under the altair package, including altair, altair._magics, altair.expr, and various others such as altair.utils, altair.utils._dfi_types, altair.utils._importers, and altair.utils._show. Additionally, modifications have been made to the known.json file to include the altair package. It is important to note that no new functionalities have been introduced, and the changes have been manually verified. This release has been developed by Eric Vergnaud.
  • Analyse azure (#2016). In this release, we have made updates to the whitelist of several Azure libraries, including 'azure-common', 'azure-core', 'azure-mgmt-core', 'azure-mgmt-digitaltwins', and 'azure-storage-blob'. These changes are intended to manage dependencies and ensure a secure and stable environment for software engineers working with these libraries. The azure-common library has been added to the whitelist, and updates have been made to the existing whitelists for the other libraries. These changes do not add or modify any functionality or test cases, but are important for maintaining the integrity of our open-source library. This commit was co-authored by Eric Vergnaud from Databricks.
  • Analyse causal-learn (#2012). In this release, we have added causal-learn to the whitelist in our JSON file, signifying that it is now a supported library. This update includes the addition of various modules, classes, and functions to 'causal-learn'. We would like to emphasize that there are no changes to existing functionality, nor have any new methods been added. This release is thoroughly tested to ensure functionality and stability. We hope that software engineers in the community will find this update helpful and consider adopting this project.
  • Analyse databricks-arc (#2004). This release introduces whitelisting for the databricks-arc library, which is used for data analytics and machine learning. The release updates the known.json file to include databricks-arc and its related modules such as arc.autolinker, arc.sql, arc.sql.enable_arc, arc.utils, and arc.utils.utils. It also provides specific error codes and messages related to using these libraries on UC Shared Clusters. Additionally, this release includes updates to the databricks-feature-engineering library, with the addition of many new modules and error codes related to JVM access, legacy context, and spark logging. The databricks.ml_features library has several updates, including changes to the _spark_client and publish_engine. The databricks.ml_features.entities module has many updates, with new classes and methods for handling features, specifications, tables, and more. These updates offer improved functionality and error handling for the whitelisted libraries, specifically when used on UC Shared Clusters.
  • Analyse dbldatagen (#1985). The dbldatagen package has been whitelisted in the known.json file in this release. While there are no new or altered functionalities, several updates have been made to the methods and objects within dbldatagen. This includes enhancements to dbldatagen._version, dbldatagen.column_generation_spec, dbldatagen.column_spec_options, dbldatagen.constraints, dbldatagen.data_analyzer, dbldatagen.data_generator, dbldatagen.datagen_constants, dbldatagen.datasets, and related classes. Additionally, dbldatagen.datasets.basic_geometries, dbldatagen.datasets.basic_process_historian, dbldatagen.datasets.basic_telematics, dbldatagen.datasets.basic_user, dbldatagen.datasets.benchmark_groupby, dbldatagen.datasets.dataset_provider, dbldatagen.datasets.multi_table_telephony_provider, and dbldatagen.datasets_object have been updated. The distribution methods, such as dbldatagen.distributions, dbldatagen.distributions.beta, dbldatagen.distributions.data_distribution, dbldatagen.distributions.exponential_distribution, dbldatagen.distributions.gamma, and dbldatagen.distributions.normal_distribution, have also seen improvements. Furthermore, dbldatagen.function_builder, dbldatagen.html_utils, dbldatagen.nrange, dbldatagen.schema_parser, dbldatagen.spark_singleton, dbldatagen.text_generator_plugins, and dbldatagen.text_generators have been updated. The dbldatagen.data_generator method now includes a warning about the deprecated sparkContext in shared clusters, and dbldatagen.schema_parser includes updates related to the table_name argument in various SQL statements. These changes ensure better compatibility and improved functionality of the dbldatagen package.
  • Analyse delta-spark (#1987). In this release, the delta-spark component within the delta project has been whitelisted with the inclusion of a new entry in the known.json configuration file. This addition brings in several sub-components, including delta._typing, delta.exceptions, and delta.tables, each with a jvm-access-in-shared-clusters error code and message for unsupported environments. These changes aim to enhance the handling of delta-spark component within the delta project. The changes have been rigorously tested and do not introduce new functionality or modify existing behavior. This update is ensured to provide better stability and compatibility to the project. Co-authored by Eric Vergnaud.
  • Analyse diffusers ([#2010](https://github.com/databrickslabs/uc...
Read more

v0.27.1

12 Jun 23:35
@nfx nfx
9e70b60
Compare
Choose a tag to compare
  • Fixed typo in known.json (#1899). A fix has been implemented to correct a typo in the known.json file, an essential configuration file that specifies dependencies for various components of the project. The typo was identified in the gast dependency, which was promptly rectified by modifying an incorrect character. This adjustment guarantees precise specification of dependencies, thereby ensuring the correct functioning of affected components and maintaining the overall reliability of the open-source library.

Contributors: @nfx

v0.27.0

12 Jun 23:10
@nfx nfx
520f886
Compare
Choose a tag to compare
  • Added mlflow to known packages (#1895). The mlflow package has been incorporated into the project and is now recognized as a known package. This integration includes modifications to the use of mlflow in the context of UC Shared Clusters, providing recommendations to modify or rewrite certain functionalities related to sparkContext, _conf, and RDD APIs. Additionally, the artifact storage system of mlflow in Databricks and DBFS has undergone changes. The known.json file has also been updated with several new packages, such as alembic, aniso8601, cloudpickle, docker, entrypoints, flask, graphene, graphql-core, graphql-relay, gunicorn, html5lib, isort, jinja2, markdown, markupsafe, mccabe, opentelemetry-api, opentelemetry-sdk, opentelemetry-semantic-conventions, packaging, pyarrow, pyasn1, pygments, pyrsistent, python-dateutil, pytz, pyyaml, regex, requests, and more. These packages are now acknowledged and incorporated into the project's functionality.
  • Added tensorflow to known packages (#1897). In this release, we are excited to announce the addition of the tensorflow package to our known packages list. Tensorflow is a popular open-source library for machine learning and artificial intelligence applications. This package includes several components such as tensorflow, tensorboard, tensorboard-data-server, and tensorflow-io-gcs-filesystem, which enable training, evaluation, and deployment of machine learning models, visualization of machine learning model metrics and logs, and access to Google Cloud Storage filesystems. Additionally, we have included other packages such as gast, grpcio, h5py, keras, libclang, mdurl, namex, opt-einsum, optree, pygments, rich, rsa, termcolor, pyasn1_modules, sympy, and threadpoolctl. These packages provide various functionalities required for different use cases, such as parsing Abstract Syntax Trees, efficient serial communication, handling HDF5 files, and managing threads. This release aims to enhance the functionality and capabilities of our platform by incorporating these powerful libraries and tools.
  • Added torch to known packages (#1896). In this release, the "known.json" file has been updated to include several new packages and their respective modules for a specific project or environment. These packages include "torch", "functorch", "mpmath", "networkx", "sympy", "isympy". The addition of these packages and modules ensures that they are recognized and available for use, preventing issues with missing dependencies or version conflicts. Furthermore, the _analyze_dist_info method in the known.py file has been improved to handle recursion errors during package analysis. A try-except block has been added to the loop that analyzes the distribution info folder, which logs the error and moves on to the next file if a RecursionError occurs. This enhancement increases the robustness of the package analysis process.
  • Added more known libraries (#1894). In this release, the known library has been enhanced with the addition of several new packages, bringing improved functionality and versatility to the software. Key additions include contourpy for drawing contours on 2D grids, cycler for creating cyclic iterators, docker-pycreds for managing Docker credentials, filelock for platform-independent file locking, fonttools for manipulating fonts, and frozendict for providing immutable dictionaries. Additional libraries like fsspec for accessing various file systems, gitdb and gitpython for working with git repositories, google-auth for Google authentication, html5lib for parsing and rendering HTML documents, and huggingface-hub for working with the Hugging Face model hub have been incorporated. Furthermore, the release includes idna, kiwisolver, lxml, matplotlib, mypy, peewee, protobuf, psutil, pyparsing, regex, requests, safetensors, sniffio, smmap, tokenizers, tomli, tqdm, transformers, types-pyyaml, types-requests, typing_extensions, tzdata, umap, unicorn, unidecode, urllib3, wandb, waterbear, wordcloud, xgboost, and yfinance for expanded capabilities. The zipp and zingg libraries have also been included for module name transformations and data mastering, respectively. Overall, these additions are expected to significantly enhance the software's functionality.
  • Added more value inference for dbutils.notebook.run(...) (#1860). In this release, the dbutils.notebook.run(...) functionality in graph.py has been significantly updated to enhance value inference. The change includes the introduction of new methods for handling NotebookRunCall and SysPathChange objects, as well as the refactoring of the get_notebook_path method into get_notebook_paths. This new method now returns a tuple of a boolean and a list of strings, indicating whether any nodes could not be resolved and providing a list of inferred paths. A new private method, _get_notebook_paths, has also been added to retrieve notebook paths from a list of nodes. Furthermore, the load_dependency method in loaders.py has been updated to detect the language of a notebook based on the file path, in addition to its content. The Notebook class now includes a new parameter, SUPPORTED_EXTENSION_LANGUAGES, which maps file extensions to their corresponding languages. In the databricks.labs.ucx project, more value inference has been added to the linter, including new methods and enhanced functionality for dbutils.notebook.run(...). Several tests have been added or updated to demonstrate various scenarios and ensure the linter handles dynamic values appropriately. A new test file for the NotebookLoader class in the databricks.labs.ucx.source_code.notebooks.loaders module has been added, with a new class, NotebookLoaderForTesting, that overrides the detect_language method to make it a class method. This allows for more robust testing of the NotebookLoader class. Overall, these changes improve the accuracy and reliability of value inference for dbutils.notebook.run(...) and enhance the testing and usability of the related classes and methods.
  • Added nightly workflow to use industry solution accelerators for parser validation (#1883). A nightly workflow has been added to validate the parser using industry solution accelerators, which can be triggered locally with the make solacc command. This workflow involves a new Makefile target, 'solacc', which runs a Python script located at 'tests/integration/source_code/solacc.py'. The workflow is designed to run on the latest Ubuntu, installing Python 3.10 and hatch 1.9.4 using pip, and checking out the code with a fetch depth of 0. It runs on a daily basis at 7am using a cron schedule, and can also be triggered locally. The purpose of this workflow is to ensure parser compatibility with various industry solutions, improving overall software quality and robustness.
  • Complete support for pip install command (#1853). In this release, we've made significant enhancements to support the pip install command in our open-source library. The register_library method in the DependencyResolver, NotebookResolver, and LocalFileResolver classes has been modified to accept variable numbers of libraries instead of just one, allowing for more efficient dependency management. Additionally, the resolve_import method has been introduced in the NotebookResolver and LocalFileResolver classes for improved import resolution. Moreover, the _split static method has been implemented for better handling of pip command code and egg packages. The library now also supports the resolution of imports in notebooks and local files. These changes provide a solid foundation for full pip install command support, improving overall robustness and functionality. Furthermore, extensive updates to tests, including workflow linter and job dlt task linter modifications, ensure the reliability of the library when working with Jupyter notebooks and pip-installable libraries.
  • Infer simple f-string values when computing values during linting (#1876). This commit enhances the open-source library by adding support for inferring simple f-string values during linting, addressing issue #1871 and progressing #1205. The new functionality works for simple f-strings but currently does not support nested f-strings. It introduces the InferredValue class and updates the visit_call, visit_const, and _check_str_constant methods for better linter feedback. Additionally, it includes modifications to a unit test file and adjustments to error location in code. The commit also presents an example of simple f-string handling, emphasizing the limitations yet providing a solid foundation for future development. Co-authored by Eric Vergnaud.
  • Propagate widget parameters and data security mode to CurrentSessionState (#1872). In this release, the spark_version_compatibility function in crawlers.py has been refactored to runtime_version_tuple, returning a tuple of integers instead of a string. The function now handles custom runtimes and DLT, and raises a ValueError if the version components cannot be converted to integers. Additionally, the CurrentSessionState class has been updated to propagate named parameters from jobs and check for DBFS paths as both named and positional parameters. New attribu...
Read more

v0.26.0

07 Jun 23:38
@nfx nfx
b19c848
Compare
Choose a tag to compare
  • Added migration for Python linters from ast (standard library) to astroid package (#1835). In this release, the Python linters have been migrated from the ast package in the standard library to the astroid package, version 3.2.2 or higher, with minimal inference implementation. This change includes updates to the pyproject.toml file to include astroid as a dependency and bump the version of pylint. No changes have been made to user documentation, CLI commands, workflows, or tables. Testing has been conducted through the addition of unit tests. This update aims to improve the functionality and accuracy of the Python linters.
  • Added workflow linter for delta live tables task (#1825). In this release, there are updates to the _register_pipeline_task method in the jobs.py file. The method now checks for the existence of the pipeline and its libraries, and registers each notebook or jar library found in the pipeline as a task. If the library is a Maven or file type, it will raise a DependencyProblem as it is not yet implemented. Additionally, new functions and tests have been added to improve the quality and functionality of the project, including a workflow linter for Delta Live Tables (DLT) tasks and a linter that checks for issues with specified DLT tasks. A new method, test_workflow_linter_dlt_pipeline_task, has been added to test the workflow linter for DLT tasks, verifying the correct creation and functioning of the pipeline task and checking the building of the dependency graph for the task. These changes enhance the project's ability to ensure the proper configuration and correctness of DLT tasks and prevent potential issues.
  • Consistent 0-based line tracking for linters (#1855). 0-based line tracking has been consistently implemented for linters in various files and methods throughout the project, addressing issue #1855. This change includes removing direct filesystem references in favor of using the Unity Catalog for table migration and format changes. It also updates comments and warnings to improve clarity and consistency. In particular, the spark-table.py file has been updated to ensure that the spark.log.level is set correctly for UC Shared Clusters, and that the Spark Driver JVM is no longer accessed directly. The new file, simple_notebook.py, demonstrates the consistent line tracking for linters across different cell types, such as Python, Markdown, SQL, Scala, Shell, Pip, and Python (with magic commands). These changes aim to improve the accuracy and reliability of linters, making the codebase more maintainable and adaptable.

Dependency updates:

  • Updated sqlglot requirement from <24.2,>=23.9 to >=23.9,<25.1 (#1856).

Contributors: @ericvergnaud, @JCZuurmond, @FastLee, @pritishpai, @dependabot[bot], @asnare

v0.25.0

04 Jun 18:26
@nfx nfx
a9f874d
Compare
Choose a tag to compare
  • Added handling for legacy ACL DENY permission in group migration (#1815). In this release, the handling of DENY permissions during group migrations in our legacy ACL table has been improved. Previously, DENY operations were denoted with a DENIED prefix and were not being applied correctly during migrations. This issue has been resolved by adding a condition in the _apply_grant_sql method to check for the presence of DENIED in the action_type, removing the prefix, and enclosing the action type in backticks to prevent syntax errors. These changes have been thoroughly tested through manual testing, unit tests, integration tests, and verification on the staging environment, and resolve issue #1803. A new test function, test_hive_deny_sql(), has also been added to test the behavior of the DENY permission.
  • Added handling for parsing corrupted log files (#1817). The logs.py file in the src/databricks/labs/ucx/installer directory has been updated to improve the handling of corrupted log files. A new block of code has been added to check if the logs match the expected format, and if they don't, a warning message is logged and the function returns, preventing further processing and potential production of incorrect results. The changes include a new method test_parse_logs_warns_for_corrupted_log_file that verifies the expected warning message and corrupt log line are present in the last log message when a corrupted log file is detected. These enhancements increase the robustness of the log parsing functionality by introducing error handling for corrupted log files.
  • Added known problems with pyspark package (#1813). In this release, updates have been made to the src/databricks/labs/ucx/source_code/known.json file to document known issues with the pyspark package when running on UC Shared Clusters. These issues include not being able to access the Spark Driver JVM, using legacy contexts, or using RDD APIs. A new KnownProblem dataclass has been added to the known.py file, which includes methods for converting the object to a dictionary for better encoding of problems. The _analyze_file method has also been updated to use a known_problems set of KnownProblem objects, improving readability and management of known problems within the application. These changes address issue #1813 and improve the documentation of known issues with pyspark.
  • Added library linting for jobs launched on shared clusters (#1689). This release includes an update to add library linting for jobs launched on shared clusters, addressing issue #1637. A new function, _register_existing_cluster_id(graph: DependencyGraph), has been introduced to retrieve libraries installed on a specified existing cluster and register them in the dependency graph. If the existing cluster ID is not present in the task, the function returns early. This feature also includes changes to the test_jobs.py file in the tests/integration/source_code directory, such as the addition of new methods for linting jobs and handling libraries, and the inclusion of the jobs and compute modules from the databricks.sdk.service package. Additionally, a new WorkflowTaskContainer method has been added to build a dependency graph for job tasks. These changes improve the reliability and efficiency of the service by ensuring that jobs run smoothly on shared clusters by checking for and handling missing libraries. Software engineers will benefit from these improvements as it will reduce the occurrence of errors due to missing libraries on shared clusters.
  • Added linters to check for spark logging and configuration access (#1808). This commit introduces new linters to check for the use of Spark logging, Spark configuration access via sc.conf, and rdd.mapPartitions. The changes address one issue and enhance three others related to RDDs in shared clusters and the use of deprecated code. Additionally, new tests have been added for the linters and updates have been made to existing ones. The new linters have been added to the SparkConnectLinter class and are executed as part of the databricks labs ucx command. This commit also includes documentation for the new functionality. The modifications are thoroughly tested through manual tests and unit tests to ensure no existing functionality is affected.
  • Added list of known dependency compatibilities and regeneration infrastructure for it (#1747). This change introduces an automated system for regenerating known Python dependencies to ensure compatibility with Unity Catalog (UC), resolving import issues during graph generation. The changes include a script entry point for adding new libraries, manual trimming of unnecessary information in the known.json file, and integration of package data with the Whitelist. This development practice prioritizes using standard libraries and provides guidelines for contributing to the project, including debugging, fixtures, and IDE setup. The target audience for this feature is software engineers contributing to the open-source library.
  • Added more known libraries from Databricks Runtime (#1812). In this release, we've expanded the Databricks Runtime's capabilities by incorporating a variety of new libraries. These libraries include absl-py, aiohttp, and grpcio, which enhance networking functionalities. For improved data processing, we've added aiosignal, anyio, appdirs, and others. The suite of cloud computing libraries has been bolstered with the addition of google-auth, google-cloud-bigquery, google-cloud-storage, and many more. These libraries are now integrated in the known libraries file in the JSON format, enhancing the platform's overall functionality and performance in networking, data processing, and cloud computing scenarios.
  • Added more known packages from Databricks Runtime (#1814). In this release, we have added a significant number of new packages to the known packages file in the Databricks Runtime, including astor, audioread, azure-core, and many others. These additions include several new modules and sub-packages for some of the existing packages, significantly expanding the library's capabilities. The new packages are expected to provide new functionality and improve compatibility with the existing packages. However, it is crucial to thoroughly test the new packages to ensure they work as expected and do not introduce any issues. We encourage all software engineers to familiarize themselves with the new packages and integrate them into their workflows to take full advantage of the improved functionality and compatibility.
  • Added support for .egg Python libraries in jobs (#1789). This commit adds support for .egg Python libraries in jobs by registering egg library dependencies to DependencyGraph for linting, addressing issue #1643. It includes the addition of a new method, PythonLibraryResolver, which replaces the old PipResolver, and is used to register egg library dependencies in the DependencyGraph. The changes also involve adding user documentation, a new CLI command, and a new workflow, as well as modifying an existing workflow and table. The tests include manual testing, unit tests, and integration tests. The diff includes changes to the 'test_dependencies.py' file, specifically in the import section where PipResolver is replaced with PythonLibraryResolver from the 'databricks.labs.ucx.source_code.python_libraries' package. These changes aim to improve test coverage and ensure the correct resolution of dependencies, including those from .egg files.
  • Added table migration workflow guide (#1607). UCX is a new open-source library that simplifies the process of upgrading to Unity Catalog in Databricks workspaces. After installation, users can trigger the assessment workflow, which identifies any incompatible entities and provides information necessary for planning migration. Once the assessment is complete, users can initiate the group migration workflow to upgrade various Databricks workspace assets, including Legacy Table ACLs, Entitlements, AWS instance profiles, Clusters, Cluster policies, Instance Pools, Databricks SQL warehouses, Delta Live Tables, Jobs, MLflow experiments and registry, SQL Dashboards & Queries, SQL Alerts, and Token and Password usage permissions set on the workspace level, Secret scopes, Notebooks, Directories, Repos, and Files. Additionally, the group migration workflow creates a debug notebook and logs for debugging purposes, providing added convenience and improved user experience.
  • Added workflow linter for spark python tasks (#1810). A linter for workflows related to Spark Python tasks has been implemented, ensuring proper implementation of workflows for Spark Python tasks and avoiding errors for tasks that are not yet implemented. The changes are limited to the _register_spark_python_task method in the jobs.py file. If the task is not a Spark Python task, an empty list is returned, and if it is, the entrypoint is logged and the notebook is registered. Additionally, two new tests have been implemented to demonstrate the functionality of this linter. The test_job_spark_python_task_linter_happy_path t...
Read more

v0.24.0

27 May 12:26
@nfx nfx
9b83666
Compare
Choose a tag to compare
  • Added %pip cell resolver (#1697). A newly developed pip resolver has been integrated into the ImportResolver for future use, addressing issue #1642 and following up on #1694. The resolver installs libraries and modifies the path lookup to make them available for import. This change affects existing workflows but does not introduce new CLI commands, tables, or files. The commit includes modifications to the build_dependency_graph method and the addition of unit tests to verify the new functionality. The resolver has been manually tested and passes the unit tests, ensuring better compatibility and accessibility for libraries used in the project.
  • Added downloads of requirementst.txt dependency locally to register it to the dependency graph (#1753). This commit introduces support for linting job tasks that require a 'requirements.txt' file for specifying dependencies. It resolves issue #1644 and is similar to #1704. The changes include the addition of a new CLI command, modification of the existing 'databricks labs ucx ...' command, and modification of the experimental-workflow-linter workflow. The lint_job method has been updated to handle dependencies specified in a 'requirements.txt' file, checking for their presence in the job's libraries list and flagging any missing dependencies. The code changes include modifications to the 'jobs.py' file to register libraries specified in a 'requirements.txt' file to the dependency graph. Unit and integration tests have been added to verify the new functionality. The changes also include handling of jar libraries. The code includes TODO comments for future enhancements such as downloading the library wheel and adding it to the virtual system path, and handling references to other requirements files and constraints files.
  • Added ability to install UCX on workspaces without Public Internet connectivity (#1566). A new flag, upload_dependencies, has been added to the WorkspaceConfig to enable users to upload dependencies to air-gapped workspaces without public internet connectivity. This flag is a boolean value that is set to False by default and can be set by the user through the installation prompt. This feature resolves issue #573 and was co-authored by hari-selvarajan_data. When this flag is set to True, it triggers the upload of specified dependencies during installation, which allows for the installation of UCX on workspaces without public internet access. This change also includes updating the version of databricks-labs-blueprint from <0.7.0 to >=0.6.0, which may include changes to existing functionality. Additionally, new test functions have been added to test the functionality of uploading dependencies when the upload_dependencies flag is set to True.
  • Added initial interface for data comparison framework (#1695). This commit introduces the initial interface for a data comparison framework, which includes classes and methods for managing metadata, profiling data, and comparing schema and data for tables. A new StandardDataComparator class has been implemented for comparing the data of two tables, and a StandardSchemaComparator class tests the comparison of table schemas. The framework also includes the DatabricksTableMetadataRetriever class for retrieving metadata about a given table using a SQL backend. Additional classes and methods will be implemented in future work to provide a robust data comparison framework, such as StandardDataProfiler for profiling data, SchemaComparator and DataComparator for comparing schema and data, and test fixtures and functions for testing the framework. This release lays the groundwork for enabling users to perform comprehensive data comparisons effectively, enhancing the project's capabilities and versatility.
  • Added lint local code command (#1710). A new lint local code command has been added to the databricks labs ucx tool, allowing users to assess required migrations in a local directory or file. This command detects dependencies and analyzes them, currently supporting Python and SQL files, with an expected runtime of under a minute for code bases up to 50,000 lines of code. The command generates output that includes file links opening the file at the problematic line in modern IDEs, providing a quick and easy way to identify necessary migrations. The lint-local-code command is implemented in the application.py file, with supporting methods and classes added to the workspace_cli.py and databricks.labs.ucx.source_code packages, enhancing the linting process and providing valuable feedback for maintaining high code quality standards.
  • Added table in mount migration (#1225). This commit introduces new functionality to migrate tables in mounts to the Unity Catalog, including creating a table in the Unity Catalog based on a table mapping CSV file, fixing an issue with include_paths_in_mount not being present in workflows.py, and adding the ability to set default ownership on each created table. A new method ScanTablesInMounts has been added to scan tables in mounts, and a TableMigration class creates tables in the Unity Catalog based on the table mapping. Two new methods, Rule and TableMapping, have been added to manage mappings of tables, and TableToMigrate is used to represent a table that needs to be migrated to Unity Catalog. The commit includes manual, unit, and integration testing to ensure the changes work as expected. The diff shows changes to the workflows.py file and the addition of several new methods, including Rule, TableMapping, TableToMigrate, create_autospec, and MockBackend.
  • Added workflows to trigger table reconciliations (#1721). In this release, we've introduced several enhancements to our table migration workflow, focusing on data reconciliation and consistency. We've added a new post-migration data reconciliation task that validates migrated table integrity by comparing the schema, row count, and individual row content of the source and target tables. The new task stores and displays the number of missing rows in the Migration dashboard's $inventory_database.reconciliation_results view. Additionally, new workflows have been implemented to automatically trigger table reconciliations, ensuring consistency and integrity between different data sources. These workflows involve modifying relevant functions and modules, and may include new methods for data processing, scheduling, or monitoring based on the project's architecture. Furthermore, new configuration options for table reconciliation are now available in the WorkspaceConfig class, allowing for greater control and flexibility over migration processes. By incorporating these improvements, users can expect enhanced data consistency and more efficient table reconciliation management.
  • Always refresh HMS stats when getting table size (#1713). A change has been implemented in the hive_metastore library to enhance the precision of table size calculations by ensuring that HMS stats are always refreshed before being retrieved. This has been achieved by calling the ANALYZE TABLE command with the COMPUTE STATISTICS NOSCAN option before computing the table size, thus preventing the use of stale stats. Specifically, the "backend.queries" list has been updated to include two ANALYZE statements for tables "db1.table1" and "db1.table2", ensuring that their statistics are updated and accurate. The test case test_table_size_crawler in the "test_table_size.py" file has been revised to validate the presence of the two ANALYZE statements in the "backend.queries" list and confirm the size of the results for both tables. This commit also includes manual testing, added unit tests, and verification on the staging environment to ensure the functionality.
  • Automatically retrieve aws_account_id from aws profile instead of prompting (#1715). This commit introduces several improvements to the library's AWS integration, enhancing automation and user experience. It eliminates the need for manual input of aws_account_id by automatically retrieving it from the AWS profile. An optional kms-key flag has been documented for creating roles, providing more flexibility. The create-missing-principals command now accepts optional parameters such as KMS Key, Role Name, Policy Name, and allows creating a single role for all S3 locations, with a default behavior of creating one role per S3 location. These changes have been manually tested and verified in a staging environment, and resolve issue #1714. Additionally, tests have been conducted to ensure the changes do not introduce regressions. A new method simulating a successful AWS CLI call has been added, replacing aws_cli_run_command, ensuring automated retrieval of aws_account_id. A test has also been added to raise an error when AWS CLI is not found in the system path.
  • Detect dependencies of libraries installed via pip (#1703). This commit introduces a child dependency graph for libraries resolved via pip using DistInfo data, addressing issues #1642 and [#1202](https://github.com/databrickslabs/u...
Read more