Cloudera Data Catalog on premises 1.5.5 SP2

Review the features, fixes, and known issues in the Cloudera Data Catalog 1.5.5 Service Pack 2 release.

What's new in Cloudera Data Catalog 1.5.5 SP2

This section lists major features and updates for the Cloudera Data Catalog service.

This release of Cloudera Data Catalog on Cloudera Data Services on premises 1.5.5 service pack 2 includes the following features and enhancements:
  • Redesigned user interface to support the same look and profiler functionality (Compute Cluster based) as by Cloudera Data Catalog on cloud.
  • Bug and CVE fixes

Containerized architecture for profilers

Cloudera Data Catalog introduces a new containerized architecture for Profilers, providing a scalable environment:
  • Only the required amount of Kubernetes pods are launched based on the size of the database to be profiled.
  • The deployment of the containerized profiler architecture is more streamlined and quicker than the previous architecture.
  • Moreover, the containerized nature of the architecture means that later upgrades can be carried out easier, without the need for multiple dependencies.
  • Profilers now also support AVRO, ORC and text files.
    • Custom delimiters are supported for Hive tables where the Hive Metastore contains them. Cloudera Data Catalog now supports LazySimpleSerDe for Hive tables. For more information, see the Apache Hive Developer Guide.
  • Hive Column Profiler and Cluster Sensitivity Profilers also support profiling Iceberg Tables, including with On-Demand Profilers.

Redesigned profiler setup

  • Settings for instance sizing and autoscaling are introduced.

Improved profiler UI

The improved profilers present a more user friendly UI and several extended capabilities.

  • New names for profilers in Compute Cluster enabled environments:
    • The Cluster Sensitivity Profiler is now called Data Compliance profiler.
    • The Hive Column Profiler is now called Statistics Collector profiler.
    • The Ranger Audit Profiler is now called Activity Profiler.
  • New driver settings in profiler Configuration:
    Number of driver cores
    Sets the available processor cores for the Spark driver controlling the executors and scheduling their tasks. More cores allow more effective coordination when profiling multiple assets concurrently.
    Maximum driver memory in GBs
    Sets the maximum available memory for the Spark driver controlling the executors and scheduling their tasks. Increase it when profiling large number of tables, or getting out-of-memory errors.
  • Redesigned Profilers menu for easier access to jobs, configurations and their history, asset filtering and tag rules:
    • The individual profilers show new metrics
      • Number of profiled assets of the last job
      • Job duration of the last job
      • The profilers menu also shows the next jobs’ start time and the number of completions
    • The CRON expression based scheduler is supplemented with a natural language based scheduler
    • Asset Filtering Rules is expanded with the list of assets affected by your rule set
    • You can now access the Configuration History of a profiler, where you can check your changes in a sequential order
    • The Job Summary page is introduced new metrics:
      • Workers details:
        • Worker Memory limit
        • Threads per workers
        • Number of workers
      • Last run check details
    • The Job Summary page provides the list of profiled assets.
  • You are now able to approve tags recommended by the Data Compliance Profiler before applying them to your assets and syncing them to Apache Atlas. This mean, that you can review tag suggestions to correct mistakenly applied tags, which would otherwise lead to unexpected changes of tag-based Apache Ranger policies.
  • The Cluster Sensitivity Profiler and the Statistics Collector Profiler support incremental profiling to reduce required time and compute resources during repeated profiling jobs in Compute Cluster enabled environments.

  • The new Asset Filtering Rules tab in Job Summary shows the relevant Allow and Deny list rules for each Data Compliance and Statistics Collector Profiler job.

Redesigned and expanded Tag Rules for Compute Cluster enabled environments

  • Profiling table names is introduced next to column values or column names.
  • Atlas classifications (Cloudera Data Catalog tags) can be used in a more granular way thanks to the distinction between parent and child tags.
  • The new Tag Rules tab offers filters to allow for faster searching and displays:
    • List of applied parent and child tags
    • Tag rule status (Can be used to filter for tag rules not yet validated by Dry Run)
    • Rule types
    • You can filter for tag rules that apply child tags
  • The initial loading time of rules has been decreased.
  • You can upload regex patterns in CSV files for easier handling.
  • Now you can specify weightage for column value based matching (which was fixed at 85% before). The column weightage and column name weightage add up to 100%.
  • When profiling column values, you can upload a sample set of column values instead of defining a regex pattern.
  • You can review your configuration before finalizing your tag rule.
  • Dry Run: Before deploying your tag rules, you have to test them with actual table data.
  • New API calls are available.

Known issues in Cloudera Data Catalog 1.5.5 SP2

This section lists issues fixed in this release of the Cloudera Data Catalog service.

CDPDSS-4372: Profiler pods are stuck in init state
Upgrading to Cloudera Data Catalog on premises 1.5.5 SP2 while having launched profilers will result in the following error after the upgrade:
The auto-scaling node group is unavailable at the moment but profilers seem to have been launched. This has caused an inconsistency on Cloudera Data Catalog. Please delete all profilers and relaunch them to continue profiling your assets.
Before upgrading to Cloudera Data Catalog on premises 1.5.5 SP2, delete all launched profilers.

Fixed issues in Cloudera Data Catalog 1.5.5 SP2

This section lists issues fixed in this release of the Cloudera Data Catalog service.

CDPDSS-4375: Kerberos authentication failure with commented lines in krb5.conf
Kerberos authentication previously failed in Cloudera Data Catalog if the krb5.conf file contained commented-out lines. This occurred because the parser incorrectly processed commented lines as active configuration. This issue is resolved. The parser now correctly identifies and ignores commented-out realm configurations.
CDPDSS-3057: Failed profiler job because of missing columns in log entries.
When the logs to be profiled by the Ranger Audit Profiler have missing columns, the profiling job no longer fails.