What's new in Cloudera Data Warehouse on premises 1.5.5 SP2
Review the features introduced in Cloudera Data Warehouse on premises 1.5.5 Service Pack 2 release.
Cloudera Data Warehouse on premises
- Custom Kerberos support for Cloudera Data Warehouse
- Cloudera Data Warehouse on premises now allows administrators to specify a custom hostname for the Kerberos service principal during environment activation. This ensures compatibility with enterprise DNS and security standards, addressing scenarios when default internal hostnames are inaccessible or non-existent outside the Kubernetes cluster. This option consolidates Kerberos identity requirements into a single, user-defined principal aligned with enterprise security policies. For more information, see to Custom Kerberos principal.
- Secure Ozone S3A configuration
- Cloudera Data Warehouse on premises now supports a Kubernetes-native security framework for authenticating Hive and Impala Virtual Warehouses with Apache Ozone. This enables the Virtual Warehouses to read and write to Ozone through the S3A protocol. This feature replaces traditional plain-text credential storage with an encrypted Java KeyStore (JCEKS) vault, which is managed as a Kubernetes secret and automatically injected into Virtual Warehouse pods using a mutating webhook. For more details, contact your Cloudera Account Representative to access this feature.
- Enforced minimum timeout for Virtual Warehouse scale-down configuration
- A minimum timeout of 120 seconds is now enforced for Trigger Shutdown
Delay (Impala) and Auto-suspend Timeout (Seconds) (Hive)
configurations in Virtual Warehouses.
With this new feature, Virtual Warehouses with timeout values less than 120 seconds are automatically adjusted during upgrades to meet the minimum requirement, ensuring consistent behavior and preventing validation failures.
- Diagnostic bundle support for Cloudera Data Visualization
- The Cloudera Data Warehouse on premises now supports generating diagnostic bundles for Cloudera Data Visualization workloads.
What's new in Hive on Cloudera Data Warehouse on premises
- Upgrading Calcite
- Hive has been upgraded to Calcite version 1.33. This upgrade introduces various query optimizations that can improve query performance.
- Hive on ARM Architecture
- Hive is now fully supported on ARM architecture instances, including AWS Graviton and Azure ARM. This enables you to run your Hive workloads on more cost-effective and energy-efficient hardware.
What's new in Impala on Cloudera Data Warehouse on cloud
- Impala AES encryption and decryption support
- Impala now supports AES (Advanced Encryption Standard) encryption and decryption to work
better with other systems. AES-GCM is the default mode for strong security, but you can also
use other modes like CTR, CFB, and ECB for different needs. This feature works with both
128-bit and 256-bit keys and includes checks to keep your data safe and confidential. For more
information see AES encryption and decryption support
Apache Jira: IMPALA-13039
- Query cancellation supported during analysis and planning
- This new feature allows you to cancel Impala queries even while they are in the Frontend
stage, which includes analysis and planning. Previously, you could not cancel a query while it
was waiting for operations like loading metadata from the Catalog Server. With this update,
Impala now registers the planning process and can interrupt it to cancel the query.
Apache Jira: IMPALA-915
- Improved memory estimation and control for large queries
- Impala now uses a more realistic approach to memory estimation for large operations like
SORT,AGGREGATION, andHASH JOIN. - Expose query cancellation status to UDF interface
- Impala now exposes the query cancellation status to the User-Defined Function
(
UDF) interface. This new feature allows complex or time-consuming UDFs to periodically check if the query has been cancelled by the user. If cancellation is detected, the UDF can stop its work and fail fast. - Impala now supports Hive’s legacy timestamp conversion to ensure consistent interpretation of historical timestamps
- When reading Parquet or Avro files written by Hive using legacy timestamp conversion, Impala's timezone calculation for UTC timestamps could be incorrect, particularly for historical dates and timezones like Asia/Kuala_Lumpur or Singapore before 1982. This meant the timestamps displayed in Impala were different from those in Hive.
- Impala-shell now shows row count and elapsed time for most statements in HiveServer2 mode
- When running Impala queries, some commands over HiveServer2 protocol (like
REFRESHorINVALIDATE) did not show the Fetched X row(s) in Ys output inImpala-shell, even though Beeswax protocol showed them. - Support for arbitrary encodings in text and sequence files
- Impala now supports reading from and writing to Text and Sequence files that use arbitrary character encodings, such as GBK, beyond the default UTF-8.
- Expanded compression levels for ZSTD, and ZLIB
- Impala has extended the configurable range of compression levels for ZSTD, and ZLIB (GZIP/DEFLATE) codecs. This enhancement allows for better optimization of the trade-off between compression ratio and write throughput.
- Constant folding is now supported for non-ASCII and binary strings
- Previously, the query planner could not apply the optimization known as constant folding if the resulting value contained non-ASCII characters or was a non-UTF8 binary string. This failure meant that important query filters could not be simplified, which prevented key performance optimizations like predicate pushdown to the storage engine (e.g., Iceberg or Parquet stat filtering).
- Catalogd and Event Processor Improvements
-
- Faster Inserts for Partitioned Tables (IMPALA-14051): Inserting data into very large partitioned tables is now much faster. Previously, Impala communicated with the Hive Metastore (HMS) one partition at a time, which was a major slowdown. Impala now uses the batch insert API to send all insert information to the HMS in one highly efficient call, significantly boosting the performance of your INSERT statements into transactional tables.
- Quicker Table Administration (IMPALA-13599): Administrative tasks, such as running
DROP STATS or changing the
CACHEDstatus of a table, are now much faster on tables with many partitions. Impala previously made thousands of individual calls to the HMS for these operations. The system now batches these updates, making far fewer calls to the HMS and speeding up these essential administrative commands. - Reliable Table Renames (IMPALA-13989): The ALTER TABLE RENAME command no longer fails when an INVALIDATE METADATA command runs at the same time. Previously, this caused the rename to succeed in the Hive Metastore but fail in Impala's Catalog Server. Impala now includes automatic error handling that instantly runs an internal metadata refresh if the rename is interrupted, ensuring the rename completes successfully without requiring any manual user steps.
- Efficient Partition Refreshes (IMPALA-13453): Running REFRESH <table> PARTITION <partition> is now much more efficient. Previously, this command always fully reloaded the partition's metadata and column statistics, even if the partition was unchanged. Impala now checks if the partition data has changed before reloading, avoiding the unnecessary drop-add sequence and significantly improving the efficiency of partition metadata updates.
- Reduced Partition API Calls (IMPALA-13599): Impala has reduced unnecessary API
interactions with the HMS during table-level operations. Commands like ALTER
TABLE... SET CACHED/UNCACHED or DROP STATS on large tables
previously generated thousands of single
alter_partition()calls. Impala now utilizes the HMS's bulk-update functionality, batching these partition updates to drastically reduce the total number of required API calls. - REFRESH on multiple partitions (IMPALA-14089): Impala now supports using the REFRESH statement on multiple partitions within a single command, which significantly speeds up metadata updates by processing partitions in parallel, reduces lock contention in the Catalog service, and avoids unnecessary increases to the table version. See Impala REFRESH Statement
- Impala cluster responsiveness during table renames(IMPALA-13631): This ensurs that the critical internal lock is no longer held during long-running external calls initiated by ALTER TABLE RENAME operations. This prevents the entire Impala cluster from being blocked, allowing other queries and catalog operations to proceed without interruption.
Apache Jira: IMPALA-14051, IMPALA-13599, IMPALA-13989, IMPALA-13453,IMPALA-14089 , IMPALA-13631
- New query options for reliable metadata synchronization
- Impala now offers new query options to give you a reliable way to ensure your queries run with the latest table data after the relative HMS modifications are done.
What's new in Trino on Cloudera Data Warehouse on premises
- General Availability (GA) of Trino in Cloudera Data Warehouse
- Trino is a distributed SQL query engine designed to efficiently query large datasets across
one or more heterogeneous data sources. This integration enables users to leverage Trino's
powerful capabilities directly within Cloudera Data Warehouse.
The GA release of Trino in Cloudera Data Warehouse introduces several key capabilities:
- Trino Virtual Warehouses — Offers full support for creating and managing Trino Virtual Warehouses, enabling efficient querying across diverse, large datasets. For information about creating a Trino Virtual Warehouse, see Adding a new Virtual Warehouse.
- Federation and Connectivity — Seamless connection and management of various remote data sources is possible through Trino Federation Connectors, including the new Teradata custom connector. A dedicated connector management UI and backend facilitates the creation and configuration of these connectors. For more information, see Trino Federation Connectors.
- Security and Governance — Governance is enforced by default through Apache Ranger using
the
cm_trinoauthorization service. You can create or update Ranger policies for specific resources and assign permissions to Trino users, groups, or roles. When a user submits a query to Trino, the system verifies the defined policies to ensure that the user has the necessary permissions to run queries. For more information, see Ranger authorization for Trino Virtual Warehouses. - Performance Optimization — Built-in capabilities for auto-suspend and auto-scaling are supported. These configurations help optimize resource utilization and ensure the provisioning of a high-performance and scalable Trino Virtual Warehouse.
- Support for Ozone file system — You can now configure the Hive Metastore (HMS) in
Database Catalogs to use the Ozone filesystem for Trino Virtual Warehouses. By default, HMS
points to HDFS, but this feature allows you to set Ozone as the default storage system,
enabling efficient and scalable data management. For more information, see Configuring Ozone file system access.
Trino supports secure interactions with the Ozone file system by utilizing the Hive user. To enable authenticated read and write operations on Ozone data volumes and buckets, administrators must configure Ranger policies to authorize the Hive user. This ensures secure and controlled access to Ozone resources. For more information, see Authorizing the Hive user for Trino access to Ozone
- Support for Teradata connector (Technical Preview) — Cloudera Data Warehouse
now introduces support for a read-only Trino-Teradata connector. This feature is designed to
facilitate
SELECToperations on Teradata sources, operating in ANSI Mode and optimizing performance by pushing down filters and aggregates. - Cloudera Data Visualization Integration — The Trino connector is available for use in Cloudera Data Visualization, enabling interactive dashboarding and analytics.
- Connection pooling for JDBC-based connectors — You can now configure connection pooling capabilities for JDBC-based Trino connectors, such as MySQL, PostgreSQL, MariaDB, Teradata, and Oracle. Connection pooling helps in better performance, resource utilization, and stability while querying different data sources using Trino. For more information, see Connection pooling for JDBC-based connectors.
What's new in Iceberg on Cloudera Data Warehouse on premises
- Integrate Iceberg scan metrics into Impala query profiles
- Iceberg scan metrics are now integrated into the
Frontendsection of Impala query profiles, providing deeper insight into query planning performance for Iceberg tables. - Delete orphan files for Iceberg tables
- You can now use the following syntax to remove orphan files for Iceberg tables:
- Allow forced predicate pushdown to Iceberg
- Since IMPALA-11591, Impala has optimized query planning by avoiding predicate pushdown to Iceberg unless it is strictly necessary. While this default behavior makes planning faster, it can miss opportunities to prune files early based on Iceberg's file-level statistics.
- UPDATE operations now skip rows that already have the desired value
- The
UPDATEstatement for Iceberg and Kudu tables is optimized to reduce unnecessary writes.
