Known issues in 1.5.5 SP2

Upgrade management

DSE-50441: Unable to register or deploy Model with the 1.5.4_h30 → 1.5.5 SP1 → 1.5.5 SP2 upgrade path and DSE-50487: Add workaround for upgrading AI registry from 1.5.4 - 1.5.5

During the Cloudera AI Registry upgrade process, a new Cloudera AI Registry instance is provisioned, and the database is assigned a fresh Persistent Volume (PV). The upgrade workflow is designed to replace this fresh PV with the old Cloudera AI Registry PV to retain historical data.

A race condition occurs between the application startup and storage orchestration. The database schema migration logic might execute on the fresh PV before the volume swap sequence is completed. As a result, when the legacy PV is eventually attached, it lacks the necessary schema updates, leading to query failures caused by schema mismatches.

Workaround

Restart the Cloudera AI Registry v1 pod. This action forces the application to reinitialize, detect the correctly attached legacy PV, and apply the pending schema migrations.

DSE-50712: On premises 1.5.5 SP2 Cloudera AI Registry on OpenShift Container Service: 1.5.4 SP2 to 1.5.5 SP2 upgrade fails with knox Init:ImagePullBackOff failure

An upgrade failure occurs when upgrading the on premises Cloudera AI Registry from 1.5.4 SP2 version to 1.5.5 SP2 version. The failure is caused by an ImagePullBackOff error during Knox initialization.

Workaround

Upgrading directly from 1.5.4 SP2 to 1.5.5 SP2 is not the supported upgrade path. The recommended procedure is to upgrade first to 1.5.5 GA or 1.5.5 SP1 before proceeding to 1.5.5 SP2.

Log management

DSE-49031: Add fluent bit sidecar for model endpoints in serving-default namespace and DSE-49032: Add fluent bit sidecar for knative pods

The diagnostic bundle includes live cluster logs from all namespaces, however the archived logs are limited to the cml-serving infra namespace.

Workaround: None.

Quota management

DSE-50530: Error observed when configuring custom user or team quota

When setting custom CPU or memory quotas for a user or team without specifying a GPU quota, the system displays the Request must contain at least one accelerator user quota error message.

Workaround

To resolve this issue, always include a GPU quota value when updating CPU or memory quotas for a user or team.

DSE-49793: Different default quota in heterogeneous GPU for users and teams

In configurations with heterogeneous GPUs, the default GPU quota is common for both users and teams. This limitation requires users to configure a single default GPU quota that applies to both users and teams.

Cloudera recommends setting a single default GPU quota that is suitable for both users and teams.

Workbench management

DSE-49514: Review of the user resources and workbench resources does not display the right data

GPU resources are discovered by the workbench even if no GPUs were added during the creation or the provisioning of the workbench. The workbench automatically discovers all GPUs available in the cluster regardless of whether they were allocated to the workbench. As a result, the User Resources section inaccurately displays the total number of GPUs in the cluster instead of reflecting the GPUs assigned to the workbench.

Workaround: None.

Model training

DSE-48824: Cloudera AI Registry does not work with Google Chrome browser version 142 or higher

Cloudera AI Registry does not function properly when accessed using Google Chrome version 142 or higher. The Model Endpoints page fails to load and displays the following error message:

Error occurred while
              communicating with Cloudera AI Registry in environment
              '<ENVIRONMENT_NAME>'

.

Workaround

To resolve this issue, either downgrade Google Chrome to a version lower than 142 or use an alternative browser.

Site Administration

DSE-36561: Updating Cloudera AI applications fails if the Allow users to use ML runtime addons option is disabled

The Allow users to use ML runtime addons option is enabled by default in Cloudera AI on premises. However, if this option is disabled for any configuration activity, updating a Cloudera AI application can fail with the Whoops, there was an unexpected error HTTP 400 error message.

Workaround:

To address this issue, navigate to the Site Administration > Settings page and enable the Allow users to use ML runtime addons option.

Model serving

DSE-50375: Remove provisioning to update S3 bucket field in update storage configuration for Cloudera AI Inference service instance

The Cloudera AI Inference service does not use the S3 bucket option, consequently the secret does not contain S3 fields either.

Workaround: None.

DSE-48823: Cloudera AI Inference does not work with Google Chrome browser version 142 or higher

Cloudera AI Inference service does not function properly when accessed using Google Chrome version 142 or higher. The Model Endpoints page fails to load and displays the following error message: Error occurred while communicating with Cloudera AI Inference service in environment '<ENVIRONMENT_NAME>'.

Workaround

To resolve this issue, either downgrade Google Chrome to a version lower than 142 or use an alternative browser.

Cloudera AI Inference service Known issues

Updating the description after a model has been added to a model endpoint will lead to a UI mismatch in the model builder for models listed by the model builder and the deployed models.
Embedding models function in two modes: query or passage. This has to be specified when interacting with the models in one of the following ways:
- Suffixing the model ID in the payload by either -query or -passage
- Specifying the input_type parameter in the request payload.
  
  For more information, see NVIDIA documentation.
Embedding models only accept strings as input. Token stream input is currently not supported.
Llama 3.2 Vision models are not supported on AWS on A10G and L40S GPUs.