English

Introduction

Resource Monitoring is a core component of Alauda AI's Monitoring & Ops module, designed specifically for tracking and analyzing resource utilization metrics of inference services. As part of the full-stack MLOps platform, it provides real-time visibility into infrastructure resource consumption, enabling users to optimize model deployment, prevent resource bottlenecks, and ensure stable operation of AI workloads. Integrated with Alauda AI's unified monitoring ecosystem, Resource Monitoring eliminates the need for fragmented tooling by delivering actionable insights directly within your MLOps workflow.

Usage Limitations

When using Resource Monitoring, note the following constraints:

Data Collection Intervals
- Minimum metric scraping interval: 60 seconds
- Historical data retention: 7 days by default
Dependency Requirements
- Requires Prometheus/VictoriaMetrics monitoring stack deployed in target clusters
- Node exporter must be running on all worker nodes
- DCGM exporter must be running on GPU nodes

Guides

Guides

How To

Troubleshooting

Guides

How To

Guides

Guides

Inference Service APIs

Workbench APIs

Manage APIs

Operator APIs

Introduction

TOC

Usage Limitations

Guides

Guides

How To

Troubleshooting

Guides

How To

Guides

Guides

Inference Service APIs

Workbench APIs

Manage APIs

Operator APIs

#Introduction

#TOC

#Usage Limitations

Introduction

TOC

Usage Limitations