Back to home

Help Center

How-to Guides

Storage & RAID

Monitoring software RAID on Linux server

Learn how to monitor and automate software RAID health checks using mdadm and smartmontools on Linux to prevent data loss and ensure system stability.

Key points to consider

RAID is not a backup solution. While RAID improves redundancy and fault tolerance, it does not replace proper backups. Multiple disk failures or user errors can still lead to data loss. Always maintain separate backups.
Monitoring your RAID array helps identify potential failures early, ensuring data integrity and system stability.
Regular checks using tools like mdadm and smartmontools provide insights into disk health, performance, and potential failures.
Proactively monitoring RAID arrays helps prevent unexpected downtime and time-consuming data recovery procedures.
Keeping RAID arrays healthy ensures optimal performance and extends the lifespan of your server infrastructure.

Step-by-step instructions

Part A: Identifying your RAID array

Before monitoring your RAID array, it is essential to identify its configuration. Use the following commands to determine your RAID setup. Identifying your RAID setup helps you understand the type of redundancy and performance improvements it provides.

Check active RAID devices
Open the terminal and run the following command to check active RAID devices (see Fig. 1):
```
$ cat /proc/mdstat
```
This command displays active RAID arrays and their status, helping you to identify any degraded or failed arrays.

Example RAID status output:

Fig. 1. Example output of the cat /proc/mdstat command, showing an active RAID 1 array with two healthy disks.

Explanation of the output:
- Personalities: Lists the available RAID types supported on the system. In this case, the system supports RAID1, RAID0, RAID6, RAID5, RAID4, and RAID10.
- md0: Indicates the active RAID array, in this case, md0 is configured as a RAID 1 (mirroring) array.
- Devices: The array consists of two NVMe drive partitions: nvme1n1p2 and nvme0n1p2. The numbers inside the square brackets [1] and [0] indicate their order in the array.
- Blocks and version: The RAID array contains 249916416 data blocks and uses the super 1.2 metadata format.
- [2/2] [UU]: This section shows the RAID member count and their status. [2/2] indicates that both disks are active, and [UU] means both disks are functioning correctly. If one disk fails, it will show [U_] or [_U], indicating which disk is degraded.
- Bitmap: The bitmap helps track changes to the RAID set, speeding up re-synchronization by reducing unnecessary data copying. In this example, the bitmap size is 8KB, with a chunk size of 65536KB.
- Unused devices: Indicates that no additional devices are currently unused within the RAID setup.
Identify RAID partitions
To identify RAID partitions and their layout, run (see Fig. 2):
```
$ lsblk
```
This visualizes your disk layout, showing RAID devices, partitions, and how storage is allocated.

Example output:

Fig. 2. Example output of the lsblk command, illustrating the RAID 1 configuration and partitions.

Explanation of the output:
- NAME: Lists devices and their partitions. Here, nvme0n1 and nvme1n1 are NVMe drives, each with partitions (nvme0n1p2 and nvme1n1p2) forming the RAID array md0.
- SIZE: Displays device capacity. Both disks are 238.5G, and md0 reflects the combined RAID size.
- TYPE: Identifies the device type - disk for physical drives, part for partitions, and raid1 for the RAID array.
- MOUNTPOINTS: Shows where devices are mounted. The RAID array md0 is mounted at /.
Gather detailed RAID information
To gather detailed information about a specific RAID array, run (see Fig. 3):
```
$ sudo mdadm --detail /dev/md0
```
Replace /dev/md0 with your actual RAID device to retrieve crucial information such as RAID level, disk health, and recovery status.

Example output:

Fig. 3. Example output of the mdadm --detail command, providing detailed information about a healthy RAID 1 array.

Explanation of the output:
- Version: The RAID metadata version, here 1.2, which defines the format used to store RAID information.
- Creation Time: Indicates when the RAID array was created.
- RAID Level: Specifies the type of RAID configuration; in this case, RAID 1 (mirroring).
- Array Size: Displays the total capacity of the RAID array, which is 238.34 GiB.
- Used Dev Size: Shows the storage utilized by each device.
- Raid Devices / Total Devices: Number of active and total devices in the RAID setup.
- Persistence: Confirms that the RAID superblock is persistent, meaning it retains configuration across reboots.
- State: Displays the current status of the array, clean indicates no issues.
- Active / Working / Failed Devices: Provides counts of functioning, operational, and failed devices, respectively.
- Consistency Policy: Indicates that a bitmap is used to track changes and speed up rebuilds.
- Device list: Shows associated storage devices with their respective RAID roles.
Verify the RAID configuration file
To verify and check RAID configurations stored on your system, run (see Fig. 4):
```
$ sudo cat /etc/mdadm/mdadm.conf
```
This file stores RAID configurations and should be updated for auto-assembly on boot.

Example output:

Fig. 4. Example of an mdadm.conf configuration file showing RAID array details and alert email settings.

Explanation of the output:
- ARRAY /dev/md0: Specifies the RAID array device managed by mdadm. In this case, the array is identified as /dev/md0.
- metadata=1.2: Indicates the metadata version used to store RAID configuration details. The metadata helps the system recognize and rebuild the RAID array upon reboots.
- name=246013:0: This field assigns a unique name to the RAID array, which can help track and manage multiple RAID arrays.
- UUID=fd3e2b9a:da14efcd:73e749f8:50e44710: The unique identifier assigned to the RAID array. This UUID identifies the correct array, even if the device name changes.
- MAILADDR alerts@internal-mx.cherryservers.com: Defines the email address where notifications and alerts regarding RAID events (such as failures or degradations) will be sent.
Importance of the configuration:
- The mdadm.conf file ensures that the RAID array is assembled automatically during system boot.
- The MAILADDR setting allows system administrators to receive critical RAID alerts proactively, helping to prevent data loss.

For more details on creating and managing RAID arrays, refer to creating different types of RAID arrays.

Part B: Monitoring and automating RAID monitoring

Once you have identified your RAID setup, the next step is to monitor it continuously to ensure optimal performance and prevent unexpected failures. This section will guide you through monitoring RAID arrays using available tools and automating the process to stay informed about potential issues.

Install monitoring tools

To monitor RAID health, install the necessary tools using the package manager for your Linux distribution:

Debian/Ubuntu-based distributions:

$ sudo apt update && sudo apt install mdadm smartmontools -y

RHEL/CentOS-based distributions:

$ sudo dnf install mdadm smartmontools -y

For older CentOS versions:

$ sudo yum install mdadm smartmontools -y

Arch Linux:

$ sudo pacman -S mdadm smartmontools --noconfirm

openSUSE:

$ sudo zypper install mdadm smartmontools

Monitoring RAID status

Once the tools are installed, you can check the status and health of your RAID array using the following methods:

Checking RAID sync and failures:
To detect any degraded or syncing issues in the RAID array, run (see Fig. 1):
```
$ cat /proc/mdstat
```
This command provides real time monitoring of the RAID sync process and disk health.
Checking disk health with smartmontools:
The smartctl utility provides detailed health reports for individual RAID disks:
```
$ sudo smartctl -a /dev/sda
```
Note: Replace /dev/sda with the appropriate disk identifier for your system (e.g., /dev/nvme0n1 or /dev/sdb). You can identify your drives using the lsblk command.

Key information provided:
- Overall health status (e.g., PASSED or FAILED)
- Disk temperature and SMART attributes
- Reallocated sectors and potential failure indicators
Useful flags:
- -H – Quick health check of the disk.
- -i – View basic disk information (model, serial, firmware).
- -t short|long – Perform self-tests to detect errors.
- -l error – Display recent error logs.
Regular monitoring with these commands helps prevent unexpected failures and ensures optimal RAID performance.

Automating monitoring with cron jobs

To ensure regular monitoring, automate checks using cron jobs:

$ crontab -e

If it's your first time using crontab, you will be prompted to select an editor (see Fig. 5).

Example output:

Fig. 5. Selecting a text editor for crontab when configuring scheduled tasks for the first time.

Add the following entry to check RAID health daily at 3 AM and log it (see Fig. 6):

0 3 * * * /usr/sbin/mdadm --detail /dev/md0 >> /var/log/raid_status.log

0 3 * * * – This specifies the schedule for running the command:
- 0 – Minute (0 minutes past the hour).
- 3 – Hour (3 AM).
- * * * – Every day, every month, and every day of the week.

/usr/sbin/mdadm --detail /dev/md0 – This command checks the detailed status of the RAID array.
>> /var/log/raid_status.log – This appends the output to the specified log file for later review.

Customization:

Change RAID device: Replace /dev/md0 with your actual RAID array (e.g., /dev/md127).
Change log location: Modify /var/log/raid_status.log to any preferred path (e.g., /home/user/raid_log.txt).

Example configuration:

Fig. 6. Example of a crontab entry scheduling a daily RAID health check at 3 AM, logging the output to /var/log/raid_status.log.

Setting up email alerts

To receive automatic alerts in case of RAID issues, configure email notifications in the mdadm.conf file:

Edit the configuration file (see Fig. 4):
```
$ sudo nano /etc/mdadm/mdadm.conf
```
Add or modify the following line to specify an email address for alerts:
```
MAILADDR alerts@yourdomain.com
```

Save and update the RAID configuration:

$ sudo mdadm --detail --scan >> /etc/mdadm/mdadm.conf

$ sudo update-initramfs -u

Implementing these monitoring solutions and automation methods, you can ensure that your RAID arrays remain healthy and perform optimally. For further guidance on replacing failed disks, refer to removing, replacing, and resyncing a disk.

Summary

In this tutorial, we have covered key aspects of monitoring software RAID arrays on Linux. We began by identifying RAID configurations using commands such as cat /proc/mdstat and lsblk. Then, we discussed how to verify RAID configuration files using mdadm --detail and cat /etc/mdadm/mdadm.conf.

Next, we explored monitoring methods, including the installation of essential tools like mdadm and smartmontools. Using smartctl, we demonstrated how to check disk health, temperature, and potential failure indicators. Additionally, we explored ways to automate RAID monitoring using cron jobs, enabling regular checks and logging for proactive maintenance.

By following this guide, you can effectively monitor your RAID arrays, automate regular health checks, and receive timely alerts to prevent data loss and maintain system stability.