Learn how to monitor and automate software RAID health checks using mdadm and smartmontools on Linux to prevent data loss and ensure system stability.
Key points to consider
-
RAID is not a backup solution. While RAID improves redundancy and fault tolerance, it does not replace proper backups. Multiple disk failures or user errors can still lead to data loss. Always maintain separate backups.
-
Monitoring your RAID array helps identify potential failures early, ensuring data integrity and system stability.
-
Regular checks using tools like
mdadm
andsmartmontools
provide insights into disk health, performance, and potential failures. -
Proactively monitoring RAID arrays helps prevent unexpected downtime and time-consuming data recovery procedures.
-
Keeping RAID arrays healthy ensures optimal performance and extends the lifespan of your server infrastructure.
Step-by-step instructions
Part A: Identifying your RAID array
Before monitoring your RAID array, it is essential to identify its configuration. Use the following commands to determine your RAID setup. Identifying your RAID setup helps you understand the type of redundancy and performance improvements it provides.
- Check active RAID devices
Open the terminal and run the following command to check active RAID devices (see Fig. 1):$ cat /proc/mdstat
This command displays active RAID arrays and their status, helping you to identify any degraded or failed arrays.
Example RAID status output:
Fig. 1. Example output of the cat /proc/mdstat command, showing an active RAID 1 array with two healthy disks.
Explanation of the output:
- Personalities: Lists the available RAID types supported on the system. In this case, the system supports
RAID1
,RAID0
,RAID6
,RAID5
,RAID4
, andRAID10
. - md0: Indicates the active RAID array, in this case,
md0
is configured as aRAID 1
(mirroring) array. - Devices: The array consists of two NVMe drive partitions:
nvme1n1p2
andnvme0n1p2
. The numbers inside the square brackets[1]
and[0]
indicate their order in the array. - Blocks and version: The RAID array contains
249916416
data blocks and uses thesuper 1.2
metadata format. - [2/2] [UU]: This section shows the RAID member count and their status.
[2/2]
indicates that both disks are active, and[UU]
means both disks are functioning correctly. If one disk fails, it will show[U_]
or[_U]
, indicating which disk is degraded. - Bitmap: The
bitmap
helps track changes to the RAID set, speeding up re-synchronization by reducing unnecessary data copying. In this example, the bitmap size is8KB
, with a chunk size of65536KB
. - Unused devices: Indicates that no additional devices are currently unused within the RAID setup.
- Personalities: Lists the available RAID types supported on the system. In this case, the system supports
- Identify RAID partitions
To identify RAID partitions and their layout, run (see Fig. 2):$ lsblk
This visualizes your disk layout, showing RAID devices, partitions, and how storage is allocated.
Example output:
Fig. 2. Example output of the lsblk command, illustrating the RAID 1 configuration and partitions.
Explanation of the output:
- NAME: Lists devices and their partitions. Here,
nvme0n1
andnvme1n1
are NVMe drives, each with partitions (nvme0n1p2
andnvme1n1p2
) forming the RAID arraymd0
. - SIZE: Displays device capacity. Both disks are
238.5G
, andmd0
reflects the combined RAID size. - TYPE: Identifies the device type -
disk
for physical drives,part
for partitions, andraid1
for the RAID array. - MOUNTPOINTS: Shows where devices are mounted. The RAID array
md0
is mounted at/
.
- NAME: Lists devices and their partitions. Here,
- Gather detailed RAID information
To gather detailed information about a specific RAID array, run (see Fig. 3):$ sudo mdadm --detail /dev/md0
Replace/dev/md0
with your actual RAID device to retrieve crucial information such as RAID level, disk health, and recovery status.
Example output:
Fig. 3. Example output of the mdadm --detail command, providing detailed information about a healthy RAID 1 array.
Explanation of the output:
- Version: The RAID metadata version, here
1.2
, which defines the format used to store RAID information. - Creation Time: Indicates when the RAID array was created.
- RAID Level: Specifies the type of RAID configuration; in this case,
RAID 1
(mirroring). - Array Size: Displays the total capacity of the RAID array, which is
238.34 GiB
. - Used Dev Size: Shows the storage utilized by each device.
- Raid Devices / Total Devices: Number of active and total devices in the RAID setup.
- Persistence: Confirms that the RAID superblock is persistent, meaning it retains configuration across reboots.
- State: Displays the current status of the array,
clean
indicates no issues. - Active / Working / Failed Devices: Provides counts of functioning, operational, and failed devices, respectively.
- Consistency Policy: Indicates that a bitmap is used to track changes and speed up rebuilds.
- Device list: Shows associated storage devices with their respective RAID roles.
- Version: The RAID metadata version, here
- Verify the RAID configuration file
To verify and check RAID configurations stored on your system, run (see Fig. 4):$ sudo cat /etc/mdadm/mdadm.conf
This file stores RAID configurations and should be updated for auto-assembly on boot.
Example output:
Fig. 4. Example of an mdadm.conf configuration file showing RAID array details and alert email settings.
Explanation of the output:
- ARRAY /dev/md0: Specifies the RAID array device managed by
mdadm
. In this case, the array is identified as/dev/md0
. - metadata=1.2: Indicates the metadata version used to store RAID configuration details. The metadata helps the system recognize and rebuild the RAID array upon reboots.
- name=246013:0: This field assigns a unique name to the RAID array, which can help track and manage multiple RAID arrays.
- UUID=fd3e2b9a:da14efcd:73e749f8:50e44710: The unique identifier assigned to the RAID array. This UUID identifies the correct array, even if the device name changes.
- MAILADDR alerts@internal-mx.cherryservers.com: Defines the email address where notifications and alerts regarding RAID events (such as failures or degradations) will be sent.
- The
mdadm.conf
file ensures that the RAID array is assembled automatically during system boot. - The
MAILADDR
setting allows system administrators to receive critical RAID alerts proactively, helping to prevent data loss.
- ARRAY /dev/md0: Specifies the RAID array device managed by
For more details on creating and managing RAID arrays, refer to creating different types of RAID arrays.
Part B: Monitoring and automating RAID monitoring
Once you have identified your RAID setup, the next step is to monitor it continuously to ensure optimal performance and prevent unexpected failures. This section will guide you through monitoring RAID arrays using available tools and automating the process to stay informed about potential issues.
Install monitoring tools
To monitor RAID health, install the necessary tools using the package manager for your Linux distribution:
- Debian/Ubuntu-based distributions:
$ sudo apt update && sudo apt install mdadm smartmontools -y
- RHEL/CentOS-based distributions:
$ sudo dnf install mdadm smartmontools -y
For older CentOS versions:$ sudo yum install mdadm smartmontools -y
- Arch Linux:
$ sudo pacman -S mdadm smartmontools --noconfirm
- openSUSE:
$ sudo zypper install mdadm smartmontools
Monitoring RAID status
Once the tools are installed, you can check the status and health of your RAID array using the following methods:
- Checking RAID sync and failures:
To detect any degraded or syncing issues in the RAID array, run (see Fig. 1):$ cat /proc/mdstat
This command provides real time monitoring of the RAID sync process and disk health. - Checking disk health with
smartmontools
:
Thesmartctl
utility provides detailed health reports for individual RAID disks:$ sudo smartctl -a /dev/sda
Note: Replace/dev/sda
with the appropriate disk identifier for your system (e.g.,/dev/nvme0n1
or/dev/sdb
). You can identify your drives using thelsblk
command.
Key information provided:
- Overall health status (e.g.,
PASSED
orFAILED
) - Disk temperature and SMART attributes
- Reallocated sectors and potential failure indicators
-H
– Quick health check of the disk.-i
– View basic disk information (model, serial, firmware).-t short|long
– Perform self-tests to detect errors.-l error
– Display recent error logs.
- Overall health status (e.g.,
Automating monitoring with cron jobs
To ensure regular monitoring, automate checks using cron jobs:$ crontab -e
If it's your first time using crontab, you will be prompted to select an editor (see Fig. 5).
Example output:
Fig. 5. Selecting a text editor for crontab when configuring scheduled tasks for the first time.
Add the following entry to check RAID health daily at 3 AM and log it (see Fig. 6):
0 3 * * * /usr/sbin/mdadm --detail /dev/md0 >> /var/log/raid_status.log
0 3 * * *
– This specifies the schedule for running the command:0
– Minute (0 minutes past the hour).3
– Hour (3 AM).* * *
– Every day, every month, and every day of the week.
/usr/sbin/mdadm --detail /dev/md0
– This command checks the detailed status of the RAID array.>> /var/log/raid_status.log
– This appends the output to the specified log file for later review.
Customization:
- Change RAID device: Replace
/dev/md0
with your actual RAID array (e.g.,/dev/md127
). - Change log location: Modify
/var/log/raid_status.log
to any preferred path (e.g.,/home/user/raid_log.txt
).
Example configuration:
Fig. 6. Example of a crontab entry scheduling a daily RAID health check at 3 AM, logging the output to /var/log/raid_status.log.
Setting up email alerts
To receive automatic alerts in case of RAID issues, configure email notifications in the mdadm.conf
file:
- Edit the configuration file (see Fig. 4):
$ sudo nano /etc/mdadm/mdadm.conf
- Add or modify the following line to specify an email address for alerts:
MAILADDR alerts@yourdomain.com
- Save and update the RAID configuration:
$ sudo mdadm --detail --scan >> /etc/mdadm/mdadm.conf
$ sudo update-initramfs -u
Implementing these monitoring solutions and automation methods, you can ensure that your RAID arrays remain healthy and perform optimally. For further guidance on replacing failed disks, refer to removing, replacing, and resyncing a disk.
Summary
In this tutorial, we have covered key aspects of monitoring software RAID arrays on Linux. We began by identifying RAID configurations using commands such as cat /proc/mdstat
and lsblk
. Then, we discussed how to verify RAID configuration files using mdadm --detail
and cat /etc/mdadm/mdadm.conf
.
Next, we explored monitoring methods, including the installation of essential tools like mdadm
and smartmontools
. Using smartctl
, we demonstrated how to check disk health, temperature, and potential failure indicators. Additionally, we explored ways to automate RAID monitoring using cron jobs, enabling regular checks and logging for proactive maintenance.
By following this guide, you can effectively monitor your RAID arrays, automate regular health checks, and receive timely alerts to prevent data loss and maintain system stability.