SBC Maintenance Routine

Illustration for SBC Maintenance Routine documentation

Summary

A small, consistent maintenance routine prevents most surprises on SBCs: disk-full incidents, unattended update breakage, and silent storage wear. This guide provides practical maintenance schedules for single-board computers running services 24/7. The goal is not constant tuning—it's predictable checks you can finish in minutes that catch problems before they cause downtime.

Most SBC failures come from preventable causes: full disks from log accumulation, failing SD cards showing early warning signs, and kernel updates that break boot without supervision. A lightweight maintenance routine addresses these systematically.

Who this is for

Anyone running an SBC as a service host (DNS, VPN, monitoring, home automation, web apps), especially systems you don't physically access often. If your board runs continuously and provides services other devices depend on, these maintenance procedures are essential.

What you'll do

Establish weekly health checks (5-10 minutes) to catch common issues early.
Perform monthly maintenance (15-30 minutes) including supervised reboots and backup verification.
Monitor storage health and replace failing media before catastrophic failure.
Manage system updates and kernel changes safely with rollback plans.
Set up automated monitoring and alerting for critical failures.
Document your system configuration for rapid recovery.

Weekly routine (5-10 minutes)

Run these checks once per week, preferably at the same time. Schedule them during low-usage periods if your services have predictable traffic patterns.

Check disk space

df -h
# Look for filesystems above 80% usage

# Check specific high-growth directories:
du -sh /var/log /var/tmp /tmp ~/.cache

# Find largest files in /var/log:
du -ah /var/log | sort -rh | head -n 20

Action thresholds:

Above 80%: Investigate what is consuming space. Check logs, cache directories, and temporary files.
Above 90%: Urgent action required. Clean up immediately or expand storage.
Above 95%: Critical. System may become unstable or enter read-only mode.

Common causes: Unrotated logs (check /var/log/journal/, /var/log/syslog*), package manager cache (/var/cache/apt/), core dumps (/var/crash/), application data growth.

Check for failed services

systemctl --failed
# If any services are listed, investigate why

# Check service status:
systemctl status SERVICE_NAME

# View recent logs:
journalctl -u SERVICE_NAME -n 50

Failed services often indicate configuration errors, missing dependencies, or resource exhaustion. Address failures promptly—cascading failures can occur if dependent services cannot start.

Review error logs

journalctl -b -p err | tail -n 50
# Focus on recurring errors, not one-time events

# Check kernel messages:
dmesg --level=err,crit,alert,emerg | tail -n 30

# Check authentication failures:
sudo journalctl _SYSTEMD_UNIT=ssh.service | grep "Failed password" | tail -n 20

Pay attention to patterns:

I/O errors: Failing storage (replace immediately).
Out-of-memory errors: Insufficient RAM or memory leak (add swap, investigate processes).
Network errors: Connectivity issues or DNS problems.
Authentication failures: Brute-force attempts (check fail2ban status).

Monitor memory and swap usage

free -h
# Check if swap is being used heavily

# Identify top memory consumers:
ps aux --sort=-%mem | head -n 10

# Check for OOM (out-of-memory) kills:
journalctl -b | grep -i "out of memory"

Persistent high swap usage (above 50% of swap capacity) indicates insufficient RAM. If OOM killer is active, add swap space or reduce memory-hungry services.

Check temperature

cat /sys/class/thermal/thermal_zone0/temp
# Output is in millidegrees Celsius (divide by 1000)
# Typical idle: 40000-50000 (40-50°C)
# Typical load: 60000-75000 (60-75°C)
# Throttling threshold: 80000-85000 (80-85°C)

If temperature consistently exceeds 75°C under normal load, improve cooling (add heatsink, improve airflow, reduce ambient temperature).

Monthly routine (15-30 minutes)

Supervised reboot after updates

After applying kernel or bootloader updates, reboot once under supervision to verify the system comes back cleanly:

sudo apt update
sudo apt -y full-upgrade
# Note if kernel was upgraded

sudo reboot

Monitor the reboot:

Watch boot messages via serial console or HDMI if available.
Verify SSH access returns within 2 minutes.
Check all critical services started: systemctl --failed.
Verify network connectivity: ping -c 3 8.8.8.8.

If reboot fails or services don't start, you have an immediate opportunity to troubleshoot rather than discovering the problem weeks later during an unplanned outage.

Verify backups are current and restorable

ls -lh /path/to/backups/
# Check latest backup timestamp

# Test restore process (on a test system or VM):
tar -tzf backup.tar.gz | head -n 20
# Verify critical files are present

Backups you haven't tested are not backups—they are wishes. Periodically restore a configuration file or directory to verify the backup is valid and you remember the restore procedure.

What to back up:

/etc/ — System configuration
/home/ — User data and SSH keys
/var/www/ or /srv/ — Web application data
/root/.ssh/ — Root SSH keys (if used)
Application-specific data directories
Database dumps (if running databases)

Check storage health

sudo smartctl -a /dev/mmcblk0
# For SD cards (if smartctl supports the card reader)

# Check for I/O errors in dmesg:
dmesg | grep -i "mmcblk\|sd[a-z]" | grep -i error

# Check filesystem errors:
sudo journalctl -b | grep -i "ext4\|xfs\|btrfs" | grep -i error

Early warning signs of storage failure:

Occasional I/O errors in dmesg (even if system still boots)
Filesystem remounted read-only unexpectedly
Increasing number of reallocated sectors (SMART attribute)
Boot time increasing (filesystem checks taking longer)
Files disappearing or showing corruption

Replace storage immediately if you see these signs. Failing storage accelerates—it may work 90% of the time until it suddenly stops working entirely, often without warning.

Update system documentation

Maintain a simple text file with essential system information:

hostname: homeserver
IP: 192.168.1.50 (DHCP reservation on router)
Board: Banana Pi
OS: Armbian 24.2.1 (Debian Bookworm)
Kernel: 6.6.16-current-sunxi
Storage: SanDisk 32GB microSD (purchased 2025-06)

Services:
- Pi-hole (DNS)
- Nginx (reverse proxy)
- Home Assistant

Backup location: /mnt/nas/backups/homeserver/
Last backup: 2026-01-10
Last full test restore: 2025-12-15

Notes:
- Serial console: 115200 baud, /dev/ttyUSB0
- Root login disabled; user account: jamie
- SSH keys deployed; password auth disabled

Store this documentation off-board (in your password manager, network share, or printed). Update it when you make significant changes.

Optional: Automated monitoring

For critical systems, set up automated alerts for common failure conditions:

Email alerts for disk space

Create a script /usr/local/bin/check-disk-space.sh:

#!/bin/bash
THRESHOLD=85

df -H | grep -vE '^Filesystem|tmpfs|cdrom' | awk '{ print $5 " " $1 }' | while read output; do
  usage=$(echo $output | awk '{ print $1}' | sed 's/%//g')
  partition=$(echo $output | awk '{ print $2 }')
  if [ $usage -ge $THRESHOLD ]; then
    echo "WARNING: $partition is ${usage}% full" | mail -s "Disk Space Alert on $(hostname)" your-email@example.com
  fi
done

Schedule with cron: 0 */6 * * * /usr/local/bin/check-disk-space.sh (every 6 hours).

Service monitoring with systemd

Configure systemd to email on service failure. Install postfix or msmtp for mail delivery, then edit /etc/systemd/system.conf:

[Manager]
DefaultDependencies=yes

Create an OnFailure handler for critical services.

Troubleshooting common maintenance issues

Disk full: emergency recovery

If the system is at 100% disk usage and won't boot properly:

Boot into single-user mode or mount the SD card on another Linux system.
Delete oldest rotated logs: rm /var/log/*.gz /var/log/*.[1-9].
Clear package cache: rm /var/cache/apt/archives/*.deb.
Truncate large log files: truncate -s 0 /var/log/syslog.
Remove core dumps: rm /var/crash/*.

System won't boot after update

If you have a serial console or HDMI output:

Watch boot messages for kernel panic or filesystem errors.
Try the previous kernel from GRUB/U-Boot menu if available.
Boot from a rescue SD card, mount the root filesystem, check journalctl logs.
Restore /boot/ from backup if bootloader/kernel was corrupted.

Services won't restart after maintenance

Check dependencies: systemctl list-dependencies SERVICE_NAME.
Verify configuration files: most services have a test mode (nginx -t, apache2ctl -t).
Check permissions: systemd services often run as specific users; ensure they can read config files and write to data directories.
Review full logs: journalctl -u SERVICE_NAME --no-pager.

Frequently asked questions

How much swap should I configure?

For 1GB RAM SBCs: 512MB-1GB swap on fast storage (SATA SSD if available). More swap allows better handling of memory spikes but slows down if heavily used. Monitor swap usage weekly—if consistently above 50%, add more RAM or reduce services.

Should I enable automatic updates?

For home/lab systems: Use unattended-upgrades for security updates only, not all packages. For production: Manual updates with change windows allow testing before applying. Never auto-update kernel or bootloader without supervision.

How long do SD cards last in 24/7 SBC use?

Highly variable: 6 months to 3+ years depending on write load, card quality, and luck. High-endurance cards (SanDisk MAX Endurance, Samsung PRO Endurance) last longer. Monitor for early failure signs and replace proactively.

What services should I monitor most closely?

SSH (critical for remote access), DNS (if running Pi-hole/dnsmasq), network services your infrastructure depends on (VPN, reverse proxy, monitoring agents). Use systemctl is-active SERVICE in monitoring scripts.

How do I reduce log growth?

Configure log rotation in /etc/logrotate.conf and /etc/logrotate.d/. For systemd journal, set SystemMaxUse=500M in /etc/systemd/journald.conf. Reduce logging verbosity for chatty applications.

Should I use RAID or redundant storage on an SBC?

RAID is not practical on most SBCs (limited I/O bandwidth, CPU overhead). Instead: frequent backups to separate storage (NAS, cloud, external USB), keep spare SD cards with recent images, use network boot if supported (reduces SD card writes).

What temperature is too hot?

Above 80°C: Add cooling. Sustained operation above 85°C reduces component lifespan. Above 90°C: Thermal throttling or shutdown occurs. Measure temperature under your typical load and ensure it stays below 75°C.

How do I test my backup restore procedure?

Use a second SD card or a VM running the same OS version. Restore /etc/, verify a critical config file (SSH, network settings), then restore application data. Time the process—you need to know how long recovery takes during an outage.

Author: LeMaker Documentation Team
Last updated: 2026-01-11