Self-healing, zero-touch infrastructure monitoring built for banking-grade production environments.
MagicMonitor is a production-grade, self-healing infrastructure monitoring system deployed across 20+ critical Windows Ec2 instance in a banking environment. It monitors servers 24/7, automatically fixes common issues, and sends real-time alerts β all with zero manual intervention.
| Feature | Detail |
|---|---|
| π Zero-Touch Updates | New deployments reach all 20+ servers in 2β3 hours automatically |
| π¨ Real-Time Alerts | Detection to email notification in under 30 seconds |
| π€ Self-Healing | Auto-restarts failed processes, auto-expands storage, auto-scales databases |
| π Full Visibility | CPU, RAM, Disk, Processes, Patches tracked 24/7/365 |
| π¦ Compliance Ready | RBI audit-ready with 90-day log retention and weekly patch reports |
| ποΈ Scalable Design | 3-tier architecture proven for 20+ servers, built for 100+ |
3-Tier Distributed Design:
graph TD
subgraph tier3["π’ TIER 3: MASTER SERVER (On Premises)"]
M["π₯οΈ MASTER<br/>13.2**.**.***<br/>Alert Hub & Update Source"]
end
subgraph tier2["βοΈ TIER 2: RELAY SERVERS (AWS)"]
R1["π RELAY 1<br/>AWS Account A<br/>Aggregates & Caches"]
R2["π RELAY 2<br/>AWS Account B<br/>Aggregates & Caches"]
end
subgraph tier1["π TIER 1: MONITOR SERVICES (20+ Servers)"]
Mon1["π’ Monitor<br/>Server 1"]
Mon2["π’ Monitor<br/>Server 2"]
Mon3["π’ Monitor<br/>Server 3"]
Mon4["π’ Monitor<br/>Server 4"]
Mon5["π’ Monitor<br/>Server 5+"]
end
%% Alert Flow (UP) - Solid arrows
Mon1 -->|"5050: Alert UP"| R1
Mon2 -->|"5050: Alert UP"| R1
Mon3 -->|"5050: Alert UP"| R1
Mon4 -->|"5050: Alert UP"| R2
Mon5 -->|"5050: Alert UP"| R2
R1 -->|"6060: Forward Alert"| M
R2 -->|"6060: Forward Alert"| M
%% Update Flow (DOWN) - Dashed arrows
M -.->|"9090: Serve Update"| R1
M -.->|"9090: Serve Update"| R2
R1 -.->|"9999: Serve Update"| Mon1
R1 -.->|"9999: Serve Update"| Mon2
R1 -.->|"9999: Serve Update"| Mon3
R2 -.->|"9999: Serve Update"| Mon4
R2 -.->|"9999: Serve Update"| Mon5
style M fill:#9b59b6,color:#fff,stroke:#7d3c98,stroke-width:3px
style R1 fill:#e67e22,color:#fff,stroke:#ca6f1e,stroke-width:2px
style R2 fill:#e67e22,color:#fff,stroke:#ca6f1e,stroke-width:2px
style Mon1 fill:#27ae60,color:#fff,stroke:#1e8449,stroke-width:1px
style Mon2 fill:#27ae60,color:#fff,stroke:#1e8449,stroke-width:1px
style Mon3 fill:#27ae60,color:#fff,stroke:#1e8449,stroke-width:1px
style Mon4 fill:#27ae60,color:#fff,stroke:#1e8449,stroke-width:1px
style Mon5 fill:#27ae60,color:#fff,stroke:#1e8449,stroke-width:1px
Flow Legend:
- Solid arrows (β): Alert flow β Data travels UP (Monitor β Relay β Master)
- Dashed arrows (-.->): Update flow β Data travels DOWN (Master β Relay β Monitor)
| Tier | Component | Count | Location | Role |
|---|---|---|---|---|
| Tier 3 | Master Server | 1 | On Premises | Central alert hub, email distribution, update source |
| Tier 2 | Relay Servers | 1 | AWS (one per account) | Alert aggregation, update caching |
| Tier 1 | Monitor Services | 20+ | Production servers | Real-time monitoring, auto-remediation |
| Port | Direction | Purpose |
|---|---|---|
| 5050 | Monitor β Relay | Alerts & heartbeats sent upstream |
| 9999 | Relay β Monitor | Updates downloaded by monitors |
| 6060 | Relay β Master | Alerts forwarded to central hub |
| 9090 | Master β Relay | Updates downloaded by relays |
- β Fault Isolation β AWS account boundaries contain failures
- β Network Efficiency β Relay caching reduces bandwidth by 20x
- β Scalability β Add servers without changing master config
- β Centralized Control β One master manages all 20+ servers ... (20+ monitor services total) TIER 1: MONITOR SERVICES (One per server)
CPU, RAM, Disk, Process monitoring Auto-restart failed processes Local Excel reporting Heartbeat every 20 seconds
Data Flow:
- Alerts Flow UP: Monitor β Relay (Port 5050) β Master (Port 6060) β Email
- Updates Flow DOWN: Master (Port 9090) β Relay β Relay Cache β Monitor (Port 9999)
- Heartbeat: Monitor β Relay β Master (Every 20 seconds)
| Layer | Technology |
|---|---|
| Backend | C# .NET 6.0, Windows Services |
| Cloud | AWS EC2, RDS, VPC (Multi-Account), ap-south-1 (Mumbai) |
| On-Premises | Master Server at on Premises |
| Automation | Windows Task Scheduler, PowerShell Scripting |
| Alerting | SMTP Integration, HTML Email Templates |
| Reporting | Automated Excel Generation |
| Architecture | Distributed 3-Tier (Master β Relay β Monitor) |
| Metric | Achievement | Impact |
|---|---|---|
| π» Downtime Reduction | 40-60% | Fewer service outages affecting customers |
| π€ Auto-Remediation | 70% | Issues fixed without human intervention |
| πΎ Storage Outages | 100% prevented | Zero incidents due to disk exhaustion |
| β‘ Alert Speed | <30 seconds | Detection to email notification |
| π System Uptime | 99%+ | Across all monitored servers |
| π₯ Manual Effort | 80%+ reduction | Time saved on health checks |
| π₯οΈ Coverage | 20+ servers | Production infrastructure monitored 24/7 |
| π Update Speed | 2-3 Minute | Full fleet deployment time |
| Metric | Check Frequency | Threshold | Action |
|---|---|---|---|
| CPU | Every 40 seconds | >80% | Alert + Excel log |
| RAM | Every 50 seconds | >85% | Alert + Excel log |
| Disk Space | Interval (Hourly) | Critical: <5 GB Warning: <10 GB |
Alert + possible auto-resize |
| Process Health | Every 1 minute | Process stopped | Auto-restart + alert |
| Heartbeat | Every 20 seconds | 2 missed (40 sec) | Server down alert |
| Patches | Weekly (Saturdays 11:30 PM IST) | N/A | Compliance report |
Process Auto-Restart:
- Monitors configured "important processes" every 60 seconds
- Auto-restarts within 1-2 minutes if process fails
- Respects
DoNotStart.flagfor manual override during maintenance - Sends alerts whether restart succeeds or fails
RDS Storage Autoscaling:
- Runs daily at 6:30 AM IST (off-peak hours)
- Auto-expands storage by 10 GiB when approaching capacity
- Respects configured maximum storage limits
- Zero downtime during expansion
EC2 Disk Auto-Resize:
- Triggers on critical disk space condition (<2 GB)
- Expands EBS volumes automatically via AWS API
- Filesystem extended without server reboot
Manual Override:
- Create
DoNotStart_ProcessName.flagto disable auto-restart - Used during maintenance, upgrades, or troubleshooting
- Reminder alerts sent every 24 hours while flag exists
Stage 1: Master Preparation (T+0)
Administrator β Copies new build to C:\Deploy\Updater\Monitor Master Service β Auto-creates update.zip + version.json Master β Serves on port 9090
Stage 2: Relay Caching (T+0 to T+20 min)
Relay β Checks Master port 9090 every 60 minutes Relay β Downloads update.zip if newer version available Relay β Caches in cached_updates\monitor Relay β Serves to Monitors on port 9999
Stage 3: Monitor Installation (T+0 to T+30 min)
Monitor β Checks Relay port 9999 every 60 minutes Monitor β Downloads update.zip to update_staging Monitor β Triggers scheduled task "ApplyMonitorUpdate" Scheduled Task β Stops service β Copies files β Restarts service
Configuration Protection:
appsettings.jsonβ Never overwritten (server-specific settings)apply_update.cmdβ Never overwritten (custom update logic)Logs\folder β Never touched during updates- Protected via
exclude.txtused by xcopy command
UAC Bypass:
- Scheduled tasks run as
NT AUTHORITY\SYSTEM - "Run with highest privileges" enabled
- No user interaction required
- Works even when no user logged in
Deployment Timeline: 2-3 hours total for all 20+ servers (natural staggering via 60-min intervals)
Automated Excel Reports:
- CPU Logs: Minutely with timestamp, CPU%, top 10 processes
- RAM Logs: Minutely with timestamp, RAM%, top 10 processes
- Patch Reports: Weekly with patch name, install date, KB number
- Storage: Local at
C:\Deploy\Monitor\Reports\ - Retention: 90 days recommended
Email Reports:
- Daily Summary: 6:00 PM IST β All server health, day's alerts, critical issues
- Monthly Summary: 1st of month β Trends, uptime stats, capacity planning data
- Ad-hoc Alerts: <30 seconds from detection to email
Compliance Features:
- 90-day log retention at all tiers (Monitor, Relay, Master)
- Weekly patch compliance tracking (Windows OS + RDS)
- Complete audit trail of all alerts and automated actions
Challenge: Windows services can't restart themselves or modify their own files while running.
Solution: Windows scheduled tasks with NT AUTHORITY\SYSTEM privileges and "Run with highest privileges" flag.
Challenge: Each of 20+ servers has unique config (relay IP, thresholds, process lists).
Solution: Exclude appsettings.json from update.zip and use exclude.txt during file copy.
Challenge: 20+ monitors downloading 50MB updates from on-premises master = 1GB+ bandwidth.
Solution: Relay-level caching β relay downloads once, serves 20+ monitors locally.
Challenge: Monitors in different AWS accounts need to reach same master.
Solution: One relay per AWS account that will connect to master server via internet.
Challenge: Can't take all 20+ servers offline simultaneously.
Solution: Staggered deployment via 60-minute check intervals β natural rollout over 2-3 hours.
Challenge: Continuous high CPU could spam hundreds of duplicate alerts.
Solution: Smart suppression β alert once when threshold exceeded, alert once when normalized.
| Port | Direction | Data Flow | Purpose |
|---|---|---|---|
| 5050 | Monitor β Relay | Alerts travel UP | Monitors send alerts & heartbeats to relay |
| 9999 | Relay β Monitor | Updates travel DOWN | Monitors download updates from relay cache |
| 6060 | Relay β Master | Alerts travel UP | Relays forward aggregated alerts to master |
| 9090 | Master β Relay | Updates travel DOWN | Relays download updates from master |
AWS Security Groups:
- Monitors: Outbound 5050, 9999 to Relay Security Group
- Relays: Inbound 5050, 9999 from Monitor SG; Outbound 6060, 9090 to Master IP
- Master: Inbound 6060, 9090 from Relay IPs; Outbound 25/587 for SMTP
Windows Firewall:
- All components: Allow inbound on listening ports
- All components: Allow outbound on client ports
Network Topology:
- VPC Peering between AWS accounts and on-premises (or VPN)
- Private IPs within AWS, public IP for master (firewall-restricted)
- No cross-account communication at monitor level (only via relays)
β
Banking-Grade Production System β Live in banking environment.
β
Truly Zero-Touch β From deployment to updates to remediation
β
Resource-Efficient β Only ~150 MB RAM and <5% CPU per server
β
Hybrid Cloud β Spans AWS ap-south-1 (Mumbai) and on-premises seamlessly
β
Enterprise Security β Least-privilege access, firewall-controlled ports, full audit logging
β
Self-Healing β 70% of issues resolved without human intervention
β
Proven Scalability β Currently 20+ servers, designed for 100+
Backend Development:
C# .NET 6 Windows Services Multithreading Async/Await Logging
Cloud & Infrastructure:
AWS EC2 AWS RDS AWS VPC Security Groups Hybrid Cloud
DevOps & Automation:
PowerShell Windows Task Scheduler Service Management Automation
System Design:
Distributed Systems 3-Tier Architecture Event-Driven Design Fault Tolerance Scalability
Monitoring & Observability:
Real-Time Monitoring Alerting Reporting Metrics Collection Log Aggregation
Compliance & Security:
RBI Compliance Audit Trails Access Control Network Security Data Retention
- Windows Server 2016/2019/2022
- .NET 6.0 Runtime
- Administrator access
- Network ports 5050, 6060, 9090, 9999 open
1. Master Server (AWS Premises)
# Install service
sc.exe create "MasterMonitorService" binPath= "C:\Deploy\Master\MasterService.exe" start= auto
sc.exe start "MasterMonitorService"
# Configure SMTP in appsettings.json
# Open firewall ports 6060, 90902. Relay Server (Per AWS Account)
# Install service
sc.exe create "RelayMonitorService" binPath= "C:\Deploy\Relay\RelayService.exe" start= auto
sc.exe start "RelayMonitorService"
# Create scheduled task for self-update
$Action = New-ScheduledTaskAction -Execute "C:\Deploy\Relay\apply_update.cmd"
$Principal = New-ScheduledTaskPrincipal -UserId "NT AUTHORITY\SYSTEM" -RunLevel Highest
Register-ScheduledTask -TaskName "ApplyRelayUpdate" -Action $Action -Principal $Principal
# Configure Master IP in appsettings.json
# Open firewall ports 5050, 99993. Monitor Service (Every Server)
# Install service
sc.exe create "MonitorService" binPath= "C:\Deploy\Monitor\MonitorService.exe" start= auto
sc.exe start "MonitorService"
# Create scheduled task for auto-update
$Action = New-ScheduledTaskAction -Execute "C:\Deploy\Monitor\apply_update.cmd"
$Principal = New-ScheduledTaskPrincipal -UserId "NT AUTHORITY\SYSTEM" -RunLevel Highest
Register-ScheduledTask -TaskName "ApplyMonitorUpdate" -Action $Action -Principal $Principal
# Configure Relay IP, thresholds, important processes in appsettings.jsonMonitor appsettings.json:
{
"ServerName": "PROD-APP-01",
"RelayServerIP": "10.*.1**",
"RelayAlertPort": 5050,
"RelayUpdatePort": 9999,
"Thresholds": {
"CpuPercent": 80,
"RamPercent": 85,
"DiskCriticalGB": 2,
"DiskWarningGB": 10
},
"ImportantProcesses": [
"MyBankingApp",
"PaymentService",
"APIGateway"
],
"Schedules": {
"DiskCheckTimes": [ "Interval" ],
"PatchCheckDay": "Saturday",
"PatchCheckTime": "23:30",
"DailyReportTime": "10:00",
"RdsAutoscaleTime": "06:30"
}
}Service won't start:
# Check .NET Runtime installed
dotnet --list-runtimes
# Check Windows Event Log
Get-EventLog -LogName Application -Source "MonitorService" -Newest 10
# Validate JSON config
Get-Content "C:\Deploy\Monitor\appsettings.json" | ConvertFrom-JsonNo alerts received:
# Test network connectivity
Test-NetConnection -ComputerName RelayIP -Port 5050
Test-NetConnection -ComputerName MasterIP -Port 6060
# Check service logs
Get-Content "C:\Deploy\Monitor\Logs\log-*.txt" -Tail 50Updates not applying:
# Verify scheduled task exists
Get-ScheduledTask -TaskName "ApplyMonitorUpdate"
# Test task manually
Start-ScheduledTask -TaskName "ApplyMonitorUpdate"
# Check staging folder
Get-ChildItem "C:\Deploy\Monitor\update_staging\"Bhushan Koli β Cloud Engineer
π Pune, India
Project Maintainer: Bhushan Koli
Role: Cloud Engineer
Location: Pune, India
Built for reliability. Designed for scale. Proven in production. π
- β 3-Tier Architecture with fault isolation and efficient caching
- β Sub-30 second alert latency from detection to email
- β 70% auto-remediation rate β most issues fixed without humans
- β Zero-touch updates across 20+ servers in 2-3 hours
- β Banking-grade compliance with RBI audit-ready documentation
- β 99%+ uptime with <5% CPU and ~150MB RAM overhead per server