Two independent brokers are not a high-availability solution. It is two single points of failure deployed in parallel. True ActiveMQ high availability requires a coordinated mechanism: one broker inheriting the other’s persistent state, clients automatically reconnecting without data loss, and the entire transition happening without manual intervention.
This guide covers every supported HA topology for Apache ActiveMQ and Artemis, with production-grade configurations, honest failure-mode analysis, and the operational knowledge that determines whether your HA architecture actually works when it matters.
Our post on Apache ActiveMQ vs Apache Artemis covered the architectural differences between the two brokers. Those differences matter here: your HA options are not identical across Apache ActiveMQ and Artemis, and choosing wrong costs you either unnecessary complexity or inadequate protection.
The Three HA Models: A Decision Matrix
Before walking through configurations, match your infrastructure to the right model. The wrong choice creates operational problems that no amount of subsequent tuning will fix.
| Criteria | Shared File System | JDBC Master/Slave | Artemis Replication |
| Broker | Apache ActiveMQ | Apache ActiveMQ | Apache Artemis |
| Shared storage required | Yes (SAN / NFSv4) | Yes (RDBMS) | No |
| Failover speed | Fast (lock acquisition) | Moderate (lock expiry delay) | Fast (network sync check) |
| Split-brain risk | Low (file lock) | Low (DB row lock) | Moderate (network partition) |
| Backup ready immediately? | Yes | Yes | No (warmup sync required) |
| Works without SAN/NAS | No | No | Yes |
| Operational complexity | Low | Moderate | High |
| Cloud / Kubernetes fit | Difficult | Moderate | Native |
Model 1: Shared File System Master/Slave
How It Works
On startup, every broker instance attempts to acquire an exclusive java.nio.channels.FileLock on the KahaDB data directory. The first broker to succeed becomes the master, it starts its transport connectors, and begins serving clients.
While all the other brokers become slaves, they hold the lock acquisition loop open, waiting. The moment the master releases or loses its lock (on shutdown or crash), one slave immediately acquires it and promotes itself to master, starting its transport connectors.
Clients using the Failover Transport detect the master’s disappearance and automatically reconnect to the new master. The coordination mechanism is beautifully simple: the broker holding the file lock is the master, always. No external quorum service, no heartbeat protocol, no ZooKeeper dependency.
Production Configuration
Both the master and slave nodes use identical configurations. Role is determined entirely by lock acquisition order at startup, as there is no <master/> or <slave/> tag in the XML.
| <!– activemq.xml — identical on ALL nodes in the HA group –> <broker xmlns=”http://activemq.apache.org/schema/core” brokerName=”ha-broker-prod” useJmx=”true” schedulerSupport=”true” dataDirectory=”/mnt/shared-san/activemq”> <persistenceAdapter> <kahaDB directory=”/mnt/shared-san/activemq/kahadb” journalMaxFileLength=”64mb” enableJournalDiskSyncs=”true” concurrentStoreAndDispatchQueues=”true”/> </persistenceAdapter> <!– CRITICAL: schedulerDirectory must be on the shared path –> <!– Omitting this causes scheduled messages to vanish after failover –> <!– The broker element’s dataDirectory attribute sets the base path –> <!– for scheduler storage when schedulerSupport=”true” –> <systemUsage> <systemUsage> <memoryUsage><memoryUsage percentOfJvmHeap=”20″/></memoryUsage> <storeUsage><storeUsage limit=”500gb”/></storeUsage> <tempUsage><tempUsage limit=”50gb”/></tempUsage> </systemUsage> </systemUsage> <transportConnectors> <transportConnector name=”nio” uri=”nio://0.0.0.0:61616?maximumConnections=2000″/> </transportConnectors> </broker> |
The Scheduler Directory Footgun
Apache ActiveMQ stores scheduled message state in a separate structure from the KahaDB journal. When schedulerSupport=”true” is set, the broker defaults to storing this data in a path relative to the broker’s dataDirectory attribute — but if that attribute is not explicitly set to your shared mount, scheduler data lands on the local disk of whichever broker is currently the master.
On failover, the new master starts with no knowledge of the previous master’s scheduled messages. They vanish silently — no error, no warning, no DLQ entry.
The fix: always explicitly set dataDirectory on the <broker> element to your shared path when using schedulerSupport=”true”. The configuration above does this correctly. This is one of the most consistent gaps we see in enterprise Apache ActiveMQ HA deployments.
Filesystem Compatibility: The NFSv3 Risk
The file lock that coordinates HA is a POSIX java.nio.channels.FileLock. Its behavior depends on the filesystem implementing it correctly:
- NFSv4: Supported. If the master terminates abnormally, the lock is released after a built-in 30-second timeout, allowing the slave to promote.
- NFSv3: Production risk. If the master crashes, the NFSv3 server does not timeout the lock. The slave cannot acquire it and cannot promote. The only recovery is rebooting all ActiveMQ instances. Never use NFSv3 for production HA.
- SAN (ext4/xfs): Generally reliable. Verify your SAN driver exports proper POSIX lock semantics before going to production — test with a deliberate kill -9 of the master process, not a clean shutdown.
- OCFS2: Not supported. OCFS2 supports fcntl locking only, not lockf or flock. Java’s FileLock is incompatible. Both brokers will simultaneously believe they are the master. GFS2 is a supported alternative for Linux cluster filesystems.
Model 2: JDBC Master/Slave
How It Works
The coordination mechanism shifts from a file lock to a database row lock. On startup, each broker opens a long-running JDBC transaction against the ACTIVEMQ_LOCK table. The broker that successfully holds that transaction becomes the master. All others are slaves. If the master loses its database connection, the transaction is rolled back, the lock is released, and a slave acquires it and promotes.
This model is the right choice when shared block storage is unavailable or impractical, but a highly available relational database is already part of the infrastructure.
Production Configuration
| <!– activemq.xml — identical on ALL nodes –> <broker xmlns=”http://activemq.apache.org/schema/core” brokerName=”ha-broker-prod” useJmx=”true”> <!– CRITICAL: Use direct jdbcPersistenceAdapter, NOT journaledJDBC –> <!– The journaled variant’s local journal is NOT replicated — –> <!– it breaks the HA guarantee silently –> <persistenceAdapter> <jdbcPersistenceAdapter dataSource=”#pg-ds” createTablesOnStartup=”true”/> </persistenceAdapter> <transportConnectors> <transportConnector name=”nio” uri=”nio://0.0.0.0:61616?maximumConnections=2000″/> </transportConnectors> </broker> <!– Connection pool — health checks prevent stale lock hangs –> <bean id=”pg-ds” class=”org.apache.commons.dbcp2.BasicDataSource” destroy-method=”close”> <property name=”driverClassName” value=”org.postgresql.Driver”/> <property name=”url” value=”jdbc:postgresql://db-primary:5432/activemq ?connectTimeout=5&socketTimeout=30″/> <property name=”username” value=”activemq”/> <property name=”password” value=”changeme”/> <property name=”maxTotal” value=”10″/> <!– Validate connection on every borrow — prevents silent stale locks –> <property name=”testOnBorrow” value=”true”/> <property name=”validationQuery” value=”SELECT 1″/> <property name=”timeBetweenEvictionRunsMillis” value=”30000″/> <property name=”minEvictableIdleTimeMillis” value=”60000″/> </bean> |
The JDBC Lock Timeout Problem
The most operationally painful aspect of JDBC HA: when the master crashes or loses its database connection, the long-running JDBC transaction is held open by the TCP half-open socket until the OS TCP timeout fires. On many Linux configurations, this can be 30–60 seconds. During this window, no slave can promote and all producers are blocked.
The mitigation: configure socketTimeout=30 in the JDBC URL (as shown above) and use a connection pool with testOnBorrow=true. This forces the pool to validate the connection before using it, which triggers an immediate failure rather than a silent hang. Also configure TCP keepalive at the OS level on the database server to ensure half-open connections are detected and closed faster than the OS default timeout.
One critical configuration detail: you must use jdbcPersistenceAdapter directly, never journaledJDBC. The journaled adapter maintains a local journal that is not replicated to the database. A slave starting after a master crash cannot reconstruct the journal-buffered state, silently breaking the HA guarantee on every failover.
Model 3: Artemis Replication (No Shared Storage)
How It Works
In Artemis replication, the live server and its backup do not share any data directory. All data synchronization happens over the network: every persistent write on the live server is simultaneously replicated to the backup. When the live server fails, the backup activates and takes over client connections.
This is the most sophisticated HA model and the most appropriate for cloud-native environments, Kubernetes deployments, and architectures where shared block storage is unavailable or impractical.
Production Configuration
| Live server (broker.xml): <configuration xmlns=”urn:activemq” xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance”> <core xmlns=”urn:activemq:core”> <name>live-broker-01</name> <ha-policy> <replication> <master> <!– Prevents the old live from re-activating if backup already took over –> <check-for-live-server>true</check-for-live-server> </master> </replication> </ha-policy> <cluster-connections> <cluster-connection name=”ha-cluster”> <connector-ref>live-connector</connector-ref> <retry-interval>500</retry-interval> <use-duplicate-detection>true</use-duplicate-detection> <message-load-balancing>ON_DEMAND</message-load-balancing> <max-hops>1</max-hops> <static-connectors> <connector-ref>backup-connector</connector-ref> </static-connectors> </cluster-connection> </cluster-connections> <connectors> <connector name=”live-connector”>tcp://live-host:61616</connector> <connector name=”backup-connector”>tcp://backup-host:61616</connector> </connectors> </core> </configuration> |
| Backup server (broker.xml): <ha-policy> <replication> <slave> <!– How many times the backup can restart after taking over –> <!– before requiring manual intervention –> <max-saved-replicated-journals-size>2</max-saved-replicated-journals-size> <!– Allow the old live to re-take the live role after restart –> <allow-failback>true</allow-failback> <failback-delay>5000</failback-delay> </slave> </replication> </ha-policy> |
The Backup Warmup Window
Unlike shared-storage backup (which is immediately ready), a replicating backup must synchronize all existing journal data from the live server before it can serve as a failover target. For a broker with a large persistent backlog, this synchronization can take minutes. During this window, if the live server fails, there is no HA protection.
Operationally: before declaring an Artemis HA pair production-ready, verify that backup-sync-complete is true via JMX or the Artemis management console. MeshIQ Console surfaces this sync status in the broker topology view — gating your deployment pipeline on backup sync completion prevents a dangerous window where HA is assumed but not yet real.
Split-Brain: The Replication Risk
Replication introduces split-brain risk that Classic’s lock models avoid. When the live server crashes, the backup is promoted when it loses its connection to the live server. But connection loss can also occur due to a transient network partition, in which case both the live and the backup may believe they are active.
Artemis addresses this with a quorum check: when a backup loses connection to its live server, it polls the other cluster members. If it can connect to more than half the cluster, it promotes. If not, it waits and retries by preventing false promotion.
This is why deploying a single live/backup pair is insufficient for production replication HA: with only two nodes, there is no quorum to consult. Always deploy three or more live/backup pairs for Artemis replication.
The Failover Transport: Your Client-Side HA Layer
Every HA model above depends on clients using the Failover Transport. Without it, a broker restart means applications throw connection exceptions until manually reconnected. The Failover Transport layer’s automatic reconnection logic on top of any underlying transport.
| Production Failover Transport URI failover:(tcp://broker1:61616,tcp://broker2:61616,tcp://broker3:61616) ?randomize=false &initialReconnectDelay=100 &maxReconnectDelay=30000 &maxReconnectAttempts=-1 &timeout=3000 &trackMessages=true |
Critical Parameter Reference
| Parameter | Recommended | Rationale |
| randomize | false | Connect to first URI (primary) before falling back — not random load balancing |
| initialReconnectDelay | 100ms | Short initial wait; avoids thundering herd after broker restart |
| maxReconnectDelay | 30000ms | Cap exponential backoff — prevents indefinite stalls |
| maxReconnectAttempts | -1 | Infinite retries; let the HA layer determine recovery |
| timeout | 3000ms | Fail fast on blocked send() rather than waiting indefinitely |
| trackMessages | true | Replay in-flight messages after reconnection |
One critical production gotcha: never use nio:// in Failover Transport URLs on the client side. NIO transport is server-side only, used to configure broker acceptors. Client applications using nio:// in the Failover URL combined with priorityBackup=true can cause immediate CPU spikes to 100% — a bug that has affected production deployments. Client URLs must always use tcp://.
Priority Failover for Geographic Redundancy
For deployments spanning multiple datacenters or availability zones, priorityBackup=true ensures clients prefer the local broker and automatically rebalance back to it when it recovers:
| failover:(tcp://local-broker:61616,tcp://remote-broker:61616) ?randomize=false &priorityBackup=true &priorityURIs=tcp://local-broker:61616 |
Without priorityBackup=true, a client that failed over to the remote broker during a local broker outage remains on the remote broker permanently after the local broker recovers. Every client on the remote broker represents cross-datacenter network latency on every message operation. With priorityBackup=true, the client automatically reconnects to the local broker once it is available, no manual intervention required.
Handling In-Doubt Transactions After Failover
This is the area where HA implementations most commonly silently lose correctness. The Failover Transport replays in-flight transactions after reconnection, but transactions that were in-flight at the exact moment of failover are in doubt: the broker may have received the commit, but the reply was lost, or the commit may never have arrived.
ActiveMQ 5.3.1+ handles in-doubt transactions by rolling them back and throwing a javax.jms.TransactionRolledBackException. Your application must catch this exception and implement an idempotent retry. If your application assumes a commit either fully succeeds or was never attempted, you will silently lose messages or produce duplicates under HA failover conditions.
This is not an edge case. It is the normal failure mode for any transacted producer or consumer during failover.
The Five HA Failure Modes You Need to Prepare For
These are the production incidents we see most frequently — all of which are preventable with correct configuration.
Failure Mode 1: The NFSv3 Lock Zombie
Symptom: Master crashes; slave never promotes. Both instances idle. All producers blocked indefinitely.
Cause: NFSv3 does not release file locks on abnormal client termination. The lock is held indefinitely.
Fix: Mount the shared data directory with NFSv4 (nfsvers=4). Verify with mount | grep nfs that the mount is using version 4, not 3.
Failure Mode 2: The JDBC Lock Timeout Window
Symptom: After master crash, slave waits 30–60 seconds before promoting. All producers block during the window.
Cause: TCP half-open connection holds the database transaction lock until OS-level TCP timeout fires.
Fix: Set socketTimeout=30 in the JDBC URL. Enable connection pool health checks with testOnBorrow=true. Configure OS-level TCP keepalive on the database host.
Failure Mode 3: Artemis Replication Split-Brain
Symptom: After a network partition, both live and backup believe they are active. Writes go to two independent journals that cannot be merged.
Cause: Single live/backup pair with no quorum. Backup promotes unilaterally.
Fix: Deploy a minimum of three live/backup pairs. Never run Artemis replication HA with a single pair in production.
Failure Mode 4: The Vanishing Scheduled Messages
Symptom: After failover, all scheduled messages are missing.
Cause: schedulerSupport=”true” without dataDirectory explicitly set to the shared path. Scheduler data stored on the local disk of the master.
Fix: Set dataDirectory on the <broker> element to your shared mount path whenever schedulerSupport=”true”.
Failure Mode 5: Silent HA Degradation (The Most Dangerous)
Symptom: The HA cluster appears healthy — both brokers running, no alerts — but the slave is not actually tracking the master’s state. On failover, the slave promotes but has stale or empty persistent state.
Cause: Artemis backup sync never completed (check backup-sync-complete via JMX). Or: shared storage HA with a slave that restarted but is silently failing to acquire the lock.
Fix: Continuous monitoring of master lock holder identity, slave sync status (Artemis), and Failover Transport reconnection event rate. These three metrics together give you full HA health visibility.
HA Model Selection Framework
Choose Shared File System HA if:
- You are running Apache ActiveMQ
- You have a SAN or NFSv4 shared storage infrastructure already in place
- You prioritize operational simplicity — no quorum, no external services
- Your KahaDB data fits within the shared storage budget
Choose JDBC Master/Slave if:
- You are running Apache ActiveMQ and have no SAN
- You have a highly available relational database (PostgreSQL with streaming replication, Oracle RAC)
- Your team has stronger DBA skills than storage engineering
Choose Artemis Replication if:
- You are running or migrating to Artemis
- You are deploying on Kubernetes or cloud without shared block storage
- You need geographic HA with no shared storage dependency
- You can absorb the added complexity of quorum management and backup warmup monitoring
Do not use any HA model in isolation without:
- The Failover Transport configured on all clients
- randomize=false set to prevent load-balancing across master/slave
- Idempotent retry logic handling TransactionRolledBackException
- Continuous monitoring of HA state — not just broker uptime
Network of Brokers (NoB) is not an HA mechanism, it is a horizontal scaling and routing topology. Individual broker nodes in a NoB should each have their own master/slave HA pair for true resilience.
Monitoring HA State: Where Manual JMX Falls Short
Once the HA configuration is deployed, the next operational challenge is knowing whether it is actually working, not just whether the brokers are running.
The most dangerous HA state is silent degradation: the cluster looks healthy, but the slave has lost synchronization with the master and will not promote correctly on failover. This is nearly invisible with basic JMX polling because the relevant state spans both brokers and requires correlation across them.
Key metrics for continuous HA health monitoring:
- Master lock holder identity: which broker currently holds the exclusive lock
- Slave connection state: is the slave actually connected and waiting (not crashed or network-partitioned)
- Artemis backup sync status: backup-sync-complete must be true before HA is considered operational
- Failover Transport reconnection rate: unexpected reconnection spikes indicate broker instability before a full failover occurs
- Time since last clean HA state check: composite health metric correlating all the above
MeshIQ Console provides a unified HA topology view that surfaces all of these indicators across your entire broker fleet in a single dashboard, without requiring JMX scripting or per-broker manual checks.
The Console’s alerting engine can notify your on-call team the moment HA state transitions from healthy to degraded, not after a production failover exposes the problem.
Your HA Configuration Is Only as Good as Your Monitoring
The most common post-incident finding in ActiveMQ HA failures is not that the configuration was wrong, it is that a configuration that was correct at deployment silently degraded over time and nobody noticed until the first real failover exposed it.
Backup sync drift, slave disconnects, scheduler directory misplacement, and JDBC lock timeout behaviors are all invisible to basic uptime monitoring. They only surface when a failover is triggered at the worst possible moment.
MeshIQ Console gives your operations team continuous visibility into HA state across all broker models, so silent degradation is caught before it becomes a production incident.
See how MeshIQ Console monitors your ActiveMQ HA topology → Request a Demo
Frequently Asked Questions
ActiveMQ HA uses a master/slave architecture where one broker serves clients and one or more standby brokers wait. When the master fails, a slave acquires the coordination lock (file system or database row) and promotes itself to master. Clients using the Failover Transport automatically reconnect. Classic supports shared file system and JDBC HA. Artemis additionally supports network replication.
Both produce identical active/passive behavior. The shared file system uses a file lock on a shared NFS/SAN mount; JDBC uses a database row lock. The shared file system is simpler but requires reliable shared storage with proper POSIX locking (NFSv4 or SAN). JDBC works without shared storage but depends on database lock expiry, a crashed master’s lock may delay slave promotion by 30–60 seconds.
Use failover:(tcp://broker1:61616,tcp://broker2:61616)?randomize=false&maxReconnectDelay=30000&maxReconnectAttempts=-1&timeout=3000. Set randomize=false to connect to the primary first. Set maxReconnectAttempts=-1 for infinite retries. Handle TransactionRolledBackException in your application, in-doubt transactions are rolled back, not replayed, during failover.
Split-brain is when both master and standby believe they are active simultaneously. Classic’s lock models largely prevent this, only one holder can hold the exclusive lock. Artemis replication has split-brain risk during network partitions. Prevention requires three or more live/backup pairs so quorum voting determines which side activates.
With persistent messaging and synchronous sends, messages acknowledged before failover are not lost, they are in the persistent store available to the new master. In-flight transactions at the moment of failover result in a TransactionRolledBackException that the application must retry. Non-persistent messages are always lost on failover regardless of HA model.