Blog | Apache ActiveMQ®, Middleware Optimization

Apache ActiveMQ® High Availability: Fix the #1 HA Design Mistake

meshIQ April 28, 2026

The most common Apache ActiveMQ high availability mistake is not a configuration error; it is a false assumption. Teams deploy two broker instances, point clients at both with a comma-separated URL, and label the topology "HA." Then the primary crashes, the secondary does not have the message state, and clients start throwing exceptions while the ops team scrambles.

Apache ActiveMQ High Availability Architecture

Two independent brokers are not a high-availability solution. It is two single points of failure deployed in parallel. True ActiveMQ high availability requires a coordinated mechanism: one broker inheriting the other’s persistent state, clients automatically reconnecting without data loss, and the entire transition happening without manual intervention.

This guide covers every supported HA topology for Apache ActiveMQ and Artemis, with production-grade configurations, honest failure-mode analysis, and the operational knowledge that determines whether your HA architecture actually works when it matters.

Our post on Apache ActiveMQ vs Apache Artemis covered the architectural differences between the two brokers. Those differences matter here: your HA options are not identical across Apache ActiveMQ and Artemis, and choosing wrong costs you either unnecessary complexity or inadequate protection.

The Three HA Models: A Decision Matrix

Before walking through configurations, match your infrastructure to the right model. The wrong choice creates operational problems that no amount of subsequent tuning will fix.

Criteria	Shared File System	JDBC Master/Slave	Artemis Replication
Broker	Apache ActiveMQ	Apache ActiveMQ	Apache Artemis
Shared storage required	Yes (SAN / NFSv4)	Yes (RDBMS)	No
Failover speed	Fast (lock acquisition)	Moderate (lock expiry delay)	Fast (network sync check)
Split-brain risk	Low (file lock)	Low (DB row lock)	Moderate (network partition)
Backup ready immediately?	Yes	Yes	No (warmup sync required)
Works without SAN/NAS	No	No	Yes
Operational complexity	Low	Moderate	High
Cloud / Kubernetes fit	Difficult	Moderate	Native

Model 1: Shared File System Master/Slave

How It Works

On startup, every broker instance attempts to acquire an exclusive java.nio.channels.FileLock on the KahaDB data directory. The first broker to succeed becomes the master, it starts its transport connectors, and begins serving clients.

While all the other brokers become slaves, they hold the lock acquisition loop open, waiting. The moment the master releases or loses its lock (on shutdown or crash), one slave immediately acquires it and promotes itself to master, starting its transport connectors.

Clients using the Failover Transport detect the master’s disappearance and automatically reconnect to the new master. The coordination mechanism is beautifully simple: the broker holding the file lock is the master, always. No external quorum service, no heartbeat protocol, no ZooKeeper dependency.

Production Configuration

Both the master and slave nodes use identical configurations. Role is determined entirely by lock acquisition order at startup, as there is no <master/> or <slave/> tag in the XML.

<!– activemq.xml — identical on ALL nodes in the HA group –>
<broker xmlns=”http://activemq.apache.org/schema/core”
        brokerName=”ha-broker-prod”
        useJmx=”true”
        schedulerSupport=”true”
        dataDirectory=”/mnt/shared-san/activemq”>

  <persistenceAdapter>
    <kahaDB
      directory=”/mnt/shared-san/activemq/kahadb”
      journalMaxFileLength=”64mb”
      enableJournalDiskSyncs=”true”
      concurrentStoreAndDispatchQueues=”true”/>
  </persistenceAdapter>

  <!– CRITICAL: schedulerDirectory must be on the shared path –>
  <!– Omitting this causes scheduled messages to vanish after failover –>
  <!– The broker element’s dataDirectory attribute sets the base path –>
  <!– for scheduler storage when schedulerSupport=”true” –>

  <systemUsage>
    <systemUsage>
      <memoryUsage><memoryUsage percentOfJvmHeap=”20″/></memoryUsage>
      <storeUsage><storeUsage limit=”500gb”/></storeUsage>
      <tempUsage><tempUsage limit=”50gb”/></tempUsage>
    </systemUsage>
  </systemUsage>

  <transportConnectors>
    <transportConnector name=”nio”
      uri=”nio://0.0.0.0:61616?maximumConnections=2000″/>
  </transportConnectors>

</broker>

The Scheduler Directory Footgun

Apache ActiveMQ stores scheduled message state in a separate structure from the KahaDB journal. When schedulerSupport=”true” is set, the broker defaults to storing this data in a path relative to the broker’s dataDirectory attribute — but if that attribute is not explicitly set to your shared mount, scheduler data lands on the local disk of whichever broker is currently the master.

On failover, the new master starts with no knowledge of the previous master’s scheduled messages. They vanish silently — no error, no warning, no DLQ entry.

The fix: always explicitly set dataDirectory on the <broker> element to your shared path when using schedulerSupport=”true”. The configuration above does this correctly. This is one of the most consistent gaps we see in enterprise Apache ActiveMQ HA deployments.

Filesystem Compatibility: The NFSv3 Risk

The file lock that coordinates HA is a POSIX java.nio.channels.FileLock. Its behavior depends on the filesystem implementing it correctly:

NFSv4: Supported. If the master terminates abnormally, the lock is released after a built-in 30-second timeout, allowing the slave to promote.
NFSv3: Production risk. If the master crashes, the NFSv3 server does not timeout the lock. The slave cannot acquire it and cannot promote. The only recovery is rebooting all ActiveMQ instances. Never use NFSv3 for production HA.
SAN (ext4/xfs): Generally reliable. Verify your SAN driver exports proper POSIX lock semantics before going to production — test with a deliberate kill -9 of the master process, not a clean shutdown.
OCFS2: Not supported. OCFS2 supports fcntl locking only, not lockf or flock. Java’s FileLock is incompatible. Both brokers will simultaneously believe they are the master. GFS2 is a supported alternative for Linux cluster filesystems.

Model 2: JDBC Master/Slave

How It Works

The coordination mechanism shifts from a file lock to a database row lock. On startup, each broker opens a long-running JDBC transaction against the ACTIVEMQ_LOCK table. The broker that successfully holds that transaction becomes the master. All others are slaves. If the master loses its database connection, the transaction is rolled back, the lock is released, and a slave acquires it and promotes.

This model is the right choice when shared block storage is unavailable or impractical, but a highly available relational database is already part of the infrastructure.

Production Configuration

<!– activemq.xml — identical on ALL nodes –>
<broker xmlns=”http://activemq.apache.org/schema/core”
        brokerName=”ha-broker-prod”
        useJmx=”true”>

  <!– CRITICAL: Use direct jdbcPersistenceAdapter, NOT journaledJDBC –>
  <!– The journaled variant’s local journal is NOT replicated — –>
  <!– it breaks the HA guarantee silently –>
  <persistenceAdapter>
    <jdbcPersistenceAdapter dataSource=”#pg-ds”
                            createTablesOnStartup=”true”/>
  </persistenceAdapter>

  <transportConnectors>
    <transportConnector name=”nio”
      uri=”nio://0.0.0.0:61616?maximumConnections=2000″/>
  </transportConnectors>

</broker>

<!– Connection pool — health checks prevent stale lock hangs –>
<bean id=”pg-ds”
      class=”org.apache.commons.dbcp2.BasicDataSource”
      destroy-method=”close”>
  <property name=”driverClassName” value=”org.postgresql.Driver”/>
  <property name=”url”
    value=”jdbc:postgresql://db-primary:5432/activemq
           ?connectTimeout=5&socketTimeout=30″/>
  <property name=”username” value=”activemq”/>
  <property name=”password” value=”changeme”/>
  <property name=”maxTotal” value=”10″/>
  <!– Validate connection on every borrow — prevents silent stale locks –>
  <property name=”testOnBorrow” value=”true”/>
  <property name=”validationQuery” value=”SELECT 1″/>
  <property name=”timeBetweenEvictionRunsMillis” value=”30000″/>
  <property name=”minEvictableIdleTimeMillis” value=”60000″/>
</bean>

The JDBC Lock Timeout Problem

The most operationally painful aspect of JDBC HA: when the master crashes or loses its database connection, the long-running JDBC transaction is held open by the TCP half-open socket until the OS TCP timeout fires. On many Linux configurations, this can be 30–60 seconds. During this window, no slave can promote and all producers are blocked.

The mitigation: configure socketTimeout=30 in the JDBC URL (as shown above) and use a connection pool with testOnBorrow=true. This forces the pool to validate the connection before using it, which triggers an immediate failure rather than a silent hang. Also configure TCP keepalive at the OS level on the database server to ensure half-open connections are detected and closed faster than the OS default timeout.

One critical configuration detail: you must use jdbcPersistenceAdapter directly, never journaledJDBC. The journaled adapter maintains a local journal that is not replicated to the database. A slave starting after a master crash cannot reconstruct the journal-buffered state, silently breaking the HA guarantee on every failover.

Model 3: Artemis Replication (No Shared Storage)

How It Works

In Artemis replication, the live server and its backup do not share any data directory. All data synchronization happens over the network: every persistent write on the live server is simultaneously replicated to the backup. When the live server fails, the backup activates and takes over client connections.

This is the most sophisticated HA model and the most appropriate for cloud-native environments, Kubernetes deployments, and architectures where shared block storage is unavailable or impractical.

Production Configuration

Live server (broker.xml):

<configuration xmlns=”urn:activemq”
               xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance”>
  <core xmlns=”urn:activemq:core”>
    <name>live-broker-01</name>

    <ha-policy>
      <replication>
        <master>
          <!– Prevents the old live from re-activating if backup already took over –>
          <check-for-live-server>true</check-for-live-server>
        </master>
      </replication>
    </ha-policy>

    <cluster-connections>
      <cluster-connection name=”ha-cluster”>
        <connector-ref>live-connector</connector-ref>
        <retry-interval>500</retry-interval>
        <use-duplicate-detection>true</use-duplicate-detection>
        <message-load-balancing>ON_DEMAND</message-load-balancing>
        <max-hops>1</max-hops>
        <static-connectors>
          <connector-ref>backup-connector</connector-ref>
        </static-connectors>
      </cluster-connection>
    </cluster-connections>

    <connectors>
      <connector name=”live-connector”>tcp://live-host:61616</connector>
      <connector name=”backup-connector”>tcp://backup-host:61616</connector>
    </connectors>

  </core>
</configuration>

Backup server (broker.xml):
<ha-policy>
  <replication>
    <slave>
      <!– How many times the backup can restart after taking over –>
      <!– before requiring manual intervention –>
      <max-saved-replicated-journals-size>2</max-saved-replicated-journals-size>
      <!– Allow the old live to re-take the live role after restart –>
      <allow-failback>true</allow-failback>
      <failback-delay>5000</failback-delay>
    </slave>
  </replication>
</ha-policy>

The Backup Warmup Window

Unlike shared-storage backup (which is immediately ready), a replicating backup must synchronize all existing journal data from the live server before it can serve as a failover target. For a broker with a large persistent backlog, this synchronization can take minutes. During this window, if the live server fails, there is no HA protection.

Operationally: before declaring an Artemis HA pair production-ready, verify that backup-sync-complete is true via JMX or the Artemis management console. meshIQ Console surfaces this sync status in the broker topology view — gating your deployment pipeline on backup sync completion prevents a dangerous window where HA is assumed but not yet real.

Split-Brain: The Replication Risk

Replication introduces split-brain risk that Classic’s lock models avoid. When the live server crashes, the backup is promoted when it loses its connection to the live server. But connection loss can also occur due to a transient network partition, in which case both the live and the backup may believe they are active.

Artemis addresses this with a quorum check: when a backup loses connection to its live server, it polls the other cluster members. If it can connect to more than half the cluster, it promotes. If not, it waits and retries by preventing false promotion.

This is why deploying a single live/backup pair is insufficient for production replication HA: with only two nodes, there is no quorum to consult. Always deploy three or more live/backup pairs for Artemis replication.

The Failover Transport: Your Client-Side HA Layer

Every HA model above depends on clients using the Failover Transport. Without it, a broker restart means applications throw connection exceptions until manually reconnected. The Failover Transport layer’s automatic reconnection logic on top of any underlying transport.

Production Failover Transport URI
failover:(tcp://broker1:61616,tcp://broker2:61616,tcp://broker3:61616)
  ?randomize=false
  &initialReconnectDelay=100
  &maxReconnectDelay=30000
  &maxReconnectAttempts=-1
  &timeout=3000
  &trackMessages=true

Critical Parameter Reference

Parameter	Recommended	Rationale
randomize	false	Connect to first URI (primary) before falling back — not random load balancing
initialReconnectDelay	100ms	Short initial wait; avoids thundering herd after broker restart
maxReconnectDelay	30000ms	Cap exponential backoff — prevents indefinite stalls
maxReconnectAttempts	-1	Infinite retries; let the HA layer determine recovery
timeout	3000ms	Fail fast on blocked send() rather than waiting indefinitely
trackMessages	true	Replay in-flight messages after reconnection

One critical production gotcha: never use nio:// in Failover Transport URLs on the client side. NIO transport is server-side only, used to configure broker acceptors. Client applications using nio:// in the Failover URL combined with priorityBackup=true can cause immediate CPU spikes to 100% — a bug that has affected production deployments. Client URLs must always use tcp://.

Priority Failover for Geographic Redundancy

For deployments spanning multiple datacenters or availability zones, priorityBackup=true ensures clients prefer the local broker and automatically rebalance back to it when it recovers:

failover:(tcp://local-broker:61616,tcp://remote-broker:61616)
  ?randomize=false
  &priorityBackup=true
  &priorityURIs=tcp://local-broker:61616

Without priorityBackup=true, a client that failed over to the remote broker during a local broker outage remains on the remote broker permanently after the local broker recovers. Every client on the remote broker represents cross-datacenter network latency on every message operation. With priorityBackup=true, the client automatically reconnects to the local broker once it is available, no manual intervention required.

Handling In-Doubt Transactions After Failover

This is the area where HA implementations most commonly silently lose correctness. The Failover Transport replays in-flight transactions after reconnection, but transactions that were in-flight at the exact moment of failover are in doubt: the broker may have received the commit, but the reply was lost, or the commit may never have arrived.

ActiveMQ 5.3.1+ handles in-doubt transactions by rolling them back and throwing a javax.jms.TransactionRolledBackException. Your application must catch this exception and implement an idempotent retry. If your application assumes a commit either fully succeeds or was never attempted, you will silently lose messages or produce duplicates under HA failover conditions.

This is not an edge case. It is the normal failure mode for any transacted producer or consumer during failover.

The Five HA Failure Modes You Need to Prepare For

These are the production incidents we see most frequently — all of which are preventable with correct configuration.

Failure Mode 1: The NFSv3 Lock Zombie

Symptom: Master crashes; slave never promotes. Both instances idle. All producers blocked indefinitely.

Cause: NFSv3 does not release file locks on abnormal client termination. The lock is held indefinitely.

Fix: Mount the shared data directory with NFSv4 (nfsvers=4). Verify with mount | grep nfs that the mount is using version 4, not 3.

Failure Mode 2: The JDBC Lock Timeout Window

Symptom: After master crash, slave waits 30–60 seconds before promoting. All producers block during the window.

Cause: TCP half-open connection holds the database transaction lock until OS-level TCP timeout fires.

Fix: Set socketTimeout=30 in the JDBC URL. Enable connection pool health checks with testOnBorrow=true. Configure OS-level TCP keepalive on the database host.

Failure Mode 3: Artemis Replication Split-Brain

Symptom: After a network partition, both live and backup believe they are active. Writes go to two independent journals that cannot be merged.

Cause: Single live/backup pair with no quorum. Backup promotes unilaterally.

Fix: Deploy a minimum of three live/backup pairs. Never run Artemis replication HA with a single pair in production.

Failure Mode 4: The Vanishing Scheduled Messages

Symptom: After failover, all scheduled messages are missing.

Cause: schedulerSupport=”true” without dataDirectory explicitly set to the shared path. Scheduler data stored on the local disk of the master.

Fix: Set dataDirectory on the <broker> element to your shared mount path whenever schedulerSupport=”true”.

Failure Mode 5: Silent HA Degradation (The Most Dangerous)

Symptom: The HA cluster appears healthy — both brokers running, no alerts — but the slave is not actually tracking the master’s state. On failover, the slave promotes but has stale or empty persistent state.

Cause: Artemis backup sync never completed (check backup-sync-complete via JMX). Or: shared storage HA with a slave that restarted but is silently failing to acquire the lock.

Fix: Continuous monitoring of master lock holder identity, slave sync status (Artemis), and Failover Transport reconnection event rate. These three metrics together give you full HA health visibility.

HA Model Selection Framework

Choose Shared File System HA if:

You are running Apache ActiveMQ
You have a SAN or NFSv4 shared storage infrastructure already in place
You prioritize operational simplicity — no quorum, no external services
Your KahaDB data fits within the shared storage budget

Choose JDBC Master/Slave if:

You are running Apache ActiveMQ and have no SAN
You have a highly available relational database (PostgreSQL with streaming replication, Oracle RAC)
Your team has stronger DBA skills than storage engineering

Choose Artemis Replication if:

You are running or migrating to Artemis
You are deploying on Kubernetes or cloud without shared block storage
You need geographic HA with no shared storage dependency
You can absorb the added complexity of quorum management and backup warmup monitoring

Do not use any HA model in isolation without:

The Failover Transport configured on all clients
randomize=false set to prevent load-balancing across master/slave
Idempotent retry logic handling TransactionRolledBackException
Continuous monitoring of HA state — not just broker uptime

Network of Brokers (NoB) is not an HA mechanism, it is a horizontal scaling and routing topology. Individual broker nodes in a NoB should each have their own master/slave HA pair for true resilience.

Monitoring HA State: Where Manual JMX Falls Short

Once the HA configuration is deployed, the next operational challenge is knowing whether it is actually working, not just whether the brokers are running.

The most dangerous HA state is silent degradation: the cluster looks healthy, but the slave has lost synchronization with the master and will not promote correctly on failover. This is nearly invisible with basic JMX polling because the relevant state spans both brokers and requires correlation across them.

Key metrics for continuous HA health monitoring:

Master lock holder identity: which broker currently holds the exclusive lock
Slave connection state: is the slave actually connected and waiting (not crashed or network-partitioned)
Artemis backup sync status: backup-sync-complete must be true before HA is considered operational
Failover Transport reconnection rate: unexpected reconnection spikes indicate broker instability before a full failover occurs
Time since last clean HA state check: composite health metric correlating all the above

meshIQ Console provides a unified HA topology view that surfaces all of these indicators across your entire broker fleet in a single dashboard, without requiring JMX scripting or per-broker manual checks.

The Console’s alerting engine can notify your on-call team the moment HA state transitions from healthy to degraded, not after a production failover exposes the problem.

Your HA Configuration Is Only as Good as Your Monitoring

The most common post-incident finding in ActiveMQ HA failures is not that the configuration was wrong, it is that a configuration that was correct at deployment silently degraded over time and nobody noticed until the first real failover exposed it.

Backup sync drift, slave disconnects, scheduler directory misplacement, and JDBC lock timeout behaviors are all invisible to basic uptime monitoring. They only surface when a failover is triggered at the worst possible moment.

meshIQ Console gives your operations team continuous visibility into HA state across all broker models, so silent degradation is caught before it becomes a production incident.

See how meshIQ Console monitors your ActiveMQ HA topology → Request a Demo

Frequently Asked Questions

Q1: How does ActiveMQ high availability work?

ActiveMQ HA uses a master/slave architecture where one broker serves clients and one or more standby brokers wait. When the master fails, a slave acquires the coordination lock (file system or database row) and promotes itself to master. Clients using the Failover Transport automatically reconnect. Classic supports shared file system and JDBC HA. Artemis additionally supports network replication.

Q2: What is the difference between shared file system HA and JDBC master/slave?

Both produce identical active/passive behavior. The shared file system uses a file lock on a shared NFS/SAN mount; JDBC uses a database row lock. The shared file system is simpler but requires reliable shared storage with proper POSIX locking (NFSv4 or SAN). JDBC works without shared storage but depends on database lock expiry, a crashed master’s lock may delay slave promotion by 30–60 seconds.

Q3: How do I configure the ActiveMQ Failover Transport?

Use failover:(tcp://broker1:61616,tcp://broker2:61616)?randomize=false&maxReconnectDelay=30000&maxReconnectAttempts=-1&timeout=3000. Set randomize=false to connect to the primary first. Set maxReconnectAttempts=-1 for infinite retries. Handle TransactionRolledBackException in your application, in-doubt transactions are rolled back, not replayed, during failover.

Q4: What is split-brain in ActiveMQ and how do I prevent it?

Split-brain is when both master and standby believe they are active simultaneously. Classic’s lock models largely prevent this, only one holder can hold the exclusive lock. Artemis replication has split-brain risk during network partitions. Prevention requires three or more live/backup pairs so quorum voting determines which side activates.

Q5: Does ActiveMQ HA failover lose messages?

With persistent messaging and synchronous sends, messages acknowledged before failover are not lost, they are in the persistent store available to the new master. In-flight transactions at the moment of failover result in a TransactionRolledBackException that the application must retry. Non-persistent messages are always lost on failover regardless of HA model.

Others
Other uncategorized cookies are those that are being analyzed and have not been classified into a category as yet.
Necessary	Necessary
Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.
Advertisement
Advertisement cookies are used to provide visitors with relevant ads and marketing campaigns. These cookies track visitors across websites and collect information to provide customized ads.
Analytics
Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics the number of visitors, bounce rate, traffic source, etc.
Functional
Functional cookies help to perform certain functionalities like sharing the content of the website on social media platforms, collect feedbacks, and other third-party features.
Performance
Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.

Apache ActiveMQ® High Availability: Fix the #1 HA Design Mistake

The Three HA Models: A Decision Matrix

Model 1: Shared File System Master/Slave

How It Works

Production Configuration

The Scheduler Directory Footgun

Filesystem Compatibility: The NFSv3 Risk

Model 2: JDBC Master/Slave

How It Works

The JDBC Lock Timeout Problem

Model 3: Artemis Replication (No Shared Storage)

How It Works

Production Configuration

The Backup Warmup Window

Split-Brain: The Replication Risk

The Failover Transport: Your Client-Side HA Layer

Critical Parameter Reference

Priority Failover for Geographic Redundancy

Handling In-Doubt Transactions After Failover

The Five HA Failure Modes You Need to Prepare For

Failure Mode 1: The NFSv3 Lock Zombie

Failure Mode 2: The JDBC Lock Timeout Window

Failure Mode 3: Artemis Replication Split-Brain

Failure Mode 4: The Vanishing Scheduled Messages

Failure Mode 5: Silent HA Degradation (The Most Dangerous)

HA Model Selection Framework

Monitoring HA State: Where Manual JMX Falls Short

Your HA Configuration Is Only as Good as Your Monitoring

Frequently Asked Questions

Cookies preferences

We value your privacy