Thursday, March 19, 2015

IBM BPM Event Manager

The main function of the Event Manager is to guarantee scheduled execution of code. Thing to remember over here is that the Event Manager is not executing the code, but scheduling it with the corresponding Process Server. Any work scheduled by a specific Event Manager is run on the local Process Server. The Event Manager scheduler is used anytime an undercover agent (UCA) is invoked, but is also used for processing business process definitions (BPD) notifications, executing business process definition system lane activities, and executing business process definition timer events - it is not specific to TWEvents or to undercover agents.

All date times in the process database are written by the database and use the database servers’ time clock. All timer based executions (BPD timers, scheduled UCAs, task due dates are all triggered and alerted by the process server. The time clocks of the TeamWorks server and the database need to be in sync or events will not be processed properly.


Queues ::
First we need to understand Queues in order to understand Event Manager.
IBM BPM has two type of queue – Asynchronous (async) and Synchronous (sync). The Event Manager treats sync and async queues differently.

1.)    Sync Queues ::
Sync queues are executed serially. If you have multiple tasks set to run on one sync queue, they will execute one after the other in the order that they were put into the sync queue.
Each task in a sync queue must be executed in serial. To prevent problems in a cluster, an Event Manager claims ownership of one or more sync queues when it starts up. The ownership is stored in the LSW_UCA_SYNC_QUEUE where QUEUE_OWNER is linked to OWNER_ID in LSW_EM_INSTANCE.This is not a permanent assignment. The LSW_EM_INSTANCE table keeps track of status of all of the event managers. The status is checked every 15 seconds. If the owner of a sync queue is no longer available, another Event Manager takes ownership of that sync queue

2.)    Async Queues ::
Async queues are executed as soon as possible with no guaranteed order.
Async tasks are picked up by each Event Manager when there is room in their async queue for more tasks.





80EventManager.xml ::
Each process server has its own running Event Manager. The Event Manager is configured by each process server's copy of the 80EventManager.xml file:

Configuring Event Manager using 80EventManager.xml file
<enable>true</enable>
If this parameter set to true, the Event Manager is turned on for this process server instance. If you set this parameter to off, this process server has no event manager. Setting this parameter to false also disables the business process definition engine for this instance. This approach allows you to allocate process server instances for different duties. 

<start-paused>false</start-paused>
If this parameter is set to true, the scheduler for the Event Manager is started in a paused state. The Event Manager scheduler resumes if you specifically tell it to resume from the Event Manager Monitor console page, or if you click Resume All on that page.
Note: Pause/Resume always uses the Java Message Service (JMS) to send the request to the scheduler, even if you are pausing or resuming the web server to which you are connected. Pause/Resume is the only piece of the scheduler infrastructure that uses JMS - all other communication is done through the database.

<name>machine name</name>
This parameter is commented out, by default, and the host name is used instead. This parameter is used to populate the LSW_EM_INSTANCE table and names the Event Manager as viewed from the Event Manager monitor. If your host name is not descriptive for you, you can uncomment this parameter and use a name of your choosing. 

<heartbeat-period>15000</heartbeat-period>
<heartbeat-expiration>60000</heartbeat-expiration>
These parameters are used to determine which Event Manager instances are up and running. These parameters should not need to be changed.
The heartbeat is a separate thread that constantly updates the lsw_em_instance database table to tell other schedulers that it is alive. The heartbeat runs even if the scheduler itself is paused. The lsw_em_instance table drives the content in the top section of the Event Manager Monitor console. A scheduler whose expiration time is in the past is treated as disconnected. When this situation happens, the other schedulers assume that it is dead and pick ups any additional work as necessary. The heartbeat of a non-disabled scheduler will update the lsw_em_instance every <heartbeat-period> milliseconds (15 sec by default), and it sets its expiration to <heartbeat-expiration> milliseconds in the future (60 sec by default). This situation means that if a process server machine gets completely unplugged, it will take 60 seconds until the other schedulers recognize it as disconnected.

<loader-long-period>15000</loader-long-period>
For every loader long period, the Event Manager looks at each queue (sync and async) that it has access to and fills them to capacity. This scenario is sometimes referred to as a major tick.

<loader-short-period>2000</loader-short-period>
For every loader short period, the Event Manager looks through each of the queues that the Event Manager has and tries to fill them to capacity. Think of the loader long period as a sweep that fills the queue and the short period as the sweep that tries to fill any space that might be left over in the queue. This scenario is sometimes referred to as a minor tick.

<loader-advance-window>60000</loader-advance-window>
For scheduled tasks, this parameter specifies how far in advance the Event Manager looks for tasks.

<sync-queue-capacity>10</sync-queue-capacity>
This parameter specifies the number of tasks to fill for each sync queue that the Event Manager has acquired.

<async-queue-capacity>10</async-queue-capacity>
This parameter specifies the number of tasks to fill for each async queue that the Event Manager has acquired.

<bpd-queue-capacity>20</bpd-queue-capacity>
Business process definitions execute in their own async queue. The business process definition queue is used for timers firing, delivering messages to business process definition instances, and executing system lane tasks. This parameter is the queue depth for that queue.

<system-queue-capacity>10</system-queue-capacity>
The Event Manager has its own internal queue. This parameter is barely used and should not need to be changed.

<min-thread-pool-size>5</min-thread-pool-size>
This parameter specifies the minimum number of threads that the Event Manager should use.

<max-thread-pool-size>50</max-thread-pool-size>
This parameter specifies the maximum number of threads that the Event Manager can use.

The thread pool is not per queue; it is the total number of threads for that Event Manager instance in that particular Process Server Java virtual machine (JVM).
Note: Your total available database connections in the application server connection pool should be at least 2x this number. The number of connections on the actual database server needs to be at least the sum of the max connection pool for all nodes in the cluster.

<re-execute-limit>5</re-execute-limit>
This parameter specifies the maximum number of times to retry a failed task.

<kick-on-schedule>true</kick-on-schedule>
When this parameter is set to true, a newly-scheduled task forces the Event Manager into an immediate poll of lsw_em_task, to reduce the time between when a new task is scheduled and when it will be executed. This parameter helps with latency - a newly-scheduled "right now" task is executed almost immediately - but hurts overall throughput, because the TaskLoader ends up being more active than it would be otherwise. If the kick-on-schedule is false, newly-scheduled tasks are not picked up until the next time the Event Manager polls lsw_em_task (up to the loader-long-period), which will increase latency. However it also increases overall throughput by reducing the chatter and contention on the lsw_em_task table for a system with heavily loaded Event Manager, this parameter should be set to false.

<event-retry-interval>5000</event-retry-interval>
This parameter specifies the time between retries for failed tasks.

<task-execution-listener>com.lombardisoftware.server.scheduler.DbTaskExecutionListener</task-execution-listener>
This parameter is disabled by default. If the parameter is enabled, task history is maintained in the lsw_em_task_history table. You can then query this table to get the history of your tasks. Note: The product does not provide a way to display or clean up this data.

<sync-queue-controller-interval>5000</sync-queue-controller-interval>
This parameter specifies the time interval (milliseconds) the Sync Queue Controller wakes up and checks for Sync Queue jobs that need to be executed

Removing a stuck task -
Because sync queues only advance when a task completes, a poorly designed task can cause a sync queue to stop. If you do need to stop one, you can delete the task from the lsw_em_task table. You can then stop and start the Event Manager from the console to get things moving again.

Points to remember -
·         The Event Manager is quick and efficient. Usually it is the tasks it is executing that slows it down; not the Event Manager itself.
·         If you want to throttle the Event Manager, do not decrease the thread pool. Instead, decrease the queue capacity.
·         A sync queue can get stuck because it will not advance until the task completes. To help make this less of a problem, create multiple sync queues. You can manage sync queues in the Teamworks console.
·         All the time stamps used by the Event Manager scheduler - the heartbeat expirations and the task scheduled times - are interpreted relative to the system clock for the database machine. Thus, the scheduler does not require keeping the process server system clocks in sync. Keeping system clocks in sync is a good idea, however, for date-based tracking data, log analysis, and so on.

·         All time settings mentioned in 80EventManager.xml are in milliseconds.