Go to Section 2.1.4 : Section 2.2 Contents

Section 2.1.5 Big Bro Watchdog Alarm System:

Big Bro watchdog alarms can have four different scopes:

Involves the whole Gateway
Involves the Gateway Relay
Involves an IP Address
Involves a DS0 group

Gateway alarms are 0, 2, 4, 10 and 12. Gateway Relay alarms are numbers 12 and 100. The rest involve only a single IP address or a DS0 group.

Policy On redundant alarms

Condition triggering alarms may persist after the first is issued, some policy must be followed on how to repeat alarms. Only once is not good enough, since it could be missed, a single page can be lost, besides, there is information in second alarms, they are telling that the condition persists. On the other hand, once you know about a problem, you don't need to be reminded every second or even every minute. The policy is the following: Once an alarm is issued, it won't be issued again until two minutes has gone by, a third alarm must wait 4 minutes, next 8 and so on doubling each time.

Sample

Any variable that expresses a rate involves a sample and the size of this sample is the result of a tradeoff between response time and statistical significance. The smaller the sample, the faster the response to a problem, but then, it becomes prone to false alarms. The perfect value for the sample size is a function of the traffic by default the sample size is 50 because experience has shown that this value suits most systems, but for more than 100 attempts per hour you may increase the size with ini file parameter WD_SAMPLE_SIZE, and for less than 50, it might work better with a smaller sample size.

When sample collection starts GW_Relay is started, the size of the sample used for rate calculations will be the call count for that IP or DS0 group until it gets to the sample size. For an alarm to be issued, this count must be greater than 5.

Short Calls

Answer Supervision, this is the signal that the call is connected and that caller is already talking to its party, is not always reliable. Many a circuit returns such a signal as a mere acknowledge of a call setup, in these cases you may see suspicious 100% completion rates. To get a better picture of reality, you may specify, with ini file parameter WD_SHORT_CALL_DEF, a minimum length for a call to be considered as a "good" one. This specification will, not only affect Watchdog alarms, but also 24 hour completion rate summaries.

0 - Span down / up alarm

This alarm requires no further explanation, it will be issued whenever a T1 or E1 span loses synch, signal, etc. and when it gets it back.

1 - Low Completion Rate

If less than specified percentage (WD_MIN_CR) of the calls to an IP or a DS0 group are connecting, in other words, if its completion rate falls under that specified percentage, this alarm will be issued. Default for minimum completion rate is 20%, below this, it is almost certain that something is wrong, but with very good links, 40% may already mean trouble

2 - Test page

This is not an actual alarm ,but just an alarm message requested from some GW_Monitor to test the paging system.

3 - Average Gap over max specified

The average is performed over the "Good" calls in the sample or, as explained above, among the last 50 calls to or from that IP or DS0 group. A "Good" call is one that connected and has a a duration greater than 0 seconds. If there are more than 5 good calls in the sample and the average Gap/Play ratio is more than 5 times that specified in the GW_Relay.ini (PLAY/GAP_RATIO_MIN ), the alarms is triggered. You may find weird that 5% Gap/Play be specified by 20, its inverse. The reason to this, is that an integer parameter is preferred over a floating point one, 20 is better than 0.05. Even if you multiply by 100 to make it a percentage, a Play/Gap ratio of 22 is still a Gap/Play of 5%, this meaning that the nature of the typical values is such, that it is described more accurately by its inverse ratio

4 - Gateway not responding

This alarm is issued when the GW_Relay loses contact with the Gateway. This can happen if the network link goes down, the gateway is reset and password changed and, of course, if the gateway goes down. When this happens the GW_Relay will keep attempting every ALIVE_TO to start a Telnet session and every ten times the app will be reset (just in case).

5 - Average Latency over max specified

The average is performed over the "Good" calls in the sample or, as explained above, among the last 50 calls in that IP or DS0 group. A "Good" call is one that connected and has a a duration greater than 0 seconds. The Round Trip Time or RTT, reported by the gateway on every call, is used as criteria for Latency. If there are more than 5 good calls in the sample and the average Latency is three times that specified in the GW_Relay.ini (RTT_MAX ), the alarm is triggered.

6 - Too many bad calls in a row

This triggers an earlier alarm than the completion rate one, if there are more than a specified number (WD_BAD_IN_A_ROW) calls in a row to an IP or DS0 group, the alarm is triggered. The Completion rate alarms will may not respond as fast as this. Besides, the nature of this alarm indicates total failure while the completion rate may indicate only partial one. For instance, traffic congestion may trigger a completion rate alarm but seldom this one. On the other hand, a link down will definitely trigger this one and sooner than the completion rate.

8 - Low traffic/ Traffic resumed

A human operator can detect this condition easily, but it is tricky for an automatic watchdog. To make something automatic, you need to define that something in precise terms. So what is low traffic? Should we define it somehow in terms of the active call count? Then what if some node in the network got stuck and the active calls we see are just hung trunks? Traffic is best described by activity, calls dropping and being setup. Activity seems like a pretty good criteria, but now we need a parameter that to describe this "Activity". A good one is the average time between calls, small average time shows high activity and a greater one a smaller activity. We can calculate this time as the difference between the oldest call in the sample and the most recent one and then divide it into the sample size. If the time since the last call gets to be too many times greater than that average, something must have happened to the traffic and a low traffic, then number 8 alarm would be due. What is too many? you specify this with the ini file parameter WD_AVG_TIME_TOLERANCE, by default tolerance is 7.With this value the described algorithm copes with most daily traffic behaviors, but if your traffic is fairly smooth throughout the day, a smaller tolerance will respond faster to an interruption, on the other hand, if it drops sharply at certain time, then you can expect a false alarms unless you increase the tolerance.

This sequence of alarms follows the same policy as all the rest. A traffic resume alarm will be issued as soon as normal traffic activity is detected again.