Special Features:
The Watchdog
One of the most feared problems that long distance providers face is what is known as hung trunks. This is simply a call that finished, but for some reason it is still holding the one or the two channels involved. The origin may be due to equipment failure or human error (the caller failed to hang up), but the result is equally disastrous in either case. Not only is the hung trunk using up valuable resources, but also it renders huge billings that upset the clients generating distrust.
Omnibox includes a feature that goes testing every channel in use, if the channels shows some inconsistency, this is for example: an outbound channel routed to an inbound that is disconnected or even already routed to a different outbound channel. Of course inconsistencies shouldn't happen but a switch scenario is very complex and states can follow a myriad of different paths and there's no way of being too careful. Still, if an outbound channel has passed the inconsistency test but it has been connected for more than a specified time (StartAnalysis in table dia_InCh), then it is when real hung trunk analysis starts.
OmniBox will spin a thread for each outbound channel to be analyzed for hung trunk. The process will go like this: a voice resource will be requested to listen to this channel and determine what is going on, if a voice or FAX is heard, then it is OK, the resource is returned and it waits for 15 seconds before testing again. But if:
busy or "Off hook" tones are detected that go for more than 8 seconds (nobody listens to a busy for that long) the call is truncated;
a ringing is heard for more than 30 seconds, it is considered too much and such a call is also truncated;
SIT tones are heard, the inbound will be checked and if silence for more than 10s, the process will be repeated after 15 seconds, if the same thing happens then the call is truncated
silence for more than 10s is encountered, the inbound will be checked, if also silent then the process repeated after 15 seconds, if still silence in both channels then the call is truncated.
There are these cases where a caller's disconnect won't generate the signal needed to terminate the call, then the hung trunk detector can work as a "hung up" detector as well. The only difference being that to detect hung trunks the analysis may start after 5-10 minutes, while for hung up detection you need to bring this down to 1-2 minutes. Hung up detector will use more voice resources and CPU time than the hung trunk, but that's it.
Being this feature a rather sophisticated one, it frequently becomes a prime suspect on the "Calls are being cut trouble ticket" case. The OmniBox suite provides a mean of proving its innocence, there's an action in the Monitor menu that can disable the feature, if the problem is still there, and believe me, it usually is, then all the "serial Call Killer" charges must be dropped.
Channels doing IVR functions, always own a voice resource when they are active. This voice resource is used for prompting and getting digits. Even after prompting is over and there’s a conversation, the voice resource is always waiting for a digit (## terminates a call and returns the caller to the menu) If the hung trunk were to request a resource to do analysis, the resource administrator will say “You already got one, use it!”, but doing that, will truncate whatever it was playing back or any get-digit operation. So, the hung trunk detector can not be allowed to analyze channels doing IVR functions.
Hung trunk analysis hit a similar problem with channels that have attached resources like analog lines or time slots of E1 using R2MF. You can not borrow their attached resources because, that’s why they must be attached, they need them all the time.
The rule is simple, if the channel already owns a resource, skip the analysis. This means that the hung trunk method is limited to resource-less channels.
However, OmniBox has developed a defense against IVR hung trunks. Since prompt playback and digit collecting operations are subject to specified termination conditions, OmniBox will use these as an alternative hung trunk detector. The terminating conditions specified by the OmniBox IVR are:
If a busy tone, as defined in the CDP file (See chapter 3 dia_V_Boards), is detected while waiting for a digit, the call is dropped. This is specially useful when calls are originated in an FXO channel bank that generate no signaling upon disconnect, but use a busy cadence as a disconnection tone instead..
A fax tone produces an early return without a digit, this can trigger special functions in the IVR to deal with faxes.
A DTMF detection will terminate the get-digit of playback and return with a digit that will produce the corresponding state change in the IVR state machine.
If the channels has remained silent for MAX_TIME - 0.1s., then a “Strike” counter will be incremented. The same counter will be incremented by the hung trunk if it finds silence (or SIT tones) in the outbound. If the hung trunk finds that this counter over a hard coded internal threshold, it will truncate the call.
If none of the above, not even silence, is detected then the function will normally return after MAX_TIME (or playback end) with no digits but no increment to the “strike” counter.
The flow diagram for the hung trunk detector:
There is not such thing as a failure proof communication system. There is so much that can go wrong in a complex system that the best you can do is “be the first to know” and know it fast. That is what this Watchdog feature is all about. It will page everyone in a list when some something goes abnormal.
(The Watchdog feature in version 3+ of the OmniMonitor makes the approach in this article somewhat obsolete, please check E-mail Paging in chapter 6)
The pages can be made through a Modem whose COM port is specified in the environmental variable WD_COMPORT. If this is a positive value, like 1, 2…etc it will be interpreted as an actual modem being connected to that comport in the computer were the Watchdog is running, if negative, the Analog board in the OMNIBOX will be used for paging.. But this small saving is the only advantage of this approach since it is more reliable to have the Watchdog watch from a different computer in the network, just in case the OMNIBOX computer goes down.
The Watchdog is an independent application that is running on a computer, preferably different to the one running OMNIBOX. The Watchdog is notified by the OMNIBOX on any abnormality through UDP of Window Sockets.
Currently there are 9 types of pages:
Description |
Num Message |
Parameters |
T1 alarm |
0*B*C*E |
B= Board Number; C = Alarm code; E = WD Engine ID(for all
messages) |
Low Completion Rate |
1*Cr*E |
Cr = Complerion Rate % |
Test page |
2*E |
Ch = Channel Number |
NT event Log had an entry |
3*Ev*E |
Ev = Event ID |
OmniBox
is dead |
4*E |
|
Too many bad calls in outbound
Ch. |
5*Ch*E |
Ch = Channel Number |
To many bad calls in an outbound
range |
6*Rng*E |
Rng=Range ID |
Too many short calls |
7*Sh |
Sh = % rate of how many of the
good calls are short |
Low traffic |
8*Exp*E |
Exp = Expected seconds to the
next call. |
System unloaded |
9*E |
|
Exception in the Watchdog app |
10*Asc |
Asc = ASCII code for alarm ID
that cause the exception |
Too many non routed Calls |
11*D*E |
D = Domain ID |
More than 20 replication failures |
12*E |
|
Seized more than 3 times being
excluded |
13*Ch*E |
Ch=Inbound Ch |
The first number identifies the message type, the rest are parameters with meanings that depend on the message type. Channel numbers and boards are 1 based. The ‘*’ may show as a dash or a space in some pagers. A message for lost sync T1 (code – 10) alarm at board 3 in engine 0 will show on the pager like:
0-03-10-0
The watch dog has an engine ID, normally the ID of Engine it is watching plus 100. The criteria for low, high, max’s and min’s are read by OmniBox from dia_InCh and dia_OutCh. Port numbers are read from the sys_Parameter table in a database pointed to by the WatchdogSrc ODBC source. Normally it is a local MS JET engine file named Watchdog.mdb. Here’s an example for the watchdog of Engine 10.
ParamID |
ParamNumValue |
ParamStringValue |
Description |
EngID |
8 |
1028 |
|
Socket port number |
110 |
9 |
1026 |
|
Engine socket number for WD. |
110 |
The people to be paged are stored in the WD_Pagees table:
PagerNumber |
ExcludedMsgs |
PageeName |
99999999 |
11*5* |
Controller1 |
88888888 |
|
Administrator |
Not all the page types have to reach all of the pagees, all the prefixes to the numeric messages that you type under ExcludedMsg separated by “*” or any other suitable separator like, space, comma, colon, etc, will be excluded from the page list if such event would happen. Controller1 won’t get news on inbound excluded channels (11*) or entries to the NT event log (3*). Also the Watchdog won’t page with the same numeric message twice in a ten minute span or issue more than 3 pages in the same time span.
The 0 Page
The Alarm Codes will be:
0x00 |
0 |
Out
of frame error; count saturation. |
0x01 |
1 |
Initial
loss of signal detection. |
0x02 |
2 |
Driver
performance monitor. |
0x03 |
3 |
Bipolar
violation count saturation. |
0x04
|
4 |
Error
count saturation. |
0x05 |
5 |
Receive
yellow alarm. |
0x06 |
6 |
Receive
carrier loss. |
0x07 |
7 |
Frame
bit error. |
0x08 |
8 |
Bipolar
eight zero substitution dtct. |
0x09 |
9 |
Receive
blue alarm. |
0x0A |
10 |
Receive
loss of sync. |
0x0B |
11 |
Got
a red alarm condition. |
If the condition is restored then add 16 to the code, this is:
0x10 |
16 |
Restored
out of frame error; count saturation. |
0x11 |
17 |
Restored
initial loss of signal detection. |
0x12 |
18 |
Restored
driver performance monitor. |
0x13 |
19 |
Restored
bipolar violation count saturation. |
0x14
|
20 |
Restored
error count saturation. |
0x15 |
21 |
Restored
receive yellow alarm. |
0x16 |
22 |
Restored
receive carrier loss. |
0x17 |
23 |
Restored
frame bit error. |
0x18 |
24 |
Restored
bipolar eight zero substitution dtct. |
0x19 |
25 |
Restored
receive blue alarm. |
0x1A |
26 |
Restored
receive loss of sync. |
0x1B |
27 |
Restored
got a red alarm condition. |
The 1 Page
To calculate completion rate the Watchdog count good calls and bad calls (busy, no answer, no ring back or no dial tone). If the good count to the total count rate is below the value specified in the sys_Parameters table (parameter 9), the page procedure is fired. If the counter went counting forever the method would become insensitive to an abnormal situation, so the counters must be reset to half the count when a maximum is reached to keep it low enough to be sensitive but high enough to be statistically significant. This compromise depends on traffic, so this number must be set for each range in the WD_SampleSize field of table dia_OutCh.
OmniBox will issue a test page upon a command from the Monitor.
The 3 Page
Throughout the OMNIBOX code there are quite a few traps for abnormal situations that are logged into the NT Event Log. If something is logged the Watchdog is notified and pages issued. The Event ID number has the following convention.
App Level events 0 - 9
Data Interface 10 - 99
Pool level events 100 - 999
Ch events X000 + Ch#(0 - MaxCh)
DTI events 30000 - 30999
VOX events 31000 - 35999
Specifics 36000 - 37999
The actual table follows:
Event ID |
Description |
0 |
System unloaded |
1 |
System loaded |
3 |
Could not read sys_Parameters,
defaults in effect |
4 |
Exception at InitRecChannels
AswSup not loaded |
5 |
No database connection |
10 |
No link to database could be
stablished |
11 |
Exception opening database |
12 |
Database error (error message
logged) |
13 |
Access violation in database
thread |
14 |
Error in select query |
15 |
Timeout in select query |
16 |
Error opening cursor |
17 |
Time out opening cursor |
18 |
Error in get record |
19 |
Timeout in GetRecord |
20 |
Error in action query |
21 |
Timeout in action query |
100 |
Could not register any tones! |
101 |
Socket on Receive error |
103 |
SetSockOpt socket failed |
110 |
RAS Socket created |
200 |
+ev | Unknown alarm ev |
400 |
Exception in Analog Thread. |
430 |
+B | Exception doing whole T
number B. |
460 |
+ B| Exception in alarm event
handler on board B. |
500 |
T1 alarm on board b |
1000 |
+Ch | Channel Ch init failed! |
2000 |
+Ch | Time slot for Ch open fail! |
3000 |
+Ch | Error in dt_setsigmode |
7000 |
+b | T1 alarm mask could not be
set on board b |
8000 |
+Ch | Signal event mask coud not
be set on Channel Ch |
9000 |
+b | Board b open failed; |
13000 |
+Ch | Channel Ch failed to set
hook state |
14000 |
+Ch | Event mask con Winkwait
could no be set on Ch |
15000 |
+Ch | Event mask con Winkwait
could no be reset on Ch |
16000 |
+Ch | Wink failed on Ch |
30000 |
+Ch | Receiving thread Exception |
31000 |
+v | Voice resource v failed
init. |
31300 |
+Ch | Voice resource routing to
Ch failed |
31600 |
+v | Voice resource v open
failed! |
31900 |
+v | blddt failed on voice
resource v |
32200 |
+v | Add double tone to voice
resource v failed |
32500 |
+v | bldst failed on voice
resource v |
32800 |
+v | Add single tone to voice
resource v failed |
33100 |
+v | Init call Perfect failed on
v |
33380 |
+Th | Exception in playtone
thread # Th |
34000 |
+rs | Voice resources exhausted
of type rs |
34010 |
+rs | Exception while getting
voice resource of type rs |
34020 |
+rs | Timeout waiting for voice
resource type rs at the mutex |
34030 |
+rs |
Timeout while getting voice resource of type rs |
34299 |
Exception in PostDial delay
thread |
36000 |
+Ch | Channel Ch tested good and
is back in service |
36001 |
+ Ch | Call ID for Channel Ch
returned < 0! |
36300 |
+Ch | Channel Ch tested bad and
has set out of service |
36800 |
+Ch | Call on channel Ch could
not get logged |
36900 |
+Ch | Unknown event received on
channel Ch |
37200 |
+Ch | Failed to set Hook State to
Ch at event handler |
37500 |
+Ch | Exception in event handler |
37800 |
+Ch | Exception in ReceiveProc
for channel Ch |
38000 |
+Ch | While preparing resume Call
on Ch |
38999 |
Commit Db Changes was hit %d (outB),
%d (InB) |
39000 |
+Ch | Exception in MarkAsDead |
*There are two entries that are excluded from paging:
500 – T1 alarms are reported directly to the Watchdog with numeric information absent in the Event ID.
0 – System Unload, is sent directly also to avoid interference with the unloading process.
The events in bold and italics are the most frequently encountered.
The Watchdog queries the OMNIBOX every minute, if the application is hung then it won’t respond, the Watchdog will then fire a 4 page procedure.
The 5 Page
Each outbound channel has a bad call (busy, no answer, no ring back or no dial tone) counter that is reset on every good call. If the count makes it over the WD_MaxChBadInARow field in dia_OutCh, the 5 page procedure is triggered.
The 6 Page
When there is no incoming call for more than n times the ETBC (Expected Time Between Calls) the Page is triggered. The number of times n, is read from WD_Tolerance in dia_InCh. The ETBC is the average time between calls in the last m calls, were m is WD_SampleSize in the dia_InCh table. ETBC is also corrected by the trend. The trend being the change in this average in the last 2*m calls. The higher m the more statistically stable and so more unlikely to get a false alarm on a fluctuation, but to high it may become insensitive. The number must be set as high as the traffic volume allows it.
The Log_Pages Table
A log for each page issued is stored in the Log_Pages table. This is useful to know the cause if something is going wrong with the pages ot to know if pages of certain type were issued upon an event that is know to have happened or how many pages of some type where issued in a given time interval.
An example of the table follows:
PageID |
PageTime |
PageeID |
PageMsg |
PageResult |
464 |
1/3/00 7:27:02 AM |
3077208 |
8*220*0 |
VCOM |
463 |
8/9/99 8:19:02 AM |
3077208 |
8*111*0 |
VCOM |
462 |
8/6/99 3:01:33 PM |
3077208 |
3*1*0 |
VCOM |
461 |
8/6/99 1:57:41 PM |
3077208 |
8*1257*0 |
VCOM |
460 |
8/6/99 1:37:01 PM |
3077208 |
3*1*0 |
NO DIALTONE |
If the Dialogic analog boards (D/41SC or D/160SC) are included in the system, the following features are can be enabled for one or more of the analog lines:
1. Listen into a specified channel
2. Setup a call to a specified outbound channel
3. Setup a call into a route
4. Setup a call into an inbound trunk group (Domain)
These features allow the following tasks to be performed.
Tasks:
· If you want to listen to calls to test voice quality or check what is going on, use feature 1
· To check why calls are failing or short use feature 1
· If you want to test a particular channel, use feature 2
· If you want to make a phone call to actually talk to somebody, use feature 3.
· If you need to test digit processing settings, use feature 4
First prompt:
Enter 1 to Listen to a Ch, 2 for Placing a call in a Ch, 3 to place a call in a route, 4 to place a call as a Domain, hit star to go back to previous menu
If you select 1:
Enter channel number followed by the '#'
sign
As soon as you do that, you will start listening to what ever is going on in that channel, every time you hit ‘*’, the listening will be switched to the other side of the conversation (can’t listen to both at a time). If you enter a number <100 but also less than the number of spans in your system, it will be interpreted as you wanting to listen to the last call received or made in that span. A voice message will tell you the channel number before connection. If you dial a zero, it will be interpreted as the previous span specified, if there's no previous, you will listen to the last call made or received in the Box. If you dial a number that can’t be interpreted a as span number, you get an ‘Invalid” message..
If you select 2:
Enter
channel number followed by the '#' sign
And then
Enter the
number to be dialed followed by the '#' sign + Dial Tone
After you enter the number + #…
If a number+# is entered
Connecting
Call progress + Conversation
If just # is entered (Robbed bit or Analog)
The selected channel is seized and routed full duplex,
you must then do your dialing or what ever.
If you select 3
Enter route index followed by the '#' sign
Given the name of the destination, route index can be found in the IdxLookUp table, if you know the Range (or Oubound trunk group) ID, then you may use the RouteTable. The Range ID also shows in the Stats Window of the Monitor as, for example, G 4.
And then
Enter the
number to be dialed followed by the '#' sign + Dial Tone
After you enter the number + #…
Connecting
Call progress + Conversation
If you select 4
Enter domain ID followed by the '#' sign
The Domain ID can be found in the dia_InCh table.
And then
Enter the
number to be dialed followed by the '#' sign + Dial Tone
After you enter the number + #…
Connecting
Call progress + Conversation
If the call attempt fails, you will get a spoken message with the detailed result.