Runtime State out of sync trending issues

Duffanator · February 18, 2013, 8:06pm

Hey guys,

I’m running a redundant Ignition 7.5.5 setup and I’ve noticed that a couple times a day this series of errors will appear in the logs:

INFO | jvm 1 | 2013/02/18 14:56:17 | INFO [MasterTCPChannel ] [14:56:17,676]: Error occured while communicating with backup client. Client has likely become unavailable. INFO | jvm 1 | 2013/02/18 14:56:17 | java.io.EOFException INFO | jvm 1 | 2013/02/18 14:56:17 | at com.inductiveautomation.ignition.gateway.redundancy.channels.tcp.AbstractTCPChannel.readMessage(AbstractTCPChannel.java:427) INFO | jvm 1 | 2013/02/18 14:56:17 | at com.inductiveautomation.ignition.gateway.redundancy.channels.tcp.AbstractTCPChannel.expectMessage(AbstractTCPChannel.java:482) INFO | jvm 1 | 2013/02/18 14:56:17 | at com.inductiveautomation.ignition.gateway.redundancy.channels.tcp.TCPMasterStateChannel.runProtocol(TCPMasterStateChannel.java:161) INFO | jvm 1 | 2013/02/18 14:56:17 | at com.inductiveautomation.ignition.gateway.redundancy.channels.tcp.AbstractTCPChannel.run(AbstractTCPChannel.java:327) INFO | jvm 1 | 2013/02/18 14:56:17 | INFO [MasterTCPChannel ] [14:56:17,680]: (id=64) Shutting down channel. INFO | jvm 1 | 2013/02/18 14:56:23 | INFO [MasterTCPChannel ] [14:56:23,857]: (id=65) Connected to address '/10.202.1.226' INFO | jvm 1 | 2013/02/18 14:56:23 | INFO [MasterTCPChannel ] [14:56:23,862]: Starting redundancy channel id 65 INFO | jvm 1 | 2013/02/18 14:56:23 | INFO [MasterStateManager ] [14:56:23,863]: Successfully registered new connection to redundant master from '/10.202.1.226:60033' INFO | jvm 1 | 2013/02/18 14:56:25 | INFO [MasterTCPChannel ] [14:56:25,585]: Reporting master start time of Sat Jan 12 06:40:10 EST 2013 INFO | jvm 1 | 2013/02/18 14:56:26 | INFO [MasterTCPChannel ] [14:56:26,088]: Peer node information has been updated: ClusterNode(address=10.202.1.226, httpAddresses=[10.202.1.226:8088/main], sessionCount=0, loadFactor=1, activityLevel=Active) INFO | jvm 1 | 2013/02/18 14:56:29 | INFO [MasterTCPChannel ] [14:56:29,007]: Peer node information has been updated: ClusterNode(address=10.202.1.226, httpAddresses=[10.202.1.226:8088/main], sessionCount=0, loadFactor=0, activityLevel=Cold) INFO | jvm 1 | 2013/02/18 14:56:30 | INFO [MasterTCPChannel ] [14:56:30,008]: The other node's runtime state is out of sync, and the necessary updates aren't available. A full state transfer will occur.

Perhaps this is normal, but whenever I get this runtime state out of sync error it will also put gaps and drops in almost all of the trends:

This doesn’t seem normal to me, should it be doing this or is there a problem here?

Colby.Clegg · February 21, 2013, 7:30pm

Hi,

From that log snippet, we can see that the backup redundancy node is disconnecting from the master for about 5 seconds, and on reconnect, was running as active (in other words, it though the master went down… all that really means is that your failover time is probably less than 5 seconds).

I also see that the backup’s standby mode is Cold, since that’s what it goes back to. I suspect that what is happening is that the backup gets disconnected, becomes active, scan classes are enabled (including historical), and initial historical values are logged with bad quality before the device subscriptions are initialized. In your charts, bad quality creates gaps.

Don’t worry about the “runtime sync”, that always gets reset when the backup is disconnected. The real question is, why is the backup failing over occasionally.

It could be something on the network, something with system load, etc. You could try simply increasing the failover time a bit. If the system gets disconnected, it will try to reconnect, and if it can before the failover time expires, it won’t take over.

The other option, perhaps just in the short term to test this theory, is to set the backup’s historical mode to “partial”. This causes the backup to only cache data until it can confirm that the master has been down (by reconnecting, and seeing that the master’s start time is more recent than the failover). In this case, since the master isn’t going down, the cached (presumably bad) data points will be discarded on reconnect.

Hope this helps,

Colby.Clegg · February 21, 2013, 7:32pm

Also, if you look directly at the data in the db (the sqlt_data_* tables), you should see rows with dataquality!=192 for these times. If you carefully delete those rows, you can fix the gaps in the chart.

Regards,

Duffanator · February 21, 2013, 7:55pm

Ok, I’ll try changing/checking some of those settings. I’m confused why this is happening at all though. First of all, I have the failover time set for 10 seconds, not 5 seconds:

Secondly, these machines are virtual servers at our cloud service providers facility. These boxes are connected to each other with 1GBs fiber from what I’m told and I’ve been running constant pings on both of these machines and the highest ping response on either machine after one of these events is always less than 10 ms:

There must be something else going on that I’m not seeing but I don’t know what it could be…

Colby.Clegg · February 21, 2013, 9:12pm

Hi,

Thanks for the additional info. The “5 second” time frame doesn’t mean much, that’s just how long it took the backup to reconnect after the disconnection, it might not be related to anything.

Are there any messages on the backup node during these events? Perhaps something over there can give some clue. No matter what, for some reason the backup isn’t able to communicate with the master for at least 10 seconds. If there aren’t any useful messages on the backup, perhaps I can instruct you on how to set up a separate side logger for the redundancy stuff, so we can capture all of the messages sent between both nodes. Then we can see if it is perhaps related to certain operations (for example, if the backup is asking for some update, and the master takes a long time to respond).

Regards,

Duffanator · February 21, 2013, 9:23pm

I don’t see any errors that look suspicious on the Backup, but I don’t really know what I’m looking at either… I’ll include a portion of the backup log file. There was an instance this morning around 1:15 AM.

BackupLog.txt (90 KB)

Colby.Clegg · February 21, 2013, 11:19pm

Hi,

That helps establish the timing. And in the process, I’ve noticed one small oversight: the setting for how often to attempt reconnection is not currently settable through the gateway (you could edit it manually in an xml file, if you wanted).

Anyhow, here’s what we have:

Periodically, the backup isn’t getting a response from the master within 3 seconds (=3*ping rate). So it disconnects.
The backup will then wait the “reconnect time” before trying again. This setting defaults to 10 seconds.
… but 7 seconds later the failover timer expires
So the backup takes over.
About the same time that the backup takes over, the reconnect timer expires (it was working off the same “last attempt” time as the failover timer), and the backup reconnects.
The backup is told to go to cold mode.

At the heart of “why is this happenining”, the question is: why is the 3 second socket timeout happening? Likely either load or locking on the master is preventing a timely response. The backup log tells the last message sent, so I can take a look at what the master should be doing, and see if anything jumps out.

The next step would be to make the reconnect time be settable from the gateway, as it’s silly to have it be set equal to the failover time. I would say that it should probably be set to something more like 1 or 2 seconds in your case. As I mentioned, you can try setting this manually on the backup machine, by editing “redundancy.xml” in the data directory. Just look for the “redundancy.backupreconnectperiod” property. You’ll only need to do this on the backup, and you’ll have to restart the backup for it to take affect.

Regards,

Duffanator · February 22, 2013, 12:30pm

Ok, that makes sense. I have adjusted the setting manually in the xml file and I’ll monitor it to see if that helps the issue. Thanks Colby, as always you have all of the answers!

Duffanator · February 28, 2013, 12:27pm

Hey Colby,

After making the settings changes it did get better and I get less trending issues, however I’m still getting a lot of communication drops between the two machines for some reason; even when setting the reconnect period to 1 second. It’s also happening at random times during the day and at times when the system is very lightly loaded, like at night or very early morning.

This morning it had a real fit for some reason and filled the log with communication timeout errors. I’ll include the logs from both the Primary machine and the Backup machine. Is there a way to track down what is causing these communication errors? With the partial history setting on the backup it’s not having a big effect on trends, usually, but the instance this morning did. If possible I’d like to pin down what’s going on and get it fixed. Let me know if there is anything we can do. Thanks!

PrimaryTimeout20130228.txt (17.4 KB)
BackupTimeout20130228.txt (52.9 KB)

Colby.Clegg · February 28, 2013, 4:34pm

Hi,

There are a number of interesting things there, I’ll take a close look at it this morning and see what I can come up with.

Regards,

Colby.Clegg · February 28, 2013, 7:31pm

Ok, I don’t exactly have a lot of answers, but I do understand more about what’s going on in those logs.

First off, the heart of the problem is a mysterious gap of time (a block, pause, something) on the master, of about 17 seconds. This could be due to an operation being performed by Ignition, but I can’t really see how. Instead, it could be something in the java vm, or the system itself.

As for the connections “going crazy”, we can improve this a bit, but it’s more or less a result of the problem, not the cause.

Here’s what seems to happen:
7:06:01 - Backup gets a “read timeout”. This means it hasn’t been able to read from the socket in 3 seconds, indicating that the master stopped/blocked/paused at 7:05:58.
7:06:01-7:06:29 - The backup behaves correctly, attempting to reconnect every second. However, each connection runs into a read timeout, indicating that the master isn’t actually responding.
7:06:14 - The FIRST connection is reported by the master. Several connections are accepted right away, but they’ve already been closed.
…
7:06:29 - Things catch up/get sorted out and everything is fine.

The point: connections from the backup are queuing up for some reason from 7:05:58 til 7:06:14. After that some slightly incorrect management on our side is causing some additional backup, but the real question is: why can’t the master respond for 17 seconds?

This could be really tricky to track down. It could be in Ignition, in Java (the garbage collector, the networking implementation) or in the system’s network stack (the java networking stuff quickly becomes OS specific). Ideally we would start with a thread dump while it was blocked. Obviously, you’re not going to be able to do that.

I suppose we could start by trying to confirm or eliminate one potential element that could lead to this, the java garbage collector. It’s a long shot, but there have been some reports of “pauses”, and it could cause this behavior. Frankly, if this were the cause, I would probably expect other messages as well in the console, but who knows.
There is some logging that you can turn on. Unfortunately it has to be done in ignition.conf (in the data directory), and will require restarting the master. You can add the following lines in the area for “Java Additional Parameters”:

wrapper.java.additional.8=-XX:+PrintGCDetails wrapper.java.additional.9=-XX:+PrintGCTimeStamps
Depending on what you already have there, you may need to modify the numbers 8 & 9.

That will print out details about when garbage collection occurs. When this problem happens again, we can see if anything is reported nearby.

Also, and I’m sure you’ve laid this out in a different thread, so sorry… but could you perhaps describe the setup of the master? OS version, virtual machine?, etc.

Regards,

Duffanator · February 28, 2013, 8:00pm

Thanks for looking into this, Colby.

You may have something with the Java garbage collection, I trend the cpu and memory usage and went back and looked at it for that time and it does seem to have happened then. I never thought to try and correlate those two things but next time I see it in the logs I’ll check the trend as well and see if there is a pattern there:

I will check/change the Ignition.conf file to do the additional logging but I probably won’t be able to restart the gateway until the weekend.

Both servers are virtual machines, I’m not sure of the actual physical hardware as they are hosted by our cloud server provider off site, but I’ve been told that the whole facility is top of the line. They are running Windows Server 2008 R2 Standard, Service Pack 1, 64-bit. I’ll actually just include a screen shot from the system display:

Colby.Clegg · February 28, 2013, 8:49pm

Ah, that’s very interesting!

Yes, if we can get some conclusive evidence that it’s the garbage collector, we can look into tuning it, or switching it over to use the newer (and less pause-prone) “G1” collector.

Regards,

Duffanator · March 1, 2013, 1:12pm

I think a pattern is starting to emerge…

INFO | jvm 1 | 2013/03/01 05:50:33 | INFO [Projects$ProjectChangeMonitor ] [05:50:33,809]: Starting up client project monitor. project=Plant_Monitor, uuid=62067afd-454a-e284-6278-3af7586f3453, editCount=2006, scope=4, version=Published INFO | jvm 1 | 2013/03/01 06:32:06 | INFO [MasterTCPChannel ] [06:32:05,871]: Error occured while communicating with backup client. Client has likely become unavailable. INFO | jvm 1 | 2013/03/01 06:32:06 | java.io.EOFException INFO | jvm 1 | 2013/03/01 06:32:06 | at com.inductiveautomation.ignition.gateway.redundancy.channels.tcp.AbstractTCPChannel.readMessage(AbstractTCPChannel.java:427) INFO | jvm 1 | 2013/03/01 06:32:06 | at com.inductiveautomation.ignition.gateway.redundancy.channels.tcp.AbstractTCPChannel.expectMessage(AbstractTCPChannel.java:482) INFO | jvm 1 | 2013/03/01 06:32:06 | at com.inductiveautomation.ignition.gateway.redundancy.channels.tcp.TCPMasterStateChannel.runProtocol(TCPMasterStateChannel.java:161) INFO | jvm 1 | 2013/03/01 06:32:06 | at com.inductiveautomation.ignition.gateway.redundancy.channels.tcp.AbstractTCPChannel.run(AbstractTCPChannel.java:327) INFO | jvm 1 | 2013/03/01 06:32:06 | INFO [MasterTCPChannel ] [06:32:06,392]: (id=94) Shutting down channel. INFO | jvm 1 | 2013/03/01 06:32:06 | INFO [MasterTCPChannel ] [06:32:06,573]: (id=95) Connected to address '/10.202.1.226' INFO | jvm 1 | 2013/03/01 06:32:06 | INFO [MasterStateManager ] [06:32:06,593]: Successfully registered new connection to redundant master from '/10.202.1.226:52857' INFO | jvm 1 | 2013/03/01 06:32:06 | INFO [MasterTCPChannel ] [06:32:06,594]: (id=96) Connected to address '/10.202.1.226' INFO | jvm 1 | 2013/03/01 06:32:06 | INFO [MasterTCPChannel ] [06:32:06,598]: Starting redundancy channel id 95 INFO | jvm 1 | 2013/03/01 06:32:06 | INFO [MasterStateManager ] [06:32:06,601]: Successfully registered new connection to redundant master from '/10.202.1.226:52862' INFO | jvm 1 | 2013/03/01 06:32:06 | INFO [MasterTCPChannel ] [06:32:06,601]: (id=97) Connected to address '/10.202.1.226' INFO | jvm 1 | 2013/03/01 06:32:06 | INFO [MasterStateManager ] [06:32:06,604]: Successfully registered new connection to redundant master from '/10.202.1.226:52867' INFO | jvm 1 | 2013/03/01 06:32:06 | INFO [MasterTCPChannel ] [06:32:06,605]: Starting redundancy channel id 96 INFO | jvm 1 | 2013/03/01 06:32:06 | INFO [MasterTCPChannel ] [06:32:06,607]: Starting redundancy channel id 97 INFO | jvm 1 | 2013/03/01 06:32:06 | INFO [MasterTCPChannel ] [06:32:06,609]: Error occured while communicating with backup client.

I will make sure to set up the garbage collection logging this weekend, I have to restart the server for maintenance anyway so I’ll get that done and monitor the logs next week as well to make sure this is definitely the issue, but it’s looking like it. Good catch, Colby!

Colby.Clegg · March 1, 2013, 3:51pm

Hey,

I’ll have to do a bit of research, but perhaps you can also switch the garbage collector over this weekend too. What version of Java is the gateway running on? I believe they started including the new GC with Java 7, but I have to check, it might be available in 6.

Regards,

Duffanator · March 1, 2013, 4:09pm

Right now the servers are running Java 6, Update 26. I don’t have a problem updating the Java version if it will help with this issue. I do have the mobile module installed and I have read in other threads that it doesn’t work well or at all with Java 7 but I don’t currently use the mobile module for anything important; I was just playing with it so far. If it will help fix this issue I will sacrifice it for now.

Robert · March 1, 2013, 4:13pm

Wasn’t there problems with update 26? I seem to remember something about it not playing well.

(off to search the forums…)

Colby.Clegg · March 1, 2013, 7:24pm

I believe the big problem was with 6u29… where it broke database connections.

First off, yes, the mobile module doesn’t currently work with Java 7. Java 6 update 26 apparently includes an experimental version of the new garbage collector, but it wasn’t considered “fully supported” until Java 7u4.

There are a number of things that could be done:

Install Java 7, and set up Ignition.conf to use the G1 collector.
Leave Java 6 and try to get it to use G1
Leave Java 6, and try to track down what’s going wrong in the existing collector (CMS), and tune it.

The CMS has various tuning options, and there are many articles that can be found that talk about reducing pause times. Since it’s friday, and we have ton of things going on here at the office, I don’t really feel like I have time to properly write up (and test before hand) configuration of the G1 collector. What I’d recommend for this weekend is that you simply put the parameters in for logging the collector operations, and we can take a look more closely at what exactly is happening, and go from there.

Regards,

Duffanator · March 1, 2013, 7:28pm

No problem. I’ll make sure to get that done this weekend and we can revisit it next week. Thanks, Colby!

Duffanator · March 5, 2013, 2:58pm

Got an example in the logs. Look for timestamps of 23:12. Looks like it paused the threads for 11+ seconds if I’m reading it correctly. Definitely the garbage collector.

GCTimeout20130305.txt (29.3 KB)

As an aside, I’ve been getting these messages spamming the console log every 30 seconds since I upgraded the servers to 7.5.6 over the weekend. Not sure what these mean either:

INFO   | jvm 1    | 2013/03/04 23:12:22 | WARN  [SQLTags$ClientSubscriptionProxy] [23:12:22,556]: Client subscription proxy got a tag change event after it was shut down.