Network Corruption?

You are here: Home > Forum > Simulations > Released > Paisley > Network Corruption?

Page 1 of 1

Network Corruption? 23/11/2017 at 01:18 #103198
Ar88
Avatar
310 posts
Client reporting that they got the "OK" for 6E68


Post has attachments. Log in to view them.
The Welsh contingent. Aron, or Ar to mates. Also known as 88E or ThatManCalledAr.
Last edited: 23/11/2017 at 01:21 by Ar88
Reason: None given

Log in to reply
Network Corruption? 23/11/2017 at 03:09 #103199
headshot119
Avatar
4869 posts
I had a chat with GeoffM about this earlier, he's going to have a look into it, though it's possibly related to issues of poor connections.
"Passengers for New Lane, should be seated in the rear coach of the train " - Opinions are my own and not those of my employer
Log in to reply
The following user said thank you: Ar88
Network Corruption? 23/11/2017 at 03:39 #103201
Ar88
Avatar
310 posts
Let us know the outcome please fella.
The Welsh contingent. Aron, or Ar to mates. Also known as 88E or ThatManCalledAr.
Log in to reply
Network Corruption? 23/11/2017 at 18:51 #103229
GeoffM
Avatar
6274 posts
We have long suspected that messages were getting lost somehow. TCP is supposed to be a "deliver or die" kind of protocol, rather than silently losing messages. Losing messages shouldn't happen with a reasonable network connection anyway, as the SimSig networking protocol is miniscule compared to, say, Netflix streaming. Of course, it doesn't matter if you drop a packet in Netflix - but it does in SimSig.

So, recently a sequence number check was added to spot missing packets. Lo and behold a few users do indeed to be losing packets. The reasons for this are unknown at the moment but if anybody gets any, please also mention anything that might be relevant like being on a slow connection, or others in the house were downloading/streaming something, or anything at all that might have affected your connection.

We could add retries but that really should not be needed if the TCP/IP stack (outside our control) is implemented properly.

SimSig Boss
Log in to reply
Network Corruption? 23/11/2017 at 19:04 #103238
Chromatix
Avatar
190 posts
I've just mentioned an event of this type in the "F8 Simplifier" thread. I'm presently stuck on a 512Kbps connection, but that should be perfectly adequate to run the protocol. As you say, TCP is a "reliable transport", so any lost messages must indicate a fault in the application.

I had only one NSE during an hour-long session, which was triggered when I opened a simplifier (which, of course, requires an unusually large message or sequence thereof to be sent). Other players reported having several, less severe NSEs than mine.

Presumably there is a queue of messages, which gets exercised when a large message needs to be sent to one particular client. If that queue isn't big enough to cover the time required to send that message, old messages will be lost before that particular client can be sent them - a classic race-condition. That's where I would start looking for this bug.

Log in to reply
Network Corruption? 23/11/2017 at 19:19 #103242
GeoffM
Avatar
6274 posts
Chromatix in post 103238 said:
I've just mentioned an event of this type in the "F8 Simplifier" thread. I'm presently stuck on a 512Kbps connection, but that should be perfectly adequate to run the protocol.
<sucks in breath like a mechanic looking at your car>

It should be.... but we don't know what else is also going on. Lots of apps phone home these days without you really knowing.

Chromatix in post 103238 said:
As you say, TCP is a "reliable transport", so any lost messages must indicate a fault in the application.
No. You can't just point the finger with no evidence! However, if we can assume the Windows level and lower stuff is good, there is still another layer between that and SimSig.

Chromatix in post 103238 said:
Presumably there is a queue of messages, which gets exercised when a large message needs to be sent to one particular client. If that queue isn't big enough to cover the time required to send that message, old messages will be lost before that particular client can be sent them - a classic race-condition. That's where I would start looking for this bug.
No. As I already said, there is no race condition because that's not how it works. Each client is handled independently of the others so if one client goes AWOL, the others are unaffected. There is no queuing. Data is streamed into the TCP connection almost immediately, and clients read almost immediately, and packets are small, so there should be no buffer overruns (which, if they did, ought to cause a TCP failure anyway). Once received by SimSig and yet to be processed (since one TCP packet could contain multiple messages) it's stored in dynamic lists anyway, no fixed arrays or buffers.

SimSig Boss
Log in to reply
Network Corruption? 23/11/2017 at 19:24 #103243
Chromatix
Avatar
190 posts
I see what you're saying, but what if the TCP send buffer for a particular client is full when a message needs to be broadcast? Does the host block on that, or pass over it?
Log in to reply
Network Corruption? 23/11/2017 at 19:37 #103250
GeoffM
Avatar
6274 posts
Chromatix in post 103243 said:
I see what you're saying, but what if the TCP send buffer for a particular client is full when a message needs to be broadcast? Does the host block on that, or pass over it?
The stack (outside of our control) should declare it as a failure to the SimSig code if that happens. If you have a somewhat permanent failure, like pulling out your network cable (router, fibre, whatever) then after a few seconds it does indeed do this.

I may experiment with an extremely simple pair of applications to just pump tons of data between them to see if I can make it fall over without it being reported which would point to a stack bug somewhere.

SimSig Boss
Log in to reply
Network Corruption? 23/11/2017 at 20:10 #103254
Chromatix
Avatar
190 posts
In the standard "BSD Sockets" API, there are two modes for handing data to the TCP send buffer - blocking and non-blocking. The default mode is blocking, since that's easiest for simple applications to work with.

In blocking mode, the send() call will not return until the message has been completely accepted into the buffer (not necessarily sent yet). If the buffer is full, there could be a significant delay until this occurs, during which a multithreaded application might miss events occurring in other threads. Failure return codes (negative) indicate fatal errors such as disconnection.

In non-blocking mode, the buffer-full condition is indicated by a return code (the number of bytes actually sent) being less than the number provided. This need not be zero - if it is positive, non-zero *and* less than the requested size, the application needs to retain the portion of data not yet sent in its own buffer, and retry later (eg. in response to a successful select() call, or after a short wait). Negative return codes still indicate fatal errors such as disconnection.

Last time I checked, BSD Sockets was the networking interface used in Windows (as well as in essentially all UNIXes, including Linux and OSX). If you're using something else, the above might not apply.

Hence there are two possible failure modes indicated by the discussion so far:

1: You're using blocking mode, so a client-handling thread on the host is getting blocked while sending a large message, and thereby missing some messages which are generated meanwhile. This could be fixed by introducing a message buffer on the send side (similar the one already on the receive side), which I assumed you already had.

2: You're using non-blocking mode, and not handling the buffer-full condition correctly in all possible cases. This could result in truncated messages as well as missing ones.

Last edited: 23/11/2017 at 20:12 by Chromatix
Reason: None given

Log in to reply
Network Corruption? 23/11/2017 at 20:32 #103256
GeoffM
Avatar
6274 posts
Thanks, but I've spent 20 years writing safety related applications for the real railway using networking, and which is externally audited, so am aware of how it works - or is supposed to.
SimSig Boss
Log in to reply
The following user said thank you: Ar88
Network Corruption? 23/11/2017 at 20:45 #103257
Ar88
Avatar
310 posts
Geoff - we must remember that "The customer is always right"!


Post has attachments. Log in to view them.
The Welsh contingent. Aron, or Ar to mates. Also known as 88E or ThatManCalledAr.
Log in to reply
The following users said thank you: GeoffM, SamTDS
Network Corruption? 24/11/2017 at 12:33 #103278
clive
Avatar
2736 posts
GeoffM in post 103256 said:
Thanks, but I've spent 20 years writing safety related applications for the real railway using networking, and which is externally audited, so am aware of how it works - or is supposed to.
And I used to be managing director of the fourth largest ISP in the country at the time. And a technical expert at the largest (motto: "nothing is simple when you multiply it by a quarter of a million").

Log in to reply