Friday, August 17, 2007

Julian Cain's explanation on the Skype Outage (from GigaOm)

This is a great explanation of the Skype Out(age) by someone named Julian Cain (no idea who he is, but obviously has experience with Distributed Hash Tables (DHT) and p2p programming) - he posted this explanation in the comments in this thread @ GigaOm.

Number of Skype Authentication servers:

Count == 50; // Clustered

Number of potential Skype clients:

Count = 220,000,000 // Mostly decentralized

Number of SuperNode clients to maintain network connectivity:

Count = N / 300 at any one time.

•   If there are 3.0 million users online then the ratio is
3,000,000 / 300 = 10,000 == Supernodes available
• Supernodes are bootstraps into the network for normal first run clients
("and handle routing of children calls").
• Supernodes maintain the network overlay via a DHT("Distributed Has Table")
"type" method. // This is normally very slow and done over UDP
• If a client cannot find a Supernode, regardless of authentication
via central server then is NOT allowed on the Skype network.

Lack of Supernodes mean lack of network connectivity regardless of successful login via “central server”.

You CAN be a Supernode but not have full network connectivity because you have only a portion of the “Distributed Index Data aka DHT”.

MOST people that become Supernodes will bail out if they cannot keep a clear route (”aka calls bail out, client restarts and aborts Supernode status, thus booting it’s 300 - 500 Children and putting them into a “Connecting mode”.

Children that are trying to “Connect” are unable to do anything unless they have a “Supernode” as a parent. // No calls, No IM….

The overview of this is as follows:

Skype introduced a flaw into the network that dealt with “routing” and “fucked” the “decentralized data store aka DHT” this in turn ran clients on a RANDOM search of Supernodes which at this point were well booted off of the network.

In the End:
It is a huge cycle, no matter how many bugs they “fix” in the “central servers” it will take many days for N nodes to become Supernodes so they can route X data from peer A to peer B. This is NOT minor, a fix to the centralized server code base to relay data to N Supernodes there is lack there of, resulting of a very segregate network. Right now there are approximatly 10,000 sub Skype networks instead of 1 Single “in sync” network. When this “data store(see DHT) is in sync globally then the Skype network will be again STABLE.

I know this is very broad but, unless magically all of said nodes can recreate the “single overlay (DHT)” then nothing will be in sync. You will see delayed messaged, delayed or incorrect profiles and presence.

My take, in the end is give it 48 more hours and it may be semi-stable, but hey this is what you get with using end users as your own redundancy…


Updated** - had a quick chat with Julian Cain, he was a core programmer at Kazaa - and knows his stuff. He is working on some very interesting stuff, which I will post about at a later date.


Anonymous said...

Thanks for relaying my data. However I am "Julian Cain" not "Julian Cane". ;-)

Andrew said...

Duly noted and corrected!