Unravelling the Mystery of Network Latency in China

17 July 2024

What started as a desk-side conversation on why so many requests to a new API from China were traveling thousands of miles to the USA, unravelled a mystery of what life is like for a network request in mainland China.

Join us for story time with Maersk's Front-End Engineering leader Steve Workman, and learn what happens inside the great firewall, and how you can go faster in China.

Steve Workman

Steve works for Maersk, a logistics supplier that moves 16% of the whole planet's goods. He is the leader of the front-end community at Maersk and looks after maersk.com, one of the largest e-commerce sites in the world. Many, many years ago, Steve helped to organise London Web Standards, and can't wait to see the team again.

Video Permalink ¶

Transcript

I've been a web developer for 20 years.

That's actually just something that thinking about what Charlie just been saying, 20 years.

All the names of the books that you had up on the slide there.

Doing this since I used to be an organizer of this Meetup group.

Having lived through all of those books, knowing a lot of those people who wrote those things, it's amazing to see firstly their work be shared and that there is a course that teaches really good web design.

So what could you do after 20 years of knowledge, I think is the topic.

So this talk is a bit different from the other ones that you've heard tonight.

I think that puts it lightly.

My name is Steve.

I am from the north, though apologies, I don't really have an accent anymore.

I work for Maersk.

Now, if you don't know who Maersk is, that's OK.

They are the second largest provider of container logistics worldwide.

Around 16% of everything that you buy at one point has been inside a Maersk container.

And that really means 16% of everything, including everything that is in this room.

Maersk's network capacity, or how much it can hold at any one time, is about 3.4 million TEUs, or 20-foot equivalents.

For context, 20 foot is about the width of this screen, so 4.3 million screens worth of cargo around the world at any one time.

Maersk is actually more of a technology company than a logistics company these days.

We have to build systems that work in 130 different countries, including some that are under severe restrictions, like humanitarian work in the Sudan.

And we deal with ocean transport, which is what we're known for, trucks, trains, warehouses, as well as the fleet, and terminal technology as well.

We have 3,500 software engineers, which is a lot.

And my job is that I run the most-designed system.

But I also lead the front-end engineering and web development communities within the company.

This is my dog.

His name is Gizmo.

He is a good boy.

He probably featured in the very first talk that I gave here 10 years ago, because that's about how old he is.

And throughout my career, web performance was my thing.

So content warning-- this talk contains something that I didn't know before and that I definitely know now.

My last one of these was on HTTPS, which I didn't know much of back in 2014, and I definitely do now.

But that was before Let's Encrypt.

So there was all sorts going on then.

This is about China and network latency.

But the good thing is I already know about web performance.

So there will be graphs, medians, networks, and I'll try and explain it.

If you don't understand it, that's OK.

I will try and explain in the pub.

But we'll see.

Anyway, let me set the scene.

It is a warm summer day in the summer of 2023.

Unlike this summer, it was nice at the time of year.

And actually, a little cool breeze really helped.

It was a pleasant thing, and you kind of longed for the beach.

Unfortunately-- because I have a clicker-- you're stuck here.

That's my office.

That's Maidenhead.

Thankfully for the air condition, this is better for many reasons, including that there's very little chance of getting sand under the keyboard.

But mostly, it's having people around you working on a similar problem together.

The people in question are the external API development team.

They're busy working on replacement services for one of our legendary platforms.

Legacy is a terrible word, because these things definitely worked at one point.

So legendary.

It's much, much lovelier.

I just completed running an internal event, and I was looking for a project to test out some new ideas and dependencies.

And the API engineers were looking for someone to upgrade some shit.

So enter mers.com/schedules.

Now, schedules is the ninth most visited part of the website.

There are plenty of others. mers.com is one of the largest e-commerce websites in the world.

You may have heard in the news that container rates or the amount it costs to ship things from one place to another are quite high at the moment.

During the pandemic, it was even more.

At one point during 2022, mers.com was taking $5 and 1/2 million per hour.

So this is the ninth most visited part of the website.

This is somewhere quite safe for me to play.

It shows up-to-date vessel schedules to the public using data that is actually publicly available through our developer APIs.

It uses lots of our different core systems, and it's quite nice and easy to play with.

It's pretty simple.

It's three tabs.

This is what the API team wanted me to help with, making sure that we're using the same APIs that are available to our customers as our public-facing apps so that they actually work as intended.

So we go through the list of all the APIs that need to be replaced, and we start on some simple ones like locations.

You'd think that locations are quite a simple thing.

However, at Maersk, we deal with every country and every city on the planet.

So it's quite a popular API.

Let's put it that way.

It is extremely heavily cached by our CDN and in browsers around the world.

It is quite a unique API because it has a very, very large number of URL possibilities that you can have.

This is a term called cardinality.

Anyone ever heard the term cardinality before?

One, two, three, maybe?

I learned it last year when the observability platform team said, Steve, please reduce the cardinality of the data you're giving me.

It's breaking things.

Anyway, cardinality is defined as the number of elements in a set or grouping as a property of that grouping.

Number of unique combinations of something.

Anyway, careful not to degrade any functionality on one of our key applications, we actively monitor the performance from the UI and from our API gateways.

It all launched fine, tracking the different performance percentiles around the world, ensuring that all our layers of cache were working as they were meant to.

Once we were looking at the performance stats, we saw an anomaly.

And one of the engineers asked me, Steve, why is this traffic from China going to the USA?

All the traffic from outside China was going to its closest data center.

But in China, more than half of it is going to US East.

Now, apologies for the projection on this map.

It's very north heavy.

This is not how the world actually looks.

But it's east-west that we're looking for here.

So as you can see, US East is-- that's as far away as you can get.

So why is the traffic going from there?

In theoretical ping times from Shanghai to Washington, DC, are over 200 milliseconds.

So that's basically the speed of light around the planet.

You'd expect the traffic to visit the Asia data center, which is in Tokyo.

And this would be a ping time of roughly 80 milliseconds.

In theory.

From Western China, Europe is actually almost as close as Tokyo.

So you'd expect a bit of data to go there as well.

Now, networks are optimized for performance.

This is something that normally, certainly as web developers, we never really have to think about.

And to find the closest working server, so why would it be making such an illogical jump?

To describe what's going on here, we need to tell you a little bit about Maersk, about China, and the internet in China.

Now, there's someone from China here in the audience, which makes this quite interesting, because I've spoken to-- this is the average or aggregated view of how the internet works in China from the stats that I have.

Everyone's personal experience is actually quite different.

As the world's largest exporter of goods, China accounts for more than 10% of all traffic on this.com.

We support both simplified and traditional Chinese script on the site to accommodate four of the top five container terminals on the planet.

The top 10-- actually, this is the top 10 with Hong Kong, Shenzhen, and Shanghai-- excuse me, my pronunciations-- Ningbo, Shanghai, Xingdao, and Tanjin at the very top.

But every red dot is a container terminal, every single one of them.

And every black dot is an internal terminal.

There is no greater concentration of exported goods on the planet.

To run a business in China, at least certainly as a foreign power, you need a few local licenses called an ICP-- I don't actually know how to pronounce this-- B-E-I-A-N.

How do you pronounce that?

Bean?

I've always said bean.

I don't know.

Bean, an ICP, and a PSB, one of those, along with having registered offices and business contacts within the People's Republic of China.

This means we have then access to a Chinese domain name, a .com.cn domain.

This is important, and we'll come back to it later.

You may have heard that the internet access is heavily regulated in China, with many major Western companies being blocked and homegrown companies replacing them.

For example, WhatsApp is WeChat, Amazon is AliExpress, Google is Baidu, and so on.

The internet is also centered on a variety of keywords, though you wouldn't know it unless you went searching for it.

These restrictions are known as the Great Firewall of China, and it is run by the Cyberspace Administration of China and enforced by a state-owned ISP called China Telecommunications Corporation.

But it's also known as ChinaNet.

Officially, this whole project is called the Golden Shield.

All sorts of names for these things, which means everything is actually scanned as it's trying to leave the country.

So this is the official argument.

This is the official description.

Launched in 1995 by China Telecom, ChinaNet, or AS4134-- note that-- or 163Net, is not only China's national internet backbone.

It is also widely considered an important part of the global internet.

ChinaNet boasts the most subscribers' websites, widest coverage, and richest infrastructure resources on any public internet network in China.

For customers looking for the highest quality experiences on ChinaNet, China Telecom offers multiple ChinaNet access options that bypass congested gateways with optimal on-net routing.

ChinaNet has an effective monopoly as an ISP within China.

You'll see some graphs later of this.

But unless you're on a VPN, you will go through ChinaNet and you will go through the Great Firewall.

So the Great Firewall causes a lot of issues with connecting to websites outside of China, because every packet is scanned before it is allowed to proceed.

If a packet is rejected, more packets are rejected from the same source.

Encryption, such as HTTPS, helps many-- helps, but many sites are still extremely slow to load, with assets on some domains that will basically never load or even connect because they've been blacklisted.

Because of this, many people in China use a VPN that terminates outside of China, such as in Hong Kong.

This avoids the Great Firewall entirely and is reasonably quick, though since only government-supported VPNs are allowed, it doesn't get round to all controls.

For websites, there is a way to circumvent the Great Firewall, and that is being a Chinese business.

Traffic that stays inside China is not scanned in the same way that traffic that leaves the country is.

And so if your service can be positioned within the country, then it won't take as much time at the firewall, or actually it won't ever hit the firewall.

Now, Merck has both of the ICP and PSP licenses, so we run Merck.com.cn, which is this website that you see here, which uses our content delivery network within China.

I say network, I really mean China Net Center, also known as Wangshu Technology Company Limited, which has nothing to do with the actual provider of our CDN.

But it's those things that have to happen.

You have to have a Chinese business in order to do all of these things.

There are a lot of hoops to jump through in order to do this thing.

Merck does this because it's an awful lot of our business.

So the good thing that China CDN can be managed from both inside and outside the country, though it can only be set up from within China.

Now, Merck.com.cn runs reasonably well.

There's no large wait times for payloads.

The pages run quickly.

Sometimes the party services fail, though it's not as common as you'd think.

Now, that kind of lays the land for where we are.

Why is traffic going to the US East?

Yeah, I still don't know because that diagram is still relevant because it's still not the closest place.

Now, apologies, there's some AI-generated images in here.

That is the only use of AI that has been on this talk.

But as I said at the start, web performance kind of was my thing so I start to get into Detective mode and see what's happening under the hood to try and unravel this mystery.

We're trying to send traffic to a place called API.merc.com.

No prizes for saying that that's where our APIs are.

It currently, from China, mostly resolves to US East.

How did it get there?

What did it do to actually get to that point?

And how did we even find that information out?

Well, let me introduce you to two amazing tools, Wireshark and WebPagetest.

We're going to start with WebPagetest, though if you're unfamiliar with these tools, some-- well, the ordering.

If you're unfamiliar with these tools, Wireshark is network connection inspection utility.

With the right data, it can show you where every packet went over the network.

It's open source and essential for forensic level data analysis.

But WebPagetest is a service that lets you test your websites from around the world.

It provides you with Metra-- with Lighthouse metrics, with Core Web Vitals, and full waterfalls of how everything loads.

It is amazing to get a different view, or someone else's view, on someone else's computer as to how your website actually loads.

Now, its owners, Catchpoint, supply lots of locations around the world where you can perform your in-depth tests on pages using real browsers and kind of go deep into the metrics.

It just so happens that WebPagetest has nodes in Beijing and Shanghai inside the Great Firewall.

And that, from there, we can capture full TCP logs, as well as the HTTPS certificates, and we can decode that in Wireshark, which is awesome because I can't find China to do all of this work because otherwise I'd have to.

So having effectively a remote computer that is entirely set up for web performance development is incredible.

WebPagetest has saved my bacon so many times, and you should check it out for anything you're doing, to be honest.

So this means we have our ideal test service with all the tools we need to perform the analysis.

Now, this is the trace within Wireshark.

When I said that I learned a lot from this, I called our good friend Andy, Mr.

Andy Davies of Speaker.

And he helped me with this bit because I still do not understand what all of this says, but that's OK.

What I do know is the black is bad.

The black lines are where files-- where packets are not being accepted, but it's not just that they're not being accepted.

They're just not going anywhere.

The connections have been interrupted, and that is the firewall.

What we found confirmed the analysis from our external telemetry was that traffic from merc.com.cn to apr.merc.com was bouncing to Hong Kong or other regional data centers outside of China and then going to the EU or US data centers rather than to Asia.

So traffic got out the country and then decided what it was going to do.

But all that did was actually confirm what we already knew, that the data was behaving badly.

And unfortunately, there wasn't a huge amount more we could actually learn from this.

So we had to go to our network team and our CDN provider and get them to look into the traffic.

And I will show you some of their analysis.

But in short, it's, yeah, that's right.

That's what it does.

This is a two-hour sample of traffic from July 2023.

And they saw that about 50% of clients from China are mapped to the USA.

It appears that requests from those last hop countries were making it to the correct destination.

And so issues on our side were ruled out.

They checked the performance of each edge geography and saw that, unsurprisingly, Japan, JP, has a better performance than the USA.

They then broke down the China client traffic by client ASNUM, which is the chart on the screen.

And ASNUM is effectively an IP address block.

You heard me say AS4134 earlier.

That is ChinaNet.

And that is the one on the left-hand side.

It dominates in terms of request count.

It also has the worst performance among any of these different Chinese client ASNUMs.

It has more than three times the next nearest traffic and double the latency.

Well done, China.

They acknowledged that traffic going outside China has its peculiarities-- I'm quoting from an email-- has its peculiarities, specifically traffic on ASN4134.

They go on to say, "Such as peering with abroad getting controlled in unexpected and non-optimal ways.

We observed that it's not a matter of mapping going to the US, but rather performance issues on the Chinese side."

Checking the stability of this performance from AS4134-- so this is all traffic from AS4134-- to various different locations.

You can see some two-letter country codes down the left-hand side.

You can probably just about see the US at the top, then Japan next, and Hong Kong after that, and France after that.

Breaking it down, you see a lot of instability in the Japanese one.

It's not actually a very stable connection.

Some ASs have good performance wherever they go.

These are not AS-- these are not China net.

But others have quite a varying one overall, and overall not great results.

And it's very much dependent upon the ISP, the AS number itself.

So given the above findings, we don't expect that forcing the mapping to more local would be given any performance gain.

While mapping to further locations is not intuitive, it appears that the mapping system is actually doing an optimal job, as we would expect.

The point here is that the traffic has chosen the US because it is the most stable connection out of China.

Japan, even though it's closer, is a bit of a mess.

And they can't guarantee how good the performance is.

And so they route to the US instead, which is kind of mind-blowing to me because that's like, how does network engineering work?

I'm a web developer.

I do CSS.

What the fuck?

So I was seriously impressed with the detail our supply went into here.

There's a lot of detail to unpick, but fundamentally, it boils down to that huge variation.

There's no reliable way to determine where the traffic is going to go.

But they do their best to route the traffic to the optimal location to the user, even if that seems unintuitive.

Now, none of that answers my question as to why.

And it feels I'm starting to go around in circles.

But thankfully, they did have a recommendation for us.

You know that Chinese domain name that you're using for immerse.com?

Use that for the APIs as well.

OK.

So if we were to use that for our APIs, the edge of the API nodes would be within China.

And then even though it's then going off to data centers outside China, it could still benefit from the Chinese domains and skip the firewall.

Now, we didn't set this up by default.

When I joined Maersk six years ago, we did already have a Chinese domain, which was great.

We never had an API.merse.com.cn.

No one's ever thought of this as a problem before.

So we needed to set it up.

And again, the idea is that we're using this service within China.

Now, the process of actually doing this, because we already have all the licenses and people in China, it took a couple of months for me to send a lot of emails and messages to get this done.

But eventually, we did.

And we flipped the switch.

And we were now expecting people to more go to Tokyo than to the US.

We tried a canary test and rolled it out to 1% of users from us.com schedules.

And immediately, we got errors.

But not just any errors.

HTTP status code zero.

What?

Now, if you're unfamiliar with HTTP status code, zero is not among them.

This means the request failed without a response, which is pretty vague.

It can happen for a few reasons under normal circumstances, such as a timeout or loss of network.

But since this wasn't happening when we tested it from Europe, all we had was some telemetry that was telling us that it's broken.

I even managed to call up some of my colleagues in China.

And they were using VPNs, so it worked for them.

And eventually, we did get through to some people when they turned the VPNs off on their phone.

And they could see that, yes, no, it was indeed broken.

So back to web page testing in Wireshark and back to Andy.

And I got these two very unhelpful messages.

So cname, api.merc.com.cn, edge key.

So it's going to the right.

It's not a DNS problem.

TLS v1.2 record alert level, fatal description, internal error.

And that's what you want from a network stack, internal error.

So I looked up the IP address.

It's not a DNS issue.

Then it said hello and immediately errored.

It's not much to kill on.

But it also does say that it says TLS.

And so we went back to this network's team.

And we looked it up.

Have the HTTPS certificates been installed correctly?

Because one of the things that the China network does is it installs its own security certificates on top of all the ones that you've got.

And they said, yes, it is.

And we said, look at this data.

And they said, oh, no, no, it's not.

[LAUGHTER] Two weeks later, it's working.

Hooray.

And confident that it works, we take the roll-up up to 15%.

Now, actually, at 15%, we didn't see a lot of take-up.

And I'm going to cut a long story short here and say it's because of the firewall again.

Our featured flag CDN provider also isn't in China.

And they do event source connections or service end events.

Strangely enough, those don't work about 30% of the time.

And normally, that's never a problem.

You always have defaults.

But we don't have their China solution rolled out.

It's on my list.

So we just pushed the numbers up.

So we went up to 33%.

And we started seeing some traffic.

OK, now, this is the graph time.

We're getting into graph time, guys, because we've solved the problem, right?

We have definitely solved this problem.

Definitely solved this problem.

So with two graphs, these are histograms.

If you've never seen web performance histograms, this is what you've got.

On the y-axis, you have the number of requests.

On the x-axis, you have time in milliseconds, though this is actually in seconds.

So sadly, these scales don't match.

The bottom one goes to two seconds.

The top one is 1.2.

Please put that into your brain when trying to parse these things, because unfortunately, that's what you have to deal with sometimes.

Anyway, our initial results for the API response times were OK.

But they were not any faster than the existing API domain.

We focused our analysis onto traffic from schedules application within China and using only traffic that came from the Chinese website as our host so we can do a direct comparison for API response times.

Thankfully, we're A/B testing this.

We can do direct comparisons.

This showed quite a different profile to the regular ones.

Now, the original API domain shows a browser cache.

So the way you read these graphs is that each hill or at least bump is basically a new area of cache.

The first one right at the start is your browser.

That's when things take no time.

Hooray, that's what you want.

The next series is your edge cache.

This first hill here is good.

That's high.

What you want is tall.

What you want is more to the left.

That's literally the whole point of web performance.

Just shift it left.

So you have two hills there.

And the third one is your origin, so where it goes and hits the origin.

The rest of it is called the long tail.

And it's, well, networks are going to networks.

People are going to go into train tunnels.

People have low signal.

It just happens.

But we have two quite different profiles.

The new domain has a better edge cache.

But overall, it's actually slower.

The old one, things go to the origin more often than they hit the edge.

So the China CDN is working, but it's still not faster.

So does that mean we're now closer to the data center?

Because also, the bottom one, the origin hit is closer to the edge.

Does that mean we're now hitting US East?

Well, this chart shows where all the traffic was going.

And again, apologies.

There's three different colors here.

Most of it's headed to the blue series, which is US East.

But what we wanted to do was hit the green one.

That's Asia.

And the yellow one's Europe.

At 33%, nothing really happened.

So we increased it to 75%.

And nothing happened.

Frustrated, but confident that this was still the right thing to do because we've been told, and literally, what else could it be, we took a closer look at the numbers.

So a little bit of detail here.

This, again, is most on-prem schedules.

There are various different APIs.

There's eight of them in total on schedules that are being targeted.

They're one of them I missed.

The web APIs are used in different places.

So here, you have active vessels, vessels, and then vessel schedules.

All of these change at different rates, have different levels of cardinality.

And so you have various different things on different pages and different profiles.

This is a port calls.

So pick a terminal anywhere in the world, and pick a date.

And it'll tell you what ships there are.

Really interesting.

This screen is similar.

The interesting one here is that there is an API called Deadlines, which is triple-keyed on the data of the vessel code, a voyage number, and port for every port, vessel, and voyage in the world.

So it's quite high cardinality.

But port schedules has relatively low cardinality.

So we've got a whole spectrum of stuff to test here and go, well, what actually happened?

Let me give you some numbers.

For each of these charts, resource duration-- so time in milliseconds is on the x-axis.

And y-axis is the different percentiles. 50, or the median. 75, 90, and 95.

I don't really care about anything less than 50, because half of everyone was faster than that.

That's great.

What I care about is the extremes.

Especially for caching, I want to make sure that 90% and 95% of people have a great experience.

Dark blue is the new China CDN.

Light blue is the regular CDN.

Smaller is better.

So up first are two very similar APIs, active vessels and active ports.

Active vessels shows a small 4% gain at P50 and 90% and 23% gain at P95.

Active ports shows a large regression at P50, but a huge gain at P90 and P95.

This is sending me crazy just even reading this again.

These APIs are both really static in nature.

The list of active vessels changes once a day.

And it is just a list.

There is no cardinality to it.

Active ports changes even less.

We don't shut down terminals around the world every day.

So these should be very highly cacheable.

And as I said, there's low cardinality.

There's no query parameters.

Now, you can see that in-- there's no laser on this thing, is there?

There is.

Ooh, lasers.

The median here for active ports is 19 milliseconds.

So that means half of all requests were served from browser cache, which is awesome.

That's brilliant.

I'm not going to beat that.

So the new API has not yet reached browser cache, or all of it.

So that's fine.

That does take time.

Fine.

But the gigantic one at the top here, 52% duration improvement at P95, is showing that traffic at the extremes is now getting data from much closer in.

It's still quite close for P75, but when your network's limited, this is still having quite a good effect.

These APIs are high cardinality, low cache hit rate, as they change a lot and have multiple query parameters, vessel schedules, and port calls.

These change every few minutes, depending on where the data gets to, because things get delayed, things get rerouted.

And on 800 vessels around the world, it changes a lot.

The highest cardinality of these vessel schedules, we get a modest 46% improvement at P75.

But it's interesting that this API is not explicitly browser cached.

And so it will hit the origin every time.

But it's only slightly slower on the median for the China CDN.

Port calls hasn't-- but weirdly, even though these are very similar APIs, port calls has like 50% improvement, which brings the API into line with timings for vessel schedules.

So anyone want to guess what this is?

Because I don't know the answer for sure.

Would anyone like to guess why an API called port calls was twice as slow as a very similar one called vessel schedules?

Yeah?

Alcoholics a good one, maybe.

I didn't think of alcohol.

I probably should have done.

Yeah?

Censorship?

Could be on ports, maybe.

Yeah.

Sorry.

Anyone else?

I mean, censorship is what I thought was my solution as well.

I honestly thought port calls.

OK, someone's looking for ports.

But they mean ports as in IP ports and TCP ports.

And so if someone's trying to hack the system, that's literally my guess and a best guess at that.

But I do not have any other explanation for why something this weird happens.

Welcome to China.

If anyone does have any more theories, I'd really like to hear them.

Finally, the final part of most of our calm schedules is actually the film pages called point-to-point schedules.

This has the list of location searches.

As I said, every city on the planet.

There are too many New Castles.

And then the rest of it.

I'm going to hurry up, Dave.

Don't worry.

So now you won't see deadlines here because I coded a bug.

And please do remember that no matter how long you've been coding for, you will still code bugs.

It just happens.

So don't worry about that.

So one of the APIs is just totally missing from this.

The interesting thing about this is that location search is actually much slower in China.

What we think of it is firstly, it's high cardinality.

It's not cached very much.

And then it's also heavily in-browser cache from other places because we use this API a lot.

Vessels is much, much the same.

So and vessel flags-- vessel flags is static.

Why is that slow?

Strangely, worse across the board and worse by 50%.

Really don't know why.

And so why is this happening?

We went back to the traffic graph and thought, well, everything we still thought was still going to US East.

And it was.

But then on February 2, it stopped going to US East and started going to Japan.

And genuinely, we did nothing.

We did nothing at that point.

Suddenly, it goes from blue to green.

And it stayed there ever since.

What changed?

It's hard to tell without going and asking the network team to go and do the forensic analysis again.

But as an educated guess, I'd say the traffic on China reached a tipping point where there was just enough cache and data in there.

And the stability to Japan reached a certain tipping point and went, right, Japan now.

And off it went.

And doing so massively, again, increased the stability and decreased the latency of all the different APIs.

So we ramped it up to 100%.

And these are the final results as performance percentage differentials on the different APIs.

So taking measurements for latency of the different APIs over time, what's the difference between the China CDN and the regular CDN?

Well, it does vary a lot by API.

But overall, it's enormously positive.

We see consistent double-digit performance improvements, sometimes at 50% or better, a range of minus 11 to plus 26 at the median and minus 27 to plus 55 at p95.

This can be over a second better performance on a p90 on APIs that are called thousands and thousands and thousands of times per day.

There are clearly data points that produce a negative effect.

In particular, the Vessels API gets worse over time.

But there's a weird statistical anomaly because at the start, we weren't collecting enough data.

And Merce.com's API has actually gotten slower over time as the Chinese one has gotten faster.

Finally, also, on this graph here, I've cut out the median for active ports because it's minus 1,000% thanks to the 19-millisecond cache, which would be down here.

But as a very high level, doing this saved us 40 milliseconds at the median and just under a second at p99 for over 1 and 1/2 million API calls in a two-week period.

Now, those two graphs that you saw at the start, they look a bit like this.

We've still not actually caught up with the browser cache on the left.

We have now, actually.

And it's very fast.

But that edge cache spike is now huge.

And that's exactly what you want-- not much going to origin, not much going in the long tail, but a huge edge cache spike.

So that CDN is working.

And nothing's interfering with the traffic.

And also what that means is a nice side effect is we've reduced the number of HTTP/0 status codes by half.

It's now 0.09%, which is like a 50% drop, which is amazing.

So to wrap up, what have we learned?

Traffic leaving mainland China is inspected before it goes out of the country.

And that can mess up traffic routing as well.

Using a China-based CDN solved this problem and drastically improves routing and performance at the same time.

This is not like Europe or anywhere else in the world where a nearby data center will just do.

This is the mitigation that you need for China, because all traffic is inspected and someone's rejected.

This leads to all sorts of strange behavior that can be really difficult to diagnose, including being bounced around the planet to places that make no logical sense.

I feel that it should be very large.

Your mileage may vary on this number, as improvements are highly variable, depending on the type of traffic and the cacheability and cardinality of the resources that you will see.

Typical range is more like 8% to 35%.

But thank you for staying with me on this really deep dive and geeky dive into web performance.

And I hope you learned a little bit about China.

Maybe not, but thank you very much for listening. (laughs)

Unravelling the Mystery of Network Latency in China

17 July 2024

Steve Workman

Links

Video Permalink ¶