12 August

CoE Development Update: August 2023

By Caspian

Greetings, Elyrians!

Despite this being a shorter update cycle than usual, I made massive progress on the networking layer and microservice architecture since last we spoke. At a high level, I've been focusing on three main areas over the previous few weeks:

  1. Integrating a new logging solution
  2. Diagnosing the bottlenecks we were experiencing a few weeks ago when I wrote the July Update
  3. Implementing performance improvements and getting a head start on the missing Server Platform elements.

I've been neck-deep in networking and serialization code that, while interesting to me, isn't hugely exciting to others, so I'll keep this update as brief as possible. As usual, feel free to click the links below to jump right to the section that interests you the most, or keep reading, and hit them all.

  1. Progress Update
  2. The New Logging System
  3. The Platform Bottlenecks
  4. Serialization & Protocol Woes
  5. New Development
  6. What's Up Next?

Progress Update

To get a sense of where we stand on development and to get an abbreviated view of the work done over the last few weeks, let's take another look at the CoE Scope Outline. As a reminder, when all the Engineering boxes on this outline are checked off, we'll have everything we need to host Chronicles of Elryia, Kingdoms of Elyria, or any other MMOG. That is, when all the Engineering boxes are checked off, we'll have a fully functional, distributed, scalable MMO back end.

I want to take a brief moment to let that sink in. Without pretty graphics to look at, it's easy to let the monumental nature of that go overlooked. But there are only a dozen or so western game companies that have such a back-end, and even fewer have built one in a way as to make it capable of hosting an evolving online world.

CoE Scope

As you can see in the above image, I've successfully integrated the new logging library and am in progress on both the Pub/Sub and Routing & Load Balancers. I'll talk more about those in greater detail below.

As another reminder, here's the high-level scope document for the completion of CoE. Of course, long before that happens, a subset of these boxes will signify the completion of Kingdoms of Elyria and subsequent release to our backers.

CoE Scope

The New Logging System

A good logging solution is vital for diagnosing or triaging problems that arise both during development and production. If you look back at the previous couple of updates, you'll see I was "logging" messages to the console window so I could watch what was happening in real time. That's great for development but will only work in the short term.

With the performance bottlenecks I encountered last month, I needed a more robust logging solution that would allow me to easily enable/disable various levels of verbosity and one that could be written to disk, database, or even a sophisticated monitoring server. My first task over the last few weeks was integrating the Serilog logging framework. As you can see from the image below, that work has been completed.

CoE Scope

After integrating the logging solution, I went on to triaging and diagnosing the performance bottleneck issue I saw at the end of July.

The Platform Bottlenecks

If you recall from the July development update, while I could run 30k entities per process with little frame-rate drop and should theoretically be able to scale that up to millions, once I spread out to 8 different processes running on different physical machines, I started seeing server nodes drop and then ultimately the client would freeze. Using the new logging infrastructure, I tracked the problem down to a few different issues. To get a sense of what the first two problems were, take a look at the following image:

CoE Scope

In the above image, the I-column shows the number of messages in the incoming queue. The O-column is the number of messages in the outgoing queue. Finally, the S-column displays the number of messages sent over the last second. Two problems immediately stand out.

First, that's a lot of messages in the incoming queue! Like, a stupid number of messages. I was trying to minimize the computations in the Medium client, so I sent position and collision messages for every entity in the world 60 times a second from the servers to the Gateway! The result was, at times, over a million messages per second being processed by the Gateway.

The second, perhaps less obvious, problem is that the numbers in the incoming column are going up! When the messages accumulate, they arrive faster than the Gateway can process them, and the result is the client may receive notifications quite some time after the server sends them. That gives the appearance of the server or client freezing or becoming unresponsive.

My first attempt at fixing the issue was to adjust the application protocol to support batching of messages. Rather than sending a message for every single entity, I buffered the messages from each server process between game updates and sent them as a collection. You can see the results of that here:

CoE Scope

In the above image, the client log is on the left, and the Gateway is on the right. You can see significantly fewer messages being sent to/from the Gateway, but the problem we saw on the Gateway before has now moved to the client. The client receives messages faster than it can process in a single game update. Admittedly, we'd also begin seeing it on the Gateway again if I increased the number of entities a bit higher.

The root problem is processing the incoming messages takes too long. Processing can take too long due to inefficiencies in three possible places:

  1. The underlying transport protocol, which is responsible for sending and resending messages as necessary to ensure they're received on the other side.
  2. The application protocol that is responsible for serializing/deserializing the messages and doing any additional processing required.
  3. The application, which is required to process any messages in a timely fashion.

The last point is a non-issue because I'd previously made the DECAS asynchronous and thread-safe. Messages are cleanly put into an Actor queue when they arrive, which is then processed by the application in a separate thread. So the culprit(s) lie in the previous two areas: The Application Protocol and the Transport Protocol.

Serialization & Protocol Woes

I apologize in advance for this section. It's going to get technical. But, as this is where I've spent most of my time these last couple of weeks, if I say, "I worked on network stuff," it dramatically oversimplifies the amount of work done and the complexity of the problems. If your eyes start to glaze over here, move on to the next section and be content in knowing I've likely identified the sources of the performance bottleneck and am in the process of resolving them.

Having realized the bottlenecks lie in the application and transport protocols, I spent the last two weeks of this development cycle doing two things.

First, I ripped apart the application protocol and replaced the previous third-party serialization library I used with another one. In particular, I previously used the freely available and widely used MessagePack library. Having seen the performance issues I was having, I've since replaced that with the MemoryPack Library. Here are some publicly available numbers to give you a sense of the expected performance improvements from that:

MemoryPack vs. MessagePack

While the image above shows the arrow moving from a JSON serializer to the MemoryPack serializer, I'm only moving from MessagePack to MemoryPack. Even still, for tightly packed structures such as Vectors - used extensively in 3D games - that's still a 100x performance improvement!

Moving from MessagePack to MemoryPack also forced me to revisit my defined application protocol. Or really, my lack thereof. MessagePack allows you to specify "typeless" serialization, where you don't have to define your message headers, etc. That's because when you use typeless serialization, all the "type" information about what kind of message you're sending is embedded in the stream. And it's not small. On average, the typeless serializer of MessagePack was increasing my message buffers by an additional 250 or so bytes per message. That might not sound like a lot, but when the "maximum transmission unit" of the internet is 1,500 bytes, that's a significant amount of data going towards type information.

I wasn't planning to stay with a "typeless" serialization long-term. However, MemoryPack doesn't allow for typeless serialization, so the transition from MessagePack to MemoryPack forced my hand. I had to immediately replace the lack of a defined protocol with an explicit implementation. Like the previous fix, that resulted in messages being about 100x smaller.

The second thing I've done over this last week is revisit the transport protocol. This part of the development update is where every network engineer and network game programmer out there will cringe at me for a good reason.

When I first started developing the network layer for the Soulborn Engine, I was using UDP. UDP was an excellent choice because, in my decision to resend an entity's updated position 60 times a second, it didn't matter whether or not UDP lost any network messages. The entity positions were going to be resent 1/60th of a second later anyways.

However, as I started rapidly scaling up the number of entities, I noticed that some messages I was sending that needed reliability weren't arriving. Again, I knew vanilla UDP wouldn't work in a production environment, but it turns out that UDP can start dropping packets as various buffers fill up, even on a closed LAN.

When I discovered this last month, I tried various reliable UDP protocols out there, and all of them resulted in a pronounced decrease in performance. So out of curiosity, I tested TCP as a baseline. Vwalla! Things sped up again! Not 100%, but better. So for last month's "rope bridge" implementation, I was using TCP. Boo! Hiss! You suck, Jeromy! Yeah, I know.

At any rate, one of TCP's known problems is head-of-line blocking. This results because TCP's implementation requires messages to arrive reliably and in order. But, sometimes - especially in video games - you want to process other incoming messages immediately.

As an example, let's say I send 1000 entity updates. But for whatever reason, they arrive in reverse order. Now, this is a contrived, worst-case scenario, but it serves to illustrate a point. The last message, which has already arrived, can't be processed until every message before it arrives. The same is true for every message before that. So, you could wait with 999 messages in a network queue before even processing one. And, of course, if the buffer spills over and packets start getting resent, the problem worsens.

All that said, TCP is not a viable protocol for the Soulborn Engine. So this last week, I've been writing the architecture for a new transport protocol that seamlessly integrates with my application protocol and the larger concept of a "Distributed Entity Component Actor System." I've been calling it the Spirit Protocol. The result will be dramatically faster message handling that, combined with the work I did on serialization, should finally allow us to scale to the millions of entities per second we need.

You'll notice I said "will" and "should" because, despite my best efforts, I couldn't complete all this work in the week and a half I've been on it. So while there's been a ton of progress the last couple of weeks, there isn't anything "demonstrable." I know people like to see pretty pictures, but the platform is in a broken state for now, and you'll have to wait until the next update to see the fruits of my labor.

New Development

The last thing I worked on these last couple of weeks was more of a bi-product and happy accident than anything.

Last month I implemented a bespoke Gateway (Edge/Portal Service) process, which was responsible for accepting client connections and sending/receiving messages between the client and the internal server. However, due to the work I needed to do on the serialization and application protocol, I took the time to refactor the Gateway into a more generic application bridge.

The idea is that the bridge accepts connections from two sides. Each side can have different rules on how to authenticate & authorize incoming messages. It can also have different rules on whether or not a message gets sent to the other side and, if so, to which endpoints. That has several different uses.

First, if the bridge authenticates one side using user credentials and the other using a shared secret, it acts as a one-way authenticator. If messages from the "internal" side are filtered to the "external" side through the client Id, then the bridge can continue to be used as a Gateway.

On the other hand, if the bridge receives messages from one side and forwards it to connections on the other side by filtering on the Service identifier, then it's a router. If it filters messages by looking at the health and CPU load of the endpoints on the other side, it can act as a Load Balancer.

Finally, if it receives messages from one side, sends them to the other using the same connections as the first side, and filters based on a mailbox identifier, it can act as a Pub/Sub.

All that said, by refactoring the Gateway into a more generic solution, I'm now 90% of the way toward having a functioning router, load balancer, and pub/sub-broker. That's why I've marked those as in progress in the scope document.

What's Up Next?

Alright, folks, that wraps up this Development Update. This one was more on the technical side, but I am, at my core, a computer geek, and this stuff excites me. I also want to ensure that the absence of demonstrable gameplay footage, etc., isn't confused with a lack of progress. There's real, meaningful, challenging work getting done here.

With that out of the way, what's up next?

First, the summer is almost over, and my kids return to school in a couple of weeks. I've spent most of their summer this year, save for some time over the 4th of July when family visited, sitting behind my keyboard. So I'll be taking some time off between now and when my kids go back to school to get in some quality time with them.

After my break, I'll be returning to the server platform work to get the above stuff finished up. While it's "ok" there's nothing demonstrable this update, especially given that it was a short development cycle, I'm keenly aware that it can't continue. So I plan to have this stuff done and doing something sexy by the September development update.

Then, once this stuff is out of the way, I can move on to finishing up the gameplay support systems like AI, pathfinding, and animation/physics triggers. And, of course, then it's on to working on gameplay mechanics full-time until KoE is ready to be put into the hands of the Alpha 1 and 2 backers.

Finally, in the spirit of taking some risks in the interest of our mission statement, I've got some surprises in store for all of you that will be announced in the October Development Update.

That's all for now! Thanks again to all of our backers and supporters who continue to send words of encouragement as I continue down this long, windy road. I know it seems unending, but if you look carefully, the scenery is changing around us, so we're making progress!

Pledged to the Continued Development of the Soulborn Engine and the Chronicles of Elyra,