Dream of scalable & enriched graphql-subscriptions
Originally posted in Medium
Stylized photo of Jagala juga (Estonia), original photo by Aleksandr Abrosimov, Wikimedia Commons
Last time, I wrote about 5-year long journey of having GraphQL in Pipedrive. Now, let me tell you about a 10-year long journey of delivering websocket events to frontend.. and maybe help you out with it too.
The product need of asynchronous events comes up any time user needs to be notified about something by the server.
- you’ve got an instant message
- uploaded image has finished resizing
- your colleague has changed an avatar
- your bulk change of 10k deals has progressed to 99%
So, async events enrich UX with details & interactivity
But most importantly, they solve a problem of having inconsistent data displayed or stored on the browser side. Without such updates, other user won’t see Pipedrive deals renames; another browser tab won’t receive activities deleted. Even in the same browser tab, your views may not talk well with each other, lacking single storage and be out-of sync.
Historical dive
Luckily, Pipedrive has “solved” this issue 10 years ago. In 2012 Andris, Kapp & Tajur have developed a socketqueue service, which to this day transports API events to the front-end leveraging sockjs library.
Tajur’s talk in 2016 was one of the reasons I’ve re-created similar setup, got puzzled how I should manage user connections with socket.io & scale it beyond one server and ultimately joined Pipedrive to find out. I was fascinated by event streaming and the possibilities it brings.
As you can see, his talk is more about RabbitMQ — a message broker that stands between php monolith and socketqueue.
Pros and cons
In-memory queues have allowed Pipedrive to withstand bursts of events, triggered by external integrations or in-house things like deal import from xls file or a bulk edit.
Api event that is pushed to the RabbitMQ and to the frontend is denormalized, having different entities. That’s good, because you have a consistent snapshot of multiple things at the same time and you recipients don’t need to re-query anything…
But it is also hugely inefficient, as all browser tabs receive all possible changes, wether they want it or not, sometimes having reached 80GB per day of traffic for a single customer.. while web app on his laptop CPU is doing all of the heavy filtering.
So we can’t keep infinitely bloating event schema, making it even heavier.
Cell logic as number in a web-socket URL
Second problem is a noisy neighbour. Scalability, as I mentioned, is solved with “cell logic” (in a good scenario — aka tenant isolation) which means sharding socketqueue application servers by company ID. This also means that we must have fixed amount of containers and fixed amount of queues to ensure predictable routing.
So problem arises whenever you have one company generating thousands of events — they are not distributed among all servers, instead they spike CPU of a single node to the point of health check failure. There isn’t any room to vertically grow single-core CPU, only to double amount of servers which means wasted infrastructure resources.
noisy neighbour CPU spike of particular pod
Third issue is visibility & tracing. Same as with REST API, without proper documentation (wether it is swagger or type definitions), it is very hard to understand what can be inside of the event schema and most important, who uses what in case you want to do a breaking change. Which leads to a vicious cycle of nobody removing anything from the event & bloating it even more.
Finally, if web-socket connection is down because user went into a tunnel or just closed his laptop — he may loose events. Could that cause a loss of a deal and profits for the user?
Over the years we’ve optimized the hell out of socketqueue, so its
- multi-threaded as it listens queues and WS connection handling
- compresses traffic
- checks permissions & visibility
but what if we could do better..? 🤔
How
Idea behind graphql subscriptions is pretty simple — you declare only what you want to receive, given some filtering arguments. And it will be server’s job to intelligently filter events.
I can’t do a better job of explaining basic setup than Ben here 😁. So what he shows here is a single server instance where pubsub is just in-memory event routing directly from mutation. But in our case there are no mutations — real changes are generated by the php legacy far-far away.
Mission scope
So what I wanted to achieve was
- demo graphql subscriptions in production, having explicit schema
- test horizontal pod scaling
- increase reliability by using kafka instead of rabbitMq
- use small, normalized, domain events to keep things simple
- keep event delivery latency below 2 sec [to not be slower than socketqueue we’re trying to replace]
Risks
What I didn’t know..
- how do we authenticate?
- what protocol should we use? SSE? WS? whats this mercure thing.. and does some of it fallback to pooling?
- will WS/TCP connections be bound to same pod or not?
- can we do multiple subscriptions at a time?
- what storage / transfer medium do we use? DB? redis streams? kafka? redis pubsub? KTable?
- can we rewind events in time in case user got disconnected? would we need to store cursor ID per entity? also, how does live query work?
- how do we filter events by company, user, session, entity? is there a standard for subscription filter fields?
- can we re-use federated graphql schema? can we enrich events? what should be QoS / retry logic if gateway is down? Lee Byron went into it at depth in his talk over 5 years ago.
- how many items can we subscribe to?
- how do we unsubscribe? or disconnect users when they log out?
Proof of concept
Before the mission, I made a simple node service that connected to kafka and proxied events without any filtering. This was already promising. But for a mission I wanted to try out go to have all CPUs to be involved to maximize that efficiency, while this PoC was a fallback (that proved itself very useful)
Liftoff
Four of us (myself, Pavel, Kristjan and Hiro) have set off for entire summer to try and do just that.. explore the unknown 🛸 and bring value back to the launchpad.