Article / 11th Jul 2017

Towards Channels 2.0

It's been around three years since I came up with the current Channels design - that of pushing everything over a networked "channel layer" and strictly separating protocol handling and business logic - and while it's generally working well for people, I have this feeling it can be improved, and I've been thinking about how for the past few months.

This was brought into sharp focus by recent discussion about writing more of a standard interface for asyncio in particular, but also by Tom Christie's recent work on the same subject, so I've written up where my current thinking is to both help people understand where I'm going and to be more transparent about how I think about problems like this.

So, let's start by looking at the issues with the current Channels design:

You're forced to run everything (and I mean everything) over a networked channel layer. There are good reasons for this - I'll cover them later - but it's a lot to force on people and arguably the biggest dent to smooth scaling of a Channels-based site (and to working towards being a viable WSGI replacement).
There's no standard application interface like WSGI you can plug into; you have to listen on channels and do your own event loop. This is the sort of problem frameworks should solve - things that are hard to get right and which most people need.
Persisting data for the lifetime of a socket is done using Django session backends. This was one of those hacks that made it into production, and while it works surprisingly well (session backends were made for something close to this access pattern), it sits uneasily with me, especially as it's another hard issue when it comes to scaling.
Channel names are often used and passed around directly, which makes it hard to do things like multiplexing and to have the multiplexed consumer written the same as a simplexed consumer (there's no channel name to send to for a multiplexed connection).

The overall picture means you end up with a deployment strategy that necessitates smaller clusters of protocol and worker servers, sharing their session store and channel layer internal to the cluster and having a loadbalancer balance across clusters. This isn't necessarily bad, but I think there's definitely room for improvement here and, crucially, room to massively improve the user experience (especially for small projects) along the way.

That said, the choices that led to these results were deliberate. Channels has a single design goal that led to all of these descisions:

You need to be able to trigger a send to a channel from anywhere

Events can happen in other processes (for example, a WebSocket terminated on another machine sends a message that the user connected to your machine needs to see), and you need to be able to react to those. Channels is not just a WebSocket termination framework - Autobahn is good at that by itself - it's a framework to build systems that handle WebSockets in a useful way, and solve the hard problems that come with them, like cross-socket (and thus cross-process) communication.

Hopefully you can see, then, how the system of pushing everything through a channel layer achieves this goal - by putting every process on an equal footing, we let you send to channels from anywhere with ease. Groups build upon this more by building in a broadcast abstraction to save you tracking channels yourself.

But, looking at it critically (and, crucially, with three years of hindsight), there are some natural flaws with this approach. Sure, everywhere can send to channels, but that means any special send-encoding (like the multiplexer earlier, or anything else you want to add to outgoing messages), has to be implemented in every place that might send a message. This is a pretty bad abstraction; that code should ideally all be in one place.

How do we resolve these two issues, then, and still keep that core design principle?

Moving the Line

Well, these two issues are not necessarily at odds. Yes, we need a cross-process communication layer to pass events around, and yes, we need a central place to put socket-interaction code and per-socket variable storage, but we can have both of them.

The model is already even in Channels - the protocol server. Protocol servers, like Daphne, are code that ties directly against a socket API in order to turn them into a series of events - low-level events, sure, but it has "custom code" (the stuff that encodes to and from ASGI) and per-socket session storage (to track things like what its channel name is).

What I am proposing, then, is to move the line between the protocol server and the worker server:

Users write socket/HTTP handling code that runs against a direct API in the same process, like WSGI. This code tracks its own per-socket variables, does any special send/receive encoding and decoding, and other related things you want to happen on a per-socket basis.
Messages on the channel layer are turned from direct send-to-channel to more abstract events; users send messages with their own formatting (e.g. {"room": "general-discussion", "user": "andrew", "text": "Hi all!"}), and the code running directly against the socket can receive these and interpret them as needed.

Not only does this move quite a lot of the handshaking/request/response traffic out of the channel layer, it also provides for much nicer code. No more would you need @channel_session or @enforce_ordering; your code runs directly against the socket in-process, but can still talk to other processes if it needs to.

Rethinking the API

If something like this is to change, then, it means that Channels' API must also change. There needs to be a "protocol handler" abstraction, very much in the Twisted vein of dealing with different types of incoming events, and there needs to still be a good "channel layer" abstraction, allowing you to send and receive messages between processes in a standardised way.

The breaking down of the HTTP and WebSocket protocols as currently defined in Channels still works, though - we could keep the formatting and the rough naming, and just change them from being received on channels to being (for example) callable attributes:

class ChatHandler:

    def connect(message):
        return {"accept": True}

    def receive(message):
        return {"text": "You said: %s" % message['text']}

Eagle-eyed readers will have spotted the flaw with this sketched API, though, which is that there's no way to send more than one message; in fact, this is another one of the Channels design goals made manifest. The reason that the protocol server and the workers run separately in Channels right now is that they run in different modes; protocol servers are async, serving hundreds of sockets at a time, while worker servers are synchronous, serving just the one request - because Django is, by the virtue of history, built as a synchronous framework.

Async APIs are not backwards-compatible, so if the solution was to make Django async we would have to either go all-in or have two parallel APIs - not to mention the mammoth task involved and the number of developer-hours it would take. I would rather build a basic framework upon which both sync and async apps can run, letting us move components over to async as needed (or if it's needed at all; sync code is often easier to write and debug).

So, what are we to do? We need an interface that lets us send and receive messages outside the scope of a synchronous request-response cycle, yet that still allows us to use synchronous code.

Fortunately, the design of Channels consumers helps us here. They are designed to be pretty much non-blocking: they receive a message, do something, and then return. They're not instantaneous; database access and other things slow them down, but the idea is that they live nowhere near as long as the connection does.

This means we can keep the same model going if we run that code in the same process as the protocol termination, though we're probably going to be using threadpools to get decent performance out of synchronous code. This is not that much of a change from the old model in performance terms, though; current Channels workers just run one consumer at a time, synchronously and serially, the only difference being that you could individually scale their capacity compared to the protocol servers.

Now we're combining those two things, deployment changes a little. Before, Channels effectively had a built-in load balancer for WebSockets; protocol servers did very little work, and offloaded all the processing to whatever worker was free at the time. Now, WebSockets are stuck with the process they connect to having to do the work; if one process has a hundred very busy sockets, and one process has a hundred very quiet sockets, the people on the first instance are going to see much, much worse performance than those on the other one.

The solution to this is to build for failure (one of the main things I strive for in software design). It should be OK to close WebSockets at random; people use sites on bad connections, or suspend and resume their machines, and the client-side code should cope with this. Once we have this axiom of design, we can then just say that, if a server gets overloaded with sockets, it should just close a few of them and let the clients reconnect (likely to a different server, if our load-balancer is doing its job).

That lets us continue with a design that might not have particularly high throughput on individual processes when combined with a synchronous framework like Django. If and when fully-developed async components and frameworks emerge - be it from Django or some other part of the Python ecosystem - they should be able to run in a proper async mode, rather than a threadpool, and benefit from the speed increase that would bring.

Communicating Across Processes

So that's most of the worries about the main protocol-terminating code dealt with; if we were just writing an application that served individual users, and didn't do any "real-time" (in the Web sense) communication between them, we could stop there.

However, I care about that stuff; think of the most basic WebSocket or HTTP long-polling examples, and they all rely on live updates - liveblogs, chat, polls, and so on. The very nature of them compared to HTTP is the lower latency, and the ability for a server to push something down to a client when an event happens rather than waiting for another poll.

Channels currently solves this using a combination of channel names for each open socket, so you can send to them from any process, and Groups, which wrap the idea of sending the same message to a lot of sockets. As we discussed above, both of these approaches have issues if you want to add extra information (especially in the Group case, where it's very unlikely you actually want to send identical messages to everyone connected).

So, we need to move the target of these events from the sockets themselves to the code handling them. Luckily for us, that code is already structured to receive incoming events in the form of protocol events, so we can continue the same abstraction to user-generated events too.

What remains is to work out the addressing/routing, where there's two questions:

How do we send to a specific socket-handling code instance?
How do we broadcast to a whole set of them at once?

The reply_channel abstraction from current Channels continues to work well for the first case, I think; with the recent performance improvements to process-local channels, they're really quite efficient, and we know our handling code is only in one process.

I'm a little less sold on the current design of Groups. The need to handle failure (specifically, of the code that removes channels from groups on disconnect) results in some compromises, to the point where most users could improve upon the design with a database.

Offering something that tries to walk a middle line and ends up being compromised is a bad choice, in my opinion; software should be designed to work well for a certain set of cases, and show its users how to build into those abstractions and design patterns it supports - and crucially, tell them when they should look elsewhere. I often tell people that they should not use Channels or Django if it's clear what they're building won't fit, for example.

The current design of Groups, where the sending code works out a list of destination channels, is not the only kind of broadcast. We could invert things and have consumers listen to central points of broadcast instead; this is the sort of pattern you often see in things like Redis' LISTEN/NOTIFY.

However, there are some issues bringing that to the rest of the system as designed; not only do these mechanisms enable you to more easily miss messages if you're interleaving them with list reading, but it's asking for an entirely different kind of message transport from the channel layers.

Instead, I think the right move here is to narrow and focus what Groups are for, and not only discourage people from using them for the wrong things but actively make it difficult, and hopefully foster third-party solutions to things like presence and connection count.

Async

There's one last elephant in the room - async support. Newer Python versions support features like async def and await keywords, and HTTP long-polling and WebSocket handling code is a natural fit for the flow of Python async code.

With the move to having consumers run in a single process per socket, that also gives us the freedom to run consumers as a single async function if we want. There's a problem, though - async and sync APIs must necessarily be different, and we still want to allow sync code (not only from earlier Python versions, which is less of a concern as time goes on, but also as it is sometimes simpler to write and maintain).

Channels' design has always been tailored to help avoid deadlock, too - something that's far too common to run into in asynchronous systems. Making it impossible for a consumer to wait on a specific type of event is a deliberate choice which forces you to write code that can deal with events coming in any order, and I want to preserve this design property.

Still, we should allow for async APIs and consumers if not only so that it's possible to do operations like database queries or sending channel messages in a non-blocking way and get better performance out of a single Python process. Exactly how I want to do that is covered below.

Bringing it all together

So, after all this, what is the result? What do I think the future of Channels looks like?

Let me run through each of the main changes with some examples.

Channel Layers

The basic channel layer interface - send, receive, group_add - will remain the same. There's no real need to change this, and several different layers have reached maturity now with the design (plus, it's proven useful even outside the context of HTTP/WebSockets; I've implemented SOA service communication with it, for example).

However, we need to address the async question. As I mentioned above, async APIs must be different methods than sync ones, and so currently we have receive_async and nothing else. This also pushes down to inside the module; implementing a channel layer that services both synchronous requests and async requests gets very difficult, as you can't rely on the event loop always being around, so everything must be written synchronously.

To that end, I'm proposing that channels layers come in two flavours - synchronous and asynchronous. One package would likely provide both, and they would share a common base class, but have different receive loops/connection management.

Servers would require one or the other to run against, depending on their internal structure and event loop; an "async" class attribute would be used as an easy way to determine what flavour you've been passed without needing to introspect the method signatures.

The Consumer interface

Rather than having to listen to specific channels and do your own event loop, as happens now, user/framework code will instead just need to provide a low-level consumer interface. My current proposed interface for this would look like:

class Consumer:

    def __init__(self, type, channel_layer, consumer_channel, send):
        pass

    def __call__(self, message):
        pass

Well, that's the class-based version. The actual contract would be that you pass a callable which returns a callable; it doesn't matter if this is a class constructor, factory function, or something which dispatches to one of several classes depending on the type, which would be something like http, http2 or websocket.

The other arguments to the constructor are:

channel_layer, pretty much the same as it is now. How you send to other parts of the system, or add/remove yourself from groups.
consumer_channel, the replacement for reply_channel for the rest of the system. Anything sent to this channel will end up being passed into the consumer's callable, just like protocol messages.
send, a callable which takes a message in the per-protocol format and sends it down to the client. This is where you would send a HTTP response chunk, or a WebSocket frame.

The subsequent (__call__) callable is the replacement for the current consumer abstraction in Channels; the source channel no longer matters, and instead a type field will be required in messages (in any direction; this was already needed in current Channels anyway). reply_channel is also gone from messages; it's replaced by the send argument to the constructor.

Async

Async will initially be an all-or-nothing deal - either the server you are running inside uses async and so you need an async consumer, or it's synchronous and your code must be as well (the channel layer type also follows, of course).

This seems the best way to keep a clean API, though it does have the unfortunate side-effect of making two separate ecosystems (but then, this is true of sync and async code in Python in general). I have some hope that we can find a way to adapt async channel layers and APIs to have automatic synchronous versions, but that needs more research (if you have any tips on this, please get in touch).

As for the consumer API, it will be the same for both, except that the consumer's callable (__call__) will be expected to be async in an async consumer. The constructor will not be, as it's not possible (to my knowledge) to have class constructors be async.

Groups

Group membership will remain largely the same under the hood; rather than using channel_layer.group_add and channel_layer.group_discard with the reply_channel, they will instead be used with the consumer_channel.

However, the reframing I want to do here is in how it's presented to the developer. Because groups now feed into the Consumer class, rather than direct to the client, they are more useful as general signal broadcasting, with per-client customisation and handling possible thanks to the consumer class.

I also want to suppress any idea that groups have a membership or list of channels internally, and remove the group_members endpoint that's currently on the channel layer but heavily discouraged. They'll instead be presented as pure broadcast, and people who want connection or status tracking can implement that in the consumer code with a separate datastore instead. I'll also look at removing more of the delivery restrictions in the specification so backends can better implement them.

Cross-process send

This will remain largely the same. Sending to a specific consumer is now channel_layer.send(consumer_channel, message), and with the inclusion of the type key in a message it's easier to deal with a variety of possible incoming messages in the consumer.

Sending to a group remains exactly the same - channel_layer.send_group(group, message) - but the message, again, now goes to consumers rather than directly to the client.

Background tasks

Channels has always preseneted the idea of running "background tasks", but without much in the way of advice as to what these are. People often think they're a replacement for Celery - which they are not directly, as the guarantees are different (at-most-once versus at-least-once) along with the APIs (no task status in Channels, for example).

In the same vein as re-framing Groups, I want to make it clear what the design and guarantees are, that you can have named channels to send tasks to, and also add a small bit of API to make it easier to have a loop that listens on a channel and sends responses.

This is a relatively minor bit of Channels, however, and at some point I expect people to use ASGI channel layers more directly if they want a more advanced non-response-based flow (as we do at Eventbrite for SOA transport).

Remote Consumers

Of course, there are valid reasons to run consumers remotely from your protocol termination server, and even with this new interface it will be possible to have a pair of interfaces (a consumer-to-channel-layer bridge, and a channel-layer-to-consumer worker) that let you run Channels in basically the same layout it is now. This won't be a priority, but I like a design that lets you plug in components like this.

Summary

So, after all that, what am I proposing to change? Here's a short example of what a Django example might look like (with a consumer superclass that handles the basics for us, and dispatches based on message type):

class ChatConsumer(DjangoConsumer):

    type = "websocket"

    # Room name drawn from URL pattern: /chat/room/foobar/
    def websocket_connect(message, room_name):
        self.room = Rooms.objects.get(slug=room_name)
        self.room.announce("User %s has joined" % self.user)
        self.channel_layer.group_add(self.consumer_channel)
        self.send({"type": "websocket.send", "text": "OK"})

    def websocket_receive(message):
        self.room.message(message['text'])

    # This method is called by the group when it sends messages of type
    # "group.message" to us.
    def group_message(message):
        self.send({"type": "websocket.send", "text": message['text']})

    def websocket_disconnect(message)
        self.channel_layer.group_discard(self.consumer_channel)

Here's a quick summary of the main changes:

Splitting the ASGI spec up into three parts: "servers", "channel layers", and "protocols", so that people could use just the server and protocol parts if they want to only build in-process stuff.
Consumers turn from a Django-only implementation of Channels to the primary interface between a server and the user's code (getting more generic in the process) - and that code now runs in-process with the protocol server.
Because consumers now run in the same process the socket is terminated, there's no need to use Django sessions to store data and instead they can use normal variables.
This also means that reply_channel, @enforce_ordering and @channel_session are all gone.
You send cross-process messages to consumers, rather than directly to the sockets they are coupled to, so that you can implement send formats for the socket in a single place.
Channel layers have a separate asynchronous implementation allowing full async code to run against them, and protocol servers can choose what mode(s) to implement.
Messages have an explicit type key so you don't need to guess what they are based on their schema.
Deployment will change to be fatter Daphne (or other protocol server) instances with the code embedded, and no worker server processes (unless you want to do background tasks of some kind).

And what stays the same?

The protocol formats for HTTP and WebSocket stay near-identical; the main thing will be to drop the path keys in the receive and disconnect messages as this is now provided with the consumer class instance.
The channel layer send/receive/group APIs, barring optional extension to provide parallel async versions
Routing inside Django will remain very similar to the end user, though we'll move around a few internal pieces to make it a direct new-style consumer. We might swap channel names out for type names and make you write all consumers as classes, but I'm not sure yet.
Databinding will likely remain very similar, but you'll be able to handle outgoing events individually in consumers rather than formatting for a single send directly to a socket.
Client-side code and interactions won't change at all.

Obviously, this isn't a final specification document - we'll develop that as we go along, changing things slightly as the code is implemented and realities hit, but I wanted to get this post out to give everyone a better idea of what the aim is.

Of course, we'll do our best to maintain backwards compatability, but there may be some cases where we'll have to provide helpful error messages for old APIs or upgrade guides instead.

I'd love to hear any feedback (positive, or negative with suggestions for alternatives); you can find my email address, IRC handle and Twitter handle on my about page.

This is quite a big change for Channels, but I think it's for the best, and based on talking to people developing with it over the last few years should fix a lot of the idosyncracies that people run into. It's impossible to know what the best fit is until we get there, and even then nothing will be perfect, but I'm still determined to end up with a good solution for async-capable Python web interfaces, be it directly or by inspiring competition.