Engineering

Engineering flexible permissions for Zulip open-source team chat

How we seamlessly transitioned thousands of organizations to a group-based permissions system in a performance-sensitive application.

Tim Abbott 16 min read

Zulip is an open-source team chat application designed for remote and hybrid work. With conversations organized by topic, Zulip is ideal for both live and asynchronous communication.

Over the past few months, we’ve rolled out an incredibly flexible system for managing permissions in Zulip: Permissions can now be granted to any combination of roles, groups, and individual users. This applies to permissions for managing channels, groups, and the organization as a whole.

In this post, we explain how this new system was engineered, and how we transitioned smoothly from the previous, much simpler permissions model to the new one.

Out of Zulip’s ~800 database migrations in the last 10 years, 115 were part of this project.

Rebuilding Zulip’s permissions system is the most complex transition that we’ve ever done. Prior to the transition, Zulip offered user roles (e.g., administrator, moderator, member, or guest) for convenient permissions management. Each permission setting was a simple dropdown menu, where administrators could pick the minimum role required to do the action.

The transition project had four major design goals:

  1. A smooth incremental migration path from the legacy role-based permissions system.
  2. Offering the best possible experience for users who administer permissions. Organizations should be able to simply create groups that align with their teams and functions, and use those groups to configure permissions, without any additional busywork or annoying limits.
  3. Keeping Zulip’s performance as snappy as ever.
  4. Leaving Zulip’s core design more elegant and maintainable than it was before.

Zulip is 100% open-source, so you can browse the code if you’re curious about any details mentioned below. Check out our companion blog post if you’d like to learn more about what the system does for Zulip’s users.

Design goal: A smooth incremental migration path

Zulip is available as a cloud service or a self-hosted solution, and is used by thousands of organizations around the world. So from a practical standpoint, our most important constraint was to be able to smoothly migrate thousands of existing Zulip installations to the new system.

We were keen to avoid the following classes of issues that can make a complex transition like this fail:

  1. Migrations that do not exactly translate permissions from the previous system to the new one. Merging a commit that changes what permissions existing users have in any way comes with significant communications and release-management overhead. Each such change needs to be coordinated with carefully written outreach to administrators, to help them understand the changes and what settings they may need to review or update in conjunction with the upgrade. It also creates extra work for organization administrators. We ended up planning the project so that no unpleasant announcements of this form were required.
  2. Migrations that throw an error and thus prevent upgrading the Zulip server to the new version. This class of failure would leave a self-hosted Zulip server in a half-upgraded state that could be challenging for the system administrator to resolve. We really care about making upgrades as painless as possible for self-hosters, so it was important to ensure each individual database transition had a simple enough specification to be carefully reviewed and tested.
  3. Non-deployable changes in main or enormous unmerged branches. Non-deployable changes merged to main are a nightmare for release management. At the same time, accumulating an enormous branch to merge months later comes with major costs:
    • Merge conflicts waste time as long as a branch stays unmerged.
    • Careful code review becomes difficult or impossible.
    • There’s no user-facing benefit until the entire project is complete.

To solve these problems, we decided to split what is conceptually a single huge transition into a long series of small, safe transitions that could be continuously merged to the main branch. This means the old system and the new system needed to co-exist.

There’s a lot of ways that one can do that, many of them ugly. We decided to represent the old roles-based permissions inside the new permissions system. We did this by creating a system group corresponding to each role. Those groups function exactly like other groups in the new permissions system.

The members of each system group are guaranteed transactionally to be the set of users with that role. For example, the transaction for changing a user’s role automatically moves that user to the appropriate system group.

System groups allowed us to:

  • Preserve the current set of configured permissions for all existing users.
  • Keep roles as a convenient way to manage permissions when an organization’s requirements are simple.
  • Incrementally migrate permission settings from the legacy role-based API format to the new groups infrastructure separately from implementing a new permission settings UI. This greatly simplified the technical transition process, and allowed it to proceed without interrupting regular SaaS deployments to Zulip Cloud or release engineering.
  • Prototype the server and API parts of the groups implementation when adding new permissions settings. All new permissions settings added between August 2021 and the Zulip 9.0 release in July 2024 presented a dropdown list of roles in the UI, just like older permissions settings. But under the hood, each option in that dropdown was translated to a system group in the API (e.g., role:admins for “Administrators”). This prototyping work allowed us to discover several important refactors to internals of the Zulip server that were required in order to migrate dozens of permissions to the new system.

Design goal: Offer the best possible experience for users who administer permissions

We wanted Zulip’s permission system to be easy and intuitive to use, without the kinds of awkward limitations which are a hallmark of permission systems in business software.

This required two major properties:

Allowing nesting groups arbitrarily

Being able to nest groups arbitrarily is important if you want to represent the structure of any large organization in a groups system. At the same time, it is quite challenging technically. Even major tech companies often don’t support this. For example:

“Each child team only has one parent team.”

GitHub Docs

“Subgroups can belong to one immediate parent group.”

GitLab Docs

This means that on GitHub and GitLab, a “Project X Designers” group could not be a subgroup of both the “Project X” group and the “All Designers” group — it could only be a subgroup of one or the other. Limitations like these are extremely common, and usually result from technical constraints coming from the system design.

In Zulip’s design, a group’s membership is any combination of subgroups and individual users. Only direct subgroups are stored as members of the group; code to check if a user is a member of a group walks the subgroup graph. (If that sounds complex, see below for how we keep it efficient.) The subgroup graph is acyclic, enforced with careful use of PostgreSQL locking, and there are no other limitations on how groups can be nested.

Allowing permissions to be assigned to any combination of roles, groups and users

From a technical perspective, it’s simplest for each permission to be assigned to a single group. However, we felt that it’s important not to force users to create one-off groups like “administrators-and-managers-and-Bob” or “people-who-can-send-DMs”.

In the resulting design, from the user’s perspective, each permission can be assigned to any combination of roles, groups and users. Under the hood, permissions are stored in the database as a single group ID defining who has that permission. This ID may refer to any of:

  • A regular user-created group, with a name, description, and settings.
  • A system group corresponding to a role.
  • A special system group like role:nobody, which is guaranteed to contain no users, or role:everyone, which is guaranteed to contain everyone with an account on the server. These have proven quite useful for various optimizations.
  • A special anonymous group, which is an unnamed container for a list of user IDs and a list of subgroup IDs.

Anonymous groups were a key insight. In theory, you don’t need them: You could just require the value of every permission setting to be a single group. Organization administrators would need to create a group for each distinct collection of people that they want to assign a permission to.

But that model is awkward in practice, even if you layer on a helpful UI. For example, when you create a new channel, it’s natural to want to give yourself the permission to administer it. You surely don’t want to create a group that just includes yourself (with its own name, description, and permission settings) in order to finish creating your new channel.

At first, we were thinking about solving that specific case by creating system groups for every user who was assigned a permission. But that solution falls apart the moment someone wants to add a second channel administrator. Anonymous groups proved to be an elegant generalization of that initial idea: A way to store configured permissions that is consistent, compact, and easy to combine with named groups in SQL queries.

Design goal: Keeping Zulip’s performance snappy

Application performance can be a major challenge when introducing a permission group system that allows subgroups.

For example, suppose the application needs to answer a simple question: “Does user X have permission to send a message to channel Y?” The naive implementation would recursively walk the graph of subgroups, doing a database query to get the subgroups of each group in the graph. It’s easy to imagine configurations where this would end up doing dozens of database queries for a single check!

In Zulip, it was crucial for us to limit the number of database queries for latency-sensitive actions, like sending a message. Even the simplest database query can add between 0.3ms and 2ms of latency to a request, depending on how the Zulip server is deployed. (Simple installations run the database on the same host as the application server, with latency at the low end of that range; those with a separate database host may have latency toward the high end.)

For example, Zulip currently spends 25-50ms processing and delivering a brief direct message, which requires 7 database queries. Some of those queries are required in order to check two permissions configurations for sending direct messages in Zulip. If we had instead used a naive implementation, we’d have had to think hard about whether the performance trade-offs of adding those permissions features would mean we shouldn’t include them in the product.

A more challenging concern is that Zulip often needs to do permission calculations in bulk. For example, if a channel’s name is edited, the Zulip server must send an event to live-update all clients that have access to the metadata for the channel. This means it must be reasonably efficient to ask questions like “Who are all the users who have access to this channel’s metadata?”

The following techniques were important for keeping the number of database queries to a minimum in our implementation:

  • Using PostgreSQL Common Table Expression (CTE) queries, which allow the server to consistently check in a single optimized database query whether the acting user (1) has a given permission, or (2) is a (possibly indirect) member of a group that does.

  • Carefully designing bulk-query helpers on top of the CTE-based framework in order to keep the number of queries down to O(1) in situations where a naive operation would do a linear or even quadratic number of database queries.

    For example, an administrator might bulk-subscribe an employee to a dozen or more channels associated with a new role. A naive implementation would involve a loop over those channels, doing a handful of database queries for each channel to:

    1. Check whether the acting user has the permission to subscribe someone to that channel.
    2. Add the new subscriber.

    One can easily see this leading to dozens of database queries.

    The Zulip server operation is able to do all the permissions checks in just 2 queries:

    • One query using CTEs fetches the full recursive set of group IDs that the acting user is a member of, including anonymous groups.
    • Another query fetches the can_add_subscribers_group values for all the channels involved.

    And in fact we combine the latter query with fetching other channel details that the API request needs — so the group-based permissions add a total of 1 query beyond what was needed anyway! The server then just checks that the second set is a subset of the first, i.e., that for each channel, the can_add_subscribers_group for that channel is among the groups that the user is a member of.

  • Prefetching data that allows us to skip doing any queries at all for common scenarios. Many permission settings have a default or common configuration of role:nobody or role:everyone, where just the identity of the system group is sufficient to answer the question of whether the user has that permission.

    For example, when processing a request to send a message to a channel, the Zulip server will prefetch the name of can_send_message_group as part of the database query to fetch the channel object. That way, the code that needs to check that permission can look like this:

can_send_message_group = channel.can_send_message_group
# Check the name using prefetched data, saving work in common scenarios.
if hasattr(can_send_message_group, "named_user_group"):
  if can_send_message_group.named_user_group.name == SystemGroups.EVERYONE:
    return
  if can_send_message_group.named_user_group.name == SystemGroups.NOBODY:
    raise JsonableError(_("You are not allowed to post in this channel."))

# Otherwise, we pay the cost of checking the permission using the database.
if not user_has_permission_for_group_setting(
  channel.can_send_message_group,
  sender,
  Channel.channel_permission_group_settings["can_send_message_group"],
  direct_member_only=False,
):
  raise JsonableError(_("You are not allowed to post in this channel."))

Design goal: Leave Zulip’s core design more elegant and maintainable

In my experience, large codebases tend to suffer from accelerating technical debt. A growing system naturally becomes harder to understand over time. Refactoring, thoughtfully extending, and cleaning up software that you don’t understand is a pain! So with time, developers tend to invest less time into doing those things, and gain less benefit from the time they do invest. The result is a downward spiral.

Part of Zulip’s strategy for avoiding this fate is to make sure that big changes to the project not only make the product better, but also make the system easier to reason about.

You can’t avoid complexity, but you can contain complexity in small blocks of well-written code, and make the abstractions that are used by the rest of the system intuitive. Some things I really like about the new system are:

  • Clean abstractions, with most of the complex logic inside functions whose name clearly describes what they do. The server codebase makes extensive use of Python’s modern keyword-only parameters, so you can generally understand the effect of a function call without inspecting its definition.

  • Every permissions setting works the same way in the database: it’s encoded as a single group ID, making it easy to work with in SQL and in Python.

  • But permissions settings in the API use two formats: Either an integer ID of a named group, or a simple object with direct_member_ids and direct_subgroup_ids lists, defining the membership of an anonymous group. This approach of passing anonymous groups by value, without referring to their database IDs in the API, helped a great deal in keeping client implementations simple. In particular, our web client UI can just keep track of user pills and group pills, and diff those to see if a setting has been changed. All the logic related to destruction/creation of anonymous groups is contained in the server.

  • API requests to change a permissions setting will fail if multiple users try to change the same permission at the same time. This works by having clients send both the desired new value and the expected previous value of the setting, allowing the server to detect such races by comparing the expected previous value with what is currently in the database.

    This prevents a situation where an administrator’s attempt to remove a permission from Janice could silently fail because another administrator granted that permission to Lee at the same time. Because this situation could cause a security issue, we believe it’s important to protect users from it ever happening, even if it would be quite rare in practice.

    We’ve been quite happy with this technique, and I expect we’ll be using it in many other places in the future.

As a result, it feels straightforward to add new permissions settings to the system. Usually, the only decisions we need to make are around naming and semantics, a.k.a., the product design.

Reasons you might want a different design

While I think Zulip’s groups design would work well for a lot of software, there are some aspects of how Zulip works that are relevant to this being a good technical direction.

  • Zulip, like most business software, is naturally sharded by organization, and supporting millions of groups in a single organization is not a design goal. As a result, we can use standard SQL locking and transactions to prevent cyclic groups and other invariant failures that could be highly problematic for a production system. I suspect that the challenges fundamental to large distributed systems, especially those with a global social network component, may be part of the reason that GitHub’s and GitLab’s groups systems are more limited than Zulip’s.

  • Zulip is a performance-intensive application, where users care a lot about latency: A chat app that takes even a couple seconds to load each screen will frustrate its users immensely. One of our differentiators is that Zulip is faster than the competition.

    The types of optimizations we made may be unnecessary in an application that’s used less intensely than chat (e.g., apps for HR or expense reimbursements). Many such applications are successful despite remarkably bad performance.

  • Our team has deep algorithms expertise: Zulip’s technical leadership includes several former grad students in theoretical computer science. While this system doesn’t use any particularly fancy algorithms, our expertise allowed us to confidently solve various unexpected performance bottlenecks that we encountered along the way.

Take a look for yourself

I hope that sharing these engineering design challenges and solutions will help others design the systems they are building.

If you’re curious to learn more about our groups system, take a look at Zulip’s 100% open-source code base! The server implementation, including all the permissions settings for accessing groups, is only about 2400 lines of code, plus ~3600 lines of tests. git ls-files '*group*' in a zulip/zulip clone will find most of the groups code.


Check out our other blog posts published today: the Zulip 10.0 release announcement, and a blog post describing what the group permissions system does for Zulip’s users.