Month: March 2024

  • Why not matrix?

    Archived from https://telegra.ph/why-not-matrix-08-07 March 22, 2024
    Additional links here: https://benharri.org/matrix-sucks/


    haru august 7, 2023

    at this point it seems like most of the tech community is familiar with matrix, the “open network for decentralized communication”. lots of projects and communities have migrated from a host of other platforms, including irc, discord and slack with the promise that their new spaces will be free forever. i first discovered matrix in 2021 and have dedicated a lot of time trying to understand exactly how it works, as well as trawling through github issues to try and understand whether we should consider matrix safe. i’ve also evaluated conduit as well as dendrite and tried a number of clients.

    first of all, a quick primer on what matrix actually is. even though element market matrix as the foundation of a chat app, it’s far more complicated than that. under the hood, matrix is actually a distributed partially-replicated graph database. a matrix “room” is actually a directed acyclic graph (“dag”) made up of events which can contain things like messages, user membership states and bans and other things. servers participating in a room are eventually consistent, that is, they should replicate enough state so that everyone sees roughly the same thing.

    in order to “send” something into a room, a server that is participating in a room can also try to append events onto the graph. since every participating server needs to be able to verify what they are receiving from another server actually makes sense, a set of auth rules are defined in the spec to determine when to allow or to drop certain events. events are cryptographically signed by the sending server so that other servers can check those origins too. every server participating in a room needs to perform these checks, in the hope that if all servers agree, that newly created events will be happily accepted by all participating servers.

    however, over time, i have collected together quite a list of issues that i consider to be either unsolved or dangerous. without further ado, here is my list of things to consider and reasons why you might not want to use matrix:

    1. the graph is append-only by design and events that get sent into the room can not ever be guaranteed to be deleted, so a matrix room can potentially end up accumulating history endlessly. for some applications this is desirable, but for chat and other applications it removes any possibility of deniability. deleting history is a very hard problem in matrix because dropping whole events can create gaps in the graph which servers either need to complete event auth or may struggle to skip over when backfilling history.
    2. if you do want to delete something, you can send a redaction event which asks other servers very nicely to delete the content of the event, but redactions are advisory and a badly behaving server participating in a room could simply ignore the redaction request, instead holding onto the entire history. it’s basically impossible to know at the point of asking for a deletion whether this has happened or not.
    3. however, servers that choose to ignore redactions, or fail to process them for some other reason, can leak supposedly-deleted data to other servers later on. if a new server joins the room for the first time or wants to backfill history, it can ask any server in the room for help doing so, including the badly-behaved server. nothing requires a server to return the redacted skeleton event, it can opt to return the full unredacted event if it wishes. in fact this might even happen purely by accident.
    4. certain events, like membership changes, bans or pretty much any event that exercises some control over another user can’t be deleted ever as they become woven into the “auth chain” of future events, and in order for a server to independently verify each event, they need to be able to prove how the event was allowed to happen, iteratively, back to the beginning of the room. every time a user joins, leaves, changes their name or avatar, gets kicked, banned, kicks or bans someone else or changes critical room settings, traces of these actions become burned into the room history permanently.
    5. as with most places on the internet, spam is inevitable, so another fun way to attack a room is just to join hundreds or thousands of bots to the room, making the room graph very complex and difficult to compute, which in turn means that servers waste lots of cpu time and clients have lots more work to do, especially in encrypted rooms. the only way to discard all of this spam complexity is to recreate the room.
    6. even room history is a best-effort endeavor. while the room graph itself provides some causal ordering, as some events need to follow other authenticating events, it’s exceptionally hard to linearize history if you don’t know the entire history of the room partially. tiebreaks used in the server also include fields like depth and the origin timestamp, which can both be forged. so unsurprisingly, different servers can see messages arriving in a different order to each other. most of the time there is nothing you can do about this.
    7. speaking of forging things, it is also somewhat possible to insert messages into history by crafting events in the graph that refer to older ancestor events and, as long as it looks vaguely plausible (like with a sane depth and/or timestamp value), other servers will accept these events without much question and users may not be able to tell the difference if these events eventually get backfilled onto their own server’s copy of the room history.
    8. another thing that is worth noting is that end-to-end encryption in matrix is completely optional. this is pretty much required as public rooms would be extremely heavy and probably completely unusable otherwise, but the only thing trying to ensure that your dms are encrypted are clients and even they don’t have to if they don’t want to. anything in a federated room that isn’t end-to-end encrypted can end up replicated in plain-text and also end up as part of a semi-permanent history.
    9. the end-to-end encryption is also annoyingly fragile as it depends on device list updates to be delivered between servers reliably. a failure to sync these correctly will result in broken encryption in rooms where others cannot decrypt your messages and this seems to be a constant source of encryption bugs.
    10. sometimes these device list updates updates also leak information about your device, like which matrix client you are using or which operating system/platform you are on. some homeservers seem to have started trying to filter this information recently, but as always, there’s no guarantees.
    11. the entire matrix api surface is http and json. for clients and client developers this makes things nice and simple, but the federation api also uses http and json and also tries to be cryptographically secure. for signing events and requests to work, matrix expects the json to be in canonical form, except the spec doesn’t actually define what the canonical json form is strictly, so there’s every possibility different implementations will end up generating different signatures for the same event.
    12. oh wait, that actually happened. turns out that matrix homeservers written in different languages have json interoperability issues, so you might just get random events or requests that fail signature checks between different types of matrix servers because no one knows exactly how to get it right. synapse, the flagship implementation, simply relies on python’s sort_keys and calls it a day, to hilarious effect, and even other matrix devs working for element haven’t been able to figure out how to match this behavior in other implementations like dendrite.
    13. those signature checks also rely on server signing keys which can be expired and replaced, except the signing key expiry is completely arbitrary so you can simply set an expiry date for your server signing key in the past and watch as new servers are now totally unwilling to authenticate any events from your server, resulting in split-brained rooms. feeding certain key expiry information to certain servers and not others might even make for a novel eclipse attack.
    14. split-brained rooms are actually a common occurrence too, as the distributed nature of matrix requires a kind of consensus algorithm (known as “state resolution”) in order to work out what to do when a server ends up with more than one conflicting set of state. the algorithm is not watertight and relies on some preconditions (like event signatures being accurately checked across servers) so when certain edge-cases occur in this algorithm, which they often do, a “state reset” can occur which can end up kicking users out of rooms or reverting state changes. at this point state resets are an on-going meme with no solution in sight.
    15. worse than that, state resets happen quite a bit more often when servers written in different languages interoperate because of the json interoperability issues, and it’s also possible to force them to happen in some cases by joining a room, making some power actions and then setting your signing key to expire before they happened. some servers will believe the events if they passed the signature check at the time, other servers will drop the events if they learn about the key expiry first, and then chaos ensues as servers try to request state snapshots from each other, eventually breaking the room to the point that those servers may never agree again!
    16. this has happened to amusing effect more than once, as room admins and moderators have lost their powers over public rooms many times due to state resets, leaving them unable to defend themselves and their communities against spam, harassment and other types of attacks. in some cases, the core matrix team have had to try to forcefully shut down rooms on their own instances and recreate them on more than one occasion, hoping people will rejoin. the big bad Matrix HQ room has been the subject of attacks multiple times now and so have other community rooms.
    17. speaking of shutting down rooms, you can’t actually force a room to be shut down across the federation. if there are three servers in a room and one of them wants to shut down the room, there’s no way to stop the other two servers from continuing along just as they were. this is good news if you care about internet freedom, sure. this is bad news if people start to abuse instances with unsavory rooms or content. in certain legislations this could present a serious problem.
    18. speaking of moderation, this is notoriously difficult too, as moderation relies entirely on the functioning of the event auth system and breaks down if state resets happen or if someone abuses their power. this has seemingly resulted in problems with trying to moderate or clean up attacked rooms before. these changes can be extremely hard to revert, especially if there are other servers participating in the room, as the room will eventually try to converge on what the state resolution algorithm thinks the best state is, which isn’t necessarily the state that you want.
    19. pretty much any user on your homeserver can upload media to your server’s media repository at pretty much any time, and media downloads are unauthenticated by default. as a result, you can often just generate a HTTP link to the media. someone might be using your media repository as their personal sharing box at your expense and you may not even realize.
    20. you can also ask someone else’s homeserver to replicate media by asking that homeserver to return a copy of it. based on the content ID and the origin server name in the URL, the homeserver will download the media from the origin and likely cache it locally in the process. that could end up being a fun denial-of-service.
    21. in addition to that, media uploads are unverified by default. if you are running your own matrix instance, there’s a good chance that nothing is scanning that media for unsavory content (like csam or viruses) unless you have also installed a separate content scanner. this is a separate package and is not a core part of the synapse distribution, nor any other homeserver that is available today.
    22. actually the eager replication of media could end up potentially being quite a massive headache, since all it takes is for one of your users to request media from an undesirable room for your homeserver to also serve up copies of it, at which point you could become liable for hosting copies of illegal media like csam or copyrighted material without necessarily knowing it.

    i want to note that this list is not completely exhaustive and i may follow up with more items at a later date. however, i have found what i believe to be quite a number of reasons to be skeptical of matrix and i would struggle very much to want to recommend it to a new community or enterprise looking for a communication platform. most of the issues here i’ve ever witnessed first hand or have been following through a series of github issues on various matrix repositories, issues which the matrix devs seem to have been largely ignoring for literal years now.