Recommendation math has fingerprints

While I was working at Memgraph, Bluesky still felt young enough that the edges of the system were visible. The app mattered less to me than the firehose: something Twitter had long since locked away, arriving through a WebSocket where a developer could actually touch it.

Not a trickle. Not a curated stream. Roughly 500 events per second of raw social activity: posts, likes, follows, reposts, unfollows, the small nervous system of a social network made available as data.

Drinking from it was the unreasonable part. The experiment was to turn the stream into a feed I would actually want to read.

AT Protocol basics

Bluesky is built on the AT Protocol, which separates the what from the where. Identity is not tied to a server; it is represented through Decentralized Identifiers. Data lives in a Personal Data Server that a user could, in theory, host. Relays aggregate activity across the network and expose the stream.

For a graph database person, the shape is tempting and dangerous. A social network is people connected to people, people connected to posts, posts connected to replies and quotes. The data wants to become a graph before the product question has even been asked.

Conveniently, I happened to work for a company that made a graph database.

From that collision came BlueJ, originally “Home+”. It became both an obsession and a practical education in algorithmic feed generation because every nice theory about feeds had to survive contact with the graph.

Feeds have a trust problem

Chronological feeds are democratic to a fault: every post from everyone you follow, in the order they posted it. Simple, pure, and overwhelming once the follow list gets past fifty people.

Algorithmic feeds became a dirty word for earned reasons. We have all experienced platforms that show whatever keeps us scrolling, regardless of what it does to attention, relationships, or basic faith in humanity.

The useful middle sat between those failures: a feed that widened the room without taking over the room. Keep direct follows central, let trusted second-degree signals widen discovery, and use community structure carefully enough that recommendation did not become coercion with nicer math. A feed works for the reader or it works on the reader.

The first version was a set of Cypher queries and a wait for the idea to become wrong in a specific way.

Three streams

BlueJ’s feed algorithm starts with a simple constraint: do not rely on a single signal. Blend multiple sources of content using a weighted distribution.

The algorithm runs three queries in parallel:

let queryResults = await parallelQueries(requesterDid, maxNodeId, {
    follow: { query: followQuery, limit: 300 },
    likedByFollow: { query: likedByFollowQuery, limit: 100 },
    community: { query: communityQuery, limit: 100 }
})

Direct follows — Posts from people you explicitly chose. This gets the largest allocation, 300 posts, because direct follows remain the core of the feed.

Liked by your follows — What are the people you trust finding worth passing along? Second-degree signals are social proof filtered through your own curation. This stream is limited to 100 posts because it widens a feed without colonizing it.

Community posts — For users assigned to community clusters through graph analysis, posts from others in the same community. A way to discover people you might want to follow, based on the structure of the social graph rather than explicit choices.

Scoring with time decay

Scoring within each stream borrows from the Hacker News family of decay formulas:

WITH (ceil(likes) / ceil(1 + (hour_age * hour_age * hour_age * hour_age))) as score

In plain terms: score = likes / (1 + hour_age^4).

Fourth power on the age denominator is aggressive. A post that is 4 hours old requires 256 times as many likes as a fresh post to achieve the same score. At 24 hours? 331,776 times as many likes. Freshness is not a preference in that formula. It is the demand.

Fourth-power decay prevents the “greatest hits” problem where the same viral posts linger at the top of your feed forever. New posts get their moment; if they do not earn engagement quickly, they fade away. Harsh, but legible.

Blending the streams

Ranking was not the next problem. Mixing was. Three arrays of posts, each ranked by its own criteria, still have to become one feed without letting one source dominate the page.

My solution was a weighted round-robin algorithm that uses array length as the weight:

export function weightedRoundRobin(...arrays: Element[][]): Element[] {
    const result: Element[] = [];
    const maxCount = Math.max(...arraysLengths);

    for (let i = 0; i < maxCount; i++) {
        for (let arrIndex = 0; arrIndex < arrays.length; arrIndex++) {
            const arr = arrays[arrIndex];
            if (i < arr.length) {
                result.push(arr[i]);
            }
        }
    }
    return result;
}

Interleaving carries the small elegance. With a 300/100/100 distribution, you get roughly three posts from follows, then one from liked-by-follows, then one from community, repeating. The relative proportions are preserved, but the content is mixed throughout the feed rather than segregated into sections.

Inspection was the virtue. If the feed feels off, the weights are visible, the streams are visible, and the mistake has somewhere to live.

A painterly editorial collage for Custom feeds on Bluesky, showing the concrete objects and system relationships around nodes, relays, follows, and highvolume event streams. — Nodes, relays, follows, and high-volume event streams.

Detecting refresh requests

One piece I still like, with the appropriate suspicion around a developer’s favorite cleverness, is the freshness detection system. It came from a user expectation rather than a graph theory preference.

Problem: when a user pulls to refresh, they want to see new posts at the top. When they are paginating through a feed, they want consistency: the same posts in the same order, just further down.

Fix: track the highest post ID each user has seen, and use the request parameters to infer intent.

// A larger limit and no cursor indicates that this was not a new-post probe,
// but a full feed refresh, so we mark the highest Node.ID as the last seen post
if (limit >= 10 && cursor === undefined) {
    didLastSeen[requesterDid] = {
        maxNodeId: maxNodeId,
        timestamp: Date.now()
    }
}

When a refresh is detected, posts newer than the user’s last seen ID bubble to the top:

if (didLastSeen[requesterDid] !== undefined) {
    const newResults = results.filter((result) => result.id > didLastSeen[requesterDid].maxNodeId);
    const seenResults = results.filter((result) => result.id <= didLastSeen[requesterDid].maxNodeId);
    results = [...newResults, ...seenResults]
}

Users never notice this when it works. They would notice immediately if it did not.

Reading the firehose

The feed only became inspectable after the firehose reached the graph. The AT Protocol firehose is a WebSocket stream of repository commit events encoded in CBOR and wrapped in CAR files.

Libraries from @atproto handle most of the decoding. My job was to transform each event into a Cypher query and execute it against Memgraph:

// For new posts
await this.executeQuery(
    "CREATE (post:Post {uri: $uri, cid: $cid, author: $author, text: $text, " +
    "createdAt: $createdAt, indexedAt: LocalDateTime()}) " +
    "MERGE (person:Person {did: $author}) " +
    "MERGE (person)-[:AUTHOR_OF {weight: 0}]->(post)",
    { uri, cid, author, text, createdAt }
)

// For follows
await this.executeQuery(
    "MERGE (p1:Person {did: $authorDid}) " +
    "MERGE (p2:Person {did: $subjectDid}) " +
    "MERGE (p1)-[:FOLLOW {weight: 2, uri: $uri}]->(p2)",
    { authorDid, subjectDid, uri }
)

MERGE is crucial here: it creates the node if it does not exist, or matches the existing one if it does. That handles the out-of-order nature of the firehose. Someone might like a post before I have seen the post itself; MERGE creates placeholder nodes that can be enriched later.

For resilience, failed queries go into a retry queue processed every 5 seconds:

private queryQueue: RetryableQuery[];

async executeQuery(query: string, params: object, retryCount: number = 10) {
    try {
        results = await session.run(query, params);
    } catch (error) {
        if (this.isRetryableError(error) && retryCount > 0) {
            this.queryQueue.push({ query, params, retryCount: retryCount - 1 })
        }
    }
}

Not sophisticated, but effective. The occasional database hiccup does not get a vote on whether the graph remembers what happened.

Real-time visualization

Building a custom feed generator proves one claim; watching the graph move proves another. Infrastructure that only manifests as a slightly improved list of posts is hard to feel. The structure needed to be visible before the intuition could be trusted.

A real-time visualization layer followed.

A direct stack was enough: React on the frontend with react-force-graph for the 3D/2D rendering, Socket.io for real-time updates, and Memgraph triggers firing HTTP callbacks whenever nodes or edges change.

Visualization backend subscribes to a user’s network and streams updates:

socket.on('interest', async (handle) => {
    // Find the user and their follows
    // Return initial graph state
    // Subscribe to updates for all relevant DIDs
})

Memgraph triggers detect changes and push them through a C++ query module to the visualization service, which broadcasts to interested clients. The result made the system visible: nodes appearing and connecting, edges forming and dissolving, a social graph becoming less abstract because it kept moving.

You can watch someone gain a follower in real-time. See a post accumulate likes as little edges snake toward it. Witness the graph reorganize itself as the force-directed layout algorithm seeks equilibrium.

Completely impractical for most real purposes. Still worth doing, because it made the data feel less like rows and more like behavior.

Graphs made the question human

Building BlueJ taught me that social graphs are hard in the semantic sense. Memgraph can handle the computation. The difficult questions are human: what does a follow mean, how much weight does a like carry versus a reply, and when does recommendation become manipulation?

AT Protocol sharpened those questions because the data was reachable. Identity was separated from hosting, content was cryptographically verifiable, and the firehose was open enough for a side project to touch the live network. Whether Bluesky itself succeeds matters less than whether the protocol proves that social data can be less captive.

Out of that came something more restrained than “algorithms are defensible, actually.” Algorithmic feeds are not inherently hostile. Engagement-maximizing feeds are hostile because they optimize for the wrong target with great discipline. A small feed built around direct follows, second-degree trust, community structure, and visible weights can still feel like a tool.

Archived, still instructive

BlueJ is no longer actively maintained, and I no longer work at Memgraph. The codebase remains as a proof of concept for what becomes possible when a graph database meets an open social firehose.

Bluesky, meanwhile, has exploded. What was a million users when I built this is now 40 million and growing. The firehose is busier than ever. The protocol continues to evolve.

If you are curious about graph algorithms and social protocols, the AT Protocol is still worth exploring. The firehose is open, the graph is there, and the product questions are still unresolved in ways that repay inspection.

Do not be surprised if 2 AM arrives with a force-directed graph still moving and a Cypher query open.

BlueJ’s codebase is available for the curious.

Originally built at Memgraph and archived as a record of the experiment.

Chris Chabot · April 2023

technical blog