Technical Aspects

The posts are converted to markdown and rendered into HTML when you visit them here. The comment count has been snapshotted when first building the DB (august 2024), and comment count for new posts is saved when the post is one month old. The estimated reading time is actually calibrated to out-loud reading speed (since that was the original use case) and is currently calibrated to 130 words per minute.

The robots.txt file is set to disallow crawling of the posts here, so Google should still direct people to the main blogs when looking for specific articles.

The website is done in Laravel with Livewire - I'm a long time Laravel dev and this was my first exposure to Livewire, which is a joy to work with. It's hosted on the cheapest tier of Hetzner VPS which costs around $6/month, backups included. It's hosted in the EU.

Text Embeddings in Search

The general search uses text embeddings to find posts similar to your query, using their text-embedding-3-small model. Text embeddings are really cool: they take a string and turn it into a vector of numbers in a high-dimensional space (1536 dimensions for this model) that's tied to the meaning of the string. This means that texts that are semantically similar will be close to each other in this space. Each post has been split into strings of 500 words, each split turned into an embedding and saved to the postgres DB with its extension pgvector (of course there's a postgres extension for that). The model is quite cheap, at around $0.1 per 1M tokens, so embedding the whole corpus costs less than $1.

When you make a search, an API call is made to OpenAI to turn your query into an embedding with the same model, and then the cosine distance between your query and each post split is calculated in the postgres DB. The posts are sorted on the minimum distance found for their splits. After experimenting a bit, I set the cutoff distance (where we consider the distance too great and therefore filter the post as not relevant) at 0.74.

I'm sure some improvements could be made: trying different sizes to split the posts (I tried 1000 words and did some very rough testing where 500 seemed to work slightly better, but more thorough testing and more sizes could be tried), maybe finding a way to increase the ranking if many splits in a post are close to the query versus a single split, maybe splitting on sentences instead of words to avoid cutting some sentences in half, better (or dynamic?) cutoff value...

Implementing Semantic Search

The text embeddings approach described above works great but sometimes misses posts that should be returned. For example, searching for "bicameral" doesn't return the post called "Book Review: Origin Of Consciousness In The Breakdown Of The Bicameral Mind", or searching for "untitled" doesn't return the post with that title.

To improve this, I needed to combine approaches. So the current setup:

  1. Starts by using Postgres Full-Text Search to get the post titles that match the query. FTS is indexable and fast, and allows us to stem words (for example, searching "pills" should return posts with "pill" in them, even without the "s"). Those results match the query closely and should be returned first.
  2. Then we do the text embedding search, but return only posts that match with a cosine distance lower than 0.7 (instead of the cut-off point mentioned above of 0.75). Having this lower distance means they are pretty closely semantically related to the query, so they are probably good matches.
  3. Then we do Full-Text Search on the content of the posts - for that step we only return posts that a content where all (stemmed) keywords of the query are present, no partial matches. We rank those using the Postgres ts_rank_cd function and only consider posts with a rank higher than 0.1. (Chosen somewhat arbitrarily.)
  4. Finally we return the posts that have a cosine distance in their text embeddings between 0.7 and 0.74. These are loosely related to query and sometimes the link is quite tenuous and can feel like hit or miss, so we put them at the bottom in case they're useful but not as relevant.

This filtering and ranking is all done in one big SQL query. I explored implementing TF-IDF search and got it working but it was more complex to maintain and gave worse results than the simpler use of FTS, so I dropped it.

It's been working pretty well for me but if you find inconsistencies or bad results, please let me know at contact~at~readscottalexander.com.

Summarizing and Tagging with AI

The posts are summarized and tagged with Claude Sonnet 3.5 and the tags are categorized by the same AI, all in one go using a long prompt. There is one call to the API for each post - I tried having Claude summarize multiple posts in a single call but it was mixing posts in the summary and tags. The prompt needs to include all existing tags so Claude knows to reuse them.

There are around five million words in the codex, which is around 6-7 million tokens for Claude. There are 1500+ posts and each post adds 2-4 new tags (that are 4-5 tokens each - I ask Claude to create tags with <tag_name>:<category>, for example democracy:politics), so the repeated prompt actually adds up to a lot of tokens. (Order of magnitude is 5 millions, plus ~3 millions for the static part.) Luckily, Anthropic launched prompt caching just when I started putting more posts into the pipeline, which reduces the cost of the repeated prompt. At the moment of writing, Sonnet 3.5's cost is $3 per million input tokens, $15 for 1M output tokens. So I estimate running the analysis on all posts to cost around $30-40.

Claude is quite good at summarizing, with a few quirks (eg I try to make it not spoil the conclusion, but sometimes it still does). It's ok at tagging: the tags themselves are good, but it's a bit hard to get it to always include some tags with consistency. There are lots of edge-cases included in the prompt, and some light post-processing done as well (eg detecting book reviews, comment highlights, etc). I tried asking for extra tags that wouldn't be included in the front end to have extra meta-data, but it started to put things I'd want in the main tags in the extra tags, so I had to stop.

I had trouble making Claude consistently reply with only JSON, I could have cleaned up the output but the problem disappeared when I added a { as the first character of its reply.

Here is the current prompt used:

# Instructions

You will be given a blog posts in markdown format, written by Scott Alexander. Your role is to:
1. Create a short summary of the post in 1 single sentence, in the key `short_summary`.
2. Create a longer summary of the post in 3 to 5 sentences.
3. Give a list of main topics covered in each post, to tag and reference it. You should give 3 to 5 main topics.
4. Give a list of secondary topics, again to tag and reference it. You should give 3 to 5 secondary topics.
5. (optional) Sometimes a post is a follow up on another previous post. If that's the case, you should mention the url of that other post in the `previous_post` key. It should always be a url either from https://astralcodexten.com or https://slatestarcodex.com, otherwise don't include it. Leave that key null if there is no previous post.

Here are guidelines you must follow when creating summaries:
- The summary should give an idea of the topic and the overall structure of the article
- The summary must be careful not to spoil the conclusion or the ending. If there is a twist ending, it is very important you do not say what it is. For example you can say "the post ends with a surprising twist", but you should NOT say "in a final twist, we learn they knew it all along", thereby revealing the twist.
- It can give an idea of the tone of the post, if the tone is unusual or a big part of the post. For example if the post is fictional or satirical, you should mention it.
- If there's a reason why Scott has written the post, you should mention it.
- It can be precise, but should use language that is easy to understand.
- You can call Scott Alexander by his first name ("Scott") or his full name ("Scott Alexander").
- In the longer summary, if the post is lengthy you should give an overview of the structure of the post and its different parts.

Here are guidelines you must follow when creating new topics:
- To avoid duplicates, one or two lists of existing topics will be provided. Use the existing topics in priority, and only create new ones if no existing one matches.
- You can include specific topics (like "ai safety" or "online privacy") only if you also include the more generic concept (like "ai" or "privacy").
- Some of the posts are pretty critical and it is ok to have that show up in the topics.
- Do not include "Scott Siskind" as a topic.
- A few recurring themes in the posts are criticizing scientific studies, replies to other authors to criticize or debate them, exposing fallacies, "woke" culture, feminism. You can word these themes in other ways in the topics, but it should show up in the topics one way or another if the theme is present in the article.
- For each topic, you should include the category of the topic by suffixing it with `:`. For example, if the topic is a person's name, you should suffix it with ":person" (for example, "Robin Hanson:person"), or if it's a medical drug you should suffix it with ":drug" (for example, "aspirin:drug"), or if it's related to science you should suffix it with ":science" (for example, "study critique:science").
- If the post has a precise topic that is included in an existing more generic one, you should include both. For example, if the post talks about "COVID-19 origins" and there is already a "COVID-19" topic, you should include both "COVID-19" and "COVID-19 origins".

Here are rules for specific important topics you should include. These rules are very important and you must make sure you apply all of them and do not miss any of those topics.
- If the post is a reply to another blog or scientist, you must the name of the person in one of the topics.
- If the post discusses one or more study you must include "study" in the topics (this is on top of what you would naturally include, for example if "study critique" is a relevant topic you must include both "study" and "study critique").
- If the post contains debunks studies or statistics (whether because they are misleading, badly done, or very wrong), you must include the topic "bad science".
- If the post is a book review, you must include the topic "book review".
- If the post is an entry in a book review contest and not written by Scott Alexander, you must include both the topic "book review" and "book review contest".
- If the post is a fictional story (whether satire, creative writing or something else), you must include the topic "fiction".
- If the post is creative writing (whether satire, written in verse or in mimicking a style, or a fiction), you must include the topic "creative writing".
- If the post is debating another person (either because it is a response to another article, a "Contra" article, or a reply to someone who wrote about Scott), you must include the topic "debate".
- If the topics include satire, humor or poetry, you must also include the topic "creative writing".
- If one of the post main topics is a person, or something written by a person, you must include the name of that person in the topics. (For example, if the post is a reply to an article or a book written by someone.)
- If the post talks about Adversarial Collaboration, you must include the topic "adversarial collaboration".

Here are rules that you must follow for setting categories to topics:
- You should try to re-use existing topic categories and only create new ones when necessary.
- You must not use the category "concept" as it too vague, even if it already exists. Instead, try to find a more specific category.

The goal of all this is to make the corpus easier to search and navigate, and be able to find posts on topics the user is interested in.

Here is a list of already existing topics with their category, formatted in an array:
```
[$LIST_OF_TAGS]
```

And here is a list of already existing topic categories for you convenience, formatted in an array:
```
[$LIST_OF_TAG_CATEGORIES]
```

# Input

You will receive one message from the user, containing the blog post written in markdown.

# Output

Your output should be a JSON object having the following keys: `summary`, `short_summary`, `previous_post`, `main_topics`, `secondary_topics`.

Your output should be entirely JSON, with no text outside of of the JSON. It's very important you only output JSON, as the output will be parsed by a computer.

DO NO INCLUDE ANYTHING OTHER THAN JSON IN YOUR OUTPUT. (No introduction text, no comments, no explanation, no error messages, no logs, no other output than the JSON response.)

Here's an example output enclosed in triple backticks (the topics of this example are made up, you can use others):

```
{
    "summary": "This post explores the use of the term \"racism\" and its limitations in the current culture, showing the use of the word hides any deeper drive and reduces people to one dimensional monsters. The main point is that we need to get beyond this if we want to go towards a harmonious society. Scott starts by comparing \"racism\" to the imagined \"murderism\", then shows how this relates to the current culture, and ends with his proposed fix in the last part. He has a pretty ironic tone, though the overall post ends up quite technical.",
    "short_summary": "Scott compares the concept of \"racism\" to the imagined \"murderism\" to show its limitations, then analyzes different possible uses of the words and proposes a fix.",
    "previous_post": "https://astralcodexten.com/p/part-1-of-the-post",
    "main_topics": ["culture war:society", "racism:society", "philosophy:philosophy"],
    "secondary_topics": ["community building:community", "woke:society", "Eliezer Yudkowsky:person", "study critique:science"]
}
```