Argand Crawler
Why you’re seeing it, what it’s doing, and how to tell it “no thanks" (but hear me out first?)
Hewwo UwU!
Sorry. Hi. I'm Nic, and this is my personal site. If you are checking the link I had attached to my Argand Crawler, you might wonder who I am and what I'm doing. Moreover, if you wish to stop me, feel free to skip ahead to "How to opt out or set boundaries"
If you’re here, you probably saw a request from the Argand crawler in your logs and went:
“Who is this, why are they indexing my site, and are they going to be a problem?”
Short version:
- Argand is a new, indie, privacy-first search engine.
- It’s built in Rust by Nic Weyand, as part of the broader political / social project around the Lilac Party.
- The crawler is designed to be:
- Polite (limited concurrency, pacing between requests)
- Transparent (this page is the opposite of “mystery bot with no explanation”)
- Non-extractive (no ad-tech tracking, no data brokering)
If you ever decide you don’t want Argand touching your site, there will always be clear, simple ways to opt out.
Who’s behind this?
Name: Nic Weyand
Project: Argand Search Engine
Politics: Progressive leftist + socialist / Lilac Party
I’m building Argand because I don’t think search should be:
- A surveillance product
- A black box controlled by two or three megacorps
- Optimized purely for ad revenue instead of human curiosity
Lilac Party’s broader vibe: build public-spirited infrastructure, not walled gardens. Argand is one of those pieces of infrastructure. Simply put, we need a search engine that just works. Argand is trying to be the greatest search engine of all time!
Why is Argand crawling your site?
There are a few goals:
- Independent search index
- To build an actually good search engine, I need my own index.
- That means discovering and storing text from the web in a structured way, not renting it from an ad company.
- Lift “smaller” sites
- Corporate search often buries small blogs, personal sites, and weird niche projects.
- Argand wants to actively find those and give them a better chance to be seen.
- Research & ranking quality
- Crawled pages help train ranking logic: “what looks like a high-quality page?”, “what is spam?”, etc.
- This is about relevance, not ad targeting. I don’t care who visits a site, just whether a page is helpful for a given query.
- Building a non-surveillance alternative
- If you’re frustrated with how search has gone, Argand is an attempt to build something saner and better, from the ground up!
How the Argand crawler behaves (in human terms)
I want to have this readable for programmers and laypeeps alike! So I will explain things more simply than most programmers will like, sorry in advance! Just want to make sure I explain to everyone equally. I do not want people to distrust me, or my projects. Simply calling my work "the algorithm" hides what is actually happening, and I want everyone to understand! With that, even without diving deeply into code, here’s the design philosophy:
- Polite by default
- The indexer has a configuration struct that sets things like:
batch_size– how many URLs it works through in one “chunk”max_concurrent– how many requests can be in flight at once
- Those exist specifically to avoid “I just accidentally DDoS’d a shared hosting plan” energy.
- The indexer has a configuration struct that sets things like:
- Paced, not hammering
- The indexer uses async Rust (via
tokio) and its time utilities (sleep,Duration) to pace work. - That means it intentionally pauses between certain operations, instead of slamming a host with as many connections as possible.
- The indexer uses async Rust (via
- Unique IDs per URL
- URLs are turned into a stable numeric ID using a hash function:
- Think of it as:
doc_id = hash(url)
- Think of it as:
- That lets the indexer:
- De-duplicate documents
- Update content cleanly when a page changes
- Store metadata without keeping messy strings as primary keys everywhere
- URLs are turned into a stable numeric ID using a hash function:
- Separation of concerns
- There’s an Indexer that knows:
- How fast to crawl
- How many tasks to run at once
- There’s a crawler coordinator that knows:
- Which URLs to fetch next
- How to schedule work
- There’s a storage layer that:
- Writes the processed content into a data directory
- Keeps things ready for the search engine to query
- There’s an Indexer that knows:
- Async, non-blocking design
- The main binary uses
#[tokio::main]:- Multiple crawl operations can run concurrently without blocking each other.
- That keeps Argand efficient on its own resources and lets it keep concurrency limits per host.
- The main binary uses
A gentle tour of the Rust code (for devs & non-devs, in plain-ish English!)
You’ll see a couple of key pieces in the code:
1. The Indexer config
pub struct IndexerConfig {
pub batch_size: usize,
pub max_concurrent: usize,
}
In English:
batch_size= “How many documents do we try to process at a time before pausing or doing bookkeeping?”max_concurrent= “How many things can we be doing at once?”
These are the knobs that control how “aggressive” the crawler is allowed to be. The defaults are deliberately nice, not “let’s melt every server we see.” I was inspired by how awful OpenAI, Anthropic and other crawlers were decimating websites with their awful scraping, and decided to design my search engine crawler to be lightweight and gentle, ethical and respectful, while still trying to crawl when possible (more on how we respect robots.txt in a sec!)
2. The Indexer itself
pub struct Indexer {
config: IndexerConfig,
}
impl Indexer {
pub fn new(config: IndexerConfig) -> Self {
Self { config }
}
}
This is the high-level “manager”:
- It doesn’t know how HTTP works by itself.
- Hypertext Terminal Protocol, basically how you connect to any website.
- It does know:
- “Here’s how many jobs we’re allowed to run at once.”
- “Here’s the policy for how we chew through the URL list.”
Think of it like a manager on-site, not the worker themselves.
3. The async main and shared components
Some details you’ll see:
Arc<RwLock<SearchEngine>>- The search engine lives behind a read-write lock so multiple async tasks can read from it safely while others update it.
Arc<storage::Storage>- Central place where processed documents are stored.
crawler::coordinator::CrawlCoordinator- Handles URL scheduling and (internally) how to spread work out.
The pattern is: set up the shared state once, then spawn async work that coordinates crawling without clobbering each other.
4. URL hashing
fn hash_url(url: &str) -> u64 {
// ...
}
This uses Rust’s standard hashing (DefaultHasher) to turn a URL string into a u64.
That makes it easy to:
- Use the hash as a document ID
- Store and look up content quickly
- Avoid giant string keys in certain data structures
If you’re not a dev: you can think of this as turning:
https://example.com/posts/my-articleinto:
4938572304958723049so the system can store and track it efficiently.
Ethics & the Lilac Party angle
This crawler isn’t just a tech project; it’s tied to a political one.
My political project, the Lilac Party, cares about:
- Digital privacy & anti-surveillance
- Public-minded infrastructure
- Giving smaller creators a way to be found that isn’t mediated by ad budgets
So, Argand’s crawler is built around:
- Respect over extraction
- The goal is not to mine your site to build shadow profiles of your users.
- The goal is to understand your content well enough to:
- Let people discover it
- Rank it fairly compared to spam and junk
- Transparent intent
- You know who I am (Nic Weyand).
- You know this is tied to the Lilac Party’s broader political project.
- You’re invited to scrutinize, yell (plz don't!), or collaborate, not just silently tolerate a mysterious bot.
- Control for webmasters
- If there’s a part of your site you don’t want indexed—now or ever—that should be easy to communicate and enforce.
- If you have specific needs (rate limits, time windows, path exclusions), I want Argand to be able to respect those.
The web is a shared commons, not a mined resource. Argand should act like a good neighbor, not a strip miner, or worse: an AI company. Ugh.
How to opt out or set boundaries
Depending on how you prefer to manage crawlers:
- You can block the Argand user agent in your server config (e.g., via HTTP rules / firewall).
- The name will remain "Argand".
- You can set up whatever policies you normally use for bots; Argand is intended to respect clear “no” signals.
- Robots.txt remains your best bet! On my end, here is the code that checks robots.txt, and respects it with no workarounds or attempts to ignore your wishes:
// Check robots.txt
if !crawler.check_robots_txt(url).await? {
return Err("Disallowed by robots.txt".into());
}- If you’re ever unsure, you can reach out and say:
- “Please do not crawl this domain.”
- “You can crawl, but only at night / only these paths / not behind this route.”
If something Argand is doing causes you trouble, that’s a bug, not a feature.
If you’ve read this far
Thank you. Seriously.
If after reading this, you’re comfortable with Argand crawling your site, then you’re helping:
- Build an independent search index
- Support a privacy-respecting, progressive alternative
- Contribute (indirectly) to a larger project around Lilac Party’s vision of public-minded tech
And if you’re not comfortable, that’s fine too! You’re entitled to full control over your infrastructure, and Argand should fit into that, not fight it.