Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
151931 stories
·
33 followers

TIL: Downloading archived Git repositories from archive.softwareheritage.org

1 Share

TIL: Downloading archived Git repositories from archive.softwareheritage.org

Back in February I blogged about a neat Python library called sqlite-s3vfs for accessing SQLite databases hosted in an S3 bucket, released as MIT licensed open source by the UK government's Department for Business and Trade.

I went looking for it today and found that the github.com/uktrade/sqlite-s3vfs repository is now a 404.

Since this is taxpayer-funded open source software I saw it as my moral duty to try and restore access! It turns out a full copy had been captured by the Software Heritage archive, so I was able to restore the repository from there. My copy is now archived at simonw/sqlite-s3vfs.

The process for retrieving an archive was non-obvious, so I've written up a TIL and also published a new Software Heritage Repository Retriever tool which takes advantage of the CORS-enabled APIs provided by Software Heritage. Here's the Claude Code transcript from building that.

Via Hacker News comment

Tags: archives, git, github, open-source, tools, ai, til, generative-ai, llms, ai-assisted-programming, claude-code

Read the whole story
alvinashcraft
1 minute ago
reply
Pennsylvania, USA
Share this story
Delete

shot-scraper 1.9

1 Share

shot-scraper 1.9

New release of my shot-scraper CLI tool for taking screenshots and scraping websites with JavaScript from the terminal.

  • The shot-scraper har command has a new -x/--extract option which extracts all of the resources loaded by the page out to a set of files. This location can be controlled by the -o dir/ option. #184
  • Fixed the shot-scraper accessibility command for compatibility with the latest Playwright. #185

The new shot-scraper har -x https://simonwillison.net/ command is really neat. The inspiration was the digital forensics expedition I went on to figure out why Rob Pike got spammed. You can now perform a version of that investigation like this:

cd /tmp
shot-scraper har --wait 10000 'https://theaidigest.org/village?day=265' -x

Then dig around in the resulting JSON files in the /tmp/theaidigest-org-village folder.

Tags: projects, annotated-release-notes, shot-scraper

Read the whole story
alvinashcraft
1 minute ago
reply
Pennsylvania, USA
Share this story
Delete

My Open Software

1 Share

All of my software is hosted on GitHub, mostly under the Apache-2.0 permissive license. Free for commercial and non-commercial use, modification, and distribution.

Major Projects

  • USearch - a universal search engine powering many databases, AI labs, and experiments in Natural Sciences. Compact C++ core with 10+ language bindings — 10–100× faster than Meta FAISS for vector search and far beyond Apache Lucene.
  • StringZilla - SIMD, SWAR, and CUDA-accelerated string algorithms for search, matching, hashing, and sorting at Web Scale and Bioinformatics scale. Hundreds of hand-tuned kernels with manual multi-versioning, exposed to C, C++, Rust, Python, Swift, and JavaScript, up to 10× faster on CPUs and 100× faster on GPUs.
  • SimSIMD - an extensive collection of mixed-precision vector math kernels for C, Python, Rust, and JavaScript. Designed for linear algebra, scientific computing, statistics, information retrieval, and image processing, delivering consistent SIMD speedups over BLAS and NumPy on both x86 and ARM architectures.
  • UCall - a kernel-bypass web server backend for C and Python built on io_uring. Achieves 70× higher throughput and 50× lower latency than FastAPI for real-time workloads, including serving compact AI models.
  • UForm - tiny multimodal AI models with state-of-the-art parameter and data efficiency. Compatible with Python, JS, and Swift, serving as a lightweight alternative to OpenAI CLIP for on-device and server inference.
  • ForkUnion - ultra-low-latency parallelism library for Rust and C++. Avoids allocations, mutexes, and even Compare-And-Swap atomics — achieving up to 10× speedups over Rayon and TaskFlow.

Some of those are used in open-source databases, like ClickHouse, DuckDB, TiDB, ScyllaDB, yugabyteDB, DragonflyDB, MemGraph, Vald, Turso, LLM toolchains, like LangChain, LlamaIndex, Microsoft SemanticKernel, Nomic AI GPT4All, Surf, and many other less “open” systems, such as backend infrastructure of major AI labs, government intelligence agencies, Hyper-scale cloud companies, Fortune 500, iOS and Android apps with 100M-1B MAU.

Read the whole story
alvinashcraft
6 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Don’t be so eager to rewrite your code

1 Share

I used to always want to rewrite my code. Maybe even use another programming language. « If only I could rewrite my code, it would be so much better now. »

If you maintain software projects, you see it all the time. Someone new comes along and they want to start rewriting everything. They always have subjective arguments: it is going to be more maintainable or safer or just more elegant.

If your code is battle tested… then the correct instinct is to be conservative and keep your current code. Sometimes you need to rewrite your code : you made a mistake or must change your architecture. But most times, the old code is fine and investing time in updating your current code is better than starting anew.

The great intellectual Robin Hanson argues that software ages. One of his arguments is that software engineers say that it does. That’s what engineers feel but whether it is true is another matter.

« Before Borland’s new spreadsheet for Windows shipped, Philippe Kahn, the colorful founder of Borland, was quoted a lot in the press bragging about how Quattro Pro would be much better than Microsoft Excel, because it was written from scratch. All new source code! As if source code rusted. The idea that new code is better than old is patently absurd. Old code has been used. It has been tested. Lots of bugs have been found, and they’ve been fixed. There’s nothing wrong with it. It doesn’t acquire bugs just by sitting around on your hard drive. Au contraire, baby! Is software supposed to be like an old Dodge Dart, that rusts just sitting in the garage? Is software like a teddy bear that’s kind of gross if it’s not made out of all new material? » (Joel Spolsky)

Read the whole story
alvinashcraft
7 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

By how much does your memory allocator overallocate?

1 Share

How much virtual memory does the following C++ expression allocate on the heap?

new char[4096]

The answer is at least 4 kibibytes but surely more.

Firstly, each heap memory allocation requires some memory to keep track of what has been allocated. You are likely using 8 bytes or so of overhead that your program cannot access.

Secondly, the memory allocator may allocate a bit more than the 4096 bytes you requested. On a Linux machine, I found that it would allocate 4104 bytes, so 8 extra bytes that are usable by your program. You can check this value by calling malloc_usable_size under Linux.

Thus, overall, you may end up with an extra 16 bytes allocated when you requested 4096 bytes. It is an overhead of about 0.4%. You are basically wasting a byte for every 256 bytes that you allocate.

But that is not the worst possible case. On macOS, let us consider the following line of code.

new char[3585]

The system reports an allocation of 4096 bytes: a 14% overhead. What is happening is that macOS rounds up the memory allocation to the nearest 512 byte boundary for moderately small allocations. If you try allocating even larger memory blocks, it starts rounding up even more.

Read the whole story
alvinashcraft
7 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Enable Modern Run dialog box in Windows 11

1 Share
In this article, learn how you can enable modern Run dialog box in Windows 11 - using Settings or registry.
Read the whole story
alvinashcraft
7 minutes ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories