NHacker Next
- new
- past
- show
- ask
- show
- jobs
- submit
login
▲Show HN: Kage – Shadow any website to a single binary for offline viewing (self.__VINEXT_RSC_CHUNKS__=self.__VINEXT_RSC_CHUNKS__||[];self.__VINEXT_RSC_CHUNKS__.push("2:I[\"aadde9aaef29\",[],\"default\",1]\n3:I[\"6e873226e03b\",[],\"Children\",1]\n5:I[\"bc2946a341c8\",[],\"LayoutSegmentProvider\",1]\n6:I[\"6e873226e03b\",[],\"Slot\",1]\n7:I[\"3506b3d116f7\",[],\"ErrorBoundary\",1]\n8:I[\"a9bbde40cf2d\",[],\"default\",1]\n9:I[\"3506b3d116f7\",[],\"NotFoundBoundary\",1]\na:\"$Sreact.suspense\"\n:HL[\"/assets/index-BLEkI_5r.css\",\"style\"]\n")>github.com)
Rendered at 23:07:35 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
Turns out it's using another project by the same author: https://github.com/tamnd/ascii-gif
The script used for the demo is at https://github.com/tamnd/kage/blob/01e75b87ecc893bbba7943c63... and has a comment showing how to run it:
Looks like it's an opinionated wrapper around https://github.com/charmbracelet/vhshttps://www.cockos.com/licecap/
Cool!
It would be especially cool to have a version that didn't require the separate serving process - even though it's nifty you can package up a whole site as a single binary.
Maybe a single HTML entrypoint shim with a bit of javascript that could index into an archive (potentially embedded) of the site's content?
Also, in my mind, I already have a script/program to convert HTML to Markdown, so it could actually store everything on disk as a folder of Markdown files, and then commit them to a Git repo.
Basically I'm looking for something like the old-school .chm files on Windows, where you could pack a bunch of HTML documents into a single archive and open it without needing to embed a full browser engine.
This would have the advantage of keeping the file sizes really small. And you don't have to worry about the browser engine become outdated and potentially becoming an attack vector.
For the younger generation https://en.wikipedia.org/wiki/Microsoft_Compiled_HTML_Help
Epub would also be a great target.
So something like SingleFileZ https://github.com/gildas-lormeau/SingleFileZ or Gwtar https://gwern.net/gwtar ?
In a green field world, I have a personal requirement that technical documentation systems are capable of bulk exporting to a human-readable format on disk. I’m pretty flexible on what that is, though. Markdown is preferred, but I’m also fine with static, dependency-free HTML and I could accept PDFs if the rest of it is super nice.
It’s an integral part of DR, and most places want their docs on-premise, so DR effectively requires offline documentation. Everywhere I’ve worked either a) writes documentation in something that works offline (eg git repo with tarballs somewhere), or b) has invested a bunch of time in trying to scrape their own wiki into something legible during DR.
I guess it’s a long-winded way of saying “that’s using a tool to fix a self-inflicted problem that shouldn’t exist”.
If the result is static why does it need a server? Isn't it possible to make it so that it can simply be opened by the browser? Like:
$ firefox $HOME/data/kage/paulgraham.com
Then the result would be useable on machines without kage nstalled.
Actually, Kage has two parts: a crawler that crawls pages and converts them to clean HTML by capturing the DOM after rendering in Chrome/Chromium, and a pack/serve component that packages the result as either a ZIM file for Kiwix or an executable file.
Related WHATWG discussion: https://github.com/whatwg/html/issues/3099
I was thinking "of course it works, how else would people get started creating websites otherwise?" then I remember what's the most common approaches in the frontend ecosystem nowadays.
Back in the days of yore, every tutorial/book started with "First we create a index.html file which you open in your browser ...", even a JavaScript resource would start with this of course :)
The protection mechanism was introduced so that malicious saved pages can't just grab things from your Downloads folder and send stuff it to an attacker's server. But the method turned out to be a bit more refined than I have imagined: you can display an image but can't grab the pixels, run a script but not inspect its source code, fetch() will be unavailable, etc.
To see it work, click "Download self contained .html" from the menu.
Here's the source file that handles this part: https://github.com/tomtheisen/mutraction/blob/master/mutract...
The idea is to use <script type="inline-module" name="foo">...</script> to define modules. That's something I just made up. For each such script, provision a blob URL. The main blocker is usually the same origin policy. Crucially, these blob URLs count as the same origin. So then you need to rewrite the imports from the named modules to the blob URLs. I used some regex rather than a proper parser, but it was more than good enough for me.
It seems quite doable to make some proper bundling tools around this concept.
It strips out all the JavaScript too, but also packs everything into a single HTML file that is easy to transfer. Binary assets (like web fonts and images) are packed as base64 strings.
They also offer a CLI powered by Puppeteer. [1]
[0]: https://github.com/gildas-lormeau/singlefile
[1]: https://github.com/gildas-lormeau/single-file-cli
What I'm implementing here is mirroring a whole website, with all its subpages, so you can browse it all offline. For example, all essays from paulgraham.com.
I think the misunderstanding stems from the browser's "Save As" reference in the description. It is misleading. You use "Save As" to save a single page, not an entire website.
Also, the description lacks a clear explanation of the project's purpose. It would be helpful to include a sentence explaining that the program downloads an entire website, not just a single page.
I highly recommend reading the singlefile source or https://archiveweb.page/ to see how they handle closed shadow DOMs, cross-origin iframes, websockets, media urls, deduping large assets, etc.
Not the same thing, but I made a clone of pg’s website which can be used for exactly that: https://github.com/shawwn/pg
https://shawwn.github.io/pg/
If you want to read all essays, just clone the repo and open any of the .html files. Or any of the .page files which generated them.
That said, Kage looks promising if OP can combine SingleFile reproduction quality with the HTTPTrack spidering approach. SPA's are kinda tricky with archiving and do wonder how well Kage would handle that
For some reason it displays in IE better but I don't recall seeing this option in chrome of Firefox recently..
That way, the page is self-contained as it is, but requires no bundled binary code to serve the site. It is actually safer security-wise.
The vendored script can be as simple as this:
Let's say you have a site that fetches content from a database. If you Save As, then at best you'll get a local copy of an HTML page with JS that loads the content from the same remote database. It might not work (since the local copy has a different origin), or if it does, it requires you to be online, which defeats half of the purpose.
What this project, and SingleFile, both do is save a snapshot of what the rendered page actually looks like at that moment in time. The scripts are stripped out so it runs locally and has no external dependencies.
Won't comment on a project (though idea seems interesting) but this in README is a tell for me ;)
https://wiki.openzim.org/wiki/Build_your_ZIM_file
EDIT: https://get.kiwix.org/en/solutions/applications/kiwix-reader...
The executable file is mostly for people who don't have Kiwix installed yet, or just want to run the archive directly.
In any case, cool stuff :)
https://github.com/tamnd/kage/blob/main/Dockerfile
Btw, let me think the way to only enable this when running inside Docker.
Thanks for nice trick.
But, a compromise still lands on host's kernel, Docker doesn't provide kernel isolation (well it does on a macOS because it runs in Docker machine but thats a side effect).
I wonder if a better solution would be to play with seccomp or Linux capabilities so that Chrome is sandboxed even in Docker. Not sure how this would work tbh.
Answering here to get ideas, I saw your fix on Git and request for feedback (will try to review and give it some thought once I find some time)
It's one of the reasons I've become a bigger fan of RSS over time. A feed from 10-ish years ago is often more usable today than a carefully preserved (application) website.
Compared to that is there anything kage does better?
https://github.com/jart/cosmopolitan
https://justine.lol/cosmopolitan/index.html
https://redbean.dev
(Certificates just expired for justine's website, just ignore the warning.)
I did something like that a very long time ago (Of course, I have forgotten)
I'd rather have platform specific minimal binaries than a single binary with hacks.
Installing packages is a solved problem
It's fine if you don't personally find it useful for your workflow, but I think it's mad cool, especially since you can zip together multiple binaries into one, along with data.
I would recommend an add-on or new feature to detect and remove cookie banners / annoying popups that open on load (eg. sign up to my mailing list).
listing a few examples form fastText could help you.
You might also have the opposite problem though: some websites have content in the base html (so it's searchable by Google and they get views) and remove it on load (so you have to pay).
Capturing the initial html and comparing it to the final version could give you some hints and allow you to repair the removed content.
Best of luck with the project!
But will look into this now, see if we can swap some stuff out. We’ve really liked the idea of an offline mirror, makes a lot of collaboration use cases simpler
By converting it to Markdown, we save a lot of space, but it is for a different purpose and a different project: https://github.com/tamnd/ccrawl-cli
For my own custom data format, I have a lot of private code that I plan to release soon. It is optimized for compression, fast lookups, and more. I have been working on it for two years. This is part of a larger, ambitious umbrella project: I am building Google from scratch (all open source), something that anyone can host, including the crawler, indexer, storage, and serving layers. Stay tuned!
Sounds awesome. There is a lot of untapped potential with respect to efficiently archiving and indexing websites. I saw the impressive things Marginalia Search is doing in this area (the blog is great when it gets technical). There is also a lot of very complete archives of websites out there which are not being indexed at all, and I would love to make them available for researchers. In any case, I'm interested in your project!
Is the code also AI slop?
for an entire website though of many pages I can see this can be useful.
For video downloading, I suggest wrapping around yt-dlp. It's an awesome tool.
I previously downloaded the Snowflake docs, and it was something like tens or even hundreds of thousands of pages, I do not remember exactly. The output ended up being very large.
By the way, I forgot to add zstd compression support to my ZIM reader/writer. I will implement that in the next version.
```bash bin/kage clone https://developer.apple.com/documentation/ \ --scope-prefix /documentation/ \ --out /Users/apple/data/apple-docs \ --chrome "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" \ --max-pages 0 --max-depth 0 \ --workers 3 --browser-pages 3 --asset-workers 6 \ --render-timeout 60s --settle 2s --timeout 30s \ 2>&1 | tee -a /Users/apple/apple-docs.log ```
Adjust it to your needs :)
I smoke-tested it, and all the content and CSS work, but I stripped all the JS, so the sidebar won't work.
If you run into any problems, feel free to create new issues in the repo. It helps me prioritize and know what should be fixed.
Have you even read the first line of the readme of the project you're commenting on?
So I don't quite get whats the point of kage? What does it do that print-to-PDF won't already do? The resulting .pdf's contain all the content, and also include the original URL and creation date, etc. How is kage an improvement?