Home
        _______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
  HTML Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
  HTML   ArchiveBox is evolving: the future of self-hosted internet archives
       
       
        joeross wrote 2 hours 11 min ago:
        I have no programming skill at all and I don’t know a ton about
        ArchiveBox except I set it up and ran it for myself for a while, so
        I’m asking as an innocent, ignorant and curious geek, but is this
        something that could be adapted to peer to peer distribution or some
        other means of making it simultaneously as private and local as you
        want it and as distributed and bulletproof, uptime wise, as possible?
       
        chillfox wrote 3 hours 40 min ago:
        Awesome, I am really looking forward to the new api and plugins.
        
        I have been running an instance for almost 2 years now that I use for
        archiving articles that I reference in my notes.
       
        newman314 wrote 4 hours 0 min ago:
        @nikisweeting Is abx-dl already available or is it coming? I took a
        quick dive and didn't see a repo under the org.
        
        I'm happy to help package this up once it is available.
       
        pabs3 wrote 5 hours 21 min ago:
        Unfortunately ArchiveBox uses wget, so it produces non-standard WARC
        files. Sadly there are lots of things like this in the WARC ecosystem.
        
  HTML  [1]: https://wiki.archiveteam.org/index.php/The_WARC_Ecosystem
       
          nikisweeting wrote 5 hours 14 min ago:
          Yes, this is true currently. If you need nice WARCs I recommend
          Browsertrix by our friends at Webrecorder instead.
          
          Its on my roadmap to improve this eventually, but currently I'm
          focused on saving raw files to a filesystem, because it's more
          accessible to most users, and easier to pipe into other tools.
          
          I encourage people to use ZFS to do deduping and compression at the
          filesystem layer.
       
            TheTechRobo wrote 3 hours 53 min ago:
            Browsertrix (and Webrecorder tools in general) also violate the
            standard by modifying response data. It's supposed to be the raw
            bytes as they are sent over the network (minus TLS).
            
            The entire WARC ecosystem is kind of a mess.
       
              ikreymer wrote 19 min ago:
              This isn't really true, our tools do not just modify response
              data for no reason!
              
              Our tools do the best that we can with an old format that is in
              use by many institutions. The WARC format does not account for
              H2/H3 data, which is used by most sites nowadays.
              
              The goal of our (Webreocrder) tools is to preserve interactive
              web content with as much fidelity as possible and make them
              accessible/viewable in the browser. That means stripping TLS,
              H2/H3, sometimes forcing a certain video resolution, etc.. while
              preserving the authenticity and interactivity of the site. It can
              be a tricky balance.
              
              If the goal is to preserve 'raw bytes sent over the network' you
              can use Wireshark / packet capture, but your archive won't
              necessarily be useful to a human.
       
        bityard wrote 5 hours 26 min ago:
        So, after reading through the comments and website, I just realized I
        used ArchiveBox a month or two ago for a very specific purpose.
        
        You see, I inherited a boat.
        
        This boat belonged to my father. He was not materialistic but he took
        very good care of the things he cared about, and he cared about this
        boat. It's an old 18' aluminum fishing/cruising boat built in the early
        1960's. It's not particularly valuable as a collectible but it is
        fairly rare and has some unique modifications. I spent a lot of time
        trying to dig up all of the info that I could on it, but this is one of
        those situations where most of the companies involved have been gone
        for decades and most everyone who was around when these were made are
        either dead or not really on the Internet.
        
        It's a shame that I waited so long to start my research because 10 or
        20 years ago, there were quite a few active web forums containing
        informational/tutorial threads from the proud owners of these old
        boats. I know because I have seen references to them. Some of the URLs
        are in archive.org, some are not. But the forums are gone, so a large
        chunk of knowledge on these boats is too, probably forever.
        
        I did manage to dig up some interesting articles, pictures, and forum
        threads and needed a way to save them so that they didn't disappear
        from the web as well. There is probably an easier way to go about it,
        but in the end I ran ArchiveBox via Docker and set it to fetching what
        I could find and then downloaded the resulting pages as self-contained
        HTML pages.
       
          shiroiushi wrote 3 hours 16 min ago:
          >because 10 or 20 years ago, there were quite a few active web forums
          containing informational/tutorial threads from the proud owners of
          these old boats. ... But the forums are gone, so a large chunk of
          knowledge on these boats is too, probably forever.
          
          These days, that kind of info would be locked up in a closed Discord
          chat somewhere, so you can forget about people 20 years from now ever
          seeing it.
       
            stavros wrote 2 hours 31 min ago:
            Or people today ever discovering it.
       
        A4ET8a8uTh0 wrote 5 hours 47 min ago:
        Those additions are welcome, but if I could request one -- I and one
        that it is very consistently requested -- feature:
        
        - backing up an entire page
        
        Yes, it is hard. Yes, for non-pure html pages is extra kind of painful,
        but that would honestly making archivebox go from nice to have to..
        yes, I have an actual archive I can use when stuff goes down.
       
          nikisweeting wrote 4 hours 56 min ago:
          Do you mean backing up an entire domain? Like example.com/*
          
          If so that's starting to roll out  in v0.8.5rc50, check out the
          archivebox/crawls/ folder.
          
          If you mean archiving a single page more thoroughly, what do you find
          is missing in Archivebox? Are you able to get singlefile/chrome/wget
          html when archiving?
       
            A4ET8a8uTh0 wrote 4 hours 48 min ago:
            Edit: The first option. ( previous stuff removed )
            
            Lemme check my current version ( edit: 0.7.2 -- ty, I will update
            and test soon :D)
       
              nikisweeting wrote 4 hours 24 min ago:
              Ah ok. One caveat: it's only available via the 'archivebox shell'
              / Python API currently, the CLI & web UIs for full depth crawling
              will come later.
              
              You can play around with the models and tasks, but I would wait a
              few weeks for it to stabilize and check again, it's still under
              heavy active development
              
              Check archivebox/archivebox:dev periodically
       
                A4ET8a8uTh0 wrote 3 hours 48 min ago:
                No worries. I can do that.
                
                You guys probably hear it all the time, but you are doing lords
                work. If I thought I could be of use in that project, I would
                be trying to contribute myself ( in fact, let me see if there a
                way I can participate in a useful manner ).
       
        dark-star wrote 6 hours 37 min ago:
        Some time ago I installed ArchiveBox on a RaspberryPi 4 running k3s (a
        lightweight Kubernetes distro).
        
        I have documented that here: [1] Note that this was a rather old
        version and some things have probably changed compared to now, so YMMV,
        but it might still provide a good reference for those who want to try
        
  HTML  [1]: https://darkstar.github.io/2022/02/07/k3s-on-raspberrypi-at-ho...
       
          nikisweeting wrote 6 hours 7 min ago:
          Thanks for making that tutorial!
          
          Happy to report that most of the quirks you cover have been improved:
          
          - uid 999 is no longer just enforced, you can pass any PUID:GUID now
          (like Linuxserver.io  containers)
          
          - it now accepts ADMIN_USERNAME + ADMIN_PASSWORD env vars to create
          an initial admin user on first start without having to exec
          
          - archivebox/archivebox:latest is 0.7.2 (yearly stable release) and
          :dev is the 0.8.x pre-release updated daily. All Images are all amd64
          & arm64 compatible.
          
          - singlefile and sonic are now included in all images & available on
          all platforms amd64/arm64
       
            dark-star wrote 5 hours 46 min ago:
            yeah I really need to update that guide. Since I published it I
            have updated ArchiveBox locally to a newer version but never
            bothered to update the guide :)
       
        petertodd wrote 8 hours 0 min ago:
        You really should add timestamping to ArchiveBox. The easiest way to do
        that would be via my OpenTimestamps protocol, [1] It's open source and
        free to use, and uses Bitcoin for the actual timestamps. Users of it do
        not need to make Bitcoin transactions themselves as a set of community
        calendar servers do that for you. You also don't need a Bitcoin node to
        create an OTS timestamp, and you can validate an OTS timestamp without
        a Bitcoin node as well by trusting someone else to do that for you.
        
        The big thing that ArchiveBox can't do, and the Internet Archive can,
        is attest to the accuracy of the archive. Being at least able to prove
        that the archive was created in the past, prior to there being a reason
        to tamper it, is the best we can realistically do with current
        cryptography. So it'd be really good if support for timestamping was
        added.
        
        IIUC ArchiveBox is written in Python; OTS has a Python library that
        should work fine for you:
        
  HTML  [1]: https://opentimestamps.org
  HTML  [2]: https://github.com/opentimestamps/python-opentimestamps
       
          jasonfarnon wrote 7 hours 46 min ago:
          I always wonder about this when someone gets in hot water based on
          something on the wayback machine and the person says the archive was
          tampered with. Can you elaborate on "prove that the archive was
          created in the past, prior to there being a reason to tamper it"?
          What exactly does opentimestamps certify?
       
            nikisweeting wrote 7 hours 30 min ago:
            OpenTimestamps alone can not currently prove anything because TLS
            session keys are symmetric. The client can forge anything and
            attest to it falsely. Unless you 100% trust the archiver (in which
            case you can trust their timestamps), you need TLSNotary or another
            reputable third party in the loop as a bare minimum.
            
            But more critically: currently the legal standard for evidence
            is... screenshots. We have a lot of educating work to do before the
            public understands the value of attestation and signing.
       
              petertodd wrote 3 hours 9 min ago:
              > OpenTimestamps alone can not currently prove anything because
              TLS session keys are symmetric.
              
              Timestamps can prove that the data existed prior to there being a
              known reason to modify it. While that's not as good as direct
              signing, that's often still enough to be very useful. The
              statement that OTS "can not currently prove anything" is
              incorrect.
              
              A really good example of this is the Hunter Biden email
              verification. I used OpenTimestamps to prove that the DKIM key
              that signed the email was in fact used by Google at the time, by
              providing a Google-signed email that had been timestamped years
              ago: [1] That's convincing evidence, because it's highly
              implausible that I would have been working to fake Hunter's
              emails years before they even came up as an election issue.
              
  HTML        [1]: https://github.com/robertdavidgraham/hunter-dkim/tree/ma...
       
          nikisweeting wrote 7 hours 55 min ago:
          We're going to add TLSNotary support for real cryptographic signing,
          see my comments below :)
          
          Timestamping is also on my roadmap, definitely as a plugin (and
          likely paid) as it's more corporate users that really need it. We
          need to keep some of the really advanced attestation features paid to
          be able to support the rest of the business.
       
            petertodd wrote 3 hours 5 min ago:
            > We're going to add TLSNotary support for real cryptographic
            signing, see my comments below :)
            
            Last I checked TLSNotary requires a trusted third party. I would
            strongly suggest timestamping TLSNotary evidence, to be able to
            prove that evidence was created prior to any of these trusted third
            parties being compromised.
       
            mikae1 wrote 7 hours 51 min ago:
            Thanks for the box!
            
            Any examples of other possible really advanced features that might
            go for-pay?
            
            Is there any chance you will make current free features for-pay?
            That'd be rather off-putting for me as a home user.
       
              nikisweeting wrote 7 hours 34 min ago:
              No, everything currently free will stay free.
              
              The paid stuff currently is:
              
              - per-user permissions & groups
              
              - audit logging
              
              - auto CAPTCHA solving
              
              - burner credential management for FB/Insta/Twitter/etc. w/ auto
              phone based account verification ability
              
              - custom JS scripts for expanding comments, hiding pop ups, etc.
              
              - managed hosting + support
              
              Some of this stuff ^ is going to become free in upcoming
              releases, some will stay paid. What I decide to make free is
              mostly based on abuse potential and legal ramifications, I'd
              rather have a say in how the risky stuff is used so that it
              doesn't become a tool weaponized for botting.
       
        orblivion wrote 8 hours 2 min ago:
        Have you (and I wonder the same about archive.org) considered making a
        Merkle tree of the data that gets archived? Since data (including
        photos and videos) are getting easier to fake, it may be nice to have a
        provable record that at least a certain version of the data existed at
        a certain time. It would be most useful in case of some sort of
        oppressive regime down the line that wants to edit history. You'd want
        to publish the tip somewhere that records the time, and a blockchain
        seems to make the most sense to me but maybe you don't like
        blockchains.
       
          beefnugs wrote 6 hours 41 min ago:
          Not just all that nonsense, but also it makes a lot of sense to share
          just the parts from a website that matter like a single video etc
          without having to download an entire archive or the rest of the site
       
            nikisweeting wrote 6 hours 26 min ago:
            $ archivebox add --extractor=media,readability [1] ...
            
            We try to make that easy by allowing ppl to select one or more
            specific archivebox extractors when adding, so you don t have to
            archive everything every time.
            
            Makes it more useful for scraping in a pipeline with some other
            tools.
            
  HTML      [1]: https://
       
          nikisweeting wrote 7 hours 52 min ago:
          Yup, already doing that in the betas. Thats what I'm referring to as
          the beginnings of a "content addressable store" in the article.
          
          In the closed source fork we currently store a merkle tree summary of
          each dir in a dotfile containing the sha256 and blake3 hash of all
          entries / subdirs. When a result is "sealed" the summary is
          generated, and the final salted hash can be submitted to Solana or
          ETH or some other network to attest to the time of capture and the
          content. (That part is coming via a plugin later)
       
            orblivion wrote 7 hours 45 min ago:
            Wow that's great!
       
        favorited wrote 8 hours 10 min ago:
        As someone who was archiving a doomed website earlier today using wget,
        I was reminded that really need to get ArchiveBox working...
        
        I used to rely on my Pinboard subscription, but apparently archive
        exports haven't worked for years, so those days are over.
       
          pronoiac wrote 2 hours 37 min ago:
          Oh, writing my own Pinboard archive exporter is somewhere on my
          too-long to-do list. I should find out what would be good for
          importing into Archivebox. (WARC?)
       
          VTimofeenko wrote 6 hours 7 min ago:
          I recently found omnivore.app through HN comments -- works great for
          sharing a reading list across machines. I am exporting articles
          through obsidian, but there is an API option. I don't think it
          supports outbound RSS, but they have inbound RSS(i.e. omnivore as RSS
          reader) in beta.
       
          nikisweeting wrote 7 hours 46 min ago:
          Pocket also doesn't offer archived page exports (or even RSS export).
          I feel like both are really dropping the ball in this area!
       
        rcarmo wrote 8 hours 48 min ago:
        This is nice. I'm actually much more excited about the REST API (which
        will let me do searches and pull information out, I hope) than the
        plugin ecosystem, since the last thing I need is for another tool to
        have a half-baked LLM integration -- I prefer to do that myself and
        have full control.
        
        Being able to do RAG on my ArchiveBox is something that I have very
        much wanted to do for over a year now, and it might finally be within
        reach without my going and hacking at the archived content tree...
        
        Edit: Just looked at the API schema at [1] .
        
        No dedicated search endpoint? This looks like a HUGE missed
        opportunity. I was hoping to be able to query an FTS index on the
        SQLlite database... Have I missed something?
        
  HTML  [1]: https://demo.archivebox.io/api/v1/docs
       
          nikisweeting wrote 8 hours 37 min ago:
          The /cli/list endpoint is the search endpoint you're looking for. It
          provides FTS but I can make it clearer in the docs, thanks for the
          tip.
          
          As for the AI stuff don't worry, none of it is touching core, it's
          all in an optional community plugin only for those who want it.
          
          I'm not personally a huge AI person but I have clients who are
          already using it and getting massive value from it, so it's worth
          mentioning. (They're doing some automated QA on thousands of
          collected captured and feeding results into spreadsheets)
       
            sunshine-o wrote 6 hours 48 min ago:
            I have been using ArchiveBox recently and love it.
            
            About search, one thing I haven't yet figured out how to do easily
            is to plug it to my SearXNG instance as they  only seem to support
            Elasticsearch, Meilisearch or Solr [0]
            
            So this new plugin architecture will allow for a meilisearch plugin
            I guess (with relevancy ranking).
            
            - [0]
            
  HTML      [1]: https://docs.searxng.org/dev/engines/offline/search-indexe...
       
              nikisweeting wrote 6 hours 24 min ago:
              Definitely doable! Search plugins are one of the first that I
              implemented.
              
              We already provide Sonic, ripgrep, and SQLiteFTS as plugins, so
              adding something like Solr should be straightforward.
              
              Check out the existing plugins to see how it's done: [1]
              archivebox/plugins_search/sonic/*
              
  HTML        [1]: https://github.com/ArchiveBox/ArchiveBox/pull/1534/files...
       
            rcarmo wrote 8 hours 13 min ago:
            Thanks, I'll have a look.
            
            My use for this is very different--I want to be able to use a
            specific subset of my archived pages (which is mostly reference
            documentation) to "chat" with, providing different LLM prompts
            depending on subset and fetching plaintext chunks as reference info
            for the LLM to summarize (and point me back to the archived pages
            if I need more info).
       
              nikisweeting wrote 7 hours 44 min ago:
              Ok that makes sense, I think archivebox works as the first step
              in a pipeline there, with some other tool doing the LLM analysis
              and query stuff.
       
        rodolphoarruda wrote 8 hours 48 min ago:
        > "In an era where fear of public scrutiny is very tangible, people are
        afraid of archiving things for eternity. As a result, people choose not
        to archive at all, effectively erasing that history forever."
        
        Really? I don't get that feeling at all. I use Evernote to archive
        anything I consider worth keeping. I wonder where such "fear of
        archiving" comes from.
       
          nikisweeting wrote 8 hours 32 min ago:
          A lot of people are retreating off public free-for-all platforms like
          Twitter to more siloed spaces like Discord, for many reasons, not
          just fear of archiving.
          
          It all has the same effect of making it harder to archive though.
       
        FiniteField wrote 9 hours 2 min ago:
        Disappointing that a project that should ostensibly care about
        preserving the open, non-centralised internet takes the time to
        namedrop and talk about making "compromises" against preserving a
        well-known, medium-sized clearnet forum legally operated from a
        US-based LLC. Still-living independent forum sites in this day and age
        have unrivalled SNR of actual human-to-human communication, there
        should be no better candidate for archival. It's sad that a self-hosted
        archival tool has to apologise for any "evil" content it might be used
        for in the first place. Tape recorders do not require a disclaimer
        about people saying "hate speech" into them.
       
          nikisweeting wrote 8 hours 24 min ago:
          Sorry which medium sized forum are you referring to?
          
          I love forums and want them to continue, I'm not sure where you got
          the idea that I dislike them as a medium. I was just pointing out
          that public sites in general have started to see some attrition a bit
          lately for a variety of reasons, and the tooling needs to keep    with
          new mediums as they appear.
          
          I also make no apology for the content, in fact ArchiveBox is
          explicitly designed to archive the most vile stuff for lawyers and
          governments to use for long term storage or evidence collection. One
          of our first prospective clients was the UN wanting to use it to
          document Syrian war crimes. The point there was that we can save
          stuff without amplifying it, and that's sometimes useful in niche
          scenarios.
          
          Lawyers/LE especially don't want to broadcast to the world (or tip
          off their suspect) that they are investigating or endorsing a
          particular person, so the ability to capture without publicly
          announcing/mirroring every capture is vital.
       
            dark-star wrote 5 hours 39 min ago:
            I guess he's talking about K_wi F_rms which was mentioned in one of
            the screenshots...
       
              nikisweeting wrote 5 hours 5 min ago:
              Ahh that makes sense. Well all I can say to that is that it's not
              up to me what's evil. The point I was trying to make is:
              sometimes you want to archive something that you don't endorse /
              don't want to be publicly linked.
              
              You might not want to amplify and broadcast the fact that you're
              archiving it to the world.
       
        Acrobatic_Road wrote 9 hours 24 min ago:
        The subline mentions "Auto-login", but the article never elaborates on
        this. Does this mean we will be able to more easily archive non-public
        websites?
        
        Also, how do you plan to ensure data authenticity across a distributed
        archive? For example, if I archive someone's blog, what is stopping me
        from inserting inflammatory posts that they never wrote, and passing
        them off as the real deal? Slight update: I see you're using TLS
        Notary! That's exactly what I would have suggested!
       
          nikisweeting wrote 9 hours 4 min ago:
          Auto log in is currently a service I provide for paying clients, and
          you can do it in the open source version manually with some extra
          config.
          
          Working hard on making it more accessible in the future, and plugins
          should help!
       
        404mm wrote 9 hours 56 min ago:
        Somewhat similar topic, anyone has recommendations for a self-hosted
        internet website change monitoring system? I’ve been running Huginn
        for many years and it works well; however, I have a feeling the project
        is on its last leg. Also, it’s based on either text scraping
        (XPath/CSS/HTML and rss but it struggles with newer JS-based sites.
       
          arminiusreturns wrote 9 hours 16 min ago:
          Why do you feel like Huginn is on its last leg?
          It's been in my list of things to play with for years now, but I
          never got around to it...
       
            404mm wrote 8 hours 43 min ago:
            It looks like it’s being maintained by a single remaining
            developer. No new features are being added, just some basic
            maintenance. The product as a whole still works well, so unless you
            find something better, I do recommend it. I run it in k3s and the
            image is probably the easiest way of maintaining it.
       
          nikisweeting wrote 9 hours 28 min ago:
          Changedetection.io
       
            404mm wrote 7 hours 6 min ago:
            Thank you! That looks great!
       
        bravura wrote 10 hours 8 min ago:
        @nikisweeting ArchiveBox is awesome and we'd really love it to be more
        awesome. And sustainable!
        
        I've posted issues and PRs for showstopper issues that took months to
        get merged in: [1] [2] You have the opportunity for the community to
        lean in on ArchiveBox. I understand it's hard to do everything as a
        solo dev, we've seen many cases in the community where solo devs get
        burned out or have personal challenges that take priority etc.
        
        It's hard for us users to lean in on ArchiveBox when after a happy
        month of archiving, things start break and you're left with maintaining
        a branch of your own fixes that aren't in main. Meanwhile, your
        solution of soliciting one time donations just makes the whole project
        feel more rickety and fly-by-night. How about thinking bigger?
        
        We NEED ArchiveBox to be a real thing. Decentralized tooling for
        archiving is SO IMPORTANT. I care about it and I suspect many people
        do. I'm posting this so other people who care about it can also comment
        and chime in and suggest how it can become something we can rely on.
        Because archiving isn't just about the past, it's about the future.
        
        Maybe it needs to be a dev org of three committed part-time
        maintainers, and a small foundation that people recurrently support is
        what grants it? IDK. I'm not an expert at how to make open source
        resilient. There have been discussions about this in the past, but I
        think it's worth a serious look because ArchiveBox is IMPORTANT and I
        want it to work any month I decide to re-activate my interest in it. I
        invite people to discuss ways to make this valuable project more
        sustianable and resilient.
        
  HTML  [1]: https://github.com/ArchiveBox/ArchiveBox/issues/991
  HTML  [2]: https://github.com/ArchiveBox/ArchiveBox/pull/1026
       
          nikisweeting wrote 9 hours 31 min ago:
          Let chat more. I'm almost ready to raise some seed money, hire a
          second staff dev or find a cofounder, and I'm looking for people that
          care deeply about the space.
          
          It's only been during the last few months that I decided to go all in
          on the project, so this is still just the first few pages of a new
          chapter in the project's history.
          
          (I should also mention that if you're a commercial entity relying on
          ArchiveBox, you can hire us for dedicated support and uptime
          guarantees. We have a closed source fork that has a much better test
          suite and lots of other goodies)
       
            bigiain wrote 5 hours 59 min ago:
            "I too would like commit access to your promising looking project's
             git repo and CI/CD pipeline. Thanks, Jia Tan"
       
              msephton wrote 2 hours 44 min ago:
              lololol
       
            manofmanysmiles wrote 6 hours 24 min ago:
            I love this project. I "independently" "invented" it in my head the
            other day, and happy to see it already exists!
            
            I'd love to see blockchain proof/notary support. The ability to say
            "content matching this hash existed at this time.
            
            I'm exceptionally busy now but that being said, I may choose to
            contribute nonetheless.
            
            I'd love to connect directly, and will connect to the Zulip
            instance later.
            
            If we align on values, I may be able to connect you with some cash.
            People often call me an "anarchist" or "libertarian", though I'm
            just me, not labels necessary.
       
              nophunphil wrote 1 hour 50 min ago:
              Can you please explain what you mean by “blockchain
              proof/notary support”?
       
                manofmanysmiles wrote 1 hour 9 min ago:
                Motivation: Have evidence that some content existed at a
                particular time. For example, let's say a major website
                publishes an article, and later they remove it, and there is no
                record of it ever existing. If I host an ArchiveBox, I can look
                at it and see "Oh here is that article. Looks line it was
                published after all." However, why should you believe me I
                didn't just make it up?
                
                If when I initially archived it, I computed a cryptographic
                hash of the content and posted that on a blockchain, then at a
                future date I can at least claim "As of block N, approximately 
                corresponding to this time UTC, content that hashes to this
                hash exited."
                
                If multiple unrelated parties also make the same claim, it is
                stronger evidence.
                
                Is this sufficient explanation? I can expand on this more
                later.
       
                  jazzyjackson wrote 44 min ago:
                  There's no reason to believe that the hashed and timestamped
                  content was hosted at a particular domain, however (unless
                  the content was signed by the author of course, then there's
                  no Blockchain necessary). sure multiple peers could make some
                  attestation that they saw it at that URL, but then you're
                  back at square one of the reputation problem
                  
                  Internet archive as an institution with a reputation that
                  holds up to a judge is actually more valuable than a
                  cryptographic proof that x bytes existed at y time
       
            nyx wrote 6 hours 55 min ago:
            It looks like you're doing great work here, thanks a bunch; looking
            forward to seeing this project develop.
            
            Selling custom integrations, managed instances, white-glove support
            with an SLA, and so on seems like a reasonable funding model for a
            project based on an open-source, self-hostable platform. But I'm a
            little disheartened to read that you're maintaining a closed fork
            with "goodies" in it.
            
            How do you decide which features (better test suite?) end up in the
            non-libre, payware fork of your software? If someone contributed a
            feature to the open-source version that already exists in the
            payware version, would you allow it to be merged or would you
            refuse the pull request?
       
              nikisweeting wrote 5 hours 25 min ago:
              The idea with the plugin system is that plugins are just git
              repos containing /__init__.py, and you can add any set of git
              repo plugins you want to your instance.
              
              The marketplace will work by showing all git repos tagged with
              the "archivebox" tag on github.
              
              My approval is only needed for PRs to the archivebox core engine.
              
              More info on free vs paid + reasoning why it's not all open
              source:
              
  HTML        [1]: https://news.ycombinator.com/item?id=41863539
       
            giancarlostoro wrote 7 hours 4 min ago:
            Do you guys have a Discord by chance? I have a close friend who is
            insanely passionate about archiving, he has a personal instance of
            archivebox, and is working on a Video Downloading project as well.
            He has used it almost everyday and archived thousands of news
            articles over years. He's aware of a lot of the nuances.
       
              nikisweeting wrote 6 hours 53 min ago:
              We have a Zulip which is similar to discord (but self hosted and
              it has better threading):
              
  HTML        [1]: https://zulip.archivebox.io
       
        millvalleydev wrote 10 hours 14 min ago:
        For devs like us, archivebox? or browsertrix-crawler? for scraping
        entire sites for our own uses, maybe to keep contents behind pay walls
        while we have subscriptions or maybe to feed them to local LLMs to ask?
       
          nikisweeting wrote 8 hours 22 min ago:
          For scraping entire sites browserteix is currently more suited until
          we add full depth recursive crawling in v0.9. For feeding to LLMs
          ArchiveBox MIGHT BE better (imho) because we extract the raw content
          and you likely don't need the whole WARC.
       
        sagz wrote 10 hours 14 min ago:
        Do y'all support archiving pages that are behind logins? Like using
        browser cookies?
       
          markerz wrote 9 hours 54 min ago:
          Yes, but there's security concerns where you might accidentally leak
          your credentials / cookies if you publish your archive to the public.
          [1] PS. I'm an archivebox user, not a dev or maintainer.
          
  HTML    [1]: https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overv...
       
            nikisweeting wrote 9 hours 27 min ago:
            Yes this is correct, with plans to make this easier in the near
            future via setup wizard that guides you through creating dedicated
            credentials for archiving.
       
        wongarsu wrote 10 hours 49 min ago:
        Does this mean it's now possible to write plugins that dismiss cookie
        popups, solve captchas, scroll web pages etc.?
       
          nikisweeting wrote 10 hours 40 min ago:
          I have a private plugin with puppeteer support for stuff like this,
          currently charging clients money to use it to fund the open source
          development. The clients are people who are already legally allowed
          to evade CAPTCHAS (e.g. governments, NGOs doing research, lawyers
          collecting evidence, etc.)
          
          Unfortunatley I cant open source the CAPTCHA solving stuff myself,
          because it opens me up to liability, but if someone wants to
          contribute a plugin to the ecosystem I cant stop them ;).
       
            0x1ch wrote 9 hours 17 min ago:
            Legally allowed to evade CAPTCHAs? LOL.
            
            What world do we live in where evading a captcha is an illegal
            offense?
       
              nikisweeting wrote 9 hours 5 min ago:
              It doesn't matter whether or not it's actually legal, what
              matters is that the big platforms will sue you for trying, so you
              need a big bankroll to stand your ground.
              
              At the very least they can bar you from accessing their sites as
              you're violating ToS that you accept upon signup.
       
        nfriedly wrote 10 hours 56 min ago:
        I've been using an instance of [1] for personal archives of web pages
        and I really like it, but I might try out ArchiveBox at some point too.
        
        I also run an instance of ArchiveTeam Warrior which is constantly
        uploading things to archive.org, and I like the direction ArchiveBox is
        heading with the distributed/federated archiving on the roadmap, so I
        may end up setting up an instance like that even if I don't use it for
        personal content.
        
  HTML  [1]: https://readeck.org/
       
          venusenvy47 wrote 9 hours 45 min ago:
          I've been using the Single File extension to save self-contained html
          files of pages I want to keep for posterity.  I like it because any
          browser can open the files it creates.    Is it easy to view the
          archive files from readeck?  I haven't looked at fancier alternatives
          to my existing solution.
          
  HTML    [1]: https://addons.mozilla.org/en-US/firefox/addon/single-file/
       
            nikisweeting wrote 9 hours 29 min ago:
            Singlefile is excellent, Gildas is a great developer. ArchiveBox 
            has had singlefile as one of its extractors built in for years :)
       
              gildas wrote 47 min ago:
              Thank you so much Niki :). The P2P sharing is a great idea. I
              really hope this feature will get things moving in the archiving
              field.
       
            nfriedly wrote 9 hours 37 min ago:
            I haven't looked at the on-disk format, I just use the browser
            interface. (It's fairly common for me to save something from my
            phone that I'll want to review on a computer later.)
            
            Here's an example of an Amazon "review" I recently archived that
            has instructions for using a USB tester I have: [1] And, for
            comparison, here's the original: [2] It'd be nice if I could edit
            out the extra junk near the top, but the important bits are all
            there.
            
  HTML      [1]: https://readeck.home.nfriedly.com/@b/tCngVjkSFOrCbwb9DnY2y...
  HTML      [2]: https://www.amazon.com/gp/customer-reviews/R3EF0QW6MAJ0VP
       
              ashildr wrote 9 hours 17 min ago:
              I was about to post a link to the same URL but archived using
              singleFile, which looks like the original at amazon. I didn‘t
              because I realized that I have absolutely no idea what additional
              information would be hidden in the file. In the worst case any
              component sent by Amazon and archived into the file may contain
              PII, even if I am “logged out“.
              
              I‘m not saying that singleFile is bad in any way, I‘m using
              it a lot on multiple devices, but I‘m not sure whether sharing
              archives is a good idea™.
       
                nikisweeting wrote 9 hours 7 min ago:
                100%, this is the challenge of archiving logged in content.
                
                It becomes un-shareable unless we use fake burner accounts for
                capture, or have really good sanitizing methods.
       
                  ashildr wrote 8 hours 48 min ago:
                  Even when I‘m logged out I expect at least information on
                  my geographical location to seep into the archive via URLs
                  addressing specific CDN endpoints or similar mechanisms.
       
                    nikisweeting wrote 8 hours 33 min ago:
                    Yup, this is why the ArchiveBox browser extension sends
                    URLs to a separate server for archiving with an isolated
                    burner profile.
                    
                    I should write a full article on the security implications
                    at some point, there aren't many good top-down explanations
                    of why this is a hard problem.
       
                      ashildr wrote 2 hours 21 min ago:
                      I know it’s a lot of work but this would be great and
                      it may give readers a deeper understanding into security
                      in general.
       
          nikisweeting wrote 10 hours 52 min ago:
          I love ArchiveTeam warrior, it's such a good idea! We run several
          instances ourselves, and it's part of our Good Karma Kit for
          computers with spare capacity: [1] There are a bunch of other
          alternatives like ReadDeck listed on our wiki too, we encourage
          people to check it out!
          
  HTML    [1]: https://github.com/ArchiveBox/good-karma-kit
  HTML    [2]: https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-...
       
        treyd wrote 11 hours 11 min ago:
        Is this a project that could be developed to support a distributed
        mirror of archive.org similar to how Anna's Archive works?
       
          nikisweeting wrote 11 hours 6 min ago:
          Yeah that's what we're aiming for eventually, but with the addition
          of fine-grained permissions controls so you don't have to share
          everything 100% publically, you can choose a subset.
          
  HTML    [1]: https://github.com/ArchiveBox/ArchiveBox/wiki/Roadmap
       
        toomuchtodo wrote 11 hours 12 min ago:
         [1] might be helpful. I'm a fan of the ability to create WARC archives
        from a target, uploard the WARC files to object storage (whether that
        is IA, S3, Backblaze B2, etc), and then keep them in cold storage or
        serve them up via HTTPS or a torrent (mutable, preferred). The Internet
        Archive serves a torrent file for every item they host; one can do the
        same with WARC archives to enable a distributed archive. CDX indexes
        can be used for rapidly querying the underlying WARC archives.
        
        You might support cryptographically signing WARC archives; Wayback is
        particular about archive provenance and integrity, for example. [2]
        ("CDX Internet Archive Index File") [3] ("WARC, Web ARChive file
        format") [4] ("Wayback CDX Server API - BETA")
        
  HTML  [1]: https://github.com/ArchiveTeam/grab-site
  HTML  [2]: https://www.loc.gov/preservation/digital/formats/fdd/fdd000590...
  HTML  [3]: https://www.loc.gov/preservation/digital/formats/fdd/fdd000236...
  HTML  [4]: https://github.com/internetarchive/wayback/tree/master/wayback...
       
          0cf8612b2e1e wrote 9 hours 55 min ago:
          The Internet Archive serves a torrent file for every item they host
          
          I had no idea. I have found the IA serving speed to be pretty
          terrible. Are the torrents any better? Presumably the only ones
          seeding the files are IA themselves.
       
            pabs3 wrote 5 hours 25 min ago:
            The torrents have better speeds because they have WebSeeds for
            multiple IA servers, so you can download from multiple servers at
            once.
       
            toomuchtodo wrote 9 hours 22 min ago:
            The benefit is not in seeding speed directly from IA, but the
            potential for distributed access and seeding of the item. Think of
            it as a filename of a zip file in a flat distributed filesystem,
            with the ability to cherrypick files that make up the item  out via
            traditional bittorrent mechanisms. Anyone can consume each item via
            torrent, continue to seed, and then also access the underlying
            data. IA acts as the storage system of last resort (and the
            metadata index).
       
          pzmarzly wrote 10 hours 18 min ago:
          Can you recommend some tools to manage mutable torrents? I.e. create
          them, edit them, download them and keep them downloaded up to date.
          
          BTW I recently tried using IPFS for a mutable public storage bucket
          and that didn't go well - downloads were very slow compared to
          torrents, and IPNS update propagation took ages. Perhaps torrents
          will do the job.
       
            nikisweeting wrote 9 hours 22 min ago:
            My plan is to use a separate control plane for the
            discovery/announcements of changes, and torrents just for the data
            transfer. The specifics are still being actively discussed, and
            it's a few releases away anyway.
       
            Apocryphon wrote 9 hours 55 min ago:
            Man, looks like the first posts about IPFS cropped up on HN a
            decade ago. I remember seeing Neocities announcement of support for
            them. I wonder if that protocol has gotten anywhere since then.
       
              jazzyjackson wrote 38 min ago:
              There has been a large effort extended by Internet archive to
              adopt IPFS through their partnership with filecoin but IME the
              basic problems of the protocol remain - slow egress, slow
              discovery, someone still has to serve the file over a gateway to
              normie HTTP users...
       
          nikisweeting wrote 11 hours 7 min ago:
          I recommend Browsertrix for WARC creation, I think they are the best
          currently available for WARC/WACZ.
          
          ArchiveBox is also gearing up to support real cryptographic signing
          of archives using [1] in an upcoming plugin. (in a way that actually
          solves the TLS non-repudation issue, which traditional "signing a
          WARC" does not, more info: [2] )
          
  HTML    [1]: https://tlsnotary.org/
  HTML    [2]: https://www.ndss-symposium.org/wp-content/uploads/2018/02/nd...
       
            digitaldragon wrote 4 hours 39 min ago:
            Unfortunately, Browsertrix relies on the Chrome Devtools Protocol,
            which strips transfer encoding (and possibly transforms the data in
            other ways). This results in Browsertrix writing noncompliant WARC
            files, because the spec requires that the original transfer
            encoding be preserved.
       
              ikreymer wrote 7 min ago:
              Unfortunately, there is not much we can do about
              transfer-encoding, but the data is otherwise exactly as is
              returned from the browser. Browsertrix uses the browser to create
              web archives, so users get an accurate representation of what
              they see in their browser, which is generally what people want
              from archives.
              
              We do the best we can with a limited standard that is difficult
              to modify. Archiving is always lossy, we try to reduce that as
              much as possible, but there are limits. People create web
              archives because they care about not losing their stuff online,
              not because they need an accurate record of transfer-encoding
              property in an HTTP connection. If storing the transfer-encoding
              is the most important thing, then yes, there are better tools for
              that.
       
            toomuchtodo wrote 10 hours 52 min ago:
            Keep in mind, what signing methodology you use is a function of who
            accepts it. If I can confirm "ArchiveTeam ripped this", that is is
            superior to whatever tlsnotary is doing with MPC, blockchain,
            distributed ledger, whatever (in my use case). Have to trust
            someone at the end of the day. ArchiveTeam's Warrior doesn't use
            tlsnotary, for example, and rips entire sites just fine.
       
              nikisweeting wrote 10 hours 42 min ago:
              The idea with TLSNotary is that you can have several universities
              or central agencies running signing servers but you dont have to
              share the cleartext content of your archives with them to get it
              signed.
              
              This dramatically changes what is possible with signing because
              previously to get ArchiveTeam's signature of approval, they would
              have to see the content themselves to archive it. With TLSNotary
              they can sign without needing to see the content/access the
              cookies/etc.
       
                viraptor wrote 8 hours 38 min ago:
                Isn't that already possible with any kind of notary by giving
                them a sha256 of the content only? Or am I missing some
                distinction?
       
                  nikisweeting wrote 7 hours 50 min ago:
                  You can do that but it proves nothing because TLS session
                  keys are symmetric, so the archiver can forge server
                  responses and falsely attest that the server sent them.
                  
                  Look up "TLS non repudiation"
                  
                  A real solution like TLSNotary involves a neutral, reputable 
                  third party that can't see the cleartext attesting to the
                  cyphertext using a ZK proof.
                  
                  The neutral third party doing attestation can't see the
                  content so they can't easily tamper with it, and attempts to
                  tamper indiscriminately would be easily detected and ding
                  their reputation.
       
        the_gorilla wrote 11 hours 14 min ago:
        I don't know how anyone manages to use archivebox. I've tried it twice
        in the last 3 years and its site compatibility is bad, it quietly leaks
        everything you archive to archive.org by default, and whenever it fails
        on a download it stops archiving anything even after deleting and
        resubmitting all the jobs.
        
        I'm sure it works for some people, but not me.
       
          nikisweeting wrote 11 hours 8 min ago:
          These are legitimate gripes that have plagued specific past releases,
          I hear your frustration. Please keep in mind this was a solo effort
          of a single developer, only worked on in my spare time over the last
          7 years (up until very recently).
          
          The new v0.8 adds a BG queue specifically to deal with the issue of
          stalling when some sites fail. There was a system to do this in the
          past, but it was imperfect and mostly optimized for the docker setup
          where a scheduler is running `archivebox update` every few hours to
          retry failed URLs.
          
          Site compability is much improved with the new BETA, but it's a
          perpetual cat and mouse game to fix specific sites, which is why we
          think the new plugin system is the way forward. It's just not
          sustainable for a single company (really just me right now) to
          maintain hundreds of workarounds for each individual site. I'm also
          discussing with the Webrecorder and Archive.org teams how we can to
          share these site-specific workarounds as cross-compatible plugins
          (aka "behaviors") between our various software.
          
          > it quietly leaks everything you archive to archive.org by default
          
          It's prominently mentioned many times (at least 4) on our homepage
          that this is the default, and archiving public-only sites (which are
          already fair game for Archive.org) is a default for good reason.
          Archiving private content requires several important changes and
          security considerations. More context:
          
  HTML    [1]: https://news.ycombinator.com/item?id=26866689
       
            the_gorilla wrote 10 hours 51 min ago:
            I can accept the other issues, but archivebox needs be private and
            secure by default.
            
            Sending everything to archive.org is bad default value and it
            erodes a certain level of trust in the project. Requiring "several
            important changes and security considerations" just makes a
            non-starter. The default settings should be "safe" for the default
            user, because as you mentioned in that post, 90% of users are never
            going to change them. Users should  be able to run it locally and
            archive data without worrying about security issues, unless you
            only want experts to be able to use your software.
            
            Also a contradiction between your statement and your blogpost,
            someone saving their photos isn't going to be want to worry about
            whether they configured your tool correctly or leaking all the
            group logs or grandma's photos.
            
            >It's prominently mentioned many times (at least 4) on our homepage
            that this is the default, and archiving public-only sites (which
            are already fair game for Archive.org) is a default for good
            reason. Archiving private content requires several important
            changes and security considerations. More context
            
            > Who cares about saving stuff?
            
            > All of us have content that we care about, that we want to see
            preserved, but privately:
            
            > families might want to preserve their photo albums off Facebook,
            Flickr, Instagram
            
            > individuals might want to save their bookmarks, social feeds, or
            chats from Signal/Discord
            
            > companies might want to save their internal documents, old sites,
            competitor analyses, etc.
            
            I want the project to do well but it really needs to be secure by
            default.
       
              Apocryphon wrote 9 hours 51 min ago:
              Perhaps this data is "private" as in "personal property" and not
              "private" as in "confidential."
       
                nikisweeting wrote 9 hours 24 min ago:
                It's intended for both but it currently requires extra setup to
                do "confidential" because there are security risks.
       
              hobs wrote 10 hours 48 min ago:
              As a custom tool built to archive stuff for archive.org, why
              would you expect that it can also do a completely opposite task,
              saving information privately?
              
              I can see why you would want such a tool, but it seems like a
              direct divergence from the core goal of the existing codebase.
       
                the_gorilla wrote 10 hours 42 min ago:
                [flagged]
       
                  dang wrote 7 hours 19 min ago:
                  We've banned this account for breaking the site guidelines.
                  Please don't create accounts to break HN's rules with.
                  
  HTML            [1]: https://news.ycombinator.com/newsguidelines.html
       
              nikisweeting wrote 10 hours 48 min ago:
              > The default settings should be "safe" for the default user,
              
              I 100% agree, but because private archiving is doable but NOT
              100% safe yet I cant make that mode the default. The difficult
              reality currently is that archiving anything non-public is not
              simple to make safe.
              
              Every capture will contain reflected session cookies, usernames,
              and PII, and other sensitive content. People don't understand
              that this means if they share a snapshot of one page they're
              potentially leaking their login credentials for an entire site.
              
              It is possible to do safely, and we provide ways to achieve that
              that I'm constantly working on improving, but until it's easy and
              straightforward and doesn't require any user education on
              security implications, I cant make it the default.
              
              The goal is to get it to the point where it CAN be the default,
              but I'm still at least 6mo away from that point. Check out the
              archivebox/sessions dir in the source code for a look at the
              development happening here.
              
              Until then, it requires some user education and setting up a
              dedicated chrome profile + cookies + tweaking config to do. (as
              an intentional barrier to entry for private archiving)
       
                bigiain wrote 5 hours 41 min ago:
                That's a really good response, thanks.
                
                I've been very impressed by all of your responses in here, but
                that one in particular shows empathy, compassion, and a deep
                deep subject matter expertise.
       
                  nikisweeting wrote 4 hours 28 min ago:
                  Thank you. And thank you for taking the time to read all of
                  it, there's a lot of great questions being asked.
       
            freedomben wrote 10 hours 59 min ago:
            Yeah, I'm not sure whether archive.org should be defaulted to on or
            off (I see both sides of that one), but its existence is definitely
            surfaced.
            
            I love Archive Box btw, thank you for your effort!  It's filling a
            very important need.
       
        grinch5751 wrote 11 hours 15 min ago:
        This looks like a really wonderful set of developments. Already making
        plans to use an old laptop of mine as an achivebox machine.
       
       
   DIR <- back to front page