How to Create a Web Archive With Archivebox

Install Archivebox Linux 00 Featured Image

Archivebox is an easy-to-use archival program that allows you to create an accurate snapshot of any website. This can be helpful for archivists and users that want to preserve information online. Not only that, Archivebox is also incredibly simple and easy to use. For example, you can run the program both as a command line tool and as a web app that you can access anywhere.

Why Should You Archive Websites?

Over the years, the World Wide Web enabled individuals across the globe to easily share and communicate information with each other. One issue with the Web, however, is that websites do not hold up over time.

Install Archivebox Linux 02 Old Geocities Website
Image source: web.archive.org

Most websites only stay active for around two to five years. After that, they either go offline completely or are replaced by a different website altogether. For example, there are little to no websites from the 1990s that are still online today.

Install Archivebox Linux 03 Old Website Sample
Image source: cameronsworld.net

Alternatively, you can also use the WayBack machine to archive websites – no installation required.

Archivebox’s Requirement

Before you can install Archivebox, you need to make sure that you have the following resources:

  • A machine that you can access from outside your home network. This can either be a machine at home that you can port-forward or a rented remote VPS.
  • Your machine needs to have an adequate amount of storage space. In most cases, a 1TB disk should be able to store between 100,000 to 1,000,000 individual webpages.
  • Your machine’s filesystem needs to either be EXT4 or ZFS for Archivebox to work properly.

Note: this tutorial focuses on installing and configuring Archivebox on a local Ubuntu 22.04 LTS machine.

Installing Archivebox

First, install the program’s dependencies. Open a terminal and type the following command:

sudo apt install python3 nodejs python3-pip nginx npm
npm install --no-audit --no-fund 'git+https://github.com/gildas-lormeau/SingleFile.git'
npm install --no-audit --no-fund 'git+https://github.com/ArchiveBox/readability-extractor.git'
npm install --no-audit --no-fund '@postlight/mercury-parser'
Install Archivebox Linux 06 Install Dependencies

Install Archivebox through Python PIP:

pip3 install archivebox
PATH=$PATH:/home/$USER/.local/bin
Install Archivebox Linux 07 Install Archivebox Binary

Next, create a folder where Archivebox will save all of its data. In my case, I am creating my directory in my “/home/archivebox” directory:

mkdir /home/$USER/abox-data && cd /home/$USER/abox-data

Lastly, you can finalize your Archivebox instance by running the following command to download and configure all the Python patches that the program needs to run in your machine.

archivebox init --setup
Install Archivebox Linux 08 Create New Archivebox Repository

You will be asked for the details of the first user.

Install Archivebox Linux 09 Create New Archivebox User

Check whether you have installed Archivebox properly by running:

archivebox --version

Preparing the Web GUI

While Archivebox is perfectly usable as a command line utility, it is also possible to access the program through a web interface. This is useful if you want to either share Archivebox with other users or access the program outside your server.

To host a web GUI, you need to create an Nginx reverse proxy to redirect any incoming web traffic to the Archivebox daemon.

Create a new Nginx configuration file:

sudo nano /etc/nginx/sites-available/archivebox

Copy and paste the following code, changing server_name to your own domain name:

server {
       listen 80;
       listen [::]:80;
 
       root /home/archivebox/abox-data;
 
       server_name yetanotherarchivebox.xyz www.yetanotherarchivebox.xyz;
 
       location / {
                  proxy_pass http://127.0.0.1:8000;
       }
}

Enable the Archivebox configuration:

sudo ln -s /etc/nginx/sites-available/archivebox /etc/nginx/sites-enabled/

Restart Nginx and start the Archivebox daemon:

sudo systemctl restart nginx
archivebox server 0.0.0.0:8000
Install Archivebox Linux 11 Launch Web Gui

Archiving Your First Website

Open your web browser and access the Archivebox instance through your domain name. In my case, I am going to “yetanotherarchivebox.xyz.”

Install Archivebox Linux 12 Web Gui Running

Click the “LOG IN” button in the webpage’s upper-right corner.

Install Archivebox Linux 13 Highlight Login Button

Enter your user credentials to log in to the utility.

Install Archivebox Linux 14 Login Dialog Screen

Archive your first website by pressing the “Add” button on the page’s upper sidebar.

Install Archivebox Linux 15 Highlight Add Button

This will load a large dialog box, where you can add a list of web links that you would like to archive. In my case, I am adding “https://maketecheasier.com.”

Install Archivebox Linux 16 Add Url To Archive

Next, you can choose a variety of options to archive your website. For example, you can provide a set of tags for your links to sort them properly.

Install Archivebox Linux 17 Sample Tags

Further, you can tell Archivebox to save the contents of any immediate link in the page that you want to archive. This is useful in cases where you want to preserve the context of a website.

Install Archivebox Linux 18 Select Archive Depth

Click the “Add URLs and Archive” button to start the archiving process. In most cases, this should only take between one and two minutes.

Install Archivebox Linux 19 Archiving Page

Archiving a Website Using the Command Line

To archive a webpage from the command line, run the following commands:

cd /home/$USER/abox-data
archivebox add --depth=1 https://maketecheasier.com
Install Archivebox Linux 20 Cli Archiving

Further, you can also use the add subcommand to archive a list of web links. For example, running the following command will tell Archivebox to save every link in my “bookmarks.txt” file:

archivebox add < /home/$USER/bookmarks.txt
Install Archivebox Linux 21 Url List Archiving

Lastly, it is also possible to create a self-contained archive of a single webpage. To do this, run the following command:

archivebox oneshot https://maketecheasier.com
Install Archivebox Linux 22 Oneshot Sample

Customizing Archivebox

You can also customize how Archivebox obtains the pages that it saves. For example, it is possible to save only a screenshot of every web page that you archive.

This is helpful for users who want to save disk space while storing websites. To disable the other formats, you need to run the following commands:

archivebox config --set SAVE_WGET=False
archivebox config --set SAVE_WARC=False
archivebox config --set SAVE_PDF=False
archivebox config --set SAVE_SINGLEFILE=False
archivebox config --set SAVE_READABILITY=False
archivebox config --set SAVE_MERCURY=False
Install Archivebox Linux 23 Customize Archivebox

Adding a New User in Archivebox

To add a new user, go back to the web GUI and click the “ADMIN” button on the page’s upper bar.

Install Archivebox Linux 24 Admin Button Highlight

Once inside the Admin Panel, go to the “Authentication and Authorization” category and select “Users.”

Install Archivebox Linux 25 Select Users Link

This will list all the active users in the system. Select the “Add User +” button in the page’s upper-right corner.

Install Archivebox Linux 26 Add User Button Highlight

Similar to adding users to a Linux group, the user creation process in Archivebox can be complicated. Despite that, a new user only requires three things to function properly: username, password and a set of user permissions.

To create a new user, first provide a password.

Install Archivebox Linux 27 Enter New User Password

After that, select the user permissions for that particular user. In most cases, you only need to toggle the following options for a regular user:

core | archive result | Can add archive result
core | archive result | Can change archive result
core | archive result | Can view archive result
core | snapshot | Can add snapshot
core | snapshot | Can change snapshot
core | snapshot | Can view snapshot
core | tag | Can add Tag
core | tag | Can change Tag
core | tag | Can view Tag
sessions | session | Can add session
sessions | session | Can change session
sessions | session | Can view session
Install Archivebox Linux 28 Select New User Permissions

Provide a username for the new user account. In my case, I am using the name “alice.”

Install Archivebox Linux 29 Select New User Username

Lastly, select the “SAVE” button on the page’s lower right corner to apply your changes.

Install Archivebox Linux 30 Save New User

Frequently Asked Questions

How can I solve a "Failed to install Python packages" error?

This happens due to a bug in Archivebox that prevents it from finding the binaries it is looking for. Despite that, this error only affects a minor part of the program and will not damage the integrity of your archive.

One way to mitigate this issue is by making sure that your installation is always up to date. Do that by running pip3 install --upgrade archivebox.

How can I fix the "HTTPSConnectionPool" error whenever I save a website?

This error happens whenever a website does not have a valid HTTPS version. Fix this issue by forcing Archivebox to archive through HTTP. For example, running archivebox add http://insecurewebsite.com will force the program to use HTTP.

What can I do when my new user account cannot archive a website?

This issue is most likely due to a missing permissions settings on your new user account. One way to quickly fix this issue is by making sure that your new user account has the core | snapshot | Can add snapshot permission.

Image credit: Unsplash. All alterations and screenshots by Ramces Red.

Subscribe to our newsletter!

Our latest tutorials delivered straight to your inbox

Ramces Red
Ramces Red - Staff Writer

Ramces is a technology writer that lived with computers all his life. A prolific reader and a student of Anthropology, he is an eccentric character that writes articles about Linux and anything *nix.