Preserving the History of the Present: Lessons from National Taiwan University Web Archiving System (NTUWAS)

Written by the team at NTUWAS.

Image credit: archive-01 by Geir Friestad/Flickr, license CC BY-NC-ND 2.0

When the 921-earthquake struck Taiwan in 1999, the government and civilians established many disaster relief websites to spread information or to carry out certain projects. Since the recovery effort generally slowed down after 2010, many of these organisations have closed down and taken their websites offline, taking with them the tales of community efforts, shared knowledge, and collective action. These websites are no longer accessible on the live web, but they are all accessible  on Taiwan’s largest web archive, the National Taiwan University Web Archiving System (NTUWAS).

Websites will be an essential resource for historians of the 21st century. Still, it is only recently that there has been a concerted effort to preserve websites for posterity and to track and record an ever-changing digital landscape. The typical lifespan of a website is short, and it is easy for important texts to be lost for good if no one works to protect them. Because of this, Since the mid-1990s, web archive projects began to appear. The first major international initiative was launched in 1996, namely the Internet Archive. To date, it has kept more than 20 years of web history and collected 330 billion web pages accessible through the Wayback Machine, which holds the world’s most extensive collection of preserved websites. In parallel with the establishment of the Internet Archive, several other national web archiving projects were initiated. These include Australia’s Web Archive (PANDORA) and the UK’s Government Web Archive (UKWA), both established in 1996. 

Aside from national institutions, many university libraries have also developed their own web archiving databases. Those projects mainly focus on collecting web pages related to each university, whether it is the university’s own website or pages on research topics that have academic significance for its faculty. NTUWAS was first conceived of in 2006 by National Taiwan University Library. After a year of trial and interface adjustment, NTUWAS was available to the public at the beginning of 2008. From its inception to now, the NTUWAS team have already archived more than 10 thousand websites and harvested more than 30 thousand versions of websites.

Due to inexhaustible internet content, web archives have to be selective in terms of what they record. Indeed, unlike traditional archives, the challenge for web archives is how to select precisely what material is worth keeping and how to find it amongst the deep sea of content. NTUWAS uses an open-source web crawler, HTTRACK, to crawl target websites—a back-end interface to download web pages and check the downloaded results—and a front-end interface for users. On the front-end interface, NTUWAS supplies display, classification, website and full-text search features. Furthermore, it has a recommendation feature to encourage users to recommend websites that need to be archived, making the archive more comprehensive.

The only material maintained in full are websites from National Taiwan University. Apart from these sites, we try to select material that will be important to preserve. For example, we have harvested 143 versions of the “Office of the President” website dating back from 2007 to the present. The websites of various electoral campaigns can also be found in our collection. We have collected the websites of the overseas Taiwanese associations and the Taiwan new immigration organisations, as well as non-profit organisations and charities. Our collection is based around key themes (such as culture or politics); issues (biological diversity, digital research) and events (such as elections, referendums or the Olympic games).

Although we believe that web archives can be an important tool for research in the digital humanities and beyond, we also think our collection can provide a new and fun way for people to connect to the recent past. For example, our tool “Map Navigation” displays sites of special geographic significance. You can find seven map navigation categories: Taiwanese Indigenous Township, Countries with Diplomatic Relations with Taiwan, Taiwan National Park and Scenic Area, Cross-Strait Economic and Trade Development, the August 8, 2009, Southern Taiwan floods, the 921 Chi-Chi Earthquake, and Potential World Heritage Sites in Taiwan. Furthermore, through “map navigation,” you can find Taiwan’s most distinctive and beautiful aboriginal culture in 55 indigenous town websites. In addition, by using our tool “Features,” you can randomly find interesting pictures from archived websites, providing eye-catching visual representations. If you click a picture, NTUWAS will direct you to the archived website. The top of the page shows the name, the URL, and the archived date of the website.

The 21st century is, of course, very much still ongoing. We have worked to preserve documents from defining events such as the 2003 SARS pandemic and the flooding that occurred in Southern Taiwan in 2008. Recently, we have been working on recording the digital history of the ongoing pandemic and have created a new event section titled “COVID-19”—the collection is still expanding. Currently, we have archived 79 websites but hope to locate more in the future. When the pandemic finally comes to an end, many sites will likely go offline. Still, they, and countless other websites, will remain alive in our archive and will be accessible to research and the general public for generations to come. 

NTUWAS
, or the National Taiwan University Web Archiving System, is an internet archive based out of National Taiwan University Library that has been running since 2008. This article is part of a special issue on digital humanities in Taiwan.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s