But the twentieth century is in danger because of copyright. Can you talk about the materiality of scanning and the physical labour of those who scan? I have heard you say that those who love books are better at scanning them. Is there an inherent argument about loving and knowing what books are and the actual process of translating them into a new media?
People who love books want to share them. We thought people would scan books for a few months and move to other jobs, but we were wrong. Many scanners have been working for us for over five years. Most of our scanners are college graduates — they just love books and want to see them live on Figs. Lan Zhu, book scanner.
Terracotta archivists at the Internet Archive. This artwork is based on the terracotta army of Qin Shi Huang and is the Internet Archive way of thanking those archivists who have served at the organization for at least three years. Could scanning resolution become a class system of knowledge? I am afraid that our digitization will be selective, that only dominant languages, dominant cultures, and dominant points of view will be represented in the digital future.
If we are biased in our selection of what we bring to the next generation, then we are committing a crime that will never be forgiven. Fortunately, it only costs ten cents a page to digitize a book, so for a page book, it would cost thirty dollars. We have only digitized so far 2. So we need a few hundred million dollars to really complete the job with books. It could be the legacy of a set of people who say this is the priority. It is a major opportunity for our generation.
Is this discussed when deciding which copies to scan? Do holding institutions have a say? We turn to librarians and collectors to direct, and fund, what is scanned. Personally, I made sure the books written by my grandfather were scanned.
Currently, we are funding limited. But given the funding ten cents a page , I believe we can get access to the collections in the great libraries. But then we must also preserve the physical versions. Fortunately that is getting cheaper as well — if most of the access is to the digital versions, then the physical versions can be stored more densely and safely.
We do not need to be deaccessioning as much as we have been because it is less expensive to save materials. The Internet Archive now has two large physical archives where we are storing over one million books, and tens of thousands of films, records, and microfilm reels. This can and should be done by everyone. But if others cannot store them, we hope they will think of us Fig.
Unfortunately, the current technologies for charging are making for large central organizations that are dominating and massively limiting access. Comparing the open access journals and PLOS [Public Library of Science] with the commercial publishers makes me think that charging has the perverse effect of crippling access. We need new systems but, in the meantime, the best way to be read is to not charge each reader for access. What economic models do you think will be needed or can be implemented to sustain nineteenth-century digital archives and the labour underpinning them?
If it is important that such resources be open to all, who should pay? From a UK perspective, one wonders about the economy of this knowledge. Why do you think the US is so far ahead in terms of the digital archiving of our hard culture? Europe does not have a strong tradition of independent non-profit charities like the US does.
So many cultural entities are government entities in Europe, and governments are increasingly driven by corporate interests rather than by what could serve the general population.
Access drives preservation. If materials are not digitized then they will be largely forgotten and therefore physical preservation will be underfunded. You have described the Internet Archive project as the Library of Alexandria 2.
Is this really plausible? Is this really the ultimate aim of the Internet Archive , more so than the preservation of that knowledge? As this unfolds, we would like to preserve this knowledge and make sure it is accessible for centuries to come.
The early manuscripts at the Library of Alexandria were burned, much of early printing was not saved, and many early films were recycled for their silver content. The history of early materials of each medium is one of loss and eventual partial reconstruction through fragments. A group of entrepreneurs and engineers have determined to not let this happen to the early Internet. Even though the documents on the Internet are the easy documents to collect and archive, the average lifetime of a document is 75 days and then it is gone.
While the changing nature of the Internet brings a freshness and vitality, it also creates problems for historians and users alike. Where libraries serve this role for books and periodicals that are no longer sold or easily accessible, no such equivalent yet exists for digital information.
With the rise of the importance of digital information to the running of our society and culture, accompanied by the drop in costs for digital storage and access, these new digital libraries will soon take shape. The Internet Archive is such a new organization that is collecting the public materials on the Internet to construct a digital library. The first step is to preserve the contents of this new medium. This collection will include all publicly accessible World Wide Web pages, the Gopher hierarchy, the Netnews bulletin board system, and downloadable software.
If the example of paper libraries is a guide, this new resource will offer insights into human endeavor and lead to the creation of new services. Never before has this rich a cultural artifact been so easily available for research. Where historians have scattered club newsletters and fliers, physical diaries and letters, from past epochs, the World Wide Web offers a substantial collection that is easy to gather, store, and sift through when compared to its paper antecedents.
Apart from historical and scholarly research uses, these digital archives might be able to help with some common infrastructure complaints:. Where we can read the year-old books printed by Gutenberg, it is often difficult to read a 15 year-old computer disk. The Commission for Preservation and Access in Washington DC has been researching the thorny problems faced trying to ensure the usability of the digital data over a period of decades.
Where the Internet Archive will move the data to new media and new operating systems every 10 years, this only addresses part of the problem of preservation. Using the saved files in the future may require conversion to new file formats. Text, images, audio, and video are undergoing changes at different rates. Since the World Wide Web currently has most of its textual and image content in only a few formats, we hope that it will be worth translating in the future, whereas we expect that the short lived or seldom used formats not be worth the future investment.
Saving the software to read discarded formats often poses problems of preserving or simulating the machines that they ran on. The physical security of the data must also be considered. Natural and political forces can destroy the data collected. Political ideologies change over time making what was once legal becomes illegal. We are looking for partners in other geographic and national locations to provide a robust archive system over time.
To give some level of security from commercial forces that might want exclusive access to this archive, the data is donated to a special non-profit trust for long-term care taking. This non-profit organization is endowed with enough money to perform the necessary maintenance on the storage media over the years. Packaging enough meta-data information about the information is necessary to inform future users.
Since we do not know what future researchers will be interested in, we are documenting the methods of collection and attempt to be complete in those collections.
As researchers start to use these data, the methods and data recorded can be refined. Building the Internet Archive involves gathering, storing, and serving the terabytes of information that at some point were publicly accessible on the Internet. Gathering these distributed files requires computers to constantly probe the servers looking for new or updated files.
New systems for three-dimensional environments, chat facilities, and distributed software require new efforts to gather these files. Each of these systems requires special programs to probe and download appropriate files. Estimating the current size, turnover, and growth of the public Internet has proven tricky because of the dynamic nature of the systems being probed.
The World Wide Web is vast, growing rapidly, and filled with transient information. Estimated at 50 million pages with the average page online for only 75 days, the turnover is considerable. Furthermore, the number of pages is reported to be doubling every year. Using the average web page size of 30 kilobytes including graphics brings the current size of the Web to 1. This is the technique that the search engines, such as Altavista, use to create their indices to the World Wide Web. The Internet Archive currently holds GB of information of all types.
In we will have collected a snapshot of the documents and images. Much of the data is restricted by the publisher, or stored in databases that are accessible through the World Wide Web but are not available to the simple crawlers.
Other documents might have been inappropriate to collect in the first place, so authors can mark files or sites to indicate that crawlers are not welcome.
Thus the collected Web will be able to give a feel of what the Web looked like at a particular time, but will not simulate the full online environment. While the current sizes are large, the Internet is continuing to grow rapidly. At some point we will have to become more select what data will be of the most value in the future, but currently we can be afford to gather it all. Crucial to archiving the Internet, and digital libraries in general, is the cost effective storage of terabytes of data while still allowing timely access.
Since the costs of storage has been dropping rapidly, the archiving cost is dropping. The flip side, of course, is that people are making more information available. To stay ahead of this onslaught of text, images, and soon video information we believe we have to store the information for much less money than the original producers paid for their storage.
It would be impractical to spend as much on our storage as everyone else combined. With these prices, we chose hard disk storage for a small amount of the frequently accessed data combined with tape jukeboxes. In most applications we expect a small amount of information to be accessed much more frequently than the rest, leveraging the use of the faster disk technology rather than the tape jukebox.
After gathering and storing the public contents of the Internet, what services would then be of greatest value with such a repository? While it is impossible to be certain, digital versions of paper services might prove useful. This is similar to one of the roles of a library. But most notably, perhaps, Kahle is founder and director of the Internet Archive, a free service which steadfastly archives World Wide Web documents , even as they are plowed over by breakneck trends in commerce, culture and politics.
On his Wayback Machine, you can view pages as they appeared in web antiquity -- say, Yahoo! As a member of the Board of Directors of the Electronic Frontier Foundation, he works to keep such information free and reachable. He's been pushing protocols for the benefit of humanity.
You have JavaScript disabled. Menu Main menu. Watch TED Talks.
0コメント