fanlore | Dealing With Those Pesky Dead Links & Cites

S	M	T	W	T	F	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

A lot of time, we link to a website or an ongoing discussion rather than copying and pasting info over onto to Fanlore. But once a website or a link is dead, that data is lost and your Fanlore entry may lack context or key info.

Your best shot is to head over to the Internet Archive (Wayback Machine) and see if the website has been archived. But since the Wayback Machine crawls and archives randomly, you won't know if your citation can be resurrected until it is too late.

Enter: WebCite. A service designed for scholars to create a static snapshot of a website so that you can cite it (and the page contents) for longer periods. It is user driven - you have to submit the website link before the website goes down (when you're creating your Fanlore entry). It comes with a few caveats: it won't create a snapshot of pages that have the 'no robots' code. It won't grab locked content and if you're grabbing a page from an adult Livejournal community, all you may see is the 'Adults only' warning. And it is intended to be used in addition to the direct link to the website, not in place.

I've tested WebCite on the Professionals Fandom Timeline which pulls the bulk of its content from a few key LJ threads. We have already laboriously copied the data over to Fanlore (with permission), but it seemed like a good test candidate.

I also used WebCite to create links to a Stargate Award website that is not currently in the WayBack Machine.

If you have used this service before, or know anything more about it, please drop a note. I think it will be particularly useful for blogs and forum posts which are prone to vanish quickly. It comes with an easy to use Bookmarklet that will allow you to cite a webpage with one click.

edited: I had a brief discussion with someone about WebCite in which they expressed discomfort with the use of this tool (and about whether aspects of the Fanlore project in general could be seen as a breach of fannish community mores/trust). So I'll toss out this narrower question: How does using Webcite differ from our using the WayBack Machine/Internet Archive or Google as our citation sources? Both Webcite and the Wayback Machine are using the same caching process and both store the website snapshot on their servers. What I like about WebCite is that it is much more limited - it cites only the one page and does not scrape and archive the entire website (like the WayBack Machine). This offers us a better level of control over what we're citing to, makes certain we give proper credit to the source of info and grabs the smallest portion of material. In other words, it seems (to me) to be a better form of 'fair use'.

Thoughts? Input? Other ways of looking at the 'what to link, what to quote, what to cite' question? Is any use of any tool that caches a website (ex. Google, WayBack Machine, LJ Seek etc) something to avoid? I realize there may not be a single or uniform opinion, but like Fanlore, I think that plural POVs are good.

edited to add: I have to keep in mind that Fandom - and Fanlore - is not operating in isolation. Scholars, other Wikis, libraries, and historians are running into the same questions and looking at and evaluating the same tools. In fact Wired had a recent article about the US and UK digital archives and their reliance on the Oakland Archive Policy of 2001. More here.

And...a recent Library Science article discussing yet another 'caching' service: Memento Web

And...: links to legal articles on digital preservation and caching below.

Flat | Top-Level Comments Only

no subject

morgandawn

Tuesday, June 29th, 2010 03:53 pm (UTC)

German's recent ruling affirming Google's ability to cache images
http://www.out-law.com/page-10980

Info about legal disputes surrounding the WayBack machine is a bit harder to find. It makes me wonder if digital archiving (at least *their* digital archive) is no longer a low hanging fruit (too much legal protection for most content owners to go after the WM when they can just as easily use the WM's removal process).

There is this:
http://en.wikipedia.org/wiki/Internet_Archive#Controversies_and_legal_disputes

I suspect that WebCite will align itself with the Internet Archive and focus on the educational/scholarship aspects, arguing it falls under the Fair Use exception. Then there is its limited content 'grab' (also part of the fair use balancing test). That, coupled with their honoring the robots.txt flags as well as them having a removal request process might be their attempt to bring in the Safe Harbor rules.

Will keep digging.

The wiki entry on WebCite is also helpful, both in how they're positioning themselves legally as well as the fact that they're feeding the info to the Internet Archive.
http://en.wikipedia.org/wiki/WebCite

Edited 2010-06-29 04:10 pm (UTC)

legionseagle

Tuesday, June 29th, 2010 04:27 pm (UTC)

I'm somewhat bothered by the fact that the wiki on WebCite states this

Rather than relying on a web crawler which archives pages in a "random" fashion, WebCite users who want to cite web pages in a scholarly article can initiate the archiving process.

whereas the reasoning of Field v. Google expressly found

But when a
user requests a Web page contained in the Google cache by clicking on a “Cached” link, it is the
user, not Google, who creates and downloads a copy of the cached Web page. Google is passive
in this process. Google’s computers respond automatically to the user’s request. Without the
user’s request, the copy would not be created and sent to the user, and the alleged infringement at
issue in this case would not occur. The automated, non-volitional conduct by Google in response
to a user’s request does not constitute direct infringement under the Copyright Act.

The two operations are almost complete opposites of each other. Furthermore, the Google result also depends on an implied licence agreement which depended on Field knowing that Google was going to crawl over his material and cache it. As I uncerstand it, this isn't the case with WebCite; people make a choice whether or not to cache it and that choice is made on an item by item basis. Finally, the Court held that Google was entitled to the safe harbor provisions of the DCMA; again, I'm not sure if these would be available to WebCite.

I'm not arguing that WebCite's activities might not qualify for being fair use, but I don't think Field v. Google goes nearly so far to justify them as they seem to think it does.

Field/Google - Website Archiving Analysis

Tuesday, June 29th, 2010 05:25 pm (UTC)

yeah...I think the Internet Archive/WebCite analogy is stronger than the Google one. If one extends the Field ruling to the Internet Archive, it too engages in passive crawling. However, I could see someone arguing that once you put something online, you are impliedly consenting to it being cached/indexed - and the way you remove that consent is by inserting a robot.tx field. So the issue would come down to - does user driven caching tip this into the infringement arena ? Keep in mind that the WayBack Machine does offer a way for anyone to submit sites to be cached/archived. And every time you look up a website on the WayBack Machine, the Internet Archive logs that request and starts crawling the cite on the next round.

As to the second issue - the DMCA Safe Harbor provisions, if you find a ruling or an article that discusses it in connection with digital archives like the WayBack Machine, let me know? I'd like to compare it to Field and see if there is a difference between a Google index cache and a digital archive.

What intrigues me is how little legal analysis I am finding about digital archives like the Internet Archive. Again, back to them not being worth the effort (yet)?

edited to add: a 2008 library group published their take on the Field decision and how it might - and might not - provide libraries with cover for their website archiving.
http://www.aallnet.org/aallwash/LCA_greenpapercombo_Dec%202008.pdf

Edited 2010-07-02 03:47 pm (UTC)

msilverstar

Friday, July 2nd, 2010 07:19 am (UTC)

wow, that's the weirdest view of the word "cache" ever. I am certain that Google stores the full text of indexed documents both in the inverted (searchable) index and in the original order, in order to do snippets as well as view as html. I suppose it could be argued that it's a potential document until rendered, but then all web pages pretty much fit that criterion.

Luckily, Google has enough money to buy fleets of lawyers, and in this they seem to be on the side of public access. Unlike the Google Libraries digitization project, where they want to charge for access...

Friday, July 2nd, 2010 07:39 am (UTC)

I think the reasoning in Field was flaky in the extreme; basically the judge seems to have seen a clear public policy reason to allow Google caching and did whatever he needed to reach that result. Which is why I suspect WebCite are putting themselves out on a limb by relying on it for a different business model.

Overview of Web Archiving Services/Legal Analysis

Tuesday, June 29th, 2010 05:31 pm (UTC)

Here: http://en.wikipedia.org/wiki/Web_archiving

It is short on legal analysis. It offers sections on crawling and user driven archiving, commercial vs non profit etc.

This more recent article explores what the UK is discussing about website archiving:
More here

The part that applies to us here:

"Kristine Hanna, Director of Web Archiving Services told Wired: "We follow the Oakland Archive policy established in 2001, that allows a website owner/content provider to remove access from the archive, and/or prevent their content from being captured by putting up a robots.txt exclusion on their website.

"The Oakland Archive policy outlines an 'opt out' approach where, if requested, we will expeditiously remove a site from access. In most cases we find that when web site owners understand we are archiving the content for the library, and are not re-purposing the content for any other purpose (including re-sale or revenue) they decide to keep their site visible in our collections."

Last, a good overview of the two cases focusing on caching and how they would apply to website archiving (paper by a 2008 Library Coalition).
http://www.aallnet.org/aallwash/LCA_greenpapercombo_Dec%202008.pdf

Edited 2010-07-02 03:46 pm (UTC)

zing_och

Friday, July 2nd, 2010 01:02 pm (UTC)

I'm not a lawyer, but I think WebCite is illegal under German Copyright law, and the Wayback Machine as well.

Friday, July 2nd, 2010 03:21 pm (UTC)

how do you come to that conclusion (re both Webcite and the Internet Archive), after reading the recent German opinion?

I'll see if I can find a more detailed discussion of the opinion

Friday, July 2nd, 2010 05:33 pm (UTC)

The ruling says that publishing a website (and optimizing it for search engines) without a robots.txt implies consenting to being indexed in the Google Image Search. You can take back your consent, and then a search engine isn't allowed to show your work anymore. There's a passage in the ruling where they point out that images that are deleted from the indexed website disappear from the Google Search as soon as possible, too.

But WebCite (and the Archive) preserve those sites for when they get deleted, and I think that deleting a website probably counts as withdrawing consent.

Disclaimer: What I've found wrt WebCite's legality in Germany predates the BGH ruling (and is all in German. :() Generally most rulings wrt copyright used to stress the necessity of opt-in procedures instead of this opt-out, so we'll see how it goes.

Edited 2010-07-02 05:33 pm (UTC)

Friday, July 2nd, 2010 03:24 pm (UTC)

snapshot of German/EU copyright cases regarding Google
http://jurist.org/paperchase/2010/04/done-germany-high-court-rule-google-did-not-violate-copyright-laws.php

Fanlore's Journal

April 2023

Navigation

Links

OTW

Page Summary

Style Credit

Expand Cut Tags

Dealing With Those Pesky Dead Links & Cites

no subject

no subject

Field/Google - Website Archiving Analysis

no subject

no subject

Overview of Web Archiving Services/Legal Analysis

no subject

no subject

no subject

no subject