ScriptWritingPeople

216.73.216.xxx | ToothyWiki | RecentChanges | Login

(OP == AngelaRayner)
Dear ScriptWritingPeople on the Wiki,

Several of the academic institutions that I belong or have belonged to stock copies of journals in their libraries. However, I'm frequently away from the institutions, and not all of them have online subscriptions. However, since the information would be available were I in one geographical location, it's somewhat strange that I can't get hold of it online. I argue that it wouldn't constitute theft to obtain it, since it's already been obtained by my institution. One could counter-argue that it would constitute theft, because I'm trying to get hold of online material, and what is paid for by the institution to the journal publisher is technical site upkeep, rather than content. One of my previous institutions was more computer literate than one of my current ones, and thus subscribed to more online journals, so I'm more stymied at the moment. I'm also not sure, with every publisher that if you receive the paper copy, that it /necessarily/ costs to subscribe, although that does depend on the publisher. Now, due to frequent and obsess<ahem> googling, I've discovered that if you are looking for a combination of words, sometimes that combination will occur in an article. For example, I googled for: vanhoozer fish hauerwas. I wanted to find articles that mentioned the fight between Vanhoozer and Fish that had been possibly commented on by Hauerwas. This threw up this link:
http://www.blackwell-synergy.com/links/doi/10.1111/1468-0025.00125/abs/

Now, if you type in the search combo on Google, you will be presented with this link, and with some of the text that I was googling for. In this instance, I received the lines "about at one point (p. 38) —that Hauerwas’ vitally important ... find aspects of Derrida, Rorty or Fish compelling will find that Vanhoozer inade- quately" which confirmed to me that there is indeed some debate out there in an academic journal. The problem is, that the journal is - you guessed it - accessible in the library, but not from home. It's a real pest to have to wander to the library every time I have a vague wondering about some authors. If you click on the link it produces, because the journal is inaccessible, it does not produce any of the text that it found the search terms in. So, I had an idea... if I took the words "will find that Vanhoozer" and Googled for them, it gave me the next bit of the sentence: "I suspect that those who find aspects of Derrida, Rorty or Fish compelling will find that Vanhoozer inade- quately represents the views of their favorite ...", and then Googled for: "quately represents the views of their favorite", it gave me: those who find aspects of Derrida, Rorty or Fish compelling will find that Vanhoozer inade- quately represents the views of their favorite, tending instead ...". I Google for "views of their favorite tending" and receive: "aspects of Derrida, Rorty or Fish compelling will find that Vanhoozer inade- quately represents the views of their favorite, tending instead toward a melange ... " and lastly Google for: "instead toward a melange" and receive "... of Derrida, Rorty or Fish compelling will find that Vanhoozer inade- quately represents the views of their favorite, tending instead toward a melange of stock ...". I Google for "a melange of stock", and it gives me: Fish compelling will find that Vanhoozer inade- quately represents the views of their favorite, tending instead toward a melange of stock characterizations of ...". I Google for "a melange of stock characterizations of" and receive: "... will find that Vanhoozer inade- quately represents the views of their favorite, tending instead toward a melange of stock characterizations of “postmodernism ...".

And so the search goes on. I slowly receive the sentence bits at a time. Now, I suppose that I could work out a better optimum, but less words tend to produce Google results that have nothing to do with the Blackwell journal, and thus, grabbing a longer phrase is better. I wonder if the longer phrase gives me less "new" text, but I've not tried that much - it's a frustrating way of getting to the end of a sentence. Now then, I have several questions.

Questions

1. Is this procedure theft? I have offline access to the material, but not online access. Further, the material is available freely online, it's just time consuming to obtain.

: Ethically - you're the theologian ;) Legally - grey area. The whole business of internet publishing is fraught with uncertainty about precisely what implicit licence you grant others to copy your work for the purpose of viewing it.

2. If we conclude it does not constitute theft, is it possible for somebody to write a script that grabs the "new words", combines them with three of the old words, and produces the entire article (at least from where I found the search words) for me, with minimal hassle on my part?

There's also the interesting issue, that since I turned this into a Wiki question, that some people from the journal will happen upon it, and remove the option to locate words within an article you don't own. I suspect they won't do this because they rely on people throwing up articles, and then want them to pay for the article per copy (usually expensive, though depends on the journal). Basically, they rely on the fact that their users can obtain some of the article, because it might lead the reader to following up. Just an interesting point... --AR

Update: I also can't get beyond the word "postmodernism". It's just not giving me any of the next sentence. Maybe they have built in thwart points to stop people like me :) On the other hand, it might just be the end of the article.

Have you tried looking at Google's cached copy? --TheInquisitor

Sorry, ignore that - I hadn't realised it wasn't the top hit. I am rather curious as to how their spider had a subscription to the journal, though. --TI

(PeterTaylor) It's quite common to allow Google privileges which ordinary people don't have. This leads me to wonder whether the article might be obtained by setting the User-Agent header to Googlebot. My attempts to use telnet to get their webpage are unsuccessful, though - maybe they don't like HTTP/1.0, and I can't be bothered to work out a complete HTTP/1.1 request.

That would be theft, surely - as you're obtaining it by illicit means ( deception )... -- Xarak

: (PeterTaylor) It's certainly not [theft]. I'm not sure whether or not there's deception involved. It's well-known that some browsers allow the user to fake their User-Agent in order to get it to display pages which wouldn't otherwise display (e.g. because their author thinks that IE is the only browser that can display it).

I would imagine they check for the alleged bot coming from Google's IP space. I know *I* would. - MoonShadow

: (Untried) information I have from locked lj posts suggests that in the general case they don't... -- Senji

Sounds like a DNA sequencing algorithm!

Since you can only get a small amount of additional text, this doesn't seem like a significant information leak. In fact, it's almost as if you're just getting a slightly larger abstract that's no longer available, but that Google still has fragments of in their database if not their cache. -- airmail.net

: "Since you can only get a small amount of additional text" - is this true? The impression I got was that the entire original was in the database, and could be retrieved by automated piecing together of search results. Hm. I think a proof-of-concept might be in order.. - MoonShadow

Script below (uses perl, links) takes an intial sentence on the command line. Sample session:

sl236@debian:~/work/perl$ perl retrieve.pl 'of Derrida Rorty or Fish compelling will'
of Derrida Rorty or Fish compelling will find that Vanhoozer inade quately represents the views of their favorite tending instead toward a melange of stock characterizations of postmodernism 
sl236@debian:~/work/perl$

Searching for the last few words confirms that no further text is revealed. The poster from airmail.net is quite right, then ^^;

#! /usr/local/bin/perl -w

$SIG{CHLD}='IGNORE';
my $max = 7;
my $lq = '';

sub fetch
{
  my($q) = @_;
  $q=~s/[^a-zA-Z0-9_]/ /g;
  my $oq = $q;
  $q=~s/ +/+/g;
  $q = $lq . $q;
  $lq = $q;
  $oq =~ s/ +/ /g;

  {
   my @q = split(/\+/,$q);
   if($#q > $max)
   {
     splice @q, 0, $#q - $max;
     $q = join('+',@q);
   }
  }

  open KID_PS, "-|", "links", '-dump', 'http://www.google.co.uk/search?q=%22' . 
$q . '%22';
  my $result = join ('', <KID_PS>);
  close KID_PS;
  $result=~s/-/ /g;
  $result=~s/\.\.\./-/g;
  $result=~s/[^a-zA-Z0-9_-]/ /g;
  $result=~s/ +/ /g;
  if($result =~ /\-.+?$oq(.+?)\-/)
  {
    return $1;
  }
  return undef;
}


my $z = $ARGV[0];
print $z;
while(
     defined( $z = fetch ($z) )
  )
{
  print $z;
};
print "\n";

Poster from arcor-ip.net writes:

Very interesting story. I can't help very much in getting the original article from Blackwell-Synergy though I tried serveral ways to get the PDF-Document that Google saw.

But here is my bit:
I googled for 'google user-agent googlebot' and found in the result the link to http://www.webmasterworld.com/forum3/9213.htm
When I clicked on it, the site asked me to log in to get the article, but I don't want to subscribe to this service.

Then I fired up my good old friend wget:

wget http://www.webmasterworld.com/robots.txt

which brought me a "403: Forbidden" !

But faking to be google made it:

wget -U "Googlebot/2.1 (+http://www.google.com/bot.html)" "http://www.webmasterworld.com/robots.txt"

Then I tried to get the article the same way:

wget -U "Googlebot/2.1 (+http://www.google.com/bot.html)" "http://www.webmasterworld.com/forum3/9213.htm"

And that trick worked. BTW this FAQ has some interesting Information about the google-crawler.

CategoryComputing CopyrightMatters; see also Google