[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xvoice-sphinx] Usenet corpus status

To: "Jessica P. Hekman" <http://www.arborius.net/~jphekman-dated-1049297697.648976>
Subject: Re: [Xvoice-sphinx] Usenet corpus status
From: robert b <http://dummy.us.eu.org/robert>
Date: Sat, 22 Mar 2003 13:38:17 -0800 (PST)

--- "Jessica P. Hekman" <http://www.arborius.net/~jphekman> wrote:
> Started work on a Usenet corpus building tool. I got as far as trying to 
> download and parse a page of Google groups, and discovered that when you 
> use NekoHTML (or, for that matter, wget) to try to access one of their 
> pages, you get a 403. Huh!
> 
> I found their license here:
> 
>  http://groups.google.com/googlegroups/posting_terms.html
> 
> I sent them mail asking what the deal was. Since you don't get a 403 using 
> a browser on the exact same page, I assume they're checking something in 
> the HTTP headers. I suppose it's possible to get a tool to work around 
> this, but I don't know how. Input welcome. But I'm also interested in 
> their response; if they say they won't let us use their archive for 
> xvoice-sphinx, I think we should go elsewhere. It would be easy enough to 
> circumvent them, but I'd hate to always feel nervous that we could have 
> legal action taken against us if they noticed us.

It seems like if you ask them, they'd let you.

The Google API thingy doesn't allow you to suck up articles, does it?

Regardless, it seems like once they understand the project, they'd be amicable.  If they don't, I
could grab from Usenet articles from my regular NNTP feed if necessary.

> j

Prev by Date: Re: [Xvoice-sphinx] Downloading from groups.google.com
Next by Date: Re: Python tokenizing scripts (was Building LMs on our own)
Previous by thread: Re: [Xvoice-sphinx] Downloading from groups.google.com
Next by thread: Re: [Xvoice-sphinx] Usenet corpus status
Index(es):
- Date
- Thread