How to extract articles python goose with New York Times

How to extract articles python goose with New York Times - python

I am trying to extract articles from The New York Times using the python goose extractor.
I have tried using the standard url retrieval way:
g.extract(url=url)
However this yields an empty string. So I have tried the following way recommended through the documentation:
import urllib2
import goose
url = "http://www.nytimes.com/reuters/2015/12/21/world/africa/21reuters-kenya-attacks-somalia.html?_r=0"
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
response = opener.open(url)
raw_html = response.read()
g = goose.Goose()
a = g.extract(raw_html=raw_html)
a.cleaned_text
Again an empty string is returned for "cleaned_text". The html is retrieved from the website. I have tried as well using requests however the same result.
I am presuming this is a python goose problem in not being able to extract the article body from the raw data that is being returned. I have searched prior but I can't find any results that solve my problem.

It looks like the goose has traditionally had problems with New York Times because (1) they redirect users through another page to add/check cookies (see curl below) and because (2) they don't actually load the text of articles on page load. They do it asynchronously after first executing ad display code.
~ curl -I "http://www.nytimes.com/reuters/2015/12/21/world/africa/21reuters-kenya-attacks-somalia.html"
HTTP/1.1 303 See Other
Server: Varnish
Location: http://www.nytimes.com/glogin?URI=http%3A%2F%2Fwww.nytimes.com%2Freuters%2F2015%2F12%2F21%2Fworld%2Fafrica%2F21reuters-kenya-attacks-somalia.html%3F_r%3D0
Accept-Ranges: bytes
Date: Tue, 22 Dec 2015 15:46:55 GMT
X-Varnish: 1338962331
Age: 0
Via: 1.1 varnish
X-API-Version: 5-0
X-PageType: article
Connection: close
X-Frame-Options: DENY
Set-Cookie: RMID=007f01017a275679706f0004;Path=/; Domain=.nytimes.com;Expires=Wed, 21 Dec 2016 15:46:55 UTC

Related

Extracting file name from url when its name is not in url

So I wanted to create a download manager, which can download multiple files automatically. I had a problem however with extracting the name of the downloaded file from the url. I tried an answer to How to extract a filename from a URL and append a word to it?, more specifically
a = urlparse(URL)
file = os.path.basename(a.path)
but all of them, including the one shown, break when you have a url such as
URL = https://calibre-ebook.com/dist/win64
Downloading it in Microsoft Edge gives you file with the name of calibre-64bit-6.5.0.msi, but downloading it with python, and using the method from the other question to extract the name of the file, gives you win64 instead, which is the intended file.

The URL https://calibre-ebook.com/dist/win64 is a HTTP 302 redirect to another URL https://download.calibre-ebook.com/6.5.0/calibre-64bit-6.5.0.msi. You can see this by running a HEAD request, for example in a macOS/Linux terminal (note 302 and the location header):
$ curl --head https://calibre-ebook.com/dist/win64
HTTP/2 302
server: nginx
date: Wed, 21 Sep 2022 16:54:49 GMT
content-type: text/html
content-length: 138
location: https://download.calibre-ebook.com/6.5.0/calibre-64bit-6.5.0.msi
The browser follows the HTTP redirect and downloads the file, naming it based on the last URL. If you'd like to do the same in Python, you also need to get to the last URL and use that as the file name. The requests library might or might not follow these redirects depending on the version, better to explicitly use allow_redirects=True.
With requests==2.28.1 this code returns the last URL:
import requests
requests.head('https://calibre-ebook.com/dist/win64', allow_redirects=True).url
# 'https://download.calibre-ebook.com/6.5.0/calibre-64bit-6.5.0.msi'
If you'd like to solve it with built-in modules so you won't need to install external libs like requests you can also achieve the same with urllib:
import urllib.request
opener=urllib.request.build_opener()
opener.open('https://calibre-ebook.com/dist/win64').geturl()
# 'https://download.calibre-ebook.com/6.5.0/calibre-64bit-6.5.0.msi'
Then you can split the lat URL by / and get the last section as the file name, for example:
import urllib.request
opener=urllib.request.build_opener()
url = opener.open('https://calibre-ebook.com/dist/win64').geturl()
url.split('/')[-1]
# 'calibre-64bit-6.5.0.msi'
I was using urllib3==1.26.12, requests==2.28.1 and Python 3.8.9 in the examples, if you are using much older versions they might behave differently and might need extra flags to ensure redirects.

The URL results in a 302 redirect, so you don't have enough information with just the URL to get that basename. You have to get the URL from 302 response.
import requests
resp = requests.head("https://calibre-ebook.com/dist/win64")
print(resp.status_code, resp.headers['location'])
>>> 302 https://download.calibre-ebook.com/6.5.0/calibre-64bit-6.5.0.msi
You'd want to have more intelligent handling obviously in case it's not a 302. And you'd want to loop in case the new URL results in another redirect.

Prefix "http://" valid but actually ""https://"

A long list of incomplete websites, some missing prefix like "http://www." etc.
pewresearch.org
narod.ru
intel.com
xda-developers.com
oecd.org
I tried:
import requests
from lxml.html import fromstring
to_check = [
"pewresearch.org",
"narod.ru",
"intel.com",
"xda-developers.com",
"oecd.org"]
for each in to_check:
r = requests.get("http://www." + each)
tree = fromstring(r.content)
title = tree.findtext('.//title')
print (title)
They returned:
Pew Research Center | Pew Research Center
Лучшие конструкторы сайтов | Народный рейтинг конструкторов для создания сайтов
Intel | Data Center Solutions, IoT, and PC Innovation
XDA Portal & Forums
Home page - OECD
Seems theirs all started with "http://www.", however not - because for example, the right one is "https://www.pewresearch.org/".
What's the quickest way, with online tool or Python, that I can find out their complete and correct addresses, instead of keying them one-by-one in web browser? (some might be http, some https).

Write a script / short program to send a HEAD request to each site. The server should respond with a redirect (e.g. to HTTPS). Follow each redirect until no further redirects are received.
The C# HttpClient can follow redirects automatically.
For Python, see #jterrace's answer here using the requests library with the code snippet below:
>>> import requests
>>> r = requests.head('http://github.com', allow_redirects=True)
>>> r
<Response [200]>
>>> r.history
[<Response [301]>]
>>> r.url
u'https://github.com/'

Only Want Response Header Not Response body Python

I am making get/Post request on a URL and in response getting an HTML page. I only want a response header, no response body.
already used HEAD method but it is not working in all kind of situations.
By getting complete HTML page in response, bandwidth is increasing.
and also need a solution so it will work in both https and HTTP request.
For Example
import urllib2
urllib2.urlopen('http://www.google.com')
if I am sending a request on this URL using urllib2 or request. I am getting both response body and header from the server. this request is taking 14.08 kb in bytes. If I break this, the response header is taking 775 bytes and response body is taking 13.32kb. so I need only response header and will save 13.32 kb

What you want to do is a so called HEAD request. See this question on how to do it.

Is this what you are looking for:
import urllib2
l = urllib2.urlopen('http://www.google.com')
print(l.headers)
#Date: Thu, 11 Oct 2018 09:07:20 GMT
#Expires: -1
#...
EDIT
This seems to do what you are looking for:
import requests
a = requests.head('https://www.google.com')
a.headers
#{'X-XSS-Protection': '1; mode=block', 'Content-Encoding':...
a.text
#u''

python - Service Unavailable - urllib proxy not working

I use to get Information from google, I know that I will block after a few requests, that's why I tried to get through Proxies. For the Proxies I use
the ProxyBroker from this link:
The Link
However, if I use proxies, google returns 503. If I click on the error, google shows me my IP and not the Proxy IP.
Here is what I've tried with:
usedProxy = self.getProxy()
if usedProxy is not None:
proxies = {"http": "http://%s" % usedProxy[0]}
headers = {'User-agent': 'Mozilla/5.0'}
proxy_support = urlrequest.ProxyHandler(proxies)
opener = urlrequest.build_opener(proxy_support, urlrequest.HTTPHandler(debuglevel=1))
urlrequest.install_opener(opener)
req = urlrequest.Request(search_url, None, headers)
with contextlib.closing(urlrequest.urlopen(req)) as url:
htmltext = url.read()
I tried with http and https.
Even if the requests is going well, I get a 503 with this the following Message:
send: b'GET http://www.google.co.in/search?q=Test/ HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.google.co.in\r\nUser-Agent: Mozilla/5.0\r\nConnection: close\r\n\r\n'
reply: 'HTTP/1.1 302 Found\r\n'
header: Date header: Server header: Location header: Pragma header: Expires header: Cache-Control header: Content-Type header: Content-Length header: X-XSS-Protection header: X-Frame-Options header:
>Connection send: b'GET http://ipv4.google.com/sorry/index?continue=http://www.google.co.in/search%3Fq%3DTest/&q=EgTCDs9XGMbOgNAFIhkA8aeDS0dE8uXKu31DEbfj5mCVdhpUO598MgFy HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: ipv4.google.com\r\nUser-Agent: Mozilla/5.0\r\n
>Connection: close\r\n\r\n'
reply: 'HTTP/1.1 503 Service Unavailable\r\n'
If the above error doesn't happen, I finally get the following Error:
>[Errno 54] Connection reset by peer
My Questions are:
Is the Ip from the error Link every time my IP and not the Proxy IP?
Google Error Link
And if it´s every time the Host IP what is shown in the error Message from google and the Problem is from the Proxies, how to bypass the error?

It seems that Google is knowing that I go to an proxy, because it uses HTTPS and the HTTPS Proxies don´t seem to work. So the HTTP proxies are detected, that´s why I get blocked after 50-60 queries directly.
My Solution:
I tried all Solutions found on Stackoverflow but they doesen´t work fine like Sleep for 10 seconds. But I found a Article with the same Problem, the Solution was "quite" easy then. First I download the fake-useragent Library from Python, which provides a ton of usefull User-agents.
I select randomly a User-agent from this list at each request. I also add to take only common user-agents because otherwise the page has a different HTML which does not fit in my read method.
After installing the Useragent and selecting one randomly, I add a sleep between 15 and 90 seconds, because the article-writer tried different timespans, and with 30 seconds he got block. So with these two simple changes my programm is successfully running since 10 Hours without truble.
I hope this helps you also, because it cost me a bunch of time to figure out when google does block you. So it simple detects every time but let you go with this Configuration.
Have fun and I wish you all successfully crawling!
EDIT:
The Programm gets ~1000 Requests until it get banned.

HTML: Get direct link to file from embed src

I want to know how to get the direct link to an embedded video (the link to the .flv/.mp4 or whatever file) from just the embed link.
For example, http://www.kumby.com/ano-hana-episode-1/ has
<embed src="http://www.4shared.com/embed/571660264/396a46be"></embed>
, though the link to the video seems to be
"http://dc436.4shared.com/img/571660264/396a46be/dlink__2Fdownload_2FM2b0O5Rr_3Ftsid_3D20120514-093834-29c48ef9/preview.flv"
How does the browser know where to load the video from? How can I write code that converts the embed link to a direct link?
UPDATE:
Thanks for the quick answer, Quentin.
However, I don't seem to receive a 'Location' header when connecting to "http://www.4shared.com/embed/571660264/396a46be".
import urllib2
r=urllib2.urlopen('http://www.4shared.com/embed/571660264/396a46be')
gives me the following headers:
'content-length', 'via', 'x-cache', 'accept-ranges', 'server', 'x-cache-lookup', 'last-modified', 'connection', 'etag', 'date', 'content-type', 'x-jsl'
from urllib2 import Request
r=Request('http://www.4shared.com/embed/571660264/396a46be')
gives me no headers at all.

The server issues a 302 HTTP status code and a Location header.
$ curl -I http://www.4shared.com/embed/571660264/396a46be
HTTP/1.1 302 Moved Temporarily
Server: Apache-Coyote/1.1
(snip cookies)
Location: http://static.4shared.com/flash/player/5.6/player.swf?file=http://dc436.4shared.com/img/M2b0O5Rr/gg_Ano_Hi_Mita_Hana_no_Namae_o.flv&provider=image&image=http://dc436.4shared.com/img/M2b0O5Rr/gg_Ano_Hi_Mita_Hana_no_Namae_o.flv&displayclick=link&link=http://www.4shared.com/video/M2b0O5Rr/gg_Ano_Hi_Mita_Hana_no_Namae_o.html&controlbar=none
Content-Length: 0
Date: Mon, 14 May 2012 10:01:59 GMT
See How do I prevent Python's urllib(2) from following a redirect if you want to get information about the redirect response instead of following the redirect automatically.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract articles python goose with New York Times - python

Related

Extracting file name from url when its name is not in url

Prefix "http://" valid but actually ""https://"

Only Want Response Header Not Response body Python

python - Service Unavailable - urllib proxy not working

HTML: Get direct link to file from embed src

Categories

Resources