Scrape full articles from NYT archives in Python? - python

I am trying to scrape the full text of articles from a New York Times archives search for an NLP task (search here: http://query.nytimes.com/search/sitesearch/). I have legal access to all of the articles and can view them if I search the archives manually.
However, when I use urllib2, mechanize or requests to pull the HTML from the search results page, they are not pulling the relevant part of the code (links to the articles, number of hits) so that I can scrape the full articles. I am not getting an error message, the relevant sections, which are clearly visible in inspect element, are simply missing from the HTML that is pulled.
Because some of the articles are accessible to subscribers only, it occurred to me that this may be the problem and I have supplied my user credentials through Mechanize with the request, however this makes no difference in the code pulled.
There is a NYT API, however it does not give access to the full text of the articles, so it is useless to me for my purposes.
I assume that NYT has intentionally made scraping the page difficult, but I have a legal right to view all of these articles and so would appreciate any help with strategies that may help me get around the hurdles they have put up. I am new to web-scraping and am not sure where to start in figuring out this problem.
I tried pulling the HTML with all of the following, and got the same incomplete results each time:
url = 'http://query.nytimes.com/search/sitesearch/#/India+%22united+states%22/from19810101to20150228/allresults/1/allauthors/relevance/Opinion/'
#trying urllib
import urllib
opener = urllib.FancyURLopener()
print opener.open(url).read()
#trying urllib2
import urllib2
request = urllib2.Request(url)
response = urllib2.urlopen(request)
print response.read()
#trying requests
import requests
print requests.get(url).text
#trying mechanize (impersonating browser)
import mechanize
import cookielib
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
r = br.open(url)
print r.read()

Why don't you use a framework like Scrapy? This will give you a lot of power out of the box. For example, you will be able to retrieve those parts of the page you are interested in and discard the rest. I wrote a little example for dealing with Scrapy and ajax pages here: http://www.6020peaks.com/2014/12/how-to-scrape-hidden-web-data-with-python/
Maybe it can help you to get an idea of how Scrapy works.

You could try using a tool like kimonolabs.com to scrape the articles. If you're having trouble using authentication, kimono has a built-in framework that allows you to enter and store your credentials, so that might help where you're otherwise limited with the NYT API. I made this NYT API using kimono that you can clone and use if you make a kimono account: https://www.kimonolabs.com/apis/c8i366xe
Here's a help center article for how to make an API behind a login: https://help.kimonolabs.com/hc/en-us/articles/203275010-Fetch-data-from-behind-a-log-in-auth-API-
This article walks you through how to go through links to get the detailed page information: https://help.kimonolabs.com/hc/en-us/articles/203438300-Source-URLs-to-crawl-from-another-kimono-API

Related

Python3 - Scraping data from website that requires log in - Can I use the user-agent of a browser that is currently logged in? [duplicate]

This question already has answers here:
How to "log in" to a website using Python's Requests module?
(6 answers)
Closed 2 years ago.
I googled my user agent and put that code into my program but no luck
import requests
from bs4 import BeautifulSoup
URL = 'Servicenow blah blah'
headers = {
"User-Agent": Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0'
}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
Very simple code so far.
Ultimately I am trying to get logged into this website (or even circumvent that by using a user-agent that is already logged in, if this is possible (This is my main question here)) and then parse the html for a certain element's html to monitor for changes
OR if there is a better, simpler tool for this I would LOVE to know
I'm seeing in the html that is printed "Your session has expired etc. etc."
Firstly - a user-agent is not typically how session data is tracked, it lets the website know details about what version of browser you are using. Typically this information is kept in your cookies.
For the log in issue, it sounds like you just need to perform the login request and keep track of the provided cookies, etc required. However, as you said "monitor for changes" I suspect there might be a chance of some Javascript down the line ;) I recommend looking into Selenium for this. It's a browser driver which means it just interacts with a normal browser and will take care of all Javascript execution and cookie tracking for you!

Scraping with drop down menu + button using Python

I'm trying scrape data from Mexico's Central Bank website but have hit a wall. In terms of actions, I need to first access a link within an initial URL. Once the link has been accessed, I need to select 2 dropdown values and then hit an activate a submit button. If all goes well, I will be taken to a new url where a set of links to pdfs are available.
The original url is:
"http://www.banxico.org.mx/mercados/valores-gubernamentales-secto.html"
The nested URL (the one with the dropbox) is:
"http://www.banxico.org.mx/valores/LeePeriodoSectorizacionValores.faces?BMXC_claseIns=GUB&BMXC_lang=es_MX"
The inputs (arbitrary) are, say: '07/03/2019' and '14/03/2019'.
Using BeautifulSoup and requests I feel like I got as far as filling the values in the dropbox, but failed to click the button and achieve the final url with the list of links.
My code follows below :
from bs4 import BeautifulSoup
import requests
pagem=requests.get("http://www.banxico.org.mx/mercados/valores-gubernamentales-secto.html")
soupm = BeautifulSoup(pagem.content,"lxml")
lst=soupm.find_all('a', href=True)
url=lst[-1]['href']
page = requests.get(url)
soup = BeautifulSoup(page.content,"lxml")
xin= soup.find("select",{"id":"_id0:selectOneFechaIni"})
xfn= soup.find("select",{"id":"_id0:selectOneFechaFin"})
ino=list(xin.stripped_strings)
fino=list(xfn.stripped_strings)
headers = {'Referer': url}
data = {'_id0:selectOneFechaIni':'07/03/2019', '_id0:selectOneFechaFin':'14/03/2019',"_id0:accion":"_id0:accion"}
respo=requests.post(url,data,headers=headers)
print(respo.url)
In the code, respo.url is equal to url...the code fails. Can anybody pls help me identify where the problem is? I'm a newbie to scraping so that might be obvious - apologize in advance for that...I'd appreciate any help. Thanks!
Last time I checked, you cannot submit a form via clicking buttons with BeautifulSoup and Python. There are typically two approaches I often see:
Reverse engineer the form
If the form makes AJAX calls (e.g. makes a request behind the scenes, common for SPAs written in React or Angular), then the best approach is to use the network requests tab in Chrome or another browser to understand what the endpoint is and what the payload is. Once you have those answers, you can make a POST request with the requests library to that endpoint with data=your_payload_dictionary (e.g. manually do what the form is doing behind the scenes). Read this post for a more elaborate tutorial.
Use a headless browser
If the website is written in something like ASP.NET or a similar MVC framework, then the best approach is to use a headless browser to fill out a form and click submit. A popular framework for this is Selenium. This simulates a normal browser. Read this post for a more elaborate tutorial.
Judging by a cursory look at the page you're working on, I recommend approach #2.
The page you have to scrape is:
http://www.banxico.org.mx/valores/PresentaDetalleSectorizacionGubHist.faces
Add the date to consult and JSESSIONID from cookies in the payload and Referer , User-Agent and all the old good stuff in request headers
Example:
import requests
import pandas as pd
cl = requests.session()
url = "http://www.banxico.org.mx/valores/PresentaDetalleSectorizacionGubHist.faces"
payload = {
"JSESSIONID": "cWQD8qxoNJy_fecAaN2k8N0IQ6bkQ7f3AtzPx4bWL6wcAmO0T809!-1120047000",
"fechaAConsultar": "21/03/2019"
}
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36",
"Content-Type": "application/x-www-form-urlencoded",
"Referer": "http://www.banxico.org.mx/valores/LeePeriodoSectorizacionValores.faces;jsessionid=cWQD8qxoNJy_fecAaN2k8N0IQ6bkQ7f3AtzPx4bWL6wcAmO0T809!-1120047000"
}
response = cl.post(url, data=payload, headers=headers)
tables = pd.read_html(response.text)
When just clicking through the pages it looks like there's some sort of cookie/session stuff going on that might be difficult to take into account when using requests.
(Example: http://www.banxico.org.mx/valores/LeePeriodoSectorizacionValores.faces;jsessionid=8AkD5D0IDxiiwQzX6KqkB2WIYRjIQb2TIERO1lbP35ClUgzmBNkc!-1120047000)
It might be easier to code this up using selenium since that will automate the browser (and take care of all the headers and whatnot). You'll still have access to the html to be able to scrape what you need. You can probably reuse a lot of what you're doing as well in selenium.

How to scrape aspx pages with python

I am trying to scrape a site, https://www.searchiqs.com/nybro/ (you have to click the "Log In as Guest" to get to the search form. If I search for a Party 1 term like say "Andrew" the results have pagination and also, the request type is POST so the URL does not change and also the sessions time out very quickly. So quickly that if i wait ten minutes and refresh the search url page it gives me a timeout error.
I got started with scraping recently, so I have mostly been doing GET posts where I can decipher the URL. So so far I have realized that I will have to look at the DOM. Using Chrome Tools, I have found the headers. From the Network Tabs, I have also found out the following as the form data that is passed on from the search page to the results page
__EVENTTARGET:
__EVENTARGUMENT:
__LASTFOCUS:
__VIEWSTATE:/wEPaA8FDzhkM2IyZjUwNzg...(i have truncated this for length)
__VIEWSTATEGENERATOR:F92D01D0
__EVENTVALIDATION:/wEdAJ8BsTLFDUkTVU3pxZz92BxwMddqUSAXqb... (i have truncated this for length)
BrowserWidth:1243
BrowserHeight:705
ctl00$ContentPlaceHolder1$scrollPos:0
ctl00$ContentPlaceHolder1$txtName:david
ctl00$ContentPlaceHolder1$chkIgnorePartyType:on
ctl00$ContentPlaceHolder1$txtFromDate:
ctl00$ContentPlaceHolder1$txtThruDate:
ctl00$ContentPlaceHolder1$cboDocGroup:(ALL)
ctl00$ContentPlaceHolder1$cboDocType:(ALL)
ctl00$ContentPlaceHolder1$cboTown:(ALL)
ctl00$ContentPlaceHolder1$txtPinNum:
ctl00$ContentPlaceHolder1$txtBook:
ctl00$ContentPlaceHolder1$txtPage:
ctl00$ContentPlaceHolder1$txtUDFNum:
ctl00$ContentPlaceHolder1$txtCaseNum:
ctl00$ContentPlaceHolder1$cmdSearch:Search
All the ones in caps are hidden. I have also managed to figure out the results structure.
My script thus far is really pathetic as I am completely blank on what to do next. I am still to do the form submission, analyze the pagination and scrape the result but i have absolutely no idea how to proceed.
import re
import urlparse
import mechanize
from bs4 import BeautifulSoup
class DocumentFinderScraper(object):
def __init__(self):
self.url = "https://www.searchiqs.com/nybro/SearchResultsMP.aspx"
self.br = mechanize.Browser()
self.br.addheaders = [('User-agent',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.63 Safari/535.7')]
##TO DO
##submit form
#get return URL
#scrape results
#analyze pagination
if __name__ == '__main__':
scraper = DocumentFinderScraper()
scraper.scrape()
Any help would be dearly appreciated
I disabled Javascript and visited https://www.searchiqs.com/nybro/ and the form looks like this:
As you can see the Log In and Log In as Guest buttons are disabled. This will make it impossible for Mechanize to work because it can not process Javascript and you won't be able to submit the form.
For this kind of problems you can use Selenium, that will simulate a full Browser with the disadvantage of being slower than Mechanize.
This code should log you in using Selenium:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
usr = ""
pwd = ""
driver = webdriver.Firefox()
driver.get("https://www.searchiqs.com/nybro/")
assert "IQS" in driver.title
elem = driver.find_element_by_id("txtUserID")
elem.send_keys(usr)
elem = driver.find_element_by_id("txtPassword")
elem.send_keys(pwd)
elem.send_keys(Keys.RETURN)

Fetching cookie enabled page in python

I want to download a webpage using python for some web scraping task. The problem is that the website requires cookies to be enabled, otherwise it serves different version of a page. I did implement a solution that solves the problem, but it is inefficient in my opinion. Need your help to improve it!
This is how I go over it now:
import requests
import cookielib
cj = cookielib.CookieJar()
user_agent = {'User-agent': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'}
#first request to get the cookies
requests.get('https://ccirecruit.cox.com/psc/RECRUIT/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL?JobOpeningId=42845&SiteId=1&Page=HRS_CE_JOB_DTL&PostingSeq=1&',headers=user_agent, timeout=2, cookies = cj)
# second request reusing cookies served first time
r = requests.get('https://ccirecruit.cox.com/psc/RECRUIT/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL?JobOpeningId=42845&SiteId=1&Page=HRS_CE_JOB_DTL&PostingSeq=1&',headers=user_agent, timeout=2, cookies = cj)
html_text = r.text
Basically, I create a CookieJar object and then send two consecutive requests for the same URL. First time it serves me the bad page but as compensation gives cookies. Second request reuses this cookie and I get the right page.
The question is: Is it possible to just use one request and still get the right cookie enabled version of a page?
I tried to send HEAD request first time instead of GET to minimize traffic, in this case cookies aren't served. Googling for it didn't give me the answer either.
So, it is interesting to understand how to make it efficiently! Any ideas?!
You need to make the request to get the cookie, so no, you cannot obtain the cookie and reuse it without making two separate requests. If by "cookie-enabled" you mean the version that recognizes your script as having cookies, then it all depends on the server and you could try:
hardcoding the cookies before making first request,
requesting some smallest possible page (with smallest possible response yet containing cookies) to obtain first cookie,
trying to find some walkaroung (maybe adding some GET argument will fool the site into believing you have cookies - but you would need to find it for this specific site),
I think the winner here might be to use requests's session framework, which takes care of the cookies for you.
That would look something like this:
import requests
import cookielib
user_agent = {'User-agent': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'}
s = requests.session(headers=user_agent, timeout=2)
r = s.get('https://ccirecruit.cox.com/psc/RECRUIT/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL?JobOpeningId=42845&SiteId=1&Page=HRS_CE_JOB_DTL&PostingSeq=1&')
html_text = r.text
Try that and see if that works?

python mechanize.browser submit() related problem

im making some script with mechanize.browser module.
one of problem is all other thing is ok, but when submit() form,it not working,
so i was found some suspicion source part.
in the html source i was found such like following.
<form method="post" onsubmit="return loginCheck(this)" name="FRMLOGIN"/>
im thinking, loginCheck(this) making problem when submit form.
but how to handle this kind of javascript function with mechanize module ,so i can
successfully submit form and can receive result?
folloing is my current script source.
if anyone can help me ..much appreciate!!
# -*- coding: cp949-*-
import sys,os
import mechanize, urllib
import cookielib
from BeautifulSoup import BeautifulSoup,BeautifulStoneSoup,Tag
import datetime, time, socket
import re,sys,os,mechanize,urllib,time
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
# Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
# Want debugging messages?
br.set_debug_http(True)
br.set_debug_redirects(True)
br.set_debug_responses(True)
# User-Agent (this is cheating, ok?)
br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.6')]
br.open('http://user.buddybuddy.co.kr/Login/LoginForm.asp?URL=')
html = br.response().read()
print html
br.select_form(name='FRMLOGIN')
print br.viewing_html()
br.form['ID']='zero1zero2'
br.form['PWD']='012045'
br.submit()
print br.response().read()
mechanize doesn't support Javascript at all. If you absolutely have to run that Javascript, look into Selenium. It offers python bindings to control a real, running browser like Firefox or IE.
you will either need to make use of unmaintained module DOMForm and Spidermonkey (http://pypi.python.org/pypi/python-spidermonkey) to process javascript. Or you figure out what loginCheck() is doing and perform its work prior form submission in python. If loginCheck() just checks for obvious validity of login data, that should be pretty easy.
Please note, that the action parameter of the stated form tag is missing. It's probably given in the javascript part.
Depending on what you intend it might be easier to work with urllib2 only. You might assume a static appearance of that web page and just post data with urllib2's methods and get the results with it also.
onsubmit is just ignored by mechanize, no javascript interpretation is done.
You need to verify what loginCheck(); in some limited case (Validation) you can do programmatically what javascript does.

Categories