Python post request for USPTO site scraping - python

I’m trying to scrape data from http://portal.uspto.gov/EmployeeSearch/ web site.
I open the site in browser, click on the Search button inside the Search by Organisation part of the site and look for the request being sent to server.
When I post the same request using python requests library in my program, I don’t get the result page which I am expecting but I get the same Search page, with no employee data on it.
I’ve tried all variants, nothing seems to work.
My question is, what URL should I use in my request, do I need to specify headers (tried also, copied headers viewed in Firefox developer tools upon request) or something else?
Below is the code that sends the request:
import requests
from bs4 import BeautifulSoup
def scrape_employees():
URL = 'http://portal.uspto.gov/EmployeeSearch/searchEm.do;jsessionid=98BC24BA630AA0AEB87F8109E2F95638.prod_portaljboss4_jvm1?action=displayResultPageByOrgShortNm&currentPage=1'
response = requests.post(URL)
site_data = response.content
soup = BeautifulSoup(site_data, "html.parser")
print(soup.prettify())
if __name__ == '__main__':
scrape_employees()

All the data you need is in a form tag:
action is the url when you make a post to server.
input is the data you need post to server. {name:value}
import requests, bs4, urllib.parse,re
def make_soup(url):
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, 'lxml')
return soup
def get_form(soup):
form = soup.find(name='form', action=re.compile(r'OrgShortNm'))
return form
def get_action(form, base_url):
action = form['action']
# action is reletive url, convert it to absolute url
abs_action = urllib.parse.urljoin(base_url, action)
return abs_action
def get_form_data(form, org_code):
data = {}
for inp in form('input'):
# if the value is None, we put the org_code to this field
data[inp['name']] = inp['value'] or org_code
return data
if __name__ == '__main__':
url = 'http://portal.uspto.gov/EmployeeSearch/'
soup = make_soup(url)
form = get_form(soup)
action = get_action(form, url)
data = get_form_data(form, '1634')
# make request to the action using data
r = requests.post(action, data=data)

Related

Beautifulsoup Facebook Login

I am trying to use Beautifulsoup to scrape the post data by using the below code,
but I found that the beautifulsoup fail to login, that cause the scraper return text of all the post and include the header message (text that ask you to login).
Might I know how to modify the code in order to return info for the specific post with that id not all the posts info. Thanks!
import requests
from bs4 import BeautifulSoup
class faceBookBot():
login_basic_url = "https://mbasic.facebook.com/login"
login_mobile_url = 'https://m.facebook.com/login'
payload = {
'email': 'XXXX#gmail.com',
'pass': "XXXX"
}
post_ID = ""
# login to facebook and redirect to the link with specific post
# I guess something wrong happen in below function
def parse_html(self, request_url):
with requests.Session() as session:
post = session.post(self.login_basic_url, data=self.payload)
parsed_html = session.get(request_url)
return parsed_html
# scrape the post all <p> which is the paragraph/content part
def post_content(self):
REQUEST_URL = f'https://m.facebook.com/story.php?story_fbid={self.post_ID}&id=7724542745'
soup = BeautifulSoup(self.parse_html(REQUEST_URL).content, "html.parser")
content = soup.find_all('p')
post_content = []
for lines in content:
post_content.append(lines.text)
post_content = ' '.join(post_content)
return post_content
bot = faceBookBot()
bot.post_ID = "10158200911252746"
You can't, facebook encrypts password and you don't have encryption they use, server will never accept it, save your time and find another way
#AnsonChan yes, you could open the page with selenium, login and then copy it's cookies to requests:
from selenium import webdriver
import requests
driver = webdriver.Chrome()
driver.get('http://facebook.com')
# login manually, or automate it.
# when logged in:
session = requests.session()
[session.cookies.update({cookie['name']: cookie['value']}) for cookie in driver.get_cookies()]
driver.quit()
# get the page you want with requests
response = session.get('https://m.facebook.com/story.php?story_fbid=123456789')

Unable to access webpage with request in python

After some discussion with my problem on Unable to print links using beautifulsoup while automating through selenium
I realized that the main problem is in the URL which the request is not able to extract. URL of the page is actually https://society6.com/discover but I am using selenium to log into my account so the URL becomes https://society6.com/society?show=2
However, I can't use the second URL with request since its showing error. How do i scrap information from URL like this.
You need to log in first!
To do that you can use the bs4.BeautifulSoup library.
Here is an implementation that I have used:
import requests
from bs4 import BeautifulSoup
BASE_URL = "https://society6.com/"
def log_in_and_get_session():
"""
Get the session object with login details
:return: requests.Session
"""
ss = requests.Session()
ss.verify = False # optinal for uncertifaied sites.
text = ss.get(f"{BASE_URL}login").text
csrf_token = BeautifulSoup(text, "html.parser").input["value"]
data = {"username": "your_username", "password": "your_password", "csrfmiddlewaretoken": csrf_token}
# results = ss.post("{}login".format(BASE_URL), data=data)
results = ss.post("{}login".format(BASE_URL), data=data)
if results.ok:
print("Login success", results.status_code)
return ss
else:
print("Can't login", results.status_code)
Using the 'post` method to log in...
Hope this helps you!
Edit
Added the beginning of the function.

Login to aspx website using python requests

I'm trying to log into my school website that utilizes aspx with requests in order to scrape some data. My problem is similar to the one described here:
Log in to ASP website using Python's Requests module
However, my form also requires SubmitButton.x and SubmitButton.y and I don't know where to get them form. I tried to pass in values that worked in manual login, but it didn't work.
Here's the page
form data from successful manual login
from bs4 import BeautifulSoup
import requests
data = {}
with requests.Session() as s:
page = s.get('https://adfslight.resman.pl/LoginPage.aspx?ReturnUrl=%2f%3fwa%3dwsignin1.0%26wtrealm%3dhttps%253a%252f%252fcufs.resman.pl%253a443%252frzeszow%252fAccount%252fLogOn%26wctx%3drm%253d0%2526id%253dADFS%2526ru%253d%25252frzeszow%25252fFS%25252fLS%25253fwa%25253dwsignin1.0%252526wtrealm%25253dhttps%2525253a%2525252f%2525252fuonetplus.resman.pl%2525252frzeszow%2525252fLoginEndpoint.aspx%252526wctx%25253dhttps%2525253a%2525252f%2525252fuonetplus.resman.pl%2525252frzeszow%2525252fLoginEndpoint.aspx%26wct%3d2018-02-04T18%253a08%253a18Z&wa=wsignin1.0&wtrealm=https%3a%2f%2fcufs.resman.pl%3a443%2frzeszow%2fAccount%2fLogOn&wctx=rm%3d0%26id%3dADFS%26ru%3d%252frzeszow%252fFS%252fLS%253fwa%253dwsignin1.0%2526wtrealm%253dhttps%25253a%25252f%25252fuonetplus.resman.pl%25252frzeszow%25252fLoginEndpoint.aspx%2526wctx%253dhttps%25253a%25252f%25252fuonetplus.resman.pl%25252frzeszow%25252fLoginEndpoint.aspx&wct=2018-02-04T18%3a08%3a18Z').content
soup = BeautifulSoup(page, "lxml")
data["__EVENTTARGET"] = ""
data["__EVENTARGUMENT"] = ""
data["___VIEWSTATE"] = soup.select_one("#__VIEWSTATE")["value"]
data["__VIEWSTATEGENERATOR"] = soup.select_one("#__VIEWSTATEGENERATOR")["value"]
data["__EVENTVALIDATION"] = soup.select_one("#__EVENTVALIDATION")["value"]
data["UsernameTextBox"] = "myusername"
data["PasswordTextBox"] = "mypassword"
data["SubmitButton.x"] = "49"
data["SubmitButton.y"] = "1"
s.post('https://adfslight.resman.pl/LoginPage.aspx?ReturnUrl=%2f%3fwa%3dwsignin1.0%26wtrealm%3dhttps%253a%252f%252fcufs.resman.pl%253a443%252frzeszow%252fAccount%252fLogOn%26wctx%3drm%253d0%2526id%253dADFS%2526ru%253d%25252frzeszow%25252fFS%25252fLS%25253fwa%25253dwsignin1.0%252526wtrealm%25253dhttps%2525253a%2525252f%2525252fuonetplus.resman.pl%2525252frzeszow%2525252fLoginEndpoint.aspx%252526wctx%25253dhttps%2525253a%2525252f%2525252fuonetplus.resman.pl%2525252frzeszow%2525252fLoginEndpoint.aspx%26wct%3d2018-02-04T18%253a08%253a18Z&wa=wsignin1.0&wtrealm=https%3a%2f%2fcufs.resman.pl%3a443%2frzeszow%2fAccount%2fLogOn&wctx=rm%3d0%26id%3dADFS%26ru%3d%252frzeszow%252fFS%252fLS%253fwa%253dwsignin1.0%2526wtrealm%253dhttps%25253a%25252f%25252fuonetplus.resman.pl%25252frzeszow%25252fLoginEndpoint.aspx%2526wctx%253dhttps%25253a%25252f%25252fuonetplus.resman.pl%25252frzeszow%25252fLoginEndpoint.aspx&wct=2018-02-04T18%3a08%3a18Z', data=data)
open_page = s.get("https://uonetplus.resman.pl/rzeszow/Start.mvc/Index")
print(open_page.text)

Using "requests" (in ipython) to download pdf files

I want to download all the pdf docs corresponding to a list of "API#" values from http://imaging.occeweb.com/imaging/UIC1012_1075.aspx
So far I have managed to post the "API#" request but not sure what to do next.
import requests
headers = {'User-Agent': 'Mozilla/5.0'}
url = 'http://imaging.occeweb.com/imaging/UIC1012_1075.aspx'
API = '15335187'
payload = {'txtIndex7':'1','txtIndex2': API}
session = requests.Session()
res = session.post(url,headers=headers,data=payload)
It is a bit more complicated than that, there are some additional event validation hidden input fields that you need to take into account. For that you first need to get the page, collect all the hidden values, set the value for the API and then make a POST request with following HTML parsing of the HTML response.
Fortunately, there is a tool called MechanicalSoup that may help to auto-fill these hidden fields in your the form submission request. Here is a complete solution including sample code for parsing the resulting table:
import mechanicalsoup
url = 'http://imaging.occeweb.com/imaging/UIC1012_1075.aspx'
API = '15335187'
browser = mechanicalsoup.StatefulBrowser(
user_agent='Mozilla/5.0'
)
browser.open(url)
# Fill-in the search form
browser.select_form('form#Form1')
browser["txtIndex2"] = API
browser.submit_selected("Button1")
# Display the results
for tr in browser.get_current_page().select('table#DataGrid1 tr'):
print([td.get_text() for td in tr.find_all("td")])
import mechanicalsoup
import urllib
url = 'http://imaging.occeweb.com/imaging/UIC1012_1075.aspx'
Form = '1012'
API = '15335187'
browser = mechanicalsoup.StatefulBrowser(
user_agent='Mozilla/5.0'
)
browser.open(url)
# Fill-in the search form
browser.select_form('form#Form1')
browser["txtIndex7"] = Form
browser["txtIndex2"] = API
browser.submit_selected("Button1")
# Display the results
for tr in browser.get_current_page().select('table#DataGrid1 tr')[2:]:
try:
pdf_url = tr.select('td')[0].find('a').get('href')
except:
print('Pdf not found')
else:
pdf_id = tr.select('td')[0].text
response = urllib.urlopen(pdf_url) # for python 2.7, for python 3. urllib.request.urlopen()
pdf_str = "C:\\Data\\"+pdf_id+".pdf"
file = open(pdf_str, 'wb')
file.write(response.read())
file.close()
print('Pdf '+pdf_id+' saved')

Website form login using Python urllib2

I've breen trying to learn to use the urllib2 package in Python. I tried to login in as a student (the left form) to a signup page for maths students: http://reg.maths.lth.se/. I have inspected the code (using Firebug) and the left form should obviously be called using POST with a key called pnr whose value should be a string 10 characters long (the last part can perhaps not be seen from the HTML code, but it is basically my social security number so I know how long it should be). Note that the action in the header for the appropriate POST method is another URL, namely http://reg.maths.lth.se/login/student.
I tried (with a fake pnr in the example below, but I used my real number in my own code).
import urllib
import urllib2
url = 'http://reg.maths.lth.se/'
values = dict(pnr='0000000000')
data = urllib.urlencode(values)
req = urllib2.Request(url,data)
resp = urllib2.urlopen(req)
page = resp.read()
print page
While this executes, the print is the source code of the original page http://reg.maths.lth.se/, so it doesn't seem like I logged in. Also, I could add any key/value pairs to the values dictionary and it doesn't produce any error, which seems strange to me.
Also, if I go to the page http://reg.maths.lth.se/login/student, there is clearly no POST method for submitting data.
Any suggestions?
If you would inspect what request is sent to the server when you enter the number and submit the form, you would notice that it is a POST request with pnr and _token parameters:
You are missing the _token parameter which you need to extract from the HTML source of the page. It is a hidden input element:
<input name="_token" type="hidden" value="WRbJ5x05vvDlzMgzQydFxkUfcFSjSLDhknMHtU6m">
I suggest looking into tools like Mechanize, MechanicalSoup or RoboBrowser that would ease the form submission. You may also parse the HTML with an HTML parser, like BeautifulSoup yourself, extract the token and send via urllib2 or requests:
import requests
from bs4 import BeautifulSoup
PNR = "00000000"
url = "http://reg.maths.lth.se/"
login_url = "http://reg.maths.lth.se/login/student"
with requests.Session() as session:
# extract token
response = session.get(url)
soup = BeautifulSoup(response.content, "html.parser")
token = soup.find("input", {"name": "_token"})["value"]
# submit form
session.post(login_url, data={
"_token": token,
"pnr": PNR
})
# navigate to the main page again (should be logged in)
response = session.get(url)
soup = BeautifulSoup(response.content, "html.parser")
print(soup.title)

Categories