I am a novice in web-scraping and web-things in general (but pretty much used to Python), and I'd like to understand how it works to integrate a website search in a bioinformatics research tool.
Goal: retrieve the output of the form on http://www.lovd.nl/3.0/search
import mechanicalsoup
# Connect to LOVD
browser = mechanicalsoup.StatefulBrowser()
browser.open("http://www.lovd.nl/3.0/search")
# Fill-in the search form
browser.select_form('#websitevariantsearch')
browser["variant"] = "chr15:g.40699840C>T"
browser.submit_selected()
# Display the results
print(browser.get_current_page())
In the output I get the very same page ( http://www.lovd.nl/3.0/search). I tried with standard requests but I get another kind of error:
from requests import get, Session
url="http://www.lovd.nl/3.0/search"
formurl = "http://www.lovd.nl/3.0/ajax/search_variant.php"
client = Session()
#get the csrf
soup = BeautifulSoup(client.get(url).text, "html.parser")
csrf = soup.select('form input[name="csrf_token"]')[0]['value']
form_data = {
"search": "",
"csrf_token": csrf,
"build": "hg19",
"variant": "chr15:g.40699840C>T"
}
response = get(formurl, data=form_data)
html=response.content
return html
...and this returns only an
alert("Error while sending data.");
The form_data fields were took from the XHR request (from developer -> network tab).
I can see that the data is sent asynchronously via ajax but I do not understand the practical implications of this information.
Need some guidance
MechanicalSoup does not do JavaScript. The website you are trying to browse has:
<form id="websitevariantsearch"
action=""
onsubmit="if ...">
There's no action in the sense of traditional HTML forms, but there's a piece of JavaScript executed on submission. MechanicalSoup won't help here. Selenium may work: http://mechanicalsoup.readthedocs.io/en/stable/faq.html#how-does-mechanicalsoup-compare-to-the-alternatives
Related
I understand there are similar questions out there, however, I couldn't make this code to work out. Does anyone know how to login and scrape the data from this website?
from bs4 import BeautifulSoup
import requests
# Start the session
session = requests.Session()
# Create the payload
payload = {'login':<USERNAME>,
'password':<PASSWORD>
}
# Post the payload to the site to log in
s = session.post("https://www.beeradvocate.com/community/login", data=payload)
# Navigate to the next page and scrape the data
s = session.get('https://www.beeradvocate.com/place/list/?c_id=AR&s_id=0&brewery=Y')
soup = BeautifulSoup(s.text, 'html.parser')
soup.find('div', class_='titleBar')
print(soup)
The process is different for almost each site, the best way to know how to do it is to use your browser's request inspector (firefox) and look at how the site behaves when you try to login.
For your website, when you click the login button a post request is sent to https://www.beeradvocate.com/community/login/login, with a little bit of trial and error your should be able to replicate it.
Make sure you match the content-type and request headers (specifically cookies in case you need auth tokens).
Scraping data of mortgage from official mortgage registry. The problem is that I can't extract the html of particular document. Everything happens on POST behalf - I have all of the data required to precise the POST request, but still when i'm printing the request.url it shows me the welcome screen page. It should retrieve html from particular document. All data like number of mortgage or current page are listed in dev tools > netowrk > Form Data, so I bet it must be possible. I'm quite new in web python so I will apprecaite any help.
My code:
import requests
data = {
'kodWydzialu':'PT1R',
'nrKw':'00037314',
'cyfraK':'9',
}
r = requests.post('https://przegladarka-ekw.ms.gov.pl/eukw_prz/KsiegiWieczyste/wyszukiwanieKW', data=data)
print(r.url), print(r.content)
You are getting the welcome screen because you aren't sending all the requests required to view the next page.
Go to Chrome > Network tabs, and you will see that when you click the submit/search button, a bunch of other GET requests are being sent to different URLs after that first POST request.
You need to replicate that in your script. Depending upon the website it can be tough to get the response, so you should consider using Selenium
That said, it's not impossible to do this with requests:
session = requests.Session()
You need to send the POST request, and all other GET requests that follow in the same session.
data = {
'kodWydzialu':'PT1R',
'nrKw':'00037314',
'cyfraK':'9',
}
session.post(URL, headers=headers, params=data)
# Start sending the GET requests
session.get(URL_1, headers=headers)
session.get(URL_2, headers=headers)
I am attempting to scrape a website using the following code
import re
import requests
def get_csrf(page):
matchme = r'name="csrfToken" value="(.*)" /'
csrf = re.search(matchme, str(page))
csrf = csrf.group(1)
return csrf
def login():
login_url = 'https://www.edline.net/InterstitialLogin.page'
with requests.Session() as s:
login_page = s.get(login_url)
csrf = get_csrf(login_page.text)
username = 'USER'
password = 'PASS'
login = {'screenName': username,
'kclq': password,
'csrfToken': csrf,
'TCNK':'authenticationEntryComponent',
'submitEvent':'1',
'enterClicked':'true',
'ajaxSupported':'yes'}
page = s.post(login_url, data=login)
r = s.get("https://www.edline.net/UserDocList.page?")
print(r.text)
login()
Where I log into https://www.edline.net/InterstitialLogin.page, which is successful, but the problem I have is when I try to do
r = s.get("https://www.edline.net/UserDocList.page?")
print(r.text)
It doesn't print the expected page, instead it throws an error. Upon further testing I discovered that it throws this error even if you try to go directly to the page from a browser. So when I investigated the page source I found that the button used to link to the page I'm trying to scrape uses the following code
Private Reports
So essentially I am looking for a way to trigger the above javascript code in python in order to scrape the resulting page.
It is impossible to answer this question without having more context than this single link.
However, the first thing you want to check, in the case of javaScript driven content generation, are the requests made by your web page when clicking on that link.
To do this, take a look at the network-panel in the console of your browser. Record the requests being made, look especially for XHR-requests. Then, you can try to replicate this e.g. with the requests library.
content = requests.get('xhr-url')
I have the following script:
import requests
import cookielib
jar = cookielib.CookieJar()
login_url = 'http://www.whispernumber.com/signIn.jsp?source=calendar.jsp'
acc_pwd = {'USERNAME':'myusername',
'PASSWORD':'mypassword'
}
r = requests.get(login_url, cookies=jar)
r = requests.post(login_url, cookies=jar, data=acc_pwd)
page = requests.get('http://www.whispernumber.com/calendar.jsp?day=20150129', cookies=jar)
print page.text
But the print page.text is showing that the site is trying to forward me back to the login page:
<script>location.replace('signIn.jsp?source=calendar.jsp');</script>
I have a feeling this is because of the jsp, and am not sure how to login to a java script page? Thanks for the help!
Firstly you're posting to the wrong page. If you view the HTML from your link you'll see the form is as follows:
<form action="ValidatePassword.jsp" method="post">
Assuming you're correctly authenticated you will probably get a cookie back that you can use for subsequent page requests. (You seem to be thinking along the right lines.)
Requests isn't a web browser, it is an http client, it simply grabs the raw text from the page. You are going to want to use something like Selenium or another headless browser to programatically login to a site.
So I am trying to write a script that that submits a form that contains two fields for a username and password in a POST request, but the site responds with:
"This system requires the use of HTTP cookies to verify authorization information. Our system has detected that your browser has disabled HTTP cookies, or does not support them."
*EDIT: So I believe with the new modified code below that I can successfully login to the page. The only thing is that when I print out the page's html text to the terminal it only displays an html element and a head element that contains the url of the page; however, ive inspected the actual html of page when i log in and there is a lot missing, anyone know why this might be?
import requests
url = "https://someurl"
payload = {
'username': 'myname',
'password': '1234'
}
headers = {
'User-Agent': 'Mozilla/5.0'
}
session = requests.Session()
page = session.post(url, data=payload)
Without the precise URL it is very hard to give you an answer.
Many Web pages are dynamically built through JavaScript calls. The execution of the JavaScript will create a DOM that is rendered. If it's the case for the site you are looking at, you will get only the raw HTML response with Python but not the rendered DOM. You need something which actually executes the JS to get the final DOM. For example, SlimerJS