Here is my code:
response = requests.get('URL HERE', headers=header, params=param, cookies=cookie)
print(response.content)
I also tried this:
print(response.text)
But both return this:
'<script>window.location="URL HERE";</script>'
All I want is to get the html of the page.
Any ideas?
EDIT:
I don't know how I did it, but I got the header and the cookie again from the website and plugged them in and it worked. I think it might be because I use Firefix and I had an update so maybe the header changed IDK.
That is the HTML of the whole page. Unfortunately, the page seems to require JavaScript, which your approach does not support. I don't know much about JS, but it seems to be a function that redirects you to a different page.
You could use an approach based on Selenium if the website only works when JS is enabled.
The website you used seems to redirect people when I went to the link, so when you get the url, make sure to allow redirects.
Do this:
response = requests.get('URL HERE', headers=header, params=param, cookies=cookie, allow_redirects=True)
Related
I understand there are similar questions out there, however, I couldn't make this code to work out. Does anyone know how to login and scrape the data from this website?
from bs4 import BeautifulSoup
import requests
# Start the session
session = requests.Session()
# Create the payload
payload = {'login':<USERNAME>,
'password':<PASSWORD>
}
# Post the payload to the site to log in
s = session.post("https://www.beeradvocate.com/community/login", data=payload)
# Navigate to the next page and scrape the data
s = session.get('https://www.beeradvocate.com/place/list/?c_id=AR&s_id=0&brewery=Y')
soup = BeautifulSoup(s.text, 'html.parser')
soup.find('div', class_='titleBar')
print(soup)
The process is different for almost each site, the best way to know how to do it is to use your browser's request inspector (firefox) and look at how the site behaves when you try to login.
For your website, when you click the login button a post request is sent to https://www.beeradvocate.com/community/login/login, with a little bit of trial and error your should be able to replicate it.
Make sure you match the content-type and request headers (specifically cookies in case you need auth tokens).
I'm trying to make a web scraper using Python. The website has a login form though and I've been trying to log in for a few days with no results. The code looks like this:
session_requests = requests.Session()
r = session_requests.get(login_url, headers=dict(referer=login_url))
print(r.content)
tree = html.fromstring(r.text)
authenticity_token = list(set(tree.xpath('//input[#name="_csrf_token"]/#value')))[0]
payload = {"_csrf_token": authenticity_token, "_username": "-username-", "_password": "-password-",}
r = session_requests.post(login_url, data=payload, headers=dict(referer=login_url))
print(r.content)
You can see I print out r.content both before and after posting to the login page, and in theory I should get different outputs (because the second one should be the content of the actual web page after the login), but unfortunately I get the exact same output.
Here's a screenshot of what the login page requires to log in:
enter image description here
Also, I know for sure that the _csrf_token is correct because I have tested it a few times, so no doubts about that part.
Another thing that might be useful: I don't think I really need to include the headers because the outputs are exactly the same with or without them (I include them just because). Thanks in advance.
Edit: the URL is https://nuvola.madisoft.it/login
Here's some more useful stuff:
I'm trying to crawl a website using the requests library. However, the particular website I am trying to access (http://www.vi.nl/matchcenter/vandaag.shtml) has a very intrusive cookie statement.
I am trying to access the website as follows:
from bs4 import BeautifulSoup as soup
import requests
website = r"http://www.vi.nl/matchcenter/vandaag.shtml"
html = requests.get(website, headers={"User-Agent": "Mozilla/5.0"})
htmlsoup = soup(html.text, "html.parser")
This returns a web page that consists of just the cookie statement with a big button to accept. If you try accessing this page in a browser, you find that pressing the button redirects you to the requested page. How can I do this using requests?
I considered using mechanize.Browser but that seems a pretty roundabout way of doing it.
Try setting:
cookies = dict(BCPermissionLevel='PERSONAL')
html = requests.get(website, headers={"User-Agent": "Mozilla/5.0"}, cookies=cookies)
This will bypass the cookie consent page and will land you staight to the page.
Note: You could find the above by analyzing the javascript code that is run on the cookie concent page, it is a bit obfuscated but it should not be difficult. If you run into the same type of problem again, take a look at what kind of cookies does the javascript code that is executed upon a event's handling sets.
I have found this SO question which asks how to send cookies in a post using requests. The accepted answer states that the latest build of Requests will build CookieJars for you from simple dictionaries. Below is the POC code included in the original answer.
import requests
cookie = {'enwiki_session': '17ab96bd8ffbe8ca58a78657a918558'}
r = requests.post('http://wikipedia.org', cookies=cookie)
I have the following script:
import requests
import cookielib
jar = cookielib.CookieJar()
login_url = 'http://www.whispernumber.com/signIn.jsp?source=calendar.jsp'
acc_pwd = {'USERNAME':'myusername',
'PASSWORD':'mypassword'
}
r = requests.get(login_url, cookies=jar)
r = requests.post(login_url, cookies=jar, data=acc_pwd)
page = requests.get('http://www.whispernumber.com/calendar.jsp?day=20150129', cookies=jar)
print page.text
But the print page.text is showing that the site is trying to forward me back to the login page:
<script>location.replace('signIn.jsp?source=calendar.jsp');</script>
I have a feeling this is because of the jsp, and am not sure how to login to a java script page? Thanks for the help!
Firstly you're posting to the wrong page. If you view the HTML from your link you'll see the form is as follows:
<form action="ValidatePassword.jsp" method="post">
Assuming you're correctly authenticated you will probably get a cookie back that you can use for subsequent page requests. (You seem to be thinking along the right lines.)
Requests isn't a web browser, it is an http client, it simply grabs the raw text from the page. You are going to want to use something like Selenium or another headless browser to programatically login to a site.
Noob here, let's say I want to download a .mp3 file from a website like youtube.com or hypem.com. How do I go about it ? I know how to open a webpage (with requests) , how to parse it (with beautiful soup). But after these step, I really don't know what to do. How do you find de SOURCE of the file ?
Let's say for exemple this script : https://github.com/fzakaria/HypeScript/blob/master/hypeme.py
I undertand most of it except this part,
serve_url = "http://hypem.com/serve/source/{}/{}".format(id, key)
request = urllib2.Request(serve_url, "" , {'Content-Type': 'application/json'})
request.add_header('cookie', cookie)
response = urllib2.urlopen(request)
song_data_json = response.read()
response.close()
song_data = json.loads(song_data_json)
url = song_data[u"url"]
First, how did he find that this url would serve the song ?
"http://hypem.com/serve/source/{}/{}".format(id, key)
Then there is this line, no idea what it is for:
request = urllib2.Request(serve_url, "" , {'Content-Type': 'application/json'})
So my question, where do you find the link or information to download a file if it's not meant to download? (ex: youtube) How do you find de SOURCE of the file ?
To answer your first question, web scraping involves a lot of reverse engineering. I'm guessing whoever wrote the script, studied the site they were scrape and figured out what the urls for the songs look like.
As for your second question, basically, a Request object is being built before opening the url in order to add custom headers (Content-Type) to the request.
General, un-asked for advice, have a look at the requests library. This is MUCH simpler to use than urllib. The above code using requests would become:
import requests
serve_url = "http://hypem.com/serve/source/{}/{}".format(id, key)
# cookies is a simple key/value dictionary
response = requests.get(serve_url, headers={'Content-Type': 'application/json'}, cookies=cookies)
song_data = response.json()
url = song_data[u"url"]
Much cleaner and simpler to understand IMHO.