I am programming a program that should read out certain data from a website and only output certain data (data from a table). However, I ran into a problem. I wrote a program that logs into the website, but from that website I have to go to the next website and then open the document with the data. Unfortunately, I have no idea how I can change the website and then open the document and read out the data.
Does anyone have any idea how I could get on there?
from bs4 import BeautifulSoup
import requests
User = ''
Pass = ''
LOGIN_URL = ''
LOGIN_API_URL = ''
def main():
session_requests = requests.session()
result = session_requests.get(LOGIN_URL)
cookies = result.cookies
soup = BeautifulSoup(result.content, "html.parser")
auth_token = soup.find("input", {'name': 'logintoken'}).get('value')
payload = {'username': User, 'password': Pass , 'logintoken':auth_token }
result = session_requests.post(
LOGIN_API_URL,
data=payload,
cookies=cookies
)
#Report successful login
print("Login succeeded: ", result.ok)
print("Status code:", result.status_code)
print(result.text)
#Get Data
# Close Session
requests.session().close()
print('Session closed')
# Entry point
if __name__ == '__main__':
main()
You should read into Selenium with Python. Since there is no specific URL or login details (which you shouldn't post here anyway) it would be quite hard for any of us to create a working example since we don't have anything to work with.
Try the using selenium from the link above and if you have any questions or run into any issues from there come back and ask that specific question.
BS4 and requests can be powerful but selenium emulates a web browser and lets you move through websites like a "human" would. Start there.
I am trying to use Beautifulsoup to scrape the post data by using the below code,
but I found that the beautifulsoup fail to login, that cause the scraper return text of all the post and include the header message (text that ask you to login).
Might I know how to modify the code in order to return info for the specific post with that id not all the posts info. Thanks!
import requests
from bs4 import BeautifulSoup
class faceBookBot():
login_basic_url = "https://mbasic.facebook.com/login"
login_mobile_url = 'https://m.facebook.com/login'
payload = {
'email': 'XXXX#gmail.com',
'pass': "XXXX"
}
post_ID = ""
# login to facebook and redirect to the link with specific post
# I guess something wrong happen in below function
def parse_html(self, request_url):
with requests.Session() as session:
post = session.post(self.login_basic_url, data=self.payload)
parsed_html = session.get(request_url)
return parsed_html
# scrape the post all <p> which is the paragraph/content part
def post_content(self):
REQUEST_URL = f'https://m.facebook.com/story.php?story_fbid={self.post_ID}&id=7724542745'
soup = BeautifulSoup(self.parse_html(REQUEST_URL).content, "html.parser")
content = soup.find_all('p')
post_content = []
for lines in content:
post_content.append(lines.text)
post_content = ' '.join(post_content)
return post_content
bot = faceBookBot()
bot.post_ID = "10158200911252746"
You can't, facebook encrypts password and you don't have encryption they use, server will never accept it, save your time and find another way
#AnsonChan yes, you could open the page with selenium, login and then copy it's cookies to requests:
from selenium import webdriver
import requests
driver = webdriver.Chrome()
driver.get('http://facebook.com')
# login manually, or automate it.
# when logged in:
session = requests.session()
[session.cookies.update({cookie['name']: cookie['value']}) for cookie in driver.get_cookies()]
driver.quit()
# get the page you want with requests
response = session.get('https://m.facebook.com/story.php?story_fbid=123456789')
I am trying to login to a website using 'request and post' before I scrape some data.
The login does not seem to work, as in, the data I get before and after I login does not differ. However if I manually login using my browser the data before and after login is different, for example I can see my profile on the main page. I have also put the login in a try-except format to see if it's showing any exceptions, but with no luck.
I have checked and made sure I am inputting all the 'form data' requested by login on the page.
Any suggestions would be greatly appreciated.
My code is below:
import urllib
import requests
from bs4 import BeautifulSoup as soup
POST_LOGIN_URL = 'https://wex.nz/login'
REQUEST_URL = 'https://wex.nz'
payload = {'email':'testemail#gmail.com','password':'testpassword'}
with requests.Session() as session:
try:
post = session.post(POST_LOGIN_URL, data=payload, headers={"Referer":"https://wex.nz"})
except:
print('login failed')
r = session.get(REQUEST_URL)
page_html= r.content
page_soup= soup(page_html, "html.parser")
profile_container=page_soup.findAll("div",{"class":"profile"})
print(profile_container)
I've breen trying to learn to use the urllib2 package in Python. I tried to login in as a student (the left form) to a signup page for maths students: http://reg.maths.lth.se/. I have inspected the code (using Firebug) and the left form should obviously be called using POST with a key called pnr whose value should be a string 10 characters long (the last part can perhaps not be seen from the HTML code, but it is basically my social security number so I know how long it should be). Note that the action in the header for the appropriate POST method is another URL, namely http://reg.maths.lth.se/login/student.
I tried (with a fake pnr in the example below, but I used my real number in my own code).
import urllib
import urllib2
url = 'http://reg.maths.lth.se/'
values = dict(pnr='0000000000')
data = urllib.urlencode(values)
req = urllib2.Request(url,data)
resp = urllib2.urlopen(req)
page = resp.read()
print page
While this executes, the print is the source code of the original page http://reg.maths.lth.se/, so it doesn't seem like I logged in. Also, I could add any key/value pairs to the values dictionary and it doesn't produce any error, which seems strange to me.
Also, if I go to the page http://reg.maths.lth.se/login/student, there is clearly no POST method for submitting data.
Any suggestions?
If you would inspect what request is sent to the server when you enter the number and submit the form, you would notice that it is a POST request with pnr and _token parameters:
You are missing the _token parameter which you need to extract from the HTML source of the page. It is a hidden input element:
<input name="_token" type="hidden" value="WRbJ5x05vvDlzMgzQydFxkUfcFSjSLDhknMHtU6m">
I suggest looking into tools like Mechanize, MechanicalSoup or RoboBrowser that would ease the form submission. You may also parse the HTML with an HTML parser, like BeautifulSoup yourself, extract the token and send via urllib2 or requests:
import requests
from bs4 import BeautifulSoup
PNR = "00000000"
url = "http://reg.maths.lth.se/"
login_url = "http://reg.maths.lth.se/login/student"
with requests.Session() as session:
# extract token
response = session.get(url)
soup = BeautifulSoup(response.content, "html.parser")
token = soup.find("input", {"name": "_token"})["value"]
# submit form
session.post(login_url, data={
"_token": token,
"pnr": PNR
})
# navigate to the main page again (should be logged in)
response = session.get(url)
soup = BeautifulSoup(response.content, "html.parser")
print(soup.title)
If I want to scrape a website that requires login with password first, how can I start scraping it with python using beautifulsoup4 library? Below is what I do for websites that do not require login.
from bs4 import BeautifulSoup
import urllib2
url = urllib2.urlopen("http://www.python.org")
content = url.read()
soup = BeautifulSoup(content)
How should the code be changed to accommodate login? Assume that the website I want to scrape is a forum that requires login. An example is http://forum.arduino.cc/index.php
You can use mechanize:
import mechanize
from bs4 import BeautifulSoup
import urllib2
import cookielib ## http.cookiejar in python3
cj = cookielib.CookieJar()
br = mechanize.Browser()
br.set_cookiejar(cj)
br.open("https://id.arduino.cc/auth/login/")
br.select_form(nr=0)
br.form['username'] = 'username'
br.form['password'] = 'password.'
br.submit()
print br.response().read()
Or urllib - Login to website using urllib2
There is a simpler way, from my pov, that gets you there without selenium or mechanize, or other 3rd party tools, albeit it is semi-automated.
Basically, when you login into a site in a normal way, you identify yourself in a unique way using your credentials, and the same identity is used thereafter for every other interaction, which is stored in cookies and headers, for a brief period of time.
What you need to do is use the same cookies and headers when you make your http requests, and you'll be in.
To replicate that, follow these steps:
In your browser, open the developer tools
Go to the site, and login
After the login, go to the network tab, and then refresh the page
At this point, you should see a list of requests, the top one being the actual site - and that will be our focus, because it contains the data with the identity we can use for Python and BeautifulSoup to scrape it
Right click the site request (the top one), hover over copy, and then copy as
cURL
Like this:
Then go to this site which converts cURL into python requests: https://curl.trillworks.com/
Take the python code and use the generated cookies and headers to proceed with the scraping
If you go for selenium, then you can do something like below:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
# If you want to open Chrome
driver = webdriver.Chrome()
# If you want to open Firefox
driver = webdriver.Firefox()
username = driver.find_element_by_id("username")
password = driver.find_element_by_id("password")
username.send_keys("YourUsername")
password.send_keys("YourPassword")
driver.find_element_by_id("submit_btn").click()
However, if you're adamant that you're only going to use BeautifulSoup, you can do that with a library like requests or urllib. Basically all you have to do is POST the data as a payload with the URL.
import requests
from bs4 import BeautifulSoup
login_url = 'http://example.com/login'
data = {
'username': 'your_username',
'password': 'your_password'
}
with requests.Session() as s:
response = s.post(login_url , data)
print(response.text)
index_page= s.get('http://example.com')
soup = BeautifulSoup(index_page.text, 'html.parser')
print(soup.title)
You can use selenium to log in and retrieve the page source, which you can then pass to Beautiful Soup to extract the data you want.
Since Python version wasn't specified, here is my take on it for Python 3, done without any external libraries (StackOverflow). After login use BeautifulSoup as usual, or any other kind of scraping.
Likewise, script on my GitHub here
Whole script replicated below as to StackOverflow guidelines:
# Login to website using just Python 3 Standard Library
import urllib.parse
import urllib.request
import http.cookiejar
def scraper_login():
####### change variables here, like URL, action URL, user, pass
# your base URL here, will be used for headers and such, with and without https://
base_url = 'www.example.com'
https_base_url = 'https://' + base_url
# here goes URL that's found inside form action='.....'
# adjust as needed, can be all kinds of weird stuff
authentication_url = https_base_url + '/login'
# username and password for login
username = 'yourusername'
password = 'SoMePassw0rd!'
# we will use this string to confirm a login at end
check_string = 'Logout'
####### rest of the script is logic
# but you will need to tweak couple things maybe regarding "token" logic
# (can be _token or token or _token_ or secret ... etc)
# big thing! you need a referer for most pages! and correct headers are the key
headers={"Content-Type":"application/x-www-form-urlencoded",
"User-agent":"Mozilla/5.0 Chrome/81.0.4044.92", # Chrome 80+ as per web search
"Host":base_url,
"Origin":https_base_url,
"Referer":https_base_url}
# initiate the cookie jar (using : http.cookiejar and urllib.request)
cookie_jar = http.cookiejar.CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie_jar))
urllib.request.install_opener(opener)
# first a simple request, just to get login page and parse out the token
# (using : urllib.request)
request = urllib.request.Request(https_base_url)
response = urllib.request.urlopen(request)
contents = response.read()
# parse the page, we look for token eg. on my page it was something like this:
# <input type="hidden" name="_token" value="random1234567890qwertzstring">
# this can probably be done better with regex and similar
# but I'm newb, so bear with me
html = contents.decode("utf-8")
# text just before start and just after end of your token string
mark_start = '<input type="hidden" name="_token" value="'
mark_end = '">'
# index of those two points
start_index = html.find(mark_start) + len(mark_start)
end_index = html.find(mark_end, start_index)
# and text between them is our token, store it for second step of actual login
token = html[start_index:end_index]
# here we craft our payload, it's all the form fields, including HIDDEN fields!
# that includes token we scraped earler, as that's usually in hidden fields
# make sure left side is from "name" attributes of the form,
# and right side is what you want to post as "value"
# and for hidden fields make sure you replicate the expected answer,
# eg. "token" or "yes I agree" checkboxes and such
payload = {
'_token':token,
# 'name':'value', # make sure this is the format of all additional fields !
'login':username,
'password':password
}
# now we prepare all we need for login
# data - with our payload (user/pass/token) urlencoded and encoded as bytes
data = urllib.parse.urlencode(payload)
binary_data = data.encode('UTF-8')
# and put the URL + encoded data + correct headers into our POST request
# btw, despite what I thought it is automatically treated as POST
# I guess because of byte encoded data field you don't need to say it like this:
# urllib.request.Request(authentication_url, binary_data, headers, method='POST')
request = urllib.request.Request(authentication_url, binary_data, headers)
response = urllib.request.urlopen(request)
contents = response.read()
# just for kicks, we confirm some element in the page that's secure behind the login
# we use a particular string we know only occurs after login,
# like "logout" or "welcome" or "member", etc. I found "Logout" is pretty safe so far
contents = contents.decode("utf-8")
index = contents.find(check_string)
# if we find it
if index != -1:
print(f"We found '{check_string}' at index position : {index}")
else:
print(f"String '{check_string}' was not found! Maybe we did not login ?!")
scraper_login()