I'm trying to scrape an email address from a webpage. When there is any email address available in any similar page, the email sign is there. However, I can't fetch it using the script below. What I get instead is this link https://www.yell.com/customerneeds/sendenquiry/sendtoone/100040736756000120.
webpage address
I've tried with:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
base = "https://www.yell.com"
link = "https://www.yell.com/biz/east-london-only-london-901717573/"
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'}
r = requests.get(link,headers=headers)
soup = BeautifulSoup(r.text,"lxml")
email = urljoin(base,soup.select_one("a[data-tracking='ENQUIRY:SEND']")["href"])
print(email)
How can I fetch the email address from that page?
There are no email addresses on that page. This is a typical way that is used to make contacting possible without giving an email address to the public.
What happens when you press the "Send enquiry" -button is that your browser sends a HTTP POST request towards some address*, to a webserver, which then handles your enquiry. The webserver might send an email to some address, but it might not aswell. For example, the webserver might just add an entry to a database, and then some user might see your enquiry though a web interface.
* This you could check yourself using the browser developer tools and checking the Network tab while pressing the "Send enquiry" -button. I did not want to send trash to them just to check where the data is sent.
Related
I am trying to login to a site with POST method, then navigate to another page and scrape the HTML data from that second page.
However, the website is not accepting the package I am pushing to it, and the script is returning the data for a non-member landing page instead of the member page I want.
Below is the current code that does not run.
#Import Packages
import requests
from bs4 import BeautifulSoup
# Login Data
url = "https://WEBSITE.com/ajax/ajax.login.php"
data ={'username':'NAME%40MAIL.com','password':'PASSWORD%23','token':'ea83a09716ffea1a3a34a1a2195af96d2e91f4a32e4b00544db3962c33d82c40'}
# note that in HTML have encoded '#' like NAME%40MAIL.com
# note that in HTML have encoded '#' like %23
headers = {'user-agent': "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36"}
# Post data package and log in
session = requests.Session()
r = session.post(url, headers=headers, data=data)
# Navigate to the next page and scrape the data
s = session.get('https://WEBSITE.com/page/93/EXAMPLE-PAGE')
soup = BeautifulSoup(s.text, 'html.parser')
print(soup)
I have inspected the Elements on the login page and the AJAX URL for the login action is correct and there are 3 forms that need to be filled as seen in the image below. I pulled the hidden token value from the inspect element panel and passed it along with the username/e-mail and password:
Inspect Element Panel
I really have no clue what the issue might be, but there is a BOOLEAN variable for IS_GUEST returning TRUE in the HTML return that tells me I have done something wrong and the script has not been granted access.
This is also puzzling to troubleshoot since there is a redirect landing page and no server error codes to analyze or give me a hint.
I am using a different header than my actual machine, but that has never stopped me before from more simple logins.
I have encoded the string passed in the login data email with '%40' instead of '#' and the special character required in the password was encoded as '%23' for '#' (i.e. NAME#MAIL.COM = 'NAME%40MAIL.COM' and PASSWORD# = 'PASSWORD%23') Whenever I change the e-mail to use the '#' I get a garbage response, and I tried putting the '#' back in the password, but that changed nothing either.
I'm trying to parse articles from 'https://financialpost.com/', and example link is provided below. To parse this, i need to login to their website.
I do successfully post my cresidentials, however, it still do not parse the entire webpage, just the beginning.
How do I crawl everything?
import requests
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
link = 'https://financialpost.com/sign-in/'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36'
res = s.get(link)
soup = BeautifulSoup(res.text,'html.parser')
payload = {i['email']:i.get('value','') for i in soup.select('input[email]')}
payload['email'] = 'email#email.com'
payload['password'] = 'my_password'
s.post(link,data=payload)
url = 'https://financialpost.com/pmn/business-pmn/hydrogen-is-every-u-s-gas-utilitys-favorite-hail-mary-pass'
content_url = Request(url)
article_content = urlopen(content_url).read()
article_soup = BeautifulSoup(article_content, 'html.parser')
article_table = article_soup.findAll('section',attrs={'class':'article-content__content-group'})
for x in article_table:
print(x.find('p').text)
Using just requests
It's a bit complicated using just requests but possible, you would have to first authenticate to get authentication token, then you would ask for the article with said token so that site will know that you are authenticated and will display full article. To find out which API endpoints are being used to authenticate and load website content you can use something like chrome dev tools or fiddler (they can record all HTTP request so you can find manually interesting ones)
Using just selenium
Easier way would be to just use Selenium. It is a browser that can be used by code, so that you can just open login website authenticate and request for the article and the site would think that you are a human.
I work in an environment where occasionally we have to bulk configure TP-Link ADSL routers. As one can understand this does cause productivity issues. I solved the issue using python & especially it's requests.session() library. It worked tremendously well especially for older TP-LINK models such as TP-LINK Archer D5.
Reference: How to control a TPLINK router with a python script
The method that i used was to do the configuration via browser, packet capture using Wireshark and replicate it using Python. Archer VR600 introduces new method. When starting configuration using the browser, the main page requests for new password. Once done then it generates a random long string (KEY) which is sent to the router.This key is random and unique, based on this random string JSESSIONID is generated and used throughout the session.
AC1600 IP Address: 192.168.1.1
PC IP Address: 192.168.1.100
KEY and SESSIONID when configured via Browser.
KEY and SESSIONID when configured via Python Script.
As you can see i am trying to replicate the steps via script but failing due to not been able to create a unique key which will be accepted by the router, thus failing to generate a SESSIONID and enable rest on the configuration.
Code:
def configure_tplink_archer_vr600():
user = 'admin'
salt = '%3D'
default_password = 'admin:admin'
password = "admin"
base_url = 'http://192.168.1.1'
setPwd_url = 'http://192.168.1.1/cgi/setPwd?pwd='
login_url = "http://192.168.1.1/cgi/login?UserName=0f98175e8bd1c9297fc22ec6a47fa4824bfb3c8c73141acd7b46db283557d229c9783f409690c9af5e87055608b358ab4d1dfc45f17e6261daabd3e042d7aee92aa1d8829a8d5a69eb641dcc103b17c4f443a96800c8c523b911589cf7e6164dbc1001194"
get_busy_url = "http://192.168.1.1/cgi/getBusy"
authorization = base64.b64encode(
(default_password).encode()).decode('ascii')
salted_password = base64.b64encode((password).encode()).decode('ascii')
salted_password = salted_password.replace("=", "%3D")
print("Salted Password" + salted_password)
setPwd_url = setPwd_url + salted_password
rs = requests.session()
rs.headers['Cookie'] = 'Authorization=Basic ' + authorization
rs.headers['Referer'] = base_url
rs.headers[
'User-Agent'] = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"
print("This is authorization string: " + authorization)
response = rs.post(setPwd_url)
print(response)
print(response.text.encode("utf-8"))
response = rs.post(get_busy_url)
print(response)
print(response.text.encode("utf-8"))
response = rs.post(login_url)
print(response)
print(response.text.encode("utf-8"))
Use the python requests library to log in to the router, this cuts the need for any manual work:
Go to the login page and right click + inspect element.
Navigate to the resources tab, here you can see HTTP methods as they
happen.
Login using some username and password and you should see the
corresponding GET/POST method on the network tab.
Click on it and find the payload it sends to the router, this is
usually in json format and you'll need to build it in your python
script, and send it as an input to the webpage.(luckily there are
many tutorials for this out there.
Note that sometimes a payload for the script is actually generated by some javascript, but in most cases it's just some string cramped into the HTML source. If you see a payload you don't understand, just search for it in the page source. Then you'll have to extract it with something like regex and add it to your payload.
I'm currently working on attempting to scrape some HTML files from an electronic medical system that I use for work. I currently have a python bot that logs into the system and is able to download and send faxes for me, but there's some pages I want my bot to quickly grab before it even is logged in and sending faxes. These pages are basic HTML that have extremely predictable URLs and I have tested I can manually call the pages from my browser, so once I do get my session established it should be easy work.
The website is: https://kinnser.net/
Login URL: https://kinnser.net/login.cfm
second URL: https://kinnser.net/AM/Message/inbox.cfm
import requests
import json
import logging
import json
from requests.auth import HTTPBasicAuth
from lxml import html
#This URL will be the URL that your login form points to with the "action" tag.
POST_LOGIN_URL = 'https://kinnser.net/loginlogic.cfm'
#This URL is the page you actually want to pull down with requests.
REQUEST_URL = 'https://kinnser.net/AM/Message/inbox.cfm'
#username-input-name is the "name" tag associated with the username input field of the login form.
#password-input-name is the "name" tag associated with the password input field of the login form.
payload = {
'username': 'XXXXXXXX',
'password': 'XXXXXXXXX'}
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36'}
with requests.Session() as session:
post = session.post(POST_LOGIN_URL, data=payload, headers=headers)
print(post)
r = session.get(REQUEST_URL)
print(r.text) #or whatever else you want to do with the request data!
I played around with the username, & password field by setting them equal to the input's name/ID but that wouldn't work. So I tried this script on our old EMR we used just to confirm it wasn't broken, and it did indeed work perfectly. So I began to play around with the headers in my request and it was still no dice. I'm not sure if my login is just failing or if they're detecting me being a bot and serving me the login page over and over again but I have spent about 10 hours trying to research a solution and I've hit a wall with my project currently.
If anyone see's any mistakes in my code or has workable solutions please feel free to suggest them. Thanks for the help and hopefully I'll soon grow to understand more about RESTful web services.
Think the HTML might actually be in post.text?
edit:
try the request with these headers:
...
user_agent_str = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) " \
+ "AppleWebKit/537.36 (KHTML, like Gecko) " \
+ "Chrome/78.0.3904.97 " \
+ "Safari/537.36"
content_type_str = "application/json"
headers = {
"user-agent": user_agent_str,
"content-type": content_type_str
}
...
Another edit:
I'm not sure if requests already handles this, but payload isn't valid JSON. You might also try using double instead of single quotes.
I would suggest trying out this two things.
kinnser.net/loginlogic.cfm From network calls it looks like this is post url.
Change 'Username' to 'username' and 'Password' to 'password' and try.
Since I don't have access username and password i can not verify this but this two thing might be causing the problem.
I am trying to login to my university website using python and the requests library using the following code, nonetheless I am not able to.
import requests
payloads = {"User_ID": <username>,
"Password": <passwrord>,
"option": "credential",
"Log in":"Log in"
}
with requests.Session() as session:
session.post('', data=payloads)
get = session.get("")
print(get.text)
Does anyone have any idea on what I am doing wrong?
In order to login you will need to to post all the informations requested by the <input> tag. In your case you will have also to provide the hidden inputs. You can do this by scraping for these values and then post them. You might also need to post some headers to simulate a browser behaviour.
from lxml import html
import requests
s = requests.Session()
login_url = "https://intranet.cardiff.ac.uk/students/applications"
session_url = "https://login.cardiff.ac.uk/nidp/idff/sso?sid=1&sid=1"
to_get = s.get(login_url)
tree = html.fromstring(to_get.text)
hidden_inputs = tree.xpath(r'//form//input[#type="hidden"]')
payloads = {x.attrib["name"]: x.attrib["value"] for x in hidden_inputs}
payloads["Ecom_User_ID"] = "<username>"
payloads["Ecom_Password"] = "<password>"
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
result = s.post(session_url, data=payloads, headers = headers)
Hope this works
In order to login to a website with python, you will have to use a more involved method than the request library because you will have to simulate the browser in your code and have it make requests to login to the school's website servers. The reason for this is that you need the school's server to think that it is getting the request from the browser, then it should return you the contents of the resulting page, and then you have to have those contents rendered so that you can scrape it. Luckily, a great way to do this is with the selenium module in python.
I would recommend googling around to learn more about selenium. This blog post is a good example of using selenium to log into a web page with detailed explanations of what each line of code is doing. This SO answer on using selenium to login to a website is also good as an entry point into doing this.