I am trying to get number of followers of a facebook page i.e. https://web.facebook.com/marlenaband. I am using python requests library. When I see the page source in the browser, the text "142 people follow this" appears to be in the commented section of the page. But, I am not seeing it in the response text using requests and BeautifulSoup. Would someone please help me on how to get this? Thanks
Here is the code I am using:
import requests
from bs4 import BeautifulSoup as bs
url = 'https://web.facebook.com/marlenaband'
headers = {
'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/7.0.185.1002 Safari/537.36',
}
res = requests.get(url, headers=headers)
print(res.content)
I actually got it using requests by modifying the headers to this:
headers = {
'accept-language':'en-US,en;q=0.8',
}
Related
I tried to write a little app for parsing this page: https://apps.microsoft.com/store/category/Business
I cannot get a full html code. The tag body is not full.
import requests
def get_data(url):
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
}
req = requests.get(url, headers=headers)
with open("index.html", "w") as file:
file.write(req.text)
get_data("https://apps.microsoft.com/store/category/Business")
You cannot just parse this page because it is a client side rendered page through JavaScript.
You need to use a tool like:
pyppeteer
Selenium
Or maybe try to reverse engineer the page and directly call the APIs.
(Or maybe see if Microsoft has a public API you can call to get the info you want).
I'm not sure if there's an API for this but I'm trying to scrape the price of certain products from Wayfair.
I wrote some python code using Beautiful Soup and requests but I'm getting some HTML which mentions Our systems have detected unusual traffic from your computer network.
Is there anything I can do to make this work?
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'
}
def fetch_wayfair_price(url):
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content)
print(soup)
fetch_wayfair_price('https://www.wayfair.ca/furniture/pdp/mistana-katarina-5-piece-extendable-solid-wood-dining-set-mitn2584.html')
The Wayfair site loads a lot of data in after the initial page, so just getting the html from the URL you provided probably won't be all you need. That being said, I was able to get the request to work.
Start by using a session from the requests libary; this will track cookies and session. I also added upgrade-insecure-requests: '1' to the headers so that you can avoid some of the issue that HTTPS introduce to scraping. The request now returns the same response as the browser request does, but the information you want is probably loaded in subsequent requests the browser/JS make.
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36',
'upgrade-insecure-requests': '1',
'dnt': '1'
}
def fetch_wayfair_price(url):
with requests.Session() as session:
post = session.get(url, headers=headers)
soup = BeautifulSoup(post.text, 'html.parser')
print(soup)
fetch_wayfair_price(
'https://www.wayfair.ca/furniture/pdp/mistana-katarina-5-piece-extendable-solid-wood-dining-set-mitn2584.html')
Note: The session that the requests libary creates really should persist outside of the fetch_wayfair_price method. I have contained it in the method to match your example.
I am Trying to Get Html Content from a URL using request.get in Python.
But am getting incomplete response.
import requests
from lxml import html
url = "https://www.expedia.com/Hotel-Search?destination=Maldives&latLong=3.480528%2C73.192127®ionId=109&startDate=04%2F20%2F2018&endDate=04%2F21%2F2018&rooms=1&_xpid=11905%7C1&adults=2"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36',
'Content-Type': 'text/html',
}
response = requests.get(url, headers=headers)
print response.content
Can any one suggest the changes to be done for getting the exact complete response.
NB:using selenium am able to get the complete response,but that is not the recommended way.
If you need to get content generated dynamically by JavaScript and you don't want to use Selenium, you can try requests-html tool that supports JavaScript:
from requests_html import HTMLSession
session = HTMLSession()
url = "https://www.expedia.com/Hotel-Search?destination=Maldives&latLong=3.480528%2C73.192127®ionId=109&startDate=04%2F20%2F2018&endDate=04%2F21%2F2018&rooms=1&_xpid=11905%7C1&adults=2"
r = session.get(url)
r.html.render()
print(r.content)
I have tried to login to the twitter account using requests library. But I am getting url response as "400". It is not working. I used all the required payload parameters and headers. But still, I am unable to figure out how to login.
import requests
from bs4 import BeautifulSoup
payload={
"session[username_or_email]":"***************",
"session[password]":"*************",
"authenticity_token":"*************",
"ui_metrics":'{"rf":{"a4f2d7a8e3d9736f0815ae7b34692191bca9f114a7d5602c7758a3e6087b6b30":0,"ad92fc8b83fb5dec3f720f89a7f0fb415a26130516362f230b02251edd96a54a":0,"a011babb5c5df598f93bcc4a38dfad0276f69df36faff48eea95bac67cefeffe":0,"a75214752b7e90fd50725fce21cc26761ef3613173b0f8764d52c8b53f136bbf":0},"s":"mTArUSdNtTOm6WaGwNeRjMAU3EhNA3VGbFeCIZeEkjjLTAbccFDTJjcTEB2tQ9iuNJUzniFKyvhZNOGdH1LIwmi1YSMcFTOHu2Wi49yKvONv0obfg1dW27znR_C2n-ev2zMvN5166j1ccsxWKIheiWw-eHM7oXA54U40cWHvdCrunJJKj2INkTrcVph-y2fccu1m3hp31vngqBiL-XmeLWYiyZ-NYOmV8f5iXW9WWMvISTcSwzz9vd_n9-tLSKociT-1ap5ZVFWNUWIycSflj8WcOmmRFzq4kwa-NsS0FRp-DQ2FOkozhhhQi9HDvSODUlGsdQWBPkGKKtDWbtnj9gAAAWEty4Xv"}',
"scribe_log":"",
"redirect_after_login":"",
"authenticity_token":"******************",
"return_to_ssl":"",
"remember_me":"1",
"lang":""
}
headers={
"accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"accept-encoding":"gzip, deflate, br",
"accept-language":"en-US,en;q=0.9",
"cache-control":"max-age=0",
"cookie":'moments_profile_moments_nav_tooltip_self=true; syndication_guest_id=v1%3A150345116906281638; eu_cn=1; kdt=QErLcBT9OjM5gjEznmsRcHlMTK6biDyAw4gfI5ro; remember_checked_on=1; _ga=GA1.2.1923324433.1496571570; tfw_exp=0; _gid=GA1.2.106381927.1516638134; __utma=43838368.1923324433.1496571570.1516764481.1516764481.1; __utmz=43838368.1516764481.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); lang=en; ct0=7ceea26f7fd3d186152512d26365cddf; _twitter_sess=BAh7CiIKZmxhc2hJQzonQWN0aW9uQ29udHJvbGxlcjo6Rmxhc2g6OkZsYXNo%250ASGFzaHsABjoKQHVzZWR7ADoPY3JlYXRlZF9hdGwrCL8wyy1hAToMY3NyZl9p%250AZCIlNjJjODQ1MjZiZWQzOGUyODZlOWUxNmNkMWJhZTZjYjc6B2lkIiU4MmZm%250AYWQ3Mzc1OGFhNmJjOTIxZjlmOGEyMzk3MjE1NToJdXNlcmwrCQAAVbhKiEIN--32d967262e1de8852d20ace15ec93d87b9a902a8; personalization_id="v1_snKt6bqCONQsnFuE8EOZDA=="; guest_id=v1%3A151689245583269291; _gat=1; ads_prefs="HBERAAA="; twid="u=955475925457502208"; auth_token=50decb38f16f3c264f480b0cd1cc30a9bcce9f08',
"referer":"https://twitter.com/login",
"upgrade-insecure-requests":"1",
"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
}
res = requests.get("https://twitter.com/login",data=payload,headers=headers)
soup = BeautifulSoup(res.text,"html.parser")
print(res.status_code)
print(res.url)
for item in soup.find_all(class_="title"):
print(item.text)
How to login to twitter? what all parameters did i miss? Please help me out with this.
Note: I am not using APIs or selenium driver. I want to do it using requests library. Please help me. Thanks in Advance.
You're using the GET method to access an auth endpoint, usually the POST method is used for such purposes, try using requests.post instead of requests.get.
I would like to get store info from the web-site(http://www.hilife.com.tw/storeInquiry_street.aspx).
The method I found by chrome is POST.
By using below method, I still cannot access.
Could someone give me a hint?
import requests
from bs4 import BeautifulSoup
head = {
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'
}
payload = {
'__EVENTTARGET':'AREA',
'__EVENTARGUMENT':'',
'__LASTFOCUS':'',
'__VIEWSTATE':'/wEPDwULLTE0NjI2MjI3MjMPZBYCAgcPZBYMAgEPZBYCAgEPFgIeBFRleHQFLiQoJyNzdG9yZUlucXVpcnlfc3RyZWV0JykuYXR0cignY2xhc3MnLCdzZWwnKTtkAgMPEA8WBh4NRGF0YVRleHRGaWVsZAUJY2l0eV9uYW1lHg5EYXRhVmFsdWVGaWVsZAUJY2l0eV9uYW1lHgtfIURhdGFCb3VuZGdkEBUSCeWPsOWMl+W4ggnln7rpmobluIIJ5paw5YyX5biCCeWunOiYree4ownmlrDnq7nnuKMJ5qGD5ZyS5biCCeiLl+agl+e4ownlj7DkuK3luIIJ5b2w5YyW57ijCeWNl+aKlee4ownlmInnvqnnuKMJ6Zuy5p6X57ijCeWPsOWNl+W4ggnpq5jpm4TluIIJ5bGP5p2x57ijCemHkemWgOe4ownmlrDnq7nluIIJ5ZiJ576p5biCFRIJ5Y+w5YyX5biCCeWfuumahuW4ggnmlrDljJfluIIJ5a6c6Jit57ijCeaWsOeruee4ownmoYPlnJLluIIJ6IuX5qCX57ijCeWPsOS4reW4ggnlvbDljJbnuKMJ5Y2X5oqV57ijCeWYiee+qee4ownpm7LmnpfnuKMJ5Y+w5Y2X5biCCemrmOmbhOW4ggnlsY/mnbHnuKMJ6YeR6ZaA57ijCeaWsOerueW4ggnlmInnvqnluIIUKwMSZ2dnZ2dnZ2dnZ2dnZ2dnZ2dnFgECB2QCBQ8QDxYGHwEFCXRvd25fbmFtZR8CBQl0b3duX25hbWUfA2dkEBUWBuS4reWNgAbmnbHljYAG5Y2X5Y2ABuilv+WNgAbljJfljYAJ5YyX5bGv5Y2ACeilv+Wxr+WNgAnljZflsa/ljYAJ5aSq5bmz5Y2ACeWkp+mHjOWNgAnpnKfls7DljYAJ54OP5pel5Y2ACeixkOWOn+WNgAnlkI7ph4zljYAJ5r2t5a2Q5Y2ACeWkp+mbheWNgAnnpZ7lsqHljYAJ5aSn6IKa5Y2ACeaymem5v+WNgAnmoqfmo7LljYAJ5riF5rC05Y2ACeWkp+eUsuWNgBUWBuS4reWNgAbmnbHljYAG5Y2X5Y2ABuilv+WNgAbljJfljYAJ5YyX5bGv5Y2ACeilv+Wxr+WNgAnljZflsa/ljYAJ5aSq5bmz5Y2ACeWkp+mHjOWNgAnpnKfls7DljYAJ54OP5pel5Y2ACeixkOWOn+WNgAnlkI7ph4zljYAJ5r2t5a2Q5Y2ACeWkp+mbheWNgAnnpZ7lsqHljYAJ5aSn6IKa5Y2ACeaymem5v+WNgAnmoqfmo7LljYAJ5riF5rC05Y2ACeWkp+eUsuWNgBQrAxZnZ2dnZ2dnZ2dnZ2dnZ2dnZ2dnZ2dnFgECBGQCBw8PFgIfAAUJ5Y+w5Lit5biCZGQCCQ8PFgIfAAUG5YyX5Y2AZGQCCw8WAh4LXyFJdGVtQ291bnQCAhYEZg9kFgJmDxUFBEg2NDYP5Y+w5Lit5aSq5bmz5bqXIOWPsOS4reW4guWMl+WNgDQwNOWkquW5s+i3rzcy6JmfCzA0LTIyMjkwOTI4Azg3MGQCAQ9kFgJmDxUFBDQzMTgP5Y+w5Lit5rC45aSq5bqXM+WPsOS4reW4guWMl+WNgDQwNOWkquWOn+i3r+S6jOautTI0MOiZn+S4gOaok+WFqOmDqAswNC0yMzY5MDA1NwM4NzFkZFHxmtQaBu2Yr9cvskfEZMWn57JLRfjPYBFYDy+tHr6X',
'__VIEWSTATEGENERATOR':'B77476FC',
'__EVENTVALIDATION':'/wEdACtWrrgS52/ojbuYEYvRDXHZ2ryV+Ed5kWYedGp5zjHs3Neeeo/9TTvNTdElW+hiVA25mZnLEQUYPOZFLnuVu9jOT+Zq1/xceVgC7GxWRM+A8tOS3xZBjlhgzlx5UN3H3D0UrdtoyeScvRqxFL8L3gGKRyCJu029oItLX7X6c7SW7C7IVzuAeZ6t9kFMeOQus7MtrV7YeOXrlOP8inI96UkaJEU7Ro3FtK29+B+NamR2j4qInKVwJ4+JD3cjWm5buZdnOhT/ISzrljaf+F9GnVjm4dGchVglf1PxMMHl7EEoLjs20TZ856RDCGXvzK/6J+tEFp7zDvFTYGoeHtuHy+YF/IoR/CRFBAaEkys48FIAUCSUKnxACPyW6Ar2guIADjOqYue7v4fhV1jIq65P/lwanoaJpIsboCbjakbTYnqK8BLngMayrRehyT58dmj3SbzY1mOtzSNnakdpUxaC0EpOJ7rhB52A2FKsxy5EbP0PwHHuHNMa9dit0AxPMfYUP1/LWuYPWMX0W8tyEMKxoUcYsCb+qJLF9yXPgM6c8sIQTRxcBokm1PGzFN4M6vnSF8OfFSC+c0frLZ4GH6l497B/5oDIjq7Bz4/cPeGCavvh9NUqPcmzJIr8Abx9vjtMGpZSwBdVY3bR/ARswIDrmWLt1qMD4jcRvGPxBa8nsRR8HNdVINbR+iOSFLwVhBCg+s+mV5NeTdOKvAeggfOsJHmJKL0ApQSCyjY5kEiOvo2JAI07C08ENIFF7HpDTaGCi93i2WnmdDrYoaoLZi96dRTlk4xoWV9tc7rd9X/wE6QoKHxFtADSz9WkgtbUn88lAhY2++OiqWCaQZobh7K26ndH1z34JXVB7C/AiOEV+CCb97oVyooxWullV44iFQ0isVBjYC1XWS3eGf1PwMS++A+EjQTkl9VJhIRDoS6sg2mD7mikimBjQGvZX/lcYtKSrjY=',
'CITY':'台中市',
'AREA':'北區'
}
res = requests.post('http://www.hilife.com.tw/storeInquiry_street.aspx', data=payload, headers = head)
res.encoding = 'utf-8'
print res.text
I see that you are missing this : Content-Type:application/x-www-form-urlencoded, you have to send a header like this as well as send data in x-www-form-urlencoded format . I recommend using POSTMAN for testing before writing the code. Also make sure you have relevant permission before crawling third party website. Happy Crawling.