I'm not sure if there's an API for this but I'm trying to scrape the price of certain products from Wayfair.
I wrote some python code using Beautiful Soup and requests but I'm getting some HTML which mentions Our systems have detected unusual traffic from your computer network.
Is there anything I can do to make this work?
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'
}
def fetch_wayfair_price(url):
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content)
print(soup)
fetch_wayfair_price('https://www.wayfair.ca/furniture/pdp/mistana-katarina-5-piece-extendable-solid-wood-dining-set-mitn2584.html')
The Wayfair site loads a lot of data in after the initial page, so just getting the html from the URL you provided probably won't be all you need. That being said, I was able to get the request to work.
Start by using a session from the requests libary; this will track cookies and session. I also added upgrade-insecure-requests: '1' to the headers so that you can avoid some of the issue that HTTPS introduce to scraping. The request now returns the same response as the browser request does, but the information you want is probably loaded in subsequent requests the browser/JS make.
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36',
'upgrade-insecure-requests': '1',
'dnt': '1'
}
def fetch_wayfair_price(url):
with requests.Session() as session:
post = session.get(url, headers=headers)
soup = BeautifulSoup(post.text, 'html.parser')
print(soup)
fetch_wayfair_price(
'https://www.wayfair.ca/furniture/pdp/mistana-katarina-5-piece-extendable-solid-wood-dining-set-mitn2584.html')
Note: The session that the requests libary creates really should persist outside of the fetch_wayfair_price method. I have contained it in the method to match your example.
Related
My code below download the website https://www.nasdaq.com/market-activity/stocks/mrtn/earnings . I am interested in data in tables, say "Quarterly Earnings Surprise Amount" Table. From developer tool on Chrome, I can see the data is in tags such as:
<td class="earnings-forecast__cell">1.13</td>
But when using the code below to download, the number in tag is disappear. Only have <td class="earnings-forecast__cell"> </td>
Can you please help to fix? Thanks, HHC
import requests
from bs4 import BeautifulSoup as soup
header = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
'referer': 'https://www.nasdaq.com/market-activity/stocks/mrtn/earnings'
}
# Send a get request to server:
url = 'https://www.nasdaq.com/market-activity/stocks/mrtn/earnings'
html = requests.get(url=url,headers=header)
# check if request is received
html.status_code #Successful responses (200–299)
data=soup(html.content,'lxml')
print(type(data))
# print(data)
If you try looking at the page source, you can identify that the table you are interested doesn't have any values. This indicates that the data in the table is rendered via JavaScript.
On checking the sources and the requests sent from the browser's "Network" tab, we can see that a xhr request from a JS script is sent and replied back with the data that you are looking for. The endpoint to which the script sent out a request is: https://api.nasdaq.com/api/company/MRTN/earnings-surprise.
Try this,
import requests
header = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
'referer': 'https://www.nasdaq.com/market-activity/stocks/mrtn/earnings'
}
url = 'https://api.nasdaq.com/api/company/MRTN/earnings-surprise'
response = requests.get(url = url, headers = header)
if response.status_code == 200:
print(response.json())
else:
print("Failed", response.status_code)
If you use Chrome, filter requests to "Fetch/XHR", and you should be able to view the request. (Refresh the page once with the "Network" tab open)
Happy coding!
I'm using the script I always use to scrape data from the web but I'm not getting success.
I would like to get the data from the table on the website:
https://www.rad.cvm.gov.br/ENET/frmConsultaExternaCVM.aspx
I'm using the following code for scraping:
from bs4 import BeautifulSoup
from selenium import webdriver
url = "https://www.rad.cvm.gov.br/ENET/frmConsultaExternaCVM.aspx"
browser = webdriver.PhantomJS()
browser.get(url)
html = browser.page_source
bs = BeautifulSoup(html, 'lxml')
print(bs)
currently I only receive js from the site and not the data from the table itself
Do HTTP POST to https://www.rad.cvm.gov.br/ENET/frmConsultaExternaCVM.aspx/PopulaComboEmpresas
This will return you data of the table as JSON.
In the browser do F12 --> Network --> Fetch/XHR in order to see more details like HTTP header and POST Body.
You can do that easily using only requests as api calls json response and following the post method.
Here is the working code:
import requests
import json
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
body = { 'tipoEmpresa': '0'}
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36',
'x-dtpc': '33$511511524_409h2vHHVRBIAIGILPJNCRGRCECUBIACWCBUEE-0e37',
'X-Requested-With': 'XMLHttpRequest',
'Content-Type': 'application/json'
}
url='https://www.rad.cvm.gov.br/ENET/frmConsultaExternaCVM.aspx/PopulaComboEmpresas'
r = requests.post(url, data=json.dumps(body), headers =headers, verify = False)
res = r.json()['d']
print(res)
I am trying to get number of followers of a facebook page i.e. https://web.facebook.com/marlenaband. I am using python requests library. When I see the page source in the browser, the text "142 people follow this" appears to be in the commented section of the page. But, I am not seeing it in the response text using requests and BeautifulSoup. Would someone please help me on how to get this? Thanks
Here is the code I am using:
import requests
from bs4 import BeautifulSoup as bs
url = 'https://web.facebook.com/marlenaband'
headers = {
'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/7.0.185.1002 Safari/537.36',
}
res = requests.get(url, headers=headers)
print(res.content)
I actually got it using requests by modifying the headers to this:
headers = {
'accept-language':'en-US,en;q=0.8',
}
Is there any way I can scrape certain links from google result containing specific words in link.
By using beautifulsoup or selenium ?
import requests
from bs4 import BeautifulSoup
import csv
URL = "https://www.google.co.in/search?q=site%3Afacebook.com+friends+groups&oq=site%3Afacebook.com+friends+groups"
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')
Want to extract links containing group links.
Not sure what you want to do, but if you want to extract facebook links from the returned content, you can just check whether facebook.com is within the URL:
import requests
from bs4 import BeautifulSoup
import csv
URL = "https://www.google.co.in/search?q=site%3Afacebook.com+friends+groups&oq=site%3Afacebook.com+friends+groups"
r = requests.get(URL)
soup = BeautifulSoup(r.text, 'html5lib')
for link in soup.findAll('a', href=True):
if 'facebook.com' in link.get('href'):
print link.get('href')
Update:
There is another workaround. The thing you need to do is to set a legitimate user-agent. Therefore add headers to emulate a browser. :
# This is a standard user-agent of Chrome browser running on Windows 10
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
}
Example:
from bs4 import BeautifulSoup
import requests
URL = 'https://www.google.co.in/search?q=site%3Afacebook.com+friends+groups&oq=site%3Afacebook.com+friends+groups'
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
resp = requests.get(URL, headers=headers).text
soup = BeautifulSoup(resp, 'html.parser')
for link in soup.findAll('a', href=True):
if 'facebook.com' in link.get('href'):
print link.get('href')
Additionally, you can add another set of headers to pretend like a legitimate browser. Add some more headers like this:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
'Accept' :
'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language' : 'en-US,en;q=0.5',
'Accept-Encoding' : 'gzip',
'DNT' : '1', # Do Not Track Request Header
'Connection' : 'close'
}
As I understand it, you need to get all the links from the Google search results that contain specific words in link. I assume you are talking about this: site:facebook.com friends groups.
For site:facebook.com you don't need to do a special check to see if the given expression is present in the link. Because you already wrote advanced operator site: in the search query. So Google returns results only from that site.
But for friends groups a special check is needed and let's see how this can be implemented.
To get these links, you need to get the selector that contains them. In our case, this is the .yuRUbf a selector. Let's use a select() method that will return a list of all the links we need.
To iterate over all links, we can use for loop and iterate the list of matched elements what select() method returned. Use get('href') or ['href'] to extract attributes, which be URL in this case.
In each iteration of the loop, you need to perform a check for the presence of specific words in the URL address:
for result in soup.select(".yuRUbf a"):
if ("groups" or "friends") in result["href"].lower():
print(result["href"])
Also, make sure you're using request headers user-agent to act as a "real" user visit. The updated workaround 0xInfection answer worked because default requests user-agent is python-requests and websites understand that it's most likely a script that sends a request. Check what's your user-agent.
To minimize blocks from Google, I decided to add a basic example of using proxies via requests.
Code and full example in online IDE:
from bs4 import BeautifulSoup
import requests, lxml
session = requests.Session()
session.proxies = {
'http': 'http://10.10.10.10:8000',
'https': 'http://10.10.10.10:8000',
}
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "site:facebook.com friends groups",
"hl": "en", # language
"gl": "us" # country of the search, US -> USA
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36",
}
html = requests.get("https://www.google.co.in/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
for result in soup.select(".yuRUbf a"):
if ("groups" or "friends") in result["href"].lower():
print(result["href"])
Output:
https://www.facebook.com/groups/funwithfriendsknoxville/
https://www.facebook.com/FWFNYC/groups
https://www.facebook.com/groups/americansandfriendsPT/about/
https://www.facebook.com/funfriendsgroups/
https://www.facebook.com/groups/317688158367767/about/
https://m.facebook.com/funfriendsgroups/photos/
https://www.facebook.com/WordsWithFriends/groups
Or you can use Google Organic Results API from SerpApi. It will bypass blocks from search engines and you don't have to create the parser from scratch and maintain it.
Code example:
from serpapi import GoogleSearch
import os
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
# https://docs.python.org/3/library/os.html#os.getenv
"api_key": os.getenv("API_KEY"), # your serpapi api key
"engine": "google", # search engine
"q": "site:facebook.com friends groups" # search query
# other parameters
}
search = GoogleSearch(params) # where data extraction happens on the SerpApi backend
result_dict = search.get_dict() # JSON -> Python dict
for result in result_dict['organic_results']:
if ("groups" or "friends") in result['link'].lower():
print(result['link'])
Output:
https://www.facebook.com/groups/126440730781222/
https://www.facebook.com/FWFNYC/groups
https://m.facebook.com/FS1786/groups
https://www.facebook.com/pages/category/AIDS-Resource-Center/The-Big-Groups-159912964020164/
https://www.facebook.com/groups/889671771094194
https://www.facebook.com/groups/480003906466800/about/
https://www.facebook.com/funfriendsgroups/
I am Trying to Get Html Content from a URL using request.get in Python.
But am getting incomplete response.
import requests
from lxml import html
url = "https://www.expedia.com/Hotel-Search?destination=Maldives&latLong=3.480528%2C73.192127®ionId=109&startDate=04%2F20%2F2018&endDate=04%2F21%2F2018&rooms=1&_xpid=11905%7C1&adults=2"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36',
'Content-Type': 'text/html',
}
response = requests.get(url, headers=headers)
print response.content
Can any one suggest the changes to be done for getting the exact complete response.
NB:using selenium am able to get the complete response,but that is not the recommended way.
If you need to get content generated dynamically by JavaScript and you don't want to use Selenium, you can try requests-html tool that supports JavaScript:
from requests_html import HTMLSession
session = HTMLSession()
url = "https://www.expedia.com/Hotel-Search?destination=Maldives&latLong=3.480528%2C73.192127®ionId=109&startDate=04%2F20%2F2018&endDate=04%2F21%2F2018&rooms=1&_xpid=11905%7C1&adults=2"
r = session.get(url)
r.html.render()
print(r.content)