Cannot get data from nasdaq site - python

My code below download the website https://www.nasdaq.com/market-activity/stocks/mrtn/earnings . I am interested in data in tables, say "Quarterly Earnings Surprise Amount" Table. From developer tool on Chrome, I can see the data is in tags such as:
<td class="earnings-forecast__cell">1.13</td>
But when using the code below to download, the number in tag is disappear. Only have <td class="earnings-forecast__cell"> </td>
Can you please help to fix? Thanks, HHC
import requests
from bs4 import BeautifulSoup as soup
header = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
'referer': 'https://www.nasdaq.com/market-activity/stocks/mrtn/earnings'
}
# Send a get request to server:
url = 'https://www.nasdaq.com/market-activity/stocks/mrtn/earnings'
html = requests.get(url=url,headers=header)
# check if request is received
html.status_code #Successful responses (200–299)
data=soup(html.content,'lxml')
print(type(data))
# print(data)

If you try looking at the page source, you can identify that the table you are interested doesn't have any values. This indicates that the data in the table is rendered via JavaScript.
On checking the sources and the requests sent from the browser's "Network" tab, we can see that a xhr request from a JS script is sent and replied back with the data that you are looking for. The endpoint to which the script sent out a request is: https://api.nasdaq.com/api/company/MRTN/earnings-surprise.
Try this,
import requests
header = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
'referer': 'https://www.nasdaq.com/market-activity/stocks/mrtn/earnings'
}
url = 'https://api.nasdaq.com/api/company/MRTN/earnings-surprise'
response = requests.get(url = url, headers = header)
if response.status_code == 200:
print(response.json())
else:
print("Failed", response.status_code)
If you use Chrome, filter requests to "Fetch/XHR", and you should be able to view the request. (Refresh the page once with the "Network" tab open)
Happy coding!

Related

I cannot parse a JavaScript rendered webpage with Python Requests

I tried to write a little app for parsing this page: https://apps.microsoft.com/store/category/Business
I cannot get a full html code. The tag body is not full.
import requests
def get_data(url):
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
}
req = requests.get(url, headers=headers)
with open("index.html", "w") as file:
file.write(req.text)
get_data("https://apps.microsoft.com/store/category/Business")
You cannot just parse this page because it is a client side rendered page through JavaScript.
You need to use a tool like:
pyppeteer
Selenium
Or maybe try to reverse engineer the page and directly call the APIs.
(Or maybe see if Microsoft has a public API you can call to get the info you want).

Web Scraping - Cloudflare Issues

I am trying to scrape https://www.carsireland.ie/search#q?%20scraper%20python=&toggle%5Bpoa%5D=false&page=1 (I had built a scraper but then they did a total overhaul of their website). The new website has a new format and has Cloudflare to provide the usual security. I have the following code which returns a 403 error, particularly referencing this error:
"https://www.cloudflare.com/5xx-error-landing"
The code which I have built so far is as follows:
from requests_html import HTMLSession
session = HTMLSession()
header = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" ,
'referer':'https://www.google.com/'
}
# url of search page
url = 'https://www.carsireland.ie/search#q?sortBy=vehicles_prod%2Fsort%2Fpoa%3Aasc%2Cupdated%3Adesc&page=1'
# create a session with the url
r = session.get(url, headers=header)
# render the url
data = r.html.render(sleep=1, timeout=20)
# Check the response
print(r.text)
I would really appriciate any help which could be provided to correct the CloudFlare issues which I am having.
this problem can be fixed by simply changing the referer property in header to the link you are going to scrape.

Wayfair product prices

I'm not sure if there's an API for this but I'm trying to scrape the price of certain products from Wayfair.
I wrote some python code using Beautiful Soup and requests but I'm getting some HTML which mentions Our systems have detected unusual traffic from your computer network.
Is there anything I can do to make this work?
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'
}
def fetch_wayfair_price(url):
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content)
print(soup)
fetch_wayfair_price('https://www.wayfair.ca/furniture/pdp/mistana-katarina-5-piece-extendable-solid-wood-dining-set-mitn2584.html')
The Wayfair site loads a lot of data in after the initial page, so just getting the html from the URL you provided probably won't be all you need. That being said, I was able to get the request to work.
Start by using a session from the requests libary; this will track cookies and session. I also added upgrade-insecure-requests: '1' to the headers so that you can avoid some of the issue that HTTPS introduce to scraping. The request now returns the same response as the browser request does, but the information you want is probably loaded in subsequent requests the browser/JS make.
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36',
'upgrade-insecure-requests': '1',
'dnt': '1'
}
def fetch_wayfair_price(url):
with requests.Session() as session:
post = session.get(url, headers=headers)
soup = BeautifulSoup(post.text, 'html.parser')
print(soup)
fetch_wayfair_price(
'https://www.wayfair.ca/furniture/pdp/mistana-katarina-5-piece-extendable-solid-wood-dining-set-mitn2584.html')
Note: The session that the requests libary creates really should persist outside of the fetch_wayfair_price method. I have contained it in the method to match your example.

How do I get location info from http response?

I understand what location does in HTTP headers.
Access to a site with Chrome gets location in response headers.
However, access to it with Python requests cannot get that info.
import requests
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36',
'accept': '*/*',
'accept-language': 'en-US,en;q=0.9,ru-RU;q=0.8,ru;q=0.7,uk;q=0.6,en-GB;q=0.5',
}
response = requests.get('https://ec.ef.com.cn/partner/englishcenters', headers=headers)
response.headers
Does it matter for scrapy? How do I get that info? Because I guess it might be a flag the site could use for anti-scraping.
What you see in your screenshot is response with HTTP code 302 which will usually automatically redirect some clients (along with Python Requests) to another URL, specified in Location header.
If you enter the URL you shared (https://ec.ef.com.cn/partner/englishcenters) in your browser, you'll see you will get redirected to some other URL. Same behaviour can be observed in your Python code if you print out response.url which should return you the URL you've been redirected to.

Access denied [403] when accessing site with BeautifulSoup python

I want to scrape https://www.jdsports.it/ using BeautifulSoup but I get access denied.
On my pc I don't get any problem accessing the site and I'm using the same user agent of the Python program but on the program the result is different, you can see the output below.
EDIT:
I think I need cookies to gain access to the site. How can I get them and use them to access the site with the python program to scrape it?
-The script works if I use "https://www.jdsports.com" that's the same site but with different region.
Thanks!
import time
import requests
from bs4 import BeautifulSoup
import smtplib
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}
url = 'https://www.jdsports.it/'
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
soup.encode('utf-8')
status = soup.findAll.get_text()
print (status)
The output is:
<html><head>
<title>Access Denied</title>
</head><body>
<h1>Access Denied</h1>
You don't have permission to access "http://www.jdsports.it/" on this server.<p>
Reference #18.35657b5c.1589627513.36921df8
</p></body>
</html>
>
python beautifulsoup user-agent cookies python-requests
Suspected HTTP/2 at first, but wasn't able to get that working either. Perhaps you are more lucky, here's a HTTP/2 starting point:
import asyncio
import httpx
import logging
logging.basicConfig(format='%(message)s', level=logging.DEBUG)
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',
}
url = 'https://www.jdsports.it/'
async def f():
client = httpx.AsyncClient(http2=True)
r = await client.get(url, allow_redirects=True, headers=headers)
print(r.text)
asyncio.run(f())
(Tested both on Windows and Linux.) Could this have something to do with TLS1.2? That's where I'd look next, as curl works.

Categories