I cannot parse a JavaScript rendered webpage with Python Requests

I cannot parse a JavaScript rendered webpage with Python Requests - python

I tried to write a little app for parsing this page: https://apps.microsoft.com/store/category/Business
I cannot get a full html code. The tag body is not full.
import requests
def get_data(url):
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
}
req = requests.get(url, headers=headers)
with open("index.html", "w") as file:
file.write(req.text)
get_data("https://apps.microsoft.com/store/category/Business")

You cannot just parse this page because it is a client side rendered page through JavaScript.
You need to use a tool like:
pyppeteer
Selenium
Or maybe try to reverse engineer the page and directly call the APIs.
(Or maybe see if Microsoft has a public API you can call to get the info you want).

Related

Cannot get data from nasdaq site

My code below download the website https://www.nasdaq.com/market-activity/stocks/mrtn/earnings . I am interested in data in tables, say "Quarterly Earnings Surprise Amount" Table. From developer tool on Chrome, I can see the data is in tags such as:
<td class="earnings-forecast__cell">1.13</td>
But when using the code below to download, the number in tag is disappear. Only have <td class="earnings-forecast__cell"> </td>
Can you please help to fix? Thanks, HHC
import requests
from bs4 import BeautifulSoup as soup
header = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
'referer': 'https://www.nasdaq.com/market-activity/stocks/mrtn/earnings'
}
# Send a get request to server:
url = 'https://www.nasdaq.com/market-activity/stocks/mrtn/earnings'
html = requests.get(url=url,headers=header)
# check if request is received
html.status_code #Successful responses (200–299)
data=soup(html.content,'lxml')
print(type(data))
# print(data)

If you try looking at the page source, you can identify that the table you are interested doesn't have any values. This indicates that the data in the table is rendered via JavaScript.
On checking the sources and the requests sent from the browser's "Network" tab, we can see that a xhr request from a JS script is sent and replied back with the data that you are looking for. The endpoint to which the script sent out a request is: https://api.nasdaq.com/api/company/MRTN/earnings-surprise.
Try this,
import requests
header = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
'referer': 'https://www.nasdaq.com/market-activity/stocks/mrtn/earnings'
}
url = 'https://api.nasdaq.com/api/company/MRTN/earnings-surprise'
response = requests.get(url = url, headers = header)
if response.status_code == 200:
print(response.json())
else:
print("Failed", response.status_code)
If you use Chrome, filter requests to "Fetch/XHR", and you should be able to view the request. (Refresh the page once with the "Network" tab open)
Happy coding!

503 Error When Trying To Crawl One Single Website Page | Python | Requests

Goal:
I am trying to scrape the HTML from this page: https://www.doherty.jobs/jobs/search?q=&l=&lat=&long=&d=.
(note - I will eventually want to paginate and scrape all job listings from this page)
My issue:
I get a 503 error when I try to scrape the page using Python and Requests. I am working out of Google Colab.
Initial Code:
import requests
url = 'https://www.doherty.jobs/jobs/search?q=&l=&lat=&long=&d='
response = requests.get(url)
print(response)
Attempted solutions:
Using 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'
Implementing this code I found in another thread:
import requests
def getUrl(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36',
}
res = requests.get(url, headers=headers)
res.raise_for_status()
getUrl('https://www.doherty.jobs/jobs/search?q=&l=&lat=&long=&d=')
I am able to access the website via my browser.
Is there anything else I can try?
Thank you

That page is protected by cloudflare, there's some options to try to bypass it, seems that using cloudscraper works:
import cloudscraper
scraper = cloudscraper.create_scraper()
url = 'https://www.doherty.jobs/jobs/search?q=&l=&lat=&long=&d='
response = scraper.get(url).text
print(response)
In order to use it, you'll need to install it:
pip install cloudscraper

Page 404 through Python Requests but loads fine through browser

Getting page 404 with Python Requests but I can access the page no problem through my browser. I can access other pages that are formatted exactly the same as this page and they load no problem.
Have already tried changing headers with no luck.
My Code:
string_page = str(page)
with requests.Session() as s:
resp = s.get('https://bscscan.com/token/generic-tokentxns2?m=normal&contractAddress=0x470862af0cf8d27ebfe0ff77b0649779c29186db&a=&sid=f58c1cdefacc680b799412c7645ed7f7&p='+string_page)
page_info = str(resp.text)
print(page_info)
I have also tried with urllib and the same thing happens

I'm not sure if this will fix it, but try adding this in the header maybe it might work
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'

Access denied [403] when accessing site with BeautifulSoup python

I want to scrape https://www.jdsports.it/ using BeautifulSoup but I get access denied.
On my pc I don't get any problem accessing the site and I'm using the same user agent of the Python program but on the program the result is different, you can see the output below.
EDIT:
I think I need cookies to gain access to the site. How can I get them and use them to access the site with the python program to scrape it?
-The script works if I use "https://www.jdsports.com" that's the same site but with different region.
Thanks!
import time
import requests
from bs4 import BeautifulSoup
import smtplib
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}
url = 'https://www.jdsports.it/'
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
soup.encode('utf-8')
status = soup.findAll.get_text()
print (status)
The output is:
<html><head>
<title>Access Denied</title>
</head><body>
<h1>Access Denied</h1>
You don't have permission to access "http://www.jdsports.it/" on this server.<p>
Reference #18.35657b5c.1589627513.36921df8
</p></body>
</html>
>
python beautifulsoup user-agent cookies python-requests

Suspected HTTP/2 at first, but wasn't able to get that working either. Perhaps you are more lucky, here's a HTTP/2 starting point:
import asyncio
import httpx
import logging
logging.basicConfig(format='%(message)s', level=logging.DEBUG)
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',
}
url = 'https://www.jdsports.it/'
async def f():
client = httpx.AsyncClient(http2=True)
r = await client.get(url, allow_redirects=True, headers=headers)
print(r.text)
asyncio.run(f())
(Tested both on Windows and Linux.) Could this have something to do with TLS1.2? That's where I'd look next, as curl works.

Read a pdf file from a url and POST it on another server using python requests

I have tried the code below. Doesn't work.
The PDF is readable in browser.
I want to GET the pdf file from the GET-url and POST it to another server.
import requests
response = requests.get(url='some url')
requests.post(url='my_url', files={'file':response.content})
Link: (Expired)

It is caused by a missing header, specific the Uses-Agent. Looks like the site checks it.
The call returns a HTTP 406 (response.status_code). With the header a HTTP 200 is returned.
Try this:
import requests
header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36"}
response = requests.get(url='some url', headers=header)
requests.post(url='my_url', files={'file':response.content})

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

I cannot parse a JavaScript rendered webpage with Python Requests - python

You cannot just parse this page because it is a client side rendered page through JavaScript. You need to use a tool like: pyppeteer Selenium Or maybe try to reverse engineer the page and directly call the APIs. (Or maybe see if Microsoft has a public API you can call to get the info you want).

Related

Cannot get data from nasdaq site

503 Error When Trying To Crawl One Single Website Page | Python | Requests

Page 404 through Python Requests but loads fine through browser

Access denied [403] when accessing site with BeautifulSoup python

Read a pdf file from a url and POST it on another server using python requests

Categories

Resources