Scraping AJAX page with requests

Scraping AJAX page with requests - python

I would like to scrape the results of this booking flow.
By looking at the network tab I've found out that the data is retrieved with an AJIAX GET at this URL:
https://shop.caremar.it/main_acquista_1_corse_00_ajax.asp?l=it&data=24/02/2019&portoP=3&portoA=5&form_url=ticket_s1_2
I've build the URL passing the parameters as follows:
params = urllib.parse.urlencode({
'data': '24/02/2019',
'portoP': '3' ,
'portoA': '5',
'form_url': 'ticket_s1_2',
})
and make the request:
caremar_timetable_url = "https://shop.caremar.it/main_acquista_1_corse_00_ajax.asp?l=it&"
print(f"https://shop.caremar.it/main_acquista_1_corse_00_ajax.asp?l=it&{params}")
headers = {'user-agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.3'}
res = requests.get(caremar_timetable_url,headers=headers, params=params)
soup = BeautifulSoup(res.text,'html.parser')
print(soup.text)
Output
https://shop.caremar.it/main_acquista_1_corse_00_ajax.asp?l=it&data=24%2F02%2F2019&portoP=7&portoA=1&form_url=ticket_s1_2
Non Ã¨ stato possibile procedere con l'acquisto del biglietto online. Si prega di riprovare
The response is an error message from the site which says it can't complete the booking. If I copy and paste the URL I created in the browser I get an unstyled HTML page with the data I need.
Why is this and how can I overcome it?

Data seems to come back with requests
import requests
from bs4 import BeautifulSoup as bs
url = 'https://shop.caremar.it/main_acquista_1_corse_00_ajax.asp?l=it&data=27/02/2019&portoP=1&portoA=4&form_url=ticket_s1_2'
res = requests.get(url)
soup = bs(res.content, 'lxml')
print(soup.select_one('html'))

Related

Get text from <span class: with Beautifulsoup and requests

so I tried to get a specific text from a website but it only gives me the error (floor = soup.find('span', {'class': 'text-white fs-14px text-truncate attribute-value'}).text
AttributeError: 'NoneType' object has no attribute 'text')
I specifically want to get the 'Floor Price' text.
My code:
import bs4
from bs4 import BeautifulSoup
#target url
url = "https://magiceden.io/marketplace/solsamo"
#act like browser
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get('https://magiceden.io/marketplace/solsamo')
#parse the downloaded page
soup = BeautifulSoup(response.content, 'lxml')
floor = soup.find('span', {'class': 'text-white fs-14px text-truncate attribute-value'}).text
print(floor)

There is no needed data in HTML you receive after:
response = requests.get('https://magiceden.io/marketplace/solsamo')
You can make sure of this by looking at page source code:
view-source:https://magiceden.io/marketplace/solsamo
You should use Selenium instead requests to get your data or you can examine XHR-requests on this page, maybe you can get this data using requests by following other link.

webscraping python not showing all tags

I'm new to webscraping. I was trying to make a script that gets data from a balance sheet (here the site: https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019320000010/a10-qq1202012282019.htm). The problem is getting the data: when I watch at the source code in my browser, I'm able to find the tag and the correct value. Once I write down a script with bs4, I don't get anything.
I'm trying to get informations form the balance sheet: Products, Services, Cost of sales... and the data contained in the table 1. (I'm sorry, but I can't post the image. Anyway is the first table you see scrolling down).
Here's my code.
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
url = "https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019320000010/a10-qq1202012282019.htm"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
read_data = urlopen(req).read()
soup_data = BeautifulSoup(read_data,"lxml")
names = soup_data.find_all("td")
for name in names:
print(name)
Thanks for your time.

Try this URL:
Also include the headers to get the data.
import requests
from bs4 import BeautifulSoup
url = "https://www.sec.gov/Archives/edgar/data/320193/000032019320000010/a10-qq1202012282019.htm"
headers = {"User-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"}
req = requests.get(url, headers=headers)
soup_data = BeautifulSoup(req.text,"lxml")
You will be able to find the data you need.

Web scraping - Password required | Python, BeautifulSoup, requests

I am trying to perform web scraping using Python, beatifulsoup and requests. I firstly need to log into the page and then request the following page from where I would like to perform the web scraping.
I can say that I login successfully as the status code is 200. However, when I request the next page after I log in, I do not get the whole content.
Specifically, I get this line instead of multiple nested divs.
<div id="app"></div>
actual content look like the following.
My code is the following. I would like to ask you whether I’m missing anything in order to get all nested divs.
import requests
from bs4 import BeautifulSoup
import html5lib
headers={'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36'}
login_data = {
'username': 'username',
'password': 'password',
'sp-login': 'false'
}
with requests.Session() as s:
url = "https://api.private.zscaler.com/base/api/zpa/signin"
r = s.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'html5lib')
r = s.post(url, data=login_data, headers= headers)
print(r.content)
print(r.ok)
print(r.status_code)
r2 = requests.get("https://admin.private.zscaler.com/#dashboard/usersDashboard")
print(r2.text)

The web app you are trying to scrape might be an SPA (Single Page Application) built with something like React \ Vue \ Angular.
BeautifulSoup wouldn't work in this case, because you need to run javascript on page to build DOM.
You would have to use something like Selenium to accomplish this.

Python Screen Scraping Forbes.com

I'm writing a Python program to extract and store metadata from interesting online tech articles: "og:title", "og:description", "og:image", og:url, and og:site_name.
This is the code I'm using...
# Setup Headers
headers = {}
headers['Accept'] = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
headers['Accept-Charset'] = 'ISO-8859-1,utf-8;q=0.7,*;q=0.3'
headers['Accept-Encoding'] = 'none'
headers['Accept-Language'] = "en-US,en;q=0.8"
headers['Connection'] = 'keep-alive'
headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36"
# Create the Request
http = urllib3.PoolManager()
# Create the Response
response = http.request('GET ', url, headers)
# BeautifulSoup - Construct
soup = BeautifulSoup(response.data, 'html.parser')
# Scrape <meta property="og:title" content=" x x x ">
if tag.get("property", None) == "og:title":
if len(tag.get("content", None)) > len(title):
title = tag.get("content", None)
The program runs fine on all but one site. On "forbes.com", I can't get to the articles using Python:
url=
https://www.forbes.com/consent/?toURL=https://www.forbes.com/sites/shermanlee/2018/07/31/privacy-revolution-how-blockchain-is-reshaping-our-economy/#72c3b4e21086
I can't bypass this consent page; which seems to be the "Cookie Consent Manager" solution from "TrustArc". On a computer, you basically provide your consent... and each consecutive run, you're able to access the articles.
If I reference the "toURL" url:
https://www.forbes.com/sites/shermanlee/2018/07/31/privacy-revolution-how-blockchain-is-reshaping-our-economy/#72c3b4e21086
And bypass the "https://www.forbes.com/consent/" page, I'm redirected back to this page.
I've tried to see if there is a cookie I could set in the header, but couldn't find the magic key.
Can anyone help me?

There is a required cookie notice_gdpr_prefs that needs to be sent to view the data :
import requests
from bs4 import BeautifulSoup
src = requests.get(
"https://www.forbes.com/sites/shermanlee/2018/07/31/privacy-revolution-how-blockchain-is-reshaping-our-economy/",
headers= {
"cookie": "notice_gdpr_prefs"
})
soup = BeautifulSoup(src.content, 'html.parser')
title = soup.find("meta", property="og:title")
print(title["content"])

Scraping: cannot access information from web

I am scraping some information from this url: https://www.rockethub.com/projects/34210-lunar-lion-the-first-ever-university-led-mission-to-the-moon#description-tab
Everything was fine till I scraped the description.
I tried and tried to scrape, but I failed so far.
It seems like I can't reach that information. Here is my code:
html = urllib.urlopen("https://www.rockethub.com/projects/34210-lunar-lion-the-first-ever-university-led-mission-to-the-moon")
tree=BeautifulSoup(html, "lxml")
description=tree.find('div',{'id':'description_section','class':'description-section'})
Any of you has any suggestion?

You would need to make an additional request to get the description. Here is a complete working example using requests + BeautifulSoup:
import requests
from bs4 import BeautifulSoup
url = "https://www.rockethub.com/projects/34210-lunar-lion-the-first-ever-university-led-mission-to-the-moon/"
with requests.Session() as session:
session.headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"
}
# get the token
response = session.get(url)
soup = BeautifulSoup(response.content, "html.parser")
token = soup.find("meta", {"name": "csrf-token"})["content"]
# get the description
description_url = url + "description"
response = session.get(description_url, headers={"X-CSRF-Token": token, "X-Requested-With": "XMLHttpRequest"})
soup = BeautifulSoup(response.content, "html.parser")
description = soup.find('div', {'id':'description_section', 'class': 'description-section'})
print(description.get_text(strip=True))

I use XML package to web scraping, and I can't get the description section as you described with BeautifulSoup.
However if you just want to scrap this page only, you can download the page. Then:
page = htmlTreeParse("Lunar Lion - the first ever university-led mission to the Moon _ RocketHub.html",
useInternal = TRUE,encoding="utf8")
unlist(xpathApply(page, '//div[#id="description_section"]', xmlValue))
I tried the R code to download, and I can't find the description_section either.
url="https://www.rockethub.com/projects/34210-lunar-lion-the-first-ever-university-led-mission-to-the-moon"
download.file(url,"page.html",mode="w")
Maybe we have to add some options in the function download.file. I hope that some html experts could help.

I found out how to scrap with R:
library("rvest")
url="https://www.rockethub.com/projects/34210-lunar-lion-the-first-ever-university-led-mission-to-the-moon/description"
url %>%
html() %>%
html_nodes(xpath='//div[#id="description_section"]', xmlValue) %>%
html_text()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping AJAX page with requests - python

Related

Get text from <span class: with Beautifulsoup and requests

webscraping python not showing all tags

Web scraping - Password required | Python, BeautifulSoup, requests

Python Screen Scraping Forbes.com

Scraping: cannot access information from web

Categories

Resources