Real page content isn't what I get with Requests and BeautifulSoup

Real page content isn't what I get with Requests and BeautifulSoup - python

as it happens sometimes to me, I can't access everything with requests that I can see on the page in the browser, and I would like to know why. On these pages, I am particularly interested in the comments. Does anyone have an idea how to access those comments, please? Thanks!
import requests
from bs4 import BeautifulSoup
import re
url='https://aukro.cz/uzivatel/paluska_2009?tab=allReceived&type=all&page=1'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
searched = soup.find_all('td', class_='col1')
print(searched)

Worth knowing you can get the scoring info for the individual as JSON using POST request. Handle the JSON as you require.
import requests
import pandas as pd
import json
from pandas.io.json import json_normalize
headers = {
'Content-Type': 'application/json',
'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'
}
url = 'https://aukro.cz/backend/api/users/profile?username=paluska_2009'
response = requests.post(url, headers=headers,data = "")
response.raise_for_status()
data = json_normalize(response.json())
df = pd.DataFrame(data)
df.to_csv(r'C:\Users\User\Desktop\Data.csv', sep=',', encoding='utf-8',index = False )
Sample view of JSON:

I run your code and analized the content you have in page.
Seems like aukro.cz is built in Angular since it uses ng-app, therefore it's all dynamic content you apparently can't load using requests. You could try to use selenium in headless mode to scrape that part of content you are looking for.
Let me now if you need instructions for it.

To address your curiosity for QHarr's answer,
Upon loading the URL in chrome browser, if you trace Network calls. You will find out, there post request on URL - https://aukro.cz/backend/api/users/profile?username=paluska_2009, whose response - a JSON, which contains your desired information.
This is a trivial way of scraping data. While web-scraping, in most of the sites, you'll find out part of page is loading through some other api calls. To find the URL and POST params for the request, chrome Network tools is handy tool.
Let me know, if you need any details further.

Related

Can't scrape information from a static webpage using requests module

I'm trying to fetch product title and it's description from a webpage using requests module. The title and description appear to be static as they both are present in page source. However, I failed to grab them using following attempt. The script throws AttributeError at this moment.
import requests
from bs4 import BeautifulSoup
link = 'https://www.nordstrom.com/s/anine-bing-womens-plaid-shirt/6638030'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36',
}
with requests.Session() as s:
s.headers.update(headers)
res = s.get(link)
soup = BeautifulSoup(res.text,"lxml")
product_title = soup.select_one("h1[itemProp='name']").text
product_desc = soup.select_one("#product-page-selling-statement").text
print(product_title,product_desc)
How can I scrape title and description from above pages using requests module?

The page is dynamic. go after the data from the api source:
import requests
import pandas as pd
api = 'https://www.nordstrom.com/api/ng-looks/styleId/6638030?customerId=f36cf526cfe94a72bfb710e5e155f9ba&limit=7'
jsonData = requests.get(api).json()
df = pd.json_normalize(jsonData['products'].values())
print(df.iloc[0])
Output:
id 6638030-400
name ANINE BING Women's Plaid Shirt
styleId 6638030
styleNumber
colorCode 400
colorName BLUE
brandLabelName ANINE BING
hasFlatShot True
imageUrl https://n.nordstrommedia.com/id/sr3/6d000f40-8...
price $149.00
pathAlias anine-bing-womens-plaid-shirt/6638030?origin=c...
originalPrice $149.00
productTypeLvl1 12
productTypeLvl2 216
isUmap False
Name: 0, dtype: object

When testing requests like these you should output the response to see what you're getting back. Best to use something like Postman (I think VSCode has a similar function to it now) to set up URLs, headers, methods, and parameters, and to also see the full response with headers. When you have everything working right, just convert it to python code. Postman even has some 'export to code' functions for common languages.
Anyways...
I tried your request on Postman and got this response:
Requests done from python vs a browser are the same thing. If the headers, URLs, and parameters are identical, they should receive identical responses. So the next step is comparing the difference between your request and the request done by the browser:
So one or more of the headers included by the browser gets a good response from the server, but just using User-Agent is not enough.
I would try to identify which headers, but unfortunately, Nordstrom detected some 'unusual activity' and seems to have blocked my IP :(
Probably due to sending an obvious handmade request. I think it's my IP that's blocked since I can't access the site from any browser, even after clearing my cache.
So double-check that the same hasn't happened to you while working with your scraper.
Best of luck!

BeautifulSoup and MechanicalSoup won't read website

I am dealing with BeautifulSoup and also trying it with MechanicalSoup and I have got it to load with other websites, but when I request that the website be requested it takes a long time and then never really gets it. Any ideas would be super helpful.
Here is the BeautifulSoup code that I am writing:
import urllib3
from bs4 import BeautifulSoup as soup
url = 'https://www.apartments.com/apartments/saratoga-springs-ut/1-bedrooms/?bb=hy89sjv-mN24znkgE'
http = urllib3.PoolManager()
r = http.request('GET', url)
Here is the Mechanicalsoup code:
import mechanicalsoup
browser = mechanicalsoup.Browser()
url = 'https://www.apartments.com/apartments/saratoga-springs-ut/1-bedrooms/'
page = browser.get(url)
page
What I am trying to do is gather data on different cities and apartments, so the url will change to have be 2-bedrooms and then 3-bedrooms then it will move to a different city and do the same thing there, so I really need this part to work.
Any help would be appreciated.

You see the same thing if you use curl or wget to fetch the page. My guess is they are using browser detection to try to prevent people from stealing their copyrighted information, as you are attempting to do. You can search for the User-Agent header to see how to pretend to be another browser.

import urllib3
import requests
from bs4 import BeautifulSoup as soup
headers = requests.utils.default_headers()
headers.update({
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'
})
url = 'https://www.apartments.com/apartments/saratoga-springs-ut/1-bedrooms/'
r = requests.get(url, headers=headers)
rContent = soup(r.content, 'lxml')
rContent
Just as Tim said, I needed to add headers to my code to ensure that it was being read as not from a bot.

How to download web table using Python or R script

I am trying to download data from below link.
link: https://dataminer2.pjm.com/feed/act_sch_interchange
i used below python code not working.
import urllib.request
from pprint import pprint
from html_table_parser import HTMLTableParser
import pandas as pd
def url_get_contents(url):
req = urllib.request.Request(url=url)
f = urllib.request.urlopen(req)
return f.read()
xhtml = url_get_contents('https://dataminer2.pjm.com/feed/act_sch_interchange')
p = HTMLTableParser()
p.feed(xhtml)
pprint(p.tables[1])
print("\n\nPANDAS DATAFRAME\n")
print(pd.DataFrame(p.tables[1]))
I am beginner in coding pls let me know if i did any wrong in the code.
addition to download the data i want to download the table by changing dates and text boxes.
is this possible? any help, thank in advance.
enter image description here

Downloading the HTTP data from said website will unfortunately not give you the data.
What the page contains (simplified version) is some very basic HTML and JS code that loads the data in the background.
One way to visualize what gets loaded by the webiste is to use Chrome's developer mode (Settings->developer mode). Having had a quick look (no, just putting the URL in the browser will not work), it seems that in order to load the JSON data that contains the data you will need to construct HTTP requests containing the correct headers (e.g. Ocp-Apim-Subscription-Key) to directly query the API where the data is made accessible.

You can try this:
import requests
import pandas as pd
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:85.0) Gecko/20100101 Firefox/85.0",
"Ocp-Apim-Subscription-Key": "d408630449804e23b07148259c96b24a",
"Connection": "keep-alive",
}
r = requests.get("https://api.pjm.com/api/v1/act_sch_interchange", headers=headers)
df = pd.DataFrame(r.json()["items"])
If you want to change date, you should change url:
url = "https://api.pjm.com/api/v1/act_sch_interchange?rowCount=25&sort=datetime_beginning_utc&order=Asc&startRow=1&isActiveMetadata=true&fields=actual_flow,datetime_beginning_ept,datetime_beginning_utc,datetime_ending_ept,datetime_ending_utc,inadv_flow,sched_flow,tie_line&datetime_beginning_ept=1/27/2021 00:00to2/1/2021 23:59"

Crawling a site with iframe

I am trying to crawl data from this site. It uses multiple iframes for different components.
When I try to open one of the iframe url in browser, it opens in that particular session but in another icognito/private session it doesn't. Same happens when I try to do this via requests or wget.
I have tried using requests along with session, then also it doesn't work. Here is my code snippet
import requests
s = requests.Session()
s.get('https://www.epc.shell.com/')
r = s.get('https://www.epc.shell.com/welcome.asp')
r.text
The last line only returns the javascript text with error that URL is invalid.
I know Selenium can solve this problem but I am considering it as last option.
Is it possible to crawl this URL with requests (or without using Javascript)? If yes, any help would be appreciated. If no, is there any alternative lightweight Javascript library in Python that can achieve this?

Your issue can be easily solved by adding custom headers to your requests, all in all, your code should look like this:
import requests
s = requests.Session()
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept-Language": "en-US,en;q=0.5"}
s.get('https://www.epc.shell.com/', headers = headers)
r = s.get('https://www.epc.shell.com/welcome.asp', headers = headers)
print(r.text)
(Do note that it is almost always recommended to use headers when sending requests).
I hope this helps!

Scraping Data from website with a login page

I am trying to login to my university website using python and the requests library using the following code, nonetheless I am not able to.
import requests
payloads = {"User_ID": <username>,
"Password": <passwrord>,
"option": "credential",
"Log in":"Log in"
}
with requests.Session() as session:
session.post('', data=payloads)
get = session.get("")
print(get.text)
Does anyone have any idea on what I am doing wrong?

In order to login you will need to to post all the informations requested by the <input> tag. In your case you will have also to provide the hidden inputs. You can do this by scraping for these values and then post them. You might also need to post some headers to simulate a browser behaviour.
from lxml import html
import requests
s = requests.Session()
login_url = "https://intranet.cardiff.ac.uk/students/applications"
session_url = "https://login.cardiff.ac.uk/nidp/idff/sso?sid=1&sid=1"
to_get = s.get(login_url)
tree = html.fromstring(to_get.text)
hidden_inputs = tree.xpath(r'//form//input[#type="hidden"]')
payloads = {x.attrib["name"]: x.attrib["value"] for x in hidden_inputs}
payloads["Ecom_User_ID"] = "<username>"
payloads["Ecom_Password"] = "<password>"
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
result = s.post(session_url, data=payloads, headers = headers)
Hope this works

In order to login to a website with python, you will have to use a more involved method than the request library because you will have to simulate the browser in your code and have it make requests to login to the school's website servers. The reason for this is that you need the school's server to think that it is getting the request from the browser, then it should return you the contents of the resulting page, and then you have to have those contents rendered so that you can scrape it. Luckily, a great way to do this is with the selenium module in python.
I would recommend googling around to learn more about selenium. This blog post is a good example of using selenium to log into a web page with detailed explanations of what each line of code is doing. This SO answer on using selenium to login to a website is also good as an entry point into doing this.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Real page content isn't what I get with Requests and BeautifulSoup - python

Related

Can't scrape information from a static webpage using requests module

BeautifulSoup and MechanicalSoup won't read website

How to download web table using Python or R script

Crawling a site with iframe

Scraping Data from website with a login page

Categories

Resources