I am using following URL to extract the JSON file for the price history
https://steamcommunity.com/market/pricehistory/?appid=730&market_hash_name=P90%20|%20Blind%20Spot%20(Field-Tested)
The python code I am using:
item = requests.get(URL, cookies={'steamLogin': steamid}); # get item data
print(str(currRun),' out of ',str(len(allItemNames))+' code: '+str(item.status_code))
item = item.content
item = json.loads(item)
Now I went to almost all the solutions that was posted in this community but I am still getting status code as 400 and Items as [].
When I copy paste the URL and open it in browser I am able to see the JSON file with required data but somehow the Jupyter notebook is unable to detect the content
I also tried Beautiful soup to read the content with the following code:
r = requests.get(url)
#below code extracts the whole HTML Code of above URL
soup = BeautifulSoup(r.content, 'html5lib')
table = soup.find_all('pre')
print(table)
Output: []
So you are getting [] because you are not authorized, so you recieve empty json array. You can check it by opening link in incognito (Ctrl+Shift+N) mode.
To authorize you need to set Cookie header to your request, so your code will be as this:
import requests
url = "https://steamcommunity.com/market/pricehistory/?appid=730&market_hash_name=P90%20%7C%20Blind%20Spot%20(Field-Tested)"
headers = {
"Cookie": "Your cookie"
}
json = requests.get(url, headers=headers).text
...
How to find Cookie (Chrome)
Go to link with json
Press F12 to open Chrome Development Toolkit.
Open Network tab
Reload page.
Double click on first sent request
Open Headers subtab
Scroll to Request Headers
Find Cookie header
Related
I tried downloading pdf files from the website, which is contained in a table with pagination. I can download the pdf files from the first page, but it is not fetching the pdf from all the 4000+ pages. When I tried understanding the logic by observing the URL request, it seems static with out any additional value get appended on it during pagination and I couldn't figure out the way to fetch all pdfs from the table using BeautifulSoup.
Hereby attached the code that I am using to download pdf file from the table in website,
# Import libraries
import requests
from bs4 import BeautifulSoup
import re
import requests, json
# URL from which pdfs to be downloaded
url="https://loksabha.nic.in/Questions/Qtextsearch.aspx"
# Requests URL and get response object
response = requests.get(url)
# Parse text obtained
span = soup.find("span", id="ContentPlaceHolder1_lblfrom")
Total_pages = re.findall(r'\d+', span.text)
print(Total_pages[0])
# Find all hyperlinks present on webpage
# links = soup.find_all('a')
i = 0
# From all links check for pdf link and
# if present download file
# for link in links:
for link in table1.find_all('a'):
if ('.pdf' in link.get('href', [])):
list2 = re.findall('CalenderUploading', link.get('href', []))
if len(list2)==0:
# url = re.findall('hindi', link.get('href', []))
print(link.get('href', []))
i += 1
# Get response object for link
response = requests.get(link.get('href'))
# Write content in pdf file
pdf = open("pdf"+str(i)+".pdf", 'wb')
pdf.write(response.content)
pdf.close()
print("File ", i, " downloaded")
print("All PDF files downloaded")
Firstly
you need to establish a session first time you call to store cookie values
sess=requests.session()
and use sess.get subsequently instead of requests.get
Secondly:
its not static... its not get request for subsequent pages
its a post request with : ctl00$ContentPlaceHolder1$txtpage="2" for page 2
make a session with requests
capture the view parameters after first request using BeautifulSoup
the value of __VIEWSTATE, __VIEWSTATEGENERATOR and __EVENTVALIDATION
etc are in a <div class="aspNetHidden">
when you request for tthe page for first time...
for subsequent pages you will have to pass these parameters along with
page number in post parameter like this ... ctl00$ContentPlaceHolder1$txtpage="2"
using "POST" and not "GET"
this is what is sent by post request for eg. for page 4001 page here
on the loksabha site
workout other parts .. dont expect complete solution here :-)
sess=requests.session()
resp=sess.get('https://loksabha.nic.in/Questions/Qtextsearch.aspx')
soup=bs(resp.content)
vstat=soup.find('input',{'name':'__VIEWSTATE'})['value']
vstatgen=soup.find('input',{'name':'__VIEWSTATEGENERATOR'})['value']
vstatenc=soup.find('input',{'name':'__VIEWSTATEENCRYPTED'})['value']
eventval=soup.find('input',{'name':'__EVENTVALIDATION'})['value']
for pagenum in range(4000): # change as per your old code
postback={'__EVENTTARGET': 'ctl00$ContentPlaceHolder1$cmdNext',
'__EVENTARGUMENT': '',
'__VIEWSTATE':vstat,
'__VIEWSTATEGENERATOR': vstatgen,
'__VIEWSTATEENCRYPTED': vstatenc,
'__EVENTVALIDATION': eventval,
'ctl00$txtSearchGlobal': '',
'ctl00$ContentPlaceHolder1$ddlfile': '.pdf',
'ctl00$ContentPlaceHolder1$TextBox1': '',
'ctl00$ContentPlaceHolder1$btn': 'allwordbtn',
'ctl00$ContentPlaceHolder1$btn1': 'titlebtn',
'ctl00$ContentPlaceHolder1$txtpage': str(pagenum) }
resp=sess.post('https://loksabha.nic.in/Questions/Qtextsearch.aspx',data=postback)
soup=bs(resp.content)
vstat=soup.find('input',{'name':'__VIEWSTATE'})['value']
vstatgen=soup.find('input',{'name':'__VIEWSTATEGENERATOR'})['value']
vstatenc=soup.find('input',{'name':'__VIEWSTATEENCRYPTED'})['value']
eventval=soup.find('input',{'name':'__EVENTVALIDATION'})['value']
### process next page...extract pdfs here
###
###
####
I have a simple script where I want to scrape a menu from a url:
https://untappd.com/v/glory-days-grill-of-ellicott-city/3329822
When I inspect the page using dev tools, I identify that the menu contained in the menu section <div class="menu-area" id="section_1026228">
So my script is fairly simple as follows:
import requests
from bs4 import BeautifulSoup
venue_url = 'https://untappd.com/v/glory-days-grill-of-ellicott-city/3329822'
response = requests.get(venue_url, headers = {'User-agent': 'Mozilla/5.0'})
soup = BeautifulSoup(response.text, 'html.parser')
menu = soup.find('div', {'class': 'menu-area'})
print(menu.text)
I have tried this on a locally saved page of the url and it works. But when I do it to the full url using the requests library, it does not work. It cannot find the div. It throws this error:
print(menu.text)
AttributeError: 'NoneType' object has no attribute 'text'
which basically means it cannot find the div. Does anyone know why this is happening and how to fix it?
I just logged out from my browser and it showed me a different page. However, my script has no login part at all. Not even sure how that would work
[It doesn't work with all sites, but it seems to be enough for this site so far.] You can login with request.Session.
# import requests
sess = requests.Session()
headers = {'user-agent': 'Mozilla/5.0'}
data = {'username': 'YOUR_EMAIL/USERNAME', 'password': 'YOUR_PASSWORD'}
loginResp = sess.post('https://untappd.com/login', headers=headers, data=data)
print(loginResp.status_code, loginResp.reason, 'from', loginResp.url) ## should print 200 OK...
response = sess.get(venue_url, headers = {'User-agent': 'Mozilla/5.0'})
## CAN CONTINUE AS BEFORE ##
I've edited my solution to one of your previous questions about this site to include cookies so that the site will treat you as logged in. For example:
# venue_url = 'https://untappd.com/v/glory-days-grill-of-ellicott-city/3329822'
gloryMenu = scrape_untappd_menu(venue_url, cookies=sess.cookies)
will collect the following data:
Note: They have a captcha when logging in so I was worried it would be too hard to automate; if it becomes an issue, you can [probably] still login on your browser before going to the page and then paste the request from your network log to curlconverter to get the cookies as a dictionary. Ofc the process is then no longer fully automated since you'll have to repeat this manual login every time the cookies expire (which could be as fast as a few hours). If you wanted to automate the login at that point, you might have to use some kind of browser automation like with selenium.
It's been 6 hours that I have been trying to extract the grades from my school website.
My code is working perfectly, I have just a problem...
I cannot get connected through the Microsoft login page.
The only way to get connected is by getting my cookies and pasting them into the code.
This is not what I want. I want to put my email and my password and get connected.
This is my code:
from bs4 import BeautifulSoup
import requests
import csv
'''
publishedGrade[x]
1 = ID
3 = grade
'''
cookies = {
'SERVERID': 'XXXXXXXXXXX',
'JSESSIONID': 'XXXXXXXXXXX',
}
# Start the session
session = requests.Session()
# Create the payload
payload = {'loginfmt':'XXXXXXXX',
'passwd':'XXXXXXXX'
}
# Post the payload to the site to log in
s = session.post("https://login.microsoftonline.com/08983daf-5aca-4f44-bc65-c23ce32d46ec/oauth2/authorize?response_type=code&client_id=dcd45342-6b22-4490-aaa6-39e6199c2bf6&scope=openid&redirect_uri=https%3A%2F%2Faurion.ieseg.fr%2Fopenid_connect_login&nonce=XXXXXX&state=XXXXXXXX&sso_reload=true", data=payload, cookies=cookies)
# Navigate to the next page and scrape the data
soup = session.get('https://aurion.ieseg.fr/faces/LearnerNotationListPage.xhtml', cookies=cookies)
soup = BeautifulSoup(soup.text, 'html.parser')
results = soup.find(id="form:dataTableFavori_data")
elements = results.find_all("tr", role="row")
allIDs=[]
allGrades=[]
#Getting each published grade
for i, element in enumerate(elements):
publishedGrade = element.find_all("td", role="gridcell")
#Extracting only the course name
ID=publishedGrade[1].text.split("_")
#Appending course ID and the grade
#Extractin the digit ID
#example XXX_XXX_XXX_ID_XXX
allIDs.append(ID[len(ID)-2])
allGrades.append(publishedGrade[3].text)
#Saving all in a CSV file
with open('text.csv', 'w', encoding='UTF-8', newline='') as f:
writer = csv.writer(f)
for w in range(len(elements)):
writer.writerow([allIDs[w], allGrades[w]])
In this code I just censored cookies password email and the end of the url of the microsoft login page.
I tried to use Selenium, but it is not optimal. Because I plan to integrate the code in an html page with Django.
I extracted the html page showed when I tried to get connected with my email. When I try to show it in a browser page, it redirects me to https://login.microsoftonline.com/cookiesdisabled.
This is what I see
how to get access to this API:
import requests
url = 'https://b2c-api-premiumlabel-production.azurewebsites.net/api/b2c/page/menu?id_loja=2691'
print(requests.get(url))
I'm trying to retrieve data from this site via API, I found the url above and I can see its data , however I can't seem to get it right because I'm running into code 403.
This is the website url:
https://www.nagumo.com.br/osasco-lj46-osasco-ayrosa-rua-avestruz/departamentos
I'm trying to retrieve items category, they are visible for me, but I'm unable to take them.
Later I'll use these categories to iterate over products API.
API Category
Obs: please be gentle it's my first post here =]
To get the data as you shown in your image the following headers and endpoint are needed:
import requests
headers = {
'sm-token': '{"IdLoja":2691,"IdRede":884}',
'User-Agent': 'Mozilla/5.0',
'Referer': 'https://www.nagumo.com.br/osasco-lj46-osasco-ayrosa-rua-avestruz/departamentos',
}
params = {
'id_loja': '2691',
}
r = requests.get('https://www.nagumo.com.br/api/b2c/page/menu', params=params, headers=headers)
r.json()
Not sure exactly what your issue is here.
Bu if you want to see the content of the response and not just the 200/400 reponses. You need to add '.content' to your print.
Eg.
#Create Session
s = requests.Session()
#Example Connection Variables, probably not required for your use case.
setCookieUrl = 'https://www...'
HeadersJson = {'Accept-Language':'en-us'}
bodyJson = {"__type":"xxx","applicationName":"xxx","userID":"User01","password":"password2021"}
#Get Request
p = s.get(otherUrl, json=otherBodyJson, headers=otherHeadersJson)
print(p) #Print response (200 etc)
#print(p.headers)
#print(p.content) #Print the content of the response.
#print(s.cookies)
I'm also new here haha, but besides this requests library, you'll also need another one like beautiful soup for what you're trying to do.
bs4 installation: https:https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup
Once you install it and import it, it's just continuing what you were doing to actively get your data.
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
this gets the entire HTML content of the page, and so, you can get your data from this page based on their css selectors like this:
site_data = soup.select('selector')
site_data is an array of things with that 'selector', so a simple for loop and an array to add your items in would suffice (as an example, getting links for each book on a bookstore site)
For example, if i was trying to get links from a site:
import requests
from bs4 import BeautifulSoup
sites = []
URL = 'https://b2c-api-premiumlabel-production.azurewebsites.net/api/b2c/page/menu?id_loja=2691'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
links = soup.select("a") # list of all items with this selector
for link in links:
sites.append(link)
Also, a helpful tip is when you inspect the page (right click and at the bottom press 'inspect'), you can see the code for the page. Go to the HTML and find the data you want and right click it and select copy -> copy selector. This will make it really easy for you to get the data you want on that site.
helpful sites:
https://oxylabs.io/blog/python-web-scraping
https://realpython.com/beautiful-soup-web-scraper-python/
I'm trying to create a program that grabs my school grades from a website everyday. Then stores the values and creates a graph for my grades, but when i try to scrape the page the HTML that i receive is different then the HTML that i get with inspect element.
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://ames.usoe-dcs.org/Students/2567")
bsObj = BeautifulSoup(html.read(), 'lxml');
print(bsObj)
inspect element gives me: http://pastebin.com/BakmpqUM
while python gives me: http://pastebin.com/7gPY1WgB
i figure this is because the URL to my grades (https://ames.usoe-dcs.org/Students/2567) is private, so when you type it into the browser it returns me here:https://ames.usoe-dcs.org/Login/?DestinationURL=%2FStudents%2F2566
is there a way to use python to automatically sign me in?
The URL isn't necessarily private, however requesting the URL without cookies verifying your status as a user won't get you to the information you see when you logged in.
I would recommend opening Inspect Element to the network tab and reloading the page with your grades on it (while signed in). Then right click on the first request (should be a GET request answered with HTML, code 200), hover over copy, then click Copy as cURL command (bash). Then paste into this webpage and copy the python. It will give you the proper request for the page including the cookies and verification parameters you used to access them in the browser. From there you can parse the HTML response for your grade.
You should have something like this to receive and parse your HTML from the request:
cookies = {
...stuff...
}
headers = {
...stuff...
}
r = requests.get("https://ames.usoe-dcs.org/Students/2567", headers=headers, cookies=cookies)
soup = BeautifulSoup(r.text, "lxml")
grade = soup.find("h1", {"class":"grade"}).contents # Customize to find your grade
print(grade)
The cookies and headers dictionaries come from the cURL to Python output.