Getting the entire body element of an html

Getting the entire body element of an html - python

https://www.apple.com/covid19/mobility
source=requests.get("https://www.apple.com/covid19/mobility")
soup=BeautifulSoup(source.text,"lxml")
I'm currently trying to get the url contained in the All Data CSV button which can be found by inspecting element. The requests.get doesn't seem to return the full body and all the elements.

Use the following API which returns data in json() format.
https://covid19-static.cdn-apple.com/covid19-mobility-data/current/v1/index.json
Now to get the url use key values
Code:
url='https://covid19-static.cdn-apple.com/covid19-mobility-data/current/v1/index.json'
data=requests.get(url).json()
print("https://covid19-static.cdn-apple.com"+data['basePath'] +data['regions']['en-us']['csvPath'])
Output:
https://covid19-static.cdn-apple.com/covid19-mobility-data/2006HotfixDev17/v1/en-us/applemobilitytrends-2020-04-25.csv
To get csv data in json format try this API
url='https://covid19-static.cdn-apple.com/covid19-mobility-data/2006HotfixDev17/v1/en-us/applemobilitytrends.json'
data=requests.get(url).json()
print(data)

Related

Output of Requests for API in JSON/Dictionary format

I am very new to APIs. I am trying to get the response of a requests.post method in the form of a json file or dictionary. I get a status_code of 200, so I know there is success, but when I run response.text I return everything as a string. I have read parts of the Quickstart guide for Requests, but they only seem to use .text to extract the data. My expected output for this particular api would ideally be a json file or some dictionary I can work with.
What I have so far (I get this is not a full reproducible example, but I think it gets the point across, otherwise refer to here for some examples):
import pandas as pd
import requests
response = requests.post(
url = request_url
,headers = headers
,json = body
)
response.text # returns a string
response.json # returns a method
pd.json_normalize(response.text) #throws an error that pandas does not have this attribute (which it does, idk why not)
pd.read_json(response.text) #somewhat workable dataframe.
pd.read_json() gets me somewhere, but it is an object in a cell of a dataframe, which I feel like is not the route to go down on.

Based on John Gordon's comment above, you can do the following
data = response.json()
Then with from pandas.io.json import json_normalize you can also do
df = json_normalize(data)
This will convert the response into a pandas dataframe.

Unable to modify page number which is within dictionary

I've written a script in python using post requests to fetch the json content from a webpage. The script is doing just fine if I'm only stick to it's default page. However, my intention is to create a loop to collect the content from few different pages. The only problem I'm struggling to solve is use page keyword within payload in order to loop three different pages. Consider my faulty approach as a placeholder.
How can I use format within dict in order to change page numbers?
Working script (if I get rid of the pagination loop):
import requests
link = 'https://nsv3auess7-3.algolianet.com/1/indexes/idealist7-production/query?x-algolia-agent=Algolia%20for%20vanilla%20JavaScript%203.30.0&x-algolia-application-id=NSV3AUESS7&x-algolia-api-key=c2730ea10ab82787f2f3cc961e8c1e06'
for page in range(0,3):
payload = {"params":"getRankingInfo=true&clickAnalytics=true&facets=*&hitsPerPage=20&page={}&attributesToSnippet=%5B%22description%3A20%22%5D&attributesToRetrieve=objectID%2Ctype%2Cpublished%2Cname%2Ccity%2Cstate%2Ccountry%2Curl%2CorgID%2CorgUrl%2CorgName%2CorgType%2CgroupID%2CgroupUrl%2CgroupName%2CisFullTime%2CremoteOk%2Cpaid%2ClocalizedStarts%2ClocalizedEnds%2C_geoloc&filters=(orgType%3A'NONPROFIT')%20AND%20type%3A'JOB'&aroundLatLng=40.7127837%2C%20-74.0059413&aroundPrecision=15000&minimumAroundRadius=16000&query="}
res = requests.post(link,json=payload.format(page)).json()
for item in res['hits']:
print(item['name'])
I get an error when I run the script as it is:
res = requests.post(link,json=payload.format(page)).json()
AttributeError: 'dict' object has no attribute 'format'

format is a string method. You should apply it to the string value of your payload instead:
payload = {"params":"getRankingInfo=true&clickAnalytics=true&facets=*&hitsPerPage=20&page={}&attributesToSnippet=%5B%22description%3A20%22%5D&attributesToRetrieve=objectID%2Ctype%2Cpublished%2Cname%2Ccity%2Cstate%2Ccountry%2Curl%2CorgID%2CorgUrl%2CorgName%2CorgType%2CgroupID%2CgroupUrl%2CgroupName%2CisFullTime%2CremoteOk%2Cpaid%2ClocalizedStarts%2ClocalizedEnds%2C_geoloc&filters=(orgType%3A'NONPROFIT')%20AND%20type%3A'JOB'&aroundLatLng=40.7127837%2C%20-74.0059413&aroundPrecision=15000&minimumAroundRadius=16000&query=".format(page)}
res = requests.post(link,json=payload).json()

How to search for required data in JSON object obtained in python session

I tried to scrape twitter followers using requests library. finally, i tried to save the response of required page in json format and then tried to search for the required parts. The thing is, how to find required elements in the json object?
my code is:
s =requests.session()
res = s.post("https://twitter.com/sessions",data=payload,headers=headers)
r = s.get("https://twitter.com/akhiltaker619/following/users?include_available_features=1&include_entities=1&max_position=1590310744326457266&reset_error_state=false")
dp = r.text
dp1=json.loads(dp)
x = json.dumps(dp1)
print(res.status_code)
soup = BeautifulSoup(x,"html.parser")
x1= soup.find_all("b",{"class":"u-linkComplex-target"})
for i in x1:
print(i.text)
At the end parsing part is wrong as i am trying to scrape json object which is not possible. When i print the json object, i get this:
The link is attached which contains the output of json object
now from this object, i want "class : u-linkComplex-target" present in the "item_html" of this json object. How to get this? Or is there any way to get the same content without using json object(this content is the followers list page in twitter). I used json inorder to load the dynamic content of the page.

The Beautiful Soup library is for parsing HTML and similar tagged languages, not JSON.
If your requests return JSON responses then you should call the r.json() method. This will return a dictionary of the JSON structure. Suppose you used
j = r.json()
then you probably want j['item-html']['linkComplex-target'] or something similar. If you access the dictionary interactively you will probably find what you want.

python scrape webpage and parse the content

I want to scrape the data on this link
http://www.realclearpolitics.com/epolls/json/5491_historical.js?1453388629140&callback=return_json
I am not sure what type of this link is, is it html or json or something else. Sorry for my bad web knowledge. But I try to use the following code to scrape:
import requests
url='http://www.realclearpolitics.com/epolls/json/5491_historical.js?1453388629140&callback=return_json'
source=requests.get(url).text
The type of the source is unicode. I also try to use the urllib2 to scrape like:
source2=urllib2.urlopen(url).read()
The type of source2 is string. I am not sure which method is better. Because the link is not like the normal webpage contains different tags. If I want to clean the scraped data and form the dataframe data (like the pandas dataframe), what method or process I should follow/
Thanks.

The returned response is text containing valid JSON data within it. You can validate it on your own using a service like http://jsonlint.com/ if you want. For doing so just copy the code within the brackets
return_json("JSON code to copy")
In order to make use of that data you just need to parse it in your program. Here an example: https://docs.python.org/2/library/json.html

The response is text. It does contain JSON, just need to extract it
import json
strip_len = len("return_json(")
source=requests.get(url).text[strip_len:-2]
source = json.loads(source)

After parsing url as json, can't return a specific element from list- only whole list

All,
I wrote the following code:
import requests,bs4
res=requests.get('http://itunes.apple.com/lookup?id=551798799')
res.raise_for_status()
wwe=bs4.BeautifulSoup(res.text)
print wwe.select('p averageUserRating')
If I only do: print wwe.select('p') then the code works, but it prints everything in the list. However when I print what is in the output above, this throws an error saying the selector is unsupported.
I am basically only trying to return the averageUserRating value (which is 4.0).
Thanks for the help!

The contents of that file isn't HTML, which is what BeautifulSoup is designed to read; it's a different data format called JSON. Thankfully, the requests library makes it really easy to parse JSON - if you call .json() on a response it parses it into a dictionary. You need to access averageUserRating, which is inside the first element of the results list, so you can use this to access what you need:
>>> data = res.json()
>>> data["results"][0]["averageUserRating"]
4.0
To modify your existing code:
import requests
res=requests.get('http://itunes.apple.com/lookup?id=551798799')
res.raise_for_status()
wwe=res.json()
print data["results"][0]["averageUserRating"]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Getting the entire body element of an html - python

Related

Output of Requests for API in JSON/Dictionary format

Unable to modify page number which is within dictionary

How to search for required data in JSON object obtained in python session

python scrape webpage and parse the content

After parsing url as json, can't return a specific element from list- only whole list

Categories

Resources