Is it possible to create a dataframe from JSON formatted as text, not as Python object?
With Python object, I could for example do:
from pandas.io.json import json_normalize
import requests
response = requests.get(url, params).json()
df = json_normalize(response)
but I want to achieve the same with response = requests.get(url,params).text (flattening is not required though).
If your response = requests.get(url,params).text is guaranteed to give you a valid JSON string, then all you need to do is as follows:
from pandas.io.json import json_normalize, loads
import requests
response = requests.get(url, params).text
df = json_normalize(loads(response))
Here we make use of json's loads to convert the JSON string to a Python object before passing back to json_normalize.
I usually create dataframe from json using "read_json"
import pandas as pd
import requests
data = requests.get(url, params).content
df = pd.read_json(data)
df.head()
Related
I want to scrape multiple pages of website using Python, but I'm getting Remote Connection closed error.
Here is my code
import pandas as pd
url_link = 'https://www.taneps.go.tz/epps/viewAllAwardedContracts.do?d-3998960-p={}&selectedItem=viewAllAwardedContracts.do'
LIST = []
for number in range(1,5379):
url = url_link.format(number)
dframe = pd.read_html(url, header=None)[0]
LIST.append(dframe)
Result_df = pd.concat(LIST)
Result_df.to_csv('Taneps_contracts.csv')
Any idea how to solve it?
For me, just using requests to fetch the html before passing to read_html is getting the data. I just edited your code to
import pandas as pd
import re
url_link = 'https://www.taneps.go.tz/epps/viewAllAwardedContracts.do?d-3998960-p={}&selectedItem=viewAllAwardedContracts.do'
LIST = []
for number in range(1,5379):
url = url_link.format(number)
r = requests.get(url) # getting page -> html in r.text
dframe = pandas.read_html(r.text, header=None)[0]
LIST.append(dframe)
Result_df = pd.concat(LIST)
Result_df.to_csv('Taneps_contracts.csv')
I didn't even have to add headers, but if this isn't enough for you (i.e., if the program breaks or if you don't end up with 53770+ rows), try adding convincing headers or using something like HTMLSession instead of directly calling requests.get...
Beginner here,
I scraped a table using requests from nhl.com, and I'd like to send it to Excel.
import requests
url = 'https://api.nhle.com/stats/rest/en/team/powerplay?isAggregate=false&isGame=false&sort=%5B%7B%22property%22:%22powerPlayPct%22,%22direction%22:%22DESC%22%7D,%7B%22property%22:%22teamId%22,%22direction%22:%22ASC%22%7D%5D&start=0&limit=50&factCayenneExp=gamesPlayed%3E=1&cayenneExp=gameTypeId=2%20and%20homeRoad=%22H%22%20and%20seasonId%3C=20212022%20and%20seasonId%3E=20212022'
data = requests.get(url).json()
for i in data['data']:
print('{:<30} {:.1f}'.format(i['teamFullName'], i['powerPlayPct']*100))
I used requests instead of pandas because of the dynamic format on nhl.com for scraping Tables and I don't feel like it creates a dataframe (just like in pandas) to be sent using df.to_excel.
How could I do that?
Try using pd.json_normalize and pass the record_path parameter as 'data'
import requests
import pandas as pd
url = 'https://api.nhle.com/stats/rest/en/team/powerplay?isAggregate=false&isGame=false&sort=%5B%7B%22property%22:%22powerPlayPct%22,%22direction%22:%22DESC%22%7D,%7B%22property%22:%22teamId%22,%22direction%22:%22ASC%22%7D%5D&start=0&limit=50&factCayenneExp=gamesPlayed%3E=1&cayenneExp=gameTypeId=2%20and%20homeRoad=%22H%22%20and%20seasonId%3C=20212022%20and%20seasonId%3E=20212022'
data = requests.get(url).json()
df = pd.json_normalize(data, record_path='data')
# Do whatever math you want here
df.to_excel('nhl_data.xlsx', index=False)
I was requested to create a JSON dataframe using looping only, with specific headers: headers = {'User-agent': 'CS6400'} in request.get(). In addition, dataframe with 5 rows (one for each of the top 25 posts).
Since I only can use looping, not able to use "read_json" or "json_normalize". I was stuck for several hours, support needed and much appreciated.
Here is my buggy codes:(error message :ERROR! Session/line number was not unique in database. History logging moved to new session 122 )
import urllib, json
import requests
import gzip
import json
url = "http://www.reddit.com/r/popular/top.json"
headers = {'User-agent': 'DS6001'}
response = requests.get(url,headers=headers)
data = response.read().decode("utf-8")
data = json.dumps(response)
dataframe = pd.DataFrame.from_dict(data, orient="index")
dataframe
I tried to download specific data as part of my work,
the data is located in link! .
The source indicates how to download through the get method, but when I make my requests:
import requests
import pandas as pd
url="https://estadisticas.bcrp.gob.pe/estadisticas/series/api/PN01210PM/csv/2015-01/2019-01"
r=pd.to_csv(url)
it doesnt read as it should be (open link in navigator).
When I try
s=requests.get(url,verify=False) # you can set verify=True
df=pd.DataFrame(s)
the data neither is good.
What else can I do? It suppose to download the data as csv avoiding me to clean the data.
to get the content as csv you can replace all HTML line breaks with newline chars.
please let me know if this works for you:
import requests
import pandas as pd
from io import StringIO
url = "https://estadisticas.bcrp.gob.pe/estadisticas/series/api/PN01210PM/csv/2015-01/2019-01"
content = requests.get(url,verify=False).text.replace("<br>","\n").strip()
csv = StringIO(content)
r = pd.read_csv(csv)
print(r)
Actually i am calling 3rd party API and requirement in to add json dictionary as it is. refer below URL example
https://pguat.paytm.com/oltp/HANDLER_INTERNAL/getTxnStatus?JsonData={"MID":"MID117185435","ORDERID":"ORDR4o22310421111",
"CHECKSUMHASH":
"NU9wPEWxmbOTFL2%2FUKr3lk6fScfnLy8wORc3YRylsyEsr2MLRPn%2F3DRePtFEK55ZcfdTj7mY9vS2qh%2Bsm7oTRx%2Fx4BDlvZBj%2F8Sxw6s%2F9io%3D"}
The query param name in "JsonData" and data should be in {} brackets.
import requests
import json
from urllib.parse import urlencode, quote_plus
import urllib.request
import urllib
data = '{"MID":"MID117185435","ORDERID":"ORDR4o22310421111","CHECKSUMHASH":"omcrIRuqDP0v%2Fa2DXTlVI4XtzvmuIW56jlXtGEp3S%2B2b1h9nU9cfJx5ZO2Hp%2FAN%2F%2FyF%2F01DxmoV1VHJk%2B0ZKHrYxqvDMJa9IOcldrfZY1VI%3D"}'
jsonData = data
uri = 'https://pguat.paytm.com/oltp/HANDLER_INTERNAL/getTxnStatus?jsonData='+str(quote_plus(data))
r = requests.get(uri)
print(r.url)
print(r.json)
print(r.json())
print(r.url) output on console
https://pguat.paytm.com/oltp/HANDLER_INTERNAL/getTxnStatus?jsonData=%7B%22MID%22%3A%22MEDTPA37902117185435%22%2C%22ORDERID%22%3A%22medipta1521537969o72718962111%22%2C%22CHECKSUMHASH%22%3A%22omcrIRuqDP0v%252Fa2DXTlVI4XtzvmuIW56jlXtGEp3S%252B2b1h9nU9cfJx5ZO2Hp%252FAN%252F%252FyF%252F01DxmoV1VHJk%252B0ZKHrYxqvDMJa9IOcldrfZY1VI%253D%22%7D
It converts {} to %7B and i want {} as it is..
Plz help ...
You need to undo quote_plus by importing and using unquote_plus.
I didn’t test against your url, just against your string.
When I print your uri string I get this as my output:
https://pguat.paytm.com/oltp/HANDLER_INTERNAL/getTxnStatus?jsonData=%7B%22MID%22%3A%22MID117185435%22%2C%22ORDERID%22%3A%22ORDR4o22310421111%22%2C%22CHECKSUMHASH%22%3A%22omcrIRuqDP0v%252Fa2DXTlVI4XtzvmuIW56jlXtGEp3S%252B2b1h9nU9cfJx5ZO2Hp%252FAN%252F%252FyF%252F01DxmoV1VHJk%252B0ZKHrYxqvDMJa9IOcldrfZY1VI%253D%22%7D
If I surround it like this:
print(str(unquote_plus(uri)))
I get this as output:
https://pguat.paytm.com/oltp/HANDLER_INTERNAL/getTxnStatus?jsonData={"MID":"MID117185435","ORDERID":"ORDR4o22310421111","CHECKSUMHASH":"omcrIRuqDP0v%2Fa2DXTlVI4XtzvmuIW56jlXtGEp3S%2B2b1h9nU9cfJx5ZO2Hp%2FAN%2F%2FyF%2F01DxmoV1VHJk%2B0ZKHrYxqvDMJa9IOcldrfZY1VI%3D"}