I am trying to access data.gov.au datasets through their CKAN data API.
Unfortunately, the data API instructions are slightly outdated and do not seem to work. Instructions found here.
So far, I've worked out that I am meant to query the dataset using urllib.request.
import urllib.request
req = urllib.request.Request('https://data.sa.gov.au/data/api/3/action/datastore_search?resource_id=86d35483-feff-42b5-ac05-ad3186ac39de')
with urllib.request.urlopen(req) as response:
data = response.read()
This produces an object of type bytes that looks like a dictionary data structure, where the dataset seems to be stored in "records:".
I'm wondering how I can convert the data records into a Pandas DataFrame. I've tried converting the bytes object into a string and reading that as a json file, but the output is wrong.
# code that did not work
result = str(data, 'utf-8')
rdata = StringIO(result)
df = pd.read_json(rdata)
df
The output I would like to return looks like this:
Thanks!
Here is a solution that works:
import numpy as np
import pandas as pd
import requests
import json
url = "https://data.sa.gov.au/data/api/3/action/datastore_search?resource_id=86d35483-feff-42b5-ac05-ad3186ac39de"
JSONContent = requests.get(url).json()
content = json.dumps(JSONContent, indent = 4, sort_keys=True)
print(content)
df = pd.read_json(content)
df.to_csv("output.csv")
df = pd.json_normalize(df['result']['records'])
You actually were near the solution. It is only the last step df=pd.json_normalize(df['result']['records']) you were missing.
Related
I am not an expert in Python but I used it to call data from an API. I got a code 200 and printed part of the data but I am not able to export/ save/ write the output (to CSV). Can anyone assist?
This is my code:
import requests
headers = {
'Accept': 'text/csv',
'Authorization': 'Bearer ...'
}
response = requests.get('https://feeds.preqin.com/api/investor/pe', headers=headers)
print response
output = response.content
And here is how the data (should be CSV, correct?) looks like:
enter image description here
I managed to save it as txt but the output is not usable/ importable (e.g. to Excel). I used the following code:
text_file = open("output.txt", "w")
n = text_file.write(output)
text_file.close()
Thank you and best regards,
A
Your content uses pipes | as a separator. CSVs use , commas (that's why they're called Comma Separated Values).
You can simply replace your data's pipes with commas. However, this may be problematic if the data you have itself uses commas.
output = response.content.replace("|", ",")
As comments have suggested, you could also use pandas:
import pandas as pd
from StringIO import StringIO
# Get your output normally...
output = response.content
df = pd.read_csv(StringIO(output), sep="|")
# Saving to .CSV
df.to_csv(r"C:\output.csv")
# Saving to .XLSX
df.to_excel(r"C:\output.xlsx")
I am trying to read a json, which I get from the python package 'yahoofinancials' (it pulls the data from Yahoo Finance):
import numpy as np
import pandas as pd
from yahoofinancials import YahooFinancials
yahoo_financials = YahooFinancials(ticker)
cash_statements = yahoo_financials.get_financial_stmts('annual', 'income')
cash_statements
pd.read_json(str(cash_statements).replace("'", '"'), orient='records')
However I get the error:
Unexpected character found when decoding 'NaN'
The problem is this command: str(cash_statements).replace("'", '"').
You tried to "convert" from a python dictionary to a json string, by replacing single with double quotes, which does not properly work.
Use the json.dump(cash_statements) function for converting your dictionary object into a json string.
Updated Code:
import numpy as np
import pandas as pd
from yahoofinancials import YahooFinancials
# ADJUSTMENT 1 - import json
import json
# just some sample data for testing
ticker = ['AAPL', 'MSFT', 'INTC']
yahoo_financials = YahooFinancials(ticker)
cash_statements = yahoo_financials.get_financial_stmts('annual', 'income')
# ADJUSTMENT 2 - dict to json
cash_statements_json = json.dumps(cash_statements)
pd.read_json(cash_statements_json, orient='records')
Check whether the file is available or the file name is correct because I got the same error while reading a .json file that was not in that folder and located somewhere else.
Hello guys I know lots of similar questions i'll find here but i have a code which is executing properly which is returning five records also my query is how should i only read the entire file and atlast return with desire rows just supose i have csv file which have size in gb so i don't want to return the entire gb file data for getting only 5 records so please tell me how should i get it....Please if possible explain my code if it is not good why it is not good..
code:
import boto3
from botocore.client import Config
import pandas as pd
ACCESS_KEY_ID = 'something'
ACCESS_SECRET_KEY = 'something'
BUCKET_NAME = 'something'
Filename='dataRepository/source/MergedSeedData(Parts_skills_Durations).csv'
client = boto3.client("s3",
aws_access_key_id=ACCESS_KEY_ID,
aws_secret_access_key=ACCESS_SECRET_KEY)
obj = client.get_object(Bucket=BUCKET_NAME, Key=Filename)
Data = pd.read_csv(obj['Body'])
# data1 = Data.columns
# return data1
Data=Data.head(5)
print(Data)
This my code which is running fine also getting the 5 records from s3 bucket but i have explained it what i'm looking for any other query feel free to text me...thnxx in advance
You can use the pandas capability of reading a file in chunks, just loading as much data as you need.
data_iter = pd.read_csv(obj['Body'], chunksize = 5)
data = data_iter.get_chunk()
print(data)
You can use a HTTP Range: header (see RFC 2616), which take a byte range argument. S3 APIs have a provision for this and this will help you to NOT read/download the whole S3 file.
Sample code:
import boto3
obj = boto3.resource('s3').Object('bucket101', 'my.csv')
record_stream = obj.get(Range='bytes=0-1000')['Body']
print(record_stream.read())
This will return only the byte_range_data provided in the header.
But you will need to modify this to convert the string into Dataframe. Maybe read + join for the \t and \n present in the string coming from the .csv file
JSON data output when printed in command line I am currently pulling data via an API and am attempting to write the data into a CSV in order to run calculations in SQL. I am currently able to pull the data, open the CSV, however an error occurs when the data is being written into the CSV. The error is that each individual character is separated by a comma.
I am new to working with JSON data so I am curious if I need to perform an intermediary step between pulling the JSON data and inserting it into a CSV. Any help would be greatly appreciated as I am completely stuck on this (even the data provider does not seem to know how to get around this).
Please see the code below:
import requests
import time
import pyodbc
import csv
import json
headers = {'Authorization': 'Token'}
Metric1 = ['Website1','Website2']
Metric2 = ['users','hours','responses','visits']
Metric3 = ['Country1','Country2','Country3']
obs_list = []
obs_file = r'TEST.csv'
with open(obs_file, 'w') as csvfile:
f=csv.writer(csvfile)
for elem1 in Metric1:
for elem2 in Metric2:
for elem3 in Metric3:
URL = "www.data.com"
r = requests.get(URL, headers=headers, verify=False)
for elem in r:
f.writerow(elem) `
Edit: When I print the data instead of writing it to a CSV, the data appears in the command window in the following format:
[timestamp, metric], [timestamp, metric], [timestamp, metric] ...
Timestamp = 12 digit character
Metric = decimal value
I'm currently using Yahoo Pipes which provides me with a JSON file from an URL.
I would like to be able to fetch it and convert it into a CSV file, and I have no idea where to begin (I'm a complete beginner in Python).
How can I fetch the JSON data from the URL?
How can I transform it to CSV?
Thank you
import urllib2
import json
import csv
def getRows(data):
# ?? this totally depends on what's in your data
return []
url = "http://www.yahoo.com/something"
data = urllib2.urlopen(url).read()
data = json.loads(data)
fname = "mydata.csv"
with open(fname,'wb') as outf:
outcsv = csv.writer(outf)
outcsv.writerows(getRows(data))