I'm fairly new to Python and I need to make a nested JSON out of an online zipped CSV file using standard libraries only and specifically in python 2.7. I've figured out the accessing and unzipping the file but am having some trouble with the parsing. Basically, I need to make a JSON output that contains three high-level elements for each primary key:
The primary key (which is made up of columns 0,2,3&4)
A dictionary that is a time series of the observed values for that PK (ie: date: observed value)
A dictionary of metadata (The product, flowtype, units,and ideally a nested time series of the quality for each observed point.
from StringIO import StringIO
from urllib import urlopen
from zipfile
import ZipFile from datetime
import datetime import itertools as it
import csv
import sys
url = urlopen("https://www.jodidata.org/_resources/files/downloads/gas-data/jodi_gas_csv_beta.zip")
myzip = ZipFile(StringIO(url.read()))
with myzip.open('jodi_gas_beta.csv','r' ) as myCSV:
#Read the data
reader=csv.DictReader(myCSV)
#Sort the data by PK + Time for timeseries
reader=sorted(reader,key=lambda row: row['REF_AREA'],row['ENERGY_PRODUCT'],row['FLOW_BREAKDOWN'],row['UNIT_MEASURE'],row['TIME_PERIOD']))
#initialize dictionaries for output
myData=[]
keys=[]
groups=[]
#limiting to first 200 rows for testing ONLY
for k, g in it.groupby(list(it.islice(reader,200)),key=lambda row: row['REF_AREA'],row['ENERGY_PRODUCT'],row['FLOW_BREAKDOWN'],row['UNIT_MEASURE'])):
keys.append(k)
groups.append(list(g))
myData.append({'MyPK': ''.join(k), #captures the PKs
'TimeSeries' : dict((zip(e['TIME_PERIOD'],e['OBS_VALUE']))) for e in g], #Not working properly, want a time series dictionary here
#TODO: Dictionary of metadata here (with nested time series, if possible)})
#TODO: Output as a JSON string
So, the Result should look something like this:
{
"myPK": "AENATGASEXPLNGM3",
"TimeSeries":[
["2015-01", 756],
["2015-02", 572],
["2015-03", 654]
],
"Metadata":{
"Country":"AE",
"Product":"NATGAS",
"Flow":"EXPLNG",
"Unit":"M3",
"Quality:[
["2015-01", 3],
["2015-02", 3],
["2015-03", 3]
]
}
}
Although you don't appear to have put much effort into solving the problem yourself, here's something I think does what you want. It makes use of the operator.itemgetter() function to simplify retrieving a series of different items from the various containers (such as lists and dicts).
I also modified the code to more closely follow the PEP 8 - Style Guide for Python Code.
import datetime
import csv
from operator import itemgetter
import itertools as it
import json
from StringIO import StringIO
import sys
from urllib import urlopen
from zipfile import ZipFile
# Utility.
def typed_itemgetter(items, callables):
""" Like operator.itemgetter() but also applies corresponding callable to
each retrieved value if it's not None. Creates and returns a function.
"""
return lambda row: [f(value) if f else value
for value, f in zip(itemgetter(*items)(row), callables)]
url = urlopen("https://www.jodidata.org/_resources/files/downloads/gas-data/jodi_gas_csv_beta.zip")
myzip = ZipFile(StringIO(url.read()))
with myzip.open('jodi_gas_beta.csv', 'r' ) as myCSV:
reader = csv.DictReader(myCSV)
primary_key = itemgetter('REF_AREA', 'ENERGY_PRODUCT', 'FLOW_BREAKDOWN', 'UNIT_MEASURE',
'TIME_PERIOD')
reader = sorted(reader, key=primary_key)
# Limit to first 200 rows for TESTING.
reader = [row for row in it.islice(reader, 200)]
# Group the data by designated keys (aka "primary key").
keys, groups = [], []
keyfunc = itemgetter('REF_AREA', 'ENERGY_PRODUCT', 'FLOW_BREAKDOWN', 'UNIT_MEASURE')
for k, g in it.groupby(reader, key=keyfunc):
keys.append(k)
groups.append(list(g))
# Create corresponding JSON-like Python data-structure.
myData = []
for i, group in enumerate(groups):
result = {'myPK': ''.join(keys[i]),
'TimeSeries': [
typed_itemgetter(('TIME_PERIOD', 'OBS_VALUE'),
(None, lambda x: int(float(x))))(row)
for row in group]
}
metadata = dict(zip(("Country", "Product", "Flow", "Unit"), keys[i]))
metadata['Quality'] = [typed_itemgetter(
('TIME_PERIOD', 'ASSESSMENT_CODE'), (None, int))(row)
for row in group]
result['Metadata'] = metadata
myData.append(result)
# Display the data to be turned into JSON.
from pprint import pprint
print('myData:')
pprint(myData)
# To create JSON format output, use something like:
import json
with open('myData.json', 'w') as fp:
json.dump(myData, fp, indent=2)
Beginning portion of the output printed:
myData:
[{'Metadata': {'Country': 'AE',
'Flow': 'EXPLNG',
'Product': 'NATGAS',
'Quality': [['2015-01', 3],
['2015-02', 3],
['2015-03', 3],
['2015-04', 3],
['2015-05', 3],
['2015-06', 3],
['2015-07', 3],
['2015-08', 3],
['2015-09', 3],
['2015-10', 3],
['2015-11', 3],
['2015-12', 3],
['2016-01', 3],
['2016-02', 3],
['2016-04', 3],
['2016-05', 3]],
'Unit': 'M3'},
'TimeSeries': [['2015-01', 756],
['2015-02', 572],
['2015-03', 654],
['2015-04', 431],
['2015-05', 681],
['2015-06', 683],
['2015-07', 751],
['2015-08', 716],
['2015-09', 830],
['2015-10', 580],
['2015-11', 659],
['2015-12', 659],
['2016-01', 742],
['2016-02', 746],
['2016-04', 0],
['2016-05', 0]],
'myPK': 'AENATGASEXPLNGM3'},
{'Metadata': {'Country': 'AE',
'Flow': 'EXPPIP',
'Product': 'NATGAS',
'Quality': [['2015-01', 3],
['2015-02', 3],
['2015-03', 3],
['2015-04', 3],
['2015-05', 3],
['2015-06', 3],
['2015-07', 3],
['2015-08', 3],
['2015-09', 3],
['2015-10', 3],
['2015-11', 3],
['2015-12', 3],
['2016-01', 3],
['2016-02', 3],
['2016-03', 3],
['2016-04', 3],
# etc, etc...
]
Related
I have a pyarrow.dataset.ParquetFileFragment object like this:
<pyarrow.dataset.ParquetFileFragment path=pq-test/Location=US-California/Industry=HT-SoftWare/dce9900c46f94ec3a8dca094cf62bd34-0.parquet partition=[Industry=HT-SoftWare, Location=US-California]>
I could get the path using .path but .partition method does not give the partition list. Is there anyway to grab it?
There is an PR open that would expose ds.get_partition_keys publicly: https://github.com/apache/arrow/pull/33862/files and that would help you get a nice dict from partition_expression attribute of a ds.ParquetFileFragment.
Note that you have to add partitioning parameter when you read the dataset, to get a valid expression:
>>> import pyarrow as pa
>>> table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021],
... 'n_legs': [2, 2, 4, 4, 5, 100],
... 'animal': ["Flamingo", "Parrot", "Dog", "Horse",
... "Brittle stars", "Centipede"]})
>>> import pyarrow.dataset as ds
>>> ds.write_dataset(table, "dataset_name_fragments", format="parquet",
... partitioning=["year"], partitioning_flavor="hive")
>>> dataset = ds.dataset('dataset_name_fragments/', format="parquet", partitioning="hive")
>>> fragments = dataset.get_fragments()
>>> fragment = next(fragments)
>>> fragment.partition_expression
<pyarrow.compute.Expression (year == 2019)>
It would be also great to have an attribute that would get you the partition list also and will be added to the mentioned PR.
Im fairly new dealing with .txt files that has a dictionary within it. Im trying to pd.read_csv and create a dataframe in pandas.I get thrown an error of Error tokenizing data. C error: Expected 4 fields in line 2, saw 11. I belive I found the root problem which is the file is difficult to read because each row contains a dict, whose key-value pairs are separated by commas in this case is the delimiter.
Data (store.txt)
id,name,storeid,report
11,JohnSmith,3221-123-555,{"Source":"online","FileFormat":0,"Isonline":true,"comment":"NAN","itemtrack":"110", "info": {"haircolor":"black", "age":53}, "itemsboughtid":[],"stolenitem":[{"item":"candy","code":1},{"item":"candy","code":1}]}
35,BillyDan,3221-123-555,{"Source":"letter","FileFormat":0,"Isonline":false,"comment":"this is the best store, hands down and i will surely be back...","itemtrack":"110", "info": {"haircolor":"black", "age":21},"itemsboughtid":[1,42,465,5],"stolenitem":[{"item":"shoe","code":2}]}
64,NickWalker,3221-123-555, {"Source":"letter","FileFormat":0,"Isonline":false, "comment":"we need this area to be fixed, so much stuff is everywhere and i do not like this one bit at all, never again...","itemtrack":"110", "info": {"haircolor":"red", "age":22},"itemsboughtid":[1,2],"stolenitem":[{"item":"sweater","code":11},{"item":"mask","code":221},{"item":"jack,jill","code":001}]}
How would I read this csv file and create new columns based on the key-values. In addition, what if there are more key-value in other data... for example > 11 keys within the dictionary.
Is there a an efficient way of create a df from the example above?
My code when trying to read as csv##
df = pd.read_csv('store.txt', header=None)
I tried to import json and user a converter but it do not work and converted all the commas to a |
`
import json
df = pd.read_csv('store.txt', converters={'report': json.loads}, header=0, sep="|")
In addition I also tried to use:
`
import pandas as pd
import json
df=pd.read_csv('store.txt', converters={'report':json.loads}, header=0, quotechar="'")
I also was thinking to add a quote at the begining of the dictionary and at the end to make it a string but thought that was too tedious to find the closing brackets.
I think adding quotes around the dictionaries is the right approach. You can use regex to do so and use a different quote character than " (I used § in my example):
from io import StringIO
import re
import json
with open("store.txt", "r") as f:
csv_content = re.sub(r"(\{.*})", r"§\1§", f.read())
df = pd.read_csv(StringIO(csv_content), skipinitialspace=True, quotechar="§", engine="python")
df_out = pd.concat([
df[["id", "name", "storeid"]],
pd.DataFrame(df["report"].apply(lambda x: json.loads(x)).values.tolist())
], axis=1)
print(df_out)
Note: the very last value in your csv isn't valid json: "code":001. It should either be "code":"001" or "code":1
Output:
id name storeid Source ... itemtrack info itemsboughtid stolenitem
0 11 JohnSmith 3221-123-555 online ... 110 {'haircolor': 'black', 'age': 53} [] [{'item': 'candy', 'code': 1}, {'item': 'candy...
1 35 BillyDan 3221-123-555 letter ... 110 {'haircolor': 'black', 'age': 21} [1, 42, 465, 5] [{'item': 'shoe', 'code': 2}]
2 64 NickWalker 3221-123-555 letter ... 110 {'haircolor': 'red', 'age': 22} [1, 2] [{'item': 'sweater', 'code': 11}, {'item': 'ma...
I would like to convert the following sheet of an excel file containing coordinates into a json file that looks exactly like the one below. I need it to be that way in order to run a clustering algorithm.
Thanks
{ "X" : [[1.32, 2.23], [2.01, 2.223], [4.196, 4.04], [4.09, 3.96], [2.01, 3.01],
[8.01, 7.01], [8.01, 8.01], [1.01, 8.01], [1.01, 1.10], [0.10, 7.81], [0.10, 7.91],
[0.1, 7.91], [0.01, 7.8], [0.1, 7.8], [6.875, 1.43], [6.99, 1.54], [6.71, 1.37],
[7.98, 1.1], [7.33, 1.53], [6.43, 1.3], [6.99, 1.3], [4.11, 4.11]]
}
Can you try this solution and adapt it to your dataframe:
Code:
import pandas as pd
# Creating Dataframe
df = pd.DataFrame([[1, 2],
[3, 4],
[5, 6],
[7, 8]
],
columns=['x', 'y'])
# Convert DataFrame to JSON
data = df.to_json(orient='values')
output = '{ "X" :'+data+'}'
print(output)
Output:
{ "X" :[[1,2],[3,4],[5,6],[7,8]]}
I a dataframe of below format
I want to send each row separately as below:
{ 'timestamp': 'A'
'tags': {
'columnA': '1',
'columnB': '11',
'columnC': '21'
.
.
.
.}}
The columns vary and I cannot hard code it. Then Send it to firestore collection
Then second row in above format to firestore collection and so on
How can I do this?
and don't mark the question as duplicate without comparing questions
I am not clear on the firebase part, but I think this might be what you want
import json
import pandas as pd
# Data frame to work with
x = pd.DataFrame(data={'timestamp': 'A', 'ca': 1, 'cb': 2, 'cc': 3}, index=[0])
x = x.append(x, ignore_index=True)
# rearranging
x = x[['timestamp', 'ca', 'cb', 'cc']]
def new_json(row):
return json.dumps(
dict(timestamp=row['timestamp'], tag=dict(zip(row.index[1:], row[row.index[1:]].values.tolist()))))
print x.apply(new_json, raw=False, axis=1)
Output
Output is a pandas series with each entry being a str in the json format as needed
0 '{"timestamp": "A", "tag": {"cc": 3, "cb": 2, "ca": 1}}'
1 '{"timestamp": "A", "tag": {"cc": 3, "cb": 2, "ca": 1}}'
I am working on SQLAlchemy and want to fetch the data from database and convert the same into JSON format.
I have below code :
db_string = "postgres://user:pwd#10.**.**.***:####/demo_db"
Base = declarative_base()
db = create_engine(db_string)
record = db.execute("SELECT name, columndata, gridname, ownerid, issystem, ispublic, isactive FROM col.layout WHERE (ispublic=1 AND isactive=1) OR (isactive=1 AND ispublic=1 AND ownerid=ownerid);")
for row in record:
result.append(row)
print(result)
Data is coming in this format:
[('layout-1', {'theme': 'blue', 'sorting': 'price_down', 'filtering': ['Sub Strategye', 'PM Strategy']}, 'RealTimeGrid', 1, 0, 1, 1), ('layout-2', {'theme': 'orange', 'sorting': 'price_up', 'filtering': ['FX Rate', 'Start Price']}, 'RealBalancing Grid', 2, 0, 1, 1), ('layout-3', {'theme': 'red', 'sorting': 'mv_price', 'filtering': ['Sub Strategye', 'PM Strategy']}, 'RT', 3, 0, 1, 1)]
But I am facing a lot of issues to convert the above result into JSON Format. Please suggest.
Your data is basically a list of tuples.
like first tuple is like
('layout-3',
{'filtering': ['Sub Strategye', 'PM Strategy'],
'sorting': 'mv_price',
'theme': 'red'},
'RT',
3,
0,
1,
1)
if you want to convert whole data as it is to json, you can use json module dumps function
import json
jsn_data = json.dumps(data)
Your list of tuple is converted to json
[["layout-1", {"theme": "blue", "sorting": "price_down", "filtering": ["Sub Strategye", "PM Strategy"]}, "RealTimeGrid", 1, 0, 1, 1], ["layout-2", {"theme": "orange", "sorting": "price_up", "filtering": ["FX Rate", "Start Price"]}, "RealBalancing Grid", 2, 0, 1, 1], ["layout-3", {"theme": "red", "sorting": "mv_price", "filtering": ["Sub Strategye", "PM Strategy"]}, "RT", 3, 0, 1, 1]]
but If you need json formate as key value pair , first need to convert the result in python dictionary then use json.dumps(dictionary_Var)
What you want to accomplish is called "serialization".
You can follow Sudhanshu Patel's answer if you just want to dump json into response.
However if you intend to produce a more sophisticated application, consider using a serialization library. You'll be able to input data from request into db, check if input data is in the right format, and send response in a standarised format.
Check these libraries:
Marshmallow
Python's own Pickle