I would like to convert the following sheet of an excel file containing coordinates into a json file that looks exactly like the one below. I need it to be that way in order to run a clustering algorithm.
Thanks
{ "X" : [[1.32, 2.23], [2.01, 2.223], [4.196, 4.04], [4.09, 3.96], [2.01, 3.01],
[8.01, 7.01], [8.01, 8.01], [1.01, 8.01], [1.01, 1.10], [0.10, 7.81], [0.10, 7.91],
[0.1, 7.91], [0.01, 7.8], [0.1, 7.8], [6.875, 1.43], [6.99, 1.54], [6.71, 1.37],
[7.98, 1.1], [7.33, 1.53], [6.43, 1.3], [6.99, 1.3], [4.11, 4.11]]
}
Can you try this solution and adapt it to your dataframe:
Code:
import pandas as pd
# Creating Dataframe
df = pd.DataFrame([[1, 2],
[3, 4],
[5, 6],
[7, 8]
],
columns=['x', 'y'])
# Convert DataFrame to JSON
data = df.to_json(orient='values')
output = '{ "X" :'+data+'}'
print(output)
Output:
{ "X" :[[1,2],[3,4],[5,6],[7,8]]}
Related
I have already asked a similar question here: Color code a column based on values in another column in Excel using pandas but then I realised it was too much simplified for my case.
I want to create an Excel file with a table having the cells color-coded based on certain conditions. The condition is that the value in a cell is between a lower an an upper limit, these limits given for two different types of data - aa and bb. Therefore, I have a table summarising the limits:
In another table, I have the values that I need to compare to the respective limit to understand if they are within or not. I don't know in advance how many values I will have, but they can be of type aa or type bb, and this is given in their name:
How to have a final table to be then written in Excel, where I have my values color-coded based on if they are within limits or not?
Here the code to reproduce the example:
import pandas as pd
df_limits_1 = pd.DataFrame({"Measure": ["A", "B", "C"],
"lower limit": [0.1, 1, 10],
"upper limit": [1.2, 3.4, 100]})
df_limits_1 = df_limits_1.set_index("Measure")
df_limits_2 = pd.DataFrame({"Measure": ["A", "B", "C"],
"lower limit": [0.3, 2, 15],
"upper limit": [1.1, 5, 28]})
df_limits_2 = df_limits_2.set_index("Measure")
df_limits_1.columns = pd.MultiIndex.from_product([['aa'], df_limits_1.columns])
df_limits_2.columns = pd.MultiIndex.from_product([['bb'], df_limits_2.columns])
df_limits = pd.concat([df_limits_1, df_limits_2], axis = 1)
df_values = pd.DataFrame({"Measure": ["A", "B", "C"],
"value1_aa": [1, 5, 34],
"value1_bb": [0.2, 3, 21],
"value2_aa": [0.3, 2, 23],
"value2_bb": [1, 0.9, 12]})
df_values = df_values.set_index("Measure")
Use lambda function with extract last value after _ for match another DataFrame by DataFrame.xs and then use original solution:
def color(x):
g = x.name.split('_')[-1]
df1 = df_limits.xs(g, axis=1, level=0)
return x.between(df1['lower limit'], df1['upper limit'])
.map({True: 'background-color: yellow', False:''})
df_values.style.apply(color, axis=0)
I have two data frames:
one (multiindex) of size (1113, 7897) containing values for different country and sectors in columns and different IDs in the row, example:
F_Frame:
AT BE ...
Food Energy Food Energy ...
ID1
ID2
...
In another dataframe (CC_LO) I have factor-values with corresponding country and IDs that I would like to match with the former dataframe (F_frame), so that I multiply values in F_frame with factorvalues on CC_LO if they match by country and ID. If they do not match, I put a zero.
The code I have so far, seems to work, but it runs very slowly. Is there a smarter way to match the tables based on the index/header names?
(The code loops over 49 countries and multiply by the same factor for every 163 sector within the country)
LO_impacts = pd.DataFrame(np.zeros((1113,7987)))
for i in range(0, len(F_frame)):
for j in range(0, 49):
for k in range(0, len(CF_LO)):
if (F_frame.index.get_level_values(1)[i] == CF_LO.iloc[k,1] and
F_frame.columns.get_level_values(0)[j*163] == CF_LO.iloc[k,2]):
LO_impacts.iloc[i,(j*163):((j+1)*163)] = F_frame.iloc[i,(j*163):((j+1)*163)] * CF_LO.iloc[k,4]
else:
LO_impacts.iloc[i,(j*163):((j+1)*163)] == 0
i have made two dataframes, then i setted a new index for the second dataFrame as below:
then i have used the function assign() to create a new column for df2:
df2=df2.assign(gre_multiply=lambda x: x.gre*df1.gre)
don't forget to make df2=, i forgot it in the picture.
and i have got the following dataFrame:
of course it look at index you can check using a calculator, it returns values as float, it is easy now to convert to int later df2.gre_multiply.astype(int)
but before that you need to fillna because if the indexes of the two dataframes don't match it will return Nan
df2.gre_multiply=df2.gre_multiply.fillna(0).astype(int)
import pandas as pd
# Creating dummy data
data = pd.DataFrame([
[2.0, 1.1, 6.7, 4.5],
[4.3, 5.7, 8.6, 9.0],
[5.5, 6.8, 9.0, 4.7],
[5.5, 6.8, 9.0, 4.7],
], index = ["S1", "S1", "S2", "S2"], columns = mindex)
mindex = pd.MultiIndex.from_product([["AT", "DK"], ["Food", "Energy"]])
mul_factor = pd.DataFrame({"Country": ['AT', 'DK', 'AT', 'DK'],
"Value": [1.0, 0.8, 0.9, 0.6],
}, index = ['S1', 'S1', 'S2', 'S2'])
new_data = data.copy()
new_data.columns = data.columns.to_frame()[0].to_list()
# Reshaping the second Dataframe
mat = mul_factor.reset_index().pivot(index = 'Country', columns='index')
mat.index.name = None
mat = mat.T.reset_index(0, drop = True)
mat.index.name = None
new_data.multiply(mat) # Required result
Please let me know if I've misunderstood your question. You might have to modify the code a bit to accommodate missing country values.
We query a df by below code:
json.loads(df.reset_index(drop=True).to_json(orient='table'))
The output is:
{"index": [ 0, 1 ,2, 3, 4],
"col1": [ "250"],
"col2": [ "1"],
"col3": [ "223"],
"col4": [ "2020-06-12 14:55"]
}
We need the output should be like this:
[ "250", "1", "223", "2020-06-12 14:55"],[.....][.....]
json.loads(df.reset_index(drop=True).to_json(orient='values'))
change table into values solved my problem.
What you call a "json" (there is no such data type) is a Python dictionary. Extract the values for the keys of interest using list comprehension:
x = .... # Your dictionary
[x[col][0] for col in x if col.startswith("col")]
#['250', '1', '223', '2020-06-12 14:55']
We convert json to dataframe and we remove column name.
pd.Dataframe(json_source,header='False')
Then we convert it to json formate
df.to_json(orient='table')
I'm fairly new to Python and I need to make a nested JSON out of an online zipped CSV file using standard libraries only and specifically in python 2.7. I've figured out the accessing and unzipping the file but am having some trouble with the parsing. Basically, I need to make a JSON output that contains three high-level elements for each primary key:
The primary key (which is made up of columns 0,2,3&4)
A dictionary that is a time series of the observed values for that PK (ie: date: observed value)
A dictionary of metadata (The product, flowtype, units,and ideally a nested time series of the quality for each observed point.
from StringIO import StringIO
from urllib import urlopen
from zipfile
import ZipFile from datetime
import datetime import itertools as it
import csv
import sys
url = urlopen("https://www.jodidata.org/_resources/files/downloads/gas-data/jodi_gas_csv_beta.zip")
myzip = ZipFile(StringIO(url.read()))
with myzip.open('jodi_gas_beta.csv','r' ) as myCSV:
#Read the data
reader=csv.DictReader(myCSV)
#Sort the data by PK + Time for timeseries
reader=sorted(reader,key=lambda row: row['REF_AREA'],row['ENERGY_PRODUCT'],row['FLOW_BREAKDOWN'],row['UNIT_MEASURE'],row['TIME_PERIOD']))
#initialize dictionaries for output
myData=[]
keys=[]
groups=[]
#limiting to first 200 rows for testing ONLY
for k, g in it.groupby(list(it.islice(reader,200)),key=lambda row: row['REF_AREA'],row['ENERGY_PRODUCT'],row['FLOW_BREAKDOWN'],row['UNIT_MEASURE'])):
keys.append(k)
groups.append(list(g))
myData.append({'MyPK': ''.join(k), #captures the PKs
'TimeSeries' : dict((zip(e['TIME_PERIOD'],e['OBS_VALUE']))) for e in g], #Not working properly, want a time series dictionary here
#TODO: Dictionary of metadata here (with nested time series, if possible)})
#TODO: Output as a JSON string
So, the Result should look something like this:
{
"myPK": "AENATGASEXPLNGM3",
"TimeSeries":[
["2015-01", 756],
["2015-02", 572],
["2015-03", 654]
],
"Metadata":{
"Country":"AE",
"Product":"NATGAS",
"Flow":"EXPLNG",
"Unit":"M3",
"Quality:[
["2015-01", 3],
["2015-02", 3],
["2015-03", 3]
]
}
}
Although you don't appear to have put much effort into solving the problem yourself, here's something I think does what you want. It makes use of the operator.itemgetter() function to simplify retrieving a series of different items from the various containers (such as lists and dicts).
I also modified the code to more closely follow the PEP 8 - Style Guide for Python Code.
import datetime
import csv
from operator import itemgetter
import itertools as it
import json
from StringIO import StringIO
import sys
from urllib import urlopen
from zipfile import ZipFile
# Utility.
def typed_itemgetter(items, callables):
""" Like operator.itemgetter() but also applies corresponding callable to
each retrieved value if it's not None. Creates and returns a function.
"""
return lambda row: [f(value) if f else value
for value, f in zip(itemgetter(*items)(row), callables)]
url = urlopen("https://www.jodidata.org/_resources/files/downloads/gas-data/jodi_gas_csv_beta.zip")
myzip = ZipFile(StringIO(url.read()))
with myzip.open('jodi_gas_beta.csv', 'r' ) as myCSV:
reader = csv.DictReader(myCSV)
primary_key = itemgetter('REF_AREA', 'ENERGY_PRODUCT', 'FLOW_BREAKDOWN', 'UNIT_MEASURE',
'TIME_PERIOD')
reader = sorted(reader, key=primary_key)
# Limit to first 200 rows for TESTING.
reader = [row for row in it.islice(reader, 200)]
# Group the data by designated keys (aka "primary key").
keys, groups = [], []
keyfunc = itemgetter('REF_AREA', 'ENERGY_PRODUCT', 'FLOW_BREAKDOWN', 'UNIT_MEASURE')
for k, g in it.groupby(reader, key=keyfunc):
keys.append(k)
groups.append(list(g))
# Create corresponding JSON-like Python data-structure.
myData = []
for i, group in enumerate(groups):
result = {'myPK': ''.join(keys[i]),
'TimeSeries': [
typed_itemgetter(('TIME_PERIOD', 'OBS_VALUE'),
(None, lambda x: int(float(x))))(row)
for row in group]
}
metadata = dict(zip(("Country", "Product", "Flow", "Unit"), keys[i]))
metadata['Quality'] = [typed_itemgetter(
('TIME_PERIOD', 'ASSESSMENT_CODE'), (None, int))(row)
for row in group]
result['Metadata'] = metadata
myData.append(result)
# Display the data to be turned into JSON.
from pprint import pprint
print('myData:')
pprint(myData)
# To create JSON format output, use something like:
import json
with open('myData.json', 'w') as fp:
json.dump(myData, fp, indent=2)
Beginning portion of the output printed:
myData:
[{'Metadata': {'Country': 'AE',
'Flow': 'EXPLNG',
'Product': 'NATGAS',
'Quality': [['2015-01', 3],
['2015-02', 3],
['2015-03', 3],
['2015-04', 3],
['2015-05', 3],
['2015-06', 3],
['2015-07', 3],
['2015-08', 3],
['2015-09', 3],
['2015-10', 3],
['2015-11', 3],
['2015-12', 3],
['2016-01', 3],
['2016-02', 3],
['2016-04', 3],
['2016-05', 3]],
'Unit': 'M3'},
'TimeSeries': [['2015-01', 756],
['2015-02', 572],
['2015-03', 654],
['2015-04', 431],
['2015-05', 681],
['2015-06', 683],
['2015-07', 751],
['2015-08', 716],
['2015-09', 830],
['2015-10', 580],
['2015-11', 659],
['2015-12', 659],
['2016-01', 742],
['2016-02', 746],
['2016-04', 0],
['2016-05', 0]],
'myPK': 'AENATGASEXPLNGM3'},
{'Metadata': {'Country': 'AE',
'Flow': 'EXPPIP',
'Product': 'NATGAS',
'Quality': [['2015-01', 3],
['2015-02', 3],
['2015-03', 3],
['2015-04', 3],
['2015-05', 3],
['2015-06', 3],
['2015-07', 3],
['2015-08', 3],
['2015-09', 3],
['2015-10', 3],
['2015-11', 3],
['2015-12', 3],
['2016-01', 3],
['2016-02', 3],
['2016-03', 3],
['2016-04', 3],
# etc, etc...
]
I want to change the JSON structure to generate the expected output.
I don't want to achieve it with Python and Pandas.
Any idea about how to change the json format,
so that I can get the output by pd.read_json(JSON_STR) directly.
Thanks
Current output of dataframe
json_str='''{
"2013-03-20_change_in_real_gdp":{
"2013":{
"upper_end_of_central_tendency":"2.8",
"lower_end_of_range":"2.0"
},
"2014":{
"upper_end_of_central_tendency":"3.4",
"lower_end_of_range":"2.6"
}
},
"2012-04-25_change_in_real_gdp":{
"2013":{
"upper_end_of_central_tendency":"7.7",
"lower_end_of_range":"7.0"
},
"2014":{
"upper_end_of_central_tendency":"7.4",
"lower_end_of_range":"6.3"
}
}
}'''
pd.read_json(json_str)
This is the expected output from dataframe
Working backwards, you can create your dataframe and then convert it to JSON to see the expected format. When you try to convert it to JSON, you'll get an error because the measure index values are not unique. After resetting the index, you'll get the following JSON.
import json
df = pd.DataFrame({2013: [2.8, 2, 7.7, 7],
2014: [3.4, 2.6, 7.4, 6.3],
'source': ['2013-03-20_change_in_real_gdp',
'2013-03-20_change_in_real_gdp',
'2012-04-25_change_in_real_gdp',
'2012-04-25_change_in_real_gdp']},
index=['upper_end_of_central_tendency',
'lower_end_of_range',
'upper_end_of_central_tendency',
'lower_end_of_range'])
df.index.name = 'measure'
>>> df.reset_index().to_json()
'{"measure": {"0":"upper_end_of_central_tendency",
"1":"lower_end_of_range",
"2":"upper_end_of_central_tendency",
"3":"lower_end_of_range"},
"2013": {"0":2.8,"1":2.0,"2":7.7,"3":7.0},
"2014": {"0":3.4,"1":2.6,"2":7.4,"3":6.3},
"source": {"0":"2013-03-20_change_in_real_gdp",
"1":"2013-03-20_change_in_real_gdp",
"2":"2012-04-25_change_in_real_gdp",
"3":"2012-04-25_change_in_real_gdp"}}'"""