Converting a pandas dataframe to JSON file - python

I am trying to convert a pandas DataFrame to JSON file. Following image shows my data:
Screenshot of the dataset from Ms. excel
I am using the following code:
import pandas as pd
os.chdir("G:\\My Drive\\LEC dashboard\\EnergyPlus simulation files\\DEC\\Ahmedabad\\Adaptive set point\\CSV")
df = pd.read_csv('Adap_40-_0_0.1_1.5_0.6.csv')
df2 = df.filter(like = '[C](Hourly)',axis =1)
df3 = df.filter(like = '[C](Hourly:ON)',axis =1)
df4 = df.filter(like = '[%](Hourly)',axis =1)
df5 = df.filter(like = '[%](Hourly:ON)',axis =1)
df6 = pd.concat([df2,df3,df4,df5],axis=1)
df6.to_json("123.json",orient='columns')
I the output, I am getting a dictionary in of values. However, I need a list as value.
The output I am getting: The JSON output I am getting by using above code
The out put that is desired: The output that is desired.
I have tried different orientations of json but nothing works.

There might be other ways of doing this but one way is this.
import json
test = pd.DataFrame({'a':[1,2,3,4,5,6]})
with open('test.json', 'w') as f:
json.dump(test.to_dict(orient='list'), f)
Result file will look like this '{"a": [1, 2, 3, 4, 5, 6]}'

There is a built-in function of pandas called to_json:
df.to_json(r'Path_to_file\file_name.json')
Take a look at the documentation if you need more specifics: https://pandas.pydata.org/pandas-docs/version/0.24/reference/api/pandas.DataFrame.to_json.html

Related

Pandas dataframe throwing error when appending to CSV

`
import pandas as pd
df = pd.read_csv("stack.csv")
sector_select = "Col2"
df[sector_select] = ["100"]
df.to_csv("stack.csv", index=False, mode='a', header=False)
`
stack.csv has no data other than a header: Col1,Col2,Col3,Col4,Col5
ValueError: Length of values (1) does not match length of index (2)
Im just trying to make a program where I can select a header and append data to the column under that header
You can only run it twice until it gives an error!
You can use this:
df = df.append({"Col2": 100}, ignore_index=True)
That code runs for me.
But I assume that you would like to run something like this:
import pandas as pd
df = pd.read_csv("stack.csv")
sector_select = "Col2"
df.at[len(df), sector_select] = "100"
df.to_csv("stack.csv", index=False)

xlsx file to list for dataset

Here is a copy of my code when I run it the list output is in a weird format that I cant figure out how to change.
'''
import pandas as pd
inputdata = []
data = pd.read_excel('testfile.xlsx')
df = pd.DataFrame(data, columns=['black_toner_percentage'])
for i in df.values:
inputdata.append(i)
print(inputdata)
'''
array([100], dtype=int64), array([100], dtype=int64)
This is the format my list is going into and Id like it to be [100,100,ext.]
What am I doing wrong here.

How to read list of parquets with partially overlapping set of columns in dask?

Consider this code:
import dask.dataframe as dd
import numpy as np
df1=pd.DataFrame({'A': [1, 2], 'B': [11, 12]})
df1.to_parquet("df1.parquet")
df2=pd.DataFrame({'A': [3, 4], 'C': [13, 14]})
df2.to_parquet("df2.parquet")
all_files = ["df1.parquet", "df2.parquet"]
full_df = dd.read_parquet(all_files)
# dask.compute(full_df) # KeyError: "['B'] not in index"
def normalize(df):
df_cols = set(df.columns)
for c in ['A', 'B', 'C']:
if c not in df_cols:
df[c] = np.nan
df = df[sorted(df.columns)]
return df
normal_df = full_df.map_partitions(normalize)
dask.compute(normal_df) # Still gives keyError
I was hoping that after the normalization using map_partitions, I wouldn't get keyError, but the read_parquet probably fails before reaching the map_partitions step.
I could have created the DataFrame from a list of delayed objects which would each read one file and normalize the columns, but I want to avoid using delayed objects for this reason
The other option is suggested by SultanOrazbayev is to use dask dataframe like this:
def normal_ddf(path):
df = dd.read_parquet(path)
return normalize(df) # normalize f should work with both pandas and dask
full_df = dd.concat([normal_ddf(path) for path in all_files])
Problem with this is that, when all_files contains large number of files (10K) this takes a long time to create the dataframe since all those dd.read_parquet happens sequentially. Although dd.read_parquet doesn't need to load the whole file, it still needs to read some headers to get column info. Doing it sequentially on 10k files adds up.
So, what is the proper/efficient way to read a bunch of parquet files all of which don't have the same set of columns?
dd.concat should take care of your normalization.
Consider this example:
import pandas as pd
import dask.dataframe as dd
import numpy as np
import string
N = 100_000
all_files = []
for col in string.ascii_uppercase[1:]:
df = pd.DataFrame({
"A": np.random.normal(size=N),
col: (np.random.normal(size=N) ** 2) * 50,
})
fname = f"df_{col}.parquet"
all_files.append(fname)
df.to_parquet(fname)
full_df = dd.concat([dd.read_parquet(path) for path in all_files]).compute()
And I get this on my task stream dashboard:
Another option that was not mentioned in the comments by #Michael Delgado is to load each parquet into a separate dask dataframe and then stitch them together. Here's the rough pseudocode:
def normal_ddf(path):
df = dd.read_parquet(path)
return normalize(df) # normalize f should work with both pandas and dask
full_df = dd.concat([normal_ddf(path) for path in all_files])

Handle variable as file with pandas dataframe

I would like to create a pandas dataframe out of a list variable.
With pd.DataFrame() I am not able to declare delimiter which leads to just one column per list entry.
If I use pd.read_csv() instead, I of course receive the following error
ValueError: Invalid file path or buffer object type: <class 'list'>
If there a way to use pd.read_csv() with my list and not first save the list to a csv and read the csv file in a second step?
I also tried pd.read_table() which also need a file or buffer object.
Example data (seperated by tab stops):
Col1 Col2 Col3
12 Info1 34.1
15 Info4 674.1
test = ["Col1\tCol2\tCol3", "12\tInfo1\t34.1","15\tInfo4\t674.1"]
Current workaround:
with open(f'{filepath}tmp.csv', 'w', encoding='UTF8') as f:
[f.write(line + "\n") for line in consolidated_file]
df = pd.read_csv(f'{filepath}tmp.csv', sep='\t', index_col=1 )
import pandas as pd
df = pd.DataFrame([x.split('\t') for x in test])
print(df)
and you want header as your first row then
df.columns = df.iloc[0]
df = df[1:]
It seems simpler to convert it to nested list like in other answer
import pandas as pd
test = ["Col1\tCol2\tCol3", "12\tInfo1\t34.1","15\tInfo4\t674.1"]
data = [line.split('\t') for line in test]
df = pd.DataFrame(data[1:], columns=data[0])
but you can also convert it back to single string (or get it directly form file on socket/network as single string) and then you can use io.BytesIO or io.StringIO to simulate file in memory.
import pandas as pd
import io
test = ["Col1\tCol2\tCol3", "12\tInfo1\t34.1","15\tInfo4\t674.1"]
single_string = "\n".join(test)
file_like_object = io.StringIO(single_string)
df = pd.read_csv(file_like_object, sep='\t')
or shorter
df = pd.read_csv(io.StringIO("\n".join(test)), sep='\t')
This method is popular when you get data from network (socket, web API) as single string or data.

save a list of different Dataframes to json

I have different pandas dataframes, which I put in a list.
I want to save this list in json (or any other format) which can be read by R.
import pandas as pd
def create_df_predictions(extra_periods):
"""
make a empty df for predictions
params: extra_periods = how many prediction in the future the user wants
"""
df = pd.DataFrame({ 'model': ['a'], 'name_id': ['a'] })
for col in range(1, extra_periods+1):
name_col = 'forecast' + str(col)
df[name_col] = 0
return df
df1 = create_df_predictions(9)
df2 = create_df_predictions(12)
list_df = [df1, df2]
The question is how to save list_df in a readable format for R? Note that df1 and df2 are have a different amount of columns!
don't know panda DataFrames in detail, so maybe this won't work. But in case it is kind of a traditional dict, you should be able to use the json module.
df1 = create_df_predictions(9)
df2 = create_df_predictions(12)
list_df = [df1, df2]
You can write it to a file, using json.dumps(list_df), which will convert your list of dicts to a valid json representation.
import json
with open("my_file", 'w') as outfile:
outfile.write(json.dumps(list_df))
Edit: as commented by DaveR dataframes are't serializiable. You can convert them to a dict and then dump the list to json.
import json
with open("my_file", 'w') as outfile:
outfile.write(json.dumps([df.to_dict() for df in list_df]))
Alternatively pd.DataFrame and pd.Series have a to_json() method, maybe have a look at those as well.
To export the list of DataFrames to a single json file, you should convert the list into a DataFrame and then use the to_json() function as shown below:
df_to_export = pd.DataFrame(list_df)
json_output = df_to_export.to_json()
with open("output.txt", 'w') as outfile:
outfile.write(json_output)
This will export the full dataset to a single json string and export it to a file.

Categories