Create dictionaries from pandas dataframe based on colums values

Create dictionaries from pandas dataframe based on colums values - python

Which is the best way of creating multiples dictionaries from a pandas dataframe based on columns values?
My dataframe has this format:
evtnum pcode energy
1 1 a 20.0
2 1 a 30.0
3 1 b 29.0
4 1 a 34.0
5 2 c 20.0
6 2 a 15.0
7 3 a 3.0
8 3 b 2.0
9 3 c 25.0
10 4 h 28.0
11 5 a 43.6
12 5 c 20.3
evtnum takes values from 1 to 5000 and pcode are 25 different letters. I have a set with these letters:
pcode_set = [a,b,c,d,h,...]
So, I want to obtain evtnum dictionaries of lenght(pcode_set) each one, counting the ocurrencies of each letter in each event and the mean value of the energy of this letter in this event. Something like this:
dict_1 = {a : [timesthat"a"appears in evtnum1,
energy mean value of a in evtnum1],
b : [timesthat"b"appears in evtnum1,
energy mean value of b in evtnum1]
...
}
dict_2 = {a : [timesthat"a"appears in evtnum2,
energy mean value of a in evtnum2],
b : [timesthat"b"appears in evtnum2,
energy mean value of b in evtnum2]
...
}
...
dict_5000 = {a : [timesthat"a"appears in evtnum5000,
energy mean value of a in evtnum5000],
b : [timesthat"b"appears in evtnum5000,
energy mean value of b in evtnum5000]
...
}
Please dont answer me how to count the letter's ocurrencies or how to calculate the mean value, these were just examples.
I only want to know How can I create a multiple number of dictionaries and fill them taking into account the columns values of the dataframe.

Using your example, this script should do the trick:
thismodule = sys.modules[__name__]
df1 = df.groupby(['evtnum', 'pcode']).agg({'pcode':'size', 'energy':'mean'}).rename(columns={'pcode': 'num_pcode',
'energy':'mean_energy'}).reset_index(drop = False)
for evt in df1.evtnum.unique():
name = 'dict_'+str(evt)
df_ = df1
df_ = df_[df_.evtnum==evt].drop('evtnum', 1).set_index('pcode').to_dict('index')
setattr(thismodule, name, df_)
for number in range(max(df1.reset_index().evtnum.unique())):
print( number+1)
print(eval('dict_'+str(number+1)))
Prints this:
1
{'a': {'num_pcode': 3, 'mean_energy': 28.0}, 'b': {'num_pcode': 1, 'mean_energy': 29.0}}
2
{'a': {'num_pcode': 1, 'mean_energy': 15.0}, 'c': {'num_pcode': 1, 'mean_energy': 20.0}}
3
{'a': {'num_pcode': 1, 'mean_energy': 3.0}, 'b': {'num_pcode': 1, 'mean_energy': 2.0}, 'c': {'num_pcode': 1, 'mean_energy': 25.0}}
4
{'h': {'num_pcode': 1, 'mean_energy': 28.0}}
5
{'a': {'num_pcode': 1, 'mean_energy': 43.6}, 'c': {'num_pcode': 1, 'mean_energy': 20.3}}

Related

How to convert efficiently dictionary to dataframe [duplicate]

How can I convert a list of dictionaries into a DataFrame? Given:
[{'points': 50, 'time': '5:00', 'year': 2010},
{'points': 25, 'time': '6:00', 'month': "february"},
{'points':90, 'time': '9:00', 'month': 'january'},
{'points_h1':20, 'month': 'june'}]
I want to turn the above into a DataFrame:
month points points_h1 time year
0 NaN 50 NaN 5:00 2010
1 february 25 NaN 6:00 NaN
2 january 90 NaN 9:00 NaN
3 june NaN 20 NaN NaN
Note: Order of the columns does not matter.

If ds is a list of dicts:
df = pd.DataFrame(ds)
Note: this does not work with nested data.

How do I convert a list of dictionaries to a pandas DataFrame?
The other answers are correct, but not much has been explained in terms of advantages and limitations of these methods. The aim of this post will be to show examples of these methods under different situations, discuss when to use (and when not to use), and suggest alternatives.
DataFrame(), DataFrame.from_records(), and .from_dict()
Depending on the structure and format of your data, there are situations where either all three methods work, or some work better than others, or some don't work at all.
Consider a very contrived example.
np.random.seed(0)
data = pd.DataFrame(
np.random.choice(10, (3, 4)), columns=list('ABCD')).to_dict('r')
print(data)
[{'A': 5, 'B': 0, 'C': 3, 'D': 3},
{'A': 7, 'B': 9, 'C': 3, 'D': 5},
{'A': 2, 'B': 4, 'C': 7, 'D': 6}]
This list consists of "records" with every keys present. This is the simplest case you could encounter.
# The following methods all produce the same output.
pd.DataFrame(data)
pd.DataFrame.from_dict(data)
pd.DataFrame.from_records(data)
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
Word on Dictionary Orientations: orient='index'/'columns'
Before continuing, it is important to make the distinction between the different types of dictionary orientations, and support with pandas. There are two primary types: "columns", and "index".
orient='columns'
Dictionaries with the "columns" orientation will have their keys correspond to columns in the equivalent DataFrame.
For example, data above is in the "columns" orient.
data_c = [
{'A': 5, 'B': 0, 'C': 3, 'D': 3},
{'A': 7, 'B': 9, 'C': 3, 'D': 5},
{'A': 2, 'B': 4, 'C': 7, 'D': 6}]
pd.DataFrame.from_dict(data_c, orient='columns')
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
Note: If you are using pd.DataFrame.from_records, the orientation is assumed to be "columns" (you cannot specify otherwise), and the dictionaries will be loaded accordingly.
orient='index'
With this orient, keys are assumed to correspond to index values. This kind of data is best suited for pd.DataFrame.from_dict.
data_i ={
0: {'A': 5, 'B': 0, 'C': 3, 'D': 3},
1: {'A': 7, 'B': 9, 'C': 3, 'D': 5},
2: {'A': 2, 'B': 4, 'C': 7, 'D': 6}}
pd.DataFrame.from_dict(data_i, orient='index')
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
This case is not considered in the OP, but is still useful to know.
Setting Custom Index
If you need a custom index on the resultant DataFrame, you can set it using the index=... argument.
pd.DataFrame(data, index=['a', 'b', 'c'])
# pd.DataFrame.from_records(data, index=['a', 'b', 'c'])
A B C D
a 5 0 3 3
b 7 9 3 5
c 2 4 7 6
This is not supported by pd.DataFrame.from_dict.
Dealing with Missing Keys/Columns
All methods work out-of-the-box when handling dictionaries with missing keys/column values. For example,
data2 = [
{'A': 5, 'C': 3, 'D': 3},
{'A': 7, 'B': 9, 'F': 5},
{'B': 4, 'C': 7, 'E': 6}]
# The methods below all produce the same output.
pd.DataFrame(data2)
pd.DataFrame.from_dict(data2)
pd.DataFrame.from_records(data2)
A B C D E F
0 5.0 NaN 3.0 3.0 NaN NaN
1 7.0 9.0 NaN NaN NaN 5.0
2 NaN 4.0 7.0 NaN 6.0 NaN
Reading Subset of Columns
"What if I don't want to read in every single column"? You can easily specify this using the columns=... parameter.
For example, from the example dictionary of data2 above, if you wanted to read only columns "A', 'D', and 'F', you can do so by passing a list:
pd.DataFrame(data2, columns=['A', 'D', 'F'])
# pd.DataFrame.from_records(data2, columns=['A', 'D', 'F'])
A D F
0 5.0 3.0 NaN
1 7.0 NaN 5.0
2 NaN NaN NaN
This is not supported by pd.DataFrame.from_dict with the default orient "columns".
pd.DataFrame.from_dict(data2, orient='columns', columns=['A', 'B'])
ValueError: cannot use columns parameter with orient='columns'
Reading Subset of Rows
Not supported by any of these methods directly. You will have to iterate over your data and perform a reverse delete in-place as you iterate. For example, to extract only the 0th and 2nd rows from data2 above, you can use:
rows_to_select = {0, 2}
for i in reversed(range(len(data2))):
if i not in rows_to_select:
del data2[i]
pd.DataFrame(data2)
# pd.DataFrame.from_dict(data2)
# pd.DataFrame.from_records(data2)
A B C D E
0 5.0 NaN 3 3.0 NaN
1 NaN 4.0 7 NaN 6.0
The Panacea: json_normalize for Nested Data
A strong, robust alternative to the methods outlined above is the json_normalize function which works with lists of dictionaries (records), and in addition can also handle nested dictionaries.
pd.json_normalize(data)
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
pd.json_normalize(data2)
A B C D E
0 5.0 NaN 3 3.0 NaN
1 NaN 4.0 7 NaN 6.0
Again, keep in mind that the data passed to json_normalize needs to be in the list-of-dictionaries (records) format.
As mentioned, json_normalize can also handle nested dictionaries. Here's an example taken from the documentation.
data_nested = [
{'counties': [{'name': 'Dade', 'population': 12345},
{'name': 'Broward', 'population': 40000},
{'name': 'Palm Beach', 'population': 60000}],
'info': {'governor': 'Rick Scott'},
'shortname': 'FL',
'state': 'Florida'},
{'counties': [{'name': 'Summit', 'population': 1234},
{'name': 'Cuyahoga', 'population': 1337}],
'info': {'governor': 'John Kasich'},
'shortname': 'OH',
'state': 'Ohio'}
]
pd.json_normalize(data_nested,
record_path='counties',
meta=['state', 'shortname', ['info', 'governor']])
name population state shortname info.governor
0 Dade 12345 Florida FL Rick Scott
1 Broward 40000 Florida FL Rick Scott
2 Palm Beach 60000 Florida FL Rick Scott
3 Summit 1234 Ohio OH John Kasich
4 Cuyahoga 1337 Ohio OH John Kasich
For more information on the meta and record_path arguments, check out the documentation.
Summarising
Here's a table of all the methods discussed above, along with supported features/functionality.
* Use orient='columns' and then transpose to get the same effect as orient='index'.

In pandas 16.2, I had to do pd.DataFrame.from_records(d) to get this to work.

You can also use pd.DataFrame.from_dict(d) as :
In [8]: d = [{'points': 50, 'time': '5:00', 'year': 2010},
...: {'points': 25, 'time': '6:00', 'month': "february"},
...: {'points':90, 'time': '9:00', 'month': 'january'},
...: {'points_h1':20, 'month': 'june'}]
In [12]: pd.DataFrame.from_dict(d)
Out[12]:
month points points_h1 time year
0 NaN 50.0 NaN 5:00 2010.0
1 february 25.0 NaN 6:00 NaN
2 january 90.0 NaN 9:00 NaN
3 june NaN 20.0 NaN NaN

Pyhton3:
Most of the solutions listed previously work. However, there are instances when row_number of the dataframe is not required and the each row (record) has to be written individually.
The following method is useful in that case.
import csv
my file= 'C:\Users\John\Desktop\export_dataframe.csv'
records_to_save = data2 #used as in the thread.
colnames = list[records_to_save[0].keys()]
# remember colnames is a list of all keys. All values are written corresponding
# to the keys and "None" is specified in case of missing value
with open(myfile, 'w', newline="",encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(colnames)
for d in records_to_save:
writer.writerow([d.get(r, "None") for r in colnames])

The easiest way I have found to do it is like this:
dict_count = len(dict_list)
df = pd.DataFrame(dict_list[0], index=[0])
for i in range(1,dict_count-1):
df = df.append(dict_list[i], ignore_index=True)

I have the following list of dicts with datetime keys and int values:
list = [{datetime.date(2022, 2, 10): 7}, {datetime.date(2022, 2, 11): 1}, {datetime.date(2022, 2, 11): 1}]
I had a problem to convert it to Dataframe with the methods above as it created Dataframe with columns with dates...
My solution:
df = pd.DataFrame()
for i in list:
temp_df = pd.DataFrame.from_dict(i, orient='index')
df = df.append(temp_df)

Compare headers of dataframes and add the columns to the delta table

There are two dataframes
df1 ---> OrgID, location, Address, State -- which is delta table
df2 ---> OrgID, location, Address, State, Active, Seq_Number -- which is createOrReplaceTempViewtable
In df2 there are two additional columns Active, Seq_Number.
How to get the datatypes of the new columns
How to add the new columns to the existing Delta table and update the values
tried the below:
converted the dataframes to pandaDF and used this, which got new columns in the Index object.
df_new_columns = df1.columns.difference(df2.columns)
new = [ ]
if len(df_new_columns.tolist()) != 0:
for column in df_new_columns.tolist():
column_name = column
new.append(column)

Try this:
delta_table = pd.concat([delta_table, createOrReplaceTempViewtable[['Active', 'Seq_Number']]])

If you do not want to name any of the duplicate tables when join the missing one from your original table, you can do the following thing:
import pandas as pd
a = [{'A': 3, 'B': 5, 'C': 3, 'D': 2},{'A': 2, 'B': 4, 'C': 3, 'D': 9}]
df1 = pd.DataFrame(a)
b = [{'F': 0, 'M': 4, 'B': 2, 'C': 8 },{'F': 2, 'M': 4, 'B': 3, 'C': 9}]
df2 = pd.DataFrame(b)
df_delta = df2.columns.difference(df1.columns)
print(pd.concat([df1,df2]).T.drop_duplicates().T)
which gives:
A B C D F M
0 3.0 5.0 3.0 2.0 NaN NaN
1 2.0 4.0 3.0 9.0 NaN NaN
0 NaN 2.0 8.0 NaN 0.0 4.0
1 NaN 3.0 9.0 NaN 2.0 4.0
and
Index(['F', 'M'], dtype='object')

Convert dictionary to pandas dataframe in python [duplicate]

How can I convert a list of dictionaries into a DataFrame? Given:
[{'points': 50, 'time': '5:00', 'year': 2010},
{'points': 25, 'time': '6:00', 'month': "february"},
{'points':90, 'time': '9:00', 'month': 'january'},
{'points_h1':20, 'month': 'june'}]
I want to turn the above into a DataFrame:
month points points_h1 time year
0 NaN 50 NaN 5:00 2010
1 february 25 NaN 6:00 NaN
2 january 90 NaN 9:00 NaN
3 june NaN 20 NaN NaN
Note: Order of the columns does not matter.

If ds is a list of dicts:
df = pd.DataFrame(ds)
Note: this does not work with nested data.

How do I convert a list of dictionaries to a pandas DataFrame?
The other answers are correct, but not much has been explained in terms of advantages and limitations of these methods. The aim of this post will be to show examples of these methods under different situations, discuss when to use (and when not to use), and suggest alternatives.
DataFrame(), DataFrame.from_records(), and .from_dict()
Depending on the structure and format of your data, there are situations where either all three methods work, or some work better than others, or some don't work at all.
Consider a very contrived example.
np.random.seed(0)
data = pd.DataFrame(
np.random.choice(10, (3, 4)), columns=list('ABCD')).to_dict('r')
print(data)
[{'A': 5, 'B': 0, 'C': 3, 'D': 3},
{'A': 7, 'B': 9, 'C': 3, 'D': 5},
{'A': 2, 'B': 4, 'C': 7, 'D': 6}]
This list consists of "records" with every keys present. This is the simplest case you could encounter.
# The following methods all produce the same output.
pd.DataFrame(data)
pd.DataFrame.from_dict(data)
pd.DataFrame.from_records(data)
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
Word on Dictionary Orientations: orient='index'/'columns'
Before continuing, it is important to make the distinction between the different types of dictionary orientations, and support with pandas. There are two primary types: "columns", and "index".
orient='columns'
Dictionaries with the "columns" orientation will have their keys correspond to columns in the equivalent DataFrame.
For example, data above is in the "columns" orient.
data_c = [
{'A': 5, 'B': 0, 'C': 3, 'D': 3},
{'A': 7, 'B': 9, 'C': 3, 'D': 5},
{'A': 2, 'B': 4, 'C': 7, 'D': 6}]
pd.DataFrame.from_dict(data_c, orient='columns')
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
Note: If you are using pd.DataFrame.from_records, the orientation is assumed to be "columns" (you cannot specify otherwise), and the dictionaries will be loaded accordingly.
orient='index'
With this orient, keys are assumed to correspond to index values. This kind of data is best suited for pd.DataFrame.from_dict.
data_i ={
0: {'A': 5, 'B': 0, 'C': 3, 'D': 3},
1: {'A': 7, 'B': 9, 'C': 3, 'D': 5},
2: {'A': 2, 'B': 4, 'C': 7, 'D': 6}}
pd.DataFrame.from_dict(data_i, orient='index')
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
This case is not considered in the OP, but is still useful to know.
Setting Custom Index
If you need a custom index on the resultant DataFrame, you can set it using the index=... argument.
pd.DataFrame(data, index=['a', 'b', 'c'])
# pd.DataFrame.from_records(data, index=['a', 'b', 'c'])
A B C D
a 5 0 3 3
b 7 9 3 5
c 2 4 7 6
This is not supported by pd.DataFrame.from_dict.
Dealing with Missing Keys/Columns
All methods work out-of-the-box when handling dictionaries with missing keys/column values. For example,
data2 = [
{'A': 5, 'C': 3, 'D': 3},
{'A': 7, 'B': 9, 'F': 5},
{'B': 4, 'C': 7, 'E': 6}]
# The methods below all produce the same output.
pd.DataFrame(data2)
pd.DataFrame.from_dict(data2)
pd.DataFrame.from_records(data2)
A B C D E F
0 5.0 NaN 3.0 3.0 NaN NaN
1 7.0 9.0 NaN NaN NaN 5.0
2 NaN 4.0 7.0 NaN 6.0 NaN
Reading Subset of Columns
"What if I don't want to read in every single column"? You can easily specify this using the columns=... parameter.
For example, from the example dictionary of data2 above, if you wanted to read only columns "A', 'D', and 'F', you can do so by passing a list:
pd.DataFrame(data2, columns=['A', 'D', 'F'])
# pd.DataFrame.from_records(data2, columns=['A', 'D', 'F'])
A D F
0 5.0 3.0 NaN
1 7.0 NaN 5.0
2 NaN NaN NaN
This is not supported by pd.DataFrame.from_dict with the default orient "columns".
pd.DataFrame.from_dict(data2, orient='columns', columns=['A', 'B'])
ValueError: cannot use columns parameter with orient='columns'
Reading Subset of Rows
Not supported by any of these methods directly. You will have to iterate over your data and perform a reverse delete in-place as you iterate. For example, to extract only the 0th and 2nd rows from data2 above, you can use:
rows_to_select = {0, 2}
for i in reversed(range(len(data2))):
if i not in rows_to_select:
del data2[i]
pd.DataFrame(data2)
# pd.DataFrame.from_dict(data2)
# pd.DataFrame.from_records(data2)
A B C D E
0 5.0 NaN 3 3.0 NaN
1 NaN 4.0 7 NaN 6.0
The Panacea: json_normalize for Nested Data
A strong, robust alternative to the methods outlined above is the json_normalize function which works with lists of dictionaries (records), and in addition can also handle nested dictionaries.
pd.json_normalize(data)
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
pd.json_normalize(data2)
A B C D E
0 5.0 NaN 3 3.0 NaN
1 NaN 4.0 7 NaN 6.0
Again, keep in mind that the data passed to json_normalize needs to be in the list-of-dictionaries (records) format.
As mentioned, json_normalize can also handle nested dictionaries. Here's an example taken from the documentation.
data_nested = [
{'counties': [{'name': 'Dade', 'population': 12345},
{'name': 'Broward', 'population': 40000},
{'name': 'Palm Beach', 'population': 60000}],
'info': {'governor': 'Rick Scott'},
'shortname': 'FL',
'state': 'Florida'},
{'counties': [{'name': 'Summit', 'population': 1234},
{'name': 'Cuyahoga', 'population': 1337}],
'info': {'governor': 'John Kasich'},
'shortname': 'OH',
'state': 'Ohio'}
]
pd.json_normalize(data_nested,
record_path='counties',
meta=['state', 'shortname', ['info', 'governor']])
name population state shortname info.governor
0 Dade 12345 Florida FL Rick Scott
1 Broward 40000 Florida FL Rick Scott
2 Palm Beach 60000 Florida FL Rick Scott
3 Summit 1234 Ohio OH John Kasich
4 Cuyahoga 1337 Ohio OH John Kasich
For more information on the meta and record_path arguments, check out the documentation.
Summarising
Here's a table of all the methods discussed above, along with supported features/functionality.
* Use orient='columns' and then transpose to get the same effect as orient='index'.

In pandas 16.2, I had to do pd.DataFrame.from_records(d) to get this to work.

You can also use pd.DataFrame.from_dict(d) as :
In [8]: d = [{'points': 50, 'time': '5:00', 'year': 2010},
...: {'points': 25, 'time': '6:00', 'month': "february"},
...: {'points':90, 'time': '9:00', 'month': 'january'},
...: {'points_h1':20, 'month': 'june'}]
In [12]: pd.DataFrame.from_dict(d)
Out[12]:
month points points_h1 time year
0 NaN 50.0 NaN 5:00 2010.0
1 february 25.0 NaN 6:00 NaN
2 january 90.0 NaN 9:00 NaN
3 june NaN 20.0 NaN NaN

Pyhton3:
Most of the solutions listed previously work. However, there are instances when row_number of the dataframe is not required and the each row (record) has to be written individually.
The following method is useful in that case.
import csv
my file= 'C:\Users\John\Desktop\export_dataframe.csv'
records_to_save = data2 #used as in the thread.
colnames = list[records_to_save[0].keys()]
# remember colnames is a list of all keys. All values are written corresponding
# to the keys and "None" is specified in case of missing value
with open(myfile, 'w', newline="",encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(colnames)
for d in records_to_save:
writer.writerow([d.get(r, "None") for r in colnames])

The easiest way I have found to do it is like this:
dict_count = len(dict_list)
df = pd.DataFrame(dict_list[0], index=[0])
for i in range(1,dict_count-1):
df = df.append(dict_list[i], ignore_index=True)

I have the following list of dicts with datetime keys and int values:
list = [{datetime.date(2022, 2, 10): 7}, {datetime.date(2022, 2, 11): 1}, {datetime.date(2022, 2, 11): 1}]
I had a problem to convert it to Dataframe with the methods above as it created Dataframe with columns with dates...
My solution:
df = pd.DataFrame()
for i in list:
temp_df = pd.DataFrame.from_dict(i, orient='index')
df = df.append(temp_df)

Assigning column names while creating dataframe results in nan values

I have a list of dict which is being converted to a dataframe. When I attempt to pass the columns argument the output values are all nan.
# This code does not result in desired output
l = [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}]
pd.DataFrame(l, columns=['c', 'd'])
c d
0 NaN NaN
1 NaN NaN
# This code does result in desired output
l = [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}]
df = pd.DataFrame(l)
df.columns = ['c', 'd']
df
c d
0 1 2
1 3 4
Why is this happening?

Because if pass list of dictionaries from keys are created new columns names in DataFrame constructor:
l = [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}]
print (pd.DataFrame(l))
a b
0 1 2
1 3 4
If pass columns parameter with some values not exist in keys of dictionaries then are filtered columns from dictonaries and for not exist values are created columns with missing values with order like values in list of columns names:
#changed order working, because a,b keys at least in one dictionary
print (pd.DataFrame(l, columns=['b', 'a']))
b a
0 2 1
1 4 3
#filtered a, d filled missing values - key is not at least in one dictionary
print (pd.DataFrame(l, columns=['a', 'd']))
a d
0 1 NaN
1 3 NaN
#filtered b, c filled missing values - key is not at least in one dictionary
print (pd.DataFrame(l, columns=['c', 'b']))
c b
0 NaN 2
1 NaN 4
#filtered a,b, c, d filled missing values - keys are not at least in one dictionary
print (pd.DataFrame(l, columns=['c', 'd','a','b']))
c d a b
0 NaN NaN 1 2
1 NaN NaN 3 4
So if want another columns names you need rename them or set new one like in your second code.

Convert list of dictionaries to a pandas DataFrame

How can I convert a list of dictionaries into a DataFrame? Given:
[{'points': 50, 'time': '5:00', 'year': 2010},
{'points': 25, 'time': '6:00', 'month': "february"},
{'points':90, 'time': '9:00', 'month': 'january'},
{'points_h1':20, 'month': 'june'}]
I want to turn the above into a DataFrame:
month points points_h1 time year
0 NaN 50 NaN 5:00 2010
1 february 25 NaN 6:00 NaN
2 january 90 NaN 9:00 NaN
3 june NaN 20 NaN NaN
Note: Order of the columns does not matter.

If ds is a list of dicts:
df = pd.DataFrame(ds)
Note: this does not work with nested data.

How do I convert a list of dictionaries to a pandas DataFrame?
The other answers are correct, but not much has been explained in terms of advantages and limitations of these methods. The aim of this post will be to show examples of these methods under different situations, discuss when to use (and when not to use), and suggest alternatives.
DataFrame(), DataFrame.from_records(), and .from_dict()
Depending on the structure and format of your data, there are situations where either all three methods work, or some work better than others, or some don't work at all.
Consider a very contrived example.
np.random.seed(0)
data = pd.DataFrame(
np.random.choice(10, (3, 4)), columns=list('ABCD')).to_dict('r')
print(data)
[{'A': 5, 'B': 0, 'C': 3, 'D': 3},
{'A': 7, 'B': 9, 'C': 3, 'D': 5},
{'A': 2, 'B': 4, 'C': 7, 'D': 6}]
This list consists of "records" with every keys present. This is the simplest case you could encounter.
# The following methods all produce the same output.
pd.DataFrame(data)
pd.DataFrame.from_dict(data)
pd.DataFrame.from_records(data)
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
Word on Dictionary Orientations: orient='index'/'columns'
Before continuing, it is important to make the distinction between the different types of dictionary orientations, and support with pandas. There are two primary types: "columns", and "index".
orient='columns'
Dictionaries with the "columns" orientation will have their keys correspond to columns in the equivalent DataFrame.
For example, data above is in the "columns" orient.
data_c = [
{'A': 5, 'B': 0, 'C': 3, 'D': 3},
{'A': 7, 'B': 9, 'C': 3, 'D': 5},
{'A': 2, 'B': 4, 'C': 7, 'D': 6}]
pd.DataFrame.from_dict(data_c, orient='columns')
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
Note: If you are using pd.DataFrame.from_records, the orientation is assumed to be "columns" (you cannot specify otherwise), and the dictionaries will be loaded accordingly.
orient='index'
With this orient, keys are assumed to correspond to index values. This kind of data is best suited for pd.DataFrame.from_dict.
data_i ={
0: {'A': 5, 'B': 0, 'C': 3, 'D': 3},
1: {'A': 7, 'B': 9, 'C': 3, 'D': 5},
2: {'A': 2, 'B': 4, 'C': 7, 'D': 6}}
pd.DataFrame.from_dict(data_i, orient='index')
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
This case is not considered in the OP, but is still useful to know.
Setting Custom Index
If you need a custom index on the resultant DataFrame, you can set it using the index=... argument.
pd.DataFrame(data, index=['a', 'b', 'c'])
# pd.DataFrame.from_records(data, index=['a', 'b', 'c'])
A B C D
a 5 0 3 3
b 7 9 3 5
c 2 4 7 6
This is not supported by pd.DataFrame.from_dict.
Dealing with Missing Keys/Columns
All methods work out-of-the-box when handling dictionaries with missing keys/column values. For example,
data2 = [
{'A': 5, 'C': 3, 'D': 3},
{'A': 7, 'B': 9, 'F': 5},
{'B': 4, 'C': 7, 'E': 6}]
# The methods below all produce the same output.
pd.DataFrame(data2)
pd.DataFrame.from_dict(data2)
pd.DataFrame.from_records(data2)
A B C D E F
0 5.0 NaN 3.0 3.0 NaN NaN
1 7.0 9.0 NaN NaN NaN 5.0
2 NaN 4.0 7.0 NaN 6.0 NaN
Reading Subset of Columns
"What if I don't want to read in every single column"? You can easily specify this using the columns=... parameter.
For example, from the example dictionary of data2 above, if you wanted to read only columns "A', 'D', and 'F', you can do so by passing a list:
pd.DataFrame(data2, columns=['A', 'D', 'F'])
# pd.DataFrame.from_records(data2, columns=['A', 'D', 'F'])
A D F
0 5.0 3.0 NaN
1 7.0 NaN 5.0
2 NaN NaN NaN
This is not supported by pd.DataFrame.from_dict with the default orient "columns".
pd.DataFrame.from_dict(data2, orient='columns', columns=['A', 'B'])
ValueError: cannot use columns parameter with orient='columns'
Reading Subset of Rows
Not supported by any of these methods directly. You will have to iterate over your data and perform a reverse delete in-place as you iterate. For example, to extract only the 0th and 2nd rows from data2 above, you can use:
rows_to_select = {0, 2}
for i in reversed(range(len(data2))):
if i not in rows_to_select:
del data2[i]
pd.DataFrame(data2)
# pd.DataFrame.from_dict(data2)
# pd.DataFrame.from_records(data2)
A B C D E
0 5.0 NaN 3 3.0 NaN
1 NaN 4.0 7 NaN 6.0
The Panacea: json_normalize for Nested Data
A strong, robust alternative to the methods outlined above is the json_normalize function which works with lists of dictionaries (records), and in addition can also handle nested dictionaries.
pd.json_normalize(data)
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
pd.json_normalize(data2)
A B C D E
0 5.0 NaN 3 3.0 NaN
1 NaN 4.0 7 NaN 6.0
Again, keep in mind that the data passed to json_normalize needs to be in the list-of-dictionaries (records) format.
As mentioned, json_normalize can also handle nested dictionaries. Here's an example taken from the documentation.
data_nested = [
{'counties': [{'name': 'Dade', 'population': 12345},
{'name': 'Broward', 'population': 40000},
{'name': 'Palm Beach', 'population': 60000}],
'info': {'governor': 'Rick Scott'},
'shortname': 'FL',
'state': 'Florida'},
{'counties': [{'name': 'Summit', 'population': 1234},
{'name': 'Cuyahoga', 'population': 1337}],
'info': {'governor': 'John Kasich'},
'shortname': 'OH',
'state': 'Ohio'}
]
pd.json_normalize(data_nested,
record_path='counties',
meta=['state', 'shortname', ['info', 'governor']])
name population state shortname info.governor
0 Dade 12345 Florida FL Rick Scott
1 Broward 40000 Florida FL Rick Scott
2 Palm Beach 60000 Florida FL Rick Scott
3 Summit 1234 Ohio OH John Kasich
4 Cuyahoga 1337 Ohio OH John Kasich
For more information on the meta and record_path arguments, check out the documentation.
Summarising
Here's a table of all the methods discussed above, along with supported features/functionality.
* Use orient='columns' and then transpose to get the same effect as orient='index'.

In pandas 16.2, I had to do pd.DataFrame.from_records(d) to get this to work.

You can also use pd.DataFrame.from_dict(d) as :
In [8]: d = [{'points': 50, 'time': '5:00', 'year': 2010},
...: {'points': 25, 'time': '6:00', 'month': "february"},
...: {'points':90, 'time': '9:00', 'month': 'january'},
...: {'points_h1':20, 'month': 'june'}]
In [12]: pd.DataFrame.from_dict(d)
Out[12]:
month points points_h1 time year
0 NaN 50.0 NaN 5:00 2010.0
1 february 25.0 NaN 6:00 NaN
2 january 90.0 NaN 9:00 NaN
3 june NaN 20.0 NaN NaN

Pyhton3:
Most of the solutions listed previously work. However, there are instances when row_number of the dataframe is not required and the each row (record) has to be written individually.
The following method is useful in that case.
import csv
my file= 'C:\Users\John\Desktop\export_dataframe.csv'
records_to_save = data2 #used as in the thread.
colnames = list[records_to_save[0].keys()]
# remember colnames is a list of all keys. All values are written corresponding
# to the keys and "None" is specified in case of missing value
with open(myfile, 'w', newline="",encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(colnames)
for d in records_to_save:
writer.writerow([d.get(r, "None") for r in colnames])

The easiest way I have found to do it is like this:
dict_count = len(dict_list)
df = pd.DataFrame(dict_list[0], index=[0])
for i in range(1,dict_count-1):
df = df.append(dict_list[i], ignore_index=True)

I have the following list of dicts with datetime keys and int values:
list = [{datetime.date(2022, 2, 10): 7}, {datetime.date(2022, 2, 11): 1}, {datetime.date(2022, 2, 11): 1}]
I had a problem to convert it to Dataframe with the methods above as it created Dataframe with columns with dates...
My solution:
df = pd.DataFrame()
for i in list:
temp_df = pd.DataFrame.from_dict(i, orient='index')
df = df.append(temp_df)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create dictionaries from pandas dataframe based on colums values - python

Related

How to convert efficiently dictionary to dataframe [duplicate]

Compare headers of dataframes and add the columns to the delta table

Convert dictionary to pandas dataframe in python [duplicate]

Assigning column names while creating dataframe results in nan values

Convert list of dictionaries to a pandas DataFrame

Categories

Resources