Fill the Pandas cell based on dictionary - python

My Pandas dataframe is like:
import pandas as pd
data = {'dates': ['2/16/2023', '2/17/2023', '2/18/2023', '2/19/2023', '2/20/2023', '2/21/2023', '2/22/2023', '2/23/2023', '2/24/2023', '2/25/2023', '2/26/2023', '2/27/2023', '2/28/2023', '3/1/2023', '3/2/2023', '3/3/2023', '3/4/2023', '3/5/2023', '3/6/2023', '3/7/2023', '3/8/2023', '3/9/2023', '3/10/2023'],
'Name': ['', '', '', '', 'A', '', '', '', '', 'B', '', '', '', '', '', 'A', '', '', '', 'D', '', '', ''],
'Filter': [np.nan, np.nan, np.nan, np.nan, 0.0, np.nan, np.nan, np.nan, np.nan, 1.0, np.nan, np.nan, np.nan, np.nan, np.nan, 2.0, np.nan, np.nan, np.nan, 3.0, np.nan, np.nan, np.nan]}
df = pd.DataFrame(data)
I want to fill the data frame using a dictionary below:
mapper = {0: {'begin': -3, 'length': 3},
1: {'begin': -2, 'length': 2},
2: {'begin': -3, 'length': 2},
3: {'begin': -1, 'length': 2}}
So, For first A in the column filter is 0 which from above mapper dict should begin with -3 element and should continue for total 3 times.
Expected output:
import pandas as pd
import numpy as np
data = {'dates': ['2/16/2023', '2/17/2023', '2/18/2023', '2/19/2023', '2/20/2023', '2/21/2023', '2/22/2023', '2/23/2023', '2/24/2023', '2/25/2023', '2/26/2023', '2/27/2023', '2/28/2023', '3/1/2023', '3/2/2023', '3/3/2023', '3/4/2023', '3/5/2023', '3/6/2023', '3/7/2023', '3/8/2023', '3/9/2023', '3/10/2023'],
'Name': ['', 'A', 'A', 'A', '', '', '', 'B', 'B', '', '', '', 'A', 'A', '', '', '', '', 'D', 'D', '', '', ''],
'Filter': [np.nan, 0.0, 0.0, 0.0, np.nan, np.nan, np.nan, 1.0, 1.0, np.nan, np.nan, np.nan, 2.0, 2.0, np.nan, np.nan, np.nan, np.nan, 3.0, 3.0, np.nan, np.nan, np.nan]}
df = pd.DataFrame(data)
I can use bfill and fill. But, not able to use conditions.

Use a simple loop:
tmp = df.copy()
df[['Name', 'Filter']] = ['', np.nan]
for k, d in tmp['Filter'].dropna().map(mapper).items():
idx = slice(k+d['begin'], k+d['begin']+d['length']-1)
df.loc[idx, 'Name'] = tmp.loc[k, 'Name']
df.loc[idx, 'Filter'] = tmp.loc[k, 'Filter']
Output:
dates Name Filter
0 2/16/2023 NaN
1 2/17/2023 A 0.0
2 2/18/2023 A 0.0
3 2/19/2023 A 0.0
4 2/20/2023 NaN
5 2/21/2023 NaN
6 2/22/2023 NaN
7 2/23/2023 B 1.0
8 2/24/2023 B 1.0
9 2/25/2023 NaN
10 2/26/2023 NaN
11 2/27/2023 NaN
12 2/28/2023 A 2.0
13 3/1/2023 A 2.0
14 3/2/2023 NaN
15 3/3/2023 NaN
16 3/4/2023 NaN
17 3/5/2023 NaN
18 3/6/2023 D 3.0
19 3/7/2023 D 3.0
20 3/8/2023 NaN
21 3/9/2023 NaN
22 3/10/2023 NaN

You can filter names and values fo helper variables, remove missing values before and after mmapping vy Series.dropna, rewrite original values by empty string and NaNs and then set new values if match by dictionary if no missing value in v:
for k, v in df['Filter'].dropna().map(mapper).dropna().items():
s = k + v['begin']
e = s + v['length'] - 1
name = df.loc[k, 'Name']
val = df.loc[k, 'Filter']
df.loc[k, ['Name','Filter']] = ['',np.nan]
df.loc[s:e, ['Name','Filter']] = [name, val]
print (df)
dates Name Filter
0 2/16/2023 NaN
1 2/17/2023 A 0.0
2 2/18/2023 A 0.0
3 2/19/2023 A 0.0
4 2/20/2023 NaN
5 2/21/2023 NaN
6 2/22/2023 NaN
7 2/23/2023 B 1.0
8 2/24/2023 B 1.0
9 2/25/2023 NaN
10 2/26/2023 NaN
11 2/27/2023 NaN
12 2/28/2023 A 2.0
13 3/1/2023 A 2.0
14 3/2/2023 NaN
15 3/3/2023 NaN
16 3/4/2023 NaN
17 3/5/2023 NaN
18 3/6/2023 D 3.0
19 3/7/2023 D 3.0
20 3/8/2023 NaN
21 3/9/2023 NaN
22 3/10/2023 NaN
Tested if no match dictionary:
data = {'dates': ['2/16/2023', '2/17/2023', '2/18/2023', '2/19/2023', '2/20/2023', '2/21/2023', '2/22/2023', '2/23/2023', '2/24/2023', '2/25/2023', '2/26/2023', '2/27/2023', '2/28/2023', '3/1/2023', '3/2/2023', '3/3/2023', '3/4/2023', '3/5/2023', '3/6/2023', '3/7/2023', '3/8/2023', '3/9/2023', '3/10/2023'],
'Name': ['', '', '', '', 'A', '', '', '', '', 'B', '', '', '', '', '', 'A', '', '', '', 'D', '', '', ''],
'Filter': [np.nan, np.nan, np.nan, np.nan, 4.0, np.nan, np.nan, np.nan, np.nan, 1.0, np.nan, np.nan, np.nan, np.nan, np.nan, 2.0, np.nan, np.nan, np.nan, 3.0, np.nan, np.nan, np.nan]}
df = pd.DataFrame(data)
mapper = {0: {'begin': -3, 'length': 3},
1: {'begin': -2, 'length': 2},
2: {'begin': -3, 'length': 2},
3: {'begin': -1, 'length': 2}}
for k, v in df['Filter'].dropna().map(mapper).dropna().items():
s = k + v['begin']
e = s + v['length'] - 1
name = df.loc[k, 'Name']
val = df.loc[k, 'Filter']
df.loc[k, ['Name','Filter']] = ['',np.nan]
df.loc[s:e, ['Name','Filter']] = [name, val]
print (df)
dates Name Filter
0 2/16/2023 NaN
1 2/17/2023 NaN
2 2/18/2023 NaN
3 2/19/2023 NaN
4 2/20/2023 A 4.0 <- 4 not match dictionary, so no change
5 2/21/2023 NaN
6 2/22/2023 NaN
7 2/23/2023 B 1.0
8 2/24/2023 B 1.0
9 2/25/2023 NaN
10 2/26/2023 NaN
11 2/27/2023 NaN
12 2/28/2023 A 2.0
13 3/1/2023 A 2.0
14 3/2/2023 NaN
15 3/3/2023 NaN
16 3/4/2023 NaN
17 3/5/2023 NaN
18 3/6/2023 D 3.0
19 3/7/2023 D 3.0
20 3/8/2023 NaN
21 3/9/2023 NaN
22 3/10/2023 NaN

Related

How to convert efficiently dictionary to dataframe [duplicate]

How can I convert a list of dictionaries into a DataFrame? Given:
[{'points': 50, 'time': '5:00', 'year': 2010},
{'points': 25, 'time': '6:00', 'month': "february"},
{'points':90, 'time': '9:00', 'month': 'january'},
{'points_h1':20, 'month': 'june'}]
I want to turn the above into a DataFrame:
month points points_h1 time year
0 NaN 50 NaN 5:00 2010
1 february 25 NaN 6:00 NaN
2 january 90 NaN 9:00 NaN
3 june NaN 20 NaN NaN
Note: Order of the columns does not matter.
If ds is a list of dicts:
df = pd.DataFrame(ds)
Note: this does not work with nested data.
How do I convert a list of dictionaries to a pandas DataFrame?
The other answers are correct, but not much has been explained in terms of advantages and limitations of these methods. The aim of this post will be to show examples of these methods under different situations, discuss when to use (and when not to use), and suggest alternatives.
DataFrame(), DataFrame.from_records(), and .from_dict()
Depending on the structure and format of your data, there are situations where either all three methods work, or some work better than others, or some don't work at all.
Consider a very contrived example.
np.random.seed(0)
data = pd.DataFrame(
np.random.choice(10, (3, 4)), columns=list('ABCD')).to_dict('r')
print(data)
[{'A': 5, 'B': 0, 'C': 3, 'D': 3},
{'A': 7, 'B': 9, 'C': 3, 'D': 5},
{'A': 2, 'B': 4, 'C': 7, 'D': 6}]
This list consists of "records" with every keys present. This is the simplest case you could encounter.
# The following methods all produce the same output.
pd.DataFrame(data)
pd.DataFrame.from_dict(data)
pd.DataFrame.from_records(data)
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
Word on Dictionary Orientations: orient='index'/'columns'
Before continuing, it is important to make the distinction between the different types of dictionary orientations, and support with pandas. There are two primary types: "columns", and "index".
orient='columns'
Dictionaries with the "columns" orientation will have their keys correspond to columns in the equivalent DataFrame.
For example, data above is in the "columns" orient.
data_c = [
{'A': 5, 'B': 0, 'C': 3, 'D': 3},
{'A': 7, 'B': 9, 'C': 3, 'D': 5},
{'A': 2, 'B': 4, 'C': 7, 'D': 6}]
pd.DataFrame.from_dict(data_c, orient='columns')
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
Note: If you are using pd.DataFrame.from_records, the orientation is assumed to be "columns" (you cannot specify otherwise), and the dictionaries will be loaded accordingly.
orient='index'
With this orient, keys are assumed to correspond to index values. This kind of data is best suited for pd.DataFrame.from_dict.
data_i ={
0: {'A': 5, 'B': 0, 'C': 3, 'D': 3},
1: {'A': 7, 'B': 9, 'C': 3, 'D': 5},
2: {'A': 2, 'B': 4, 'C': 7, 'D': 6}}
pd.DataFrame.from_dict(data_i, orient='index')
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
This case is not considered in the OP, but is still useful to know.
Setting Custom Index
If you need a custom index on the resultant DataFrame, you can set it using the index=... argument.
pd.DataFrame(data, index=['a', 'b', 'c'])
# pd.DataFrame.from_records(data, index=['a', 'b', 'c'])
A B C D
a 5 0 3 3
b 7 9 3 5
c 2 4 7 6
This is not supported by pd.DataFrame.from_dict.
Dealing with Missing Keys/Columns
All methods work out-of-the-box when handling dictionaries with missing keys/column values. For example,
data2 = [
{'A': 5, 'C': 3, 'D': 3},
{'A': 7, 'B': 9, 'F': 5},
{'B': 4, 'C': 7, 'E': 6}]
# The methods below all produce the same output.
pd.DataFrame(data2)
pd.DataFrame.from_dict(data2)
pd.DataFrame.from_records(data2)
A B C D E F
0 5.0 NaN 3.0 3.0 NaN NaN
1 7.0 9.0 NaN NaN NaN 5.0
2 NaN 4.0 7.0 NaN 6.0 NaN
Reading Subset of Columns
"What if I don't want to read in every single column"? You can easily specify this using the columns=... parameter.
For example, from the example dictionary of data2 above, if you wanted to read only columns "A', 'D', and 'F', you can do so by passing a list:
pd.DataFrame(data2, columns=['A', 'D', 'F'])
# pd.DataFrame.from_records(data2, columns=['A', 'D', 'F'])
A D F
0 5.0 3.0 NaN
1 7.0 NaN 5.0
2 NaN NaN NaN
This is not supported by pd.DataFrame.from_dict with the default orient "columns".
pd.DataFrame.from_dict(data2, orient='columns', columns=['A', 'B'])
ValueError: cannot use columns parameter with orient='columns'
Reading Subset of Rows
Not supported by any of these methods directly. You will have to iterate over your data and perform a reverse delete in-place as you iterate. For example, to extract only the 0th and 2nd rows from data2 above, you can use:
rows_to_select = {0, 2}
for i in reversed(range(len(data2))):
if i not in rows_to_select:
del data2[i]
pd.DataFrame(data2)
# pd.DataFrame.from_dict(data2)
# pd.DataFrame.from_records(data2)
A B C D E
0 5.0 NaN 3 3.0 NaN
1 NaN 4.0 7 NaN 6.0
The Panacea: json_normalize for Nested Data
A strong, robust alternative to the methods outlined above is the json_normalize function which works with lists of dictionaries (records), and in addition can also handle nested dictionaries.
pd.json_normalize(data)
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
pd.json_normalize(data2)
A B C D E
0 5.0 NaN 3 3.0 NaN
1 NaN 4.0 7 NaN 6.0
Again, keep in mind that the data passed to json_normalize needs to be in the list-of-dictionaries (records) format.
As mentioned, json_normalize can also handle nested dictionaries. Here's an example taken from the documentation.
data_nested = [
{'counties': [{'name': 'Dade', 'population': 12345},
{'name': 'Broward', 'population': 40000},
{'name': 'Palm Beach', 'population': 60000}],
'info': {'governor': 'Rick Scott'},
'shortname': 'FL',
'state': 'Florida'},
{'counties': [{'name': 'Summit', 'population': 1234},
{'name': 'Cuyahoga', 'population': 1337}],
'info': {'governor': 'John Kasich'},
'shortname': 'OH',
'state': 'Ohio'}
]
pd.json_normalize(data_nested,
record_path='counties',
meta=['state', 'shortname', ['info', 'governor']])
name population state shortname info.governor
0 Dade 12345 Florida FL Rick Scott
1 Broward 40000 Florida FL Rick Scott
2 Palm Beach 60000 Florida FL Rick Scott
3 Summit 1234 Ohio OH John Kasich
4 Cuyahoga 1337 Ohio OH John Kasich
For more information on the meta and record_path arguments, check out the documentation.
Summarising
Here's a table of all the methods discussed above, along with supported features/functionality.
* Use orient='columns' and then transpose to get the same effect as orient='index'.
In pandas 16.2, I had to do pd.DataFrame.from_records(d) to get this to work.
You can also use pd.DataFrame.from_dict(d) as :
In [8]: d = [{'points': 50, 'time': '5:00', 'year': 2010},
...: {'points': 25, 'time': '6:00', 'month': "february"},
...: {'points':90, 'time': '9:00', 'month': 'january'},
...: {'points_h1':20, 'month': 'june'}]
In [12]: pd.DataFrame.from_dict(d)
Out[12]:
month points points_h1 time year
0 NaN 50.0 NaN 5:00 2010.0
1 february 25.0 NaN 6:00 NaN
2 january 90.0 NaN 9:00 NaN
3 june NaN 20.0 NaN NaN
Pyhton3:
Most of the solutions listed previously work. However, there are instances when row_number of the dataframe is not required and the each row (record) has to be written individually.
The following method is useful in that case.
import csv
my file= 'C:\Users\John\Desktop\export_dataframe.csv'
records_to_save = data2 #used as in the thread.
colnames = list[records_to_save[0].keys()]
# remember colnames is a list of all keys. All values are written corresponding
# to the keys and "None" is specified in case of missing value
with open(myfile, 'w', newline="",encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(colnames)
for d in records_to_save:
writer.writerow([d.get(r, "None") for r in colnames])
The easiest way I have found to do it is like this:
dict_count = len(dict_list)
df = pd.DataFrame(dict_list[0], index=[0])
for i in range(1,dict_count-1):
df = df.append(dict_list[i], ignore_index=True)
I have the following list of dicts with datetime keys and int values:
list = [{datetime.date(2022, 2, 10): 7}, {datetime.date(2022, 2, 11): 1}, {datetime.date(2022, 2, 11): 1}]
I had a problem to convert it to Dataframe with the methods above as it created Dataframe with columns with dates...
My solution:
df = pd.DataFrame()
for i in list:
temp_df = pd.DataFrame.from_dict(i, orient='index')
df = df.append(temp_df)

Compare two data frames side by side for same columns but different rows

I have two data frames with same column labels like below:
df1 = {'key_1': {0: 'F', 1: 'H', 2: 'E'},
'key_2': {0: 'F', 1: 'G', 2: 'E'},
'min': {0: -158, 1: -881, 2: -674},
'count': {0: 58, 1: 24, 2: 13}}
df2 = {'key_1': {0: 'C', 1: 'L', 2: 'F', 3: 'K'},
'key_2': {0: 'C', 1: 'D', 2: 'F', 3: 'K'},
'min': {0: -452, 1: -153, 2: -181, 3: -120},
'count': {0: 7470, 1: 1262, 2: 171, 3: 86}}
pandas.DataFrame.compare is useful for side by side comparison of each column, but it is not working for comparing data frames with different rows
df1.compare(df2, keep_shape=True, keep_equal=True)
ValueError: Can only compare identically-labeled DataFrame objects
can we achieve the same functionality using pandas.merge ?
I tried below but it is NOT giving side by side comparison for each corresponding column
pd.merge(df1,df2, on=['key_1','key_2'], suffixes=['_df1','_df2'], how='outer')
key_1 key_2 min_df1 count_df1 min_df2 count_df2
0 F F -158.0 58.0 -181.0 171.0
1 H G -881.0 24.0 NaN NaN
2 E E -674.0 13.0 NaN NaN
3 C C NaN NaN -452.0 7470.0
4 L D NaN NaN -153.0 1262.0
5 K K NaN NaN -120.0 86.0
Use concat with convert ['key_1','key_2'] to MultiIndex:
df = (pd.concat([df1.set_index(['key_1','key_2']),
df2.set_index(['key_1','key_2'])], keys=['df1','df2'], axis=1)
.sort_index(level=1, axis=1))
print (df)
df1 df2 df1 df2
count count min min
key_1 key_2
C C NaN 7470.0 NaN -452.0
E E 13.0 NaN -674.0 NaN
F F 58.0 171.0 -158.0 -181.0
H G 24.0 NaN -881.0 NaN
K K NaN 86.0 NaN -120.0
L D NaN 1262.0 NaN -153.0
After the merge, ou can re-order the columns alphabetically in order to have them side by side:
first_columns = ['key_1','key_2']
merged_df = pd.merge(df1,df2, on=['key_1','key_2'], suffixes=['_df1','_df2'], how='outer')
merged_df = merged_df[first_columns + sorted([col for col in merged_df.columns if col not in first_columns ])]
One way:
merged_df = pd.merge(df1, df2, on=['key_1', 'key_2'], suffixes=[
'_df1', '_df2'], how='outer').set_index(['key_1', 'key_2'])
merged_df.columns = merged_df.columns.str.split('_', expand=True)
merged_df = merged_df.sort_index(level=0, axis=1)

Convert dictionary to pandas dataframe in python [duplicate]

How can I convert a list of dictionaries into a DataFrame? Given:
[{'points': 50, 'time': '5:00', 'year': 2010},
{'points': 25, 'time': '6:00', 'month': "february"},
{'points':90, 'time': '9:00', 'month': 'january'},
{'points_h1':20, 'month': 'june'}]
I want to turn the above into a DataFrame:
month points points_h1 time year
0 NaN 50 NaN 5:00 2010
1 february 25 NaN 6:00 NaN
2 january 90 NaN 9:00 NaN
3 june NaN 20 NaN NaN
Note: Order of the columns does not matter.
If ds is a list of dicts:
df = pd.DataFrame(ds)
Note: this does not work with nested data.
How do I convert a list of dictionaries to a pandas DataFrame?
The other answers are correct, but not much has been explained in terms of advantages and limitations of these methods. The aim of this post will be to show examples of these methods under different situations, discuss when to use (and when not to use), and suggest alternatives.
DataFrame(), DataFrame.from_records(), and .from_dict()
Depending on the structure and format of your data, there are situations where either all three methods work, or some work better than others, or some don't work at all.
Consider a very contrived example.
np.random.seed(0)
data = pd.DataFrame(
np.random.choice(10, (3, 4)), columns=list('ABCD')).to_dict('r')
print(data)
[{'A': 5, 'B': 0, 'C': 3, 'D': 3},
{'A': 7, 'B': 9, 'C': 3, 'D': 5},
{'A': 2, 'B': 4, 'C': 7, 'D': 6}]
This list consists of "records" with every keys present. This is the simplest case you could encounter.
# The following methods all produce the same output.
pd.DataFrame(data)
pd.DataFrame.from_dict(data)
pd.DataFrame.from_records(data)
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
Word on Dictionary Orientations: orient='index'/'columns'
Before continuing, it is important to make the distinction between the different types of dictionary orientations, and support with pandas. There are two primary types: "columns", and "index".
orient='columns'
Dictionaries with the "columns" orientation will have their keys correspond to columns in the equivalent DataFrame.
For example, data above is in the "columns" orient.
data_c = [
{'A': 5, 'B': 0, 'C': 3, 'D': 3},
{'A': 7, 'B': 9, 'C': 3, 'D': 5},
{'A': 2, 'B': 4, 'C': 7, 'D': 6}]
pd.DataFrame.from_dict(data_c, orient='columns')
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
Note: If you are using pd.DataFrame.from_records, the orientation is assumed to be "columns" (you cannot specify otherwise), and the dictionaries will be loaded accordingly.
orient='index'
With this orient, keys are assumed to correspond to index values. This kind of data is best suited for pd.DataFrame.from_dict.
data_i ={
0: {'A': 5, 'B': 0, 'C': 3, 'D': 3},
1: {'A': 7, 'B': 9, 'C': 3, 'D': 5},
2: {'A': 2, 'B': 4, 'C': 7, 'D': 6}}
pd.DataFrame.from_dict(data_i, orient='index')
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
This case is not considered in the OP, but is still useful to know.
Setting Custom Index
If you need a custom index on the resultant DataFrame, you can set it using the index=... argument.
pd.DataFrame(data, index=['a', 'b', 'c'])
# pd.DataFrame.from_records(data, index=['a', 'b', 'c'])
A B C D
a 5 0 3 3
b 7 9 3 5
c 2 4 7 6
This is not supported by pd.DataFrame.from_dict.
Dealing with Missing Keys/Columns
All methods work out-of-the-box when handling dictionaries with missing keys/column values. For example,
data2 = [
{'A': 5, 'C': 3, 'D': 3},
{'A': 7, 'B': 9, 'F': 5},
{'B': 4, 'C': 7, 'E': 6}]
# The methods below all produce the same output.
pd.DataFrame(data2)
pd.DataFrame.from_dict(data2)
pd.DataFrame.from_records(data2)
A B C D E F
0 5.0 NaN 3.0 3.0 NaN NaN
1 7.0 9.0 NaN NaN NaN 5.0
2 NaN 4.0 7.0 NaN 6.0 NaN
Reading Subset of Columns
"What if I don't want to read in every single column"? You can easily specify this using the columns=... parameter.
For example, from the example dictionary of data2 above, if you wanted to read only columns "A', 'D', and 'F', you can do so by passing a list:
pd.DataFrame(data2, columns=['A', 'D', 'F'])
# pd.DataFrame.from_records(data2, columns=['A', 'D', 'F'])
A D F
0 5.0 3.0 NaN
1 7.0 NaN 5.0
2 NaN NaN NaN
This is not supported by pd.DataFrame.from_dict with the default orient "columns".
pd.DataFrame.from_dict(data2, orient='columns', columns=['A', 'B'])
ValueError: cannot use columns parameter with orient='columns'
Reading Subset of Rows
Not supported by any of these methods directly. You will have to iterate over your data and perform a reverse delete in-place as you iterate. For example, to extract only the 0th and 2nd rows from data2 above, you can use:
rows_to_select = {0, 2}
for i in reversed(range(len(data2))):
if i not in rows_to_select:
del data2[i]
pd.DataFrame(data2)
# pd.DataFrame.from_dict(data2)
# pd.DataFrame.from_records(data2)
A B C D E
0 5.0 NaN 3 3.0 NaN
1 NaN 4.0 7 NaN 6.0
The Panacea: json_normalize for Nested Data
A strong, robust alternative to the methods outlined above is the json_normalize function which works with lists of dictionaries (records), and in addition can also handle nested dictionaries.
pd.json_normalize(data)
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
pd.json_normalize(data2)
A B C D E
0 5.0 NaN 3 3.0 NaN
1 NaN 4.0 7 NaN 6.0
Again, keep in mind that the data passed to json_normalize needs to be in the list-of-dictionaries (records) format.
As mentioned, json_normalize can also handle nested dictionaries. Here's an example taken from the documentation.
data_nested = [
{'counties': [{'name': 'Dade', 'population': 12345},
{'name': 'Broward', 'population': 40000},
{'name': 'Palm Beach', 'population': 60000}],
'info': {'governor': 'Rick Scott'},
'shortname': 'FL',
'state': 'Florida'},
{'counties': [{'name': 'Summit', 'population': 1234},
{'name': 'Cuyahoga', 'population': 1337}],
'info': {'governor': 'John Kasich'},
'shortname': 'OH',
'state': 'Ohio'}
]
pd.json_normalize(data_nested,
record_path='counties',
meta=['state', 'shortname', ['info', 'governor']])
name population state shortname info.governor
0 Dade 12345 Florida FL Rick Scott
1 Broward 40000 Florida FL Rick Scott
2 Palm Beach 60000 Florida FL Rick Scott
3 Summit 1234 Ohio OH John Kasich
4 Cuyahoga 1337 Ohio OH John Kasich
For more information on the meta and record_path arguments, check out the documentation.
Summarising
Here's a table of all the methods discussed above, along with supported features/functionality.
* Use orient='columns' and then transpose to get the same effect as orient='index'.
In pandas 16.2, I had to do pd.DataFrame.from_records(d) to get this to work.
You can also use pd.DataFrame.from_dict(d) as :
In [8]: d = [{'points': 50, 'time': '5:00', 'year': 2010},
...: {'points': 25, 'time': '6:00', 'month': "february"},
...: {'points':90, 'time': '9:00', 'month': 'january'},
...: {'points_h1':20, 'month': 'june'}]
In [12]: pd.DataFrame.from_dict(d)
Out[12]:
month points points_h1 time year
0 NaN 50.0 NaN 5:00 2010.0
1 february 25.0 NaN 6:00 NaN
2 january 90.0 NaN 9:00 NaN
3 june NaN 20.0 NaN NaN
Pyhton3:
Most of the solutions listed previously work. However, there are instances when row_number of the dataframe is not required and the each row (record) has to be written individually.
The following method is useful in that case.
import csv
my file= 'C:\Users\John\Desktop\export_dataframe.csv'
records_to_save = data2 #used as in the thread.
colnames = list[records_to_save[0].keys()]
# remember colnames is a list of all keys. All values are written corresponding
# to the keys and "None" is specified in case of missing value
with open(myfile, 'w', newline="",encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(colnames)
for d in records_to_save:
writer.writerow([d.get(r, "None") for r in colnames])
The easiest way I have found to do it is like this:
dict_count = len(dict_list)
df = pd.DataFrame(dict_list[0], index=[0])
for i in range(1,dict_count-1):
df = df.append(dict_list[i], ignore_index=True)
I have the following list of dicts with datetime keys and int values:
list = [{datetime.date(2022, 2, 10): 7}, {datetime.date(2022, 2, 11): 1}, {datetime.date(2022, 2, 11): 1}]
I had a problem to convert it to Dataframe with the methods above as it created Dataframe with columns with dates...
My solution:
df = pd.DataFrame()
for i in list:
temp_df = pd.DataFrame.from_dict(i, orient='index')
df = df.append(temp_df)

Pandas replace value in multiindex row

So, I have a MultiIndex DataFrame and I cannot figure out row to modify a row index value.
In this example, I would like to set c = 1 where the "a" index is 4:
import pandas as pd
import numpy as np
df = pd.DataFrame({('colA', 'x1'): {(1, np.nan, 0): np.nan, (4, np.nan, 0): np.nan},
('colA', 'x2'): {(1, np.nan, 0): np.nan, (4, np.nan, 0): np.nan},
('colA', 'x3'): {(1, np.nan, 0): np.nan, (4, np.nan, 0): np.nan},
('colA', 'x4'): {(1, np.nan, 0): np.nan, (4, np.nan, 0): np.nan}})
df.index.set_names(['a', 'b', 'c'], inplace=True)
print(df)
colA
x1 x2 x3 x4
a b c
1 NaN 0 NaN NaN NaN NaN
4 NaN 0 NaN NaN NaN NaN
Desired output:
colA
x1 x2 x3 x4
a b c
1 NaN 0 NaN NaN NaN NaN
4 NaN 1 NaN NaN NaN NaN
Any help is appreciated.
Assuming we start with df.
x = df.reset_index()
x.loc[x[x.a == 4].index, 'c'] = 1
x = x.set_index(['a', 'b', 'c'])
print(x)
colA
x1 x2 x3 x4
a b c
1 NaN 0 NaN NaN NaN NaN
4 NaN 1 NaN NaN NaN NaN
Solution
Separate the index, process it and put it back together with the data.
Logic
Separate index and process it as a dataframe
Prepare a MultiIndex
Either of the following two options:
combine data and MultiIndex together Method-1
update the index of the original dataframe Method-2
Code
# separate the index and process it
names = ['a', 'b', 'c'] # same as df.index.names
#dfd = pd.DataFrame(df.to_records())
dfd = df.index.to_frame().reset_index(drop=True)
dfd.loc[dfd['a']==4, ['c']] = 1
# prepare index for original dataframe: df
index = pd.MultiIndex.from_tuples([tuple(x) for x in dfd.loc[:, names].values], names=names)
## Method-1
# create new datframe with updated index
dfn = pd.DataFrame(df.values, index=index, columns=df.columns)
# dfn --> new dataframe
## Method-2
# reset the index of original dataframe df
df.set_index(index)
Output:
colA
x1 x2 x3 x4
a b c
1.0 NaN 0.0 NaN NaN NaN NaN
4.0 NaN 1.0 NaN NaN NaN NaN
Dummy Data
import pandas as pd
import numpy as np
df = pd.DataFrame({('colA', 'x1'): {(1, np.nan, 0): np.nan, (4, np.nan, 0): np.nan},
('colA', 'x2'): {(1, np.nan, 0): np.nan, (4, np.nan, 0): np.nan},
('colA', 'x3'): {(1, np.nan, 0): np.nan, (4, np.nan, 0): np.nan},
('colA', 'x4'): {(1, np.nan, 0): np.nan, (4, np.nan, 0): np.nan}})
df.index.set_names(['a', 'b', 'c'], inplace=True)

Convert list of dictionaries to a pandas DataFrame

How can I convert a list of dictionaries into a DataFrame? Given:
[{'points': 50, 'time': '5:00', 'year': 2010},
{'points': 25, 'time': '6:00', 'month': "february"},
{'points':90, 'time': '9:00', 'month': 'january'},
{'points_h1':20, 'month': 'june'}]
I want to turn the above into a DataFrame:
month points points_h1 time year
0 NaN 50 NaN 5:00 2010
1 february 25 NaN 6:00 NaN
2 january 90 NaN 9:00 NaN
3 june NaN 20 NaN NaN
Note: Order of the columns does not matter.
If ds is a list of dicts:
df = pd.DataFrame(ds)
Note: this does not work with nested data.
How do I convert a list of dictionaries to a pandas DataFrame?
The other answers are correct, but not much has been explained in terms of advantages and limitations of these methods. The aim of this post will be to show examples of these methods under different situations, discuss when to use (and when not to use), and suggest alternatives.
DataFrame(), DataFrame.from_records(), and .from_dict()
Depending on the structure and format of your data, there are situations where either all three methods work, or some work better than others, or some don't work at all.
Consider a very contrived example.
np.random.seed(0)
data = pd.DataFrame(
np.random.choice(10, (3, 4)), columns=list('ABCD')).to_dict('r')
print(data)
[{'A': 5, 'B': 0, 'C': 3, 'D': 3},
{'A': 7, 'B': 9, 'C': 3, 'D': 5},
{'A': 2, 'B': 4, 'C': 7, 'D': 6}]
This list consists of "records" with every keys present. This is the simplest case you could encounter.
# The following methods all produce the same output.
pd.DataFrame(data)
pd.DataFrame.from_dict(data)
pd.DataFrame.from_records(data)
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
Word on Dictionary Orientations: orient='index'/'columns'
Before continuing, it is important to make the distinction between the different types of dictionary orientations, and support with pandas. There are two primary types: "columns", and "index".
orient='columns'
Dictionaries with the "columns" orientation will have their keys correspond to columns in the equivalent DataFrame.
For example, data above is in the "columns" orient.
data_c = [
{'A': 5, 'B': 0, 'C': 3, 'D': 3},
{'A': 7, 'B': 9, 'C': 3, 'D': 5},
{'A': 2, 'B': 4, 'C': 7, 'D': 6}]
pd.DataFrame.from_dict(data_c, orient='columns')
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
Note: If you are using pd.DataFrame.from_records, the orientation is assumed to be "columns" (you cannot specify otherwise), and the dictionaries will be loaded accordingly.
orient='index'
With this orient, keys are assumed to correspond to index values. This kind of data is best suited for pd.DataFrame.from_dict.
data_i ={
0: {'A': 5, 'B': 0, 'C': 3, 'D': 3},
1: {'A': 7, 'B': 9, 'C': 3, 'D': 5},
2: {'A': 2, 'B': 4, 'C': 7, 'D': 6}}
pd.DataFrame.from_dict(data_i, orient='index')
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
This case is not considered in the OP, but is still useful to know.
Setting Custom Index
If you need a custom index on the resultant DataFrame, you can set it using the index=... argument.
pd.DataFrame(data, index=['a', 'b', 'c'])
# pd.DataFrame.from_records(data, index=['a', 'b', 'c'])
A B C D
a 5 0 3 3
b 7 9 3 5
c 2 4 7 6
This is not supported by pd.DataFrame.from_dict.
Dealing with Missing Keys/Columns
All methods work out-of-the-box when handling dictionaries with missing keys/column values. For example,
data2 = [
{'A': 5, 'C': 3, 'D': 3},
{'A': 7, 'B': 9, 'F': 5},
{'B': 4, 'C': 7, 'E': 6}]
# The methods below all produce the same output.
pd.DataFrame(data2)
pd.DataFrame.from_dict(data2)
pd.DataFrame.from_records(data2)
A B C D E F
0 5.0 NaN 3.0 3.0 NaN NaN
1 7.0 9.0 NaN NaN NaN 5.0
2 NaN 4.0 7.0 NaN 6.0 NaN
Reading Subset of Columns
"What if I don't want to read in every single column"? You can easily specify this using the columns=... parameter.
For example, from the example dictionary of data2 above, if you wanted to read only columns "A', 'D', and 'F', you can do so by passing a list:
pd.DataFrame(data2, columns=['A', 'D', 'F'])
# pd.DataFrame.from_records(data2, columns=['A', 'D', 'F'])
A D F
0 5.0 3.0 NaN
1 7.0 NaN 5.0
2 NaN NaN NaN
This is not supported by pd.DataFrame.from_dict with the default orient "columns".
pd.DataFrame.from_dict(data2, orient='columns', columns=['A', 'B'])
ValueError: cannot use columns parameter with orient='columns'
Reading Subset of Rows
Not supported by any of these methods directly. You will have to iterate over your data and perform a reverse delete in-place as you iterate. For example, to extract only the 0th and 2nd rows from data2 above, you can use:
rows_to_select = {0, 2}
for i in reversed(range(len(data2))):
if i not in rows_to_select:
del data2[i]
pd.DataFrame(data2)
# pd.DataFrame.from_dict(data2)
# pd.DataFrame.from_records(data2)
A B C D E
0 5.0 NaN 3 3.0 NaN
1 NaN 4.0 7 NaN 6.0
The Panacea: json_normalize for Nested Data
A strong, robust alternative to the methods outlined above is the json_normalize function which works with lists of dictionaries (records), and in addition can also handle nested dictionaries.
pd.json_normalize(data)
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
pd.json_normalize(data2)
A B C D E
0 5.0 NaN 3 3.0 NaN
1 NaN 4.0 7 NaN 6.0
Again, keep in mind that the data passed to json_normalize needs to be in the list-of-dictionaries (records) format.
As mentioned, json_normalize can also handle nested dictionaries. Here's an example taken from the documentation.
data_nested = [
{'counties': [{'name': 'Dade', 'population': 12345},
{'name': 'Broward', 'population': 40000},
{'name': 'Palm Beach', 'population': 60000}],
'info': {'governor': 'Rick Scott'},
'shortname': 'FL',
'state': 'Florida'},
{'counties': [{'name': 'Summit', 'population': 1234},
{'name': 'Cuyahoga', 'population': 1337}],
'info': {'governor': 'John Kasich'},
'shortname': 'OH',
'state': 'Ohio'}
]
pd.json_normalize(data_nested,
record_path='counties',
meta=['state', 'shortname', ['info', 'governor']])
name population state shortname info.governor
0 Dade 12345 Florida FL Rick Scott
1 Broward 40000 Florida FL Rick Scott
2 Palm Beach 60000 Florida FL Rick Scott
3 Summit 1234 Ohio OH John Kasich
4 Cuyahoga 1337 Ohio OH John Kasich
For more information on the meta and record_path arguments, check out the documentation.
Summarising
Here's a table of all the methods discussed above, along with supported features/functionality.
* Use orient='columns' and then transpose to get the same effect as orient='index'.
In pandas 16.2, I had to do pd.DataFrame.from_records(d) to get this to work.
You can also use pd.DataFrame.from_dict(d) as :
In [8]: d = [{'points': 50, 'time': '5:00', 'year': 2010},
...: {'points': 25, 'time': '6:00', 'month': "february"},
...: {'points':90, 'time': '9:00', 'month': 'january'},
...: {'points_h1':20, 'month': 'june'}]
In [12]: pd.DataFrame.from_dict(d)
Out[12]:
month points points_h1 time year
0 NaN 50.0 NaN 5:00 2010.0
1 february 25.0 NaN 6:00 NaN
2 january 90.0 NaN 9:00 NaN
3 june NaN 20.0 NaN NaN
Pyhton3:
Most of the solutions listed previously work. However, there are instances when row_number of the dataframe is not required and the each row (record) has to be written individually.
The following method is useful in that case.
import csv
my file= 'C:\Users\John\Desktop\export_dataframe.csv'
records_to_save = data2 #used as in the thread.
colnames = list[records_to_save[0].keys()]
# remember colnames is a list of all keys. All values are written corresponding
# to the keys and "None" is specified in case of missing value
with open(myfile, 'w', newline="",encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(colnames)
for d in records_to_save:
writer.writerow([d.get(r, "None") for r in colnames])
The easiest way I have found to do it is like this:
dict_count = len(dict_list)
df = pd.DataFrame(dict_list[0], index=[0])
for i in range(1,dict_count-1):
df = df.append(dict_list[i], ignore_index=True)
I have the following list of dicts with datetime keys and int values:
list = [{datetime.date(2022, 2, 10): 7}, {datetime.date(2022, 2, 11): 1}, {datetime.date(2022, 2, 11): 1}]
I had a problem to convert it to Dataframe with the methods above as it created Dataframe with columns with dates...
My solution:
df = pd.DataFrame()
for i in list:
temp_df = pd.DataFrame.from_dict(i, orient='index')
df = df.append(temp_df)

Categories