Add new column from a list of dicts - python

I have a DataFrame, in which there a column which is composed of dict, I want to extract all the keys and values and make them as two new columns.
a b
0 1 {'a': 1, 'b': 2}
1 2 {'k': 4, 'v': 6}
2 3 {'z': 3}
The output would be
a k v
0 1 a 1
1 1 b 2
2 2 k 4
3 2 v 6
4 3 z 3

Use list comprehension with flatten values for list of tuples and pass to DataFrame constructor:
L = [(x, k, v) for x, y in df[['a','b']].values for k, v in y.items()]
df = pd.DataFrame(L, columns=['a','k','v'])
print (df)
a k v
0 1 a 1
1 1 b 2
2 2 k 4
3 2 v 6
4 3 z 3
EDIT: For general solution working with unique index is possible solution modify with DataFrame.pop for extract b column, add new column idx by index values, convert to index and last use DataFrame.join:
L = [(x, k, v) for x, y in df.pop('b').items() for k, v in y.items()]
df1 = pd.DataFrame(L, columns=['idx','k','v']).set_index('idx').rename_axis(None)
df = df.join(df1).reset_index(drop=True)
print (df)
a k v
0 1 a 1
1 1 b 2
2 2 k 4
3 2 v 6
4 3 z 3

You can try groupby, apply, rename_axis and reset_index:
>>> df.groupby('a').apply(lambda x:pd.Series(x.b[0], name='v'))
.rename_axis(['a','k']).reset_index()
a k v
0 1 a 1
1 1 b 2
2 2 k 4
3 2 v 6
4 3 z 3

First, I created the original data frame:
df = pd.DataFrame({'a': [1, 2, 3],
'b': [{'a': 1, 'b': 2},
{'k': 4, 'v': 6},
{'z': 3}]
})
Then, iterate over rows of the data frame, and iterate over the dictionary in Column B:
ts = list()
for row in df.itertuples():
for key, value in row.b.items():
t = (row.Index, row.a, key, value)
ts.append(t)
print(pd.DataFrame(data=ts, columns=['Index', 'a', 'k', 'v']).set_index('Index'))
a k v
Index
0 1 a 1
0 1 b 2
1 2 k 4
1 2 v 6
2 3 z 3

You could expand the dictionary, explode the column and apply the pd.Series function to get your result :
df = pd.DataFrame({"a": [1, 2, 3], "b": [{"a": 1, "b": 2}, {"k": 4, "v": 6}, {"z": 3}]})
divider = df.columns.get_loc("b")
# expand dictionary within the `b` column
df["b"] = [tuple(entry.items()) for entry in df.b]
# merge dataframe before `b`, the exploded `b` column, and the dataframe after `b`
merger = (
df.iloc[:, :-divider],
df.b.explode().apply(pd.Series).set_axis(["k", "v"], axis=1),
df.iloc[:, : -(divider + 1)],
)
pd.concat(merger, axis=1)
a k v
0 1 a 1
0 1 b 2
1 2 k 4
1 2 v 6
2 3 z 3

Related

How to update multiple entries of a certain column of a Pandas dataframe in certain order?

Let’s say I have the following Pandas dataframe, where the 'key' column only contains unique strings:
import pandas as pd
df = pd.DataFrame({'key':['b','d','c','a','e','f'], 'value': [0,0,0,0,0,0]})
df
key value
0 b 0
1 d 0
2 c 0
3 a 0
4 e 0
5 f 0
Now I have a list of unique keys and a list of corresponding values:
keys = ['a', 'b', 'c', 'd']
values = [1, 2, 3, 4]
I want to update the 'value' column in the same order of the lists, so that each row has matched 'key' and 'value' (a to 1, 'b' to 2, 'c' to 3, 'd' to 4). I am using the following code, but the dataframe seems to update values from top to bottom, which I don't quite understand
df.loc[df['key'].isin(keys),'value'] = values
df
key value
0 b 1
1 d 2
2 c 3
3 a 4
4 e 0
5 f 0
To be clear, I am expecting to get
key value
0 b 2
1 d 4
2 c 3
3 a 1
4 e 0
5 f 0
Any suggestions?
Use map:
dd = dict(zip(keys, values))
df['value'] = df['key'].map(dd).fillna(df['value'])
keys = ['a', 'b', 'c', 'd']
values = [1, 2, 3, 4]
# form a dictionary with keys and values list
d=dict(zip(keys, values))
# update the value where mapping exists using LOC and MAP
df.loc[df['key'].map(d).notna(), 'value'] =df['key'].map(d)
df
key value
0 b 2
1 d 4
2 c 3
3 a 1
4 e 0
5 f 0
with a temporary dataframe:
temp_df = df.set_index('key')
temp_df.loc[keys] = np.array(values).reshape(-1, 1)
df = temp_df.reset_index()

Pandas - appending dictionary to existing row

For each row, I am computing values and storing them in a dictionary. I want to be able to take the dictionary and add it to the row where the keys are columns.
For example:
Dataframe
A B C
1 2 3
Dictionary:
{
'D': 4,
'E': 5
}
Result:
A B C D E
1 2 3 4 5
There will be more than one row in the dataframe, and for each row I'm computing a dictionary that might not necessarily have the same exact keys.
I ended up doing this to get it to work:
appiled_df = df.apply(lambda row: func(row['a']), axis='columns', result_type='expand')
df = pd.concat([df, appiled_df], axis='columns')
def func():
...
return pd.Series(dictionary)
If you want the dict values to appear in each row of the original dataframe, use:
d = {
'D': 4,
'E': 5
}
df_result = df.join(df.apply(lambda x: pd.Series(d), axis=1))
Demo
Data Input:
df
A B C
0 1 2 3
1 11 12 13
Output:
df_result = df.join(df.apply(lambda x: pd.Series(d), axis=1))
A B C D E
0 1 2 3 4 5
1 11 12 13 4 5
If you just want the dict to appear in the first row of the original dataframe, use:
d = {
'D': 4,
'E': 5
}
df_result = df.join(pd.Series(d).to_frame().T)
A B C D E
0 1 2 3 4.0 5.0
1 11 12 13 NaN NaN
Simply use a for cycle in your dictionary and assign the values.
df = pd.DataFrame(columns=['A', 'B', 'C'], data=[[1,2,3]])
# You can test with df = pd.DataFrame(columns=['A', 'B', 'C'], data=[[1,2,3], [8,0,33]]), too.
d = {
'D': 4,
'E': 5
}
for k,v in d.items():
df[k] = v
print(df)
Output:
A
B
C
D
E
0
1
2
3
4
5

creating a pandas dataframe from dictionary of lists

I have a dictionary of connected components of a graph, for example:
d = {'A':[1,5,7],'B':[2,4], 'C':[3,6]}
and I want to create a dataframe of the form:
cc node
0 A 1
1 A 5
2 A 7
3 B 2
4 B 4
5 C 3
6 C 6
I can do this by creating a df for each key in the dictionary and then connecting them, but I am looking for a better solution.
Check with explode
pd.Series(d).explode()
d = {'A':[1,5,7],'B':[2,4], 'C':[3,6]}
df = []
for k,v in d.items():
for i in v:
df.append([k,i])
for l in df:
print(l)
['A', 1]
['A', 5]
['A', 7]
['B', 2]
['B', 4]
['C', 3]
['C', 6]
Maybe something like this:
import pandas as pd
d = {'A':[1,5,7],'B':[2,4], 'C':[3,6]}
temp = [[key, n] for key, val in d.items() for n in val]
df = pd.DataFrame(df, columns=['cc', 'node'])
print(df)
Output:
cc node
0 A 1
1 A 5
2 A 7
3 B 2
4 B 4
5 C 3
6 C 6
IIUC,
cc, node = [], []
for key, value in d.items():
cc += [key] * len(value)
node += value
df = pd.DataFrame({'cc' : cc, 'node' : node})
print(df)
Output
cc node
0 A 1
1 A 5
2 A 7
3 B 2
4 B 4
5 C 3

Pandas long format DataFrame from multiple lists of different length

Consider I have multiple lists
A = [1, 2, 3]
B = [1, 4]
and I want to generate a Pandas DataFrame in long format as follows:
type | value
------------
A | 1
A | 2
A | 3
B | 1
B | 4
What is the easiest way to achieve this? The way over the wide format and melt is not possible(?) because the lists may have different lengths.
Create dictionary for types and create list of tuples by list comprehension:
A = [1, 2, 3]
B = [1, 4]
d = {'A':A,'B':B}
print ([(k, y) for k, v in d.items() for y in v])
[('A', 1), ('A', 2), ('A', 3), ('B', 1), ('B', 4)]
df = pd.DataFrame([(k, y) for k, v in d.items() for y in v], columns=['type','value'])
print (df)
type value
0 A 1
1 A 2
2 A 3
3 B 1
4 B 4
Another solution, if input is list of lists and types should be integers:
L = [A,B]
df = pd.DataFrame([(k, y) for k, v in enumerate(L) for y in v], columns=['type','value'])
print (df)
type value
0 0 1
1 0 2
2 0 3
3 1 1
4 1 4
Here's a NumPy-based solution using a dictionary input:
d = {'A': [1, 2, 3],
'B': [1, 4]}
keys, values = zip(*d.items())
res = pd.DataFrame({'type': np.repeat(keys, list(map(len, values))),
'value': np.concatenate(values)})
print(res)
type value
0 A 1
1 A 2
2 A 3
3 B 1
4 B 4
Check this, this borrows the idea from dplyr, tidyr, R programming languages' 3rd libs, the following code is just for demo, so I created two df: df1, df2, you can dynamically create dfs and concat them:
import pandas as pd
def gather(df, key, value, cols):
id_vars = [col for col in df.columns if col not in cols]
id_values = cols
var_name = key
value_name = value
return pd.melt(df, id_vars, id_values, var_name, value_name)
df1 = pd.DataFrame({'A': [1, 2, 3]})
df2 = pd.DataFrame({'B': [1, 4]})
df_messy = pd.concat([df1, df2], axis=1)
print(df_messy)
df_tidy = gather(df_messy, 'type', 'value', df_messy.columns).dropna()
print(df_tidy)
And you got output for df_messy
A B
0 1 1.0
1 2 4.0
2 3 NaN
output for df_tidy
type value
0 A 1.0
1 A 2.0
2 A 3.0
3 B 1.0
4 B 4.0
PS: Remeber to convert the type of values from float to int type, I just wrote it down for a demo, and didn't pay too much attention about the details.

DataFrame from dictionary with nested lists

I have a python dictionary with nested lists, that I would like to turn into a pandas DataFrame
a = {'A': [1,2,3], 'B':['a','b','c'],'C':[[1,2],[3,4],[5,6]]}
I would like the final DataFrame to look like this:
> A B C
> 1 a 1
> 1 a 2
> 2 b 3
> 2 b 4
> 3 c 5
> 3 c 6
When I use the DataFrame command it looks like this:
pd.DataFrame(a)
> A B C
>0 1 a [1, 2]
>1 2 b [3, 4]
>2 3 c [5, 6]
Is there anyway I make the data long by the elements of C?
This is what I came up with:
In [53]: df
Out[53]:
A B C
0 1 a [1, 2]
1 2 b [3, 4]
2 3 c [5, 6]
In [58]: s = df.C.apply(Series).unstack().reset_index(level=0, drop = True)
In [59]: s.name = 'C2'
In [61]: df.drop('C', axis = 1).join(s)
Out[61]:
A B C2
0 1 a 1
0 1 a 2
1 2 b 3
1 2 b 4
2 3 c 5
2 3 c 6
apply(Series) gives me a DataFrame with two columns. To join them into one while keeping the original index, I use unstack. reset_index removes the first level of the index, which basically holds the index of the value in the original list which was in C. Then I join it back into the df.
Yes, one way is to deal with your dictionnary first ( I assume your dictionnary values contain either just list of values either list of nested lists - but not lists of both values and lists).
Step by step:
def f(x, y): return x + y
res={k: reduce(f, v) if any(isinstance(i, list) for i in v) else v for k,v in a.items()}
will give you:
{'A': [1, 2, 3], 'C': [1, 2, 3, 4, 5, 6], 'B': ['a', 'b', 'c']}
Now you need to extend lists in your dictionnary:
m = max([len(v) for v in res.values()])
res1 = {k: reduce(f, [(m/len(v))*[i] for i in v]) for k,v in res.items()}
And finally:
pd.DataFrame(res1)

Categories