Related
I have the following dataframe (sample):
import pandas as pd
min_id = 1
max_id = 10
data = [['A', 2], ['A', 3], ['A', 1], ['A', 4], ['A', 4], ['A', 2],
['B', 4], ['B', 5], ['B', 7], ['B', 4], ['B', 2],
['C', 1], ['C', 3], ['C', 2], ['C', 1], ['C', 5], ['C', 2] ,['C', 1],
['D', 1], ['D', 1], ['D', 1], ['D', 1]]
df = pd.DataFrame(data = data, columns = ['group', 'val'])
group val
0 A 2
1 A 3
2 A 1
3 A 4
4 A 4
5 A 2
6 B 4
7 B 5
8 B 7
9 B 4
10 B 2
11 C 1
12 C 3
13 C 2
14 C 1
15 C 5
16 C 2
17 C 1
18 D 1
19 D 1
20 D 1
21 D 1
I would like to create a column called "id" which shows the id with a min value of 1 (min_id) and a max value of 10 (max_id) per group. So the values between min and max depend on the number of rows per group. Here you can see the desired output:
data = [['A', 2, 1], ['A', 3, 2.8], ['A', 1, 4.6], ['A', 4, 6.4], ['A', 4, 8.2], ['A', 2, 10],
['B', 4, 1], ['B', 5, 3.25], ['B', 7, 5.5], ['B', 4, 7.75], ['B', 2, 10],
['C', 1, 1], ['C', 3, 2.5], ['C', 2, 4], ['C', 1, 5.5], ['C', 5, 7], ['C', 2, 8.5] ,['C', 1, 10],
['D', 1, 1], ['D', 1, 4], ['D', 1, 7], ['D', 1, 10]]
df_desired = pd.DataFrame(data = data, columns = ['group', 'val', 'id'])
group val id
0 A 2 1.00
1 A 3 2.80
2 A 1 4.60
3 A 4 6.40
4 A 4 8.20
5 A 2 10.00
6 B 4 1.00
7 B 5 3.25
8 B 7 5.50
9 B 4 7.75
10 B 2 10.00
11 C 1 1.00
12 C 3 2.50
13 C 2 4.00
14 C 1 5.50
15 C 5 7.00
16 C 2 8.50
17 C 1 10.00
18 D 1 1.00
19 D 1 4.00
20 D 1 7.00
21 D 1 10.00
So I was wondering if anyone knows how to automatically create the column "id" using pandas? Please note that the number of rows could be way more then in the sample dataframe.
Use GroupBy.transform with numpy.linspace:
df['ID']=df.groupby('group')['group'].transform(lambda x: np.linspace(min_id,max_id,len(x)))
print (df)
group val ID
0 A 2 1.00
1 A 3 2.80
2 A 1 4.60
3 A 4 6.40
4 A 4 8.20
5 A 2 10.00
6 B 4 1.00
7 B 5 3.25
8 B 7 5.50
9 B 4 7.75
10 B 2 10.00
11 C 1 1.00
12 C 3 2.50
13 C 2 4.00
14 C 1 5.50
15 C 5 7.00
16 C 2 8.50
17 C 1 10.00
18 D 1 1.00
19 D 1 4.00
20 D 1 7.00
21 D 1 10.00
I have one initial dataframe df1:
df1 = pd.DataFrame(np.array([[1, 'B', 'C', 'D', 'E'], [2, 'B', 'C', 'D', 'E'], [3, 'B', 'C', 'D', 'E'], [4, 'B', 'C', 'D', 'E'], [5, 'B', 'C', 'D', 'E']]), columns=['a', 'b', 'c', 'd', 'e'])
a b c d e
0 1 B C D E
1 2 B C D E
2 3 B C D E
3 4 B C D E
4 5 B C D E
Then I compute some new parameters based on df1 column values, create a new df2 and merge with df1 on column name "a".
df2 = pd.DataFrame(np.array([[1, 'F', 'G'], [2, 'F', 'G']]), columns=['a', 'f', 'g'])
a f g
0 1 F G
1 2 F G
df1 = pd.merge(df1, df2, how='left', left_on=['a'], right_on = ['a'])
a b c d e f g
0 1 B C D E F G
1 2 B C D E F G
2 3 B C D E NaN NaN
3 4 B C D E NaN NaN
4 5 B C D E NaN NaN
This works perfectly fine, but in another loop event, I create a df3 with same columns as df2 but merge in this case does not work, it doesn't take into account that the same columns are already in df1.
IMPORTANT REMARK: This is for illustration purpose only, there are thousands of new dataframes to be added, one per loop step.
df3 = pd.DataFrame(np.array([[3, 'F', 'G']]), columns=['a', 'f', 'g'])
a f g
0 3 F G
df1 = pd.merge(df1, df3, how='left', left_on=['a'], right_on = ['a'])
a b c d e f_x g_x f_y g_y
0 1 B C D E F G NaN NaN
1 2 B C D E F G NaN NaN
2 3 B C D E NaN NaN F G
3 4 B C D E NaN NaN NaN NaN
4 5 B C D E NaN NaN NaN NaN
I just one to fill missing gaps using the already existing columns. This approach creates new columns (f_x, g_x, f_y, g_y).
Append and contact also does not work as they repeats information (repeated rows on "a").
Any advice on how to solve this? Final result after merging df1 with df2, and after with df3 should be:
a b c d e f g
0 1 B C D E F G
1 2 B C D E F G
2 3 B C D E F G
3 4 B C D E NaN NaN
4 5 B C D E NaN NaN
Eventually all the columns will be filled during the loop, so the first added (df2) will add new columns, and from df3 onwards just new data to fill all NaN. The loop looks like this:
df1 = pd.DataFrame(np.array([[1, 'B', 'C', 'D', 'E'], [2, 'B', 'C', 'D', 'E'], [3, 'B', 'C', 'D', 'E'], [4, 'B', 'C', 'D', 'E'], [5, 'B', 'C', 'D', 'E']]), columns=['a', 'b', 'c', 'd', 'e'])
for num, item in enumerate(df1['a']):
#compute df[num] (based on values on df1)
df1 = pd.merge(df1, df[num], how='left', left_on=['a'], right_on = ['a'])
One possible solution is concat all small DataFrames and then only once merge:
df4 = pd.concat([df2, df3])
print (df4)
a f g
0 1 F G
1 2 F G
0 3 F G
df1 = pd.merge(df1, df4, how='left', on = 'a')
print (df1)
a b c d e f g
0 1 B C D E F G
1 2 B C D E F G
2 3 B C D E F G
3 4 B C D E NaN NaN
4 5 B C D E NaN NaN
Another possible solution is use DataFrame.combine_first with DataFrame.set_index:
df1 = (df1.set_index('a')
.combine_first(df2.set_index('a'))
.combine_first(df3.set_index('a')))
print (df1)
b c d e f g
a
1 B C D E F G
2 B C D E F G
3 B C D E F G
4 B C D E NaN NaN
5 B C D E NaN NaN
Another way is too use fillna then drop the extra columns you dont need anymore:
# Fill NaN with the extra columns value
df1.f_x.fillna(df1.f_y, inplace=True)
df1.g_x.fillna(df1.g_y, inplace=True)
a b c d e f_x g_x f_y g_y
0 1 B C D E F G NaN NaN
1 2 B C D E F G NaN NaN
2 3 B C D E F G F G
3 4 B C D E NaN NaN NaN NaN
4 5 B C D E NaN NaN NaN NaN
# Slice of the last two columns
df1 = df1.iloc[:, :-2]
# Rename the columns correctly
df1.columns = df1.columns.str.replace('_x', '')
Output
a b c d e f g
0 1 B C D E F G
1 2 B C D E F G
2 3 B C D E F G
3 4 B C D E NaN NaN
4 5 B C D E NaN NaN
I would just use a subset of df1 in the merge with df3, or alternatively I would keep a copy of the original df1.
subset:
df1.fillna(pd.merge(df1.loc(1)['a':'e'], df3, how='left',
left_on=['a'], right_on = ['a']),
inplace=True)
copy of original data
df1_orig = df1 # before merging with df2
...
df1.fillna(pd.merge(df1_orig, df3, how='left',
left_on=['a'], right_on = ['a']),
inplace=True)
Why does Example 1 give back NaN, while Example 2 doesn't?
Example 1:
data=DataFrame(np.arange(0,16).reshape(4,4),
index=[list('abcd')],
columns=[list('retz')])
data[data['t'] > 5]
r e t z
a NaN NaN NaN NaN
b NaN NaN 6.0 NaN
c NaN NaN 10.0 NaN
d NaN NaN 14.0 NaN
Example2:
data2 = DataFrame(np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
data2[data2['three'] > 5]
one two three four
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
Your first dataframe has a multiindex
data.axes
> [MultiIndex(levels=[['a', 'b', 'c', 'd']],
labels=[[0, 1, 2, 3]]), MultiIndex(levels=[['e', 'r', 't', 'z']],
labels=[[1, 0, 2, 3]])]
Whereas your second doesn't:
data2.axes
> [Index(['Ohio', 'Colorado', 'Utah', 'New York'], dtype='object'),
Index(['one', 'two', 'three', 'four'], dtype='object')]
It's because you've wrapped list('retz') in another list, so it's interpreted as [['e', 'r', 't', 'z']]. If you want to have just a single index, you would just get rid of the brackets.
data=DataFrame(np.arange(0,16).reshape(4,4),
index=list('abcd'),
columns=list('retz'))
data[data['t'] > 5]
> r e t z
b 4 5 6 7
c 8 9 10 11
d 12 13 14 15
I tried to answer this question by a group-level merging. The below is a slightly modified version of the same question, but I need the output by a group-level merging.
Here are the input dataframes:
df = pd.DataFrame({ "group":[1,1,1 ,2,2],
"cat": ['a', 'b', 'c', 'a', 'c'] ,
"value": range(5),
"value2": np.array(range(5))* 2})
df
cat group value value2
a 1 0 0
b 1 1 2
c 1 2 4
a 2 3 6
c 2 4 8
categories = ['a', 'b', 'c', 'd']
categories = pd.DataFrame(['a', 'b', 'c', 'd'], columns=['cat'])
print(categories)
cat
0 a
1 b
2 c
3 d
Here's the expected output:
cat group value value2
a 1 0 0
b 1 1 2
c 1 2 4
d NA NA NA
a 2 3 6
c 2 4 8
b NA NA NA
d NA NA NA
Question:
I can achieve what I want by a for loop. Is there a pandas way to do that though?
(I need to perform an outer join between categories and each group of the groupby result of df.groupby('group'))
grouped = df.groupby('group')
merged_list = []
for g in grouped:
merged = pd.merge(categories, g[1], how = 'outer', on='cat')
merged_list.append(merged)
out = pd.concat(merged_list)
I think groupby + merge here is only overcomplicated way for this.
So faster is use reindex by MultiIndex:
mux = pd.MultiIndex.from_product([df['group'].unique(), categories], names=('group','cat'))
df = df.set_index(['group','cat']).reindex(mux).swaplevel(0,1).reset_index()
#add missing values to group column
df['group'] = df['group'].mask(df['value'].isnull())
print (df)
cat group value value2
0 a 1.0 0.0 0.0
1 b 1.0 1.0 2.0
2 c 1.0 2.0 4.0
3 d NaN NaN NaN
4 a 2.0 3.0 6.0
5 b NaN NaN NaN
6 c 2.0 4.0 8.0
7 d NaN NaN NaN
Possible solution:
df = df.groupby('group', group_keys=False)
.apply(lambda x: pd.merge(categories, x, how = 'outer', on='cat'))
cat group value value2
0 a 1.0 0.0 0.0
1 b 1.0 1.0 2.0
2 c 1.0 2.0 4.0
3 d NaN NaN NaN
0 a 2.0 3.0 6.0
1 b NaN NaN NaN
2 c 2.0 4.0 8.0
3 d NaN NaN NaN
Timings:
np.random.seed(123)
N = 1000000
L = list('abcd') #235,94.1,156ms
df = pd.DataFrame({'cat': np.random.choice(L, N, p=(0.002,0.002,0.005, 0.991)),
'group':np.random.randint(10000,size=N),
'value':np.random.randint(1000,size=N),
'value2':np.random.randint(5000,size=N)})
df = df.sort_values(['group','cat']).drop_duplicates(['group','cat']).reset_index(drop=True)
print (df.head(10))
categories = ['a', 'b', 'c', 'd']
def jez1(df):
mux = pd.MultiIndex.from_product([df['group'].unique(), categories], names=('group','cat'))
df = df.set_index(['group','cat']).reindex(mux, fill_value=0).swaplevel(0,1).reset_index()
df['group'] = df['group'].mask(df['value'].isnull())
return df
def jez2(df):
grouped = df.groupby('group')
categories = pd.DataFrame(['a', 'b', 'c', 'd'], columns=['cat'])
return grouped.apply(lambda x: pd.merge(categories, x, how = 'outer', on='cat'))
def coldspeed(df):
grouped = df.groupby('group')
categories = pd.DataFrame(['a', 'b', 'c', 'd'], columns=['cat'])
return pd.concat([g[1].merge(categories, how='outer', on='cat') for g in grouped])
def akilat90(df):
grouped = df.groupby('group')
categories = pd.DataFrame(['a', 'b', 'c', 'd'], columns=['cat'])
merged_list = []
for g in grouped:
merged = pd.merge(categories, g[1], how = 'outer', on='cat')
merged['group'].fillna(merged['group'].mode()[0],inplace=True) # replace the `group` column's `NA`s by mode
merged.fillna(0, inplace=True)
merged_list.append(merged)
return pd.concat(merged_list)
In [471]: %timeit jez1(df)
100 loops, best of 3: 12 ms per loop
In [472]: %timeit jez2(df)
1 loop, best of 3: 14.5 s per loop
In [473]: %timeit coldspeed(df)
1 loop, best of 3: 19.4 s per loop
In [474]: %timeit akilat90(df)
1 loop, best of 3: 22.3 s per loop
To actually answer your question, no - you can only merge 2 dataframes at a time (I'm not aware of multi-way merges in pandas). You cannot avoid the loop, but you certainly can make your code a little neater.
pd.concat([g[1].merge(categories, how='outer', on='cat') for g in grouped])
cat group value value2
0 a 1.0 0.0 0.0
1 b 1.0 1.0 2.0
2 c 1.0 2.0 4.0
3 d NaN NaN NaN
0 a 2.0 3.0 6.0
1 c 2.0 4.0 8.0
2 b NaN NaN NaN
3 d NaN NaN NaN
I have a (example-) dataframe with 4 columns:
data = {'A': ['a', 'b', 'c', 'd', 'e', 'f'],
'B': [42, 52, np.nan, np.nan, np.nan, np.nan],
'C': [np.nan, np.nan, 31, 2, np.nan, np.nan],
'D': [np.nan, np.nan, np.nan, np.nan, 62, 70]}
df = pd.DataFrame(data, columns = ['A', 'B', 'C', 'D'])
A B C D
0 a 42.0 NaN NaN
1 b 52.0 NaN NaN
2 c NaN 31.0 NaN
3 d NaN 2.0 NaN
4 e NaN NaN 62.0
5 f NaN NaN 70.0
I would now like to merge/combine columns B, C, and D to a new column E like in this example:
data2 = {'A': ['a', 'b', 'c', 'd', 'e', 'f'],
'E': [42, 52, 31, 2, 62, 70]}
df2 = pd.DataFrame(data2, columns = ['A', 'E'])
A E
0 a 42
1 b 52
2 c 31
3 d 2
4 e 62
5 f 70
I found a quite similar question here but this adds the merged colums B, C, and D at the end of column A:
0 a
1 b
2 c
3 d
4 e
5 f
6 42
7 52
8 31
9 2
10 62
11 70
dtype: object
Thanks for help.
Option 1
Using assign and drop
In [644]: cols = ['B', 'C', 'D']
In [645]: df.assign(E=df[cols].sum(1)).drop(cols, 1)
Out[645]:
A E
0 a 42.0
1 b 52.0
2 c 31.0
3 d 2.0
4 e 62.0
5 f 70.0
Option 2
Using assignment and drop
In [648]: df['E'] = df[cols].sum(1)
In [649]: df = df.drop(cols, 1)
In [650]: df
Out[650]:
A E
0 a 42.0
1 b 52.0
2 c 31.0
3 d 2.0
4 e 62.0
5 f 70.0
Option 3 Lately, I like the 3rd option.
Using groupby
In [660]: df.groupby(np.where(df.columns == 'A', 'A', 'E'), axis=1).first() #or sum max min
Out[660]:
A E
0 a 42.0
1 b 52.0
2 c 31.0
3 d 2.0
4 e 62.0
5 f 70.0
In [661]: df.columns == 'A'
Out[661]: array([ True, False, False, False], dtype=bool)
In [662]: np.where(df.columns == 'A', 'A', 'E')
Out[662]:
array(['A', 'E', 'E', 'E'],
dtype='|S1')
The question as written asks for merge/combine as opposed to sum, so posting this to help folks who find this answer looking for help on coalescing with combine_first, which can be a bit tricky.
df2 = pd.concat([df["A"],
df["B"].combine_first(df["C"]).combine_first(df["D"])],
axis=1)
df2.rename(columns={"B":"E"}, inplace=True)
A E
0 a 42.0
1 b 52.0
2 c 31.0
3 d 2.0
4 e 62.0
5 f 70.0
What's so tricky about that? in this case there's no problem - but let's say you were pulling the B, C and D values from different dataframes, in which the a,b,c,d,e,f labels were present, but not necessarily in the same order. combine_first() aligns on the index, so you'd need to tack a set_index() on to each of your df references.
df2 = pd.concat([df.set_index("A", drop=False)["A"],
df.set_index("A")["B"]\
.combine_first(df.set_index("A")["C"])\
.combine_first(df.set_index("A")["D"]).astype(int)],
axis=1).reset_index(drop=True)
df2.rename(columns={"B":"E"}, inplace=True)
A E
0 a 42
1 b 52
2 c 31
3 d 2
4 e 62
5 f 70
Use difference for columns names without A and then get sum or max:
cols = df.columns.difference(['A'])
df['E'] = df[cols].sum(axis=1).astype(int)
# df['E'] = df[cols].max(axis=1).astype(int)
df = df.drop(cols, axis=1)
print (df)
A E
0 a 42
1 b 52
2 c 31
3 d 2
4 e 62
5 f 70
If multiple values per rows:
data = {'A': ['a', 'b', 'c', 'd', 'e', 'f'],
'B': [42, 52, np.nan, np.nan, np.nan, np.nan],
'C': [np.nan, np.nan, 31, 2, np.nan, np.nan],
'D': [10, np.nan, np.nan, np.nan, 62, 70]}
df = pd.DataFrame(data, columns = ['A', 'B', 'C', 'D'])
print (df)
A B C D
0 a 42.0 NaN 10.0
1 b 52.0 NaN NaN
2 c NaN 31.0 NaN
3 d NaN 2.0 NaN
4 e NaN NaN 62.0
5 f NaN NaN 70.0
cols = df.columns.difference(['A'])
df['E'] = df[cols].apply(lambda x: ', '.join(x.dropna().astype(int).astype(str)), 1)
df = df.drop(cols, axis=1)
print (df)
A E
0 a 42, 10
1 b 52
2 c 31
3 d 2
4 e 62
5 f 70
You can also use ffill with iloc:
df['E'] = df.iloc[:, 1:].ffill(1).iloc[:, -1].astype(int)
df = df.iloc[:, [0, -1]]
print(df)
A E
0 a 42
1 b 52
2 c 31
3 d 2
4 e 62
5 f 70
Zero's third option using groupby requires a numpy import and only handles one column outside the set of columns to collapse, while jpp's answer using ffill requires you know how columns are ordered. Here's a solution that has no extra dependencies, takes an arbitrary input dataframe, and only collapses columns if all rows in those columns are single-valued:
import pandas as pd
data = [{'A':'a', 'B':42, 'messy':'z'},
{'A':'b', 'B':52, 'messy':'y'},
{'A':'c', 'C':31},
{'A':'d', 'C':2, 'messy':'w'},
{'A':'e', 'D':62, 'messy':'v'},
{'A':'f', 'D':70, 'messy':['z']}]
df = pd.DataFrame(data)
cols = ['B', 'C', 'D']
new_col = 'E'
if df[cols].apply(lambda x: len(x.notna().value_counts()) == 1, axis=1).all():
df[new_col] = df[cols].ffill(axis=1).dropna(axis=1)
df2 = df.drop(columns=cols)
print(df, '\n\n', df2)
Output:
A B messy C D
0 a 42.0 z NaN NaN
1 b 52.0 y NaN NaN
2 c NaN NaN 31.0 NaN
3 d NaN w 2.0 NaN
4 e NaN v NaN 62.0
5 f NaN [z] NaN 70.0
A messy E
0 a z 42.0
1 b y 52.0
2 c NaN 31.0
3 d w 2.0
4 e v 62.0
5 f [z] 70.0