Combine duplicated columns within a DataFrame - python

If I have a dataframe that has columns that include the same name, is there a way to combine the columns that have the same name with some sort of function (i.e. sum)?
For instance with:
In [186]:
df["NY-WEB01"].head()
Out[186]:
NY-WEB01 NY-WEB01
DateTime
2012-10-18 16:00:00 5.6 2.8
2012-10-18 17:00:00 18.6 12.0
2012-10-18 18:00:00 18.4 12.0
2012-10-18 19:00:00 18.2 12.0
2012-10-18 20:00:00 19.2 12.0
How might I collapse the NY-WEB01 columns (there are a bunch of duplicate columns, not just NY-WEB01) by summing each row where the column name is the same?

I believe this does what you are after:
df.groupby(lambda x:x, axis=1).sum()
Alternatively, between 3% and 15% faster depending on the length of the df:
df.groupby(df.columns, axis=1).sum()
EDIT: To extend this beyond sums, use .agg() (short for .aggregate()):
df.groupby(df.columns, axis=1).agg(numpy.max)

pandas >= 0.20: df.groupby(level=0, axis=1)
You don't need a lambda here, nor do you explicitly have to query df.columns; groupby accepts a level argument you can specify in conjunction with the axis argument. This is cleaner, IMO.
# Setup
np.random.seed(0)
df = pd.DataFrame(np.random.choice(50, (5, 5)), columns=list('AABBB'))
df
A A B B B
0 44 47 0 3 3
1 39 9 19 21 36
2 23 6 24 24 12
3 1 38 39 23 46
4 24 17 37 25 13
<!_ >
df.groupby(level=0, axis=1).sum()
A B
0 91 6
1 48 76
2 29 60
3 39 108
4 41 75
Handling MultiIndex columns
Another case to consider is when dealing with MultiIndex columns. Consider
df.columns = pd.MultiIndex.from_arrays([['one']*3 + ['two']*2, df.columns])
df
one two
A A B B B
0 44 47 0 3 3
1 39 9 19 21 36
2 23 6 24 24 12
3 1 38 39 23 46
4 24 17 37 25 13
To perform aggregation across the upper levels, use
df.groupby(level=1, axis=1).sum()
A B
0 91 6
1 48 76
2 29 60
3 39 108
4 41 75
or, if aggregating per upper level only, use
df.groupby(level=[0, 1], axis=1).sum()
one two
A B B
0 91 0 6
1 48 19 57
2 29 24 36
3 39 39 69
4 41 37 38
Alternate Interpretation: Dropping Duplicate Columns
If you came here looking to find out how to simply drop duplicate columns (without performing any aggregation), use Index.duplicated:
df.loc[:,~df.columns.duplicated()]
A B
0 44 0
1 39 19
2 23 24
3 1 39
4 24 37
Or, to keep the last ones, specify keep='last' (default is 'first'),
df.loc[:,~df.columns.duplicated(keep='last')]
A B
0 47 3
1 9 36
2 6 12
3 38 46
4 17 13
The groupby alternatives for the two solutions above are df.groupby(level=0, axis=1).first(), and ... .last(), respectively.

Here is possible simplier solution for common aggregation functions like sum, mean, median, max, min, std - only use parameters axis=1 for working with columns and level:
#coldspeed samples
np.random.seed(0)
df = pd.DataFrame(np.random.choice(50, (5, 5)), columns=list('AABBB'))
print (df)
print (df.sum(axis=1, level=0))
A B
0 91 6
1 48 76
2 29 60
3 39 108
4 41 75
df.columns = pd.MultiIndex.from_arrays([['one']*3 + ['two']*2, df.columns])
print (df.sum(axis=1, level=1))
A B
0 91 6
1 48 76
2 29 60
3 39 108
4 41 75
print (df.sum(axis=1, level=[0,1]))
one two
A B B
0 91 0 6
1 48 19 57
2 29 24 36
3 39 39 69
4 41 37 38
Similar it working for index, then use axis=0 instead axis=1:
np.random.seed(0)
df = pd.DataFrame(np.random.choice(50, (5, 5)), columns=list('ABCDE'), index=list('aabbc'))
print (df)
A B C D E
a 44 47 0 3 3
a 39 9 19 21 36
b 23 6 24 24 12
b 1 38 39 23 46
c 24 17 37 25 13
print (df.min(axis=0, level=0))
A B C D E
a 39 9 0 3 3
b 1 6 24 23 12
c 24 17 37 25 13
df.index = pd.MultiIndex.from_arrays([['bar']*3 + ['foo']*2, df.index])
print (df.mean(axis=0, level=1))
A B C D E
a 41.5 28.0 9.5 12.0 19.5
b 12.0 22.0 31.5 23.5 29.0
c 24.0 17.0 37.0 25.0 13.0
print (df.max(axis=0, level=[0,1]))
A B C D E
bar a 44 47 19 21 36
b 23 6 24 24 12
foo b 1 38 39 23 46
c 24 17 37 25 13
If need use another functions like first, last, size, count is necessary use coldspeed answer

Related

Pandas drop multiple in range value using isin

Given a df
a
0 1
1 2
2 1
3 7
4 10
5 11
6 21
7 22
8 26
9 51
10 56
11 83
12 82
13 85
14 90
I would like to drop rows if the value in column a is not within these multiple range
(10-15),(25-30),(50-55), (80-85). Such that these range are made from the 'lbotandltop`
lbot =[10, 25, 50, 80]
ltop=[15, 30, 55, 85]
I am thinking this can be achieve via pandas isin
df[df['a'].isin(list(zip(lbot,ltop)))]
But, it return empty df instead.
The expected output is
a
10
11
26
51
83
82
85
You can use numpy broadcasting to create a boolean mask where for each row it returns True if the value is within any of the ranges and filter df with it.:
out = df[((df[['a']].to_numpy() >=lbot) & (df[['a']].to_numpy() <=ltop)).any(axis=1)]
Output:
a
4 10
5 11
8 26
9 51
11 83
12 82
13 85
Create values in flatten list comprehension with range:
df = df[df['a'].isin([z for x, y in zip(lbot,ltop) for z in range(x, y+1)])]
print (df)
a
4 10
5 11
8 26
9 51
11 83
12 82
13 85
Or use np.concatenate for flatten list of ranges:
df = df[df['a'].isin(np.concatenate([range(x, y+1) for x, y in zip(lbot,ltop)]))]
A method that uses between():
df[pd.concat([df['a'].between(x, y) for x,y in zip(lbot, ltop)], axis=1).any(axis=1)]
output:
a
4 10
5 11
8 26
9 51
11 83
12 82
13 85
If your values in the two lists are sorted, a method that doesn't require any loop would be to use pandas.cut and checking that you obtain the same group cutting on the two lists:
# group based on lower bound
id1 = pd.cut(df['a'], bins=lbot+[float('inf')], labels=range(len(lbot)),
right=False) # include lower bound
# group based on upper bound
id2 = pd.cut(df['a'], bins=[0]+ltop, labels=range(len(ltop)))
# ensure groups are identical
df[id1.eq(id2)]
output:
a
4 10
5 11
8 26
9 51
11 83
12 82
13 85
intermediate groups:
a id1 id2
0 1 NaN 0
1 2 NaN 0
2 1 NaN 0
3 7 NaN 0
4 10 0 0
5 11 0 0
6 21 0 1
7 22 0 1
8 26 1 1
9 51 2 2
10 56 2 3
11 83 3 3
12 82 3 3
13 85 3 3
14 90 3 NaN

Add/Update/Merge original DataFrame into a grouped DataFrame

How can I merge, update, join, concat, or filter the original DF correctly so that I can have the complete 78 columns?
I have a DataFrame with 22 rows and 78 columns. An internet-friendly version of the file can be found here. This a sample:
item_no code group gross_weight net_weight value ... ... +70 columns more
1 7417.85.24.25 0 18 17 13018.74
2 1414.19.00.62 1 35 33 0.11
3 7815.80.99.96 0 49 48 1.86
4 1414.19.00.62 1 30 27 2.7
5 5867.21.36.92 1 31 24 94
6 9227.71.84.12 1 24 17 56.4
7 1414.19.00.62 0 42 35 0.56
8 4465.58.84.31 0 50 42 0.94
9 1596.09.32.64 1 20 13 0.75
10 2194.64.27.41 1 38 33 1.13
11 1596.09.32.64 1 53 46 1.9
12 1596.09.32.64 1 18 15 10.44
13 1596.09.32.64 1 35 33 15.36
14 4835.09.81.44 1 55 47 10.44
15 5698.44.72.13 1 51 49 15.36
16 5698.44.72.13 1 49 45 2.15
17 5698.44.72.13 0 41 33 16
18 3815.79.80.69 1 25 21 4
19 3815.79.80.69 1 35 30 2.4
20 4853.40.53.94 1 53 46 3.12
21 4853.40.53.94 1 50 47 3.98
22 4853.40.53.94 1 16 13 6.53
The column group gives me the instruction that I should group all similar values in the code column and add the values in the columns: 'gross_weight', 'net_weight', 'value', and 'item_quantity'. Additionally, I have to modify 2 additional columns as shown below:
#Group DF
grouped_df = df.groupby(['group', 'code'], as_index=False).agg({'item_quantity':'sum', 'gross_weight':'sum','net_weight':'sum', 'value':'sum'}).copy()
#Total items should be equal to the length of the DF
grouped_df['total_items'] = len(grouped_df)
#Item No.
grouped_df['item_no'] = [x+1 for x in range(len(grouped_df))]
This is the result:
group code item_quantity gross_weight net_weight value total_items item_no
0 0 1414.19.00.62 75.0 42 35 0.56 14 1
1 0 4465.58.84.31 125.0 50 42 0.94 14 2
2 0 5698.44.72.13 200.0 41 33 16.0 14 3
3 0 7417.85.24.25 1940.2 18 17 13018.74 14 4
4 0 7815.80.99.96 200.0 49 48 1.86 14 5
5 1 1414.19.00.62 275.0 65 60 2.81 14 6
6 1 1596.09.32.64 515.0 126 107 28.45 14 7
7 1 2194.64.27.41 151.0 38 33 1.13 14 8
8 1 3815.79.80.69 400.0 60 51 6.4 18 14 9
9 1 4835.09.81.44 87.0 55 47 10.44 14 10
10 1 4853.40.53.94 406.0 119 106 13.63 14 11
11 1 5698.44.72.13 328.0 100 94 17.51 14 12
12 1 5867.21.36.92 1000.0 31 24 94.0 14 13
13 1 9227.71.84.12 600.0 24 17 56.4 14 14
All of the columns in the grouped DF exist in the original DF but some have different values.
How can I merge, update, join, concat, or filter the original DF correctly so that I can have the complete 78 columns?
The objective DataFrame is the grouped DF.
The columns in the original DF that already exists in the Grouped DF should be omitted.
I should be able to take the first value of the columns in the original DF that aren't in the Grouped DF.
The column code does not have unique values.
The column part_number in the complete file does not have unique values.
I tried:
pd.Merge(how='left') after creating a unique ID; it duplicates existing columns instead of updating values or overwriting.
join, concat, update: does not yield the expected results.
.agg({lambda x: x.iloc[0]}) adds all the columns but I don't know how to add it to the current .agg({'item_quantity':'sum', 'gross_weight':'sum','net_weight':'sum', 'value':'sum'})
I know that .agg({'column_name':'first']) returns the first value, but I don't know how to make it work for over 70 columns automatically.
You can achieve this dynamically creating a dictionary with list comprehension like this:
df.groupby(['group', 'code'], as_index=False).agg({col : 'sum' for col in df.columns[3:]}
If item_no is your index, then change df.columns[3:] to df.columns[2:]

Pandas.DataFrame: How to sort rows by the largest value in each row

I have a dataframe as in the figure (result of a word2vec analysis). I need to sort the rows
descendingly by the largest value in each row. So I want the order of the rows after sorting to be as indicated by the red numbers in the image.
Thanks
Michael
Find max on axis=1 and sort this series of maxes. reindex using this index.
Sample df
A B C D E F
0 95 86 29 38 79 18
1 15 8 34 46 71 50
2 29 9 78 97 83 45
3 88 25 17 83 78 77
4 40 82 3 0 78 38
df_final = df.reindex(df.max(1).sort_values(ascending=False).index)
Out[675]:
A B C D E F
2 29 9 78 97 83 45
0 95 86 29 38 79 18
3 88 25 17 83 78 77
4 40 82 3 0 78 38
1 15 8 34 46 71 50
You can use .max(axis=1) to find the row-wise max and then use .argsort() to return the integer indices that would sort the Series values. Finally, use .loc to arrange the rows in the desired sequence:
df.loc[df.max(axis=1).argsort()[::-1]]
([::-1] added for descending order. Remove it for ascending order)
Input:
1 2 3 4
0 0.32 -1.09 -0.040000 0.600062
1 -0.32 1.19 3.287113 0.620000
2 2.04 1.23 1.010000 1.320000
Output:
1 2 3 4
1 -0.32 1.19 3.287113 0.620000
2 2.04 1.23 1.010000 1.320000
0 0.32 -1.09 -0.040000 0.600062

Find the difference between the max value and 2nd highest value within a subset of pandas columns

I have a fairly large dataframe:
A
B
C
D
0
17
36
45
54
1
18
23
17
17
2
74
47
8
46
3
48
38
96
83
I am trying to create a new column that is the (max value of the columns) - (2nd highest value) / (2nd highest value).
In this example it would look something like:
A
B
C
D
Diff
0
17
36
45
54
.20
1
18
23
17
17
.28
2
74
47
8
46
.57
3
48
38
96
83
.16
I've tried df['diff'] = df.loc[:, 'A': 'D'].max(axis=1) - df.iloc[:df.index.get_loc(df.loc[:, 'A': 'D'].idxmax(axis=1))] / ...
but even that part of the formula returns an error, nevermind including the final division. I'm sure there must be an easier way going about this.
Edit: Additionally, I am also trying to get the difference between the max value and the column that immediately precedes the max value. I know this is a somewhat different question, but I would appreciate any insight. Thank you!
One way using pandas.Series.nlargest with pct_change:
df["Diff"] = df.apply(lambda x: x.nlargest(2).pct_change(-1)[0], axis=1)
Output:
A B C D Diff
0 17 36 45 54 0.200000
1 18 23 17 17 0.277778
2 74 47 8 46 0.574468
3 48 38 96 83 0.156627
One way is to apply a udf:
def get_pct(x):
xmax2, xmax = x.sort_values().tail(2)
return (xmax-xmax2)/xmax2
df['Diff'] = df.apply(get_pct, axis=1)
Output:
A B C D Diff
0 17 36 45 54 0.200000
1 18 23 17 17 0.277778
2 74 47 8 46 0.574468
3 48 38 96 83 0.156627
We can also make use of numpy sort and np.diff :
arr = np.sort(df,axis=1)[:,-2:]
df['Diff'] = np.diff(arr,axis=1)[:,0]/arr[:,0]
print(df)
A B C D Diff
0 17 36 45 54 0.200000
1 18 23 17 17 0.277778
2 74 47 8 46 0.574468
3 48 38 96 83 0.156627
Let us try get the second Max value with mask
Max = df.max(1)
secMax = df.mask(df.eq(Max,0)).max(1)
df['Diff'] = (Max - secMax)/secMax
df
Out[69]:
A B C D Diff
0 17 36 45 54 0.200000
1 18 23 17 17 0.277778
2 74 47 8 46 0.574468
3 48 38 96 83 0.156627

How to create a rolling window in pandas with another condition

I have a data frame with 2 columns
df = pd.DataFrame(np.random.randint(0,100,size=(100, 2)), columns=list('AB'))
A B
0 11 10
1 61 30
2 24 54
3 47 52
4 72 42
... ... ...
95 61 2
96 67 41
97 95 30
98 29 66
99 49 22
100 rows × 2 columns
Now I want to create a third column, which is a rolling window max of col 'A' BUT
the max has to be lower than the corresponding value in col 'B'. In other words I want the value of the 4 (using a window size of 4) in column 'A' closest to the value in col 'B', yet smaller than B
So for example in row
3 47 52
the new value I am looking for, is not 61 but 47, because it is the highest value of the 4 that is not higher than 52
pseudo code
df['C'] = df['A'].rolling(window=4).max() where < df['B']
You can use concat + shift to create a wide DataFrame with the previous values, which makes complicated rolling calculations a bit easier.
Sample Data
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 100, size=(100, 2)), columns=list('AB'))
Code
N = 4
# End slice ensures same default min_periods behavior to `.rolling`
df1 = pd.concat([df['A'].shift(i).rename(i) for i in range(N)], axis=1).iloc[N-1:]
# Remove values larger than B, then find the max of remaining.
df['C'] = df1.where(df1.lt(df.B, axis=0)).max(1)
print(df.head(15))
A B C
0 51 92 NaN # Missing b/c min_periods
1 14 71 NaN # Missing b/c min_periods
2 60 20 NaN # Missing b/c min_periods
3 82 86 82.0
4 74 74 60.0
5 87 99 87.0
6 23 2 NaN # Missing b/c 82, 74, 87, 23 all > 2
7 21 52 23.0 # Max of 21, 23, 87, 74 which is < 52
8 1 87 23.0
9 29 37 29.0
10 1 63 29.0
11 59 20 1.0
12 32 75 59.0
13 57 21 1.0
14 88 48 32.0
You can use a custom function to .apply to the rolling window. In this case, you can use a default argument to pass in the B column.
df = pd.DataFrame(np.random.randint(0,100,size=(100, 2)), columns=('AB'))
def rollup(a, B=df.B):
ix = a.index.max()
b = B[ix]
return a[a<b].max()
df['C'] = df.A.rolling(4).apply(rollup)
df
# returns:
A B C
0 8 17 NaN
1 23 84 NaN
2 75 84 NaN
3 86 24 23.0
4 52 83 75.0
.. .. .. ...
95 38 22 NaN
96 53 48 38.0
97 45 4 NaN
98 3 92 53.0
99 91 86 53.0
The NaN values occur when no number in the window of A is less than B or at the start of the series when the window is too big for the first few rows.
You can use where to replace values that don't fulfill the condition with np.nan and then use rolling(window=4, min_periods=1):
In [37]: df['C'] = df['A'].where(df['A'] < df['B'], np.nan).rolling(window=4, min_periods=1).max()
In [38]: df
Out[38]:
A B C
0 0 1 0.0
1 1 2 1.0
2 2 3 2.0
3 10 4 2.0
4 4 5 4.0
5 5 6 5.0
6 10 7 5.0
7 10 8 5.0
8 10 9 5.0
9 10 10 NaN

Categories