Pandas Collapse and Stack Multi-level columns - python

I want to break down multi level columns and have them as a column value.
Original data input (excel):
As read in dataframe:
Company Name Company code 2017-01-01 00:00:00 Unnamed: 3 Unnamed: 4 Unnamed: 5 2017-02-01 00:00:00 Unnamed: 7 Unnamed: 8 Unnamed: 9 2017-03-01 00:00:00 Unnamed: 11 Unnamed: 12 Unnamed: 13
0 NaN NaN Product A Product B Product C Product D Product A Product B Product C Product D Product A Product B Product C Product D
1 Company A #123 1 5 3 5 0 2 3 4 0 1 2 3
2 Company B #124 600 208 30 20 600 213 30 15 600 232 30 12
3 Company C #125 520 112 47 15 520 110 47 10 520 111 47 15
4 Company D #126 420 165 120 31 420 195 120 30 420 182 120 58
Intended data frame:
I have tried stack() and unstack() and also swap level, but I couldn't get the dates column to 'drop as row'. Looks like the merged cells in excels will produce NaN as in the dataframes - and if its the columns that is merged, I will have a unnamed column. How do I work around it? Am I missing something really simple here?

Using stack
df.stack(level=0).reset_index(level=1)

Related

Find cumulative maximum of a pandas series, but reset the cumulative maximum when 0 is encountered

Consider the series below:
100
102
101
103
0
12
123
14
I want the result to be as follows:
100
102
102
103
0
12
123
123
Let d be the variable containing your series, then groupby the cummulative sum of d == 0 and then obtain the cummax
d.groupby(d.eq(0).cumsum()).cummax()
Out[37]:
0 100
1 102
2 102
3 103
4 0
5 12
6 123
7 123

Compare two values from different dataframes and add values from one to other dataframe [Pandas] [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 11 months ago.
I have two data frames in Pandas with following structure:
pr_df:
id prMerged idRepository avgTime
0 1 1 2 63.93
1 2 0 3 41.11
2 3 0 3 36.03
3 4 1 4 77.28
...
98 99 1 20 54.78
99 100 0 20 42.12
repo_df
id stars forks
0 1 1245 45
1 2 3689 78
2 3 458 15
3 4 954 75
...
19 20 1947 102
I would like to combine pr_df with repo_df by comparing idRepository (from pr_df) and id (from repo_df) with each other and add two columns to pr_df: stars and forks. As a result, I would like to achieve:
pr_df:
id prMerged idRepository avgTime stars forks
0 1 1 2 63.93 3689 78
1 2 0 3 41.11 458 15
2 3 0 3 36.03 458 15
3 4 1 4 77.28 954 75
...
98 99 1 20 54.78 1947 102
99 100 0 20 42.12 1947 102
How can I do it using Pandas? How can I compare idRepository with id and add new columns to pr_df based on that?
You can use the merge function, and you have to supply the columns that you want to merge on.
pr_df.merge(repo_df, left_on='idRepository', right_on='id')

Transpose/Pivot DataFrame but not all columns in the same row

I have a DataFrame and I want to transpose it.
import pandas as pd
df = pd.DataFrame({'ID':[111,111,222,222,333,333],'Month':['Jan','Feb','Jan','Feb','Jan','Feb'],
'Employees':[2,3,1,5,7,1],'Subsidy':[20,30,10,15,40,5]})
print(df)
ID Month Employees Subsidy
0 111 Jan 2 20
1 111 Feb 3 30
2 222 Jan 1 10
3 222 Feb 5 15
4 333 Jan 7 40
5 333 Feb 1 5
Desired output:
ID Var Jan Feb
0 111 Employees 2 3
1 111 Subsidy 20 30
0 222 Employees 1 5
1 222 Subsidy 10 15
0 333 Employees 7 1
1 333 Subsidy 40 5
My attempt: I tried using pivot_table(), but both Employees & Subsidy naturally appear in same rows, where as I want them on separate rows.
df.pivot_table(index=['ID'],columns='Month',values=['Employees','Subsidy'])
Employees Subsidy
Month Feb Jan Feb Jan
ID
111 3 2 30 20
222 5 1 15 10
333 1 7 5 40
I tried using transpose(), but it transposes entire DataFrame, it seems there is no possibility to transpose by first fixing a column. Any suggestions?
You can add DataFrame.rename_axis for set new column name for first level after pivoting and also None for avoid Month column name in final DataFrame, which is reshaped by DataFrame.stack by first level, last MultiIndex in converted to coumns by DataFrame.reset_index:
df2 = (df.pivot_table(index='ID',
columns='Month',
values=['Employees','Subsidy'])
.rename_axis(['Var',None], axis=1)
.stack(level=0)
.reset_index()
)
print (df2)
ID Var Feb Jan
0 111 Employees 3 2
1 111 Subsidy 30 20
2 222 Employees 5 1
3 222 Subsidy 15 10
4 333 Employees 1 7
5 333 Subsidy 5 40
You were on point with your pivot_table approach. Only thing is you missed stack and reset_index :
df.pivot_table(index=['ID'],columns='Month',values=['Employees','Subsidy']).stack(0).reset_index()
Out[42]:
Month ID level_1 Feb Jan
0 111 Employees 3 2
1 111 Subsidy 30 20
2 222 Employees 5 1
3 222 Subsidy 15 10
4 333 Employees 1 7
5 333 Subsidy 5 40
You can change the column name to var later if it's needed.

Subtract/Add existing values if contents of one dataframe is present in another using pandas

Here are 2 dataframes
df1:
Index Number Name Amount
0 123 John 31
1 124 Alle 33
2 312 Amy 33
3 314 Holly 35
df2:
Index Number Name Amount
0 312 Amy 13
1 124 Alle 35
2 317 Jack 53
The resulting dataframe should look like this
result_df:
Index Number Name Amount Curr_amount
0 123 John 31 31
1 124 Alle 33 68
2 312 Amy 33 46
3 314 Holly 35 35
4 317 Jack 53
I have tried using pandas isin but it only says if the Number column was present or no in boolean. Is there any way to do this efficiently?
Use merge with outer join and then add Series.add (or
Series.sub if necessary):
df = df1.merge(df2, on=['Number','Name'], how='outer', suffixes=('','_curr'))
df['Amount_curr'] = df['Amount_curr'].add(df['Amount'], fill_value=0)
print (df)
Number Name Amount Amount_curr
0 123 John 31.0 31.0
1 124 Alle 33.0 68.0
2 312 Amy 33.0 46.0
3 314 Holly 35.0 35.0
4 317 Jack NaN 53.0

Mark every Nth row per group using pandas

I have a Dataframe with customer info with their purchase details. I am trying to add a new columns that indicates every 3rd purchase done by the same customer.
Given below is the Dataframe
customer_name,bill_no,date
Mark,101,2018-10-01
Scott,102,2018-10-01
Pete,103,2018-10-02
Mark,104,2018-10-02
Mark,105,2018-10-04
Scott,106,2018-10-21
Julie,107,2018-10-03
Kevin,108,2018-10-07
Steve,109,2018-10-02
Mark,110,2018-10-06
Mark,111,2018-10-02
Mark,112,2018-10-05
Mark,113,2018-10-05
I am writing to filter every 3rd purchase done by the same customer. So in this case, I would like to add a flag for the below bill_no
Mark,105,2018-10-04
Mark,112,2018-10-05
Basically every multiple of 3 bill generated for the same customer.
Using groupby.cumcount:
n = 3
df['flag'] = df.groupby('customer_name').cumcount() + 1
df['flag'] = ((df['flag'] % n) == 0).astype(int)
print(df)
customer_name bill_no date flag
0 Mark 101 2018-10-01 0
1 Scott 102 2018-10-01 0
2 Pete 103 2018-10-02 0
3 Mark 104 2018-10-02 0
4 Mark 105 2018-10-04 1
5 Scott 106 2018-10-21 0
6 Julie 107 2018-10-03 0
7 Kevin 108 2018-10-07 0
8 Steve 109 2018-10-02 0
9 Mark 110 2018-10-06 0
10 Mark 111 2018-10-02 0
11 Mark 112 2018-10-05 1
12 Mark 113 2018-10-05 0
If actually getting the indices are important, you should use groupby + apply with slicing on the index:
n = 3
idx = df.groupby('customer_name', group_keys=False).apply(
lambda x: x.index[n-1::n].to_series())
# So you can query these rows easily.
df.loc[idx]
customer_name bill_no date
4 Mark 105 2018-10-04
11 Mark 112 2018-10-05
Now, mark them using the indices:
df['flag'] = 0
df.loc[idx, 'flag'] = 1
df
customer_name bill_no date flag
0 Mark 101 2018-10-01 0
1 Scott 102 2018-10-01 0
2 Pete 103 2018-10-02 0
3 Mark 104 2018-10-02 0
4 Mark 105 2018-10-04 1
5 Scott 106 2018-10-21 0
6 Julie 107 2018-10-03 0
7 Kevin 108 2018-10-07 0
8 Steve 109 2018-10-02 0
9 Mark 110 2018-10-06 0
10 Mark 111 2018-10-02 0
11 Mark 112 2018-10-05 1
12 Mark 113 2018-10-05 0
If performance is important, use Sandeep's solution instead.

Categories