Split a bottom level column tuple into another level

Split a bottom level column tuple into another level - python

My current dataframe has two levels. I'm looking to add a third by splitting tuples, which are the names of the columns. See example:
Original DF:
Category
(A,Cat) (B,Dog) (B,Bird) (B,Frog) (HH,Lion) (HH,Tiger)
48 28 585 4 233 44
11 434 23 854 32 10
Desired DF: "Category" is top level. Letter (A,B,HH) is the second level. Then the animal is the bottom level of the dataframe
Category
A B B B HH HH
Cat Dog Bird Frog Lion Tiger
48 28 585 4 233 44
11 434 23 854 32 10
I don't have much experience with working with Multi-index columns in dataframes. Any suggestions is appreciated.

First starting with what you have (would've been nice if you had provided this code yourself):
import pandas as pd
df = pd.DataFrame(
data=[[48, 28, 585, 4, 233, 44], [11, 434, 23, 854, 32, 10]],
columns=[("A", "Cat"), ("B", "Dog"), ("B", "Bird"), ("B", "Frog"), ("HH", "Lion"), ("HH", "Tiger")],
)
df
(A, Cat) (B, Dog) (B, Bird) (B, Frog) (HH, Lion) (HH, Tiger)
0 48 28 585 4 233 44
1 11 434 23 854 32 1
Now break down the tuples into multi-level columns and then prepend another level:
# Create multi-index from existing column tuples
# https://pandas.pydata.org/docs/reference/api/pandas.MultiIndex.from_tuples.html
df.columns = pd.MultiIndex.from_tuples(df.columns)
# Add another level for the 'Category'
# From https://stackoverflow.com/a/42094658/1256347
pd.concat([df], keys=['Category'], axis=1)
Category
A B HH
Cat Dog Bird Frog Lion Tiger
0 48 28 585 4 233 44
1 11 434 23 854 32 10

Related

Filtering dataframes based on one column with a different type of other column

I have the following problem
import pandas as pd
data = {
"ID": [420, 380, 390, 540, 520, 50, 22],
"duration": [50, 40, 45,33,19,1,3],
"next":["390;50","880;222" ,"520;50" ,"380;111" ,"810;111" ,"22;888" ,"11" ]
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df)
As you can see I have
ID duration next
0 420 50 390;50
1 380 40 880;222
2 390 45 520;50
3 540 33 380;111
4 520 19 810;111
5 50 1 22;888
6 22 3 11
Things to notice:
ID type is int
next type is a string with numbers separated by ; if more than two numbers
I would like to filter the rows with no next in the ID
For example in this case
420 has a follow up in both 390 and 50
380 has as next 880 and 222 both of which are not in ID so this one
540 has as next 380 and 111 and while 111 is not in ID, 380 is so not this one
same with 50
In the end I want to get
1 380 40 880;222
4 520 19 810;111
6 22 3 11
With only one value I used print(df[~df.next.astype(int).isin(df.ID)]) but in this case isin can not be simply applied.
How can I do this?

Let us try with split then explode with isin check
s = df.next.str.split(';').explode().astype(int)
out = df[~s.isin(df['ID']).groupby(level=0).any()]
Out[420]:
ID duration next
1 380 40 880;222
4 520 19 810;111
6 22 3 11

Use a regex with word boundaries for efficiency:
pattern = '|'.join(df['ID'].astype(str))
out = df[~df['next'].str.contains(fr'\b(?:{pattern})\b')]
Output:
ID duration next
1 380 40 880;222
4 520 19 810;111
6 22 3 11

Check if a row in one DataFrame exist in another, BASED ON SPECIFIC COLUMNS ONLY

I have two Pandas DataFrame with different columns number.
df1 is a single row DataFrame:
a X0 b Y0 c
0 233 100 56 shark -23
df2, instead, is multiple rows Dataframe:
d X0 e f Y0 g h
0 snow 201 32 36 cat 58 336
1 rain 176 99 15 tiger 63 845
2 sun 193 81 42 dog 48 557
3 storm 100 74 18 shark 39 673 # <-- This row
4 cloud 214 56 27 wolf 66 406
I would to verify if the df1's row is in df2, but considering X0 AND Y0 columns only, ignoring all other columns.
In this example the df1's row match the df2's row at index 3, that have 100 in X0 and 'shark' in Y0.
The output for this example is:
True
Note: True/False as output is enough for me, I don't care about index of matched row.
I founded similar questions but all of them check the entire row...

Use df.merge with an if condition check on len:
In [219]: if len(df1[['X0', 'Y0']].merge(df2)):
...: print(True)
...:
True
OR:
In [225]: not (df1[['X0', 'Y0']].merge(df2)).empty
Out[225]: True

Try this:
df2[(df2.X0.isin(df1.X0))&(df2.Y0.isin(df1.Y0))]
Output:
d X0 e f Y0 g h
3 storm 100 74 18 shark 39 673

duplicated
df2.append(df1).duplicated(['X0', 'Y0']).iat[-1]
True
Save a tad bit of time
df2[['X0', 'Y0']].append(df1[['X0', 'Y0']]).duplicated().iat[-1]

Changing the order of my columns to create a data frame suitable for barplot

I have this data frame (two first row, the real one is huge)
df
p__Actinobacteriota 25 555
p__Bacteroidota 31 752
I would like to transform this data frame to the next one:
dft
p__Actinobacteriota 25 A
p__Actinobacteriota 555 B
p__Bacteroidota 31 A
p__Bacteroidota 725 B
What is the most simple way to do that?

I will assume that your dataframe is:
pd.DataFrame([['p__Actinobacteriota', 25, 555], ['p__Bacteroidota', 31, 752]])
which prints as:
0 1 2
0 p__Actinobacteriota 25 555
1 p__Bacteroidota 31 752
It is easy to stack it:
df.rename(columns={1:'A', 2:'B'}).set_index([0]).stack().rename('val').reset_index()
which give:
0 level_1 val
0 p__Actinobacteriota A 25
1 p__Actinobacteriota B 555
2 p__Bacteroidota A 31
3 p__Bacteroidota B 752

Numpy: Use vectorization for loop while referring to previous row value?

I have the following dataframe for which I want to create a column named 'Value' using numpy for fast looping and at the same time refer to the previous row value in the same column.
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"Product": ["A", "A", "A", "A", "B", "B", "B", "C", "C"],
"Inbound": [115, 220, 200, 402, 313, 434, 321, 343, 120],
"Outbound": [10, 20, 24, 52, 40, 12, 43, 23, 16],
"Is First?": ["Yes", "No", "No", "No", "Yes", "No", "No", "Yes", "No"],
}
)
Product Inbound Outbound Is First? Value
0 A 115 10 Yes 125
1 A 220 20 No 105
2 A 200 24 No 81
3 A 402 52 No 29
4 B 313 40 Yes 353
5 B 434 12 No 341
6 B 321 43 No 298
7 C 343 23 Yes 366
8 C 120 16 No 350
The formula for Value column in pseudocode is:
if ['Is First?'] = 'Yes' then [Value] = [Inbound] + [Outbound]
else [Value] = [Previous Value] - [Outbound]
The ideal way of creating the Value column right now is to do a for loop and use shift to refer to the previous column (which I am somehow not able to make work). But since I will be applying this over a giant dataset, I want to use the numpy vectorization method on it.
for i in range(len(df)):
if df.loc[i, "Is First?"] == "Yes":
df.loc[i, "Value"] = df.loc[i, "Inbound"] + df.loc[i, "Outbound"]
else:
df.loc[i, "Value"] = df.loc[i, "Value"].shift(-1) + df.loc[i, "Outbound"]

One way:
You may use np.subtract.accumulate with transform
s = df['Is First?'].eq('Yes').cumsum()
df['value'] = ((df.Inbound + df.Outbound).where(df['Is First?'].eq('Yes'), df.Outbound)
.groupby(s)
.transform(np.subtract.accumulate))
Out[1749]:
Product Inbound Outbound Is First? value
0 A 115 10 Yes 125
1 A 220 20 No 105
2 A 200 24 No 81
3 A 402 52 No 29
4 B 313 40 Yes 353
5 B 434 12 No 341
6 B 321 43 No 298
7 C 343 23 Yes 366
8 C 120 16 No 350
Another way:
Assign value for Yes. Create groupid s to use for groupby. Groupby and shift Outbound to calculate cumsum, and subtract it from 'Yes' value of each group. Finally, use it to fillna.
df['value'] = (df.Inbound + df.Outbound).where(df['Is First?'].eq('Yes'))
s = df['Is First?'].eq('Yes').cumsum()
s1 = df.value.ffill() - df.Outbound.shift(-1).groupby(s).cumsum().shift()
df['value'] = df.value.fillna(s1)
Out[1671]:
Product Inbound Outbound Is First? value
0 A 115 10 Yes 125.0
1 A 220 20 No 105.0
2 A 200 24 No 81.0
3 A 402 52 No 29.0
4 B 313 40 Yes 353.0
5 B 434 12 No 341.0
6 B 321 43 No 298.0
7 C 343 23 Yes 366.0
8 C 120 16 No 350.0

This is not a trivial task, the difficulty lies in the consecutive Nos. It's necessary to group consecutive no's together, the code below should do,
col_sum = df.Inbound+df.Outbound
mask_no = df['Is First?'].eq('No')
mask_yes = df['Is First?'].eq('Yes')
consec_no = mask_yes.cumsum()
result = col_sum.groupby(consec_no).transform('first')-df['Outbound'].where(mask_no,0).groupby(consec_no).cumsum()

Use:
df.loc[df['Is First?'].eq('Yes'),'Value']=df['Inbound']+df['Outbound']
df.loc[~df['Is First?'].eq('Yes'),'Value']=df['Value'].fillna(0).shift().cumsum()-df.loc[~df['Is First?'].eq('Yes'),'Outbound'].cumsum()

Annotated numpy code:
## 1. line up values to sum
ob = -df["Outbound"].values
# get yes indices
fi, = np.where(df["Is First?"].values == "Yes")
# insert yes formula at yes positions
ob[fi] = df["Inbound"].values[fi] - ob[fi]
## 2. calculate block sums and subtract each from the
## first element of the **next** block
ob[fi[1:]] -= np.add.reduceat(ob,fi)[:-1]
# now simply taking the cumsum will reset after each block
df["Value"] = ob.cumsum()
Result:
Product Inbound Outbound Is First? Value
0 A 115 10 Yes 125
1 A 220 20 No 105
2 A 200 24 No 81
3 A 402 52 No 29
4 B 313 40 Yes 353
5 B 434 12 No 341
6 B 321 43 No 298
7 C 343 23 Yes 366
8 C 120 16 No 350

Pandas dataframe with multiindex column - merge levels

I have a dataframe, grouped, with multiindex columns as below:
import pandas as pd
codes = ["one","two","three"];
colours = ["black", "white"];
textures = ["soft", "hard"];
N= 100 # length of the dataframe
df = pd.DataFrame({ 'id' : range(1,N+1),
'weeks_elapsed' : [random.choice(range(1,25)) for i in range(1,N+1)],
'code' : [random.choice(codes) for i in range(1,N+1)],
'colour': [random.choice(colours) for i in range(1,N+1)],
'texture': [random.choice(textures) for i in range(1,N+1)],
'size': [random.randint(1,100) for i in range(1,N+1)],
'scaled_size': [random.randint(100,1000) for i in range(1,N+1)]
}, columns= ['id', 'weeks_elapsed', 'code','colour', 'texture', 'size', 'scaled_size'])
grouped = df.groupby(['code', 'colour']).agg( {'size': [np.sum, np.average, np.size, pd.Series.idxmax],'scaled_size': [np.sum, np.average, np.size, pd.Series.idxmax]}).reset_index()
>> grouped
code colour size scaled_size
sum average size idxmax sum average size idxmax
0 one black 1031 60.647059 17 81 185.153944 10.891408 17 47
1 one white 481 37.000000 13 53 204.139249 15.703019 13 53
2 three black 822 48.352941 17 6 123.269405 7.251141 17 31
3 three white 1614 57.642857 28 50 285.638337 10.201369 28 37
4 two black 523 58.111111 9 85 80.908912 8.989879 9 88
5 two white 669 41.812500 16 78 82.098870 5.131179 16 78
[6 rows x 10 columns]
How can I flatten/merge the column index levels as: "Level1|Level2", e.g. size|sum, scaled_size|sum. etc? If this is not possible, is there a way to groupby() as I did above without creating multi-index columns?

There is potentially a better way, more pythonic way to flatten MultiIndex columns.
1. Use map and join with string column headers:
grouped.columns = grouped.columns.map('|'.join).str.strip('|')
print(grouped)
Output:
code colour size|sum size|average size|size size|idxmax \
0 one black 862 53.875000 16 14
1 one white 554 46.166667 12 18
2 three black 842 49.529412 17 90
3 three white 740 56.923077 13 97
4 two black 1541 61.640000 25 50
scaled_size|sum scaled_size|average scaled_size|size scaled_size|idxmax
0 6980 436.250000 16 77
1 6101 508.416667 12 13
2 7889 464.058824 17 64
3 6329 486.846154 13 73
4 12809 512.360000 25 23
2. Use map with format for column headers that have numeric data types.
grouped.columns = grouped.columns.map('{0[0]}|{0[1]}'.format)
Output:
code| colour| size|sum size|average size|size size|idxmax \
0 one black 734 52.428571 14 30
1 one white 1110 65.294118 17 88
2 three black 930 51.666667 18 3
3 three white 1140 51.818182 22 20
4 two black 656 38.588235 17 77
5 two white 704 58.666667 12 17
scaled_size|sum scaled_size|average scaled_size|size scaled_size|idxmax
0 8229 587.785714 14 57
1 8781 516.529412 17 73
2 10743 596.833333 18 21
3 10240 465.454545 22 26
4 9982 587.176471 17 16
5 6537 544.750000 12 49
3. Use list comprehension with f-string for Python 3.6+:
grouped.columns = [f'{i}|{j}' if j != '' else f'{i}' for i,j in grouped.columns]
Output:
code colour size|sum size|average size|size size|idxmax \
0 one black 1003 43.608696 23 76
1 one white 1255 59.761905 21 66
2 three black 777 45.705882 17 39
3 three white 630 52.500000 12 23
4 two black 823 54.866667 15 33
5 two white 491 40.916667 12 64
scaled_size|sum scaled_size|average scaled_size|size scaled_size|idxmax
0 12532 544.869565 23 27
1 13223 629.666667 21 13
2 8615 506.764706 17 92
3 6101 508.416667 12 43
4 7661 510.733333 15 42
5 6143 511.916667 12 49

you could always change the columns:
grouped.columns = ['%s%s' % (a, '|%s' % b if b else '') for a, b in grouped.columns]

Based on Scott Boston's answer,
little update(it will be work for 2 or more levels column):
temp.columns.map(lambda x: '|'.join([str(i) for i in x]))
Thank you, Boston!

Full credit to suraj's concise answer: https://stackoverflow.com/a/72616083/317797
df.columns = df.columns.map('_'.join)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Split a bottom level column tuple into another level - python

Related

Filtering dataframes based on one column with a different type of other column

Check if a row in one DataFrame exist in another, BASED ON SPECIFIC COLUMNS ONLY

Changing the order of my columns to create a data frame suitable for barplot

Numpy: Use vectorization for loop while referring to previous row value?

Pandas dataframe with multiindex column - merge levels

Categories

Resources