I have the following dataframe df:
A B Var Value
0 A1 B1 T1name T1
1 A2 B2 T1name T1
2 A1 B1 T2name T2
3 A2 B2 T2name T2
4 A1 B1 T1res 1
5 A2 B2 T1res 1
6 A1 B1 T2res 2
7 A2 B2 T2res 2
I now want to 'half' my dataframe because Var contains variables that should not go under the same column. My intended outcome is:
A B Name Value
0 A1 B1 T1 1
1 A2 B2 T1 1
2 A1 B1 T2 2
3 A2 B2 T2 2
What should I use to unpivot this correctly?
Just filter where the string contains res and assign a new column with the first two characters of the var columns
df[df['Var'].str.contains('res')].assign(Name=df['Var'].str[:2]).drop(columns='Var')
A B Value Name
4 A1 B1 1 T1
5 A2 B2 1 T1
6 A1 B1 2 T2
7 A2 B2 2 T2
Note that this creates a slice of the original DataFrame and not a copy
then :
df = df[~df['Var'].isin(['T1name','T2name'])]
output :
A B Var Value
4 A1 B1 T1res 1
5 A2 B2 T1res 1
6 A1 B1 T2res 2
7 A2 B2 T2res 2
There are different options available looking at the df. Regex seems to be on top of the list. If regex doesn't work, maybe think of redefining your problem:
Filter Value by dtype, replace unwanted characters in df and rename columns. Code below
df[df['Value'].str.isnumeric()].replace(regex=r'res$', value='').rename(columns={'Var':'Name'})
A B Name Value
4 A1 B1 T1 1
5 A2 B2 T1 1
6 A1 B1 T2 2
7 A2 B2 T2 2
Related
I have a dataset:
name val
a a1
a a2
b b1
b b2
b b3
c c1
I want to make all possible permutations "names" which are not same. So desired result is:
name1 val1 name2 val2
a a1 b b1
a a1 b b2
a a1 b b3
a a2 b b1
a a2 b b2
a a2 b b3
a a1 c c1
a a2 c c2
b b1 c c1
b b2 c c1
b b3 c c1
How to do that? Id like to write a function that would make same operation with bigger table with same structure.
I would like to make it efficiently, since original data has several thousands rows
Easiest is to cross merge and query, if you have enough memory for few million rows, which is not too bad:
df.merge(df, how='cross', suffixes=['1','2']).query('name1 < name2')
Output:
name1 val1 name2 val2
2 a a1 b b1
3 a a1 b b2
4 a a1 b b3
5 a a1 c c1
8 a a2 b b1
9 a a2 b b2
10 a a2 b b3
11 a a2 c c1
17 b b1 c c1
23 b b2 c c1
29 b b3 c c1
Df1
A B C1 C2 D E
a1 b1 2 4 d1 e1
a2 b2 1 2 d2 e2
Df2
A B C D E
a1 b1 2 d1 e1
a1 b1 3 d1 e1
a1 b1 4 d1 e1
a2 b2 1 d2 e2
a2 b2 2 d2 e2
How to make Df2 from Df1 in the fastest possible way?
I tried using groupby and then within for loop used np.arange to fill Df2.C and then used pd.concat to make the final Df2. But this approach is very slow and doesn't seem very elegant and pythonic as well. Can somebody please help with this problem.
Try this:
df1.assign(C = [np.arange(s, e+1) for s, e in zip(df1['C1'], df1['C2'])])\
.explode('C')
Output:
A B C1 C2 D E C
0 a1 b1 2 4 d1 e1 2
0 a1 b1 2 4 d1 e1 3
0 a1 b1 2 4 d1 e1 4
1 a2 b2 1 2 d2 e2 1
1 a2 b2 1 2 d2 e2 2
One way is to melt df1, use groupby.apply to add ranges; then explode for the final output:
cols = ['A','B','D','E']
out = (df1.melt(cols, value_name='C').groupby(cols)['C']
.apply(lambda x: range(x.min(), x.max()+1))
.explode().reset_index(name='C'))
Output:
A B D E C
0 a1 b1 d1 e1 2
1 a1 b1 d1 e1 3
2 a1 b1 d1 e1 4
3 a2 b2 d2 e2 1
4 a2 b2 d2 e2 2
I have a groupby array in which I need to group by A, then show a count of instances of B separated by B1 and B2 and finally the percentage of those instances that are > 0.1 so I did this to get the first 2:
A B C
id
118 a1 B1 0
119 a1 B1 0
120 a1 B1 101.1
121 a1 B1 106.67
122 a1 B2 103.33
237 a1 B2 100
df = pd.DataFrame(df.groupby(
['A', 'B'])['B'].aggregate('count')).unstack(level=1)
to which I get the first part right:
B
B B1 B2
A
a1 4 2
a2 7 9
a3 9 17
a4 8 8
a5 7 8
But then when I need to get the percentage of the count that is > 0
prcnt_complete = df[['A', 'B', 'C']]
prcnt_complete['passed'] = prcnt_complete['C'].apply(lambda x: (float(x) > 1))
prcnt_complete = prcnt_complete.groupby(['A', 'B', 'passed']).count()
I get weird values that make no sense, sometimes the sum between True and False doesn't even add up. I'm trying to understand what in the order of things I'm doing wrong so that I can make sense of it.
The result I'm looking for is something like this:
B passed
B B1 B2 B1 B2
A
a1 4 2 2 2
a2 7 9 7 6
a3 9 17 9 5
Thanks,
You can do:
(df['C'].gt(1).groupby([df['A'],df['B']])
.agg(['size','sum'])
.rename(columns={'size':'B','sum':'passed'})
.unstack('B')
)
Output (from sample data):
B passed
B B1 B2 B1 B2
A
a1 4 2 2 2
While working on your problem, I also wanted to see if I can get the average percentage for B (while ignoring 0s). I was able to accomplish this as well while getting the counts.
DataFrame for this exercise:
A B C
0 a1 B1 0.00
1 a1 B1 0.00
2 a1 B1 98.87
3 a1 B1 101.10
4 a1 B2 106.67
5 a1 B2 103.00
6 a2 B1 0.00
7 a2 B1 0.00
8 a2 B1 33.00
9 a2 B1 100.00
10 a2 B2 80.00
11 a3 B1 90.00
12 a3 B2 99.00
Average while excluding the zeros
for this I had to add .replace(0, np.nan) before the groupby function.
A = ['a1','a1','a1','a1','a1','a1','a2','a2','a2','a2','a2','a3','a3']
B = ['B1','B1','B1','B1','B2','B2','B1','B1','B1','B1','B2','B1','B2']
C = [0,0,98.87,101.1,106.67,103,0,0,33,100,80,90,99]
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':A,'B':B,'C':C})
df = pd.DataFrame(df.replace(0, np.nan)
.groupby(['A', 'B'])
.agg({'B':'size','C':['count','mean']})
.rename(columns={'size':'Count','count':'Passed','mean':'Avg Score'})).unstack(level=1)
df.columns = df.columns.droplevel(0)
Count Passed Avg Score
B B1 B2 B1 B2 B1 B2
A
a1 4 2 2 2 99.985 104.835
a2 4 1 2 1 66.500 80.000
a3 1 1 1 1 90.000 99.000
Still new into pandas but is there a way to sort df by subtotal of each group.
Area Unit Count
A A1 5
A A2 2
B B1 10
B B2 1
B B3 3
C C1 10
So I want to sort them by subtotal of each Area which results to A subtotal = 7, B subtotal=14, C subtotal = 10
The sort should be like
Area Unit Count
B B1 10
B B2 1
B B3 3
C C1 10
A A1 5
A A2 2
*Note that despite the value of B3 > B1 it should not be affected by the sort.
create a helper column 'sorter', which is the sum of the count variable, and sort ur dataframe with it
df['sorter'] = df.groupby("Area").Count.transform('sum')
df.sort_values('sorter',ascending=False).reset_index(drop=True).drop('sorter',axis=1)
Area Unit Count
0 B B1 10
1 B B2 1
2 B B3 3
3 C C1 10
4 A A1 5
5 A A2 2
I would like to add to the dataframe df a column that enumerates the rows having the same index
df = pd.DataFrame(data=[['a1','b1'],['a2','b2'],['a3','b3'],['a4','b4']],columns=['a','b'],index=[9,9,12,14])
df
Out[13]:
a b
9 a1 b1
9 a2 b2
12 a3 b3
14 a4 b4
In practice I would like to add a columns 'day' so that
df = pd.DataFrame(data=[[1,'a1','b1'],[2,'a2','b2'],[1,'a3','b3'],[1,'a4','b4']],columns=['day','a',''],index=[9,9,12,14])
df
Out[15]:
day a
9 1 a1 b1
9 2 a2 b2
12 1 a3 b3
14 1 a4 b4
Use cumcount and if need new column as first add insert:
df['day'] = df.groupby(level=0).cumcount() + 1
print (df)
a b day
9 a1 b1 1
9 a2 b2 2
12 a3 b3 1
14 a4 b4 1
df.insert(0, 'day', df.groupby(level=0).cumcount() + 1)
print (df)
day a b
9 1 a1 b1
9 2 a2 b2
12 1 a3 b3
14 1 a4 b4