Get values and column names - python

I have a pandas data frame that looks something like this:
data = {'1' : [0, 2, 0, 0], '2' : [5, 0, 0, 2], '3' : [2, 0, 0, 0], '4' : [0, 7, 0, 0]}
df = pd.DataFrame(data, index = ['a', 'b', 'c', 'd'])
df
1 2 3 4
a 0 5 2 0
b 2 0 0 7
c 0 0 0 0
d 0 2 0 0
I know I can get the maximum value and the corresponding column name for each row by doing (respectively):
df.max(1)
df.idxmax(1)
How can I get the values and the column name for every cell that is not zero?
So in this case, I'd want 2 tables, one giving me each value != 0 for each row:
a 5
a 2
b 2
b 7
d 2
And one giving me the column names for those values:
a 2
a 3
b 1
b 4
d 2
Thanks!

You can use stack for Series, then filter by boolean indexing, rename_axis, reset_index and last drop column or select columns by subset:
s = df.stack()
df1 = s[s!= 0].rename_axis(['a','b']).reset_index(name='c')
print (df1)
a b c
0 a 2 5
1 a 3 2
2 b 1 2
3 b 4 7
4 d 2 2
df2 = df1.drop('b', axis=1)
print (df2)
a c
0 a 5
1 a 2
2 b 2
3 b 7
4 d 2
df3 = df1.drop('c', axis=1)
print (df3)
a b
0 a 2
1 a 3
2 b 1
3 b 4
4 d 2
df3 = df1[['a','c']]
print (df3)
a c
0 a 5
1 a 2
2 b 2
3 b 7
4 d 2
df3 = df1[['a','b']]
print (df3)
a b
0 a 2
1 a 3
2 b 1
3 b 4
4 d 2

Related

What's the most efficient way to iterate by rows for each group of rows?

I'm wondering how to efficiently loop through rows by groups. So like the following sample dataset shows, it includes 3 different students with their pass records in 3 months.
import pandas as pd
import numpy as np
df = pd.DataFrame({'student':'A A A B B B C C C'.split(),
'month':[1, 2, 3, 1, 2, 3, 1, 2, 3],
'pass':[0, 1, 0, 0, 0, 0, 1, 0, 0]})
print(df)
student month pass
0 A 1 0
1 A 2 1
2 A 3 0
3 B 1 0
4 B 2 0
5 B 3 0
6 C 1 1
7 C 2 0
8 C 3 0
I'd like to have a new column "pass_patch", which should be equal to "pass" at first. But when a student has "pass" as 1 then all of his "pass_patch" in the following months should be 1, like the following:
df = pd.DataFrame({'student':'A A A B B B C C C'.split(),
'month':[1, 2, 3, 1, 2, 3, 1, 2, 3],
'pass':[0, 1, 0, 0, 0, 0, 1, 0, 0],
'pass_patch':[0, 1, 1, 0, 0, 0, 1, 1, 1]})
print(df)
student month pass pass_patch
0 A 1 0 0
1 A 2 1 1
2 A 3 0 1
3 B 1 0 0
4 B 2 0 0
5 B 3 0 0
6 C 1 1 1
7 C 2 0 1
8 C 3 0 1
I did some searches and found iterrows might work, but was concerned it would be too slow to run for the whole dataset (around million of records). Would there be more efficient ways to realize that?
Any suggestions would be greatly appreciated.
Try with cummax
df['new'] = df.groupby('student')['pass'].cummax()
df
Out[78]:
student month pass new
0 A 1 0 0
1 A 2 1 1
2 A 3 0 1
3 B 1 0 0
4 B 2 0 0
5 B 3 0 0
6 C 1 1 1
7 C 2 0 1
8 C 3 0 1
What's the most efficient way to iterate by rows for each group of
rows
DON'T ITERATE MANUALLY
Manualy iteration should always be your last option to try, most often, there's always some better way to perfrom a required operation rather than doing the iteration.
You can groupby student, then call cumsum which will just sum the values iteratively, convert it to boolean then convert it back to int:
df['pass_patch'] = df.groupby('student')['pass'].cumsum().astype(bool).astype(int)
OUTPUT:
student month pass pass_patch
0 A 1 0 0
1 A 2 1 1
2 A 3 0 1
3 B 1 0 0
4 B 2 0 0
5 B 3 0 0
6 C 1 1 1
7 C 2 0 1
8 C 3 0 1
PS: In above solution, you can avoid .astype(bool).astype(int) part if there is not more than one 1s for pass for each group. You may also need to sort the dataframe on months for each student if they are not sorted, I have not added that part since the sample data you have provided is already in that order.
import pandas as pd
import numpy as np
df = pd.DataFrame({'student':'A A A B B B C C C'.split(),
'month':[1, 2, 3, 1, 2, 3, 1, 2, 3],
'pass':[0, 1, 0, 0, 0, 0, 1, 0, 0]})
First we search the month of first pass for students that actually passed at least once.
grp = df[df["pass"].eq(1)]\
.sort_values(["student", "month"])\
.groupby("student").head(1)
where grp looks like
student month pass
1 A 2 1
6 C 1 1
Then we merge the dataframes
df = pd.merge(df,
grp,
on=["student"],
how="left",
suffixes=(None, '_y'))
and df looks like
student month pass month_y pass_y
0 A 1 0 2.0 1.0
1 A 2 1 2.0 1.0
2 A 3 0 2.0 1.0
3 B 1 0 NaN NaN
4 B 2 0 NaN NaN
5 B 3 0 NaN NaN
6 C 1 1 1.0 1.0
7 C 2 0 1.0 1.0
8 C 3 0 1.0 1.0
Finally we set 1 to all months greater or equals to month_y and 0 otherwise.
df["pass_patch"] = np.where(
df["month"].ge(df["month_y"]),
1,
0)
and we drop the columns we don't need anymore
df = df.drop(columns=["month_y", "pass_y"])
Which returns
student month pass pass_patch
0 A 1 0 0
1 A 2 1 1
2 A 3 0 1
3 B 1 0 0
4 B 2 0 0
5 B 3 0 0
6 C 1 1 1
7 C 2 0 1
8 C 3 0 1
You can replace 0 by pd.NA and then use ffill method, and then replace the null values back to 0:
df['pass_patch'] = df['pass'].replace(0, pd.NA)
df['pass_patch'] = df.groupby('student')['pass_patch']\
.transform(lambda x: x.ffill())\
.fillna(0)\
.astype(int)
Output:
student month pass pass_patch
0 A 1 0 0
1 A 2 1 1
2 A 3 0 1
3 B 1 0 0
4 B 2 0 0
5 B 3 0 0
6 C 1 1 1
7 C 2 0 1
8 C 3 0 1

Pandas cumulative count on new value

I have a data frame like the below one.
df = pd.DataFrame()
df['col_1'] = [1, 1, 1, 2, 2, 2, 3, 3, 3]
df['col_2'] = ['A', 'B', 'B', 'A', 'B', 'C', 'A', 'A', 'B']
df
col_1 col_2
0 1 A
1 1 B
2 1 B
3 2 A
4 2 B
5 2 C
6 3 A
7 3 A
8 3 B
I need to group by on col_1 and within each group, I need to update cumulative count whenever there is a new value in col_2. Something like below data frame.
col_1 col_2 col_3
0 1 A 1
1 1 B 2
2 1 B 2
3 2 A 1
4 2 B 2
5 2 C 3
6 3 A 1
7 3 A 1
8 3 B 2
I could do this using lists and dictionary. But couldn't find a way using pandas in built functions.
Use factorize with lambda function in GroupBy.transform:
df['col_3'] = df.groupby('col_1')['col_2'].transform(lambda x: pd.factorize(x)[0]+1)
print (df)
col_1 col_2 col_3
0 1 A 1
1 1 B 2
2 1 B 2
3 2 A 1
4 2 B 2
5 2 C 3
6 3 A 1
7 3 A 1
8 3 B 2

Pandas Dataframe groupby: apply several lambda functions at once

I group the following pandas dataframe by 'name' and then apply several lambda functions on 'value' to generate additional columns.
Is it possible to apply these lambda functions at once, to increase efficiency?
import pandas as pd
df = pd.DataFrame({'name': ['A','A', 'B','B','B','B', 'C','C','C'],
'value': [1, 3, 1, 2, 3, 1, 2, 3, 3], })
df['Diff'] = df.groupby('name')['value'].transform(lambda x: x - x.iloc[0])
df['Count'] = df.groupby('name')['value'].transform(lambda x: x.count())
df['Index'] = df.groupby('name')['value'].transform(lambda x: x.index - x.index[0] + 1)
print(df)
Output:
name value Diff Count Index
0 A 1 0 2 1
1 A 3 2 2 2
2 B 1 0 4 1
3 B 2 1 4 2
4 B 3 2 4 3
5 B 1 0 4 4
6 C 2 0 3 1
7 C 3 1 3 2
8 C 3 1 3 3
Here is possible use GroupBy.apply with one function, but not sure if better performance:
def f(x):
a = x - x.iloc[0]
b = x.count()
c = x.index - x.index[0] + 1
return pd.DataFrame({'Diff':a, 'Count':b, 'Index':c})
df = df.join(df.groupby('name')['value'].apply(f))
print(df)
name value Diff Count Index
0 A 1 0 2 1
1 A 3 2 2 2
2 B 1 0 4 1
3 B 2 1 4 2
4 B 3 2 4 3
5 B 1 0 4 4
6 C 2 0 3 1
7 C 3 1 3 2
8 C 3 1 3 3

Make a table from 2 columns

I'm fairly new on Python.
I have 2 columns on a dataframe, columns are something like:
db = pd.read_excel(path_to_file/file.xlsx)
db = db.loc[:,['col1','col2']]
col1 col2
C 4
C 5
A 1
B 6
B 1
A 2
C 4
I need them to be like this:
1 2 3 4 5 6
A 1 1 0 0 0 0
B 1 0 0 0 0 1
C 0 0 0 2 1 0
so they act like rows and columns and values refer to the number of coincidences.
Say your columns are called cat and val:
In [26]: df = pd.DataFrame({'cat': ['C', 'C', 'A', 'B', 'B', 'A', 'C'], 'val': [4, 5, 1, 6, 1, 2, 4]})
In [27]: df
Out[27]:
cat val
0 C 4
1 C 5
2 A 1
3 B 6
4 B 1
5 A 2
6 C 4
Then you can groupby the table hierarchicaly, then unstack it:
In [28]: df.val.groupby([df.cat, df.val]).sum().unstack().fillna(0).astype(int)
Out[28]:
val 1 2 4 5 6
cat
A 1 2 0 0 0
B 1 0 0 0 6
C 0 0 8 5 0
Edit
As IanS pointed out, 3 is missing here (thanks!). If there's a range of columns you must have, then you can use
r = df.val.groupby([df.cat, df.val]).sum().unstack().fillna(0).astype(int)
for c in set(range(1, 7)) - set(df.val.unique()):
r[c] = 0
I think you need aggreagate by size and add missing values to columns by reindex:
print (df)
a b
0 C 4
1 C 5
2 A 1
3 B 6
4 B 1
5 A 2
6 C 4
df1 = df.b.groupby([df.a, df.b])
.size()
.unstack()
.reindex(columns=(range(1,df.b.max() + 1)))
.fillna(0)
.astype(int)
df1.index.name = None
df1.columns.name = None
print (df1)
1 2 3 4 5 6
A 1 1 0 0 0 0
B 1 0 0 0 0 1
C 0 0 0 2 1 0
Instead size you can use count, size counts NaN values, count does not.

Pandas number rows within group in increasing order

Given the following data frame:
import pandas as pd
import numpy as np
df=pd.DataFrame({'A':['A','A','A','B','B','B'],
'B':['a','a','b','a','a','a'],
})
df
A B
0 A a
1 A a
2 A b
3 B a
4 B a
5 B a
I'd like to create column 'C', which numbers the rows within each group in columns A and B like this:
A B C
0 A a 1
1 A a 2
2 A b 1
3 B a 1
4 B a 2
5 B a 3
I've tried this so far:
df['C']=df.groupby(['A','B'])['B'].transform('rank')
...but it doesn't work!
Use groupby/cumcount:
In [25]: df['C'] = df.groupby(['A','B']).cumcount()+1; df
Out[25]:
A B C
0 A a 1
1 A a 2
2 A b 1
3 B a 1
4 B a 2
5 B a 3
Use groupby.rank function.
Here the working example.
df = pd.DataFrame({'C1':['a', 'a', 'a', 'b', 'b'], 'C2': [1, 2, 3, 4, 5]})
df
C1 C2
a 1
a 2
a 3
b 4
b 5
df["RANK"] = df.groupby("C1")["C2"].rank(method="first", ascending=True)
df
C1 C2 RANK
a 1 1
a 2 2
a 3 3
b 4 1
b 5 2

Categories