Total count of strings within range in dataframe - python

I have a dataframe where I want to count the total number of occurrences of the word Yes, as it appears between a range of rows—Dir—and then add that count as a new column.
Type,Problem
Parent,
Dir,Yes
File,
Opp,Yes
Dir,
Metadata,
Subfolder,Yes
Dir,
Opp,Yes
So whenever the word Yes appears in the Problem column between two Dir rows, I need a count to then appear next to the Dir at the beginning of the range.
Expected output would be:
Type Problem yes_count
Parent
Dir Yes 2
File
Opp Yes
Dir 1
Metadata
Subfolder Yes
Dir 1
Opp Yes
I could do something like yes_count = df['Problem'].str.count('Yes').sum() to get part of the way there. But how do I also account for the range?

Use:
# is the row a "Yes"?
m1 = df['Problem'].eq('Yes')
# is the row a "Dir"?
m2 = df['Type'].eq('Dir')
# form groups starting on each "Dir"
g = m1.groupby(m2.cumsum())
# count the number of "Yes" per group
# assign only on "Dir"
df['yes_count'] = g.transform('sum').where(m2)
Output:
Type Problem yes_count
0 Parent NaN NaN
1 Dir Yes 2.0
2 File NaN NaN
3 Opp Yes NaN
4 Dir NaN 1.0
5 Metadata NaN NaN
6 Subfolder Yes NaN
7 Dir NaN 1.0
8 Opp Yes NaN

Related

Perform operations on a dataframe from groupings by ID

I have the following dataframe in Python:
ID
maths
value
0
add
12
1
sub
30
0
add
10
2
mult
3
0
sub
10
1
add
11
3
sub
40
2
add
21
My idea is to perform the following operations to get the result I want:
First step: Group the rows of the dataframe by ID. The order of the groups shall be indicated by the order of the original dataframe.
ID
maths
value
0
add
12
0
add
10
0
sub
10
1
sub
30
1
add
11
2
mult
3
2
add
21
3
sub
40
Second step: For each group created: Create a value for a new column 'result' where a mathematical operation indicated by the previous column of 'maths' is performed. If there is no previous row for the group, this column would have the value NaN.
ID
maths
value
result
0
add
12
NaN
0
add
10
22
0
sub
10
20
1
sub
30
NaN
1
add
11
19
2
mult
3
NaN
2
add
21
63
3
sub
40
NaN
Third step: Return the resulting dataframe.
I have tried to realise this code by making use of the pandas groupby method. But I have problems to iterate with conditions for each row and each group, and I don't know how to create the new column 'result' on a groupby object.
grouped_df = testing.groupby('ID')
for key, item in grouped_df:
print(grouped_df.get_group(key))
I don't know whether to use orderby or groupby or some other method that works for what I want to do. If you can help me with a better idea, I'd appreciate it.
ID = list("00011223")
maths = ["add","add","sub","sub","add","mult","add","sub"]
value = [12,10,10,30,11,3,21,40]
import pandas as pd
df = pd.DataFrame(list(zip(ID,maths,value)),columns = ["ID","Maths","Value"])
df["Maths"] = df.groupby(["ID"]).pipe(lambda df:df.Maths.shift(1)).fillna("add")
df["Value1"] = df.groupby(["ID"]).pipe(lambda df:df.Value.shift(1))
df["result"] = df.groupby(["Maths"]).pipe(lambda x:(x.get_group("add")["Value1"] + x.get_group("add")["Value"]).append(
x.get_group("sub")["Value1"] - x.get_group("sub")["Value"]).append(
x.get_group("mult")["Value1"] * x.get_group("mult")["Value"])).sort_index()
Here is the Output:
df
Out[168]:
ID Maths Value Value1 result
0 0 add 12 NaN NaN
1 0 add 10 12.0 22.0
2 0 add 10 10.0 20.0
3 1 add 30 NaN NaN
4 1 sub 11 30.0 19.0
5 2 add 3 NaN NaN
6 2 mult 21 3.0 63.0
7 3 add 40 NaN NaN

Collapse a dataframes column into its distinct values and create a new column based on anothers frequency

I would like to take a dataframe such as:
USER PACKAGE
0 1 1
1 1 1
2 1 2
3 1 1
4 1 2
5 1 3
6 2 ...
And select the distinct USERS and then have new columns that are based on the frequency of the different packages. i.e highest frequency package, second highest etc.
User First Second Third
0 1 1 2 3
1 2 ...
I can implement this with for loops but thats obviously bad using dataframes, I need to run this on millions of records, can't quite find a vectorized way of doing it.
Cheers
On SO you're supposed to attempt it and post your own code. Here are some hints for implementing the solution:
Do .groupby('USER')... then .value_counts() ...
(don't need to .sort(), since .value_counts() does that by default)
take the .head(3)...
then pivot into a table, in that same pivot command there's an option to add the column names 'First, Second, Third'
You can use SeriesGroupBy.value_counts with default sorting, so get first 3 index values, convert to Series, reshape by Series.unstack, rename columns and last convert index to column:
print (df)
USER PACKAGE
0 1 1
1 1 1
2 1 2
3 1 1
4 1 2
5 1 3
6 2 3
df = (df.groupby('USER')['PACKAGE']
.apply(lambda x: pd.Series(x.value_counts().index[:3]))
.unstack()
.rename(columns= dict(enumerate(['First','Second','Third'])))
.reset_index())
print (df)
USER First Second Third
0 1 1.0 2.0 3.0
1 2 3.0 NaN NaN
If need all counts:
df = (df.groupby('USER')['PACKAGE']
.apply(lambda x: pd.Series(x.value_counts().index))
.unstack())
print (df)
0 1 2
USER
1 1.0 2.0 3.0
2 3.0 NaN NaN
EDIT: Another idea, I hope faster is use:
s = (df.groupby('USER')['PACKAGE']
.apply(lambda x: x.value_counts().index[:3]))
df = (pd.DataFrame(s.tolist(),index=s.index, columns=['First','Second','Third'])
.reset_index())
print (df)
USER First Second Third
0 1 1 2.0 3.0
1 2 3 NaN NaN
I assumed the count the number of user and package occurrences
USER =[1,1,1,1,1,1,2]
PACKAGE=[1,1,2,1,2,3,3]
df=pd.DataFrame({'user':USER,'package':PACKAGE})
results=df.groupby(['user','package']).size()
results=results.sort_values(ascending=False)
results=results.unstack(level='package').fillna(0)
results=results.rename(columns={1:'First',2:'Second',3:'Third'})
print(results)
output:
package First Second Third
user
1 3.0 2.0 1.0
2 0.0 0.0 1.0
The highest frequency package is type 1, second highest package is type 2 and third highest package is type 3 for user 1. the highest rank for user 2 is type 3. You can do a lookup on the results to produce this output.
Try using Groupby:
df.groupby(['X']).get_group('A')

pandas replace is not replacing value even with inplace=True

My data looks like this. I would like to replace marital_status 'Missing' with 'Married' if 'no_of_children' is not nan.
>cust_data_df[['marital_status','no_of_children']]
>
marital_status no_of_children
0 Married NaN
1 Married NaN
2 Missing 1
3 Missing 2
4 Single NaN
5 Single NaN
6 Married NaN
7 Single NaN
8 Married NaN
9 Married NaN
10 Single NaN
This is what I tried:
cust_data_df.loc[cust_data_df['no_of_children'].notna()==True, 'marital_status'].replace({'Missing':'Married'},inplace=True)
But this is not doing anything.
Assign back replaced values for avoid chained assignments:
m = cust_data_df['no_of_children'].notna()
d = {'Missing':'Married'}
cust_data_df.loc[m, 'marital_status'] = cust_data_df.loc[m, 'marital_status'].replace(d)
If need set all values:
cust_data_df.loc[m, 'marital_status'] = 'Married'
EDIT:
Thanks #Quickbeam2k1 for explanation:
cust_data_df.loc[cust_data_df['no_of_children'].notna()==True, 'marital_status'] is just a new object which has no reference. Replacing there, will leave the original object unchanged

Df.mean returns imaginary numbers

I have a dataframe with around 50 columns and around 3000 rows. Most cells are empty but not all of them. I am trying to add a new row at the end of the dataframe, with the mean value of each column and I need it to ignore the empty cells.
I am using df.mean(axis=0), which somehows turns all values of the dataframe into imaginary numbers. All values stay the same but a +0j is added. I have no Idea why.
Turbine.loc['Mean_Values'] = Turbine.mean(axis=0)
I couldnt find a solution for this, is it because of the empty cells?
Base on this, df.mean() will automatically skip the NaN/Null value with parameter value of skipna=True. Example:
df=pd.DataFrame({'value':[1,2,3,np.nan,5,6,np.nan]})
df=df.append({'value':df.mean(numeric_only=True).value}, ignore_index=True,)
print(df)
Output:
value
0 1.0
1 2.0
2 3.0
3 NaN
4 5.0
5 6.0
6 NaN
7 3.4
But if there is a complex number in a cell, the result of df.mean() will be cast to complex number. Example:
df=pd.DataFrame({'value':[1,2,3,np.nan,5,6,np.nan, complex(1,0)]})
print(df)
print('\n')
df=df.append({'value':df.mean(numeric_only=True).value}, ignore_index=True,)
print(df)
Output with a complex value in a cell:
value
0 (1+0j)
1 (2+0j)
2 (3+0j)
3 NaN
4 (5+0j)
5 (6+0j)
6 NaN
7 (1+0j)
value
0 (1+0j)
1 (2+0j)
2 (3+0j)
3 NaN
4 (5+0j)
5 (6+0j)
6 NaN
7 (1+0j)
8 (3+0j)
Hope this can help you :)
some cells had information about directions (north, west...) in them, which were interpreted as imaginary numbers.

Cumulative count when two values match pandas

I am trying to create a new Column that displays a cumulative count based off values in separate columns.
So for the code below, I'm trying to create two new columns based off Cause and Answer Columns. So for the values in Column Answer, if In is situated in Column Cause I want to provide a cumulative count in a new column.
import pandas as pd
d = ({
'Cause' : ['In','','','In','','In','In'],
'Answer' : ['Yes','No','Maybe','No','Yes','No','Yes'],
})
df = pd.DataFrame(d)
Output:
Answer Cause
0 Yes In
1 No
2 Maybe
3 No In
4 Yes
5 No In
6 Yes In
Intended Output:
Answer Cause Count_No Count_Yes
0 Yes In 1
1 No
2 Maybe
3 No In 1
4 Yes
5 No In 2
6 Yes In 2
I have tried the following but get an error.
df['cumsum'] = df.groupby(['Answer'])['Cause'].cumsum()
Here is one way -
for val in ['Yes', 'No']:
cond = df.Answer.eq(val) & df.Cause.eq('In')
df.loc[cond, 'Count_' + val] = cond[cond].cumsum()
df
# Cause Answer Count_Yes Count_No
#0 In Yes 1.0 NaN
#1 No NaN NaN
#2 Maybe NaN NaN
#3 In No NaN 1.0
#4 Yes NaN NaN
#5 In No NaN 2.0
#6 In Yes 2.0 NaN
Without for loop : -)
s=df.loc[df.Cause=='In'].Answer.str.get_dummies()
pd.concat([df,s.cumsum().mask(s!=1,'')],axis=1).fillna('')
Out[62]:
Answer Cause No Yes
0 Yes In 1
1 No
2 Maybe
3 No In 1
4 Yes
5 No In 2
6 Yes In 2

Categories