Pandas countifs value from column and other column not null - python

I am trying to perform the equivalent to an Excel COUNTIFS formula in pandas, where the first range is a dataframe column, and the search criteria is each value in that column. The second search range is a different column and the criteria is non null values in that column.
Written as an Excel formula, it would look like: COUNTIFS(A:A,A2,B:B,"<>")
Here is some sample data:
data = {'ADJL':['BCF-364/BTS-1091/ADJL-4', 'BCF-130/BTS-389/ADJL-1', 'BCF-130/BTS-389/ADJL-1', 'BCF-130/BTS-389/ADJL-1', 'BCF-130/BTS-389/ADJL-1', 'BCF-130/BTS-389/ADJL-1', 'BCF-581/BTS-1742/ADJL-1', 'BCF-581/BTS-1742/ADJL-1'],
'LNCEL':['LNBTS-55/LNCEL-63', '', 'LNBTS-801/LNCEL-62', '', 'LNBTS-801/LNCEL-61', '', '', '']}
df = pd.DataFrame(data)
I need to add two columns to this. The first is a count of the value of each "ADJL". I found this solution for that column:
df['Count_of_ADJL'] = df.groupby('ADJL')['ADJL'].transform('Count_of_ADJL')
What I am stuck on is the next one, shown below in Excel. I need to calculate how many times the value in ADJL occurs throughout the entire ADJL column AND the LNCEL column is not empty.
I removed many other columns to simplify my question, so a solution where I can just add another column is ideal.
Many thanks in advance.

Use groupby.transform with np.count_nonzero as:
df['Count_of_ADJL'] = df.groupby('ADJL')['ADJL'].transform('count')
df['Count_of_ADJL & LNCEL not null'] = df.groupby('ADJL')['LNCEL'].transform(np.count_nonzero)
#or
df['Count_of_ADJL & LNCEL not null'] = df.groupby('ADJL')['LNCEL'].transform('count')
print(df)
ADJL LNCEL Count_of_ADJL \
0 BCF-364/BTS-1091/ADJL-4 LNBTS-55/LNCEL-63 1
1 BCF-130/BTS-389/ADJL-1 5
2 BCF-130/BTS-389/ADJL-1 LNBTS-801/LNCEL-62 5
3 BCF-130/BTS-389/ADJL-1 5
4 BCF-130/BTS-389/ADJL-1 LNBTS-801/LNCEL-61 5
5 BCF-130/BTS-389/ADJL-1 5
6 BCF-581/BTS-1742/ADJL-1 2
7 BCF-581/BTS-1742/ADJL-1 2
Count_of_ADJL & LNCEL not null
0 1
1 2
2 2
3 2
4 2
5 2
6 0
7 0

Related

Rename x rows after value in DataFrame

I have a pandas Dataframe containing a time series of data.
Every full second contains a string with the name of the point. I want to rename the next 4 values, that contain a random point id, after this row to the name of the first row with an added suffix
time ID
12:00:00,00 pointname1
12:00:00,20 12345
12:00:00,40 45645
12:00:00,60 78963
12:00:00,80 23432
12:00:01,00 pointname2
12:00:01,20 53454
12:00:01,40 24324
12:00:01,60 24324
12:00:01,80 42435
I want to transform this into:
time ID
12:00:00,00 pointname1
12:00:00,20 pointname1_1
12:00:00,40 pointname1_2
12:00:00,60 pointname1_3
12:00:00,80 pointname1_4
12:00:01,00 pointname2
12:00:01,20 pointname2_1
12:00:01,40 pointname2_2
12:00:01,60 pointname2_3
12:00:01,80 pointname2_4
I have a working solution by iterating over the entire DataFrame, detecting the 'pointname' rows and renaming the 4 rows after that. However, that takes a very long time with the 1.3million rows the data contains. Is there a more clever and efficient way of doing this?
Use Series.str.startswith with Series.where for set missing values to not matched values and then forward filling them, last use counter by GroupBy.cumcount and add values without first:
df['ID'] = df['ID'].where(df['ID'].str.startswith('pointname')).ffill()
df['ID'] += df.groupby('ID').cumcount().astype(str).radd('_').replace('_0','')
print (df)
time ID
0 12:00:00,00 pointname1
1 12:00:00,20 pointname1_1
2 12:00:00,40 pointname1_2
3 12:00:00,60 pointname1_3
4 12:00:00,80 pointname1_4
5 12:00:01,00 pointname2
6 12:00:01,20 pointname2_1
7 12:00:01,40 pointname2_2
8 12:00:01,60 pointname2_3
9 12:00:01,80 pointname2_4
You can to_numeric (or str.startswith if your identifier is literal, the only important point is to have True for the rows to use as referenc) to identify the ID rows, then for all other rows use ffill and groupby.cumcount to make the new identifier:
# find rows with string identifier (could use other methods)
m = pd.to_numeric(df['ID'], errors='coerce').isna()
# or if "pointname" is literal
# m = df['ID'].str.startswith('pointname')
# for non matching rows, use previous value
# and add group counter
df.loc[~m, 'ID'] = (df['ID'].where(m).ffill()
+'_'
+df.groupby(m.cumsum()).cumcount().astype(str)
)
output:
time ID
0 12:00:00,00 pointname1
1 12:00:00,20 pointname1_1
2 12:00:00,40 pointname1_2
3 12:00:00,60 pointname1_3
4 12:00:00,80 pointname1_4
5 12:00:01,00 pointname2
6 12:00:01,20 pointname2_1
7 12:00:01,40 pointname2_2
8 12:00:01,60 pointname2_3
9 12:00:01,80 pointname2_4
You can groupby the time part of time column and transform ID column to add suffix to first value in each group.
df['ID'] = (df.groupby(pd.to_datetime(df['time']).dt.strftime('%H:%M:%S'))
['ID'].transform(lambda col: [f'{col.iloc[0]}_{s}' for s in ['']+list(range(1, len(col)))])
.str.rstrip('_'))
# or
df['ID'] = (df.groupby(pd.to_datetime(df['time']).dt.strftime('%H:%M:%S'))
['ID'].transform(lambda col: [col.iloc[0]] + [f'{col.iloc[0]}_{s}' for s in range(1, len(col))]))
print(df)
time ID
0 12:00:00,00 pointname1
1 12:00:00,20 pointname1_1
2 12:00:00,40 pointname1_2
3 12:00:00,60 pointname1_3
4 12:00:00,80 pointname1_4
5 12:00:01,00 pointname2
6 12:00:01,20 pointname2_1
7 12:00:01,40 pointname2_2
8 12:00:01,60 pointname2_3
9 12:00:01,80 pointname2_4

nunique compare two Pandas dataframe with duplicates and pivot them

My input:
df1 = pd.DataFrame({'frame':[ 1,1,1,2,3,0,1,2,2,2,3,4,4,5,5,5,8,9,9,10,],
'label':['GO','PL','ICV','CL','AO','AO','AO','ICV','PL','TI','PL','TI','PL','CL','CL','AO','TI','PL','ICV','ICV'],
'user': ['user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1','user1']})
df2 = pd.DataFrame({'frame':[ 1, 1, 2, 3, 4,0,1,2,2,2,4,4,5,6,6,7,8,9,10,11],
'label':['ICV','GO', 'CL','TI','PI','AO','GO','ICV','TI','PL','ICV','TI','PL','CL','CL','CL','AO','AO','PL','ICV'],
'user': ['user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2','user2']})
df_c = pd.concat([df1,df2])
I trying compare two df, frame by frame, and check if label in df1 existing in same frame in df2. And make some calucation with result (pivot for example)
That my code:
m_df = df1.merge(df2,on=['frame'],how='outer' )
m_df['cross']=m_df.apply(lambda row: 'Matched'
if row['label_x']==row['label_y']
else 'Mismatched', axis='columns')
pv_m_unq= pd.pivot_table(m_df,
columns='cross',
index='label_x',
values='frame',
aggfunc=pd.Series.nunique,fill_value=0,margins=True)
pv_mc = pd.pivot_table(m_df,
columns='cross',
index='label_x',
values='frame',
aggfunc=pd.Series.count,fill_value=0,margins=True)
but this creates a some problem:
first, I can calqulate "simple" total (column All) of matched and missmatched as descipted in picture, or its "duplicated" as AO in pv_m or wrong number as in CL in pv_m_unq
and second, I think merge method as I use int not clever way, because I get if frame+label repetead in df(its happens often), in merged df I get number row in df1 X number of rows in df2 for this specific frame+label
I think maybe there is a smarter way to compare df and pivot them?
You got the unexpected result on margin total because the margin is making use the same function passed to aggfunc (i.e. pd.Series.nunique in this case) for its calculation and the values of Matched and Mismatched in these 2 rows are both the same as 1 (hence only one unique value of 1). (You are currently getting the unique count of frame id's)
Probably, you can achieve more or less what you want by taking the count on them (including margin, Matched and Mismatched) instead of the unique count of frame id's, by using pd.Series.count instead in the last line of codes:
pv_m = pd.pivot_table(m_df,columns='cross',index='label_x',values='frame', aggfunc=pd.Series.count, margins=True, fill_value=0)
Result
cross Matched Mismatched All
label_x
AO 0 1 1
CL 1 0 1
GO 1 1 2
ICV 1 1 2
PL 0 2 2
All 3 5 8
Edit
If all you need is to have the All column being the sum of Matched and Mismatched, you can do it as follows:
Change your code of generating pv_m_unq without building margin:
pv_m_unq= pd.pivot_table(m_df,
columns='cross',
index='label_x',
values='frame',
aggfunc=pd.Series.nunique,fill_value=0)
Then, we create the column All as the sum of Matched and Mismatched for each row, as follows:
pv_m_unq['All'] = pv_m_unq['Matched'] + pv_m_unq['Mismatched']
Finally, create the row All as the sum of Matched and Mismatched for each column and append it as the last row, as follows:
row_All = pd.Series({'Matched': pv_m_unq['Matched'].sum(),
'Mismatched': pv_m_unq['Mismatched'].sum(),
'All': pv_m_unq['All'].sum()},
name='All')
pv_m_unq = pv_m_unq.append(row_All)
Result:
print(pv_m_unq)
Matched Mismatched All
label_x
AO 1 3 4
CL 1 2 3
GO 1 1 2
ICV 2 4 6
PL 1 5 6
TI 2 3 5
All 8 18 26
You can use isin() function like this:
df3 =df1[df1.label.isin(df2.label)]

changing index of 1 row in pandas

I have the the below df build from a pivot of a larger df. In this table 'week' is the the index (dtype = object) and I need to show week 53 as the first row instead of the last
Can someone advice please? I tried reindex and custom sorting but can't find the way
Thanks!
here is the table
Since you can't insert the row and push others back directly, a clever trick you can use is create a new order:
# adds a new column, "new" with the original order
df['new'] = range(1, len(df) + 1)
# sets value that has index 53 with 0 on the new column
# note that this comparison requires you to match index type
# so if weeks are object, you should compare df.index == '53'
df.loc[df.index == 53, 'new'] = 0
# sorts values by the new column and drops it
df = df.sort_values("new").drop('new', axis=1)
Before:
numbers
weeks
1 181519.23
2 18507.58
3 11342.63
4 6064.06
53 4597.90
After:
numbers
weeks
53 4597.90
1 181519.23
2 18507.58
3 11342.63
4 6064.06
One way of doing this would be:
import pandas as pd
df = pd.DataFrame(range(10))
new_df = df.loc[[df.index[-1]]+list(df.index[:-1])].reset_index(drop=True)
output:
0
9 9
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
Alternate method:
new_df = pd.concat([df[df["Year week"]==52], df[~(df["Year week"]==52)]])

python panda new column with order of values

I would like to make a new column with the order of the numbers in a list. I get 3,1,0,4,2,5 ( index of the lowest numbers ) but I would like to have a new column with 2,1,4,0,3,5 ( so if I look at a row i get the list and I get in what order this number comes in the total list. what am I doing wrong?
df = pd.DataFrame({'list': [4,3,6,1,5,9]})
df['order'] = df.sort_values(by='list').index
print(df)
What you're looking for is the rank:
import pandas as pd
df = pd.DataFrame({'list': [4,3,6,1,5,9]})
df['order'] = df['list'].rank().sub(1).astype(int)
Result:
list order
0 4 2
1 3 1
2 6 4
3 1 0
4 5 3
5 9 5
You can use the method parameter to control how to resolve ties.

Dropping filtered rows in pandas

I need to filter phone numbers and drop filtered phone numbers. I couldn't combine these conditions into 1 filter, so I made two filters. When I drop by 1 filter everything works fine. But when I drop by second filter I get an error. But anyway I get a properly filtered data frame.
What I need to change, to go without any error?
import pandas as pd
df = pd.DataFrame({"Phone" :['+77013655566','87014324366','7014324366','11111','999999','43434343','+77015452313','7012334212','87010956612', '7777777', '8888888']})
print(df)
Phone
0 +77013655566
1 87014324366
2 7014324366
3 11111
4 999999
5 43434343
6 +77015452313
7 7012334212
8 87010956612
9 7777777
10 8888888
phone_filter = ((df['Phone'].map(str) == '8888888') |
(df['Phone'].map(str) == '7777777'))
phone_filter2 = ((df['Phone'].map(str).str[0] != '8') &
(df['Phone'].map(str).str[0] != '7') &
(df['Phone'].map(str).str[0] != '+'))
df.drop(df[phone_filter].index, inplace = True)
df.drop(df[phone_filter2].index, inplace = True)
<ipython-input-83-80183cb110d3>:1: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
Expected output:
print(df)
Phone
0 +77013655566
1 87014324366
2 7014324366
6 +77015452313
7 7012334212
8 87010956612
Use:
invalid_numbers = ['8888888', '7777777']
df[(~df.Phone.isin(invalid_numbers)) & (df.Phone.str[0].isin(['8','7','+']))]
Output:
Phone
0 +77013655566
1 87014324366
2 7014324366
6 +77015452313
7 7012334212
8 87010956612

Categories