Finding index of a data frame comparing with another data frame - python

I have Two data frames df and df1. Both have a column called description(which may not be unique). I wanted to get the index no of df where the description matches description of df1.
df
Name des
0 xyz1 abc
1 xyz2 bcd
2 xyz3 nna
3 xyz4 mmm
4 xyz5 man
df1
des
0 abc
1 nna
2 bcd
3 man
O/P required
df1
des index_df
0 abc 0
1 nna 2
2 bcd 1
3 man 4

This is possible with .loc accessor and using reset_index to elevate index to column:
res = df.loc[df['des'].isin(set(df1['des'])), 'des'].reset_index()
# index des
# 0 0 abc
# 1 1 bcd
# 2 2 nna
# 3 4 man

Use map by Series with swapped index and values created by column des:
s = pd.Series(df.index, index=df['des'])
df1['index_df'] = df1['des'].map(s)
print (df1)
des index_df
0 abc 0
1 nna 2
2 bcd 1
3 man 4

Related

Sort column names using wildcard using pandas

I have a big dataframe with more than 100 columns. I am sharing a miniature version of my real dataframe below
ID rev_Q1 rev_Q5 rev_Q4 rev_Q3 rev_Q2 tx_Q3 tx_Q5 tx_Q2 tx_Q1 tx_Q4
1 1 1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1 1 1
I would like to do the below
a) sort the column names based on Quarters (ex:Q1,Q2,Q3,Q4,Q5..Q100..Q1000) for each column pattern
b) By column pattern, I mean the keyword that is before underscore which is rev and tx.
So, I tried the below but it doesn't work and it also shifts the ID column to the back
df = df.reindex(sorted(df.columns), axis=1)
I expect my output to be like as below. In real time, there are more than 100 columns with more than 30 patterns like rev, tx etc. I want my ID column to be in the first position as shown below.
ID rev_Q1 rev_Q2 rev_Q3 rev_Q4 rev_Q5 tx_Q1 tx_Q2 tx_Q3 tx_Q4 tx_Q5
1 1 1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1 1 1
For the provided example, df.sort_index(axis=1) should work fine.
If you have Q values higher that 9, use natural sorting with natsort:
from natsort import natsort_key
out = df.sort_index(axis=1, key=natsort_key)
Or using manual sorting with np.lexsort:
idx = df.columns.str.split('_Q', expand=True, n=1)
order = np.lexsort([idx.get_level_values(1).astype(float), idx.get_level_values(0)])
out = df.iloc[:, order]
Something like:
new_order = list(df.columns)
new_order = ['ID'] + sorted(new_order.remove("ID"))
df = df[new_order]
we manually put "ID" in front and then sort what is remaining
The idea is to create a dataframe from the column names. Create two columns: one for Variable and another one for Quarter number. Finally sort this dataframe by values then extract index.
idx = (df.columns.str.extract(r'(?P<V>[^_]+)_Q(?P<Q>\d+)')
.fillna(0).astype({'Q': int})
.sort_values(by=['V', 'Q']).index)
df = df.iloc[:, idx]
Output:
>>> df
ID rev_Q1 rev_Q2 rev_Q3 rev_Q4 rev_Q5 tx_Q1 tx_Q2 tx_Q3 tx_Q4 tx_Q5
0 1 1 1 1 1 1 1 1 1 1 1
1 2 1 1 1 1 1 1 1 1 1 1
>>> (df.columns.str.extract(r'(?P<V>[^_]+)_Q(?P<Q>\d+)')
.fillna(0).astype({'Q': int})
.sort_values(by=['V', 'Q']))
V Q
0 0 0
1 rev 1
5 rev 2
4 rev 3
3 rev 4
2 rev 5
9 tx 1
8 tx 2
6 tx 3
10 tx 4
7 tx 5

how to filter dataframe with removing NULL values from selected rows in python?

df=pd.DataFrame({'dept':['dept1','dept2','dept3','dept4','dept5'],
'room1':['0','1','1','NA','1'],
'room2':['1','0','NA','1','1'],
'room3':['0','0','1','NA','1'],
'room4':['1','NA','1','1','1'],
'count':['4','3','3','2','4']}
dept room1 room2 room3 room4 count
0 dept1 0 1 0 1 4
1 dept2 1 0 0 NA 3
2 dept3 1 NA 1 1 3
3 dept4 NA 1 NA 1 2
4 dept5 1 1 1 1 4
i have a selectbox where the user can filter the required data and display records based on his selection.
expected result:
if user select dept2 :
dept room1 room2 room3
0 dept2 1 0 0
if user select dept4:
dept room2 room4
0 dept4 1 1
code:
option_dept = df["dept"].unique().tolist()
selected_dept = st.multiselect("search by departement",option_dept)
if selected_dept:
df= df[df["dept"].isin(selected_dept)]
st.write(df)
the problem is that with this code all the columns are displayed
how can i remove the columns that includes NA or null in each selected row?
Select only dept and room columns, replace possible strings NA to NaNs and remove missing columns:
df= df[df["dept"].isin(selected_dept)].filter(regex='room|dept').replace('NA', np.nan).dropna(axis=1)
Or:
df= df[df["dept"].isin(selected_dept)].drop('count', axis=1).replace('NA', np.nan).dropna(axis=1)
Suppose the user selects dept2 then this code would give you the desired output,
pd.DataFrame(df.loc[1, :].dropna()).T
Output -
dept
room1
room2
room3
count
1
dept2
1
0
0
3
For other dept values, just change the row number in the iloc function. You can even set the indices of the dataframe to the dept value for that row using df.set_index("dept") and then use df.loc["dept_2", :] to get the data for that row.

Need to add a column in one dataset from another dataset while keeping values intact

I have two dataframes, df and df1 whereas
df:
sr. name sector
===================================
0 newyork2510 2
1 boston76w2 1
2 chicago785dw 1
3 san891dwn39210114 1
4 f2391rpg 1
and then in df1 I have
code class
=========================
bo2510 2
on76w2 1
bio5dw 1
tin018 1
retm11 1
I want to add column "code" into df that it should look exactly like it has values in df1
sr. name sector code
===================================
0 newyork2510 2 bo2510
1 boston76w2 1 on76w2
2 chicago785dw 1 bio5dw
3 san891dwn39210114 1 tin018
4 f2391rpg 1 retm11
I am trying but without success
df['code'] = df1['code'].values

Group by multiple columns and pivot and count values from other column in pandas

I have a dataframe
city skills priority acknowledge id_count acknowledge_count
ABC XXX High Yes 11 2
ABC XXX High No 10 3
ABC XXX Med Yes 5 1
ABC YYY Low No 1 5
I want to group by city and skills and get total_id_count from the column id_count, divided into three seperate column from priority as high.med,low.
SIMILARLY for total_acknowledge_count, take acknowledge
output required:
total_id_count total_acknowledege_count
city,skills High Med Low Yes No
ABC,XXX 22 5 0 3 3 # 22=11+10 3=(2+1)
ABC,YYY 0 0 1 0 5
I am trying different methods like pivot_table, and groupby & stack, but it seems very difficult.
Is there any way to achieve this result.?
You'll need to pivot separately for the total_id_count and the total_acknowledege_count here, since you have two separate column/value schemes for the aggregation:
piv1 = df.pivot_table(index=['city', 'skills'], columns='priority',
values='id_count', aggfunc='sum', fill_value=0)
piv2 = df.pivot_table(index=['city', 'skills'], columns='acknowledge',
values='acknowledge_count', aggfunc='sum', fill_value=0)
piv1.columns = pd.MultiIndex.from_product([['id_count'], piv1.columns])
piv2.columns = pd.MultiIndex.from_product([['acknowledge_count'], piv2.columns])
output = pd.concat([piv1, piv2], axis=1)
print(output)
id_count acknowledge_count
High Low Med No Yes
city skills
ABC XXX 21 0 5 3 3
YYY 0 1 0 5 0

How to count particular column values in python pandas?

I'm having dataframe like below:
df1_data = {'sym' :{0:'AAA',1:'BBB',2:'CCC',3:'AAA',4:'CCC',5:'DDD',6:'EEE',7:'EEE',8:'FFF'},
'identity' :{0:'AD',1:'AD',2:'AU',3:'AU',4:'AU',5:'AZ',6:'AU',7:'AZ',8:'AZ'}}
I want to check for sym column in my dataframe. My intension is to generate two different files, one containing same two columns in different order and second file contains sym,sym_count,AD_count,AU_count,neglected_count columns.
Edit 1 -
I want to avoid identity other than (AD & AU). In both output file I don't want result of AD & AU identity. neglected_count column is optional.
Expected Result-
result.csv
sym,identity
AAA,AD
AAA,AU
BBB,AD
CCC,AU
CCC,AU
EEE,AU
result_count.csv
sym,sym_count,AD_count,AU_count,neglected_count
AAA,2,1,1,0
BBB,1,1,0,0
CCC,2,0,2,0
EEE,2,0,1,1
How I can perform such type of calculation in python pandas?
I think you need crosstab with insert for add sum column to first position and add_suffix to column names.
Last write to_csv.
df1_data = {'sym' :{0:'AAA',1:'BBB',2:'CCC',3:'AAA',4:'CCC',5:'DDD',6:'EEE',7:'EEE',8:'FFF'},
'identity' :{0:'AD',1:'AD',2:'AU',3:'AU',4:'AU',5:'AZ',6:'AU',7:'AZ',8:'AZ'}}
df = pd.DataFrame(df1_data, columns=['sym','identity'])
print (df)
sym identity
0 AAA AD
1 BBB AD
2 CCC AU
3 AAA AU
4 CCC AU
5 DDD AZ
6 EEE AU
7 EEE AZ
8 FFF AZ
#write to csv
df.to_csv('result.csv', index=False)
#need vals only in identity
vals = ['AD','AU']
#replace another values to neglected
neglected = df.loc[~df.identity.isin(vals), 'identity'].unique().tolist()
neglected = {x:'neglected' for x in neglected}
print (neglected)
{'AZ': 'neglected'}
df.identity = df.identity.replace(neglected)
df1 = pd.crosstab(df['sym'], df['identity'])
df1.insert(0, 'sym', df1.sum(axis=1))
df2 = df1.add_suffix('_count').reset_index()
#find all rows where is 0 in columns with vals
mask = ~df2.filter(regex='|'.join(vals)).eq(0).all(axis=1)
print (mask)
0 True
1 True
2 True
3 False
4 True
5 False
dtype: bool
#boolean indexing
df2 = df2[mask]
print (df2)
identity sym sym_count AD_count AU_count neglected_count
0 AAA 2 1 1 0
1 BBB 1 1 0 0
2 CCC 2 0 2 0
4 EEE 2 0 1 1
df2.to_csv('result_count.csv', index=False)

Categories