why is pd.crosstab not giving the expected output in python pandas?

why is pd.crosstab not giving the expected output in python pandas? - python

I have a 2dataframes, which I am calling as df1 and df2.
df1 has columns like KPI and context and it looks like this.
KPI Context
0 Does the company have a policy in place to man... Anti-Bribery Policy\nBroadridge does not toler...
1 Does the company have a supplier code of conduct? Vendor Code of Conduct Our vendors play an imp...
2 Does the company have a grievance/complaint ha... If you ever have a question or wish to report ...
3 Does the company have a human rights policy ? Human Rights Statement of Commitment Broadridg...
4 Does the company have a policies consistent wi... Anti-Bribery Policy\nBroadridge does not toler...
df2 has a single column 'keyword'
df2:
Keyword
0 1.5 degree
1 1.5°
2 2 degree
3 2°
4 accident
I wanted to create another dataframe out of these two dataframe wherein if a particular value from 'Keyword' column of df2 is present in the 'Context' of df1 then simply write the count of it.
for which I have used pd.crosstab() however I suspect that its not giving me the expected output.
here's what I have tried so far.
new_df = df1.explode('Context')
new_df1 = df2.explode('Keyword')
new_df = pd.crosstab(new_df['KPI'], new_df1['Keyword'], values=new_df['Context'], aggfunc='count').reset_index().rename_axis(columns=None)
print(new_df.head())
the new_df looks like this.
KPI 1.5 degree 1.5° \
0 Does the Supplier code of conduct cover one or... NaN NaN
1 Does the companies have sites/operations locat... NaN NaN
2 Does the company have a due diligence process ... NaN NaN
3 Does the company have a grievance/complaint ha... NaN NaN
4 Does the company have a grievance/complaint ha... NaN NaN
2 degree 2° accident
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 1.0 NaN NaN
4 NaN NaN NaN
The expected output which I want is something like this.
0 KPI 1.5 degree 1.5° 2 degree 2° accident
1 Does the company have a policy in place to man 44 2 3 5 9
what exactly am I missing? please let me know, thanks!

There is multiple problems - first explode working with splitted values, not with strings. Then for extract Keyword from Context need Series.str.findall and for crosstab use columns in same DataFrame, not 2 different:
import re
pat = '|'.join(r"\b{}\b".format(re.escape(x)) for x in df2['Keyword'])
df1['new'] = df1['Context'].str.findall(pat, flags=re.I)
new_df = df1.explode('new')
out = pd.crosstab(new_df['KPI'], new_df['new'])

Related

Converting unique values in a particular column in excel to new columns and assign values for those columns from another exiting columns using python

I am trying to convert an excel sheet in a particular format to another one.
Current Format
enter image description here
Expected Format
enter image description here
The Map Criteria list is exhaustive and values may not be present in all cases.
Though I am able to do it in excel itself, need to schedule it to run on a recurring basis. That's why trying to solve it using python.
I tried dataframe aggregate and melt functions which is notgiving the intended result.

You're looking for:
df.pivot(index=['Map ID','Map Name'], columns=['Map Criteria'], values='Map Values')
Output:
Brand Country Department Gender Product
Map ID Map Name
1 AAA Brand A United KIngdom Marketing Male Laptop
2 BBB Brand B NaN Finance NaN NaN
3 CCCC NaN United Kindgom NaN NaN NaN
4 DDD NaN NaN NaN NaN Mobile
5 DDD NaN NaN NaN Female NaN
Input data:
df = pd.DataFrame({
'Map ID': [1,1,1,1,1, 2,2, 3,4,5],
'Map Name': [*['AAA']*5,*['BBB']*2,'CCCC',*['DDD']*2],
'Map Criteria': ['Brand','Department','Country','Product','Gender']*2,
'Map Values': ['Brand A','Marketing','United KIngdom','Laptop','Male','Brand B','Finance','United Kindgom','Mobile','Female']
})

Pandas - Grouping data by index

This is not my actual data, just a representation of a larger set.
I have a dataframe (df) that looks like this:
id text_field text_value
1 date 2021-07-01
1 hour 07:04
2 available yes
2 sold no
Due to project demand i need to manipulate this data to a certain point. The main part is this one:
df.set_index(['id','text_field'], append=True).unstack().droplevel(0,1).droplevel(0)
Leaving me with something like this:
text_field date hour available sold
id
1 2021-07-01 NaN NaN NaN
1 NaN 07:04 NaN NaN
2 NaN NaN yes NaN
2 NaN NaN NaN no
That is very close to what i need, but i'm failing to achieve the next step. I need to group this data by id, leaving only one id on each line.
Something like this:
text_field date hour available sold
id
1 2021-07-01 07:04 NaN NaN
2 NaN NaN yes no
Can somebody help me?

As mentioned by #Nk03 in the comments, you could use the pivot feature of pandas:
import pandas as pd
# Creating example dataframe
data = {
'id': [1, 1, 2, 2],
'text_field': ['date', 'hour', 'available', 'sold'],
'text_value': ['2021-07-01', '07:04', 'yes', 'no']
}
df = pd.DataFrame(data)
# Pivoting on dataframe
df_pivot = df.pivot(index='id', columns='text_field')
print(df_pivot)
Console output:
text_value
text_field available date hour sold
id
1 NaN 2021-07-01 07:04 NaN
2 yes NaN NaN no

How to use two rows to create a new row in python pandas

Here is a sample of a df I am working with. I am particularly interested in these two columns rusher and receiver.
rusher receiver
0 A.Ekeler NaN
1 NaN S.Barkley
2 C.Carson NaN
3 J.Jacobs NaN
4 NaN K.Drake
I want to run a groupby that considers all of these names in both columns (because the same name can show up in both columns).
My idea is to create a new column player, and then I can just groupby player, if that makes sense. Here is what I want my output to look like
rusher receiver player
0 A.Ekeler NaN A.Ekeler
1 NaN S.Barkley S.Barkley
2 C.Carson NaN C.Carson
3 J.Jacobs NaN J.Jacobs
4 NaN K.Drake K.Drake
I would like to take the name from whichever column it is listed under in that particular row and place it into the player column, so I can then run a groupby.
I have tried various string methods but I don't know how to work around the NaNs

Check with fillna
df['player'] = df['rusher'].fillna(df['receiver'])

how to remove rows in python data frame with condition?

I have the following data:
df =
Emp_Name Leaves Leave_Type Salary Performance
0 Christy 20 sick 3000.0 56.6
1 Rocky 10 Casual kkkk 22.4
2 jenifer 50 Emergency 2500.6 '51.6'
3 Tom 10 sick Nan 46.2
4 Harry nn Casual 1800.1 '58.3'
5 Julie 22 sick 3600.2 'unknown'
6 Sam 5 Casual Nan 47.2
7 Mady 6 sick unknown Nan
Output:
Emp_Name Leaves Leave_Type Salary Performance
0 Christy 20 sick 3000.0 56.6
1 jenifer 50 Emergency 2500.6 51.6
2 Tom 10 sick Nan 46.2
3 Sam 5 Casual Nan 47.2
4 Mady 6 sick unknown Nan
I want to delete records where there is datatype error in numerical columns(Leaves,Salary,Performance).
If numerical columns contains strings then that row show be deleted from data frame?
df[['Leaves','Salary','Performance']].apply(pd.to_numeric, errors = 'coerce')
but this will covert values to Nan.

Let's start from a note concerning your sample data:
It contains Nan strings, which are not among strings automatically
recognized as NaNs.
To treat them as NaN, I read the source text with read_fwf,
passing na_values=['Nan'].
And now get down to the main task:
Define a function to check whether a cell is acceptable:
def isAcceptable(cell):
if pd.isna(cell) or cell == 'unknown':
return True
return all(c.isdigit() or c == '.' for c in cell)
I noticed that you accept NaN values.
You also a cell if it contains only unknown string, but you don't
accept a cell if such word is enclosed between e.g. quotes.
If you change your mind about what is / is not acceptable, change the
above function accordingly.
Then, to leave only rows with all acceptable values in all 3 mentioned
columns, run:
df[df[['Leaves', 'Salary', 'Performance']].applymap(isAcceptable).all(axis=1)]

How to identity non-empty columns in pandas dataframe?

I'm processing a dataset with around 2000 columns and I noticed many of them are empty. I want to know specifically how many of them are empty and how many are not. I use the following code:
df.isnull().sum()
I will get the number of empty rows in each columns. However, given I'm investigating about 2000 columns with 7593 rows, the output in IPython looks like the following:
FEMALE_DEATH_YR3_RT 7593
FEMALE_COMP_ORIG_YR3_RT 7593
PELL_COMP_4YR_TRANS_YR3_RT 7593
PELL_COMP_2YR_TRANS_YR3_RT 7593
...
FIRSTGEN_YR4_N 7593
NOT1STGEN_YR4_N 7593
It doesn't show all the columns because it has too many columns. Thus it makes it very difficult to tell how many columns are all empty and how any are not.
I wonder is there anyway to allow me to identify the non-empty columns quickly? Thanks!

to find the number of non empty columns:
len(df.columns) - len(df.dropna(axis=1,how='all').columns)
3
df
Country movie name rating year Something
0 thg John 3 NaN NaN NaN
1 thg Jan 4 NaN NaN NaN
2 mol Graham lob NaN NaN NaN
df=df.dropna(axis=1,how='all')
Country movie name
0 thg John 3
1 thg Jan 4
2 mol Graham lob

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.