using pandas isin with groupby

using pandas isin with groupby - python

I have a dataframe the looks like the following below. I want to be able to find out the total number of "missing" that is occurring in the dataframe but to also group the data by columns A,B and C.
A B C D E F total_missing
0 Miami Heat FL Basketball 21 MISSING MISSING 11
1 Miami Heat FL Basketball 17 MISSING MISSING 11
2 Miami Heat FL Basketball MISSING 12 23 11
3 Orlando Magic FL Basketball MISSING 5 MISSING 11
4 Orlando Magic FL Basketball 10 MISSING MISSING 11
5 Orlando Magic FL Basketball 5 MISSING MISSING 11
This code below only gives back the total number of the words missing which is 11 and it appears in all the rows. I just want the total 11 to appear in one cell and to be able to group it by columns a,b and c which I am not sure how to do with the isin function. Any help would be appreciated.
import pandas as pd
df = pd.read_excel(r'C:\ds_test\basketball.xlsx')
df['total_missing'] = df.isin(["MISSING"]).sum().sum()
print(df['total_missing'])

If I understand you correctly, you want:
dfn = (
df.assign(Total_missing=df.eq('MISSING').sum(axis=1))
.groupby(['A', 'B', 'C'])['Total_missing'].sum()
.reset_index()
)
A B C Total_missing
0 Miami Heat FL Basketball 5
1 Orlando Magic FL Basketball 6

Related

My merge() is not showing fully of the two dataframes I am combining, how do I show the full dataframe?

I have already sort the two dataframes
city_future:
City Future_50
7 Atlanta 1
9 Bal Harbour 1
1 Chicago 8
6 Coalinga 1
independents_future:
City independents_100
14 Amarillo 1
10 Atlanta 2
18 Atlantic City 1
20 Austin 1
This is what I got so far:
city_future = future.loc[:,"City"].value_counts().rename_axis('City').reset_index(name='Future_50').sort_values('City')
city_independents = independents.loc[:,"City"].value_counts().rename_axis('City').reset_index(name='independents_100').sort_values('City')
hot_cities = pd.merge(city_independents,city_future)
hot_cities
I need to show all the cities in both dataframe, which are in different lentgh, and mark those cities not in the other dataframe by 0.
I have no idea why my current output only shows 20 rows... which is in the form of:
City independents_100 Future_50
0 Atlanta 2 1
1 Bal Harbour 1 1
2 Chicago 15 8
Thank you for helping!

I believe you can do this without creating the two helper dataframes using the merge method.
setting indicator=True will create a new column in the resulting dataframe that will tell you if the row appears in the left dataframe only (city_future), the right dataframe only (independents_future), or both
merged_df = city_future.merge(right=independents_future,
left_on='City',
right_on='City',
how='outer',
indicator=True
)
here is the pandas.DataFrame.merge refrence page
hope this helps :)

replace row data in pandas based on another dataframe

I've a sample dataframe
name
0 Newyork
1 Los Angeles
2 Ohio
3 Washington DC
4 Kentucky
Also I've a second dataframe
name ratio
0 Newyork 1:2
1 Kentucky 3:7
2 Florida 1:5
3 SF 2:9
How can I replace the data of name column in the df2 with not available, if the name is present in df1?
Desired result:
name ratio
0 Not Available 1:2
1 Not Available 3:7
2 Florida 1:5
3 SF 2:9

Use numpy.where:
df2['name'] = np.where(df2['name'].isin(df1['name']), 'Not Available', df2['name'])

How to multiply two dataframes of different shapes

I have two dataframes:
the first datframe df1 looks like this:
variable value
0 plastic 5774
2 glass 42
4 ferrous metal 642
6 non-ferrous metal 14000
8 paper 4000
Here is the head of the second dataframe df2:
waste_type total_waste_recycled_tonne year energy_saved
non-ferrous metal 160400.0 2015 NaN
glass 14600.0 2015 NaN
ferrous metal 15200 2015 NaN
plastic 766800 2015 NaN
I want to update the energy_saved in the second dataframe df2 such that I multiply the total_waste_recycled_tonne variable in df2 by the variable in df1 into the energy_saved column in df2.
For example:
For plastic: 5774 will be multipled with every waste_type platic with the total_waste_recycled_tonne variable in df2
ie:
energy_saved = 5774 * 766800
Here is what I tried:
df2["energy_saved"] = df1[df1["variable"]=="plastic"]["value"].values[0] * df2["total_waste_recycled_tonne"][df2["waste_type"]=="plastic"]
However the problem was that when I do others, the rest changes back to NaN. I need a better approach to handle this?

Use map:
df2['energy_saved'] = (df2['waste_type'].map(df1.set_index('variable')['value'])
.mul(df2['total_waste_recycled_tonne']
)

Try via merge() and pass how='right':
df=df1[['variable','value']].merge(df2[['waste_type','total_waste_recycled_tonne']],left_on='variable',right_on='waste_type',how='right')
Finally:
df2["energy_saved"]=df['value'].mul(df['total_waste_recycled_tonne'])
Output of df2:
waste_type total_waste_recycled_tonne year energy_saved
0 non-ferrous metal 160400.0 2015 2.245600e+09
1 glass 14600.0 2015 6.132000e+05
2 ferrous metal 15200.0 2015 9.758400e+06
3 plastic 766800.0 2015 4.427503e+09
4 plastic 762700.0 2015 4.403830e+09

A set_index + reindex option:
df2['energy_saved'] = (
df1.set_index('variable').reindex(df2['waste_type'])['value'] *
df2.set_index('waste_type')['total_waste_recycled_tonne']
).values
df2:
waste_type total_waste_recycled_tonne year energy_saved
0 non-ferrous metal 160400.0 2015 2.245600e+09
1 glass 14600.0 2015 6.132000e+05
2 ferrous metal 15200.0 2015 9.758400e+06
3 plastic 766800.0 2015 4.427503e+09
4 plastic 762700.0 2015 4.403830e+09

how to remove rows in python data frame with condition?

I have the following data:
df =
Emp_Name Leaves Leave_Type Salary Performance
0 Christy 20 sick 3000.0 56.6
1 Rocky 10 Casual kkkk 22.4
2 jenifer 50 Emergency 2500.6 '51.6'
3 Tom 10 sick Nan 46.2
4 Harry nn Casual 1800.1 '58.3'
5 Julie 22 sick 3600.2 'unknown'
6 Sam 5 Casual Nan 47.2
7 Mady 6 sick unknown Nan
Output:
Emp_Name Leaves Leave_Type Salary Performance
0 Christy 20 sick 3000.0 56.6
1 jenifer 50 Emergency 2500.6 51.6
2 Tom 10 sick Nan 46.2
3 Sam 5 Casual Nan 47.2
4 Mady 6 sick unknown Nan
I want to delete records where there is datatype error in numerical columns(Leaves,Salary,Performance).
If numerical columns contains strings then that row show be deleted from data frame?
df[['Leaves','Salary','Performance']].apply(pd.to_numeric, errors = 'coerce')
but this will covert values to Nan.

Let's start from a note concerning your sample data:
It contains Nan strings, which are not among strings automatically
recognized as NaNs.
To treat them as NaN, I read the source text with read_fwf,
passing na_values=['Nan'].
And now get down to the main task:
Define a function to check whether a cell is acceptable:
def isAcceptable(cell):
if pd.isna(cell) or cell == 'unknown':
return True
return all(c.isdigit() or c == '.' for c in cell)
I noticed that you accept NaN values.
You also a cell if it contains only unknown string, but you don't
accept a cell if such word is enclosed between e.g. quotes.
If you change your mind about what is / is not acceptable, change the
above function accordingly.
Then, to leave only rows with all acceptable values in all 3 mentioned
columns, run:
df[df[['Leaves', 'Salary', 'Performance']].applymap(isAcceptable).all(axis=1)]

Return a column which meet two conditions with pandas dataframe

I am very new to python, and here I have a question I don't know how to fix, please help.
Here is the thing: I have a dataframe, and I want to extract a column which meets two different conditions.
The columns is as follows:
state gender year name births
13299 AK F 2013 Emma 57
13300 AK F 2013 Sophia 50
13301 AK F 2013 Abigail 39
13302 AK F 2013 Isabella 38
13303 AK F 2013 Olivia 36
13304 AK F 2013 Charlotte 34
13305 AK F 2013 Harper 34
13306 AK F 2013 Emily 33
13307 AK F 2013 Ava 31
13308 AK F 2013 Avery 30
5742631 WY M 2013 Emmett 5
5742632 WY M 2013 Jesse 5
5742633 WY M 2013 Jonah 5
5742634 WY M 2013 Jude 5
5742635 WY M 2013 Kaden 5
5742636 WY M 2013 Kaleb 5
5742637 WY M 2013 Kasen 5
5742638 WY M 2013 Kellan 5
There is like 90K rows in this dataframe, I want to return the value of 'name' where the 'gender' column is as evenly distributed to 'M' and 'F' as possible.
Or in other words: I want to return the value of 'name' under the condition that 'births' columns contains same number of 'M' and 'F'.
Sorry I am new to Python, and I got stuck on this for quite awhile.
I was trying to split the dataframe into two different dataframe, and do it that way, but I found it was kind of impossible.
Any suggestion would be appreciated.

Pivot table in pandas works fine here:
pvt = pd.pivot_table(df,values='births',columns='gender',index='name',aggfunc='sum')
pvt[pvt['M'] == pvt['F']]
This returns a dataframe with name as an index and M,F for columns. It's unlikely that unisex names will be exactly equal though so you can instead do a multiconditional like
pvt[(pvt['M'] + 10 > pvt['F']) & (pvt['M'] - 10 < pvt['F'])]

I've defined df1 to further process. I've set the index to be ['name', 'gender'], then unstack to get 'gender' into columns. .births to focus on births. Then I divide the min by the max to avoid dividing by zero.
df1 = df.set_index(['name', 'gender'], append=True).unstack().births.fillna(0)
df1.min(1).astype(float).div(df1.max(1)).sort_values(ascending=False)
This should give you a sorted dataframe by which name has a the closest ratio to 1.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

using pandas isin with groupby - python

If I understand you correctly, you want: dfn = ( df.assign(Total_missing=df.eq('MISSING').sum(axis=1)) .groupby(['A', 'B', 'C'])['Total_missing'].sum() .reset_index() ) A B C Total_missing 0 Miami Heat FL Basketball 5 1 Orlando Magic FL Basketball 6

Related

My merge() is not showing fully of the two dataframes I am combining, how do I show the full dataframe?

replace row data in pandas based on another dataframe

How to multiply two dataframes of different shapes

how to remove rows in python data frame with condition?

Return a column which meet two conditions with pandas dataframe

Categories

Resources