I have a dataframe the looks like the following below. I want to be able to find out the total number of "missing" that is occurring in the dataframe but to also group the data by columns A,B and C.
A B C D E F total_missing
0 Miami Heat FL Basketball 21 MISSING MISSING 11
1 Miami Heat FL Basketball 17 MISSING MISSING 11
2 Miami Heat FL Basketball MISSING 12 23 11
3 Orlando Magic FL Basketball MISSING 5 MISSING 11
4 Orlando Magic FL Basketball 10 MISSING MISSING 11
5 Orlando Magic FL Basketball 5 MISSING MISSING 11
This code below only gives back the total number of the words missing which is 11 and it appears in all the rows. I just want the total 11 to appear in one cell and to be able to group it by columns a,b and c which I am not sure how to do with the isin function. Any help would be appreciated.
import pandas as pd
df = pd.read_excel(r'C:\ds_test\basketball.xlsx')
df['total_missing'] = df.isin(["MISSING"]).sum().sum()
print(df['total_missing'])
If I understand you correctly, you want:
dfn = (
df.assign(Total_missing=df.eq('MISSING').sum(axis=1))
.groupby(['A', 'B', 'C'])['Total_missing'].sum()
.reset_index()
)
A B C Total_missing
0 Miami Heat FL Basketball 5
1 Orlando Magic FL Basketball 6
Related
I have already sort the two dataframes
city_future:
City Future_50
7 Atlanta 1
9 Bal Harbour 1
1 Chicago 8
6 Coalinga 1
independents_future:
City independents_100
14 Amarillo 1
10 Atlanta 2
18 Atlantic City 1
20 Austin 1
This is what I got so far:
city_future = future.loc[:,"City"].value_counts().rename_axis('City').reset_index(name='Future_50').sort_values('City')
city_independents = independents.loc[:,"City"].value_counts().rename_axis('City').reset_index(name='independents_100').sort_values('City')
hot_cities = pd.merge(city_independents,city_future)
hot_cities
I need to show all the cities in both dataframe, which are in different lentgh, and mark those cities not in the other dataframe by 0.
I have no idea why my current output only shows 20 rows... which is in the form of:
City independents_100 Future_50
0 Atlanta 2 1
1 Bal Harbour 1 1
2 Chicago 15 8
Thank you for helping!
I believe you can do this without creating the two helper dataframes using the merge method.
setting indicator=True will create a new column in the resulting dataframe that will tell you if the row appears in the left dataframe only (city_future), the right dataframe only (independents_future), or both
merged_df = city_future.merge(right=independents_future,
left_on='City',
right_on='City',
how='outer',
indicator=True
)
here is the pandas.DataFrame.merge refrence page
hope this helps :)
I've a sample dataframe
name
0 Newyork
1 Los Angeles
2 Ohio
3 Washington DC
4 Kentucky
Also I've a second dataframe
name ratio
0 Newyork 1:2
1 Kentucky 3:7
2 Florida 1:5
3 SF 2:9
How can I replace the data of name column in the df2 with not available, if the name is present in df1?
Desired result:
name ratio
0 Not Available 1:2
1 Not Available 3:7
2 Florida 1:5
3 SF 2:9
Use numpy.where:
df2['name'] = np.where(df2['name'].isin(df1['name']), 'Not Available', df2['name'])
I have two dataframes:
the first datframe df1 looks like this:
variable value
0 plastic 5774
2 glass 42
4 ferrous metal 642
6 non-ferrous metal 14000
8 paper 4000
Here is the head of the second dataframe df2:
waste_type total_waste_recycled_tonne year energy_saved
non-ferrous metal 160400.0 2015 NaN
glass 14600.0 2015 NaN
ferrous metal 15200 2015 NaN
plastic 766800 2015 NaN
I want to update the energy_saved in the second dataframe df2 such that I multiply the total_waste_recycled_tonne variable in df2 by the variable in df1 into the energy_saved column in df2.
For example:
For plastic: 5774 will be multipled with every waste_type platic with the total_waste_recycled_tonne variable in df2
ie:
energy_saved = 5774 * 766800
Here is what I tried:
df2["energy_saved"] = df1[df1["variable"]=="plastic"]["value"].values[0] * df2["total_waste_recycled_tonne"][df2["waste_type"]=="plastic"]
However the problem was that when I do others, the rest changes back to NaN. I need a better approach to handle this?
Use map:
df2['energy_saved'] = (df2['waste_type'].map(df1.set_index('variable')['value'])
.mul(df2['total_waste_recycled_tonne']
)
Try via merge() and pass how='right':
df=df1[['variable','value']].merge(df2[['waste_type','total_waste_recycled_tonne']],left_on='variable',right_on='waste_type',how='right')
Finally:
df2["energy_saved"]=df['value'].mul(df['total_waste_recycled_tonne'])
Output of df2:
waste_type total_waste_recycled_tonne year energy_saved
0 non-ferrous metal 160400.0 2015 2.245600e+09
1 glass 14600.0 2015 6.132000e+05
2 ferrous metal 15200.0 2015 9.758400e+06
3 plastic 766800.0 2015 4.427503e+09
4 plastic 762700.0 2015 4.403830e+09
A set_index + reindex option:
df2['energy_saved'] = (
df1.set_index('variable').reindex(df2['waste_type'])['value'] *
df2.set_index('waste_type')['total_waste_recycled_tonne']
).values
df2:
waste_type total_waste_recycled_tonne year energy_saved
0 non-ferrous metal 160400.0 2015 2.245600e+09
1 glass 14600.0 2015 6.132000e+05
2 ferrous metal 15200.0 2015 9.758400e+06
3 plastic 766800.0 2015 4.427503e+09
4 plastic 762700.0 2015 4.403830e+09
I have the following data:
df =
Emp_Name Leaves Leave_Type Salary Performance
0 Christy 20 sick 3000.0 56.6
1 Rocky 10 Casual kkkk 22.4
2 jenifer 50 Emergency 2500.6 '51.6'
3 Tom 10 sick Nan 46.2
4 Harry nn Casual 1800.1 '58.3'
5 Julie 22 sick 3600.2 'unknown'
6 Sam 5 Casual Nan 47.2
7 Mady 6 sick unknown Nan
Output:
Emp_Name Leaves Leave_Type Salary Performance
0 Christy 20 sick 3000.0 56.6
1 jenifer 50 Emergency 2500.6 51.6
2 Tom 10 sick Nan 46.2
3 Sam 5 Casual Nan 47.2
4 Mady 6 sick unknown Nan
I want to delete records where there is datatype error in numerical columns(Leaves,Salary,Performance).
If numerical columns contains strings then that row show be deleted from data frame?
df[['Leaves','Salary','Performance']].apply(pd.to_numeric, errors = 'coerce')
but this will covert values to Nan.
Let's start from a note concerning your sample data:
It contains Nan strings, which are not among strings automatically
recognized as NaNs.
To treat them as NaN, I read the source text with read_fwf,
passing na_values=['Nan'].
And now get down to the main task:
Define a function to check whether a cell is acceptable:
def isAcceptable(cell):
if pd.isna(cell) or cell == 'unknown':
return True
return all(c.isdigit() or c == '.' for c in cell)
I noticed that you accept NaN values.
You also a cell if it contains only unknown string, but you don't
accept a cell if such word is enclosed between e.g. quotes.
If you change your mind about what is / is not acceptable, change the
above function accordingly.
Then, to leave only rows with all acceptable values in all 3 mentioned
columns, run:
df[df[['Leaves', 'Salary', 'Performance']].applymap(isAcceptable).all(axis=1)]
I am very new to python, and here I have a question I don't know how to fix, please help.
Here is the thing: I have a dataframe, and I want to extract a column which meets two different conditions.
The columns is as follows:
state gender year name births
13299 AK F 2013 Emma 57
13300 AK F 2013 Sophia 50
13301 AK F 2013 Abigail 39
13302 AK F 2013 Isabella 38
13303 AK F 2013 Olivia 36
13304 AK F 2013 Charlotte 34
13305 AK F 2013 Harper 34
13306 AK F 2013 Emily 33
13307 AK F 2013 Ava 31
13308 AK F 2013 Avery 30
5742631 WY M 2013 Emmett 5
5742632 WY M 2013 Jesse 5
5742633 WY M 2013 Jonah 5
5742634 WY M 2013 Jude 5
5742635 WY M 2013 Kaden 5
5742636 WY M 2013 Kaleb 5
5742637 WY M 2013 Kasen 5
5742638 WY M 2013 Kellan 5
There is like 90K rows in this dataframe, I want to return the value of 'name' where the 'gender' column is as evenly distributed to 'M' and 'F' as possible.
Or in other words: I want to return the value of 'name' under the condition that 'births' columns contains same number of 'M' and 'F'.
Sorry I am new to Python, and I got stuck on this for quite awhile.
I was trying to split the dataframe into two different dataframe, and do it that way, but I found it was kind of impossible.
Any suggestion would be appreciated.
Pivot table in pandas works fine here:
pvt = pd.pivot_table(df,values='births',columns='gender',index='name',aggfunc='sum')
pvt[pvt['M'] == pvt['F']]
This returns a dataframe with name as an index and M,F for columns. It's unlikely that unisex names will be exactly equal though so you can instead do a multiconditional like
pvt[(pvt['M'] + 10 > pvt['F']) & (pvt['M'] - 10 < pvt['F'])]
I've defined df1 to further process. I've set the index to be ['name', 'gender'], then unstack to get 'gender' into columns. .births to focus on births. Then I divide the min by the max to avoid dividing by zero.
df1 = df.set_index(['name', 'gender'], append=True).unstack().births.fillna(0)
df1.min(1).astype(float).div(df1.max(1)).sort_values(ascending=False)
This should give you a sorted dataframe by which name has a the closest ratio to 1.