I have a large dataframe with over 100 columns. One of the columns is "Cow". For each value of "Cow" I would like to determine the number of missing values in each of the other columns.
Using code from Get proportion of missing values per Country
I am able to tabulate the number of missing values for one column at a time. By repeating the code for each column and then merging the dataframes I am able to build a dataframe that has the proportion of missing values for each cow for each column. The problem is that I have over 100 columns.
The following creates a short data example
import pandas as pd
import numpy as np
mast_model_data = [[1152, '1', '10', '23'], [1154, '1', '4', '43'],
[1155, 'NA', '3', '76'], [1152, '1', '10', 'NA'],
[1155, '2', '10', '65'], [1152, '1', '4', 'NA']]
df = pd.DataFrame(mast_model_data, columns =['Cow', 'Lact', 'Procedure', 'Height'])
df.loc[:,'Lact'] = df['Lact'].replace('NA', np.nan)
df.loc[:,'Procedure'] = df['Procedure'].replace('NA', np.nan)
df.loc[:,'Height'] = df['Height'].replace('NA', np.nan)
df
The data is presented below
Cow Lact Procedure Height
0 1152 1 10 23
1 1154 1 4 43
2 1155 NaN 3 76
3 1152 1 10 NaN
4 1155 2 10 65
5 1152 1 4 NaN
The code that I am using to tabulate missing data is as follows
df1 = (df.groupby('Cow')['Lact']
.apply(lambda x: np.mean(x.isna().to_numpy(), axis=None))
.reset_index(name='Lact'))
df2 = (df.groupby('Cow')['Procedure']
.apply(lambda x: np.mean(x.isna().to_numpy(), axis=None))
.reset_index(name='Procedure'))
df3 = (df.groupby('Cow')['Height']
.apply(lambda x: np.mean(x.isna().to_numpy(), axis=None))
.reset_index(name='Height'))
missing = df1.merge(df2, on=['Cow'], how="left")
missing = missing.merge(df3, on=['Cow'], how="left")
missing
The output of the code above is
Cow Lact Procedure Height
0 1152 0.0 0.0 0.666667
1 1154 0.0 0.0 0.000000
2 1155 0.5 0.0 0.000000
The actual dataframe has more cows and columns so to complete the table will require a lot of repitition
I anticipate there is a more refined way that does not require the repetition required for the method that I am using.
Appreciate advice on how I can streamline the code.
Try as follows:
missing = df.set_index('Cow').isna().groupby(level=0).mean()\
.reset_index(drop=False)
print(missing)
Cow Lact Procedure Height
0 1152 0.0 0.0 0.666667
1 1154 0.0 0.0 0.000000
2 1155 0.5 0.0 0.000000
Explanation
Set column Cow as the index, and apply df.isna to get a mask of bool values with True for NaN values.
Now, chain df.groupby on the index (i.e. level=0), retrieve the mean, and reset the index again.
Related
data['family_income'].value_counts()
>=35,000 2517
<27,500, >=25,000 1227
<30,000, >=27,500 994
<25,000, >=22,500 833
<20,000, >=17,500 683
<12,500, >=10,000 677
<17,500, >=15,000 634
<15,000, >=12,500 629
<22,500, >=20,000 590
<10,000, >= 8,000 563
< 8,000, >= 4,000 402
< 4,000 278
Unknown 128
The data column to be shown as a MEAN value instead of values in range
data['family_income']
0 <17,500, >=15,000
1 <27,500, >=25,000
2 <30,000, >=27,500
3 <15,000, >=12,500
4 <30,000, >=27,500
...
10150 <30,000, >=27,500
10151 <25,000, >=22,500
10152 >=35,000
10153 <10,000, >= 8,000
10154 <27,500, >=25,000
Name: family_income, Length: 10155, dtype: object
Output: as mean imputed value
0 16250
1 26250
3 28750
...
10152 35000
10153 9000
10154 26500
data['family_income']=data['family_income'].str.replace(',', ' ').str.replace('<',' ')
data[['income1','income2']] = data['family_income'].apply(lambda x: pd.Series(str(x).split(">=")))
data['income1']=pd.to_numeric(data['income1'], errors='coerce')
data['income1']
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
..
10150 NaN
10151 NaN
10152 NaN
10153 NaN
10154 NaN
Name: income1, Length: 10155, dtype: float64
In this case, conversion of datatype from object to numeric doesn't seem to work since all the values are returned as NaN. So, how to convert to numeric data type and find mean imputed values?
You can use the following snippet:
# Importing Dependencies
import pandas as pd
import string
# Replicating Your Data
data = ['<17,500, >=15,000', '<27,500, >=25,000', '< 4,000 ', '>=35,000']
df = pd.DataFrame(data, columns = ['family_income'])
# Removing punctuation from family_income column
df['family_income'] = df['family_income'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))
# Splitting ranges to two columns A and B
df[['A', 'B']] = df['family_income'].str.split(' ', 1, expand=True)
# Converting cols A and B to float
df[['A', 'B']] = df[['A', 'B']].apply(pd.to_numeric)
# Creating mean column from A and B
df['mean'] = df[['A', 'B']].mean(axis=1)
# Input DataFrame
family_income
0 <17,500, >=15,000
1 <27,500, >=25,000
2 < 4,000
3 >=35,000
# Result DataFrame
mean
0 16250.0
1 26250.0
2 4000.0
3 35000.0
I have a df called df_world with the following shape:
Cases Death Delta_Cases Delta_Death
Country/Region Date
Brazil 2020-01-22 0.0 0 NaN NaN
2020-01-23 0.0 0 0.0 0.0
2020-01-24 0.0 0 0.0 0.0
2020-01-25 0.0 0 0.0 0.0
2020-01-26 0.0 0 0.0 0.0
... ... ... ...
World 2020-05-12 4261747.0 291942 84245.0 5612.0
2020-05-13 4347018.0 297197 85271.0 5255.0
2020-05-14 4442163.0 302418 95145.0 5221.0
2020-05-15 4542347.0 307666 100184.0 5248.0
2020-05-16 4634068.0 311781 91721.0 4115.0
I'de like to sort the country index by the value of the columns 'Cases' on the last recording i.e. comparing the cases values on 2020-05-16 for all countries and return the sorted country list
I thought about creating another df with only the 2020-05-16 values and then use the df.sort_values() method but I am sure there has to be a more efficient way.
While I'm at it, I've also tried to only select the countries that have a number of cases on 2020-05-16 above a certain value and the only way I found to do it was to iterate over the Country index:
for a_country in df_world.index.levels[0]:
if df_world.loc[(a_country, last_date), 'Cases'] < cut_off_val:
df_world = df_world.drop(index=a_country)
But it's quite a poor way to do it.
If anyone has an idea on how the improve the efficiency of this code I'de be very happy.
Thank you :)
You can first group thee dataset by "Country/Region", then sort each group by "Date", take the last one, and sort again by "Cases".
Faking some data myself (data types are different but you see my point):
df = pd.DataFrame([['a', 1, 100],
['a', 2, 10],
['b', 2, 55],
['b', 3, 15],
['c', 1, 22],
['c', 3, 80]])
df.columns = ['country', 'date', 'cases']
df = df.set_index(['country', 'date'])
print(df)
# cases
# country date
# a 1 100
# 2 10
# b 2 55
# 3 15
# c 1 22
# 3 80
Then,
# group them by country
grp_by_country = df.groupby(by='country')
# for each group, aggregate by sorting by data and taking the last row (latest date)
latest_per_grp = grp_by_country.agg(lambda x: x.sort_values(by='date').iloc[-1])
# sort again by cases
sorted_by_cases = latest_per_grp.sort_values(by='cases')
print(sorted_by_cases)
# cases
# country
# a 10
# b 15
# c 80
Stay safe!
last_recs = df_world.reset_index().groupby('Country/Region').last()
sorted_countries = last_recs.sort_values('Cases')['Country/Region']
As I don't have your raw data, I can't test it but this should do what you need. All methods are self-explanatory I believe.
you may need to sort df_world by the dates in the first line if it isn't the case.
Let's say I have this dataframe containing the difference in number of active cases from previous value in each country:
[in]
import pandas as pd
import numpy as np
active_cases = {'Day(s) since outbreak':['0', '1', '2', '3', '4', '5'], 'Australia':[np.NaN, 10, 10, -10, -20, -20], 'Albania':[np.NaN, 20, 0, 15, 0, -20], 'Algeria':[np.NaN, 25, 10, -10, 20, -20]}
df = pd.DataFrame(active_cases)
df
[out]
Day(s) since outbreak Australia Albania Algeria
0 0 NaN NaN NaN
1 1 10.0 20.0 25.0
2 2 10.0 0.0 10.0
3 3 -10.0 15.0 -10.0
4 4 -20.0 0.0 20.0
5 5 -20.0 -20.0 -20.0
I need to find the average length of days for a local outbreak to peak in this COVID-19 dataframe.
My solution is to find the nth row with the first negative value in each column (e.g., nth row of first negative value in 'Australia': 3, nth row of first negative value in 'Albania': 5) and average it.
However, I have no idea how to do this in Panda/Python.
Are there any ways to perform this task with simple lines of Python/Panda code?
you can set_index the column Day(s) since outbreak, then use iloc to select all rows except the first one, then check where the values are less than (lt) 0. Use idxmax to get the first row where the value is less than 0 and take the mean. With your input, it gives:
print (df.set_index('Day(s) since outbreak')\
.iloc[1:, :].lt(0).idxmax().astype(float).mean())
3.6666666666666665
IICU
using df.where mask negatives and replace positives with np.NaN and then calculate the mean
cols= ['Australia','Albania','Algeria']
df.set_index('Day(s) since outbreak', inplace=True)
m = df< 0
df2=df.where(m, np.NaN)
#df2 = df2.replace(0, np.NaN)
df2.mean()
Result
I am not able to read my dataset.csv file due to following Parser Error.
Error tokenizing data. C error: Expected 1 fields in line 8, saw 4
The CSV file is generated through another program.
Basically I want to skip the character rows which are iterating after certain intervals and want only the integer and float values in my dataset.
I tried this:
df = pd.read_csv('Dataset.csv')
I also tried this, but i am only getting the bad lines as output. But I want to skip all these bad error lines and only show the remaining values in my dataset.
df = pd.read_csv('Dataset.csv',error_bad_lines=False, engine='python')
Dataset:
The pch2csv utility program
This file contains the pch2csv
$TITLE =
$SUBTITLE=
$LABEL = FX
1,0.000000E+00,3.792830E-06,-1.063093E-06
2,0.000000E+00,-1.441319E-06,4.711234E-06
3,0.000000E+00,2.950290E-06,-5.669502E-07
4,0.000000E+00,3.706791E-06,-1.094726E-06
5,0.000000E+00,3.689831E-06,-1.107476E-06
$TITLE =
$SUBTITLE=
$LABEL = FY
1,0.000000E+00,-5.878803E-06,1.127179E-06
2,0.000000E+00,2.782207E-06,-8.840886E-06
3,0.000000E+00,-1.574296E-06,3.867732E-07
4,0.000000E+00,-6.227912E-06,1.864081E-06
5,0.000000E+00,-3.113227E-05,9.339538E-06
Expected dataset:
*Even the blank rows may be deleted if possible
The 1st column should be set as index, and the final dataset must contain the 1st and 3rd columnn only as shown. The Column label must be set as '1'
You can add parameter names to read_csv for new columns names - then get some rows with missing values, so added DataFrame.dropna:
import pandas as pd
from io import StringIO
temp="""The pch2csv utility program
This file contains the pch2csv
$TITLE =
$SUBTITLE=
$LABEL = FX
1,0.000000E+00,3.792830E-06,-1.063093E-06
2,0.000000E+00,-1.441319E-06,4.711234E-06
3,0.000000E+00,2.950290E-06,-5.669502E-07
4,0.000000E+00,3.706791E-06,-1.094726E-06
5,0.000000E+00,3.689831E-06,-1.107476E-06
$TITLE =
$SUBTITLE=
$LABEL = FY
1,0.000000E+00,-5.878803E-06,1.127179E-06
2,0.000000E+00,2.782207E-06,-8.840886E-06
3,0.000000E+00,-1.574296E-06,3.867732E-07
4,0.000000E+00,-6.227912E-06,1.864081E-06
5,0.000000E+00,-3.113227E-05,9.339538E-06"""
#after testing replace 'pd.compat.StringIO(temp)' to 'Dataset.csv'
df = pd.read_csv(StringIO(temp),
error_bad_lines=False,
engine='python',
names=['a','b','c','d'])
df = df.dropna(subset=['b','c','d'])
print (df)
a b c d
0 1 0.0 0.000004 -1.063093e-06
1 2 0.0 -0.000001 4.711234e-06
2 3 0.0 0.000003 -5.669502e-07
3 4 0.0 0.000004 -1.094726e-06
4 5 0.0 0.000004 -1.107476e-06
8 1 0.0 -0.000006 1.127179e-06
9 2 0.0 0.000003 -8.840886e-06
10 3 0.0 -0.000002 3.867732e-07
11 4 0.0 -0.000006 1.864081e-06
12 5 0.0 -0.000031 9.339538e-06
EDIT:
For set first column to index and another columns names:
#after testing replace 'pd.compat.StringIO(temp)' to 'Dataset.csv'
df = pd.read_csv(StringIO(temp),
error_bad_lines=False,
engine='python',
index_col=[0],
names=['idx','col1','col2','col3'])
#check all columns, first column is set to index, so not tested
df = df.dropna()
#if need test if all values in row has NaNs
#df = df.dropna(how='all')
print (df)
col1 col2 col3
idx
1 0.0 0.000004 -1.063093e-06
2 0.0 -0.000001 4.711234e-06
3 0.0 0.000003 -5.669502e-07
4 0.0 0.000004 -1.094726e-06
5 0.0 0.000004 -1.107476e-06
1 0.0 -0.000006 1.127179e-06
2 0.0 0.000003 -8.840886e-06
3 0.0 -0.000002 3.867732e-07
4 0.0 -0.000006 1.864081e-06
5 0.0 -0.000031 9.339538e-06
EDIT1:
If need remove all columns filled by 0 only:
df = df.loc[:, df.ne(0).any()]
print (df)
col2 col3
idx
1 0.000004 -1.063093e-06
2 -0.000001 4.711234e-06
3 0.000003 -5.669502e-07
4 0.000004 -1.094726e-06
5 0.000004 -1.107476e-06
1 -0.000006 1.127179e-06
2 0.000003 -8.840886e-06
3 -0.000002 3.867732e-07
4 -0.000006 1.864081e-06
5 -0.000031 9.339538e-06
I have a dataframe with two month value columns as 'month1' and 'month2'. If the value in 'month1' column is not 'NA', then sum the corresponding 'amount' values as per 'month1' column. If the value in 'month1' column is 'NA', then sum the corresponding 'amount' values of 'month2' column.
import pandas as pd
df = pd.DataFrame({'month1': [1,2,'NA', 1, 4, 'NA', 'NA'],
'month2': ['NA',5,1, 2, 'NA', 1, 3],
'amount': [10,20,40, 50, 60, 70, 100]})
The input and output dataframes are as follows:
Input dataframe
month1 month2 amount
0 1.0 NaN 10
1 2.0 5.0 20
2 NaN 1.0 40
3 1.0 2.0 50
4 4.0 NaN 60
5 NaN 1.0 70
6 NaN 3.0 100
Output dataframe
since your NA values is string, you can simply groupby on the two columns:
# ignore month2 if month1 is NA
df.loc[df.month1.ne('NA'), 'month2'] = 'NA'
# groupby and sum
df.groupby(['month1','month2']).amount.transform('sum')
if you don't want to alter your data, you can do
s = np.where(df.month1.ne('NA'), 'NA', df['month2'])
df.groupby(['month1', s]).amount.transform('sum')
Output:
0 60
1 20
2 110
3 60
4 60
5 110
6 100
Name: amount, dtype: int64
You can use:
c=df.month1.eq('NA')
np.select([c,~c],[df.groupby('month2')['amount'].transform('sum')
,df.groupby('month1')['amount'].transform('sum')],default='NA') #assign to new column
array(['60', '20', '110', '60', '60', '110', '100'], dtype='<U21')
Edit: as #rafael pointed out, your data may be mixing of numbers and strings, so converting them all to numeric before processing is needed.
A simple way is groupby and transform month1 and month2 separately and fillna result of month1 by month2
df = df.apply(pd.to_numeric, errors='coerce')
m1 = df.groupby('month1').amount.transform('sum')
m2 = df.groupby('month2').amount.transform('sum')
m1.fillna(m2)
Out[406]:
0 60.0
1 20.0
2 110.0
3 60.0
4 60.0
5 110.0
6 100.0
Name: amount, dtype: float64