fillna() not allowing floating values

fillna() not allowing floating values - python

I'm testing a simple imputation method on the side using a copy of my dataset. I'm essentially trying to impute missing values with categorical means grouped by the target variable.
df_test_2 = train_df.loc[:,['Survived','Age']].copy() #copy of dataset for testing
#creating impute function
def impute(df,variable):
if 'Survived'==0: df[variable] = df[variable].fillna(30.7)
else: df[variable] = df[variable].fillna(28.3)
#imputing
impute(df_test_2,'Age')
The output is that the imputation is successful, but the values added are 30 and 28 instead of 30.7 and 28.3.
'Age' is float64.
Thank you
Edit: I simply copied the old code for calling the function here and corrected it now. Wasn't the issue in my original code; problem persists.

Have a look at this to see what may be going on
To test it I set up a simple case
import pandas as pd
import numpy as np
data = {'Survived' : [0,1,1,0,0,1], 'Age' :[12.2,45.4,np.nan,np.nan,64.3,44.3]}
df = pd.DataFrame(data)
df
This got the data set
Survived Age
0 0 12.2
1 1 45.4
2 1 NaN
3 0 NaN
4 0 64.3
5 1 44.3
I ran your function exactly
def impute(df,variable):
if 'Survived'==0: df[variable] = df[variable].fillna(30.7)
else: df[variable] = df[variable].fillna(28.3)
and this yielded this result
Survived Age
0 0. 12.2
1 1 45.4
2 1 28.3
3 0 28.3
4 0 64.3
5 1 44.3
As you can see on the index 3 the row age got filled with the wrong value. The problem is this 'Survived'==0. This is always going to be false. You are checking to see if the string is 0 and it is not.
What you may want is
df2 = df[df['Survived'] == 0].fillna(30.7)
df3 = df[df['Survived'] == 1].fillna(28.3)
dfout = df2.append(df3)
and the output is
Survived Age
0 0 12.2
3 0 30.7
4 0 64.3
1 1 45.4
2 1 28.3
5 1 44.3

Anish
I think is better to use the method apply() available in pandas. This method applies (in rows or in columns) a custom function over a dataframe.
I let you one post: Stack Question
Documentation pandas: Doc Apply df
regards,

Related

split Python DataFrame into k parts with index and iterate over them in a loop

I suppose that someone might have asked this already, but for the life of me I cannot find what I need after some looking, possibly my level of Py is too low.
I saw several questions with answers using globals() and exec() with comments that it's a bad idea, other answers suggest using dictionaries or lists. At this point I got a bit loopy about what to use here and any help would be very welcome.
What I need is roughly this:
I have a Python DataFrame, say called dftest
I'd like to split dftest into say 6 parts of similar size
then I'd like to iterate over them (or possibly parallelise?) and run some steps calling some spatial functions that use parameters (param0,param1, ... param5) over each of the rows of each df to add more columns, preferably export each result to a csv (as it takes long time to complete one part, I wouldn't want to loose the result of each iteration)
And then I'd like to put them back together into one DataFrame, say dfresult (possibly with concat) and continue doing the next thing with it
To keep it simple, this is how a toy dftest looks like (the original df has more rows and columns):
print(dftest)
# rowid type lon year
# 1 1 Tomt NaN 2021
# 2 2 Lägenhet 12.72 2022
# 3 3 Lägenhet NaN 2017
# 4 4 Villa 17.95 2016
# 5 5 Radhus 17.95 2021
# 6 6 Villa 17.95 2016
# 7 7 Fritidshus 18.64 2020
# 8 8 Villa 18.64 2019
# 9 9 Villa 18.63 2021
# 10 10 Villa 18.63 2019
# 11 11 Villa 17.66 2017
# 12 12 Radhus 17.66 2022
So here is what I tried:
dfs = np.array_split(dftest, 6)
for j in range(0,6):
print ((f'dfs[{j}] has'),len(dfs[j].index),'obs ',min(dfs[j].index),'to ',max (dfs[j].index))
where I get output:
# dfs[0] has 2 obs 1 to 2
# dfs[1] has 2 obs 3 to 4
# dfs[2] has 2 obs 5 to 6
# dfs[3] has 2 obs 7 to 8
# dfs[4] has 2 obs 9 to 10
# dfs[5] has 2 obs 11 to 12
So now I'd like to iterate over each df and create more columns. I tried a hardcoded test, one by one something like this:
for row in tqdm(dfs[0].itertuples()):
x = row.type
y = foo.bar(x, param="param0")
i = row[0]
dfs[0].x[i, 'anotherColumn'] = baz(y)
#... some more functions ...
dfs[0].to_csv("/projectPath/dfs0.csv")
I suppose this should be possible to automate or even run in parallel (how?)
And in the end I'll try putting them together (no clue if this would work), possibly something like this:
pd.concat([dfs[0],dfs[1],dfs[2],dfs[3],dfs[4],dfs[5]])
If I had a 100 parts - perhaps dfs[0]:dfs[5] would work...I'm still in the previous step
PS. I'm using a Jupyter notebook on localhost with Python3.

As far as I understand, you can use the chunk_apply function of the parallel-pandas library. This function splits the dataframe into chunks and applies a custom function to each chunk then concatenates the result. Everything works in parallel.Toy example:
#pip install parallel-pandas
import pandas as pd
import numpy as np
from parallel_pandas import ParallelPandas
#initialize parallel-pandas
# n_cpu - count of cores and split chunks
ParallelPandas.initialize(n_cpu=8)
def foo(df):
# do something with df
df['new_col'] = df.sum(axis=1)
return df
if __name__ == '__main__':
ROW = 10000
COL = 10
df = pd.DataFrame(np.random.random((ROW, COL)))
res = df.chunk_apply(foo, axis=0)
print(res.head())
Out:
0 1 2 ... 8 9 new_col
0 0.735248 0.393912 0.966608 ... 0.261675 0.207216 6.276589
1 0.256962 0.461601 0.341175 ... 0.688134 0.607418 5.297881
2 0.335974 0.093897 0.622115 ... 0.442783 0.115127 3.102827
3 0.488585 0.709927 0.209429 ... 0.942065 0.126600 4.367873
4 0.619996 0.704085 0.685806 ... 0.626539 0.145320 4.901926

Calculate average temperature in CSV

Below is my data set.
I want to calculate the average temperature of each station. It is more desired if I can remove zero (noise) data.
How can I do it?
enter image description here
I have no idea on how to start with.

I think you can use standard csv reader to iterate over rows, collect all temperatures to the list t and do something like sum(t) / len(t)

Just turn the data into a pandas.DataFrame and use pandas.DataFrame.groupby method:
import pandas as pd
file = #your csv file
df = pd.read_csv(file)
df = df.drop(df[df['temp'] == 0].index)
print(df.groupby('ID')[['temp']].mean())
Gives:
temp
ID
1 20.5
2 32.1
3 14.4
Note: the file I used looks like...
ID,stuff,temp
1,3,20
1,6,20.1
1,7,21.4
2,1,30.2
2,3,0
2,2,34
3,7,0
3,6,0
3,2,14.4
If wanting to turn that data into a column, you create a dictionary and use it to 'replace' (but not really) a new column in the DataFrame:
mean = df.groupby('ID')[['temp']].mean() # Store this into a variable
groups = {}
for i in mean.itertuples(): # Iterate over the values
groups[i[0]] = i[1]
df['avg_temp'] = df['ID'].replace(groups) # Create a new column
print(df)
Gives:
ID stuff temp avg_temp
0 1 3 20.0 20.5
1 1 6 20.1 20.5
2 1 7 21.4 20.5
3 2 1 30.2 32.1
5 2 2 34.0 32.1
8 3 2 14.4 14.4

how to apply filter condition in percentage string column using pandas?

I am working on below df but unable to apply filter in percentage field,but it is working normal excel.
I need to apply filter condition > 100.00% in the particular field using pandas.
I tried reading it from Html,csv and excel in pandas but unable to use condition.
it requires float conversion but not working with given data

I am assuming that the values you have are read as strings in Pandas:
data = ['4,700.00%', '3,900.00%', '1,500.00%', '1,400.00%', '1,200.00%', '0.15%', '0.13%', '0.12%', '0.10%', '0.08%', '0.07%']
df = pd.DataFrame(data)
df.columns = ['data']
printing the df:
data
0 4,700.00%
1 3,900.00%
2 1,500.00%
3 1,400.00%
4 1,200.00%
5 0.15%
6 0.13%
7 0.12%
8 0.10%
9 0.08%
10 0.07%
then:
df['data'] = df['data'].str.rstrip('%').str.replace(',','').astype('float')
df_filtered = df[df['data'] > 100]
Results:
data
0 4700.0
1 3900.0
2 1500.0
3 1400.0
4 1200.0

I have used below code as well.str.rstrip('%') and .str.replace(',','').astype('float') it working fine

Python Pandas Running Totals with Resets

I would like to perform the following task. Given a 2 columns (good and bad) I would like to replace any rows for the two columns with a running total. Here is an example of the current dataframe along with the desired data frame.
EDIT: I should have added what my intentions are. I am trying to create equally binned (in this case 20) variable using a continuous variable as the input. I know the pandas cut and qcut functions are available, however the returned results will have zeros for the good/bad rate (needed to compute the weight of evidence and information value). Zeros in either the numerator or denominator will not allow the mathematical calculations to work.
d={'AAA':range(0,20),
'good':[3,3,13,20,28,32,59,72,64,52,38,24,17,19,12,5,7,6,2,0],
'bad':[0,0,1,1,1,0,6,8,10,6,6,10,5,8,2,2,1,3,1,1]}
df=pd.DataFrame(data=d)
print(df)
Here is an explanation of what I need to do to the above dataframe.
Roughly speaking, anytime I encounter a zero for either column, I need to use a running total for the column which is not zero to the next row which has a non-zero value for the column that contained zeros.
Here is the desired output:
dd={'AAA':range(0,16),
'good':[19,20,60,59,72,64,52,38,24,17,19,12,5,7,6,2],
'bad':[1,1,1,6,8,10,6,6,10,5,8,2,2,1,3,2]}
desired_df=pd.DataFrame(data=dd)
print(desired_df)

The basic idea of my solution is to create a column from a cumsum over non-zero values in order to get the zero values with the next non zero value into one group. Then you can use groupby + sum to get your the desired values.
two_good = df.groupby((df['bad']!=0).cumsum().shift(1).fillna(0))['good'].sum()
two_bad = df.groupby((df['good']!=0).cumsum().shift(1).fillna(0))['bad'].sum()
two_good = two_good.loc[two_good!=0].reset_index(drop=True)
two_bad = two_bad.loc[two_bad!=0].reset_index(drop=True)
new_df = pd.concat([two_bad, two_good], axis=1).dropna()
print(new_df)
bad good
0 1 19.0
1 1 20.0
2 1 28.0
3 6 91.0
4 8 72.0
5 10 64.0
6 6 52.0
7 6 38.0
8 10 24.0
9 5 17.0
10 8 19.0
11 2 12.0
12 2 5.0
13 1 7.0
14 3 6.0
15 1 2.0
This code treats your etch case of trailing zeros different from your desired output, it simple cuts it off. You'd have to add some extra code to catch that one with a different logic.

P.Tillmann. I appreciate your assistance with this. For the more advanced readers I would assume you to find this code appalling, as I do. I would be more than happy to take any recommendation which makes this more streamlined.
d={'AAA':range(0,20),
'good':[3,3,13,20,28,32,59,72,64,52,38,24,17,19,12,5,7,6,2,0],
'bad':[0,0,1,1,1,0,6,8,10,6,6,10,5,8,2,2,1,3,1,1]}
df=pd.DataFrame(data=d)
print(df)
row_good=0
row_bad=0
row_bad_zero_count=0
row_good_zero_count=0
row_out='NO'
crappy_fix=pd.DataFrame()
for index,row in df.iterrows():
if row['good']==0 or row['bad']==0:
row_bad += row['bad']
row_good += row['good']
row_bad_zero_count += 1
row_good_zero_count += 1
output_ind='1'
row_out='NO'
elif index+1 < len(df) and (df.loc[index+1,'good']==0 or df.loc[index+1,'bad']==0):
row_bad=row['bad']
row_good=row['good']
output_ind='2'
row_out='NO'
elif (row_bad_zero_count > 1 or row_good_zero_count > 1) and row['good']!=0 and row['bad']!=0:
row_bad += row['bad']
row_good += row['good']
row_bad_zero_count=0
row_good_zero_count=0
row_out='YES'
output_ind='3'
else:
row_bad=row['bad']
row_good=row['good']
row_bad_zero_count=0
row_good_zero_count=0
row_out='YES'
output_ind='4'
if ((row['good']==0 or row['bad']==0)
and (index > 0 and (df.loc[index-1,'good']!=0 or df.loc[index-1,'bad']!=0))
and row_good != 0 and row_bad != 0):
row_out='YES'
if row_out=='YES':
temp_dict={'AAA':row['AAA'],
'good':row_good,
'bad':row_bad}
crappy_fix=crappy_fix.append([temp_dict],ignore_index=True)
print(str(row['AAA']),'-',
str(row['good']),'-',
str(row['bad']),'-',
str(row_good),'-',
str(row_bad),'-',
str(row_good_zero_count),'-',
str(row_bad_zero_count),'-',
row_out,'-',
output_ind)
print(crappy_fix)

rounding series of pandas dataframes

I am trying to solve one of the coursera's homework for beginners.
I have read the data and tried to convert it as it shown in the code piece below. I am looking for the frequency distribution of the considered variables and for this reason I am trying to round the values. I tried several methods but nothing give me what I am expecting (see below please)..
import pandas as pd
import numpy as np
# loading the database file
data = pd.read_csv('gapminder-2.csv',low_memory=False)
# number of observations (rows)
print len(data)
# number of variables (columns)
print len(data.columns)
sub1 = pd.DataFrame({'income':data['incomeperperson'].convert_objects(convert_numeric=True),
'alcohol':data['alcconsumption'].convert_objects(convert_numeric=True),
'suicide':data['suicideper100th'].convert_objects(convert_numeric=True)})
sub1.apply(pd.Series.round)
income = sub1['income'].value_counts(sort=False)
print income
However, I got
285.224449 1
2712.517199 1
21943.339898 1
1036.830725 1
557.947513 1
What I expect:
285 1
2712 1
21943 1
1036 1
557 1

You can implement Series.round()
ser = pd.Series([1.1,2.1,3.1,5.1])
print(ser)
0 1.1
1 2.1
2 3.1
3 5.1
dtype: float64
From here you can use .round(), the default is set to 0 per docs.
print(ser.round())
0 1
1 2
2 3
3 5
dtype: float64
To save changes you need to re-assign it to ser=ser.round().

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

fillna() not allowing floating values - python

Anish I think is better to use the method apply() available in pandas. This method applies (in rows or in columns) a custom function over a dataframe. I let you one post: Stack Question Documentation pandas: Doc Apply df regards,

Related

split Python DataFrame into k parts with index and iterate over them in a loop

Calculate average temperature in CSV

how to apply filter condition in percentage string column using pandas?

Python Pandas Running Totals with Resets

rounding series of pandas dataframes

Categories

Resources