Replacing missing values with mean

Replacing missing values with mean - python

I am exploring pandas library, and I'd find this dataset. My task is to fill ? with mean of by group of column 'num-of-doors'. When I used dataframe.groupby('num-of-doors').mean() pandas was unable to find mean of these columns:
'peak-rpm', 'price', 'bore', 'stroke', 'normalized-losses', 'horsepower'
So, I tried with my own dataset to know why it is not working. I created a file with the following contents
c0,c1,type
1,2,0
2,3,0
2,4,0
1,?,1
1,3,1
and I wrote the following script:
data = pd.read_csv("data.csv")
data = data.replace('?',np.nan)
print(data)
print(data.groupby('type').mean())
this is what I'm getting as output:
c0 c1 type
0 1 2 0
1 2 3 0
2 2 4 0
3 1 NaN 1
4 1 3 1
c0
type
0 1.666667
1 1.000000
Can you please explain what is going on here? Why I'm not getting mean for column c1? Even I tried some Stackoverflow's answers, but still got nothing. Any suggestions?
Really appreciate your help.

The problem is that c1, is not of type numeric, do:
data = data.replace('?',np.nan)
data['c1'] = data['c1'].astype(float)
print(data.groupby('type').mean())
Output
c0 c1
type
0 1.666667 3.0
1 1.000000 3.0
When you read the original data DataFrame, as it has a ? the column is of dtype object (using dtypes to verify):
c0 int64
c1 object
type int64
dtype: object
If you want to replace the nan, with the mean of the group use transform + fillna:
data = data.replace('?',np.nan)
data['c1'] = data['c1'].astype(float)
res = data.groupby('type').transform('mean')
print(data.fillna(res))
Output
c0 c1 type
0 1 2.0 0
1 2 3.0 0
2 2 4.0 0
3 1 3.0 1
4 1 3.0 1
As a last advise you could read the csv as:
data = pd.read_csv("data.csv", na_values='?')
print(data)
Output
c0 c1 type
0 1 2.0 0
1 2 3.0 0
2 2 4.0 0
3 1 NaN 1
4 1 3.0 1
This will save you the need of converting the columns to numeric.

df['c1']=df['c1'].str.replace('[?]','NaN').astype(float)
df.groupby('type').apply(lambda x: x.fillna(x.mean()))

Related

If function: if column A==1 AND 1 column B is in list X and column C is not null, 1. else, 0

I have a dataset that looks something like this:
index Ind. Code Code_2
1 1 NaN x
2 0 7 NaN
3 1 9 z
4 1 NaN a
5 0 11 NaN
6 1 4 NaN
I also created a list to indicate values in the column Code, something like this:
Code_List=['7', '9', '11']
I would like to create a new column for the indicator that is 1 so long as Ind. = 1, Code is in the above list, and Code 2 is not null
I would like to create a function containing an if statement. I tried this and am not sure if its a syntax issue, but i keep getting attribute errors such as the following:
def New_Indicator(x):
if x['Ind.'] == 1 and (x['Code'].isin[Code_List]) or (x['Code_2'].notnull()):
return 1
else:
return 0
df['NewIndColumn'] = df.apply(lambda x: New_Indicator(x), axis=1)
("'str' object has no attribute 'isin'", 'occurred at index 259')
("'float' object has no attribute 'notnull'", 'occurred at index
259')

The problem is that in your function, x['Code'] is a string, not a Series. I suggest you use numpy.where:
ind1 = df['Ind.'].eq(1)
codes = df.Code.isin(code_list)
code2NotNull = df.Code_2.notnull()
mask = ind1 & codes & code2NotNull
df['indicator'] = np.where(mask, 1, 0)
print(df)
Output
index Ind. Code Code_2 indicator
0 1 1 NaN x 0
1 2 0 7.0 NaN 0
2 3 1 9.0 z 1
3 4 1 NaN a 0
4 5 0 11.0 NaN 0
5 6 1 4.0 NaN 0
Update (as suggested by #splash58):
df['indicator'] = mask.astype(int)

Query errors in a panda Dataframe

I am facing an issue with my code using queries on simple pandas DataFrame, I am sure I am missing a tiny detail. Can you guys help me with this?
I don't understand why I've only NAN values.

You can change [['date']] to [date] for select Series instead one column DataFrame.
Sample:
df = pd.DataFrame({'A':[1,2,3]})
print (df['A'])
0 1
1 2
2 3
Name: A, dtype: int64
print (df[['A']])
A
0 1
1 2
2 3
print (df[df['A'] == 1])
A
0 1
print (df[df[['A']] == 1])
A
0 1.0
1 NaN
2 NaN

Error while replacing '?' with mean value in dataframe in Python

I have a car dataset where I want to replace the '?' values in the column normalized-values to the mean of the remaining numerical values. The code I have used is:
mean = df["normalized-losses"].mean()
df["normalized-losses"].replace("?",mean)
However, this produces the error:
ValueError: could not convert string to float: '???164164?158?158?192192188188??121988111811811814814814814811014513713710110110111078106106858585107????145??104104104113113150150150150129115129115?115118?93939393?142???161161161161153153???125125125137128128128122103128128122103168106106128108108194194231161161??161161??16116116111911915415415474?186??????1501041501041501048383831021021021021028989858587877477819191919191919191168168168168134134134134134134656565656519719790?1221229494949494?256???1037410374103749595959595'
Can anyone help with the way in which I can convert the '?' values to the mean values. Also, this is the first time I am working with the Pandas package so if I have made any silly mistakes, please forgive me.

Use to_numeric for convert non numeric values to NaNs and then fillna with mean:
vals = pd.to_numeric(df["normalized-losses"], errors='coerce')
df["normalized-losses"] = vals.fillna(vals.mean())
#data from jpp
print (df)
normalized-losses
0 1.0
1 2.0
2 3.0
3 3.4
4 5.0
5 6.0
6 3.4
Details:
print (vals)
0 1.0
1 2.0
2 3.0
3 NaN
4 5.0
5 6.0
6 NaN
Name: normalized-losses, dtype: float64
print (vals.mean())
3.4

Use replace() followed byfillna():
df['normalized-losses'] = df['normalized-losses'].replace('?',np.NaN)
df['normalized-losses'].fillna(df['normalized-losses'].mean())

The mean of a series of mixed types is not defined. Convert to numeric and then use replace:
df = pd.DataFrame({'A': [1, 2, 3, '?', 5, 6, '??']})
mean = pd.to_numeric(df['A'], errors='coerce').mean()
df['B'] = df['A'].replace('?', mean)
print(df)
A B
0 1 1
1 2 2
2 3 3
3 ? 3.4
4 5 5
5 6 6
6 ?? ??
If you need to replace all non-numeric values, then use fillna:
nums = pd.to_numeric(df['A'], errors='coerce')
df['B'] = nums.fillna(nums.mean())
print(df)
A B
0 1 1.0
1 2 2.0
2 3 3.0
3 ? 3.4
4 5 5.0
5 6 6.0
6 ?? 3.4

How to change the string in data frame to integer ID with pandas fast?

I have a data set including the user ID, item ID(both string) and rating like that:
A12VH45Q3H5R5I B000NWJTKW 5.0
A3J8AQWNNI3WSN B000NWJTKW 4.0
A1XOBWIL4MILVM B000NWJTKW 1.0
I'd like to change the IDs to integer like:
1 1 5.0
2 1 4.0
3 1 1.0
I have tried a traditional way, creating a big dictionary and mark each string ID with an integer one. But it took extremely long time. So could you please tell me how to finish it in a more fast way? Thanks in advance.

You can apply factorize:
In [244]:
df[[0,1]] = df[[0,1]].apply(lambda x: pd.factorize(x)[0] + 1)
df
Out[244]:
0 1 2
0 1 1 5
1 2 1 4
2 3 1 1

You could also encode the column as a categorical and then get the codes.
df['User_ID_code'] = df.User_ID.astype('category').cat.codes
>>> df
User_ID Item_ID Rating User_ID_code
0 A12VH45Q3H5R5I B000NWJTKW 5 0
1 A3J8AQWNNI3WSN B000NWJTKW 4 2
2 A1XOBWIL4MILVM B000NWJTKW 1 1

Pandas data frame mean by variables

I have a data frame
a = pd.DataFrame({'a':[1,2,3,4], 'b':[1,1,2,2], 'c':[1,1,1,2]})
>>> a
a b c
0 1 1 1
1 2 1 1
2 3 2 1
3 4 2 2
I would like to compute the mean of a once that it has been grouped according to the value of b an c.
So i should split the data in 3 groups:
b=1,c=1
b=1,c=2
b=2,c=2
and then compute the mean of a in each group.
How can I do that?
I suspect that I have to use groupby but I do not understand how.

You can groupby multiple columns by passing a list of the column names, then it's just a simple case of calling mean on the gorupby object:
In [4]:
a.groupby(['b','c']).mean()
Out[4]:
a
b c
1 1 1.5
2 1 3.0
2 4.0
If you want to restore the columns that were grouped by back as columns, just call reset_index():
In [5]:
a.groupby(['b','c']).mean().reset_index()
Out[5]:
b c a
0 1 1 1.5
1 2 1 3.0
2 2 2 4.0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Replacing missing values with mean - python

df['c1']=df['c1'].str.replace('[?]','NaN').astype(float) df.groupby('type').apply(lambda x: x.fillna(x.mean()))

Related

If function: if column A==1 AND 1 column B is in list X and column C is not null, 1. else, 0

Query errors in a panda Dataframe

Error while replacing '?' with mean value in dataframe in Python

How to change the string in data frame to integer ID with pandas fast?

Pandas data frame mean by variables

Categories

Resources