I am exploring pandas library, and I'd find this dataset. My task is to fill ? with mean of by group of column 'num-of-doors'. When I used dataframe.groupby('num-of-doors').mean() pandas was unable to find mean of these columns:
'peak-rpm', 'price', 'bore', 'stroke', 'normalized-losses', 'horsepower'
So, I tried with my own dataset to know why it is not working. I created a file with the following contents
c0,c1,type
1,2,0
2,3,0
2,4,0
1,?,1
1,3,1
and I wrote the following script:
data = pd.read_csv("data.csv")
data = data.replace('?',np.nan)
print(data)
print(data.groupby('type').mean())
this is what I'm getting as output:
c0 c1 type
0 1 2 0
1 2 3 0
2 2 4 0
3 1 NaN 1
4 1 3 1
c0
type
0 1.666667
1 1.000000
Can you please explain what is going on here? Why I'm not getting mean for column c1? Even I tried some Stackoverflow's answers, but still got nothing. Any suggestions?
Really appreciate your help.
The problem is that c1, is not of type numeric, do:
data = data.replace('?',np.nan)
data['c1'] = data['c1'].astype(float)
print(data.groupby('type').mean())
Output
c0 c1
type
0 1.666667 3.0
1 1.000000 3.0
When you read the original data DataFrame, as it has a ? the column is of dtype object (using dtypes to verify):
c0 int64
c1 object
type int64
dtype: object
If you want to replace the nan, with the mean of the group use transform + fillna:
data = data.replace('?',np.nan)
data['c1'] = data['c1'].astype(float)
res = data.groupby('type').transform('mean')
print(data.fillna(res))
Output
c0 c1 type
0 1 2.0 0
1 2 3.0 0
2 2 4.0 0
3 1 3.0 1
4 1 3.0 1
As a last advise you could read the csv as:
data = pd.read_csv("data.csv", na_values='?')
print(data)
Output
c0 c1 type
0 1 2.0 0
1 2 3.0 0
2 2 4.0 0
3 1 NaN 1
4 1 3.0 1
This will save you the need of converting the columns to numeric.
df['c1']=df['c1'].str.replace('[?]','NaN').astype(float)
df.groupby('type').apply(lambda x: x.fillna(x.mean()))
Related
I have a dataset that looks something like this:
index Ind. Code Code_2
1 1 NaN x
2 0 7 NaN
3 1 9 z
4 1 NaN a
5 0 11 NaN
6 1 4 NaN
I also created a list to indicate values in the column Code, something like this:
Code_List=['7', '9', '11']
I would like to create a new column for the indicator that is 1 so long as Ind. = 1, Code is in the above list, and Code 2 is not null
I would like to create a function containing an if statement. I tried this and am not sure if its a syntax issue, but i keep getting attribute errors such as the following:
def New_Indicator(x):
if x['Ind.'] == 1 and (x['Code'].isin[Code_List]) or (x['Code_2'].notnull()):
return 1
else:
return 0
df['NewIndColumn'] = df.apply(lambda x: New_Indicator(x), axis=1)
("'str' object has no attribute 'isin'", 'occurred at index 259')
("'float' object has no attribute 'notnull'", 'occurred at index
259')
The problem is that in your function, x['Code'] is a string, not a Series. I suggest you use numpy.where:
ind1 = df['Ind.'].eq(1)
codes = df.Code.isin(code_list)
code2NotNull = df.Code_2.notnull()
mask = ind1 & codes & code2NotNull
df['indicator'] = np.where(mask, 1, 0)
print(df)
Output
index Ind. Code Code_2 indicator
0 1 1 NaN x 0
1 2 0 7.0 NaN 0
2 3 1 9.0 z 1
3 4 1 NaN a 0
4 5 0 11.0 NaN 0
5 6 1 4.0 NaN 0
Update (as suggested by #splash58):
df['indicator'] = mask.astype(int)
I am facing an issue with my code using queries on simple pandas DataFrame, I am sure I am missing a tiny detail. Can you guys help me with this?
I don't understand why I've only NAN values.
You can change [['date']] to [date] for select Series instead one column DataFrame.
Sample:
df = pd.DataFrame({'A':[1,2,3]})
print (df['A'])
0 1
1 2
2 3
Name: A, dtype: int64
print (df[['A']])
A
0 1
1 2
2 3
print (df[df['A'] == 1])
A
0 1
print (df[df[['A']] == 1])
A
0 1.0
1 NaN
2 NaN
I have a car dataset where I want to replace the '?' values in the column normalized-values to the mean of the remaining numerical values. The code I have used is:
mean = df["normalized-losses"].mean()
df["normalized-losses"].replace("?",mean)
However, this produces the error:
ValueError: could not convert string to float: '???164164?158?158?192192188188??121988111811811814814814814811014513713710110110111078106106858585107????145??104104104113113150150150150129115129115?115118?93939393?142???161161161161153153???125125125137128128128122103128128122103168106106128108108194194231161161??161161??16116116111911915415415474?186??????1501041501041501048383831021021021021028989858587877477819191919191919191168168168168134134134134134134656565656519719790?1221229494949494?256???1037410374103749595959595'
Can anyone help with the way in which I can convert the '?' values to the mean values. Also, this is the first time I am working with the Pandas package so if I have made any silly mistakes, please forgive me.
Use to_numeric for convert non numeric values to NaNs and then fillna with mean:
vals = pd.to_numeric(df["normalized-losses"], errors='coerce')
df["normalized-losses"] = vals.fillna(vals.mean())
#data from jpp
print (df)
normalized-losses
0 1.0
1 2.0
2 3.0
3 3.4
4 5.0
5 6.0
6 3.4
Details:
print (vals)
0 1.0
1 2.0
2 3.0
3 NaN
4 5.0
5 6.0
6 NaN
Name: normalized-losses, dtype: float64
print (vals.mean())
3.4
Use replace() followed byfillna():
df['normalized-losses'] = df['normalized-losses'].replace('?',np.NaN)
df['normalized-losses'].fillna(df['normalized-losses'].mean())
The mean of a series of mixed types is not defined. Convert to numeric and then use replace:
df = pd.DataFrame({'A': [1, 2, 3, '?', 5, 6, '??']})
mean = pd.to_numeric(df['A'], errors='coerce').mean()
df['B'] = df['A'].replace('?', mean)
print(df)
A B
0 1 1
1 2 2
2 3 3
3 ? 3.4
4 5 5
5 6 6
6 ?? ??
If you need to replace all non-numeric values, then use fillna:
nums = pd.to_numeric(df['A'], errors='coerce')
df['B'] = nums.fillna(nums.mean())
print(df)
A B
0 1 1.0
1 2 2.0
2 3 3.0
3 ? 3.4
4 5 5.0
5 6 6.0
6 ?? 3.4
I have a data set including the user ID, item ID(both string) and rating like that:
A12VH45Q3H5R5I B000NWJTKW 5.0
A3J8AQWNNI3WSN B000NWJTKW 4.0
A1XOBWIL4MILVM B000NWJTKW 1.0
I'd like to change the IDs to integer like:
1 1 5.0
2 1 4.0
3 1 1.0
I have tried a traditional way, creating a big dictionary and mark each string ID with an integer one. But it took extremely long time. So could you please tell me how to finish it in a more fast way? Thanks in advance.
You can apply factorize:
In [244]:
df[[0,1]] = df[[0,1]].apply(lambda x: pd.factorize(x)[0] + 1)
df
Out[244]:
0 1 2
0 1 1 5
1 2 1 4
2 3 1 1
You could also encode the column as a categorical and then get the codes.
df['User_ID_code'] = df.User_ID.astype('category').cat.codes
>>> df
User_ID Item_ID Rating User_ID_code
0 A12VH45Q3H5R5I B000NWJTKW 5 0
1 A3J8AQWNNI3WSN B000NWJTKW 4 2
2 A1XOBWIL4MILVM B000NWJTKW 1 1
I have a data frame
a = pd.DataFrame({'a':[1,2,3,4], 'b':[1,1,2,2], 'c':[1,1,1,2]})
>>> a
a b c
0 1 1 1
1 2 1 1
2 3 2 1
3 4 2 2
I would like to compute the mean of a once that it has been grouped according to the value of b an c.
So i should split the data in 3 groups:
b=1,c=1
b=1,c=2
b=2,c=2
and then compute the mean of a in each group.
How can I do that?
I suspect that I have to use groupby but I do not understand how.
You can groupby multiple columns by passing a list of the column names, then it's just a simple case of calling mean on the gorupby object:
In [4]:
a.groupby(['b','c']).mean()
Out[4]:
a
b c
1 1 1.5
2 1 3.0
2 4.0
If you want to restore the columns that were grouped by back as columns, just call reset_index():
In [5]:
a.groupby(['b','c']).mean().reset_index()
Out[5]:
b c a
0 1 1 1.5
1 2 1 3.0
2 2 2 4.0