I have a two different dataframe with one similar column. I am trying to apply the conditional statement in the following data.
df
a b
1 5
2 4
3 5.5
4 4.2
5 3.1
df1
a c
1 9
2 3
3 5.1
4 4.8
5 3
I am writing the below code
df.loc['comparison'] = df['b'] > df1['c']
and get the following error:
can only compare identically-labeled Series objects.
Please advise how can I fix this issue.
Your dataframe indices (not displayed in your question) are not aligned. In addition, you are attempting to add a column incorrectly: pd.DataFrame.loc with one indexer refers to a row index rather than a column.
To overcome these issues, you can reindex one of your series and use df[col] to create a new series:
df['comparison'] = df['b'] > df1['c'].reindex(df.index)
See Indexing and Selecting Data to understand how to index data in a dataframe.
Related
When we make a new column in a dataset in pandas
df["Max"] = df.iloc[:, 5:7].sum(axis=1)
If we are only getting the columns from index 5 to index 7, why do we need to pass: as all the columns.
pandas.DataFrame.iloc() is used purely for integer-location based indexing for selection by position (read here for documentation). The : means all rows in the selected columns, here column index 5 and 6 (iloc is not inclusive of the last index).
You are using .iloc() to take a slice out of the dataframe and apply an aggregate function across columns of the slice.
Consider an example:
df = pd.DataFrame({"a":[0,1,2],"b":[2,3,4],"c":[4,5,6]})
df
would produce the following dataframe
a b c
0 0 2 4
1 1 3 5
2 2 4 6
You are using iloc to avoid dealing with named columns, so that
df.iloc[:,1:3]
would look as follows
b c
0 2 4
1 3 5
2 4 6
Now a slight modification of your code would get you a new column containing sums across columns
df.iloc[:,1:3].sum(axis=1)
0 6
1 8
2 10
Alternatively you could use function application:
df.apply(lambda x: x.iloc[1:3].sum(), axis=1)
0 6
1 8
2 10
Thus you explicitly tell to apply sum across columns. However your syntax is more succinct and is preferable to explicit function application. The result is the same as one would expect.
I am using a dataframe (df1) inside a loop to store information that I read from another dataframe (df2). df1 can have different number of rows in every iteration. I store the data row by row using df1.loc[row_number]. This could be an example:
a b c
0 9 2 3
1 8 5 6
2 3 8 9
Then I need to read the value of the first column and the first row, which I perform as
df1['a'].iloc[0]
9
The problem arises when df1 is a one row dataframe:
a 9
b 2
c 3
Name: 0, dtype: int64
It seems that with only one row, pandas stores the dataframe as a pandas series object. Trying to access the value in the same way ( df1['a'].iloc[0] ) I get the error:
AttributeError: 'numpy.int64' object has no attribute 'iloc'
Is there a way to solve this in a general case, with no need to handle the 1-row dataframe separately?
df1['a'] might be the column 'a' of the dataframe/series, which in the error case doesn't exist (no column named 'a').... Try to use df.iloc[0] directly
I have a dataframe with columns of different datatypes (floats and ints). Two rows are in the wrong order and I need to swap them, but copying a row onto another does not work.
import pandas as pd
df = pd.DataFrame([
{"a":2.5, "b":10},
{"a":2.7, "b":12},
{"a":2.8, "b":16},
{"a":3.1, "b":18}
])
This does copy the values, but afterwards all rows are of type 'float' (Series objects only have a single datatype.):
df.iloc[1] = df.iloc[2].copy() # changes datatype if b to float
Copying the rows by using slices sets the whole row to NaN:
df.iloc[1:2] = df.iloc[2:3].copy() # sets row 1 to NaN,NaN
a b
0 2.5 10.0
1 NaN NaN
2 2.8 16.0
3 3.1 18.0
2 Questions:
whats happening in the second case, where do the NaNs come from?
how do I copy a row onto another row while keeping the datatypes?
whats happening in the second case, where do the NaNs come from?
Problem is different index values of sliced DataFrames, pandas cannot align rows, so NaNs are created:
print (df.iloc[1:2])
a b
1 2.7 12
print (df.iloc[2:3])
a b
2 2.8 16
how do I copy a row onto another row while keeping the datatypes?
One solution is create one row DataFrame and change index name for alignment:
df.iloc[[1]] = df.iloc[[2]].rename(index={2:1}).copy()
More general if need indexing index values:
df.iloc[[1]] = df.iloc[[2]].rename(index={df.index[2]: df.index[1]}).copy()
print (df)
0 2.5 10
1 2.8 16
2 2.8 16
3 3.1 18
Converting to numpy array is posible, but then dtypes are changes.
I have created a function that replaces the NaNs in a Pandas dataframe with the means of the respective columns. I tested the function with a small dataframe and it worked. When I applied it though to a much larger dataframe (30,000 rows, 9 columns) I got the error message: IndexError: index out of bounds
The function is the following:
# The 'update' function will replace all the NaNs in a dataframe with the mean of the respective columns
def update(df): # the function takes one argument, the dataframe that will be updated
ncol = df.shape[1] # number of columns in the dataframe
for i in range(0 , ncol): # loops over all the columns
df.iloc[:,i][df.isnull().iloc[:, i]]=df.mean()[i] # subsets the df using the isnull() method, extracting the positions
# in each column where the
return(df)
The small dataframe I used to test the function is the following:
0 1 2 3
0 NaN NaN 3 4
1 NaN NaN 7 8
2 9.0 10.0 11 12
Could you explain the error? Your advice will be appreciated.
I would use DataFrame.fillna() method in conjunction with DataFrame.mean() method:
In [130]: df.fillna(df.mean())
Out[130]:
0 1 2 3
0 9.0 10.0 3 4
1 9.0 10.0 7 8
2 9.0 10.0 11 12
Mean values:
In [138]: df.mean()
Out[138]:
0 9.0
1 10.0
2 7.0
3 8.0
dtype: float64
The reason you are getting "index out of bounds" is because you are assigning the value df.mean()[i] when i is one iteration of what are supposed to be ordinal positions. df.mean() is a Series whose indices are the columns of df. df.mean()[something] implies something better be a column name. But they aren't and that's why you get your error.
your code... fixed
def update(df): # the function takes one argument, the dataframe that will be updated
ncol = df.shape[1] # number of columns in the dataframe
for i in range(0 , ncol): # loops over all the columns
df.iloc[:,i][df.isnull().iloc[:, i]]=df.mean().iloc[i] # subsets the df using the isnull() method, extracting the positions
# in each column where the
return(df)
Also, your function is altering the df directly. You may want to be careful. I'm not sure that's what you intended.
All that said. I'd recommend another approach
def update(df):
return df.where(df.notnull(), df.mean(), axis=1)
You could use any number of methods to fill missing with the mean. I'd suggest using #MaxU's answer.
df.where
takes df when first arg is True otherwise second argument
df.where(df.notnull(), df.mean(), axis=1)
df.combine_first with awkward pandas broadcasting
df.combine_first(pd.DataFrame([df.mean()], df.index))
np.where
pd.DataFrame(
np.where(
df.notnull(), df.values,
np.nanmean(df.values, 0, keepdims=1)),
df.index, df.columns)
I have a pandas DataFrame and I would like to drop some columns based on the values of their mean. For eaxmple I have:
column 1 column 2 column 3
1 1 3
1 2 3
2 1 4
I used This solution which works fine on windows. For example, from the dataFrame df, i would like to drop columns that have a mean greater than 2.5 I wrote:
m=df.mean(axis=0)
df.loc[:,m<=2.5]
That works perfectly on Windows as the column 3 is dropped. But when I try it in linux, I have the following error:
IndexingError: Unalignable boolean Series key provided .
What could be the problem?