IndexingError: Unalignable boolean Series key provided in linux - python

I have a pandas DataFrame and I would like to drop some columns based on the values of their mean. For eaxmple I have:
column 1 column 2 column 3
1 1 3
1 2 3
2 1 4
I used This solution which works fine on windows. For example, from the dataFrame df, i would like to drop columns that have a mean greater than 2.5 I wrote:
m=df.mean(axis=0)
df.loc[:,m<=2.5]
That works perfectly on Windows as the column 3 is dropped. But when I try it in linux, I have the following error:
IndexingError: Unalignable boolean Series key provided .
What could be the problem?

Related

Why do we need to add : when defining a new column using .iloc function

When we make a new column in a dataset in pandas
df["Max"] = df.iloc[:, 5:7].sum(axis=1)
If we are only getting the columns from index 5 to index 7, why do we need to pass: as all the columns.
pandas.DataFrame.iloc() is used purely for integer-location based indexing for selection by position (read here for documentation). The : means all rows in the selected columns, here column index 5 and 6 (iloc is not inclusive of the last index).
You are using .iloc() to take a slice out of the dataframe and apply an aggregate function across columns of the slice.
Consider an example:
df = pd.DataFrame({"a":[0,1,2],"b":[2,3,4],"c":[4,5,6]})
df
would produce the following dataframe
a b c
0 0 2 4
1 1 3 5
2 2 4 6
You are using iloc to avoid dealing with named columns, so that
df.iloc[:,1:3]
would look as follows
b c
0 2 4
1 3 5
2 4 6
Now a slight modification of your code would get you a new column containing sums across columns
df.iloc[:,1:3].sum(axis=1)
0 6
1 8
2 10
Alternatively you could use function application:
df.apply(lambda x: x.iloc[1:3].sum(), axis=1)
0 6
1 8
2 10
Thus you explicitly tell to apply sum across columns. However your syntax is more succinct and is preferable to explicit function application. The result is the same as one would expect.

Pandas dataframe select row by index and column by name

Is there any way to select the row by index (i.e. integer) and column by column name in a pandas data frame?
I tried using loc but it returns an error, and I understand iloc only works with indexes.
Here is the first rows of the data frame df. I am willing to select the first row, column named 'Volume' and tried using df.loc[0,'Volume']
Use get_loc method of Index to get integer location of a column name.
Suppose this dataframe:
>>> df
A B C
10 1 2 3
11 4 5 6
12 7 8 9
You can use .iloc like this:
>>> df.iloc[1, df.columns.get_loc('B')]
5

Conditionally copy values from dataframe column to another dataframe with different length

I would like to copy a values from a dataframe column to another dataframe if the values in two other columns are the same.
example df1:
identifier price
1 nan
1 nan
3 nan
3 nan
and so on. There are several rows for every identifier.
In my df2, there is only one value for each identifier in "price"
example df2:
Identifier price
1 3
3 5
I just would like to copy the "price" values in df2 to "price" in df1. It does not matter to me if the values are copied to each column where the identifiers match or just to the first, since I will alter all but the first entry for each identifier in df1["price"] anyways.
Expected output would be still df1 because there are other columns I still need:
identifier price
1 3
1 nan
3 5
3 nan
OR:
identifier price
1 3
1 3
3 5
3 5
I could work with both.
I tried np.where but the different length of the dataframes causes problems. I also tried using loc, but I got stuck when defining the value that should be inserted in the cell if the condition holds.
Any help is much appreciated, thank you in advance!

how to group data in a column based on indices

i am a newbie, slowly learning... i have a unique dataframe as shown below:
time
index
1 8:51 am
1 8:51 am
1 8:51 am
2 8:52 am
2 8:52 am
3 8:53 am
3 8:53 am
3 8:53 am
i want to be able to combine the dataframe and input the index in one row only as shown below:
time
index
1 8:51 am
2 8:52 am
3 8:53 am
Try with
df = df.groupby(level=0).head(1)
Nothing looks unique there, that just seems to be whole duplicate rows (unless timestamps can be different for same index number)
Df.drop_duplicates function is what you’re looking for.
You can also use this function even if timestamp can be different by just running it over a selected column( index) and argument “first” or “last” will keep first or last of those timestamps.
data.drop_duplicates(subset ="time", keep = False, inplace = True)
This should return only the rows of the dataframe containing unique values in the subset column mentioned.

Comparing two dataframe and getting an error

I have a two different dataframe with one similar column. I am trying to apply the conditional statement in the following data.
df
a b
1 5
2 4
3 5.5
4 4.2
5 3.1
df1
a c
1 9
2 3
3 5.1
4 4.8
5 3
I am writing the below code
df.loc['comparison'] = df['b'] > df1['c']
and get the following error:
can only compare identically-labeled Series objects.
Please advise how can I fix this issue.
Your dataframe indices (not displayed in your question) are not aligned. In addition, you are attempting to add a column incorrectly: pd.DataFrame.loc with one indexer refers to a row index rather than a column.
To overcome these issues, you can reindex one of your series and use df[col] to create a new series:
df['comparison'] = df['b'] > df1['c'].reindex(df.index)
See Indexing and Selecting Data to understand how to index data in a dataframe.

Categories