Replace pandas Series with Series of different length but same indices - python

Suppose two pandas Series A and B:
A:
1 4
2 4
3 4
4 1
5 3
B:
3 4
4 4
5 2
A is larger than B and B has the same indices as A with different values. I'm trying to replace the values of A with those of B.
A.replace(to_replace=B) seems obvious but does not work. What am I missing here?

I think you can use combine_first:
C = B.combine_first(A).astype(int)
print (C)
1 4
2 4
3 4
4 4
5 2
dtype: int32

An alternative solution with more basic pandas operators.
a.loc[b.index.values]=b.values

Related

groupby a column to get a minimum value and corresponding value in another column [duplicate]

I have a DataFrame with columns A, B, and C. For each value of A, I would like to select the row with the minimum value in column B.
That is, from this:
df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
'B': [4, 5, 2, 7, 4, 6],
'C': [3, 4, 10, 2, 4, 6]})
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
I would like to get:
A B C
0 1 2 10
1 2 4 4
For the moment I am grouping by column A, then creating a value that indicates to me the rows I will keep:
a = data.groupby('A').min()
a['A'] = a.index
to_keep = [str(x[0]) + str(x[1]) for x in a[['A', 'B']].values]
data['id'] = data['A'].astype(str) + data['B'].astype('str')
data[data['id'].isin(to_keep)]
I am sure that there is a much more straightforward way to do this.
I have seen many answers here that use MultiIndex, which I would prefer to avoid.
Thank you for your help.
I feel like you're overthinking this. Just use groupby and idxmin:
df.loc[df.groupby('A').B.idxmin()]
A B C
2 1 2 10
4 2 4 4
df.loc[df.groupby('A').B.idxmin()].reset_index(drop=True)
A B C
0 1 2 10
1 2 4 4
Had a similar situation but with a more complex column heading (e.g. "B val") in which case this is needed:
df.loc[df.groupby('A')['B val'].idxmin()]
The accepted answer (suggesting idxmin) cannot be used with the pipe pattern. A pipe-friendly alternative is to first sort values and then use groupby with DataFrame.head:
data.sort_values('B').groupby('A').apply(DataFrame.head, n=1)
This is possible because by default groupby preserves the order of rows within each group, which is stable and documented behaviour (see pandas.DataFrame.groupby).
This approach has additional benefits:
it can be easily expanded to select n rows with smallest values in specific column
it can break ties by providing another column (as a list) to .sort_values(), e.g.:
data.sort_values(['final_score', 'midterm_score']).groupby('year').apply(DataFrame.head, n=1)
As with other answers, to exactly match the result desired in the question .reset_index(drop=True) is needed, making the final snippet:
df.sort_values('B').groupby('A').apply(DataFrame.head, n=1).reset_index(drop=True)
I found an answer a little bit more wordy, but a lot more efficient:
This is the example dataset:
data = pd.DataFrame({'A': [1,1,1,2,2,2], 'B':[4,5,2,7,4,6], 'C':[3,4,10,2,4,6]})
data
Out:
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
First we will get the min values on a Series from a groupby operation:
min_value = data.groupby('A').B.min()
min_value
Out:
A
1 2
2 4
Name: B, dtype: int64
Then, we merge this series result on the original data frame
data = data.merge(min_value, on='A',suffixes=('', '_min'))
data
Out:
A B C B_min
0 1 4 3 2
1 1 5 4 2
2 1 2 10 2
3 2 7 2 4
4 2 4 4 4
5 2 6 6 4
Finally, we get only the lines where B is equal to B_min and drop B_min since we don't need it anymore.
data = data[data.B==data.B_min].drop('B_min', axis=1)
data
Out:
A B C
2 1 2 10
4 2 4 4
I have tested it on very large datasets and this was the only way I could make it work in a reasonable time.
You can sort_values and drop_duplicates:
df.sort_values('B').drop_duplicates('A')
Output:
A B C
2 1 2 10
4 2 4 4
The solution is, as written before ;
df.loc[df.groupby('A')['B'].idxmin()]
If the solution but then if you get an error;
"Passing list-likes to .loc or [] with any missing labels is no longer supported.
The following labels were missing: Float64Index([nan], dtype='float64').
See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
In my case, there were 'NaN' values at column B. So, I used 'dropna()' then it worked.
df.loc[df.groupby('A')['B'].idxmin().dropna()]
You can also boolean indexing the rows where B column is minimal value
out = df[df['B'] == df.groupby('A')['B'].transform('min')]
print(out)
A B C
2 1 2 10
4 2 4 4

Is there a way to filter out rows from a table with an unnamed column

I'm currently trying to do analysis of rolling correlations of a dataset with four compared values but only need the output of rows containing 'a'
I got my data frame by using the command newdf = df.rolling(3).corr()
Sample input (random numbers)
a b c d
1 a
1 b
1 c
1 d
2 a
2 b
2 c
2 d
3 a
3 b 5 6 3
3 c 4 3 1
3 d 3 4 2
4 a 1 3 5 6
4 b 6 2 4 1
4 c 8 6 6 7
4 d 2 5 4 6
5 a 2 5 4 1
5 b 1 4 6 3
5 c 2 6 3 7
5 d 3 6 3 7
and need the output
a b c d
1 a 1 3 5 6
2 a 2 5 4 1
I've tried filtering it by doing adf = newdf.filter(['a'], axis=0) however that gets rid of everything and when doing it for the other axis it filters by column. Unfortunately the column containing the rows with values: a, b, c, d is unnamed so I cant filter that column individually. This wouldn't be an issue however if its possible to flip the rows and columns with the values being listed by index to get the desired output.
Try using loc. Put the column of abcdabcd ... as index and just use loc
df.loc['a']
The actual source of problem in your case is that your DataFrame
has a MultiIndex.
So when you attempt to execute newdf.filter(['a'], axis=0) you want
to leave rows with the index containing only "a" string.
But since your DataFrame has a MultiIndex, each row with "a" at
level 1 contains also some number at level 0.
To get your intended result, run:
newdf.filter(like='a', axis=0)
maybe followed by .dropna().
An alterantive solution is:
newdf.xs('a', level=1, drop_level=False)

issue with np.where() for creating new column in Pandas (possibly NaN issue?)

I have a dataframe with 2 columns, and I want to create a 3rd column based on a comparison between the 2 columns.
So the logic is:
column 1 val = 3, column 2 val = 4, so the new column value is nothing
column 1 val = 3, column 2 val = 2, so the new column is 3
It's a very similar problem to one previously asked but the answer there isn't working for me, using np.where()
Here's what I tried:
FinalDF['c'] = np.where(FinalDF['a']>FinalDF['b'],[FinalDF['a'],""])
and after that failed I tried to see if maybe it doesn't like the [x,y] I gave it, so I tried:
FinalDF['c'] = np.where(FinalDF['a']>FinalDF['b'],[1,0])
the result is always:
ValueError: either both or neither of x and y should be given
Edit: I also removed the [x,y], to see what happens, since the documentation says it is optional. But I still get an error:
ValueError: Length of values does not match length of index
Which is odd because they are sitting in the same dataframe, although one column does have some Nan values.
I don't think I can use np.select because I have a condition here. I've linked to the previous questions so readers can reference them in future questions.
Thanks for any help.
I think that this should work:
FinalDF['c'] = np.where(FinalDF['a']>FinalDF['b'], FinalDF['a'],"")
Example:
FinalDF = pd.DataFrame({'a':[4,2,4,5,5,4],
'b':[4,3,2,2,2,4],
})
print FinalDF
a b
0 4 4
1 2 3
2 4 2
3 5 2
4 5 2
5 4 4
Output:
a b c
0 4 4
1 2 3
2 4 2 4
3 5 2 5
4 5 2 5
5 4 4
or if the column b has to have a greater value of column a, use this:
FinalDF['c'] = np.where(FinalDF['a']<FinalDF['b'], FinalDF['b'],"")
Output:
a b c
0 4 4
1 2 3 3
2 4 2
3 5 2
4 5 2
5 4 4

Python Dataframe conditionally drop_duplicates

I want to drop duplicated rows for a dataframe, based on the type of values of a column. For example, my dataframe is:
A B
3 4
3 4
3 5
yes 8
no 8
yes 8
If df['A'] is a number, I want to drop_duplicates().
If df['A'] is a string, I want to keep the duplicates.
So the desired result would be:
A B
3 4
3 5
yes 8
no 8
yes 8
Besides using for loops, is there a Pythonic way to do that? thanks!
Create a new column C: if A columns is numeric, assign a common value in C, otherwise assign a unique value in C.
After that, just drop_duplicates as normal.
Note: there is a nice isnumeric() method for testing if a cell is number-like.
In [47]:
df['C'] = np.where(df.A.str.isnumeric(), 1, df.index)
print df
A B C
0 3 4 1
1 3 4 1
2 3 5 1
3 yes 8 3
4 no 8 4
5 yes 8 5
In [48]:
print df.drop_duplicates()[['A', 'B']] #reset index if needed
A B
0 3 4
2 3 5
3 yes 8
4 no 8
5 yes 8
This solution is more verbose, but might be more flexible for more involved tests:
def true_if_number(x):
try:
int(x)
return True
except ValueError:
return False
rows_numeric = df['A'].apply(true_if_number)
df['A'][rows_numeric].drop_duplicates().append(df['A'][~rows_numeric])

how to find return minimal values in each column in python dataframe

Hi I have a dataframe looks like:
a b c d
2 4 2 7
4 2 3 8
5 3 2 9
I want to return 2 2 2 7.
I would like if there is a function to do this, or the most efficient to do this. Thanks
df.min(axis=0)
Axis keyword changes between rows or columns.

Categories