Replace part of df column with values defined in Series/dictionary - python

I have a column in a DataFrame that often has repeat indexes. Some indexes have exceptions and need to be changed based on another Series I've made, while the rest of the indices are fine as is. The Series indices are unique.
Here's a couple variables to illustrate
df = pd.DataFrame(data={'hi':[1, 2, 3, 4, 5, 6, 7]}, index=[1, 1, 1, 2, 2, 3, 4])
Out[52]:
hi
1 1
1 2
1 3
2 4
2 5
3 6
4 7
exceptions = pd.Series(data=[90, 95], index=[2, 4])
Out[36]:
2 90
4 95
I would like to do set the df to ...
hi
1 1
1 2
1 3
2 90
2 90
3 6
4 95
What's a clean way to do this? I'm a bit new to Pandas, my thoughts are just to loop but I don't think that's the proper way to solve this

Assuming that the index in exceptions is guaranteed to be a subset of df indexes we can use loc and the Series.index to assign the values:
df.loc[exceptions.index, 'hi'] = exceptions
We can use index.intersection if we have extra values in exceptions that does not or should not align in df:
exceptions = pd.Series(data=[90, 95, 100], index=[2, 4, 5])
df.loc[exceptions.index.intersection(df.index, sort=False), 'hi'] = exceptions
df:
hi
1 1
1 2
1 3
2 90
2 90
3 6
4 95

Related

Find the rows in which the (aboslute) value is bigger than

I have a pandas dataframe like:
one
two
three
1
3
4
2
4
6
1
3
4
10
3
4
2
4
5
0
3
4
-10
3
4
Now observing the first column (labeled 'one') I would like to find the rows where the value is bigger than say 9. (in this case it would be the fourth )
Ideally, I also would like to find the rows where the absolute value of the value is bigger than say 9 (so that would be fourth and seventh)
How can I do this? (So far I only covert the columns into series and even into series of truths and false but my problem is that my dataframe is huge and I cannot visually inspect it. I need to get the row numbers automatically
you can apply abs and compare and filter by loc:
out = df.loc[df['one'].abs() > 9]
output :
>>>
one two three
3 10 3 4
6 -10 3 4
You could use abs() pandas-abs
df = pd.DataFrame({
'a': [1, 4, -6, 3, 7],
'b': [2, 3, 5, 3, 1],
'c': [4, 2, 7, 1, 3]
})
df[df.a.abs() > 5]
returns two rows 2, 4.
row = {}
for column in df:
row_temp = {}
index = df.loc[df[column].abs()>=9].index
row.update({column:list(index)})

Efficient way in Pandas to count occurrences of Series of values by row

I have a large dataframe for which I want to count the number of occurrences of a series specific values (given by an external function) by row. For reproducibility let's assume the following simplified dataframe:
data = {'A': [3, 2, 1, 0], 'B': [4, 3, 2, 1], 'C': [1, 2, 3, 4], 'D': [1, 1, 2, 2], 'E': [4, 4, 4, 4]}
df = pd.DataFrame.from_dict(data)
df
A B C D E
0 3 4 1 1 4
1 2 3 2 1 3
2 1 2 3 2 2
3 0 1 4 2 4
How can I count the number of occurrences of specific values (given by a series with the same size) by row?
Again for simplicity, let's assume this value_series is given by the max of each row.
values_series = df.max(axis=1)
0 4
1 3
2 3
3 4
dtype: int64
The solution I got to seems not very pythonic (e.g. I'm using iterrows(), which is slow):
max_count = []
for index, row in df.iterrows():
max_count.append(row.value_counts()[values_series.loc[index]])
df_counts = pd.Series(max_count)
Is there any more efficient way to do this?
We can compare the transposed df.T directly to the df.max series, thanks to broadcasting:
(df.T == df.max(axis=1)).sum()
# result
0 2
1 1
2 1
3 2
dtype: int64
(Transposing also has the added benefit that we can use sum without specifying the axis, i.e. with the default axis=0.)
You can try
df.eq(df.max(1),axis=0).sum(1)
Out[361]:
0 2
1 1
2 1
3 2
dtype: int64
The perfect job for numpy broadcasting:
a = df.to_numpy()
b = values_series.to_numpy()[:, None]
(a == b).sum(axis=1)

How to get the (relative) place of values in a dataframe when sorted using Python?

How can I create a Pandas DataFrame that shows the relative position of each value, when those values are sorted from low to high for each column?
So in this case, how can you transform 'df' into 'dfOut'?
import pandas as pd
import numpy as np
#create DataFrame
df = pd.DataFrame({'A': [12, 18, 9, 21, 24, 15],
'B': [18, 22, 19, 14, 14, 11],
'C': [5, 7, 7, 9, 12, 9]})
# How to assign a value to the order in the column, when sorted from low to high?
dfOut = pd.DataFrame({'A': [2, 4, 1, 5, 6, 3],
'B': [3, 5, 4, 2, 2, 1],
'C': [1, 2, 2, 3, 4, 3]})
If you need to map the same values to the same output, try using the rank method of a DataFrame. Like this:
>> dfOut = df.rank(method="dense").astype(int) # Type transformation added to match your output
>> dfOut
A B C
0 2 3 1
1 4 5 2
2 1 4 2
3 5 2 3
4 6 2 4
5 3 1 3
The rank method computes the rank for each column following a specific criteria. According to the Pandas documentation, the "dense" method ensures that "rank always increases by 1 between groups", and that might match your use case.
Original answer: In case that repeated numbers are not required to map to the same out value, np.argsort could be applied on each column to retrieve the position of each value that would sort the column. Combine this with the apply method of a DataFrame to apply the function on each column and you have this:
>> dfOut = df.apply(lambda column: np.argsort(column.values)))
>> dfOut
A B C
0 2 5 0
1 0 3 1
2 5 4 2
3 1 0 3
4 3 2 5
5 4 1 4
Here is my attempt using some functions:
def sorted_idx(l, num):
x = sorted(list(set(l)))
for i in range(len(x)):
if x[i]==num:
return i+1
def output_list(l):
ret = [sorted_idx(l, elem) for elem in l]
return ret
dfOut = df.apply(lambda column: output_list(column))
print(dfOut)
I make reduce the original list to unique values and then sort. Finally, I return the index+1 where the element in the original list matches this unique, sorted list to get the values you have in your expected output.
Output:
A B C
0 2 3 1
1 4 5 2
2 1 4 2
3 5 2 3
4 6 2 4
5 3 1 3

Python use dataframe column value in iloc (or shift)

Although my previous question was answered here Python dataframe new column with value based on value in other row I still want to know how to use a column value in iloc (or shift or rolling, etc.)
I have a dataframe with two columns A and B, how do I use the value of column B in iloc? Or shift()?
d = {'A': [8, 2, 4, 5, 6, 4, 3, 5, 5, 3], 'B': [2, -1, 4, 5, 0, -3, 8, 2, 6, -1]}
df = pd.DataFrame(data=d)
Using iloc I get this error.
df['C'] = df['A'] * df['A'].iloc[df['B']]
ValueError: cannot reindex from a duplicate axis
Using shift() another one.
df['C'] = df['A'] * df['A'].shift(df['B'])
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Is it possible what I want to do? If yes, how? If no, why not?
Use numpy indexing:
print (df['A'].to_numpy()[df['B'].to_numpy()])
[4 3 6 4 8 5 5 4 3 3]
df['C'] = df['A'] * df['A'].to_numpy()[df['B'].to_numpy()]
print (df)
A B C
0 8 2 32
1 2 -1 6
2 4 4 24
3 5 5 20
4 6 0 48
5 4 -3 20
6 3 8 15
7 5 2 20
8 5 6 15
9 3 -1 9
numpy indexing is the fastest way i agree, but you can use list comprehension + iloc too:
d = {'A': [8, 2, 4, 5, 6, 4, 3, 5, 5, 3], 'B': [2, -1, 4, 5, 0, -3, 8, 2, 6, -1]}
df = pd.DataFrame(data=d)
df['C'] = df['A'] * [df['A'].iloc[i] for i in df['B']]
A B C
0 8 2 32
1 2 -1 6
2 4 4 24
3 5 5 20
4 6 0 48
5 4 -3 20
6 3 8 15
7 5 2 20
8 5 6 15
9 3 -1 9

Pandas replacing values on specific columns

I am aware of these two similar questions:
Pandas replace values
Pandas: Replacing column values in dataframe
I used a different approach for substituting values from which I think it should be the cleanest one. But it does not work. I know how to work around it, but I would like to understand why it does not work:
In [108]: df=pd.DataFrame([[1, 2, 8],[3, 4, 8], [5, 1, 8]], columns=['A', 'B', 'C'])
In [109]: df
Out[109]:
A B C
0 1 2 8
1 3 4 8
2 5 1 8
In [110]: df.loc[:, ['A', 'B']].replace([1, 3, 2], [3, 6, 7], inplace=True)
In [111]: df
Out[111]:
A B C
0 1 2 8
1 3 4 8
2 5 1 8
In [112]: df.loc[:, 'A'].replace([1, 3, 2], [3, 6, 7], inplace=True)
In [113]: df
Out[113]:
A B C
0 3 2 8
1 6 4 8
2 5 1 8
If I slice only one column In [112] it works different to slicing several columns In [110]. As I understand the .loc method it returns a view and not a copy. In my logic this means that making an inplace change on the slice should change the whole DataFrame. This is what happens at line In [110].
Here is the answer by one of the developers: https://github.com/pydata/pandas/issues/11984
This should ideally show a SettingWithCopyWarning, but I think this is
quite difficult to detect.
You should NEVER do this type of chained inplace setting. It is simply
bad practice.
idiomatic is:
In [7]: df[['A','B']] = df[['A','B']].replace([1, 3, 2], [3, 6, 7])
In [8]: df
Out[8]:
A B C
0 3 7 8
1 6 4 8
2 5 3 8
(you can do with df.loc[:,['A','B']] as well, but more clear as above.
to_rep = dict(zip([1, 3, 2],[3, 6, 7]))
df.replace({'A':to_rep, 'B':to_rep}, inplace = True)
This will return:
A B C
0 3 7 8
1 6 4 8
2 5 3 8

Categories