Pad selection range in Pandas Dataframe? - python

If I slice a dataframe with something like
>>> df = pd.DataFrame(data=[[x] for x in [1,2,3,5,1,3,2,1,1,4,5,6]], columns=['A'])
>>> df.loc[df['A'] == 1]
# or
>>> df[df['A'] == 1]
A
0 1
4 1
7 1
8 1
how could I pad my selections by a buffer of 1 and get the each of the indices 0, 1, 3, 4, 5, 6, 7, 8, 9? I want to select all rows for which the value in column 'A' is 1, but also a row before or after any such row.
edit I'm hoping to figure out a solution that works for arbitrary pad sizes, rather than just for a pad size of 1.
edit 2 here's another example illustrating what I'm going for
df = pd.DataFrame(data=[[x] for x in [1,2,3,5,3,2,1,1,4,5,6,0,0,3,1,2,4,5]], columns=['A'])
and we're looking for pad == 2. In this case I'd be trying to fetch rows 0, 1, 2, 4, 5, 6, 7, 8, 9, 12, 13, 14, 15, 16.

you can use shift with bitwise or |
c = df['A'] == 1
df[c|c.shift()|c.shift(-1)]
A
0 1
1 2
3 5
4 1
5 3
6 2
7 1
8 1
9 4

For arbitrary pad sizes, you may try where, interpolate, and notna to create the mask
n = 2
c = df.where(df['A'] == 1)
m = c.interpolate(limit=n, limit_direction='both').notna()
df[m]
Out[61]:
A
0 1
1 2
2 3
4 3
5 2
6 1
7 1
8 4
9 5
12 0
13 3
14 1
15 2
16 4

Here is an approach that allows for multiple pad levels. Use ffill and bfill on the boolean mask (df['A'] == 1), after converting the False values to np.nan:
import numpy as np
pad = 2
df[(df['A'] == 1).replace(False, np.nan).ffill(limit=pad).bfill(limit=pad).replace(np.nan,False).astype(bool)]
Here it is in action:
def padsearch(df, column, value, pad):
return df[(df[column] == value).replace(False, np.nan).ffill(limit=pad).bfill(limit=pad).replace(np.nan,False).astype(bool)]
# your first example
df = pd.DataFrame(data=[[x] for x in [1,2,3,5,1,3,2,1,1,4,5,6]], columns=['A'])
print(padsearch(df=df, column='A', value=1, pad=1))
# your other example
df = pd.DataFrame(data=[[x] for x in [1,2,3,5,3,2,1,1,4,5,6,0,0,3,1,2,4,5]], columns=['A'])
print(padsearch(df=df, column='A', value=1, pad=2))
Result:
A
0 1
1 2
3 5
4 1
5 3
6 2
7 1
8 1
9 4
A
0 1
1 2
2 3
4 3
5 2
6 1
7 1
8 4
9 5
12 0
13 3
14 1
15 2
16 4
Granted the command is far less nice, and its a little clunky to be converting the False to and from null. But it's still using all Pandas builtins, so it is fairly quick still.

I found another solution but not nearly as slick as some of the ones already posted.
# setup
df = ...
pad = 2
# determine set of indicies
indices = set(
[
x for x in filter(
lambda x: x>=0,
[
x+y
for x in df[df['A'] == 1].index
for y in range(-pad, pad+1)
]
)
]
)
# fetch rows
df.iloc[[*indices]]

Related

Replace specific values in a data frame with column mean

I have a dataframe and I want to replace the value 7 with the round number of mean of its columns with out other 7 in that columns. Here is a simple example:
import pandas as pd
df = pd.DataFrame()
df['a'] = [1, 2, 3]
df['b'] =[3, 0, -1]
df['c'] = [4, 7, 6]
df['d'] = [7, 7, 6]
a b c d
0 1 3 4 7
1 2 0 7 7
2 3 -1 6 6
And here is the output I want:
a b c d
0 1 3 4 2
1 2 0 3 2
2 3 -1 6 6
For example, in row 1, the mean of column c is equal to 3.33 and then its round is 3, and in column column d is equal to 2 (since we do not consider the other 7 in that column).
Can you please help me with that?
here is one way to do it
# replace 7 with np.nan
df.replace(7,np.nan, inplace=True)
# fill NaN values with the mean of the column
(df.fillna(df.apply(lambda x: x.replace(np.nan, 0)
.mean(skipna=False) ))
.round(0)
.astype(int))
a b c d
0 1 3 4 2
1 2 0 3 2
2 3 -1 6 6
temp = df.replace(to_replace=7, value=0, inplace=False).copy()
df.replace(to_replace=7, value=temp.mean().astype(int), inplace=True)

How to change several values of pandas DataFrame at once?

Let's consider very simple data frame:
import pandas as pd
df = pd.DataFrame([[0, 1, 2, 3, 2, 5], [3, 4, 5, 0, 2, 7]]).transpose()
df.columns = ["A", "B"]
A B
0 0 3
1 1 4
2 2 5
3 3 0
4 2 2
5 5 7
I want to do two things with this dataframe:
All numbers below 3 has to be changed to 0
All numbers equal to 0 has to be changed to 10
The problem is, that when we apply:
df[df < 3] = 0
df[df == 0] = 10
we are also going to change numbers which were initially not 0, obtaining:
A B
0 10 3
1 10 4
2 10 5
3 3 10
4 10 10
5 5 7
which is not a desired output which should look like this:
A B
0 10 3
1 0 4
2 0 5
3 3 10
4 0 0
5 5 7
My question is - is there any opportunity to change both those things at the same time? i.e. I want to change numbers which are smaller than 3 to 0 and numbers which equal to 0 to 10 independently of each other.
Note! This example is created to just outline the problem. An obvious solution is to change the order of replacement - first change 0 to 10, and then numbers smaller than 3 to 0. But I'm struggling with a much complex problem, and I want to know if it is possible to change both of those at once.
Use applymap() to apply a function to each element in the DataFrame:
df.applymap(lambda x: 10 if x == 0 else (0 if x < 3 else x))
results in
A B
0 10 3
1 0 4
2 0 5
3 3 10
4 0 0
5 5 7
I would do it following way
import pandas as pd
df = pd.DataFrame([[0, 1, 2, 3, 2, 5], [3, 4, 5, 0, 2, 7]]).transpose()
df.columns = ["A", "B"]
df_orig = df.copy()
df[df_orig < 3] = 0
df[df_orig == 0] = 10
print(df)
output
A B
0 10 3
1 0 4
2 0 5
3 3 10
4 0 0
5 5 7
Explanation: I use .copy method to get copy of DataFrame, which is placed in variable df_orig, then use said DataFrame, which is not altered during run of program, to select places to put 0 and 10.
You can create the mask first then change value
m1 = df < 3
m2 = df == 0
df[m1] = 0
df[m2] = 10
print(df)
A B
0 10 3
1 0 4
2 0 5
3 3 10
4 0 0
5 5 7

Alter columns for each row in dataframe basing on other column

I have a dataframe that has got many columns.
I want to apply a function on each row that alters all the columns based on different column.
def mark(row):
columns = get_columns_to_alter(row['Text'])
for c in columns:
row[c] = True
And I was trying to use apply function
df.apply(mark, axis=1)
But it does not alter these columns. What am I doing wrong? The function I gave is a psuedocode but it gets names of columns to change basing on "Text" column.
OK,
This is a bit confusing, to be honest.
Several issues I see:
First, DataFrame.apply a function to each column should look more like:
df.apply(lambda x: mark(x), axis=1)
so that you actually loop through each row.
Second, DataFrame.apply creates a copy Series for each row (in your case); thus, the changes are not applied to df but to the new row value. If you want to change df, you need to (a) make sure that mark returns something and (b) to assign it to something else:
def mark(row):
columns = get_columns_to_alter(row['Text'])
if len(columns) > 0:
row[columns] = True
return row
new_df = df.apply(lambda x: mark(x), axis=1)
Something like this should do what you expect.
Here is one solution via numpy and itertools.chain. As far as possible, it's a good idea to remove loops.
from itertools import chain
import numpy as np
df = pd.DataFrame(np.random.randint(0, 9, (10, 10)))
df['cols'] = [np.random.randint(0, 9, 3) for _ in df]
def calc_cols(s):
arr = s.values.tolist()
# apply function on arr here, e.g.
# arr = list(map(f, arr))
idx = np.repeat(list(range(len(arr))), list(map(len, arr)))
return idx, list(chain(*arr))
A, cols = df.values, calc_cols(df['cols'])
A[cols[0], cols[1]] = -1
df_res = pd.DataFrame(A, columns=df.columns)
# 0 1 2 3 4 5 6 7 8 9 cols
# 0 2 4 -1 4 -1 2 6 6 8 1 [4, 4, 2]
# 1 4 -1 -1 3 4 4 -1 5 6 7 [2, 1, 6]
# 2 -1 1 7 1 2 -1 2 2 -1 0 [8, 0, 5]
# 3 2 4 -1 6 -1 8 6 -1 0 3 [7, 2, 4]
# 4 -1 5 5 2 8 2 -1 8 -1 6 [8, 6, 0]
# 5 5 6 0 3 5 -1 -1 5 3 7 [6, 5, 6]
# 6 -1 0 7 1 4 -1 -1 6 1 8 [5, 6, 0]
# 7 2 6 4 6 -1 6 -1 5 7 6 [6, 4, 6]
# 8 -1 8 1 -1 0 7 8 -1 2 3 [3, 0, 7]
# 9 2 4 6 6 -1 -1 0 2 -1 0 [4, 8, 5]

Vectorized calculation of a column's value based on a previous value of the same column?

I have a pandas dataframe with two columns A,B as below.
I want a vectorized solution for creating a new column C where C[i] = C[i-1] - A[i] + B[i].
df = pd.DataFrame(data={'A': [10, 2, 3, 4, 5, 6], 'B': [0, 1, 2, 3, 4, 5]})
>>> df
A B
0 10 0
1 2 1
2 3 2
3 4 3
4 5 4
5 6 5
Here is the solution using for-loops:
df['C'] = df['A']
for i in range(1, len(df)):
df['C'][i] = df['C'][i-1] - df['A'][i] + df['B'][i]
>>> df
A B C
0 10 0 10
1 2 1 9
2 3 2 8
3 4 3 7
4 5 4 6
5 6 5 5
... which does the job.
But since loops are slow in comparison to vectorized calculations, I want a vectorized solution for this in pandas:
I tried to use the shift() method like this:
df['C'] = df['C'].shift(1).fillna(df['A']) - df['A'] + df['B']
but it didn't help since the shifted C column isn't updated with the calculation. It keeps its original values:
>>> df['C'].shift(1).fillna(df['A'])
0 10
1 10
2 2
3 3
4 4
5 5
and that produces a wrong result.
This can be vectorized since:
delta[i] = C[i] - C[i-1] = -A[i] +B[i]. You can get delta from A and B first, then...
calculate cumulative sum of delta (plus C[0]) to get full C
Code as follows:
delta = df['B'] - df['A']
delta[0] = 0
df['C'] = df.loc[0, 'A'] + delta.cumsum()
​
print df
A B C
0 10 0 10
1 2 1 9
2 3 2 8
3 4 3 7
4 5 4 6
5 6 5 5

Pandas indexing by both boolean `loc` and subsequent `iloc`

I want to index a Pandas dataframe using a boolean mask, then set a value in a subset of the filtered dataframe based on an integer index, and have this value reflected in the dataframe. That is, I would be happy if this worked on a view of the dataframe.
Example:
In [293]:
df = pd.DataFrame({'a': [0, 1, 2, 3, 4, 5, 6, 7],
'b': [5, 5, 2, 2, 5, 5, 2, 2],
'c': [0, 0, 0, 0, 0, 0, 0, 0]})
mask = (df['a'] < 7) & (df['b'] == 2)
df.loc[mask, 'c']
Out[293]:
2 0
3 0
6 0
Name: c, dtype: int64
Now I would like to set the values of the first two elements returned in the filtered dataframe. Chaining an iloc onto the loc call above works to index:
In [294]:
df.loc[mask, 'c'].iloc[0: 2]
Out[294]:
2 0
3 0
Name: c, dtype: int64
But not to assign:
In [295]:
df.loc[mask, 'c'].iloc[0: 2] = 1
print(df)
a b c
0 0 5 0
1 1 5 0
2 2 2 0
3 3 2 0
4 4 5 0
5 5 5 0
6 6 2 0
7 7 2 0
Making the assign value the same length as the slice (i.e. = [1, 1]) also doesn't work. Is there a way to assign these values?
This does work but is a little ugly, basically we use the index generated from the mask and make an additional call to loc:
In [57]:
df.loc[df.loc[mask,'c'].iloc[0:2].index, 'c'] = 1
df
Out[57]:
a b c
0 0 5 0
1 1 5 0
2 2 2 1
3 3 2 1
4 4 5 0
5 5 5 0
6 6 2 0
7 7 2 0
So breaking the above down:
In [60]:
# take the index from the mask and iloc
df.loc[mask, 'c'].iloc[0: 2]
Out[60]:
2 0
3 0
Name: c, dtype: int64
In [61]:
# call loc using this index, we can now use this to select column 'c' and set the value
df.loc[df.loc[mask,'c'].iloc[0:2].index]
Out[61]:
a b c
2 2 2 0
3 3 2 0
How about.
ix = df.index[mask][:2]
df.loc[ix, 'c'] = 1
Same idea as EdChum but more elegant as suggested in the comment.
EDIT: Have to be a little bit careful with this one as it may give unwanted results with a non-unique index, since there could be multiple rows indexed by either of the label in ix above. If the index is non-unique and you only want the first 2 (or n) rows that satisfy the boolean key, it would be safer to use .iloc with integer indexing with something like
ix = np.where(mask)[0][:2]
df.iloc[ix, 'c'] = 1
I don't know if this is any more elegant, but it's a little different:
mask = mask & (mask.cumsum() < 3)
df.loc[mask, 'c'] = 1
a b c
0 0 5 0
1 1 5 0
2 2 2 1
3 3 2 1
4 4 5 0
5 5 5 0
6 6 2 0
7 7 2 0

Categories