Is there anyway to compare values within the same column of a pandas DataFrame?
The task at hand is something like this:
import pandas as pd
data = pd.DataFrame({"A": [0,-5,2,3,-3,-4,-4,-2,-1,5,6,7,3,-1]});
I need to find the maximum time (in indices) consecutive +/- values appear (Equivalently checking consecutive values because the sign can be encoded by True/False). The above data should yield 5 because there are 5 consecutive negative integers [-3,-4,-4,-2,-1]
If possible, I was hoping to avoid using a loop because the number of data points in the column may very well exceed millions in order.
I've tried using data.A.rolling() and it's variants, but can't seem to figure out any possible way to do this in a vectorized way.
Any suggestions?
Here's a NumPy approach that computes the max interval lengths for the positive and negative values -
def max_interval_lens(arr):
# Store mask of positive values
pos_mask = arr>=0
# Get indices of shifts
idx = np.r_[0,np.flatnonzero(pos_mask[1:] != pos_mask[:-1])+1, arr.size]
# Return max of intervals
lens = np.diff(idx)
s = int(pos_mask[0])
maxs = [0,0]
if len(lens)==1:
maxs[1-s] = lens[0]
else:
maxs = lens[1-s::2].max(), lens[s::2].max()
return maxs # Positive, negative max lens
Sample run -
In [227]: data
Out[227]:
A
0 0
1 -5
2 2
3 3
4 -3
5 -4
6 -4
7 -2
8 -1
9 5
10 6
11 7
12 3
13 -1
In [228]: max_interval_lens(data['A'].values)
Out[228]: (4, 5)
Related
I have a dataframe(df) like below (there are more rows actually).
number
0
21
1
35
2
467
3
965
4
2754
5
34r
6
5743
7
841
8
8934
9
275
I want to insert multiple 6 rows in between rows for example I want to get random 6 values within range of index 0 and 1 and add these 6 rows between index 0 and 1.
Same goes to index 1 and 2, 2 and 3 and so forth until the end.
np.linspace(df["number"][0], df["number"][1],8)
Is there a function or any other method to generate 6 additional rows between all existing 9 rows so therefore the final number of rows will be not 9 but 64 rows (after adding 54 rows)?
You could try the following:
from random import uniform
def rng_numbers(row):
left, right = row.iat[0], row.iat[1]
n = left
if pd.isna(right):
return [n]
if right < left:
left, right = right, left
return [n] + [uniform(left, right) for _ in range(6)]
df["number"] = (
pd.concat([df["number"], df["number"].shift(-1)], axis=1)
.apply(rng_numbers, axis=1)
)
df = df.explode("number", ignore_index=True)
First create a dataframe with 2 columns that form the interval boundaries: the number column and number column shifted 1 forth.
Then .apply the function rng_numbers to the rows of the new dataframe: rng_numbers first sorts the interval boundaries and then returns a list that starts with the resp. item from column number and then num_rows many random numbers in the interval. In the last row the left boundary is NaN (due to the .shift(-1)): in this case the function returns the list without the random numbers.
Then .explode df on the new column number.
You could do something similar with NumPy, which is probably faster:
rng = np.random.default_rng()
limits = pd.concat([df["number"], df["number"].shift(-1)], axis=1)
left = limits.min(axis=1).values.reshape(-1, 1)
right = limits.max(axis=1).values.reshape(-1, 1)
df["number"] = (
pd.Series(df["number"].values.reshape(len(df), 1).tolist())
+ pd.Series(rng.uniform(left, right, size=(len(df), 6)).tolist())
)
df["number"].iat[-1] = df["number"].iat[-1][:1]
df = df.explode("number", ignore_index=True)
I am trying to check three continuous values in a column and if they are all positive, then create a new column with a string value in the third row. My index is the date index.
I want a new column created in my data frame and want to check in a loop if three consecutive values in a row are positive, then return a string value of 'increasing' or if all three are negative, then return a value of 'decreasing' or if neither, then return 'none'. And this new value should be in the new column and in the row that is the last one of the three values that have been checked.
I have tried below but whatever variation I use, it is not working.
df['num_change'] = df.num.diff()
result = []
for i in range(len(df)):
if np.all(df['num_change'].values[i:i+3]) < 0:
result.loc[[i+3],'Trend'] =('decreasing')
elif np.all(df['num_change'].values[i:i+3]) > 0:
result.loc[[i+3],'Trend'] =('increasing')
else:
result.loc[[i+3],'Trend'] =('none')
df["new_col"] = result
I am unfortunately not able to insert an image here, I hope someone is patient enough to help me still.
This can be achieved with a custom rolling without an (explicit) loop
First we define the aggregation (it has to return a numeric value):
def trend(s):
if (s < 0).all():
return -1
if (s > 0).all():
return 1
return 0
Now apply it and map to a label
df['trend'] = (df['col'].rolling(3, min_periods = 1)
.apply(trend)
.map({1:'Increasing', -1:'Decreasing', 0:'none'})
)
output
col trend
0 1 Increasing
1 2 Increasing
2 3 Increasing
3 -4 none
4 -5 none
5 -6 Decreasing
6 7 none
7 8 none
8 9 Increasing
Note that we set min_periods to 1 here which has the effect of filling the first two rows based on the sub-series of 1 or 2 elements. if you don't want that you can delete the min_periods bit
You could do this as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col' : [1,2,3,-4,-5,-6,7,8,9]})
start = 0
end = 3
result = [None] * 2 # because trend will start after the third value
while end <= len(df.col):
if np.all(df.col[start:end] > 0):
result.append("Increasing")
elif np.all(df.col[start:end] < 0):
result.append("Decreasing")
else:
result.append(None)
start += 1
end += 1
df["new_col"] = result
In this solution, the while-loop runs till the subset of the column in the data frame has at least 3 values, i.e. end is less than or equals to the length of df.col. Inside it, the first three elements of the column will be checked. If all of them are greater than 0, then the trend "increasing" will be added to the result. If not, then the trend "decreasing" will be added. Otherwise, None is added.
The first two elements of the result are None because there can be no comparison for the first two elements as the comparison is for the first 3 elements and so on. The start and end are 0 and 3 respectively, which are incremented by 1 after each iteration. The output is as shown below:
>>> df
col new_col
0 1 None
1 2 None
2 3 Increasing
3 -4 None
4 -5 None
5 -6 Decreasing
6 7 None
7 8 None
8 9 Increasing
I'd like to calculate rolling sum of elements as R rollapply doing:
s = pd.Series([1,2,3,4,5,6])
As result I'd like to receive new series with sum of elements for non overlapping intervals(window size is 2):
3
7
11
Pandas Series.rolling procedure works in other way producing rolling on overlapping intervals. Please tell me how to do what I want...
You can try
s.groupby(s.index//2).sum()
0 3
1 7
2 11
dtype: int64
Here is true solution:
s = pd.Series([1,2,3,4,5,6])
pd.Series([np.sum(s[x:x + 2]) for x in range(0, len(s), 2)])
Below is a dataframe showing coordinate values from and to, each row having a corresponding value column.
I want to find the range of coordinates where the value column doesn't exceed 5. Below is the dataframe input.
import pandas as pd
From=[10,20,30,40,50,60,70]
to=[20,30,40,50,60,70,80]
value=[2,3,5,6,1,3,1]
df=pd.DataFrame({'from':From, 'to':to, 'value':value})
print(df)
hence I want to convert the following table:
to the following outcome:
Further explanation:
Coordinates from 10 to 30 are joined and the value column changed to 5
as its sum of values from 10 to 30 (not exceeding 5)
Coordinates 30 to 40 equals 5
Coordinate 40 to 50 equals 6 (more than 5, however, it's included as it cannot be divided further)
Remaining coordinates sum up to a value of 5
What code is required to achieve the above?
We can do a groupby on cumsum:
s = df['value'].ge(5)
(df.groupby([~s, s.cumsum()], as_index=False, sort=False)
.agg({'from':'min','to':'max', 'value':'sum'})
)
Output:
from to value
0 10 30 5
1 30 40 5
2 40 50 6
3 50 80 5
Update: It looks like you want to accumulate the values so the new groups do not exceed 5. There are several threads on SO saying that this can only be done with a for a loop. So we can do something like this:
thresh = 5
groups, partial, curr_grp = [], thresh, 0
for v in df['value']:
if partial + v > thresh:
curr_grp += 1
partial = v
else:
partial += v
groups.append(curr_grp)
df.groupby(groups).agg({'from':'min','to':'max', 'value':'sum'})
I am looking for a fast method to determine the cross-matching indices of two arrays, defined as follows.
I have two very large (>1e7 elements) structured arrays, one called members, and another called groups. Both arrays have a groupID column. The groupID entries of the groups array are unique, the groupID entries of the members array are not.
The groups array has a column called mass. The members array has a (currently empty) column called groupmass. I want to assign the correct groupmass to those elements of members with a groupID that matches one of the groups. This would be accomplished via:
members['groupmass'][idx_matched_members] = groups['mass'][idx_matched_groups]
So what I need is a fast routine to compute the two index arrays idx_matched_members and idx_matched_groups. This sort of task seems so common that it seems very likely that a package like numpy or pandas would have an optimized solution. Does anyone know of a solution, professionally developed, homebrewed, or otherwise?
This can be done with pandas using map to map the data from one column using the data of another. Here's an example with sample data:
members = pandas.DataFrame({
'id': np.arange(10),
'groupID': np.arange(10) % 3,
'groupmass': np.zeros(10)
})
groups = pandas.DataFrame({
'groupID': np.arange(3),
'mass': np.random.randint(1, 10, 3)
})
This gives you this data:
>>> members
groupID groupmass id
0 0 0 0
1 1 0 1
2 2 0 2
3 0 0 3
4 1 0 4
5 2 0 5
6 0 0 6
7 1 0 7
8 2 0 8
9 0 0 9
>>> groups
groupID mass
0 0 3
1 1 7
2 2 4
Then:
>>> members['groupmass'] = members.groupID.map(groups.set_index('groupID').mass)
>>> members
groupID groupmass id
0 0 3 0
1 1 7 1
2 2 4 2
3 0 3 3
4 1 7 4
5 2 4 5
6 0 3 6
7 1 7 7
8 2 4 8
9 0 3 9
If you will often want to use the groupID as the index into groups, you can set it that way permanently so you won't have to use set_index every time you do this.
Here's an example of setting the mass with just numpy. It does use iteration, so for large arrays it won't be fast.
For just 10 rows, this is much faster than the pandas equivalent. But as the data set becomes larger (eg. M=10000), pandas is much better. The setup time for pandas is larger, but the per row iteration time much lower.
Generate test arrays:
dt_members = np.dtype({'names':['groupID','groupmass'], 'formats': [int, float]})
dt_groups = np.dtype({'names':['groupID', 'mass'], 'formats': [int, float]})
N, M = 5, 10
members = np.zeros((M,), dtype=dt_members)
groups = np.zeros((N,), dtype=dt_groups)
members['groupID'] = np.random.randint(101, 101+N, M)
groups['groupID'] = np.arange(101, 101+N)
groups['mass'] = np.arange(1,N+1)
def getgroup(id):
idx = id==groups['groupID']
return groups[idx]
members['groupmass'][:] = [getgroup(id)['mass'] for id in members['groupID']]
In python2 the iteration could use map:
members['groupmass'] = map(lambda x: getgroup(x)['mass'], members['groupID'])
I can improve the speed by about 2x by minimizing the repeated subscripting, eg.
def setmass(members, groups):
gmass = groups['mass']
gid = groups['groupID']
mass = [gmass[id==gid] for id in members['groupID']]
members['groupmass'][:] = mass
But if groups['groupID'] can be mapped onto arange(N), then we can get a big jump in speed. By applying the same mapping to members['groupID'], it becomes a simple array indexing problem.
In my sample arrays, groups['groupID'] is just arange(N)+101. So the mapping just subtracts that minimum.
def setmass1(members, groups):
members['groupmass'][:] = groups['mass'][members['groupID']-groups['groupID'].min()]
This is 300x faster than my earlier code, and 8x better than the pandas solution (for 10000,500 arrays).
I suspect pandas does something like this. pgroups.set_index('groupID').mass is the mass Series, with an added .index attribute. (I could test this with a more general array)
In a more general case, it might help to sort groups, and if necessary, fill in some indexing gaps.
Here's a 'vectorized' solution - no iteration. But it has to calculate a very large matrix (length of groups by length of members), so does not gain much speed (np.where is the slowest step).
def setmass2(members, groups):
idx = np.where(members['groupID'] == groups['groupID'][:,None])
members['groupmass'][idx[1]] = groups['mass'][idx[0]]