Divide a dataframe based on the last occurrence of a condition - python

Let's say I have this data ordered by id:
id | count
1 1
2 2
3 0
4 4
5 3
6 2
7 0
8 10
9 1
10 2
I want to obtain always the last change that comes after the last zero of any. Based on the data above, I would want to get :
id | count
8 10
9 1
10 2
Does anyone know how to do this?

pandas
df.loc[df['count'].ne(0).iloc[::-1].cumprod().astype(bool)]
id count
7 8 10
8 9 1
9 10 2
numpy
df[(df['count'].values[::-1] != 0).cumprod()[::-1].astype(bool)]
id count
7 8 10
8 9 1
9 10 2
with other conditions
df[(df['count'].values[::-1] < 3).cumprod()[::-1].astype(bool)]
# df.loc[df['count'].lt(3).iloc[::-1].cumprod().astype(bool)]
id count
8 9 1
9 10 2
debugging
You should be able to copy and paste this and reproduce my results. If you can't then there is something else wrong. Try resetting your kernel.
import pandas as pd
df = pd.DataFrame({
'count': [1, 2, 0, 4, 3, 2, 0, 10, 1, 2],
'id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
})
df[(df['count'].values[::-1] < 3).cumprod()[::-1].astype(bool)]
Should produce
count id
8 1 9
9 2 10

Related

How to change several values of pandas DataFrame at once?

Let's consider very simple data frame:
import pandas as pd
df = pd.DataFrame([[0, 1, 2, 3, 2, 5], [3, 4, 5, 0, 2, 7]]).transpose()
df.columns = ["A", "B"]
A B
0 0 3
1 1 4
2 2 5
3 3 0
4 2 2
5 5 7
I want to do two things with this dataframe:
All numbers below 3 has to be changed to 0
All numbers equal to 0 has to be changed to 10
The problem is, that when we apply:
df[df < 3] = 0
df[df == 0] = 10
we are also going to change numbers which were initially not 0, obtaining:
A B
0 10 3
1 10 4
2 10 5
3 3 10
4 10 10
5 5 7
which is not a desired output which should look like this:
A B
0 10 3
1 0 4
2 0 5
3 3 10
4 0 0
5 5 7
My question is - is there any opportunity to change both those things at the same time? i.e. I want to change numbers which are smaller than 3 to 0 and numbers which equal to 0 to 10 independently of each other.
Note! This example is created to just outline the problem. An obvious solution is to change the order of replacement - first change 0 to 10, and then numbers smaller than 3 to 0. But I'm struggling with a much complex problem, and I want to know if it is possible to change both of those at once.
Use applymap() to apply a function to each element in the DataFrame:
df.applymap(lambda x: 10 if x == 0 else (0 if x < 3 else x))
results in
A B
0 10 3
1 0 4
2 0 5
3 3 10
4 0 0
5 5 7
I would do it following way
import pandas as pd
df = pd.DataFrame([[0, 1, 2, 3, 2, 5], [3, 4, 5, 0, 2, 7]]).transpose()
df.columns = ["A", "B"]
df_orig = df.copy()
df[df_orig < 3] = 0
df[df_orig == 0] = 10
print(df)
output
A B
0 10 3
1 0 4
2 0 5
3 3 10
4 0 0
5 5 7
Explanation: I use .copy method to get copy of DataFrame, which is placed in variable df_orig, then use said DataFrame, which is not altered during run of program, to select places to put 0 and 10.
You can create the mask first then change value
m1 = df < 3
m2 = df == 0
df[m1] = 0
df[m2] = 10
print(df)
A B
0 10 3
1 0 4
2 0 5
3 3 10
4 0 0
5 5 7

Vectorizing for-loop

I have a very large dataframe (~10^8 rows) where I need to change some values. The algorithm I use is complex so I tried to break down the issue into a simple example below. I mostly programmed in C++, so I keep thinking in for-loops. I know I should vectorize but I am new to python and very new to pandas and cannot come up with a better solution. Any solutions which increase performance are welcome.
#!/usr/bin/python3
import numpy as np
import pandas as pd
data = {'eventID': [1, 1, 1, 2, 2, 3, 4, 5, 6, 6, 6, 6, 7, 8],
'types': [0, -1, -1, -1, 1, 0, 0, 0, -1, -1, -1, 1, -1, -1]
}
mydf = pd.DataFrame(data, columns=['eventID', 'types'])
print(mydf)
MyIntegerCodes = np.array([0, 1])
eventIDs = np.unique(mydf.eventID.values) # can be up to 10^8 values
for val in eventIDs:
currentTypes = mydf[mydf.eventID == val].types.values
if (0 in currentTypes) & ~(1 in currentTypes):
mydf.loc[mydf.eventID == val, 'types'] = 0
if ~(0 in currentTypes) & (1 in currentTypes):
mydf.loc[mydf.eventID == val, 'types'] = 1
print(mydf)
Any ideas?
EDIT: I was ask to explain what I do with my for-loops.
For every eventID I want to know if all corresponding types contain a 1 or a 0 or both. If they contain a 1, all values which are equal to -1 should be changed to 1. If the values are 0, all values equal to -1 should be changed to 0. My problem is to do this efficiently for each eventID independently. There can be one or multiple entries per eventID.
Input of example:
eventID types
0 1 0
1 1 -1
2 1 -1
3 2 -1
4 2 1
5 3 0
6 4 0
7 5 0
8 6 -1
9 6 -1
10 6 -1
11 6 1
12 7 -1
13 8 -1
Output of example:
eventID types
0 1 0
1 1 0
2 1 0
3 2 1
4 2 1
5 3 0
6 4 0
7 5 0
8 6 1
9 6 1
10 6 1
11 6 1
12 7 -1
13 8 -1
First we create boolean masks m1 and m2 using Series.eq then use DataFrame.groupby on this mask and transform using any, then using np.select chose the elements from 1, 0 depending upon the conditions m1 or m2:
m1 = mydf['types'].eq(1).groupby(mydf['eventID']).transform('any')
m2 = mydf['types'].eq(0).groupby(mydf['eventID']).transform('any')
mydf['types'] = np.select([m1 , m2], [1, 0], mydf['types'])
Result:
# print(mydf)
eventID types
0 1 0
1 1 0
2 1 0
3 2 1
4 2 1
5 3 0
6 4 0
7 5 0
8 6 1
9 6 1
10 6 1
11 6 1
12 7 -1
13 8 -1

Creating a derived column using pandas operations

I'm trying to create a column which contains a cumulative sum of the number of entries, tid, which are grouped according to unique values of (raceid, tid). The cumulative sum should increment by the number of entries in the grouping as shown in the df3 dataframe below rather than one at a time.
import pandas as pd
df1 = pd.DataFrame({
'rid': [1, 1, 1, 2, 2, 2, 3, 3, 4, 5, 5, 5, 5],
'tid': [1, 2, 2, 1, 1, 3, 1, 4, 5, 1, 1, 1, 3]})
rid tid
0 1 1
1 1 2
2 1 2
3 2 1
4 2 1
5 2 3
6 3 1
7 3 4
8 4 5
9 5 1
10 5 1
11 5 1
12 5 3
Giving after the required operation:
df3 = pd.DataFrame({
'rid': [1, 1, 1, 2, 2, 2, 3, 3, 4, 5, 5, 5, 5],
'tid': [1, 2, 2, 1, 1, 3, 1, 4, 5, 1, 1, 1, 3],
'groupentries': [1, 2, 2, 2, 2, 1, 1, 1, 1, 3, 3, 3, 1],
'cumulativeentries': [1, 2, 2, 3, 3, 1, 4, 1, 1, 7, 7, 7, 2]})
rid tid groupentries cumulativeentries
0 1 1 1 1
1 1 2 2 2
2 1 2 2 2
3 2 1 2 3
4 2 1 2 3
5 2 3 1 1
6 3 1 1 4
7 3 4 1 1
8 4 5 1 1
9 5 1 3 7
10 5 1 3 7
11 5 1 3 7
12 5 3 1 2
The derived column that I'm after is the cumulativeentries column although I've only figured out how to generate the intermediate column groupentries using pandas:
df1.groupby(["rid", "tid"]).size()
Values in cumulativeentries are actually a kind of running count.
The task is to count occurrences of the current tid in "source area" of
tid column:
from the beginning of the DataFrame,
up to (including) the end of the current group.
To compute values of both required values for each group, I defined
the following function:
def fn(grp):
lastRow = grp.iloc[-1] # last row of the current group
lastId = lastRow.name # index of this row
tids = df1.truncate(after=lastId).tid
return [grp.index.size, tids[tids == lastRow.tid].size]
To get the "source area" mentioned above I used truncate function.
In my opinion it is a very intuitive solution, based on the notion of the
"source area".
The function returns a list containing both required values:
the size of the current group,
how many tids equal to the current tid are in the
truncated tid column.
To apply this function, run:
df2 = df1.groupby(['rid', 'tid']).apply(fn).apply(pd.Series)\
.rename(columns={0: 'groupentries', 1: 'cumulativeentries'})
Details:
apply(fn) generates a Series containing 2-element lists.
apply(pd.Series) converts it to a DataFrame (with default column names).
rename sets the target column names.
And the last thing to do is to join this table to df1:
df1.join(df2, on=['rid', 'tid'])
For first column use GroupBy.transform with DataFrameGroupBy.size, for second use custom function for test all values of column to last index values, compare with last values and count matched values by sum:
f = lambda x: (df1['tid'].iloc[:x.index[-1]+1] == x.iat[-1]).sum()
df1['groupentries'] = df1.groupby(["rid", "tid"])['rid'].transform('size')
df1['cumulativeentries'] = df1.groupby(["rid", "tid"])['tid'].transform(f)
print (df1)
rid tid groupentries cumulativeentries
0 1 1 1 1
1 1 2 2 2
2 1 2 2 2
3 2 1 2 3
4 2 1 2 3
5 2 3 1 1
6 3 1 1 4
7 3 4 1 1
8 4 5 1 1
9 5 1 3 7
10 5 1 3 7
11 5 1 3 7
12 5 3 1 2

Count Total number of sequences that meet condition, without for-loop

I have the following Dataframe as input:
l = [2,2,2,5,5,5,3,3,2,2,4,4,6,5,5,3,5]
df = pd.DataFrame(l)
print(df)
0
0 2
1 2
2 2
3 5
4 5
5 5
6 3
7 3
8 2
9 2
10 4
11 4
12 6
13 5
14 5
15 3
16 5
As an output I would like to have a final count of the total sequences that meet a certain condition. For example, in this case, I want the number of sequences that the values are greater than 3.
So, the output is 3.
1st Sequence = [555]
2nd Sequence = [44655]
3rd Sequence = [5]
Is there a way to calculate this without a for-loop in pandas ?
I have already implemented a solution using for-loop, and I wonder if there is better approach using pandas in O(N) time.
Thanks very much!
Related to this question: How to count the number of time intervals that meet a boolean condition within a pandas dataframe?
You can use:
m = df[0] > 3
df[1] = (~m).cumsum()
df = df[m]
print (df)
0 1
3 5 3
4 5 3
5 5 3
10 4 7
11 4 7
12 6 7
13 5 7
14 5 7
16 5 8
#create tuples
df = df.groupby(1)[0].apply(tuple).value_counts()
print (df)
(5, 5, 5) 1
(4, 4, 6, 5, 5) 1
(5,) 1
Name: 0, dtype: int64
#alternativly create strings
df = df.astype(str).groupby(1)[0].apply(''.join).value_counts()
print (df)
5 1
44655 1
555 1
Name: 0, dtype: int64
If need output as list:
print (df.astype(str).groupby(1)[0].apply(''.join).tolist())
['555', '44655', '5']
Detail:
print (df.astype(str).groupby(1)[0].apply(''.join))
3 555
7 44655
8 5
Name: 0, dtype: object
If you don't need pandas this will suit your needs:
l = [2,2,2,5,5,5,3,3,2,2,4,4,6,5,5,3,5]
def consecutive(array, value):
result = []
sub = []
for item in array:
if item > value:
sub.append(item)
else:
if sub:
result.append(sub)
sub = []
if sub:
result.append(sub)
return result
print(consecutive(l,3))
#[[5, 5, 5], [4, 4, 6, 5, 5], [5]]

How to selectively import only lines of words with genfromtxt into Python?

I have a data text file which has the following format:
1 2 2 3 4 5 6
1 5 8 9 3 4 2
1 2 3 5 1 2 3
Timestamp 1
5 4 8 9 8 7 2
1 5 9 6 3 1 2
Timestamp 2
...
I wish to import the data in such a way that:
I can ignore the timestamps first and process the data.
AND I can also handle the timestamps later.
I have achieved 1 by
myData = np.genfromtxt('data.txt', comments='T')
By doing so, I have the following in Python
1 2 2 3 4 5 6
1 5 8 9 3 4 2
1 2 3 5 1 2 3
5 4 8 9 8 7 2
1 5 9 6 3 1 2
However, by doing this I simply have discarded all the timestamps.
But I need to process them later as well.
How may I import the timestamps into another list like below in Python?
Timestamp 1
Timestamp 2
...
How about this?
I assume that the numbers before a Timestamp are the ones belong to it:
This snippet also converts the numbers to integers.
CODE:
with open('source.txt', 'r') as f:
data = {}
numbers = []
for line in f:
ln = line.strip()
if 'Timestamp' in ln:
data[ln] = numbers
numbers = []
elif ln:
numbers.append([int(n) for n in ln.split()])
print(data)
OUTPUT:
{
'Timestamp 2':
[
[5, 4, 8, 9, 8, 7, 2],
[1, 5, 9, 6, 3, 1, 2]
],
'Timestamp 1':
[
[1, 2, 2, 3, 4, 5, 6],
[1, 5, 8, 9, 3, 4, 2],
[1, 2, 3, 5, 1, 2, 3]
]
}
#PeterVaro has a nice solution that keeps the timestamps linked to the data, but if all you want to do is import the numbers and timestamps into separate lists, you can do this:
with open('data.txt') as dataFile:
numbers = []
timestamps = []
for line in dataFile:
# if statement makes sure it's not a blank line with only a newline character in it.
if len(line) > 1:
if 'Timestamp' in line:
timestamps.append(line.rstrip())
else:
numbers.append([int(x) for x in line.split()])
for line in numbers:
for number in line:
print number, " ",
print
print
for timestamp in timestamps:
print timestamp
output:
1 2 2 3 4 5 6
1 5 8 9 3 4 2
1 2 3 5 1 2 3
5 4 8 9 8 7 2
1 5 9 6 3 1 2
Timestamp 1
Timestamp 2

Categories