I've got a dataframe that looks something like this:
user
current_date
prior_date
points_scored
1
2021-01-01
2020-10-01
5
2
2021-01-01
2020-10-01
4
2
2021-01-21
2020-10-21
4
2
2021-05-01
2021-02-01
4
The prior_date column is simply current_date - 3 months and points_scored is the number of points scored on current_date. I'd like to be able to identify which rows had sum(points_scored) >= 8 where for a given user, the rows considered would be where current_date is between current_date and prior_date. It is guaranteed that no single row will have a value of points_scored >= 8.
For example, in the example above, I'd like something like this returned:
user
current_date
prior_date
points_scored
flag
1
2021-01-01
2021-04-01
5
0
2
2021-01-01
2020-10-01
4
0
2
2021-01-21
2020-10-21
4
1
2
2021-05-01
2021-02-01
4
0
The third row shows flag=1 because for row 3's values of current_date=2021-01-21 and prior_date=2020-10-21, the rows to consider would be rows 2 and 3. We consider row 2 because row 2's current_date=2021-01-01 which is between row 3's current_date and prior_date.
Ultimately, I'd like to end up with a data structure where it shows distinct user and flag. It could be a dataframe or a dictionary-- anything easily referencable.
user
flag
1
0
2
1
To do this, I'm doing something like this:
flags = {}
ids = list(df['user'].value_counts()[df['user'].value_counts() > 2].index)
for id in ids:
temp_df = df[df['user'] == id]
for idx, row in temp_df.iterrows():
cur_date = row['current_date']
prior_date = row['prior_date']
temp_total = temp_df[(temp_df['current_date'] <= cur_date) & (cand_df['current_date'] >= prior_date)]['points_scored'].sum()
if temp_total >= 8:
flags[id] = 1
break
The code above works, but just takes way too long to actually execute.
You are right, performing loops on large data can be quite time consuming. This is where the power of numpy comes into full play. I am still not sure of what you want but i can help address the speed
Numpy.select can perform your if else statement efficiently.
import pandas as pd
import numpy as np
condition = [df['points_scored']==5, df['points_scored']==4, df['points_scored'] ==3] # <-- put your condition here
choices = ['okay', 'hmmm!', 'yes'] #<--what you want returned (the order is important)
np.select(condition,choices,default= 'default value')
Also, you might want to more succint what you want. meanwhile you can refactor your loops with np.select()
Related
I have a list of transactions that lists the matter, the date, and the amount. People entering the data often make mistakes and have to reverse out costs by entering a new cost with a negative amount to offset the error. I'm trying to identify both reversal entries and the entry being reversed by grouping my data according to matter number and work date and then comparing Amounts.
The data looks something like this:
MatterNum
WorkDate
Amount
1
1/02/2022
10
1
1/02/2022
15
1
1/02/2022
-10
2
1/04/2022
15
2
1/05/2022
-5
2
1/05/2022
5
So my output table would look like this:
|MatterNum|WorkDate|Amount|Reversal?|
|---------|--------|------|---------|
|1|1/02/2022|10|yes|
|1|1/02/2022|15|no|
|1|1/02/2022|-10|yes|
|2|1/04/2022|15|no|
|2|1/05/2022|-5|yes|
|2|1/05/2022|5|yes|
Right now, i'm using the following code to check each row:
import pandas as pd
data = [
[1,'1/2/2022',10],
[1,'1/2/2022',15],
[1,'1/2/2022',-10],
[2,'1/4/2022',12],
[2,'1/5/2022',-5],
[2,'1/5/2022',5]
]
df = pd.DataFrame(data, columns=['MatterNum','WorkDate','Amount'])
def rev_check(MatterNum, workDate, WorkAmt, df):
funcDF = df.loc[(df['MatterNum'] == MatterNum) & (df['WorkDate'] == workDate)]
listCheck = funcDF['Amount'].tolist()
if WorkAmt*-1 in listCheck:
return 'yes'
df['reversal?'] = df.apply(lambda row: rev_check(row.MatterNum, row.WorkDate, row.Amount, df), axis=1)
This seems to work, but it is pretty slow. I need to check millions of rows of data. Is there a better way I can approach this that would be more efficient?
If I assume that a "reversal" is when this row's amount is less than the previous row's amount, then pandas can do this with diff:
import pandas as pd
data = [
[1,'1/2/2022',10],
[1,'1/2/2022',15],
[1,'1/2/2022',-10],
[1,'1/2/2022',12]
]
df = pd.DataFrame(data, columns=['MatterNum','WorkDate','Amount'])
print(df)
df['Reversal'] = df['Amount'].diff()<0
print(df)
Output:
MatterNum WorkDate Amount
0 1 1/2/2022 10
1 1 1/2/2022 15
2 1 1/2/2022 -10
3 1 1/2/2022 12
MatterNum WorkDate Amount Reversal
0 1 1/2/2022 10 False
1 1 1/2/2022 15 False
2 1 1/2/2022 -10 True
3 1 1/2/2022 12 False
The first row has to be special-cased, since there's nothing to compare against.
I really want to speed up my code.
My already working code loops through a DataFrame and gets the start and end year. Then I add it to the lists. At the end of the loop, I append to the empty DataFrame.
rows = range(3560)
#initiate lists and dataframe
start_year = []
end_year = []
for i in rows:
start_year.append(i)
end_year.append(i)
df = pd.DataFrame({'Start date':start_year, 'End date':end_year})
I get what I expect, but very slowly:
Start date End date
0 1 1
1 2 2
2 3 3
3 4 4
Yes, it can be made faster. The trick is to avoid list.append (or, worse pd.DataFrame.append) in a loop. You can use list(range(3560)), but you may find np.arange even more efficient. Here you can assign an array to multiple series via dict.fromkeys:
df = pd.DataFrame(dict.fromkeys(['Start date', 'End date'], np.arange(3560)))
print(df.shape)
# (3560, 2)
print(df.head())
# Start date End date
# 0 0 0
# 1 1 1
# 2 2 2
# 3 3 3
# 4 4 4
In the process of writing out a script to automate the compilation of a report, I'm trying to create a column of Timestamps based on a conditional using np.where(). The logic is as follows:
df['StartMonth'] = np.where(
chng['Count'] == 1, pd.Timestamp(
int(year), chng['Month'].astype(int), 1), str('')
)
The DataFrame is a list of employees who are either considered additions or deletions, where the chng['Count'] is used as a flag that shows +1 as an addition and -1 as a deletion. So where any employee is being added, create the StartMonth series where the fixed year variable, the Month of the row, and 1 are used as the basis to create the timestamp (both year and chng['Month'] are strings, hence casting them as integers in the conditional). The output of the function comes up as the following for each True row:
Month Count StartMonth
0 1 1 1970-01-01 00-00-01.000002+00019:00:01
1 1 1 1970-01-01 00-00-01.000002+00019:00:01
2 4 1 1970-01-01 00-00-01.000002+00019:00:01
3 5 1 1970-01-01 00-00-01.000002+00019:00:01
4 10 1 1970-01-01 00-00-01.000002+00019:00:01
I've tried this with year and chng['Month'] already cast as integers prior to the conditional and it's been the same result. The only time it "works" is when chng['Month'] is replaced with any other arbitrary number, leading me to believe that is the issue. I have done plenty of other conditionals with np.where() that use values from another Series in the DataFrame (though not as the base for a Timestamp creation) without any problem, so I'm not sure what is causing this.
There are a few issues:
You should use pd.to_datetime for vectorised conversion, rather than pd.Timestamp.
numpy.where returns a NumPy array, which is not the same as a Pandas datetime series. But you can feed an array to pd.to_datetime.
You should avoid combining strings with datetime values in a single series. Choose one. Here, instead of '' use pd.NaT to ensure your series remains datetime.
Here's an example solution:
year = 2018
s = str(year) + '-' + df['Month'].astype(str)
df['StartMonth'] = pd.to_datetime(np.where(df['Count'] == 1, s, pd.NaT))
print(df)
Month Count StartMonth
0 1 1 2018-01-01
1 1 1 2018-01-01
2 4 1 2018-04-01
3 5 1 2018-05-01
4 10 1 2018-10-01
I have two data sets from different pulse oximeters, and plot them with pyplot as displayed below. As you may see, the green data sheet has alot of outliers(vertical drops). In my work I've defined these outlayers as non-valid in for my statistical analysis, they are must certainly not measurements. Therefore I argue that I can simply remove them.
The characteristics of these rogue values is that they're single(or top two) value outliers(see df below). The "real" sample values are either the same as the previous value, or +-1. In e.g. java(pseudo code) I would do something like:
for(i; i <df.length; i++)
if (df[i+1|-1].spo2 - df[i].spo2 > 1|-1)
df[i].drop
What would be the pandas(numpy?) equivalent of what I'm trying to do, remove values that is more/less than 1 compared to the last/next value?
df:
time, spo2
1900-01-01 18:18:41.194 98.0
1900-01-01 18:18:41.376 98.0
1900-01-01 18:18:41.559 78.0
1900-01-01 18:18:41.741 98.0
1900-01-01 18:18:41.923 98.0
1900-01-01 18:18:42.105 90.0
1900-01-01 18:18:42.288 97.0
1900-01-01 18:18:42.470 97.0
1900-01-01 18:18:42.652 98.0
have a look at pandas.DataFrame.shift. This is a column-wise operation that shifts all rows in a given column to another row of another column:
# original df
x1
0 0
1 1
2 2
3 3
4 4
# shift down
df.x2 = df.x1.shift(1)
x1 x2
0 0 NaN # Beware
1 1 0
2 2 1
3 3 2
4 4 3
# Shift up
df.x2 = df.x1.shift(-1)
x1 x2
0 0 1
1 1 2
2 2 3
3 3 4
4 4 NaN # Beware
You can use this to move spo2 of timestamp n+1 next to spo2 in the timestamp n row. Then, filter based on conditions applied to that one row.
df['spo2_Next'] = df['spo2'].shift(-1)
# replace NaN to allow float comparison
df.spo2_Next.fillna(1, inplace = True)
# Apply your row-wise condition to create filter column
df.loc[((df.spo2_Next - df.spo2) > 1) or ((df.spo2_Next - df.spo2) < 1), 'Outlier'] = True
# filter
df_clean = df[df.Outlier != True]
# remove filter column
del df_clean['Outlier']
When you filter a pandas dataframe like:
df[ df.colum1 = 2 & df.colum2 < 3 ], you are:
comparing a numeric series to a scalar value and generating a boolean series
obtaining two boolean series and doing a logical and
then using a numeric series to filter the data frame (the false values will not be added in the new data frame)
So you just need create an iterative algorithm over the data frame to produce such boolean array, and use it to filter the dataframe, as in:
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
df[ [True, False, True]]
You can also create a closure to filter the data frame (using df.apply), and keeping previous observations in the closure to detect abrupt changes, but this would be way too complicated. I would go for the straightforward imperative solution.
I have an unbalanced panel that I'm trying to aggregate up to a regular, weekly time series. The panel looks as follows:
Group Date value
A 1/1/2000 5
A 1/17/2000 10
B 1/9/2000 3
B 1/23/2000 7
C 1/22/2000 20
To give a better sense of what I'm looking for, I'm including an intermediate step, which I'd love to skip if possible. Basically some data needs to be filled in so that it can be aggregated. As you can see, missing weeks in between observations are interpolated. All other values are set equal to zero.
Group Date value
A 1/1/2000 5
A 1/8/2000 5
A 1/15/2000 10
A 1/22/2000 0
B 1/1/2000 0
B 1/8/2000 3
B 1/15/2000 3
B 1/22/2000 7
C 1/1/2000 0
C 1/8/2000 0
C 1/15/2000 0
C 1/22/2000 20
The final result that I'm looking for is as follows:
Date value
1/1/2000 5 = 5 + 0 + 0
1/8/2000 8 = 5 + 3 + 0
1/15/2000 13 = 10 + 3 + 0
1/22/2000 27 = 0 + 7 + 20
I haven't gotten very far, managed to create a panel:
panel = df.set_index(['Group','week']).to_panel()
Unfortunately, if I try to resample, I get an error
panel.resample('W')
TypeError: Only valid with DatetimeIndex or PeriodIndex
Assume df is your second dataframe with weeks, you can try the following:
df.groupby('week').sum()['value']
The documentation of groupby() and its application is here. It's similar to group-by function in SQL.
To obtain the second dataframe from the first one, try the following:
Firstly, prepare a function to map the day to week
def d2w_map(day):
if day <=7:
return 1
elif day <= 14:
return 2
elif day <= 21:
return 3
else:
return 4
In the method above, days from 29 to 31 are considered in week 4. But you get the idea. You can modify it as needed.
Secondly, take the lists out from the first dataframe, and convert days to weeks
df['Week'] = df['Day'].apply(d2w_map)
del df['Day']
Thirdly, initialize your second dataframe with only columns of 'Group' and 'Week', leaving the 'value' out. Assume now your initialized new dataframe is result, you can now do a join
result = result.join(df, on=['Group', 'Week'])
Last, write a function to fill the Nan up in the 'value' column with the nearby element. The Nan is what you need to interpolate. Since I am not sure how you want the interpolation to work, I will leave it to you.
Here is how you can change d2w_map to convert string of date to integer of week
from datetime import datetime
def d2w_map(day_str):
return datetime.strptime(day_str, '%m/%d/%Y').weekday()
Returned value of 0 means Monday, 1 means Tuesday and so on.
If you have the package dateutil installed, the function can be more robust:
from dateutil.parser import parse
def d2w_map(day_str):
return parse(day_str).weekday()
Sometimes, things you want are already implemented by magic :)
Turns out the key is to resample a groupby object like so:
df_temp = (df.set_index('date')
.groupby('Group')
.resample('W', how='sum', fill_method='ffill'))
ts = (df_temp.reset_index()
.groupby('date')
.sum()[value])
Used this tab delimited test.txt:
Group Date value
A 1/1/2000 5
A 1/17/2000 10
B 1/9/2000 3
B 1/23/2000 7
C 1/22/2000 20
You can skip the intermediate datafile as follows. Don't have time now. Just play around with it to get it right.
import pandas as pd
import datetime
time_format = '%m/%d/%Y'
Y = pd.read_csv('test.txt', sep="\t")
dates = Y['Date']
dates_right_format = map(lambda s: datetime.datetime.strptime(s, time_format), dates)
values = Y['value']
X = pd.DataFrame(values)
X.index = dates_right_format
print X
X = X.sort()
print X
print X.resample('W', how=sum, closed='right', label='right')
Last print
value
2000-01-02 5
2000-01-09 3
2000-01-16 NaN
2000-01-23 37