Find intersection in Dataframe with Start and Enddate in Python - python

I have a dataframe of elements with a start and end datetime. What is the best option to find intersections of the dates? My naive approach right now consists of two nested loops cross-comparing the elements, which obviously is super slow. What would be a better way to achieve that?
dict = {}
start = "start_time"
end = "end_time"
for index1, rowLoop1 in df[{start, end}].head(500).iterrows():
matches = []
dict[(index1, rowLoop1[start])] = 0
for index2, rowLoop2 in df[{start,end}].head(500).iterrows():
if index1 != index2:
if date_intersection(rowLoop1[start], rowLoop1[end], rowLoop2[start], rowLoop2[end]):
dict[(index1, rowLoop1[start])] += 1
Code for date_intersection:
def date_intersection(t1start, t1end, t2start, t2end):
if (t1start <= t2start <= t2end <= t1end): return True
elif (t1start <= t2start <= t1end):return True
elif (t1start <= t2end <= t1end):return True
elif (t2start <= t1start <= t1end <= t2end):return True
else: return False
Sample data:
id,start_date,end_date
41234132,2021-01-10 10:00:05,2021-01-10 10:30:27
64564512,2021-01-10 10:10:00,2021-01-11 10:28:00
21135765,2021-01-12 12:30:00,2021-01-12 12:38:00
87643252,2021-01-12 12:17:00,2021-01-12 12:42:00
87641234,2021-01-12 12:58:00,2021-01-12 13:17:00

You can do something like merging your dataframe with itself to get the cartesian product and comparing columns.
df = df.merge(df, how='cross', suffixes=('','_2'))
df['date_intersection'] = (((df['start_date'].le(df['start_date_2']) & df['start_date_2'].le(df['end_date'])) | # start 2 within start/end
(df['start_date'].le(df['end_date_2']) & df['end_date_2'].le(df['end_date'])) | # end 2 within start/end
(df['start_date_2'].le(df['start_date']) & df['start_date'].le(df['end_date_2'])) | # start within start 2/end 2
(df['start_date_2'].le(df['end_date']) & df['end_date'].le(df['end_date_2']))) & # end within start 2/end 2
df['id'].ne(df['id_2'])) # id not compared to itself
and then to return the ids and if they have a date intersection...
df.groupby('id')['date_intersection'].any()
id
21135765 True
41234132 True
64564512 True
87641234 False
87643252 True
or if you need the ids that were intersected
df.loc[df['date_intersection'], :].groupby(['id'])['id_2'].agg(list).to_frame('intersected_ids')
intersected_ids
id
21135765 [87643252]
41234132 [64564512]
64564512 [41234132]
87643252 [21135765]

Related

Optimizing pandas operations

I'm cleaning a data set with 6 columns and just under 9k rows. As part of the clean up I have to find zero/negative, repetitive, interpolated, and outlier values defined as:
repetitive values - 3 subsequent values are equivalent up to 6 decimal places, flag the first one
interpolated values - take a = row1_val - row2_val, b = row2_val-row3_val, c = row3_val - row4_val, etc. If a=b or b=c, etc. flag
outlier values - 1.1peak < MW < 0.1peak
Right now I am using for loops on the data frame to do the row comparisons and flag the values, put them into a new data frame, and replace them with 999999 but it takes FOREVER. I used the following code to find and replace the zero/negative values, but I cant seem to make it work for the multi row functions used in the for loop. Can anyone show me how this works?
zero/negative values:
df = (df.drop(data_columns, axis=1).join(df[data_columns].apply(pd.to_numeric, errors='coerce')))
Missing_Vals_df = df.loc[(df['A KW'].isnull()) | (df['A KVAR'].isnull()) | (df['B KW'].isnull()) | (df['B KVAR'].isnull()) | (df['C KW'].isnull()) | (df['C KVAR'].isnull())]
df = df.fillna(999999)
Loops:
for x in range(len(df)-2):
for thing in data_columns:
if df.loc[x][thing] <= 0:
df = df.replace(to_replace = df.loc[x][thing], value=999999)
elif (round(df.loc[x][thing], 6) == round(df.loc[x+1][thing], 6) == round(df.loc[x+2][thing], 6)) & (df.loc[x][thing] != 999999):
if x not in duplicate_loc:
duplicate_loc.append(x)
duplicate_df = duplicate_df.append(df.loc[(x)])
df = df.replace(to_replace = df.iloc[x][thing], value=999999)
elif (round((df.loc[x+1][thing] - df.loc[x][thing]), 3) == round((df.loc[x+2][thing] - df.loc[x+1][thing]), 3)) & (df.loc[x][thing] != 999999):
if x not in interpolated_loc:
interpolated_loc.append(x)
interpolated_df = interpolated_df.append(df.loc[(x)])
df = df.replace(to_replace = df.iloc[x][thing], value=999999)
elif ((df.loc[x][thing] > 1.1*df_peak.loc[0]['Value']) | (df.loc[x][thing] > 1.1*df_peak.loc[0]['Value']) | (df.loc[x][thing] > 1.1*df_peak.loc[0]['Value'])) & (df.loc[x][thing] != 999999):
if x not in outlier_loc:
outlier_loc.append(x)
outlier_df = outlier_df.append(df.loc[(x)])
df = df.replace(to_replace = df.iloc[x][thing], value=999999)

Comparing values in Python data frame efficiently

I'm trading daily on Cryptocurrencies and would like to find which are the most desirable Cryptos for trading.
I have CSV file for every Crypto with the following fields:
Date Sell Buy
43051.23918 1925.16 1929.83
43051.23919 1925.12 1929.79
43051.23922 1925.12 1929.79
43051.23924 1926.16 1930.83
43051.23925 1926.12 1930.79
43051.23926 1926.12 1930.79
43051.23927 1950.96 1987.56
43051.23928 1190.90 1911.56
43051.23929 1926.12 1930.79
I would like to check:
How many quotes will end with profit:
for Buy positions - if one of the following Sells > current Buy.
for Sell positions - if one of the following Buys < current Sell.
How much time it would take to a theoretical position to become profitable.
What can be the profit potential.
I'm using the following code:
#converting from OLE to datetime
OLE_TIME_ZERO = dt.datetime(1899, 12, 30, 0, 0, 0)
def ole(oledt):
return OLE_TIME_ZERO + dt.timedelta(days=float(oledt))
#variables initialization
buy_time = ole(43031.57567) - ole(43031.57567)
sell_time = ole(43031.57567) - ole(43031.57567)
profit_buy_counter = 0
no_profit_buy_counter = 0
profit_sell_counter = 0
no_profit_sell_counter = 0
max_profit_buy_positions = 0
max_profit_buy_counter = 0
max_profit_sell_positions = 0
max_profit_sell_counter = 0
df = pd.read_csv("C:/P/Crypto/bitcoin_test_normal_276k.csv")
#comparing to max
for index, row in df.iterrows():
a = index + 1
df_slice = df[a:]
if df_slice["Sell"].max() - row["Buy"] > 0:
max_profit_buy_positions += df_slice["Sell"].max() - row["Buy"]
max_profit_buy_counter += 1
for index1, row1 in df_slice.iterrows():
if row["Buy"] < row1["Sell"] :
buy_time += ole(row1["Date"])- ole(row["Date"])
profit_buy_counter += 1
break
else:
no_profit_buy_counter += 1
#comparing to sell
for index, row in df.iterrows():
a = index + 1
df_slice = df[a:]
if row["Sell"] - df_slice["Buy"].min() > 0:
max_profit_sell_positions += row["Sell"] - df_slice["Buy"].min()
max_profit_sell_counter += 1
for index2, row2 in df_slice.iterrows():
if row["Sell"] > row2["Buy"] :
sell_time += ole(row2["Date"])- ole(row["Date"])
profit_sell_counter += 1
break
else:
no_profit_sell_counter += 1
num_rows = len(df.index)
buy_avg_time = buy_time/num_rows
sell_avg_time = sell_time/num_rows
if max_profit_buy_counter == 0:
avg_max_profit_buy = "There is no profitable buy positions"
else:
avg_max_profit_buy = max_profit_buy_positions/max_profit_buy_counter
if max_profit_sell_counter == 0:
avg_max_profit_sell = "There is no profitable sell positions"
else:
avg_max_profit_sell = max_profit_sell_positions/max_profit_sell_counter
The code works fine for 10K-20K lines but for a larger amount (276K) it take a long time (more than 10 hrs)
What can I do in order to improve it?
Is there any "Pythonic" way to compare each value in a data frame to all following values?
note - the dates in the CSV are in OLE so I need to convert it to Datetime.
File for testing:
Thanks for your comment.
Here you can find the file that I used:
First, I'd want to create the cumulative maximum/minimum values for Sell and Buy per row, so it's easy to compare to. pandas has cummax and cummin, but they go the wrong way. So we'll do:
df['Max Sell'] = df[::-1]['Sell'].cummax()[::-1]
df['Min Buy'] = df[::-1]['Buy'].cummin()[::-1]
Now, we can just compare each row:
df['Buy Profit'] = df['Max Sell'] - df['Buy']
df['Sell Profit'] = df['Sell'] - df['Min Buy']
I'm positive this isn't exactly what you want as I don't perfectly understand what you're trying to do, but hopefully it leads you in the right direction.
After comparing your function and mine, there is a slight difference, as your a is offset one off the index. Removing that offset, you'll see that my method produces the same results as yours, only in vastly shorter time:
for index, row in df.iterrows():
a = index
df_slice = df[a:]
assert (df_slice["Sell"].max() - row["Buy"]) == df['Max Sell'][a] - df['Buy'][a]
else:
print("All assertions passed!")
Note this will still take the very long time required by your function. Note that this can be fixed with shift, but I don't want to run your function for long enough to figure out what way to shift it.

Compare two date by month & date Python

I have two columns of dates need to be compared, date1 is a list of certain dates, date2 is random date (dob). I need to compare month and day by some conditon to make a flag. sample like:
df_sample = DataFrame({'date1':('2015-01-15','2015-01-15','2015-03-15','2015-04-15','2015-05-15'),
'dob':('1999-01-25','1987-12-12','1965-03-02','2000-08-02','1992-05-15')}
I create a function based on condition below
def eligible(date1,dob):
if date1.month - dob.month==0 and date1.day <= dob.day:
return 'Y'
elif date1.month - dob.month==1 and date1.day > dob.day:
return 'Y'
else:
return 'N'
I want to apply this function to orginal df which has more than 5M rows, hence for loop is not efficiency, is there any way to achieve this?
Datatype is date, not datetime
I think you need numpy.where with conditions chained by | (or):
df_sample['date1'] = pd.to_datetime(df_sample['date1'])
df_sample['dob'] = pd.to_datetime(df_sample['dob'])
months_diff = df_sample.date1.dt.month - df_sample.dob.dt.month
days_date1 = df_sample.date1.dt.day
days_dob = df_sample.dob.dt.day
m1 = (months_diff==0) & (days_date1 <= days_dob)
m2 = (months_diff==1) & (days_date1 > days_dob)
df_sample['out'] = np.where(m1 | m2 ,'Y','N')
print (df_sample)
date1 dob out
0 2015-01-15 1999-01-25 Y
1 2015-01-15 1987-12-12 N
2 2015-03-15 1965-03-02 N
3 2015-04-15 2000-08-02 N
4 2015-05-15 1992-05-15 Y
Using datetime is certainly beneficial:
df_sample['dob'] = pd.to_datetime(df_sample['dob'])
df_sample['date1'] = pd.to_datetime(df_sample['date1'])
Once you have it, your formula can be literally applied to all rows:
df_sample['eligible'] =
( (df_sample.date1.dt.month == df_sample.dob.dt.month)\
& (df_sample.date1.dt.day <= df_sample.dob.dt.day)) |\
( (df_sample.date1.dt.month - df_sample.dob.dt.month == 1)\
& (df_sample.date1.dt.day > df_sample.dob.dt.day))
The result is boolean (True/False), but you can easily convert it to "Y"/"N", if you want.

pandas filtering consecutive rows

I got a Dataframe with a Matrix colum like this
11034-A
11034-B
1120-A
1121-A
112570-A
113-A
113.558
113.787-A
113.787-B
114-A
11691-A
11691-B
117-A RRS
12 X R
12-476-AT-A
12-476-AT-B
I'd like to filter only matrix that ends with A or B only when they are consecutive, so in the example above 11034-A and 11034-B, 113.787-A and 113.787-B, 11691-A and 11691-B, 12-476-AT-A and 12-476-AT-B
I wrote the function that will compare those 2 strings and return True or False, the problem is I fail to see how to apply / applymap to the consecutive rows:
def isAB(stringA, stringB):
if stringA.endswith('A') and stringB.endswith('B') and stringA[:-1] == stringB[:-1]:
return True
else:
return False
I tried df['result'] = isAB(df['Matrix'].str, df['Matrix'].shift().str) to no-avail
I seem to lack something in the way I designed this
edit :
I think this works, looks like I overcomplicated at 1st :
df['t'] = (df['Matrix'].str.endswith('A') & df['Matrix'].shift(-1).str.endswith('B')) | (df['Matrix'].str.endswith('B') & df['Matrix'].shift(1).str.endswith('A'))
df['p'] = (df['Matrix'].str[:-1] == df['Matrix'].shift(-1).str[:-1]) | (df['Matrix'].str[:-1] == df['Matrix'].shift(1).str[:-1])
df['e'] = df['p'] & df['t']
final = df[df['e']]
Here is how I would do it.
df['ShiftUp'] = df['matrix'].shift(-1)
df['ShiftDown'] = df['matrix'].shift()
def check_matrix(x):
if pd.isnull(x.ShiftUp) == False and x.matrix[:-1] == x.ShiftUp[:-1]:
return True
elif pd.isnull(x.ShiftDown) == False and x.matrix[:-1] == x.ShiftDown[:-1]:
return True
else:
return False
df['new'] = df.apply(check_matrix, axis=1)
df = df.drop(['ShiftUp', 'ShiftDown'], axis=1)
print df
prints
matrix new
0 11034-A True
1 11034-B True
2 1120-A False
3 1121-A False
4 112570-A False
5 113-A False
6 113.558 False
7 113.787-A True
8 113.787-B True
9 114-A False
10 11691-A True
11 11691-B True
12 117-A RRS False
13 12 X R False
14 12-476-AT-A True
15 12-476-AT-B True
Here's my solution, it requires a bit of work.
The strategy is the following: obtain a new column that has the same values as the current column but shifted one position.
Then, it's just a matter to check whether one column is A or B and the other one B or A.
Say your matrix colum is called "column_name".
Then:
myl = ['11034-A',
'11034-B',
'1120-A',
'1121-A',
'112570-A',
'113-A',
'113.558',
'113.787-A',
'113.787-B',
'114-A',
'11691-A',
'11691-B',
'117-A RRS',
'12 X R',
'12-476-AT-A',
'12-476-AT-B']
#toy data frame
mydf = pd.DataFrame.from_dict({'column_name':myl})
#get a new series which is the same one as the original
#but the first entry contains "nothing"
new_series = pd.Series( ['nothing'] +
mydf['column_name'][:-1].values.tolist() )
#add it to the original dataframe
mydf['new_col'] = new_series
You then define a simple function:
def do_i_want_this_row(x,y):
left_char = x[-1]
right_char = y[-1]
return ((left_char == 'A') & (right_char == 'B')) or ((left_char == 'B') & (right_char=='A'))
and voila:
print mydf[mydf.apply(lambda x: do_i_want_this_row( x.column_name, x.new_col), axis=1)]
column_name new_col
1 11034-B 11034-A
2 1120-A 11034-B
8 113.787-B 113.787-A
9 114-A 113.787-B
11 11691-B 11691-A
15 12-476-AT-B 12-476-AT-A
There is still the question of the last element, but I'm sure you can think of what to do with it if you decide to follow this strategy ;)
You can delete rows from a DataFrame using DataFrame.drop(labels, axis). To get a list of labels to delete, I would first get a list of pairs that match your criterion. With the labels from above in a list labels and your isAB function,
pairs = zip(labels[:-1], labels[1:])
delete_pairs = filter(isAB, pairs)
delete_labels = []
for a,b in delete_pairs:
delete_labels.append(a)
delete_labels.append(b)
Examinedelete_labels to make sure you've put it together correctly,
print(delete_labels)
And finally, delete the rows. With the DataFrame in question as x,
x.drop(delete_labels) # or x.drop(delete_labels, axis) if appropriate

python, operation on big pandas Dataframe

I have a pandas DataFrame named Joined with 5 fields:
product | price | percentil_25 | percentil_50 | percentile_75
for each row I want to class the price like this:
if the price is below percentil_25 I'm giving to this product the class 1, and so on
So what I did is:
classe_final = OrderedDict()
classe_final['sku'] = []
classe_final['class'] = []
for index in range(len(joined)):
classe_final['sku'].append(joined.values[index][0])
if(float(joined.values[index][1]) <= float(joined.values[index][2])):
classe_final['class'].append(1)
elif(float(joined.values[index][2]) < float(joined.values[index][1]) and float(joined.values[index][1]) <= float(joined.values[index][3])):
classe_final['class'].append(2)
elif(float(joined.values[index][3]) < float(joined.values[index][1]) and float(joined.values[index][1]) <= float(joined.values[index][4])):
classe_final['class'].append(3)
else:
classe_final['class'].append(4)
But as my DataFrame is quite big it's taking forever.
Do you have any idea how I could do this quicker?
# build an empty df
df = pd.DataFrame()
# get a list of the unique products, could skip this perhaps
df['Product'] = other_df['Sku'].unique()
2 ways, define a func and call apply
def class(x):
if x.price < x.percentil_25:
return 1
elif x.price >= x.percentil_25 and x.price < x.percentil_50:
return 2:
elif x.price >= x.percentil_50 and x.price < x.percentil_75:
return 2:
elif x.price >= x.percentil_75:
return 4
df['class'] = other_df.apply(lambda row: class(row'), axis=1)
another way which I think is better and will be much faster is we could add the 'class' column to your existing df and use loc and then just take a view of the 2 columns of interest:
joined.loc[joined['price'] < joined['percentil_25'], 'class'] =1
joined.loc[(joined['price'] >= joined['percentil_25']) & (joined['price'] < joined['percentil_50']), 'class'] =2
joined.loc[(joined['price'] >= joined['percentil_50']) & (joined['price'] < joined['percentil_75']), 'class'] =3
joined.loc[joined['price'] >= joined['percentil_75'], 'class'] =4
classe_final = joined[['cku', 'class']]
Just for kicks you could use a load of np.where conditions:
classe_final['class'] = np.where(joined['price'] > joined['percentil_75'], 4, np.where( joined['price'] > joined['percentil_50'], 3, np.where( joined['price'] > joined['percentil_25'], 2, 1 ) ) )
this evaluates whether the price is greater than percentil_75, if so then class 4 otherwise it evaluates another conditiona and so on, may be worth timing this compared to loc but it is a lot less readable
Another solution, if someone asked me to bet which one is the fastest I'd go for this:
joined.set_index("product").eval(
"1 * (price >= percentil_25)"
" + (price >= percentil_50)"
" + (price >= percentil_75)"
)

Categories