Pandas faster implementation of date calculation on every row - python

I have a dataframe with multiple columns, including analysis_date (datetime), and forecast_hour (int). I want to add a new column called total_hours, which is the sum of the hour component of analysis_date plus the corresponding forecast_hour in that row. Here's a visual example:
original dataframe:
analysis_date | forecast_hour
12-2-19-05 | 3
12-2-19-06 | 3
12-2-19-07 | 3
12-2-19-08 | 3
dataframe after calculation:
analysis_date | forecast_hour | total_hours
12-2-19-05 | 3 | 8
12-2-19-06 | 3 | 9
12-2-19-07 | 3 | 10
12-2-19-08 | 3 | 11
Here is the current logic that does what I want:
df['total_hours'] = df.apply(lambda row: row.analysis_date.hour + row.forecast_hours_out, axis=1)
Unfortunately, this is too slow for my application, it takes around 15 seconds for a dataframe with a few hundred thousand entries. I have tried using the swifter library, but unfortunately, it took approximately as long (if not longer) than my current implementation.

apply is slow because it is not vectorized. This should do what you want (assuming df['analysis_date'] is a datetime64):
df['total_hours'] = df['analysis_date'].dt.hour + df['forecast_hour']

Related

How can I copy values from one dataframe column to another based on the difference between the values

I have two csv mirror files generated by two different servers. Both files have the same number of lines and should have the exact same unix timestamp column. However, due to some clock issues, some records in one file, might have asmall difference of a nanosecond than it's counterpart record in the other csv file, see below an example, the difference is always of 1:
dataframe_A dataframe_B
| | ts_ns | | | ts_ns |
| -------- | ------------------ | | -------- | ------------------ |
| 1 | 1661773636777407794| | 1 | 1661773636777407793|
| 2 | 1661773636786474677| | 2 | 1661773636786474677|
| 3 | 1661773636787956823| | 3 | 1661773636787956823|
| 4 | 1661773636794333099| | 4 | 1661773636794333100|
Since these are huge files with milions of lines, I use pandas and dask to process them, but before I process, I need to ensure they have the same timestamp column.
I need to check the difference between column ts_ns in A and B and if there is a difference of 1 or -1 I need to replace the value in B with the corresponding ts_ns value in A so I can finally have the same ts_ns value in both files for corresponding records.
How can I do this in a decent way using pandas/dask?
If you're sure that the timestamps should be identical, why don't you simply use the timestamp column from dataframe A and overwrite the timestamp column in dataframe B with it?
Why even check whether the difference is there or not?
You can use the pandas merge_asof function for this, see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge_asof.html . The tolerance allows for a int or timedelta which should be set to the +1 for your example with direction being nearest.
Assuming your files are identical except from your ts_ns column you can perform a .merge on indices.
df_a = pd.DataFrame({'ts_ns': [1661773636777407794, 1661773636786474677, 1661773636787956823, 1661773636794333099]})
df_b = pd.DataFrame({'ts_ns': [1661773636777407793, 1661773636786474677, 1661773636787956823, 1661773636794333100]})
df_b = (df_b
.merge(df_a, how='left', left_index=True, right_index=True, suffixes=('', '_a'))
.assign(
ts_ns = lambda df_: np.where(abs(df_.ts_ns - df_.ts_ns_a) <= 1, df_.ts_ns_a, df_.ts_ns)
)
.loc[:, ['ts_ns']]
)
But I agree with #ManEngel, just overwrite all the values if you know they are identical.

Display how long it has been since something doubled

I read an interesting stat that since last year, the stock market has gone up 100% (i.e., doubled) in the shortest time on rest -- and I am looking to test/replicate this claim.
The data below is from FRED (Federal Reserve data depository) and it's for the WILL5000 index, which goes back to 1970, as the S&P only goes to 2011.
| DATE | WILL5000 | 50% |
| 1970-12-31 00:00:00 | 1 | 0.5 |
| 1971-01-01 00:00:00 | nan | nan |
| 1971-01-04 00:00:00 | nan | nan |
| 1971-01-05 00:00:00 | nan | nan |
| 1971-01-06 00:00:00 | nan | nan |
| ... | ... | ... |
| 2021-07-21 00:00:00 | 216.54 | 108.27 |
| 2021-07-22 00:00:00 | 216.68 | 108.34 |
| 2021-07-23 00:00:00 | 218.84 | 109.42 |
| 2021-07-26 00:00:00 | 219.32 | 109.66 |
| 2021-07-27 00:00:00 | 218.07 | 109.035 |
One way I thought was to add a column with half the value of the WILL5000 index and then using code to search for a value below this level (which would be 100% move), and recording how many days it has been since.
I cannot seem to find how to do this anywhere - and would love to hear of any other ways to achieve it.
This problem has O(n2) steps for n points in your series.
For the ith point in your sequence, you'll need to check if wj >= 2wi for all j > i. Stop at the first j (if any) that satisfies your requirements in each case. In other words, hold one date fixed as a baseline and then look for the doubling condition across all future dates; do this for all possible baseline dates.
In Pandas, this means you'll have to (i) cross-merge the dataframe with itself and filter it to the "upper triangular" (i.e., j > i) portion, (ii) find the first time to double for each group on i.
Here's Python+Pandas code that will get the job done:
import numpy as np
import pandas as pd
# load your data --> construct synthetic df for this example
np.random.seed(52)
date_axis = pd.date_range('1970-01-01', '2021-01-01', freq='M')
n = len(date_axis)
raw_df = pd.DataFrame(data={'date': date_axis, 'ticker_value': 300.0 * np.random.rand(n)})
# create n^2 df
df = pd.merge(raw_df, raw_df, how='cross').sort_values(by=['date_x', 'date_y'])
# restrict to upper triangle
df = df.loc[df.date_y > df.date_x, :]
# add a column to check if doubling condition is met
df['is_at_least_double'] = (df.ticker_value_y >= 2.0 * df.ticker_value_x)
# throw away values that don't meet the condition
df = df.loc[df.is_at_least_double, :].drop(columns=['is_at_least_double'])
# pick up the first value that satisfies the condition -- this is why we did the sort
df = df.groupby('date_x').first().reset_index()
# find intervals
df['interval'] = df.date_y - df.date_x
# find the smallest interval; tie-breaker is the one with the earliest base date
df.sort_values(by=['interval', 'date_x'], inplace=True)
solution = df.iloc[0]
print(solution)
The comments explain the steps in the code. I recommend running it line-by-line in the console and inspecting the intermediate results to understand what's going on.

How would i iterate through a list of lists and perform computational filtering?

this one is a bit of a doozy.
At a high level, I'm trying to figure out how to run a nested for loop. I'm essentially trying to iterate through columns and rows, and perform a computational check to make sure outcomes match a specified requirement - if so, they loop to the next row, if not, they are kicked out and the loop moves onto the next user.
Specifically, I want to perform a T-Test between a control/treatment group of users, and make sure the result is less than a pre-determined value.
Example:
I have my table of values - "DF" - there are 7 columns. The user_id column specifies the user's unique identifier. The user_type column is a binary classifier, users can be of either T (treatment) or C (control) types. The 3 "hour" columns are dummy number columns, values that I'll perform computation on. The mon column is the month, and tval is the number that the computation will have to be less than to be accepted.
In this case, the month is all January data. Each month can have a different tval.
DF
| user_id | user_type | hour1 | hour2 | hour3 | mon | tval |
|---------|-----------|-------|-------|-------|-----|------|
| 4 | T | 1 | 10 | 100 | 1 | 2.08 |
| 5 | C | 2 | 20 | 200 | 1 | 2.08 |
| 6 | C | 3 | 30 | 300 | 1 | 2.08 |
| 7 | T | 4 | 40 | 400 | 1 | 2.08 |
| 8 | T | 5 | 50 | 500 | 1 | 2.08 |
My goal is to iterate through each T user - and for each, loop through each C user. For each "Pair", I want to perform computation (t-test) between their hour 1 values. If the value is less than the tval, move to hour2 values, etc. If not, it gets kicked out and the loop moves to the next C user without completing that C user's loop. If it passes all value checks, the user_ids of each would be appended to a list or something external.
The output would hopefully look like a table of pairs. The T user and C user that have successfully iterated through all hour columns, and the month that passed (as each set of users have data for all 12 months).
Output:
| t_userid | c_userid | month |
|--------- |-----------|-------|
| 4 | 5 | 1 |
| 8 | 6 | 1 |
To sum it all up:
For each T user:
For each C user:
If t-test on t.hour1 and c.hour1 is less than X number (passing test):
move to next hour (hour2) and repeat
If all hours pass, add pair (T user_id, c_user_id) to separate list/series/df,etc
else: skip following hours and move to next C user.
I'm wondering if my data format is also incorrect. Would this be easier if I unpivoted my hourly data and iterated over each row? Any help is greatly appreciated. Thanks, and let me know if any clarification is necessary.
EDIT:
So far I've split the data between Treat and Control groups, and calculated average and standard deviation for a users monthly data (which is normally broken down by day) and added them as columns, hour1_avg and hour1_stdev. I've attempted another for loop, but am getting a ValueError.
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I know this is due to the fact that I cant compare a pandas Series to a float, int, str, etc. I will make another post addressing this question.
Here's what I have so far:
for i in treatment.user_id:
for i in control.user_id:
if np.absolute((treatment['hour1_avg'] - control['hour1_avg'])/np.sqrt((treatment['hour1_stdev']**2/31)+(control['hour1_stdev']**2/31))) > treatment.tval:
"End loop, move to next control user"
else:
"copy paste if statement above, but for hour2, etc etc"
Split the dataframe into control and treatment groups
Make join of the resulting dataframes on a constant field (will create all pairs)
Use a combination of apply and any to make the decision
Filter out the join using the decision vector
Code to illustrate the idea:
# assuming the input is in df
control = df[df['user_type'] == 'C']
treatment = df[df['user_type'] == 'T']
# part 2: pairs will be created month-wise.
# If you want all vs all, create a temp field, e.g.: control['_temp'] = 1
pairs = treatment.merge(control, left_on='mon', right_on='mon')
# part 3
def test(row):
# all will stop executing at the first False
return all(
row['hours_%d_x' % i] - row['hours_%d_y' % i] < row['t_val']
for i in range())
# all_less is a series of bool
all_less = pairs.apply(test, axis=1)
# part 4
output = pairs.loc[all_less, ['user_id_x', 'user_id_y', 'mon']].rename(
columns={'user_id_x': 't_user_id', 'user_id_y': 'c_user_id'})

How do i create a heatmap from two columns plus the value of those two in python

and thank you for helping!
I would like to generate a heatmap in python, from the data df.
(i am using pandas, seaborn, numpy, and matplotlib in my project)
The dataframe df looks like:
index | a | b | c | year | month
0 | | | | 2013 | 1
1 | | | | 2015 | 4
2 | | | | 2016 | 10
3 | | | | 2017 | 1
in the dataset the rows are each a ticket.
The dataset is big (51 colums and 100k+ rows), so a, b, c is just to show some random columns. (for month => 1 = jan, 2= feb...)
For the heatmap:
x-axis = year,
y-axis = month,
value: and in the heatmap, I wanted the value between the two axes to be a count of the number of rows, in which a ticket has been given in that year and month.
The result I imagine should look something like the from the seaborn documentation:
https://seaborn.pydata.org/_images/seaborn-heatmap-4.png
I am new to coding and tried a lot of random things I found on the internet and has not been able to make it work.
Thank you for helping!
This should do (with generated data):
import pandas as pd
import seaborn as sns
import random
y = [random.randint(2013,2017) for n in range(2000)]
m = [random.randint(1,12) for n in range(2000)]
df = pd.DataFrame([y,m]).T
df.columns=['y','m']
df['count'] = 1
df2 = df.groupby(['y','m'], as_index=False).count()
df_p = pd.pivot_table(df2,'count','m','y')
sns.heatmap(df_p)
You probably won't need the column count but I added it because I needed an extra column for the groupby to work.

Apply function with string and integer from multiple columns not working

I want to create a combined string based on two columns, one is an integer and the other is a string. I need to combine them to create a string.
I've already tried using the solution from this answer here (Apply function to create string with multiple columns as argument) but it doesn't give the required output. H
I have two columns: prod_no which is an integer and PROD which is a string. So something like
| prod_no | PROD | out | | |
|---------|-------|---------------|---|---|
| 1 | PRODA | #Item=1=PRODA | | |
| 2 | PRODB | #Item=2=PRODB | | |
| 3 | PRODC | #Item=3=PRODC | | |
to get the last column, I used the following code:
prod_list['out'] = prod_list.apply(lambda x: "#ITEM={}=={}"
.format(prod_list.prod_no.astype(str), prod_list.PROD), axis=1)
I'm trying to produce the column "out" but the result of that code is weird. The output is #Item=0 1 22 3...very odd. I'm specifically trying to implement using apply and lambda. However, I am biased to efficient implementations since I am trying to learn how to write optimized code. Please help :)
This works.
import pandas as pd
df= pd.DataFrame({"prod_no": [1,2,3], "PROD": [ "PRODA", "PRODB", "PRODC" ]})
df["out"] = df.apply(lambda x: "#ITEM={}=={}".format(x["prod_no"], x["PROD"]), axis=1)
print(df)
Output:
PROD prod_no out
0 PRODA 1 #ITEM=1==PRODA
1 PRODB 2 #ITEM=2==PRODB
2 PRODC 3 #ITEM=3==PRODC
you can also try with zip:
df=df.assign(out=['#ITEM={}=={}'.format(a,b) for a,b in zip(df.prod_no,df.PROD)])
#or directly : df.assign(out='#Item='+df.prod_no.astype(str)+'=='+df.PROD)
prod_no PROD out
0 1 PRODA #ITEM=1==PRODA
1 2 PRODB #ITEM=2==PRODB
2 3 PRODC #ITEM=3==PRODC

Categories