Panda dataframe conditional change - python

I'm working on csv time series data, which shows count of step per some time frame. Once the step count is exceeding 65535, it will count start from 0, etc. However since not all the dataset has 65535 count (some goes from 65530, then 5, if they made several steps on the time frame), I can't find a good way to handle it so that every 0 after 6553x will change to 65536.. etc.
step realstep
65531 65531
65533 65533
65534 65534
2 65538
4 65540
I'm trying to count the real step in order to get their difference (e.g step/minute).

Find where it resets with diff being negative and add the max counter value (65536 since you count from 0) to all rows beyond that. This will be flexible if it resets multiple times (I added some extra data)
df['real_step'] = df.step + df.step.diff(1).lt(0).cumsum()*65536
step real_step
0 65531 65531
1 65533 65533
2 65534 65534
3 2 65538
4 4 65540
5 65434 130970
6 2 131074
7 4 131076

Related

Rolling mean and standard deviation without zeros

I have a data frame that one of its columns represents how many corns produced in this time stamp.
for example
timestamp corns_produced another_column
1 5 4
2 0 1
3 0 3
4 3 4
The dataframe is big.. 100,000+ rows
I want to calculate moving average and std for 1000 time stamps of corn_produced.
Luckily it is pretty easy using rolling :
my_df.rolling(1000).mean()
my_df.rolling(1000).std().
But the problem is I want to ignore the zeros, meaning if in the last 1000 timestamps there are only 5 instances in which corn was produced, I want to do the mean and std on those 5 elements.
How do I ignore the zeros ?
Just to clarify, I don't want to do the following x = my_df[my_df['corns_produced'] != 0], and than do rolling on x, because it ignores the time stamps and doesn't give me the result I need
You can use Rolling.apply:
print (my_df.rolling(1000).apply(lambda x: x[x!= 0].mean()))
print (my_df.rolling(1000).apply(lambda x: x[x!= 0].std()))
A faster solution: first set all zeros to np.nan, then take a rolling mean. If you are dealing with large data, it will be much faster

I need faster methods to optimize my loop

So, I'm a python newbie looking for someone with an ideia on how to optimize my code. I'm working with a spreadsheet with over 6000 rows, and this portion of my code seems really ineficient.
for x in range(0,len(df):
if df.at[x,'Streak_currency'] != str(df.at[x,'Currency']):
df.at[x, 'Martingale'] = df.at[x-1, 'Martingale'] + (df.at[x-1, 'Martingale'] )/64
x+=1
if df.at[x,'Streak_currency'] == str(df.at[x,'Currency']):
x+=1
It can take upwards of 8 minutes run.
With my limited knowledge, I only manage to change my df.loc for df.at, and it helped a lot. But I st
UPDATE
In this section of the code, I'm trying to apply a function based on a previous value until a certain condition is met, in this case,
df.at[x,'Streak_currency'] != str(df.at[x,'Currency']):
I really don't know why this iteration is taking so long. In theory, it should only look at a previous value and apply the function. Here is a sample of the output:
Periodo Currency ... Agrupamento Martingale
0 1 GBPUSD 1 1.583720 <--- starts aplying a function over and over.
1 1 GBPUSD 1 1.608466
2 1 GBPUSD 1 1.633598
3 1 GBPUSD 1 1.659123
4 1 GBPUSD 1 1.685047
5 1 GBPUSD 1 1.711376 <- stops aplying, since Currency changed
6 1 EURCHF 2 1.256550
7 1 USDCAD 3 1.008720 <- starts applying again until currency changes
8 1 USDCAD 3 1.024481
9 1 USDCAD 3 1.040489
10 1 GBPAUD 4 1.603080
Pandas lookups like df.at[x,'Streak_currency'] are not efficient. Indeed, for each evaluation of this kind of expression (multiple time per loop iteration), pandas fetch the column regarding its name and then fetch the value in a list.
You can avoid this computation cost by just storing the columns in variables before the loop. Additionally, you can put the column in numpy array so the value can be fetch in a more efficient way (assuming all the value have the same type).
Finally, using string conversions and string comparisons on integers are not efficient. They can be avoided here (assuming the integers are not unreasonably big).
Here is an example:
import numpy as np
streakCurrency = np.array(df['Streak_currency'], dtype=np.int64)
currency = np.array(df['Currency'], dtype=np.int64)
martingale = np.array(df['Martingale'], dtype=np.float64)
for x in range(len(df)):
if streakCurrency[x] != currency[x]:
martingale[x] = martingale[x-1] * (65./64.)
x+=1
if streakCurrency[x] == currency[x]:
x+=1
# Update the pandas dataframe
df['Martingale'] = martingale
This should at least an order of magnitude faster.
Please note that the second condition is useless since the compared values cannot be equal and different at the same times (this may be a bug in your code)...

separating data from a txt file using pandas

I have data in a txt file and need to separate the data. Apologizes but i am really finding this hard (and maybe hard to explain aswell). the below is the top few lines of the txt file (there are 1000 lines). I need all the data between the first * in row 0 and the last * which is in row 700. I dont want to select by row number as the numbers can change but I want a code which will select the data between the *. Secondly the data is NOT separated into columns and it is one big row. I want a second piece of code which can separate the data into columns ie Latter REPORT, Calculation Date, Index Code are columns (I cant separate on space because it splits Calculation and Date into separate columns when they should be one column.) Please can someone help me and thank you!
0
0 *
1 #124 Latter REPORT D51D ...
2 # 1 Calculation Date calc_da...
3 # 2 Index Code modes2_in...
4 # 3 Index Name index_n...
120 #120 5 Years ADPS Growth Rate 5_years...
121 #121 1 Year ADPS Growth Rate 1_year_...
122 #122 Payout Ratio payout_...
123 #123 Reserved 26 reserve...
124 #124 Reserved 27 reserve...
125 *
Assuming the dataframe is called dat, for the first part to find the asterisks:
asterisk_location = dat[0] == '*'
asterisk_location = asterisk_location[asterisk_location]
start, finish = asterisk_location.index
dat = dat.iloc[start+1:finish]
This also assumes you want to get the region between the first two asterisks. If there's more, you'll have to adjust a bit.

Date manipulation periods

I have this problem for work. So I have this dataset as follows:
Client Date Transaction Num
A 7/20/2017 1
A 7/26/2017 1
A 7/31/2017 1
A 8/23/2017 2
A 8/31/2017 2
A 9/11/2017 2
A 9/19/2017 3
A 9/27/2017 3
A 10/4/2017 3
B 6/1/2017 1
B 6/29/2017 1
B 7/6/2017 2
B 8/27/2017 3
B 9/28/2017 4
B 10/16/2017 4
B 11/30/2017 5
What I need to do is generate the transaction num based on the date for each client as follows:
For the starting date (for client A, it is 7/20/17), I need to assign a starting transaction Number = 1. Then for every 30 days from this starting date, I need to increment the transaction number by one. So 30 days from 7/20/17 is 8/19/17, so all dates falling within this range get transaction num =1, then if they exceed, the transaction number increments by one for every 30 days from starting date. This pattern goes on, so 30 days from 8/19/17 is 9/18/17, so dates within this range gets transaction num =2, and after 9/18/17, gets transaction num = 3 and so on.
I need to do this for a large excel. Any help would be appreciated. If it easier in python, please let me know as well.
Thanks,
Sammy
Interesting question, possibly multiple sollutions but I came up with the one below:
So in C1 enter this formula:
=FLOOR(ROUND(B1-MIN(IF($A$1:$A$17=A1,$B$1:$B$17)),1)/30,1)+1
Confirm with CTRL+SHIFT+ENTER, and drag your formula down.
Note: Sorry for the difference in layout of dates, I have to deal with Dutch version of Excel :)
EDIT: Explaination
Step 1 - Get minimum date corresponding to Cell A1:
=MIN(IF($A$1:$A$17=A1,$B$1:$B$17))
Step 2 - Get difference of cell B1 and minimmum and round it of. Doesn't matter if its one or 0 decimals:
=ROUND(B1-MIN(IF($A$1:$A$17=A1,$B$1:$B$17)),1)
Step 3 - Devide difference through 30 days:
=ROUND(B1-MIN(IF($A$1:$A$17=A1,$B$1:$B$17)),1)/30
Step 4 - Make sure you round of this outcome to below (probably bad english) with floor function to its closest multiple you want to round to. In this case it will be 1.
=FLOOR(ROUND(B1-MIN(IF($A$1:$A$17=A1,$B$1:$B$17)),1)/30,1)
Step 5 - Now we just need to add 1 to this outcome to prevent starting at 0
=FLOOR(ROUND(B1-MIN(IF($A$1:$A$17=A1,$B$1:$B$17)),1)/30,1)+1
Confirm all through CTRL+SHIFT+ENTER
If the dates are in order, you could just do a VLOOKUP to get the first one and subtract, but #JvdV's answer is more general
=INT((B2-VLOOKUP(A2,A:B,2,FALSE))/30)+1

Selecting, slicing, and aggregating temporal data with Pandas

I'm trying to handle temporal data with pandas and I'm having a hard time...
Here is a sample of the DataFrame :
index ip app dev os channel click_time
0 29540 3 1 42 489 2017-11-08 03:57:46
1 26777 11 1 25 319 2017-11-09 11:02:14
2 140926 12 1 13 140 2017-11-07 04:36:14
3 69375 2 1 19 377 2017-11-09 13:17:20
4 119166 9 2 15 445 2017-11-07 12:11:37
This is a click prediction problem, so I want to create a time window aggregating the past behaviour of a specific ip ( for a given ip, how many clicks in the last 4 hours, 8 hours ? ).
I tried creating one new column which was simply :
df['minus_8']=df['click_time']-timedelta(hours=8)
I wanted to use this so that for each row I have a specific 8 hours window on which to aggregate my data.
I have also tried resampling with little success, my understanding of the function isn't optimal let's say.
Can anyone help ?
If you just need to select a particular 8 hours, you can do as follows:
start_time = datetime.datetime(2017, 11, 9,11, 2, 14)
df[(df['click_time' >= start_time)
& (df['click_time'] <= start_time+datetime.timedelta(0, 60*60*8))]
Otherwise I really think you need to look more at resample. Mind you, if you want resample to have your data divided into 8 hour chunks that are always consistent (e.g. from 00:00-08:00, 08:00-16:00, 16:00-00:00), then you will probably want to crop your data to a certain start time.
Using parts of the solution given by Martin, I was able to create this function that outputs what I wanted :
def window_filter_clicks(df, h):
df['nb_clicks_{}h'.format(h)]=0
ip_array = df.ip.unique()
for ip in ip_array:
df_ip=df[df['ip']==ip]
for row, i in zip(df_ip['click_time'],df_ip['click_time'].index):
df_window = df_ip[(df_ip['click_time']>= row-timedelta(hours=h)) & (df_ip['click_time']<= row) ]
nb_clicks_4h = len(df_window)
df['nb_clicks_{}h'.format(h)].iloc[i]= nb_clicks_4h
return df
h allows me to select the size of the window on which to iterate.
Now this works fine, but it is very slow and I am working with a lot of rows.
Does anyone know how to improve the speed of such a function ? ( Or if there is anything similar built-in ? )

Categories