Date manipulation periods - python

I have this problem for work. So I have this dataset as follows:
Client Date Transaction Num
A 7/20/2017 1
A 7/26/2017 1
A 7/31/2017 1
A 8/23/2017 2
A 8/31/2017 2
A 9/11/2017 2
A 9/19/2017 3
A 9/27/2017 3
A 10/4/2017 3
B 6/1/2017 1
B 6/29/2017 1
B 7/6/2017 2
B 8/27/2017 3
B 9/28/2017 4
B 10/16/2017 4
B 11/30/2017 5
What I need to do is generate the transaction num based on the date for each client as follows:
For the starting date (for client A, it is 7/20/17), I need to assign a starting transaction Number = 1. Then for every 30 days from this starting date, I need to increment the transaction number by one. So 30 days from 7/20/17 is 8/19/17, so all dates falling within this range get transaction num =1, then if they exceed, the transaction number increments by one for every 30 days from starting date. This pattern goes on, so 30 days from 8/19/17 is 9/18/17, so dates within this range gets transaction num =2, and after 9/18/17, gets transaction num = 3 and so on.
I need to do this for a large excel. Any help would be appreciated. If it easier in python, please let me know as well.
Thanks,
Sammy

Interesting question, possibly multiple sollutions but I came up with the one below:
So in C1 enter this formula:
=FLOOR(ROUND(B1-MIN(IF($A$1:$A$17=A1,$B$1:$B$17)),1)/30,1)+1
Confirm with CTRL+SHIFT+ENTER, and drag your formula down.
Note: Sorry for the difference in layout of dates, I have to deal with Dutch version of Excel :)
EDIT: Explaination
Step 1 - Get minimum date corresponding to Cell A1:
=MIN(IF($A$1:$A$17=A1,$B$1:$B$17))
Step 2 - Get difference of cell B1 and minimmum and round it of. Doesn't matter if its one or 0 decimals:
=ROUND(B1-MIN(IF($A$1:$A$17=A1,$B$1:$B$17)),1)
Step 3 - Devide difference through 30 days:
=ROUND(B1-MIN(IF($A$1:$A$17=A1,$B$1:$B$17)),1)/30
Step 4 - Make sure you round of this outcome to below (probably bad english) with floor function to its closest multiple you want to round to. In this case it will be 1.
=FLOOR(ROUND(B1-MIN(IF($A$1:$A$17=A1,$B$1:$B$17)),1)/30,1)
Step 5 - Now we just need to add 1 to this outcome to prevent starting at 0
=FLOOR(ROUND(B1-MIN(IF($A$1:$A$17=A1,$B$1:$B$17)),1)/30,1)+1
Confirm all through CTRL+SHIFT+ENTER

If the dates are in order, you could just do a VLOOKUP to get the first one and subtract, but #JvdV's answer is more general
=INT((B2-VLOOKUP(A2,A:B,2,FALSE))/30)+1

Related

Printing the whole row of my data from a max value in a column

I am trying to select the highest value from this data but i also need the month it comes from too, here printing the whole row. Currently i'm using df.max() which just pulls the highest value. Does anyone know how to do this in pandas.
#current code
accidents["month"] = accidents.Date.apply(lambda s: int(s.split("/")[1]))
temp = accidents.groupby('month').size().rename('Accidents')
#selecting the highest value from the dataframe
temp.max()
answer given = 10937
answer i need should look like this (month and no of accidents): 11 10937
temp dataframe;
month
1 9371
2 8838
3 9427
4 8899
5 9758
6 9942
7 10325
8 9534
9 10222
10 10311
11 10937
12 9972
Name: Accidents, dtype: int64
would also be good to rename the accidents column to accidents is anyone can help too. Thanks
If the value is unique (in your case it is) you can simply get a subset of the dataframe.
temp[temp.iloc[:,1]==temp.iloc[:,1].max()]
So what the code is doing is looking at the integer position (rows then columns) and matching it with your condition, which is the max temp.

I need faster methods to optimize my loop

So, I'm a python newbie looking for someone with an ideia on how to optimize my code. I'm working with a spreadsheet with over 6000 rows, and this portion of my code seems really ineficient.
for x in range(0,len(df):
if df.at[x,'Streak_currency'] != str(df.at[x,'Currency']):
df.at[x, 'Martingale'] = df.at[x-1, 'Martingale'] + (df.at[x-1, 'Martingale'] )/64
x+=1
if df.at[x,'Streak_currency'] == str(df.at[x,'Currency']):
x+=1
It can take upwards of 8 minutes run.
With my limited knowledge, I only manage to change my df.loc for df.at, and it helped a lot. But I st
UPDATE
In this section of the code, I'm trying to apply a function based on a previous value until a certain condition is met, in this case,
df.at[x,'Streak_currency'] != str(df.at[x,'Currency']):
I really don't know why this iteration is taking so long. In theory, it should only look at a previous value and apply the function. Here is a sample of the output:
Periodo Currency ... Agrupamento Martingale
0 1 GBPUSD 1 1.583720 <--- starts aplying a function over and over.
1 1 GBPUSD 1 1.608466
2 1 GBPUSD 1 1.633598
3 1 GBPUSD 1 1.659123
4 1 GBPUSD 1 1.685047
5 1 GBPUSD 1 1.711376 <- stops aplying, since Currency changed
6 1 EURCHF 2 1.256550
7 1 USDCAD 3 1.008720 <- starts applying again until currency changes
8 1 USDCAD 3 1.024481
9 1 USDCAD 3 1.040489
10 1 GBPAUD 4 1.603080
Pandas lookups like df.at[x,'Streak_currency'] are not efficient. Indeed, for each evaluation of this kind of expression (multiple time per loop iteration), pandas fetch the column regarding its name and then fetch the value in a list.
You can avoid this computation cost by just storing the columns in variables before the loop. Additionally, you can put the column in numpy array so the value can be fetch in a more efficient way (assuming all the value have the same type).
Finally, using string conversions and string comparisons on integers are not efficient. They can be avoided here (assuming the integers are not unreasonably big).
Here is an example:
import numpy as np
streakCurrency = np.array(df['Streak_currency'], dtype=np.int64)
currency = np.array(df['Currency'], dtype=np.int64)
martingale = np.array(df['Martingale'], dtype=np.float64)
for x in range(len(df)):
if streakCurrency[x] != currency[x]:
martingale[x] = martingale[x-1] * (65./64.)
x+=1
if streakCurrency[x] == currency[x]:
x+=1
# Update the pandas dataframe
df['Martingale'] = martingale
This should at least an order of magnitude faster.
Please note that the second condition is useless since the compared values cannot be equal and different at the same times (this may be a bug in your code)...

Panda dataframe conditional change

I'm working on csv time series data, which shows count of step per some time frame. Once the step count is exceeding 65535, it will count start from 0, etc. However since not all the dataset has 65535 count (some goes from 65530, then 5, if they made several steps on the time frame), I can't find a good way to handle it so that every 0 after 6553x will change to 65536.. etc.
step realstep
65531 65531
65533 65533
65534 65534
2 65538
4 65540
I'm trying to count the real step in order to get their difference (e.g step/minute).
Find where it resets with diff being negative and add the max counter value (65536 since you count from 0) to all rows beyond that. This will be flexible if it resets multiple times (I added some extra data)
df['real_step'] = df.step + df.step.diff(1).lt(0).cumsum()*65536
step real_step
0 65531 65531
1 65533 65533
2 65534 65534
3 2 65538
4 4 65540
5 65434 130970
6 2 131074
7 4 131076

Selecting, slicing, and aggregating temporal data with Pandas

I'm trying to handle temporal data with pandas and I'm having a hard time...
Here is a sample of the DataFrame :
index ip app dev os channel click_time
0 29540 3 1 42 489 2017-11-08 03:57:46
1 26777 11 1 25 319 2017-11-09 11:02:14
2 140926 12 1 13 140 2017-11-07 04:36:14
3 69375 2 1 19 377 2017-11-09 13:17:20
4 119166 9 2 15 445 2017-11-07 12:11:37
This is a click prediction problem, so I want to create a time window aggregating the past behaviour of a specific ip ( for a given ip, how many clicks in the last 4 hours, 8 hours ? ).
I tried creating one new column which was simply :
df['minus_8']=df['click_time']-timedelta(hours=8)
I wanted to use this so that for each row I have a specific 8 hours window on which to aggregate my data.
I have also tried resampling with little success, my understanding of the function isn't optimal let's say.
Can anyone help ?
If you just need to select a particular 8 hours, you can do as follows:
start_time = datetime.datetime(2017, 11, 9,11, 2, 14)
df[(df['click_time' >= start_time)
& (df['click_time'] <= start_time+datetime.timedelta(0, 60*60*8))]
Otherwise I really think you need to look more at resample. Mind you, if you want resample to have your data divided into 8 hour chunks that are always consistent (e.g. from 00:00-08:00, 08:00-16:00, 16:00-00:00), then you will probably want to crop your data to a certain start time.
Using parts of the solution given by Martin, I was able to create this function that outputs what I wanted :
def window_filter_clicks(df, h):
df['nb_clicks_{}h'.format(h)]=0
ip_array = df.ip.unique()
for ip in ip_array:
df_ip=df[df['ip']==ip]
for row, i in zip(df_ip['click_time'],df_ip['click_time'].index):
df_window = df_ip[(df_ip['click_time']>= row-timedelta(hours=h)) & (df_ip['click_time']<= row) ]
nb_clicks_4h = len(df_window)
df['nb_clicks_{}h'.format(h)].iloc[i]= nb_clicks_4h
return df
h allows me to select the size of the window on which to iterate.
Now this works fine, but it is very slow and I am working with a lot of rows.
Does anyone know how to improve the speed of such a function ? ( Or if there is anything similar built-in ? )

Comparison between one element and all the others of a DataFrame column

I have a list of tuples which I turned into a DataFrame with thousands of rows, like this:
frag mass prot_position
0 TFDEHNAPNSNSNK 1573.675712 2
1 EPGANAIGMVAFK 1303.659458 29
2 GTIK 417.258734 2
3 SPWPSMAR 930.438172 44
4 LPAK 427.279469 29
5 NEDSFVVWEQIINSLSALK 2191.116099 17
...
and I have the follow rule:
def are_dif(m1, m2, ppm=10):
if abs((m1 - m2) / m1) < ppm * 0.000001:
v = False
else:
v = True
return v
So, I only want the "frag"s that have a mass that difers from all the other fragments mass. How can I achieve that "selection"?
Then, I have a list named "pinfo" that contains:
d = {'id':id, 'seq':seq_code, "1HW_fit":hits_fit}
# one for each protein
# each dictionary as the position of the protein that it describes.
So, I want to sum 1 to the "hits_fit" value, on the dictionary respective to the protein.
If I'm understanding correctly (not sure if I am), you can accomplish quite a bit by just sorting. First though, let me adjust the data to have a mix of close and far values for mass:
Unnamed: 0 frag mass prot_position
0 0 TFDEHNAPNSNSNK 1573.675712 2
1 1 EPGANAIGMVAFK 1573.675700 29
2 2 GTIK 417.258734 2
3 3 SPWPSMAR 417.258700 44
4 4 LPAK 427.279469 29
5 5 NEDSFVVWEQIINSLSALK 2191.116099 17
Then I think you can do something like the following to select the "good" ones. First, create 'pdiff' (percent difference) to see how close mass is to the nearest neighbors:
ppm = .00001
df = df.sort('mass')
df['pdiff'] = (df.mass-df.mass.shift()) / df.mass
Unnamed: 0 frag mass prot_position pdiff
3 3 SPWPSMAR 417.258700 44 NaN
2 2 GTIK 417.258734 2 8.148421e-08
4 4 LPAK 427.279469 29 2.345241e-02
1 1 EPGANAIGMVAFK 1573.675700 29 7.284831e-01
0 0 TFDEHNAPNSNSNK 1573.675712 2 7.625459e-09
5 5 NEDSFVVWEQIINSLSALK 2191.116099 17 2.817926e-01
The first and last data lines make this a little tricky so this next line backfills the first line and repeats the last line so that the following mask works correctly. This works for the example here, but might need to be tweaked for other cases (but only as far as the first and last lines of data are concerned).
df = df.iloc[range(len(df))+[-1]].bfill()
df[ (df['pdiff'] > ppm) & (df['pdiff'].shift(-1) > ppm) ]
Results:
Unnamed: 0 frag mass prot_position pdiff
4 4 LPAK 427.279469 29 0.023452
5 5 NEDSFVVWEQIINSLSALK 2191.116099 17 0.281793
Sorry, I don't understand the second part of the question at all.
Edit to add: As mentioned in a comment to #AmiTavory's answer, I think possibly the sorting approach and groupby approach could be combined for a simpler answer than this. I might try at a later time, but everyone should feel free to give this a shot themselves if interested.
Here's something that's slightly different from what you asked, but it is very simple, and I think gives a similar effect.
Using numpy.round, you can create a new column
import numpy as np
df['roundedMass'] = np.round(df.mass, 6)
Following that, you can do a groupby of the frags on the rounded mass, and use nunique to count the numbers in the group. Filter for the groups of size 1.
So, the number of frags per bin is:
df.frag.groupby(np.round(df.mass, 6)).nunique()
Another solution can be create a dup of your list (if you need to preserve it for further processing later), iterate over it and remove all element that are not corresponding with your rule (m1 & m2).
You will get a new list with all unique masses.
Just don't forget that if you do need to use the original list later you will need to use deepcopy.

Categories