I am running this code on a large csv file (1.5 million rows). Is there a way to optimise ?
df is a pandas dataframe.
I take a row and want to know what happens 1st in the 1000 folowing rows :
I find my value + 0.0004 or i find my value - 0.0004
result = []
for row in range(len(df)-1000):
start = df.get_value(row,'A')
win = start + 0.0004
lose = start - 0.0004
for n in range(1000):
ref = df.get_value(row + n,'B')
if ref > win:
result.append(1)
break
elif ref <= lose:
result.append(-1)
break
elif n==999 :
result.append(0)
the dataframe is like :
timestamp A B
0 20190401 00:00:00.127 1.12230 1.12236
1 20190401 00:00:00.395 1.12230 1.12237
2 20190401 00:00:00.533 1.12229 1.12234
3 20190401 00:00:00.631 1.12228 1.12233
4 20190401 00:00:01.019 1.12230 1.12234
5 20190401 00:00:01.169 1.12231 1.12236
the result is : result[0,0,1,0,0,1,-1,1,…]
this is working but takes a long time to process on such large files.
To generate values for the "first outlier", define the following function:
def firstOutlier(row, dltRow = 4, dltVal = 0.1):
''' Find the value for the first "outlier". Parameters:
row - the current row
dltRow - number of rows to check, starting from the current
dltVal - delta in value of "B", compared to "A" in the current row
'''
rowInd = row.name # Index of the current row
df2 = df.iloc[rowInd : rowInd + dltRow] # "dltRow" rows from the current
outliers = df2[abs(df2.B - row.A) >= dlt]
if outliers.index.size == 0: # No outliers within the range of rows
return 0
return int(np.sign(outliers.iloc[0].B - row.A))
Then apply it to each row:
df.apply(firstOutlier, axis=1)
This function relies on the fact that the DataFrame has the index consisting
of consecutive numbers, starting from 0, so that having ind - the index of
any row we can access it calling df.iloc[ind] and a slice of n rows,
starting from this row, calling df.iloc[ind : ind + n].
For my test, I set the default values of parameters to:
dltRow = 4 - look at 4 rows, starting from the current one,
dltVal = 0.1 - look for rows with B column "distant by" 0.1
or more from A in the current row.
My test DataFrame was:
A B
0 1.00 1.00
1 0.99 1.00
2 1.00 0.80
3 1.00 1.05
4 1.00 1.20
5 1.00 1.00
6 1.00 0.80
7 1.00 1.00
8 1.00 1.00
The result (for my data and default values of parameters) was:
0 -1
1 -1
2 -1
3 1
4 1
5 -1
6 -1
7 0
8 0
dtype: int64
For your needs, change default values of params to 1000 and 0.0004 respectively.
The idea is to loop through A and B while maintaining a sorted list of A values. Then, for each B, find the highest A that loses and the lowest A that wins. Since it's a sorted list it's O(log(n)) to search. Only those A's that have index in the last 1000 are used for setting the result vector. After that the A's that are no longer waiting for a B are removed from this sorted list to keep it small.
import numpy as np
import bisect
import time
N = 10
M = 3
#N=int(1e6)
#M=int(1e3)
thresh = 0.4
A = np.random.rand(N)
B = np.random.rand(N)
result = np.zeros(N)
l = []
t_start = time.time()
for i in range(N):
a = (A[i],i)
bisect.insort(l,a)
b = B[i]
firstLoseInd = bisect.bisect_left(l,(b+thresh,-1))
lastWinInd = bisect.bisect_right(l,(b-thresh,-1))
for j in range(lastWinInd):
curInd = l[j][1]
if curInd > i-M:
result[curInd] = 1
for j in range(firstLoseInd,len(l)):
curInd = l[j][1]
if curInd > i-M:
result[curInd] = -1
del l[firstLoseInd:]
del l[:lastWinInd]
t_done = time.time()
print(A)
print(B)
print(result)
print(t_done - t_start)
This is a sample output:
[ 0.22643589 0.96092354 0.30098532 0.15569044 0.88474775 0.25458535
0.78248271 0.07530432 0.3460113 0.0785128 ]
[ 0.83610433 0.33384085 0.51055061 0.54209458 0.13556121 0.61257179
0.51273686 0.54850825 0.24302884 0.68037965]
[ 1. -1. 0. 1. -1. 0. -1. 1. 0. 1.]
For N = int(1e6) and M = int(1e3) it took about 3.4 seconds on my computer.
Related
I am new to python and am trying to write a code to create a new dataframe based on conditions from an old dataframe along with the results in the cell above on the new dataframe.
Here is an example of what I am trying to do:
is the raw data
I need to create a new dataframe where if the corresponding position in the raw data is 0 the result is 0, if it is greater than 0 then 1 plus the above row
I need to remove any instances where the consecutive number of intervals doesn't reach at least 3
The way I think about the code is as such, but being new to python I am struggling.
From Raw data to Dataframe 2:
if (1,1)=0 then (1a, 1a)= 0: # line 1
else (1a,1a)=1;
if (2,1)=0 then (2a,1a)=0; # line 2
else (2a,1a)= (1a,1a)+1 = 2;
if (3,1)=0 then (3a,1a)=0; # line 3
From Dataframe 2 to 3:
If any of the last 3 rows is greater than 3 then return that cells value else return 0
I am not sure how to make any of these work, if there is an easier way to do/think about this then what I am doing please let me know. Any help is appreciated!
Based on your question, the output I was able to generate was:
Earlier, the DataFrame looked like so:
A B C
0.05 5 0 0
0.10 7 0 1
0.15 0 0 12
0.20 0 4 3
0.25 1 0 5
0.30 21 5 0
0.35 6 0 9
0.40 15 0 0
Now, the DataFrame looks like so:
A B C
0.05 0 0 0
0.10 0 0 1
0.15 0 0 2
0.20 0 0 3
0.25 1 0 4
0.30 2 0 0
0.35 3 0 0
0.40 4 0 0
The code I used for this is given below, just copy the following code in a new file, say code.py and run it
import re
import pandas as pd
def get_continous_runs(ext_list, threshold):
mylist = list(ext_list)
for i in range(len(mylist)):
if mylist[i] != 0:
mylist[i] = 1
samp = "".join(map(str, mylist))
finder = re.finditer(r"1{%s,}" % threshold, samp)
ranges = [x.span() for x in finder]
return ranges
def build_column(ranges, max_len):
answer = [0]*max_len
for r in ranges:
start = r[0]
run_len = r[1] - start
for i in range(run_len):
answer[start+i] = i + 1
return answer
def main(df):
print("Earlier, the DataFrame looked like so:")
print(df)
ndf = df.copy()
for col_name, col_data in df.iteritems():
ranges = get_continous_runs(col_data.values, 4)
column_len = len(col_data.values)
new_column = build_column(ranges, column_len)
ndf[col_name] = new_column
print("\nNow, the DataFrame looks like so:")
print(ndf)
return
if __name__ == '__main__':
raw_data = [
(5,0,0), (7,0,1), (0,0,12), (0,4,3),
(1,0,5), (21,5,0), (6,0,9), (15,0,0),
]
df = pd.DataFrame(
raw_data,
columns=list("ABC"),
index=[0.05,0.10,0.15,0.20,0.25,0.30,0.35,0.40]
)
main(df)
You can adjust threshold in line #28 to get consecutive number of intervals other than 4 (i.e. more than 3).
As always, start by reading main() function to understand how everything works. I have tried to use good variable names to aid understanding. My method might seem a little contrived because I am using regex, but I didn't want to overwhelm a very beginner with a custom run-length counter, so...
So I understand we can use pandas data frame to do vector operations on cells like
df = pd.Dataframe([a, b, c])
df*3
would equal something like :
0 a*3
1 b*3
2 c*3
but could we use a pandas dataframe to say calculate the Fibonacci sequence ?
I am asking this because for the Fibonacci sequence the next number depends on the previous two number ( F_n = F_(n-1) + F_(n-2) ). I am not exactly interested in the Fibonacci sequence and more interested in knowing if we can do something like:
df = pd.DataFrame([a,b,c])
df.apply( some_func )
0 x1 a
1 x2 b
2 x3 c
where x1 would be calculated from a,b,c (I know this is possible), x2 would be calculated from x1 and x3 would be calculated from x2
the Fibonacci example would just be something like :
df = pd.DataFrame()
df.apply(fib(n, df))
0 0
1 1
2 1
3 2
4 2
5 5
.
.
.
n-1 F(n-1) + F(n-2)
You need to iterate through the rows and access previous rows data by DataFrame.loc. For example, n = 6
df = pd.DataFrame()
for i in range(0, 6):
df.loc[i, 'f'] = i if i in [0, 1] else df.loc[i - 1, 'f'] + df.loc[i - 2, 'f']
df
f
0 0.0
1 1.0
2 1.0
3 2.0
4 3.0
5 5.0
My data frame contains 10,000,000 rows! After group by, ~ 9,000,000 sub-frames remain to loop through.
The code is:
data = read.csv('big.csv')
for id, new_df in data.groupby(level=0): # look at mini df and do some analysis
# some code for each of the small data frames
This is super inefficient, and the code has been running for 10+ hours now.
Is there a way to speed it up?
Full Code:
d = pd.DataFrame() # new df to populate
print 'Start of the loop'
for id, new_df in data.groupby(level=0):
c = [new_df.iloc[i:] for i in range(len(new_df.index))]
x = pd.concat(c, keys=new_df.index).reset_index(level=(2,3), drop=True).reset_index()
x = x.set_index(['level_0','level_1', x.groupby(['level_0','level_1']).cumcount()])
d = pd.concat([d, x])
To get the data:
data = pd.read_csv('https://raw.githubusercontent.com/skiler07/data/master/so_data.csv', index_col=0).set_index(['id','date'])
Note:
Most of id's will only have 1 date. This indicates only 1 visit. For id's with more visits, I would like to structure them in a 3d format e.g. store all of their visits in the 2nd dimension out of 3. The output is (id, visits, features)
Here is one way to speed that up. This adds the desired new rows in some code which processes the rows directly. This saves the overhead of constantly constructing small dataframes. Your sample of 100,000 rows runs in a couple of seconds on my machine. While your code with only 10,000 rows of your sample data takes > 100 seconds. This seems to represent a couple of orders of magnitude improvement.
Code:
def make_3d(csv_filename):
def make_3d_lines(a_df):
a_df['depth'] = 0
depth = 0
prev = None
accum = []
for row in a_df.values.tolist():
row[0] = 0
key = row[1]
if key == prev:
depth += 1
accum.append(row)
else:
if depth == 0:
yield row
else:
depth = 0
to_emit = []
for i in range(len(accum)):
date = accum[i][2]
for j, r in enumerate(accum[i:]):
to_emit.append(list(r))
to_emit[-1][0] = j
to_emit[-1][2] = date
for r in to_emit[1:]:
yield r
accum = [row]
prev = key
df_data = pd.read_csv('big-data.csv')
df_data.columns = ['depth'] + list(df_data.columns)[1:]
new_df = pd.DataFrame(
make_3d_lines(df_data.sort_values('id date'.split())),
columns=df_data.columns
).astype(dtype=df_data.dtypes.to_dict())
return new_df.set_index('id date'.split())
Test Code:
start_time = time.time()
df = make_3d('big-data.csv')
print(time.time() - start_time)
df = df.drop(columns=['feature%d' % i for i in range(3, 25)])
print(df[df['depth'] != 0].head(10))
Results:
1.7390995025634766
depth feature0 feature1 feature2
id date
207555809644681 20180104 1 0.03125 0.038623 0.008130
247833985674646 20180106 1 0.03125 0.004378 0.004065
252945024181083 20180107 1 0.03125 0.062836 0.065041
20180107 2 0.00000 0.001870 0.008130
20180109 1 0.00000 0.001870 0.008130
329567241731951 20180117 1 0.00000 0.041952 0.004065
20180117 2 0.03125 0.003101 0.004065
20180117 3 0.00000 0.030780 0.004065
20180118 1 0.03125 0.003101 0.004065
20180118 2 0.00000 0.030780 0.004065
I believe your approach for feature engineering could be done better, but I will stick to answering your question.
In Python, iterating over a Dictionary is way faster than iterating over a DataFrame
Here how I managed to process a huge pandas DataFrame (~100,000,000 rows):
# reset the Dataframe index to get level 0 back as a column in your dataset
df = data.reset_index() # the index will be (id, date)
# split the DataFrame based on id
# and store the splits as Dataframes in a dictionary using id as key
d = dict(tuple(df.groupby('id')))
# iterate over the Dictionary and process the values
for key, value in d.items():
pass # each value is a Dataframe
# concat the values and get the original (processed) Dataframe back
df2 = pd.concat(d.values(), ignore_index=True)
Modified #Stephen's code
def make_3d(dataset):
def make_3d_lines(a_df):
a_df['depth'] = 0 # sets all depth from (1 to n) to 0
depth = 1 # initiate from 1, so that the first loop is correct
prev = None
accum = [] # accumulates blocks of data belonging to given user
for row in a_df.values.tolist(): # for each row in our dataset
row[0] = 0 # NOT SURE
key = row[1] # this is the id of the row
if key == prev: # if this rows id matches previous row's id, append together
depth += 1
accum.append(row)
else: # else if this id is new, previous block is completed -> process it
if depth == 0: # previous id appeared only once -> get that row from accum
yield accum[0] # also remember that depth = 0
else: # process the block and emit each row
depth = 0
to_emit = [] # prepare to emit the list
for i in range(len(accum)): # for each unique day in the accumulated list
date = accum[i][2] # define date to be the first date it sees
for j, r in enumerate(accum[i:]):
to_emit.append(list(r))
to_emit[-1][0] = j # define the depth
to_emit[-1][2] = date # define the
for r in to_emit[0:]:
yield r
accum = [row]
prev = key
df_data = dataset.reset_index()
df_data.columns = ['depth'] + list(df_data.columns)[1:]
new_df = pd.DataFrame(
make_3d_lines(df_data.sort_values('id date'.split(), ascending=[True,False])),
columns=df_data.columns
).astype(dtype=df_data.dtypes.to_dict())
return new_df.set_index('id date'.split())
Testing:
t = pd.DataFrame(data={'id':[1,1,1,1,2,2,3,3,4,5], 'date':[20180311,20180310,20180210,20170505,20180312,20180311,20180312,20180311,20170501,20180304], 'feature':[10,20,45,1,14,15,20,20,13,11],'result':[1,1,0,0,0,0,1,0,1,1]})
t = t.reindex(columns=['id','date','feature','result'])
print t
id date feature result
0 1 20180311 10 1
1 1 20180310 20 1
2 1 20180210 45 0
3 1 20170505 1 0
4 2 20180312 14 0
5 2 20180311 15 0
6 3 20180312 20 1
7 3 20180311 20 0
8 4 20170501 13 1
9 5 20180304 11 1
Output
depth feature result
id date
1 20180311 0 10 1
20180311 1 20 1
20180311 2 45 0
20180311 3 1 0
20180310 0 20 1
20180310 1 45 0
20180310 2 1 0
20180210 0 45 0
20180210 1 1 0
20170505 0 1 0
2 20180312 0 14 0
20180312 1 15 0
20180311 0 15 0
3 20180312 0 20 1
20180312 1 20 0
20180311 0 20 0
4 20170501 0 13 1
I need to write a code to create the new column B embedding the cumsum of the first column A.
When the cumsum value is < 0, then the value in B should be 0.
Then cumsum starts again until next value <0.
I search similar answer but I was not able to find an answer fitting my case. Thanks for your help.
A B
1 1
3 4
5 9
7 16
-6 10
-8 2
-10 *0*
6 6
-15 *0*
11 11
Set up a loop over A and have the total. If the total is less than 0 then set it to 0. Then append the new total to B
You have a A = [ 1, ...,], total = 0, B = []
total = 0
B = []
for i in range(len(A)):
# process the sum
total += A[i]
if total < 0:
total = 0
B.append(total)
Here is a non-Pandas answer that iteratively loops through the values in column A and creates column B by never going below 0.
result = []
cur_res = 0
for i in df.A:
cur_res = max(cur_res + i, 0)
result.append(cur_res)
df['B'] = result
I have a dataframe of 6M+ observations, where 20 of the columns are weights that will be applied to a single score column. I.e., Wgt1 * Wgt2 * Wgt3...* Score. In addition, not each weight is applicable to every observation, so I have created 20 columns that represent a weight mask. I.e., (Wgt1*Msk1) * (Wgt2*Msk2) * (Wgt3*Msk3) ... Score. When the mask is 0, the weight is not applicable; when the mask is 1, it is applicable.
For each row in the dataframe, I want to:
1, Check 2 qualifying conditions that indicate the row should be processed
2, find the product of the weights, subject to the presence of the corresponding mask (ttl_wgt)
3, multiply this product by the score (prob) to create a final weighted score
To do this, I have created a user-defined function:
import functools
import operator
import time
def mymult(a):
ttl_wgt = float('NaN') #Initialize to NaN
if ~np.isnan(a['ID']): #condition 1, only process if an ID is present
if a['prob'] > -1.0: #condition 2, only process if our unweighted score is NOT -1.0
b = np.where(a[msks] ==1)[0] #index for which of our masks is 1?
ttl_wgt = functools.reduce(operator.mul, a[np.asarray(wgt_nms)[b]], 1)
return ttl_wgt
I ran out of memory during development, so I decided to chunk it up into 500000 rows at a time. I use a lambda function to apply to the chunk:
msks = ['Msk1','Msk2','Msk3','Msk4',...,'Msk20']
wgt_nms = ['Wgt1','Wgt2','Wgt3','Wgt4',...,'Wgt20']
print('Determining final weights...')
chunksize = 500000 #we'll operate on this many rows at a time
start_time = time.time()
ttl_wgts = [] #initialize list to hold weight products
for i in range(0,len(df),chunksize):
ttl_wgts.extend(df[i:(i+chunksize)].apply(lambda x: mymult(x), axis=1))
print("--- %s seconds ---" % (time.time() - start_time)) #Expect between 30 and 40 minutes
print('Done!')
Then I assignthe ttl_wgts list as a new column in the dataframe, and do the final product of weight * initial score.
#Initialize the fields
#Might not be necessary or evenuseful
df['ttl_wgt'] = float('NaN')
df['wgt_prob'] = float('NaN')
df['ttl_wgt'] = ttl_wgts
df['wgt_prob'] = df['ttl_wgt'] * df['prob']
I checked out a prior post on multiplying elements in a list. It was great food for thought, but I wasn't able to turn it into anything more efficient for my 6M+ observations. Are there other approaches I should be considering?
Adding an example df, as suggested
A sample of the dataframe might looks something like this, with only 3 masks/weights:
df = pd.DataFrame({'id': [999999999,136550,80010170,80010177,90002408,90002664,16207501,62992,np.nan,80010152],
'prob': [-1,0.180274382,0.448361456,0.000945058,0.005060279,0.009893078,0.169686288,0.109541453,0.117907763,0.266242921],
'Msk1': [0,1,1,1,0,0,1,0,0,0],
'Msk2': [0,0,1,0,0,0,0,1,0,0],
'Msk3': [1,0,0,0,1,1,0,0,1,1],
'Wgt1': [np.nan,0.919921875,1.08984375,1.049804688,np.nan,np.nan,np.nan,0.91015625,np.nan,0.810058594],
'Wgt2': [np.nan,1.129882813,1.120117188,0.970214844,np.nan,np.nan,np.nan,1.0703125,np.nan,0.859863281],
'Wgt3': [np.nan,1.209960938,1.23046875,1,np.nan,np.nan,np.nan,1.150390625,np.nan,0.649902344]
})
In the first observation, the prob field is -1, so the row would not be processed. In the second observation, Msk1 is turned on while Msk2 and Msk3 are turned off. Thus the final weight would be the value of Wgt1 = 0.919922. In the 3rd row, Mask1 and Msk2 are on, while Msk3 is off. Therefore the final weight would be Wgt1*Wgt2 = 1.089844*1.120117 = 1.220752.
IIUC:
you want to fill in your masked weights with 1. Then you can multiply them all together with no impact from the ones being masked. That's the trick. You'll have to apply it as needed.
create msk
msk = df.filter(like='Msk')
print(msk)
Msk1 Msk2 Msk3
0 0 0 1
1 1 0 0
2 1 1 0
3 1 0 0
4 0 0 1
5 0 0 1
6 1 0 0
7 0 1 0
8 0 0 1
9 0 0 1
create wgt
wgt = df.filter(like='Wgt')
print(wgt)
Wgt1 Wgt2 Wgt3
0 NaN NaN NaN
1 0.919922 1.129883 1.209961
2 1.089844 1.120117 1.230469
3 1.049805 0.970215 1.000000
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 0.910156 1.070312 1.150391
8 NaN NaN NaN
9 0.810059 0.859863 0.649902
create new_weight
new_wgt = np.where(msk, wgt, 1)
print(new_wgt)
[[ 1. 1. nan]
[ 0.91992188 1. 1. ]
[ 1.08984375 1.12011719 1. ]
[ 1.04980469 1. 1. ]
[ 1. 1. nan]
[ 1. 1. nan]
[ nan 1. 1. ]
[ 1. 1.0703125 1. ]
[ 1. 1. nan]
[ 1. 1. 0.64990234]]
final prod_wgt
prod_wgt = pd.Series(new_wgt.prod(1), wgt.index)
print(prod_wgt)
0 NaN
1 0.919922
2 1.220753
3 1.049805
4 NaN
5 NaN
6 NaN
7 1.070312
8 NaN
9 0.649902
dtype: float64