Python Pandas -- Random sampling of time series - python

New to Pandas, looking for the most efficient way to do this.
I have a Series of DataFrames. Each DataFrame has the same columns but different indexes, and they are indexed by date. The Series is indexed by ticker symbol. So each item in the Sequence represents a single time series of each individual stock's performance.
I need to randomly generate a list of n data frames, where each dataframe is a subset of some random assortment of the available stocks' histories. It's ok if there is overlap, so long as start end end dates are different.
This following code does it, but it's really slow, and I'm wondering if there's a better way to go about it:
Code
def random_sample(data=None, timesteps=100, batch_size=100, subset='train'):
if type(data) != pd.Series:
return None
if subset=='validate':
offset = 0
elif subset=='test':
offset = 200
elif subset=='train':
offset = 400
tickers = np.random.randint(0, len(data), size=len(data))
ret_data = []
while len(ret_data) != batch_size:
for t in tickers:
data_t = data[t]
max_len = len(data_t)-timesteps-1
if len(ret_data)==batch_size: break
if max_len-offset < 0: continue
index = np.random.randint(offset, max_len)
d = data_t[index:index+timesteps]
if len(d)==timesteps: ret_data.append(d)
return ret_data
Profile output:
Timer unit: 1e-06 s
File: finance.py
Function: random_sample at line 137
Total time: 0.016142 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
137 #profile
138 def random_sample(data=None, timesteps=100, batch_size=100, subset='train'):
139 1 5 5.0 0.0 if type(data) != pd.Series:
140 return None
141
142 1 1 1.0 0.0 if subset=='validate':
143 offset = 0
144 1 1 1.0 0.0 elif subset=='test':
145 offset = 200
146 1 0 0.0 0.0 elif subset=='train':
147 1 1 1.0 0.0 offset = 400
148
149 1 1835 1835.0 11.4 tickers = np.random.randint(0, len(data), size=len(data))
150
151 1 2 2.0 0.0 ret_data = []
152 2 3 1.5 0.0 while len(ret_data) != batch_size:
153 116 148 1.3 0.9 for t in tickers:
154 116 2497 21.5 15.5 data_t = data[t]
155 116 317 2.7 2.0 max_len = len(data_t)-timesteps-1
156 116 80 0.7 0.5 if len(ret_data)==batch_size: break
157 115 69 0.6 0.4 if max_len-offset < 0: continue
158
159 100 101 1.0 0.6 index = np.random.randint(offset, max_len)
160 100 10840 108.4 67.2 d = data_t[index:index+timesteps]
161 100 241 2.4 1.5 if len(d)==timesteps: ret_data.append(d)
162
163 1 1 1.0 0.0 return ret_data

Are you sure you need to find a faster method? Your current method isn't that slow. The following changes might simplify, but won't necessarily be any faster:
Step 1: Take a random sample (with replacement) from the list of dataframes
rand_stocks = np.random.randint(0, len(data), size=batch_size)
You can treat this array rand_stocks as a list of indices to be applied to your Series of dataframes. The size is already batch size so that eliminates the need for the while loop and your comparison on line 156.
That is, you can iterate over rand_stocks and access the stock like so:
for idx in rand_stocks:
stock = data.ix[idx]
# Get a sample from this stock.
Step 2: Get a random datarange for each stock you have randomly selected.
start_idx = np.random.randint(offset, len(stock)-timesteps)
d = data_t[start_idx:start_idx+timesteps]
I don't have your data, but here's how I put it together:
def random_sample(data=None, timesteps=100, batch_size=100, subset='train'):
if subset=='train': offset = 0 #you can obviously change this back
rand_stocks = np.random.randint(0, len(data), size=batch_size)
ret_data = []
for idx in rand_stocks:
stock = data[idx]
start_idx = np.random.randint(offset, len(stock)-timesteps)
d = stock[start_idx:start_idx+timesteps]
ret_data.append(d)
return ret_data
Creating a dataset:
In [22]: import numpy as np
In [23]: import pandas as pd
In [24]: rndrange = pd.DateRange('1/1/2012', periods=72, freq='H')
In [25]: rndseries = pd.Series(np.random.randn(len(rndrange)), index=rndrange)
In [26]: rndseries.head()
Out[26]:
2012-01-02 2.025795
2012-01-03 1.731667
2012-01-04 0.092725
2012-01-05 -0.489804
2012-01-06 -0.090041
In [27]: data = [rndseries,rndseries,rndseries,rndseries,rndseries,rndseries]
Testing the function:
In [42]: random_sample(data, timesteps=2, batch_size = 2)
Out[42]:
[2012-01-23 1.464576
2012-01-24 -1.052048,
2012-01-23 1.464576
2012-01-24 -1.052048]

Related

How can i improve execution time of my code that parses xml to dataframes using requests library

I have an url mask and dynamic values from dictionary which i pass into that mask to generate an url. There is an xml file within each url. I'm getting that xml, making a dataframe and fill one column with one of the values from said dictionary. In the end i generate the list of dataframes to work further with.
My code executes pretty slow, i imagine that is because i have something in my iteration that could be refactored. Is there any way i can make it faster or is it limited by get requests?
This is my go-to algorithm. At first i tried to save xml files locally and only then parsing it to dataframes, but it obviously takes longer. I also tried breaking functions into smaller ones, same effect.
FILE_FORMAT = 'xml'
CURRENT_DIR = os.path.abspath('')
SAVE_DIR = os.path.join(CURRENT_DIR, 'report')
REPORT = 'oo1'
YEAR = '2022-2023'
BASE_URL = 'sensetive_link'
def create_source(file_name):
df = pd.read_excel(f'{file_name}.xlsx', dtype=object)
columns = df.columns.tolist()
result = {
school: df[item].dropna(how='all').tolist() for school, item in zip(
columns, df
)
}
return result
def download_xml_to_df_list(source_dict):
df_list = []
fillers = {
'base_url': BASE_URL,
'year': YEAR,
'report': REPORT,
'file_format': FILE_FORMAT,
}
count = 0
length = sum([len(i) for i in source.values()])
for mouo, school in source_dict.items():
for num, i in enumerate(range(len(source_dict[mouo])), 1):
try:
url = (
'{base_url}/{year}ob/{report}/61/{mouo}/oo1_{school}.{file_format}'
.format(**fillers, mouo=mouo, school=source_dict[mouo][i])
)
df = pd.read_xml(requests.get(url).text, xpath='//item')
df['value'] = df['value'].astype('float64')
df.index = [source_dict[mouo][i]] * len(df)
df_list.append(df)
count += 1
message = f'parsed {count} out of {length}'
print(message, end='\r')
except Exception as error:
print(f"{url} doesn't exist")
print('\ndone')
return df_list
I was using time library to measure execution time and it says
excecuted in 131.20987153053284
I'm using jupyter notebook but from I've read it doesn't affect execution time.
EDIT:
As per #Paul H advice i did line profiling. Results:
Function: download_xml_to_df_list at line 39
Line # Hits Time Per Hit % Time Line Contents
39 #profile
40 def download_xml_to_df_list(source_dict):
41 1 6.0 6.0 0.0 df_list = []
42 fillers = {
43 1 5.0 5.0 0.0 'base_url': BASE_URL,
44 1 4.0 4.0 0.0 'year': YEAR,
45 1 4.0 4.0 0.0 'report': REPORT,
46 1 11.0 11.0 0.0 'file_format': FILE_FORMAT,
47 }
48 1 4.0 4.0 0.0 count = 0
49 1 108.0 108.0 0.0 length = sum([len(i) for i in source.values()])
50 17 143.0 8.4 0.0 for mouo, school in source_dict.items():
51 173 1941.0 11.2 0.0 for num, i in enumerate(range(len(source_dict[mouo])), 1):
52 173 700.0 4.0 0.0 try:
53 url = (
54 173 1765.0 10.2 0.0 '{base_url}/{year}ob/{report}/61/{mouo}/oo1_{school}.{file_format}'
55 173 13534.0 78.2 0.0 .format(**fillers, mouo=mouo, school=source_dict[mouo][i])
56 )
57 167 1892702357.0 11333547.0 99.9 df = pd.read_xml(requests.get(url).text, xpath='//item')
58 167 563079.0 3371.7 0.0 print(f'fetched url {url}')
59 167 1660282.0 9941.8 0.1 df['value'] = df['value'].astype('float64')
60 167 511147.0 3060.8 0.0 df.index = [source_dict[mouo][i]] * len(df)
61 167 1869.0 11.2 0.0 df_list.append(df)
62 167 1108.0 6.6 0.0 count += 1
63 167 3864.0 23.1 0.0 message = f'parsed {count} out of {length}'
64 167 41371.0 247.7 0.0 print(message, end='\r')
65 6 94.0 15.7 0.0 except Exception as error:
66 6 5053.0 842.2 0.0 print(f"{url} doesn't exist")
67 print_stats()
68 return df_list

Python multithreading not getting desired performance

I have a bunch of pandas dataframe I would like to print out to any format (csv, json, etc) -- and would like to preserve the order, based on the order of the data frames read. Unfortunately .to_csv() can take some time, sometimes 2x longer than just reading the dataframe.
Lets take the image as an example:
Here you can see that running the task linearly, reading the data frame, printing it out, then repeat for the remaining data frames. This can take about 3x longer than just reading the data frame. Theoretically, if we can push the printing (to_csv()) to a separate threads (2 threads, plus the main thread reading), we can achieve an improve performance that could almost be a third of the total execution compared to the linear (synchronous) version. Of course with just 3 reads, it looks like its just half as fast. But the more dataframes you read, the faster it will be (theoretically).
Unfortunately, the actual does not work like so. I am getting a very small gain in performance. Where the read time is actually taking longer. This might be due to the fact that the to_csv() is CPU extensive, and using all the reasources in the process. And since it is multithreaded, it all shares the same resources. Thus not much gains.
So my question is, how can I improve the code to get a performance closer to the theoretical numbers. I tried using multiprocessing but failed to get a working code. How can I have this in multiprocessing? Is there other ways I could improve the total execution time of such a task?
Here's my sample code using multithreads:
import pandas as pd
import datetime
import os
from threading import Thread
import queue
from io import StringIO
from line_profiler import LineProfiler
NUMS = 500
DEVNULL = open(os.devnull, 'w')
HEADERS = ",a,b,c,d,e,f,g\n"
SAMPLE_CSV = HEADERS + "\n".join([f"{x},{x},{x},{x},{x},{x},{x},{x}" for x in range(4000)])
def linear_test():
print("------Linear Test-------")
main_start = datetime.datetime.now()
total_read_time = datetime.timedelta(0)
total_add_task = datetime.timedelta(0)
total_to_csv_time = datetime.timedelta(0)
total_to_print = datetime.timedelta(0)
for x in range(NUMS):
start = datetime.datetime.now()
df = pd.read_csv(StringIO(SAMPLE_CSV), header=0, index_col=0)
total_read_time += datetime.datetime.now() - start
start = datetime.datetime.now()
#
total_add_task += datetime.datetime.now() - start
start = datetime.datetime.now()
data = df.to_csv()
total_to_csv_time += datetime.datetime.now() - start
start = datetime.datetime.now()
print(data, file=DEVNULL)
total_to_print += datetime.datetime.now() - start
print("total_read_time: {}".format(total_read_time))
print("total_add_task: {}".format(total_add_task))
print("total_to_csv_time: {}".format(total_to_csv_time))
print("total_to_print: {}".format(total_to_print))
print("total: {}".format(datetime.datetime.now() - main_start))
class Handler():
def __init__(self, num_workers=1):
self.num_workers = num_workers
self.total_num_jobs = 0
self.jobs_completed = 0
self.answers_sent = 0
self.jobs = queue.Queue()
self.results = queue.Queue()
self.start_workers()
def add_task(self, task, *args, **kwargs):
args = args or ()
kwargs = kwargs or {}
self.total_num_jobs += 1
self.jobs.put((task, args, kwargs))
def start_workers(self):
for i in range(self.num_workers):
t = Thread(target=self.worker)
t.daemon = True
t.start()
def worker(self):
while True:
item, args, kwargs = self.jobs.get()
item(*args, **kwargs)
self.jobs_completed += 1
self.jobs.task_done()
def get_answers(self):
while self.answers_sent < self.total_num_jobs or self.jobs_completed == 0:
yield self.results.get()
self.answers_sent += 1
self.results.task_done()
def task(task_num, df, q):
ans = df.to_csv()
q.put((task_num, ans))
def parallel_test():
print("------Parallel Test-------")
main_start = datetime.datetime.now()
total_read_time = datetime.timedelta(0)
total_add_task = datetime.timedelta(0)
total_to_csv_time = datetime.timedelta(0)
total_to_print = datetime.timedelta(0)
h = Handler(num_workers=2)
q = h.results
answers = {}
curr_task = 1
t = 1
for x in range(NUMS):
start = datetime.datetime.now()
df = pd.read_csv(StringIO(SAMPLE_CSV), header=0, index_col=0)
total_read_time += datetime.datetime.now() - start
start = datetime.datetime.now()
h.add_task(task, t, df, q)
t += 1
total_add_task += datetime.datetime.now() - start
start = datetime.datetime.now()
#data = df.to_csv()
total_to_csv_time += datetime.datetime.now() - start
start = datetime.datetime.now()
#print(data, file=DEVNULL)
total_to_print += datetime.datetime.now() - start
print("total_read_time: {}".format(total_read_time))
print("total_add_task: {}".format(total_add_task))
print("total_to_csv_time: {}".format(total_to_csv_time))
print("total_to_print: {}".format(total_to_print))
for task_num, ans in h.get_answers():
#print("got back: {}".format(task_num, ans))
answers[task_num] = ans
if curr_task in answers:
print(answers[curr_task], file=DEVNULL)
del answers[curr_task]
curr_task += 1
# In case others are left out
for k, v in answers.items():
print(k)
h.jobs.join() # block until all tasks are done
print("total: {}".format(datetime.datetime.now() - main_start))
if __name__ == "__main__":
# linear_test()
# parallel_test()
lp = LineProfiler()
lp_wrapper = lp(linear_test)
lp_wrapper()
lp.print_stats()
lp = LineProfiler()
lp_wrapper = lp(parallel_test)
lp_wrapper()
lp.print_stats()
The output will be below. Where you can see in the linear test reading the data frame only took 4.6 seconds (42% of the total execution). But reading the data frames in the parallel test took 9.7 seconds (93% of the total execution):
------Linear Test-------
total_read_time: 0:00:04.672765
total_add_task: 0:00:00.001000
total_to_csv_time: 0:00:05.582663
total_to_print: 0:00:00.668319
total: 0:00:10.935723
Timer unit: 1e-07 s
Total time: 10.9309 s
File: ./test.py
Function: linear_test at line 33
Line # Hits Time Per Hit % Time Line Contents
==============================================================
33 def linear_test():
34 1 225.0 225.0 0.0 print("------Linear Test-------")
35 1 76.0 76.0 0.0 main_start = datetime.datetime.now()
36 1 32.0 32.0 0.0 total_read_time = datetime.timedelta(0)
37 1 11.0 11.0 0.0 total_add_task = datetime.timedelta(0)
38 1 9.0 9.0 0.0 total_to_csv_time = datetime.timedelta(0)
39 1 9.0 9.0 0.0 total_to_print = datetime.timedelta(0)
40
41 501 3374.0 6.7 0.0 for x in range(NUMS):
42
43 500 5806.0 11.6 0.0 start = datetime.datetime.now()
44 500 46728029.0 93456.1 42.7 df = pd.read_csv(StringIO(SAMPLE_CSV), header=0, index_col=0)
45 500 40199.0 80.4 0.0 total_read_time += datetime.datetime.now() - start
46
47 500 6821.0 13.6 0.0 start = datetime.datetime.now()
48 #
49 500 6916.0 13.8 0.0 total_add_task += datetime.datetime.now() - start
50
51 500 5794.0 11.6 0.0 start = datetime.datetime.now()
52 500 55843605.0 111687.2 51.1 data = df.to_csv()
53 500 53640.0 107.3 0.0 total_to_csv_time += datetime.datetime.now() - start
54
55 500 6798.0 13.6 0.0 start = datetime.datetime.now()
56 500 6589129.0 13178.3 6.0 print(data, file=DEVNULL)
57 500 18258.0 36.5 0.0 total_to_print += datetime.datetime.now() - start
58
59 1 221.0 221.0 0.0 print("total_read_time: {}".format(total_read_time))
60 1 95.0 95.0 0.0 print("total_add_task: {}".format(total_add_task))
61 1 87.0 87.0 0.0 print("total_to_csv_time: {}".format(total_to_csv_time))
62 1 85.0 85.0 0.0 print("total_to_print: {}".format(total_to_print))
63 1 112.0 112.0 0.0 print("total: {}".format(datetime.datetime.now() - main_start))
------Parallel Test-------
total_read_time: 0:00:09.779954
total_add_task: 0:00:00.016984
total_to_csv_time: 0:00:00.003000
total_to_print: 0:00:00.001001
total: 0:00:10.488563
Timer unit: 1e-07 s
Total time: 10.4803 s
File: ./test.py
Function: parallel_test at line 106
Line # Hits Time Per Hit % Time Line Contents
==============================================================
106 def parallel_test():
107 1 100.0 100.0 0.0 print("------Parallel Test-------")
108 1 33.0 33.0 0.0 main_start = datetime.datetime.now()
109 1 24.0 24.0 0.0 total_read_time = datetime.timedelta(0)
110 1 10.0 10.0 0.0 total_add_task = datetime.timedelta(0)
111 1 10.0 10.0 0.0 total_to_csv_time = datetime.timedelta(0)
112 1 10.0 10.0 0.0 total_to_print = datetime.timedelta(0)
113 1 13550.0 13550.0 0.0 h = Handler(num_workers=2)
114 1 15.0 15.0 0.0 q = h.results
115 1 9.0 9.0 0.0 answers = {}
116 1 7.0 7.0 0.0 curr_task = 1
117 1 7.0 7.0 0.0 t = 1
118
119 501 5017.0 10.0 0.0 for x in range(NUMS):
120 500 6545.0 13.1 0.0 start = datetime.datetime.now()
121 500 97761876.0 195523.8 93.3 df = pd.read_csv(StringIO(SAMPLE_CSV), header=0, index_col=0)
122 500 45702.0 91.4 0.0 total_read_time += datetime.datetime.now() - start
123
124 500 8259.0 16.5 0.0 start = datetime.datetime.now()
125 500 167269.0 334.5 0.2 h.add_task(task, t, df, q)
126 500 5009.0 10.0 0.0 t += 1
127 500 11865.0 23.7 0.0 total_add_task += datetime.datetime.now() - start
128
129 500 6949.0 13.9 0.0 start = datetime.datetime.now()
130 #data = df.to_csv()
131 500 7921.0 15.8 0.0 total_to_csv_time += datetime.datetime.now() - start
132
133 500 6498.0 13.0 0.0 start = datetime.datetime.now()
134 #print(data, file=DEVNULL)
135 500 8084.0 16.2 0.0 total_to_print += datetime.datetime.now() - start
136
137 1 3321.0 3321.0 0.0 print("total_read_time: {}".format(total_read_time))
138 1 4669.0 4669.0 0.0 print("total_add_task: {}".format(total_add_task))
139 1 1995.0 1995.0 0.0 print("total_to_csv_time: {}".format(total_to_csv_time))
140 1 113037.0 113037.0 0.1 print("total_to_print: {}".format(total_to_print))
141
142 501 176106.0 351.5 0.2 for task_num, ans in h.get_answers():
143 #print("got back: {}".format(task_num, ans))
144 500 5169.0 10.3 0.0 answers[task_num] = ans
145 500 4160.0 8.3 0.0 if curr_task in answers:
146 500 6429159.0 12858.3 6.1 print(answers[curr_task], file=DEVNULL)
147 500 5646.0 11.3 0.0 del answers[curr_task]
148 500 4144.0 8.3 0.0 curr_task += 1
149
150 # In case others are left out
151 1 24.0 24.0 0.0 for k, v in answers.items():
152 print(k)
153
154 1 61.0 61.0 0.0 h.jobs.join() # block until all tasks are done
155
156 1 328.0 328.0 0.0 print("total: {}".format(datetime.datetime.now() - main_start))
Rather than cut your own solution you may want to look at Dask - particularly Dask's Distributed DataFrame if you want to read multiple CSV files into 1 "virtual" big DataFrame or Delayed to run functions, as per your example, in parallel across multiple cores. See light examples here if you scroll down: https://docs.dask.org/en/latest/
Your other lightweight choice is to use Joblib's Parallel interface, this looks exactly like Delayed but with much less functionality. I tend to go for Joblib if I want a lightweight solution, then upgrade to Dask if I need more: https://joblib.readthedocs.io/en/latest/parallel.html
For both tools if you go down the delayed route - write a function that works in a for loop in series (you have this already), then wrap it in the respective delayed syntax and "it should just work". In both cases by default it'll use all the cores on your machine.

How to merge pandas dataset onto itself based on a condition and a groupby

To best illustrate consider the following SQL Illustration:
Table StockPrices, BarSeqId is a sequential number where each increment is information from next minute of trading.
The goal to achieve in pandas data frame is to transform this data:
StockPrice BarSeqId LongProfitTarget
105 0 109
100 1 105
103 2 107
103 3 108
104 4 110
105 5 113
into this data:
StockPrice BarSeqId LongProfitTarget TargetHitBarSeqId
106 0 109 Nan
100 1 105 3
103 2 107 5
105 3 108 Nan
104 4 110 Nan
107 5 113 Nan
to create a new column which describes at which soonest sequential time-frame a price target will be hit in the future from the current time-frame
Here is how it could be achieved in SQL:
SELECT S1.StockPrice, S1.BarSeqId, S1.LongProfitTarget,
min(S2.BarSeqId) as TargetHitBarSeqId
FROM StockPrices S1
left outer join StockPrices S2 on S1.BarSeqId<s2.BarSeqId and
S2.StockPrice>=S1.LongProfitTarget
GROUP BY S1.StockPrice, S1.BarSeqId, S1.LongProfitTarget
I would like the answer to be as follows:
someDataFrame['TargetHitBarSeqId'] = (pandas expression here ...**
assume that someDataFrame already has columns: StockPrice, BarSeqId, LongProfitTarget
data edited to illustrate case
so in the second row result should be
100 1 105 3
and NOT
100 1 105 0
since 3 and not 0 occurs after 1.
It is important that the barseq in question shall occur in the future (greater than current BarSeq)
df = pd.DataFrame({'StockPrice':[105,100,103,105,104,107],'BarSeqId':[0,1,2,3,4,5],
'LongProfitTarget':[109,105,107,108,110,113]})
def get_barseqid(longProfitTarget):
try:
idx = df.StockPrice[df.StockPrice >= longProfitTarget].index[0]
return df.iloc[idx].BarSeqId
except:
return np.nan
df['TargetHitBarSeqId'] = df.apply(lambda row: get_barseqid(row['LongProfitTarget']), axis=1)
Here's one solution:
import pandas as pd
import numpy as np
df = <your input data frame>
def get_barseqid(longProfitTarget):
try:
idx = df.StockPrice[df.StockPrice >= longProfitTarget].index[0]
return df.iloc[idx].BarSeqId
except:
return np.nan
df['TargetHitBarSeqId'] = df.apply(lambda row: get_barseqid(row['LongProfitTarget']), axis=1)
Output:
StockPrice BarSeqId LongProfitTarget TargetHitBarSeqId
0 100 1 105 3.0
1 103 2 107 5.0
2 105 3 108 NaN
3 104 4 110 NaN
4 107 5 113 NaN
from pathlib import Path
import pandas as pd
from itertools import islice
import numpy as np
df = pd.DataFrame({'StockPrice':[105,100,103,105,104,107],'BarSeqId':[0,1,2,3,4,5],
'LongProfitTarget':[109,105,107,108,110,113]})
def get_barseqid(longProfitTarget,barseq):
try:
idx = df[(df.StockPrice >= longProfitTarget) & (df.BarSeqId>barseq)].index[0]
return df.iloc[idx].BarSeqId
except:
return np.nan
df['TargetHitBarSeqId'] = df.apply(lambda row: get_barseqid(row['LongProfitTarget'], row['BarSeqId']), axis=1)
df
The key misunderstanding for me was a need to use & operator instead of regular 'or'
Assuming data is manageable, consider a cross join followed by filter and groupby, which would replicate the SQL query:
cdf = pd.merge(df.assign(key=1), df.assign(key=1), on='key', suffixes=['','_'])\
.query('(BarSeqId < BarSeqId_) & (LongProfitTarget <= StockPrice_)')\
.groupby(['StockPrice', 'BarSeqId', 'LongProfitTarget'])['BarSeqId_'].min()
print(cdf)
# StockPrice BarSeqId LongProfitTarget
# 100 1 105 3
# 103 2 107 5
# Name: BarSeqId_, dtype: int64

How can I add points to a numpy array more efficiently?

I'm fairly new to python, so I don't know all the tips and tricks quite yet. But I'm trying to read in data line by line from a file, then into a numpy array. I have to read it in line by line in this manner, but I have freedom when it comes to moving that data into the array. Here is the relevant code:
xyzi_point_array = np.zeros((0,4))
x_list = []
y_list = []
z_list = []
i_list = []
points_read = 0
while True: #FOR EVERY LINE DO:
line = decryptLine(inFile.readline()) #grabs the next line of data
if not line: break
.
.
.
index = 0
for entry in line: #FOR EVERY VALUE IN THE LINE
x_list.append(X)
y_list.append(Y)
z_list.append(z_catalog[index])
i_list.append(entry)
index += 1
points_read += 1
xyzi_point_array = np.zeros((points_read,4))
xyzi_point_array[:,0] = x_list
xyzi_point_array[:,1] = y_list
xyzi_point_array[:,2] = z_list
xyzi_point_array[:,3] = i_list
Where X and Y are scalars which are different for each line, and where z_catalog is a 1D numpy array.
For smaller data sets, the imbedded for loop is the biggest draw, with the xyzi_point_array[points_read,:] = line taking the majority of processor time. However with larger data sets, working with tempArr to expand xyzi_point_array becomes the worst, so I'll need to optimize both.
Any ideas? General tips on how to better handle numpy arrays are also welcome, I come from a C++ background, and am probably not handling these arrays in the most pythonic way..
For reference, here is the lineprofiler readout for this bit of the code:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
138 150 233 1.6 0.0 index = 0
139 489600 468293 1.0 11.6 for entry in line: #FOR EVERY VALUE IN THE LINE
140 489450 457227 0.9 11.4 x_list.append(lineX)
141 489450 441687 0.9 11.0 y_list.append(lineY)
142 489450 541891 1.1 13.5 z_list.append(z_catalog[index])
143 489450 450191 0.9 11.2 i_list.append(entry)
144 489450 421573 0.9 10.5 index += 1
145 489450 408764 0.8 10.2 points_read += 1
146
149 1 78 78.0 0.0 xyzi_point_array = np.zeros((points_read,4))
150 1 39539 39539.0 1.0 xyzi_point_array[:,0] = x_list
151 1 33876 33876.0 0.8 xyzi_point_array[:,1] = y_list
152 1 48619 48619.0 1.2 xyzi_point_array[:,2] = z_list
153 1 47219 47219.0 1.2 xyzi_point_array[:,3] = i_list

Speed up numpy.where for extracting integer segments?

I'm trying to work out how to speed up a Python function which uses numpy. The output I have received from lineprofiler is below, and this shows that the vast majority of the time is spent on the line ind_y, ind_x = np.where(seg_image == i).
seg_image is an integer array which is the result of segmenting an image, thus finding the pixels where seg_image == i extracts a specific segmented object. I am looping through lots of these objects (in the code below I'm just looping through 5 for testing, but I'll actually be looping through over 20,000), and it takes a long time to run!
Is there any way in which the np.where call can be speeded up? Or, alternatively, that the penultimate line (which also takes a good proportion of the time) can be speeded up?
The ideal solution would be to run the code on the whole array at once, rather than looping, but I don't think this is possible as there are side-effects to some of the functions I need to run (for example, dilating a segmented object can make it 'collide' with the next region and thus give incorrect results later on).
Does anyone have any ideas?
Line # Hits Time Per Hit % Time Line Contents
==============================================================
5 def correct_hot(hot_image, seg_image):
6 1 239810 239810.0 2.3 new_hot = hot_image.copy()
7 1 572966 572966.0 5.5 sign = np.zeros_like(hot_image) + 1
8 1 67565 67565.0 0.6 sign[:,:] = 1
9 1 1257867 1257867.0 12.1 sign[hot_image > 0] = -1
10
11 1 150 150.0 0.0 s_elem = np.ones((3, 3))
12
13 #for i in xrange(1,seg_image.max()+1):
14 6 57 9.5 0.0 for i in range(1,6):
15 5 6092775 1218555.0 58.5 ind_y, ind_x = np.where(seg_image == i)
16
17 # Get the average HOT value of the object (really simple!)
18 5 2408 481.6 0.0 obj_avg = hot_image[ind_y, ind_x].mean()
19
20 5 333 66.6 0.0 miny = np.min(ind_y)
21
22 5 162 32.4 0.0 minx = np.min(ind_x)
23
24
25 5 369 73.8 0.0 new_ind_x = ind_x - minx + 3
26 5 113 22.6 0.0 new_ind_y = ind_y - miny + 3
27
28 5 211 42.2 0.0 maxy = np.max(new_ind_y)
29 5 143 28.6 0.0 maxx = np.max(new_ind_x)
30
31 # 7 is + 1 to deal with the zero-based indexing, + 2 * 3 to deal with the 3 cell padding above
32 5 217 43.4 0.0 obj = np.zeros( (maxy+7, maxx+7) )
33
34 5 158 31.6 0.0 obj[new_ind_y, new_ind_x] = 1
35
36 5 2482 496.4 0.0 dilated = ndimage.binary_dilation(obj, s_elem)
37 5 1370 274.0 0.0 border = mahotas.borders(dilated)
38
39 5 122 24.4 0.0 border = np.logical_and(border, dilated)
40
41 5 355 71.0 0.0 border_ind_y, border_ind_x = np.where(border == 1)
42 5 136 27.2 0.0 border_ind_y = border_ind_y + miny - 3
43 5 123 24.6 0.0 border_ind_x = border_ind_x + minx - 3
44
45 5 645 129.0 0.0 border_avg = hot_image[border_ind_y, border_ind_x].mean()
46
47 5 2167729 433545.8 20.8 new_hot[seg_image == i] = (new_hot[ind_y, ind_x] + (sign[ind_y, ind_x] * np.abs(obj_avg - border_avg)))
48 5 10179 2035.8 0.1 print obj_avg, border_avg
49
50 1 4 4.0 0.0 return new_hot
EDIT I have left my original answer at the bottom for the record, but I have actually looked into your code in more detail over lunch, and I think that using np.where is a big mistake:
In [63]: a = np.random.randint(100, size=(1000, 1000))
In [64]: %timeit a == 42
1000 loops, best of 3: 950 us per loop
In [65]: %timeit np.where(a == 42)
100 loops, best of 3: 7.55 ms per loop
You could get a boolean array (that you can use for indexing) in 1/8 of the time you need to get the actual coordinates of the points!!!
There is of course the cropping of the features that you do, but ndimage has a find_objects function that returns enclosing slices, and appears to be very fast:
In [66]: %timeit ndimage.find_objects(a)
100 loops, best of 3: 11.5 ms per loop
This returns a list of tuples of slices enclosing all of your objects, in 50% more time thn it takes to find the indices of one single object.
It may not work out of the box as I cannot test it right now, but I would restructure your code into something like the following:
def correct_hot_bis(hot_image, seg_image):
# Need this to not index out of bounds when computing border_avg
hot_image_padded = np.pad(hot_image, 3, mode='constant',
constant_values=0)
new_hot = hot_image.copy()
sign = np.ones_like(hot_image, dtype=np.int8)
sign[hot_image > 0] = -1
s_elem = np.ones((3, 3))
for j, slice_ in enumerate(ndimage.find_objects(seg_image)):
hot_image_view = hot_image[slice_]
seg_image_view = seg_image[slice_]
new_shape = tuple(dim+6 for dim in hot_image_view.shape)
new_slice = tuple(slice(dim.start,
dim.stop+6,
None) for dim in slice_)
indices = seg_image_view == j+1
obj_avg = hot_image_view[indices].mean()
obj = np.zeros(new_shape)
obj[3:-3, 3:-3][indices] = True
dilated = ndimage.binary_dilation(obj, s_elem)
border = mahotas.borders(dilated)
border &= dilated
border_avg = hot_image_padded[new_slice][border == 1].mean()
new_hot[slice_][indices] += (sign[slice_][indices] *
np.abs(obj_avg - border_avg))
return new_hot
You would still need to figure out the collisions, but you could get about a 2x speed-up by computing all the indices simultaneously using a np.unique based approach:
a = np.random.randint(100, size=(1000, 1000))
def get_pos(arr):
pos = []
for j in xrange(100):
pos.append(np.where(arr == j))
return pos
def get_pos_bis(arr):
unq, flat_idx = np.unique(arr, return_inverse=True)
pos = np.argsort(flat_idx)
counts = np.bincount(flat_idx)
cum_counts = np.cumsum(counts)
multi_dim_idx = np.unravel_index(pos, arr.shape)
return zip(*(np.split(coords, cum_counts) for coords in multi_dim_idx))
In [33]: %timeit get_pos(a)
1 loops, best of 3: 766 ms per loop
In [34]: %timeit get_pos_bis(a)
1 loops, best of 3: 388 ms per loop
Note that the pixels for each object are returned in a different order, so you can't simply compare the returns of both functions to assess equality. But they should both return the same.
One thing you could do to same a little bit of time is to save the result of seg_image == i so that you don't need to compute it twice. You're computing it on lines 15 & 47, you could add seg_mask = seg_image == i and then reuse that result (It might also be good to separate out that piece for profiling purposes).
While there a some other minor things that you could do to eke out a little bit of performance, the root issue is that you're using a O(M * N) algorithm where M is the number of segments and N is the size of your image. It's not obvious to me from your code whether there is a faster algorithm to accomplish the same thing, but that's the first place I'd try and look for a speedup.

Categories