I have a bunch of pandas dataframe I would like to print out to any format (csv, json, etc) -- and would like to preserve the order, based on the order of the data frames read. Unfortunately .to_csv() can take some time, sometimes 2x longer than just reading the dataframe.
Lets take the image as an example:
Here you can see that running the task linearly, reading the data frame, printing it out, then repeat for the remaining data frames. This can take about 3x longer than just reading the data frame. Theoretically, if we can push the printing (to_csv()) to a separate threads (2 threads, plus the main thread reading), we can achieve an improve performance that could almost be a third of the total execution compared to the linear (synchronous) version. Of course with just 3 reads, it looks like its just half as fast. But the more dataframes you read, the faster it will be (theoretically).
Unfortunately, the actual does not work like so. I am getting a very small gain in performance. Where the read time is actually taking longer. This might be due to the fact that the to_csv() is CPU extensive, and using all the reasources in the process. And since it is multithreaded, it all shares the same resources. Thus not much gains.
So my question is, how can I improve the code to get a performance closer to the theoretical numbers. I tried using multiprocessing but failed to get a working code. How can I have this in multiprocessing? Is there other ways I could improve the total execution time of such a task?
Here's my sample code using multithreads:
import pandas as pd
import datetime
import os
from threading import Thread
import queue
from io import StringIO
from line_profiler import LineProfiler
NUMS = 500
DEVNULL = open(os.devnull, 'w')
HEADERS = ",a,b,c,d,e,f,g\n"
SAMPLE_CSV = HEADERS + "\n".join([f"{x},{x},{x},{x},{x},{x},{x},{x}" for x in range(4000)])
def linear_test():
print("------Linear Test-------")
main_start = datetime.datetime.now()
total_read_time = datetime.timedelta(0)
total_add_task = datetime.timedelta(0)
total_to_csv_time = datetime.timedelta(0)
total_to_print = datetime.timedelta(0)
for x in range(NUMS):
start = datetime.datetime.now()
df = pd.read_csv(StringIO(SAMPLE_CSV), header=0, index_col=0)
total_read_time += datetime.datetime.now() - start
start = datetime.datetime.now()
#
total_add_task += datetime.datetime.now() - start
start = datetime.datetime.now()
data = df.to_csv()
total_to_csv_time += datetime.datetime.now() - start
start = datetime.datetime.now()
print(data, file=DEVNULL)
total_to_print += datetime.datetime.now() - start
print("total_read_time: {}".format(total_read_time))
print("total_add_task: {}".format(total_add_task))
print("total_to_csv_time: {}".format(total_to_csv_time))
print("total_to_print: {}".format(total_to_print))
print("total: {}".format(datetime.datetime.now() - main_start))
class Handler():
def __init__(self, num_workers=1):
self.num_workers = num_workers
self.total_num_jobs = 0
self.jobs_completed = 0
self.answers_sent = 0
self.jobs = queue.Queue()
self.results = queue.Queue()
self.start_workers()
def add_task(self, task, *args, **kwargs):
args = args or ()
kwargs = kwargs or {}
self.total_num_jobs += 1
self.jobs.put((task, args, kwargs))
def start_workers(self):
for i in range(self.num_workers):
t = Thread(target=self.worker)
t.daemon = True
t.start()
def worker(self):
while True:
item, args, kwargs = self.jobs.get()
item(*args, **kwargs)
self.jobs_completed += 1
self.jobs.task_done()
def get_answers(self):
while self.answers_sent < self.total_num_jobs or self.jobs_completed == 0:
yield self.results.get()
self.answers_sent += 1
self.results.task_done()
def task(task_num, df, q):
ans = df.to_csv()
q.put((task_num, ans))
def parallel_test():
print("------Parallel Test-------")
main_start = datetime.datetime.now()
total_read_time = datetime.timedelta(0)
total_add_task = datetime.timedelta(0)
total_to_csv_time = datetime.timedelta(0)
total_to_print = datetime.timedelta(0)
h = Handler(num_workers=2)
q = h.results
answers = {}
curr_task = 1
t = 1
for x in range(NUMS):
start = datetime.datetime.now()
df = pd.read_csv(StringIO(SAMPLE_CSV), header=0, index_col=0)
total_read_time += datetime.datetime.now() - start
start = datetime.datetime.now()
h.add_task(task, t, df, q)
t += 1
total_add_task += datetime.datetime.now() - start
start = datetime.datetime.now()
#data = df.to_csv()
total_to_csv_time += datetime.datetime.now() - start
start = datetime.datetime.now()
#print(data, file=DEVNULL)
total_to_print += datetime.datetime.now() - start
print("total_read_time: {}".format(total_read_time))
print("total_add_task: {}".format(total_add_task))
print("total_to_csv_time: {}".format(total_to_csv_time))
print("total_to_print: {}".format(total_to_print))
for task_num, ans in h.get_answers():
#print("got back: {}".format(task_num, ans))
answers[task_num] = ans
if curr_task in answers:
print(answers[curr_task], file=DEVNULL)
del answers[curr_task]
curr_task += 1
# In case others are left out
for k, v in answers.items():
print(k)
h.jobs.join() # block until all tasks are done
print("total: {}".format(datetime.datetime.now() - main_start))
if __name__ == "__main__":
# linear_test()
# parallel_test()
lp = LineProfiler()
lp_wrapper = lp(linear_test)
lp_wrapper()
lp.print_stats()
lp = LineProfiler()
lp_wrapper = lp(parallel_test)
lp_wrapper()
lp.print_stats()
The output will be below. Where you can see in the linear test reading the data frame only took 4.6 seconds (42% of the total execution). But reading the data frames in the parallel test took 9.7 seconds (93% of the total execution):
------Linear Test-------
total_read_time: 0:00:04.672765
total_add_task: 0:00:00.001000
total_to_csv_time: 0:00:05.582663
total_to_print: 0:00:00.668319
total: 0:00:10.935723
Timer unit: 1e-07 s
Total time: 10.9309 s
File: ./test.py
Function: linear_test at line 33
Line # Hits Time Per Hit % Time Line Contents
==============================================================
33 def linear_test():
34 1 225.0 225.0 0.0 print("------Linear Test-------")
35 1 76.0 76.0 0.0 main_start = datetime.datetime.now()
36 1 32.0 32.0 0.0 total_read_time = datetime.timedelta(0)
37 1 11.0 11.0 0.0 total_add_task = datetime.timedelta(0)
38 1 9.0 9.0 0.0 total_to_csv_time = datetime.timedelta(0)
39 1 9.0 9.0 0.0 total_to_print = datetime.timedelta(0)
40
41 501 3374.0 6.7 0.0 for x in range(NUMS):
42
43 500 5806.0 11.6 0.0 start = datetime.datetime.now()
44 500 46728029.0 93456.1 42.7 df = pd.read_csv(StringIO(SAMPLE_CSV), header=0, index_col=0)
45 500 40199.0 80.4 0.0 total_read_time += datetime.datetime.now() - start
46
47 500 6821.0 13.6 0.0 start = datetime.datetime.now()
48 #
49 500 6916.0 13.8 0.0 total_add_task += datetime.datetime.now() - start
50
51 500 5794.0 11.6 0.0 start = datetime.datetime.now()
52 500 55843605.0 111687.2 51.1 data = df.to_csv()
53 500 53640.0 107.3 0.0 total_to_csv_time += datetime.datetime.now() - start
54
55 500 6798.0 13.6 0.0 start = datetime.datetime.now()
56 500 6589129.0 13178.3 6.0 print(data, file=DEVNULL)
57 500 18258.0 36.5 0.0 total_to_print += datetime.datetime.now() - start
58
59 1 221.0 221.0 0.0 print("total_read_time: {}".format(total_read_time))
60 1 95.0 95.0 0.0 print("total_add_task: {}".format(total_add_task))
61 1 87.0 87.0 0.0 print("total_to_csv_time: {}".format(total_to_csv_time))
62 1 85.0 85.0 0.0 print("total_to_print: {}".format(total_to_print))
63 1 112.0 112.0 0.0 print("total: {}".format(datetime.datetime.now() - main_start))
------Parallel Test-------
total_read_time: 0:00:09.779954
total_add_task: 0:00:00.016984
total_to_csv_time: 0:00:00.003000
total_to_print: 0:00:00.001001
total: 0:00:10.488563
Timer unit: 1e-07 s
Total time: 10.4803 s
File: ./test.py
Function: parallel_test at line 106
Line # Hits Time Per Hit % Time Line Contents
==============================================================
106 def parallel_test():
107 1 100.0 100.0 0.0 print("------Parallel Test-------")
108 1 33.0 33.0 0.0 main_start = datetime.datetime.now()
109 1 24.0 24.0 0.0 total_read_time = datetime.timedelta(0)
110 1 10.0 10.0 0.0 total_add_task = datetime.timedelta(0)
111 1 10.0 10.0 0.0 total_to_csv_time = datetime.timedelta(0)
112 1 10.0 10.0 0.0 total_to_print = datetime.timedelta(0)
113 1 13550.0 13550.0 0.0 h = Handler(num_workers=2)
114 1 15.0 15.0 0.0 q = h.results
115 1 9.0 9.0 0.0 answers = {}
116 1 7.0 7.0 0.0 curr_task = 1
117 1 7.0 7.0 0.0 t = 1
118
119 501 5017.0 10.0 0.0 for x in range(NUMS):
120 500 6545.0 13.1 0.0 start = datetime.datetime.now()
121 500 97761876.0 195523.8 93.3 df = pd.read_csv(StringIO(SAMPLE_CSV), header=0, index_col=0)
122 500 45702.0 91.4 0.0 total_read_time += datetime.datetime.now() - start
123
124 500 8259.0 16.5 0.0 start = datetime.datetime.now()
125 500 167269.0 334.5 0.2 h.add_task(task, t, df, q)
126 500 5009.0 10.0 0.0 t += 1
127 500 11865.0 23.7 0.0 total_add_task += datetime.datetime.now() - start
128
129 500 6949.0 13.9 0.0 start = datetime.datetime.now()
130 #data = df.to_csv()
131 500 7921.0 15.8 0.0 total_to_csv_time += datetime.datetime.now() - start
132
133 500 6498.0 13.0 0.0 start = datetime.datetime.now()
134 #print(data, file=DEVNULL)
135 500 8084.0 16.2 0.0 total_to_print += datetime.datetime.now() - start
136
137 1 3321.0 3321.0 0.0 print("total_read_time: {}".format(total_read_time))
138 1 4669.0 4669.0 0.0 print("total_add_task: {}".format(total_add_task))
139 1 1995.0 1995.0 0.0 print("total_to_csv_time: {}".format(total_to_csv_time))
140 1 113037.0 113037.0 0.1 print("total_to_print: {}".format(total_to_print))
141
142 501 176106.0 351.5 0.2 for task_num, ans in h.get_answers():
143 #print("got back: {}".format(task_num, ans))
144 500 5169.0 10.3 0.0 answers[task_num] = ans
145 500 4160.0 8.3 0.0 if curr_task in answers:
146 500 6429159.0 12858.3 6.1 print(answers[curr_task], file=DEVNULL)
147 500 5646.0 11.3 0.0 del answers[curr_task]
148 500 4144.0 8.3 0.0 curr_task += 1
149
150 # In case others are left out
151 1 24.0 24.0 0.0 for k, v in answers.items():
152 print(k)
153
154 1 61.0 61.0 0.0 h.jobs.join() # block until all tasks are done
155
156 1 328.0 328.0 0.0 print("total: {}".format(datetime.datetime.now() - main_start))
Rather than cut your own solution you may want to look at Dask - particularly Dask's Distributed DataFrame if you want to read multiple CSV files into 1 "virtual" big DataFrame or Delayed to run functions, as per your example, in parallel across multiple cores. See light examples here if you scroll down: https://docs.dask.org/en/latest/
Your other lightweight choice is to use Joblib's Parallel interface, this looks exactly like Delayed but with much less functionality. I tend to go for Joblib if I want a lightweight solution, then upgrade to Dask if I need more: https://joblib.readthedocs.io/en/latest/parallel.html
For both tools if you go down the delayed route - write a function that works in a for loop in series (you have this already), then wrap it in the respective delayed syntax and "it should just work". In both cases by default it'll use all the cores on your machine.
Related
I have an url mask and dynamic values from dictionary which i pass into that mask to generate an url. There is an xml file within each url. I'm getting that xml, making a dataframe and fill one column with one of the values from said dictionary. In the end i generate the list of dataframes to work further with.
My code executes pretty slow, i imagine that is because i have something in my iteration that could be refactored. Is there any way i can make it faster or is it limited by get requests?
This is my go-to algorithm. At first i tried to save xml files locally and only then parsing it to dataframes, but it obviously takes longer. I also tried breaking functions into smaller ones, same effect.
FILE_FORMAT = 'xml'
CURRENT_DIR = os.path.abspath('')
SAVE_DIR = os.path.join(CURRENT_DIR, 'report')
REPORT = 'oo1'
YEAR = '2022-2023'
BASE_URL = 'sensetive_link'
def create_source(file_name):
df = pd.read_excel(f'{file_name}.xlsx', dtype=object)
columns = df.columns.tolist()
result = {
school: df[item].dropna(how='all').tolist() for school, item in zip(
columns, df
)
}
return result
def download_xml_to_df_list(source_dict):
df_list = []
fillers = {
'base_url': BASE_URL,
'year': YEAR,
'report': REPORT,
'file_format': FILE_FORMAT,
}
count = 0
length = sum([len(i) for i in source.values()])
for mouo, school in source_dict.items():
for num, i in enumerate(range(len(source_dict[mouo])), 1):
try:
url = (
'{base_url}/{year}ob/{report}/61/{mouo}/oo1_{school}.{file_format}'
.format(**fillers, mouo=mouo, school=source_dict[mouo][i])
)
df = pd.read_xml(requests.get(url).text, xpath='//item')
df['value'] = df['value'].astype('float64')
df.index = [source_dict[mouo][i]] * len(df)
df_list.append(df)
count += 1
message = f'parsed {count} out of {length}'
print(message, end='\r')
except Exception as error:
print(f"{url} doesn't exist")
print('\ndone')
return df_list
I was using time library to measure execution time and it says
excecuted in 131.20987153053284
I'm using jupyter notebook but from I've read it doesn't affect execution time.
EDIT:
As per #Paul H advice i did line profiling. Results:
Function: download_xml_to_df_list at line 39
Line # Hits Time Per Hit % Time Line Contents
39 #profile
40 def download_xml_to_df_list(source_dict):
41 1 6.0 6.0 0.0 df_list = []
42 fillers = {
43 1 5.0 5.0 0.0 'base_url': BASE_URL,
44 1 4.0 4.0 0.0 'year': YEAR,
45 1 4.0 4.0 0.0 'report': REPORT,
46 1 11.0 11.0 0.0 'file_format': FILE_FORMAT,
47 }
48 1 4.0 4.0 0.0 count = 0
49 1 108.0 108.0 0.0 length = sum([len(i) for i in source.values()])
50 17 143.0 8.4 0.0 for mouo, school in source_dict.items():
51 173 1941.0 11.2 0.0 for num, i in enumerate(range(len(source_dict[mouo])), 1):
52 173 700.0 4.0 0.0 try:
53 url = (
54 173 1765.0 10.2 0.0 '{base_url}/{year}ob/{report}/61/{mouo}/oo1_{school}.{file_format}'
55 173 13534.0 78.2 0.0 .format(**fillers, mouo=mouo, school=source_dict[mouo][i])
56 )
57 167 1892702357.0 11333547.0 99.9 df = pd.read_xml(requests.get(url).text, xpath='//item')
58 167 563079.0 3371.7 0.0 print(f'fetched url {url}')
59 167 1660282.0 9941.8 0.1 df['value'] = df['value'].astype('float64')
60 167 511147.0 3060.8 0.0 df.index = [source_dict[mouo][i]] * len(df)
61 167 1869.0 11.2 0.0 df_list.append(df)
62 167 1108.0 6.6 0.0 count += 1
63 167 3864.0 23.1 0.0 message = f'parsed {count} out of {length}'
64 167 41371.0 247.7 0.0 print(message, end='\r')
65 6 94.0 15.7 0.0 except Exception as error:
66 6 5053.0 842.2 0.0 print(f"{url} doesn't exist")
67 print_stats()
68 return df_list
Which month has the highest median for maximum_gust_speed out of all the available records. Also find the respective value
The data set looks like below
Day Average temperature (°F) Average humidity (%) Average dewpoint (°F) Average barometer (in) Average windspeed (mph) Average gustspeed (mph) Average direction (°deg) Rainfall for month (in) Rainfall for year (in) Maximum rain per minute Maximum temperature (°F) Minimum temperature (°F) Maximum humidity (%) Minimum humidity (%) Maximum pressure Minimum pressure Maximum windspeed (mph) Maximum gust speed (mph) Maximum heat index (°F)
0 1/01/2009 37.8 35 12.7 29.7 26.4 36.8 274 0.0 0.0 0.0 40.1 34.5 44 27 29.762 29.596 41.4 59.0 40.1
1 2/01/2009 43.2 32 14.7 29.5 12.8 18.0 240 0.0 0.0 0.0 52.8 37.5 43 16 29.669 29.268 35.7 51.0 52.8
2 3/01/2009 25.7 60 12.7 29.7 8.3 12.2 290 0.0 0.0 0.0 41.2 6.7 89 35 30.232 29.260 25.3 38.0 41.2
3 4/01/2009 9.3 67 0.1 30.4 2.9 4.5 47 0.0 0.0 0.0 19.4 -0.0 79 35 30.566 30.227 12.7 20.0 32.0
4 5/01/2009 23.5 30 -5.3 29.9 16.7 23.1 265 0.0 0.0 0.0 30.3 15.1 56 13 30.233 29.568 38.0 53.0 32.0
The code I have written is as below however the test case fails
Code :
data1= data[data['Maximum gust speed (mph)']!= 0.0]
#print(data1.count())
#print(data.count())
#print(data.median())
#print(data1.median())
max_gust_value_median = data1.groupby(pd.DatetimeIndex(data1['Day']).month).agg({'Maximum gust speed (mph)':pd.Series.median})
#print(max_gust_value_median)
max_gust_month = "max_gust_month = " + str(max_gust_value_median.idxmax()[0])
max_gust_value = "max_gust_value = " + format((max_gust_value_median.max()[0]),'.2f')
print(max_gust_value)
print(max_gust_month)
Output :
max_gust_value = 32.20
max_gust_month = 11
Error :
=================================== FAILURES ===================================
_____________________________ test_max_gust_month ______________________________
def test_max_gust_month():
assert hash_dict["max_gust_month"] == answer_dict["max_gust_month"]
E AssertionError: assert 'd1aecb72eff6...7412c2a651d81' == 'e6e3cedb0dc6...798711404a6c8'
E - e6e3cedb0dc67a96317798711404a6c8
E + d1aecb72eff64d1169f7412c2a651d81
test.py:52: AssertionError
_____________________________ test_max_gust_value ______________________________
def test_max_gust_value():
assert hash_dict["max_gust_value"] == answer_dict["max_gust_value"]
E AssertionError: assert '6879064548a1...2361f91ecd7b0' == '5818ebe448c4...471e93c92d545'
E - 5818ebe448c43f2dfed471e93c92d545
E + 6879064548a136da2f22361f91ecd7b0
test.py:55: AssertionError
=========================== short test summary info ============================
FAILED test.py::test_max_gust_month - AssertionError: assert 'd1aecb72eff6......
FAILED test.py::test_max_gust_value - AssertionError: assert '6879064548a1......
========================= 2 failed, 9 passed in 0.13s ==========================
Below code
data['Month'] = pd.to_datetime(data['Day'], dayfirst=True).dt.strftime('%B')
month_list =['January', 'February','March','April', 'May', 'June','July','August','September','October','November','December']
month_grp = data.groupby(['Month'])
month_name_value_all = []
max_value=[]
for i in month_list:
month_name_value =[]
value = month_grp.get_group(i).median().loc['Maximum gust speed (mph)']
month_name_value.append(i)
max_value.append(value)
month_name_value.append(value)
month_name_value_all.append(month_name_value)
max_gust_value = format(max(max_value), '.2f')
for j in month_name_value_all:
month_max_find =[]
month_max_find.append(j)
if max_gust_value in j:
break
max_gust_month = month_max_find[0][0]
print("max_gust_value = ", max_gust_value)
print("max_gust_month = ", max_gust_month)
You can try this way:
#Convert day column values to datetime
df['Date'] = pd.to_datetime(df['Day'],format = '%d/%m/%Y')
#Convert a new column month_index
df['month_index'] = df['Date'].dt.month
#Group the dataframe by month & then find the median for max gust speed
max_gust_month = df.groupby(['month_index'])
max_gust_month = max_gust_month['Maximum gust speed (mph)'].median()
#Find max value in the month
max_gust_value = max_gust_median_month.max()
max_gust_value
#Find the max value index in the month
max_gust_month = max_gust_median_month.idxmax()
max_gust_month
Here is a example of data we want to process:
df_size = 1000000
df_random = pd.DataFrame({'boat_id' : np.random.choice(range(300),df_size),
'X' :np.random.random_integers(0,1000,df_size),
'target_Y' :np.random.random_integers(0,10,df_size)})
X boat_id target_Y
0 482 275 6
1 705 245 4
2 328 102 6
3 631 227 6
4 234 236 8
...
I want to obtain an output like this :
X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 target_Y boat_id
40055 684.0 692.0 950.0 572.0 442.0 850.0 75.0 140.0 382.0 576.0 0.0 1
40056 178.0 949.0 490.0 777.0 335.0 559.0 397.0 729.0 701.0 44.0 4.0 1
40057 21.0 818.0 341.0 577.0 612.0 57.0 303.0 183.0 519.0 357.0 0.0 1
40058 501.0 1000.0 999.0 532.0 765.0 913.0 964.0 922.0 772.0 534.0 1.0 2
40059 305.0 906.0 724.0 996.0 237.0 197.0 414.0 171.0 369.0 299.0 8.0 2
40060 408.0 796.0 815.0 638.0 691.0 598.0 913.0 579.0 650.0 955.0 2.0 3
40061 298.0 512.0 247.0 824.0 764.0 414.0 71.0 440.0 135.0 707.0 9.0 4
40062 535.0 687.0 945.0 859.0 718.0 580.0 427.0 284.0 122.0 777.0 2.0 4
40063 352.0 115.0 228.0 69.0 497.0 387.0 552.0 473.0 574.0 759.0 3.0 4
40064 179.0 870.0 862.0 186.0 25.0 125.0 925.0 310.0 335.0 739.0 7.0 4
...
I did the folowing code, but it is way to slow.
It groupby, cut with enumerate, transpose then merge result into one pandas Dataframe
start_time = time.time()
N = 10
col_names = map(lambda x: 'X'+str(x), range(N))
compil = pd.DataFrame(columns = col_names)
i = 0
# I group by boat ID
for boat_id, df_boat in df_random.groupby('boat_id'):
# then I cut every 50 line
for (line_number, (index, row)) in enumerate(df_boat.iterrows()):
if line_number%5 == 0:
compil_new_line_X = list(df_boat.iloc[line_number-N:line_number,:]["X"])
# filter to avoid issues at the start and end of the columns
if len (compil_new_line_X ) == N:
compil.loc[i,col_names] = compil_new_line_X
compil.loc[i, 'target_Y'] = row['target_Y']
compil.loc[i,'boat_id'] = row['boat_id']
i += 1
print("Total %s seconds" % (time.time() - start_time))
Total 232.947000027 seconds
My question is:
How to do somethings every "x number of line"? Then merge result?
Do it exist a way to vectorize that kind of operation?
Here is a solution that improve calculation time by 35%.
It use a 'groupby' for 'boat_ID' then 'groupby.apply' to divide groups in smalls chunks.
Then a final apply to create the new line. We probably still can improve it.
df_size = 1000000
df_random = pd.DataFrame({'boat_id' : np.random.choice(range(300),df_size),
'X' :np.random.random_integers(0,1000,df_size),
'target_Y' : np.random.random_integers(0,10,df_size)})
start_time = time.time()
len_of_chunks = 10
col_names = map(lambda x: 'X'+str(x), range(N))+['boat_id', 'target_Y']
def prepare_data(group):
# this function create the new line we will put in 'compil'
info_we_want_to_keep =['boat_id', 'target_Y']
info_and_target = group.tail(1)[info_we_want_to_keep].values
k = group["X"]
return np.hstack([k.values, info_and_target[0]]) # this create the new line we will put in 'compil'
# we group by ID (boat)
# we divide in chunk of len "len_of_chunks"
# we apply prepare data from each chunk
groups = df_random.groupby('boat_id').apply(lambda x: x.groupby(np.arange(len(x)) // len_of_chunks).apply(prepare_data))
# we reset index
# we take the '0' columns containing valuable info
# we put info in a new 'compil' dataframe
# we drop uncomplet line ( generated by chunk < len_of_chunks )
compil = pd.DataFrame(groups.reset_index()[0].values.tolist(), columns= col_names).dropna()
print("Total %s seconds" % (time.time() - start_time))
Total 153.781999826 seconds
I want to get the change in memory for every step in my function.
I have written the code for interpolation search and even given a input as large as 10000 no. of elements in a list, but still no change in memory.
The code is:
import time
from memory_profiler import profile
#profile()
def interpolation_search(numbers, value):
low = 0
high = len(numbers) - 1
mid = 0
while numbers[low] <= value and numbers[high] >= value:
mid = low + ((value - numbers[low]) * (high - low)) / (numbers[high] - numbers[low])
if numbers[mid] < value:
low = mid + 1
elif numbers[mid] > value:
high = mid - 1
else:
return mid
if numbers[low] == value:
return low
else:
return -1
if __name__ == "__main__":
# Pre-sorted numbers
numbers = [-100, -6, 0, 1, 5, 14, 15, 26,28,29,30,31,35,37,39,40,41,42]
num=[]
for i in range(100000):
num.append(i)
value = 15
# Print numbers to search
print 'Numbers:'
print ' '.join([str(i) for i in numbers])
# Find the index of 'value'
start_time1 = time.time()
index = interpolation_search(numbers, value)
# Print the index where 'value' is located
print '\nNumber %d is at index %d' % (value, index)
print("--- Run Time %s seconds---" % (time.time() - start_time1))
The output that I am getting is:
Numbers:
-100 -6 0 1 5 14 15 26 28 29 30 31 35 37 39 40 41 42
Filename: C:/Users/Admin/PycharmProjects/timenspace/Interpolation.py
Line # Mem usage Increment Line Contents
================================================
4 21.5 MiB 0.0 MiB #profile()
5 def interpolation_search(numbers, value):
6 21.5 MiB 0.0 MiB low = 0
7 21.5 MiB 0.0 MiB high = len(numbers) - 1
8 21.5 MiB 0.0 MiB mid = 0
9
10 21.5 MiB 0.0 MiB while numbers[low] <= value and numbers[high] >= value:
11 21.5 MiB 0.0 MiB mid = low + ((value - numbers[low]) * (high - low)) / (numbers[high] - numbers[low])
12
13 21.5 MiB 0.0 MiB if numbers[mid] < value:
14 low = mid + 1
15
16 21.5 MiB 0.0 MiB elif numbers[mid] > value:
17 21.5 MiB 0.0 MiB high = mid - 1
18 else:
19 21.5 MiB 0.0 MiB return mid
20
21 if numbers[low] == value:
22 return low
23 else:
24 return -1
Number 15 is at index 6
--- Run Time 0.0429999828339 seconds---
As you can see my memory remains constant at 21.5 Mib in all steps.
Please help me with this.Thank You
Why do you expect it to increase? I don't see any memory allocations, i.e., the array numbers does not grow in size
New to Pandas, looking for the most efficient way to do this.
I have a Series of DataFrames. Each DataFrame has the same columns but different indexes, and they are indexed by date. The Series is indexed by ticker symbol. So each item in the Sequence represents a single time series of each individual stock's performance.
I need to randomly generate a list of n data frames, where each dataframe is a subset of some random assortment of the available stocks' histories. It's ok if there is overlap, so long as start end end dates are different.
This following code does it, but it's really slow, and I'm wondering if there's a better way to go about it:
Code
def random_sample(data=None, timesteps=100, batch_size=100, subset='train'):
if type(data) != pd.Series:
return None
if subset=='validate':
offset = 0
elif subset=='test':
offset = 200
elif subset=='train':
offset = 400
tickers = np.random.randint(0, len(data), size=len(data))
ret_data = []
while len(ret_data) != batch_size:
for t in tickers:
data_t = data[t]
max_len = len(data_t)-timesteps-1
if len(ret_data)==batch_size: break
if max_len-offset < 0: continue
index = np.random.randint(offset, max_len)
d = data_t[index:index+timesteps]
if len(d)==timesteps: ret_data.append(d)
return ret_data
Profile output:
Timer unit: 1e-06 s
File: finance.py
Function: random_sample at line 137
Total time: 0.016142 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
137 #profile
138 def random_sample(data=None, timesteps=100, batch_size=100, subset='train'):
139 1 5 5.0 0.0 if type(data) != pd.Series:
140 return None
141
142 1 1 1.0 0.0 if subset=='validate':
143 offset = 0
144 1 1 1.0 0.0 elif subset=='test':
145 offset = 200
146 1 0 0.0 0.0 elif subset=='train':
147 1 1 1.0 0.0 offset = 400
148
149 1 1835 1835.0 11.4 tickers = np.random.randint(0, len(data), size=len(data))
150
151 1 2 2.0 0.0 ret_data = []
152 2 3 1.5 0.0 while len(ret_data) != batch_size:
153 116 148 1.3 0.9 for t in tickers:
154 116 2497 21.5 15.5 data_t = data[t]
155 116 317 2.7 2.0 max_len = len(data_t)-timesteps-1
156 116 80 0.7 0.5 if len(ret_data)==batch_size: break
157 115 69 0.6 0.4 if max_len-offset < 0: continue
158
159 100 101 1.0 0.6 index = np.random.randint(offset, max_len)
160 100 10840 108.4 67.2 d = data_t[index:index+timesteps]
161 100 241 2.4 1.5 if len(d)==timesteps: ret_data.append(d)
162
163 1 1 1.0 0.0 return ret_data
Are you sure you need to find a faster method? Your current method isn't that slow. The following changes might simplify, but won't necessarily be any faster:
Step 1: Take a random sample (with replacement) from the list of dataframes
rand_stocks = np.random.randint(0, len(data), size=batch_size)
You can treat this array rand_stocks as a list of indices to be applied to your Series of dataframes. The size is already batch size so that eliminates the need for the while loop and your comparison on line 156.
That is, you can iterate over rand_stocks and access the stock like so:
for idx in rand_stocks:
stock = data.ix[idx]
# Get a sample from this stock.
Step 2: Get a random datarange for each stock you have randomly selected.
start_idx = np.random.randint(offset, len(stock)-timesteps)
d = data_t[start_idx:start_idx+timesteps]
I don't have your data, but here's how I put it together:
def random_sample(data=None, timesteps=100, batch_size=100, subset='train'):
if subset=='train': offset = 0 #you can obviously change this back
rand_stocks = np.random.randint(0, len(data), size=batch_size)
ret_data = []
for idx in rand_stocks:
stock = data[idx]
start_idx = np.random.randint(offset, len(stock)-timesteps)
d = stock[start_idx:start_idx+timesteps]
ret_data.append(d)
return ret_data
Creating a dataset:
In [22]: import numpy as np
In [23]: import pandas as pd
In [24]: rndrange = pd.DateRange('1/1/2012', periods=72, freq='H')
In [25]: rndseries = pd.Series(np.random.randn(len(rndrange)), index=rndrange)
In [26]: rndseries.head()
Out[26]:
2012-01-02 2.025795
2012-01-03 1.731667
2012-01-04 0.092725
2012-01-05 -0.489804
2012-01-06 -0.090041
In [27]: data = [rndseries,rndseries,rndseries,rndseries,rndseries,rndseries]
Testing the function:
In [42]: random_sample(data, timesteps=2, batch_size = 2)
Out[42]:
[2012-01-23 1.464576
2012-01-24 -1.052048,
2012-01-23 1.464576
2012-01-24 -1.052048]