Speeding Up Datafile Reading Program For School Project

Speeding Up Datafile Reading Program For School Project - python

I am in a lower-level coding class (Python) and have a major project due in three days. One of our grading criteria is program speed. My program runs in about 30 seconds, ideally it would execute in 15 or less. Here is my code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
import time
start_time = time.time()#for printing execution time
#function for appending any number of files to a dataframe
def read_data_files(pre, start, end): #reading in the data
data = pd.DataFrame()#dataframe with all the data from files
x = start
while x <= end:
filename = pre + str(x) + ".csv" #string manipulation
dpath = pd.read_csv("C:\\Users\\jacks\\Downloads\\BMEN 207 Project 1
Data\\" + filename )
for y in dpath:
dpath = dpath.rename(columns = {y: y})
data = data.append(dpath)
x += 1
return data
data = read_data_files("Data_", 5, 163) #start, end, prefix...
#converting to human time and adding to new column in dataframe
human_time = []
for i in data[' Time']:
i = int(i)
j = datetime.utcfromtimestamp(i).strftime('%Y-%m-%d %H:%M:%S')
human_time.append(j)
human_timen = np.array(human_time) #had issues here for some reason, so i
created another array to replace the time column in the dataframe
data[' Time'] = human_timen
hours = [] #for use as x-axis in plot
stdlist = [] #for use as y-axis in plot
histlist = [] #for storing magnitudes of most active hour
def magfind(row): #separate function to calculate the magnitude of each row in
each dataframe
return (row[' Acc X'] ** 2 + row[' Acc Y'] ** 2 + row[' Acc Z'] ** 2) ** .5
def filterfunction(intro1, intro2, first, last): #two different intros to deal
with the issue of '00:' versus '10:' timestamps
k = first
meanmax = 0
active = 0
while k <= last:
if 0 <= k < 7: #data from hours 0 to 6, none after
hr = intro1 + str(k) + ':'
tfilter = data[' Time'].str.contains(hr)
acc = data.loc[tfilter, [' Acc X', ' Acc Y', ' Acc Z']]
acc['magnitude'] = acc.apply(magfind, axis = 1) #creates magnitude
column using prior function, column has magnitudes for every row of every file
p = acc.loc[:, 'magnitude'].std()#finds std dev for the column and
appends to a list for graphing
m = acc.loc[:, 'magnitude'].mean()
stdlist.append(p)
elif k == 12 or 20 < k <= last: #data at 12 and beyond hour 20
hr = intro2 + str(k) + ":"
tfilter = data[' Time'].str.contains(hr)
acc = data.loc[tfilter, [' Acc X', ' Acc Y', ' Acc Z']]
acc['magnitude'] = acc.apply(magfind, axis = 1)
p = acc.loc[:, 'magnitude'].std()
m = acc.loc[:, 'magnitude'].mean()
stdlist.append(p)
else: #in the case that we are dealing with an hour that has no data
p = 0
m = 0
stdlist.append(p) #need this so that the hours with no data still
get graphed
if m > meanmax: # for determining which hour was the most active, and
appending those magnitudes to a list for histogramming
meanmax = m
active = k #most active hour
for i in acc['magnitude']:
histlist.append(i) #adding all the magnitudes for histogramming
hours.append(k)
k += 1
print("mean magnitude", meanmax)
print("most active hour", active)
return hours, stdlist, histlist
filterfunction(' 0', ' ', 0, 23)
The slow speed stems from the "filterfunction" function. What this program does is read data from over 100 files, and this function specifically sorts the data into a dataframe and analyzes by time (each individual hour) in order to calculate the data in all rows for that hour. I believe that it might be able to be sped up by changing up the way that the data is filtered to search by hour, but am not sure. The reason I have statements to dis-include certain k-values is that there are hours with no data to manipulate, which would mess up the list of standard deviation calculations as well as the plot that this data will father. Any tips or ideas for speeding this up would be greatly appreciated!

One suggestion to speed it up a bit is to remove this line since it is not being used anywhere in the program:
import matplotlib.pyplot as plt
matplotlib is a big library so removing it should improve performance.
Also I think you could get rid of numpy since it is used once only...consider using a tuple

I could not able to test because I am on mobile now. However my main idea is not making the code better or leen. I changed the functioning part of the process.
Integrated the 'multiprocessing' library(method) into your code and also calculated the system cpu cores and divide all processes between them.
Multiprocessing library detailed documentation: https://docs.python.org/2/library/multiprocessing.html
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
import psutil
from datetime import datetime
from multiprocessing import Pool
cores = psutil.cpu_count()
start_time = time.time()#for printing execution time
#function for appending any number of files to a dataframe
def read_data_files(pre, start, end): #reading in the data
data = pd.DataFrame()#dataframe with all the data from files
x = start
while x <= end:
filename = pre + str(x) + ".csv" #string manipulation
dpath = pd.read_csv("C:\\Users\\jacks\\Downloads\\BMEN 207 Project 1
Data\\" + filename )
for y in dpath:
dpath = dpath.rename(columns = {y: y})
data = data.append(dpath)
x += 1
return data
data = read_data_files("Data_", 5, 163) #start, end, prefix...
#converting to human time and adding to new column in dataframe
human_time = []
for i in data[' Time']:
i = int(i)
j = datetime.utcfromtimestamp(i).strftime('%Y-%m-%d %H:%M:%S')
human_time.append(j)
human_timen = np.array(human_time) #had issues here for some reason, so i
created another array to replace the time column in the dataframe
data[' Time'] = human_timen
hours = [] #for use as x-axis in plot
stdlist = [] #for use as y-axis in plot
histlist = [] #for storing magnitudes of most active hour
def magfind(row): #separate function to calculate the magnitude of each row in
each dataframe
return (row[' Acc X'] ** 2 + row[' Acc Y'] ** 2 + row[' Acc Z'] ** 2) ** .5
def filterfunction(intro1, intro2, first, last): #two different intros to deal
with the issue of '00:' versus '10:' timestamps
k = first
meanmax = 0
active = 0
while k <= last:
if 0 <= k < 7: #data from hours 0 to 6, none after
hr = intro1 + str(k) + ':'
tfilter = data[' Time'].str.contains(hr)
acc = data.loc[tfilter, [' Acc X', ' Acc Y', ' Acc Z']]
acc['magnitude'] = acc.apply(magfind, axis = 1) #creates magnitude
column using prior function, column has magnitudes for every row of every file
p = acc.loc[:, 'magnitude'].std()#finds std dev for the column and
appends to a list for graphing
m = acc.loc[:, 'magnitude'].mean()
stdlist.append(p)
elif k == 12 or 20 < k <= last: #data at 12 and beyond hour 20
hr = intro2 + str(k) + ":"
tfilter = data[' Time'].str.contains(hr)
acc = data.loc[tfilter, [' Acc X', ' Acc Y', ' Acc Z']]
acc['magnitude'] = acc.apply(magfind, axis = 1)
p = acc.loc[:, 'magnitude'].std()
m = acc.loc[:, 'magnitude'].mean()
stdlist.append(p)
else: #in the case that we are dealing with an hour that has no data
p = 0
m = 0
stdlist.append(p) #need this so that the hours with no data still
get graphed
if m > meanmax: # for determining which hour was the most active, and
appending those magnitudes to a list for histogramming
meanmax = m
active = k #most active hour
for i in acc['magnitude']:
histlist.append(i) #adding all the magnitudes for histogramming
hours.append(k)
k += 1
print("mean magnitude", meanmax)
print("most active hour", active)
return hours, stdlist, histlist
# Run this with a pool of 5 agents having a chunksize of 3 until finished
agents = cores
chunksize = (len(data) / cores)
with Pool(processes=agents) as pool:
pool.map(filterfunction, (' 0', ' ', 0, 23))

don't use apply, it's not vectorized. instead, use vectorized operations whenever you can. in this case, instead of doing df.apply(magfind, 1), do:
def add_magnitude(df):
df['magnitude'] = (df[' Acc X'] ** 2 + df[' Acc Y'] ** 2 + df[' Acc Z'] ** 2) ** .5

Related

How to display the average price of bitcoin in Pandas and not through a loop?

y = []
n = 0
days = 1
for i in btc['Adj Close']:
averagePrice = (i + n) / days
n += i
days += 1
y.append(averagePrice)
btc['TopAverage'] = y

If btc is a pandas data frame (as it appears so), then:
btc.loc[:, 'Days'] = list(range(1, btc.shape[0] + 1))
btc.loc[:, 'n'] = btc['Adj Close'].cumsum().shift(periods=1, fill_value=0.)
btc.loc[:, 'TopAverage'] = (btc['Adj Close'] + btc['n']) / btc['Days']
reflects the logic in your code. This will add the columns 'Days' and 'n' to the data frame as well.

Selecting columns using [[]] is very inefficient especially as the size of the dataset increases in python using pandas

Created sample data using below function:
def create_sample(num_of_rows=1000):
num_of_rows = num_of_rows # number of records to generate.
data = {
'var1' : [random.uniform(0.0, 1.0) for x in range(num_of_rows)],
'other' : [random.uniform(0.0, 1.0) for x in range(num_of_rows)]
}
df = pd.DataFrame(data)
print("Shape : {}".format(df.shape))
print("Type : \n{}".format(df.dtypes))
return df
df = create_sample()
times = []
for i in range(1, 300):
start = time.time()
# Make the dataframe 1 column bigger
df['var' + str(i + 1)] = df['var' + str(i)]
# Select two columns from the dataframe using double square brackets
####################################################
temp = df[['var' + str(i + 1), 'var' + str(i)]]
####################################################
end = time.time()
times.append(end - start)
start = end
plt.plot(times)
print(sum(times))
The graph is linear
enter image description here
used pd.concat to select columns, the graph shows peaks at every 100.. why is this so
df = create_sample()
times = []
for i in range(1, 300):
start = time.time()
# Make the dataframe 1 column bigger
df['var' + str(i + 1)] = df['var' + str(i)]
# Select two columns from the dataframe using double square brackets
####################################################
temp = pd.concat([df['var' + str(i + 1)],df['var' + str(i)]], axis=1)
####################################################
end = time.time()
times.append(end - start)
start = end
plt.plot(times)
print(sum(times))
please ignore indentation.
**From the above we can see that the time taken to select columns using [[]] increases linerly with the size of the dataset.
However, using pd.concat the time does not increase materially. Why increases in every 100 records only. The above is not obvious
**

How to make this for loop faster?

I know that python loops themselves are relatively slow when compared to other languages but when the correct functions are used they become much faster.
I have a pandas dataframe called "acoustics" which contains over 10 million rows:
print(acoustics)
timestamp c0 rowIndex
0 2016-01-01T00:00:12.000Z 13931.500000 8158791
1 2016-01-01T00:00:30.000Z 14084.099609 8158792
2 2016-01-01T00:00:48.000Z 13603.400391 8158793
3 2016-01-01T00:01:06.000Z 13977.299805 8158794
4 2016-01-01T00:01:24.000Z 13611.000000 8158795
5 2016-01-01T00:02:18.000Z 13695.000000 8158796
6 2016-01-01T00:02:36.000Z 13809.400391 8158797
7 2016-01-01T00:02:54.000Z 13756.000000 8158798
and there is the code I wrote:
acoustics = pd.read_csv("AccousticSandDetector.csv", skiprows=[1])
weights = [1/9, 1/18, 1/27, 1/36, 1/54]
sumWeights = np.sum(weights)
deltaAc = []
for i in range(5, len(acoustics)):
time = acoustics.iloc[i]['timestamp']
sum = 0
for c in range(5):
sum += (weights[c]/sumWeights)*(acoustics.iloc[i]['c0']-acoustics.iloc[i-c]['c0'])
print("Row " + str(i) + " of " + str(len(acoustics)) + " is iterated")
deltaAc.append([time, sum])
deltaAc = pd.DataFrame(deltaAc)
It takes a huge amount of time, how can I make it faster?

You can use diff from pandas and create all the differences for each row in an array, then multiply with your weigths and finally sum over the axis 1, such as:
deltaAc = pd.DataFrame({'timestamp': acoustics.loc[5:, 'timestamp'],
'summation': (np.array([acoustics.c0.diff(i) for i in range(5) ]).T[5:]
*np.array(weights)).sum(1)/sumWeights})
and you get the same values than what I get with your code:
print (deltaAc)
timestamp summation
5 2016-01-01T00:02:18.000Z -41.799986
6 2016-01-01T00:02:36.000Z 51.418728
7 2016-01-01T00:02:54.000Z -3.111184

First optimization, weights[c]/sumWeights could be done outside the loop.
weights_array = np.array([1/9, 1/18, 1/27, 1/36, 1/54])
sumWeights = np.sum(weights_array)
tmp = weights_array / sumWeights
...
sum += tmp[c]*...
I'm not familiar with pandas, but if you could extract your columns as 1D numpy array, it would be great for you. It might look something like:
# next lines to be tested, or find the correct way of extracting the column
c0_column = acoustics[['c0']].values
time_column = acoustics[['times']].values
...
sum = numpy.zeros(shape=(len(acoustics)-5,))
delta_ac = []
for c in range(5):
sum += tmp[c]*(c0_column[5:]-c0_column[5-c:len(acoustics)-c])
for i in range(len(acoustics)-5):
deltaAc.append([time[5+i], sum[i])

Dataframes have a great method rolling for constructing and applying windowing transformations; So, you don't need loops at all:
# df is your data frame
window_size = 5
weights = pd.np.array([1/9, 1/18, 1/27, 1/36, 1/54])
weights /= weights.sum()
df.loc[:,'deltaAc'] = df.loc[:, 'c0'].rolling(window_size).apply(lambda x: ((x[-1] - x)*weights).sum())

handling real time data in python , rolling window

I want to create a function that will read a series of time values from a file (with gaps in the sampling rate,thats the problem) and would read me exactly 200 days and allow me to move through the entire data length,say 10000 day,sort of a rolling window.
I am not sure how to code it. Can I add a statement that calculates the difference between two values of the time variable (x axis) up to when is exactly 200 days?
Or can I somehow write a function that would find the starting value say t0 and then find the element of the array that is closest to t0 + (interval=) 200 days.
What I have so far is:
f = open(reading the file from directory)
lines = f.readlines()
print(len(lines))
tx = np.array([]) # times
y= np.array([])
interval = 200 # days
for li in lines:
col = li.split()
t0 = np.array([])
t1 = np.array([])
tx = np.append(tx, float(col[0]))
y= np.append(y, float(col[1]))
t0 = np.append(t0, np.max(tx))
t1 = np.append(t1, tx[np.argmin(tx)])
print(t0,t1)
days = [t1 + dt.timedelta(days = float(x)) for x in days]
#y = np.random.randn(len(days))
# use pandas for convenient rolling function:
df = pd.DataFrame({"day":tx, "value": y}).set_index("day")
def closest_value(s):
if s.shape[0]<2:
return np.nan
X = np.empty((s.shape[0]-1, 2))
X[:, 0] = s[:-1]
X[:, 1] = np.fabs(s[:-1]-s[-1])
min_diff = np.min(X[:, 1])
return X[X[:, 1]==min_diff, 0][0]
df['closest_value'] = df.rolling(window=dt.timedelta(days=200))
['value'].apply(closest_value, raw=True)
print(df.tail(5))
Output error:
TypeError: float() argument must be a string or a number, not
'datetime.datetime'
Additionally,
First 10 tx and ty values respectively:
0 0.003372722575018
0.015239999629557 0.003366515509113
0.045829999726266 0.003385171061055
0.075369999743998 0.003385171061055
0.993219999596477 0.003366515509113
1.022699999623 0.003378941085299
1.05217999964952 0.003369617612836
1.08166999975219 0.003397665493594
3.0025899996981 0.003378941085299
3.04120999993756 0.003394537568711

import numpy as np
import pandas as pd
import datetime as dt
# load data in days and y arrays
# ... or generate them:
N = 1000 # number of days
day_min = dt.datetime.strptime('2000-01-01', '%Y-%m-%d')
day_max = 2000
days = np.sort(np.unique(np.random.uniform(low=0, high=day_max, size=N).astype(int)))
days = [day_min + dt.timedelta(days = int(x)) for x in days]
y = np.random.randn(len(days))
# use pandas for convenient rolling function:
df = pd.DataFrame({"day":days, "value": y}).set_index("day")
def closest_value(s):
if s.shape[0]<2:
return np.nan
X = np.empty((s.shape[0]-1, 2))
X[:, 0] = s[:-1]
X[:, 1] = np.fabs(s[:-1]-s[-1])
min_diff = np.min(X[:, 1])
return X[X[:, 1]==min_diff, 0][0]
df['closest_value'] = df.rolling(window=dt.timedelta(days=200))['value'].apply(closest_value, raw=True)
print(df.tail(5))
Output:
value closest_value
day
2005-06-15 1.668638 1.591505
2005-06-16 0.316645 0.304382
2005-06-17 0.458580 0.445592
2005-06-18 -0.846174 -0.847854
2005-06-22 -0.151687 -0.166404

You could use pandas, set a datetime range and create a while loop to process the data in batches.
import pandas as pd
from datetime import datetime, timedelta
# Load data into pandas dataframe
df = pd.read_csv(filepath)
# Name columns
df.columns = ['dates', 'num_value']
# Convert strings to datetime
df.dates = pd.to_datetime(df['dates'], format='%d/%m/%Y')
# Print dates within a 200 day interval and move on to the next interval
i = 0
while i < len(df.dates):
start = df.dates[i]
end = start + timedelta(days=200)
print(df.dates[(df.dates >= start) & (df.dates < end)])
i += 200
If the columns don't have headers, you should omit skiprows:
dates num_value
2004-7-1 1
2004-7-2 5
2004-7-4 8
2004-7-5 11
2004-7-6 17
df = pd.read_table(filepath, sep="\s+", skiprows=1)

Getting data arrays from CSV with loops

I have a CSV that looks like this:
0.500187550,CPU1,7.93
0.500187550,CPU2,1.62
0.500187550,CPU3,7.93
0.500187550,CPU4,1.62
1.000445359,CPU1,9.96
1.000445359,CPU2,1.61
1.000445359,CPU3,9.96
1.000445359,CPU4,1.61
1.500674877,CPU1,9.94
1.500674877,CPU2,1.61
1.500674877,CPU3,9.94
1.500674877,CPU4,1.61
The first column is time, the second the CPU used and the third is energy.
As a final result I would like to have these arrays:
Time:
[0.500187550, 1.000445359, 1.500674877]
Energy (per CPU): e.g. CPU1
[7.93, 9.96, 9.94]
For parsing the CSV I'm using:
query = csv.reader(csvfile, delimiter=',', skipinitialspace=True)
#Arrays global time and power:
for row in query:
x = row[0]
x = float(x)
x_array.append(x) #column 0 to array
y = row[2]
y = float(y)
y_array.append(y) #column 2 to array
print x_array
print y_array
These way I get all the data from time and energy into two arrays: x_array and y_array.
Then I order the arrays:
energy_core_ord_array = []
time_ord_array = []
#Dividing array into energy and time per core:
for i in range(number_cores[0]):
e = 0 + i
for j in range(len(x_array)/(int(number_cores[0]))):
time_ord = x_array[e]
time_ord_array.append(time_ord)
energy_core_ord = y_array[e]
energy_core_ord_array.append(energy_core_ord)
e = e + int(number_cores[0])
And lastly, I cut the time array into the lenghts it should have:
final_time_ord_array = []
for i in range(len(x_array)/(int(number_cores[0]))):
final_time_ord = time_ord_array[i]
final_time_ord_array.append(final_time_ord)
Till here, although the code is not elegant, it works.
The problem comes when I try to get the array for each core.
I get it for the first core, but when I try to iterate for the next one, I don´t know how to do it, and how can I store each array in a variable with a single name for example.
final_energy_core_ord_array = []
#Trunk energy core array:
for i in range(len(x_array)/(int(number_cores[0]))):
final_energy_core_ord = energy_core_ord_array[i]
final_energy_core_ord_array.append(final_energy_core_ord)

So using Pandas (library to handle dataframes in Python) you can do something like this, which is much quicker than trying to process the CSV manually like you're doing:
import pandas as pd
csvfile = "C:/Users/Simon/Desktop/test.csv"
data = pd.read_csv(csvfile, header=None, names=['time','cpu','energy'])
times = list(pd.unique(data.time.ravel()))
print times
cpuList = data.groupby(['cpu'])
cpuEnergy = {}
for i in range(len(cpuList)):
curCPU = 'CPU' + str(i+1)
cpuEnergy[curCPU] = list(cpuList.get_group('CPU' + str(i+1))['energy'])
for k, v in cpuEnergy.items():
print k, v
that will give the following as output:
[0.50018755000000004, 1.000445359, 1.5006748769999998]
CPU4 [1.6200000000000001, 1.6100000000000001, 1.6100000000000001]
CPU2 [1.6200000000000001, 1.6100000000000001, 1.6100000000000001]
CPU3 [7.9299999999999997, 9.9600000000000009, 9.9399999999999995]
CPU1 [7.9299999999999997, 9.9600000000000009, 9.9399999999999995]

Finally I got the answer, using globals.... not a great idea, but works, leave it here if someone find it useful.
final_energy_core_ord_array = []
#Trunk energy core array:
a = 0
for j in range(number_cores[0]):
for i in range(len(x_array)/(int(number_cores[0]))):
final_energy_core_ord = energy_core_ord_array[a + i]
final_energy_core_ord_array.append(final_energy_core_ord)
globals()['core%s' % j] = final_energy_core_ord_array
final_energy_core_ord_array = []
a = a + 12
print 'Final time and cores:'
print final_time_ord_array
for j in range(number_cores[0]):
print globals()['core%s' % j]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Speeding Up Datafile Reading Program For School Project - python

don't use apply, it's not vectorized. instead, use vectorized operations whenever you can. in this case, instead of doing df.apply(magfind, 1), do: def add_magnitude(df): df['magnitude'] = (df[' Acc X'] 2 + df[' Acc Y'] 2 + df[' Acc Z'] 2) .5

Related

How to display the average price of bitcoin in Pandas and not through a loop?

Selecting columns using [[]] is very inefficient especially as the size of the dataset increases in python using pandas

How to make this for loop faster?

handling real time data in python , rolling window

Getting data arrays from CSV with loops

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Speeding Up Datafile Reading Program For School Project - python

don't use apply, it's not vectorized. instead, use vectorized operations whenever you can. in this case, instead of doing df.apply(magfind, 1), do: def add_magnitude(df): df['magnitude'] = (df[' Acc X'] ** 2 + df[' Acc Y'] ** 2 + df[' Acc Z'] ** 2) ** .5

Related

How to display the average price of bitcoin in Pandas and not through a loop?

Selecting columns using [[]] is very inefficient especially as the size of the dataset increases in python using pandas

How to make this for loop faster?

handling real time data in python , rolling window

Getting data arrays from CSV with loops

Categories

Resources

don't use apply, it's not vectorized. instead, use vectorized operations whenever you can. in this case, instead of doing df.apply(magfind, 1), do: def add_magnitude(df): df['magnitude'] = (df[' Acc X'] 2 + df[' Acc Y'] 2 + df[' Acc Z'] 2) .5