Python, parsing data 24 hours at a time out of 263 days - python

I have an excel/( to be converted to CSV a link) file.
The data- , has 8 columns. The first two are day of the year and time respectively while two before the last are minimum temperature and maximum temperature. For each day I need to find the maximum and minimum of the day subtract and save the value for that day.
Two problems I ran into, how do I parse 24 lines at a time ( there are no missing data lines!) and in each batch find the maximum or minimum.
I have 63126 lines=24 hr*263 days
So to iterate through the lines;
import numpy as np
up_air_min=np.genfromtxt(input_temps,skip_header=1, dtype=float, delimiter=',',usecols=(5))
up_air_max=np.genfromtxt(input_temps,skip_header=1, dtype=float, delimiter=',',usecols=(6))
day_year=np.genfromtxt(input_temps,skip_header=1, dtype=float, delimiter=',',usecols=(0))
for i in range (0,63126,1):
# I get stuck here how to limit the iteration for 24 at a time.
# if I can do that I think I can get the rest done.
min_d.append( up_air_min[i])
max_d.append( up_air_max[i])
move to the next batch of 24 lines....

Use the Numpy, Luke, avoid for-loops.
Then you have ap_air_min and ap_air_max numpy arrays you can easily do what you want by using numpy element-wise functions.
At first, create 2d array with 263 rows (one for a day) and 24 columns like this:
min_matrix = up_air_min.reshape((263, 24))
max_matrix = up_air_max.reshape((263, 24))
Then use np.min and np.max functions along axis 1 (good array tip sheet):
min_temperature = np.min(min_matrix, axis=1)
max_temperature = mp.max(max_matrix, axis=1)
And find the difference:
dt = max_temperature - min_temperature
dt is array with needed values. Let's save it to foo.csv:
np.savetxt('foo.csv', np.swapaxes([day_year, dt], 0, 1), delimiter=',')
And final code looks like this:
import numpy as np
# This I got from your answer.
up_air_min=np.genfromtxt(input_temps,skip_header=1, dtype=float, delimiter=',',usecols=(5))
up_air_max=np.genfromtxt(input_temps,skip_header=1, dtype=float, delimiter=',',usecols=(6))
day_year=np.genfromtxt(input_temps,skip_header=1, dtype=float, delimiter=',',usecols=(0))
# Split arrays and create matrix with 263 lines-days and 24 values in every line.
min_matrix = up_air_min.reshape((263, 24))
max_matrix = up_air_max.reshape((263, 24))
# Find min temperature for every day. min_temperature is an array with 263 values.
min_temperature = np.min(min_matrix, axis=1)
# The same for max temperature.
max_temperature = mp.max(max_matrix, axis=1)
# Subtract min temperature from max.
dt = max_temperature - min_temperature
# Save result in csv.
np.savetxt('foo.csv', np.swapaxes([day_year, dt], 0, 1), delimiter=',')

A reasonably pythonic way to do this would be to have a function that loops over the rows, gathering them up and spitting out the gathered rows using yield when the day changes. This gives you a generator that generates 263 lists each holding 24 values, which is a bit easier to process.
If you've definitely not got any missing values, you could use a trivial doubly-nested loop without batching up the elements first. That's a bit more fragile, but it sounds like you might not be planning to re-use the code anyway.

Here's a somewhat contrived example of how you could chunk things by 24 lines at a time.
from StringIO import StringIO
from random import random as r
import numpy as np
import operator
s = StringIO()
for x in xrange(0,10000):
s.write('%f,%f,%f\n' % (r(),r()*10,r()*100))
data = np.genfromtxt(s,dtype=None, names=['pitch','yaw','thrust'], delimiter=',')
for x in range(0,len(data),24):
print('Acting on hours %d through %d' % (x, x+24))
one_day = data[x:x+24]
minimum_yaw = min(one_day['yaw'])
max_yaw = max(one_day['yaw'])
print 'min',minimum_yaw,'max',max_yaw,'one_day',one_day['yaw']


How do I optimize a for loop for faster results in Python

I've written a piece of code to extract data from a HDF5 file and save into a dataframe that I can export as .csv later. The final data frame effectively has 2.5 million rows and is taking a lot of time to execute.
Is there any way, I can optimize this code so that it can run effectively.
Current runtime is 7.98 minutes!
Ideally I would want to run this program for 48 files like these and expect a faster run time.
Link to source file:
import h5py
import numpy as np
import pandas as pd
#import geopandas as gpd
f = h5py.File('mer.h5', 'r')
for key in f.keys():
#print(key) #Names of the root level object names in HDF5 file - can be groups or datasets.
#print(type(f[key])) # get the object type: usually group or dataset
ls = list(f.keys())
#Get the HDF5 group; key needs to be a group name from above
key ='DHI'
#group = f['OBSERVATION_TIME']
#for key in ls:
#data = f.get(key)
#dataset1 = np.array(data)
data = f.get(key)
dataset1 = np.array(data)
X = f.get('X')
X_1 = pd.DataFrame(X)
Y = f.get('Y')
Y_1 = pd.DataFrame(Y)
data_df = pd.DataFrame(index=range(len(Y_1)),columns=range(len(X_1)))
for i in data_df.index:
data_df.iloc[i] = dataset1[0][i]
final = pd.DataFrame(index=range(1616*1616),columns=['X', 'Y','GHI'])
for y in range(len(Y_1)):
for x in range(len(X_1[:-2])): #X and Y ranges are not same
final.loc[k,'X'] = X_1[0][x]
final.loc[k,'Y'] = Y_1[0][y]
final.loc[k,'GHI'] = data_df.iloc[y,x]
# print(k)`
we can optimize loops by vectorizing operations. this is one/two orders of magnitude faster than their pure python equivalents(especially in numerical computations). vectorization is something we can get with NumPy. it is a library with efficient data structures designed to hold matrix data.
Could you please try the following (file.h5 your file):
import pandas as pd
import h5py
with h5py.File("file.h5", "r") as file:
df_X = pd.DataFrame(file.get("X")[:-2], columns=["X"])
df_Y = pd.DataFrame(file.get("Y"), columns=["Y"])
DHI = file.get("DHI")[0][:, :-2].reshape(-1)
final = df_Y.merge(df_X, how="cross").assign(DHI=DHI)[["X", "Y", "DHI"]]
Some explanations:
First read the data with key X into a dataframe df_X with one column X, except for the last 2 data points.
Then read the full data with key Y into a dataframe df_Y with one column Y.
Then get the data with key DHI and take the first element [0] (there are no more): Result is a NumpPy array with 2 dimensions, a matrix. Now remove the last two columns ([:, :-2]) and reshape the matrix into an 1-dimensional array, in the order you are looking for (order="C" is default). The result is the column DHI of your final dataframe.
Finally take the cross product of df_Y and df_X (y is your outer dimension in the loop) via .merge with how="cross", add the DHI column, and rearrange the columns in the order you want.

Avoid for loop in Python DataFrame

Problem 1.
Suppose I have n years of annual returns r and my initial wealth is 100. Every year I have fixed expense of 6. I want to create yearly wealth. I can do it in for loop. But for my purpose it's time consuming. How do I do it in DataFrame?
wealth = pd.Series(index = range(n+1))
wealth[0] = 100
for i in range(n):
wealth.iloc[i+1] = wealth.iloc[i]*(1+r.iloc[i]) - 6
Initially I thought
wealth = ((1 + r - 0.06).cumprod()).multiply(other = 100)
to be the solution. But it is not. Expenses are not 6%. They are fixed. It is 6.
Problem 2.
I want to do the above N times. In each case I generate r by sampling n returns with replacement.
r = returnY.sample(n,replace=True).reset_index(drop=True)
Then for that return, create the wealth path I described above and create a n*N dateframe of wealth paths. I can do this in for loop, but for big N and n, it takes long time to run. Is there an efficient and elegant way to do this?
Problem 3.
Suppose allWealth is the DF with all wealth paths. Want to check %columns in each row less than 0. This is how I resolved it.
yy = allWealth.copy()
yy[yy>0] = 1
yy[yy<=0] = 0
yy.sum(axis = 1)/N
Any better, more elegant solution?
Problem 1: It looks like you want to apply the "reduce" pattern. You can use reduce function from functools.
import numpy as np
from functools import reduce
rs = np.random.random(50)*0.3 #sequence of annual returns
result = reduce(lambda w,r: w*(1+r)-6, rs, 100)
If you want to keep all the intermediate values, use itertools.accumulate() instead. For example, replace the last line with the following:
ts_iter= itertools.accumulate(rs, lambda w,r: w*(1+r)-6, initial=100)
ts = list(ts_iter) #itertools.accumulate returns an iterable
Problem 2: You can first generate a random matrix of nxN by sampling with replacement. Then you can use "apply_along_axis" method for each column.
import numpy as np
rm = np.random.random((n,N))
def sim(rs):
return reduce(lambda w,r: w * (1+r) - 6, rs, 100)
result = np.apply_along_axis(sim, 0, rm)
Problem 3: you don't need to assign ones and zeros to your original dataframe. A mask dataframe of True and False implicitly acts as a dataframe of ones and zeros in this case.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((50,30)))
mask = df < 0.5
I used #chi's solution with some small edit.
import numpy as np
import itertools
rm = np.random.random((n,N)) #sequence of annual returns
rm0 = np.insert(rm, 0, 100, axis=1)
def wealth(rs):
return list(itertools.accumulate(rs, lambda w,r: w*(1+r)-6))
result = np.apply_along_axis(wealth, 1, rm0)
itertools.accumulate does not recognize initial. Hence inserted initial wealth at the front of return array.

Is there a way to loop through a matrix/array/df every 30 rows to return scipy.stats.describe

I want a loop that goes over every 30th row of a (1095, 10000) array, returns a scipy.stats.describe(matrix[30]) and writes these results to a list
I have tried to do it manually and it works, I'm trying to optimise my code
stats150 = scipy.stats.describe(matrix[150])
list_for_stats +=['150:', stats150]
stats180 = scipy.stats.describe(matrix[180])
list_for_stats += ['180:', stats180]
statsOut = open("myOutputStatsFile.txt", "w")
for line in list_for_stats:
# write line to output file
a for loop that is more intuitive than what I already have
Assuming your matrix is a numpy array this loop goes through every 30th row of a (1095,10000) matrix of ones and stores the scipy.describe results along with the row number as a string in a list:
import numpy as np
import scipy
matrix = np.ones(shape=(1095,10000))
for i in range(0,matrix.shape[0],30):
list_for_stats +=[str(i)+':', scipy.stats.describe(matrix[i])]

Data Cleaning(Flagging) Dead Sensor

I have a large timeseries(pandas dataframe) of windspeed (10min average) which contains error data (dead sensor). How can it be flagged automatically. I was trying with moving average.
Some other approach other then moving average is much appreciated. I have attached the sample data image below.
There are several ways to deal with this problem. I will first pass to differences:
%matplotlib inline
import pandas as pd
import numpy as np
n = 200
y = np.cumsum(np.random.randn(n))
y[100:120] = 2
y[150:160] = 0
ts = pd.Series(y)
The next step is to find how long are the strikes of consecutive zeros.
def getZeroStrikeLen(x):
""" Accept a boolean array only
res = np.diff(np.where(np.concatenate(([x[0]],
x[:-1] != x[1:],
return res
vec = ts.diff().values == 0
out = getZeroStrikeLen(vec)
Now if len(out)>0 you can conclude that there is a problem. If you want to go one step further you can have a look to this. It is in R but it's not that hard to replicate in Python.

Spatial temporal query in python with many records

I have a dataframe of 600 000 x/y points with date-time information, along another field 'status', with extra descriptive information
My objective is, for each record:
sum column 'status' by records that are within a certain spatial temporal buffer
the specific buffer is within t - 8 hours and < 100 meters
Currently I have the data in a pandas data frame.
I could, loop through the rows, and for each record, subset the dates of interest, then calculate a distances and restrict the selection further. However that would still be quite slow with so many records.
THIS TAKES 4.4 hours to run.
I can see that I could create a 3 dimensional kdtree with x, y, date as epoch time. However, I am not certain how to restrict the distances properly when incorporating dates and geographic distances.
Here is some reproducible code for you guys to test on:
import numpy.random as npr
import numpy
import pandas as pd
from pandas import DataFrame, date_range
from datetime import datetime, timedelta
Create data
Function to generate test data
def CreateDataSet(Number=1):
Output = []
for i in range(Number):
# Create a date range with hour frequency
date = date_range(start='10/1/2012', end='10/31/2012', freq='H')
# Create long lat data
laty = npr.normal(4815862, 5000,size=len(date))
longx = npr.normal(687993, 5000,size=len(date))
# status of interest
status = [0,1]
# Make a random list of statuses
random_status = [status[npr.randint(low=0,high=len(status))] for i in range(len(date))]
# user pool
user = ['sally','derik','james','bob','ryan','chris']
# Make a random list of users
random_user = [user[npr.randint(low=0,high=len(user))] for i in range(len(date))]
Output.extend(zip(random_user, random_status, date, longx, laty))
return pd.DataFrame(Output, columns = ['user', 'status', 'date', 'long', 'lat'])
#Create data
data = CreateDataSet(3)
#some time deltas
before = timedelta(hours = 8)
after = timedelta(minutes = 1)
Function to speed up
def work(df):
output = []
#loop through data index's
for i in range(0, len(df)):
l = []
#first we will filter out the data by date to have a smaller list to compute distances for
#create a mask to query all dates between range for date i
date_mask = (df['date'] >= df['date'].iloc[i]-before) & (df['date'] <= df['date'].iloc[i]+after)
#create a mask to query all users who are not user i (themselves)
user_mask = df['user']!=df['user'].iloc[i]
#apply masks
dists_to_check = df[date_mask & user_mask]
#for point i, create coordinate to calculate distances from
a = np.array((df['long'].iloc[i], df['lat'].iloc[i]))
#create array of distances to check on the masked data
b = np.array((dists_to_check['long'].values, dists_to_check['lat'].values))
#for j in the date queried data
for j in range(1, len(dists_to_check)):
#compute the ueclidean distance between point a and each point of b (the date masked data)
x = np.linalg.norm(a-np.array((b[0][j], b[1][j])))
#if the distance is within our range of interest append the index to a list
if x <=100:
#use the list of desired index's 'l' to query a final subset of the data
data = dists_to_check.iloc[l]
#summarize the column of interest then append to output list
except IndexError, e:
#print "There were no data to add"
return pd.DataFrame(output)
Run code and time it
start =
out = work(data)
print - start
Is there a way to do this query in a vectorized way? Or should I be chasing another technique.
Here is what at least somewhat solves my problem. Since the loop can operate on different parts of the data independently, parallelization makes sense here.
using Ipython...
from IPython.parallel import Client
cli = Client()
cli = Client()
with dview.sync_imports():
import numpy as np
import os
from datetime import timedelta
import pandas as pd
#We also need to add the time deltas and output list into the function as
#local variables as well as add the Ipython.parallel decorator
def work(df):
before = timedelta(hours = 8)
after = timedelta(minutes = 1)
output = []
final time 1:17:54.910206, about 1/4 original time
I would still be very interested for anyone to suggest small speed improvements within the body of the function.
