pandas read_csv - Skipping every other row starting from a certain row

pandas read_csv - Skipping every other row starting from a certain row - python

I have a huge .csv file (over 1 million rows), that I am trying to parse using the pandas read_csv function. The file is very large because it is measurement data from a sensor with very high sampling rate, and I want to take downsampled segments from it. I tried implementing it with a lambda function and the skiprows and nrows parameter. What my code currently does is, it just reads the same segment over and over again.
segment_amt = 20 # How many segments we want from a individual measurement file
segment_length = 5 # Segment length in seconds
segment_length_idx = fs * segment_length # Segmenth length in indices
segment_skip_length = 10 # How many seconds between segments
segment_skip_idx = fs * segment_skip_length # The amount of indices to skip between each segment
downsampling = 2 # Factor of downsampling
idx = start_idx
for i in range(segment_amt):
cond = lambda x: (x+idx) % downsampling != 0
data = pd.read_csv(filename, skiprows=cond, nrows = segment_length_idx/downsampling,
usecols=[z_component_idx],names=["z"],engine='python')
M1_df = M1_df.append(data.T)
idx += segment_skip_idx
This results in something like this. I assume the behaviour is due to the lambda function, but I don't know how to fix it, so that it changes the starting row each time based on idx (this is what I thought it would do currently).

Your cond lambda is wrong. You want to skip rows if x < idx or x % downsampling != 0. Just write it that way:
cond = x < idx or x % downsampling != 0
But you should also consider passing header = False to avoid processing the first line of each segment as a header.

Related

Python - Multiprocessing very large parquet file

I have stumbled across similar questions but all of them used either .csv or .txt files and a solution was to read in the data line by line or in chunks. I am not aware of that being possible with parquet files as they were designed to be read by columns and not rows.
My first attempt below works well with a smaller subset/test dataset created from the original full dataset.
def process_single_group(group_df):
# Super simple version of my function
group_df['E'] = group_df['C'] + group_df['D']
group_df['F'] = group_df['C'] - group_df['D']
group_df['G'] = group_df['C'] * group_df['D']
return group_df
def group_of_groups_loop(group_of_group_df):
df_agg = pd.DataFrame()
for i, group_df in enumerate(group_of_group_df):
t = process_single_group(group_df)
df_agg = pd.concat([df_agg, t])
return df_agg
num_processes = os.cpu_count()
pool = Pool(processes=num_processes)
df = pd.read_parquet('path/to/dataset.parquet')
# With small dataset, there are about 300 groups
# With full dataset, there are about ~800k groups
groupby = df.groupby(by=['A', 'B'])
# Split the groups into chunks of groups
groupby_split = np.array_split(groupby, num_processes)
# Create a list where each element is a chunk of groups
# The starmap expects this sort of input
input_list = [[gb] for gb in groupby_split]
x = pd.concat(pool.starmap(group_of_groups_loop, input_list))
pool.join()
pool.close()
x.to_parquet('path/to/save/file.parquet')
but when I switch to the full parquet file, I get the error:
struct.error: 'i' format requires -2147483648 <= number <= 2147483647
which I expected.
My solution to this was to break the very large number of groups into smaller chunks of groups (size similar to the subset) and loop over each one like with the subset earlier.
EDIT I forgot to add the second np.array_split within the loop.
def process_single_group(group_df):
# Super simple version of my function
group_df['E'] = group_df['C'] + group_df['D']
group_df['F'] = group_df['C'] - group_df['D']
group_df['G'] = group_df['C'] * group_df['D']
return group_df
def group_of_groups_loop(group_of_group_df):
df_agg = pd.DataFrame()
for i, group_df in enumerate(group_of_group_df):
t = process_single_group(group_df)
df_agg = pd.concat([df_agg, t])
return df_agg
num_processes = os.cpu_count()
pool = Pool(processes=num_processes)
df = pd.read_parquet('path/to/dataset.parquet')
# With small dataset, there are about 300 groups
# With full dataset, there are about ~800k groups
groupby = df.groupby(by=['A', 'B'])
# Split the large number of groups into smaller chunks
N = 10
groupby_split = np.array_split(groupby, N)
final_agg_df = pd.DataFrame()
# iterate over each of the smaller chunks
for groupby_group in groupby_split:
groupby_split_split = np.array_split(groupby_group, num_processes)
# Create a list to use as argument for starmap
input_list = [[gb] for gb in groupby_split_split]
x = pd.concat(pool.starmap(group_of_groups_loop, input_list))
pool.join()
pool.close()
final_agg_df = pd.concat([final_agg_df, x])
final_agg_df.to_parquet('path/to/save/file.parquet')
But this is still giving me the same error...
I thought that since the pool was created prior to reading in the large parquet file (I read a solution earlier that mentioned doing this) that each process would only be given the small chunk of groups.
I am wondering if there is something I missed? And also if there is a better way of doing this in general (queue? dask? function logic?)
Thanks in advance!

How do I add a matrix constraint `Ax=b` to a Pyomo model efficiently?

I want to add the constraints Ax=b to a Pyomo model with my numpy arrays A and b as efficient as possible. Unfortunately, the performance is very bad currently. For the following example
import time
import numpy as np
import pyomo.environ as pyo
start = time.time()
rows = 287
cols = 2765
A = np.random.rand(rows, cols)
b = np.random.rand(rows)
mdl = pyo.ConcreteModel()
mdl.rows = range(rows)
mdl.cols = range(cols)
mdl.A = A
mdl.b = b
mdl.x_var = pyo.Var(mdl.cols, bounds=(0.0, None))
mdl.constraints = pyo.ConstraintList()
[mdl.constraints.add(sum(mdl.A[row, col] * mdl.x_var[col] for col in mdl.cols) <= mdl.b[row]) for row in mdl.rows]
mdl.obj = pyo.Objective(expr=sum(mdl.x_var[col] for col in mdl.cols), sense=pyo.minimize)
end = time.time()
print(end - start)
is takes almost 30 seconds because of the add statement and the huge amount of columns. Is it possible to pass A, x, and b directly and fast instead of adding it row by row?

The main thing that is slowing down your construction above is the fact that you are building the constraint list elements within a list comprehension, which is unnecessary and causes a lot of bloat.
This line:
[mdl.constraints.add(sum(mdl.A[row, col] * mdl.x_var[col] for col in mdl.cols) <= mdl.b[row]) for row in mdl.rows]
Constructs a list of the captured results of each ConstraintList.add() expression, which is a "rich" return. That list is an unnecessary byproduct of the loop you desire to do over the add() function. Just change your generation scheme to either a loop or a generator (by using parens) to avoid that capture, as such:
(mdl.constraints.add(sum(mdl.A[row, col] * mdl.x_var[col] for col in mdl.cols) <= mdl.b[row]) for row in mdl.rows)
And the model construction time drops to about 0.02 seconds.

To find the maximum correlation value from a csv file of 20000 rows efficiently in python

I have a CSV file with 12 columns and 20000 rows and 12 images of shape 100 * 100, i.e, 10000 pixels. I need to run each pixel across 20000 data points in a CSV file to find the maximum correlation. Here is my function:
def Corrlate(pixels):
max_value = -1
max_roc = 0
max_val = 0
if(len(pixels[pixels != 0]) == 0):
max_soc = 0
else:
for index, row in data.iterrows():
val = [row['B2'], row['B3'], row['B4'],
row['B5'], row['B6'], row['B7'],
row['B8'], row['B11'], row['B12']]
corr = np.corrcoef(pixels, val)
if (corr > max_value).any():
max_value = corr
max_soc = row['SOC']
max_val = val
return max_soc
pixel = [0.1459019176,0.209071098,0.2940336262,0.3246242805,0.349758679,0.375791541,0.3990873849,0.5312103156,0.4791704195]
data = pd.read_csv("test.csv")
Corrlate(pixel)
test.csv
Or.,B2,B3,B4,B5,B6,B7,B8,B11,B12,SOC
0,0.09985147853,0.1279325334,0.1735545485,0.1963891543,0.2143978866,0.2315615778,0.2477941219,0.3175400435,0.3072681177,91.1
1,0.1353946488,0.1938304482,0.2661696295,0.2920155645,0.3128841044,0.3351216611,0.3539684059,0.4850393799,0.4505173283,21.4
2,0.1307552092,0.2112897844,0.3084664959,0.3367929562,0.3613391345,0.3852476516,0.4031711988,0.5193408686,0.4661771688,15.6
.
.
.
.
.
20000,0.1307552092,0.2112897844,0.3084664959,0.3367929562,0.3613391345,0.3852476516,0.4031711988,0.5193408686,0.4661771688,15855.6
The above function needs to be run 10000 times for an image of shape 100*100.
In my machine, it was taking 2.5 hours to complete the process. Is there any effective solution to run this process with less time.

I think you could use apply instead of iterating over your rows, it's always pretty inefficient to do this.
It seems your are already using pandas, I would also recommend using the pandarallel library to distribute the apply method.
def func_to_apply(row) :
val = [row['B2'],row['B3'],row['B4'],
row['B5'],row['B6'],row['B7'],
row['B8'],row['B11'],row['B12']]
corr = np.corrcoef(randompixels,val)
return corr
data["corr"] = data.apply(func_to_apply,axis=1)
data[data["corr"].max()==data["corr"]]["SOC"]
That would be the notdistributed way.
Using pandarallel :
from pandarallel import pandarallel
pandarallel.initialize()
data["corr"] = data.parallel_apply(func_to_apply,axis=1)
data[data["corr"].max()==data["corr"]]["SOC"]
Wrote the code without testing it but it should work, let me know if something is wrong or if that helped.

Ensuring performance of sketching/streaming algorithm (countSketch)

I have implemented what is know as a countSketch in python (page 17: https://arxiv.org/pdf/1411.4357.pdf) but my implementation is currently lacking in performance. The algorithm is to compute the product SA where A is an n x d matrix, S is m x n matrix defined as follows: for every column of S uniformly at random select a row (hash bucket) from the m rows and for that given row, uniformly at random select +1 or -1. So S is a matrix with exactly one nonzero in every column and otherwise all zero.
My intention is to compute SA in a streaming fashion by reading the entries of A. The idea for my implementation is as follows: observe a sequence of triples (i,j,A_ij) and return a sequence (h(i), j, s(i)A_ij) where:
- h(i) is a hash bucket (row of matrix chosen uniformly at random from the m possible rows of S
- s(i) is the random sign function as described above.
I have assumed that the matrix is in row order so that the first row of A arrives in its entirety before the next row of A arrives because this limits the number of calls I need to select a random bucket or the need to use a hash library. I have also assumed that the number of nonzero entries (or the length of the input stream) is known so that I can efficiently encode the iteration.
My problem is that the matrix should compute (1+error)*||Ax||^2 <= ||SAx||^2 <= (1+error)*||Ax||^2 and also have the difference in frobenius norms between A^T S^T S A and A^T A being small. However, while my implementation for the first condition seems to be true, the latter is consistently too small. I was wondering if there is an obvious reason for this that I am missing because it seems to be underestimating the latter quantity.
I am open to feedback on changing the code if there are obvious improvements to be made. The single call to np.choice is made to remove the need to look through a (potentially large) array or hash table containing the row hashes for each row and because the matrix is in row order we can just keep the hash for that row until a new row is seen.
nb. If you don't want to run using numba then just comment out the import and the function decorator and it will run in standard numpy/scipy.
import numpy as np
import numpy.random as npr
import scipy.sparse as sparse
from scipy.sparse import coo_matrix
import numba
from numba import jit
#jit(nopython=True) # comment this if want just numpy
def countSketch(input_rows, input_data,
input_nnz,
sketch_size, seed=None):
'''
input_rows: row indices for data (can be repeats)
input_data: values seen in row location,
input_nnz : number of nonzers in the data (can replace with
len(input_data) but avoided here for speed)
sketch_size: int
seed=None : random seed
'''
hashed_rows = np.empty(input_rows.shape,dtype=np.int32)
current_row = 0
hash_val = npr.choice(sketch_size)
sign_val = npr.choice(np.array([-1.0,1.0]))
#print(hash_val)
hashed_rows[0] = hash_val
#print(hash_val)
for idx in np.arange(input_nnz):
print(idx)
row_id = input_rows[idx]
data_val = input_data[idx]
if row_id == current_row:
hashed_rows[idx] = hash_val
input_data[idx] = sign_val*data_val
else:
# make new hashes
hash_val = npr.choice(sketch_size)
sign_val = npr.choice(np.array([-1.0,1.0]))
hashed_rows[idx] = hash_val
input_data[idx] = sign_val*data_val
return hashed_rows, input_data
def sort_row_order(input_data):
sorted_row_column = np.array((input_data.row,
input_data.col,
input_data.data))
idx = np.argsort(sorted_row_column[0])
sorted_rows = np.array(sorted_row_column[0,idx], dtype=np.int32)
sorted_cols = np.array(sorted_row_column[1,idx], dtype=np.int32)
sorted_data = np.array(sorted_row_column[2,idx], dtype=np.float64)
return sorted_rows, sorted_cols, sorted_data
if __name__=="__main__":
import time
from tabulate import tabulate
matrix = sparse.random(1000, 50, 0.1)
x = np.random.randn(matrix.shape[1])
true_norm = np.linalg.norm(matrix#x,ord=2)**2
tidy_data = sort_row_order(matrix)
sketch_size = 300
start = time.time()
hashed_rows, sketched_data = countSketch(tidy_data[0],\
tidy_data[2], matrix.nnz,sketch_size)
duration_slow = time.time() - start
S_A = sparse.coo_matrix((sketched_data, (hashed_rows,matrix.col)))
approx_norm_slow = np.linalg.norm(S_A#x,ord=2)**2
rel_error_slow = approx_norm_slow/true_norm
#print("Sketch time: {}".format(duration_slow))
start = time.time()
hashed_rows, sketched_data = countSketch(tidy_data[0],\
tidy_data[2], matrix.nnz,sketch_size)
duration = time.time() - start
#print("Sketch time: {}".format(duration))
S_A = sparse.coo_matrix((sketched_data, (hashed_rows,matrix.col)))
approx_norm = np.linalg.norm(S_A#x,ord=2)**2
rel_error = approx_norm/true_norm
#print("Relative norms: {}".format(approx_norm/true_norm))
print(tabulate([[duration_slow, rel_error_slow, 'Yes'],
[duration, rel_error, 'No']],
headers=['Sketch Time', 'Relative Error', 'Dry Run'],
tablefmt='orgtbl'))

How to speed up iterating through an array / matrix? Tried both pandas and numpy arrays

I would like to do some feature enrichment through a large 2 dimensional array (15,100m).
Working on a sample set with 100'000 records showed that I need to get this faster.
Edit (data model info)
To simplify, let's say we have only two relevant columns:
IP (identifier)
Unix (timestamp in seconds since 1970)
I would like to add a 3rd column, counting how many times this IP has shown up in the past 12 hours.
End edit
My first attempt was using pandas, because it was comfortable working with named dimensions, but too slow:
for index,row in tqdm_notebook(myData.iterrows(),desc='iterrows'):
# how many times was the IP address (and specific device) around in the prior 5h?
hours = 12
seen = myData[(myData['ip']==row['ip'])
&(myData['device']==row['device'])
&(myData['os']==row['os'])
&(myData['unix']<row['unix'])
&(myData['unix']>(row['unix']-(60*60*hours)))].shape[0]
ip_seen = myData[(myData['ip']==row['ip'])
&(myData['unix']<row['unix'])
&(myData['unix']>(row['unix']-(60*60*hours)))].shape[0]
myData.loc[index,'seen'] = seen
myData.loc[index,'ip_seen'] = ip_seen
Then I switched to numpy arrays and hoped for a better result, but it is still too slow to run against the full dataset:
# speed test numpy arrays
for i in np.arange(myArray.shape[0]):
hours = 12
ip,device,os,ts = myArray[i,[0,3,4,12]]
ip_seen = myArray[(np.where((myArray[:,0]==ip)
& (myArray[:,12]<ts)
& (myArray[:,12]>(ts-60*60*hours) )))].shape[0]
device_seen = myArray[(np.where((myArray[:,0]==ip)
& (myArray[:,2] == device)
& (myArray[:,3] == os)
& (myArray[:,12]<ts)
& (myArray[:,12]>(ts-60*60*hours) )))].shape[0]
myArray[i,13]=ip_seen
myArray[i,14]=device_seen
My next idea would be to iterate only once, and maintain a growing dictionary of the current count, instead of looking backwards in every iteration.
But that would have some other drawbacks (e.g. how to keep track when to reduce count for observations falling out of the 12h window).
How would you approach this problem?
Could it be even an option to use low level Tensorflow functions to involve a GPU?
Thanks

The only way to speed up things is not looping. In your case you can try using rolling with a window of the time span that you want, using the Unix timestamp as a datetime index (assuming that records are sorted by timestamp, otherwise you would need to sort first). This should work fine for the ip_seen:
ip = myData['ip']
ip.index = pd.to_datetime(myData['unix'], unit='s')
myData['ip_seen'] = ip.rolling('5h')
.agg(lambda w: np.count_nonzero(w[:-1] == w[-1]))
.values.astype(np.int32)
However, when the aggregation involves multiple columns, like in the seen column, it gets more complicated. Currently (see Pandas issue #15095) rolling functions do not support aggregations spanning two dimensions. A workaround could be merging the columns of interest into a single new series, for example a tuple (which may work better if values are numbers) or a string (which may be better is values are already strings). For example:
criteria = myData['ip'] + '|' + myData['device'] + '|' + myData['os']
criteria.index = pd.to_datetime(myData['unix'], unit='s')
myData['seen'] = criteria.rolling('5h')
.agg(lambda w: np.count_nonzero(w[:-1] == w[-1]))
.values.astype(np.int32)
EDIT
Apparently rolling only works with numeric types, which leaves as with two options:
Manipulate the data to use numeric types. For the IP this is easy, since it actually represents a 32 bit number (or 64 if IPv6 I guess). For device and OS, assuming they are strings now, it get's more complicated, you would have to map each possible value to an integer and the merge it with the IP in a long value, e.g. putting these in the higher bits or something like that (maybe even impossible with IPv6, since the biggest integers NumPy supports right now are 64 bits).
Roll over the index of myData (which should now be not datetime, because rolling cannot work with that either) and use the index window to get the necessary data and operate:
# Use sequential integer index
idx_orig = myData.index
myData.reset_index(drop=True, inplace=True)
# Index to roll
idx = pd.Series(myData.index)
idx.index = pd.to_datetime(myData['unix'], unit='s')
# Roll aggregation function
def agg_seen(w, data, fields):
# Use slice for faster data frame slicing
slc = slice(int(w[0]), int(w[-2])) if len(w) > 1 else []
match = data.loc[slc, fields] == data.loc[int(w[-1]), fields]
return np.count_nonzero(np.all(match, axis=1))
# Do rolling
myData['ip_seen'] = idx.rolling('5h') \
.agg(lambda w: agg_seen(w, myData, ['ip'])) \
.values.astype(np.int32)
myData['ip'] = idx.rolling('5h') \
.agg(lambda w: agg_seen(w, myData, ['ip', 'device', 'os'])) \
.values.astype(np.int32)
# Put index back
myData.index = idx_orig
This is not how rolling is meant to be used, though, and I'm not sure if this gives much better performance than just looping.

as mentioned in the comment to #jdehesa, I took another approach which allows me to only iterate once through the entire dataset and pull the (decaying) weight from an index.
decay_window = 60*60*12 # every 12
decay = 0.5 # fall by 50% every window
ip_idx = pd.DataFrame(myData.ip.unique())
ip_idx['ts_seen'] = 0
ip_idx['ip_seen'] = 0
ip_idx.columns = ['ip','ts_seen','ip_seen']
ip_idx.set_index('ip',inplace=True)
for index, row in myData.iterrows(): # all
# How often was this IP seen?
prior_ip_seen = ip_idx.loc[(row['ip'],'ip_seen')]
prior_ts_seen = ip_idx.loc[(row['ip'],'ts_seen')]
delay_since_count = row['unix']-ip_idx.loc[(row['ip'],'ts_seen')]
new_ip_seen = prior_ip_seen*decay**(delay_since_count/decay_window)+1
ip_idx.loc[(row['ip'],'ip_seen')] = new_ip_seen
ip_idx.loc[(row['ip'],'ts_seen')] = row['unix']
myData.iloc[index,14] = new_ip_seen-1
That way the result is not the fixed time window as requested initially, but prior observations "fade out" over time, giving frequent recent observations a higher weight.
This feature carries more information than the simplified (and turned out more expensive) approach initially planned.
Thanks for your input!
Edit
In the meantime I switched to numpy arrays for the same operation, which now only takes a fraction of the time (loop with 200m updates in <2h).
Just in case somebody looks for a starting point:
%%time
import sys
## temporary lookup
ip_seen_ts = [0]*365000
ip_seen_count = [0]*365000
cnt = 0
window = 60*60*12 # 12h
decay = 0.5
counter = 0
chunksize = 10000000
store = pd.HDFStore('store.h5')
t = time.process_time()
try:
store.remove('myCount')
except:
print("myData not present.")
for myHdfData in store.select_as_multiple(['myData','myFeatures'],columns=['ip','unix','ip_seen'],chunksize=chunksize):
print(counter, time.process_time() - t)
#display(myHdfData.head(5))
counter+=chunksize
t = time.process_time()
sys.stdout.flush()
keep_index = myHdfData.index.values
myArray = myHdfData.as_matrix()
for row in myArray[:,:]:
#for row in myArray:
i = (row[0].astype('uint32')) # IP as identifier
u = (row[1].astype('uint32')) # timestamp
try:
delay = u - ip_seen_ts[i]
except:
delay = 0
ip_seen_ts[i] = u
try:
ip_seen_count[i] = ip_seen_count[i]*decay**(delay/window)+1
except:
ip_seen_count[i] = 1
row[3] = np.tanh(ip_seen_count[i]-1) # tanh to normalize between 0 and 1
myArrayAsDF = pd.DataFrame(myArray,columns=['c_ip','c_unix','c_ip2','ip_seen'])
myArrayAsDF.set_index(keep_index,inplace=True)
store.append('myCount',myArrayAsDF)
store.close()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas read_csv - Skipping every other row starting from a certain row - python

Your cond lambda is wrong. You want to skip rows if x < idx or x % downsampling != 0. Just write it that way: cond = x < idx or x % downsampling != 0 But you should also consider passing header = False to avoid processing the first line of each segment as a header.

Related

Python - Multiprocessing very large parquet file

How do I add a matrix constraint `Ax=b` to a Pyomo model efficiently?

To find the maximum correlation value from a csv file of 20000 rows efficiently in python

Ensuring performance of sketching/streaming algorithm (countSketch)

How to speed up iterating through an array / matrix? Tried both pandas and numpy arrays

Categories

Resources