Dask Concatenate a Series of Dataframes - python

I have a Dask Series of Pandas DataFrames. I would like to use dask.dataframe.multi.concat to convert this into a Dask DataFrame. However the dask.dataframe.multi.concat always requires a list of DataFrames.
I could perform a compute on the Dask series of Pandas DataFrames to get a Pandas series of DataFrames, at which point I could turn that into a list. But I think it would be better not to call compute and instead directly acquire the Dask DataFrame from the Dask Series of Pandas DataFrames.
What would the best way to do this? Here's my code that produces the series of dataframes
import pandas as pd
import dask.dataframe as dd
import operator
import numpy as np
import math
import itertools
def apportion_pcts(pcts, total):
"""Apportion an integer by percentages
Uses the largest remainder method
"""
if (sum(pcts) != 100):
raise ValueError('Percentages must add up to 100')
proportions = [total * (pct / 100) for pct in pcts]
apportions = [math.floor(p) for p in proportions]
remainder = total - sum(apportions)
remainders = [(i, p - math.floor(p)) for (i, p) in enumerate(proportions)]
remainders.sort(key=operator.itemgetter(1), reverse=True)
for (i, _) in itertools.cycle(remainders):
if remainder == 0:
break
else:
apportions[i] += 1
remainder -= 1
return apportions
# images_df = dd.read_csv('./tests/data/classification/images.csv')
images_df = pd.DataFrame({"image_id": [0,1,2,3,4,5], "image_class_id": [0,1,1,3,3,5]})
images_df = dd.from_pandas(images_df, npartitions=1)
output_ratio = [80, 20]
def partition_class (partition):
size = len(partition)
proportions = apportion_pcts(output_ratio, size)
slices = []
start = 0
for proportion in proportions:
s = slice(start, start + proportion)
slices.append(partition.iloc[s, :])
start = start+proportion
slicess = pd.Series(slices)
return slicess
partitioned_schema = dd.utils.make_meta(
[(0, object), (1, object)], pd.Index([], name='image_class_id'))
partitioned_df = images_df.groupby('image_class_id')
partitioned_df = partitioned_df.apply(partition_class, meta=partitioned_schema)
In the partitioned_df, we can get partitioned_df[0] or partitioned_df[1] to get a series of dataframe objects.
Here is an example of the CSV file:
image_id,image_width,image_height,image_path,image_class_id
0,224,224,tmp/data/image_matrices/0.npy,5
1,224,224,tmp/data/image_matrices/1.npy,0
2,224,224,tmp/data/image_matrices/2.npy,4
3,224,224,tmp/data/image_matrices/3.npy,1
4,224,224,tmp/data/image_matrices/4.npy,9
5,224,224,tmp/data/image_matrices/5.npy,2
6,224,224,tmp/data/image_matrices/6.npy,1
7,224,224,tmp/data/image_matrices/7.npy,3
8,224,224,tmp/data/image_matrices/8.npy,1
9,224,224,tmp/data/image_matrices/9.npy,4
I tried to do a reduction afterwards, but this doesn't quite make sense due to a proxy foo string.
def zip_partitions(s):
r = []
for c in s.columns:
l = s[c].tolist()
r.append(pd.concat(l))
return pd.Series(r)
output_df = partitioned_df.reduction(
chunk=zip_partitions
)
The proxy list that I'm attempting to concat is ['foo', 'foo']. What is this phase for? To discover how to do the task? But then certain operations don't work. I'm wondering if it's because I'm operating over objects that I'm getting these strings.

I figured out an answer by applying the reduction at the very end to "zip" up each dataframe into a series of dataframes.
import pandas as pd
import dask.dataframe as dd
import operator
import numpy as np
import math
import itertools
def apportion_pcts(pcts, total):
"""Apportion an integer by percentages
Uses the largest remainder method
"""
if (sum(pcts) != 100):
raise ValueError('Percentages must add up to 100')
proportions = [total * (pct / 100) for pct in pcts]
apportions = [math.floor(p) for p in proportions]
remainder = total - sum(apportions)
remainders = [(i, p - math.floor(p)) for (i, p) in enumerate(proportions)]
remainders.sort(key=operator.itemgetter(1), reverse=True)
for (i, _) in itertools.cycle(remainders):
if remainder == 0:
break
else:
apportions[i] += 1
remainder -= 1
return apportions
images_df = dd.read_csv('./tests/data/classification/images.csv', blocksize=1024)
output_ratio = [80, 20]
def partition_class(group_df, ratio):
proportions = apportion_pcts(ratio, len(group_df))
partitions = []
start = 0
for proportion in proportions:
s = slice(start, start + proportion)
partitions.append(group_df.iloc[s, :])
start += proportion
return pd.Series(partitions)
partitioned_schema = dd.utils.make_meta(
[(i, object) for i in range(len(output_ratio))],
pd.Index([], name='image_class_id'))
partitioned_df = images_df.groupby('image_class_id')
partitioned_df = partitioned_df.apply(
partition_class, meta=partitioned_schema, ratio=output_ratio)
def zip_partitions(partitions_df):
partitions = []
for i in partitions_df.columns:
partitions.append(pd.concat(partitions_df[i].tolist()))
return pd.Series(partitions)
zipped_schema = dd.utils.make_meta((None, object))
partitioned_ds = partitioned_df.reduction(
chunk=zip_partitions, meta=zipped_schema)
I think it should be possible to combine both the reduction and apply to a single custom aggregation to represent a map reduce operation.
However I could not figure out how to do such a thing with the custom aggregation since it uses a series groupby.

Related

Compare rows with conditions and generate a new dataframe in Pandas

I have a very big dataframe with this structure:
Timestamp Val1
Here you can see a real sample:
Timestamp Temp
0 1622471518.92911 36.443
1 1622471525.034114 36.445
2 1622471531.148139 37.447
3 1622471537.284337 36.449
4 1622471543.622588 43.345
5 1622471549.734765 36.451
6 1622471556.2518 36.454
7 1622471562.361368 41.461
8 1622471568.472718 42.468
9 1622471574.826475 36.470
What I want to do is compare the Temp column with itself and if is higher than "X", for example 4, and the time between they is lower than "Y", for example 180 min, then I save some data of they.
Now I'm using two for loops one inside the other, but this expends to much time and usually pandas has an option to avoid this.
This is my code:
cap_time, maxim = 180, 4
cap_time = cap_time * 60
temps= df['Temperature'].values
times = df['Timestamp'].values
results = []
for i in range(len(temps)):
for j in range(i+1, len(temps)):
print(i,j,len(temps))
if float(temps[j]) > float(temps[i])*maxim:
timeIn = dt.datetime.fromtimestamp(float(times[i]))
timeOut = dt.datetime.fromtimestamp(float(times[j]))
diff = timeOut - timeIn
tdiff = diff.total_seconds()
if dd > cap_time:
break
else:
res = [temps[i], temps[j], times[i], times[j], tdiff/60, cap_time/60, maxim]
results.append(res)
break
# Then I save it in a dataframe and another actions
Can Pandas help me to achieve my goal and reduce the execution time? I found dataFrame.diff() but I'm not sure is what I want (or I don`t know how to use it).
Thank you very much.
Short of avoiding the nested for loops, you can already speed things up by avoiding all unnecessary calculations and conversions within the loops. In particular, you can use NumPy broadcasting to define a Boolean array beforehand, in which you can look up whether the condition is met:
import numpy as np
temps_diff = temps - temps[:, None]
times_diff = times - times[:, None]
condition = np.logical_and(temps_diff > maxim,
times_diff < cap_time)
results = []
for i in range(len(temps)):
for j in range(i+1, len(temps)):
if condition[i, j]:
results.append([temps[i], temps[j],
times[i], times[j],
times_diff[i, j]])
results
[[36.443, 43.345, 1622471518.92911, 1622471543.622588, 24.693477869033813],
...
[36.454, 42.468, 1622471556.2518, 1622471568.472718, 12.22091794013977]]
To avoid the loops altogether, you could define a 3-dimensional full results array and then use the condition array as a Boolean mask to filter out the results you want:
import numpy as np
n = len(temps)
temps_diff = temps - temps[:, None]
times_diff = times - times[:, None]
condition = np.logical_and(temps_diff > maxim,
times_diff < cap_time)
results_full = np.stack([np.repeat(temps[:, None], n, axis=1),
np.tile(temps, (n, 1)),
np.repeat(times[:, None], n, axis=1),
np.tile(times, (n, 1)),
times_diff])
results = results_full[np.stack(results_full.shape[0] * [condition])]
results.reshape((5, -1)).T
array([[ 3.64430000e+01, 4.33450000e+01, 1.62247152e+09,
1.62247154e+09, 2.46934779e+01],
...
[ 3.64540000e+01, 4.24680000e+01, 1.62247156e+09,
1.62247157e+09, 1.22209179e+01],
...
])
As you can see, the resulting numbers are the same as above, although this time the results array will contain more rows, because we didn't use the shortcut of starting the inner loop at i+1.

How to group and sum certain columns of an array based on their classification (eg to group cities by country)

The issue
I have arrays which track certain items over time. The items belong to certain categories. I want to calculate the sum by time and category, e.g. to go from a table by time and city to one by time and country.
I have found a couple of ways, but they seem clunky - there must be a better way! Surely I'm not the first one with this issue? Maybe using np.where?
More specifically:
I have a number of numpy arrays of shape (p x i), where p is the period and i is the item I am tracking over time.
I then have a separate array of shape i which classifies the items into categories (red, green, yellow, etc.).
What I want to do is calculate an array of shape (p x number of unique categories) which sums the values of the big array by time and category. In pictures:
I'd need the code to be as efficient as possible as I need to do this multiple times on arrays which can be up to 400 x 1,000,000
What I have tried:
This question covers a number of ways to groupby without resorting to pandas. I like the scipy.ndimage approach, but AFAIK it works on one dimension only.
I have tried a solution with pandas:
I create a dataframe of shape periods x items
I unpivot it with pd.melt(), join the categories and do a crosstab period/categories
I have also tried a set of loops, optimised with numba:
A first loop creates an array which converts the categories into integers, i.e. the first category in alphabetical order becomes 0, the 2nd 1, etc
A second loop iterates through all the items, then for each item it iterates through all the periods and sums by category
My findings
for small arrays, pandas is faster
for large arrays, numba is better, but it's better to set parallel = False in the numba decorator
for very large arrays, numba with parallel = True shines
parallel = True makes use of numba's parallelisation by using numba.prange on the outer loops.
PS I am aware of the pitfalls of premature optimisation etc etc - I am only looking into this because a significant amount of time is spent doing precisely this
The code
import numpy as np
import pandas as pd
import time
import numba
periods = 300
n = int(2000)
categories = np.tile(['red','green','yellow','brown'],n)
my_array = np.random.randint(low = 0, high = 10, size = (periods, len(categories) ))
# my_arrays will have shape (periods x (n * number of categories))
#---- pandas
start = time.time()
df_categories = pd.DataFrame(data = categories).reset_index().rename(columns ={'index':'item',0:'category'})
df = pd.DataFrame(data = my_array)
unpiv = pd.melt(df.reset_index(), id_vars ='index', var_name ='item', value_name ='value').rename( columns = {'index':'time'})
unpiv = pd.merge(unpiv, df_categories, on='item' )
crosstab = pd.crosstab( unpiv['time'], unpiv['category'], values = unpiv['value'], aggfunc='sum' )
print("panda crosstab in:")
print(time.time() - start)
# yep, I know that timeit.timer would have been better, but I was in a hurry :)
print("")
#---- numba
#numba.jit(nopython = True, parallel = True, nogil = True)
def numba_classify(x, categories):
cat_uniq = np.unique(categories)
num_categories = len(cat_uniq)
num_items = x.shape[1]
periods = x.shape[0]
categories_converted = np.zeros(len(categories), dtype = np.int32)
out = np.zeros(( periods, num_categories))
# before running the actual classification, I must convert the categories, which can be strings, to
# the corresponsing number in cat_uniq, e.g. if brown is the first category by alphabetical sorting, then
# brown --> 0, etc
for i in numba.prange(num_items):
for c in range(num_categories):
if categories[i] == cat_uniq[c]:
categories_converted[i] = c
for i in numba.prange(num_items):
for p in range(periods):
out[ p, categories_converted[i] ] += x[p,i]
return out
start = time.time()
numba_out = numba_classify(my_array, categories)
print("numba done in:")
print(time.time() - start)
You can use df.groupby(categories, axis=1).sum() for a substantial speedup.
import numpy as np
import pandas as pd
import time
def make_data(periods, n):
categories = np.tile(['red','green','yellow','brown'],n)
my_array = np.random.randint(low = 0, high = 10, size = (periods, len(categories) ))
return categories, pd.DataFrame(my_array)
for n in (200, 2000, 20000):
categories, df = make_data(300, n)
true_n = n * 4
start = time.time()
tabulation =df.groupby(categories, axis=1).sum()
elapsed = time.time() - start
print(f"300 x {true_n:5}: {elapsed:.3f} seconds")
# prints:
300 x 800: 0.005 seconds
300 x 8000: 0.021 seconds
300 x 80000: 0.673 seconds

Find when the values of a pandas.Series change by at least x

I have a time series s stored as a pandas.Series and I need to find when the value tracked by the time series changes by at least x.
In pseudocode:
print s(0)
s*=s(0)
for all t in ]t, t_max]:
if |s(t)-s*| > x:
s* = s(t)
print s*
Naively, this can be coded in Python as follows:
import pandas as pd
def find_changes(s, x):
changes = []
s_last = None
for index, value in s.iteritems():
if s_last is None:
s_last = value
if value-s_last > x or s_last-value > x:
changes += [index, value]
s_last = value
return changes
My data set is large, so I can't just use the method above. Moreover, I cannot use Cython or Numba due to limitations of the framework I will run this on. I can (and plan to) use pandas and NumPy.
I'm looking for some guidance on what NumPy vectorized/optimized methods to use and how.
Thanks!
EDIT: Changed code to match pseudocode.
I don't know if I am understanding you correctly, but here is how I interpreted the problem:
import pandas as pd
import numpy as np
# Our series of data.
data = pd.DataFrame(np.random.rand(10), columns = ['value'])
# The threshold.
threshold = .33
# For each point t, grab t - 1.
data['value_shifted'] = data['value'].shift(1)
# Absolute difference of t and t - 1.
data['abs_change'] = abs(data['value'] - data['value_shifted'])
# Test against the threshold.
data['change_exceeds_threshold'] = np.where(data['abs_change'] > threshold, 1, 0)
print(data)
Giving:
value value_shifted abs_change change_exceeds_threshold
0 0.005382 NaN NaN 0
1 0.060954 0.005382 0.055573 0
2 0.090456 0.060954 0.029502 0
3 0.603118 0.090456 0.512661 1
4 0.178681 0.603118 0.424436 1
5 0.597814 0.178681 0.419133 1
6 0.976092 0.597814 0.378278 1
7 0.660010 0.976092 0.316082 0
8 0.805768 0.660010 0.145758 0
9 0.698369 0.805768 0.107400 0
I don't think the pseudo code can be vectorized because the next state of s* is dependent on the last state. There's a pure python solution (1 iteration):
import random
import pandas as pd
s = [random.randint(0,100) for _ in range(100)]
res = [] # record changes
thres = 20
ss = s[0]
for i in range(len(s)):
if abs(s[i] - ss) > thres:
ss = s[i]
res.append([i, s[i]])
df = pd.DataFrame(res, columns=['value'])
I think there's no way to run faster than O(N) in this case.

How to efficiently join a list of values to a list of intervals?

I have a data frame which can be constructed as follows:
df = pd.DataFrame({'value':scipy.stats.norm.rvs(0, 1, size=1000),
'start':np.abs(scipy.stats.norm.rvs(0, 20, size=1000))})
df['end'] = df['start'] + np.abs(scipy.stats.norm.rvs(5, 5, size=1000))
df[:10]
start value end
0 9.521781 -0.570097 17.708335
1 3.929711 -0.927318 15.065047
2 3.990466 0.756413 4.841934
3 20.676291 -1.418172 28.284301
4 13.084246 1.280723 14.121626
5 29.784740 0.236915 32.791751
6 21.626625 1.144663 28.739413
7 18.524309 0.101871 27.271344
8 21.288152 -0.727120 27.049582
9 13.556664 0.713141 22.136275
Each row represents a value assigned to an interval (start, end)
Now, I would like to get a list of best values occuring at time 10,13,15, ... ,70. (It is similar to the geometric index in SQL if you are familiar with that.)
Below is my 1st attempt in python with pandas, it takes 18.5ms. Can any one help to improve it? (This procedure would be called 1M or more times with different data frames in my program)
def get_values(data):
data.sort_index(by='value', ascending=False, inplace=True) # this takes 0.2ms
# can we get rid of it? since we don't really need sort...
# all we need is the max value for each interval.
# But if we have to keep it for simplicity it is ok.
ret = []
#data = data[(data['end'] >= 10) & (data['start'] <= 71)]
for t in range(10, 71, 2):
interval = data[(data['end'] >= t) & (data['start'] <= t)]
if not interval.empty:
ret.append(interval['value'].values[0])
else:
for i in range(t, 71, 2):
ret.append(None)
break
return ret
#%prun -l 10 print get_values(df)
%timeit get_values(df)
The 2nd attemp involves decompose pandas into numpy as much as possible, and it takes around 0.7ms
def get_values(data):
data.sort_index(by='value', ascending=False, inplace=True)
ret = []
df_end = data['end'].values
df_start = data['start'].values
df_value = data['value'].values
for t in range(10, 71, 2):
values = df_value[(df_end >= t) & (df_start <= t)]
if len(values) != 0:
ret.append(values[0])
else:
for i in range(t, 71, 2):
ret.append(None)
break
return ret
#%prun -l 10 print get_values(df)
%timeit get_values(df)
Can we improve further? I guess the next step is algorithm level, both of the above are just naive logic implementations.
I don't understand empty process in your code, here is a faster version if ignore your empty process:
import scipy.stats as stats
import pandas as pd
import numpy as np
df = pd.DataFrame({'value':stats.norm.rvs(0, 1, size=1000),
'start':np.abs(stats.norm.rvs(0, 20, size=1000))})
df['end'] = df['start'] + np.abs(stats.norm.rvs(5, 5, size=1000))
def get_value(df, target):
value = df["value"].values
idx = np.argsort(value)[::-1]
start = df["start"].values[idx]
end = df["end"].values[idx]
value = value[idx]
mask = (target[:, None] >= start[None, :]) & (target[:, None] <= end[None, :])
index = np.argmax(mask, axis=1)
flags = mask[np.arange(len(target)), index]
result = value[index]
result[~flags] = np.nan
return result
get_value(df, np.arange(10, 71, 2))

optimizing indexing and retrieval of elements in numpy arrays in Python?

I'm trying to optimize the following code, potentially by rewriting it in Cython: it simply takes a low dimensional but relatively long numpy arrays, looks into of its columns for 0 values, and marks those as -1 in an array. The code is:
import numpy as np
def get_data():
data = np.array([[1,5,1]] * 5000 + [[1,0,5]] * 5000 + [[0,0,0]] * 5000)
return data
def get_cols(K):
cols = np.array([2] * K)
return cols
def test_nonzero(data):
K = len(data)
result = np.array([1] * K)
# Index into columns of data
cols = get_cols(K)
# Mark zero points with -1
idx = np.nonzero(data[np.arange(K), cols] == 0)[0]
result[idx] = -1
import time
t_start = time.time()
data = get_data()
for n in range(5000):
test_nonzero(data)
t_end = time.time()
print (t_end - t_start)
data is the data. cols is the array of columns of data to look for non-zero values (for simplicity, I made it all the same column). The goal is to compute a numpy array, result, which has a 1 value for each row where the column of interest is non-zero, and -1 for the rows where the corresponding columns of interest have a zero.
Running this function 5000 times on a not-so-large array of 15,000 rows by 3 columns takes about 20 seconds. Is there a way this can be sped up? It appears that most of the work goes into finding the nonzero elements and retrieving them with indices (the call to nonzero and subsequent use of its index.) Can this be optimized or is this the best that can be done?
How could a Cython implementation gain speed on this?
cols = np.array([2] * K)
That's going to be really slow. That's create a very large python list and then converts it into a numpy array. Instead, do something like:
cols = np.ones(K, int)*2
That'll be way faster
result = np.array([1] * K)
Here you should do:
result = np.ones(K, int)
That will produce the numpy array directly.
idx = np.nonzero(data[np.arange(K), cols] == 0)[0]
result[idx] = -1
The cols is an array, but you can just pass a 2. Furthermore, using nonzero adds an extra step.
idx = data[np.arange(K), 2] == 0
result[idx] = -1
Should have the same effect.

Categories