Grouping floating point numbers - python

I have an application where I need to block average a list of data (currently in a pandas.DataFrame) according to a timestamp, which may be a floating point value. For example, I may need to average the following df into groups of 0.3 secs:
+------+------+ +------+------+
| secs | A | | secs | A |
+------+------+ +------+------+
| 0.1 | .. | | 0.3 | .. | <-- avg of 0.1, 0.2, 0.3
| 0.2 | .. | --> | 0.6 | .. | <-- avg of 0.4, 0.5, 0.6
| 0.3 | .. | | ... | ... | <-- etc
| 0.4 | .. | +------+------+
| 0.5 | .. |
| 0.6 | .. |
| ... | ... |
+------+------+
Currently I am using the following (minimal) solution:
import pandas as pd
import numpy as np
def block_avg ( df : pd.DataFrame, duration : float ) -> pd.DataFrame:
grouping = (df['secs'] - df['secs'][0]) // duration
df = df.groupby( grouping, as_index=False ).mean()
df['secs'] = duration * np.arange(1,1+len(df))
return df
which works just fine for integer durations, but floating point values at the edges of blocks can fall on the wrong side. A simple test that the blocks are being created properly is to average by the same duration that the data is already in (0.1 in this example). This should return the input, but often doesn't. (e.g. x=.1*np.arange(1,20); (x-x[0])//.1).)
I found that the error with this method is usually that the LSB is 1 low, so a tentative fix is to add np.spacing(df['secs']) to the numerator in the grouping. (That is, x=.1*np.arange(1,20); all( (x-x[0]+np.spacing(x)) // .1 == np.arange(19) ) returns True.)
However, I am concerned that this is not a robust solution. Is there a better or preferred way to group floats which passes the above test?
I have had similar issues with a (perhaps more straightforward) algorithm which groups using x[ (duration*i < x) & (x <= duration*(i+1)) ] and looping i over an appropriate range.

To be extra careful (of float inaccuracy) I'd round early before doing the groupby:
In [11]: np.round(300 + df.secs * 1000).astype(int) // 300
Out[11]:
0 1
1 1
2 1
3 2
4 2
5 2
Name: secs, dtype: int64
In [12]: (np.round(300 + df.secs * 1000).astype(int) // 300) * 0.3
Out[12]:
0 0.3
1 0.3
2 0.3
3 0.6
4 0.6
5 0.6
Name: secs, dtype: float64
In [13]: df.groupby(by=(np.round(300 + df.secs * 1000).astype(int) // 300) * 0.3)["A"].sum()
Out[13]:
secs
0.3 1.753843
0.6 2.687098
Name: A, dtype: float64
I would prefer to use a timedelta:
In [21]: s = pd.to_timedelta(np.round(df["secs"], 1), unit="S")
In [22]: df["secs"] = pd.to_timedelta(np.round(df["secs"], 1), unit="S")
In [23]: df.groupby(pd.Grouper(key="secs", freq="0.3S")).sum()
Out[23]:
A
secs
00:00:00 1.753843
00:00:00.300000 2.687098
or with a resample:
In [24]: res = df.set_index("secs").resample("300ms").sum()
In [25]: res
Out[25]:
A
secs
00:00:00 1.753843
00:00:00.300000 2.687098
you can set the index to correct the labelling*
In [26]: res.index += np.timedelta64(300, "ms")
In [27]: res
Out[27]:
A
secs
00:00:00.300000 1.753843
00:00:00.600000 2.687098
* There ought to be a way to set that through a resample argument, but they don't seem to work...

Related

python datatable read expression from csv

I am using python's datatable.
I have 2 csv files.
CSV 1
A,B
1,2
3,4
5,6
CSV 2
NAME,EXPR
A_GREATER_THAN_B, A>B
A_GREATER_THAN_10, A>10
B_GREATER_THAN_5, B>5
Expected Output
A,B,A_GREATER_THAN_B,A_GREATER_THAN_10,B_GREATER_THAN_5
1,2,0,0,0
3,4,0,0,0
5,6,0,0,1
Code
exprdt = dt.fread("csv_2.csv")
exprdict = dict(exprdt.to_tuples())
dt1[:, dt.update(**exprdict)]
print(dt1)
Current output
| A B C A_G_B A_G_1 B_G_4
| int32 int32 int32 str32 str32 str32
-- + ----- ----- ----- ------------- -------- --------
0 | 0 1 1 dt.f.A>dt.f.B dt.f.A>1 dt.f.B>4
1 | 1 5 6 dt.f.A>dt.f.B dt.f.A>1 dt.f.B>4
I am trying to use the extend function to process first datatable using the expressions from second datatable. When i use fread to read the csv files, the expression is processed as string and not as expression.
How do i use the 2nd datatable (csv) to update the first datatable using the NAME and EXPR columns?
You can do what you want, but just that you can do something doesn't mean it's a good idea. Any solution that requires eval() is probably more complicated than it needs to be, and introduces great risks if you don't have complete control over the data going in.
Having said that, this script shows a naive approach without fancy expressions from a table, and the approach you suggest - which I strongly recommend you not use and figure out a better way to achieve what you need:
from io import StringIO
import re
import datatable as dt
csv1 = """A,B
1,2
3,4
5,6"""
csv2 = """NAME,EXPR
A_GREATER_THAN_B, A>B
A_GREATER_THAN_10, A>10
B_GREATER_THAN_5, B>5"""
def naive():
# naive approach
d = dt.fread(StringIO(csv1))
d['A_GREATER_THAN_B'] = d[:, dt.f.A > dt.f.B]
d['A_GREATER_THAN_10'] = d[:, dt.f.A > 10]
d['B_GREATER_THAN_5'] = d[:, dt.f.B > 5]
print(d)
def update_with_expressions(d, expressions):
for n in range(expressions.nrows):
col = expressions[n, :][0, 'NAME']
expr = re.sub('([A-Za-z]+)', r'dt.f.\1', expressions[n, :][0, 'EXPR'])
# here's hoping that expression is trustworthy...
d[col] = d[:, eval(expr)]
def fancy():
# fancy, risky approach
d = dt.fread(StringIO(csv1))
update_with_expressions(d, dt.fread(StringIO(csv2)))
print(d)
if __name__ == '__main__':
naive()
fancy()
Result (showing you get the same result from either approach):
| A B A_GREATER_THAN_B A_GREATER_THAN_10 B_GREATER_THAN_5
| int32 int32 bool8 bool8 bool8
-- + ----- ----- ---------------- ----------------- ----------------
0 | 1 2 0 0 0
1 | 3 4 0 0 0
2 | 5 6 0 0 1
[3 rows x 5 columns]
| A B A_GREATER_THAN_B A_GREATER_THAN_10 B_GREATER_THAN_5
| int32 int32 bool8 bool8 bool8
-- + ----- ----- ---------------- ----------------- ----------------
0 | 1 2 0 0 0
1 | 3 4 0 0 0
2 | 5 6 0 0 1
[3 rows x 5 columns]
Note: if someone knows of a nicer way to iterate over rows in a datatable.Frame, please leave a comment, because I'm not a fan of this part:
for n in range(expressions.nrows):
col = expressions[n, :][0, 'NAME']
Note that StringIO is only imported to have the .csv files in the code, you wouldn't need them.

Matplotlib dot plot with two categorical variables

I would like to produce a specific type of visualization, consisting of a rather simple dot plot but with a twist: both of the axes are categorical variables (i.e. ordinal or non-numerical values). And this complicates matters instead of making it easier.
To illustrate this question, I will be using a small example dataset that is a modification from seaborn.load_dataset("tips") and defined as such:
import pandas
from six import StringIO
df = """total_bill | tip | sex | smoker | day | time | size
16.99 | 1.01 | Male | No | Mon | Dinner | 2
10.34 | 1.66 | Male | No | Sun | Dinner | 3
21.01 | 3.50 | Male | No | Sun | Dinner | 3
23.68 | 3.31 | Male | No | Sun | Dinner | 2
24.59 | 3.61 | Female | No | Sun | Dinner | 4
25.29 | 4.71 | Female | No | Mon | Lunch | 4
8.77 | 2.00 | Female | No | Tue | Lunch | 2
26.88 | 3.12 | Male | No | Wed | Lunch | 4
15.04 | 3.96 | Male | No | Sat | Lunch | 2
14.78 | 3.23 | Male | No | Sun | Lunch | 2"""
df = pandas.read_csv(StringIO(df.replace(' ','')), sep="|", header=0)
My first approach to produce my graph was to try a call to seaborn as such:
import seaborn
axes = seaborn.pointplot(x="time", y="sex", data=df)
This fails with:
ValueError: Neither the `x` nor `y` variable appears to be numeric.
So does the equivalent seaborn.stripplot and seaborn.swarmplot calls. It does work however if one of the variables is categorical and the other one is numerical. Indeed seaborn.pointplot(x="total_bill", y="sex", data=df) works, but is not what I want.
I also attempted a scatterplot like such:
axes = seaborn.scatterplot(x="time", y="sex", size="day", data=df,
x_jitter=True, y_jitter=True)
This produces the following graph which does not contain any jitter and has all the dots overlapping, making it useless:
Do you know of any elegant approach or library that could solve my problem ?
I started writing something myself, which I will include below, but this implementation is suboptimal and limited by the number of points that can overlap at the same spot (currently it fails if more than 4 points overlap).
# Modules #
import seaborn, pandas, matplotlib
from six import StringIO
################################################################################
def amount_to_offets(amount):
"""A function that takes an amount of overlapping points (e.g. 3)
and returns a list of offsets (jittered) coordinates for each of the
points.
It follows the logic that two points are displayed side by side:
2 -> * *
Three points are organized in a triangle
3 -> *
* *
Four points are sorted into a square, and so on.
4 -> * *
* *
"""
assert isinstance(amount, int)
solutions = {
1: [( 0.0, 0.0)],
2: [(-0.5, 0.0), ( 0.5, 0.0)],
3: [(-0.5, -0.5), ( 0.0, 0.5), ( 0.5, -0.5)],
4: [(-0.5, -0.5), ( 0.5, 0.5), ( 0.5, -0.5), (-0.5, 0.5)],
}
return solutions[amount]
################################################################################
class JitterDotplot(object):
def __init__(self, data, x_col='time', y_col='sex', z_col='tip'):
self.data = data
self.x_col = x_col
self.y_col = y_col
self.z_col = z_col
def plot(self, **kwargs):
# Load data #
self.df = self.data.copy()
# Assign numerical values to the categorical data #
# So that ['Dinner', 'Lunch'] becomes [0, 1] etc. #
self.x_values = self.df[self.x_col].unique()
self.y_values = self.df[self.y_col].unique()
self.x_mapping = dict(zip(self.x_values, range(len(self.x_values))))
self.y_mapping = dict(zip(self.y_values, range(len(self.y_values))))
self.df = self.df.replace({self.x_col: self.x_mapping, self.y_col: self.y_mapping})
# Offset points that are overlapping in the same location #
# So that (2.0, 3.0) becomes (2.05, 2.95) for instance #
cols = [self.x_col, self.y_col]
scaling_factor = 0.05
for values, df_view in self.df.groupby(cols):
offsets = amount_to_offets(len(df_view))
offsets = pandas.DataFrame(offsets, index=df_view.index, columns=cols)
offsets *= scaling_factor
self.df.loc[offsets.index, cols] += offsets
# Plot a standard scatter plot #
g = seaborn.scatterplot(x=self.x_col, y=self.y_col, size=self.z_col, data=self.df, **kwargs)
# Force integer ticks on the x and y axes #
locator = matplotlib.ticker.MaxNLocator(integer=True)
g.xaxis.set_major_locator(locator)
g.yaxis.set_major_locator(locator)
g.grid(False)
# Expand the axis limits for x and y #
margin = 0.4
xmin, xmax, ymin, ymax = g.get_xlim() + g.get_ylim()
g.set_xlim(xmin-margin, xmax+margin)
g.set_ylim(ymin-margin, ymax+margin)
# Replace ticks with the original categorical names #
g.set_xticklabels([''] + list(self.x_mapping.keys()))
g.set_yticklabels([''] + list(self.y_mapping.keys()))
# Return for display in notebooks for instance #
return g
################################################################################
# Graph #
graph = JitterDotplot(data=df)
axes = graph.plot()
axes.figure.savefig('jitter_dotplot.png')
you could first convert time and sex to categorical type and tweak it a little bit:
df.sex = pd.Categorical(df.sex)
df.time = pd.Categorical(df.time)
axes = sns.scatterplot(x=df.time.cat.codes+np.random.uniform(-0.1,0.1, len(df)),
y=df.sex.cat.codes+np.random.uniform(-0.1,0.1, len(df)),
size=df.tip)
Output:
With that idea, you can modify the offsets (np.random) in the above code to the respective distance. For example:
# grouping
groups = df.groupby(['time', 'sex'])
# compute the number of samples per group
num_samples = groups.tip.transform('size')
# enumerate the samples within a group
sample_ranks = df.groupby(['time']).cumcount() * (2*np.pi) / num_samples
# compute the offset
x_offsets = np.where(num_samples.eq(1), 0, np.cos(df.sample_rank) * 0.03)
y_offsets = np.where(num_samples.eq(1), 0, np.sin(df.sample_rank) * 0.03)
# plot
axes = sns.scatterplot(x=df.time.cat.codes + x_offsets,
y=df.sex.cat.codes + y_offsets,
size=df.tip)
Output:

Nearest neighbors in a given range

I faced the problem of quickly finding the nearest neighbors in a given range.
Example of dataset:
id | string | float
0 | AA | 0.1
12 | BB | 0.5
2 | CC | 0.3
102| AA | 1.1
33 | AA | 2.8
17 | AA | 0.5
For each line, print the number of lines satisfying the following conditions:
string field is equal to current
float field <= current float - del
For this example with del = 1.5:
id | count
0 | 0
12 | 0
2 | 0
102| 2 (string is equal row with id=0,33,17 but only in row id=0,17 float value: 1.1-1.5<=0.1, 1.1-1.5<=0.5)
33 | 0 (string is equal row with id=0,102,17 but 2.8-1.5>=0.1/1.1/1.5)
17 | 1
To solve this problem, I used a class BallTree with custom metric, but it works for a very long time due to a reverse tree walk (on a large dataset).
Can someone suggest other solutions or how you can increase the speed of custom metrics to the speed of the metrics from the sklearn.neighbors.DistanceMetric?
My code:
from sklearn.neighbors import BallTree
def distance(x, y):
if(x[0]==y[0] and x[1]>y[1]):
return (x[1] - y[1])
else:
return (x[1] + y[1])
tree2 = BallTree(X, leaf_size=X.shape[0], metric=distance)
mas=tree2.query_radius(X, r=del, count_only = True)

How to pass condition into lambda?

I have a dictionary like this:
Dict={'A':0.0697,'B':0.1136,'C':0.2227,'D':0.2725,'E':0.4555}
I want my output like this:
Return A,B,C,D,E if the value in my dataframe is LESS THAN 0.0697,0.1136,0.2227,0.2725,0.4555 respectively; else return F
I tried:
TrainTest['saga1'] = TrainTest['saga'].apply(lambda x,v: Dict[x] if x<=v else 'F')
But it returns an error:
TypeError: <lambda>() takes exactly 2 arguments (1 given)
Let's make some test data:
saga = pd.Series([0.1, 0.2, 0.3, 0.4, 0.5, 0.9])
Next, recognize that Dict is a dict and has no ordering, so let's get that sorted by the numbers in reverse order:
thresh = sorted(Dict.items(), key=lambda t: t[1], reverse=True)
Finally, solve the problem by looping not over saga but over thresh, because loops (and apply()) in Python/Pandas are slow and we assume saga is much longer than thresh:
result = pd.Series('F', saga.index) # all F's to start
for name, value in thresh:
result[saga < value] = name
Now result is a series of values A,B,C,D,E,F as appropriate--we loop in reverse order because e.g. 0 is smaller than all the values and should be labeled A, not E.
Regarding run-times:
In [160]:%%timeit
# loop over smaller thresh, not << saga
for name, value in thresh:
result[saga < value] = name
100 loops, best of 3: 2.59 ms per loop
Here are pandas run-times:
saga1 = pd.DataFrame([0.05,0.1, 0.2, 0.3, 0.4, 0.5, 0.9],columns=['c1'])
def mapF(s):
# descending
curr='F'
for name, value in thresh:
if s < value:
curr = name
return curr
Using map/apply:
In [149]: %%timeit
saga1['result'] = saga1['c1'].map(lambda x: mapF(x) )
1000 loops, best of 3: 311 µs per loop
Using vectorization:
In [166]:%%timeit
import numpy as np
saga1['result'] = np.vectorize(mapF)(saga1['c1'])
1000 loops, best of 3: 244 µs per loop
** saga1
+---+------+--------+
| | c1 | result |
+---+------+--------+
| 0 | 0.05 | A |
| 1 | 0.1 | B |
| 2 | 0.2 | C |
| 3 | 0.3 | E |
| 4 | 0.4 | E |
| 5 | 0.5 | F |
| 6 | 0.9 | F |
+---+------+--------+

Matrix multiplication with SFrame and SArray with Graphlab and/or Numpy

Given a graphlab.SArray named coef:
+-------------+----------------+
| name | value |
+-------------+----------------+
| (intercept) | 87910.0724924 |
| sqft_living | 315.403440552 |
| bedrooms | -65080.2155528 |
| bathrooms | 6944.02019265 |
+-------------+----------------+
[4 rows x 2 columns]
And a graphlab.SFrame (shown below first 10) named x:
+-------------+----------+-----------+-------------+
| sqft_living | bedrooms | bathrooms | (intercept) |
+-------------+----------+-----------+-------------+
| 1430.0 | 3.0 | 1.0 | 1 |
| 2950.0 | 4.0 | 3.0 | 1 |
| 1710.0 | 3.0 | 2.0 | 1 |
| 2320.0 | 3.0 | 2.5 | 1 |
| 1090.0 | 3.0 | 1.0 | 1 |
| 2620.0 | 4.0 | 2.5 | 1 |
| 4220.0 | 4.0 | 2.25 | 1 |
| 2250.0 | 4.0 | 2.5 | 1 |
| 1260.0 | 3.0 | 1.75 | 1 |
| 2750.0 | 4.0 | 2.0 | 1 |
+-------------+----------+-----------+-------------+
[1000 rows x 4 columns]
How do I manipulate SArray and SFrame such that the multiplication will return a single vector SArray that has the first row as computed as below?:
87910.0724924 * 1
+ 315.403440552 * 1430.0
+ -65080.2155528 * 3.0
+ 6944.02019265 * 1.0
= 350640.36601600994
I've currently doing silly things converting SFrame / SArray into lists and then converting it into numpy arrays to do np.multiply. Even after converting into numpy arrays, it's not giving the right matrix-vector multiplication. My current attempt:
import numpy as np
coef # as should in SArray above.
x # as should in the SFrame above.
intercept = list(x['(intercept)'])
sqftliving = list(x['sqft_living'])
bedrooms = list(x['bedrooms'])
bathrooms = list(x['bathrooms'])
x_new = np.column_stack((intercept, sqftliving, bedrooms, bathrooms))
coef_new = np.array(list(coef['value']))
np.multiply(coef_new, x_new)
(wrong) [out]:
[[ 87910.07249236 451026.91998949 -195240.64665846 6944.02019265]
[ 87910.07249236 930440.14962867 -260320.86221128 20832.06057795]
[ 87910.07249236 539339.88334408 -195240.64665846 13888.0403853 ]
...,
[ 87910.07249236 794816.67019127 -260320.86221128 17360.05048162]
[ 87910.07249236 728581.94767533 -260320.86221128 17360.05048162]
[ 87910.07249236 321711.50936313 -130160.43110564 5208.01514449]]
The output of my attempt is wrong too, it should return a single vector scalar values. There must be an easier way to do it.
How do I manipulate SArray and SFrame such that the multiplication will return a single vector SArray that has the first row as computed as below?
And with numpy Dataframes, how should one perform the matrix-vector multiplcation?
I think your best bet is to convert both the SFrame and SArray to numpy arrays and use the numpy dot method.
import graphlab
sf = graphlab.SFrame({'a': [1., 2.], 'b': [3., 5.], 'c': [7., 11]})
sa = graphlab.SArray([1., 2., 3.])
X = sf.to_dataframe().values
y = sa.to_numpy()
ans = X.dot(y)
I'm using simpler data here than what you have, but this should work for you as well. The only complication I can see is that you have to make sure the values in your SArray are in the same order as the columns in your SFrame (in your example they aren't).
I think this can be done with an SFrame apply as well, but unless you have a lot of data, the dot product route is probably simpler.
To manipulate SArray and SFrame to perform linear algebra operations you need first to convert them to Numpy Array. Make sure that you get right dimensions and order of columns.
(I have coef SArray and features SFrame which is exactly your x)
In [15]: coef = coef.to_numpy()
In [17]: features = features.to_numpy()
Now coef and features are both Numpy arrays. So now multiplying them is as easy as:
In [23]: prod = numpy.dot(features, coef)
In [24]: print prod
[ 350640.36601601 778861.42048755 445897.34956322 641765.45839626
243403.19622833 671306.27500907 1174215.7748441 554607.00200482
302229.79626666 708836.7121845 ]
In [25]: prod.shape
Out[25]: (10,)
In Numpy multiply() and * perform element-wise multiplication. But dot() performs matrix multiplication which is exactly what you need.
Besides your output
[[ 87910.07249236 451026.91998949 -195240.64665846 6944.02019265]
[ 87910.07249236 930440.14962867 -260320.86221128 20832.06057795]
[ 87910.07249236 539339.88334408 -195240.64665846 13888.0403853 ]
...,
[ 87910.07249236 794816.67019127 -260320.86221128 17360.05048162]
[ 87910.07249236 728581.94767533 -260320.86221128 17360.05048162]
[ 87910.07249236 321711.50936313 -130160.43110564 5208.01514449]]
is half wrong. If you now sum values in each row you will get your first element of vector:
In [26]: 87910.07249236 + 451026.91998949 + (-195240.64665846) + 6944.02019265
Out[26]: 350640.3660160399
But dot() does all this for you, so you don't need to worry.
P.S. Are you in Machine Learning Specialization? Me too, that's why I know this :-)

Categories