Matrix multiplication with SFrame and SArray with Graphlab and/or Numpy

Matrix multiplication with SFrame and SArray with Graphlab and/or Numpy - python

Given a graphlab.SArray named coef:
+-------------+----------------+
| name | value |
+-------------+----------------+
| (intercept) | 87910.0724924 |
| sqft_living | 315.403440552 |
| bedrooms | -65080.2155528 |
| bathrooms | 6944.02019265 |
+-------------+----------------+
[4 rows x 2 columns]
And a graphlab.SFrame (shown below first 10) named x:
+-------------+----------+-----------+-------------+
| sqft_living | bedrooms | bathrooms | (intercept) |
+-------------+----------+-----------+-------------+
| 1430.0 | 3.0 | 1.0 | 1 |
| 2950.0 | 4.0 | 3.0 | 1 |
| 1710.0 | 3.0 | 2.0 | 1 |
| 2320.0 | 3.0 | 2.5 | 1 |
| 1090.0 | 3.0 | 1.0 | 1 |
| 2620.0 | 4.0 | 2.5 | 1 |
| 4220.0 | 4.0 | 2.25 | 1 |
| 2250.0 | 4.0 | 2.5 | 1 |
| 1260.0 | 3.0 | 1.75 | 1 |
| 2750.0 | 4.0 | 2.0 | 1 |
+-------------+----------+-----------+-------------+
[1000 rows x 4 columns]
How do I manipulate SArray and SFrame such that the multiplication will return a single vector SArray that has the first row as computed as below?:
87910.0724924 * 1
+ 315.403440552 * 1430.0
+ -65080.2155528 * 3.0
+ 6944.02019265 * 1.0
= 350640.36601600994
I've currently doing silly things converting SFrame / SArray into lists and then converting it into numpy arrays to do np.multiply. Even after converting into numpy arrays, it's not giving the right matrix-vector multiplication. My current attempt:
import numpy as np
coef # as should in SArray above.
x # as should in the SFrame above.
intercept = list(x['(intercept)'])
sqftliving = list(x['sqft_living'])
bedrooms = list(x['bedrooms'])
bathrooms = list(x['bathrooms'])
x_new = np.column_stack((intercept, sqftliving, bedrooms, bathrooms))
coef_new = np.array(list(coef['value']))
np.multiply(coef_new, x_new)
(wrong) [out]:
[[ 87910.07249236 451026.91998949 -195240.64665846 6944.02019265]
[ 87910.07249236 930440.14962867 -260320.86221128 20832.06057795]
[ 87910.07249236 539339.88334408 -195240.64665846 13888.0403853 ]
...,
[ 87910.07249236 794816.67019127 -260320.86221128 17360.05048162]
[ 87910.07249236 728581.94767533 -260320.86221128 17360.05048162]
[ 87910.07249236 321711.50936313 -130160.43110564 5208.01514449]]
The output of my attempt is wrong too, it should return a single vector scalar values. There must be an easier way to do it.
How do I manipulate SArray and SFrame such that the multiplication will return a single vector SArray that has the first row as computed as below?
And with numpy Dataframes, how should one perform the matrix-vector multiplcation?

I think your best bet is to convert both the SFrame and SArray to numpy arrays and use the numpy dot method.
import graphlab
sf = graphlab.SFrame({'a': [1., 2.], 'b': [3., 5.], 'c': [7., 11]})
sa = graphlab.SArray([1., 2., 3.])
X = sf.to_dataframe().values
y = sa.to_numpy()
ans = X.dot(y)
I'm using simpler data here than what you have, but this should work for you as well. The only complication I can see is that you have to make sure the values in your SArray are in the same order as the columns in your SFrame (in your example they aren't).
I think this can be done with an SFrame apply as well, but unless you have a lot of data, the dot product route is probably simpler.

To manipulate SArray and SFrame to perform linear algebra operations you need first to convert them to Numpy Array. Make sure that you get right dimensions and order of columns.
(I have coef SArray and features SFrame which is exactly your x)
In [15]: coef = coef.to_numpy()
In [17]: features = features.to_numpy()
Now coef and features are both Numpy arrays. So now multiplying them is as easy as:
In [23]: prod = numpy.dot(features, coef)
In [24]: print prod
[ 350640.36601601 778861.42048755 445897.34956322 641765.45839626
243403.19622833 671306.27500907 1174215.7748441 554607.00200482
302229.79626666 708836.7121845 ]
In [25]: prod.shape
Out[25]: (10,)
In Numpy multiply() and * perform element-wise multiplication. But dot() performs matrix multiplication which is exactly what you need.
Besides your output
[[ 87910.07249236 451026.91998949 -195240.64665846 6944.02019265]
[ 87910.07249236 930440.14962867 -260320.86221128 20832.06057795]
[ 87910.07249236 539339.88334408 -195240.64665846 13888.0403853 ]
...,
[ 87910.07249236 794816.67019127 -260320.86221128 17360.05048162]
[ 87910.07249236 728581.94767533 -260320.86221128 17360.05048162]
[ 87910.07249236 321711.50936313 -130160.43110564 5208.01514449]]
is half wrong. If you now sum values in each row you will get your first element of vector:
In [26]: 87910.07249236 + 451026.91998949 + (-195240.64665846) + 6944.02019265
Out[26]: 350640.3660160399
But dot() does all this for you, so you don't need to worry.
P.S. Are you in Machine Learning Specialization? Me too, that's why I know this :-)

Related

Python: Find Average Y-Value for Each X-Value in [X,Y] Coordinates

Let's say I have a list of x,y coordinates like this:
coordinate_list = [(4,6),(2,5),(0,4),(-2,-2),(0,2),(0,0),(8,8),(8,11),(8,14)]
I want to find the average y-value associated with each x-value. So for instance, there's only one "2" x-value in the dataset, so the average y-value would be "5". However, there are three 8's and the average y-value would be 11 [ (8+11+14) / 3 ].
What would be the most efficient way to do this?

y_values_by_x = {}
for x, y in coordinate_list:
y_values_by_x.setdefault(x, []).append(y)
average_y_by_x = {k: sum(v)/len(v) for k, v in y_values_by_x.items()}

You can use pandas
coordinate_list = [(4,6),(2,5),(0,4),(-2,-2),(0,2),(0,0),(8,8),(8,11),(8,14)]
import pandas as pd
df = pd.DataFrame(coordinate_list)
df
df.groupby([0]).mean()
| 0 | | 1 |
| --- | --- |
| -2 | -2 |
| 0 | 2 |
| 2 | 5 |
| 4 | 6 |
| 8 | 11 |

Try the mean() function from statistics module with list comprehension
from statistics import mean
x0_filter_value = 0 # can be any value of your choice for finding average
result = mean([x[1] for x in coordinate_list if x[0] == x0_filter_value])
print(result)
And to print means for all X[0] values:
for i in set([x[0] for x in coordinate_list]):
print (i,mean([x[1] for x in coordinate_list if x[0] == i]))

Nearest neighbors in a given range

I faced the problem of quickly finding the nearest neighbors in a given range.
Example of dataset:
id | string | float
0 | AA | 0.1
12 | BB | 0.5
2 | CC | 0.3
102| AA | 1.1
33 | AA | 2.8
17 | AA | 0.5
For each line, print the number of lines satisfying the following conditions:
string field is equal to current
float field <= current float - del
For this example with del = 1.5:
id | count
0 | 0
12 | 0
2 | 0
102| 2 (string is equal row with id=0,33,17 but only in row id=0,17 float value: 1.1-1.5<=0.1, 1.1-1.5<=0.5)
33 | 0 (string is equal row with id=0,102,17 but 2.8-1.5>=0.1/1.1/1.5)
17 | 1
To solve this problem, I used a class BallTree with custom metric, but it works for a very long time due to a reverse tree walk (on a large dataset).
Can someone suggest other solutions or how you can increase the speed of custom metrics to the speed of the metrics from the sklearn.neighbors.DistanceMetric?
My code:
from sklearn.neighbors import BallTree
def distance(x, y):
if(x[0]==y[0] and x[1]>y[1]):
return (x[1] - y[1])
else:
return (x[1] + y[1])
tree2 = BallTree(X, leaf_size=X.shape[0], metric=distance)
mas=tree2.query_radius(X, r=del, count_only = True)

Grouping floating point numbers

I have an application where I need to block average a list of data (currently in a pandas.DataFrame) according to a timestamp, which may be a floating point value. For example, I may need to average the following df into groups of 0.3 secs:
+------+------+ +------+------+
| secs | A | | secs | A |
+------+------+ +------+------+
| 0.1 | .. | | 0.3 | .. | <-- avg of 0.1, 0.2, 0.3
| 0.2 | .. | --> | 0.6 | .. | <-- avg of 0.4, 0.5, 0.6
| 0.3 | .. | | ... | ... | <-- etc
| 0.4 | .. | +------+------+
| 0.5 | .. |
| 0.6 | .. |
| ... | ... |
+------+------+
Currently I am using the following (minimal) solution:
import pandas as pd
import numpy as np
def block_avg ( df : pd.DataFrame, duration : float ) -> pd.DataFrame:
grouping = (df['secs'] - df['secs'][0]) // duration
df = df.groupby( grouping, as_index=False ).mean()
df['secs'] = duration * np.arange(1,1+len(df))
return df
which works just fine for integer durations, but floating point values at the edges of blocks can fall on the wrong side. A simple test that the blocks are being created properly is to average by the same duration that the data is already in (0.1 in this example). This should return the input, but often doesn't. (e.g. x=.1*np.arange(1,20); (x-x[0])//.1).)
I found that the error with this method is usually that the LSB is 1 low, so a tentative fix is to add np.spacing(df['secs']) to the numerator in the grouping. (That is, x=.1*np.arange(1,20); all( (x-x[0]+np.spacing(x)) // .1 == np.arange(19) ) returns True.)
However, I am concerned that this is not a robust solution. Is there a better or preferred way to group floats which passes the above test?
I have had similar issues with a (perhaps more straightforward) algorithm which groups using x[ (duration*i < x) & (x <= duration*(i+1)) ] and looping i over an appropriate range.

To be extra careful (of float inaccuracy) I'd round early before doing the groupby:
In [11]: np.round(300 + df.secs * 1000).astype(int) // 300
Out[11]:
0 1
1 1
2 1
3 2
4 2
5 2
Name: secs, dtype: int64
In [12]: (np.round(300 + df.secs * 1000).astype(int) // 300) * 0.3
Out[12]:
0 0.3
1 0.3
2 0.3
3 0.6
4 0.6
5 0.6
Name: secs, dtype: float64
In [13]: df.groupby(by=(np.round(300 + df.secs * 1000).astype(int) // 300) * 0.3)["A"].sum()
Out[13]:
secs
0.3 1.753843
0.6 2.687098
Name: A, dtype: float64
I would prefer to use a timedelta:
In [21]: s = pd.to_timedelta(np.round(df["secs"], 1), unit="S")
In [22]: df["secs"] = pd.to_timedelta(np.round(df["secs"], 1), unit="S")
In [23]: df.groupby(pd.Grouper(key="secs", freq="0.3S")).sum()
Out[23]:
A
secs
00:00:00 1.753843
00:00:00.300000 2.687098
or with a resample:
In [24]: res = df.set_index("secs").resample("300ms").sum()
In [25]: res
Out[25]:
A
secs
00:00:00 1.753843
00:00:00.300000 2.687098
you can set the index to correct the labelling*
In [26]: res.index += np.timedelta64(300, "ms")
In [27]: res
Out[27]:
A
secs
00:00:00.300000 1.753843
00:00:00.600000 2.687098
* There ought to be a way to set that through a resample argument, but they don't seem to work...

Python read csv file and add one column through function computation

I have an example csv file with name 'r2.csv':
Factory | Product_Number | Date | mu | cs | co
--------------------------------------------------------------
A | 1 | 01APR2017 | 5.6 | 125 | 275
--------------------------------------------------------------
A | 1 | 02APR2017 | 4.5 | 200 | 300
--------------------------------------------------------------
A | 1 | 03APR2017 | 6.6 | 150 | 250
--------------------------------------------------------------
A | 1 | 04APR2017 | 7.5 | 175 | 325
--------------------------------------------------------------
I would like to add one more column with name 'Order_Number'. With the following function
Order_Number = np.ceil(poisson.ppf(co/(cs+co), mu))
With the following code I have:
import numpy as np
from scipy.stats import poisson, norm
import csv
# Read Data
with open('r2.csv', 'r') as infile:
reader = csv.DictReader(infile)
data = {}
for row in reader:
for header, value in row.items():
try:
data[header].append(value)
except KeyError:
data[header] = [value]
# To create a list for the following parameters
mu = data['mu']
cs = data['cs']
co = data['co']
# Obtain Order_Number
Order_Number = np.ceil(poisson.ppf(co/(cs+co), mu))
Before Obtaining 'Order_Number' it works fine. And 'Order_Number' function it has the following error:
TypeError: unsupported operand type(s) for /: 'list' and 'list'
How could I change my code in order to obtain the following table as output：
Factory | Product_Number | Date | mu | cs | co | Order_Number
----------------------------------------------------------------------
A | 1 | 01APR2017 | 5.6 | 125 | 275 | ?
----------------------------------------------------------------------
A | 1 | 02APR2017 | 4.5 | 200 | 300 | ?
----------------------------------------------------------------------
A | 1 | 03APR2017 | 6.6 | 150 | 250 | ?
----------------------------------------------------------------------
A | 1 | 04APR2017 | 7.5 | 175 | 325 | ?
----------------------------------------------------------------------

Looks like your content of mu, cs and co is list of strings.
First convert that to float.
mu = map(float,mu)
cs = map(float,cs)
co = map(float,co)
Then , since you have list of values you need to map your np.ceil(poisson.ppf(co/(cs+co), mu)) function to each value of these lists.
Order_Number =map(lambda mu_,cs_,co_:np.ceil(poisson.ppf(co_/(cs_+co_),mu_)),mu,cs,co)
Result is as follows,
>>> map(lambda mu_,cs_,co_:np.ceil(poisson.ppf(co_/(cs_+co_), mu_)),mu,cs,co)
[7.0, 5.0, 7.0, 8.0]
Hope this helps.
EDIT-1
Code to add data to an csv file. You may want to look at opening your csv to orderedDict so that you dont need to write each column headers manually. Yo can just call data.keys().
#Covnert string element of list to float
mu = map(float,mu)
cs = map(float,cs)
co = map(float,co)
# Obtain Order_Number
Order_Number =map(lambda mu_,cs_,co_:np.ceil(poisson.ppf(co_/(cs_+co_),mu_)),mu,cs,co)
#Add Oder_Number to the data dict
data['Order_Number'] = Order_Number
header = 'Factory','Product_Number','Date','mu','cs','co','Order_Number'
#Add data to csv
with open("output.csv",'wb') as resultFile:
wr = csv.writer(resultFile,quoting=csv.QUOTE_ALL)
wr.writerow(header)
z = zip(data['Factory'],data['Product_Number'],data['Date'],data['mu'],data['cs'],data['co'],data['Order_Number'])
for i in z:
wr.writerow(i)
Result

As created
mu = data['mu']
cs = data['cs']
co = data['co']
are lists of strings. Look at them, or at least a subset, e.g mu[:10]. You can't do array math with lists
co/(cs+co)
cs+co will concatenate the 2 lists (+ definition for lists), but / is not defined for lists.
mu = np.array(data, dtype=float)
cs = ....
co
might do the trick, converting the lists into 1d numpy arrays.
An alternative is to use np.genfromtxt with dtype=None and names=True to load the data into a structured array. But then I'd have to explain how to access the named fields. And unfortunately adding a new field to this array (the calc results) isn't trivial. And writing a new csv from a structured array requires some extra knowledge.
Try the list to array conversion.

TensorFlow - numpy-like tensor indexing

In numpy, we can do this:
x = np.random.random((10,10))
a = np.random.randint(0,10,5)
b = np.random.randint(0,10,5)
x[a,b] # gives 5 entries from x, indexed according to the corresponding entries in a and b
When I try something equivalent in TensorFlow:
xt = tf.constant(x)
at = tf.constant(a)
bt = tf.constant(b)
xt[at,bt]
The last line gives a "Bad slice index tensor" exception. It seems TensorFlow doesn't support indexing like numpy or Theano.
Does anybody know if there is a TensorFlow way of doing this (indexing a tensor by arbitrary values). I've seen the tf.nn.embedding part, but I'm not sure they can be used for this and even if they can, it's a huge workaround for something this straightforward.
(Right now, I'm feeding the data from x as an input and doing the indexing in numpy but I hoped to put x inside TensorFlow to get higher efficiency)

You can actually do that now with tf.gather_nd. Let's say you have a matrix m like the following:
| 1 2 3 4 |
| 5 6 7 8 |
And you want to build a matrix r of size, let's say, 3x2, built from elements of m, like this:
| 3 6 |
| 2 7 |
| 5 3 |
| 1 1 |
Each element of r corresponds to a row and column of m, and you can have matrices rows and cols with these indices (zero-based, since we are programming, not doing math!):
| 0 1 | | 2 1 |
rows = | 0 1 | cols = | 1 2 |
| 1 0 | | 0 2 |
| 0 0 | | 0 0 |
Which you can stack into a 3-dimensional tensor like this:
| | 0 2 | | 1 1 | |
| | 0 1 | | 1 2 | |
| | 1 0 | | 2 0 | |
| | 0 0 | | 0 0 | |
This way, you can get from m to r through rows and cols as follows:
import numpy as np
import tensorflow as tf
m = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
rows = np.array([[0, 1], [0, 1], [1, 0], [0, 0]])
cols = np.array([[2, 1], [1, 2], [0, 2], [0, 0]])
x = tf.placeholder('float32', (None, None))
idx1 = tf.placeholder('int32', (None, None))
idx2 = tf.placeholder('int32', (None, None))
result = tf.gather_nd(x, tf.stack((idx1, idx2), -1))
with tf.Session() as sess:
r = sess.run(result, feed_dict={
x: m,
idx1: rows,
idx2: cols,
})
print(r)
Output:
[[ 3. 6.]
[ 2. 7.]
[ 5. 3.]
[ 1. 1.]]

LDGN's comment is correct. This is not possible at the moment, and is a requested feature. If you follow issue#206 on github you'll get updated if/when this is available. Many people would like this feature.

For Tensorflow 0.11, basic indexing has been implemented. More advanced indexing (like boolean indexing) is still missing but apparently is planned for future versions.
Advanced indexing can be tracked with https://github.com/tensorflow/tensorflow/issues/4638

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Matrix multiplication with SFrame and SArray with Graphlab and/or Numpy - python

Related

Python: Find Average Y-Value for Each X-Value in [X,Y] Coordinates

Nearest neighbors in a given range

Grouping floating point numbers

Python read csv file and add one column through function computation

TensorFlow - numpy-like tensor indexing

Categories

Resources