How to separate 2 output arrays of sklearn kneighbors() Python? - python

I am a beginner in Python and I use NearestNeighbors in sklearn and the output is:
print(neigh.kneighbors([[0.00015217, 0.00050968, 0.00044049, 0.00014538,
0.00077339, 0.0020284 , 0.00047572]]))
And the output is:
(array([[1.01980586e-08, 7.73354596e-05, 7.73354596e-05, 1.20134585e-04,
1.39792434e-04, 1.48002389e-04, 1.98794609e-04, 4.63512739e-04,
5.31436554e-04, 5.36960418e-04, 5.72679303e-04, 6.28187320e-04,
6.67923141e-04, 7.51928163e-04, 8.97313642e-04, 1.00023442e-03,
1.06114362e-03, 1.11943158e-03, 1.12626043e-03, 1.20185118e-03,
1.51073901e-03, 1.71592746e-03, 1.73362257e-03]]),array([[ 0, 16, 15,
19,1, 23, 5, 8, 20, 9,6, 10, 17, 3, 21, 22,14, 2, 13, 7, 11, 12,
18]],dtype=int64))
I would like to import these data to csv because I need both the arrays in csv. how can I separate these arrays?

hh = neigh.kneighbors([[0.00015217, 0.00050968, 0.00044049, 0.00014538,
0.00077339, 0.0020284 , 0.00047572]])
first_array = hh[0]
second_array = hh[1]

Related

Interpolation between two datetimes

I have a time series dataset and I'm getting some events. These events are when I get a specific error from my system. Now I wanted to plot a graph from the dataset and place markers from the events on the graph of my time series. So I have to interpolate between two timestamps to get the exact y value. My problem is now when I'm typing in the following code:
import numpy as np
test=np.interp(event, [timestamp_timeseries[k-1],
timestamp_timeseries[k]], [y_value[k-1], y_value[k])
with types:
timestamp_timeseries: datetime.datetime
y_value: int
event: datetime.datetime (Timestamp when an error is coming from a system)
Thanks for you help.
Example:
test=np.interp(
datetime.datetime(2022, 10, 11, 12, 24, 5, 922000),
[datetime.datetime(2022, 10, 11, 12, 6, 40, 480000),
datetime.datetime(2022, 10, 11, 12, 52, 51, 481000)],
[335872, 336896])
My result is:
TypeError: float() argument must be a string or a number, not 'datetime.datetime'
use a numeric representation of datetime, e.g. Unix time you get from .timestamp(). Ex:
from datetime import datetime
import numpy as np
test=np.interp(
datetime(2022, 10, 11, 12, 24, 5, 922000).timestamp(),
[datetime(2022, 10, 11, 12, 6, 40, 480000).timestamp(), datetime(2022, 10, 11, 12, 52, 51, 481000).timestamp()],
[335872, 336896])
test
# 336258.3342553605

Plotting a histogram from a database using matplot and python

So from the database, I'm trying to plot a histogram using the matplot lib library in python.
as shown here:
cnx = sqlite3.connect('practice.db')
sql = pd.read_sql_query('''
SELECT CAST((deliverydistance/1)as int)*1 as bin, count(*)
FROM orders
group by 1
order by 1;
''',cnx)
which outputs
This
From the sql table, I try to extract the columns using a for loop and place them in array.
distance =[]
counts = []
for x,y in sql.iterrows():
y = y["count(*)"]
counts.append(y)
distance.append(x)
print(distance)
print(counts)
OUTPUT:
distance = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]
counts = [57136, 4711, 6569, 7268, 6755, 5757, 7643, 6175, 7954, 9418, 4945, 4178, 2844, 2104, 1829, 9, 4, 1, 3]
When I plot a histogram
plt.hist(counts,bins=distance)
I get this out put:
click here
My question is, how do I make it so that the count is on the Y axis and the distance is on the X axis? It doesn't seem to allow me to put it there.
you could also skip the for loop and plot direct from your pandas dataframe using
sql.bin.plot(kind='hist', weights=sql['count(*)'])
or with the for loop
import matplotlib.pyplot as plt
import pandas as pd
distance =[]
counts = []
for x,y in sql.iterrows():
y = y["count(*)"]
counts.append(y)
distance.append(x)
plt.hist(distance, bins=distance, weights=counts)
You can skip the middle section where you count the instances of each distance. Check out this example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'distance':np.round(20 * np.random.random(100))})
df['distance'].hist(bins = np.arange(0,21,1))
Pandas has a built-in histogram plot which counts, then plots the occurences of each distance. You can specify the bins (in this case 0-20 with a width of 1).
If you are not looking for a bar chart and are looking for a horizontal histogram, then you are looking to pass orientation='horizontal':
distance = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]
# plt.style.use('dark_background')
counts = [57136, 4711, 6569, 7268, 6755, 5757, 7643, 6175, 7954, 9418, 4945, 4178, 2844, 2104, 1829, 9, 4, 1, 3]
plt.hist(counts,bins=distance, orientation='horizontal')
Use :
plt.bar(distance,counts)

How to convert this numpy to tf.function compatible code?

I'm trying to convert numpy to tensorflow equivalent code to be compatible with tf.function ...
Given have a (32, 6) numpy array target_values that looks like this:
array([[-0.01656106, 0.04762066, 0.05735449, -0.0284767 , -0.02237438,
-0.00042562],
[-0.01420249, 0.0477839 , 0.0563598 , -0.02971786, -0.02367548,
0.00001262],
[-0.01695916, 0.04826669, 0.05893629, -0.03067053, -0.02261235,
0.00345904],
[-0.01953977, 0.04540274, 0.05829531, -0.02759781, -0.02390759,
-0.00487727],
[-0.01708016, 0.04894669, 0.0606699 , -0.02576046, -0.02461138,
-0.00068538],
[-0.01604217, 0.04770135, 0.05761468, -0.02858265, -0.02624938,
-0.00084356],
[-0.01527106, 0.04699571, 0.05959677, -0.02956396, -0.02510098,
-0.00223234],
[-0.01448676, 0.04620824, 0.05775366, -0.03008122, -0.02655901,
-0.00159649],
[-0.0172577 , 0.04814827, 0.05807308, -0.02916523, -0.02367857,
-0.00100602],
[-0.01690523, 0.0484785 , 0.05807881, -0.02960616, -0.02560546,
-0.00065042],
[-0.0166171 , 0.0488232 , 0.05776291, -0.03231864, -0.02132723,
-0.00033605],
[-0.01541627, 0.04840397, 0.0580376 , -0.02927143, -0.02461101,
0.00121263],
[-0.01685588, 0.047661 , 0.05873172, -0.02989979, -0.02574112,
-0.00126612],
[-0.01333553, 0.05043796, 0.05915743, -0.02990219, -0.02657976,
-0.0007656 ],
[-0.01531163, 0.04781894, 0.05637252, -0.02968849, -0.02225551,
-0.00151382],
[-0.01357749, 0.04807179, 0.05955081, -0.02748637, -0.02498721,
-0.00040934],
[-0.01606943, 0.04768877, 0.05455931, -0.03136749, -0.02475093,
0.00245846],
[-0.01609829, 0.04687681, 0.05982678, -0.02886578, -0.02608151,
0.00015348],
[-0.01503662, 0.04740106, 0.05958583, -0.03141545, -0.02522127,
-0.00063602],
[-0.01697148, 0.04910276, 0.05744712, -0.02858391, -0.02481578,
-0.00072039],
[-0.01503395, 0.04843756, 0.05773868, -0.03061879, -0.02586869,
-0.00025573],
[-0.0152991 , 0.04847359, 0.05739099, -0.0299796 , -0.02552593,
-0.00334571],
[-0.01324895, 0.04529134, 0.05534273, -0.03109139, -0.02304241,
-0.00143186],
[-0.01280282, 0.05004944, 0.05856398, -0.0314032 , -0.02394999,
-0.00030306],
[-0.01677033, 0.04876196, 0.05794405, -0.02888608, -0.02658239,
-0.00015171],
[-0.01572544, 0.04779808, 0.05939355, -0.03048976, -0.02896303,
-0.00090334],
[-0.01542805, 0.04709881, 0.05839922, -0.02894112, -0.02240603,
-0.00188624],
[-0.01493233, 0.0476524 , 0.0581631 , -0.0297201 , -0.02485022,
-0.00087418],
[-0.01804641, 0.04739738, 0.06070606, -0.02981704, -0.02543145,
-0.00115484],
[-0.01518638, 0.04843838, 0.05744548, -0.02980216, -0.02420005,
0.00036349],
[-0.01442349, 0.04673778, 0.05804737, -0.03062913, -0.02476445,
-0.00066772],
[-0.01598305, 0.04622466, 0.0588723 , -0.03096713, -0.02364032,
-0.00005574]])
Given another (32,) array of indices actions with values being in range(5) inclusive:
array([0, 2, 5, 5, 1, 1, 3, 4, 0, 5, 4, 3, 4, 5, 1, 0, 3, 0, 0, 2, 2, 2,
0, 1, 4, 1, 4, 4, 0, 4, 1, 0])
I'm expecting this result:
array([-0.01656106, 0.0563598 , 0.00345904, -0.00487727, 0.04894669,
0.04770135, -0.02956396, -0.02655901, -0.0172577 , -0.00065042,
-0.02132723, -0.02927143, -0.02574112, -0.0007656 , 0.04781894,
-0.01357749, -0.03136749, -0.01609829, -0.01503662, 0.05744712,
0.05773868, 0.05739099, -0.01324895, 0.05004944, -0.02658239,
0.04779808, -0.02240603, -0.02485022, -0.01804641, -0.02420005,
0.04673778, -0.01598305], dtype=float32)
For self.batch_size == 32, I'm able to achieve what I need in numpy using:
state_action_values = target_values[np.arange(self.batch_size), actions]
For target_value_update being another (32,) array of new values, I will need to assign the new values to this slice using:
target_values[np.arange(self.batch_size), actions] = target_value_update
However in tensorflow under tf.function, this is not possible and I get the following error:
TypeError: Only integers, slices (`:`), ellipsis (`...`), tf.newaxis (`None`) and scalar tf.int32/tf.int64 tensors are valid indices, got array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31])
So I try:
target_values = tf.Variable(target_values)
state_action_values = tf.gather(target_values, actions, axis=1)
However here's the value of state_action_values which should be (32,) not (32, 32)
Tensor("GatherV2:0", shape=(32, 32), dtype=float32)
Use gather_nd():
a = tf.range(32)[:, tf.newaxis]
a = tf.concat((a, actions[:, tf.newaxis]), -1)
output = tf.gather_nd(target_values, a)

What's the best way to re-order rows in a large binary file?

I have some large data files (32 x VERY BIG) that I would like to concatenate.
However, the data were collected in the wrong order, so I need to reorder the rows as well.
So far, what I am doing is:
# Assume FILE_1 and FILE_2 are paths to the appropriate files.
# FILE_1 is a matrix of size 32 x SIZE_1
# FILE_2 is a matrix of size 32 x SIZE_2
data_1 = np.memmap(FILE_1, mode='r', dtype='<i2', order='F', shape=(32, SIZE_1))
data_2 = np.memmap(FILE_2, mode='r', dtype='<i2', order='F', shape=(32, SIZE_2))
data_out = np.memmap('output', mode='w+', dtype='<i2', order='F', shape=(32, SIZE_1 + SIZE_2))
channel_mapping = [15, 14, 13, 12, 11, 10, 9, 8, 0, 1, 2, 3, 4, 5, 6, 7,
24, 25, 26, 27, 28, 29, 30, 31, 23, 22, 21, 20, 19, 18, 17, 16]
data_out[:SIZE_1, :] = data_1[:, channel_mapping]
data_out[SIZE_1:SIZE_2, :] = data_2[:, channel_mapping]
I actually do this in a for loop with more than 2 files, but you get the idea.
Is this the most efficient way to do this? I am afraid that the application of channel_mapping will write the data to memory and slow the whole process down. As it is, this is much slower than simply concatenating the files.

Merge numeric and text features for category classification

I'm trying to classify product items in order to predict their category based on the product title and their base price.
An example(product title, price, category):
['notebook sony vaio vgn-z770td dockstation', 3000.0, u'MLA54559']
Previously I was only using product title for the prediction task but I'd like to include the price to see if the accuracy improves.
The problem with my code is that I can't merge the text/numeric features, I've been reading some questions here in SO and this is my code excerpt:
#extracting features from text
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform([e[0] for e in training_set])
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
#extracting numerical features
X_train_price = np.array([e[1] for e in training_set])
X = sparse.hstack([X_train_tfidf, X_train_price]) #this is where the problem begins
clf = svm.LinearSVC().fit(X, [e[2] for e in training_set])
I try to merge the data types with sparse.hstack but I get the following error:
ValueError: blocks[0,:] has incompatible row dimensions
I guess the problem lies in X_train_price(a list of prices) but I don't know how to format it for the sparse function to succesfully work.
These are the shapes of both arrays:
>>> X_train_tfidf.shape
(65845, 23136)
>>>X_train_price.shape
(65845,)
It looks to me like this should be as simple as stacking the arrays. If scikit-learn follows the conventions I'm familiar with, then each row in X_train_tfidf is a training datapoint, and there are a total of 65845 points. So you just have to do an hstack -- as you said you tried to do.
However, you need to make sure the dimensions are compatible! In vanilla numpy you get this error otherwise:
>>> a = numpy.arange(15).reshape(5, 3)
>>> b = numpy.arange(15, 20)
>>> numpy.hstack((a, b))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/
Extras/lib/python/numpy/core/shape_base.py", line 270, in hstack
return _nx.concatenate(map(atleast_1d,tup),1)
ValueError: arrays must have same number of dimensions
Reshape b to have the correct dimensions -- noting that a 1-d array of shape (5,) is totally different from a 2-d array of shape (5, 1).
>>> b
array([15, 16, 17, 18, 19])
>>> b.reshape(5, 1)
array([[15],
[16],
[17],
[18],
[19]])
>>> numpy.hstack((a, b.reshape(5, 1)))
array([[ 0, 1, 2, 15],
[ 3, 4, 5, 16],
[ 6, 7, 8, 17],
[ 9, 10, 11, 18],
[12, 13, 14, 19]])
So in your case, you want an array of shape (65845, 1) instead of (65845,). I might be missing something because you are using sparse arrays. Nonetheless, the principle ought be the same. I have no idea what sparse format you're using based on the above code, so I just picked one to test:
>>> a = scipy.sparse.lil_matrix(numpy.arange(15).reshape(5, 3))
>>> scipy.sparse.hstack((a, b.reshape(5, 1))).toarray()
array([[ 0, 1, 2, 15],
[ 3, 4, 5, 16],
[ 6, 7, 8, 17],
[ 9, 10, 11, 18],
[12, 13, 14, 19]])

Categories