Python: Remap and reduce the range of numbers - python

I have some large unique numbers that are some sort of identity of devices
clusteringOutput[:,1]
Out[140]:
array([1.54744609e+12, 1.54744946e+12, 1.54744133e+12, ...,
1.54744569e+12, 1.54744570e+12, 1.54744571e+12])
even though the numbers are large they are only a handful of those that just repeat over the entries.
I would like to remap those into smaller ranges of integers. So if these numbers are only different 100 values I would like then to map them in the scale from 1-100 with a mapping table that allows me to find and see those mappings.
In the internet the remapping functions, typically will rescale and I do not want to rescale. I want to have concrete integer numbers that map the longer ids I have to simpler to the eyes numbers.
Any ideas on how I can implement that? I can use pandas data frames if it helps.
Thanks a lot
Alex

Use numpy.unique with return_inverse=True:
import numpy as np
arr = np.array([1.54744609e+12,
1.54744946e+12,
1.54744133e+12,
1.54744133e+12,
1.54744569e+12,
1.54744570e+12,
1.54744571e+12])
mapper, ind = np.unique(arr, return_inverse=True)
Output of ind:
array([4, 5, 0, 0, 1, 2, 3])
Remapping using mapper:
mapper[ind]
# array([1.54744609e+12, 1.54744946e+12, 1.54744133e+12, 1.54744133e+12,
# 1.54744569e+12, 1.54744570e+12, 1.54744571e+12])
Validation:
all(arr == mapper[ind])
# True

Related

How to calculate standard deviation of count-value pairs

In numpy the function for calculating the standard deviaiton expects a list of values like [1, 2, 1, 1] and calculates the standard deviation from those. In my case I have a nested list of values and counts like [[1, 2], [3, 1]], where the first list contains the values and the second contains the count of how often the corresponding values appear.
I am looking for a clean way of calculating the standard deviation for a given list like above, clean meaning
an already existing function in numpy, scipy, pandas etc.
a more pythonic approach to the problem
a more concise and nicely readable solution
I already have a working solution, that converts the nested count value list into a flattened list of values and calculates the standard deviation with the function above, but i find it not that pleasing and would rather have another option.
A minimal working example of my workaround is
import numpy as np
# The usual way
values = [1,2,1,1]
deviation = np.std(values)
print(deviation)
# My workaround for the problem
value_counts = [[1, 2], [3, 1]]
values, counts = value_counts
flattened = []
for value, count in zip(values, counts):
# append the current value count times
flattened = flattened + [value]*count
deviation = np.std(flattened)
print(deviation)
The output is
0.4330127018922193
0.4330127018922193
Thanks for any ideas or suggestions :)
You are simply looking for numpy.repeat.
numpy.std(numpy.repeat(value_counts[0], value_counts[1]))

How to make a sparse matrix in python from a data frame having column names as string

I need to convert a data frame to sparse matrix. The data frame looks similar to this: (The actual data is way too big (Approx 500 000 rows and 1000 columns)).
I need to convert it into a matrix such that the rows of the matrix are 'id' and columns are 'names' and should show only the finite values. No nans should be shown (to reduce memory usage). And when I tried using pd.pivot_table, it was taking a long time to make the matrix for my big data.
In R, there is a method called 'dMcast' for this purpose. I explored but could not find the alternate of this in python. I'm new to python.
First i will convert the categorical names column to indices. Maybe pandas has this functionality already?
names = list('PQRSPSS')
name_ids_map = {n:i for i, n in enumerate(set(names))}
name_ids = [name_ids_map[n] for n in names]
Then I would use scipy.sparse.coo and then maybe convert that to another sparse format.
ids = [1, 1, 1, 1, 2, 2, 3]
rating = [2, 4, 1, 4, 2, 2, 1]
sp = scipy.sparse.coo_matrix((rating, (ids, name_ids))
print(sp)
sp.tocsc()
I am not aware of a sparse matrix library that can index a dimension with categorical data like 'R', 'S" etc

Intelligent averaging of time series data with python

I have the following (time-series) data:
t = [5.13, 5.27, 5.40, 5.46, 190.99, 191.13, 191.267, 368.70, 368.83, 368.90, 368.93]
y = [17.17, 17.18, 17.014, 17.104, 16.981, 16.96, 16.85, 17.27, 17.66, 17.76, 18.01]
so, groups of data in short (time) intervals then separated cleanly by a long time gap.
I'm looking for a simple method that will intelligently average these together; sort of a 'Bayesian blocks' but for non-histogram data.
One could do a simple moving average, or numpy convolution, but I'm looking for something a bit smarter that will generalize to larger, similar, but not identical datasets.
It's easy with Pandas. First, construct a DataFrame:
df = pd.DataFrame({'t':t,'y':y})
Then label the groups according to a time threshold:
groups = (df.t.diff() > 10).cumsum()
That gives you [0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2], because cumsum() on a boolean array increments wherever the input is true.
Finally, use groupby():
df.groupby(groups).mean()
It gives you:
t y
t
0 5.315 17.117000
1 191.129 16.930333
2 368.840 17.675000
If you need plain NumPy arrays at the end, just tack on .t.values and .y.values.
If you don't know a priori what time threshold to use, I'm sure you can come up with some heuristic, perhaps involving simple statistics on df.t and df.t.diff().

finding the max of a column in an array

def maxvalues():
for n in range(1,15):
dummy=[]
for k in range(len(MotionsAndMoorings)):
dummy.append(MotionsAndMoorings[k][n])
max(dummy)
L = [x + [max(dummy)]] ## to be corrected (adding columns with value max(dummy))
## suggest code to add new row to L and for next function call, it should save values here.
i have an array of size (k x n) and i need to pick the max values of the first column in that array. Please suggest if there is a simpler way other than what i tried? and my main aim is to append it to L in columns rather than rows. If i just append, it is adding values at the end. I would like to this to be done in columns for row 0 in L, because i'll call this function again and add a new row to L and do the same. Please suggest.
General suggestions for your code
First of all it's not very handy to access globals in a function. It works but it's not considered good style. So instead of using:
def maxvalues():
do_something_with(MotionsAndMoorings)
you should do it with an argument:
def maxvalues(array):
do_something_with(array)
MotionsAndMoorings = something
maxvalues(MotionsAndMoorings) # pass it to the function.
The next strange this is you seem to exlude the first row of your array:
for n in range(1,15):
I think that's unintended. The first element of a list has the index 0 and not 1. So I guess you wanted to write:
for n in range(0,15):
or even better for arbitary lengths:
for n in range(len(array[0])): # I chose the first row length here not the number of columns
Alternatives to your iterations
But this would not be very intuitive because the max function already implements some very nice keyword (the key) so you don't need to iterate over the whole array:
import operator
column = 2
max(array, key=operator.itemgetter(column))[column]
this will return the row where the i-th element is maximal (you just define your wanted column as this element). But the maximum will return the whole row so you need to extract just the i-th element.
So to get a list of all your maximums for each column you could do:
[max(array, key=operator.itemgetter(column))[column] for column in range(len(array[0]))]
For your L I'm not sure what this is but for that you should probably also pass it as argument to the function:
def maxvalues(array, L): # another argument here
but since I don't know what x and L are supposed to be I'll not go further into that. But it looks like you want to make the columns of MotionsAndMoorings to rows and the rows to columns. If so you can just do it with:
dummy = [[MotionsAndMoorings[j][i] for j in range(len(MotionsAndMoorings))] for i in range(len(MotionsAndMoorings[0]))]
that's a list comprehension that converts a list like:
[[1, 2, 3], [4, 5, 6], [0, 2, 10], [0, 2, 10]]
to an "inverted" column/row list:
[[1, 4, 0, 0], [2, 5, 2, 2], [3, 6, 10, 10]]
Alternative packages
But like roadrunner66 already said sometimes it's easiest to use a library like numpy or pandas that already has very advanced and fast functions that do exactly what you want and are very easy to use.
For example you convert a python list to a numpy array simple by:
import numpy as np
Motions_numpy = np.array(MotionsAndMoorings)
you get the maximum of the columns by using:
maximums_columns = np.max(Motions_numpy, axis=0)
you don't even need to convert it to a np.array to use np.max or transpose it (make rows to columns and the colums to rows):
transposed = np.transpose(MotionsAndMoorings)
I hope this answer is not to unstructured. Some parts are suggestions to your function and some are alternatives. You should pick the parts that you need and if you have any trouble with it, just leave a comment or ask another question. :-)
An example with a random input array, showing that you can take the max in either axis easily with one command.
import numpy as np
aa= np.random.random([4,3])
print aa
print
print np.max(aa,axis=0)
print
print np.max(aa,axis=1)
Output:
[[ 0.51972266 0.35930957 0.60381998]
[ 0.34577217 0.27908173 0.52146593]
[ 0.12101346 0.52268843 0.41704152]
[ 0.24181773 0.40747905 0.14980534]]
[ 0.51972266 0.52268843 0.60381998]
[ 0.60381998 0.52146593 0.52268843 0.40747905]

How to efficiently create a sparse vector in python?

I have a dictionary of keys where each value should be a sparse vector of a huge size (~ 700000 elements, maybe more). How do I efficiently grow / build this data structure.
Right now my implementation works only for smaller sizes.
myvec = defaultdict(list)
for id in id_data:
for item in item_data:
if item in item_data[id]:
myvec[id].append(item * 0.5)
else:
myvec[id].append(0)
The above code when used with huge files quickly eats up all the available memory. I tried removing the myvec[id].append(0) condition and store only non-zero values because the length of each myvec[id] list is constant. That worked on my huge test file with a decent memory consumption but I'd rather find a better way to do it.
I know that there are different type of sparse arrays/matrices for this purpose but I have no intuition which one is better. I tried to use lil_matrix from numpy package instead of myvec dict but it turned out to be much slower than the above code.
So the problem basically boils down to the following two questions:
Is it possible to create a sparse data structure on the fly in python?
How can one create such sparse data structure with decent speed?
Appending to a list (or lists) will always be faster than appending to a numpy.array or to a sparse matrix (which stores data in several numpy arrays). lil is supposed to be the fastest when you have to grow the matrix incrementally, but it still will slower than working directly with lists.
Numpy arrays have a fixed size. So the np.append function actually creates a new array by concatenating the old with the new data.
You example code would be more useful if you gave us some data, so we cut, paste and run.
For simplicity lets define
data_dict=dict(one=[1,0,2,3,0,0,4,5,0,0,6])
Sparse matrices can be created directly from this with:
sparse.coo_matrix(data_dict['one'])
whose attributes are:
data: array([1, 2, 3, 4, 5, 6])
row: array([0, 0, 0, 0, 0, 0], dtype=int32)
col: array([ 0, 2, 3, 6, 7, 10], dtype=int32)
or
sparse.lil_matrix(id_data['one'])
data: array([[1, 2, 3, 4, 5, 6]], dtype=object)
rows: array([[0, 2, 3, 6, 7, 10]], dtype=object)
The coo version times a lot faster.
The sparse matrix only saves the nonzero data, but it also has to save an index. There is also a dictionary format, which uses a tuple (row,col) as the key.
And example of incremental construction is:
llm = sparse.lil_matrix((1,11),dtype=int)
for i in range(11):
llm[0,i]=data_dict['one'][i]
For this small case this incremental approach is faster.
I get even better speed by only adding the nonzero terms to the sparse matrix:
llm = sparse.lil_matrix((1,11),dtype=int)
for i in range(11):
if data_dict['one'][i]!=0:
llm[0,i]=data_dict['one'][i]
I can imagine adapting this to your default dict example. Instead of myvec[id].append(0), you keep a record of where you appended the item * 0.5 values (whether in a separate list, or via a lil_matrix. It would take some experimenting to adapt this idea to a default dictionary.
So basically the goal is to create 2 lists:
data = [1, 2, 3, 4, 5, 6]
cols = [ 0, 2, 3, 6, 7, 10]
Whether you create a sparse matrix from these or not depends on what else you need to do with the data.

Categories