Split numpy array based column value in list - python

I'am new in numpy and I want to split array 2D based on columns values if value in another list,
I converted a pandas dataframe on numpy array of 2D and I have a another list, I want to split my numpy array on two others array, the first based on (if values of second column in list) and the second contains the rest of my numpy array, and I want to get the rest of my list(contains all values doesn't exist in my numpy array)
numpy_data = np.array([
[1, 'p1', 2],
[11, 'p2', 8],
[1, 'p8', 21],
[13, 'p10', 2] ])
list_value = ['p1', 'p3', 'p8']
The expected output :
data_in_list = [
[1, 'p1', 2],
[1, 'p8', 21]]
list_val_in_numpy = ['p1', 'p8'] # intersection of second column with my list
rest_data = [
[11, 'p2', 8],
[13, 'p10', 2]]
rest_list_value = ['p3']
In my code I have found how to get first output :
first_output = numpy_data[np.isin(numpy_data[:,1], list_value)]
But I couldn't find the rest of my numpy, I have tried too,
Browse my list and seek if values in second column of array and then delete this row, in this case I dont need the first output (That I called data_in_list, b-coz I do what I need on it), here I need the others output
for val in l :
row = numpy_data[np.where(numpy_data[:,1]== val)]
row.size != 0 :
# My custom code
# then remove this row from my numpy, I couldn't do it
Thanks in advance

Use python's invert ~ operator over the result of the np.isin:
rest = numpy_data[~np.isin(numpy_data[:,1], list_value)]

There are multiple ways of doing this. I would prefer a vectorized way of using list comprehension. But for sake of clarity here is loop way of doing the same thing.
data_in_list=[]
list_val_in_numpy = []
rest_data=[]
for x in numpy_data:
for y in x:
if y in list_value:
data_in_list.append(x)
for x in list_value:
if x == y:
list_val_in_numpy.append(x)
for x in numpy_data:
if x in data_in_list:
pass
else:
rest_data.append(x)
This gives you all the three lists you were looking for. Concatenate to get the list you want exactly.

list comprehension will solve it I guess:
numpy_data = [
[1, 'p1', 2],
[11, 'p2', 8],
[1, 'p8', 21],
[13, 'p10', 2],
]
list_value = ['p1', 'p3', 'p8']
output_list = [[item] for item in numpy_data if item[1] in list_value]
print(output_list)
output:
[[[1, 'p1', 2]], [[1, 'p8', 21]]]

Related

Comparing rows in dataset based on a specific column to find min/max

So I have a dataset that contains history of a specific tag from a start to end date. I am trying to compare rows based on the a date column, if they're similar by month, day and year, I'll add those to a temporary list by the value of the next column and then once I have those items by similar date, I'll take that list and find the min/max values subtract them, then add the result to another list and empty the temp_list to start all over again.
For the sake of time and simplicity, I am just presenting a example of 2D List. Here's my example data
dataset = [[1,5],[1,6],[1,10],[1,23],[2,4],[2,8],[2,12],[3,10],[3,20],[3,40],[4,50],[4,500]]
Where the first column will act as dates and second value.
The issues I am having is :
I cant seem to compare every row based on its first column which would take the value in the second column and include it in the temp list to perform min/max operations?
Based on the above 2D List I would expect to get [18,8,30,450] but the result is [5,4,10]
dataset = [[1,5],[1,6],[1,10],[1,23],[2,4],[2,8],[2,12],[3,10],[3,30],[3,40],[4,2],[4,5]]
temp_list = []
daily_total = []
for i in range(len(dataset)-1):
if dataset[i][0] == dataset[i+1][0]:
temp_list.append(dataset[i][1])
else:
max_ = max(temp_list)
min_ = min(temp_list)
total = max_ - min_
daily_total.append(total)
temp_list = []
print([x for x in daily_total])
Try:
tmp = {}
for d, v in dataset:
tmp.setdefault(d, []).append(v)
out = [max(v) - min(v) for v in tmp.values()]
print(out)
Prints:
[18, 8, 30, 450]
Here is a solution using pandas:
import pandas as pd
dataset = [
[1, 5],
[1, 6],
[1, 10],
[1, 23],
[2, 4],
[2, 8],
[2, 12],
[3, 10],
[3, 20],
[3, 40],
[4, 50],
[4, 500],
]
df = pd.DataFrame(dataset)
df.columns = ["date", "value"]
df = df.groupby("date").agg(min_value=("value", "min"), max_value=("value", "max"))
df["res"] = df["max_value"] - df["min_value"]
df["res"].to_list()
Output:
[18, 8, 30, 450]

How can I create a label encoder utilizing only numpy (and not sklearn LabelEncoder)?

I am trying to recreate something similar to the
sklearn.preprocessing.LabelEncoder
However I do not want to use sklearn or pandas. I would like to only use numpy and the Python standard library. Here's what I would like to achieve:
import numpy as np
input = np.array([['hi', 'there'],
['scott', 'james'],
['hi', 'scott'],
['please', 'there']])
# Output would look like
np.ndarray([[0, 0],
[1, 1],
[0, 2],
[2, 0]])
It would also be great to be able to map it back as well, so a result would then look exactly like the input again.
If this were in a spreadsheet, the input would look like this:
Here's a simple comprehension, using the return_inverse result from np.unique
arr = np.array([['hi', 'there'], ['scott', 'james'],
['hi', 'scott'], ['please', 'there']])
np.column_stack([np.unique(arr[:, i], return_inverse=True)[1] for i in range(arr.shape[1])])
array([[0, 2],
[2, 0],
[0, 1],
[1, 2]], dtype=int64)
Or applying along the axis:
np.column_stack(np.apply_along_axis(np.unique, 0, arr, return_inverse=True)[1])
Was talking to #Scott Stoltzmann and spit balled about a way to reverse the accepted answer.
One can either carry the original arr along with them through out their program or record the mappings for each column. If you do the latter, here's some simple non-performant code to do so:
l = []
for real_column, encoded_column in zip(np.column_stack(arr), np.column_stack(arr2)):
d = {}
for real_element, encoded_element in zip(real_column, encoded_column):
d[encoded_element] = real_element
l.append(d)
print(l)
Doing this with the above yields:
[{0: 'hi', 2: 'scott', 1: 'please'}, {2: 'there', 0: 'james', 1: 'scott'}]
Try this method, which is both beautiful (almost) and optimal:
labels = np.array([['hi', 'there'], ['scott', 'james'],
['hi', 'scott'], ['please', 'there']])
indexes = {val: idx for idx, val in enumerate(np.unique(labels))}
encoded = np.array([indexes[val] for val in labels.flatten()]).reshape(labels.shape)
print(f'Indexes: {indexes}')
print(f'Encoded labels: {encoded}')
The output:
Indexes: {'hi': 0, 'james': 1, 'please': 2, 'scott': 3, 'there': 4}
Encoded labels: [[0 4]
[3 1]
[0 3]
[2 4]]
Enjoy the labels encoder ;)

Replacing numbers in numpy array with the ones in the list

I have a 2D numpy array, and I'm looking to replace its contents with the numbers of a list by index.
Here's a code snippet to describe it more clearly:
import numpy as np
x = np.array([
[2, 'something'],
[2, 'more'],
[6, 'and more'],
[11, 'and so on'],
[11, 'etc..']
])
y = [1, 2, 3]
I tried to do it by the following code, got an error and couldn't figure why is it occurring.
k = x[:, 0]
z = [2, 6, 11]
j = 0
for i in range(z[0], z[-1] + 1):
k = np.where(i in k, y[j])
j+=1
Error while running the above code:
Traceback (most recent call last):
File "<ipython-input-10-c48814c42718>", line 4, in <module>
k = np.where(i in k, y[j])
ValueError: either both or neither of x and y should be given
Output array I want to have:
# The output array which I intend to get
output = [
[1, 'something'],
[1, 'more'],
[2, 'and more'],
[3, 'and so on'],
[3, 'etc..']
]
If I understand correctly, this is one way you can do that:
import numpy as np
x = np.array([
[2, 'something'],
[2, 'more'],
[6, 'and more'],
[11, 'and so on'],
[11, 'etc..']
])
y = np.array([1, 2, 3])
# Find places where value changes, do cumsum and add a 0 at the beginning, then index y
x[:, 0] = y[np.r_[0, np.cumsum(np.diff(x[:, 0].astype(np.int32)) != 0)]]
# [['1' 'something']
# ['1' 'more']
# ['2' 'and more']
# ['3' 'and so on']
# ['3' 'etc..']]
Note here the result is strings because that is the type of the input array (NumPy will coerce to string unless dtype=object is specified). In any case, if you want to have mixed-type arrays, you should consider using a structured array.
numpy.unique + return_inverse=True
You can create a mapping from differing elements in your column, and use basic numpy indexing to map those values to your input list.
y = np.array([1, 2, 3])
_, inv = np.unique(x[:, 0], return_inverse=True)
x[:, 0] = y[inv]
array([['1', 'something'],
['1', 'more'],
['2', 'and more'],
['3', 'and so on'],
['3', 'etc..']], dtype='<U11')
The one caveat to this answer is that if another 2 appears later in the array, it will replace it with 1, not with a new value, but you will need to clarify your question if that is an issue.
Based on the size of your replacement list, this seems to be the desired behavior.
You can do this by getting unique values, ordering them in a brute force manner, and using a for loop to map. You would need to make sure your mapping list (y) is also ordered least to greatest.
ind = list(x[i][0] for i in range(len(x)))
lookup = set()
ind = [x for x in ind if x not in lookup and lookup.add(x) is None]
for i in range(len(x)):
c = ind.index(x[i][0])
x[i][0] = y[c]
print(x)
Output:
array([['1', 'something'],
['1', 'more'],
['2', 'and more'],
['3', 'and so on'],
['3', 'etc..']], dtype='<U11')
If you want to continue using a for loop like you currently have and make use of the y list you could do something like this:
import numpy as np
x = np.array([[2, 'something'], [2, 'more'], [6, 'and more'],
[11, 'and so on'], [11, 'etc..']])
y = [1, 2, 3]
y_index = 0
for i in range(0, x.shape[0] - 1):
if x[i+1][0] != x[i][0]:
x[i][0] = y[y_index]
y_index += 1
else:
x[i][0] = y[y_index]
x[-1][0] = y[y_index] # Set last index
print(x)
Output:
[['1' 'something']
['1' 'more']
['2' 'and more']
['3' 'and so on']
['3' 'etc..']]

Merge and add duplicate integers from a multidimensional array

I have a multidimensional list where the first item is a date and the second is a date time object needing to be added together. For example (leave the second as a integer for simplicity):
[[01/01/2019, 10], [01/01/2019, 3], [02/01/2019, 4], [03/01/2019, 2]]
The resulting array should be:
[[01/01/2019, 13], [02/01/2019, 4], [03/01/2019, 2]]
Does someone have a short way of doing this?
The background to this is vehicle tracking, I have a list of trips performed by vehicle and I want to have a summary by day with a count of total time driven per day.
You should change your data 01/01/2019 to '01/01/2019'.
#naivepredictor suggested good sample, anyway, if you don't want to import pandas, use this.
my_list = [['01/01/2019', 10], ['01/01/2019', 3], ['02/01/2019', 4], ['03/01/2019', 2]]
result_d = {}
for i in my_list:
result_d[i[0]] = result_d.get(i[0], 0) + i[1]
print(result_d) #{'01/01/2019': 13, '02/01/2019': 4, '03/01/2019': 2}
print([list(d) for d in result_d.items()]) #[['01/01/2019', 13], ['02/01/2019', 4], ['03/01/2019', 2]]
import pandas as pd
# create dataframe out of the given imput
df = pd.DataFrame(data=[['01/01/2019', 10], ['01/01/2019', 3], ['02/01/2019', 4]], columns=['date', 'trip_len'])
# groupby date and sum values for each day
df = df.groupby('date').sum().reset_index()
# output result as list of lists
result = df.values.tolist()

Get Maximum Value across rows and columns of a python Matrix

Consider the question:
The grid is:
[ [3, 0, 8, 4],
[2, 4, 5, 7],
[9, 2, 6, 3],
[0, 3, 1, 0] ]
The max viewed from top (i.e. max across columns) is: [9, 4, 8, 7]
The max viewed from left (i.e. max across rows) is: [8, 7, 9, 3]
I know how to define a grid in Python:
maximums = [[0 for x in range(len(grid[0]))] for x in range(len(grid))]
Getting maximum across rows looks easy:
max_top = [max(x) for x in grid]
But how to get maximum across columns?
Further, I need to find a way to do so in linear space O(M+N) where MxN is the size of the Matrix.
Use zip:
result = [max(i) for i in zip(*grid)]
In Python, * is not a pointer, rather, it is used for unpacking a structure passed to an object's parameter or specifying that the object can receive a variable number of items. For instance:
def f(*args):
print(args)
f(434, 424, "val", 233, "another val")
Output:
(434, 424, 'val', 233, 'another val')
Or, given an iterable, each item can be inserted at its corresponding function parameter:
def f(*args):
print(args)
f(*["val", "val3", 23, 23])
>>>('val', 'val3', 23, 23)
zip "transposes" a listing of data i.e each row becomes a column, and vice versa.
You could use numpy:
import numpy as np
x = np.array([ [3, 0, 8, 4],
[2, 4, 5, 7],
[9, 2, 6, 3],
[0, 3, 1, 0] ])
print(x.max(axis=0))
Output:
[9 4 8 7]
You said that you need to do this in O(m+n) space (not using numpy), so here's a solution that doesn't recreate the matrix:
max = x[0]
for i in x:
for j, k in enumerate(i):
if k > max[j]:
max[j] = k
print(max)
Output:
[9, 4, 8, 7]
I figured a shortcut too:
transpose the matrix and then just take maximum over rows:
grid_transposed = [[grid[j][i] for j in range(len(grid[0]))] for i in range(len(grid))]
max_left = [max(x) for x in grid]
But then again this takes O(M*N) space I have to alter the matrix.
I don't want to use numpy as external libraries are not allowed in any assignments.
Easiest way is to use numpy's array max:
array.max(0)
Something like these works both ways and is quite easy to read:
# 1.
maxLR, maxTB = [], []
maxlr, maxtb = 0, 0
# max across rows
for i, x in enumerate(grid):
maxlr = 0
for j, y in enumerate(grid[0]):
maxlr = max(maxlr, grid[i][j])
maxLR.append(maxlr)
# max across columns
for j, y in enumerate(grid[0]):
maxtb = 0
for i, x in enumerate(grid):
maxtb = max(maxtb, grid[i][j])
maxTB.append(maxtb)
# 2.
row_maxes = [max(row) for row in grid]
col_maxes = [max(col) for col in zip(*grid)]

Categories