Group list elements using pandas in python [duplicate] - python

This question already has answers here:
Group Python List Elements
(4 answers)
Closed 6 years ago.
I have a python list as follows:
my_list =
[[25, 1, 0.65],
[25, 3, 0.63],
[25, 2, 0.62],
[50, 3, 0.65],
[50, 2, 0.63],
[50, 1, 0.62]]
I want to order them according to this rule:
1 --> [0.65, 0.62] <--25, 50
2 --> [0.62, 0.63] <--25, 50
3 --> [0.63, 0.65] <--25, 50
So the expected result is as follows:
Result = [[0.65, 0.62],[0.62, 0.63],[0.63, 0.65]]
I tried as follows:
import pandas as pd
df = pd.DataFrame(my_list,columns=['a','b','c'])
res = df.groupby(['b', 'c']).get_group('c')
print res
ValueError: must supply a tuple to get_group with multiple grouping keys
How to do it guys?

Here is a pandas solution, you can sort the list by the first column, groupby the second column and covert the third column to list, if you prefer the result to be a list, use tolist() method afterwards:
df = pd.DataFrame(my_list, columns=list('ABC'))
s = df.sort_values('A').groupby('B').C.apply(list)
#B
#1 [0.65, 0.62]
#2 [0.62, 0.63]
#3 [0.63, 0.65]
#Name: C, dtype: object
The above method obtains a pandas series:
To get a list of lists:
s.tolist():
# [[0.65000000000000002, 0.62], [0.62, 0.63], [0.63, 0.65000000000000002]]
To get a numpy array of lists:
s.values
# array([[0.65000000000000002, 0.62], [0.62, 0.63],
# [0.63, 0.65000000000000002]], dtype=object)
s.values[0]
# [0.65000000000000002, 0.62] # here each element in the array is still a list
To get a 2D array or a matrix, you can transform the data frame in a different way, i.e pivot your original data frame to wide format and then convert it to a 2d array:
df.pivot('B', 'A', 'C').as_matrix()
# array([[ 0.65, 0.62],
# [ 0.62, 0.63],
# [ 0.63, 0.65]])
Or:
np.array(s.tolist())
# array([[ 0.65, 0.62],
# [ 0.62, 0.63],
# [ 0.63, 0.65]])

Here is an other way, as it seems in your question you were trying to use get_group():
g = [1,2,3]
result = []
for i in g:
lst = df.groupby('b')['c'].get_group(i).tolist()
result.append(lst)
print(result)
[[0.65, 0.62], [0.62, 0.63], [0.63, 0.65]]

Related

Structure arrays for broadcasting numpy python

I have a dataframe with is in long-format
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1': [1], 'col2': [10]})
ratio = pd.Series([0.1, 0.70, 0.2])
# Expected Output
df_multiplied = pd.DataFrame({'col1': [0.1, 0.7, 0.2], 'col2': [1, 7, 2]})
My attempt was to convert it into numpy arrays and use np.tile
np.tile(df.T.values, len(df_ratio) * np.array(df_ratio).T
Is there any better way to do this?
Thank you!
Repeat the row n times where n is the ratio series' length, then multiple along row axis by the ratio series:
>>> pd.concat([df]*ratio.shape[0], ignore_index=True).mul(ratio, axis='rows')
col1 col2
0 0.1 1.0
1 0.7 7.0
2 0.2 2.0
Or, you can implement similar logic with numpy, repeat the array n times then multiply by ratio values with expanded dimension:
>>> np.repeat([df.values], ratio.shape[0], axis=1)*ratio.values[:,None]
array([[[0.1, 1. ],
[0.7, 7. ],
[0.2, 2. ]]])

Add list of numbers to an empty list using for loop in Python

Faced a wall recently with somewhat simple thing but no matter what I am unable to solve it.
I created a small function that calculates some values and returns a list as an output value
def calc(file):
#some calculation based on file
return degradation #as a list
for example, for file "data1.txt"
degradation = [1,0.9,0.8,0.5]
and for file "data2.txt"
degradation = [1,0.8,0.6,0.2]
Since I have several files on which I want to apply the calc() I wanted them to connect them, sideways, so that I connect them into an array which has len(degradation) number of rows, and columns as much as I have files. Was planning to do it with for loop.
For this specific case something like:
output = 1 , 1
0.9,0.8
0.8,0.6
0.5,0.2
Tried with pandas as well but without a success.
import numpy as np
arr2d = np.array([[1, 2, 3, 4]])
arr2d = np.append(arr2d, [[9, 8, 7, 6]], axis=0).T
expect an output something like this:
array([[1, 9],
[2, 8],
[3, 7],
[4, 6]])
You can use numpy.hstack() to achieve this.
Imagine you have data from the first two files from the first two iterations of the for loop.
data1.txt gives you
degradation1 = [1,0.9,0.8,0.5]
and data2.txt gives you
degradation2 = [1,0.8,0.6,0.2]
First, you have to convert both lists into lists of lists.
degradation1 = [[i] for i in degradation1]
degradation2 = [[i] for i in degradation2]
This gives the outputs,
print(degradation1)
print(degradation2)
[[1], [0.9], [0.8], [0.5]]
[[1], [0.8], [0.6], [0.2]]
Now you can stack the data using the numpy.hstack().
stacked = numpy.hstack(degradation1,degradation2)
This gives the output
array([[1. , 1. ],
[0.9, 0.8],
[0.8, 0.6],
[0.5, 0.2]])
Imagine you have the file data3.text during the 3rd iteration of the for loop and it gives
degradation3 = [1,0.3,0.6,0.4]
You can follow the same steps as above and stack it with stacked. Follow the steps; convert to a list of the lists, stack with stacked.
degradation3 = [[i] for i in degradation3]
stacked = numpy.hstack(stacked,degradation3)
This gives you the output
array([[1. , 1. , 1. ],
[0.9, 0.8, 0.3],
[0.8, 0.6, 0.6],
[0.5, 0.2, 0.4]])
You can continue this for the whole loop.
Assume my_lists is a list of your lists.
my_lists = [
[1, 2, 3, 4],
[10, 20, 30, 40],
[100, 200, 300, 400]]
result = []
for _ in my_lists[0]:
result.append([])
for l in my_lists:
for i in range(len(result)):
result[i].append(l[i])
for line in result:
print(line)
The output would be
[1, 10, 100]
[2, 20, 200]
[3, 30, 300]
[4, 40, 400]
As you seem to want to work with lists
## degradations as list
degradation1 = [1,0.8,0.6,0.2]
degradation2 = [1,0.9,0.8,0.5]
degradation3 = [0.7,0.9,0.8,0.5]
degradations = [degradation1, degradation2, degradation3]
## CORE OF THE ANSWER ##
degradationstransposed = [list(i) for i in zip(*degradations)]
print(degradationstransposed)
[[1, 1, 0.7], [0.8, 0.9, 0.9], [0.6, 0.8, 0.8], [0.2, 0.5, 0.5]]

Finding nearest element of an array in a particular column of another array

I want to map one numpy array to another one. My frist array has two columns and thousands of rows:
arr_1 = [[20, 0.5],
[30, 0.75],
[40, 1.0],
[50, 1.25],
[60, 1.5],
[70, 1.75],
...]
The second array can have a different number of rows and columns:
arr_2 = [[1, 0.45],
[2, 0.57],
[4, 0.58],
[1, 1.69],
[1, 1.51],
[1, 0.95],
...]
I want to compare the values of the second column of arr_2 with the second column of arr_1 to know which row of arr_2 is closer to which row of arr_1. Then I want to copy the first column of arr_1 into arr_2 from the row with the nearest second column.
For example, 0.45 in arr_2 is closest to 0.5, i.e. first row in arr_1. After finding that, I want to copy the first column of that row (which is 20) into arr_2. The final result would look something like:
arr_2_final = [[1, 0.45, 20],
[2, 0.57, 20],
[4, 0.58, 20],
[1, 1.69, 70],
[1, 1.51, 60],
[1, 0.95, 40],
...]
Looking up lots of items in an array is easiest done when it is sorted. You can delegate most of the work to np.searchsorted. Since we want to find elements in arr_1, it is the only array that needs to be sorted. I suspect that having a sorted arr_2 will speed things up by reducing the size of the search space for every successive element.
First, find the insertion points where arr_2 would end up in arr_1:
indices = np.searchsorted(arr_1[:, 1], arr_2[:, 1])
Now all you have to do is check for cases where the prior element is closer than the current one. There are two corner cases: when index is 0, you have to accept it, and when it is arr_1.size, you have to take the prior.
indices[indices == arr_1.shape[0]] = arr_1.shape[0] - 1
indices[(indices != 0) & (arr_1[indices, 1] - arr_2[:, 1] > arr_2[:, 1] - arr_1[indices - 1, 1])] -= 1
Doing it in this order saves you the trouble of messing with temporary arrays. The first line ensures that the index arr_1[indices, 1] is always valid. Since index -1 is valid, the second line succeeds as well.
The final result is then
np.concatenate((arr_2, arr_1[indices, 0:1]), axis=1)
If arr_1 is not already sorted, you can do the following:
arr_1 = arr1[np.argsort(arr_1[:, 1]), :]
A quick benchmark shows that on my very moderately powered machine, this approach takes ~300ms for arr_1.shape = (500000, 2) and arr_2.shape = (300000, 2).
I would probably do it this way:
import numpy as np
arr_1= [[20, 0.5], [30, 0.75], [40, 1], [50, 1.25], [60, 1.5], [70, 1.75]]
arr_2= [[1, 0.45], [2, 0.57], [4, 0.58], [1, 1.69], [1, 1.51], [1, 0.95]]
arr_2_np = np.array(arr_2)[:,1]
for row in arr_1:
idx = np.argmin(np.abs(arr_2_np - row[1]))
arr_2[idx].append(row[0])
print(arr_2)

Adding a column to a pandas dataframe based on other columns

Problem description
Introductory remark: For the code have a look below
Let's say we have a pandas dataframe consisting of 3 columns and 2 rows.
I'd like to add a 4th column called 'Max_LF' that will consist of an array. The value of the cell is retrieved by having a look at the column 'Max_WD'. For the first row that would be 0.35 which will than be compared to the values in the column 'WD' where 0.35 can be found at the third position. Therefore, the third value of the column 'LF' should be written into the column 'Max_LF'. If the value of 'Max_WD' occures multiple times in 'WD', then all corresponding items of 'LF' should be written into 'Max_LF'.
Failed attempt
So far I had various attemps on first retrieving the index of the item in 'Max_WD' in 'WD'. After potentially retrieving the index the idea was to then get the items of 'LF' via their index:
df4['temp_indices'] = [i for i, x in enumerate(df4['WD']) if x == df4['Max_WD']]
However, a ValueError occured:
raise ValueError('Lengths must match to compare')
ValueError: Lengths must match to compare
This is what the example dateframe looks like
df = pd.DataFrame(data={'LF': [[1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4]] , 'WD': [[0.28, 0.34, 0.35, 0.18], [0.42, 0.45, 0.45, 0.18], [0.31, 0.21, 0.41, 0.41]], 'Max_WD': [0.35, 0.45, 0.41]})
The expected outcome should look like
df=pd.DataFrame(data={'LF': [[1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4]] , 'WD': [[0.28, 0.34, 0.35, 0.18], [0.42, 0.45, 0.45, 0.18], [0.31, 0.21, 0.41, 0.41]], 'Max_WD': [0.35, 0.45, 0.41], 'Max_LF': [[3] ,[2,3], [3,4]]})
You could get it by simply using lambda as follows
df['Max_LF'] = df.apply(lambda x : [i + 1 for i, e in enumerate(x['WD']) if e == x['Max_WD']], axis=1)
output is
LF Max_WD WD Max_LF
0 [1, 2, 3] 0.35 [0.28, 0.34, 0.35, 0.18] [3]
1 [1, 2, 3] 0.45 [0.42, 0.45, 0.45, 0.18] [2, 3]
2 [1, 2, 3] 0.41 [0.31, 0.21, 0.41, 0.41] [3, 4]
Thanks guys! With your help I was able to solve my problem.
Like Prince Francis suggested I first did
df['temp'] = df.apply(lambda x : [i for i, e in enumerate(x['WD']) if e == x['Max_WD']], axis=1)
to get the indicees of the 'WD'-values in 'LF'. In a second stept I then could add the actual column 'Max_LF' by doing
df['LF_Max'] = df.apply(lambda x: [x['LF'][e] for e in (x['temp'])],axis=1)
Thanks a lot guys!
You can achieve it by applying a function over axis 1.
For this, I recommend you to first convert the WD list into a pd.Series (or a numpy.ndarray) and then compare all the values at once.
Assuming that you want a list of all the values higher than the threshold, you could use this:
>>> def get_max_wd(x):
... wd = pd.Series(x.WD)
... return list(wd[wd >= x.Max_WD])
...
>>> df.apply(get_max_wd, axis=1)
0 [0.35]
1 [0.45, 0.45]
2 [0.41, 0.41]
dtype: object
The result of the apply can then be assigned as a new column into the dataframe:
df['Max_LF'] = df.apply(get_max_wd, axis=1)
If what you are after is only the maximum value (see my comment above), you can use the max() method within the function.

Group Python List Elements

I have a python list as follows:
my_list =
[[25, 1, 0.65],
[25, 3, 0.63],
[25, 2, 0.62],
[50, 3, 0.65],
[50, 2, 0.63],
[50, 1, 0.62]]
I want to order them according to this rule:
1 --> [0.65, 0.62] <--25, 50
2 --> [0.62, 0.63] <--25, 50
3 --> [0.63, 0.65] <--25, 50
So the expected result is as follows:
Result = [[0.65, 0.62],[0.62, 0.63],[0.63, 0.65]]
How to do it guys?
I tried as follows:
df = pd.DataFrame(my_list,columns=['a','b','c'])
res = df.groupby(['b', 'c']).get_group('c')
print res
ValueError: must supply a tuple to get_group with multiple grouping keys
You can sort your list with native python, but I find it easiest to get your required list using numpy. Since you were going to use pandas anyway, I consider this to be an acceptable solution:
from operator import itemgetter
import numpy as np
# or just use pandas.np if you have that already imported
my_list = [[25, 1, 0.65],
[25, 3, 0.63],
[25, 2, 0.62],
[50, 3, 0.65],
[50, 2, 0.63],
[50, 1, 0.62]]
sorted_list = sorted(my_list,key=itemgetter(1,0)) # sort by second and first column
sliced_array = np.array(sorted_list)[:,-1].reshape(-1,2)
final_list = sliced_array.tolist() # to get a list
The main point is to use itemgetter to sort your list on two columns one after the other. The resulting sorted list contains the required elements in its third column, which I extract with numpy. It could be done with native python, but if you're already using numpy/pandas, this should be natural.
Use the following:
my_list = [[25, 1, 0.65], [25, 3, 0.63], [25, 2, 0.62], [50, 3, 0.65], [50, 2, 0.63], [50, 1, 0.62]]
list_25 = sorted([item for item in my_list if item[0] == 25], key=lambda item: item[1])
list_50 = sorted([item for item in my_list if item[0] == 50], key=lambda item: item[1])
res = [[i[2], j[2]] for i,j in zip(list_25, list_50)]
Output:
>>> res
[[0.65, 0.62], [0.62, 0.63], [0.63, 0.65]]
A way to do this with pandas is to extract each group, pull out 'c', convert to a list and append to the list you want :
z = []
>>> for g in df.groupby('b'):
z.append(g[1]['c'].tolist())
>>> z
[[0.65, 0.62], [0.62, 0.63], [0.63, 0.65]]
You could do this as a list comprehension:
>>> res = [g[1]['c'].tolist() for g in df.groupby('b')]
>>> res
[[0.65, 0.62], [0.62, 0.63], [0.63, 0.65]]
Another way would be to apply list directly to df.groupby('b')['c'] this gives you the object you need. Then call the .tolist() method to return a list of lists:
>>> df.groupby('b')['c'].apply(list).tolist()
[[0.65000000000000002, 0.62], [0.62, 0.63], [0.63, 0.65000000000000002]]
The numpy_indexed package (disclaimer: I am its author) has a one-liner for these kind of problems:
import numpy_indexed as npi
my_list = np.asarray(my_list)
keys, table = npi.Table(my_list[:, 1], my_list[:, 0]).mean(my_list[:, 2])
Note that if duplicate values are present in the list, the mean is reported in the table.
EDIT: added some improvements to the master of numpy_indexed, that allow more control over the way you convert to a table; for instance, there is Table.unique which asserts that each item in the table occurs once in the list, and Table.sum; and eventually all other reductions supported by the numpy_indexed package that make sense. Hopefully I can do a new release for that tonight.

Categories