Counting uneven bins in Panda - python

pd.DataFrame({'email':["a#gmail.com", "b#gmail.com", "c#gmail.com", "d#gmail.com", "e#gmail.com",],
'one':[88, 99, 11, 44, 33],
'two': [80, 80, 85, 80, 70],
'three': [50, 60, 70, 80, 20]})
Given this DataFrame, I would like to compute, for each column, one, two and three, how many values are in certain ranges.
The ranges are for example: 0-70, 71-80, 81-90, 91-100
So the result would be:
out = pd.DataFrame({'colname': ["one", "two", "three"],
'b0to70': [3, 1, 4],
'b71to80': [0, 3, 1],
'b81to90': [1, 1, 0],
'b91to100': [1, 0, 0]})
What would be a nice idiomatic way to do this?

This would do it:
out = pd.DataFrame()
for name in ['one','two','three']:
out[name] = pd.cut(df[name], bins=[0,70,80,90,100]).value_counts()
out.sort_index(inplace=True)
Returns:
one two three
(0, 70] 3 1 4
(70, 80] 0 3 1
(80, 90] 1 1 0
(90, 100] 1 0 0

Related

Counting the cells in a row (across multiple columns) that are within x value of first column in Pandas

I have the following dataframe and I'm trying to determine how many of the column values in each row are within 12 of the max value found in the first four columns.
import pandas as pd
df = pd.DataFrame({
't1': [0, 0, 40, 37, 143],
't2': [0, 38, 149, 145, 151],
't3': [0, 140, 100, 37, 150],
't4': [0, 0, 23, 0, 19],
'other': ['str1', 'str2', 'str3', 'str4', 'NaN'],
'age': [21, 29, 57, 48, 37],
'new_max': [0,140,149,145,151]})
I want to check columns 't1' through 't4' to see if they are within 12 of the maximum value contained in those four columns for that row.
The output would be to add a 'w12_count' column for each row like this:
df = pd.DataFrame({
't1': [0, 0, 40, 37, 143],
't2': [0, 38, 149, 145, 151],
't3': [0, 140, 100, 37, 150],
't4': [0, 0, 23, 0, 19],
'other': ['str1', 'str2', 'str3', 'str4', 'NaN'],
'age': [21, 29, 57, 48, 37],
'new_max': [0,140,149,145,151],
'w12_count': [4, 1, 1, 1, 3]})
I know I could use .loc to create a new column based on each column I'm checking and assign it 0 if it is false or 1 if it is true and then sum those new columns to get the count. But my data actually has a lot of columns so I'm trying to find the syntax for using the count method to total the number of columns within 12 and assign a new column with the count.
We can filter the t like columns, then take the max along axis=1 on these columns then subtract the max value from these columns to get the difference then compare the absolute value of difference with 12 to create a boolean mask followed by sum along axis=1 to get the counts
t = df.filter(regex=r't\d+')
df['w12_count'] = t.sub(t.max(1), axis=0).abs().le(12).sum(1)
t1 t2 t3 t4 other age new_max w12_count
0 0 0 0 0 str1 21 0 4
1 0 38 140 0 str2 29 140 1
2 40 149 100 23 str3 57 149 1
3 37 145 37 0 str4 48 145 1
4 143 151 150 19 NaN 37 151 3

Avoid for-loop to split array into multiple arrays by index values using numpy

Input: There are two input arrays:
value_array = [56, 10, 65, 37, 29, 14, 97, 46]
index_array = [ 0, 0, 1, 0, 3, 0, 1, 1]
Output: I want to split value_array using index_array without using for-loop. So the output array will be:
split_array = [[56, 10, 37, 14], # index 0
[65, 97, 46], # index 1
[], # index 2
[29]] # index 3
Is there any way to do that using numpy without using any for-loop? I have looked at numpy.where but cannot figure it out how to do that.
For-loop: Here is the way to do that using for-loop. I want to avoid for-loop.
split_array = []
for i in range(max(index_array) + 1):
split_array.append([])
for i in range(len(value_array)):
split_array[index_array[i]].append(value_array[i])
Does this suffice?
Solution 1 (Note: for loop is not over the entire index array)
import numpy as np
value_array = np.array([56, 10, 65, 37, 29, 14, 97, 46])
index_array = np.array([ 0, 0, 1, 0, 3, 0, 1, 1])
max_idx = np.max(index_array)
split_array = []
for idx in range(max_idx + 1):
split_array.append([])
split_array[-1].extend(list(value_array[np.where(index_array == idx)]))
print(split_array)
[[56, 10, 37, 14], [65, 97, 46], [], [29]]
Solution 2
import numpy as np
value_array = np.array([56, 10, 65, 37, 29, 14, 97, 46])
index_array = np.array([ 0, 0, 1, 0, 3, 0, 1, 1])
value_array = value_array[index_array.argsort()]
split_idxs = np.squeeze(np.argwhere(np.diff(np.sort(index_array)) != 0) + 1)
print(np.array_split(value_array, split_idxs))
[array([56, 10, 37, 14]), array([65, 97, 46]), array([29])]
Indeed, you can use numpy by using arrays :
import numpy as np
value_array=np.array(value_array)
index_array=np.array(index_array)
split_array=[value_array[np.where(index_array==j)[0]] for j in set(index_array)]
You could do:
import numpy as np
value_array = np.array([56, 10, 65, 37, 29, 14, 97, 46])
index_array = np.array([ 0, 0, 1, 0, 3, 0, 1, 1])
# find the unique values in index array and the corresponding counts
unique, counts = np.unique(index_array, return_counts=True)
# create an array with 0 for the missing indices
zeros = np.zeros(index_array.max() + 1, dtype=np.int32)
zeros[unique] = counts # zeros = [4 3 0 1] 0 -> 4, 1 -> 3, 2 -> 0, 3 -> 1
# group by index array
so = value_array[np.argsort(index_array)] # so = [56 10 37 14 65 97 46 29]
# finally split using the counts
res = np.split(so, zeros.cumsum()[:-1])
print(res)
Output
[array([56, 10, 37, 14]), array([65, 97, 46]), array([], dtype=int64), array([29])]
The time complexity of this approach is O(N logN).
Additionally if you don't care about the missing indices, you could use the following:
_, counts = np.unique(index_array, return_counts=True)
res = np.split(value_array[np.argsort(index_array)], counts.cumsum()[:-1])
print(res)
Output
[array([56, 10, 37, 14]), array([65, 97, 46]), array([29])]

Python - How to split an array based on the first column?

I have below fake data. After reading it into array it will have shape (8, 3). Now I want to split the data based on the first column(ID) and return a list of array whose shape will be:[(3,3),(2,3),(3,3)]. I think np.split could do the job by assigning a 1-D array to "indices_or_sections" argument. But is there any more convenient way to do this?
1 700 35
1 700 35
1 700 35
2 680 25
2 680 25
3 750 40
3 750 40
3 750 40
You can achieve this by using a combination of np.split, sort, np.unique and np.cumsum.
>>> a = [[1, 700, 35],
... [1, 700, 35],
... [1, 700, 35],
... [2, 680, 25],
... [2, 680, 25],
... [3, 750, 40],
... [3, 750, 40],
... [3, 750, 40]]
>>> a = np.array(a)
>>> # sort the array by first column.
>>> a = a[a[:,0].argsort()]
>>> np.split(a, np.cumsum(np.unique(a[:, 0], return_counts=True)[1])[:-1])
[array([[ 1, 700, 35],
[ 1, 700, 35],
[ 1, 700, 35]]), array([[ 2, 680, 25],
[ 2, 680, 25]]), array([[ 3, 750, 40],
[ 3, 750, 40],
[ 3, 750, 40]])]

How to merge two lists into dictionary without using nested for loop

I have two lists:
a = [0, 0, 0, 1, 1, 1, 1, 1, .... 99999]
b = [24, 53, 88, 32, 45, 24, 88, 53, ...... 1]
I want to merge those two lists into a dictionary like:
{
0: [24, 53, 88],
1: [32, 45, 24, 88, 53],
......
99999: [1]
}
A solution might be using for loop, which does not look good and elegant, like:
d = {}
unique_a = list(set(list_a))
for i in range(len(list_a)):
if list_a[i] in d.keys:
d[list_a[i]].append(list_b[i])
else:
d[list_a] = [list_b[i]]
Though this does work, it’s an inefficient and would take too much time when the list is extremely large. I want to know more elegant ways to construct such a dictionary?
Thanks in advance!
You can use a defaultdict:
from collections import defaultdict
d = defaultdict(list)
list_a = [0, 0, 0, 1, 1, 1, 1, 1, 9999]
list_b = [24, 53, 88, 32, 45, 24, 88, 53, 1]
for a, b in zip(list_a, list_b):
d[a].append(b)
print(dict(d))
Output:
{0: [24, 53, 88], 1: [32, 45, 24, 88, 53], 9999: [1]}
Alternative itertools.groupby() solution:
import itertools
a = [0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3]
b = [24, 53, 88, 32, 45, 24, 88, 53, 11, 22, 33, 44, 55, 66, 77]
result = { k: [i[1] for i in g]
for k,g in itertools.groupby(sorted(zip(a, b)), key=lambda x:x[0]) }
print(result)
The output:
{0: [24, 53, 88], 1: [24, 32, 45, 53, 88], 2: [11, 22, 33, 44, 55, 66], 3: [77]}
No fancy structures, just a plain ol' dictionary.
d = {}
for x, y in zip(a, b):
d.setdefault(x, []).append(y)
You can do this with a dict comprehension:
list_a = [0, 0, 0, 1, 1, 1, 1, 1]
list_b = [24, 53, 88, 32, 45, 24, 88, 53]
my_dict = {key: [] for key in set(a)} # my_dict = {0: [], 1: []}
for a, b in zip(list_a, list_b):
my_dict[a].append(b)
# {0: [24, 53, 88], 1: [32, 45, 24, 88, 53]}
Oddly enough, you cannot seem to make this work using dict.fromkeys(set(list_a), []) as this will set the value of all keys equal to the same empty array:
my_dict = dict.fromkeys(set(list_a), []) # my_dict = {0: [], 1: []}
my_dict[0].append(1) # my_dict = {0: [1], 1: [1]}
A pandas solution:
Setup:
import pandas as pd
a = [0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 3, 4, 4, 4]
b = pd.np.random.randint(0, 100, len(a)).tolist()
>>> b
Out[]: [28, 68, 71, 25, 25, 79, 30, 50, 17, 1, 35, 23, 52, 87, 21]
df = pd.DataFrame(columns=['Group', 'Value'], data=list(zip(a, b))) # Create a dataframe
>>> df
Out[]:
Group Value
0 0 28
1 0 68
2 0 71
3 1 25
4 1 25
5 1 79
6 1 30
7 1 50
8 2 17
9 2 1
10 2 35
11 3 23
12 4 52
13 4 87
14 4 21
Solution:
>>> df.groupby('Group').Value.apply(list).to_dict()
Out[]:
{0: [28, 68, 71],
1: [25, 25, 79, 30, 50],
2: [17, 1, 35],
3: [23],
4: [52, 87, 21]}
Walkthrough:
create a pd.DataFrame from the input lists, a is called Group and b called Value
df.groupby('Group') creates groups based on a
.Value.apply(list) gets the values for each group and cast it to list
.to_dict() converts the resulting DataFrame to dict
Timing:
To get an idea of timings for a test set of 1,000,000 values in 100,000 groups:
a = sorted(np.random.randint(0, 100000, 1000000).tolist())
b = pd.np.random.randint(0, 100, len(a)).tolist()
df = pd.DataFrame(columns=['Group', 'Value'], data=list(zip(a, b)))
>>> df.shape
Out[]: (1000000, 2)
%timeit df.groupby('Group').Value.apply(list).to_dict()
4.13 s ± 9.29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
But to be honest it is likely less efficient than itertools.groupby suggested by #RomanPerekhrest, or defaultdict suggested by #Ajax1234.
Maybe I miss the point, but at least I will try to help. If you have to lists and want to put them in the dict do the following
a = [1, 2, 3, 4]
b = [5, 6, 7, 8]
lists = [a, b] # or directly -> lists = [ [1, 2, 3, 4], [5, 6, 7, 8] ]
new_dict = {}
for idx, sublist in enumerate([a, b]): # or enumerate(lists)
new_dict[idx] = sublist
hope it helps
Or do dictionary comprehension beforehand, then since all keys are there with values of empty lists, iterate trough the zip of the two lists, then add the second list's value to the dictionary's key naming first list's value, no need for try-except clause (or if statements), to see if the key exists or not, because of the beforehand dictionary comprehension:
d={k:[] for k in l}
for x,y in zip(l,l2):
d[x].append(y)
Now:
print(d)
Is:
{0: [24, 53, 88], 1: [32, 45, 24, 88, 53], 9999: [1]}

loop for computing average of selected data in dataframe using pandas

I have a 3 row x 96 column dataframe. I'm trying to computer the average of the two rows beneath the index (row1:96) for every 12 data points. here is my dataframe:
Run 1 Run 2 Run 3 Run 4 Run 5 Run 6 \
0 1461274.92 1458079.44 1456807.1 1459216.08 1458643.24 1457145.19
1 478167.44 479528.72 480316.08 475569.52 472989.01 476054.89
2 ------ ------ ------ ------ ------ ------
Run 7 Run 8 Run 9 Run 10 ... Run 87 \
0 1458117.08 1455184.82 1455768.69 1454738.07 ... 1441822.45
1 473630.89 476282.93 475530.87 474200.22 ... 468525.2
2 ------ ------ ------ ------ ... ------
Run 88 Run 89 Run 90 Run 91 Run 92 Run 93 \
0 1445339.53 1461050.97 1446849.43 1438870.43 1431275.76 1430781.28
1 460076.8 473263.06 455885.07 475245.64 483875.35 487065.25
2 ------ ------ ------ ------ ------ ------
Run 94 Run 95 Run 96
0 1436007.32 1435238.23 1444300.51
1 474328.87 475789.12 458681.11
2 ------ ------ ------
[3 rows x 96 columns]
Currently I am trying to use df.irow(0) to select all the data in row index 0.
something along the lines of:
selection = np.arange(0,13)
for i in selection:
new_df = pd.DataFrame()
data = df.irow(0)
........
then i get lost
I just don't know how to link this range with the dataframe in order to computer the mean for every 12 data points in each column.
To summarize, I want the average for every 12 runs in each column. So, i should end up with a separate dataframe with 2 * 8 average values (96/12).
any ideas?
thanks.
You can do a groupby on axis=1 (using some dummy data I made up):
>>> h = df.iloc[:2].astype(float)
>>> h.groupby(np.arange(len(h.columns))//12, axis=1).mean()
0 1 2 3 4 5 6 7
0 0.609643 0.452047 0.536786 0.377845 0.544321 0.214615 0.541185 0.544462
1 0.382945 0.596034 0.659157 0.437576 0.490161 0.435382 0.476376 0.423039
First we extract the data and force recognition of a float (the presence of the ------ row means that you've probably got an object dtype, which will make the mean unhappy.)
Then we make an array saying what groups we want to put the different columns in:
>>> np.arange(len(df.columns))//12
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7,
7, 7, 7, 7], dtype=int32)
which we feed as an argument to groupby. .mean() handles the rest.
It's always best to try to use pandas methods when you can, rather than iterating over the rows. The DataFrame's iloc method is useful for extracting any number of rows.
The following example shows you how to do what you want in a two-column DataFrame. The same technique will work independent of the number of columns:
In [14]: df = pd.DataFrame({"x": [1, 2, "-"], "y": [3, 4, "-"]})
In [15]: df
Out[15]:
x y
0 1 3
1 2 4
2 - -
In [16]: df.iloc[2] = df.iloc[0:2].sum()
In [17]: df
Out[17]:
x y
0 1 3
1 2 4
2 3 7
However, in your case you want to sum each group of eight cells in df.iloc[2]`, so you might be better simply taking the result of the summing expression with the statement
ds = df.iloc[0:2].sum()
which with your data will have the form
col1 0
col2 1
col3 2
col4 3
...
col93 92
col94 93
col95 94
col96 95
(These numbers are representative, you will obviously see your column sums). You can then turn this into a 12x8 matrix with
ds.values.reshape(12, 8)
whose value is
array([[ 0, 1, 2, 3, 4, 5, 6, 7],
[ 8, 9, 10, 11, 12, 13, 14, 15],
[16, 17, 18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29, 30, 31],
[32, 33, 34, 35, 36, 37, 38, 39],
[40, 41, 42, 43, 44, 45, 46, 47],
[48, 49, 50, 51, 52, 53, 54, 55],
[56, 57, 58, 59, 60, 61, 62, 63],
[64, 65, 66, 67, 68, 69, 70, 71],
[72, 73, 74, 75, 76, 77, 78, 79],
[80, 81, 82, 83, 84, 85, 86, 87],
[88, 89, 90, 91, 92, 93, 94, 95]])
but summing this array will give you the sum of all elements, so instead create another DataFrame with
rs = pd.DataFrame(ds.values.reshape(12, 8))
and then sum that:
rs.sum()
giving
0 528
1 540
2 552
3 564
4 576
5 588
6 600
7 612
dtype: int64
You may find in practice that it is easier to simply create two 12x8 matrices in the first place, which you can add together before creating a dataframe which you can then sum. Much depends on how you are reading your data.

Categories