Average using grouping value in another vector (numpy / Python) - python

I'd like to take the average of one vector based on grouping information in another vector. The two vectors are the same length. I've created a minimal example below based on averaging predictions for each user. How do I do that in NumPy?
>>> pred
[ 0.99 0.23 0.11 0.64 0.45 0.55 0.76 0.72 0.97 ]
>>> users
['User2' 'User3' 'User2' 'User3' 'User0' 'User1' 'User4' 'User4' 'User4']

A 'pure numpy' solution might use a combination of np.unique and np.bincount:
import numpy as np
pred = [0.99, 0.23, 0.11, 0.64, 0.45, 0.55, 0.76, 0.72, 0.97]
users = ['User2', 'User3', 'User2', 'User3', 'User0', 'User1', 'User4',
'User4', 'User4']
# assign integer indices to each unique user name, and get the total
# number of occurrences for each name
unames, idx, counts = np.unique(users, return_inverse=True, return_counts=True)
# now sum the values of pred corresponding to each index value
sum_pred = np.bincount(idx, weights=pred)
# finally, divide by the number of occurrences for each user name
mean_pred = sum_pred / counts
print(unames)
# ['User0' 'User1' 'User2' 'User3' 'User4']
print(mean_pred)
# [ 0.45 0.55 0.55 0.435 0.81666667]
If you have pandas installed, DataFrames have some very nice methods for grouping and summarizing data:
import pandas as pd
df = pd.DataFrame({'name':users, 'pred':pred})
print(df.groupby('name').mean())
# pred
# name
# User0 0.450000
# User1 0.550000
# User2 0.550000
# User3 0.435000
# User4 0.816667

If you want to stick to numpy, the simplest is to use np.unique and np.bincount:
>>> pred = np.array([0.99, 0.23, 0.11, 0.64, 0.45, 0.55, 0.76, 0.72, 0.97])
>>> users = np.array(['User2', 'User3', 'User2', 'User3', 'User0', 'User1',
... 'User4', 'User4', 'User4'])
>>> unq, idx, cnt = np.unique(users, return_inverse=True, return_counts=True)
>>> avg = np.bincount(idx, weights=pred) / cnt
>>> unq
array(['User0', 'User1', 'User2', 'User3', 'User4'],
dtype='|S5')
>>> avg
array([ 0.45 , 0.55 , 0.55 , 0.435 , 0.81666667])

A compact solution is to use numpy_indexed (disclaimed: I am its author), which implements a solution similar to the vectorized one proposed by Jaime; but with a cleaner interface and more tests:
import numpy_indexed as npi
npi.group_by(users).mean(pred)

Related

seaborn: barplot of a dataframe by group

I am having difficulty with this. I have the results from my initial model (`Unfiltered´), that I plot like so:
df = pd.DataFrame(
{'class': ['foot', 'bike', 'bus', 'car', 'metro'],
'Precision': [0.7, 0.66, 0.41, 0.61, 0.11],
'Recall': [0.58, 0.35, 0.13, 0.89, 0.02],
'F1-score': [0.64, 0.45, 0.2, 0.72, 0.04]}
)
groups = df.melt(id_vars=['class'], var_name=['Metric'])
sns.barplot(data=groups, x='class', y='value', hue='Metric')
To produce this nice plot:
Now, I obtained a second results from my improved model (filtered), so I add a column (status) to my df to indicate the results from each model like this:
df2 = pd.DataFrame(
{'class': ['foot','foot','bike','bike','bus','bus',
'car','car','metro','metro'],
'Precison': [0.7, 0.62, 0.66, 0.96, 0.41, 0.42, 0.61, 0.75, 0.11, 0.3],
'Recall': [0.58, 0.93, 0.35, 0.4, 0.13, 0.1, 0.89, 0.86, 0.02, 0.01],
'F1-score': [0.64, 0.74, 0.45, 0.56, 0.2, 0.17, 0.72, 0.8, 0.04, 0.01],
'status': ['Unfiltered', 'Filtered', 'Unfiltered','Filtered','Unfiltered',
'Filtered','Unfiltered','Filtered','Unfiltered','Filtered']}
)
df2.head()
class Precison Recall F1-score status
0 foot 0.70 0.58 0.64 Unfiltered
1 foot 0.62 0.93 0.74 Filtered
2 bike 0.66 0.35 0.45 Unfiltered
3 bike 0.96 0.40 0.56 Filtered
4 bus 0.41 0.13 0.20 Unfiltered
And I want to plot this, in similar grouping as above (i.e. foot, bike, bus, car, metro). However, for each of the metrics, I want to place the two values side-by-side. Take for example, the foot group, I would have two bars Precision[Unfiltered, filtered], then 2 bars for Recall[Unfiltered, filtered] and also 2 bars for F1-score[Unfiltered, filtered]. Likewise all other groups.
My attempt:
group2 = df2.melt(id_vars=['class', 'status'], var_name=['Metric'])
sns.barplot(data=group2, x='class', y='value', hue='Metric')
Totally not what I want.
You can pass in hue any sequence as long as it has the same length as your data, and assign colours through it.
So you could try with
group2 = df2.melt(id_vars=['class', 'status'], var_name=['Metric'])
sns.barplot(data=group2, x='class', y='value', hue=group2[['Metric','status']].agg(tuple, axis=1))
plt.legend(fontsize=7)
But the result is a bit hard to read:
Seaborn grouped barplots don't allow for multiple grouping variables. One workaround is to recode the two grouping variables (Metric and status) as one variable with 6 levels. Another possibility is to use facets. If you are open to another plotting package, I might recommend plotnine, which allows multiple grouping variables as follows:
import plotnine as p9
fig = (
p9.ggplot(group2)
+ p9.geom_col(
p9.aes(x="class", y="value", fill="Metric", color="Metric", alpha="status"),
position=p9.position_dodge(1),
size=1,
width=0.5,
)
+ p9.scale_color_manual(("red", "blue", "green"))
+ p9.scale_fill_manual(("red", "blue", "green"))
)
fig.draw()
This generates the following image:

Plotting arrays with different lengths in seaborn

I have a dataframe that I would like to make a strip plot out of, the array consists of the following
Symbol Avg.Sentiment Weighted Mentions Sentiment
0 AMC 0.14 0.80 557 [-0.38, -0.48, -0.27, -0.42, 0.8, -0.8, 0.13, ...
2 GME 0.15 0.26 175 [-0.27, 0.13, -0.53, 0.65, -0.91, 0.66, 0.67, ...
1 BB 0.23 0.29 126 [-0.27, 0.34, 0.8, -0.14, -0.39, 0.4, 0.34, -0...
11 SPY -0.06 -0.03 43 [0.32, -0.38, -0.54, 0.36, -0.18, 0.18, -0.33,...
4 SPCE 0.26 0.09 35 [0.65, 0.57, 0.74, 0.48, -0.54, -0.15, -0.3, -...
13 AH 0.06 0.02 33 [0.62, 0.66, -0.18, -0.62, 0.12, -0.42, -0.59,...
12 PLTR 0.16 0.05 29 [0.66, 0.36, 0.64, 0.59, -0.42, 0.65, 0.15, -0...
15 TSLA 0.13 0.03 24 [0.1, 0.38, 0.64, 0.42, -0.32, 0.32, 0.44, -0....
and so on, the number of elements in the list of 'Sentiment' are the same as the number of mentions, I would like to make a strip plot with the Symbol as the x axis and sentiment as the y axis, I believe the problem that I'm encountering is because of the different lengths of list, the actual error reading I'm getting is
ValueError: setting an array element with a sequence.
the code that I'm trying to use to create the strip plot is this
def symbolSentimentVisualization(dataset):
sns.stripplot(x='Symbol',y='Sentiment',data=dataset.loc[:9])
plt.show()
the other part of my issue I would guess has something to do with numpy trying to set multidimensional arrays with different lengths before being put into a seaborn plot, but not 100% on that, if the solution is to plot one row at a time and then merge plots that would definitely work but I'm not sure what exactly I should call to do that because trying it out with the following doesn't seem to work either.
def symbolSentimentVisualization(dataset):
sns.stripplot(x=dataset['Symbol'][0],y=dataset['Sentiment'][0],data=dataset.loc[:9])
plt.show()
IIUC explode 'Sentiment' first then plot:
df = df.explode('Sentiment')
ax = sns.stripplot(x="Symbol", y="Sentiment", data=df)
Sample Data:
np.random.seed(5)
df = pd.DataFrame({
'Symbol': ['AMC', 'GME', 'BB', 'SPY', 'SPCE'],
'Mentions': [557, 175, 126, 43, 35]
})
df['Sentiment'] = df['Mentions'].apply(lambda x: (np.random.random(x) * 2) - 1)
Symbol Mentions Sentiment
0 AMC 557 [-0.556013657820521, 0.7414646123547528, -0.58...
1 GME 175 [-0.5673003921341209, -0.6504850189478857, 0.1...
2 BB 126 [0.7771316020052821, 0.26579994709269994, -0.4...
3 SPY 43 [-0.5966607678089173, -0.4473484233894889, 0.7...
4 SPCE 35 [0.7934741289205556, 0.17613102678923398, 0.58...
Resulting Graph:
Complete Working Example with Sample Data:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
np.random.seed(5)
df = pd.DataFrame({
'Symbol': ['AMC', 'GME', 'BB', 'SPY', 'SPCE'],
'Mentions': [557, 175, 126, 43, 35]
})
df['Sentiment'] = df['Mentions'].apply(lambda x: (np.random.random(x) * 2) - 1)
df = df.explode('Sentiment')
ax = sns.stripplot(x="Symbol", y="Sentiment", data=df)
plt.show()

How to set precision on column names made by np.arange()?

I made dataframe and set column names by using np.arange(). However instead of exact numbers it (sometimes) sets them to numbers like 0.300000004.
I tried both rounding entire dataframe and using np.around() on np.arange() output but none of these seems to work.
I also tried to add these at the top:
np.set_printoptions(suppress=True)
np.set_printoptions(precision=3)
Here is return statement of my function:
stepT = 0.1
%net is some numpy array
return pd.DataFrame(net, columns = np.arange(0,1+stepT, stepT),
index = np.around(np.arange(0,1+stepS,stepS),decimals = 3)).round(3)
Is there any function that will allow me to have these names as numbers with only one digit after comma?
The apparent imprecision of floating point numbers comes up often.
In [689]: np.arange(0,1+stepT, stepT)
Out[689]: array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])
In [690]: _.tolist()
Out[690]:
[0.0,
0.1,
0.2,
0.30000000000000004,
0.4,
0.5,
0.6000000000000001,
0.7000000000000001,
0.8,
0.9,
1.0]
In [691]: _689[3]
Out[691]: 0.30000000000000004
The numpy print options control how the arrays are displayed. but they have no effect when individual values are printed.
When I make a dataframe with this column specification I get a nice display. (_689 is ipython shorthand for the Out[689] array.) It is using the array formatting:
In [699]: df = pd.DataFrame(np.arange(11)[None,:], columns=_689)
In [700]: df
Out[700]:
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0 0 1 2 3 4 5 6 7 8 9 10
In [701]: df.columns
Out[701]:
Float64Index([ 0.0, 0.1, 0.2,
0.30000000000000004, 0.4, 0.5,
0.6000000000000001, 0.7000000000000001, 0.8,
0.9, 1.0],
dtype='float64')
But selecting columns with floats like this is tricky. Some work, some don't.
In [705]: df[0.4]
Out[705]:
0 4
Name: 0.4, dtype: int64
In [707]: df[0.3]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Looks like it's doing some sort of dictionary lookup. Floats don't work well for that, because of their inherent imprecision.
Doing an equality test on the arange:
In [710]: _689[3]==0.3
Out[710]: False
In [711]: _689[4]==0.4
Out[711]: True
I think you should create a list of properly formatted strings from the arange, and use that as column headers, not the floats themselves.
For example:
In [714]: alist = ['%.3f'%i for i in _689]
In [715]: alist
Out[715]:
['0.000',
'0.100',
'0.200',
'0.300',
'0.400',
'0.500',
'0.600',
'0.700',
'0.800',
'0.900',
'1.000']
In [716]: df = pd.DataFrame(np.arange(11)[None,:], columns=alist)
In [717]: df
Out[717]:
0.000 0.100 0.200 0.300 0.400 0.500 0.600 0.700 0.800 0.900 1.000
0 0 1 2 3 4 5 6 7 8 9 10
In [718]: df.columns
Out[718]:
Index(['0.000', '0.100', '0.200', '0.300', '0.400', '0.500', '0.600', '0.700',
'0.800', '0.900', '1.000'],
dtype='object')
In [719]: df['0.300']
Out[719]:
0 3
Name: 0.300, dtype: int64

Regrid numpy array based on cell area

import numpy as np
from skimage.measure import block_reduce
arr = np.random.random((6, 6))
area_cell = np.random.random((6, 6))
block_reduce(arr, block_size=(2, 2), func=np.ma.mean)
I would like to regrid a numpy array arr from 6 x 6 size to 3 x 3. Using the skimage function block_reduce for this.
However, block_reduce assumes each grid cell has same size. How can I solve this problem, when each grid cell has a different size? In this case size of each grid cell is given by the numpy array area_cell
-- EDIT:
An example:
arr
0.25 0.58 0.69 0.74
0.49 0.11 0.10 0.41
0.43 0.76 0.65 0.79
0.72 0.97 0.92 0.09
If all elements of area_cell were 1, and we were to convert 4 x 4 arr into 2 x 2, result would be:
0.36 0.48
0.72 0.61
However, if area_cell is as follows:
0.00 1.00 1.00 0.00
0.00 1.00 0.00 0.50
0.20 1.00 0.80 0.80
0.00 0.00 1.00 1.00
Then, result becomes:
0.17 0.22
0.21 0.54
It seems you are still reducing by blocks, but after scaling arr with area_cell. So, you just need to perform element-wise multiplication between these two arrays and use the same block_reduce code on that product array, like so -
block_reduce(arr*area_cell, block_size=(2, 2), func=np.ma.mean)
Alternatively, we can simply use np.mean after reshaping to a 4D version of the product array, like so -
m,n = arr.shape
out = (arr*area_cell).reshape(m//2,2,n//2,2).mean(axis=(1,3))
Sample run -
In [21]: arr
Out[21]:
array([[ 0.25, 0.58, 0.69, 0.74],
[ 0.49, 0.11, 0.1 , 0.41],
[ 0.43, 0.76, 0.65, 0.79],
[ 0.72, 0.97, 0.92, 0.09]])
In [22]: area_cell
Out[22]:
array([[ 0. , 1. , 1. , 0. ],
[ 0. , 1. , 0. , 0.5],
[ 0.2, 1. , 0.8, 0.8],
[ 0. , 0. , 1. , 1. ]])
In [23]: block_reduce(arr*area_cell, block_size=(2, 2), func=np.ma.mean)
Out[23]:
array([[ 0.1725 , 0.22375],
[ 0.2115 , 0.5405 ]])
In [24]: m,n = arr.shape
In [25]: (arr*area_cell).reshape(m//2,2,n//2,2).mean(axis=(1,3))
Out[25]:
array([[ 0.1725 , 0.22375],
[ 0.2115 , 0.5405 ]])

Convert string with spaced out numbers into array

I'm working in Python and currently I have a list which looks like
['001 2.4600 0.46 2.36E+003 86.66 16.77 0.33 1.32E+003 74.41 17.61 0.40 2.21E+003 87.39 22.07',
'002 10.310 0.38 2.95E+002 76.88 4.53 0000 000000000 00000 0000 0.34 2.62E+002 97.36 4.41',
'003 74.840 0.63 5.07E+002 64.63 4.03 0.57 4.15E+002 61.96 3.99 0.63 5.43E+002 64.67 5.16',
...
and so on, with quite a few more elements. Each element of the list is a string, containing various figures which have spaces between them. i.e, as above, the first element has 001, 2.4600, 0.46 and so on.
The point is that I want to turn each element of the list into a row of an array. The aim is to have a large array giving me all the information which is currently just numbers separated by spaces inside strings in a list.
I'm sure I can use the built in array module to do this but I just can't figure out how.
Any ideas? Hope the question is clear.
Assuming you want floats in the final list of lists, try this:
>>> data = ['001 2.4600 0.46 2.36E+003 86.66 16.77 0.33 1.32E+003 74.41 17.61 0.40 2.21E+003 87.39 22.07', '002 10.310 0.38 2.95E+002 76.88 4.53 0000 000000000 00000 0000 0.34 2.62E+002 97.36 4.41', '003 74.840 0.63 5.07E+002 64.63 4.03 0.57 4.15E+002 61.96 3.99 0.63 5.43E+002 64.67 5.16']
>>> [list(map(float, row.split())) for row in data]
[[1.0, 2.46, 0.46, 2360.0, 86.66, 16.77, 0.33, 1320.0, 74.41, 17.61, 0.4, 2210.0, 87.39, 22.07], [2.0, 10.31, 0.38, 295.0, 76.88, 4.53, 0.0, 0.0, 0.0, 0.0, 0.34, 262.0, 97.36, 4.41], [3.0, 74.84, 0.63, 507.0, 64.63, 4.03, 0.57, 415.0, 61.96, 3.99, 0.63, 543.0, 64.67, 5.16]]
map just says 'do this function (float()) on everything in this list (the result of split(), which is a list of strings)'. In Python 3 it returns an iterator, so we have to ask for the list() of it. It's often better to use a for loop or list comprehension instead of map, but in this case it's handy.
Your idea of using the array module is probably bogus, as an array.array object is, essentially, a list with constrained data type. You cannot use vectorized operations on them. Further, an array.array is a 1D object.
That said, you possibly want to use the numpy module, whose array object is a multidimensional array on which you can operate at your will.
# idiomatic manner of importing numpy
import numpy as np
data = ['1 2 3.', '4. 5 8']
arraydata = np.array([[float(n) for n in row.split()] for row in data])
print arraydata
# [[ 1. 2. 3.]
# [ 4. 5. 8.]]
Hopefully I understood correctly
res = []
for row in my_list:
res.append(list(map(float, row.split())))
Here you will have a matrix of values, in string format. Added conversion
Asuming your data is stored in a list called data you could use
data =[[int(el) for el in string.split(' ')] for string in data]

Categories