How to find averages for each bin of a column - python

I have a data frame below:
import pandas as pd
df = {'A':[1.06, 1.01, 0.99, 0.98, 1.05, 0.96], 'B':[2, 7, 22, 7, 15, 16]}
df = pd.DataFrame(df)
I want to find the mean of column B for each bin of column A.
For example, if I want to create bins of 0.02 starting the minimum value in column A, then bins will be like this (Inclusive):
1) 0.96-0.98
2) 0.99-1.01
3) 1.02-1.04
4) 1.05-1.07
The average of each bin will be
1) (16+7)/2 = 11.5
2) (7+22)/2 = 14.5
3) 0
4) (2+15)/2 = 8.5
Thus, the outcome will look like:
df = {'A':[1.06, 1.01, 0.99, 0.98, 1.05, 0.96], 'B':[2, 7, 22, 7, 15, 16], 'Avg':[8.5, 14.5, 14.5, 11.5, 8.5, 11.5]}
df = pd.DataFrame(df)

LOGIC
You can make bins group and then use bins to group data and then apply transform on groupby which will return the result for each row rather then grouped data.
PS: pd.cut is used for binning, assuming you have approach to get desired bins.
SOLUTION
import pandas as pd
df = {"A": [1.06, 1.01, 0.99, 0.98, 1.05, 0.96], "B": [2, 7, 22, 7, 15, 16]}
df = pd.DataFrame(df)
df["cat"] = pd.cut(df["A"], [0.95, 0.98, 1.01, 1.04, 1.07], right=True)
print(df)
OUTPUT with Bins
A B cat
0 1.06 2 (1.04, 1.07]
1 1.01 7 (0.98, 1.01]
2 0.99 22 (0.98, 1.01]
3 0.98 7 (0.95, 0.98]
4 1.05 15 (1.04, 1.07]
5 0.96 16 (0.95, 0.98]
df["Avg"] = df.groupby("cat")["B"].transform("mean")
# you can directly groupby pd.cut without making new column
print(df)
FINAL OUTPUT with average
A B cat Avg
0 1.06 2 (1.04, 1.07] 8.5
1 1.01 7 (0.98, 1.01] 14.5
2 0.99 22 (0.98, 1.01] 14.5
3 0.98 7 (0.95, 0.98] 11.5
4 1.05 15 (1.04, 1.07] 8.5
5 0.96 16 (0.95, 0.98] 11.5

Related

separating dataset into 2 groups (group 1: ID starting with u and group 2: ID starting with s)

ID A1 A2 Exam
0 u123456 10.00 0.00 21
1 s123457 6.80 9.40 30
2 u123458 13.35 20.00 25
3 u123459 0.00 10.15 24
4 u123460 4.50 8.09 21
5 u123461 5.50 13.30 14
6 u123462 20.00 12.75 16
7 s123463 20.00 17.50 22
8 u123464 11.75 17.30 31
9 s123465 0.00 12.65 15
The above is sample of my dataset, I'm confused how can I make two dataset based on id which starts with 'u' and 's' respectively. I am new in coding and sorry for asking silly thing.
You can group the DataFrame using a function that takes the first letter into account.
import pandas as pd
df = pd.DataFrame(
[
['s123457', 6.80, 9.40, 30],
['u123458', 13.35, 20.00, 25],
['u123459', 0.00, 10.15, 24],
['u123460', 4.50, 8.09, 21],
['u123461', 5.50, 13.30, 14],
['u123462', 20.00, 12.75, 16],
['s123463', 20.00, 17.50, 22],
['u123464', 11.75, 17.30, 31],
['s123465', 0.00, 12.65, 15]
],
columns=['ID', 'A1', 'A2', 'Exam']
)
# Group by the first letter of the ID column.
grouped = df.groupby(lambda index: df['ID'].loc[index][0])
# Output key and associated group, with the index of the group being reset.
for key, group in grouped:
print(key)
print(group.reset_index(drop=True))

Plotting arrays with different lengths in seaborn

I have a dataframe that I would like to make a strip plot out of, the array consists of the following
Symbol Avg.Sentiment Weighted Mentions Sentiment
0 AMC 0.14 0.80 557 [-0.38, -0.48, -0.27, -0.42, 0.8, -0.8, 0.13, ...
2 GME 0.15 0.26 175 [-0.27, 0.13, -0.53, 0.65, -0.91, 0.66, 0.67, ...
1 BB 0.23 0.29 126 [-0.27, 0.34, 0.8, -0.14, -0.39, 0.4, 0.34, -0...
11 SPY -0.06 -0.03 43 [0.32, -0.38, -0.54, 0.36, -0.18, 0.18, -0.33,...
4 SPCE 0.26 0.09 35 [0.65, 0.57, 0.74, 0.48, -0.54, -0.15, -0.3, -...
13 AH 0.06 0.02 33 [0.62, 0.66, -0.18, -0.62, 0.12, -0.42, -0.59,...
12 PLTR 0.16 0.05 29 [0.66, 0.36, 0.64, 0.59, -0.42, 0.65, 0.15, -0...
15 TSLA 0.13 0.03 24 [0.1, 0.38, 0.64, 0.42, -0.32, 0.32, 0.44, -0....
and so on, the number of elements in the list of 'Sentiment' are the same as the number of mentions, I would like to make a strip plot with the Symbol as the x axis and sentiment as the y axis, I believe the problem that I'm encountering is because of the different lengths of list, the actual error reading I'm getting is
ValueError: setting an array element with a sequence.
the code that I'm trying to use to create the strip plot is this
def symbolSentimentVisualization(dataset):
sns.stripplot(x='Symbol',y='Sentiment',data=dataset.loc[:9])
plt.show()
the other part of my issue I would guess has something to do with numpy trying to set multidimensional arrays with different lengths before being put into a seaborn plot, but not 100% on that, if the solution is to plot one row at a time and then merge plots that would definitely work but I'm not sure what exactly I should call to do that because trying it out with the following doesn't seem to work either.
def symbolSentimentVisualization(dataset):
sns.stripplot(x=dataset['Symbol'][0],y=dataset['Sentiment'][0],data=dataset.loc[:9])
plt.show()
IIUC explode 'Sentiment' first then plot:
df = df.explode('Sentiment')
ax = sns.stripplot(x="Symbol", y="Sentiment", data=df)
Sample Data:
np.random.seed(5)
df = pd.DataFrame({
'Symbol': ['AMC', 'GME', 'BB', 'SPY', 'SPCE'],
'Mentions': [557, 175, 126, 43, 35]
})
df['Sentiment'] = df['Mentions'].apply(lambda x: (np.random.random(x) * 2) - 1)
Symbol Mentions Sentiment
0 AMC 557 [-0.556013657820521, 0.7414646123547528, -0.58...
1 GME 175 [-0.5673003921341209, -0.6504850189478857, 0.1...
2 BB 126 [0.7771316020052821, 0.26579994709269994, -0.4...
3 SPY 43 [-0.5966607678089173, -0.4473484233894889, 0.7...
4 SPCE 35 [0.7934741289205556, 0.17613102678923398, 0.58...
Resulting Graph:
Complete Working Example with Sample Data:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
np.random.seed(5)
df = pd.DataFrame({
'Symbol': ['AMC', 'GME', 'BB', 'SPY', 'SPCE'],
'Mentions': [557, 175, 126, 43, 35]
})
df['Sentiment'] = df['Mentions'].apply(lambda x: (np.random.random(x) * 2) - 1)
df = df.explode('Sentiment')
ax = sns.stripplot(x="Symbol", y="Sentiment", data=df)
plt.show()

Python: pandas.cut labels are ignored

I want to cut one column in my pandas.DataFrame using pandas.cut(), but the labels I put into labels argument are not applied. Let me show you an example.
I have got the following data frame:
>>> import pandas as pd
>>> df = pd.DataFrame({'x': [-0.009, 0.089, 0.095, 0.096, 0.198]})
>>> print(df)
x
0 -0.009
1 0.089
2 0.095
3 0.096
4 0.198
And I cut x column like this:
>>> bins = pd.IntervalIndex.from_tuples([(-0.2, -0.1), (-0.1, 0.0), (0.0, 0.1), (0.1, 0.2)])
>>> labels = [100, 200, 300, 400]
>>> df['x_cut'] = pd.cut(df['x'], bins, labels=labels)
>>> print(df)
x x_cut
0 -0.009 (-0.1, 0.0]
1 0.089 (0.0, 0.1]
2 0.095 (0.0, 0.1]
3 0.096 (0.0, 0.1]
4 0.198 (0.1, 0.2]
However, I expected the data frame looking like this:
id x x_cut
0 6 0.089 200
1 6 0.089 300
2 6 0.095 300
3 6 0.096 300
4 6 0.098 400
What am I missing? How can I get the data frame with correct labels?
It is bug issue 21233.
For me working like #anky_91 commented mapping by dictionary created by zip:
df['x_cut'] = pd.cut(df['x'], bins).map(dict(zip(bins, labels)))
print(df)
x x_cut
0 -0.009 200
1 0.089 300
2 0.095 300
3 0.096 300
4 0.198 400

How to plot two colors in the one line by other columns value?

I have a dataframe like this:
df=pd.DataFrame([[1.65, -0.05, 0],
[1.68, -0.01, 0],
[1.70, 0.01, 1],
[1.67, -0.02, 1],
[1.73 , 0.05, 1],
[1.67 , 0.01, 1],
[ 1.67, -0.02, 1],
[1.70 , 0.03, 0],
[ 1.66, -0.01, 0],
[ 1.69 ,-0.01 , 0]
])
df.rename(columns={1: "diff", 2: "label"},inplace=True)
df['label']=df['label'].astype(str)
print(df)
0 diff label
0 1.65 -0.05 0
1 1.68 -0.01 0
2 1.70 0.01 1
3 1.67 -0.02 1
4 1.73 0.05 1
5 1.67 0.01 1
6 1.67 -0.02 1
7 1.70 0.03 0
8 1.66 -0.01 0
9 1.69 -0.01 0
I want to plot first columns and give it different color by 'label' column.
label=1 blue , label=0 red
That is , there are two colors in the one line.
I use the following code to plot.
df.iloc[0:2,0].plot(y=df.columns[0],color='r', )
df.iloc[1:7,0].plot(y=df.columns[0],color='b' )
df.iloc[6:10,0].plot(y=df.columns[0],color='r' )
Have any better method to plot?
In fact,the real data have 10000 rows
Essentially, you are trying to plot the value of 0 by its interaction with the previous value in the dataset, relative to the index.
My proposed solution is to plot each datapoint individually:
# First, create a new column for color
df['color'] = df['label'].map({0:'red',1:'blue'})
# Next, import & set up subplot
import matplotlib.patches as mpatches
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1,1, figsize=(10,4))
# Iterate through rows
for idx, row in df[[0,'color']].iterrows():
v, c = row
# If you want a scatter plot
ax.scatter(idx, v, color=c)
if idx>0:
# If you want a line plot
ax.plot([idx-1,idx], [prev_v, v], color=c)
# Set the previous value
prev_v = v
# Add a legend
red_patch = mpatches.Patch(color='red', label='Losses')
blue_patch = mpatches.Patch(color='blue', label='Gains')
ax.legend(handles=[red_patch,blue_patch])
plt.show()
You can probably simplify it but as a general solution the following lines should help you grab all of the rows where label is 1 or 0:
# label == 1
df.iloc[df['label'].where(df['label'].astype(int) == 1).dropna().index].plot(y=df.columns[0], color='b')
# label == 0
df.iloc[df['label'].where(df['label'].astype(int) == 0).dropna().index].plot(y=df.columns[0], color='r')

Frequency mean calculation for an arbitrary distibution in pandas

I have a large dataset with values ranging from 1 to 25 with a resolution of o.1 . The distribution is arbitrary in nature with mode value of 1. The sample dataset can be like :
1,
1,
23.05,
19.57,
1,
1.56,
1,
23.53,
19.74,
7.07,
1,
22.85,
1,
1,
7.78,
16.89,
12.75,
15.32,
7.7,
14.26,
15.41,
1,
16.34,
8.57,
15,
14.97,
1.18,
14.15,
1.94,
14.61,
1,
15.49,
1,
9.18,
1.71,
1,
10.4,
How to evaluate the counts in different ranges (0-0.5,0.5-1, etc) and find out their frequency mean in pandas, Python.
expected output can be
values ranges(f) occurance(n) f*n
1
2.2 1-2 2 3
2.8 2-3 3 7.5
3.7 3-4 2 7
5.5 4-5 1 4.5
5.8 5-6 3 16.5
4.3
2.7 sum- 11 38.5
3.5
1.8 frequency mean 3.5
5.9
You need cut for binning, then convert CategoricalIndex to IntervalIndex for mid value, multiple column by mul, sum and last divide scalars:
df = pd.DataFrame({'col':[1,2.2,2.8,3.7,5.5,5.8,4.3,2.7,3.5,1.8,5.9]})
print (df)
col
0 1.0
1 2.2
2 2.8
3 3.7
4 5.5
5 5.8
6 4.3
7 2.7
8 3.5
9 1.8
10 5.9
binned = pd.cut(df['col'], np.arange(1, 7), include_lowest=True)
df1 = df.groupby(binned).size().reset_index(name='val')
df1['mid'] = pd.IntervalIndex(df1['col']).mid
df1['mul'] = df1['val'].mul(df1['mid'])
print (df1)
col val mid mul
0 (0.999, 2.0] 2 1.4995 2.999
1 (2.0, 3.0] 3 2.5000 7.500
2 (3.0, 4.0] 2 3.5000 7.000
3 (4.0, 5.0] 1 4.5000 4.500
4 (5.0, 6.0] 3 5.5000 16.500
a = df1.sum()
print (a)
val 11.0000
mid 17.4995
mul 38.4990
dtype: float64
b = a['mul'] / a['val']
print (b)
3.49990909091

Categories