Random Data:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.DataFrame(np.random.normal(size=(20,4)))
#Rename columns
data.set_axis(['Column A', 'Column B', 'Column C', 'Column D'], axis=1, inplace=True)
data
Column A Column B Column C Column D
0 -2.421964 1.053834 -0.522569 -0.820083
1 -0.334253 1.275719 0.590327 -1.100766
2 -3.410730 -0.738087 -1.619469 1.300860
3 1.808026 -0.490364 -0.433812 0.574514
4 -0.628401 -1.098690 -0.537222 0.601859
5 -0.953888 1.118034 -2.304954 -1.723802
6 0.856820 0.204850 -0.464042 -2.653982
7 -1.328367 1.057808 1.083722 0.120027
8 -1.150053 1.457841 -1.592256 -0.362547
9 -0.449330 0.714787 -0.134940 0.098445
10 -1.793606 0.858645 0.800018 0.191496
11 -0.967488 1.504187 -1.376536 0.251128
12 1.231656 0.984247 -0.975960 0.619155
13 -0.930387 -1.144732 -1.761642 1.983434
14 0.857593 0.580386 -0.119221 -0.513108
15 0.985186 -0.992795 2.154594 0.458575
16 -0.937518 0.548841 -0.536350 -0.925943
17 0.626230 -0.339900 0.027100 -0.209365
18 -0.538961 1.036132 -0.451085 -0.865581
19 0.078272 -0.773970 0.010077 -1.766130
data.boxplot(vert=False, figsize=(15,10), patch_artist=True)
I want to implement the following additions to my data box plot (see example picture below):
To the right of the box plot, we have particular values (4.10% and 3.52%) from the corresponding columns. For example, the last non-NaN values in each column.
To the left of the box plot, we append the percentile (within their columns) of the aforementioned values. For example, within the first column ("12M EU HY Corp") 4.10% is the 86th percentile.
How can I recreate something like this for my plot? I'm more so lost about how to insert the values to the right of the box plot. For the percentiles, I figured I could just concatenate a string representation of the percentile to the name of each column.
Related
I have a dataframe with the weight and the number of measures of each user. The df looks like:
id_user
weight
number_of_measures
1
92.16
4
2
80.34
5
3
71.89
11
4
81.11
7
5
77.23
8
6
92.37
2
7
88.18
3
I would like to see an histogram with the attribute of the table (weight, but I want to do it for both cases) at the x-axis and the frequency in the y-axis.
Does anyone know how to do it with matplotlib?
Ok, it seems to be quite easy:
import pandas as pd
import matplotlib.pyplot as plt
hist = df.hist(bins=50)
plt.show()
I have the following pandas groupby object, and I'd like to turn the result into a new dataframe.
Following, is the code to get the conditional probability:
bin_probs = data.groupby('season')['bin'].value_counts()/data.groupby('season')['bin'].count()
I've tried the following code, but it returns as follows.
I like the season to fill in each row. How can I do that?
a = pd.DataFrame(data_5.groupby('season')['bin'].value_counts()/data_5.groupby('season')['bin'].count())
a is a DataFrame, but with a 2-level index, so my interpretation is you want a dataframe without a multi-level index.
The index can't be reset when the name in the index and the column are the same.
Use pandas.Series.reset_index, and set name='normalized_bin, to rename the bin column.
This would not work with the implementation in the OP, because that is a dataframe.
This works with the following implementation, because a pandas.Series is created with .groupby.
The correct way to normalize the column is to use the normalize=True parameter in .value_counts.
import pandas as pd
import random # for test data
import numpy as np # for test data
# setup a dataframe with test data
np.random.seed(365)
random.seed(365)
rows = 1100
data = {'bin': np.random.randint(10, size=(rows)),
'season': [random.choice(['fall', 'winter', 'summer', 'spring']) for _ in range(rows)]}
df = pd.DataFrame(data)
# display(df.head())
bin season
0 2 summer
1 4 winter
2 1 summer
3 5 winter
4 2 spring
# groupby, normalize and reset the the Series index
a = df.groupby(['season'])['bin'].value_counts(normalize=True).reset_index(name='normalized_bin')
# display(a.head(15))
season bin normalized_bin
0 fall 2 0.15600
1 fall 9 0.11600
2 fall 3 0.10800
3 fall 4 0.10400
4 fall 6 0.10000
5 fall 0 0.09600
6 fall 8 0.09600
7 fall 5 0.08400
8 fall 7 0.08000
9 fall 1 0.06000
10 spring 0 0.11524
11 spring 8 0.11524
12 spring 9 0.11524
13 spring 3 0.11152
14 spring 1 0.10037
Using the OP code for a
As already noted above, use normalize=True to get normalized values
The solution in the OP, creates a DataFrame, because the .groupby is wrapped with the DataFrame constructor, pandas.DataFrame.
To reset the index, you must first pandas.DataFrame.rename the bin column, and then use pandas.DataFrame.reset_index
a = pd.DataFrame(df.groupby('season')['bin'].value_counts()/df.groupby('season')['bin'].count()).rename(columns={'bin': 'normalized_bin'}).reset_index()
Other Resources
See Pandas unable to reset index because name exist to reset by a level.
Plotting
It is easier to plot from the multi-index Series, by using pandas.Series.unstack(), and then use pandas.DataFrame.plot.bar
For side-by-side bars, set stacked=False.
The bars are all equal to 1, because this is normalized data.
s = df.groupby(['season'])['bin'].value_counts(normalize=True).unstack()
# plot a stacked bar
s.plot.bar(stacked=True, figsize=(8, 6))
plt.legend(title='bin', bbox_to_anchor=(1.05, 1), loc='upper left')
You are looking for parameter normalize:
bin_probs = data.groupby('season')['bin'].value_counts(normalize=True)
Read more about it here:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(30,3))
df.head()
which gives:
0 1 2
0 0.741955 0.913681 0.110109
1 0.079039 0.662438 0.510414
2 0.469055 0.201658 0.259958
3 0.371357 0.018394 0.485339
4 0.850254 0.808264 0.469885
Say I want to add another column that will build the averages in column 2: between index (0,1) (1,2)... (28,29).
I imagine this is a common task as column 2 are the x axis positions and I want the categorical labels on a plot to appear in the middle between the 2 points on the x axis.
So I was wondering if there is a pandas way for this:
averages = []
for index, item in enumerate(df[2]):
if index < df[2].shape[0] -1:
averages.append((item + df[2].iloc[index + 1]) / 2)
df["averages"] = pd.Series(averages)
df.head()
which gives:
0 1 2 averages
0 0.997044 0.965708 0.211980 0.318781
1 0.716349 0.724811 0.425583 0.378653
2 0.729991 0.985072 0.331723 0.333138
3 0.996487 0.272300 0.334554 0.586686
as you can see 0.31 is an average of 0.21 and 0.42.
Thanks!
I think that you can do this with pandas.DataFrame.rolling. Using your dataframe head as an example:
df['averages'] = df[2].rolling(2).mean().shift(-1)
returns:
>>> df
0 1 2 averages
0 0.997044 0.965708 0.211980 0.318781
1 0.716349 0.724811 0.425583 0.378653
2 0.729991 0.985072 0.331723 0.333139
3 0.996487 0.272300 0.334554 NaN
The NaN at the end is there because there is no row indexed 4; but in your full dataframe, it would go on until the second to last row (the average of value at indices 28 and 29, i.e. your 29th and 30th values). I just wanted to show that this gives the same values as your desired output, so I used the exact data you provided. (for future reference, if you want to provide a reproducible dataframe for us from random numbers, use and show us a random seed such as np.random.seed(42) before creating the df, that way, we'll all have the same one.)
breaking it down:
df[2] is there because you're interested in column 2; .rolling(2) is there because you want to get the mean of 2 values (if you wanted the mean of 3 values, use .rolling(3), etc...), .mean() is whatever function you want (in your case, the mean); finally .shift(-1) makes sure that the new column is in the proper place (i.e., makes sure you show the mean of each value in column 2 and the value below, as the default would be the value above)
This is one way, though slightly loopy. But #sacul's solution is better. I leave this here for reference only.
import pandas as pd
import numpy as np
from itertools import zip_longest
df = pd.DataFrame(np.random.rand(30, 3))
v = df.values[:, -1]
df = df.join(pd.DataFrame(np.array([np.mean([i, j], axis=0) for i, j in \
zip_longest(v, v[1:], fillvalue=v[-1])]), columns=['2_pair_avg']))
# 0 1 2 2_pair_avg
# 0 0.382656 0.228837 0.053199 0.373678
# 1 0.812690 0.255277 0.694156 0.697738
# 2 0.040521 0.211511 0.701320 0.491044
# 3 0.558739 0.697916 0.280768 0.615398
# 4 0.262771 0.912669 0.950029 0.489550
# 5 0.217489 0.405125 0.029071 0.101794
# 6 0.577929 0.933565 0.174517 0.214530
# 7 0.067030 0.452027 0.254544 0.613225
# 8 0.580869 0.556112 0.971907 0.582547
# 9 0.483528 0.951537 0.193188 0.175215
# 10 0.481141 0.589833 0.157242 0.159363
# 11 0.087057 0.823691 0.161485 0.108634
# 12 0.319516 0.161386 0.055784 0.285276
# 13 0.901529 0.365992 0.514768 0.386599
# 14 0.270118 0.454583 0.258430 0.245463
# 15 0.379739 0.299569 0.232497 0.214943
# 16 0.017621 0.182647 0.197389 0.538386
# 17 0.720688 0.147093 0.879383 0.732239
# 18 0.859594 0.538390 0.585096 0.503846
# 19 0.360718 0.571567 0.422596 0.287384
# 20 0.874800 0.391535 0.152171 0.239078
# 21 0.935150 0.379871 0.325984 0.294485
# 22 0.269607 0.891331 0.262986 0.212050
# 23 0.140976 0.414547 0.161115 0.542682
# 24 0.851434 0.059209 0.924250 0.801210
# 25 0.389025 0.774885 0.678170 0.388856
# 26 0.679247 0.982517 0.099542 0.372649
# 27 0.670354 0.279138 0.645756 0.336031
# 28 0.393414 0.970737 0.026307 0.343947
# 29 0.479611 0.349401 0.661587 0.661587
I have a dataframe with some columns that i have been adding myself. There is one specific column that gathers the max and min tide levels.
Pandas Column mostly empty but with some reference values
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,2,3,4],'b':[np.nan,np.nan,3,4]},columns=['a','b'])
df
The problem is that the column is mostly empty because it only shows those peak values and not the intermediate ones. I would like to fill the missing values with a function similiar to the image shown below.
I want to fill it with a function of this kind
Thank you in advance.
Since you didn't specify, which datetime format your pandas dataframe uses, here is an example with index data. You can use them, if they are evenly spaced and they don't have gaps.
import pandas as pd
import numpy as np
from scipy.optimize import curve_fit
tide = np.asarray([-1.2,np.nan,np.nan,3.4,np.nan,np.nan,-1.6,np.nan,np.nan,3.7,np.nan,np.nan,-1.4,])
tide_time = np.arange(len(tide))
df = pd.DataFrame({'a':tide_time,'b':tide})
#define your fit function with amplitude, frequence, phase and offset
def fit_func(x, ampl, freq, phase, offset):
return ampl * np.sin(freq * x + phase) + offset
#extract rows that contain your values
df_nona = df.dropna()
#perform the least square fit, get the coefficients for your fitted data
coeff, _mat = curve_fit(fit_func, df_nona["a"], df_nona["b"])
print(coeff)
#append a column with fit data
df["fitted_b"] = fit_func(df["a"], *coeff)
Output for my sample data
#amplitude frequency phase offset
[ 2.63098177 1.12805625 -2.17037976 1.0127173 ]
a b fitted_b
0 0 -1.2 -1.159344
1 1 NaN -1.259341
2 2 NaN 1.238002
3 3 3.4 3.477807
4 4 NaN 2.899605
5 5 NaN 0.164376
6 6 -1.6 -1.601058
7 7 NaN -0.378513
8 8 NaN 2.434439
9 9 3.7 3.622127
10 10 NaN 1.826826
11 11 NaN -0.899136
12 12 -1.4 -1.439532
Suppose I have a table of data-
No. 200 400 600 800
1 13 14 17 18
2 16 18 20 21
3 20 15 18 19
and so on...
where each column represents a y-value for a given x-value. The first line is the x-value and the first column is the number of each dataset.
How can I read in and plot each row seperately?
For an idea of how I would like my results to be for the table I have quoted above see the following images. I have plotted each plot individually.
http://postimg.org/image/yw46zw7er/92d01c08/
http://postimg.org/image/c1kf2nqwp/29a8b1c8/
Matplotlib plots 2d arrays by plotting each column, so here you just need to transpose your data. Assuming the data is in a text file called data.csv.
import numpy as np
import matplotlib.pyplot as plt
data = np.loadtxt('data.csv')
x = [200, 400, 600, 800]
plt.plot(x, data.T)
plt.legend((1,2,3))
plt.show()