I have the following dataframe [1] which contains information relating to music listening. I would like to print a line graph like the following 2 (I got it by putting the data manually) in which the slotID and the average bpm are related, without writing the values by hand . Each segment must be one unit long and must match the average bpm.
[1]
slotID NUn NTot MeanBPM
2 2 3 13 107.987769
9 11 3 30 133.772100
10 12 3 40 122.354025
13 15 4 44 123.221659
14 16 4 30 129.083900
15 17 9 66 123.274409
16 18 4 25 131.323480
18 20 5 40 124.782625
19 21 6 30 127.664467
20 22 6 19 120.483579
The code I used to obtain the plot is the following:
import numpy as np
import pylab as pl
from matplotlib import collections as mc
lines = [ [(2, 107), (3,107)], [(11,133),(12,133)], [(12,122),(13,122)], ]
c = np.array([(1, 0, 0, 1), (0, 1, 0, 1), (0, 0, 1, 1)])
lc = mc.LineCollection(lines, colors=c, linewidths=2)
fig, ax = pl.subplots()
ax.add_collection(lc)
ax.autoscale()
ax.margins(0.1)
To obtain data:
import numpy as np
import pandas as pd
dfLunedi = pd.read_csv( "5.sab.csv", encoding = "ISO-8859-1", sep = ';')
dfSlotMean = dfLunedi.groupby('slotID', as_index=False).agg( NSabUn=('date', 'nunique'),NSabTot = ('date', 'count'), MeanBPM=('tempo', 'mean') )
df = pd.DataFrame(dfSlotMean)
df.to_csv('sil.csv', sep = ';', index=False)
df.drop(df[df.NSabUn < 3].index, inplace=True)
You can loop through the rows and plot each segment like this:
for _, r in df.iterrows():
plt.plot([r['slotID'], r['slotID']+1], [r['MeanBPM']]*2)
Output:
Related
This question already has answers here:
Pandas how to use pd.cut()
(5 answers)
Pandas groupby with bin counts
(1 answer)
How to bin time in a pandas dataframe
(5 answers)
Closed 1 year ago.
I have df with 70000 ages I want to group them by age like this
18-30
30-50
50-99
and compare them with other column which tells us revenue:
If you have a dataframe like this one:
N = 1000
df = pd.DataFrame({'age': np.random.randint(18, 99, N),
'revenue': 20 + 200*np.abs(np.random.randn(N))})
age revenue
0 69 56.776670
1 32 40.019089
2 89 38.045533
3 78 176.214654
4 38 527.738220
5 92 124.790533
6 92 137.617365
7 41 46.680172
8 20 234.199293
9 39 136.560120
You can cut the dataframe in age groups with pandas.cut:
df['group'] = pd.cut(df['age'], bins = [18, 30, 50, 99], include_lowest = True, labels = ['18-30', '30-50', '50-99'])
age revenue group
0 69 56.776670 50-99
1 32 40.019089 30-50
2 89 38.045533 50-99
3 78 176.214654 50-99
4 38 527.738220 30-50
5 92 124.790533 50-99
6 92 137.617365 50-99
7 41 46.680172 30-50
8 20 234.199293 18-30
9 39 136.560120 30-50
Then you can group the age groups with pandas.DataFrame.groupby:
df = df.groupby(by = 'group').mean()
age revenue
group
18-30 23.534091 184.895077
30-50 40.529183 185.348380
50-99 73.902998 170.889141
Now, finally, you are ready to plot the data:
fig, ax = plt.subplots()
ax.bar(x = df.index, height = df['revenue'])
plt.show()
Complete Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
N = 1000
df = pd.DataFrame({'age': np.random.randint(18, 99, N),
'revenue': 20 + 200*np.abs(np.random.randn(N))})
df['group'] = pd.cut(df['age'], bins = [18, 30, 50, 99], include_lowest = True, labels = ['18-30', '30-50', '50-99'])
df = df.groupby(by = 'group').mean()
fig, ax = plt.subplots()
ax.bar(x = df.index, height = df['revenue'])
plt.show()
I want to plot a 95% confidence interval of a data frame using python. The graph will be a line plot where the x-axis will indicate the column name/number, and the y-axis will indicate the column values. I search a lot but could find the solution that I was looking for. Here is an example of my data frame.
Ph1 Ph2 Ph3 ph4 Ph5 Ph6
-0.152511 -0.039428 0.131173 -0.002039 0.008101 -0.002039
-0.068273 0.152013 -0.315244 0.005247 0.014775 -0.045268
0.425363 -0.043105 0.071670 -0.045124 -0.036135 -0.037250
-0.019332 0.139712 -0.026001 -0.021844 -0.040854 -0.050648
0.077907 0.341410 -0.113731 -0.065799 -0.027229 -0.077948
0.145185 0.112060 0.093898 0.028815 -0.032327 0.004239
Also attached an example of my graph, in this plot I shown the how desired graph's x-axis and y-axis will be.
Answer
You can use seaborn.lineplot to do that, since seaborn uses 95% CI by default, but firstly you need to reshape your data through pandas.melt.
If you start from data in a dataframe df like the one you provided, you can reshape it with:
df = pd.melt(frame = df,
var_name = 'column',
value_name = 'value')
output:
column value
0 Ph1 -0.152511
1 Ph1 -0.068273
2 Ph1 0.425363
3 Ph1 -0.019332
4 Ph1 0.077907
5 Ph1 0.145185
6 Ph2 -0.039428
7 Ph2 0.152013
8 Ph2 -0.043105
9 Ph2 0.139712
10 Ph2 0.341410
11 Ph2 0.112060
12 Ph3 0.131173
13 Ph3 -0.315244
14 Ph3 0.071670
15 Ph3 -0.026001
16 Ph3 -0.113731
17 Ph3 0.093898
18 ph4 -0.002039
19 ph4 0.005247
20 ph4 -0.045124
21 ph4 -0.021844
22 ph4 -0.065799
23 ph4 0.028815
24 Ph5 0.008101
25 Ph5 0.014775
26 Ph5 -0.036135
27 Ph5 -0.040854
28 Ph5 -0.027229
29 Ph5 -0.032327
30 Ph6 -0.002039
31 Ph6 -0.045268
32 Ph6 -0.037250
33 Ph6 -0.050648
34 Ph6 -0.077948
35 Ph6 0.004239
Then you can plot this df with:
fig, ax = plt.subplots()
sns.lineplot(ax = ax,
data = df,
x = 'column',
y = 'value',
sort = False)
plt.show()
Complete Code
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
df = pd.read_csv('data.csv')
df = pd.melt(frame = df,
var_name = 'column',
value_name = 'value')
fig, ax = plt.subplots()
sns.lineplot(ax = ax,
data = df,
x = 'column',
y = 'value')
plt.show()
Plot
EDIT 2
I fixed one part of the code that was wrong, With that line of code, I add the category for every information (Axis X).
y = joy(cat, EveryTest[i].GPS)
After adding that line of code, the graph improved, but something is still failing. The graph starts with the 4th category (I mean 12:40:00), and it must start in the first (12:10:00), What I am doing wrong?
EDIT 1:
I Updated Bkoeh to 0.12.13, then the label problem was fixed.
Now my problem is:
I suppose the loop for (for i, cat in enumerate(reversed(cats)):) put every chart on the label, but do not happen that. I see the chart stuck in the 5th o 6th label. (12:30:00 or 12:50:00)
- Start of question -
I am trying to reproduce the example of joyplot. But I have trouble when I want to lot my own data. I dont want to plot an histogram, I want to plot some list in X and some list in Y. But I do not understand what I am doing wrong.
the code (Fixed):
from numpy import linspace
from scipy.stats.kde import gaussian_kde
from bokeh.io import output_file, show
from bokeh.models import ColumnDataSource, FixedTicker, PrintfTickFormatter
from bokeh.plotting import figure
#from bokeh.sampledata.perceptions import probly
bokeh.BOKEH_RESOURCES='inline'
import colorcet as cc
output_file("joyplot.html")
def joy(category, data, scale=20):
return list(zip([category]*len(data),data))
#Elements = 7
cats = ListOfTime # list(reversed(probly.keys())) #list(['Pos_1','Pos_2']) #
print len(cats),' lengh of times'
palette = [cc.rainbow[i*15] for i in range(16)]
palette += palette
print len(palette),'lengh palette'
x = X # linspace(-20,110, 500) #Test.X #
print len(x),' lengh X'
source = ColumnDataSource(data=dict(x=x))
p = figure(y_range=cats, plot_width=900, x_range=(0, 1500), toolbar_location=None)
for i, cat in enumerate(reversed(cats)):
y = joy(cat, EveryTest[i].GPS)
#print cat
source.add(y, cat)
p.patch('x', cat, color=palette[i], alpha=0.6, line_color="black", source=source)
#break
print source
p.outline_line_color = None
p.background_fill_color = "#efefef"
p.xaxis.ticker = FixedTicker(ticks=list(range(0, 1500, 100)))
#p.xaxis.formatter = PrintfTickFormatter(format="%d%%")
p.ygrid.grid_line_color = None
p.xgrid.grid_line_color = "#dddddd"
p.xgrid.ticker = p.xaxis[0].ticker
p.axis.minor_tick_line_color = None
p.axis.major_tick_line_color = None
p.axis.axis_line_color = None
#p.y_range.range_padding = 0.12
#p
show(p)
the variables are:
print X, type(X)
[ 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75
78 81 84 87 90 93 96 99] <type 'numpy.ndarray'>
and
print EveryTest[0].GPS, type(EveryTest[i].GPS)
0 2
1 2
2 2
3 2
4 2
5 2
6 2
7 2
8 2
9 2
10 2
11 2
12 2
13 2
14 2
15 2
16 2
17 2
18 2
19 2
20 2
21 2
22 2
23 2
24 2
25 2
26 2
27 2
28 2
29 2
30 2
31 2
32 2
Name: GPS, dtype: int64 <class 'pandas.core.series.Series'>
Following the example, the type of data its ok. But I get the next image:
And I expected something like this:
Here is how my dataframe looks like:
year item_id sales_quantity
2014 1 10
2014 1 4
... ... ...
2015 1 7
2015 1 10
... ... ...
2014 2 1
2014 2 8
... ... ...
2015 2 17
2015 2 30
... ... ...
2014 3 9
2014 3 18
... ... ...
For each item_id, I want to plot a boxplot showing the distribution for each year.
Here is what I tried:
data = pd.DataFrame.from_csv('electronics.csv')
grouped = data.groupby(['year'])
ncols=4
nrows = int(np.ceil(grouped.ngroups/ncols))
fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(35,45),
sharey=False)
for (key, ax) in zip(grouped.groups.keys(), axes.flatten()):
grouped.get_group(key).boxplot(x='year', y='sales_quantity',
ax=ax, label=key)
I get the error boxplot() got multiple values for argument 'x'. Can someone please tell me how to do this right?
If I have only a single item, then the following works
sns.boxplot(data.sales_quantity, groupby = data.year). How could I extend it for multiple items?
Link to csv
Please check comment on the code.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('electronics_157_3cols.csv')
print(df)
fig, axes = plt.subplots(1, len(df['item_id_copy'].unique()), sharey=True)
for n, i in enumerate(df['item_id_copy'].unique()):
idf = df[df['item_id_copy'] == int('{}'.format(i))][['year', 'sales_quantity']].pivot(columns='year')
print(idf)
idf.plot.box(ax=axes[n])
axes[n].set_title('ID {}'.format(i))
axes[n].set_xticklabels([e[1] for e in idf.columns], rotation=45)
axes[n].set_ylim(0, 1) # You should disable this line to specify outlier properly. (but I didn't to show you a normal graph)
plt.show()
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('electronics_157_3cols.csv')
print(df)
fig, axes = plt.subplots(2, 5, sharey=True)
gen_n = (n for n in range(1, 11))
gen_i = (i for i in df['item_id_copy'].unique())
for r in range(2):
for c in range(5):
n = gen_n.__next__()
i = gen_i.__next__()
idf = df[df['item_id_copy'] == int('{}'.format(i))][['year', 'sales_quantity']].pivot(columns='year')
print(idf)
idf.plot.box(ax=axes[r][c])
axes[r][c].set_title('ID {}'.format(i))
axes[r][c].set_xticklabels([e[1] for e in idf.columns], rotation=0)
axes[r][c].set_ylim(0, 1)
plt.show()
I will leave this simple version for others...
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_table('sample.txt', delimiter='\s+')
fig, axes = plt.subplots(1, 3, sharey=True)
for n, i in enumerate(df['item_id'].unique()):
idf = df[df['item_id'] == int('{}'.format(i))][['year', 'sales_quantity']].pivot(columns='year')
print(idf)
idf.plot.box(ax=axes[n])
axes[n].set_title('Item ID {}'.format(i))
axes[n].set_xticklabels([e[1] for e in idf.columns])
plt.show()
sample.txt
year item_id sales_quantity
2014 1 10
2014 1 4
2015 1 7
2015 1 10
2014 2 1
2014 2 8
2015 2 17
2015 2 30
2014 3 9
2014 3 18
I have a dataframe which i want to make a scatter plot of.
the dataframe looks like:
year length Animation
0 1971 121 1
1 1939 71 1
2 1941 7 0
3 1996 70 1
4 1975 71 0
I want the points in my scatter plot to be a different color depending the value in the Animation row.
So animation = 1 = yellow
animation = 0 = black
or something similiar
I tried doing the following:
dfScat = df[['year','length', 'Animation']]
dfScat = dfScat.loc[dfScat.length < 200]
axScat = dfScat.plot(kind='scatter', x=0, y=1, alpha=1/15, c=2)
This results in a slider which makes it hard to tell the difference.
You can also assign discrete colors to the points by passing an array to c=
Like this:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
d = {"year" : (1971, 1939, 1941, 1996, 1975),
"length" : ( 121, 71, 7, 70, 71),
"Animation" : ( 1, 1, 0, 1, 0)}
df = pd.DataFrame(d)
print(df)
colors = np.where(df["Animation"]==1,'y','k')
df.plot.scatter(x="year",y="length",c=colors)
plt.show()
This gives:
Animation length year
0 1 121 1971
1 1 71 1939
2 0 7 1941
3 1 70 1996
4 0 71 1975
Use the c parameter in scatter
df.plot.scatter('year', 'length', c='Animation', colormap='jet')