I need some guidance in working out how to plot a block of histograms from grouped data in a pandas dataframe. Here's an example to illustrate my question:
from pandas import DataFrame
import numpy as np
x = ['A']*300 + ['B']*400 + ['C']*300
y = np.random.randn(1000)
df = DataFrame({'Letter':x, 'N':y})
grouped = df.groupby('Letter')
In my ignorance I tried this code command:
df.groupby('Letter').hist()
which failed with the error message "TypeError: cannot concatenate 'str' and 'float' objects"
Any help most appreciated.
I'm on a roll, just found an even simpler way to do it using the by keyword in the hist method:
df['N'].hist(by=df['Letter'])
That's a very handy little shortcut for quickly scanning your grouped data!
For future visitors, the product of this call is the following chart:
One solution is to use matplotlib histogram directly on each grouped data frame. You can loop through the groups obtained in a loop. Each group is a dataframe. And you can create a histogram for each one.
from pandas import DataFrame
import numpy as np
x = ['A']*300 + ['B']*400 + ['C']*300
y = np.random.randn(1000)
df = DataFrame({'Letter':x, 'N':y})
grouped = df.groupby('Letter')
for group in grouped:
figure()
matplotlib.pyplot.hist(group[1].N)
show()
Your function is failing because the groupby dataframe you end up with has a hierarchical index and two columns (Letter and N) so when you do .hist() it's trying to make a histogram of both columns hence the str error.
This is the default behavior of pandas plotting functions (one plot per column) so if you reshape your data frame so that each letter is a column you will get exactly what you want.
df.reset_index().pivot('index','Letter','N').hist()
The reset_index() is just to shove the current index into a column called index. Then pivot will take your data frame, collect all of the values N for each Letter and make them a column. The resulting data frame as 400 rows (fills missing values with NaN) and three columns (A, B, C). hist() will then produce one histogram per column and you get format the plots as needed.
With recent version of Pandas, you can do
df.N.hist(by=df.Letter)
Just like with the solutions above, the axes will be different for each subplot. I have not solved that one yet.
I write this answer because I was looking for a way to plot together the histograms of different groups. What follows is not very smart, but it works fine for me. I use Numpy to compute the histogram and Bokeh for plotting. I think it is self-explanatory, but feel free to ask for clarifications and I'll be happy to add details (and write it better).
figures = {
'Transit': figure(title='Transit', x_axis_label='speed [km/h]', y_axis_label='frequency'),
'Driving': figure(title='Driving', x_axis_label='speed [km/h]', y_axis_label='frequency')
}
cols = {'Vienna': 'red', 'Turin': 'blue', 'Rome': 'Orange'}
for gr in df_trips.groupby(['locality', 'means']):
locality = gr[0][0]
means = gr[0][1]
fig = figures[means]
h, b = np.histogram(pd.DataFrame(gr[1]).speed.values)
fig.vbar(x=b[1:], top=h, width=(b[1]-b[0]), legend_label=locality, fill_color=cols[locality], alpha=0.5)
show(gridplot([
[figures['Transit']],
[figures['Driving']],
]))
I find this even easier and faster.
data_df.groupby('Letter').count()['N'].hist(bins=100)
Related
I need some guidance in working out how to plot a block of histograms from grouped data in a pandas dataframe. Here's an example to illustrate my question:
from pandas import DataFrame
import numpy as np
x = ['A']*300 + ['B']*400 + ['C']*300
y = np.random.randn(1000)
df = DataFrame({'Letter':x, 'N':y})
grouped = df.groupby('Letter')
In my ignorance I tried this code command:
df.groupby('Letter').hist()
which failed with the error message "TypeError: cannot concatenate 'str' and 'float' objects"
Any help most appreciated.
I'm on a roll, just found an even simpler way to do it using the by keyword in the hist method:
df['N'].hist(by=df['Letter'])
That's a very handy little shortcut for quickly scanning your grouped data!
For future visitors, the product of this call is the following chart:
One solution is to use matplotlib histogram directly on each grouped data frame. You can loop through the groups obtained in a loop. Each group is a dataframe. And you can create a histogram for each one.
from pandas import DataFrame
import numpy as np
x = ['A']*300 + ['B']*400 + ['C']*300
y = np.random.randn(1000)
df = DataFrame({'Letter':x, 'N':y})
grouped = df.groupby('Letter')
for group in grouped:
figure()
matplotlib.pyplot.hist(group[1].N)
show()
Your function is failing because the groupby dataframe you end up with has a hierarchical index and two columns (Letter and N) so when you do .hist() it's trying to make a histogram of both columns hence the str error.
This is the default behavior of pandas plotting functions (one plot per column) so if you reshape your data frame so that each letter is a column you will get exactly what you want.
df.reset_index().pivot('index','Letter','N').hist()
The reset_index() is just to shove the current index into a column called index. Then pivot will take your data frame, collect all of the values N for each Letter and make them a column. The resulting data frame as 400 rows (fills missing values with NaN) and three columns (A, B, C). hist() will then produce one histogram per column and you get format the plots as needed.
With recent version of Pandas, you can do
df.N.hist(by=df.Letter)
Just like with the solutions above, the axes will be different for each subplot. I have not solved that one yet.
I write this answer because I was looking for a way to plot together the histograms of different groups. What follows is not very smart, but it works fine for me. I use Numpy to compute the histogram and Bokeh for plotting. I think it is self-explanatory, but feel free to ask for clarifications and I'll be happy to add details (and write it better).
figures = {
'Transit': figure(title='Transit', x_axis_label='speed [km/h]', y_axis_label='frequency'),
'Driving': figure(title='Driving', x_axis_label='speed [km/h]', y_axis_label='frequency')
}
cols = {'Vienna': 'red', 'Turin': 'blue', 'Rome': 'Orange'}
for gr in df_trips.groupby(['locality', 'means']):
locality = gr[0][0]
means = gr[0][1]
fig = figures[means]
h, b = np.histogram(pd.DataFrame(gr[1]).speed.values)
fig.vbar(x=b[1:], top=h, width=(b[1]-b[0]), legend_label=locality, fill_color=cols[locality], alpha=0.5)
show(gridplot([
[figures['Transit']],
[figures['Driving']],
]))
I find this even easier and faster.
data_df.groupby('Letter').count()['N'].hist(bins=100)
I am looking to remove the upper outliers of some columns in a DataFrame (specifically in the 'vehicle_age' and 'odometer' columns in order to then build a histogram.
I have been able to successfully build the initial histograms like so:
crankshaft_ads['odometer'].plot(kind='hist', bins=25, range= (0, 1000000))
And I would like to build histograms without the upper outliers, as a comparison. Here is what I tried so far:
q1_age = crankshaft_ads['vehicle_age'].quantile(0.25)
q1_odometer = crankshaft_ads['odometer'].quantile(0.25)
q3_age = crankshaft_ads['vehicle_age'].quantile(0.75)
q3_odometer = crankshaft_ads['odometer'].quantile(0.75)
iqr_age = q3_age - q1_age
iqr_odometer = q3_odometer - q1_odometer
upper_limit_age = q3_age + (1.5 * iqr_age)
upper_limit_odometer = q3_odometer + (1.5 * iqr_odometer)
crankshaft_ads['upper_limit_age'] = upper_limit_age
crankshaft_ads['upper_limit_odometer'] = upper_limit_odometer
(crankshaft_ads
.query('vehicle_age < upper_limit_age')
.plot(kind='hist', bins=10)
)
(crankshaft_ads
.query('odometer < upper_limit_odometer')
.plot(kind='hist', bins=25)
)
I would need help with the .query() elements. I get the following error (it happens when running the .plot line it seems):
ValueError: view limit minimum -49500.0 is less than 1 and is an invalid Matplotlib date value. This often happens if you pass a non-datetime value to an axis that has datetime units
There is one column in the DataFrame that has datetime datatype, but what I'm trying to do is build a histogram for the 2 columns mentioned above, with the upper outliers filtered out. Is this the wrong approach?
Thanks for your help.
It seems that you have not selected the columns you want to plot in your plotting functions. The queries you have written select a subset of the whole dataframe, not only the column mentioned in each query. So both plotting functions are attempting to plot a histogram for each column in a single figure, including the datetime column.
Here are three ways you could solve this problem, taking your first plotting function as an example:
# Solution 1: apply query to whole dataframe then select column in plotting function
crankshaft_ads.query('vehicle_age < #upper_limit_age').plot.hist(y='vehicle_age', bins=10)
# Solution 2: first select column then select values to plot in histogram
crankshaft_ads['vehicle_age'][crankshaft_ads['vehicle_age'] < upper_limit_age].plot.hist(bins=10)
# Solution 3: first select all dataframe rows meeting condition then select column in plotting function
crankshaft_ads[crankshaft_ads['vehicle_age'] < upper_limit_age].plot.hist(y='vehicle_age', bins=10)
I have a data visualisation-based question. I basically want to create a heatmap from a pandas DataFrame, where I have the x,y coordinates and the corresponding z value. The data can be created with the following code -
data = ([[0.2,0.2,24],[0.2,0.6,8],[0.2,2.4,26],[0.28,0.2,28],[0.28,0.6,48],[0.28,2.4,55],[0.36,0.2,34],[0.36,0.6,46],[0.36,2.4,55]])
data=np.array(data)
df=pd.DataFrame(data,columns=['X','Y','Z'])
Please note that I have converted an array into a DataFrame just so that I can give an example of an array. My actual data set is quite large and I import into python as a DataFrame. After processing the DataFrame, I have it available as the format given above.
I have seen the other questions based on the same problem, but they do not seem to be working for my particular problem. Or maybe I am not applying them correctly. I want my results to be similar to what is given here https://plot.ly/python/v3/ipython-notebooks/cufflinks/#heatmaps
Any help would be welcome.
Thank you!
Found one way of doing this -
Using Seaborn.
import seaborn as sns
data = ([[0.2,0.2,24],[0.2,0.6,8],[0.2,2.4,26],[0.28,0.2,28],[0.28,0.6,48],[0.28,2.4,55],[0.36,0.2,34],[0.36,0.6,46],[0.36,2.4,55]])
data=np.array(data)
df=pd.DataFrame(data,columns=['X','Y','Z'])
df=df.pivot('X','Y','Z')
diplay_df = sns.heatmap(df)
Returns the following image -
sorry for creating another question.
Also, thank you for the link to a related post.
How about using plotnine, A Grammar of Graphics for Python
data
data = ([[0.2,0.2,24],[0.2,0.6,8],[0.2,2.4,26],[0.28,0.2,28],[0.28,0.6,48],[0.28,2.4,55],[0.36,0.2,34],[0.36,0.6,46],[0.36,2.4,55]])
data=np.array(data)
df=pd.DataFrame(data,columns=['X','Y','Z'])
Prepare data
df['rows'] = ['row' + str(n) for n in range(0,len(df.index))]
dfMelt = pd.melt(df, id_vars = 'rows')
Make heatmap
ggplot(dfMelt, aes('variable', 'rows', fill='value')) + \
geom_tile(aes(width=1, height=1)) + \
theme(axis_ticks=element_blank(),
panel_background = element_blank()) + \
labs(x = '', y = '', fill = '')
I'm using Python module like Pandas, Matplotlib to make charts for a university Project.
I got some problems ordering the result in the pivot Table.
This is the body of a function, that takes 3 lists in input ([2017-03-03, ...], ['Username1', 'Username2',...], [1012020,103024,...]), analyze data and makes chart about it.
data = [date_list,username,field]
username_no_dup = list(set(username))
rows = zip(data[0], data[1], data[2])
headers = ['Date', 'Username', 'Value']
df = pd.DataFrame(rows, columns=headers)
df = df.sort_values('Value', ascending=False)
#*sort_values works but it is not sorting when converting to Pivot Table*
pivot_df = pd.pivot_table(df ,index='Date', columns='Username', values='Value')
pivot_df.loc[:,username_no_dup].plot(kind='bar', stacked=True, color=color_list, figsize=(15,7))
I would like to order by values with the greater value near the X-line of the chart. Everyone solved this problem??? Thank you
Here is the top rows of df sorted by value:
[['2017-03-15','SSL1_APP',1515091]
['2017-03-16','SSL1_APP',1373827]
['2017-03-18','SSL1_APP',1136483]
['2017-03-21','SSL1_APP',601810]
['2017-03-17','SSL1_APP',325561]
['2017-03-15','KE77_APP',284971]
['2017-03-16','AF77_APP',222588]
['2017-03-16','MI77_APP',222148]
['2017-03-15','AF77_APP',202224]
['2017-03-15','MI77_APP',191791]
['2017-03-17','AF77_APP',187709]
['2017-03-16','PC77_APP',185766]
['2017-03-15','NE77_APP',177475]
['2017-03-18','FBW2_APP',175156]
['2017-03-16','NE77_APP',174570]
['2017-03-17','BFD1_APP',164238]
['2017-03-15','BFD1_APP',162931]
['2017-03-20','AF77_APP',152186]
['2017-03-17','PC77_APP',148727]
['2017-03-18','MI77_APP',147460]
['2017-03-16','BFD1_APP',145815]
['2017-03-20','BFD1_APP',145449]
['2017-03-15','PC77_APP',144959]
['2017-03-20','SSL1_APP',141719]]
The first pic is the plot I have created. The second one is the result I want, plotted with Excel:
Note: This is a Python answer on the subject sorting your input.
One way of doing this would be using a bidimensional list(A lists of lists) and then sorting it.
This is how you've been using it:
data = [date0,username0,randint0,date1,username1, ....
Try a bidimensional list instead:
data = [[date0,username0,randint0], [date1,username1,randint1]...
Use the .sort() method and change the syntax to look like this:
data.sort() #Sort it, by default it will be a decreasing list.
rows = zip(data[0][0], data[0][1], data[0][2])
The standard .sort() method has its limitations(floats for one) so if doesn't return a desirable output try using .sort() parameters, here is a insight on the subject: How to use .sort()
If you are having trouble with floats, check a answer that will help you here.
i have a dataframe where i want to plot the histograms of each column.
df_play = pd.DataFrame({'a':['cat','dog','cat'],'b':['apple','orange','orange']})
df_play['a'] = df_play['a'].astype('category')
df_play['b'] = df_play['b'].astype('category')
df_play
df_play.hist(layout = (12,10))
However im getting ValueError: num must be 1 <= num <= 0, not 1
When i tried with integers instead of category in the values, it worked fine but i really want the names of the unique string to be in the x-axis.
You can just apply pd.value_counts across columns and plot.
>>> df_play.apply(pd.value_counts).T.stack().plot(kind='bar')
If you want proper subplots or something more intricate, I'd suggest you just iterate with value_counts and create the subplots yourself.
Since there is no natural parameter for binning, perhaps what you want rather than histograms are bar plots of the value counts for each Series? If so, you can achieve that through
df_play['a'].value_counts().plot(kind='bar')
I realized a way to do this is to first specify the fig and axs then loop though the column names of the dataframe that we want to plot the value counts.
fig, axs = plt.subplots(1,len(df_play.columns),figsize(10,6))
for i,x in enumerate(df_play.columns):
df_play[x].value_counts().plot(kind='bar',ax=axs[i])