Convert pd.Grouper to human-readable format on plt.plot - python

I'm using Pandas and Matplotlib to plot some data from an SQL database.
Here are my steps:
fetch the data from the DB into a pd.DataFrame
group them using a Grouper('MS')
aggregate to count how many items are there in each group
draw the chart
df = df.groupby(Grouper(key='published_at', freq='MS'))['id'].count()
ax = df.plot.bar(position=0.5, width=0.4, label="Items")
This is what my plot looks like :
I'd like to show the months as "2019-04", so "Y-M", but I can't figure out how to do it.
As I'm totally new to Python, any help would be greatly appreciated. Thank you !

The following works with you sample data, but may fail with a lot of dates:
tmp_df = df.resample('MS',on='published_at').id.count()
plt.figure(figsize=(10,6))
plt.bar(tmp_df.index.strftime("%Y-%m"), tmp_df)
plt.show()
Output:

Seems you just want to reformat the datetime column.
Also looks like you have already converted this column in the proper format, if not start from line 2, otherwise start from line 5.
# Convert to datetime
df['published_at'] = pd.to_datetime(df['published_at'])
# You can start from here, if you have already converted your column
df['published_at_YM'] = df['DOB'].dt.strftime('%Y-%m')
df = df.groupby(Grouper(key='published_at_YM', freq='MS'))['id'].count()
ax = df.plot.bar(position=0.5, width=0.4, label="Items")

Related

Pandas - plotting multiple histograms [duplicate]

I need some guidance in working out how to plot a block of histograms from grouped data in a pandas dataframe. Here's an example to illustrate my question:
from pandas import DataFrame
import numpy as np
x = ['A']*300 + ['B']*400 + ['C']*300
y = np.random.randn(1000)
df = DataFrame({'Letter':x, 'N':y})
grouped = df.groupby('Letter')
In my ignorance I tried this code command:
df.groupby('Letter').hist()
which failed with the error message "TypeError: cannot concatenate 'str' and 'float' objects"
Any help most appreciated.
I'm on a roll, just found an even simpler way to do it using the by keyword in the hist method:
df['N'].hist(by=df['Letter'])
That's a very handy little shortcut for quickly scanning your grouped data!
For future visitors, the product of this call is the following chart:
One solution is to use matplotlib histogram directly on each grouped data frame. You can loop through the groups obtained in a loop. Each group is a dataframe. And you can create a histogram for each one.
from pandas import DataFrame
import numpy as np
x = ['A']*300 + ['B']*400 + ['C']*300
y = np.random.randn(1000)
df = DataFrame({'Letter':x, 'N':y})
grouped = df.groupby('Letter')
for group in grouped:
figure()
matplotlib.pyplot.hist(group[1].N)
show()
Your function is failing because the groupby dataframe you end up with has a hierarchical index and two columns (Letter and N) so when you do .hist() it's trying to make a histogram of both columns hence the str error.
This is the default behavior of pandas plotting functions (one plot per column) so if you reshape your data frame so that each letter is a column you will get exactly what you want.
df.reset_index().pivot('index','Letter','N').hist()
The reset_index() is just to shove the current index into a column called index. Then pivot will take your data frame, collect all of the values N for each Letter and make them a column. The resulting data frame as 400 rows (fills missing values with NaN) and three columns (A, B, C). hist() will then produce one histogram per column and you get format the plots as needed.
With recent version of Pandas, you can do
df.N.hist(by=df.Letter)
Just like with the solutions above, the axes will be different for each subplot. I have not solved that one yet.
I write this answer because I was looking for a way to plot together the histograms of different groups. What follows is not very smart, but it works fine for me. I use Numpy to compute the histogram and Bokeh for plotting. I think it is self-explanatory, but feel free to ask for clarifications and I'll be happy to add details (and write it better).
figures = {
'Transit': figure(title='Transit', x_axis_label='speed [km/h]', y_axis_label='frequency'),
'Driving': figure(title='Driving', x_axis_label='speed [km/h]', y_axis_label='frequency')
}
cols = {'Vienna': 'red', 'Turin': 'blue', 'Rome': 'Orange'}
for gr in df_trips.groupby(['locality', 'means']):
locality = gr[0][0]
means = gr[0][1]
fig = figures[means]
h, b = np.histogram(pd.DataFrame(gr[1]).speed.values)
fig.vbar(x=b[1:], top=h, width=(b[1]-b[0]), legend_label=locality, fill_color=cols[locality], alpha=0.5)
show(gridplot([
[figures['Transit']],
[figures['Driving']],
]))
I find this even easier and faster.
data_df.groupby('Letter').count()['N'].hist(bins=100)

Heat map from pandas DataFrame - 2D array

I have a data visualisation-based question. I basically want to create a heatmap from a pandas DataFrame, where I have the x,y coordinates and the corresponding z value. The data can be created with the following code -
data = ([[0.2,0.2,24],[0.2,0.6,8],[0.2,2.4,26],[0.28,0.2,28],[0.28,0.6,48],[0.28,2.4,55],[0.36,0.2,34],[0.36,0.6,46],[0.36,2.4,55]])
data=np.array(data)
df=pd.DataFrame(data,columns=['X','Y','Z'])
Please note that I have converted an array into a DataFrame just so that I can give an example of an array. My actual data set is quite large and I import into python as a DataFrame. After processing the DataFrame, I have it available as the format given above.
I have seen the other questions based on the same problem, but they do not seem to be working for my particular problem. Or maybe I am not applying them correctly. I want my results to be similar to what is given here https://plot.ly/python/v3/ipython-notebooks/cufflinks/#heatmaps
Any help would be welcome.
Thank you!
Found one way of doing this -
Using Seaborn.
import seaborn as sns
data = ([[0.2,0.2,24],[0.2,0.6,8],[0.2,2.4,26],[0.28,0.2,28],[0.28,0.6,48],[0.28,2.4,55],[0.36,0.2,34],[0.36,0.6,46],[0.36,2.4,55]])
data=np.array(data)
df=pd.DataFrame(data,columns=['X','Y','Z'])
df=df.pivot('X','Y','Z')
diplay_df = sns.heatmap(df)
Returns the following image -
sorry for creating another question.
Also, thank you for the link to a related post.
How about using plotnine, A Grammar of Graphics for Python
data
data = ([[0.2,0.2,24],[0.2,0.6,8],[0.2,2.4,26],[0.28,0.2,28],[0.28,0.6,48],[0.28,2.4,55],[0.36,0.2,34],[0.36,0.6,46],[0.36,2.4,55]])
data=np.array(data)
df=pd.DataFrame(data,columns=['X','Y','Z'])
Prepare data
df['rows'] = ['row' + str(n) for n in range(0,len(df.index))]
dfMelt = pd.melt(df, id_vars = 'rows')
Make heatmap
ggplot(dfMelt, aes('variable', 'rows', fill='value')) + \
geom_tile(aes(width=1, height=1)) + \
theme(axis_ticks=element_blank(),
panel_background = element_blank()) + \
labs(x = '', y = '', fill = '')

Seaborn stripplot of datetime objects not working

Following the first example from URL:
http://seaborn.pydata.org/tutorial/categorical.html
I am able to load the dataset called 'tips' and reproduce the stripplot showed. However this plot is not shown when applied to my pandas dataframe (called df) consisting of datetime objects. My df consists of 19300 rows and 7 columns, of which 2 columns are in the form of datetime objects (dates and times respectively). I would like to use the Python Seaborn package's stripplot function to visualize these two df columns together. My code reads as follows:
sns.stripplot(x=df['DATE'], y=df['TIME'], data=df);
And the output error reads as follows:
TypeError: float() argument must be a string or a number
I have made sure to remove the header from the data columns before applying the plotting command.
Other failed attempts include (but not limited to)
sns.stripplot(x=df['DATE'], y=df['TIME']);
It is my guess that this error might be due to the datetype object nature of the column data types, and that this type must somehow be changed into either strings or integer values. Is this correct? And how might one proceed to accomplish this task?
To illustrate the df data, here is a working code which uses matplotlib.pyplot (as plt)
ax1.plot(x, y, 'o', label='Events')
Any help is much appreciated.
One can also try to convert dates/times into seconds to plot them as numeric values:
dates = df.DATE
times = df.TIME
start_date = dates.min()
dates_as_seconds = dates.map(lambda d: (d - start_date).total_seconds())
times_as_seconds = times.map(lambda t: t.second + t.minute*60 + t.hour*3600)
ax = sns.stripplot(x=dates_as_seconds, y=times_as_seconds)
ax.set_xticklabels(dates)
ax.set_yticklabels(times)
Of course, data frame should be sorted by dates and times to match ticks and values.
After applying the following code to previous script:
x = df['DATE']
data = df['TIME']
y = data[1:len(x)]
x = x[1:len(x)]
s = []
for time in y:
a = int(str(time).replace(':',''))
s.append(a)
k = []
for date in x:
a = str(date)
k.append(a)
x = k
y = s
stripplot worked:
sns.stripplot(x, y)
You just need to put the variables name as input of x and y; not the data themselves. For example :
sns.stripplot(x="value", y="measurement", hue="species",
data=iris, dodge=True, alpha=.25, zorder=1)
https://seaborn.pydata.org/examples/jitter_stripplot.html

Plot dates vs values in Python

I have a very standard dataset with 2 columns, 1 for dates and 1 for values. I've put them into two arrays:
dates
['1/1/2014', '1/2/2014', '1/3/2014', ...]
values
[1423, 4321, 1234, ...]
How can I create a simple line graph with "values" on the y-axis and "dates" on the x-axis?
What I've tried:
I can do a "Hello world" line plot with only "values" in just 1 line (awesome):
import numpy as np
import matplotlib.pyplot as plt
plt.plot(values)
Next, let's add "dates" as the x-axis. At this point I'm stuck. How can I transform my "dates" array, which is strings, into an array that is plottable?
Looking at examples, I believe we are supposed to cast the strings into Python Date objects. Let's import those libraries:
import datetime
import matplotlib.dates
Ok so now I can transform a string into a date time.strptime(dates[1], '%m/%d/%Y'). How can I transform the entire array? I could write a loop, but assuming there is a better way.
I'm not 100% sure I'm even on the right path to making something usable for "Dates" vs "values". If you know the code to make this graph (I'm assuming it's very basic once you know Python + the libraries), please let me know.
Ok, knew it was a 1 liner. Here is how to do it:
So you start with your value and dates (as strings) array:
dates = ['1/1/2014', '1/2/2014', '1/3/2014', ...]
values = [1423, 4321, 1234, ...]
Then turn your dates which are strings into date objects:
date_objects = [datetime.strptime(date, '%m/%d/%Y').date() for date in dates]
Then just plot.
plt.plot(date_objects, values)
Remember you can use xticks and use whatever the hell you want in place of the values. So, you can do a
xVals = range(len(values))
plot(xVals, values)
xticks(xVals, dates)
This is ok if your dates are linear like you have. Otherwise, you can always get a number which quantifies the dates (number of days from the first day (see the datetime module) and use that for xVals.
so here's my code:
date = []
for time_val in value:
date.append(datetime(*time.strptime(time_val, "%m/%d/%Y")[:3]))
fig, ax = plt.subplots()
ax.plot_date(date, value_list)
date_format = mdates.DateFormatter("%m/%d/%Y")
ax.xaxis.set_major_formatter(date_format)
ax.autoscale_view()
title_name = 'plot based on time'
ax.set_title(title_name)
y_label = 'values'
ax.set_ylabel(y_label)
ax.grid(True)
fig.autofmt_xdate()
plt.show()

Plotting histograms from grouped data in a pandas DataFrame

I need some guidance in working out how to plot a block of histograms from grouped data in a pandas dataframe. Here's an example to illustrate my question:
from pandas import DataFrame
import numpy as np
x = ['A']*300 + ['B']*400 + ['C']*300
y = np.random.randn(1000)
df = DataFrame({'Letter':x, 'N':y})
grouped = df.groupby('Letter')
In my ignorance I tried this code command:
df.groupby('Letter').hist()
which failed with the error message "TypeError: cannot concatenate 'str' and 'float' objects"
Any help most appreciated.
I'm on a roll, just found an even simpler way to do it using the by keyword in the hist method:
df['N'].hist(by=df['Letter'])
That's a very handy little shortcut for quickly scanning your grouped data!
For future visitors, the product of this call is the following chart:
One solution is to use matplotlib histogram directly on each grouped data frame. You can loop through the groups obtained in a loop. Each group is a dataframe. And you can create a histogram for each one.
from pandas import DataFrame
import numpy as np
x = ['A']*300 + ['B']*400 + ['C']*300
y = np.random.randn(1000)
df = DataFrame({'Letter':x, 'N':y})
grouped = df.groupby('Letter')
for group in grouped:
figure()
matplotlib.pyplot.hist(group[1].N)
show()
Your function is failing because the groupby dataframe you end up with has a hierarchical index and two columns (Letter and N) so when you do .hist() it's trying to make a histogram of both columns hence the str error.
This is the default behavior of pandas plotting functions (one plot per column) so if you reshape your data frame so that each letter is a column you will get exactly what you want.
df.reset_index().pivot('index','Letter','N').hist()
The reset_index() is just to shove the current index into a column called index. Then pivot will take your data frame, collect all of the values N for each Letter and make them a column. The resulting data frame as 400 rows (fills missing values with NaN) and three columns (A, B, C). hist() will then produce one histogram per column and you get format the plots as needed.
With recent version of Pandas, you can do
df.N.hist(by=df.Letter)
Just like with the solutions above, the axes will be different for each subplot. I have not solved that one yet.
I write this answer because I was looking for a way to plot together the histograms of different groups. What follows is not very smart, but it works fine for me. I use Numpy to compute the histogram and Bokeh for plotting. I think it is self-explanatory, but feel free to ask for clarifications and I'll be happy to add details (and write it better).
figures = {
'Transit': figure(title='Transit', x_axis_label='speed [km/h]', y_axis_label='frequency'),
'Driving': figure(title='Driving', x_axis_label='speed [km/h]', y_axis_label='frequency')
}
cols = {'Vienna': 'red', 'Turin': 'blue', 'Rome': 'Orange'}
for gr in df_trips.groupby(['locality', 'means']):
locality = gr[0][0]
means = gr[0][1]
fig = figures[means]
h, b = np.histogram(pd.DataFrame(gr[1]).speed.values)
fig.vbar(x=b[1:], top=h, width=(b[1]-b[0]), legend_label=locality, fill_color=cols[locality], alpha=0.5)
show(gridplot([
[figures['Transit']],
[figures['Driving']],
]))
I find this even easier and faster.
data_df.groupby('Letter').count()['N'].hist(bins=100)

Categories