Scatter plotting data from two different data frames in python - python

I have two different data frames in following format.
dfclean
Out[1]:
obj
0 682
1 101
2 33
dfmalicious
Out[2]:
obj
0 17
1 43
2 8
3 9
4 211
My use-case is to plot a single scatter graph that distinctly shows the obj values from both the dataframes. I am using python for this purpose. I looked at a few examples where two columns of same dataframe were used to plot the data but couldnt replicate it for my use-case. Any help is greatly appreciated.
How to plot two DataFrame on same graph for comparison

To plot multiple column groups in a single axes, repeat plot method specifying target ax
Option 1]
In [2391]: ax = dfclean.reset_index().plot(kind='scatter', x='index', y='obj',
color='Red', label='G1')
In [2392]: dfmalicious.reset_index().plot(kind='scatter', x='index', y='obj',
color='Blue', label='G2', ax=ax)
Out[2392]: <matplotlib.axes._subplots.AxesSubplot at 0x2284e7b8>
Option 2]
In [2399]: dff = dfmalicious.merge(dfclean, right_index=True, left_index=True,
how='outer').reset_index()
In [2406]: dff
Out[2406]:
index obj_x obj_y
0 0 17 682.0
1 1 43 101.0
2 2 8 33.0
3 3 9 NaN
4 4 211 NaN
In [2400]: ax = dff.plot(kind='scatter', x='index', y='obj_x', color='Red', label='G1')
In [2401]: dff.plot(kind='scatter', x='index', y='obj_y', color='Blue', label='G2', ax=ax)
Out[2401]: <matplotlib.axes._subplots.AxesSubplot at 0x11dbe1d0>

Related

Bar plot not appearing normally using df.plot.bar()

I have the following code. I am trying to loop through variables (dataframe columns) and create bar plots. I have attached below an example of a graph for the column newerdf['age'].
I believe this should produce 3 bars (one for each option - male (value = 1), female (value = 2), other(value = 3)).
However, the graph below does not seem to show this.
I would be so grateful for a helping hand as to where I am going wrong!
listedvariables = ['age','gender-quantised','hours_of_sleep','frequency_of_alarm_usage','nap_duration_mins','frequency_of_naps','takes_naps_yes/no','highest_education_level_acheived','hours_exercise_per_week_in_last_6_months','drink_alcohol_yes/no','drink_caffeine_yes/no','hours_exercise_per_week','hours_of_phone_use_per_week','video_game_phone/tablet_hours_per_week','video_game_all_devices_hours_per_week']
for i in range(0,len(listedvariables)):
fig = newerdf[[listedvariables[i]]].plot.bar(figsize=(30,20))
fig.tick_params(axis='x',labelsize=40)
fig.tick_params(axis='y',labelsize=40)
plt.tight_layout()
newerdf['age']
age
0 2
1 2
2 4
3 3
5 2
... ...
911 2
912 1
913 2
914 3
915 2
The data are not grouped into categories yet, so a value count is needed before calling the plotting method:
for var in listedvariables:
ax = newerdf[var].value_counts().plot.bar(figsize=(30,20))
ax.tick_params(axis='x', labelsize=40)
ax.tick_params(axis='y', labelsize=40)
plt.tight_layout()
plt.show()

Colour by Category in scatterplot

My dataframe looks like this:
date index count weekday_num max_temperature_C
0 2019-04-01 0 1379 0 18
1 2019-04-02 1 1395 1 21
2 2019-04-03 2 1155 2 19
3 2019-04-04 3 342 3 18
4 2019-04-05 4 216 4 14
I would like to plot count vs max_temperature_C and colour by weekday_num
I have tried the below:
#create the scatter plot of trips vs Temp
plt.scatter(comb2['count'], comb2['max_temperature_C'], c=comb2['weekday_num'])
# Label the axis
plt.xlabel('Daily Trip count')
plt.ylabel('Max Temp c')
plt.legend(['weekday_num'])
# Show it!
plt.show()
However I am not sure quite how to get the legend to display all of the colours which correspond to each of the 'weekday_num' ?
Thanks
You can use the automated legend creation like this:
fig, ax = plt.subplots()
scatter = ax.(comb2['count'], comb2['max_temperature_C'], c=comb2['weekday_num'])
# produce a legend with the unique colors from the scatter
legend = ax.legend(*scatter.legend_elements(),
loc="upper right", title="Weekday num")
ax.add_artist(legend)
plt.show()

legends not print fully when multiple plots are plotted on same figure

I have the code as below to plot multiple plots on the same figure
fig, ax = plt.subplots(figsize=(25, 10))
def wl_ratioplot(wavelength1,wavelength2, dataframe, x1=0.1,x2=1.5,y1=-500,y2=25000):
a=dataframe[['asphalt_index','layer_thickness',wavelength1,wavelength2]].copy()
sns.scatterplot(x=a[wavelength1]/a[wavelength2],y=a['layer_thickness'],data=a)
ax.set_xlim(x1,x2)
ax.set_ylim(y1,y2)
leg = "{} vs {}".format(wavelength1,wavelength2)
print(leg) #this line is only to see the variable legend has the proper content
ax.legend(leg)
wl_ratioplot(wave_lengths[2],wave_lengths[0],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
wl_ratioplot(wave_lengths[0],wave_lengths[1],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
wl_ratioplot(wave_lengths[3],wave_lengths[1],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
wl_ratioplot(wave_lengths[3],wave_lengths[0],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
wl_ratioplot(wave_lengths[2],wave_lengths[1],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
I get the plot as the below pic where the legend seems to be first 5 letters separately even though the variable legend has the right content
There was another similar question & the solution was to put a square bracket to the variable legend. I tried this with the code as below.
fig, ax = plt.subplots(figsize=(25, 10))
def wl_ratioplot(wavelength1,wavelength2, dataframe, x1=0.1,x2=1.5,y1=-500,y2=25000):
a=dataframe[['asphalt_index','layer_thickness',wavelength1,wavelength2]].copy()
sns.scatterplot(x=a[wavelength1]/a[wavelength2],y=a['layer_thickness'],data=a)
ax.set_xlim(x1,x2)
ax.set_ylim(y1,y2)
leg = "{} vs {}".format(wavelength1,wavelength2)
print(leg)#this line is only to see the variable legend has the proper content
ax.legend([leg])
wl_ratioplot(wave_lengths[2],wave_lengths[0],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
wl_ratioplot(wave_lengths[0],wave_lengths[1],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
wl_ratioplot(wave_lengths[3],wave_lengths[1],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
wl_ratioplot(wave_lengths[3],wave_lengths[0],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
wl_ratioplot(wave_lengths[2],wave_lengths[1],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
Now I get the full legend but only the first legend is shown as the pic below
Can someone let me know how to get the full legend for all the plots? Thanks.
dummy data (the plot in pic will NOT match)
14nm 15nm 16nm 17nm 18nm 19nm layer_thickness
1 2 3 4 5 6 0
1 2 3 4 5 6 0
3 5 7 9 11 13 5700
1 2 3 4 5 6 0
3 5 7 9 11 13 8600
1 2 3 4 5 6 0
3 5 7 9 11 13 5000
1 2 3 4 5 6 0
45 55 65 75 85 95 100
1 2 3 4 5 6 0
8 15 22 29 36 43 16600
wave_lengths=['15nm','16nm','14nm','18nm']
Answer Update
Based on answer from Quang Hoang. The output pics using scatter plot from matplotlib & sns.scatterplot
With plt it is pretty natural:
def wl_ratioplot(wavelength1,wavelength2, dataframe,
x1=0.1,x2=1.5,y1=-500,y2=25000,
ax=None):
leg = "{} vs {}".format(wavelength1,wavelength2)
# set the label here, and let plt deal with it
# also, you don't need to copy the dataframe:
ax.scatter(x=dataframe[wavelength1]/dataframe[wavelength2],
y=dataframe['layer_thickness'],label=leg)
ax.set_xlim(x1,x2)
ax.set_ylim(y1,y2)
fig, ax = plt.subplots(figsize=(25, 10))
wl_ratioplot(wave_lengths[2],wave_lengths[0],dataframe=df,x1=-.1,x2=3, ax=ax)
wl_ratioplot(wave_lengths[0],wave_lengths[1],dataframe=df,x1=-.1,x2=3, ax=ax)
wl_ratioplot(wave_lengths[3],wave_lengths[1],dataframe=df,x1=-.1,x2=3, ax=ax)
wl_ratioplot(wave_lengths[3],wave_lengths[0],dataframe=df,x1=-.1,x2=3, ax=ax)
wl_ratioplot(wave_lengths[2],wave_lengths[1],dataframe=df,x1=-.1,x2=3, ax=ax)
ax.legend()
Output:
every time you call the function wl_ratioplot the legend is being reset the final value. use a array to store all the legends then access it all through a loop.
ax.legend([leg]) #it is resetting the legend after each call.
use a legends = [];
legends.append([leg])
after all function calls, draw the legend differently
ax.legend(legends)

How to make a date-based color bar based on df.idxmax series?

Python beginner/first poster here.
I'm running into trouble adding color bars to scatter plots. I have two types of plot: one that shows all the data color-coded by date, and one that shows just the maximum values of my data color-coded by date. In the first case, I can use the df.index (which is datetime) to make my color bar, but in the second case, I am using df2['col'].idxmax to generate the colors because my df2 is a df.groupby object which I'm using to generate the daily maximums in my data, and it does not have an accessible index.
For the first type of plot, I have succeeded in generating a date-based color bar with the code below, cobbled together from online examples:
fig, ax = plt.subplots(1,1, figsize=(20,20))
smap=plt.scatter(df.col1, df.col2, s=140,
c=[date2num(i.date()) for i in df.index],
marker='.')
cb = fig.colorbar(smap, orientation='vertical',
format=DateFormatter('%d %b %y'))
However for the second type of plot, where I am trying to use df2['col'].idxmax to create the date series instead of df.index, the following does not work:
for n in cols1:
for m in cols2:
fig, ax = plt.subplots(1,1, figsize=(15,15))
maxTimes=df2[n].idxmax()
PlottableTimes=maxTimes.dropna() #some NaNs in the
#.idxmax series were giving date2num trouble
smap2=plt.scatter(df2[n].max(), df2[m].max(),
s=160, c=[date2num(i.date()) for i in PlottableTimes],
marker='.')
cb2 = fig.colorbar(smap2, orientation='vertical',
format=DateFormatter('%d %b %y'))
plt.show()
The error is: 'length of rgba sequence should be either 3 or 4'
Because the error was complaining of the color argument, I separately checked the output of the color (that is, c=) arguments in the respective plotting commands, and both look similar to me, so I can't figure out why one color argument works and the other doesn't:
one that works:
[736809.0,
736809.0,
736809.0,
736809.0,
736809.0,
736809.0,
736809.0,
736809.0,
736809.0,
736809.0,
...]
one that doesn't work:
[736845.0,
736846.0,
736847.0,
736848.0,
736849.0,
736850.0,
736851.0,
736852.0,
736853.0,
736854.0,
...]
Any suggestions or explanations? I'm running python 3.5.2. Thank you in advance for helping me understand this.
Edit 1: I made the following example for others to explore, and in the process realized the crux of the issue is different than my first question. The code below works the way I want it to:
df=pd.DataFrame(np.random.randint(low=0, high=10, size=(169, 8)),
columns=['a', 'b', 'c', 'd', 'e','f','g','h']) #make sample data
date_rng = pd.date_range(start='1/1/2018', end='1/8/2018', freq='H')
df['i']=date_rng
df = df.set_index('i') #get a datetime index
df['ts']=date_rng #get a datetime column to group by
from pandas import Grouper
df2=df.groupby(Grouper(key='ts', freq='D'))
for n in ['a','b','c','d']: #now make some plots
for m in ['e','f','g','h']:
print(m)
print(n)
fig, ax = plt.subplots(1,1, figsize=(5,5))
maxTimes=df2[n].idxmax()
PlottableTimes=maxTimes.dropna()
smap=plt.scatter(df2[n].max(), df2[m].max(), s=160,
c=[date2num(i.date()) for i in PlottableTimes],
marker='.')
cb = fig.colorbar(smap, orientation='vertical',
format=DateFormatter('%d %b %y'))
plt.show()
The only difference between my real data and this example is that my real data has many NaNs scattered throughout. So, I think what is going wrong is that the 'c=' argument isn't long enough for the plotting command to interpret it as covering the whole date range...? For example, if I manually put in the output of the c= command, I get the following code which also works:
for n in ['a','b','c','d']:
for m in ['e','f','g','h']:
print(m)
print(n)
fig, ax = plt.subplots(1,1, figsize=(5,5))
maxTimes=df2[n].idxmax()
PlottableTimes=maxTimes.dropna()
smap=plt.scatter(df2[n].max(), df2[m].max(), s=160,
c=[736809.0, 736810.0, 736811.0, 736812.0, 736813.0, 736814.0, 736815.0, 736816.0],
marker='.')
cb = fig.colorbar(smap, orientation='vertical',
format=DateFormatter('%d %b %y'))
plt.show()
But, if I shorten the c= array by some amount, to emulate what is happening in my code when NaNs are being dropped from idxmax, it gives the same error I am seeing:
for n in ['a','b','c','d']:
for m in ['e','f','g','h']:
print(m)
print(n)
fig, ax = plt.subplots(1,1, figsize=(5,5))
maxTimes=df2[n].idxmax()
PlottableTimes=maxTimes.dropna()
smap=plt.scatter(df2[n].max(), df2[m].max(), s=160,
c=[736809.0, 736810.0, 736811.0, 736812.0, 736813.0, 736814.0],
marker='.')
cb = fig.colorbar(smap, orientation='vertical',
format=DateFormatter('%d %b %y'))
plt.show()
So this means the real question is: how can I grab the grouper column after grouping from the groupby object, when none of the columns appear to be grab-able with df2.col? I would like to be able to grab 'ts' from the following and use it to be the color data, instead of using idxmax:
df2['a'].max()
ts
2018-01-01 9
2018-01-02 9
2018-01-03 9
2018-01-04 9
2018-01-05 9
2018-01-06 9
2018-01-07 9
2018-01-08 8
Freq: D, Name: a, dtype: int64
Essentially, your Grouper call is similar to indexing on your date time column and callingpandas.DataFrame.resample specifying the aggregate function:
df.set_index('ts').resample('D').max()
# a b c d e f g h
# ts
# 2018-01-01 9 9 8 9 9 9 9 9
# 2018-01-02 9 9 9 9 9 9 9 9
# 2018-01-03 9 9 9 9 9 9 9 9
# 2018-01-04 9 9 9 9 9 9 9 9
# 2018-01-05 9 9 9 9 9 9 9 9
# 2018-01-06 9 9 9 8 9 9 9 9
# 2018-01-07 9 9 9 9 9 9 9 9
# 2018-01-08 2 8 6 3 1 3 2 7
Therefore, the return of df2['a'].max() is a Pandas Resampler object, very similar to a Pandas Series and hence carries the index property which you can use for color bar specification:
df['a'].max().index
# DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
# '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08'],
# dtype='datetime64[ns]', name='ts', freq='D')
From there you can pass into date2num without list comprehension:
date2num(df2['a'].max().index)
# array([736695., 736696., 736697., 736698., 736699., 736700., 736701., 736702.])
Altogether, simply use above in loop without needing maxTimes or PlottableTimes:
fig, ax = plt.subplots(1, 1, figsize = (5,5))
smap = plt.scatter(df2[n].max(), df2[m].max(), s = 160,
c = date2num(df2[n].max().index),
marker = '.')
cb = fig.colorbar(smap, orientation = 'vertical',
format = DateFormatter('%d %b %y'))

Matplotlib showing x-tick labels overlapping

Have a look at the graph below:
It's a subplot of this larger figure:
I see two problems with it. First, the x-axis labels overlap with one another (this is my major issue). Second. the location of the x-axis minor gridlines seems a bit wonky. On the left of the graph, they look properly spaced. But on the right, they seem to be crowding the major gridlines...as if the major gridline locations aren't proper multiples of the minor tick locations.
My setup is that I have a DataFrame called df which has a DatetimeIndex on the rows and a column called value which contains floats. I can provide an example of the df contents in a gist if necessary. A dozen or so lines of df are at the bottom of this post for reference.
Here's the code that produces the figure:
now = dt.datetime.now()
fig, axes = plt.subplots(2, 2, figsize=(15, 8), dpi=200)
for i, d in enumerate([360, 30, 7, 1]):
ax = axes.flatten()[i]
earlycut = now - relativedelta(days=d)
data = df.loc[df.index>=earlycut, :]
ax.plot(data.index, data['value'])
ax.xaxis_date()
ax.get_xaxis().set_minor_locator(mpl.ticker.AutoMinorLocator())
ax.get_yaxis().set_minor_locator(mpl.ticker.AutoMinorLocator())
ax.grid(b=True, which='major', color='w', linewidth=1.5)
ax.grid(b=True, which='minor', color='w', linewidth=0.75)
What is my best option here to get the x-axis labels to stop overlapping each other (in each of the four subplots)? Also, separately (but less urgently), what's up with the minor tick issue in the top-left subplot?
I am on Pandas 0.13.1, numpy 1.8.0, and matplotlib 1.4.x.
Here's a small snippet of df for reference:
id scale tempseries_id value
timestamp
2014-11-02 14:45:10.302204+00:00 7564 F 1 68.0000
2014-11-02 14:25:13.532391+00:00 7563 F 1 68.5616
2014-11-02 14:15:12.102229+00:00 7562 F 1 68.9000
2014-11-02 14:05:13.252371+00:00 7561 F 1 69.0116
2014-11-02 13:55:11.792191+00:00 7560 F 1 68.7866
2014-11-02 13:45:10.782227+00:00 7559 F 1 68.6750
2014-11-02 13:35:10.972248+00:00 7558 F 1 68.4500
2014-11-02 13:25:10.362213+00:00 7557 F 1 68.1116
2014-11-02 13:15:10.822247+00:00 7556 F 1 68.2250
2014-11-02 13:05:10.102200+00:00 7555 F 1 68.5616
2014-11-02 12:55:10.292217+00:00 7554 F 1 69.0116
2014-11-02 12:45:10.382226+00:00 7553 F 1 69.3500
2014-11-02 12:35:10.642245+00:00 7552 F 1 69.2366
2014-11-02 12:25:12.642255+00:00 7551 F 1 69.1250
2014-11-02 12:15:11.122382+00:00 7550 F 1 68.7866
2014-11-02 12:05:11.332224+00:00 7549 F 1 68.5616
2014-11-02 11:55:11.662311+00:00 7548 F 1 68.2250
2014-11-02 11:45:11.122193+00:00 7547 F 1 68.4500
2014-11-02 11:35:11.162271+00:00 7546 F 1 68.7866
2014-11-02 11:25:12.102211+00:00 7545 F 1 69.2366
2014-11-02 11:15:10.422226+00:00 7544 F 1 69.4616
2014-11-02 11:05:11.412216+00:00 7543 F 1 69.3500
2014-11-02 10:55:10.772212+00:00 7542 F 1 69.1250
2014-11-02 10:45:11.332220+00:00 7541 F 1 68.7866
2014-11-02 10:35:11.332232+00:00 7540 F 1 68.5616
2014-11-02 10:25:11.202411+00:00 7539 F 1 68.2250
2014-11-02 10:15:11.932326+00:00 7538 F 1 68.5616
2014-11-02 10:05:10.922229+00:00 7537 F 1 68.9000
2014-11-02 09:55:11.602357+00:00 7536 F 1 69.3500
Edit: Trying fig.autofmt_xdate():
I don't think this going to do the trick. This seems to use the same x-tick labels for both graphs on the left and also for both graphs on the right. Which is not correct given my data. Please see the problematic output below:
Ok, finally got it working. The trick was to use plt.setp to manually rotate the tick labels. Using fig.autofmt_xdate() did not work as it does some unexpected things when you have multiple subplots in your figure. Here's the working code with its output:
for i, d in enumerate([360, 30, 7, 1]):
ax = axes.flatten()[i]
earlycut = now - relativedelta(days=d)
data = df.loc[df.index>=earlycut, :]
ax.plot(data.index, data['value'])
ax.get_xaxis().set_minor_locator(mpl.ticker.AutoMinorLocator())
ax.get_yaxis().set_minor_locator(mpl.ticker.AutoMinorLocator())
ax.grid(b=True, which='major', color='w', linewidth=1.5)
ax.grid(b=True, which='minor', color='w', linewidth=0.75)
plt.setp(ax.get_xticklabels(), rotation=30, horizontalalignment='right')
fig.tight_layout()
By the way, the comment earlier about some matplotlib things taking forever is very interesting here. I'm using a raspberry pi to act as a weather station at a remote location. It's collecting the data and serving the results via the web. And boy oh boy, it's really wheezing trying to put out these graphics.
Due to the way text rendering is handled in matplotlib, auto-detecting overlapping text really slows things down. (The space that text takes up can't be accurately calculated until after it's been drawn.) For that reason, matplotlib doesn't try to do this automatically.
Therefore, it's best to rotate long tick labels. Because dates most commonly have this problem, there's a figure method fig.autofmt_xdate() that will (among other things) rotate the tick labels to make them a bit more readable. (Note: If you're using a pandas plot method, it returns an axes object, so you'll need to use ax.figure.autofmt_xdate().)
As a quick example:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
time = pd.date_range('01/01/2014', '4/01/2014', freq='H')
values = np.random.normal(0, 1, time.size).cumsum()
fig, ax = plt.subplots()
ax.plot_date(time, values, marker='', linestyle='-')
fig.autofmt_xdate()
plt.show()
If we were to leave fig.autofmt_xdate() out:
And if we use fig.autofmt_xdate():
For the problems which don't have date values in x axis, rather a string, you can insert \n character in x axis values so they don't overlap. Here is an example -
The data frame is
somecol value
category 1 of column 16
category 2 of column 13
category 3 of column 21
category 4 of column 20
category 5 of column 11
category 6 of column 22
category 7 of column 19
category 8 of column 14
category 9 of column 18
category 10 of column 23
category 11 of column 10
category 12 of column 24
category 13 of column 17
category 14 of column 15
category 15 of column 12
I need to plot value on y axis and somecol on x axis, which will normally be plotted like this -
As you can see, there is a lot of overlap. Now introduce \n character in somecol column.
somecol = df['somecol'].values.tolist()
for i in range(len(somecol)):
x = somecol[i].split(' ')
# insert \n before 'of'
x.insert(x.index('of'),'\n')
somecol[i] = ' '.join(x)
Now if you plot, it will look like this -
plt.plot(somecol, df['val'])
This method works well if you don't want to rotate your labels.
The only con so far I found in this method is that you need to tweak your labels 3-4 times i.e., try with multiple formats to display the plot in best format.

Categories