Seaborn | Matplotlib: Scatter plot: 3rd variable is encoded by both size and colour - python

I want to create a plot that shows the geographical distribution of nightly prices using longitude and lattitude as coordinates, and the price encoded both by color and size of the circles. I curently have no idea on how to encode the plots by the price in both colour and size. I come to you in search of help~ I dont understand the documentation for seaborn in this scenario.
3 columns of interest:
longtitude lattitude Price
50.1235156 4.1236436 160
52.3697862 4.8935462 300
52.3640489 4.8895343 8000
52.3729765 4.8931707 1300
52.3657530 4.8796741 5000
52.2957663 4.3058365 60
52.6709324 4.6028347 100
In my scenario: each column is of equal length, but I only want to include prices that are >150
Im stuck with this filter in play, as the column with the applied filter is half the size as longitude and latitude.
My clueless attempt:
plt.scatter(df.longitude, df.latitude, s=(df.price)>150, c= (df.price)>150)
The way I understand it is that the latitude and longitude create the space/plane, and then apply the price data. But implementing it seems to work differently?

First of all, you need to filter the dataframe before plotting. If you do what you're doing (which won't work anyway), your Series of x and y coordinates will be the entire length of the dataframe but series responsible for color-coding and size will be shorter because you're trying to filter out values under 150 with this: s=(df.price)>150.
Secondly, you can't plot like this using matplotlib. With matplotlib to color-code points you need to create a dictionary, so I'd suggest using seaborn for simplicity.
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
df_plot = df.loc[df.price > 150]
fig = sns.scatterplot(data=df_plot, x='longitude', y='latitude', size='price', hue='price')
plt.show()

Related

Python Visualisation Not Plotting Full Range of Data Points

I'm just starting out on using Python and I'm using it to plot some points through Power BI. I use Power BI as part of my work anyway and this is for an article I'm writing alongside learning. I'm aware Power BI isn't the ideal place to be using Python :)
I have a dataset of average banana prices since 1995 (https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/1132096/bananas-30jan23.csv)
I've managed to turn that into a nice line chart which plots the average for each month but only shows the yearly labels. The chart is really nice and I'm happy with it other than the fact that it isn't plotting anything before 1997 or after 2020 despite the date range being outside that. Earlier visualisations without the x-axis labelling grouping led to all points being plot but with this it's now no longer working.
ChatGPT got me going in circles that never resolved the issue so I suspect my issue may lie in my understand of Python. If anyone could help me understand the issue that would be brilliant, I can provide more information if that helps:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# Convert the 'Date' column to a datetime format
dataset['Date'] = pd.to_datetime(dataset['Date'])
# Group the dataframe by month and calculate the average price for each month
monthly_average = dataset.groupby(dataset['Date'].dt.strftime('%B-%Y'))['Price'].mean()
# Plot the monthly average price against the month using seaborn
ax = sns.lineplot(x=monthly_average.index, y=monthly_average.values)
# Find the unique years in the dataset
unique_years = np.unique(dataset['Date'].dt.year)
# Set the x-axis tick labels to only be the unique years
ax.xaxis.set_ticklabels(unique_years)
ax.xaxis.set_major_locator(plt.MaxNLocator(len(unique_years)))
# Show the plot
plt.show()
Resulting Chart

How do I plot weather data from two data sets on one bar graph using python?

Python newbie here. I'm looking at some daily weather data for a couple of cities over the course of a year. Each city has its own csv file. I'm interested in comparing the count of daily average temperatures between two cities in a bar graph, so I can see (for example) how often the average temperature in Seattle was 75 degrees (or 30 or 100) compared to Phoenix.
I'd like a bar graph with side-by-side bars with temperature on the x-axis and count on the y-axis. I've been able to get a bar graph of each city separately with this data, but don't know how to get both cities on the same bar chart with with a different color for each city. Seems like it should be pretty simple, but my hours of search haven't gotten me a good answer yet.
Suggestions please, oh wise stackoverflow mentors?
Here's what I've got so far:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_csv("KSEA.csv")
df2 = pd.read_csv("KPHX.csv")
df["actual_mean_temp"].value_counts(sort=False).plot(kind ="bar")
df2["actual_mean_temp"].value_counts(sort = False).plot(kind = 'bar')
You can concat DataFrames, assigning city as a column, and then use histplot in seaborn:
import seaborn as sns
z = pd.concat([
df[['actual_mean_temp']].assign(city='KSEA'),
df2[['actual_mean_temp']].assign(city='KPHX'),
])
ax = sns.histplot(data=z, x='actual_mean_temp', hue='city',
multiple='dodge', binwidth=1)
Output:

A problem of python plot for large number of data

I am new to python and trying to plot a color magnitude diagram(CMD) for a selected cluster by matplotlib, there are 3400000 stars that I need to plot, the data for each star would be color on x axis and magnitude on y axis, However, my code should read two columns in a csv file and plot. The problem is when I using a part of the data (3000 stars), I can plot a CMD succesfully but when I use all the data, the plot is very mess(see figure below) and it seems that points are ploted by their positions in the column instead of its value. For example, a point has data (0.92,20.64) should be close to the y-axis, but is actually located at the far right of the plot just becasue it placed at last few columns of the dataset. So I wanna know how can I plot the entire dataset and show a plot like the first figure.Thanks for yout time. These are my codes:
import matplotlib.pyplot as plt
import pandas as pd
import csv
data = pd.read_csv(r'C:\Users\Peter\Desktop\F275W test.csv', low_memory=False)
# Generate some test data
x = data['F275W-F336W']
y = data['F275W']
#remove the axis
plt.axis('off')
plt.plot(x,y, ',')
plt.show()
This is the plot I got for 3000 stars it's a CMD
This is the plot I got for entire dataset, which is very mess

Plotting points extracted from a .txt file in python

I am trying to create a plot extracting points from a .txt file. The points are separated by 'tab' space only. Also, there are too many points to be accommodated in only one column, so they have been spread over 3 columns. However, when I plot in matplotlib, I am a little suspicious I am not seeing all the numbers plotted. It may be the case the data is plotted only over the first column and is ignoring the other two columns.
Here is the sample example of such data file: https://www.dropbox.com/s/th6uwrk2xdnmhyi/n1l2m2.txt?dl=0
I also attached the simple code I am using to plot:
import matplotlib.pyplot as plt
%matplotlib inline
import sys
import os
import numpy
from pylab import *
exp_sum = '/home/trina/Downloads/n1l2m2.txt'
a= numpy.loadtxt(exp_sum, unpack =True)
plt.plot(a)
show()
and here is the output image:
I am interested to know if this plot covers all the points in my data file. Your suggestion is very appreciated.
By doing plt.plot(a), you are passing a 3 dimensional data set to be plotted onto a 2 dimensional graph.
From the matplotlib docs for plot
If x and/or y is 2-dimensional, then the corresponding columns will be
plotted.
So, your graph output is:
column 0 values at x = 0
column 1 values at x = 1
column 2 values at x = 2
Adding the following to the code:
for i in range(0,len(a)):
print('a'+str(i),max(a[i]),min(a[i]))
Outputs the following:
stats max min
a0 0.9999 0.0
a1 0.9856736 0.3736717
a2 -0.003469009 -0.08896232
Using the mouseover position readout with matplotlib, this looks correct.
On a general graphs point, I'd recommend using histograms, boxplots or violin plots if you want to visualise the frequency (and other stats) of data sets. See the matplotlib examples for histograms, boxplots and violin plots.
Edit: from the shading on the graph you have, it also looks like it does contain all the points, as your data columns are long tails when plotted individually. The long tail graphs correlate to the shading on the graph you have.

How to connect boxplot median values

It seems like plotting a line connecting the mean values of box plots would be a simple thing to do, but I couldn't figure out how to do this plot in pandas.
I'm using this syntax to do the boxplot so that it automatically generate the box plot for Y vs. X device without having to do external manipulation of the data frame:
df.boxplot(column='Y_Data', by="Category", showfliers=True, showmeans=True)
One way I thought of doing is to just do a line plot by getting the mean values from the boxplot, but I'm not sure how to extract that information from the plot.
You can save the axis object that gets returned from df.boxplot(), and plot the means as a line plot using that same axis. I'd suggest using Seaborn's pointplot for the lines, as it handles a categorical x-axis nicely.
First let's generate some sample data:
import pandas as pd
import numpy as np
import seaborn as sns
N = 150
values = np.random.random(size=N)
groups = np.random.choice(['A','B','C'], size=N)
df = pd.DataFrame({'value':values, 'group':groups})
print(df.head())
group value
0 A 0.816847
1 A 0.468465
2 C 0.871975
3 B 0.933708
4 A 0.480170
...
Next, make the boxplot and save the axis object:
ax = df.boxplot(column='value', by='group', showfliers=True,
positions=range(df.group.unique().shape[0]))
Note: There's a curious positions argument in Pyplot/Pandas boxplot(), which can cause off-by-one errors. See more in this discussion, including the workaround I've employed here.
Finally, use groupby to get category means, and then connect mean values with a line plot overlaid on top of the boxplot:
sns.pointplot(x='group', y='value', data=df.groupby('group', as_index=False).mean(), ax=ax)
Your title mentions "median" but you talk about category means in your post. I used means here; change the groupby aggregation to median() if you want to plot medians instead.
You can get the value of the medians by using the .get_data() property of the matplotlib.lines.Line2D objects that draw them, without having to use seaborn.
Let bp be your boxplot created as bp=plt.boxplot(data). Then, bp is a dict containing the medians key, among others. That key contains a list of matplotlib.lines.Line2D, from which you can extract the (x,y) position as follows:
bp=plt.boxplot(data)
X=[]
Y=[]
for m in bp['medians']:
[[x0, x1],[y0,y1]] = m.get_data()
X.append(np.mean((x0,x1)))
Y.append(np.mean((y0,y1)))
plt.plot(X,Y,c='C1')
For an arbitrary dataset (data), this script generates this figure. Hope it helps!

Categories