How to plot linear regression between two continous values?

How to plot linear regression between two continous values? - python

I am trying to implement a Machine-Learning algorithm to predict house prices in New-York-City.
Now, when I try to plot (using Seaborn) the relationship between two columns of my house-prices dataset: 'gross_sqft_thousands' (the gross area of the property in thousands of square feets) and the target-column which is the 'sale_price_millions', I get a weird plot like this one:
Code used to plot:
sns.regplot(x="sale_price_millions", y="gross_sqft_thousands", data=clean_df);
When I try to plot the number of commercial units (commercial_units column) versus the sale_price_millions, I get also a weird plot like this one:
These weird plots, although in the correlation matrix, the sale_price correlates very good with both variables (gross_sqft_thousands and commercial_units).
What am I doing wrong, and what should I do to get great plot, with less points and a clear fitting like this plot:
Here is a part of my dataset:

Your housing price dataset is much larger than the tips dataset shown in that Seaborn example plot, so scatter plots made with default settings will be massively overcrowded.
The second plot looks "weird" because it plots a (practically) continuous variable, sales price, against an integer-valued variable, total_units.
The following solutions come to mind:
Downsample the dataset with something like sns.regplot(x="sale_price_millions", y="gross_sqft_thousands", data=clean_df[::10]). The [::10] part selects every 10th line from clean_df. You could also try clean_df.sample(frac=0.1, random_state=12345), which randomly samples 10% of all rows
without replacement (using a random seed for reproducibility).
Reduce the alpha (opacity) and/or size of the scatterplot points with sns.regplot(x="sale_price_millions", y="gross_sqft_thousands", data=clean_df, scatter_kws={"alpha": 0.1, "s": 1}).
For plot 2, add a bit of "jitter" (random noise) to the y-axis variable with sns.regplot(..., y_jitter=0.05).
For more, check out the Seaborn documentation on regplot: https://seaborn.pydata.org/generated/seaborn.regplot.html

Related

How to deal with categorical data that has 35 unique values?

I am working on IPL cricket dataset which has data about batting stats for all the teams over by over.
I want to visualise how different cricket grounds affect the total score of the batting team. I try to plot a simple scatter plot but the stadium names are too long and it does not show the names clearly.
Do I have to convert the 35 values into numeric values? It prints nothing when I try to find correlation with the target variable.
The data set:
The problem with reading the plot (the x-axis):

You can change the size of the font and/or rotate it: https://matplotlib.org/api/matplotlib_configuration_api.html#matplotlib.rc

You can make your plot bigger by setting figsize.
(add this at the first line):
plt.figure(figsize(14,8))
and then rotate the xticks. (at the end):
plt.xticks(rotation=90)

How to visualize correlation of discrete data using scatter_matrix in python?

for attribute in ['alcohol','chlorides','density']:
compare = wine_data[["quality", attribute]]
plot = pp.scatter_matrix(compare)
plt.show()
I found the following graph. Quality is an integer in the range of 0-10. ['alcohol','chlorides','density'] are continues data. The correlations between ['alcohol','chlorides','density'] and quality are 0.432733,-0.305599 and -0.207202, respectively. How do I understand the three graphs below? Is there a better way to visualize the correlation of discrete datas?

I prefer Seaborn's regplot function - which will graph the same scatterplot you see here along with a regression line on top o fit. The regression line will help you understand whether the correlation is positive or negative (upward / downward sloping) as well as providing error bars in shading around the regression line.
https://seaborn.pydata.org/generated/seaborn.regplot.html

how to create interactive graph on a large data set?

I am trying to create an interactive graph using holoviews on a large data set. Below is a sample of the data file called trackData.cvs
Event Time ID Venue
Javeline 11:25:21:012345 JVL Dome
Shot pot 11:25:22:778929 SPT Dome
4x4 11:25:21:993831 FOR Track
4x4 11:25:22:874293 FOR Track
Shot pot 11:25:21:087822 SPT Dome
Javeline 11:25:23:878792 JVL Dome
Long Jump 11:25:21:892902 LJP Aquatic
Long Jump 11:25:22:799422 LJP Aquatic
This is how I read the data and plot a scatter plot.
trackData = pd.read_csv('trackData.csv')
scatter = hv.Scatter(trackData, 'Time', 'ID')
scatter
Because this data set is quite huge, zooming in and out of the scatter plot is very slow and would like to speed this process up.
I researched and found about holoviews decimate that is recommended on large datasets but I don't know how to use in the above code.
Most cases I tried seems to throw an error. Also, is there a way to make sure the Time column is converted to micros? Thanks in advance for the help

Datashader indeed does not handle categorical axes as used here, but that's not so much a limitation of the software than of my imagination -- what should it be doing with them? A Datashader scatterplot (Canvas.points) is meant for a very large number of points located on a continuously indexed 2D plane. Such a plot approximates a 2D probability distribution function, accumulating points per pixel to show the density in that region, and revealing spatial patterns across pixels.
A categorical axis doesn't have the same properties that a continuous numerical axis does, because there's no spatial relationship between adjacent values. Specifically in this case, there's no apparent meaning to an ordering of the ID field (it appears to be a letter code for a sporting event type), so I can't see any meaning to accumulating across ID values per pixel the way Datashader is designed to do. Even if you convert IDs to numbers, you'll either just get random-looking noise (if there are more ID values than vertical pixels), or a series of spotty lines (if there are fewer ID values than pixels).
Here, maybe there are only a few dozen or so unique ID values, but many, many time measurements? In that case most people would use a box, violin, histogram, or ridge plot per ID, to see the distribution of values for each ID value. A Datashader points plot is a 2D histogram, but if one axis is categorical you're really dealing with a set of 1D histograms, not a single combined 2D histogram, so just use histograms if that's what you're after.
If you really do want to try plotting all the points per ID as raw points, you could do that using vertical spike events as in https://examples.pyviz.org/iex_trading/IEX_stocks.html . You can also add some vertical jitter and then use Datashader, but that's not something directly supported right now, and it doesn't have the clear mathematical interpretation that a normal Datashader plot does (in terms of approximating a density function).

The disadvantage of decimate() is that it downsamples your datapoints.
I think you need datashader() here, but datashader doesn't like that ID is a categorical variable instead of a numerical value.
So a solution could be to convert your categorical variable to a numerical code.
See the code example below for both hvPlot (which I prefer) and HoloViews:
import io
import pandas as pd
import hvplot.pandas
import holoviews as hv
# dynspread is for making point sizes larger when using datashade
from holoviews.operation.datashader import datashade, dynspread
# sample data
text = """
Event Time ID Venue
Javeline 11:25:21:012345 JVL Dome
Shot pot 11:25:22:778929 SPT Dome
4x4 11:25:21:993831 FOR Track
4x4 11:25:22:874293 FOR Track
Shot pot 11:25:21:087822 SPT Dome
Javeline 11:25:23:878792 JVL Dome
Long Jump 11:25:21:892902 LJP Aquatic
Long Jump 11:25:22:799422 LJP Aquatic
"""
# create dataframe and parse time
df = pd.read_csv(io.StringIO(text), sep='\s{2,}', engine='python')
df['Time'] = pd.to_datetime(df['Time'], format='%H:%M:%S:%f')
df = df.set_index('Time').sort_index()
# get a column that converts categorical id's to numerical id's
df['ID'] = pd.Categorical(df['ID'])
df['ID_code'] = df['ID'].cat.codes
# use this to overwrite numerical yticks with categorical yticks
yticks=[(0, 'FOR'), (1, 'JVL'), (2, 'LJP'), (3, 'SPT')]
# this is the hvplot solution: set datashader=True
df.hvplot.scatter(
x='Time',
y='ID_code',
datashade=True,
dynspread=True,
padding=0.05,
).opts(yticks=yticks)
# this is the holoviews solution
scatter = hv.Scatter(df, kdims=['Time'], vdims=['ID_code'])
dynspread(datashade(scatter)).opts(yticks=yticks, padding=0.05)
More info on datashader and decimate:
http://holoviews.org/user_guide/Large_Data.html
Resulting plot:

Linregress output seems incorrect

I plotted a scatter plot on my dataframe which looks like this:
with code
from scipy import stats
import pandas as pd
import seaborn as sns
df = pd.read_csv('/content/drive/My Drive/df.csv', sep=',')
subset = df[:,1:10080]
df['mean'] = subset.mean(axis=1)
df.plot(x='mean', y='Result', kind = 'scatter')
sns.lmplot('mean', 'Result', df, order=1)
I wanted to find the slope of the regression in the graph using code
scipy.stats.mstats.linregress(Result,average)
but from the output it seems like the slope magnitude is too small:
LinregressResult(slope=-0.0001320534706614152, intercept=27.887336813241845, rvalue=-0.16776138446214162, pvalue=3.0450456899520655e-07, stderr=2.55977061451773e-05)
if I switched the Resultand average positions,
scipy.stats.mstats.linregress(average,Result)
it still doesn't look right as the intercept is too large
LinregressResult(slope=-213.12489536011773, intercept=7138.48783135982, rvalue=-0.16776138446214162, pvalue=3.0450456899520655e-07, stderr=41.31287437069993)
Why is this happening? Do these output values need to be rescaled?

The signature for scipy.stats.mstats.linregress is linregress(x,y) so your second ordering, linregress(average, Result) is the one that is consistent with the way your graph is drawn. And on that graph, an intercept of 7138 doesn't seem unreasonable—are you getting confused by the fact that the x-axis limits you're showing don't go down to 0, where the intercept would actually happen?
In any case, your data really don't look like they follow a linear law, so the slope (or any parameter from a completely-misspecified model) will not actually tell you much. Are the x and y values all strictly positive? And is there a particular reason why x can never logically go below 25? The data-points certainly seem to be piling up against that vertical asymptote. If so, I would probably subtract 25 from x, then fit a linear model to logged data. In other words, do your plot and your linregress with x=numpy.log(average-25) and y=numpy.log(Result). EDIT: since you say x is temperature there’s no logical reason why x can’t go below 25 (it is meaningful to want to extrapolate below 25, for example—and even below 0). Therefore don’t subtract 25, and don’t log x. Just log y.
In your comments you talk about rescaling the slope, and eventually the suspicion emerges that you think this will give you a correlation coefficient. These are different things. The correlation coefficient is about the spread of the points around the line as well as slope. If what you want is correlation, look up the relevant tools using that keyword.

Any way to swap my seaborn heatmap's colormap values with its y-axis values?

I'm new to heatmaps and seaborn. I've been trying this for days but haven't been able to find a solution or any related threads. I think I'm setting up the problem incorrectly but would like to know if what I'm trying to do, is possible with seaborn heatmaps... or if a heatmap isn't the right graphic representation for what I want to show.
I have a .csv file of scores. It looks something like:
Genus,Pariaconus,Trioza,Non-members
-40,-80,-90,-300
-40.15,-80,-100,-320
,-40.17,-86,-101,-470
,-86.2,-130,-488
,,-132,-489
,,,-500
...
As I try to show above, the columns are different lengths. Let's say length of (the number of values in) Genus is 10, Pariaconus is 15, Trioza is 20, and Non-members is 18,000.
In addition, the columns and rows are not related to each other. Each score is individual and just falls under the column group. What I want to show with the heatmap is the range of scores that occur in each column.
I would ideally like to represent the data using a heatmap, where:
X-axis is "Genus", "Pariaconus", "Trioza", "Non-members".
Y-axis is
range of scores that occur in the dataset. In the example above,
Y-axis values would go from -40 to -500.
Colorbar is the
normalized population of the columns that get that score in
the Y-axis. For example, if 100% of the Genus column scores around
-40, that area in Y-axis would be colored red (for 1.0). The remainder of the y-axis for Genus would be colored blue (for 0.0),
because no scores for Genus are in the range -50 to -500. For the
purposes of my project, I'd like to show that the majority of scores
of "Genus" fall in a certain range, "Pariaconus" in another range,
"Non-members" in another range, and so on.
The reason I want to represent this with a heatmap and not, say, a line graph is because line graphs would suggest that there is a trend between rows in the same column. In the example above (Genus column), a line/scatter graph would make it seem that there's a relationship between the score -40, -41, and -45 as you move down the X-axis. In contrast, I just want to show the range of scores in each column.
With the data in the .csv format above, right now I have the following heatmap: https://imgur.com/a/VwgQwfQ
I get this with the line of code:
sns.heatmap(df, cmap="coolwarm")
In this heatmap, the values of the Y-axis are automatically set as the row indices from the .csv file, and the colormap values are the scores (values of the rows).
If I could just figure out how to swap the colormap and the Y-axis, then I hope that I could then move on to figuring out how to normalize the populations of each column instead of having it as the raw indices: 0 to 18000. But I've been trying to do this for days and haven't come close to what I want.
Ideally, I would want something like this: https://imgur.com/a/3A0eaOD. Of course, in the heatmap, there would be gradients instead of solid colors.
If anyone can answer, I had these questions:
Is what I'm trying to do achievable/is this something I can do with
a heatmap? Or should I be using a different representation?
Is this possibly a problem with how my input data is represented? If so, what is the correct representation when building a heatmap like this?
Any other guidance would be super appreciated.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.