How to deal with categorical data that has 35 unique values? - python

I am working on IPL cricket dataset which has data about batting stats for all the teams over by over.
I want to visualise how different cricket grounds affect the total score of the batting team. I try to plot a simple scatter plot but the stadium names are too long and it does not show the names clearly.
Do I have to convert the 35 values into numeric values? It prints nothing when I try to find correlation with the target variable.
The data set:
The problem with reading the plot (the x-axis):

You can change the size of the font and/or rotate it: https://matplotlib.org/api/matplotlib_configuration_api.html#matplotlib.rc

You can make your plot bigger by setting figsize.
(add this at the first line):
plt.figure(figsize(14,8))
and then rotate the xticks. (at the end):
plt.xticks(rotation=90)

Related

how to create interactive graph on a large data set?

I am trying to create an interactive graph using holoviews on a large data set. Below is a sample of the data file called trackData.cvs
Event Time ID Venue
Javeline 11:25:21:012345 JVL Dome
Shot pot 11:25:22:778929 SPT Dome
4x4 11:25:21:993831 FOR Track
4x4 11:25:22:874293 FOR Track
Shot pot 11:25:21:087822 SPT Dome
Javeline 11:25:23:878792 JVL Dome
Long Jump 11:25:21:892902 LJP Aquatic
Long Jump 11:25:22:799422 LJP Aquatic
This is how I read the data and plot a scatter plot.
trackData = pd.read_csv('trackData.csv')
scatter = hv.Scatter(trackData, 'Time', 'ID')
scatter
Because this data set is quite huge, zooming in and out of the scatter plot is very slow and would like to speed this process up.
I researched and found about holoviews decimate that is recommended on large datasets but I don't know how to use in the above code.
Most cases I tried seems to throw an error. Also, is there a way to make sure the Time column is converted to micros? Thanks in advance for the help
Datashader indeed does not handle categorical axes as used here, but that's not so much a limitation of the software than of my imagination -- what should it be doing with them? A Datashader scatterplot (Canvas.points) is meant for a very large number of points located on a continuously indexed 2D plane. Such a plot approximates a 2D probability distribution function, accumulating points per pixel to show the density in that region, and revealing spatial patterns across pixels.
A categorical axis doesn't have the same properties that a continuous numerical axis does, because there's no spatial relationship between adjacent values. Specifically in this case, there's no apparent meaning to an ordering of the ID field (it appears to be a letter code for a sporting event type), so I can't see any meaning to accumulating across ID values per pixel the way Datashader is designed to do. Even if you convert IDs to numbers, you'll either just get random-looking noise (if there are more ID values than vertical pixels), or a series of spotty lines (if there are fewer ID values than pixels).
Here, maybe there are only a few dozen or so unique ID values, but many, many time measurements? In that case most people would use a box, violin, histogram, or ridge plot per ID, to see the distribution of values for each ID value. A Datashader points plot is a 2D histogram, but if one axis is categorical you're really dealing with a set of 1D histograms, not a single combined 2D histogram, so just use histograms if that's what you're after.
If you really do want to try plotting all the points per ID as raw points, you could do that using vertical spike events as in https://examples.pyviz.org/iex_trading/IEX_stocks.html . You can also add some vertical jitter and then use Datashader, but that's not something directly supported right now, and it doesn't have the clear mathematical interpretation that a normal Datashader plot does (in terms of approximating a density function).
The disadvantage of decimate() is that it downsamples your datapoints.
I think you need datashader() here, but datashader doesn't like that ID is a categorical variable instead of a numerical value.
So a solution could be to convert your categorical variable to a numerical code.
See the code example below for both hvPlot (which I prefer) and HoloViews:
import io
import pandas as pd
import hvplot.pandas
import holoviews as hv
# dynspread is for making point sizes larger when using datashade
from holoviews.operation.datashader import datashade, dynspread
# sample data
text = """
Event Time ID Venue
Javeline 11:25:21:012345 JVL Dome
Shot pot 11:25:22:778929 SPT Dome
4x4 11:25:21:993831 FOR Track
4x4 11:25:22:874293 FOR Track
Shot pot 11:25:21:087822 SPT Dome
Javeline 11:25:23:878792 JVL Dome
Long Jump 11:25:21:892902 LJP Aquatic
Long Jump 11:25:22:799422 LJP Aquatic
"""
# create dataframe and parse time
df = pd.read_csv(io.StringIO(text), sep='\s{2,}', engine='python')
df['Time'] = pd.to_datetime(df['Time'], format='%H:%M:%S:%f')
df = df.set_index('Time').sort_index()
# get a column that converts categorical id's to numerical id's
df['ID'] = pd.Categorical(df['ID'])
df['ID_code'] = df['ID'].cat.codes
# use this to overwrite numerical yticks with categorical yticks
yticks=[(0, 'FOR'), (1, 'JVL'), (2, 'LJP'), (3, 'SPT')]
# this is the hvplot solution: set datashader=True
df.hvplot.scatter(
x='Time',
y='ID_code',
datashade=True,
dynspread=True,
padding=0.05,
).opts(yticks=yticks)
# this is the holoviews solution
scatter = hv.Scatter(df, kdims=['Time'], vdims=['ID_code'])
dynspread(datashade(scatter)).opts(yticks=yticks, padding=0.05)
More info on datashader and decimate:
http://holoviews.org/user_guide/Large_Data.html
Resulting plot:

Cannot get the appropriate histogram

I'm working on credit card data available on Kaggle. I want to plot the histogram for a column named 'Amount' from the data file named 'credit'. I want the plot for all the range of Amount but I ain't getting it. The range of Amount is [0,25691.16]. But the range showing in the plot is max_value/(num_bins). What should be the change in the code required to get plot over the total range mentioned above?
In the example code mentioned below the plot is showing a single bar of width 2569.116 (range/num_bins). What I need is 10 bars covering the entire range
plt.hist(credit['Amount'],10,density=True,range=(0,25691.16) ,facecolor='red',alpha=0.5)
your code is right. I think it is the nature of your data that you cannot see the other bars since the density in the first bin is very high. In other words, almost all 'Amount' are in (0, 2569.116) and there are a few 'Amount' that are in the intervals (2569.116, 5138.232), ... , (23122.044, 25691.16).

How to plot linear regression between two continous values?

I am trying to implement a Machine-Learning algorithm to predict house prices in New-York-City.
Now, when I try to plot (using Seaborn) the relationship between two columns of my house-prices dataset: 'gross_sqft_thousands' (the gross area of the property in thousands of square feets) and the target-column which is the 'sale_price_millions', I get a weird plot like this one:
Code used to plot:
sns.regplot(x="sale_price_millions", y="gross_sqft_thousands", data=clean_df);
When I try to plot the number of commercial units (commercial_units column) versus the sale_price_millions, I get also a weird plot like this one:
These weird plots, although in the correlation matrix, the sale_price correlates very good with both variables (gross_sqft_thousands and commercial_units).
What am I doing wrong, and what should I do to get great plot, with less points and a clear fitting like this plot:
Here is a part of my dataset:
Your housing price dataset is much larger than the tips dataset shown in that Seaborn example plot, so scatter plots made with default settings will be massively overcrowded.
The second plot looks "weird" because it plots a (practically) continuous variable, sales price, against an integer-valued variable, total_units.
The following solutions come to mind:
Downsample the dataset with something like sns.regplot(x="sale_price_millions", y="gross_sqft_thousands", data=clean_df[::10]). The [::10] part selects every 10th line from clean_df. You could also try clean_df.sample(frac=0.1, random_state=12345), which randomly samples 10% of all rows
without replacement (using a random seed for reproducibility).
Reduce the alpha (opacity) and/or size of the scatterplot points with sns.regplot(x="sale_price_millions", y="gross_sqft_thousands", data=clean_df, scatter_kws={"alpha": 0.1, "s": 1}).
For plot 2, add a bit of "jitter" (random noise) to the y-axis variable with sns.regplot(..., y_jitter=0.05).
For more, check out the Seaborn documentation on regplot: https://seaborn.pydata.org/generated/seaborn.regplot.html

Any way to swap my seaborn heatmap's colormap values with its y-axis values?

I'm new to heatmaps and seaborn. I've been trying this for days but haven't been able to find a solution or any related threads. I think I'm setting up the problem incorrectly but would like to know if what I'm trying to do, is possible with seaborn heatmaps... or if a heatmap isn't the right graphic representation for what I want to show.
I have a .csv file of scores. It looks something like:
Genus,Pariaconus,Trioza,Non-members
-40,-80,-90,-300
-40.15,-80,-100,-320
,-40.17,-86,-101,-470
,-86.2,-130,-488
,,-132,-489
,,,-500
...
As I try to show above, the columns are different lengths. Let's say length of (the number of values in) Genus is 10, Pariaconus is 15, Trioza is 20, and Non-members is 18,000.
In addition, the columns and rows are not related to each other. Each score is individual and just falls under the column group. What I want to show with the heatmap is the range of scores that occur in each column.
I would ideally like to represent the data using a heatmap, where:
X-axis is "Genus", "Pariaconus", "Trioza", "Non-members".
Y-axis is
range of scores that occur in the dataset. In the example above,
Y-axis values would go from -40 to -500.
Colorbar is the
normalized population of the columns that get that score in
the Y-axis. For example, if 100% of the Genus column scores around
-40, that area in Y-axis would be colored red (for 1.0). The remainder of the y-axis for Genus would be colored blue (for 0.0),
because no scores for Genus are in the range -50 to -500. For the
purposes of my project, I'd like to show that the majority of scores
of "Genus" fall in a certain range, "Pariaconus" in another range,
"Non-members" in another range, and so on.
The reason I want to represent this with a heatmap and not, say, a line graph is because line graphs would suggest that there is a trend between rows in the same column. In the example above (Genus column), a line/scatter graph would make it seem that there's a relationship between the score -40, -41, and -45 as you move down the X-axis. In contrast, I just want to show the range of scores in each column.
With the data in the .csv format above, right now I have the following heatmap: https://imgur.com/a/VwgQwfQ
I get this with the line of code:
sns.heatmap(df, cmap="coolwarm")
In this heatmap, the values of the Y-axis are automatically set as the row indices from the .csv file, and the colormap values are the scores (values of the rows).
If I could just figure out how to swap the colormap and the Y-axis, then I hope that I could then move on to figuring out how to normalize the populations of each column instead of having it as the raw indices: 0 to 18000. But I've been trying to do this for days and haven't come close to what I want.
Ideally, I would want something like this: https://imgur.com/a/3A0eaOD. Of course, in the heatmap, there would be gradients instead of solid colors.
If anyone can answer, I had these questions:
Is what I'm trying to do achievable/is this something I can do with
a heatmap? Or should I be using a different representation?
Is this possibly a problem with how my input data is represented? If so, what is the correct representation when building a heatmap like this?
Any other guidance would be super appreciated.

matplotlib plot values not in order

I'm encountering some matplotlib behaviour that I do not understand.
I have a daframe:
august.head()
value
config name
low max velocity -0.000145
medium max velocity -0.000165
reference -0.000198
high max velocity -0.000192
When I plot this dataframe using
plt.plot(august)
I get the following plot:
My data seems plotted chaotically and the blue line 'comes back to a previous x value' (sorry, that's the best I can do for a description of my problem)
I would like to see my data plotted with plt.plot(august) just as when I plot it using
august.plot()
Which gives me a good, ordered graph:
Any ideas?
Thanks
Maybe the config names were ordered alphabetically ?
In that case you could associate an integer to each config name, like here:
plot-with-custom-text-for-x-axis-points

Categories