Large DF to plot - python

I have a large df to plot (couple of millions of rows, 8 columns, obtained by concatenation of several files).
I want to plot several graphs using facet, in order to have complete view on data:
'''
rp = sns.relplot(data=df,
x='zscore',
y='%',
col='Nr',
row ="Support",
style="Metal",
kind='line')
'''
I tried both in Seaborn and Plotly Express but time to build this graphs is just too important, more than one hour on my laptop.
What can I improve, optimize, i order to speed graph creation?
Thank you!
PS. I do am a newbie in Python and programming ;)

Related

Identifying Plot Name or Visualization Implementation

I'm working on a dataset of SMS records [datetime_entry, sms_sent] and I was looking to copy a really effective trend visual from a well cited Electricity demand study. Does anyone know the name of this plot, or the implementation of something similar in Python (as I'm not sure this was done in Python).
I know how to subplot the 4 charts after splitting the data by quarter, I'm just stumped on the plot type and stylization.
This is what matplotlib calls an eventplot.
Essentially each vertical line represents an occurance of a Mwh demand during that specific hour. So each row in the plot should have as many vertical lines as there are days in that quarter.
While it works in this plot for these data, relying on the combination of alpha level + data density can be slightly unreliable as the data change as the number of overlapping points is not readily visible. So you can also create a similar visualization using hist2d, where you manually specify your bins.

Faster plotting in matplotlib or better options

I am trying to visualize some data for a log of close to 25,000 data points. When running this with matplotlib.pyplot on Python it is taking a really long time to render simple line graphs, and sometimes I've had to simply exit out from it taking 10+ minutes. This data log was made for sampling purposes, and real data logs can be a lot higher compared to this (some files can be several gigabytes long).
With this in mind, is there any way to plot data this big in matplotlib, without extremely slow execution? Or is there perhaps another framework that can do this a lot better in python? I understand that it can still take a while to render at that size, but for practical purposes, taking 10+ minutes for each plot really is not useful. Any help or guides are appreciated.
Here is a sample of my code:
df = pd.read_csv('sample.txt', low_memory=False) #25k Lines of data
df = df.iloc[:-2, :] # dropping last two rows since we don't need them
#'some_column' and 'another_column_name' for example purposes
#Both are 25k lines long
y = some_column
x = df[another_column_name]
x.pop(0) #removing unnecessary value, ignore this
fig, ax = plt.subplots()
tmpy = df[y]
tmpy.pop(0) # removing unnecessary value, ignore this
ax.plot(x, tmpy) # plot x against y
ax.set_title('Sample Graph')
plt.show()
In here I try to basically plot a column from the Pandas dataframe against another column. Very simple plotting to try and produce a line graph. The columns consist of some integers but mostly decimal values. It takes a really long time to do just this sample, real files are much bigger as mentioned. The goal is to be able to accomplish this with any file that is input.

How to plot specific rows of qualitative data using matplotlib on python?

I have a large spreadsheet of data that for privacy reasons I cannot show, but there is a column called 'origin' where there are hundreds of rows for particular company names. For example: 500 rows of information has been input for 500 people working at "Sony". I want to be able to make graphs for the information gathered for each institution, but I am having trouble only plotting for specific rows. The goal is to make a dashboard for each institution.
A way of putting this would be:
fig = px.scatter(df, x='gender'['female], y='race',
color='origin'['Sony'])
fig.update_traces(mode='markers+lines')
fig.show()
I want to focus on particular categories when plotting.
Any help is appreciated!

Group Boxplots with multiple dataframes

I have tried to create a scatter with grouped boxplots as the ones on the following links:
matplotlib: Group boxplots
https://cmdlinetips.com/2019/03/how-to-make-grouped-boxplots-in-python-with-seaborn/
how to make a grouped boxplot graph in matplotlib
However, the data I want to use comes in a format as:
5y_spreads
7y_spreads
10y_spreads
(each of the images above comes from a different worksheet in the same workbook)
I need to work the data in Python to make it ready for seaborn and that is what is difficult for me.
It is not structured as in the examples from the links. I understand this requires mastering dataframes (something I am still learning).
I also need to show the latest value to see where the bonds are trading now, compared to the range.

How to plot a dataframe that contains values spread over a large spectrum of values?

I have the following dataframe, resulted from running grid search over several regression models:
As it can be noticed, there are many values grouped around 0.0009, but several that are a few orders of magnitude higher (-1.6, -2.3 etc).
I would like to plot these results, but I don't seem to find a way to get a readable plot. I have tried a bar plot, but I get something like:
How can I make this bar plot more readable? Or what other kind of plot would be more suitable to visualize such data?
Edit: Here is the dataframe, exported as CSV:
,a,b,c,d
LinearRegression,0.000858399508896,-4.11609208874e+20,0.000952538859738,0.000952538859733
RandomForestRegressor,-1.62264355718,-2.30218457629,0.0008957696846039999,0.0008990722465239999
ElasticNet,0.000883257900658,0.0008525502791760002,0.000884706195921,0.000929498696126
Lasso,7.92193516085e-05,-1.84086765436e-05,7.92193516085e-05,-1.84086765436e-05
ExtraTreesRegressor,-6.320170496909999,-6.30420308033,,
Ridge,0.0008584791396339999,0.0008601028734780001,,
SGDRegressor,-4.62522968756,,,
You could make the graph have a log scale, which is often used for plotting data with a very large range. This muddies the interpretation slightly, as now each equivalent distance is an equivalent order of magnitude difference. You can read about log scales here:
https://en.wikipedia.org/wiki/Logarithmic_scale

Categories