I'm trying to make a program that draws a graph using given points from a csv file which contains 4 strings for each row (Number of the point, x pos, y pos, color), but the time it takes is ridiculously high, so i'm looking for ideas to make it faster.
from matplotlib import pyplot as plt
from matplotlib import style
import csv
style.use('ggplot')
s = 0.5
with open('total.csv') as f:
f_reader = csv.reader(f, delimiter=',')
for row in f_reader:
plt.scatter(str(row[1]), str(row[2]), color=str(row[3]), s=s)
plt.savefig("graph.png", dpi=1000)
The first step would be to call scatter once instead of for every points, without adding a dependency on numpy and pandas it could look like:
from matplotlib import pyplot as plt
from matplotlib import style
import csv
style.use("ggplot")
s = 0.5
x = []
y = []
c = []
with open("total.csv") as f:
f_reader = csv.reader(f, delimiter=",")
for row in f_reader:
x.append(row[1])
y.append(row[2])
c.append(row[3])
plt.scatter(x, y, color=c, s=s)
plt.savefig("graph.png", dpi=1000)
Then maybe try pandas.read_csv which would give you an pandas dataframe allowing you to access the columns of your CSV without a for loop, which would probably be faster.
Each time you try a variation, measure the time it take (possibly on a smaller file) to know what help and what don't, in other words, don't try to enhance perfs blindly.
Using pandas it would look like:
from matplotlib import pyplot as plt
from matplotlib import style
import pandas as pd
style.use("ggplot")
total = pd.read_csv("total.csv")
plt.scatter(total.x, total.y, color=total.color, s=0.5)
plt.savefig("graph.png", dpi=1000)
If you want to learn more on pandas good practices for performance, I like the No more sad pandas talk, take a look at it.
Related
I have the following .csv data:
Simulation Run,[urea] (μM),[NO3-] (μM),[NH4+] (μM),[NO2-] (μM),[O2] (μM),[HCO3-] (μM),[OH-] (μM),[H+] (μM),[H2O] (μM)
/Run_01,1124.3139186264032,49.79709670397852,128.31458304321205,0.0,4.0,140000.0,0.1,0.1,55000000.0
/Run_02,1.0017668367460492e-159,2426.7395169966485,3.1544859186304598e-09,1.975005700484566e-10,4.0,140000.0,0.1,0.1,55000000.0
/Run_03,9.905001536507822e-160,2426.739516996945,2.861369463189477e-09,1.7910618538551373e-10,4.0,140000.0,0.1,0.1,55000000.0
/Run_04,1123.3362048916795,49.7956932352008,130.27141398143655,0.0,4.0,140000.0,0.1,0.1,55000000.0
/Run_05,1101.9594005273052,49.792379912298884,173.02833603309404,0.0,4.0,140000.0,0.1,0.1,55000000.0
I would like to plot it in a series of scatterplot matrices to look at the relationships between the different variables. Much like how it is done here. NOTE: In the linked example the person is asking how to accomplish this in altair. I want to do this in Matplotlib.
Using the above code as reference, here is the code I'm working with:
import pandas as pd
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
from math import ceil
def graph_data(f: str):
"""
Represents the data
as a series of scatter-plot matrices.
"""
df = pd.read_csv(f)
NROWS = ceil((len(df.columns) - 1) / 3)
# Although the number of variables could vary,
# I would like no more than 3 charts per row.
NCOLS = 3
fname = f[:-4] + '.pdf'
with PdfPages(fname) as pdf:
scatter_matrix(df, alpha=0.2, figsize=(NROWS, NCOLS), diagonal='kde')
pdf.savefig(bbox_inches='tight')
plt.close()
When I try to run this, here is the error I get:
[LOTS OF TRACEBACK]...numpy.linalg.LinAlgError: singular matrix
Is this happening because the number of variables isn't a perfect square number (thereby not yielding a square matrix)? Is there a way to avoid this?
EDIT:
I forgot to specify my import statements so I have those in now.
I want to plot a tendency line on top of a data plot. This must be simple but I have not been able to figure out how to get to it.
Let us say I have the following:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame(np.random.randint(0,100,size=(100, 1)), columns=list('A'))
sns.lineplot(data=df)
ax.set(xlabel="Index",
ylabel="Variable",
title="Sample")
plt.show()
The resulting plot is:
What I would like to add is a tendency line. Something like the red line in the following:
I thank you for any feedback.
A moving average is one method (my first thought, and already suggested).
Another method is to use a polynomial fit. Since you had 100 points in your original data, I picked a 10th order fit (square root of data length) in the example below. With some modification of your original code:
idx = [i for i in range(100)]
rnd = np.random.randint(0,100,size=100)
ser = pd.Series(rnd, idx)
fit = np.polyfit(idx, rnd, 10)
pf = np.poly1d(fit)
plt.plot(idx, rnd, 'b', idx, pf(idx), 'r')
This code provides a plot like this:
You can do something like this using Rolling Average:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
data = np.random.randint(0,100,size=(100, 1))
df["rolling_avg"] = df.A.rolling(7).mean().shift(-3)
sns.lineplot(data=df)
plt.show()
You could also do a Regression plot to analyse how data can be interpolated using:
ax = sns.regplot(x=df.index, y="A",
data=df,
scatter_kws={"s": 10},
order=10,
ci=None)
I am trying to create a heatmap with dendrograms on Python using Seaborn and I have a csv file with about 900 rows. I'm importing the file as a pandas dataframe and attempting to plot that but a large number of the rows are not being represented in the heatmap. What am I doing wrong?
This is the code I have right now. But the heatmap only represents about 49 rows.
Here is an image of the clustermap I've obtained but it is not displaying all of my data.
import seaborn as sns
import pandas as pd
from matplotlib import pyplot as plt
# Data set
df = pd.read_csv('diff_exp_gene.csv', index_col = 0)
# Default plot
sns.clustermap(df, cmap = 'RdBu', row_cluster=True, col_cluster=True)
plt.show()
Thank you.
An alternative approach would be to use imshow in matpltlib. I'm not exactly sure what your question is but I demonstrate a way to graph points on a plane from csv file
import numpy as np
import matplotlib.pyplot as plt
import csv
infile = open('diff_exp_gene.csv')
df = csv.DictReader(in_file)
temp = np.zeros((128,128), dtype = int)
for row in data:
if row['TYPE'] == types:
temp[int(row['Y'])][int(row['X'])] = temp[int(row['Y'])][int(row['X'])] + 1
plt.imshow(temp, cmap = 'hot', origin = 'lower')
plt.show()
As far as I know, keywords that apply to seaborn heatmaps also apply to clustermap, as the sns.clustermap passes to the sns.heatmap. In that case, all you need to do in your example is to set yticklabels=True as a keyword argument in sns.clustermap(). That will make all of the 900 rows appear.
By default, it is set as "auto" to avoid overlap. The same applies to the xticklabels. See more here: https://seaborn.pydata.org/generated/seaborn.heatmap.html
I have a CSV with 12 columns of data. I'm focusing on these 4 columns
Right now I've plotted "Pass def" and "Rush def". I want to be able to highlight specific points on the scatter plot. For example, I want to highlight 1995 DAL point on the plot and change that point to a color of yellow.
I've started with a for loop but I'm not sure where to go. Any help would be great.
Here is my code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import csv
import random
df = pd.read_csv('teamdef.csv')
x = df["Pass Def."]
y = df["Rush Def."]
z = df["Season"]
points = []
for point in df["Season"]:
if point == 2015.0:
print(point)
plt.figure(figsize=(19,10))
plt.scatter(x,y,facecolors='black',alpha=.55, s=100)
plt.xlim(-.6,.55)
plt.ylim(-.4,.25)
plt.xlabel("Pass DVOA")
plt.ylabel("Rush DVOA")
plt.title("Pass v. Rush DVOA")
plot.show
You can layer multiple scatters, so the easiest way is probably
plt.scatter(x,y,facecolors='black',alpha=.55, s=100)
plt.scatter(x, 2015.0, color="yellow")
I have a csv file (excel spreadsheet) of a column of roughly a million numbers. I want to make a histogram of this data with the frequency of the numbers on the y-axis and the number quantities on the x-axis. I know matplotlib can plot a histogram, but my main problem is converting the csv file from string to float since a string can't be graphed. This is what I have:
import matplotlib.pyplot as plt
import csv
with open('D1.csv', 'rb') as data:
rows = csv.reader(data, quoting = csv.QUOTE_NONNUMERIC)
floats = [[item for number, item in enumerate(row) if item and (1 <= number <= 12)] for row in rows]
plt.hist(floats, bins=50)
plt.title("histogram")
plt.xlabel("value")
plt.ylabel("frequency")
plt.show()
You can do it in one line with pandas:
import pandas as pd
pd.read_csv('D1.csv', quoting=2)['column_you_want'].hist(bins=50)
Okay I finally got something to work with headings, titles, etc.
import matplotlib.pyplot as plt
import pandas as pd
data = pd.read_csv('D1.csv', quoting=2)
data.hist(bins=50)
plt.xlim([0,115000])
plt.title("Data")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()
My first problem was that matplotlib is necessary to actually show the graph. Also, I needed to set the action
pd.read_csv('D1.csv', quoting=2)
to data so I could plot the histogram of that action with
data.hist
Thank you all for the help.
Panda's read_csv is very powerful, but if your csv file is simple (without headers, or NaNs or comments) you do not need Pandas, as you can use Numpy:
import numpy as np
import matplotlib.pyplot as plt
data = np.loadtxt('D1.csv')
plt.hist(data, normed=True, bins='auto')
(In fact loadtxt can deal with some headers and comments, but read_csv is more versatile)