I have the following .csv data:
Simulation Run,[urea] (μM),[NO3-] (μM),[NH4+] (μM),[NO2-] (μM),[O2] (μM),[HCO3-] (μM),[OH-] (μM),[H+] (μM),[H2O] (μM)
/Run_01,1124.3139186264032,49.79709670397852,128.31458304321205,0.0,4.0,140000.0,0.1,0.1,55000000.0
/Run_02,1.0017668367460492e-159,2426.7395169966485,3.1544859186304598e-09,1.975005700484566e-10,4.0,140000.0,0.1,0.1,55000000.0
/Run_03,9.905001536507822e-160,2426.739516996945,2.861369463189477e-09,1.7910618538551373e-10,4.0,140000.0,0.1,0.1,55000000.0
/Run_04,1123.3362048916795,49.7956932352008,130.27141398143655,0.0,4.0,140000.0,0.1,0.1,55000000.0
/Run_05,1101.9594005273052,49.792379912298884,173.02833603309404,0.0,4.0,140000.0,0.1,0.1,55000000.0
I would like to plot it in a series of scatterplot matrices to look at the relationships between the different variables. Much like how it is done here. NOTE: In the linked example the person is asking how to accomplish this in altair. I want to do this in Matplotlib.
Using the above code as reference, here is the code I'm working with:
import pandas as pd
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
from math import ceil
def graph_data(f: str):
"""
Represents the data
as a series of scatter-plot matrices.
"""
df = pd.read_csv(f)
NROWS = ceil((len(df.columns) - 1) / 3)
# Although the number of variables could vary,
# I would like no more than 3 charts per row.
NCOLS = 3
fname = f[:-4] + '.pdf'
with PdfPages(fname) as pdf:
scatter_matrix(df, alpha=0.2, figsize=(NROWS, NCOLS), diagonal='kde')
pdf.savefig(bbox_inches='tight')
plt.close()
When I try to run this, here is the error I get:
[LOTS OF TRACEBACK]...numpy.linalg.LinAlgError: singular matrix
Is this happening because the number of variables isn't a perfect square number (thereby not yielding a square matrix)? Is there a way to avoid this?
EDIT:
I forgot to specify my import statements so I have those in now.
Related
I would like to determine the intersection of two Matplotlib plots.
The input data for the first plot is stored in a CSV file that looks like this:
Time;Channel A;Channel B;Channel C;Channel D (s);(mV);(mV);(mV);(mV)
0,00000000;-16,28006000;2,31961900;13,29508000;-0,98889020
0,00010000;-16,28006000;1,37345900;12,59309000;-1,34293700
0,00020000;-16,16408000;1,49554400;12,47711000;-1,92894600
0,00030000;-17,10414000;1,25747800;28,77549000;-1,57489900
0,00040000;-16,98205000;1,72750600;6,73299900;0,54327920
0,00050000;-16,28006000;2,31961900;12,47711000;-0,51886220
0,00060000;-16,39604000;2,31961900;12,47711000;0,54327920
0,00070000;-16,39604000;2,19753400;12,00708000;-0,04883409
0,00080000;-17,33610000;7,74020200;16,57917000;-0,28079600
0,00090000;-16,98205000;2,31961900;9,66304500;1,48333500
This is the shortened CSV file. The Original has a lot more Data.
I got this code so far to get the FFT of Channel D:
import matplotlib.pyplot as plt
import pandas as pd
from numpy.fft import rfft, rfftfreq
a=pd.read_csv('20210629-0007.csv', sep = ';', skiprows=[1,2],usecols = [4],dtype=float, decimal=',')
dt = 1/10000
#print(a.head())
n=len(a)
#time increment in each data
acc=a.values.flatten() #to convert DataFrame to 1D array
#acc value must be in numpy array format for half way mirror calculation
fft=rfft(acc)*dt
freq=rfftfreq(n,d=dt)
FFT=abs(fft)
plt.plot(freq,FFT)
plt.axvline(x=150, color = 'red')
plt.show()
Does anybody know how to get the intersection of those 2 plots ( red line and blue line at the same frequency ) ?
I would be very grateful for any help!
manually
This is not really a programming question, rather basic mathematics.
Here is your plot:
Let's call (x1,y1) and (x2,y2) the first two points of your blue line and (x,y) the coordinates of the intersection.
You have this relationship between the points: (x-x1)/(x2-x1) = (y-y1)/(y2-y1)
Thus: y=y1+(x-x1)*(y2-y1)/(x2-x1)
Which gives FFT[0]+(150-0)*(FFT[1]-FFT[0])/(freq[1]-freq[0])
Coordinates of the intersection are (150, 0.000189)
programmatically
You can use the pd.Series.interpolate method
import numpy as np
import pandas as pd
np.random.seed(0)
s = pd.Series(np.random.randint(0,100,20),
index=sorted(np.random.choice(range(100), 20))).sort_index()
ax = s.plot()
ax.axvline(35, color='r')
s.loc[35] = np.NaN
ax.plot(35, s.sort_index().interpolate(method='index').loc[35], marker='o')
This is a very straightforward question. I have and x axis of years and a y axis of numbers increasing linearly by 100. When plotting this with pandas and matplotlib I am given a graph that does not represent the data whatsoever. I need some help to figure this out because it is such a small amount of code:
The CSV is as follows:
A,B
2012,100
2013,200
2014,300
2015,400
2016,500
2017,600
2018,700
2012,800
2013,900
2014,1000
2015,1100
2016,1200
2017,1300
2018,1400
The Code:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
data = pd.read_csv("CSV/DSNY.csv")
data.set_index("A", inplace=True)
data.plot()
plt.show()
The graph this yields is:
It is clearly very inconsistent with the data - any suggestions?
The default behaviour of matplotlib/pandas is to draw a line between successive data points, and not to mark each data point with a symbol.
Fix: change data.plot() to data.plot(style='o'), or df.plot(marker='o', linewidth=0).
Result:
All you need is sort A before plotting.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
data = pd.read_csv("CSV/DSNY.csv").reset_index()
data = data.sort_values('A')
data.set_index("A", inplace=True)
data.plot()
plt.show()
I am trying to create a heatmap with dendrograms on Python using Seaborn and I have a csv file with about 900 rows. I'm importing the file as a pandas dataframe and attempting to plot that but a large number of the rows are not being represented in the heatmap. What am I doing wrong?
This is the code I have right now. But the heatmap only represents about 49 rows.
Here is an image of the clustermap I've obtained but it is not displaying all of my data.
import seaborn as sns
import pandas as pd
from matplotlib import pyplot as plt
# Data set
df = pd.read_csv('diff_exp_gene.csv', index_col = 0)
# Default plot
sns.clustermap(df, cmap = 'RdBu', row_cluster=True, col_cluster=True)
plt.show()
Thank you.
An alternative approach would be to use imshow in matpltlib. I'm not exactly sure what your question is but I demonstrate a way to graph points on a plane from csv file
import numpy as np
import matplotlib.pyplot as plt
import csv
infile = open('diff_exp_gene.csv')
df = csv.DictReader(in_file)
temp = np.zeros((128,128), dtype = int)
for row in data:
if row['TYPE'] == types:
temp[int(row['Y'])][int(row['X'])] = temp[int(row['Y'])][int(row['X'])] + 1
plt.imshow(temp, cmap = 'hot', origin = 'lower')
plt.show()
As far as I know, keywords that apply to seaborn heatmaps also apply to clustermap, as the sns.clustermap passes to the sns.heatmap. In that case, all you need to do in your example is to set yticklabels=True as a keyword argument in sns.clustermap(). That will make all of the 900 rows appear.
By default, it is set as "auto" to avoid overlap. The same applies to the xticklabels. See more here: https://seaborn.pydata.org/generated/seaborn.heatmap.html
Here were my 3 assigned data analysis tasks:
Display the complaint type and city together.
Plot a bar graph of count vs. complaint types.
Display the major complaint types and their count.
Here is my code:
import pandas as pd
import numpy as np
work_on = data[['Complaint Type','City']]
import matplotlib.pyplot as plt
from matplotlib import style
%matplotlib inline
koten = work_on['Complaint Type'].value_counts().head(10).plot(kind='bar')
koten
-- bar graph that was obtained
-- Which displays a bar graph but when i use the following code:
style.use('ggplot')
plt.plot(work_on['Complaint Type'].value_counts().head(10))
plt.xlabel('Values')
plt.ylabel('Names')
plt.title('first')
plt.show()
-- this throws an error:
ValueError: could not convert string to float: 'Traffic Signal Condition'
My question being: I am using the .plot(kind=) method which only works for kind='bar' which displayed the graph that i shared but when i use the matplotlib method it started giving me errors such as: ValueError: could not convert string to float: 'Traffic Signal Condition'. Is there any other good method in python to display such non-numerical data?
Here is a glimpse of my data:
Two columns to be worked on
It's not clear from the question what the desired plot should acutally show. If it is a line plot accross the different categories, you would need to provide the indizes of the categories to the plt.plot function.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame([])
df["ComplaintType"] = np.random.choice(list("ABCDEFGHIJ"), size=50)
counts = df["ComplaintType"].value_counts()
plt.plot(range(len(counts)), counts)
plt.xticks(range(len(counts)), counts.index)
plt.show()
It is a bit questionable in how far it makes sense to connect different categories by a line though.
I have a series of data which consists of values from several experiments (1-40, in the MWE it is 1-5). The overall amount of entries in my original data is ~4.000.000, which I try to smooth in order to display it:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import spline
from statsmodels.nonparametric.smoothers_lowess import lowess
df = pd.DataFrame()
df["values"] = np.random.randint(100000, 200000, 1000)
df["id"] = [1,2,3,4,5] * 200
plt.figure(1, figsize=(11.69,8.27))
# Both fail for my amount of data:
plt.plot(spline(df["values"], df["id"], range(100)), "r-")
plt.plot(lowess(df["values"], df["id"]), "r-")
Both, scipy.interplate and statsmodels.nonparametric.smoothers_lowess.lowess, throw out of memory exceptions for my data. Is there any efficient way to solve this like in, e.g., GNU R using ggplot2 and geom_smooth()?
I can't quite tell what you're getting at with all the dimensions to your data, but one very simple thing you can try is to just use the 'markevery' kwarg like so:
import numpy as np
import matplotlib.pyplot as plt
x=np.linspace(1,100,1E7)
y=x**2
plt.figure(1, figsize=(11.69,8.27))
plt.plot(x,y,markevery=100)
plt.show()
This will only plot every nth point (n=100 here).
If that doesn't help then you may want to try just a simple numpy interpolation with fewer samples like so:
x_large=np.linspace(1,100,1E7)
y_large=x**2
x_small=np.linspace(1,100,1E3)
y_small=np.interp(x_small,x_large,y_large)
plt.plot(x_small,y_small)