I have a data frame consisting of a .csv import that contains n number of trials. Trials are arranged by column with a header (wavelength1 for trial 1, wavelength2 for trial 2 etc.) We're tracking the absorption of a solution over time during a chemical reaction. You can see a SS of the excel file in the link. Trials are grouped in to threes (with g of sugar being the IDV and the absorbance in nm being the DV). For each trial:
I need to determine what the maximum and minimum values are. This can of course be done using max() and min() but when we are sampling every 0.25 seconds, the data can be noisy, meaning that I have to smooth it out. I have already built a function to do that. We're also probably just going to sample every one second as it's much smoother anyway.
Each group of three trials needs to be plotted on the same graph for comparison. n number of trials will create n/3 graphs.
I'm coming from an intermediate background in MATLAB. This is not something I was ever able to figure out in there, either.
What have I done so far?
I have attempted to make a list out of the header for each trial, and then use use a for loop to move through the data using the df.column_name command:
data = pd.read_csv('data.csv')
col_name = data.columns.values
print(col_name)
for i in col_name:
print(data.col_name[i])
The code works up to the 4th line, where it returns the error: AttributeError: 'DataFrame' object has no attribute 'col_name'. Here is where I would like to make a series or set (whatever it's called here) with all of the values from the wavelength1 trial to plot/manipulate/etc. It's worth noting that I have gotten the multiple plots and multiple lines to work manually: but I want to automate it as that's ofc the point of coding. Here's one out of four graphs of the 'manual' version:
import pandas as pd
import matplotlib.pyplot as plt
#import matplotlib as matplotlib
data = pd.read_csv('data.csv')
plt.close("all")
n_rows = 2
n_columns = 2
#initialize figure
figure_size = (30,15)
font_size = 13
f, ([plt1, plt2], [plt3, plt4]) = plt.subplots(n_rows,n_columns, figsize = figure_size)
#plot first three runs
x=data.time1
y=data.wavelength1
plt1.plot(x,y, label='Trial 1')
x=data.time2
y=data.wavelength2
plt1.plot(x,y,label='Trial 2')
plt1.set_title('0.3g Glucose', fontweight="bold", size=font_size)
x=data.time3
y=data.wavelength3
plt1.plot(x,y,label='Trial 3')
plt1.set_ylabel('Wavelength (nm)', fontsize = font_size)
plt1.set_xlabel('Time (s)', fontsize = font_size)
plt1.legend(fontsize=font_size)
My first thought was just to do:
for i in range (0,num_col):
plot(time,data.wavelength(i))
But this does not work. I'm sure it's something quite simple but it is escaping me.
Example data:
https://ufile.io/ac226vma
Thanks in advance!
[1]: https://i.stack.imgur.com/gMtBN.png
Analysis
I need to determine what the maximum and minimum values are.
Since you want the largest value within each trial, and each trial is represented by one column, you can use DataFrame.min() to get the smallest value in each column. If you want to know the index of the smallest value, you can throw in idxmin() too. Same idea with max.
df = pd.read_csv("data.csv")
# Get max and min values
print("ANALYSIS OF MIN AND MAX VALUES")
analysis_df = pd.DataFrame()
analysis_df["min"] = df.min()
analysis_df["min_idx"] = df.idxmin()
analysis_df["max"] = df.max()
analysis_df["max_idx"] = df.idxmax()
print(analysis_df)
produces:
ANALYSIS OF MIN AND MAX VALUES
min min_idx max max_idx
wavelength1 801.0 120 888.0 4
wavelength2 809.0 85 888.0 1
wavelength3 728.0 96 837.0 1
wavelength4 762.0 114 864.0 3
wavelength5 785.0 115 878.0 2
wavelength6 747.0 118 866.0 1
wavelength7 748.0 119 851.0 3
wavelength8 776.0 113 880.0 0
wavelength9 812.0 112 900.0 0
wavelength10 770.0 110 863.0 1
wavelength11 759.0 100 858.0 0
wavelength12 787.0 91 876.0 0
wavelength13 756.0 66 862.0 2
wavelength14 809.0 70 877.0 1
wavelength15 828.0 62 866.0 0
Plotting
Each group of three trials needs to be plotted on the same graph for comparison. n number of trials will create n/3 graphs.
This is easier if you break it up into a few smaller subproblems.
First, you want to take a list of all of your columns and break them up into groups of three. I copied the code to do this from here.
def grouper(n, iterable, fillvalue=None):
"grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return itertools.zip_longest(fillvalue=fillvalue, *args)
Now, once we have a group of three column names, we need to get the values within the dataframe associated with those columns. Also, since your datafile contains unequal numbers of observations per trial, we need to get rid of the NaN's at the end of the file.
def get_trials(df, column_group_names):
"""Get columns from dataframe, dropping missing values."""
column_group = df[list(column_group_names)]
column_group = column_group.dropna(how='all')
return column_group
Now, let's combine those two functions:
col_iterator = grouper(3, df.columns)
[...]
for column_group_names in col_iterator:
column_group = get_trials(df, column_group_names)
[...]
This will let us loop over the columns in groups of three, and plot them individually. Since we've filtered it down to the data we're interested in, we can use DataFrame.plot to plot it to the matplotlib plot.
Next, we need to loop over the subplots. This is a little annoying to do while also looping over groups, so I like to define an iterator.
def subplot_axes_iterator(n_rows, n_columns):
for i in range(n_rows):
for j in range(n_columns):
yield i, j
Example of it in use:
>>> list(subplot_axes_iterator(2, 2))
[(0, 0), (0, 1), (1, 0), (1, 1)]
Now, combine those pieces:
# Plot data
n_rows = 2
n_columns = 3
figure_size = (15, 10)
font_size = 13
fig, axes = plt.subplots(n_rows, n_columns, figsize=figure_size)
col_iterator = grouper(3, df.columns)
axes_iterator = subplot_axes_iterator(n_rows, n_columns)
plot_names = [
"Group 1",
"Group 2",
"Group 3",
"Group 4",
"Group 5",
]
for column_group_names, axes_position, plot_name in \
zip(col_iterator, axes_iterator, plot_names):
print(f"plotting {column_group_names} at {axes_position}")
column_group = get_trials(df, column_group_names)
column_group.plot(ax=axes[axes_position])
axes[axes_position].set_title(plot_name, fontweight="bold", size=font_size)
axes[axes_position].set_xlabel("Time (s)", fontsize=font_size)
axes[axes_position].set_ylabel("Wavelength (nm)", fontsize=font_size)
plt.tight_layout()
plt.show()
(By the way, you said that you want 4 graphs, but the dataset posted has fifteen trials, so I made 5 graphs.)
Final script
(Included for easy copy/paste.)
import pandas as pd
import matplotlib.pyplot as plt
import itertools
def grouper(n, iterable, fillvalue=None):
"grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return itertools.zip_longest(fillvalue=fillvalue, *args)
def get_trials(df, column_group_names):
"""Get columns from dataframe, dropping missing values."""
column_group = df[list(column_group_names)]
column_group = column_group.dropna(how='all')
return column_group
def subplot_axes_iterator(n_rows, n_columns):
for i in range(n_rows):
for j in range(n_columns):
yield i, j
df = pd.read_csv("data.csv")
# Get max and min values
print("ANALYSIS OF MIN AND MAX VALUES")
analysis_df = pd.DataFrame()
analysis_df["min"] = df.min()
analysis_df["min_idx"] = df.idxmin()
analysis_df["max"] = df.max()
analysis_df["max_idx"] = df.idxmax()
print(analysis_df)
# Plot data
n_rows = 2
n_columns = 3
figure_size = (15, 10)
font_size = 13
fig, axes = plt.subplots(n_rows, n_columns, figsize=figure_size)
col_iterator = grouper(3, df.columns)
axes_iterator = subplot_axes_iterator(n_rows, n_columns)
plot_names = [
"Group 1",
"Group 2",
"Group 3",
"Group 4",
"Group 5",
]
for column_group_names, axes_position, plot_name in \
zip(col_iterator, axes_iterator, plot_names):
print(f"plotting {column_group_names} at {axes_position}")
column_group = get_trials(df, column_group_names)
column_group.plot(ax=axes[axes_position])
axes[axes_position].set_title(plot_name, fontweight="bold", size=font_size)
axes[axes_position].set_xlabel("Time (s)", fontsize=font_size)
axes[axes_position].set_ylabel("Wavelength (nm)", fontsize=font_size)
plt.tight_layout()
plt.show()
Related
I have a .csv file containing x y data from transects (.csv file here).
The file can contain a few dozen transects (example only 4).
I want to calculate the elevation change from each transect and then select the transect with the highest elevation change.
x y lines
0 3.444 1
0.009 3.445 1
0.180 3.449 1
0.027 3.449 1
...
0 2.115 2
0.008 2.115 2
0.017 2.115 2
0.027 2.116 2
I've tried to calculate the change with pandas.dataframe.diff but I'm unable to select the highest elevation change from this.
UPDATE: I found a way to calculate the height difference for 1 transect. The goal is now to loop this script through the different other transects and let it select the transect with the highest difference. Not sure how to create a loop from this...
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from scipy.signal import savgol_filter, find_peaks, find_peaks_cwt
from pandas import read_csv
import csv
df = pd.read_csv('transect4.csv', delimiter=',', header=None, names=['x', 'y', 'lines'])
df_1 = df ['lines'] == 1
df1 = df[df_1]
plt.plot(df1['x'], df1['y'], label='Original Topography')
#apply a Savitzky-Golay filter
smooth = savgol_filter(df1.y.values, window_length = 351, polyorder = 5)
#find the maximums
peaks_idx_max, _ = find_peaks(smooth, prominence = 0.01)
#reciprocal, so mins will become max
smooth_rec = 1/smooth
#find the mins now
peaks_idx_mins, _ = find_peaks(smooth_rec, prominence = 0.01)
plt.xlabel('Distance')
plt.ylabel('Height')
plt.plot(df1['x'], smooth, label='Smoothed Topography')
#plot them
plt.scatter(df1.x.values[peaks_idx_max], smooth[peaks_idx_max], s = 55,
c = 'green', label = 'Local Max Cusp')
plt.scatter(df1.x.values[peaks_idx_mins], smooth[peaks_idx_mins], s = 55,
c = 'black', label = 'Local Min Cusp')
plt.legend(loc='upper left')
plt.show()
#Export to csv
df['Cusp_max']=False
df['Cusp_min']=False
df.loc[df1.x[peaks_idx_max].index, 'Cusp_max']=True
df.loc[df1.x[peaks_idx_mins].index, 'Cusp_min']=True
data=df[df['Cusp_max'] | df['Cusp_min']]
data.to_csv(r'Cusp_total.csv')
#Calculate height difference
my_data=pd.read_csv('Cusp_total.csv', delimiter=',', header=0, names=['ID', 'x', 'y', 'lines'])
df_1 = df ['lines'] == 1
df1 = df[df_1]
df1_diff=pd.DataFrame(my_data)
df1_diff['Diff_Cusps']=df1_diff['y'].diff(-1)
#Only use positive numbers for average
df1_pos = df_diff[df_diff['Diff_Cusps'] > 0]
print("Average Height Difference: ", (df1_pos['Diff_Cusps'].mean()), "m")
Ideally, the script would select the transect with the highest elevation change from an unknown number of transects in the .csv file, which will then be exported to a new .csv file.
You need to groupby by column lines.
Not sure if this is what you meant when you say elevation change but this gives difference of elevations (max(y) - min(y)) for each group, where groups are formed by all rows sharing same value of 'line'each group representing one such value. This should help you with what you are missing in your logic, (sorry can't put more time in).
frame = pd.read_csv('transect4.csv', header=None, names=['x', 'y', 'lines'])
groups = frame.groupby('lines')
groups['y'].max() - groups['y'].min()
# Should give you max elevations of each group.
I have a dataframe that looks like this :
0 1 2 ... 147 148 149
Columns 0 190.2 190.5 189.9 ... 146.7 146.4 146.1
Values 0 -49.3892 -47.0297 -39.528 ... -30.7926 -30.7561 -30.719
Columns 1 190.2 190.5 189.9 ... 146.7 146.4 146.1
Values 1 -49.3892 -47.0297 -39.528 ... -30.7926 -30.7561 -30.719
Columns 2 190.2 190.5 189.9 ... 146.7 146.4 146.1
I want to create a curve for every pair Columns # and Value #. The dataframe has 3478 rows so 1738 pairs of data.
I have tried a for loop that looks like:
import matplotlib.pyplot as plt
nline2 = len(df2.index)
for i in range (0,(nline-2),2)
x_data = df2.values[[i]]
y_data = df2.values[[(i+1)]]
plt.plot(x_data,y_data)
But i get an error message : TypeError: unhashable type: 'numpy.ndarray'
Note that i am trying to plot only to see what i get, the ultimate goal of this to calulate the area under the curve for each pair and add it for all pairs. Hense the for loop.
UPDATE
I think found the source of the problem, the rows calles Columns # are used as headers and not scalar values. I tried df2.iat[index,columns] but without success.
To change the values of the Columns # rows, I used:
df2 = df2.apply(pd.to_numeric)
Then for teh purpose of plotting only the first pair:
i = 0
x_data = df2.values[[i]]
y_data = df2.values[[(i+1)]]
#line = plt.plot(x_data, y_data)
#plt.setp(line, color = 'r', linewidth = 2.0)
line = plt.Line2D(x_data, y_data,linewidth = 2)
plt.show()
I have multiple Dataframes (up to 30) which all contain timestamps with associated values. The timestamp in the DataFrames do not necessarily overlap and the recorded values can only stay the same or increase. A DataFrame may look like this:
time coverage
0 0.000000 32.111748
1 0.875050 32.482579
2 1.850576 32.784133
3 3.693440 34.205134
...
I uploaded a couple of csv files with data here 1, 2, 3, 4.
So what I am trying to do is to plot the increase of the mean and median coverage values over time for all recordings, as follows:
# data is a list of dataframes
keys = ["Run " + str(i) for i in range(len(data))]
glued = pd.concat(data, keys=keys).reset_index(level=0).rename(columns={'level_0': 'Run'})
glued["roundtime"] = glued["time"] / 60
glued["roundtime"] = glued["roundtime"].round(0) # 1 significant digit
f, (ax1, ax2) = plt.subplots(2)
my_dpi = 96
stepsize = 5
start = 0
end = 60
ax1.set_title("Mean")
ax2.set_title("Median")
f.set_size_inches(1980 / my_dpi, 1080 / my_dpi)
ax1 = sns.lineplot(x="roundtime", y="coverage", ci="sd", estimator="mean", data=glued, ax=ax1)
ax1.set(xlabel="Time", ylabel="Coverage in percent")
ax1.xaxis.set_ticks(np.arange(start, end, stepsize))
ax1.set_xlim(0, 70)
ax2 = sns.lineplot(x="roundtime", y="coverage", ci="sd", estimator='median', data=glued, ax=ax2)
ax2.set(xlabel="Time", ylabel="Coverage in percent")
ax2.xaxis.set_ticks(np.arange(start, end, stepsize))
ax2.set_xlim(0, 70)
plt.show()
The result looks like this.
However, the curve should never decrease as the "coverage" values can never decrease either. The reason for this, I suspect, is that at certain points in time I only have recordings of some DataFrames with lower values and therefore the mean/median is also lower.
I tried to fix this by aligning the indices of all the DataFrames and filling missing values with previous recordings, before doing any of the previous code. Like this:
#create a common index
index = None
for df in data:
df.set_index("time", inplace=True, drop=False)
if index is not None:
index = index.union(df.index)
else:
index = df.index
# reindex all dataframes and fill missing values
new_data = []
for df in data:
print(df)
new_df = df.reindex(index, fill_value=np.NaN)
new_df = new_df.fillna(method="ffill")
new_data.append(new_df)
data = new_data
The result however does change much and decreases at certain times. It looks like this:
Is this approach wrong or am I simply missing something?
Update 5/22/18: Answer by #aorr below original question.
I am trying to collect each ID and the data for that ID for thousands of inputs.
I am trying to collect each row of individual ID's, sort the dates, then plot each ID + plus data and export the chart for each ID.
Edited
Sample data:
Col names: Id Date O G Company Date2
aab72ffd-4d0b-4c62-b6fe-4c55b98be9a0 3/1/1999 180.66 673 A 1/1/1996
aab72ffd-4d0b-4c62-b6fe-4c55b98be9a0 3/1/1995 173.9 651 A 1/1/1996
a15961bc-0263-4c66-a825-1deb69bda8be 12/1/2010 55.14 542 C 1/1/2011
a15961bc-0263-4c66-a825-1deb69bda8be 5/1/2012 49.24 577 C 1/1/2011
10a1d17b-1f5c-4a4d-8186-e4dbf62e3bf2 12/1/2000 48.14 290 D 3/1/2002
10a1d17b-1f5c-4a4d-8186-e4dbf62e3bf2 3/1/2003 69.03 282.5 D 3/1/2002
Desired output arrays/charts, but sorted by date.
10a1d17b-1f5c-4a4d-8186-e4dbf62e3bf2 7/1/2005 28.24 327
10a1d17b-1f5c-4a4d-8186-e4dbf62e3bf2 10/1/1998 45.11 335
10a1d17b-1f5c-4a4d-8186-e4dbf62e3bf2 7/1/2001 28.22 348
10a1d17b-1f5c-4a4d-8186-e4dbf62e3bf2 7/1/1997 44.53 350.5
10a1d17b-1f5c-4a4d-8186-e4dbf62e3bf2 8/1/2001 28.4 333.5
10a1d17b-1f5c-4a4d-8186-e4dbf62e3bf2 10/1/2005 41.72 314
10a1d17b-1f5c-4a4d-8186-e4dbf62e3bf2 12/1/2001 29.53 313.5
10a1d17b-1f5c-4a4d-8186-e4dbf62e3bf2 8/1/2002 43.24 319
The code I have typed so far successfully creates an indexed array of the the different data types. Now, I am just trying to iterate over all rows and organize the data so that it prints out/writes individual arrays/charts based on ID's.
Here is what I have so far:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
#import data
mydataset = pd.read_csv('input_test.csv', dtype=None)
x = mydataset.iloc[:,:].values
y = mydataset.iloc[:,:].values
#Id
b = np.array((x[:,0]), dtype=str)
#Date
c = np.array((x[:,1]), dtype=str)
# O Var
d = np.array((x[:,2]), dtype=int)
# G var
e = np.array((x[:,3]), dtype=int)
#Stack
f = np.vstack((b,c,d,e))
#Transpose array
g = f.T
#Plot data
plt.figure()
plt.plot(x[:,2], y[:,3], label ='Rate over time')
plt.xlabel('m')
plt.ylabel('r/m')
#plt.legend()
Update based on #aorr answer:
Thank's for helping us noobs.
This plots both O and G on the Y axis with Date on the X axis for each Id. And everything is sorted based on date. Great starting point to expand with this data. More to follow based on updates.
for Id in data['Id'].unique():
fig, ax = plt.subplots(figsize=(5,3))
plot_data = data.query("Id==#Id").sort_values('Date')
_ = plot_data.plot(x='Date',y='O', ax=ax)
_ = plot_data.plot(x='Date', y='G', ax=ax)
#Plot Company name in each chart
for Company in plot_data[Company]:
_ = plt.title(Company)
#Plot Date2 Event onto X-axis
for Date2 in plot_data[Date2]:
_ = plt.axvline(Date2)
Have you tried solving this with pandas? I don't think you need to create numpy arrays for every element, pandas already stores them as ndarrays internally.
import matplotlib.pyplot as plt
data = pd.read_csv('input_test.csv', parse_dates=['date'])
for id in data['id'].unique():
fig, ax = plt.subplots(figsize=(5,3))
plot_data = data.query("id==#id").sort_values('date')
_ = plot_data.plot(x='O',y='G', ax=ax)
that should get you nearly all the way there. The pandas visualization docs here have a bunch of other really helpful options for exploring data quickly, but if you're picky about the look of the figure then you'll want to use straight matplotlib for the figure and axes layouts.
I wrote this function and I would like it to accept more than one DF so that the final plot has multiple plotted lines for the predictions and the coef_DF gets completed with the rest of the coefficients.
The function extracts the needed features and target from a much larger dataset to make predictions using a linear regression func, it then makes the model, plots the line over the dataset and returns a df with all the coeficients.
(This is just an exercise.)
def prep_model_and_predict(feature, target, dataset, degree):
# part 1: make a df with relevant format and features
# degree >=1
poly_df=pd.DataFrame()
poly_df[str(target)] = dataset[str(target)]
poly_df['power_1'] = dataset[str(feature)]
#cehck if degree >1
if degree > 1:
for power in range(2, degree+1): #loop over reaming deg
name = 'power_'+str(power)
poly_df[name]=poly_df['power_1'].apply(lambda x: x**power)
#part 2: make model and predictions
features=list(poly_df.columns[1:])
X=poly_df[features]
y=poly_df[str(target)]
model=LinearRegression().fit(X,y)
predictions=model.predict(X)
#part 3: put weghts in a nice df
coef_df=pd.DataFrame()
coef_df=coef_df.append({"Name":'Intercept', 'Value':model.intercept_}, ignore_index=True)
coef_df=coef_df.append({'Name':'Power_1', 'Value':model.coef_[0]}, ignore_index=True)
if degree > 1:
for degree in range(2, degree+1):
name = 'Power_' + str(degree)
coef_df = coef_df.append({"Name":name,
'Value':'{:.3e}'.format(model.coef_[degree-1])}, ignore_index=True)
#prt 4: plot it
fig, ax = plt.subplots()
ax.plot(poly_df['power_1'], poly_df[str(target)], '.',
poly_df['power_1'], predictions, '-')
ax.set_xlabel('Square footage, living area')
ax.set_ylabel('Price per Sqft')
ax.ticklabel_format(axis='y', style='sci', scilimits=(-2,2))
return coef_df, ax
and this is the result:
Name Value
0 Intercept 506738
1 Power_1 2.71336e-77
2 Power_2 7.335e-39
3 Power_3 -1.850e-44
4 Power_4 8.437e-50
5 Power_5 0.000e+00
6 Power_6 0.000e+00
7 Power_7 3.645e-55
8 Power_8 1.504e-51
9 Power_9 5.760e-48
10 Power_10 1.958e-44
11 Power_11 5.394e-41
12 Power_12 9.404e-38
13 Power_13 -3.635e-41
14 Power_14 4.655e-45
15 Power_15 -1.972e-49
much appreciated!
I am not sure what exactly you are asking for. But I would suggest, next time try to ask a question that is easily produce-able and runnable by other people here in SO.
I have tried to answer your questions. Correct me if I misunderstand your question.
Pass arbitrary number of DataFrame to your function and plot it:
I have created three random dataframes for use:
df1 = pd.DataFrame(np.random.randint(0,10,size=(10, 2)), columns=list('AB'))
df2 = pd.DataFrame(np.random.randint(0,10,size=(10, 2)), columns=list('AB'))
df3 = pd.DataFrame(np.random.randint(0,10,size=(10, 2)), columns=list('AB'))
The functions that plots them:
def plot_me(*kwargs):
plt.figure(figsize=(13,9))
lab_ind = 0
for i in kwargs:
plt.plot(i['A'], i['B'], label = lab_ind)
lab_ind += 1
plt.legend()
plt.show()
The result plot you get:
Put the results of your model into a DataFrame
Regarding your second question, I am not going to concentrate too much on your exact details - for example the name of the columns of your dataframe, etc.
For this particular example I have generated two random arrays:
X = np.random.randint(0,50 ,size=(50, 2))
y = np.random.randint(0,2 ,size=(50, 1))
Then fit a LinearRegression model on this data.
model=LinearRegression().fit(X,y)
predictions=model.predict(X)
And then add it to a DataFrame:
res_df = pd.DataFrame(predictions,columns = ['Value'])
And if you print res_df
Value
0 0.420395
1 0.459389
2 0.369648
3 0.416058
4 0.644088
5 0.362072
6 0.363157
7 0.468943
. .
. .