So I have a set of X and Y data that I read in from an output file and then truncate. I'll give a sample of it below just in case anybody wants it as a reference to potentially test my problem (apologies that it is so much)
0 16442
4 15222
8 14222
12 12934
16 11837
20 10706
24 9689
28 8844
32 7999
36 7128
40 6547
44 5890
48 5378
52 4838
56 4308
60 4005
64 3587
68 3228
72 2933
76 2610
80 2434
84 2184
88 1951
92 1755
96 1632
100 1441
104 1362
108 1150
112 1095
116 1051
120 991
124 859
128 775
132 727
136 678
140 635
144 610
148 535
152 560
156 510
160 460
164 431
168 407
172 387
176 391
180 362
184 368
188 317
192 317
196 302
200 289
204 259
208 307
212 263
216 262
220 264
224 218
228 220
232 242
236 224
240 198
244 207
248 192
252 207
256 194
260 172
264 167
268 192
272 148
276 187
280 166
284 159
288 143
292 150
296 155
300 160
304 159
308 144
312 128
316 133
320 105
324 120
328 134
332 129
336 117
340 132
344 118
348 137
352 134
356 119
360 121
364 99
368 111
372 95
376 106
380 89
384 104
388 113
392 117
396 114
400 88
404 82
408 78
412 77
416 79
420 84
424 85
428 75
432 76
436 74
440 96
444 65
448 90
452 72
456 74
460 68
464 66
468 76
472 66
476 69
480 63
484 61
488 51
492 60
496 67
500 71
504 54
508 55
512 61
516 49
520 47
524 42
528 48
532 44
536 47
540 43
544 54
548 42
552 39
556 40
560 44
564 41
568 53
572 50
576 43
580 36
584 49
588 35
592 40
596 34
This data shows time and tallies, and represent an exponential decay type of trend. All of the data recorded is similar, but there is a single coefficient that changes for each record taken, and so i'm trying to develop a code to then find out what that single coefficient is. The equation i'm using as a fit is:
Y*((exp(-TMA*(log(2.)/HL110))) + (Xexp(-TMA(log(2.)/HL108)))) + b
The variable that is changing here is Y. Everything else is known. This is the variable I want to fit to (Y). I've done some work in Excel and can say that it is in the high 9000s (this is just going off of memory). Other cases are in the 4000s and 7000s. So it ranges, and that's why I need a code to do it, otherwise I have to manually do it every time, and we have thousands of records that I have to analyze. I wrote a code, but it flatlines and doesn't really provide a fit. I'll supply it below. It also contains all the constants mentioned above, which aren't subject to change.
### Section 1 ###
from scipy import *
from matplotlib import pyplot
from scipy.optimize import minimize_scalar
from scipy.optimize import curve_fit
import numpy as np
### Section 2 ###
data = np.loadtxt('Ag - Near_7_2026.txt') ### LOAD FILE DATA HERE ###
data_trunc = data[25:len(data)] ### TRUNCATED DATA UP TO 104 SEC ###
TM = data_trunc[:,0] ### TIME MARK ###
TMA = TM + 4 ### CORRECTED TIME ARRAY, ELAPSED TIME ###
Counts = data_trunc[:,1]
Sigma = sqrt([Counts])
### DEFINE PROBLEM CONSTANTS ###
HL110 = 24.6 ### ENDF ACCEPTED HL ###
HL108 = 142.92 ### ENDF ACCEPTED HL ###
b = 1.3333
X = 0.02955 ### FROM MCNP MODEL ###
### Function Handel ###
def func(TMA, Y, X, HL110, HL108, b):
return Y*((exp(-TMA*(log(2.)/HL110))) + (X*exp(-TMA*(log(2.)/HL108)))) + b ### MODEL FUNCTION ###
f = func(TMA, 5000, X, HL110, HL108, b) ### CALLABLE NEEDED FOR CURVE_FIT ###
# Data plotting ###
pyplot.plot(TMA, f, '.b', label = 'data')
pyplot.legend(fontsize = 'large')
### Curve Fitting and plotting ###
popt, pcov = curve_fit(func, TMA, f)
pyplot.plot(TMA, func(TMA, *popt), 'r-', label = 'fit')
pyplot.tick_params(labelsize='large')
pyplot.legend(fontsize='large')
pyplot.xlabel('Adjusted Time')
pyplot.ylabel('Counts')
pyplot.show()
I've done my best to hopefully comment out most of the code here to help anyone assisting me with understanding what is what. When I was doing this, I used https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html as my reference. (I'm not associated with them or anything, I'm just trying to show my thought process just in case it lends any extra information on where i'm going wrong).
Any help is really appreciated, and I'm available to provide any clarifying information requested!
This seems to work with few changes. I made a text file using your data, and for the code itself I do not pass the constants to the functions.
### Section 1 ###
from scipy import *
from matplotlib import pyplot
from scipy.optimize import minimize_scalar
from scipy.optimize import curve_fit
import numpy as np
### Section 2 ###
#data = np.loadtxt('Ag - Near_7_2026.txt') ### LOAD FILE DATA HERE ###
data = np.loadtxt('temp.dat')
data_trunc = data[25:len(data)] ### TRUNCATED DATA UP TO 104 SEC ###
TM = data_trunc[:,0] ### TIME MARK ###
TMA = TM + 4 ### CORRECTED TIME ARRAY, ELAPSED TIME ###
Counts = data_trunc[:,1]
Sigma = sqrt([Counts])
### DEFINE PROBLEM CONSTANTS ###
HL110 = 24.6 ### ENDF ACCEPTED HL ###
HL108 = 142.92 ### ENDF ACCEPTED HL ###
b = 1.3333
X = 0.02955 ### FROM MCNP MODEL ###
### Function Handel ###
def func(TMA, Y): # no need to pass constants
return Y*((exp(-TMA*(log(2.)/HL110))) + (X*exp(-TMA*(log(2.)/HL108)))) + b ### MODEL FUNCTION ###
# # no need to pass constants
f = func(TMA, 5000) ### CALLABLE NEEDED FOR CURVE_FIT ###
# Data plotting ###
pyplot.plot(TMA, f, '.b', label = 'data')
pyplot.legend(fontsize = 'large')
### Curve Fitting and plotting ###
popt, pcov = curve_fit(func, TMA, f)
print('Fitted parameters:', popt)
pyplot.plot(TMA, func(TMA, *popt), 'r-', label = 'fit')
pyplot.tick_params(labelsize='large')
pyplot.legend(fontsize='large')
pyplot.xlabel('Adjusted Time')
pyplot.ylabel('Counts')
pyplot.show()
EDIT -- code to solve for Y
### Section 1 ###
from scipy import *
from matplotlib import pyplot
from scipy.optimize import minimize_scalar
from scipy.optimize import curve_fit
import numpy as np
### Section 2 ###
#data = np.loadtxt('Ag - Near_7_2026.txt') ### LOAD FILE DATA HERE ###
data = np.loadtxt('temp.dat')
data_trunc = data[25:len(data)] ### TRUNCATED DATA UP TO 104 SEC ###
TM = data_trunc[:,0] ### TIME MARK ###
TMA = TM + 4 ### CORRECTED TIME ARRAY, ELAPSED TIME ###
Counts = data_trunc[:,1]
Sigma = sqrt(Counts)
### DEFINE PROBLEM CONSTANTS ###
HL110 = 24.6 ### ENDF ACCEPTED HL ###
HL108 = 142.92 ### ENDF ACCEPTED HL ###
b = 1.3333
X = 0.02955 ### FROM MCNP MODEL ###
### Function Handel ###
def func(TMA, Y): # no need to pass constants
return Y*((exp(-TMA*(log(2.)/HL110))) + (X*exp(-TMA*(log(2.)/HL108)))) + b ### MODEL FUNCTION ###
# # no need to pass constants
#f = func(TMA, 5000) ### CALLABLE NEEDED FOR CURVE_FIT ###
# Data plotting ###
pyplot.plot(TMA, Counts, '.b', label = 'data')
pyplot.legend(fontsize = 'large')
### Curve Fitting and plotting ###
popt, pcov = curve_fit(func, TMA, Counts)
print('Fitted parameters:', popt)
pyplot.plot(TMA, func(TMA, *popt), 'r-', label = 'fit')
pyplot.tick_params(labelsize='large')
pyplot.legend(fontsize='large')
pyplot.xlabel('Adjusted Time')
pyplot.ylabel('Counts')
pyplot.show()
Related
I would like to use the principal components analysis (PCA) method in python to understand what are the most important data to my machine learning model so I can get rid of the data that have less influence on my prediction.
To do this, I started with a simple example and I will implement that later on my real data. The following example consists of 5 columns (i.e., Five features or variables) and 100 rows (i.e., 100 samples).
my datasets are:
wt1 wt2 wt3 wt4 wt5 ko1 ko2 ko3 ko4 ko5
gene1 485 474 475 478 471 149 132 136 146 165
gene2 134 129 170 133 129 53 46 45 44 43
gene3 850 894 925 832 815 485 545 503 475 568
gene4 709 728 706 728 722 106 119 138 144 147
gene5 593 548 546 606 587 648 627 584 641 607
... ... ... ... ... ... ... ... ... ...
gene96 454 404 413 462 420 293 312 327 297 332
gene97 746 691 799 716 762 557 527 511 560 517
gene98 736 782 744 821 737 856 860 840 866 853
gene99 565 513 568 529 565 218 255 224 217 223
gene100 494 457 482 435 468 586 598 562 573 550
The features are wt1 to ko5, so I would like the PCA to tell me what are the wt or ko that I can remove without influencing the accuracy of my model
Here is my code:
import pandas as pd
import numpy as np
import random as rd
from sklearn.decomposition import PCA
from sklearn import preprocessing
import matplotlib.pyplot as plt
genes = ['gene' + str(i) for i in range(1,101)]
wt = ['wt' + str(i) for i in range(1,6)]
ko = ['ko' + str(i) for i in range(1,6)]
data = pd.DataFrame(columns=[*wt, *ko], index=genes)
# for each gene in the index(i.e. gene1, gene2,.. gene100), we create 5 values for the "wt" samples and 5 values for the "ko"..
# The mean can vary between 10 and 1000
for gene in data.index:
data.loc[gene,'wt1':'wt5'] = np.random.poisson(lam=rd.randrange(10,1000), size=5) # size = 5 ---> because we have wt1, wt2,... wt5
data.loc[gene,'ko1':'ko5'] = np.random.poisson(lam=rd.randrange(10,1000), size=5) # size = 5 ---> because we have ko1, ko2,... ko5
#print(data.head()) # only the first five rows
#print(data)
## Before we do PCA, we have to center and scale the data..
## After centering, the average value for each gene will be 0,
## After scaling, the standard deviation for the values for each gene will be 1
## Notice that we are passing in the transpose of our data, the scale function expects the samples to be rows instead of columns
scaled_data = preprocessing.scale(data.T) ## or StandardScaler().fit_transform(datalT)
# Variation is calculated in sklearn as: [(measurments - mean)**2/ the number of measurements]
# Variation is calculated in R as: [(measurments - mean)**2/ the number of measurements-1]
# There is no difference between the two methods..
pca = PCA() ## PCA here is an object
## Now we call the fit method on the scaled data
pca.fit(scaled_data) ## This is where we do all of the PCA math (i.e. calculate loading scores and the variation each principal component accounts for..)
pca_data = pca.transform(scaled_data) ## this is where we generate coordiantes for a PCA graph based on the loading scores and the scaled data..
## We'll start with a scree plot to see how many principal components shouldgo into the final plot..
# The first thing we do is calculate the percentage of variation that each principal component accounts for..
per_var = np.round(pca.explained_variance_ratio_* 100, decimals=1)
labels = ['PC' + str(x) for x in range(1, len(per_var)+1)]
plt.bar(x=range(1,len(per_var)+1), height = per_var, tick_label =labels)
plt.ylabel('Percentage of Explained Variance')
plt.xlabel('Principal Component')
plt.title('Scree Plot')
plt.show()
## Almost all the variation is along the first PC, so a 2-D graph, using PC1 and PC2 sholud do a good job representing the original data.
pca_df = pd.DataFrame(pca_data, index=[*wt, *ko], columns = labels) ## This is to organize the new data created by {pca.transform(scaled.data)}, into a matrix
plt.scatter(pca_df.PC1, pca_df.PC2)
plt.title('My PCA Graph')
plt.xlabel('PC1 - {0}%'.format(per_var[0]))
plt.ylabel('PC2 - {0}%'.format(per_var[1]))
for sample in pca_df.index:
plt.annotate(sample, (pca_df.PC1.loc[sample], pca_df.PC2.loc[sample]))
plt.show()
loading_scores = pd.Series(pca.components_[0], index=genes) # We'll start by creating a pandas "Series" object with the loading scores in PC1
sorted_loading_scores = loading_scores.abs().sort_values(ascending = False) #Sorting the loading scores based on their magnitude (absolute value)
top_10_genes = sorted_loading_scores[0:10].index.values ## Here we are just getting the names of the top 10 indexes (which are the gene names)
print(loading_scores[top_10_genes]) ## Printing out the top 10 gene names and their correspodning loading scores
The outputs of the code are the following figures:
As we can see that PC1 accounts for 89.5% of the data and PC2 accounts for 2.8% of the data..
So I can represent the original data by only using PC1 and PC2
My question is:
Is there a way to correlate PC1 and PC2 with the original data so I can understand what are the least important features in the original data?
i have txt file with x,y,z coordinates as follows:
x y z another value
129.000000 -51.000000 3.192000 166 166 166
133.000000 -21.000000 6.982500 171 169 170
134.000000 -51.000000 8.379000 172 170 171
135.000000 -45.000000 8.379000 167 165 166
136.000000 -81.000000 8.578500 160 158 159
137.000000 -51.000000 9.376500 159 157 158
138.000000 -51.000000 9.576000 169 168 167
how to read the value of z when x=20,y=33?
I tried using data = numpy.genfromtxt(yourFileName) but it not worked for me
import pandas as pd
from io import StringIO
x = '''x y z v1 v2 v3
129.000000 -51.000000 3.192000 166 166 166
133.000000 -21.000000 6.982500 171 169 170
134.000000 -51.000000 8.379000 172 170 171
135.000000 -45.000000 8.379000 167 165 166
136.000000 -81.000000 8.578500 160 158 159
137.000000 -51.000000 9.376500 159 157 158
138.000000 -51.000000 9.576000 169 168 167'''
out = StringIO(x )
df = pd.read_csv( out , delimiter="\s+" )
print (df.query( "x==138 and y==-51" ).z.values )
So I have multiple data frames and all need the same kind of formula applied to certain sets within this data frame. I got the locations of the sets inside the df, but I don't know how to access those sets.
This is my code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt #might used/need it later to check the output
df = pd.read_csv('Dalfsen.csv')
l = []
x = []
y = []
#the formula(trendline)
def rechtzetten(x,y):
a = (len(x)*sum(x*y)- sum(x)*sum(y))/(len(x)*sum(x**2)-sum(x)**2)
b = (sum(y)-a*sum(x))/len(x)
y1 = x*a+b
print(y1)
METING = df.ID.str.contains("<METING>") #locating the sets
indicatie = np.where(METING == False)[0] #and saving them somewhere
if n in df[n] != indicatie & n+1 != indicatie: #attempt to add parts of the set in l
append.l
elif n in df[n] != indicatie & n+1 == indicatie: #attempt defining the end of the set and using the formula for the set
append.l
rechtzetten(l.x, l.y)
else: #emptying the storage for the new set
l = []
indicatie has the following numbers:
0 12 13 26 27 40 41 53 54 66 67 80 81 94 95 108 109 121
122 137 138 149 150 162 163 177 178 190 191 204 205 217 218 229 230 242
243 255 256 268 269 291 292 312 313 340 341 373 374 401 402 410 411 420
421 430 431 449 450 468 469 487 488 504 505 521 522 538 539 558 559 575
576 590 591 604 605 619 620 633 634 647
Because my df looks like this:
ID,NUM,x,y,nap,abs,end
<PROFIEL>not used data
<METING>data</METING>
<METING>data</METING>
...
<METING>data</METING>
<METING>data</METING>
</PROFIEL>,,,,,,
<PROFIEL>not usde data
...
</PROFIEL>,,,,,,
tl;dr I'm trying to use a formula in each profile as shown above. I want to edit the data between 2 numbers of the list indicatie.
For example:
the fucntion rechtzetten(x,y) for the x and y df.x&df.y[1:11](Because [0]&[12] are in the list indicatie.) And then the same for [14:25] etc. etc.
What I try to avoid is typing the following hundreds of times manually:
x_#=df.x[1:11]
y_#=df.y[1:11]
rechtzetten(x_#,y_#)
I cant understand your question clearly, but if you want to replace a specific column of your pandas dataframe with a numpy array, you could simply assign it :
df['Column'] = numpy_array
Can you be more clear ?
How Normalize Data Mining MinMax from csv in Python 3 with library
this is example of my data
RT NK NB SU SK P TNI IK IB TARGET
84876 902 1192 2098 3623 169 39 133 1063 94095
79194 902 1050 2109 3606 153 39 133 806 87992
75836 902 1060 1905 3166 161 39 133 785 83987
75571 902 112 1878 3190 158 39 133 635 82618
83797 1156 134 1900 3518 218 39 133 709 91604
91648 1291 127 2225 3596 249 39 133 659 99967
79063 1346 107 1844 3428 247 39 133 591 86798
84357 1018 122 2152 3456 168 39 133 628 92073
90045 954 110 2044 3638 174 39 133 734 97871
83318 885 198 1872 3691 173 39 133 778 91087
93300 1044 181 2077 4014 216 39 133 635 101639
88370 1831 415 2074 4323 301 39 133 502 97988
91560 1955 377 2015 4153 349 39 223 686 101357
85746 1791 314 1931 3878 297 39 215 449 94660
93855 1891 344 2064 3947 287 39 162 869 103458
97403 1946 382 1937 4029 289 39 122 1164 107311
the formula MinMax is
= (data-min)/(max-min)*0.8+0.1
i got the code but the normalize data is not each column
I know how to count it like this
(first data of RT - min column RT data) / (max column RT- min column RT) * 0.8 + 0.1, etc
so does the next column
(first data of NK - min column NK data) / (max column NK- min column NK) * 0.8 + 0.1
like this
please help me
this is my code, but i don't understand
from sklearn.preprocessing import Normalizer
from pandas import read_csv
from numpy import set_printoptions
import pandas as pd
#df1=pd.read_csv("dataset.csv")
#print(df1)
namaFile = 'dataset.csv'
nama = ['rt', 'niagak', 'niagab', 'sosum', 'soskhus', 'p', 'tni', 'ik', 'ib', 'TARGET']
dataFrame = read_csv(namaFile, names=nama)
array = dataFrame.values
#membagi array
X = array[:,0:10]
Y = array[:,9]
skala = Normalizer().fit(X)
normalisasiX = skala.transform(X)
#data hasil
print('Normalisasi Data')
set_printoptions(precision = 3)
print(normalisasiX[0:5,:])
the results of manual counting with code are very different
we can use pandas python library.
import pandas as pd
df = pd.read_csv("filename")
norm = (df - df.min()) / (df.max() - df.min() )*0.8 + 0.1
norm will have the normalised dataframe
By using MinMaxScaler from sklearn you can solve your problem.
from pandas import read_csv
from sklearn.preprocessing import MinMaxScaler
df = read_csv("your-csv-file")
data = df.values
scaler = MinMaxScaler()
scaler.fit(data)
scaled_data = scaler.transform(data)
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.cluster import KMeans
from pandas import DataFrame
from sklearn.preprocessing import MinMaxScaler
data = pd.read_csv('Q4dataset.csv')
#print(data)
df = DataFrame(data,columns=['X','Y'])
scaler = MinMaxScaler()
scaler.fit(df)
#print(scaler.transform(df))
minmaxdf = scaler.transform(df)
kmeans = KMeans(n_clusters=2).fit(minmaxdf)
centroids = kmeans.cluster_centers_
plt.scatter(df['X'], df['Y'], c= kmeans.labels_.astype(float), s=30, alpha=1)
You can use the code I wrote above. I performed min-max normalization on two-dimensional data and then applied K means clustering algorithm.Be sure to include your own data set in .csv format
Is there a way to save a custom maplotlib colourmap (matplotlib.cm) as a file (e.g Color Palette Table file (.cpt), like used in MATLAB) to be shared and then use later in other programs? (e.g. Panopoly, MATLAB...)
Example
Below a new LinearSegmentedColormap is made by modifying an existing colormap (by truncation, as shown in another question linked here).
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
# Get an existing colorbar
cb = 'CMRmap'
cmap = plt.get_cmap( cb )
# Variables to modify (truncate) the colormap with
minval = 0.15
maxval = 0.95
npoints = 100
# Now modify (truncate) the colorbar
cmap = matplotlib.colors.LinearSegmentedColormap.from_list(
'trunc({n},{a:.2f},{b:.2f})'.format(n=cmap.name, a=minval,
b=maxval), cmap(np.linspace(minval, maxval, npoints)))
# Now the data can be extracted as a dictionary
cdict = cmap._segmentdata
# e.g. variables ('blue', 'alpha', 'green', 'red')
print( cdict.keys() )
# Now, is it possible to save to this as a .cpt?
More detail
I am aware of ways of loading external colormaps in matplotlib (e.g. shown here and here).
From NASA GISS's Panoply documentation:
Color Palette Table (CPT) indicates a color palette format used by the
Generic Mapping Tools program. The format defines a number of solid
color and/or gradient bands between the colorbar extrema rather than a
finite number of distinct colors.
The following is a function that takes a colormap, some limits (vmin and vmax) and the number of colors as input and creates a cpt file from it.
import matplotlib.pyplot as plt
import numpy as np
def export_cmap_to_cpt(cmap, vmin=0,vmax=1, N=255, filename="test.cpt",**kwargs):
# create string for upper, lower colors
b = np.array(kwargs.get("B", cmap(0.)))
f = np.array(kwargs.get("F", cmap(1.)))
na = np.array(kwargs.get("N", (0,0,0))).astype(float)
ext = (np.c_[b[:3],f[:3],na[:3]].T*255).astype(int)
extstr = "B {:3d} {:3d} {:3d}\nF {:3d} {:3d} {:3d}\nN {:3d} {:3d} {:3d}"
ex = extstr.format(*list(ext.flatten()))
#create colormap
cols = (cmap(np.linspace(0.,1.,N))[:,:3]*255).astype(int)
vals = np.linspace(vmin,vmax,N)
arr = np.c_[vals[:-1],cols[:-1],vals[1:],cols[1:]]
# save to file
fmt = "%e %3d %3d %3d %e %3d %3d %3d"
np.savetxt(filename, arr, fmt=fmt,
header="# COLOR_MODEL = RGB",
footer = ex, comments="")
# test case: create cpt file from RdYlBu colormap
cmap = plt.get_cmap("RdYlBu",255)
# you may create your colormap differently, as in the question
export_cmap_to_cpt(cmap, vmin=0,vmax=1,N=20)
The resulting file looks like
# COLOR_MODEL = RGB
0.000000e+00 165 0 38 5.263158e-02 190 24 38
5.263158e-02 190 24 38 1.052632e-01 215 49 39
1.052632e-01 215 49 39 1.578947e-01 231 83 55
1.578947e-01 231 83 55 2.105263e-01 244 114 69
2.105263e-01 244 114 69 2.631579e-01 249 150 86
2.631579e-01 249 150 86 3.157895e-01 253 181 104
3.157895e-01 253 181 104 3.684211e-01 253 207 128
3.684211e-01 253 207 128 4.210526e-01 254 230 153
4.210526e-01 254 230 153 4.736842e-01 254 246 178
4.736842e-01 254 246 178 5.263158e-01 246 251 206
5.263158e-01 246 251 206 5.789474e-01 230 245 235
5.789474e-01 230 245 235 6.315789e-01 206 234 242
6.315789e-01 206 234 242 6.842105e-01 178 220 235
6.842105e-01 178 220 235 7.368421e-01 151 201 224
7.368421e-01 151 201 224 7.894737e-01 120 176 211
7.894737e-01 120 176 211 8.421053e-01 96 149 196
8.421053e-01 96 149 196 8.947368e-01 70 118 180
8.947368e-01 70 118 180 9.473684e-01 59 86 164
9.473684e-01 59 86 164 1.000000e+00 49 54 149
B 165 0 38
F 49 54 149
N 0 0 0
and would be in the required format.