How Normalize Data Mining in Python with library - python

How Normalize Data Mining MinMax from csv in Python 3 with library
this is example of my data
RT NK NB SU SK P TNI IK IB TARGET
84876 902 1192 2098 3623 169 39 133 1063 94095
79194 902 1050 2109 3606 153 39 133 806 87992
75836 902 1060 1905 3166 161 39 133 785 83987
75571 902 112 1878 3190 158 39 133 635 82618
83797 1156 134 1900 3518 218 39 133 709 91604
91648 1291 127 2225 3596 249 39 133 659 99967
79063 1346 107 1844 3428 247 39 133 591 86798
84357 1018 122 2152 3456 168 39 133 628 92073
90045 954 110 2044 3638 174 39 133 734 97871
83318 885 198 1872 3691 173 39 133 778 91087
93300 1044 181 2077 4014 216 39 133 635 101639
88370 1831 415 2074 4323 301 39 133 502 97988
91560 1955 377 2015 4153 349 39 223 686 101357
85746 1791 314 1931 3878 297 39 215 449 94660
93855 1891 344 2064 3947 287 39 162 869 103458
97403 1946 382 1937 4029 289 39 122 1164 107311
the formula MinMax is
= (data-min)/(max-min)*0.8+0.1
i got the code but the normalize data is not each column
I know how to count it like this
(first data of RT - min column RT data) / (max column RT- min column RT) * 0.8 + 0.1, etc
so does the next column
(first data of NK - min column NK data) / (max column NK- min column NK) * 0.8 + 0.1
like this
please help me
this is my code, but i don't understand
from sklearn.preprocessing import Normalizer
from pandas import read_csv
from numpy import set_printoptions
import pandas as pd
#df1=pd.read_csv("dataset.csv")
#print(df1)
namaFile = 'dataset.csv'
nama = ['rt', 'niagak', 'niagab', 'sosum', 'soskhus', 'p', 'tni', 'ik', 'ib', 'TARGET']
dataFrame = read_csv(namaFile, names=nama)
array = dataFrame.values
#membagi array
X = array[:,0:10]
Y = array[:,9]
skala = Normalizer().fit(X)
normalisasiX = skala.transform(X)
#data hasil
print('Normalisasi Data')
set_printoptions(precision = 3)
print(normalisasiX[0:5,:])
the results of manual counting with code are very different

we can use pandas python library.
import pandas as pd
df = pd.read_csv("filename")
norm = (df - df.min()) / (df.max() - df.min() )*0.8 + 0.1
norm will have the normalised dataframe

By using MinMaxScaler from sklearn you can solve your problem.
from pandas import read_csv
from sklearn.preprocessing import MinMaxScaler
df = read_csv("your-csv-file")
data = df.values
scaler = MinMaxScaler()
scaler.fit(data)
scaled_data = scaler.transform(data)

import matplotlib.pyplot as plt
import pandas as pd
from sklearn.cluster import KMeans
from pandas import DataFrame
from sklearn.preprocessing import MinMaxScaler
data = pd.read_csv('Q4dataset.csv')
#print(data)
df = DataFrame(data,columns=['X','Y'])
scaler = MinMaxScaler()
scaler.fit(df)
#print(scaler.transform(df))
minmaxdf = scaler.transform(df)
kmeans = KMeans(n_clusters=2).fit(minmaxdf)
centroids = kmeans.cluster_centers_
plt.scatter(df['X'], df['Y'], c= kmeans.labels_.astype(float), s=30, alpha=1)
You can use the code I wrote above. I performed min-max normalization on two-dimensional data and then applied K means clustering algorithm.Be sure to include your own data set in .csv format

Related

Using the principal components analysis to understand the data that we can remove

I would like to use the principal components analysis (PCA) method in python to understand what are the most important data to my machine learning model so I can get rid of the data that have less influence on my prediction.
To do this, I started with a simple example and I will implement that later on my real data. The following example consists of 5 columns (i.e., Five features or variables) and 100 rows (i.e., 100 samples).
my datasets are:
wt1 wt2 wt3 wt4 wt5 ko1 ko2 ko3 ko4 ko5
gene1 485 474 475 478 471 149 132 136 146 165
gene2 134 129 170 133 129 53 46 45 44 43
gene3 850 894 925 832 815 485 545 503 475 568
gene4 709 728 706 728 722 106 119 138 144 147
gene5 593 548 546 606 587 648 627 584 641 607
... ... ... ... ... ... ... ... ... ...
gene96 454 404 413 462 420 293 312 327 297 332
gene97 746 691 799 716 762 557 527 511 560 517
gene98 736 782 744 821 737 856 860 840 866 853
gene99 565 513 568 529 565 218 255 224 217 223
gene100 494 457 482 435 468 586 598 562 573 550
The features are wt1 to ko5, so I would like the PCA to tell me what are the wt or ko that I can remove without influencing the accuracy of my model
Here is my code:
import pandas as pd
import numpy as np
import random as rd
from sklearn.decomposition import PCA
from sklearn import preprocessing
import matplotlib.pyplot as plt
genes = ['gene' + str(i) for i in range(1,101)]
wt = ['wt' + str(i) for i in range(1,6)]
ko = ['ko' + str(i) for i in range(1,6)]
data = pd.DataFrame(columns=[*wt, *ko], index=genes)
# for each gene in the index(i.e. gene1, gene2,.. gene100), we create 5 values for the "wt" samples and 5 values for the "ko"..
# The mean can vary between 10 and 1000
for gene in data.index:
data.loc[gene,'wt1':'wt5'] = np.random.poisson(lam=rd.randrange(10,1000), size=5) # size = 5 ---> because we have wt1, wt2,... wt5
data.loc[gene,'ko1':'ko5'] = np.random.poisson(lam=rd.randrange(10,1000), size=5) # size = 5 ---> because we have ko1, ko2,... ko5
#print(data.head()) # only the first five rows
#print(data)
## Before we do PCA, we have to center and scale the data..
## After centering, the average value for each gene will be 0,
## After scaling, the standard deviation for the values for each gene will be 1
## Notice that we are passing in the transpose of our data, the scale function expects the samples to be rows instead of columns
scaled_data = preprocessing.scale(data.T) ## or StandardScaler().fit_transform(datalT)
# Variation is calculated in sklearn as: [(measurments - mean)**2/ the number of measurements]
# Variation is calculated in R as: [(measurments - mean)**2/ the number of measurements-1]
# There is no difference between the two methods..
pca = PCA() ## PCA here is an object
## Now we call the fit method on the scaled data
pca.fit(scaled_data) ## This is where we do all of the PCA math (i.e. calculate loading scores and the variation each principal component accounts for..)
pca_data = pca.transform(scaled_data) ## this is where we generate coordiantes for a PCA graph based on the loading scores and the scaled data..
## We'll start with a scree plot to see how many principal components shouldgo into the final plot..
# The first thing we do is calculate the percentage of variation that each principal component accounts for..
per_var = np.round(pca.explained_variance_ratio_* 100, decimals=1)
labels = ['PC' + str(x) for x in range(1, len(per_var)+1)]
plt.bar(x=range(1,len(per_var)+1), height = per_var, tick_label =labels)
plt.ylabel('Percentage of Explained Variance')
plt.xlabel('Principal Component')
plt.title('Scree Plot')
plt.show()
## Almost all the variation is along the first PC, so a 2-D graph, using PC1 and PC2 sholud do a good job representing the original data.
pca_df = pd.DataFrame(pca_data, index=[*wt, *ko], columns = labels) ## This is to organize the new data created by {pca.transform(scaled.data)}, into a matrix
plt.scatter(pca_df.PC1, pca_df.PC2)
plt.title('My PCA Graph')
plt.xlabel('PC1 - {0}%'.format(per_var[0]))
plt.ylabel('PC2 - {0}%'.format(per_var[1]))
for sample in pca_df.index:
plt.annotate(sample, (pca_df.PC1.loc[sample], pca_df.PC2.loc[sample]))
plt.show()
loading_scores = pd.Series(pca.components_[0], index=genes) # We'll start by creating a pandas "Series" object with the loading scores in PC1
sorted_loading_scores = loading_scores.abs().sort_values(ascending = False) #Sorting the loading scores based on their magnitude (absolute value)
top_10_genes = sorted_loading_scores[0:10].index.values ## Here we are just getting the names of the top 10 indexes (which are the gene names)
print(loading_scores[top_10_genes]) ## Printing out the top 10 gene names and their correspodning loading scores
The outputs of the code are the following figures:
As we can see that PC1 accounts for 89.5% of the data and PC2 accounts for 2.8% of the data..
So I can represent the original data by only using PC1 and PC2
My question is:
Is there a way to correlate PC1 and PC2 with the original data so I can understand what are the least important features in the original data?

Sampling from static data set to create dataframe, ignore index in Python

I am trying to create some random samples (of a given size) from a static dataframe. The goal is to create multiple columns for each sample (and each sample drawn is the same size). I'm expecting to see multiple columns of the same length (i.e. sample size) in the fully sampled dataframe, but maybe append isn't the right way to go. Here is the code:
# create sample dataframe
target_df = pd.DataFrame(np.arange(1000))
target_df.columns=['pl']
# create the sampler:
sample_num = 5
sample_len = 10
df_max_row = len(target_df) - sample_len
for i in range(sample_num):
rndm_start = np.random.choice(df_max_row, 1)[0]
rndm_end = rndm_start + sample_len
slicer = target_df.iloc[rndm_start:rndm_end]['pl']
sampled_df = sampled_df.append(slicer, ignore_index=True)
sampled_df = sampled_df.T
The output of this is shown in the pic below - The red line shows the index I want remove.
The desired output is shown below that. How do I make this happen?
Thanks!
I would create new column using
sampled_df[i] = slicer.reset_index(drop=True)
Eventually I would use str(i) for column name because later it is simpler to select column using string than number
import pandas as pd
import random
target_df = pd.DataFrame({'pl': range(1000)})
# create the sampler:
sample_num = 5
sample_len = 10
df_max_row = len(target_df) - sample_len
sampled_df = pd.DataFrame()
for i in range(1, sample_num+1):
start = random.randint(0, df_max_row)
end = start + sample_len
slicer = target_df[start:end]['pl']
sampled_df[str(i)] = slicer.reset_index(drop=True)
sampled_df.index += 1
print(sampled_df)
Result:
1 2 3 4 5
1 735 396 646 534 769
2 736 397 647 535 770
3 737 398 648 536 771
4 738 399 649 537 772
5 739 400 650 538 773
6 740 401 651 539 774
7 741 402 652 540 775
8 742 403 653 541 776
9 743 404 654 542 777
10 744 405 655 543 778
But to create really random values then I would first shuffle values
np.random.shuffle(target_df['pl'])
and then I don't have to use random to select start
shuffle changes original column so it can't assign to new variable.
It doesn't repeat values in samples.
import pandas as pd
#import numpy as np
import random
target_df = pd.DataFrame({'pl': range(1000)})
# create the sampler:
sample_num = 5
sample_len = 10
sampled_df = pd.DataFrame()
#np.random.shuffle(target_df['pl'])
random.shuffle(target_df['pl'])
for i in range(1, sample_num+1):
start = i * sample_len
end = start + sample_len
slicer = target_df[start:end]['pl']
sampled_df[str(i)] = slicer.reset_index(drop=True)
sampled_df.index += 1
print(sampled_df)
Result:
1 2 3 4 5
1 638 331 171 989 170
2 22 643 47 136 764
3 969 455 211 763 194
4 859 384 174 552 566
5 221 829 62 926 414
6 4 895 951 967 381
7 758 688 594 876 873
8 757 691 825 693 707
9 235 353 34 699 121
10 447 81 36 682 251
If values can repeat then you could use
sampled_df[str(i)] = target_df['pl'].sample(n=sample_len, ignore_index=True)
import pandas as pd
target_df = pd.DataFrame({'pl': range(1000)})
# create the sampler:
sample_num = 5
sample_len = 10
sampled_df = pd.DataFrame()
for i in range(1, sample_num+1):
sampled_df[str(i)] = target_df['pl'].sample(n=sample_len, ignore_index=True)
sampled_df.index += 1
print(sampled_df)
EDIT
You may also get shuffled values as numpy array and use reshape - and later convert back to DataFrame with many columns. And later you can get some columns.
import pandas as pd
import random
target_df = pd.DataFrame({'pl': range(1000)})
# create the sampler:
sample_num = 5
sample_len = 10
random.shuffle(target_df['pl'])
sampled_df = pd.DataFrame(target_df['pl'].values.reshape([sample_len,-1]))
sampled_df = sampled_df.iloc[:, 0:sample_num]
sampled_df.index += 1
print(sampled_df)

How to use certain rows of a dataframe in a formula

So I have multiple data frames and all need the same kind of formula applied to certain sets within this data frame. I got the locations of the sets inside the df, but I don't know how to access those sets.
This is my code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt #might used/need it later to check the output
df = pd.read_csv('Dalfsen.csv')
l = []
x = []
y = []
#the formula(trendline)
def rechtzetten(x,y):
a = (len(x)*sum(x*y)- sum(x)*sum(y))/(len(x)*sum(x**2)-sum(x)**2)
b = (sum(y)-a*sum(x))/len(x)
y1 = x*a+b
print(y1)
METING = df.ID.str.contains("<METING>") #locating the sets
indicatie = np.where(METING == False)[0] #and saving them somewhere
if n in df[n] != indicatie & n+1 != indicatie: #attempt to add parts of the set in l
append.l
elif n in df[n] != indicatie & n+1 == indicatie: #attempt defining the end of the set and using the formula for the set
append.l
rechtzetten(l.x, l.y)
else: #emptying the storage for the new set
l = []
indicatie has the following numbers:
0 12 13 26 27 40 41 53 54 66 67 80 81 94 95 108 109 121
122 137 138 149 150 162 163 177 178 190 191 204 205 217 218 229 230 242
243 255 256 268 269 291 292 312 313 340 341 373 374 401 402 410 411 420
421 430 431 449 450 468 469 487 488 504 505 521 522 538 539 558 559 575
576 590 591 604 605 619 620 633 634 647
Because my df looks like this:
ID,NUM,x,y,nap,abs,end
<PROFIEL>not used data
<METING>data</METING>
<METING>data</METING>
...
<METING>data</METING>
<METING>data</METING>
</PROFIEL>,,,,,,
<PROFIEL>not usde data
...
</PROFIEL>,,,,,,
tl;dr I'm trying to use a formula in each profile as shown above. I want to edit the data between 2 numbers of the list indicatie.
For example:
the fucntion rechtzetten(x,y) for the x and y df.x&df.y[1:11](Because [0]&[12] are in the list indicatie.) And then the same for [14:25] etc. etc.
What I try to avoid is typing the following hundreds of times manually:
x_#=df.x[1:11]
y_#=df.y[1:11]
rechtzetten(x_#,y_#)
I cant understand your question clearly, but if you want to replace a specific column of your pandas dataframe with a numpy array, you could simply assign it :
df['Column'] = numpy_array
Can you be more clear ?

Scipy Curvefit for a Single Variable in an Exponential Function

So I have a set of X and Y data that I read in from an output file and then truncate. I'll give a sample of it below just in case anybody wants it as a reference to potentially test my problem (apologies that it is so much)
0 16442
4 15222
8 14222
12 12934
16 11837
20 10706
24 9689
28 8844
32 7999
36 7128
40 6547
44 5890
48 5378
52 4838
56 4308
60 4005
64 3587
68 3228
72 2933
76 2610
80 2434
84 2184
88 1951
92 1755
96 1632
100 1441
104 1362
108 1150
112 1095
116 1051
120 991
124 859
128 775
132 727
136 678
140 635
144 610
148 535
152 560
156 510
160 460
164 431
168 407
172 387
176 391
180 362
184 368
188 317
192 317
196 302
200 289
204 259
208 307
212 263
216 262
220 264
224 218
228 220
232 242
236 224
240 198
244 207
248 192
252 207
256 194
260 172
264 167
268 192
272 148
276 187
280 166
284 159
288 143
292 150
296 155
300 160
304 159
308 144
312 128
316 133
320 105
324 120
328 134
332 129
336 117
340 132
344 118
348 137
352 134
356 119
360 121
364 99
368 111
372 95
376 106
380 89
384 104
388 113
392 117
396 114
400 88
404 82
408 78
412 77
416 79
420 84
424 85
428 75
432 76
436 74
440 96
444 65
448 90
452 72
456 74
460 68
464 66
468 76
472 66
476 69
480 63
484 61
488 51
492 60
496 67
500 71
504 54
508 55
512 61
516 49
520 47
524 42
528 48
532 44
536 47
540 43
544 54
548 42
552 39
556 40
560 44
564 41
568 53
572 50
576 43
580 36
584 49
588 35
592 40
596 34
This data shows time and tallies, and represent an exponential decay type of trend. All of the data recorded is similar, but there is a single coefficient that changes for each record taken, and so i'm trying to develop a code to then find out what that single coefficient is. The equation i'm using as a fit is:
Y*((exp(-TMA*(log(2.)/HL110))) + (Xexp(-TMA(log(2.)/HL108)))) + b
The variable that is changing here is Y. Everything else is known. This is the variable I want to fit to (Y). I've done some work in Excel and can say that it is in the high 9000s (this is just going off of memory). Other cases are in the 4000s and 7000s. So it ranges, and that's why I need a code to do it, otherwise I have to manually do it every time, and we have thousands of records that I have to analyze. I wrote a code, but it flatlines and doesn't really provide a fit. I'll supply it below. It also contains all the constants mentioned above, which aren't subject to change.
### Section 1 ###
from scipy import *
from matplotlib import pyplot
from scipy.optimize import minimize_scalar
from scipy.optimize import curve_fit
import numpy as np
### Section 2 ###
data = np.loadtxt('Ag - Near_7_2026.txt') ### LOAD FILE DATA HERE ###
data_trunc = data[25:len(data)] ### TRUNCATED DATA UP TO 104 SEC ###
TM = data_trunc[:,0] ### TIME MARK ###
TMA = TM + 4 ### CORRECTED TIME ARRAY, ELAPSED TIME ###
Counts = data_trunc[:,1]
Sigma = sqrt([Counts])
### DEFINE PROBLEM CONSTANTS ###
HL110 = 24.6 ### ENDF ACCEPTED HL ###
HL108 = 142.92 ### ENDF ACCEPTED HL ###
b = 1.3333
X = 0.02955 ### FROM MCNP MODEL ###
### Function Handel ###
def func(TMA, Y, X, HL110, HL108, b):
return Y*((exp(-TMA*(log(2.)/HL110))) + (X*exp(-TMA*(log(2.)/HL108)))) + b ### MODEL FUNCTION ###
f = func(TMA, 5000, X, HL110, HL108, b) ### CALLABLE NEEDED FOR CURVE_FIT ###
# Data plotting ###
pyplot.plot(TMA, f, '.b', label = 'data')
pyplot.legend(fontsize = 'large')
### Curve Fitting and plotting ###
popt, pcov = curve_fit(func, TMA, f)
pyplot.plot(TMA, func(TMA, *popt), 'r-', label = 'fit')
pyplot.tick_params(labelsize='large')
pyplot.legend(fontsize='large')
pyplot.xlabel('Adjusted Time')
pyplot.ylabel('Counts')
pyplot.show()
I've done my best to hopefully comment out most of the code here to help anyone assisting me with understanding what is what. When I was doing this, I used https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html as my reference. (I'm not associated with them or anything, I'm just trying to show my thought process just in case it lends any extra information on where i'm going wrong).
Any help is really appreciated, and I'm available to provide any clarifying information requested!
This seems to work with few changes. I made a text file using your data, and for the code itself I do not pass the constants to the functions.
### Section 1 ###
from scipy import *
from matplotlib import pyplot
from scipy.optimize import minimize_scalar
from scipy.optimize import curve_fit
import numpy as np
### Section 2 ###
#data = np.loadtxt('Ag - Near_7_2026.txt') ### LOAD FILE DATA HERE ###
data = np.loadtxt('temp.dat')
data_trunc = data[25:len(data)] ### TRUNCATED DATA UP TO 104 SEC ###
TM = data_trunc[:,0] ### TIME MARK ###
TMA = TM + 4 ### CORRECTED TIME ARRAY, ELAPSED TIME ###
Counts = data_trunc[:,1]
Sigma = sqrt([Counts])
### DEFINE PROBLEM CONSTANTS ###
HL110 = 24.6 ### ENDF ACCEPTED HL ###
HL108 = 142.92 ### ENDF ACCEPTED HL ###
b = 1.3333
X = 0.02955 ### FROM MCNP MODEL ###
### Function Handel ###
def func(TMA, Y): # no need to pass constants
return Y*((exp(-TMA*(log(2.)/HL110))) + (X*exp(-TMA*(log(2.)/HL108)))) + b ### MODEL FUNCTION ###
# # no need to pass constants
f = func(TMA, 5000) ### CALLABLE NEEDED FOR CURVE_FIT ###
# Data plotting ###
pyplot.plot(TMA, f, '.b', label = 'data')
pyplot.legend(fontsize = 'large')
### Curve Fitting and plotting ###
popt, pcov = curve_fit(func, TMA, f)
print('Fitted parameters:', popt)
pyplot.plot(TMA, func(TMA, *popt), 'r-', label = 'fit')
pyplot.tick_params(labelsize='large')
pyplot.legend(fontsize='large')
pyplot.xlabel('Adjusted Time')
pyplot.ylabel('Counts')
pyplot.show()
EDIT -- code to solve for Y
### Section 1 ###
from scipy import *
from matplotlib import pyplot
from scipy.optimize import minimize_scalar
from scipy.optimize import curve_fit
import numpy as np
### Section 2 ###
#data = np.loadtxt('Ag - Near_7_2026.txt') ### LOAD FILE DATA HERE ###
data = np.loadtxt('temp.dat')
data_trunc = data[25:len(data)] ### TRUNCATED DATA UP TO 104 SEC ###
TM = data_trunc[:,0] ### TIME MARK ###
TMA = TM + 4 ### CORRECTED TIME ARRAY, ELAPSED TIME ###
Counts = data_trunc[:,1]
Sigma = sqrt(Counts)
### DEFINE PROBLEM CONSTANTS ###
HL110 = 24.6 ### ENDF ACCEPTED HL ###
HL108 = 142.92 ### ENDF ACCEPTED HL ###
b = 1.3333
X = 0.02955 ### FROM MCNP MODEL ###
### Function Handel ###
def func(TMA, Y): # no need to pass constants
return Y*((exp(-TMA*(log(2.)/HL110))) + (X*exp(-TMA*(log(2.)/HL108)))) + b ### MODEL FUNCTION ###
# # no need to pass constants
#f = func(TMA, 5000) ### CALLABLE NEEDED FOR CURVE_FIT ###
# Data plotting ###
pyplot.plot(TMA, Counts, '.b', label = 'data')
pyplot.legend(fontsize = 'large')
### Curve Fitting and plotting ###
popt, pcov = curve_fit(func, TMA, Counts)
print('Fitted parameters:', popt)
pyplot.plot(TMA, func(TMA, *popt), 'r-', label = 'fit')
pyplot.tick_params(labelsize='large')
pyplot.legend(fontsize='large')
pyplot.xlabel('Adjusted Time')
pyplot.ylabel('Counts')
pyplot.show()

Sorting and arranging a list using pandas

I have an input file as shown below which needs to be arranged in such an order that the key values need to be in ascending order, while the keys which are not present need to be printed in the last.
I am getting the data arranged in the required format but the order is missing.
I have tried using sort() method but it shows "list has no attribute sort".
Please suggest solution and also suggest if any modifications required.
Input file:
3=1388|4=1388|5=IBM|8=157.75|9=88929|1021=1500|854=n|388=157.75|394=157.75|474=157.75|1584=88929|444=20160713|459=93000546718000|461=7|55=93000552181000|22=89020|400=157.75|361=0.73|981=0|16=1468416600.6006|18=1468416600.6006|362=0.46
3=1388|4=1388|5=IBM|8=157.73|9=100|1021=0|854=p|394=157.73|474=157.749977558|1584=89029|444=20160713|459=93001362639104|461=26142|55=93001362849000|22=89120|361=0.71|981=0|16=1468416601.372|18=1468416601.372|362=0.45
3=1388|4=1388|5=IBM|8=157.69|9=100|1021=600|854=p|394=157.69|474=157.749910415|1584=89129|444=20160713|459=93004178882560|461=27052|55=93004179085000|22=89328|361=0.67|981=1|16=1468416604.1916|18=1468416604.1916|362=0.43
Code i tried:
import pandas as pd
import numpy as np
df = pd.read_csv('inputfile', index_col=None, names=['text'])
s = df.text.str.split('|')
ds = [dict(w.split('=', 1) for w in x) for x in s]
p = pd.DataFrame.from_records(ds)
p1 = p.replace(np.nan,'n/a', regex=True)
st = p1.stack(level=0,dropna=False)
dfs = [g for i,g in st.groupby(level=0)]
#print st
i = 0
while i < len(dfs):
#index of each column
print ('\nindex[%d]'%i)
for (_,k),v in dfs[i].iteritems():
print k,'\t',v
i = i + 1
output getting:
index[0]
1021 1500
1584 88929
16 1468416600.6006
18 1468416600.6006
22 89020
3 1388
361 0.73
362 0.46
388 157.75
394 157.75
4 1388
400 157.75
444 20160713
459 93000546718000
461 7
474 157.75
5 IBM
55 93000552181000
8 157.75
854 n
9 88929
981 0
index[1]
1021 0
1584 89029
16 1468416601.372
18 1468416601.372
22 89120
3 1388
361 0.71
362 0.45
388 n/a
394 157.73
4 1388
400 n/a
444 20160713
459 93001362639104
461 26142
474 157.749977558
5 IBM
55 93001362849000
8 157.73
854 p
9 100
981 0
Expected output:
index[0]
3 1388
4 1388
5 IBM
8 157.75
9 88929
16 1468416600.6006
18 1468416600.6006
22 89020
55 93000552181000
361 0.73
362 0.46
388 157.75
394 157.75
400 157.75
444 20160713
459 93000546718000
461 7
474 157.75
854 n
981 0
1021 1500
1584 88929
index[1]
3 1388
4 1388
5 IBM
8 157.75
9 88929
16 1468416600.6006
18 1468416600.6006
22 89020
55 93000552181000
361 0.73
362 0.46
394 157.75
444 20160713
459 93000546718000
461 7
474 157.75
854 n
981 0
1021 1500
1584 88929
388 n/a
400 n/a
Replace your ds line with
ds = [{int(pair[0]): pair[1] for pair in [w.split('=', 1) for w in x]} for x in s]
To convert the index to an integer so it will be sorted numerically
To output the n/a values at the end, you could use the pandas selection to output the nonnull values first, then the null values, e.g:
for (ix, series) in p.iterrows():
print('\nindex[%d]' % ix)
output_series(ix, series[pd.notnull])
output_series(ix, series[pd.isnull].fillna('n/a'))
btw, you can also simplify your stack, groupby, print to:
for (ix, series) in p1.iterrows():
print('\nindex[%d]' % ix)
for tag, value in series.iteritems():
print(tag, '\t', value)
So the whole script becomes:
def output_series(ix, series):
for tag, value in series.iteritems():
print(tag, '\t', value)
df = pd.read_csv('inputfile', index_col=None, names=['text'])
s = df.text.str.split('|')
ds = [{int(pair[0]): pair[1] for pair in [w.split('=', 1) for w in x]} for x in s]
p = pd.DataFrame.from_records(ds)
for (ix, series) in p.iterrows():
print('\nindex[%d]' % ix)
output_series(ix, series[pd.notnull])
output_series(ix, series[pd.isnull].fillna('n/a'))
Here:
import pandas as pd
import numpy as np
df = pd.read_csv('inputfile', index_col=None, names=['text'])
s = df.text.str.split('|')
ds = [dict(w.split('=', 1) for w in x) for x in s]
p1 = pd.DataFrame.from_records(ds).fillna('n/a')
st = p1.stack(level=0,dropna=False)
for k, v in st.groupby(level=0):
print(k, v.sort_index())

Categories