Why does :
print(np.delete(MatrixAnalytics(Cmp),[0],1))
MyNewMatrix = np.delete(MatrixAnalytics(Cmp),[0],1)
print("SecondPrint")
print(MyNewMatrix)
returns :
[[ 2. 2. 2. 2. 2.]
[ 1. 2. 2. 2. 2.]
[ 1. 2. 0. 2. 2.]
[ 2. 2. 2. 2. 2.]
[ 2. 2. 2. 0. 0.]
[ 1. 2. 2. 0. 2.]
[ 1. 2. 2. 2. 2.]
[ 1. 2. 2. 2. nan]
[ 2. 2. 2. 2. 2.]
[ 2. 2. 2. 2. nan]]
Second Print
[[-1. 0. 0. 0. 0.]
[-1. 0. 0. 0. 0.]
[-1. 0. -1. 0. 0.]
[-1. 0. 0. 0. 0.]
[-1. 0. 0. -1. 0.]
[-1. 0. 0. -1. 0.]
[-1. 0. 0. 0. 0.]
[-1. 0. 0. 0. nan]
[-1. 0. 0. 0. 0.]
[-1. 0. 0. 0. nan]]
This is weird, and can't figure this out. Why Would the values change without any line of code between 3 print ?
def MatrixAnalytics(DataMatrix):
AnalyzedMatrix = DataMatrix
for i in range(len(AnalyzedMatrix)): #Browse Each Column
for j in range(len(AnalyzedMatrix[i])): #Browse Each Line
if j>0:
if AnalyzedMatrix[i][j] > 50:
if AnalyzedMatrix[i][j] > AnalyzedMatrix[i][j-1]:
AnalyzedMatrix[i][j] = 2
else:
AnalyzedMatrix[i][j] = 1
else:
if AnalyzedMatrix[i][j] <50:
if AnalyzedMatrix[i][j] > AnalyzedMatrix[i][j-1]:
AnalyzedMatrix[i][j] = 0
else:
AnalyzedMatrix[i][j] = -1
return AnalyzedMatrix
The input array is :
[[55. 57.6 57.2 57. 51.1 55.9]
[55.3 54.7 56.1 55.8 52.7 55.5]
[55.5 52. 52.2 49.9 53.8 55.6]
[54.9 57.8 57.6 53.6 54.2 59.9]
[47.9 50.7 53.3 52.5 49.9 45.8]
[57. 56.2 58.3 55.4 47.9 56.5]
[56.6 54.2 57.6 54.7 50.1 53.6]
[54.7 53.4 52. 52. 50.9 nan]
[51.4 51.5 51.2 53. 50.1 50.1]
[55.3 58.7 59.2 56.4 53. nan]]
It seems that it call again the function MatrixAnalytics But I don't understand why
**
Doing this works :
**
MyNewMatrix = np.delete(MatrixAnalytics(Cmp),[0],1)
print(MyNewMatrix)
MyNewMatrix = np.delete(MatrixAnalytics(Cmp),[0],1)
print("SecondPrint")
print(MyNewMatrix)
I think I got the issue.
In this code :
def MatrixAnalytics(DataMatrix):
AnalyzedMatrix = DataMatrix
...
...
return AnalyzedMatrix
AnalyzedMatrix is not a copy of DataMatrix, it's referencing to the same object in memory !
So on the first call of MatrixAnalytics, your are actually modifying the object behind the reference given as argument (because arrays are mutable).
In the second call, your are giving the same reference as argument so the array behind it has already been modified.
note : return AnalyzedMatrix statement just returns the a new reference to the object referenced by the DataMatrix argument (not a copy).
Try to replace this line :
AnalyzedMatrix = DataMatrix
with this one (in your definition of MatrixAnalytics) :
AnalyzedMatrix = np.copy(DataMatrix)
For more info :
mutable vs unmutable
numpy.delete()
numpy.copy()
I believe you want same output in both the cases,
Sadly the thing is np.delete performs changes in the array itself, so when you called the first line (np.delete(MatrixAnalytics(Cmp),[0],1))
it deletes the 0th column and saves it in matrixanalytics, so never call this function in print statement, either call it during assignment or even without assignment as it will make the changes in the given array itself, but never in print since the column would be lost in the print statement.
Related
I want to reduce the size of the sparse matrix of the tf-idf vectorizer outputs since i am using it with cosine similarity and it takes a long time to go through each vector. I have about 44,000 sentences so the vocabulary size is also very large.
I was wondering if there was a way to combine a group of words to mean one word for example teal, navy and turquiose will all mean blue and that will have same tf-idf value.
I am dealing with a dataset of clothing items so things like colour, and similar clothing articles like shirt, t-shirt and sweatshirts are things i want to group.
I know i can use stop words to give certain words a value of 1 but is it possible to group words to have the same value?
Here is my code
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
dataset_2 = "/dataset_files/styles_2.csv"
df = pd.read_csv(dataset_2)
df = df.drop(['gender', 'masterCategory', 'subCategory', 'articleType', 'baseColour', 'season', 'year', 'usage'], axis = 1)
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(new_df['ProductDisplayName'])
cos_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
Unfortunately we can't use the vocabulary optional argument to TfidfVectorizer to signal synonyms; I tried and got error ValueError: Vocabulary contains repeated indices."
Instead, you could run the tfidf vectorizer algorithm once, then manually merge columns that correspond to synonyms.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
## DATA
corpus = ['The grey cat eats the navy mouse.',
'The ashen cat drives the red car.',
'There is a mouse on the brown banquette of the crimson car.',
'The teal car drove over the poor cat and tarnished its beautiful silver fur with scarlet blood.',
'I bought a turquoise sapphire shaped like a cat and mounted on a rose gold ring.',
'Mice and cats alike are drowning in the deep blue sea.']
synonym_groups = [['grey', 'gray', 'ashen', 'silver'],
['red', 'crimson', 'rose', 'scarlet'],
['blue', 'navy', 'sapphire', 'teal', 'turquoise']]
## VECTORIZING FIRST TIME TO GET vectorizer0.vocabulary_
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)
## MERGING SYNONYM COLUMNS
vocab = vectorizer.vocabulary_
synonym_representants = { group[0] for group in synonym_groups }
redundant_synonyms = { word: group[0] for group in synonym_groups for word in group[1:] }
syns_dict = {group[0]: group for group in synonym_groups}
# syns_dict = {next(word for word in group if word in vocab): group for group in synonym_groups} ## SHOULD BE MORE ROBUST
nonredundant_columns = sorted( v for k, v in vocab.items() if k not in redundant_synonyms )
for rep in synonym_representants:
X[:,vocab[rep]] = X[:, [vocab[syn] for syn in syns_dict[rep] if syn in vocab]].sum(axis=1)
Y = X[:, nonredundant_columns]
new_vocab = [w for w in sorted(vocab, key=vocab.get) if w not in redundant_synonyms]
## COSINE SIMILARITY
cos_sim = cosine_similarity(Y, Y)
## RESULTS
print(' ', ''.join('{:11.11}'.format(word) for word in new_vocab))
print(Y.toarray())
print()
print('Cosine similarity')
print(cos_sim)
Output:
alike banquette beautiful blood blue bought brown car cat cats deep drives drove drowning eats fur gold grey like mice mounted mouse poor red ring sea shaped tarnished
[[0. 0. 0. 0. 0.49848319 0. 0. 0. 0.29572971 0. 0. 0. 0. 0. 0.49848319 0. 0. 0.49848319 0. 0. 0. 0.40876335 0. 0. 0. 0. 0. 0. ]
[0. 0. 0. 0. 0. 0. 0. 0.35369727 0.30309169 0. 0. 0.51089257 0. 0. 0. 0. 0. 0.51089257 0. 0. 0. 0. 0. 0.51089257 0. 0. 0. 0. ]
[0. 0.490779 0. 0. 0. 0. 0.490779 0.3397724 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.4024458 0. 0.490779 0. 0. 0. 0. ]
[0. 0. 0.31893014 0.31893014 0.31893014 0. 0. 0.2207993 0.18920822 0. 0. 0. 0.31893014 0. 0. 0.31893014 0. 0.31893014 0. 0. 0. 0. 0.31893014 0.31893014 0. 0. 0. 0.31893014]
[0. 0. 0. 0. 0.65400152 0.32700076 0. 0. 0.19399619 0. 0. 0. 0. 0. 0. 0. 0.32700076 0. 0.32700076 0. 0.32700076 0. 0. 0.32700076 0.32700076 0. 0.32700076 0. ]
[0.37796447 0. 0. 0. 0.37796447 0. 0. 0. 0. 0.37796447 0.37796447 0. 0. 0.37796447 0. 0. 0. 0. 0. 0.37796447 0. 0. 0. 0. 0. 0.37796447 0. 0. ]]
Cosine similarity
[[1. 0.34430458 0.16450509 0.37391712 0.3479721 0.18840894]
[0.34430458 1. 0.37091192 0.46132163 0.20500145 0. ]
[0.16450509 0.37091192 1. 0.23154573 0.14566346 0. ]
[0.37391712 0.46132163 0.23154573 1. 0.3172916 0.12054426]
[0.3479721 0.20500145 0.14566346 0.3172916 1. 0.2243601 ]
[0.18840894 0. 0. 0.12054426 0.2243601 1. ]]
I'm trying to run a number of classification models, but all of them keep throwing the reshape error. I think it has to do with the calculation of model.score or model.predict but i've tried running some reshape commands (on X_valid and Y_valid) with no success
Code:
X = train.drop("Survived", axis=1) # features
Y = train["Survived"] # target
X_test = test # test set, containing no target
# run a split of train data and later predict on x_test
X_train, X_valid, Y_train, Y_valid = train_test_split(X, Y, random_state=42, test_size=0.20, stratify=Y)
# Random Forest
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
Y_pred = random_forest.predict(X_valid)
acc_random_forest = round(random_forest.score(Y_valid, Y_pred) * 100, 2)
Error and traceback:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<timed exec> in <module>
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\base.py in score(self, X, y, sample_weight)
498 """
499 from .metrics import accuracy_score
--> 500 return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
501
502 def _more_tags(self):
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\ensemble\_forest.py in predict(self, X)
628 The predicted classes.
629 """
--> 630 proba = self.predict_proba(X)
631
632 if self.n_outputs_ == 1:
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\ensemble\_forest.py in predict_proba(self, X)
672 check_is_fitted(self)
673 # Check data
--> 674 X = self._validate_X_predict(X)
675
676 # Assign chunk of trees to jobs
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\ensemble\_forest.py in _validate_X_predict(self, X)
420 check_is_fitted(self)
421
--> 422 return self.estimators_[0]._validate_X_predict(X, check_input=True)
423
424 #property
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\tree\_classes.py in _validate_X_predict(self, X, check_input)
400 """Validate the training data on predict (probabilities)."""
401 if check_input:
--> 402 X = self._validate_data(X, dtype=DTYPE, accept_sparse="csr",
403 reset=False)
404 if issparse(X) and (X.indices.dtype != np.intc or
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
419 out = X
420 elif isinstance(y, str) and y == 'no_validation':
--> 421 X = check_array(X, **check_params)
422 out = X
423 else:
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
61 extra_args = len(args) - len(all_args)
62 if extra_args <= 0:
---> 63 return f(*args, **kwargs)
64
65 # extra_args > 0
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
635 # If input is 1D raise error
636 if array.ndim == 1:
--> 637 raise ValueError(
638 "Expected 2D array, got 1D array instead:\narray={}.\n"
639 "Reshape your data either using array.reshape(-1, 1) if "
ValueError: Expected 2D array, got 1D array instead:
array=[0. 0. 1. 0. 1. 1. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 1.
0. 1. 0. 1. 1. 0. 0. 1. 1. 0. 1. 1. 1. 0. 0. 1. 0. 0. 0. 1. 1. 0. 0. 0.
1. 1. 1. 0. 0. 1. 1. 0. 0. 1. 1. 1. 0. 1. 0. 0. 1. 0. 1. 0. 0. 0. 1. 0.
0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 1. 0. 1. 0. 0. 1.
1. 0. 0. 1. 0. 1. 1. 1. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0.
0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 1. 0. 1. 0. 1. 1. 0. 0. 1. 0.
0. 1. 0. 0. 1. 1. 1. 0. 1. 1. 0. 1. 0. 0. 0. 1. 1. 1. 1. 1. 1. 0. 0. 0.
0. 1. 0. 0. 0. 0. 1. 0. 0. 1. 0.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Y_valid seems to be the one causing the problem. I tried reshaping as follows:
Y_valid2 = Y_valid.values.reshape(-1,1)
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
Y_pred = random_forest.predict(X_valid)
acc_random_forest = round(random_forest.score(Y_valid2, Y_pred) * 100, 2)
But now a different error occurs:
ValueError: X has 1 features, but DecisionTreeClassifier is expecting 10 features as input.
I've tried viewing some other similar questions but I can't discover a successful correction for my own version of the issue. Help!
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample
sci-kit learn: Reshape your data either using X.reshape(-1, 1)
Getting a weird error that says 'Reshape your data either using array.reshape(-1, 1)'
Reshaping array using array.reshape(-1, 1)
ValueError: Expected 2D array, got 1D array instead: array=[0.33913043 0.36086956 0.4173913 ... 0.52608699 0.56956524 0.53913045]
Got a ValueError: Expected 2D array, got 1D array instead while fiiting my image data into decisiontree classifier
Try making X and Y using double brackets, this is the easiest way to turn a series into a dataframe; should remove the error too:
X = train[['Age']]
Y = train[['Survived']]
I have a sample DataFrame as below:
First column consists of 2 years, for each year, 2 track exist and each track includes pairs of longitude and latitude coordinated. How can I extract every track for each year separately to obtain an array of tracks with lat and long?
df = pd.DataFrame(
{'year':[0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1],
'track_number':[0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1],
'lat': [11.7,11.8,11.9,11.9,12.0,12.1,12.2,12.2,12.3,12.3,12.4,12.5,12.6,12.6,12.7,12.8],
'long':[-83.68,-83.69,-83.70,-83.71,-83.71,-83.73,-83.74,-83.75,-83.76,-83.77,-83.78,-83.79,-83.80,-83.81,-83.82,-83.83]})
You can groupby year and then extract a numpy.array from the created dataframes with .to_numpy().
>>> years = []
>>> for _, df2 in df.groupby(["year"]):
years.append(df2.to_numpy()[:, 1:])
>>> years[0]
array([[ 0. , 11.7 , -83.68],
[ 0. , 11.8 , -83.69],
[ 0. , 11.9 , -83.7 ],
[ 0. , 11.9 , -83.71],
[ 1. , 12. , -83.71],
[ 1. , 12.1 , -83.73],
[ 1. , 12.2 , -83.74],
[ 1. , 12.2 , -83.75]])
>>> years[1]
array([[ 0. , 12.3 , -83.76],
[ 0. , 12.3 , -83.77],
[ 0. , 12.4 , -83.78],
[ 0. , 12.5 , -83.79],
[ 1. , 12.6 , -83.8 ],
[ 1. , 12.6 , -83.81],
[ 1. , 12.7 , -83.82],
[ 1. , 12.8 , -83.83]])
Where years[0] would have the desired information for the year 0. And so on. Inside the array, the positions of the original dataframe are preserved. That is, the first element is the track; the second, the latitude, and the third, the longitude.
If you wish to do the same for the track, i.e, have an array of only latitude and longitude, you can groupby(["year", "track_number"]) as well.
While I am trying to use metrics.roc_auc_score, I am getting ValueError: multiclass format is not supported.
import lightgbm as lgb
from sklearn import metrics
def train_model(train, valid):
dtrain = lgb.Dataset(train, label=y_train)
dvalid = lgb.Dataset(valid, label=y_valid)
param = {'num_leaves': 64, 'objective': 'binary',
'metric': 'auc', 'seed': 7}
print("Training model!")
bst = lgb.train(param, dtrain, num_boost_round=1000, valid_sets=[dvalid],
early_stopping_rounds=10, verbose_eval=False)
valid_pred = bst.predict(valid)
print('Valid_pred: ')
print(valid_pred)
print('y_valid:')
print(y_valid)
valid_score = metrics.roc_auc_score(y_valid, valid_pred)
print(f"Validation AUC score: {valid_score:.4f}")
return bst
bst = train_model(X_train_final, X_valid_final)
valid_pred and y_valid are:
Training model!
Valid_pred:
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1.]
y_valid:
Id
530 200624
492 133000
460 110000
280 192000
656 88000
...
327 324000
441 555000
1388 136000
1324 82500
62 101000
Name: SalePrice, Length: 292, dtype: int64
Error:
ValueError Traceback (most recent call last)
<ipython-input-80-df034caf8c9b> in <module>
----> 1 bst = train_model(X_train_final, X_valid_final)
<ipython-input-79-483a6fb5ab9b> in train_model(train, valid)
17 print('y_valid:')
18 print(y_valid)
---> 19 valid_score = metrics.roc_auc_score(y_valid, valid_pred)
20 print(f"Validation AUC score: {valid_score:.4f}")
21 return bst
/opt/conda/lib/python3.6/site-packages/sklearn/metrics/ranking.py in roc_auc_score(y_true, y_score, average, sample_weight, max_fpr)
353 return _average_binary_score(
354 _binary_roc_auc_score, y_true, y_score, average,
--> 355 sample_weight=sample_weight)
356
357
/opt/conda/lib/python3.6/site-packages/sklearn/metrics/base.py in _average_binary_score(binary_metric, y_true, y_score, average, sample_weight)
71 y_type = type_of_target(y_true)
72 if y_type not in ("binary", "multilabel-indicator"):
---> 73 raise ValueError("{0} format is not supported".format(y_type))
74
75 if y_type == "binary":
ValueError: multiclass format is not supported
I tried:
valid_pred = pd.Series(bst.predict(valid)).astype(np.int64)
also I removed 'objective': 'binary' and tried but no success.
Still not able to figure out what is the issue.
It seems the task you are trying to solve is regression: predicting the price. However, you are training a classification model, that assigns a class to every input.
ROC-AUC score is meant for classification problems where the output is the probability of the input belonging to a class. If you do a multi-class classification, then you can compute the score for each class independently.
Moreover, the predict method returns a discrete class, not a probability. Let's imagine you do a binary classification and have only one example, it should be classified as False. If your classifier yields a probability of 0.7, the ROC-AUC value is 1.0-0.7=0.3. If you use the predict method, the ROC-AUC value will be 1.0-1.0=0.0, which won't tell you much.
I hope i can have some help with this problem:
I have two text file made up of about 10.000 rows (let's say File1 and File2) comng from a FEM analysis. The structure of the files is:
File1
....
Element Facet Node CNORMF.Magnitude CNORMF.CNF1 CNORMF.CNF2 CNORMF.CNF3 CPRESS CSHEAR1 CSHEAR2 CSHEARF.Magnitude CSHEARF.CSF1 CSHEARF.CSF2 CSHEARF.CSF3
881 3 6619 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
881 3 6648 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
881 3 6653 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
930 3 6452 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
930 3 6483 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
930 3 6488 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
1244 2 7722 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
1244 2 7724 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
1244 2 7754 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
2380 2 3757 304.326E-06 -123.097E-06 -203.689E-06 -189.663E-06 564.697E-06 -281.448E-06 22.5357E-06 152.710E-06 144.843E-06 -26.7177E-06 -40.3387E-06
2380 2 3826 226.603E-06 -85.9859E-06 -161.270E-06 -133.967E-06 270.594E-06 -134.865E-06 10.7988E-06 117.700E-06 116.217E-06 -4.67318E-06 -18.0298E-06
2380 2 3848 10.4740E-03 -2.01174E-03 -6.63900E-03 -7.84743E-03 771.739E-06 -384.638E-06 30.7983E-06 5.24148E-03 5.12795E-03 -541.446E-06 -940.251E-06
2894 2 8253 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
2894 2 8255 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
2894 2 8270 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
3372 2 5920 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
3372 2 5961 52.7705E-03 12.2948E-03 -40.8019E-03 -31.1251E-03 7.36309E-03 -2.56505E-03 -502.055E-06 18.8167E-03 17.9038E-03 2.12060E-03 5.38774E-03
3372 2 5996 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
3936 3 6782 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
3936 3 6852 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
3936 3 6857 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
3937 4 6410 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
3937 4 6452 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
3937 4 6488 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
3955 2 6940 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
3955 2 6941 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
3955 2 6993 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
4024 2 8027 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
4024 2 8050 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
....
File2
....
Node COORD.Magnitude COORD.COOR1 COORD.COOR2 COORD.COOR3 U.Magnitude U.U1 U.U2 U.U3
1 131.691 14.5010 -92.2190 -92.8868 1.93638 188.252E-03 -1.64949 -996.662E-03
2 131.336 10.9038 -92.2281 -92.8663 1.93341 188.250E-03 -1.64672 -995.468E-03
3 132.130 18.7534 -92.4681 -92.5002 1.93968 188.190E-03 -1.65258 -997.959E-03
4 130.769 1.97638 -92.5186 -92.3953 1.92580 188.179E-03 -1.63965 -992.387E-03
5 130.560 -4.04517 -93.1433 -91.3993 1.92030 188.026E-03 -1.63459 -990.122E-03
6 132.422 24.0768 -93.9662 -90.1454 1.94282 187.819E-03 -1.65564 -999.062E-03
7 130.377 -8.39503 -94.1640 -89.7827 1.91586 187.774E-03 -1.63054 -988.235E-03
8 126.321 13.6556 -88.0641 -89.5278 1.93579 192.554E-03 -1.64736 -998.202E-03
9 125.963 4.31065 -88.6558 -89.3771 1.92786 192.145E-03 -1.64012 -994.852E-03
10 130.037 3.02359 -94.4877 -89.2894 1.92501 187.692E-03 -1.63909 -991.871E-03
11 126.692 18.5888 -88.1164 -89.1107 1.93970 192.653E-03 -1.65097 -999.810E-03
12 125.751 -1.96189 -89.1238 -88.6928 1.92231 192.010E-03 -1.63500 -992.572E-03
13 125.719 -3.46723 -89.2798 -88.4437 1.92094 191.971E-03 -1.63373 -992.005E-03
14 130.026 7.42596 -95.0372 -88.4289 1.92818 187.556E-03 -1.64210 -993.086E-03
15 130.736 16.3557 -95.3755 -87.9092 1.93527 187.472E-03 -1.64873 -995.891E-03
16 130.251 -12.8122 -95.5572 -87.5783 1.91105 187.430E-03 -1.62618 -986.163E-03
17 130.250 12.8770 -95.6602 -87.4548 1.93216 187.401E-03 -1.64586 -994.616E-03
18 125.609 -7.73838 -90.1949 -87.0785 1.91668 191.718E-03 -1.62985 -990.191E-03
19 124.466 -6.21492 -88.8834 -86.9075 1.91827 192.783E-03 -1.63095 -991.270E-03
20 126.958 23.9470 -89.5421 -86.7584 1.94289 192.337E-03 -1.65406 -1.00096
21 121.210 6.64491 -84.7929 -86.3587 1.92993 196.112E-03 -1.64059 -997.316E-03
22 121.369 12.5781 -84.3620 -86.3434 1.93495 196.450E-03 -1.64514 -999.468E-03
....
I want to do the following step:
remove the first two column from the File1
compare the node label for the two files
write an output text file in "rpt" format containing the rows having the same "node label" side by side
Here is the code I have used. It looks like it works for small file. But for large file, it takes a huge amound of time.
nodEl = open("P:/File1.rpt", "r")
uniNod = open("P:/File2.rpt", "r")
row_nodEl = nodEl.readlines()
row_uniNod = uniNod.readlines()
nodEl.close()
uniNod.close()
output = open("P:/output.rpt", "w")
for index, line in enumerate(row_nodEl):
if index > 23081 and index < 40572 and index !=23083 and index !=23084:
temp = line.strip()
temp2 = " ".join(temp.split())
var = temp2.split(" ",3)
for index2, line2 in enumerate(row_uniNod):
if index2 > 11412 and index2 < 21258 and index2 != 11414 and index2 !=11415:
temp3 = line.strip()
temp4 = " ".join(temp3.split())
var2 = temp4.split(" ",1)
if var[2] == var2[0]:
output.write("%s" %var[2]) + " " + "%s" %var[3] + " " + "%s" %var2[1])
Any suggestion is more then welcome!
You are comparing each line of one file (with m lines) to each line of another file (with n lines). This leads to a time complexity O(m*n). What this means is that two files, each having 10,000 lines, will produce 100,000,000 comparisons.
You could speed up your code if you change how you read values. Consider reading a file into a dictionary instead into a list. Each key in the dictionary would be a node number and each value would be the complete line.
Using this approach, you could do the following:
Load the first file into a dictionary
Load the second file into a dictionary
For each node from the first dictionary, find the corresponding node in the second dictionary
Using Python, it would look similar to this
file_contents_1 = load_file("P:/File1.rpt")
file_contents_2 = load_file("P:/File2.rpt")
for node_label in file_contents_1:
# Skip processing node which doesn't have corresponding values in the second file
if not node_label in file_contents_2:
continue
# Do something
The benefit of this approach is that you load the files separately, meaning that time complexity now becomes linear O(m+n). When looking for a corresponding node in the second file, you have a constant time complexity because of the way dictionaries are implemented (i.e. hash tables).
This should make your code a lot faster.