I am new to data analysis and python. I have mostly been following the logistic regression demonstrated in this link on titanic survivors:
http://hamelg.blogspot.ca/2015/11/python-for-data-analysis-part-28.html
I am using my own non titanic dataset though. I am at the end of the example where I want to export the results to a csv file. I made a small modification though which is getting me stuck. In addition to the prediction, i also explicitly generated the predicted probabilities which I would also like to export to the csv file.
test_probs=log_model.predict_proba(X=test_features)
print(test_probs)
# Create a submission for Kaggle
submission = pd.DataFrame({"AccountNumber":titanic_test["AccountNumber"],
"PolarPredict":test_preds,**"probabilities":test_probs** })
This is the message i get: Exception: Data must be 1-dimensional
here is the original code from the tutorial:
# Make test set predictions
test_preds = log_model.predict(X=test_features)
# Create a submission for Kaggle
submission = pd.DataFrame({"PassengerId":titanic_test["PassengerId"],
"Survived":test_preds})
# Save submission to CSV
submission.to_csv("tutorial_logreg_submission.csv",
index=False)
How can I export the prediction, probabilities and the "ID" into a csv file?
you can try converting them into list
submission = pd.DataFrame({
"AccountNumber":list(titanic_test["AccountNumber"]),
"PolarPredict":list(test_preds),
"probabilities":list(test_probs)
})
submission.to_csv("tutorial_logreg_submission.csv",
index=False)
Related
I am currently data wrangling on a very new project, and it is proving a challenge.
I have EEG data that has been preprocessed in eeglab in MATLAB, and I would like to load it into python to use it to train a classifier. I also have a .csv file with the subject IDs of each individual, along with a number (1, 2 or 3) corresponding to which third of the sample they are in.
Currently, I have the data saved as .mat files, one for each individual (104 in total), each containing an array shaped 64x2000x700 (64 channels, 2000 data points per 2 second segment (sampling frequency of 1000Hz), 700 segments). I would like to load each participant's data into the dataframe alongside their subject ID and classification score.
I tried this:
all_files = glob.glob(os.path.join(path, "*.mat"))
lang_class= pd.read_csv("TestLangLabels.csv")
df_dict = {}
for file in all_files:
file_name = os.path.splitext(os.path.basename(file))[0]
df_dict[file]
df_dict[file_name]= loadmat(file,appendmat=False)
# Setting the file name (without extension) as the index name
df_dict[file_name].index.name = file_name
But the files are so large that this maxes out my memory and doesn't complete.
Then, I attempted to loop it using pandas using the following:
main_dataframe = pd.DataFrame(loadmat(all_files[0]))
for i in range(1,len(all_files)):
data = loadmat(all_files[i])
df = pd.DataFrame(data)
main_dataframe = pd.concat([main_dataframe,df],axis=1)
At which point I got the error:
ValueError: Data must be 1-dimensional
Is there a way of doing this that I am overlooking, or will downsampling be inevitable?
subjectID
Data
Class
AA123
64x2000x700
2
I believe that something like this could then be used as a test/train dataset for my model, but welcome any and all advice!
Thank you in advance.
Is there a reason you have such a high sampling rate? I don't believe Ive heard a compelling reason to go over 512hz and normally take it down to 256hz. I don't know if it matters for ML, but most other approach really don't need that. Going from 1000hz to 500hz or even 250hz might help.
I am trying to make predictions from my custom model on Vertex AI but am getting errors.
I have deployed the model to an endpoint and I request prediction using this line
gcloud beta ai endpoints predict 1234 --region=europe-west4 --json-request=request2.json
and I get this response
Using endpoint [https://europe-west4-prediction-aiplatform.googleapis.com/]ERROR: (gcloud.beta.ai.endpoints.predict) FAILED_PRECONDITION: "Prediction failed: Exception during xgboost prediction: feature_names mismatch:
I have created the data with this code ( and later renamed to request2.json)
test3 = {}
test3["instances"] = test_set2.head(20).values.tolist()
with open('05_GCP/test_v3.jsonl', 'w') as outfile:
json.dump(test3, outfile, indent = 2)
This generates a file which looks like this.
The error tells me that its expecting column names per value instead of nothings, which are then interpreted as f0, f1 etc.
My challenge is that I don't know how to generate data that looks like this( also from the help file)
Though the result with the mismatched column names suggests I need a different format.
I tried:
import json
test4= X_test.head(20).to_json(orient='records', lines=True)
with open('05_GCP/test_v4.json', 'w') as outfile:
json.dump(test4, outfile, indent = 2)
which gives me data that looks like:
ie with many line breaks in it and cloud shell tells me this isn't json.
I also replicated this format and was informed it is not a json file
So 2 questions,
how do I create a json file that has the appropriate format for a live prediction.
how do I create a jsonl file so I can run batch jobs. This is actually want I a trying to get to. I have used csv but it returns errors:
('Post request fails. Cannot get predictions. Error: Predictions are not in the response. Got: {"error": "Prediction failed: Could not initialize DMatrix from inputs: ('Expecting 2 dimensional numpy.ndarray, got: ', (38,))"}.', 2)
This csv data is the exact same that I use to measure the model error whilst training (I know not good practise but this is just a test run)
UPDATE
Following Raj's suggestion, I tried creating two extra models. One where I changed my training code to be X_train.values and another where I replaced all the column names to be F0:F426 as the response to the json file on the endpoint said it didn't match column names.
These both SUCCEDDED with the test3 json file above when endpoints where deployed.
However I want this to be returning batch predictions, and here it returns the same errors. This is therefore clearly a formatting error but I have no clue about how to get there.
it should be pointed out that the batch predictions need to be a jsonl file. Not a json file. Trying to pass a json file doesn't work. All I have done is change the extension on the json file when I loaded it to make it appear as a jsonl file. I have not found anything that helps me create that. Tips are welcome.
I noticed that my data was in double brackets. So I created a version which had only one bracket and ran this on one of the models but it also returned errors. This was also only one prediction per the other comment.
Appreciate the help,
James
This worked for me:
If you train the model with df.values as per raj's answer, you can then pass the instances you want to get batch predictions from in a jsonl file ("input.jsonl" for example) with the following format for each instance/row:
[3.0,1.0,30.0,1.0,0.0,16.1]
File would look something like this for 5 rows to predict:
[3.0,1.0,30.0,1.0,0.0,16.1]
[3.0,0.0,22.0,0.0,0.0,9.8375]
[2.0,0.0,45.0,1.0,1.0,26.25]
[1.0,0.0,21.0,0.0,0.0,26.55]
[3.0,1.0,16.0,4.0,1.0,39.6875]
I'm trying to build a recommendation system using TensorFlow recommenders (https://www.tensorflow.org/recommenders/examples/quickstart)
in their quick start they're loading the data like this
ratings = tfds.load('movielens/100k-ratings', split="train")
I have a .csv file how to put it in the same format as the data they're passing in?
Also I would like to use .map on the tf data
EX: ratings = ratings .map(lambda x: {"itemId": x["itemId"],"userId": x["userId"]})
I have a .csv term-document matrix, and I wanna perform some latent dirichlet allocation using gensim in python. However, I'm not particularly familiar with Python or LDA.
I posted in the gensim...forum? I dunno if that's what it's called. The guy that wrote the package responded and had this to say:
how big is your term-document CSV matrix?
If it's small enough = fits in RAM, you could:
1) use numpy.loadtxt()
to load your CSV into an in-memory matrix
2) convert the matrix to a corpus with gensim.matutils.Dense2Corpus() . Check out its documents_columns flag, it lets you switch between document-term and term-document transposition easily.
3) use that corpus to train your LDA model.
So that leads me to believe that the answer to this question isn't correct.
It seems like a dictionary is a necessary input to a LDA model; is this not correct? Here's what I have that I think successfully sticks the .csv into a corpus.
file = np.genfromtxt(fname=fPathName, dtype="int", delimiter=",", skip_header=True, missing_values="", filling_values=0)
corpus = gensim.matutils.Dense2Corpus(file, documents_columns=False)
Any help would be appreciated.
Edit: turns out that a Gensim dictionary and a Python dictionary are not exactly the same things.
So, from Gensim documentation I took this snip of code:
from gensim.models import LdaModel
from gensim.test.utils import common_texts
from gensim.corpora.dictionary import Dictionary
# Create a corpus from a list of texts
common_dictionary = Dictionary(common_texts)
common_corpus = [common_dictionary.doc2bow(text) for text in common_texts]
# Train the model on the corpus.
lda = LdaModel(common_corpus, num_topics=10)
The file you want to analyse is a csv so to open it you can use pandas
import pandas as pd
df = pd.read_csv(filename) # add header=None if the file has no column names
Once you import the file you have everything loaded into a data frame, you need to combine all text into a unique list (see first comment of gensim code snip) that should look like this
["text one..", "text 2..", "text 3..."]
You can do that by iterating through the data frame and iteratevely adding text to an empty list. Before to do that you also need to check which column of your csv file contain the text to analyse.
common_texts = [] # initialise empty list
for ind, row in df.iteritem():
text = row[name_column_with_text]
common_texts.append(text)
Once you get your list of text you can simply apply the code from gensim documentation.
Of course you might get memory problems, it depends on the size of your csv file.
I created a RandomForestClassification model using SkLearn using 10 different text features and a training set of 10000. Then, I pickled the model (76mb) in hopes of using it for prediction.
However, in order to produce the Random Forest, I used the LabelEncoder and OneHotEncoder for best results on the categorical/string data.
Now, I'd like to pull up the pickled model and get a classification prediction on 1 instance. However, I'm not sure how to encode the text on the 1 instance without loading the entire training & test dataset CSV
again and going through the entire encoding process.
It seems quite laborious to load the csv files every time. I'd like this to run 1000x per hour so it doesn't seem right to me.
Is there a way to quickly encode 1 row of data given the pickle or other variable/setting? Does encoding always require ALL the data?
If loading all the training data is required to encode a single row, would be advantageous to encode the text data myself in a database where each feature assigned to a table, auto-incremented with a numeric id and a UNIQUE key on the text/categorical field, then pass this id to the RandomForestClassification? Obviously I would need to refit and pickle this new model, but then I would know exactly the (encoded) numeric representation of a new row and simply request a prediction on those values.
It's highly likely that I'm missing a feature or misunderstanding SkLearn or Python, I only started both a 3 days ago. Please excuse my naivety.
Using Pickle you should save your Label and One Hot Encoder. You can then read this each time and easily transform new instances. For example,
import cPickle as pickle
from sklearn.externals import joblib
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
train_x = [0,1,2,6,'true','false']
le.fit_transform(train_x)
# Save your encoding
joblib.dump(le, '/path/to/save/model')
# OR
pickle.dump(le, open( '/path/to/model', "wb" ) )
# Load those encodings
le = joblib.load('/path/to/save/model')
# OR
le = pickle.load( open( '/path/to/model', "rb" ) )
# Then use as normal
new_x = [0,0,0,2,2,2,'false']
le.transform(new_x)
# array([0, 0, 0, 1, 1, 1, 3])