Load Pandas data frames Keras - python

I'm trying to build a recommendation system using TensorFlow recommenders (https://www.tensorflow.org/recommenders/examples/quickstart)
in their quick start they're loading the data like this
ratings = tfds.load('movielens/100k-ratings', split="train")
I have a .csv file how to put it in the same format as the data they're passing in?
Also I would like to use .map on the tf data
EX: ratings = ratings .map(lambda x: {"itemId": x["itemId"],"userId": x["userId"]})

Related

Huggingface datasets ValueError

I am trying to load a dataset from huggingface organization, but I am getting the following error:
ValueError: Couldn't cast string
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 686
to
{'text': Value(dtype='string', id=None)}
because column names don't match
I used the following lines of code to load the dataset:
from datasets import load_dataset
dataset = load_dataset("datasetFile", use_auth_token=True)
Pleases note the dataset version = (2.0.0), I changed it to 1.18.2 but it did not work.
Is there any way to fix this error?
I solved this error by streaming the dataset.
from datasets import load_dataset
dataset = load_dataset("datasetFile", use_auth_token=True, streaming= True)
According to https://github.com/huggingface/datasets/issues/3700#issuecomment-1035400186, you actually want to use load_from_disk:
from datasets import load_from_disk
dataset = load_from_disk("datasetFile")

Kernel Crashes while Loading data using tf.data.Dataset.take() from a CSV file

I wanted to Load a CSV file with a Target column and 25 feature columns.I have loaded it via pd.read_csv() as a pandas.Dataframe:
import pandas as pd
import tensorflow as tf
data = pd.read_csv("./data.csv")
data = data.astype('float64')
data.shape #returns (6500, 26)
y_train = data.pop('target')
y_train.shape #returns (6500,)
Then I used the standard Tensorflow 2.0 procedure to read values from pandas.Dataframe:
dataset = tf.data.Dataset.from_tensor_slices((data.values, y_train.values))
As the Docs said,I have to load features and targets seperately from the TensorflowSliceDataset. But as soon as i run the for loop, it freezes for couple of seconds and suddenly kernel dies without any specific reason.
for feat, targ in dataset.take(1):
print ('Features: {}, Target: {}'.format(feat, targ))
I have tried to run the code without the for loop but the same thing happens with:
tf.constant(data['feature-1'])
I don't know what is causing this problem. I have also re-installed the pandas as well.

Trying to make use of a library to conduct some topic modeling, but it's not going well

I have a .csv term-document matrix, and I wanna perform some latent dirichlet allocation using gensim in python. However, I'm not particularly familiar with Python or LDA.
I posted in the gensim...forum? I dunno if that's what it's called. The guy that wrote the package responded and had this to say:
how big is your term-document CSV matrix?
If it's small enough = fits in RAM, you could:
1) use numpy.loadtxt()
to load your CSV into an in-memory matrix
2) convert the matrix to a corpus with gensim.matutils.Dense2Corpus() . Check out its documents_columns flag, it lets you switch between document-term and term-document transposition easily.
3) use that corpus to train your LDA model.
So that leads me to believe that the answer to this question isn't correct.
It seems like a dictionary is a necessary input to a LDA model; is this not correct? Here's what I have that I think successfully sticks the .csv into a corpus.
file = np.genfromtxt(fname=fPathName, dtype="int", delimiter=",", skip_header=True, missing_values="", filling_values=0)
corpus = gensim.matutils.Dense2Corpus(file, documents_columns=False)
Any help would be appreciated.
Edit: turns out that a Gensim dictionary and a Python dictionary are not exactly the same things.
So, from Gensim documentation I took this snip of code:
from gensim.models import LdaModel
from gensim.test.utils import common_texts
from gensim.corpora.dictionary import Dictionary
# Create a corpus from a list of texts
common_dictionary = Dictionary(common_texts)
common_corpus = [common_dictionary.doc2bow(text) for text in common_texts]
# Train the model on the corpus.
lda = LdaModel(common_corpus, num_topics=10)
The file you want to analyse is a csv so to open it you can use pandas
import pandas as pd
df = pd.read_csv(filename) # add header=None if the file has no column names
Once you import the file you have everything loaded into a data frame, you need to combine all text into a unique list (see first comment of gensim code snip) that should look like this
["text one..", "text 2..", "text 3..."]
You can do that by iterating through the data frame and iteratevely adding text to an empty list. Before to do that you also need to check which column of your csv file contain the text to analyse.
common_texts = [] # initialise empty list
for ind, row in df.iteritem():
text = row[name_column_with_text]
common_texts.append(text)
Once you get your list of text you can simply apply the code from gensim documentation.
Of course you might get memory problems, it depends on the size of your csv file.

Can I train NER in spaCy using annotations from a wordpad or text document

Can I train NER in spaCy using annotations from a wordpad or text document, because training with a sentence or paragraph doesn't meet my requirements. Thanks.
Yes, you can.
The python library spacy-annotator is your friend here.
It uses ipywidgets to provide users with a user-friendly UI to annotate data.
First: install the annotator.
pip install spacy-annotator
Second, import your data from your txt document in a pandas dataframe.
import pandas as pd
df = pd.read_csv('insert_text_file.txt', sep=" ", header=None)
Third, use spacy-annotator to label your data.
from spacy_annotator.pandas_annotations import annotate as pd_annotate
# Annotations
pd_dd = pd_annotate(df,
col_text = 'full_text', # Column in pandas dataframe containing text to be labelled
labels = ['GPE', 'PERSON'], # List of labels you want to get from text
sample_size=1, # Size of the sample to be labelled
delimiter=',', # Delimiter to separate entities in UI
model = None, # spaCy model for noisy pre-labelling
regex_flags=re.IGNORECASE # One (or more) regex flags to be applied when searching for entities in text
)
The cool thing is that the spacy-annotator (i) returns label in a format that spacy likes, (ii) it works with pandas orp python lists and (iii) allows users to do noisy pre-labelling (i.e. if you already have a spacy model, you can use it to get suggestions about which the entities to annotate).
If you prefer not to use pandas, you can always import each line of your text file in a python list and use the spacy_annotator.list_annotations module to do your annotations.
It works in a similar way.

python- logistic regression, save predicted probabilities & predictions to csv

I am new to data analysis and python. I have mostly been following the logistic regression demonstrated in this link on titanic survivors:
http://hamelg.blogspot.ca/2015/11/python-for-data-analysis-part-28.html
I am using my own non titanic dataset though. I am at the end of the example where I want to export the results to a csv file. I made a small modification though which is getting me stuck. In addition to the prediction, i also explicitly generated the predicted probabilities which I would also like to export to the csv file.
test_probs=log_model.predict_proba(X=test_features)
print(test_probs)
# Create a submission for Kaggle
submission = pd.DataFrame({"AccountNumber":titanic_test["AccountNumber"],
"PolarPredict":test_preds,**"probabilities":test_probs** })
This is the message i get: Exception: Data must be 1-dimensional
here is the original code from the tutorial:
# Make test set predictions
test_preds = log_model.predict(X=test_features)
# Create a submission for Kaggle
submission = pd.DataFrame({"PassengerId":titanic_test["PassengerId"],
"Survived":test_preds})
# Save submission to CSV
submission.to_csv("tutorial_logreg_submission.csv",
index=False)
How can I export the prediction, probabilities and the "ID" into a csv file?
you can try converting them into list
submission = pd.DataFrame({
"AccountNumber":list(titanic_test["AccountNumber"]),
"PolarPredict":list(test_preds),
"probabilities":list(test_probs)
})
submission.to_csv("tutorial_logreg_submission.csv",
index=False)

Categories