Huggingface datasets ValueError - python

I am trying to load a dataset from huggingface organization, but I am getting the following error:
ValueError: Couldn't cast string
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 686
to
{'text': Value(dtype='string', id=None)}
because column names don't match
I used the following lines of code to load the dataset:
from datasets import load_dataset
dataset = load_dataset("datasetFile", use_auth_token=True)
Pleases note the dataset version = (2.0.0), I changed it to 1.18.2 but it did not work.
Is there any way to fix this error?

I solved this error by streaming the dataset.
from datasets import load_dataset
dataset = load_dataset("datasetFile", use_auth_token=True, streaming= True)

According to https://github.com/huggingface/datasets/issues/3700#issuecomment-1035400186, you actually want to use load_from_disk:
from datasets import load_from_disk
dataset = load_from_disk("datasetFile")

Related

Python Statsmodel Logistic Regression iterate through Parquet file

I am trying to run a logistic regression model on a very large dataset with 2.3 billion observations in Python. I need a standard regression output. Statsmodels with parquet seemed promising:
https://www.statsmodels.org/v0.13.0/large_data.html
import pyarrow as pa
import pyarrow.parquet as pq
import statsmodels.formula.api as smf
class DataSet(dict):
def __init__(self, path):
self.parquet = pq.ParquetFile(path)
def __getitem__(self, key):
try:
return self.parquet.read([key]).to_pandas()[key]
except:
raise KeyError
LargeData = DataSet('LargeData.parquet')
res = smf.ols('Profit ~ Sugar + Power + Women', data=LargeData).fit()
However it says "Additionally, you can add code to this example DataSet object to return only a subset of the rows until you have built a good model. Then, you can refit your final model on more data."
This is what I tried all day and could not get to work. I am not super familiar with Python classes and how to iterate row-group-wise through a parquet.
I am sure it's only few lines of code, could anyone help me out?
P.S.: Ideally, of course I need the combination of the distributed model and the subsetted data. But I would already be happy to get the subsetting to without running out of memory. Thanks!
You can read "subset" of your parquet file by using row groups:
import random
row_group = random.rand_range(self.parquet.num_row_groups)
return self.parquet.read_row_group(row_group).to_pandas()
I'm not sure how exactly you'd do that in your case, maybe by selecting an arbitrary / random set of available row groups.

JSON parse error when trying to load my own SQuAD dataset using Huggingface Transformers

I'm trying to follow this notebook but I get stuck at loading my SQuAD dataset.
import transformers
from datasets import load_dataset, load_metric
dataset = load_dataset('json', data_files={'train': 'squad/nl_squad_train_clean.json', 'test': 'squad/nl_squad_train_clean.json'}, field='data')
Gives the following error ArrowInvalid: JSON parse error: Column(/paragraphs/[]/qas/[]/answers/[]) changed from object to array in row 0.
Does anyone know how to fix this? If needed I can post the complete stack trace.

Load Pandas data frames Keras

I'm trying to build a recommendation system using TensorFlow recommenders (https://www.tensorflow.org/recommenders/examples/quickstart)
in their quick start they're loading the data like this
ratings = tfds.load('movielens/100k-ratings', split="train")
I have a .csv file how to put it in the same format as the data they're passing in?
Also I would like to use .map on the tf data
EX: ratings = ratings .map(lambda x: {"itemId": x["itemId"],"userId": x["userId"]})

why does pickle file of fbprophet model need so much memory on hard drive?

I created a simple fbprophet model with the airpassengers data:
I created a simple fbprophet model with the airpassengers data:
import pandas as pd
import pickle
from fbprophet import Prophet
import sys
df = pd.read_csv("airline-passengers.csv")
# preprocess columns as fbprophet expects it
df.rename(columns={"Month": "ds", "Passengers": "y"}, inplace=True)
df["ds"] = pd.to_datetime(df["ds"])
m = Prophet()
m.fit(df)
However, when I save the object m:
with open("p_model", "wb") as f:
pickle.dump(m, f)
it needs >1 MB of memory on my hard drive. The object m itself seems to be rather small, as sys.getsizeof(m) returns 56.
Why is the pickle file so large? Is there a suitable alternative for saving the the object for later reuse? Thanks in advance.
I think that it pickles training data also, so try not to save model.history and it should be fine.
Here is nice discussion: https://github.com/facebook/prophet/issues/1159
Thanks to the link of #Kohelet, I found the solultion, it was the stan_backend attribute:
m.stan_backend = None
This reduced the filesize on hard drive to around 18 KB.
Im am still wondering why this is not visible when invoking sys.sizeof(m)

Kernel Crashes while Loading data using tf.data.Dataset.take() from a CSV file

I wanted to Load a CSV file with a Target column and 25 feature columns.I have loaded it via pd.read_csv() as a pandas.Dataframe:
import pandas as pd
import tensorflow as tf
data = pd.read_csv("./data.csv")
data = data.astype('float64')
data.shape #returns (6500, 26)
y_train = data.pop('target')
y_train.shape #returns (6500,)
Then I used the standard Tensorflow 2.0 procedure to read values from pandas.Dataframe:
dataset = tf.data.Dataset.from_tensor_slices((data.values, y_train.values))
As the Docs said,I have to load features and targets seperately from the TensorflowSliceDataset. But as soon as i run the for loop, it freezes for couple of seconds and suddenly kernel dies without any specific reason.
for feat, targ in dataset.take(1):
print ('Features: {}, Target: {}'.format(feat, targ))
I have tried to run the code without the for loop but the same thing happens with:
tf.constant(data['feature-1'])
I don't know what is causing this problem. I have also re-installed the pandas as well.

Categories