exporting dataframe to arff file python - python

I am trying to export a pandas dataframe to .arff file to use it in Weka. I have seen that the module liac-arff can be used for that purpose. Going on the documentation here it seems I have to use
arff.dump(obj,fp) Though, I am struggling with obj ( a dictionary) I'm guessing I have to create this by myself. How do you suggest me to do that properly? in a big dataset (3 000 000 lines and 95 columns) is there any example you can provide me to export from pandas dataframe to .arff file using python (v 2.7)?

First install the package:
$ pip install arff
Then use in Python:
import arff
arff.dump('filename.arff'
, df.values
, relation='relation name'
, names=df.columns)
Where df is of type pandas.DataFrame. Voila.

This is how I did it recently using the package liac-arff. Event if the arff package is more easy to use, it doesn't allow the definition of column types and values of categorical attributes.
df = pd.DataFrame(...)
attributes = [(c, 'NUMERIC') for c in df.columns.values[:-1]]
attributes += [('target', df[t].unique().astype(str).tolist())]
t = df.columns[-1]
data = [df.loc[i].values[:-1].tolist() + [df[t].loc[i]] for i in range(df.shape[0])]
arff_dic = {
'attributes': attributes,
'data': data,
'relation': 'myRel',
'description': ''
}
with open("myfile.arff", "w", encoding="utf8") as f:
arff.dump(arff_dic, f)
Values of categorical attributes such as target must be of type str, event if they are numbers.

Inspired by the answer of #M. Franklin which was not working very well but the idea was there.
import arff
input // your DataFrame.
attributes = [(j, 'NUMERIC') if input[j].dtypes in ['int64', 'float64'] else (j, input[j].unique().astype(str).tolist()) for j in input]
arff_dic = {
'attributes': attributes,
'data': input.values,
'relation': 'myRel',
'description': ''
}
with open("myfile.arff", "w", encoding="utf8") as f:
arff.dump(arff_dic, f)
Following this snippet above, it outputs an arff file with the correct format wished. Good luck guys out there!

Related

pyspark - dynamically type cast and alias using metadata defined in json variable

Using Apache Spark 3.0.1
I have an incoming data frame df_src like the following.
df_src = spark.createDataFrame(
[
('1', 'foo', '2020-02-03', '2020-03-24T09:21:20+00:00'),
('2', 'bar', '2019-01-29', '2020-01-15T17:00:20+00:00'),
],
['a_col', 'b_col', 'c_col', 'd_col']
)
I also have metadata (as json object within array) that needs to be applied, - I would like to apply this metadata to df_src
I need to do 3 things
select only needed columns (projection)
apply data type
apply alias
I have tried the following and got steps 1 and 2.
import json
metadata_json = """
[
{"source_field":"a_col", "alias":"x_col", "datatype":"string"}
,{"source_field":"b_col", "alias":"y_col", "datatype":"timestamp"}
,{"source_field":"c_col", "alias":"z_col", "datatype":"string"}
]
"""
# Transform json input to python objects
metadata_dict = json.loads(metadata_json)
# Filter python objects with list comprehensions
source_fields = [x['source_field'] for x in metadata_dict]
print(source_fields)
# 1) projection - DONE i.e. my incoming data frame had d_col which I do not need - so doing select here based on metadata_json
df = df_src.select(source_fields)
# 2) apply data type - DONE i.e. from metata_json I am selecting fields that needs to be timestamp and casting.
for column in metadata_dict:
if column['datatype'] == 'timestamp':
df_dest = df.withColumn(column['source_field'], col(column['source_field']).cast("timestamp"))
# 3) apply alias ?
After step 2 my destination data frame df_dest looks as above.
Now, how do I apply alias dynamically based on metadata_json above using pyspark? (Also please suggest if there is an elegant way to do all the 3 steps, I cannot change the metadata_json)
Given your input object (and straightforward strings), consider something like this:
import pyspark.sql.functions as F
# string backticks to protect the names against "." and other characters
input_df.select(
*[
F.col(f"`{x["source_field"]}`").cast(x["datatype"]).alias(x["alias"])
for x in metadata_dict
]
)
If your strings become a little bit more complex, a simple cast() may not hack it. If that's the case, consider wrapping the entire F.col().cast().alias() statement implements a simple strategy pattern (or if ... elif ... else switch) that can handle more complex logic.

How to save a pandas DataFrame with custom types using pyarrow and parquet

I want to save a pandas DataFrame to parquet, but I have some unsupported types in it (for example bson ObjectIds).
Throughout the examples we use:
import pandas as pd
import pyarrow as pa
Here's a minimal example to show the situation:
df = pd.DataFrame(
[
{'name': 'alice', 'oid': ObjectId('5e9992543bfddb58073803e7')},
{'name': 'bob', 'oid': ObjectId('5e9992543bfddb58073803e8')},
]
)
df.to_parquet('some_path')
And we get:
ArrowInvalid: ('Could not convert 5e9992543bfddb58073803e7 with type ObjectId: did not recognize Python value type when inferring an Arrow data type', 'Conversion failed for column oid with type object')
I tried to follow this reference: https://arrow.apache.org/docs/python/extending_types.html
Thus I wrote the following type extension:
class ObjectIdType(pa.ExtensionType):
def __init__(self):
pa.ExtensionType.__init__(self, pa.binary(12), "my_package.objectid")
def __arrow_ext_serialize__(self):
# since we don't have a parametrized type, we don't need extra
# metadata to be deserialized
return b''
#classmethod
def __arrow_ext_deserialize__(self, storage_type, serialized):
# return an instance of this subclass given the serialized
# metadata.
return ObjectId()
And was able to get a working pyarray for my oid column:
values = df['oid']
storage_array = pa.array(values.map(lambda oid: oid.binary), type=pa.binary(12))
pa.ExtensionArray.from_storage(objectid_type, storage_array)
Now where I’m stuck, and cannot find any good solution on the internet, is how to save my df to parquet, letting it interpret which column needs which Extension. I might change columns in the future, and I have several different types that need this treatment.
How can I simply create parquet file from dataframes and restore them while transparently converting the types ?
I tried to create a pyarrow.Table object, and append columns to it after preprocessing, but it doesn’t work as table.append_column takes binary columns and not pyarrow.Arrays, plus the whole isinstance thing looks like a terrible solution.
table = pa.Table.from_pandas(pd.DataFrame())
for col, values in test_df.iteritems():
if isinstance(values.iloc[0], ObjectId):
arr = pa.array(
values.map(lambda oid: oid.binary), type=pa.binary(12)
)
elif isinstance(values.iloc[0], ...):
...
else:
arr = pa.array(values)
table.append_column(arr, col) # FAILS (wrong type)
Pseudocode of the ideal solution:
parquetize(df, path, my_custom_types_conversions)
# ...
new_df = unparquetize(path, my_custom_types_conversions)
assert df.equals(new_df) # types have been correctly restored
I’m getting lost in pyarrow’s doc on if I should use ExtensionType, serialization or other things to write these functions. Any pointer would be appreciated.
Side note, I do not need parquet at all means, the main issue is to being able to save and restore dataframes with custom types quickly and space efficiently. I tried a solution based on jsonifying and gziping the dataframe, but it was too slow.
I think it is probably because the 'ObjectId' is not a defined keyword in python hence it is throwing up this exception in type conversion.
I tried the example you provided and tried by casting the oid values as string type during dataframe creation and it worked.
Check below the steps:
df = pd.DataFrame(
[
{'name': 'alice', 'oid': "ObjectId('5e9992543bfddb58073803e7')"},
{'name': 'bob', 'oid': "ObjectId('5e9992543bfddb58073803e8')"},
]
)
df.to_parquet('parquet_file.parquet')
df1 = pd.read_parquet('parquet_file.parquet',engine='pyarrow')
df1
output:
name oid
0 alice ObjectId('5e9992543bfddb58073803e7')
1 bob ObjectId('5e9992543bfddb58073803e8')
You could write a method that reads the column names and types and outputs a new DF with the columns converted to compatible types, using a switch-case pattern to choose what type to convert column to (or whether to leave it as is).

Error when trying to save hdf5 row where one column is a string and the other is an array of floats

I have two column, one is a string, and the other is a numpy array of floats
a = 'this is string'
b = np.array([-2.355, 1.957, 1.266, -6.913])
I would like to store them in a row as separate columns in a hdf5 file. For that I am using pandas
hdf_key = 'hdf_key'
store5 = pd.HDFStore('file.h5')
z = pd.DataFrame(
{
'string': [a],
'array': [b]
})
store5.append(hdf_key, z, index=False)
store5.close()
However, I get this error
TypeError: Cannot serialize the column [array] because
its data contents are [mixed] object dtype
Is there a way to store this to h5? If so, how? If not, what's the best way to store this sort of data?
I can't help you with pandas, but can show you how do this with pytables.
Basically you create a table referencing either a numpy recarray or a dtype that defines the mixed datatypes.
Below is a super simple example to show how to create a table with 1 string and 4 floats. Then it adds rows of data to the table.
It shows 2 different methods to add data:
1. A list of tuples (1 tuple for each row) - see append_list
2. A numpy recarray (with dtype matching the table definition) -
see simple_recarr in the for loop
To get the rest of the arguments for create_table(), read the Pytables documentation. It's very helpful, and should answer additional questions. Link below:
Pytables Users's Guide
import tables as tb
import numpy as np
with tb.open_file('SO_55943319.h5', 'w') as h5f:
my_dtype = np.dtype([('A','S16'),('b',float),('c',float),('d',float),('e',float)])
dset = h5f.create_table(h5f.root, 'table_data', description=my_dtype)
# Append one row using a list:
append_list = [('test string', -2.355, 1.957, 1.266, -6.913)]
dset.append(append_list)
simple_recarr = np.recarray((1,),dtype=my_dtype)
for i in range(5):
simple_recarr['A']='string_' + str(i)
simple_recarr['b']=2.0*i
simple_recarr['c']=3.0*i
simple_recarr['d']=4.0*i
simple_recarr['e']=5.0*i
dset.append(simple_recarr)
print ('done')

pandas read_json: "If using all scalar values, you must pass an index"

I have some difficulty in importing a JSON file with pandas.
import pandas as pd
map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json')
This is the error that I get:
ValueError: If using all scalar values, you must pass an index
The file structure is simplified like this:
{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}
It is from the machine learning course of University of Washington on Coursera. You can find the file here.
Try
ser = pd.read_json('people_wiki_map_index_to_word.json', typ='series')
That file only contains key value pairs where values are scalars. You can convert it to a dataframe with ser.to_frame('count').
You can also do something like this:
import json
with open('people_wiki_map_index_to_word.json', 'r') as f:
data = json.load(f)
Now data is a dictionary. You can pass it to a dataframe constructor like this:
df = pd.DataFrame({'count': data})
You can do as #ayhan mention which will give you a column base format
Or you can enclose the object in [ ] (source) as shown below to give you a row format that will be convenient if you are loading multiple values and planing on using matrix for your machine learning models.
df = pd.DataFrame([data])
I think what is happening is that the data in
map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json')
is being read as a string instead of a json
{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}
is actually
'{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}'
Since a string is a scalar, it wants you to load it as a json, you have to convert it to a dict which is exactly what the other response is doing
The best way is to do a json loads on the string to convert it to a dict and load it into pandas
myfile=f.read()
jsonData=json.loads(myfile)
df=pd.DataFrame(data)
{
"biennials": 522004,
"lb915": 116290
}
df = pd.read_json('values.json')
As pd.read_json expects a list
{
"biennials": [522004],
"lb915": [116290]
}
for a particular key, it returns an error saying
If using all scalar values, you must pass an index.
So you can resolve this by specifying 'typ' arg in pd.read_json
map_index_to_word = pd.read_json('Datasets/people_wiki_map_index_to_word.json', typ='dictionary')
For newer pandas, 0.19.0 and later, use the lines parameter, set it to True.
The file is read as a json object per line.
import pandas as pd
map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json', lines=True)
If fixed the following errors I encountered especially when some of the json files have only one value:
ValueError: If using all scalar values, you must pass an index
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
ValueError: Trailing data
For example
cat values.json
{
name: "Snow",
age: "31"
}
df = pd.read_json('values.json')
Chances are you might end up with this
Error: if using all scalar values, you must pass an index
Pandas looks up for a list or dictionary in the value. Something like
cat values.json
{
name: ["Snow"],
age: ["31"]
}
So try doing this. Later on to convert to html tohtml()
df = pd.DataFrame([pd.read_json(report_file, typ='series')])
result = df.to_html()
I solved this by converting it into an array like so
[{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}]

python pandas HDF5Store append new dataframe with long string columns fails

For a given path, i process many GigaBytes of files inside, and yield dataframes for every processed one.
For every dataframe that is yield, which includes two string columns of varying size, I want to dump them to disk using the very efficient HDF5 format. The error is raised when the HDFStore.append procedure is called, for the 4th or 5th iteration.
I use the following routine(simplified) to build the dataframes:
def build_data_frames(path):
data = df({'headline': [],
'content': [],
'publication': [],
'file_ref': []},
columns=['publication','file_ref','headline','content'])
for curdir, subdirs, filenames in os.walk(path):
for file in filenames:
if (zipfile.is_zipfile(os.path.join(curdir, file))):
with zf(os.path.join(curdir, file), 'r') as arch:
for arch_file_name in arch.namelist():
if re.search('A[r|d]\d+.xml', arch_file_name) is not None:
xml_file_ref = arch.open(arch_file_name, 'r')
xml_file = xml_file_ref.read()
metadata = XML2MetaData(xml_file)
headlineTokens, contentTokens = XML2TokensParser(xml_file)
rows= [{'headline': " ".join(headlineTokens),
'content': " ".join(contentTokens)}]
rows[0].update(metadata)
data = data.append(df(rows,
columns=['publication',
'file_ref',
'headline',
'content']),
ignore_index=True)
arch.close()
yield data
Then I use the following method to write these dataframes to disk:
def extract_data(path):
hdf_fname = extract_name(path)
hdf_fname += ".h5"
data_store = HDFStore(hdf_fname)
for dataframe in build_data_frames(path):
data_store.append('df', dataframe, data_columns=True)
## passing min_itemsize doesn't work either
## data_store.append('df', dataframe, min_itemsize=8000)
## trying the "alternative" command didn't help
## dataframe.to_hdf(hdf_fname, 'df', format='table', append=True,
## min_itemsize=80000)
data_store.close()
->
%time load_data(publications_path)
And the ValueError I get is:
...
ValueError: Trying to store a string with len [5761] in [values_block_0]
column but this column has a limit of [4430]!
Consider using min_itemsize to preset the sizes on these columns
I tried all the options, went through all the documentation necessary for this task, and tried all the tricks I saw on the Internet. Yet, no idea why it happens.
I use pandas ver: 0.17.0
Appreciate your help very much!
Have you seen this post? stackoverflow
data_store.append('df',dataframe,min_itemsize={ 'string' : 5761 })
Change 'string' to your type.

Categories