PySpark: Converting config string items for dataframe operation

PySpark: Converting config string items for dataframe operation - python

Input:
source_dataframe = spark.createDataFrame(
[
(1,"1", "2020-01-01",10),
(1,"2", "2020-01-01",20),
(1,"2", "2020-02-01",30)
],
("country_code", "cust_id","day","value")
)
Config:
input_config = """
[ {
"source":"source_dataframe",
"opearation":"max",
"group":["country_code", "cust_id"]
}
]
"""
import json
config_dict = json.loads(input_config)
print(config_dict)
read from the config and apply operation on the input dataframe: Here I have hardcoded dataframe (source_dataframe) and operation (max).this works fine
for each in config_dict:
result = source_dataframe.groupBy(["country_code", "cust_id"]).agg(max("value"))
result.show()
However instead of harcoding, if I try to read dataframe from config dynamically and apply the operation on input , I am running into different errors. This could be because, while reading they are converted as string. How do I convert the string object so that they work?
Error: 'str' object has no attribute 'groupBy'
result = each['source'].groupBy(["country_code", "cust_id"]).agg(max("value"))
Error: TypeError: 'str' object is not callable
result = source_dataframe.groupBy(["country_code", "cust_id"]).agg(each['opearation']("value"))
This section where I read groupBy dynamically works fine.
result = source_dataframe.groupBy(each["group"]).agg(max("value"))
tried looking other posts, but could not figure out a solution. Can anyone please help.

Maybe you should evaluate the string, which would grant you access to the underlying dataframe.
result = eval(each['source']).groupBy(["country_code", cust_id"]).agg(max("value"))
Can't verify since i got an error from your first part.

Related

change a struct type to map type pyspark

Im trying to change the Struct type to Map type. I found this this solution -->
from pyspark.sql.functions import col,lit,create_map
df = df.withColumn("propertiesMap",create_map(
lit("salary"),col("properties.salary"),
lit("location"),col("properties.location")
)).drop("properties")
but in my case I need to wrtie alot of lit which I tried to avoid with for loop but seems like it does not work as when I try to print schema, i get this error 'list' object has no attribute 'printSchema'
my code is here:
field_names = [field.name for field in next(field for field in df4.schema.fields if field.name=="values").dataType.fields]
df5= [df4.withColumn("valuesMap",
create_map(lit(field_name),
col(f"values.{field_name}"))) for field_name in field_names]

How to save a pandas DataFrame with custom types using pyarrow and parquet

I want to save a pandas DataFrame to parquet, but I have some unsupported types in it (for example bson ObjectIds).
Throughout the examples we use:
import pandas as pd
import pyarrow as pa
Here's a minimal example to show the situation:
df = pd.DataFrame(
[
{'name': 'alice', 'oid': ObjectId('5e9992543bfddb58073803e7')},
{'name': 'bob', 'oid': ObjectId('5e9992543bfddb58073803e8')},
]
)
df.to_parquet('some_path')
And we get:
ArrowInvalid: ('Could not convert 5e9992543bfddb58073803e7 with type ObjectId: did not recognize Python value type when inferring an Arrow data type', 'Conversion failed for column oid with type object')
I tried to follow this reference: https://arrow.apache.org/docs/python/extending_types.html
Thus I wrote the following type extension:
class ObjectIdType(pa.ExtensionType):
def __init__(self):
pa.ExtensionType.__init__(self, pa.binary(12), "my_package.objectid")
def __arrow_ext_serialize__(self):
# since we don't have a parametrized type, we don't need extra
# metadata to be deserialized
return b''
#classmethod
def __arrow_ext_deserialize__(self, storage_type, serialized):
# return an instance of this subclass given the serialized
# metadata.
return ObjectId()
And was able to get a working pyarray for my oid column:
values = df['oid']
storage_array = pa.array(values.map(lambda oid: oid.binary), type=pa.binary(12))
pa.ExtensionArray.from_storage(objectid_type, storage_array)
Now where I’m stuck, and cannot find any good solution on the internet, is how to save my df to parquet, letting it interpret which column needs which Extension. I might change columns in the future, and I have several different types that need this treatment.
How can I simply create parquet file from dataframes and restore them while transparently converting the types ?
I tried to create a pyarrow.Table object, and append columns to it after preprocessing, but it doesn’t work as table.append_column takes binary columns and not pyarrow.Arrays, plus the whole isinstance thing looks like a terrible solution.
table = pa.Table.from_pandas(pd.DataFrame())
for col, values in test_df.iteritems():
if isinstance(values.iloc[0], ObjectId):
arr = pa.array(
values.map(lambda oid: oid.binary), type=pa.binary(12)
)
elif isinstance(values.iloc[0], ...):
...
else:
arr = pa.array(values)
table.append_column(arr, col) # FAILS (wrong type)
Pseudocode of the ideal solution:
parquetize(df, path, my_custom_types_conversions)
# ...
new_df = unparquetize(path, my_custom_types_conversions)
assert df.equals(new_df) # types have been correctly restored
I’m getting lost in pyarrow’s doc on if I should use ExtensionType, serialization or other things to write these functions. Any pointer would be appreciated.
Side note, I do not need parquet at all means, the main issue is to being able to save and restore dataframes with custom types quickly and space efficiently. I tried a solution based on jsonifying and gziping the dataframe, but it was too slow.

I think it is probably because the 'ObjectId' is not a defined keyword in python hence it is throwing up this exception in type conversion.
I tried the example you provided and tried by casting the oid values as string type during dataframe creation and it worked.
Check below the steps:
df = pd.DataFrame(
[
{'name': 'alice', 'oid': "ObjectId('5e9992543bfddb58073803e7')"},
{'name': 'bob', 'oid': "ObjectId('5e9992543bfddb58073803e8')"},
]
)
df.to_parquet('parquet_file.parquet')
df1 = pd.read_parquet('parquet_file.parquet',engine='pyarrow')
df1
output:
name oid
0 alice ObjectId('5e9992543bfddb58073803e7')
1 bob ObjectId('5e9992543bfddb58073803e8')

You could write a method that reads the column names and types and outputs a new DF with the columns converted to compatible types, using a switch-case pattern to choose what type to convert column to (or whether to leave it as is).

Problems transforming data in a dataframe

I've written the function (tested and working) below:
import pandas as pd
def ConvertStrDateToWeekId(strDate):
dateformat = '2016-7-15 22:44:09'
aDate = pd.to_datetime(strDate)
wk = aDate.isocalendar()[1]
yr = aDate.isocalendar()[0]
Format_4_5_4_date = str(yr) + str(wk)
return Format_4_5_4_date'
and from what I have seen on line I should be able to use it this way:
ml_poLines = result.value.select('PURCHASEORDERNUMBER', 'ITEMNUMBER', PRODUCTCOLORID', 'RECEIVINGWAREHOUSEID', ConvertStrDateToWeekId('CONFIRMEDDELIVERYDATE'))
However when I "show" my dataframe the "CONFIRMEDDELIVERYDATE" column is the original datetime string! NO errors are given.
I've also tried this:
ml_poLines['WeekId'] = (ConvertStrDateToWeekId(ml_poLines['CONFIRMEDDELIVERYDATE']))
and get the following error:
"ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions." which makes no sense to me.
I've also tried this with no success.
x = ml_poLines.toPandas();
x['testDates'] = ConvertStrDateToWeekId(x['CONFIRMEDDELIVERYDATE'])
ml_poLines2 = spark.createDataFrame(x)
ml_poLines2.show()
The above generates the following error:
AttributeError: 'Series' object has no attribute 'isocalendar'
What have I done wrong?

Your function ConvertStrDateToWeekId takes a string. But in the following line the argument of the function call is a series of strings:
x['testDates'] = ConvertStrDateToWeekId(x['CONFIRMEDDELIVERYDATE'])
A possible workaround for this error is to use the apply-function of pandas:
x['testDates'] = x['CONFIRMEDDELIVERYDATE'].apply(ConvertStrDateToWeekId)
But without more information about the kind of data you are processing it is hard to provide further help.

This was the work-around that I got to work:
`# convert the confirimedDeliveryDate to a WeekId
x= ml_poLines.toPandas();
x['WeekId'] = x[['ITEMNUMBER', 'CONFIRMEDDELIVERYDATE']].apply(lambda y:ConvertStrDateToWeekId(y[1]), axis=1)
ml_poLines = spark.createDataFrame(x)
ml_poLines.show()`
Not quite as clean as I would like.
Maybe someone else cam propose a cleaner solution.

ValueError: DataFrame constructor not properly called

I am trying to create a dataframe with Python, which works fine with the following command:
df_test2 = DataFrame(index = idx, data=(["-54350","2016-06-25T10:29:57.340Z","2016-06-25T10:29:57.340Z"]))
but, when I try to get the data from a variable instead of hard-coding it into the data argument; eg. :
r6 = ["-54350", "2016-06-25T10:29:57.340Z", "2016-06-25T10:29:57.340Z"]
df_test2 = DataFrame(index = idx, data=(r6))
I expect this is the same and it should work? But I get:
ValueError: DataFrame constructor not properly called!

Reason for the error:
It seems a string representation isn't satisfying enough for the DataFrame constructor
Fix/Solutions:
import ast
# convert the string representation to a dict
dict = ast.literal_eval(r6)
# and use it as the input
df_test2 = DataFrame(index = idx, data=(dict))
which will solve the error.

pandas read_json: "If using all scalar values, you must pass an index"

I have some difficulty in importing a JSON file with pandas.
import pandas as pd
map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json')
This is the error that I get:
ValueError: If using all scalar values, you must pass an index
The file structure is simplified like this:
{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}
It is from the machine learning course of University of Washington on Coursera. You can find the file here.

Try
ser = pd.read_json('people_wiki_map_index_to_word.json', typ='series')
That file only contains key value pairs where values are scalars. You can convert it to a dataframe with ser.to_frame('count').
You can also do something like this:
import json
with open('people_wiki_map_index_to_word.json', 'r') as f:
data = json.load(f)
Now data is a dictionary. You can pass it to a dataframe constructor like this:
df = pd.DataFrame({'count': data})

You can do as #ayhan mention which will give you a column base format
Or you can enclose the object in [ ] (source) as shown below to give you a row format that will be convenient if you are loading multiple values and planing on using matrix for your machine learning models.
df = pd.DataFrame([data])

I think what is happening is that the data in
map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json')
is being read as a string instead of a json
{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}
is actually
'{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}'
Since a string is a scalar, it wants you to load it as a json, you have to convert it to a dict which is exactly what the other response is doing
The best way is to do a json loads on the string to convert it to a dict and load it into pandas
myfile=f.read()
jsonData=json.loads(myfile)
df=pd.DataFrame(data)

{
"biennials": 522004,
"lb915": 116290
}
df = pd.read_json('values.json')
As pd.read_json expects a list
{
"biennials": [522004],
"lb915": [116290]
}
for a particular key, it returns an error saying
If using all scalar values, you must pass an index.
So you can resolve this by specifying 'typ' arg in pd.read_json
map_index_to_word = pd.read_json('Datasets/people_wiki_map_index_to_word.json', typ='dictionary')

For newer pandas, 0.19.0 and later, use the lines parameter, set it to True.
The file is read as a json object per line.
import pandas as pd
map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json', lines=True)
If fixed the following errors I encountered especially when some of the json files have only one value:
ValueError: If using all scalar values, you must pass an index
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
ValueError: Trailing data

For example
cat values.json
{
name: "Snow",
age: "31"
}
df = pd.read_json('values.json')
Chances are you might end up with this
Error: if using all scalar values, you must pass an index
Pandas looks up for a list or dictionary in the value. Something like
cat values.json
{
name: ["Snow"],
age: ["31"]
}
So try doing this. Later on to convert to html tohtml()
df = pd.DataFrame([pd.read_json(report_file, typ='series')])
result = df.to_html()

I solved this by converting it into an array like so
[{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

PySpark: Converting config string items for dataframe operation - python

Maybe you should evaluate the string, which would grant you access to the underlying dataframe. result = eval(each['source']).groupBy(["country_code", cust_id"]).agg(max("value")) Can't verify since i got an error from your first part.

Related

change a struct type to map type pyspark

How to save a pandas DataFrame with custom types using pyarrow and parquet

Problems transforming data in a dataframe

ValueError: DataFrame constructor not properly called

pandas read_json: "If using all scalar values, you must pass an index"

Categories

Resources