Use RDD to map dataframe rows into custom objects pyspark - python

I want to convert each row of my dataframe into to a Python class object called Fruit.
I have a dataframe df with the following columns: Identifier, Name, Quantity
I also have a dictionary fruit_color that looks like this:
fruit_color = {"Apple":"Red", "Lemon": "yellow", ...}
class Fruit(name: str, quantity: int, color: str, entryPointer: DataFrameEntry)
I also have an object called DataFrameEntry that takes as parameters a dataframe and an identifier.
class DataFrameEntry(df: DataFrame, index: int)
Now I am trying to convert each row of the dataframe "df" to this object using rdds and ultimately get a list of all fruits through this piece of code:
df.rdd.map(lambda x: Fruit(
x.__getitem__('Name'),
x.__getitem__('Quantity'),
fruit_color[x.__getitem__('Name')],
LogEntryPointer(original_df, trigger.__getitem__('StartMarker_Index')))).collect()
However, I keep getting this error:
PicklingError: Could not serialize object: Py4JError: An error occurred while calling o55.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
Maybe my approch is wrong? How can I generally convert each row of a dataframe to a specific object in pyspark?
Thank you a lot in advance!!

You need to make sure all objects, classes you're using inside map is defined inside map. To be more clear, RDD's map will distribute workload across multiple workers (different machines), and those machines don't know what Fruit is.

Related

AttributeError: 'Pandas' object has no attribute 'to_dict'

I am trying to convert a tuple of a Pandas DataFrame into a dictionary because I need the dict to call an API later. I have an entire Dataframe, from which I iterate a for loop to get all data inside it. Here is the code
df = ....Dataframe definition and retriving
for item in df.itertuples():
print(item.to_dict)
But the following error appears: AttributeError: 'Pandas' object has no attribute 'to_dict'
I have also tried using the dict keyword to convert the tuple, but there is the following error:cannot convert dictionary update sequence element #0 to a sequence
I know that I could do almost everything manually, but it would be two for loop and it will take forever. Is there a way to convert the structure I have into a dict, also based on the columns? Thank you so much
df.to_dict()
is a method that you call and by different arguments you can get:
‘list’ : dict like {column -> [values]}
‘series’ : dict like {column -> Series(values)}
‘split’ : dict like {‘index’ -> [index], ‘columns’ -> [columns], ‘data’ -> [values]}
See the [docs][1].
And if you have only one dataframe, you can use this method without itertuples.
[1]: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_dict.html

pySpark list to dataframe

My code below creates a dataframe from lists of columns from other dataframes. I'm getting an error when calling a list that is produce by a set. How can I treat that set of list, in order to add those columns to my dataframe?
Error produce by +list(matchedList)
#extract columns that need to be conform
datasetMatched = dataset.select(selectedColumns +list(matchedList))
#display(datasetMatched)
TypeError: 'list' object is not callable
It probably happens due to shadowing the builtin list function. Make sure you didn't define any variable named list in your code.

Issue trying converting object to numeric

Im new to python so apologies for any dum qa in advanced,
Im tring to import a csv as data frame to pandas, do some 'df.groupby' (base on 'mean'), and to merge with other data frames,
the problem is that the values for the 'mean' are taken as an object :
Plant object
Component int64
PerUnitPrice object >> that's what I'm talking about
dtype: object
Traceback (most recent call last):
I did try to convert using '.astype(float)' - got an error
and with :
price_['PerUnitPrice'] = pd.to_numeric(price_['PerUnitPrice'],errors='coerce')
that worked partially ->> it set all the values bigger than 999 as Nan, at least that what I think it did
here are some lines from the csv that I'm importing:
csv lines
It looks like a problem with the reading in of the data from .csv. Looking at the image of the .csv you shared, there are thousand separators (commas) in the PerUnitPrice column (which are causing the column to be parsed incorrectly).
In your pd.read_csv call, set thousands=True, and specify the dtype with a dictionary, as in the documentation for read_csv. In your case, I think you'll want to use {'Plant' : str, 'Component' : str, 'PerUnitPrice' : float}.
A tip in debugging this sort of issue is to use the info() method of a pd.DataFrame to check the dtypes, size, and non-null values of a DataFrame.

Problem with pandas 'to_csv' of 'DataFrameGroupBy' objects)

I want to output a Pandas groupby dataframe to CSV. Tried various StackOverflow solutions but they have not worked.
Python 3.7
This is my dataframe
This is my code
groups = clustering_df.groupby(clustering_df['Family Number'])
groups.apply(lambda clustering_df: clustering_df.sort_values(by=['Family Number']))
groups.to_csv('grouped.csv')
Error Message
(AttributeError: Cannot access callable attribute 'to_csv' of 'DataFrameGroupBy' objects, try using the 'apply' method)
You just need to do this:
groups = clustering_df.groupby(clustering_df['Family Number'])
groups = groups.apply(lambda clustering_df: clustering_df.sort_values(by=['Family Number']))
groups.to_csv('grouped.csv')
What you have done is, not saved the groupby-apply variable. It would get applied and might throw output depending on what IDE/Notebook you use. But to save it into a file, you will have to apply the function on the groupby object, save it into a variable and you can save the file.
Chaining works as well:
groups = clustering_df.groupby(clustering_df['Family Number']).apply(lambda clustering_df: clustering_df.sort_values(by=['Family Number']))
groups.to_csv("grouped.csv")

TypeError: 'Zipcode' object is not subscriptable

I'm using Python3 and have a pandas df that looks like
zip
0 07105
1 00000
2 07030
3 07032
4 07032
I would like to add state and city using the python package uszipcode
import uszipcode
search = SearchEngine(simple_zipcode=False)
def zco(x):
print(search.by_zipcode(x)['City'])
df['City'] = df[['zip']].fillna(0).astype(int).apply(zco)
However, I get the following error
TypeError: 'Zipcode' object is not subscriptable
Can someone help with the error? Thank you in advance.
The call search.by_zipcode(x) returns a ZipCode() instance, not a dictionary, so applying ['City'] to that object fails.
Instead, use either the .major_city attribute of the shorter alias, the .city attribute; you want to return that value, not print it:
def zco(x):
return search.by_zipcode(x).city
If all you are going to use the uszipcode project for is mapping zip codes to state and city names, you don’t need to use the full database (a 450MB download). Just stick with the ‘simple’ version, which is only 9MB, by leaving out the simple_zipcode=False argument to SearchEngine().
Next, this is going to be really really slow. .apply() uses a simple loop under the hood, and for each row the .by_zipcode() method will query a SQLite database using SQLAlchemy, create a single result object with all the columns from the matching row, then return that object, just so you can get a single attribute from them.
You'd be much better off querying the database directly, with the Pandas SQL methods. The uszipcode package is still useful here as it handles downloading the database for you and creating a SQLAlchemy session, the SearchEngine.ses attribute gives you direct access to it, but from there I'd just do:
from uszipcode import SearchEngine, SimpleZipcode
search = SearchEngine()
query = (
search.ses.query(
SimpleZipcode.zipcode.label('zip'),
SimpleZipcode.major_city.label('city'),
SimpleZipcode.state.label('state'),
).filter(
SimpleZipcode.zipcode.in_(df['zip'].dropna().unique())
)
).selectable
zipcode_df = pd.read_sql_query(query, search.ses.connection(), index_col='zip')
to create a Pandas Dataframe with all your unique zipcodes mapped to city and state columns. You can then join your dataframe with the zipcode dataframe:
df = pd.merge(df, zipcode_df, how='left', left_on='zip', right_index=True)
This adds city and state columns to your original dataframe. If you need to pull in more columns, add them to the search.ses.query(...) portion, using .label() to give them a suitable column name in the output dataframe (without a .label(), they'll get prefixed with simple_zipcode_ or zipcode_, depending on the class you are using). Pick from the model attributes documented, but take into account that if you need access to the full Zipcode model attributes you need to use SearchEngine(simple_zipcode=False) to ensure you get the full 450MB dataset at your disposal, then use Zipcode.<column>.label(...) instead of SimpleZipcode.<column>.label(...) in the query.
With the zipcodes as the index in the zipcode_df dataframe, that's going to be a lot faster (zippier :-)) than using SQLAlchemy on each row individually.

Categories