pySpark list to dataframe - python

My code below creates a dataframe from lists of columns from other dataframes. I'm getting an error when calling a list that is produce by a set. How can I treat that set of list, in order to add those columns to my dataframe?
Error produce by +list(matchedList)
#extract columns that need to be conform
datasetMatched = dataset.select(selectedColumns +list(matchedList))
#display(datasetMatched)
TypeError: 'list' object is not callable

It probably happens due to shadowing the builtin list function. Make sure you didn't define any variable named list in your code.

Related

Use RDD to map dataframe rows into custom objects pyspark

I want to convert each row of my dataframe into to a Python class object called Fruit.
I have a dataframe df with the following columns: Identifier, Name, Quantity
I also have a dictionary fruit_color that looks like this:
fruit_color = {"Apple":"Red", "Lemon": "yellow", ...}
class Fruit(name: str, quantity: int, color: str, entryPointer: DataFrameEntry)
I also have an object called DataFrameEntry that takes as parameters a dataframe and an identifier.
class DataFrameEntry(df: DataFrame, index: int)
Now I am trying to convert each row of the dataframe "df" to this object using rdds and ultimately get a list of all fruits through this piece of code:
df.rdd.map(lambda x: Fruit(
x.__getitem__('Name'),
x.__getitem__('Quantity'),
fruit_color[x.__getitem__('Name')],
LogEntryPointer(original_df, trigger.__getitem__('StartMarker_Index')))).collect()
However, I keep getting this error:
PicklingError: Could not serialize object: Py4JError: An error occurred while calling o55.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
Maybe my approch is wrong? How can I generally convert each row of a dataframe to a specific object in pyspark?
Thank you a lot in advance!!
You need to make sure all objects, classes you're using inside map is defined inside map. To be more clear, RDD's map will distribute workload across multiple workers (different machines), and those machines don't know what Fruit is.

Pyspark DF groupBy giving error - TypeError: 'NoneType' object is not iterable

I have a Pyspark dataframe whose schema definition is
Last 4 columns - genres_value,production_companies_values,production_countries_values and spoken_languages_values are derived column after parsing Json strings and then added to original dataframe.
I am trying to run groupBy as df2.groupBy("production_countries_values").count().show() but its throwing error - 'NoneType' object is not iterable.
I tried 'select','filter' on the column but these commands return without any error whereas groupBy on all four new columns that was added after parsing is throwing same error - 'NoneType' object is not iterable. Groupby works on other columns of DF.
Command - df2.where(col('production_countries_values')=='unknown').show() also throw error 'NoneType' object is not iterable
Seems like production_countries_values column has Null Values so you cant group Null columns together. You can use when condition and replace Null values with some default value and then group-by will work.

create dynamic column names in pandas

I am trying to create multiple dataframes inside a for loop using the below code:
for i in range(len(columns)):
f'df_v{i+1}' = df.pivot(index="no", columns=list1[i], values=list2[i])
But I get the error "Cannot assign to literal". Not sure whether there is a way to create the dataframes dynamically in pandas?
This syntax
f'df_v{i+1}' = df.pivot(index="no", columns=list1[i], values=list2[i])
means that you are trying to assign DataFrames to a string, which is not possible. You might try using a dictionary, instead:
my_dfs = {}
for i in range(len(columns)):
my_dfs[f'df_v{i+1}'] = df.pivot(index="no", columns=list1[i], values=list2[i])
Since it allows the use of named keys, which seems like what you want. This way you can access your dataframes using my_dfs['df_v1'], for example.

TypeError: unhashable type: 'list'? when trying to slice dataframe

My course taught us that the way to choose a specific value in a pandas dataframe is by typing:
df.loc([row,column])
or
df.loc([[row],[column]])
but when I tried to do it, I get the following error message:
"TypeError: unhashable type: 'list'"
What's wrong?
It's hard to say without a clear example, but I think that where you have:
file.loc([row,column])
# and
file.loc([[row],[column]])
You probably want:
file.loc[row,column]
# and
file.loc[[row],[column]]
I.e. lose the parentheses.
No, the correct syntax for slicing pandas dataframes is:
df.loc[row,column]
WRONG:
df.loc([row,column])
^ ^ # no parentheses () around the [...] expression
df.loc([[row],[column]])
^ ^ ^ ^ ^ ^ # no second pair of [] brackets, and no parentheses
Assuming that's what you're trying to access here. CSV is only a file format, not a pandas object. df = pd.read_csv(...) reads in a CSV file and assigns it to a pandas dataframe called df.
A dataframe is called a dataframe, not "variable that contains csv". And usually by convention we give them variable names df, df2, df_b...

Pandas to_csv and read_csv with column containing lists?

Could have sworn that I was able to serialize and read back a dataframe with a column containing lists before.
When calling df.to_csv(path) the column containing lists are indeed lists
When calling pd.read_csv(path) the column previously containing lists are now strings, but need to be lists again.
I've written a converter argument to handle it, but I'd like to know if there is better way. I've tried astype() with np.ndarray, list and 'O' with no luck.
Anyone know of a 'built-in' way of handling this?

Categories