I am trying to convert a tuple of a Pandas DataFrame into a dictionary because I need the dict to call an API later. I have an entire Dataframe, from which I iterate a for loop to get all data inside it. Here is the code
df = ....Dataframe definition and retriving
for item in df.itertuples():
print(item.to_dict)
But the following error appears: AttributeError: 'Pandas' object has no attribute 'to_dict'
I have also tried using the dict keyword to convert the tuple, but there is the following error:cannot convert dictionary update sequence element #0 to a sequence
I know that I could do almost everything manually, but it would be two for loop and it will take forever. Is there a way to convert the structure I have into a dict, also based on the columns? Thank you so much
df.to_dict()
is a method that you call and by different arguments you can get:
‘list’ : dict like {column -> [values]}
‘series’ : dict like {column -> Series(values)}
‘split’ : dict like {‘index’ -> [index], ‘columns’ -> [columns], ‘data’ -> [values]}
See the [docs][1].
And if you have only one dataframe, you can use this method without itertuples.
[1]: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_dict.html
Related
I want to convert each row of my dataframe into to a Python class object called Fruit.
I have a dataframe df with the following columns: Identifier, Name, Quantity
I also have a dictionary fruit_color that looks like this:
fruit_color = {"Apple":"Red", "Lemon": "yellow", ...}
class Fruit(name: str, quantity: int, color: str, entryPointer: DataFrameEntry)
I also have an object called DataFrameEntry that takes as parameters a dataframe and an identifier.
class DataFrameEntry(df: DataFrame, index: int)
Now I am trying to convert each row of the dataframe "df" to this object using rdds and ultimately get a list of all fruits through this piece of code:
df.rdd.map(lambda x: Fruit(
x.__getitem__('Name'),
x.__getitem__('Quantity'),
fruit_color[x.__getitem__('Name')],
LogEntryPointer(original_df, trigger.__getitem__('StartMarker_Index')))).collect()
However, I keep getting this error:
PicklingError: Could not serialize object: Py4JError: An error occurred while calling o55.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
Maybe my approch is wrong? How can I generally convert each row of a dataframe to a specific object in pyspark?
Thank you a lot in advance!!
You need to make sure all objects, classes you're using inside map is defined inside map. To be more clear, RDD's map will distribute workload across multiple workers (different machines), and those machines don't know what Fruit is.
I want to create a pandas DataFrame from a query_set.QuerySet object but when I go for:
df = pd.DataFrame.from_dict(res)
res being the query set, I get the following error:
TypeError: Object of type DataFrame is not JSON serializable.
What can I do to fix it? Thanks in advance!
Assuming it's Django's QuerySet you can use the following code:
df = pd.DataFrame.from_dict(res.values())
queryset.values() returns list of dicts.
You can find more information at this link.
My code below creates a dataframe from lists of columns from other dataframes. I'm getting an error when calling a list that is produce by a set. How can I treat that set of list, in order to add those columns to my dataframe?
Error produce by +list(matchedList)
#extract columns that need to be conform
datasetMatched = dataset.select(selectedColumns +list(matchedList))
#display(datasetMatched)
TypeError: 'list' object is not callable
It probably happens due to shadowing the builtin list function. Make sure you didn't define any variable named list in your code.
print(df["date"].str.replace("2016","16"))
The code above works fine. What I really want to do is to make this replacement in just a small part of the data-frame. Something like:
df.loc[2:4,["date"]].str.replace("2016","16")
However here I get an error:
AttributeError: 'DataFrame' object has no attribute 'str'
What about df['date'].loc[2:4].str.replace('2016', 16')?
By selecting ['date'] first you know you are dealing with a series which does have a string attribute.
I have a list that is generated by a function. when I execute print on my list:
print(preds_labels)
I obtain:
[(0.,8.),(0.,13.),(0.,19.),(0.,19.),(0.,19.),(0.,19.),(0.,19.),(0.,20.),(0.,21.),(0.,23.)]
but when I want to create a DataFrame with this command:
df = sqlContext.createDataFrame(preds_labels, ["prediction", "label"])
I get an error message:
not supported type: type 'numpy.float64'
If I create the list manually, I have no problem. Do you have an idea?
pyspark uses its own type system and unfortunately it doesn't deal with numpy well. It works with python types though. So you could manually convert the numpy.float64 to float like
df = sqlContext.createDataFrame(
[(float(tup[0]), float(tup[1]) for tup in preds_labels],
["prediction", "label"]
)
Note pyspark will then take them as pyspark.sql.types.DoubleType
To anyone arriving here with the error:
typeerror not supported type class 'numpy.str_'
This is true for string as well. So if you created your list strings using numpy , try to change it to pure python.
Create list of single item repeated N times