Problem with pandas 'to_csv' of 'DataFrameGroupBy' objects) - python

I want to output a Pandas groupby dataframe to CSV. Tried various StackOverflow solutions but they have not worked.
Python 3.7
This is my dataframe
This is my code
groups = clustering_df.groupby(clustering_df['Family Number'])
groups.apply(lambda clustering_df: clustering_df.sort_values(by=['Family Number']))
groups.to_csv('grouped.csv')
Error Message
(AttributeError: Cannot access callable attribute 'to_csv' of 'DataFrameGroupBy' objects, try using the 'apply' method)

You just need to do this:
groups = clustering_df.groupby(clustering_df['Family Number'])
groups = groups.apply(lambda clustering_df: clustering_df.sort_values(by=['Family Number']))
groups.to_csv('grouped.csv')
What you have done is, not saved the groupby-apply variable. It would get applied and might throw output depending on what IDE/Notebook you use. But to save it into a file, you will have to apply the function on the groupby object, save it into a variable and you can save the file.
Chaining works as well:
groups = clustering_df.groupby(clustering_df['Family Number']).apply(lambda clustering_df: clustering_df.sort_values(by=['Family Number']))
groups.to_csv("grouped.csv")

Related

'DataFrame' object has no attribute 'merge'

I am new to PySpark and i am trying to merge a dataframe to the one present in Delta location using the merge function.
DEV_Delta.alias("t").merge(df_from_pbl.alias("s"),condition_dev)\
.whenMatchedUpdateAll() \
.whenNotMatchedInsertAll()\
.execute()
Both the dataframes have equal number of columns but when i run this particular command in my notebook i get the following error
'DataFrame' object has no attribute 'merge'
I couldnt find solutions for this particular task and hence raising a new question. Could you please help me figuring out this issue?
Thanks,
Afras Khan
You need to have an instance of the DeltaTable class, but you're passing the DataFrame instead. For this you need to create it using the DeltaTable.forPath (pointing to a specific path) or DeltaTable.forName (for a named table), like this:
DEV_Delta = DeltaTable.forPath(spark, 'some path')
DEV_Delta.alias("t").merge(df_from_pbl.alias("s"),condition_dev)\
.whenMatchedUpdateAll() \
.whenNotMatchedInsertAll()\
.execute()
If you have data as DataFrame only, you need to write them first.
See documentation for more details.

AttributeError: 'DataFrame' object has no attribute 'dtype' error in pyspark

I have categoryDf which is spark Dataframe and its being printed successfully:
categoryDf.limit(10).toPandas()
I want to join this to another sparkdataframe. So, I tried this:
df1=spark.read.parquet("D:\\source\\202204121920-seller_central_opportunity_explorer_niche_summary.parquet")
#df1.limit(5).toPandas()
df2=df1.join(categoryDf,df1["category_id"] == categoryDf["cat_id"])
df2.show()
When I use df2.show() then I see the output as:
The join is happening succesfully.But when I tried to change it into df2.limit(10).toPandas(), I see the error:
AttributeError: 'DataFrame' object has no attribute 'dtype' error in pyspark
I want to see how the data looks after join. So, I tried to use df2.limit(10).toPandas(). Or is there any other method to see the data since my join is happening successfully?
My python version is:3.7.7
Spark version is:2.4.4
I faced the same problem, in my case it was because I had duplicate column names after the join.
I see you have report_date and marketplaceid in both dataframes. For each duplicated pair, you need to either drop one or both, or rename one of them.

Cannot access callable attribute 'set_index' of 'DataFrameGroupBy' objects, try using the 'apply' method

I have a dataframe that contains population of countries by year. I want to make the country column the index of the dataframe. I tried this:
df = df.set_index('Country')
Gives me an error:
Cannot access callable attribute 'set_index' of 'DataFrameGroupBy'
objects, try using the 'apply' method.
My data frame looks like this:
try this...
df.set_index('Country', inplace=True)

'function' object has no attribute 'str' in pandas

I am using below code to read and split the csv file strings separated by /
DATA IS
SRC_PATH TGT_PATH
/users/sn/Retail /users/am/am
/users/sn/Retail Reports/abc /users/am/am
/users/sn/Automation /users/am/am
/users/sn/Nidh /users/am/xzy
import pandas as pd
df = pd.read_csv('E:\RCTemplate.csv',index_col=None, header=0)
s1 = df.SRC_PATH.str.split('/', expand=True)
i get the correct split data in s1, but when i am going to do the similar operation on single row it throws error "'function' object has no attribute 'str'"
error is throwing in below code
df2= [(df.SRC_PATH.iloc[0])]
df4=pd.DataFrame([(df.SRC_PATH.iloc[0])],columns = ['first'])
newvar = df4.first.str.split('/', expand=True)
Pandas thinks you are trying to access the method dataframe.first().
This is why it's best practice to use hard brackets to access dataframe columns rather than .column access
df4['first'].str.split() instead of df4.first.str.split()
Not that this cause common issues with things like a column called 'name' ending up as the name attribute of the dataframe and a host of other problems

AttributeError: 'list' object has no attribute 'rename'

df.rename(columns={'nan': 'RK', 'PP': 'PLAYER','SH':'TEAM','nan':'GP','nan':'G','nan':'A','nan':'PTS','nan':'+/-','nan':'PIM','nan':'PTS/G','nan':'SOG','nan':'PCT','nan':'GWG','nan':'PPG','nan':'PPA','nan':'SHG','nan':'SHA'}, inplace=True)
This is my code to rename the columns according to http://www.espn.com/nhl/statistics/player/_/stat/points/sort/points/year/2015/seasontype/2
I want both the tables to have same column names. I am using python2 in spyder IDE.
When I run the code above, it gives me this error:
AttributeError: 'list' object has no attribute 'rename'
The original question was posted a long time ago, but I just came across the same issue and found the solution here: pd.read_html() imports a list rather than a dataframe
When you do pd.read_html you are creating a list of dataframes since the website may have more than 1 table. Add one more line of code before you try your rename:
dfs = pd.read_html(url, header=0)
and then df = dfs[0] ; you will have the df variable as a dataframe , which will allow you to run the df.rename command you are trying to run in the original question.
this should be able to fix , df is you dataset
df.columns=['a','b','c','d','e','f']

Categories