Assign dataframe head results to another dataframe - python

Here is the scenario.
Millions of data from source.
I'd like to get a sample 10k records only.
df = record.head(10000)
Search the 10k records. Get the first record who's specific column is not null.
df[~df['af_ad_id'].isnull()].head(1)
This is returning an error -- AttributeError: 'NoneType' object has no attribute 'isnull'.

Your getting that error because df['af_ad_id'] is of type None

Let try df[~df['af_ad_id'].isnull()][0]

Related

'DataFrame' object has no attribute 'merge'

I am new to PySpark and i am trying to merge a dataframe to the one present in Delta location using the merge function.
DEV_Delta.alias("t").merge(df_from_pbl.alias("s"),condition_dev)\
.whenMatchedUpdateAll() \
.whenNotMatchedInsertAll()\
.execute()
Both the dataframes have equal number of columns but when i run this particular command in my notebook i get the following error
'DataFrame' object has no attribute 'merge'
I couldnt find solutions for this particular task and hence raising a new question. Could you please help me figuring out this issue?
Thanks,
Afras Khan
You need to have an instance of the DeltaTable class, but you're passing the DataFrame instead. For this you need to create it using the DeltaTable.forPath (pointing to a specific path) or DeltaTable.forName (for a named table), like this:
DEV_Delta = DeltaTable.forPath(spark, 'some path')
DEV_Delta.alias("t").merge(df_from_pbl.alias("s"),condition_dev)\
.whenMatchedUpdateAll() \
.whenNotMatchedInsertAll()\
.execute()
If you have data as DataFrame only, you need to write them first.
See documentation for more details.

AttributeError: 'DataFrame' object has no attribute 'dtype' error in pyspark

I have categoryDf which is spark Dataframe and its being printed successfully:
categoryDf.limit(10).toPandas()
I want to join this to another sparkdataframe. So, I tried this:
df1=spark.read.parquet("D:\\source\\202204121920-seller_central_opportunity_explorer_niche_summary.parquet")
#df1.limit(5).toPandas()
df2=df1.join(categoryDf,df1["category_id"] == categoryDf["cat_id"])
df2.show()
When I use df2.show() then I see the output as:
The join is happening succesfully.But when I tried to change it into df2.limit(10).toPandas(), I see the error:
AttributeError: 'DataFrame' object has no attribute 'dtype' error in pyspark
I want to see how the data looks after join. So, I tried to use df2.limit(10).toPandas(). Or is there any other method to see the data since my join is happening successfully?
My python version is:3.7.7
Spark version is:2.4.4
I faced the same problem, in my case it was because I had duplicate column names after the join.
I see you have report_date and marketplaceid in both dataframes. For each duplicated pair, you need to either drop one or both, or rename one of them.

Python Conditional on a date time Object

for index, row in HDFC_Nifty.iterrows():
if HDFC_Nifty.iat(index,0).dt.year ==2016:
TE_Nifty_2016.append(row['TE'])
else:
TE_Nifty_2017.append(row['TE'])
Hello,
I am attempting to iterate over the DataFrame, more specifically apply a conditional to the Date column which is formatted as a Datatime object.
I keep getting the below error
**
TypeError: '_iAtIndexer' object is not callable
**
I do not know how to proceed further. I have tried various loops but am largely unsuccessful and not able to understand what I am doing incorrect.
Thanks for the help!

Replace string in one part pandas dataframe

print(df["date"].str.replace("2016","16"))
The code above works fine. What I really want to do is to make this replacement in just a small part of the data-frame. Something like:
df.loc[2:4,["date"]].str.replace("2016","16")
However here I get an error:
AttributeError: 'DataFrame' object has no attribute 'str'
What about df['date'].loc[2:4].str.replace('2016', 16')?
By selecting ['date'] first you know you are dealing with a series which does have a string attribute.

How can I fill the NAs using groupby function in Pandas (using Python) when I have more than one column?

I am trying to fill the NAs in my data frame using the following code and got an error. Can anyone help? Why it is not working? I need to use more than one column (gender and age). With only one column, it works but beyond one column, I have an error
Here is the code:
df['NewCol'].fillna(df.groupby(['gender','age'])['grade'].transform('mean'),inplace=True)
The error message is:
TypeError: 'NoneType' object is not subscriptable

Categories