I am new to PySpark and i am trying to merge a dataframe to the one present in Delta location using the merge function.
DEV_Delta.alias("t").merge(df_from_pbl.alias("s"),condition_dev)\
.whenMatchedUpdateAll() \
.whenNotMatchedInsertAll()\
.execute()
Both the dataframes have equal number of columns but when i run this particular command in my notebook i get the following error
'DataFrame' object has no attribute 'merge'
I couldnt find solutions for this particular task and hence raising a new question. Could you please help me figuring out this issue?
Thanks,
Afras Khan
You need to have an instance of the DeltaTable class, but you're passing the DataFrame instead. For this you need to create it using the DeltaTable.forPath (pointing to a specific path) or DeltaTable.forName (for a named table), like this:
DEV_Delta = DeltaTable.forPath(spark, 'some path')
DEV_Delta.alias("t").merge(df_from_pbl.alias("s"),condition_dev)\
.whenMatchedUpdateAll() \
.whenNotMatchedInsertAll()\
.execute()
If you have data as DataFrame only, you need to write them first.
See documentation for more details.
Related
I have categoryDf which is spark Dataframe and its being printed successfully:
categoryDf.limit(10).toPandas()
I want to join this to another sparkdataframe. So, I tried this:
df1=spark.read.parquet("D:\\source\\202204121920-seller_central_opportunity_explorer_niche_summary.parquet")
#df1.limit(5).toPandas()
df2=df1.join(categoryDf,df1["category_id"] == categoryDf["cat_id"])
df2.show()
When I use df2.show() then I see the output as:
The join is happening succesfully.But when I tried to change it into df2.limit(10).toPandas(), I see the error:
AttributeError: 'DataFrame' object has no attribute 'dtype' error in pyspark
I want to see how the data looks after join. So, I tried to use df2.limit(10).toPandas(). Or is there any other method to see the data since my join is happening successfully?
My python version is:3.7.7
Spark version is:2.4.4
I faced the same problem, in my case it was because I had duplicate column names after the join.
I see you have report_date and marketplaceid in both dataframes. For each duplicated pair, you need to either drop one or both, or rename one of them.
Here is the scenario.
Millions of data from source.
I'd like to get a sample 10k records only.
df = record.head(10000)
Search the 10k records. Get the first record who's specific column is not null.
df[~df['af_ad_id'].isnull()].head(1)
This is returning an error -- AttributeError: 'NoneType' object has no attribute 'isnull'.
Your getting that error because df['af_ad_id'] is of type None
Let try df[~df['af_ad_id'].isnull()][0]
I want to output a Pandas groupby dataframe to CSV. Tried various StackOverflow solutions but they have not worked.
Python 3.7
This is my dataframe
This is my code
groups = clustering_df.groupby(clustering_df['Family Number'])
groups.apply(lambda clustering_df: clustering_df.sort_values(by=['Family Number']))
groups.to_csv('grouped.csv')
Error Message
(AttributeError: Cannot access callable attribute 'to_csv' of 'DataFrameGroupBy' objects, try using the 'apply' method)
You just need to do this:
groups = clustering_df.groupby(clustering_df['Family Number'])
groups = groups.apply(lambda clustering_df: clustering_df.sort_values(by=['Family Number']))
groups.to_csv('grouped.csv')
What you have done is, not saved the groupby-apply variable. It would get applied and might throw output depending on what IDE/Notebook you use. But to save it into a file, you will have to apply the function on the groupby object, save it into a variable and you can save the file.
Chaining works as well:
groups = clustering_df.groupby(clustering_df['Family Number']).apply(lambda clustering_df: clustering_df.sort_values(by=['Family Number']))
groups.to_csv("grouped.csv")
I'm using Python3 and have a pandas df that looks like
zip
0 07105
1 00000
2 07030
3 07032
4 07032
I would like to add state and city using the python package uszipcode
import uszipcode
search = SearchEngine(simple_zipcode=False)
def zco(x):
print(search.by_zipcode(x)['City'])
df['City'] = df[['zip']].fillna(0).astype(int).apply(zco)
However, I get the following error
TypeError: 'Zipcode' object is not subscriptable
Can someone help with the error? Thank you in advance.
The call search.by_zipcode(x) returns a ZipCode() instance, not a dictionary, so applying ['City'] to that object fails.
Instead, use either the .major_city attribute of the shorter alias, the .city attribute; you want to return that value, not print it:
def zco(x):
return search.by_zipcode(x).city
If all you are going to use the uszipcode project for is mapping zip codes to state and city names, you don’t need to use the full database (a 450MB download). Just stick with the ‘simple’ version, which is only 9MB, by leaving out the simple_zipcode=False argument to SearchEngine().
Next, this is going to be really really slow. .apply() uses a simple loop under the hood, and for each row the .by_zipcode() method will query a SQLite database using SQLAlchemy, create a single result object with all the columns from the matching row, then return that object, just so you can get a single attribute from them.
You'd be much better off querying the database directly, with the Pandas SQL methods. The uszipcode package is still useful here as it handles downloading the database for you and creating a SQLAlchemy session, the SearchEngine.ses attribute gives you direct access to it, but from there I'd just do:
from uszipcode import SearchEngine, SimpleZipcode
search = SearchEngine()
query = (
search.ses.query(
SimpleZipcode.zipcode.label('zip'),
SimpleZipcode.major_city.label('city'),
SimpleZipcode.state.label('state'),
).filter(
SimpleZipcode.zipcode.in_(df['zip'].dropna().unique())
)
).selectable
zipcode_df = pd.read_sql_query(query, search.ses.connection(), index_col='zip')
to create a Pandas Dataframe with all your unique zipcodes mapped to city and state columns. You can then join your dataframe with the zipcode dataframe:
df = pd.merge(df, zipcode_df, how='left', left_on='zip', right_index=True)
This adds city and state columns to your original dataframe. If you need to pull in more columns, add them to the search.ses.query(...) portion, using .label() to give them a suitable column name in the output dataframe (without a .label(), they'll get prefixed with simple_zipcode_ or zipcode_, depending on the class you are using). Pick from the model attributes documented, but take into account that if you need access to the full Zipcode model attributes you need to use SearchEngine(simple_zipcode=False) to ensure you get the full 450MB dataset at your disposal, then use Zipcode.<column>.label(...) instead of SimpleZipcode.<column>.label(...) in the query.
With the zipcodes as the index in the zipcode_df dataframe, that's going to be a lot faster (zippier :-)) than using SQLAlchemy on each row individually.
I'm new with Pandas so this is basic question. I created a Dataframe by concatenating two previous Dataframes. I used
todo_pd = pd.concat([rabia_pd, capitan_pd], keys=['Rabia','Capitan'])
thinking that in the future I could separate them easily and saving each one to a different location. Right now I'm being unable to do this separation using the keys I defined with the concat function.
I've tried simple things like
half_dataframe = todo_pd['Rabia']
but it throws me an error saying that there is a problem with the key.
I've also tried with other options I've found in SO, like using the
_get_values('Rabia'),or the.index._get_level_values('Rabia')features, but they all throw me different errors regarding that it does not recognize a string as a way to access the information, or that it requires positional argument: 'level'
The whole Dataframe contains about 22 columns, and I just want to retrieve from the "big dataframe" the part indexed as 'Rabia' and the part index as 'Capitan'.
I'm sure it has a simple solution that I'm not getting for my lack of practice with Pandas.
Thanks a lot,
Use DataFrame.xs:
df1 = todo_pd.xs('Rabia')
df2 = todo_pd.xs('Capitan')