For Looping error in pyspark - python

I am facing the following problem:
I have a list which I need to compare with the elements of a column in a dataframe(acc_name). I am using the following looping function but it only returns me 1 record when it should provide me 30.
Using Pyspark
bs_list =
['AC_E11','AC_E12','AC_E13','AC_E135','AC_E14','AC_E15','AC_E155','AC_E157',
'AC_E16','AC_E163','AC_E165','AC_E17','AC_E175','AC_E180','AC_E185', 'AC_E215','AC_E22','AC_E225','AC_E23','AC_E23112','AC_E235','AC_E245','AC_E258','AC_E25','AC_E26','AC_E265','AC_E27','AC_E275','AC_E31','AC_E39','AC_E29']
for i in bs_list:
bs_acc1 = (acc\
.filter(i == acc.acc_name)
.select(acc.acc_name,acc.acc_description)
)
the bs_list elements are subset of acc_name column. I am trying to create a new DF which will have the following 2 Columns acc_name, acc_description. It will only contain details of the value of elements present in list bs_list
Please let me know where I am going wrong?

Thats because , in loop everytime you filter on i, you are creating a new dataframe bs_acc1. So it must be showing you only 1 row belonging to last value in bs_list i.e. row for 'AC_E29'
one way to do it is repeat union with itself, so previous results also remain in the dataframe like -
# create a empty dataframe, give schema which is appropriate to your data below
bs_acc1 = sqlContext.createDataFrame(sc.emptyRDD(), schema)
for i in bs_list:
bs_acc1 = bs_acc1.union(
acc\
.filter(i == acc_fil.acc_name)
.select(acc.acc_name,acc.acc_description)
)
better way is not doing loop at all -
from pyspark.sql.functions import *
bs_acc1 = acc.where(acc.acc_name.isin(bs_list))

You can also transform bs_list to dataframe with column acc_name and then just do join to acc dataframe.
bs_rdd = spark.sparkContext.parallelize(bs_list)
bs_df = bs_rdd.map(lambda x: Row(**{'acc_name':x})).toDF()
bs_join_df = bs_df.join(acc, on='acc_name')
bs_join_df.show()

Related

How to rename a column while merging in pandas

I am using a for loop to merge many different dataframes. Each dataframe contains values from a specific time period. As such the column in each df is named "balance". In order to avoid creating multiple balance_x, balance_y... I want to name the columns using the name of the df.
so far, I have the following
top = topaccount_2021_12
top = top.rename(columns={"balance": "topaccount_2021_12"})
for i in [topaccount_2021_09, topaccount_2021_06, topaccount_2021_03,
topaccount_2020_12, topaccount_2020_09, topaccount_2020_06, topaccount_2020_03,
topaccount_2019_12, topaccount_2019_09, topaccount_2019_06, topaccount_2019_03,
topaccount_2018_12, topaccount_2018_09, topaccount_2018_06, topaccount_2018_03,
topaccount_2017_12, topaccount_2017_09, topaccount_2017_06, topaccount_2017_03,
topaccount_2016_12, topaccount_2016_09, topaccount_2016_06, topaccount_2016_03,
topaccount_2015_12, topaccount_2015_09]:
top = top.merge(i, on='address', how='left')
top = top.rename(columns={'balance': i})
But i get the error msg:
TypeError: Cannot convert bool to numpy.ndarray
Any idea how to solve this? Thanks!
I assume topaccount_* is a dataframe. I'm a bit confused in top = top.rename(columns={'balance': i}) because what do you want to achieve here? rename function used to rename column given key as original column name and value as the renamed column name. but instead of giving a string, you give dataframe to column
Edit
# store in dictionary
dictOfDf = {
'topaccount_2021_09':topaccount_2021_09,
'topaccount_2021_06':topaccount_2021_06,
...
'topaccount_2015_09':topaccount_2015_09,
}
# pick the first dict to declare dataframe
top = dictOfDf[dictOfDf.keys()[0]]
top = top.rename(columns={"balance": dictOfDf.keys()[0]})
# iterate through all the keys
for i in dictOfDf.keys()[1:]:
top = top.merge(i, on='address', how='left')
top = top.rename(columns={'balance': i})

Pandas Apply function referencing column name

I'm trying to create a new column that contains all of the assortments (Asst 1 - 50) that a SKU may belong to. A SKU belongs to an assortment if it is indicated by an "x" in the corresponding column.
The script will need to be able to iterate over the rows in the SKU column and check for that 'x' in any of the ASST columns. If it finds one, copy the name of that assortment column into the newly created "all assortments" column.
After one Liner:
I have been attempting this using the df.apply method but I cannot seem to get it right.
def assortment_crunch(row):
if row == 'x':
df['Asst #1'].apply(assortment_crunch):
my attempt doesn't really account for the need to iterate over all of the "asst" columns and how to assign that column to the newly created one.
Here's a super fast ("vectorized") one-liner:
asst_cols = df.filter(like='Asst #')
df['All Assortment'] = [', '.join(asst_cols.columns[mask]) for mask in asst_cols.eq('x').to_numpy()]
Explanation:
df.filter(like='Asst #') - returns all the columns that contain Asst # in their name
.eq('x') - exactly the same as == 'x', it's just easier for chaining functions like this because of the parentheses mess that would occur otherwise
to_numpy() - converts the mask dataframe in to a list of masks
I'm not sure if this is the most efficient way, but you can try this.
Instead of applying to the column, apply to the whole DF to get access to the row. Then you can iterate through each column and build up the value for the final column:
def make_all_assortments_cell(row):
assortments_in_row = []
for i in range(1, 51):
column_name = f'Asst #{i}'
if (row[column_name] == 'x').any():
assortments_in_row.append(row[column_name])
return ", ".join(assortments_in_row)
df["All Assortments"] = df.apply(make_all_assortments_cell)
I think this will work though I haven't tested it.

Creating a new column in Dataframe based on multiple lists

I'm trying to create a new column 'BroadCategory' within a dataframe based on whether values within another column called 'Venue Category' within the data occur in specific lists. I have 5 lists that I am using to fill in the values in the new column
For example:
df['BroadCategory'] = np.where(df['VenueCategory'].isin(Bar),'Bar','Other')
df['BroadCategory'] = np.where(df['VenueCategory'].isin(Museum_ArtGallery),'Museum/Art Gallery','Other')
df['BroadCategory'] = np.where(df['VenueCategory'].isin(Public_Transport),'Public Transport','Other')
df['BroadCategory'] = np.where(df['VenueCategory'].isin(Restaurant_FoodVenue),'Restaurant/Food Venue','Other')
I ultimately want the values in VenueCategory column occurring in the list Bar to be labeled 'Bar' and those occurring in the list Museum_ArtGallery to be labeled 'Museum_ArtGallery', etc. My code above doesn't accomplish this.
I tried this in order to keep the values I had previously filled but it's still overwriting the values I had filled in based on my previous conditions:
df['BroadCategory'] = np.where(df[df.VenueCategory!='Other'].isin(Entertainment_Venue),'Entertainment Venue','Other')
How can I fill the column BoardCategory with the specific values based on whether the values in the VenueCategory column occur in the specified lists Bar, Restaurant, Public_Transport, Museum_ArtGallery, etc?
support your data is like this
df=pd.DataFrame({'VenueCategory':['drink','wine','MOMA','MTA','sushi','Hudson']})
Bar=['drink','wine','alcohol']
Museum_ArtGallery=['MOMA','MCM']
Public_Transport=['MTA','MBTA']
Restaurant_FoodVenue=['sushi','chicken']
prepare a dictionary:
from collections import defaultdict
d=defaultdict(lambda:'other')
d.update({x:'Bar' for x in Bar})
d.update({x:'Museum_ArtGallery' for x in Museum_ArtGallery})
d.update({x:'Public_Transport' for x in Public_Transport})
d.update({x:'Restaurant_FoodVenue' for x in Restaurant_FoodVenue})
build new column and print result:
df['BroadCategory']=df['VenueCategory'].apply(lambda x:d[x])
df
venue_list = [['Bar', Bar],
['Museum_ArtGallery',Museum_ArtGallery]
#etc
]
venue_lookup = pd.concat([
pd.DataFrame({
'BroadCategory':venue[0],
'VenueCategory':venue[1]}) for venue in venue_list]
)
pd.merge(df, venue_lookup, how='left', on = 'VenueCategory')
Your solution is already close. Just that in order not to overwrite previously values, you should get a subset of the rows and only set new values on the subset.
To do that, you can firstly initialize new column BroadCategory to 'Other'. Then set up a subset of rows of each category by subscripting the new column with Boolean mask using the .isin() function like you are using now. The codes are like below:
df['BroadCategory'] = 'Other'
df['BroadCategory'][df['VenueCategory'].isin(Bar)] = 'Bar'
df['BroadCategory'][df['VenueCategory'].isin(Museum_ArtGallery)] = 'Museum/Art Gallery'
df['BroadCategory'][df['VenueCategory'].isin(Public_Transport)] = 'Public Transport'
df['BroadCategory'][df['VenueCategory'].isin(Restaurant_FoodVenue)] = 'Restaurant/Food Venue'
df['BroadCategory'][df['VenueCategory'].isin(Entertainment_Venue)] = 'Entertainment Venue'

How to drop a series of rows from dataframe in a faster way

I have a data set and I want to drop some rows with a faster method. I had tried the following code but it took a long time
I want to drop every user who makes less than 3 operations.
every operation is stored in a row in which user_id is not the ID of my data
undesirable_users=[]
for i in range(len(operations_per_user)):
if operations_per_user.get_value(operations_per_user.index[i])<=3:
undesirable_users.append(operations_per_user.index[i])
for i in range(len(undesirable_users)):
data = data.drop(data[data.user_id == undesirable_users[i]].index)
data is a dataframe and operation_per_user is a series created by: operation_per_user = data['user_id'].value_counts().
Why not just filter them? You don't need to loop at all.
You can get the filtered indexes by:
operations_per_user.index[operations_per_user <= 3]
And then you can filter these indexes from the df, making the solution:
data = data[data['user_id'] not in (operations_per_user.index[operations_per_user <= 3])]
EDIT
My understanding is that you want to remove any user that occurs less than 3 times in the data. You won't need to create a value_counts list for that, you could do a groupby and find the counts and then filter on that basis.
filtered_user_ids = data.groupby('user_id').filter(lambda x: len(x) <= 3)['user_id'].tolist()
data = data[~data[user_id].isin(filtered_user_ids)]
If data is a pandas DataFrame, and it contains both user_id and operations_per_user as columns, you should perform the drop with:
data = data.drop(data.loc[data['operations_per_user'] <= 3].index)
Edit
Instead of creating a seperate series, you could add operations_per_user to data with:
data['operations_per_user'] = data.loc[:, 'user_id'].value_counts()
You could either perform the drop as above or perform the selection with the inverse logical condition:
data = data.loc[data['operations_per_user' > 3]]
Original
It would be preferable if you could supply some more information about the variables used in your code.
If operations_per_user is a pandas Series, your first loop could be improved with:
undesirable_users=[]
for i in operations_per_user.index:
if operations_per_user.loc[i] <= 3:
undesirable_users.append(i)
The function get_value() is deprecated, use loc or iloc instead. This is a good summary of loc and iloc, and here is a great pandas cheatsheet to reference.
You can use python lists as iterators; for your second loop:
for user in undesirable_users:
data = data.drop(data.loc[data['user_id'] == user].index)
Rather than dropping, you can simply select the rows you want to keep reverting the logical condition.
First, select the user to keep only.
Then get a boolean list, length equal to data rows.
Finally, select the rows to keep.
keepusers = operation_per_user.loc[operation_per_user > 3]
tokeep = [uid in keepuser for uid in data['user_id']]
newdata = data.loc[tokeep]

Drop Pandas DataFrame lines according to a GropuBy property

I have some DataFrames with information about some elements, for instance:
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df2=pd.DataFrame([[1,5],[1,7],[1,23],[2,6],[2,4]],columns=['Group','Value'])
I have used something like dfGroups = df.groupby('group').apply(my_agg).reset_index(), so now I have DataFrmaes with informations on groups of the previous elements, say
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
my_df2_Group=pd.DataFrame([[1,38],[2,49]],columns=['Group','Group_Value'])
Now I want to clean my groups according to properties of their elements. Let's say that I want to discard groups containing an element with Value greater than 16. So in my_df1_Group, there should only be the first group left, while both groups qualify to stay in my_df2_Group.
As I don't know how to get my_df1_Group and my_df2_Group from my_df1 and my_df2 in Python (I know other languages where it would simply be name+"_Group" with name looping in [my_df1,my_df2], but how do you do that in Python?), I build a list of lists:
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
Then, I simply try this:
my_max=16
Bad=[]
for Sample in SampleList:
for n in Sample[1]['Group']:
df=Sample[0].loc[Sample[0]['Group']==n] #This is inelegant, but trying to work
#with Sample[1] in the for doesn't work
if (df['Value'].max()>my_max):
Bad.append(1)
else:
Bad.append(0)
Sample[1] = Sample[1].assign(Bad_Row=pd.Series(Bad))
Sample[1] = Sample[1].query('Bad_Row == 0')
Which runs without errors, but doesn't work. In particular, this doesn't add the column Bad_Row to my df, nor modifies my DataFrame (but the query runs smoothly even if Bad_Rowcolumn doesn't seem to exist...). On the other hand, if I run this technique manually on a df (i.e. not in a loop), it works.
How should I do?
Based on your comment below, I think you are wanting to check if a Group in your aggregated data frame has a Value in the input data greater than 16. One solution is to perform a row-wise calculation using a criterion of the input data. To accomplish this, my_func accepts a row from the aggregated data frame and the input data as a pandas groupby object. For each group in your grouped data frame, it will subset you initial data and use boolean logic to see if any of the 'Values' in your input data meet your specified criterion.
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>16).any():
return 'Bad Row'
else:
return 'Good Row'
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
grouped_df1 = my_df1.groupby('Group')
my_df1_Group['Bad_Row'] = my_df1_Group.apply(lambda x: my_func(x,grouped_df1), axis=1)
Returns:
Group Group_Value Bad_Row
0 1 57 Good Row
1 2 63 Bad Row
Based on dubbbdan idea, there is a code that works:
my_max=16
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>my_max).any():
return 1
else:
return 0
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
for Sample in SampleList:
grouped_df = Sample[0].groupby('Group')
Sample[1]['Bad_Row'] = Sample[1].apply(lambda x: my_func(x,grouped_df), axis=1)
Sample[1].drop(Sample[1][Sample[1]['Bad_Row']!=0].index, inplace=True)
Sample[1].drop(['Bad_Row'], axis = 1, inplace = True)

Categories