How to rename a column while merging in pandas - python

I am using a for loop to merge many different dataframes. Each dataframe contains values from a specific time period. As such the column in each df is named "balance". In order to avoid creating multiple balance_x, balance_y... I want to name the columns using the name of the df.
so far, I have the following
top = topaccount_2021_12
top = top.rename(columns={"balance": "topaccount_2021_12"})
for i in [topaccount_2021_09, topaccount_2021_06, topaccount_2021_03,
topaccount_2020_12, topaccount_2020_09, topaccount_2020_06, topaccount_2020_03,
topaccount_2019_12, topaccount_2019_09, topaccount_2019_06, topaccount_2019_03,
topaccount_2018_12, topaccount_2018_09, topaccount_2018_06, topaccount_2018_03,
topaccount_2017_12, topaccount_2017_09, topaccount_2017_06, topaccount_2017_03,
topaccount_2016_12, topaccount_2016_09, topaccount_2016_06, topaccount_2016_03,
topaccount_2015_12, topaccount_2015_09]:
top = top.merge(i, on='address', how='left')
top = top.rename(columns={'balance': i})
But i get the error msg:
TypeError: Cannot convert bool to numpy.ndarray
Any idea how to solve this? Thanks!

I assume topaccount_* is a dataframe. I'm a bit confused in top = top.rename(columns={'balance': i}) because what do you want to achieve here? rename function used to rename column given key as original column name and value as the renamed column name. but instead of giving a string, you give dataframe to column
Edit
# store in dictionary
dictOfDf = {
'topaccount_2021_09':topaccount_2021_09,
'topaccount_2021_06':topaccount_2021_06,
...
'topaccount_2015_09':topaccount_2015_09,
}
# pick the first dict to declare dataframe
top = dictOfDf[dictOfDf.keys()[0]]
top = top.rename(columns={"balance": dictOfDf.keys()[0]})
# iterate through all the keys
for i in dictOfDf.keys()[1:]:
top = top.merge(i, on='address', how='left')
top = top.rename(columns={'balance': i})

Related

Pandas Apply function referencing column name

I'm trying to create a new column that contains all of the assortments (Asst 1 - 50) that a SKU may belong to. A SKU belongs to an assortment if it is indicated by an "x" in the corresponding column.
The script will need to be able to iterate over the rows in the SKU column and check for that 'x' in any of the ASST columns. If it finds one, copy the name of that assortment column into the newly created "all assortments" column.
After one Liner:
I have been attempting this using the df.apply method but I cannot seem to get it right.
def assortment_crunch(row):
if row == 'x':
df['Asst #1'].apply(assortment_crunch):
my attempt doesn't really account for the need to iterate over all of the "asst" columns and how to assign that column to the newly created one.
Here's a super fast ("vectorized") one-liner:
asst_cols = df.filter(like='Asst #')
df['All Assortment'] = [', '.join(asst_cols.columns[mask]) for mask in asst_cols.eq('x').to_numpy()]
Explanation:
df.filter(like='Asst #') - returns all the columns that contain Asst # in their name
.eq('x') - exactly the same as == 'x', it's just easier for chaining functions like this because of the parentheses mess that would occur otherwise
to_numpy() - converts the mask dataframe in to a list of masks
I'm not sure if this is the most efficient way, but you can try this.
Instead of applying to the column, apply to the whole DF to get access to the row. Then you can iterate through each column and build up the value for the final column:
def make_all_assortments_cell(row):
assortments_in_row = []
for i in range(1, 51):
column_name = f'Asst #{i}'
if (row[column_name] == 'x').any():
assortments_in_row.append(row[column_name])
return ", ".join(assortments_in_row)
df["All Assortments"] = df.apply(make_all_assortments_cell)
I think this will work though I haven't tested it.

Creating a new column in Dataframe based on multiple lists

I'm trying to create a new column 'BroadCategory' within a dataframe based on whether values within another column called 'Venue Category' within the data occur in specific lists. I have 5 lists that I am using to fill in the values in the new column
For example:
df['BroadCategory'] = np.where(df['VenueCategory'].isin(Bar),'Bar','Other')
df['BroadCategory'] = np.where(df['VenueCategory'].isin(Museum_ArtGallery),'Museum/Art Gallery','Other')
df['BroadCategory'] = np.where(df['VenueCategory'].isin(Public_Transport),'Public Transport','Other')
df['BroadCategory'] = np.where(df['VenueCategory'].isin(Restaurant_FoodVenue),'Restaurant/Food Venue','Other')
I ultimately want the values in VenueCategory column occurring in the list Bar to be labeled 'Bar' and those occurring in the list Museum_ArtGallery to be labeled 'Museum_ArtGallery', etc. My code above doesn't accomplish this.
I tried this in order to keep the values I had previously filled but it's still overwriting the values I had filled in based on my previous conditions:
df['BroadCategory'] = np.where(df[df.VenueCategory!='Other'].isin(Entertainment_Venue),'Entertainment Venue','Other')
How can I fill the column BoardCategory with the specific values based on whether the values in the VenueCategory column occur in the specified lists Bar, Restaurant, Public_Transport, Museum_ArtGallery, etc?
support your data is like this
df=pd.DataFrame({'VenueCategory':['drink','wine','MOMA','MTA','sushi','Hudson']})
Bar=['drink','wine','alcohol']
Museum_ArtGallery=['MOMA','MCM']
Public_Transport=['MTA','MBTA']
Restaurant_FoodVenue=['sushi','chicken']
prepare a dictionary:
from collections import defaultdict
d=defaultdict(lambda:'other')
d.update({x:'Bar' for x in Bar})
d.update({x:'Museum_ArtGallery' for x in Museum_ArtGallery})
d.update({x:'Public_Transport' for x in Public_Transport})
d.update({x:'Restaurant_FoodVenue' for x in Restaurant_FoodVenue})
build new column and print result:
df['BroadCategory']=df['VenueCategory'].apply(lambda x:d[x])
df
venue_list = [['Bar', Bar],
['Museum_ArtGallery',Museum_ArtGallery]
#etc
]
venue_lookup = pd.concat([
pd.DataFrame({
'BroadCategory':venue[0],
'VenueCategory':venue[1]}) for venue in venue_list]
)
pd.merge(df, venue_lookup, how='left', on = 'VenueCategory')
Your solution is already close. Just that in order not to overwrite previously values, you should get a subset of the rows and only set new values on the subset.
To do that, you can firstly initialize new column BroadCategory to 'Other'. Then set up a subset of rows of each category by subscripting the new column with Boolean mask using the .isin() function like you are using now. The codes are like below:
df['BroadCategory'] = 'Other'
df['BroadCategory'][df['VenueCategory'].isin(Bar)] = 'Bar'
df['BroadCategory'][df['VenueCategory'].isin(Museum_ArtGallery)] = 'Museum/Art Gallery'
df['BroadCategory'][df['VenueCategory'].isin(Public_Transport)] = 'Public Transport'
df['BroadCategory'][df['VenueCategory'].isin(Restaurant_FoodVenue)] = 'Restaurant/Food Venue'
df['BroadCategory'][df['VenueCategory'].isin(Entertainment_Venue)] = 'Entertainment Venue'

Can't Create pandas DataFrame in python (Wrong Shape)

I'm trying to create the following data frame
new_df = pd.DataFrame(data = percentage_default, columns =
df['purpose'].unique())
The variables I'm using are as follows
percentage_default = [0.15238817285822592,
0.11568938193343899,
0.16602316602316602,
0.17011128775834658,
0.2778675282714055,
0.11212814645308924,
0.20116618075801748]
df['purpose'].unique = array(['debt_consolidation', 'credit_card', 'all_other',
'home_improvement', 'small_business', 'major_purchase',
'educational'], dtype=object)
When I try to create this data frame I get the following error:
Shape of passed values is (1, 7), indices imply (7, 7)
To me it seemed like the shape of the values and idices were the same. Could someone explain what I'm missing here?
Thanks!
You're creating a dataframe from a list. Calling pd.DataFrame(your_list) where your_list is a simple homogenous list will create a single row for every element in that list. For your input:
percentage_default = [0.15238817285822592,
0.11568938193343899,
0.16602316602316602,
0.17011128775834658,
0.2778675282714055,
0.11212814645308924,
0.20116618075801748]
pandas will create a dataframe like this:
Column
0.15238817285822592
0.11568938193343899
0.16602316602316602
0.17011128775834658
0.2778675282714055
0.11212814645308924
0.20116618075801748
Because of this, your dataframe only has one column. You're trying to pass multiple column names, which is confusing pandas.
If you wish to create a dataframe from a list with multiple columns, you need to nest more lists or tuples inside your original list. Each nested tuple/list will become a row in the dataframe, and each element in the nested tuple/list will become a new column. See this:
percentage_default = [(0.15238817285822592,
0.11568938193343899,
0.16602316602316602,
0.17011128775834658,
0.2778675282714055,
0.11212814645308924,
0.20116618075801748)] # nested tuple
We have one nested tuple in this list, so our dataframe will have 1 row with n columns, where n is the number of elements in the nested tuple (7). We can then pass your 7 column names:
percentage_default = [(0.15238817285822592,
0.11568938193343899,
0.16602316602316602,
0.17011128775834658,
0.2778675282714055,
0.11212814645308924,
0.20116618075801748)]
col_names = ['debt_consolidation', 'credit_card', 'all_other',
'home_improvement', 'small_business', 'major_purchase',
'educational']
new_df = pd.DataFrame(percentage_default, columns = col_names)
print(new_df)
debt_consolidation credit_card all_other home_improvement \
0 0.152388 0.115689 0.166023 0.170111
small_business major_purchase educational
0 0.277868 0.112128 0.201166
Try to rewrite your data in a next way:
percentage_default = {
'debt_consolidation': 0.15238817285822592,
'credit_card': 0.11568938193343899,
...
}
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html

For Looping error in pyspark

I am facing the following problem:
I have a list which I need to compare with the elements of a column in a dataframe(acc_name). I am using the following looping function but it only returns me 1 record when it should provide me 30.
Using Pyspark
bs_list =
['AC_E11','AC_E12','AC_E13','AC_E135','AC_E14','AC_E15','AC_E155','AC_E157',
'AC_E16','AC_E163','AC_E165','AC_E17','AC_E175','AC_E180','AC_E185', 'AC_E215','AC_E22','AC_E225','AC_E23','AC_E23112','AC_E235','AC_E245','AC_E258','AC_E25','AC_E26','AC_E265','AC_E27','AC_E275','AC_E31','AC_E39','AC_E29']
for i in bs_list:
bs_acc1 = (acc\
.filter(i == acc.acc_name)
.select(acc.acc_name,acc.acc_description)
)
the bs_list elements are subset of acc_name column. I am trying to create a new DF which will have the following 2 Columns acc_name, acc_description. It will only contain details of the value of elements present in list bs_list
Please let me know where I am going wrong?
Thats because , in loop everytime you filter on i, you are creating a new dataframe bs_acc1. So it must be showing you only 1 row belonging to last value in bs_list i.e. row for 'AC_E29'
one way to do it is repeat union with itself, so previous results also remain in the dataframe like -
# create a empty dataframe, give schema which is appropriate to your data below
bs_acc1 = sqlContext.createDataFrame(sc.emptyRDD(), schema)
for i in bs_list:
bs_acc1 = bs_acc1.union(
acc\
.filter(i == acc_fil.acc_name)
.select(acc.acc_name,acc.acc_description)
)
better way is not doing loop at all -
from pyspark.sql.functions import *
bs_acc1 = acc.where(acc.acc_name.isin(bs_list))
You can also transform bs_list to dataframe with column acc_name and then just do join to acc dataframe.
bs_rdd = spark.sparkContext.parallelize(bs_list)
bs_df = bs_rdd.map(lambda x: Row(**{'acc_name':x})).toDF()
bs_join_df = bs_df.join(acc, on='acc_name')
bs_join_df.show()

Speed up vlookup like operation using pandas in python

I have written some code to essentially do a excel style vlookup on two pandas dataframes and want to speed it up.
The structure of the data frames is as follows:
dbase1_df.columns:
'VALUE', 'COUNT', 'GRID', 'SGO10GEO'
merged_df.columns:
'GRID', 'ST0, 'ST1', 'ST2', 'ST3', 'ST4', 'ST5', 'ST6', 'ST7', 'ST8', 'ST9', 'ST10'
sgo_df.columns:
'mkey', 'type'
To combine them, I do the following:
1. For each row in dbase1_df, find the row where its 'SGO10GEO' value matches the 'mkey' value of sgo_df. Obtain the 'type' from that row in sgo_df.
'type' contains an integer ranging from 0 to 10. Create a column name by appending 'ST' to type.
Find the value in merged_df, where its 'GRID' value matches the 'GRID' value in dbase1_df and the column name is the one we obtained in step 2. Output this value into a csv file.
// Read in dbase1 dbf into data frame
dbase1_df = pandas.DataFrame.from_csv(dbase1_file,index_col=False)
merged_df = pandas.DataFrame.from_csv('merged.csv',index_col=False)
lup_out.writerow(["VALUE","TYPE",EXTRACT_VAR.upper()])
// For each unique value in dbase1 data frame:
for index, row in dbase1_df.iterrows():
# 1. Find the soil type corresponding to the mukey
tmp = sgo_df.type.values[sgo_df['mkey'] == int(row['SGO10GEO'])]
if tmp.size > 0:
s_type = 'ST'+tmp[0]
val = int(row['VALUE'])
# 2. Obtain hmu value
tmp_val = merged_df[s_type].values[merged_df['GRID'] == int(row['GRID'])]
if tmp_val.size > 0:
hmu_val = tmp_val[0]
# 4. Output into data frame: VALUE, hmu value
lup_out.writerow([val,s_type,hmu_val])
else:
err_out.writerow([merged_df['GRID'], type, row['GRID']])
Is there anything here that might be a speed bottleneck? Currently it takes me around 20 minutes for around ~500,000 rows in dbase1_df; ~1,000 rows in merged_df and ~500,000 rows in sgo_df.
thanks!
You need to use the merge operation in Pandas to get a better performance. I'm not able to test the below code since I don't have the data but at minimum it should help you to get the idea:
import pandas as pd
dbase1_df = pd.DataFrame.from_csv('dbase1_file.csv',index_col=False)
sgo_df = pd.DataFrame.from_csv('sgo_df.csv',index_col=False)
merged_df = pd.DataFrame.from_csv('merged_df.csv',index_col=False)
#you need to use the same column names for common columns to be able to do the merge operation in pandas , so we changed the column name to mkey
dbase1_df.columns = [u'VALUE', u'COUNT', u'GRID', u'mkey']
#Below operation merges the two dataframes
Step1_Merge = pd.merge(dbase1_df,sgo_df)
#We need to add a new column to concatenate ST and type
Step1_Merge['type_2'] = Step1_Merge['type'].map(lambda x: 'ST'+str(x))
# We need to change the shape of merged_df and move columns to rows to be able to do another merge operation
id = merged_df.ix[:,['GRID']]
a = pd.merge(merged_df.stack(0).reset_index(1), id, left_index=True, right_index=True)
# We also need to change the automatically generated name to type_2 to be able to do the next merge operation
a.columns = [u'type_2', 0, u'GRID']
result = pd.merge(Step1_Merge,a,on=[u'type_2',u'GRID'])

Categories