Avoid having to repeat the same dataframe column names when modifying them - python

I have a dataframe with over 30 columns. I am doing various modifications on specific columns and would like to find a way to avoid having to always list the specifc columns. Is there a shortcut?
For example:
matrix_bus_filled.loc[matrix_bus_filled['FNR'] == 'AB1122', ["Ice", "Tartlet", "Pain","Fruit","Club","Focaccia","SW of Month","Salad + Dressing","Planchette + bread","Muffin"]] = matrix_bus_filled[matrix_bus_filled['FNR'] == 'AB1120'][["Ice", "Tartlet", "Pain","Fruit","Club","Focaccia","SW of Month","Salad + Dressing","Planchette + bread","Muffin"]].values
Could I simply once define the term "SpecificColumns" and then paste it here?
matrix_bus_filled.loc[matrix_bus_filled['FNR'] == 'AB1122', ["SpecificColumns"]] = matrix_bus_filled[matrix_bus_filled['Flight Number'] == 'AB1120'][["SpecificColumns]].values
And here
matrix_bus_filled [["SpecificColumns"]] = matrix_bus_filled [["SpecificColumns"]].apply(scale, axis=1)

Just define a list and use that to call the columns.
specific_columns = ["Ice", "Tartlet", "Pain","Fruit","Club","Focaccia","SW of Month","Salad + Dressing","Planchette + bread","Muffin"]
matrix_bus_filled[specific_columns] = matrix_bus_filled[specific_columns].apply(scale, axis=1)

Related

Trying to make a new column with existing column that have 'int' and 'string'

data = pd.read_csv(r'RE_absentee_one.csv')
data['New_addy'] = str(data['Prop-House Number']) + data['Prop-Street Name'] + data['Prop-Mode'] + str(data['Prop-Apt Unit Number'])
df = pd.DataFrame(data, columns = ['Name','New_addy'])
So this is the code
As you can see Prop-House Number and Prop-Apt Number are both int, and the rest are strings, I am trying to combine all these so that the full address is under one column labeled 'New addy'
Follow the string assignment with each variable using map as mentioned below:
data = pd.read_csv(r'RE_absentee_one.csv')
data['New_addy'] = data['Prop-House Number'].map(str) + data['Prop-Street Name'].map(str) + data['Prop-Mode'].map(str) + data['Prop-Apt Unit Number'].map(str)
#select the desired columns for further work
data = data[['Name','New_addy']]
One way is using list comprehension:
data['New_addy'] = [str(n) + street + mode + str(apt_n) for n,street,mode,apt_n in zip(
data['Prop-House Number'],data['Prop-Street Name'],data['Prop-Mode'],data['Prop-Apt Unit Number'])]

efficient soultion to create multiple columns with formula. pandas/python

i'm trying to create multiple columns(couple of hundreds) using values within the same df. is there a more efficient way for me to create multiple columns in batches? below is an example where i have to manually input new column names jwrl2_rank.r1, jwrl2_rank.1r1,jwrl2_rank.2r1, etc.. attached to the formula.
i0, i1, i2 are the original column names
and rn is the value within the column.
i0='jwrl2_rank'
i1='jwrl2_rank.1'
i2='jwrl2_rank.2'
i3='jwrl2_rank.3'
i4='jwrl2_rank.4'
i5='jwrl2_rank.5'
i6='jwrl2_rank.6'
i7='jwrl2_rank.7'
rn=1
df['jwrl2_rank.r1']=((df.loc[(df[i0]==rn)&(df['result']==1),'timing'].sum())/(df.loc[(df[i0]==rn),i0].count()))-1
df['jwrl2_rank.1r1']=((df.loc[(df[i1]==rn)&(df['result']==1),'timing'].sum())/(df.loc[(df[i1]==rn),i1].count()))-1
df['jwrl2_rank.2r1']=((df.loc[(df[i2]==rn)&(df['result']==1),'timing'].sum())/(df.loc[(df[i2]==rn),i2].count()))-1
df['jwrl2_rank.3r1']=((df.loc[(df[i3]==rn)&(df['result']==1),'timing'].sum())/(df.loc[(df[i3]==rn),i3].count()))-1
df['jwrl2_rank.4r1']=((df.loc[(df[i4]==rn)&(df['result']==1),'timing'].sum())/(df.loc[(df[i4]==rn),i4].count()))-1
df['jwrl2_rank.5r1']=((df.loc[(df[i5]==rn)&(df['result']==1),'timing'].sum())/(df.loc[(df[i5]==rn),i5].count()))-1
df['jwrl2_rank.6r1']=((df.loc[(df[i6]==rn)&(df['result']==1),'timing'].sum())/(df.loc[(df[i6]==rn),i6].count()))-1
df['jwrl2_rank.7r1']=((df.loc[(df[i7]==rn)&(df['result']==1),'timing'].sum())/(df.loc[(df[i7]==rn),i7].count()))-1
many thanks. regards
Using a for loop should work.
Incrementing string value
By using string interpolation you could solve your problem. See here for a quick introduction. I am using f-strings in the example below.
base_name='jwrl2_rank'
MAX_NUMBER = 3
for i in range(1, MAX_NUMBER + 1):
new_name = f"{base_name}.{i}"
print(new_name)
>>>
jwrl2_rank.1
jwrl2_rank.2
jwrl2_rank.3
Example of for loop
base_name='jwrl2_rank'
MAX_NUMBER = 3
for i in range(MAX_NUMBER + 1):
current_iN = f"{base_name}.{i}"
new_col_name = f"{base_name}.{i}r1"
if i == 0: # compensate for missing zero in column name
current_iN = base_name
new_col_name = f"{base_name}.r1"
df[new_col_name]=((df.loc[(df[current_iN]==rn)&(df['result']==1),'timing'].sum())/(df.loc[(df[current_iN]==rn),current_iN].count()))-1

Removing doubles while iterating a dataframe

I'm trying to remove doubles from a dataframe.
Basically, the dataframe contains two (or more) occurence of a document.
The doubles can be found by comparing the description of the document.
In my logic, I had to find who the duplicates are, copy the data and drop them from both the dataframe and the iterated dataframe.
But it appears there are still doubles, I do think it is because of the drop but don't know how to fix it.
So what is in green is the description, I need to drop one of the two, and fuse all that there is in black.
For example:
URL1 + URL2|Explorimmo + Bien_ici|Apartment|Description
Unfortunately, I can't link the dataset.
file = pd.ExcelFile(mc.file_path)
df = pd.read_excel(file)
description_duplicate = df.loc[df.duplicated(['DESCRIPTION']) == True]
for idx1, clean in description_duplicate.iterrows():
for idx2, dirty in description_duplicate.iterrows():
if idx1 != idx2:
if clean['DESCRIPTION'] == dirty['DESCRIPTION']:
clean['CRAWL_SOURCE'] = clean['CRAWL_SOURCE'] + " / " +dirty['CRAWL_SOURCE']
clean['URL'] = clean['URL'] + " / " + dirty['URL']
description_duplicate = description_duplicate.drop(idx2)
df = df.drop(idx2)
df[idx1] = clean
You only need to remove duplicates with the pandas.DataFrame.drop_duplicates() function:
df.drop_duplicates(subset='DESCRIPTION', inplace=True)

How to access a cell in a new dataframe?

I created a sub dataframe (drama_df) based on a criteria in the original dataframe (df). However, I can't access a cell using the typical drama_df['summary'][0] . Instead I get a KeyError: 0. I'm confused since type(drama_df) is a DataFrame. What do I do? Note that df['summary'][0] does indeed return a string.
drama_df = df[df['drama'] > 0]
#Now we generate a lump of text from the summaries
drama_txt = ""
i = 0
while (i < len(drama_df)):
drama_txt = drama_txt + " " + drama_df['summary'][i]
i += 1
EDIT
Here is an example of df:
Here is an example of drama_df:
This will solve it for you:
drama_df['summary'].iloc[0]
When you created the subDataFrame you probably left the index 0 behind. So you need to use iloc to get the element by position and not by index name (0).
You can also use .iterrows() or .itertuples() to do this routine:
Itertuples is a lot faster, but it is a bit more work to handle if you have a lot of columns
for row in drama_df.iterrows():
drama_txt = drama_txt + " " + row['summary']
To go faster:
for index, summary in drama_df[['summary']].itertuples():
drama_txt = drama_txt + " " + summary
Wait a moment here. You are looking for the str.join() operation.
Simply do this:
drama_txt = ' '.join(drama_df['summary'])
Or:
drama_txt = drama_df['summary'].str.cat(sep=' ')

Pandas - Working on multiple columns seems slow

I have some trouble processing a big csv with Pandas. Csv consists of an index and about other 450 columns in groups of 3, something like this:
cola1 colb1 colc1 cola2 colb2 colc2 cola3 colb3 colc3
1 stra_1 ctrlb_1 retc_1 stra_1 ctrlb_1 retc_1 stra_1 ctrlb_1 retc_1
2 stra_2 ctrlb_2 retc_2 stra_2 ctrlb_2 retc_2 stra_2 ctrlb_2 retc_2
3 stra_3 ctrlb_3 retc_3 stra_3 ctrlb_3 retc_3 stra_3 ctrlb_3 retc_3
For each trio of columns I would like to "analyze B column (it's a sort of "CONTROL field" and depending on its value I should then return a value by processing col A and C.
Finally I need to return a concatenation of all resulting columns starting from 150 to 1.
I already tried with apply but it seems too slow (10 min to process 50k rows).
df['Path'] = df.apply(lambda x: getFullPath(x), axis=1)
with an example function you can find here:
https://pastebin.com/S9QWTGGV
I tried extracting a list of unique combinations of cola,colb,colc - preprocessing the list - and applying map to generate results and it speeds up a little:
for i in range(1,151):
df['Concat' + str(i)] = df['cola' + str(i)] + '|' + df['colb' + str(i)] + '|' + df['colc' + str(i)]
concats = []
for i in range(1,151):
concats.append('Concat' + str(i))
ret = df[concats].values.ravel()
uniq = list(set(ret))
list = {}
for member in ret:
list[member] = getPath2(member)
for i in range(1,MAX_COLS + 1):
df['Res' + str(i)] = df['Concat' + str(i)].map(list)
df['Path'] = df.apply(getFullPath2,axis=1)
function getPath and getFullPath2 are defined as example here:
https://pastebin.com/zpFF2wXD
But it seems still a little bit slow (6 min for processing everything)
Do you have any suggestion on how I could speed up csv processing?
I don't even know if the way I using to "concatenate" columns could be better :), tried with Series.cat but I didn't get how to chain only some columns and not the full df
Thanks very much!
Mic
Amended answer: I see from your criteria, you actually have multiple controls on each column. I think what works is to split these into 3 dataframes, applying your mapping as follows:
import pandas as pd
series = {
'cola1': pd.Series(['D_1','C_1','E_1'],index=[1,2,3]),
'colb1': pd.Series(['ret1','ret1','ret2'],index=[1,2,3]),
'colc1': pd.Series(['B_1','C_2','B_3'],index=[1,2,3]),
'cola2': pd.Series(['D_1','C_1','E_1'],index=[1,2,3]),
'colb2': pd.Series(['ret3','ret1','ret2'],index=[1,2,3]),
'colc2': pd.Series(['B_2','A_1','A_3'],index=[1,2,3]),
'cola3': pd.Series(['D_1','C_1','E_1'],index=[1,2,3]),
'colb3': pd.Series(['ret2','ret2','ret1'],index=[1,2,3]),
'colc3': pd.Series(['A_1','B_2','C_3'],index=[1,2,3]),
}
your_df = pd.DataFrame(series, index=[1,2,3], columns=['cola1','colb1','colc1','cola2','colb2','colc2','cola3','colb3','colc3'])
# Split your dataframe into three frames for each column type
bframes = your_df[[col for col in your_df.columns if 'colb' in col]]
aframes = your_df[[col for col in your_df.columns if 'cola' in col]]
cframes = your_df[[col for col in your_df.columns if 'colc' in col]]
for df in [bframes, aframes, cframes]:
df.columns = ['col1','col2','col3']
# Mapping criteria
def map_colb(c):
if c == 'ret1':
return 'A'
elif c == 'ret2':
return None
else:
return 'F'
def map_cola(a):
if a.startswith('D_'):
return 'D'
else:
return 'E'
def map_colc(c):
if c.startswith('B_'):
return 'B'
elif c.startswith('C_'):
return 'C'
elif c.startswith('A_'):
return None
else:
return 'F'
# Use it on each frame
aframes = aframes.applymap(map_cola)
bframes = bframes.applymap(map_colb)
cframes = cframes.applymap(map_colc)
# The trick here is filling 'None's from the left to right in order of precedence
final = bframes.fillna(cframes.fillna(aframes))
# Then just combine them using whatever delimiter you like
# df.values.tolist() turns a row into a list
pathlist = ['|'.join(item) for item in final.values.tolist()]
This gives a result of:
In[70]: pathlist
Out[71]: ['A|F|D', 'A|A|B', 'B|E|A']

Categories