Pandas - Working on multiple columns seems slow - python

I have some trouble processing a big csv with Pandas. Csv consists of an index and about other 450 columns in groups of 3, something like this:
cola1 colb1 colc1 cola2 colb2 colc2 cola3 colb3 colc3
1 stra_1 ctrlb_1 retc_1 stra_1 ctrlb_1 retc_1 stra_1 ctrlb_1 retc_1
2 stra_2 ctrlb_2 retc_2 stra_2 ctrlb_2 retc_2 stra_2 ctrlb_2 retc_2
3 stra_3 ctrlb_3 retc_3 stra_3 ctrlb_3 retc_3 stra_3 ctrlb_3 retc_3
For each trio of columns I would like to "analyze B column (it's a sort of "CONTROL field" and depending on its value I should then return a value by processing col A and C.
Finally I need to return a concatenation of all resulting columns starting from 150 to 1.
I already tried with apply but it seems too slow (10 min to process 50k rows).
df['Path'] = df.apply(lambda x: getFullPath(x), axis=1)
with an example function you can find here:
https://pastebin.com/S9QWTGGV
I tried extracting a list of unique combinations of cola,colb,colc - preprocessing the list - and applying map to generate results and it speeds up a little:
for i in range(1,151):
df['Concat' + str(i)] = df['cola' + str(i)] + '|' + df['colb' + str(i)] + '|' + df['colc' + str(i)]
concats = []
for i in range(1,151):
concats.append('Concat' + str(i))
ret = df[concats].values.ravel()
uniq = list(set(ret))
list = {}
for member in ret:
list[member] = getPath2(member)
for i in range(1,MAX_COLS + 1):
df['Res' + str(i)] = df['Concat' + str(i)].map(list)
df['Path'] = df.apply(getFullPath2,axis=1)
function getPath and getFullPath2 are defined as example here:
https://pastebin.com/zpFF2wXD
But it seems still a little bit slow (6 min for processing everything)
Do you have any suggestion on how I could speed up csv processing?
I don't even know if the way I using to "concatenate" columns could be better :), tried with Series.cat but I didn't get how to chain only some columns and not the full df
Thanks very much!
Mic

Amended answer: I see from your criteria, you actually have multiple controls on each column. I think what works is to split these into 3 dataframes, applying your mapping as follows:
import pandas as pd
series = {
'cola1': pd.Series(['D_1','C_1','E_1'],index=[1,2,3]),
'colb1': pd.Series(['ret1','ret1','ret2'],index=[1,2,3]),
'colc1': pd.Series(['B_1','C_2','B_3'],index=[1,2,3]),
'cola2': pd.Series(['D_1','C_1','E_1'],index=[1,2,3]),
'colb2': pd.Series(['ret3','ret1','ret2'],index=[1,2,3]),
'colc2': pd.Series(['B_2','A_1','A_3'],index=[1,2,3]),
'cola3': pd.Series(['D_1','C_1','E_1'],index=[1,2,3]),
'colb3': pd.Series(['ret2','ret2','ret1'],index=[1,2,3]),
'colc3': pd.Series(['A_1','B_2','C_3'],index=[1,2,3]),
}
your_df = pd.DataFrame(series, index=[1,2,3], columns=['cola1','colb1','colc1','cola2','colb2','colc2','cola3','colb3','colc3'])
# Split your dataframe into three frames for each column type
bframes = your_df[[col for col in your_df.columns if 'colb' in col]]
aframes = your_df[[col for col in your_df.columns if 'cola' in col]]
cframes = your_df[[col for col in your_df.columns if 'colc' in col]]
for df in [bframes, aframes, cframes]:
df.columns = ['col1','col2','col3']
# Mapping criteria
def map_colb(c):
if c == 'ret1':
return 'A'
elif c == 'ret2':
return None
else:
return 'F'
def map_cola(a):
if a.startswith('D_'):
return 'D'
else:
return 'E'
def map_colc(c):
if c.startswith('B_'):
return 'B'
elif c.startswith('C_'):
return 'C'
elif c.startswith('A_'):
return None
else:
return 'F'
# Use it on each frame
aframes = aframes.applymap(map_cola)
bframes = bframes.applymap(map_colb)
cframes = cframes.applymap(map_colc)
# The trick here is filling 'None's from the left to right in order of precedence
final = bframes.fillna(cframes.fillna(aframes))
# Then just combine them using whatever delimiter you like
# df.values.tolist() turns a row into a list
pathlist = ['|'.join(item) for item in final.values.tolist()]
This gives a result of:
In[70]: pathlist
Out[71]: ['A|F|D', 'A|A|B', 'B|E|A']

Related

pandas query isnt recognizing f string or .format

All Im trying to do is pass a variable to a pandas .query function. I keep getting empty rows returned when I use a python string variable (even when its formatted).
This works
a = '1736_4_A1'
df = metaData.query("array_id == #a")
print(df)
output:
array_id wafer_id slide position array_no sample_id
0 1736_4_A1 1736 4 A1 1 Rat 2nd
But this does not work! I dont understand why
array = str(waferid) + '_' + str(slideid) + '_' + str(position)
a = f'{array}'
a = "{}_{}_{}".format(waferid, slideid, position)
print(a)
df = metaData.query("array_id == #a")
print(df)
output:
1736_4_a1
Empty DataFrame
Columns: [array_id, wafer_id, slide, position, array_no, sample_id]
Index: []
I've spent too many hours on this. I feel like this should be simple! What am I doing wrong here?

Python pandas if column value is list then create new column(s) with individual list value

I'm using pandas to create a dataframe from a SaaS REST API json response and hitting a minor blocker to cleanse the data for visualization and analysis.
I need to tweak the python script by adding a conditional function to say if the value is in a list then remove the brackets, separate the values into new columns and name the new columns as [original column name + value list order].
In the similar questions posted the function is performed on a specified column whereas I need the check to be run on all 1,400+ columns in the dataframe. Basically, excel text to columns and the column header name is [original column name + value list order]
Current
Need
Here's the dataframe creation script from the .json response
def get_tap_dashboard():
use_fields = ''
for index, value in enumerate(list(WORKFLOW_FIELDS.keys())):
if index != len(list(WORKFLOW_FIELDS.keys())) - 1:
use_fields = use_fields + value + ','
else:
use_fields = use_fields + value
dashboard_head = {'Authorization': 'Bearer {}'.format(get_tap_token()), 'Content-Type': 'application/json'}
dashboard_url = \
TAP_URL + "api/v1/workflows/all?pageSize={}&page=1".format(SIZE) \
+ "&advancedFilter=__WorkflowDescription__~eq~'{}'".format(WORKFLOW_NAME) \
+ "&configurationId={}".format("1128443a-f7a7-4a90-953d-c095752a97a2")
dashboard = json.loads(requests.get(url=dashboard_url, headers=dashboard_head).text)
all_columns = []
for col in dashboard['Items'][0]['Columns']:
all_columns.append(col['Name'])
all_columns = ['ResultSetId'] + all_columns
pd_dashboard = pd.DataFrame(columns=all_columns)
for row in dashboard['Items']:
add_row_values = [row['ResultSetId']]
for col in row['Columns']:
if col['Value'] == '-- Select One --': # dtype issue
add_row_values.append([''])
else:
add_row_values.append(col['Value'])
add_row_df = pd.DataFrame([add_row_values], columns=all_columns)
pd_dashboard = pd_dashboard.append(add_row_df)
tap_dashboard = pd_dashboard
return tap_dashboard.rename(columns=WORKFLOW_FIELDS).reset_index(drop=True)
df = get_tap_dashboard()
Any help would be much appreciated thanks all!
PS - I have a Tableau creator license if it makes more sense to do it in Tableau/Tableau prep builder
Is this could be what you need?
from collections import defaultdict
output = defaultdict(lambda : [])
def count(x):
if isinstance(x,list):
if len(x) > 1:
for i,item in enumerate(x):
output[f'{item}_{i}'].append(item)
elif len(x) == 1:
output[f'{x[0]}_0'].append(x[0])
df['df_column_name'].apply(count)
print(pd.DataFrame.from_dict(output, orient='index').T)

Removing doubles while iterating a dataframe

I'm trying to remove doubles from a dataframe.
Basically, the dataframe contains two (or more) occurence of a document.
The doubles can be found by comparing the description of the document.
In my logic, I had to find who the duplicates are, copy the data and drop them from both the dataframe and the iterated dataframe.
But it appears there are still doubles, I do think it is because of the drop but don't know how to fix it.
So what is in green is the description, I need to drop one of the two, and fuse all that there is in black.
For example:
URL1 + URL2|Explorimmo + Bien_ici|Apartment|Description
Unfortunately, I can't link the dataset.
file = pd.ExcelFile(mc.file_path)
df = pd.read_excel(file)
description_duplicate = df.loc[df.duplicated(['DESCRIPTION']) == True]
for idx1, clean in description_duplicate.iterrows():
for idx2, dirty in description_duplicate.iterrows():
if idx1 != idx2:
if clean['DESCRIPTION'] == dirty['DESCRIPTION']:
clean['CRAWL_SOURCE'] = clean['CRAWL_SOURCE'] + " / " +dirty['CRAWL_SOURCE']
clean['URL'] = clean['URL'] + " / " + dirty['URL']
description_duplicate = description_duplicate.drop(idx2)
df = df.drop(idx2)
df[idx1] = clean
You only need to remove duplicates with the pandas.DataFrame.drop_duplicates() function:
df.drop_duplicates(subset='DESCRIPTION', inplace=True)

Avoid having to repeat the same dataframe column names when modifying them

I have a dataframe with over 30 columns. I am doing various modifications on specific columns and would like to find a way to avoid having to always list the specifc columns. Is there a shortcut?
For example:
matrix_bus_filled.loc[matrix_bus_filled['FNR'] == 'AB1122', ["Ice", "Tartlet", "Pain","Fruit","Club","Focaccia","SW of Month","Salad + Dressing","Planchette + bread","Muffin"]] = matrix_bus_filled[matrix_bus_filled['FNR'] == 'AB1120'][["Ice", "Tartlet", "Pain","Fruit","Club","Focaccia","SW of Month","Salad + Dressing","Planchette + bread","Muffin"]].values
Could I simply once define the term "SpecificColumns" and then paste it here?
matrix_bus_filled.loc[matrix_bus_filled['FNR'] == 'AB1122', ["SpecificColumns"]] = matrix_bus_filled[matrix_bus_filled['Flight Number'] == 'AB1120'][["SpecificColumns]].values
And here
matrix_bus_filled [["SpecificColumns"]] = matrix_bus_filled [["SpecificColumns"]].apply(scale, axis=1)
Just define a list and use that to call the columns.
specific_columns = ["Ice", "Tartlet", "Pain","Fruit","Club","Focaccia","SW of Month","Salad + Dressing","Planchette + bread","Muffin"]
matrix_bus_filled[specific_columns] = matrix_bus_filled[specific_columns].apply(scale, axis=1)

Split and Join Series in Pandas

I have two series in the dataframe below. The first is a string which will appear in the second, which will be a url string. What I want to do is change the first series by concatenating on extra characters, and have that change applied onto the second string.
import pandas as pd
#import urlparse
d = {'OrigWord' : ['bunny', 'bear', 'bull'], 'WordinUrl' : ['http://www.animal.com/bunny/ear.html', 'http://www.animal.com/bear/ear.html', 'http://www.animal.com/bull/ear.html'] }
df = pd.DataFrame(d)
def trial(source_col, dest_col):
splitter = dest_col.str.split(str(source_col))
print type(splitter)
print splitter
res = 'angry_' + str(source_col).join(splitter)
return res
df['Final'] = df.applymap(trial(df.OrigWord, df.WordinUrl))
I'm trying to find the string from the source_col, then split on that string in the dest_col, then effect that change on the string in dest_col. Here I have it as a new series called Final but I would rather inplace. I think the main issue are the splitter variable, which isn't working and the application of the function.
Here's how result should look:
OrigWord WordinUrl
angry_bunny http://www.animal.com/angry_bunny/ear.html
angry_bear http://www.animal.com/angry_bear/ear.html
angry_bull http://www.animal.com/angry_bull/ear.html
apply isn't really designed to apply to multiple columns in the same row. What you can do is to change your function so that it takes in a series instead and then assigns source_col, dest_col to the appropriate value in the series. One way of doing it is as below:
def trial(x):
source_col = x["OrigWord"]
dest_col = x['WordinUrl' ]
splitter = str(dest_col).split(str(source_col))
res = splitter[0] + 'angry_' + source_col + splitter[1]
return res
df['Final'] = df.apply(trial,axis = 1 )
here is an alternative approach:
df['WordinUrl'] = (df.apply(lambda x: x.WordinUrl.replace(x.OrigWord,
'angry_' + x.OrigWord), axis=1))
In [25]: df
Out[25]:
OrigWord WordinUrl
0 bunny http://www.animal.com/angry_bunny/ear.html
1 bear http://www.animal.com/angry_bear/ear.html
2 bull http://www.animal.com/angry_bull/ear.html
Instead of using split, you can use the replace method to prepend the angry_ to the corresponding source:
def trial(row):
row.WordinUrl = row.WordinUrl.replace(row.OrigWord, "angry_" + row.OrigWord)
row.OrigWord = "angry_" + row.OrigWord
return row
df.apply(trial, axis = 1)
OrigWord WordinUrl
0 angry_bunny http://www.animal.com/angry_bunny/ear.html
1 angry_bear http://www.animal.com/angry_bear/ear.html
2 angry_bull http://www.animal.com/angry_bull/ear.html

Categories