pandas query isnt recognizing f string or .format - python

All Im trying to do is pass a variable to a pandas .query function. I keep getting empty rows returned when I use a python string variable (even when its formatted).
This works
a = '1736_4_A1'
df = metaData.query("array_id == #a")
print(df)
output:
array_id wafer_id slide position array_no sample_id
0 1736_4_A1 1736 4 A1 1 Rat 2nd
But this does not work! I dont understand why
array = str(waferid) + '_' + str(slideid) + '_' + str(position)
a = f'{array}'
a = "{}_{}_{}".format(waferid, slideid, position)
print(a)
df = metaData.query("array_id == #a")
print(df)
output:
1736_4_a1
Empty DataFrame
Columns: [array_id, wafer_id, slide, position, array_no, sample_id]
Index: []
I've spent too many hours on this. I feel like this should be simple! What am I doing wrong here?

Related

Rename columns in dataframe using bespoke function python pandas

I've got a data frame with column names like 'AH_AP' and 'AH_AS'.
Essentially all i want to do is swap the part before the underscore and the part after the underscore so that the column headers are 'AP_AH' and 'AS_AH'.
I can do that if the elements are in a list, but i've no idea how to get that to apply to column names.
My solution if it were a list goes like this:
columns = ['AH_AP','AS_AS']
def rejig_col_names():
elements_of_header = columns.split('_')
new_title = elements_of_header[-1] + "_" + elements_of_header[0]
return new_title
i'm guessing i need to apply this to something like the below, but i've no idea how, or how to reference a single column within df.columns:
df.columns = df.columns.map()
Any help appreciated. Thanks :)
You can do it this way:
Input:
df = pd.DataFrame(data=[['1','2'], ['3','4']], columns=['AH_PH', 'AH_AS'])
print(df)
AH_PH AH_AS
0 1 2
1 3 4
Output:
df.columns = df.columns.str.split('_').str[::-1].str.join('_')
print(df)
PH_AH AS_AH
0 1 2
1 3 4
Explained:
Use string accessor and the split method on '_'
Then using the str accessor with index slicing reversing, [::-1], you
can reverse the order of the list
Lastly, using the string accessor and join, we can concatenate the
list back together again.
You were almost there: you can do
df.columns = df.columns.map(rejig_col_names)
except that the function gets called with a column name as argument, so change it like this:
def rejig_col_names(col_name):
elements_of_header = col_name.split('_')
new_title = elements_of_header[-1] + "_" + elements_of_header[0]
return new_title
An alternative to the other answer. Using your function and DataFrame.rename
import pandas as pd
def rejig_col_names(columns):
elements_of_header = columns.split('_')
new_title = elements_of_header[-1] + "_" + elements_of_header[0]
return new_title
data = {
'A_B': [1, 2, 3],
'C_D': [4, 5, 6],
}
df = pd.DataFrame(data)
df.rename(rejig_col_names, axis='columns', inplace=True)
print(df)
str.replace is also an option via swapping capture groups:
Sample input borrowed from ScottBoston
df = pd.DataFrame(data=[['1', '2'], ['3', '4']], columns=['AH_PH', 'AH_AS'])
Then Capture everything before and after the '_' and swap capture group 1 and 2.
df.columns = df.columns.str.replace(r'^(.*)_(.*)$', r'\2_\1', regex=True)
PH_AH AS_AH
0 1 2
1 3 4

Loop through cell range (Every 3 cells) and add ranking to it

The problem is I am trying to make a ranking for every 3 cells in that column
using pandas.
For example:
This is the outcome I want
I have no idea how to make it.
I tried something like this:
for i in range(df.iloc[1:],df.iloc[,:],3):
counter = 0
i['item'] += counter + 1
The code is completely wrong, but I need help with the range and put df.iloc in the brackets in pandas.
Does this match the requirements ?
import pandas as pd
df = pd.DataFrame()
df['Item'] = ['shoes','shoes','shoes','shirts','shirts','shirts']
df2 = pd.DataFrame()
for i, item in enumerate(df['Item'].unique(), 1):
df2.loc[i-1,'rank'] = i
df2.loc[i-1, 'Item'] = item
df2['rank'] = df2['rank'].astype('int')
print(df)
print("\n")
print(df2)
df = df.merge(df2, on='Item', how='inner')
print("\n")
print(df)

Positional string-formatting on pandas DataFrame

I'm using python to automatise some processes at work. My final product has to be in excel format (formulas have to be there, and everything has to be traceable), so I work on a pandas DataFrame and then export the result to a .xlsx.
What I want to do is to create a pandas DataFrame that looks like this:
ID Price Quantity Total
0 A =VLOOKUP(A2;'Sheet2'!A:J;6;0) =VLOOKUP(A2;'Sheet2'!A:J;7;0) =B2*C2
1 B =VLOOKUP(A3;'Sheet2'!A:J;6;0) =VLOOKUP(A3;'Sheet2'!A:J;7;0) =B3*C3
2 C =VLOOKUP(A4;'Sheet2'!A:J;6;0) =VLOOKUP(A4;'Sheet2'!A:J;7;0) =B4*C4
3 D =VLOOKUP(A5;'Sheet2'!A:J;6;0) =VLOOKUP(A5;'Sheet2'!A:J;7;0) =B5*C5
4 E =VLOOKUP(A6;'Sheet2'!A:J;6;0) =VLOOKUP(A6;’Sheet2'!A:J;7;0) =B6*C6
As you can see in the first row, the formulas reference A2, B2 and C2; the second row references A3, B3 and C3; the 'n' row references A(n+2), B(n+2) and C(n+2). The DataFrame has about 3.000 rows.
I want to generate this dataframe with a few lines of code, and i haven't got the expected result. I though using positional formatting would do:
df = pd.DataFrame()
df['temp'] = range(3000)
df['Price'] = """=VLOOKUP(A{0};'Sheet2'!A:J;6;0)""" .format(df.index + 2)
df['Quantity'] = """=VLOOKUP(A{0};'Sheet2'!A:J;7;0)""" .format(df.index + 2)
df['Total'] = """=B{0}*C{0}""" .format(df.index + 2)
df.drop('temp', axis=1, inplace=True)
Unfortunately it doesn't work. It returns something like this:
"=VLOOKUP(ARangeIndex(start=2, stop=3002, step=1);'Sheet2'!A:J;6;0)"
Does anyone have any suggestion on how to do this?
Thanks!
Try vectorised string concatenation:
df = pd.DataFrame(index=range(2000)) # no need for temp here, btw
idx = (df.index + 2).astype(str)
df['Price'] = "=VLOOKUP(A" + idx + ";'Sheet2'!A:J;6;0)"
A similar process follows for the remainder of your columns:
df['Quantity'] = "=VLOOKUP(A" + idx + ";'Sheet2'!A:J;7;0)"
df['Total'] = 'B' + idx + '*C' + idx
df.head()
Price Quantity Total
0 =VLOOKUP(A2;'Sheet2'!A:J;6;0) =VLOOKUP(A2;'Sheet2'!A:J;7;0) B2*C2
1 =VLOOKUP(A3;'Sheet2'!A:J;6;0) =VLOOKUP(A3;'Sheet2'!A:J;7;0) B3*C3
2 =VLOOKUP(A4;'Sheet2'!A:J;6;0) =VLOOKUP(A4;'Sheet2'!A:J;7;0) B4*C4
3 =VLOOKUP(A5;'Sheet2'!A:J;6;0) =VLOOKUP(A5;'Sheet2'!A:J;7;0) B5*C5
4 =VLOOKUP(A6;'Sheet2'!A:J;6;0) =VLOOKUP(A6;'Sheet2'!A:J;7;0) B6*C6

Pandas - Working on multiple columns seems slow

I have some trouble processing a big csv with Pandas. Csv consists of an index and about other 450 columns in groups of 3, something like this:
cola1 colb1 colc1 cola2 colb2 colc2 cola3 colb3 colc3
1 stra_1 ctrlb_1 retc_1 stra_1 ctrlb_1 retc_1 stra_1 ctrlb_1 retc_1
2 stra_2 ctrlb_2 retc_2 stra_2 ctrlb_2 retc_2 stra_2 ctrlb_2 retc_2
3 stra_3 ctrlb_3 retc_3 stra_3 ctrlb_3 retc_3 stra_3 ctrlb_3 retc_3
For each trio of columns I would like to "analyze B column (it's a sort of "CONTROL field" and depending on its value I should then return a value by processing col A and C.
Finally I need to return a concatenation of all resulting columns starting from 150 to 1.
I already tried with apply but it seems too slow (10 min to process 50k rows).
df['Path'] = df.apply(lambda x: getFullPath(x), axis=1)
with an example function you can find here:
https://pastebin.com/S9QWTGGV
I tried extracting a list of unique combinations of cola,colb,colc - preprocessing the list - and applying map to generate results and it speeds up a little:
for i in range(1,151):
df['Concat' + str(i)] = df['cola' + str(i)] + '|' + df['colb' + str(i)] + '|' + df['colc' + str(i)]
concats = []
for i in range(1,151):
concats.append('Concat' + str(i))
ret = df[concats].values.ravel()
uniq = list(set(ret))
list = {}
for member in ret:
list[member] = getPath2(member)
for i in range(1,MAX_COLS + 1):
df['Res' + str(i)] = df['Concat' + str(i)].map(list)
df['Path'] = df.apply(getFullPath2,axis=1)
function getPath and getFullPath2 are defined as example here:
https://pastebin.com/zpFF2wXD
But it seems still a little bit slow (6 min for processing everything)
Do you have any suggestion on how I could speed up csv processing?
I don't even know if the way I using to "concatenate" columns could be better :), tried with Series.cat but I didn't get how to chain only some columns and not the full df
Thanks very much!
Mic
Amended answer: I see from your criteria, you actually have multiple controls on each column. I think what works is to split these into 3 dataframes, applying your mapping as follows:
import pandas as pd
series = {
'cola1': pd.Series(['D_1','C_1','E_1'],index=[1,2,3]),
'colb1': pd.Series(['ret1','ret1','ret2'],index=[1,2,3]),
'colc1': pd.Series(['B_1','C_2','B_3'],index=[1,2,3]),
'cola2': pd.Series(['D_1','C_1','E_1'],index=[1,2,3]),
'colb2': pd.Series(['ret3','ret1','ret2'],index=[1,2,3]),
'colc2': pd.Series(['B_2','A_1','A_3'],index=[1,2,3]),
'cola3': pd.Series(['D_1','C_1','E_1'],index=[1,2,3]),
'colb3': pd.Series(['ret2','ret2','ret1'],index=[1,2,3]),
'colc3': pd.Series(['A_1','B_2','C_3'],index=[1,2,3]),
}
your_df = pd.DataFrame(series, index=[1,2,3], columns=['cola1','colb1','colc1','cola2','colb2','colc2','cola3','colb3','colc3'])
# Split your dataframe into three frames for each column type
bframes = your_df[[col for col in your_df.columns if 'colb' in col]]
aframes = your_df[[col for col in your_df.columns if 'cola' in col]]
cframes = your_df[[col for col in your_df.columns if 'colc' in col]]
for df in [bframes, aframes, cframes]:
df.columns = ['col1','col2','col3']
# Mapping criteria
def map_colb(c):
if c == 'ret1':
return 'A'
elif c == 'ret2':
return None
else:
return 'F'
def map_cola(a):
if a.startswith('D_'):
return 'D'
else:
return 'E'
def map_colc(c):
if c.startswith('B_'):
return 'B'
elif c.startswith('C_'):
return 'C'
elif c.startswith('A_'):
return None
else:
return 'F'
# Use it on each frame
aframes = aframes.applymap(map_cola)
bframes = bframes.applymap(map_colb)
cframes = cframes.applymap(map_colc)
# The trick here is filling 'None's from the left to right in order of precedence
final = bframes.fillna(cframes.fillna(aframes))
# Then just combine them using whatever delimiter you like
# df.values.tolist() turns a row into a list
pathlist = ['|'.join(item) for item in final.values.tolist()]
This gives a result of:
In[70]: pathlist
Out[71]: ['A|F|D', 'A|A|B', 'B|E|A']

Split and Join Series in Pandas

I have two series in the dataframe below. The first is a string which will appear in the second, which will be a url string. What I want to do is change the first series by concatenating on extra characters, and have that change applied onto the second string.
import pandas as pd
#import urlparse
d = {'OrigWord' : ['bunny', 'bear', 'bull'], 'WordinUrl' : ['http://www.animal.com/bunny/ear.html', 'http://www.animal.com/bear/ear.html', 'http://www.animal.com/bull/ear.html'] }
df = pd.DataFrame(d)
def trial(source_col, dest_col):
splitter = dest_col.str.split(str(source_col))
print type(splitter)
print splitter
res = 'angry_' + str(source_col).join(splitter)
return res
df['Final'] = df.applymap(trial(df.OrigWord, df.WordinUrl))
I'm trying to find the string from the source_col, then split on that string in the dest_col, then effect that change on the string in dest_col. Here I have it as a new series called Final but I would rather inplace. I think the main issue are the splitter variable, which isn't working and the application of the function.
Here's how result should look:
OrigWord WordinUrl
angry_bunny http://www.animal.com/angry_bunny/ear.html
angry_bear http://www.animal.com/angry_bear/ear.html
angry_bull http://www.animal.com/angry_bull/ear.html
apply isn't really designed to apply to multiple columns in the same row. What you can do is to change your function so that it takes in a series instead and then assigns source_col, dest_col to the appropriate value in the series. One way of doing it is as below:
def trial(x):
source_col = x["OrigWord"]
dest_col = x['WordinUrl' ]
splitter = str(dest_col).split(str(source_col))
res = splitter[0] + 'angry_' + source_col + splitter[1]
return res
df['Final'] = df.apply(trial,axis = 1 )
here is an alternative approach:
df['WordinUrl'] = (df.apply(lambda x: x.WordinUrl.replace(x.OrigWord,
'angry_' + x.OrigWord), axis=1))
In [25]: df
Out[25]:
OrigWord WordinUrl
0 bunny http://www.animal.com/angry_bunny/ear.html
1 bear http://www.animal.com/angry_bear/ear.html
2 bull http://www.animal.com/angry_bull/ear.html
Instead of using split, you can use the replace method to prepend the angry_ to the corresponding source:
def trial(row):
row.WordinUrl = row.WordinUrl.replace(row.OrigWord, "angry_" + row.OrigWord)
row.OrigWord = "angry_" + row.OrigWord
return row
df.apply(trial, axis = 1)
OrigWord WordinUrl
0 angry_bunny http://www.animal.com/angry_bunny/ear.html
1 angry_bear http://www.animal.com/angry_bear/ear.html
2 angry_bull http://www.animal.com/angry_bull/ear.html

Categories