All Im trying to do is pass a variable to a pandas .query function. I keep getting empty rows returned when I use a python string variable (even when its formatted).
This works
a = '1736_4_A1'
df = metaData.query("array_id == #a")
print(df)
output:
array_id wafer_id slide position array_no sample_id
0 1736_4_A1 1736 4 A1 1 Rat 2nd
But this does not work! I dont understand why
array = str(waferid) + '_' + str(slideid) + '_' + str(position)
a = f'{array}'
a = "{}_{}_{}".format(waferid, slideid, position)
print(a)
df = metaData.query("array_id == #a")
print(df)
output:
1736_4_a1
Empty DataFrame
Columns: [array_id, wafer_id, slide, position, array_no, sample_id]
Index: []
I've spent too many hours on this. I feel like this should be simple! What am I doing wrong here?
Related
I've got a data frame with column names like 'AH_AP' and 'AH_AS'.
Essentially all i want to do is swap the part before the underscore and the part after the underscore so that the column headers are 'AP_AH' and 'AS_AH'.
I can do that if the elements are in a list, but i've no idea how to get that to apply to column names.
My solution if it were a list goes like this:
columns = ['AH_AP','AS_AS']
def rejig_col_names():
elements_of_header = columns.split('_')
new_title = elements_of_header[-1] + "_" + elements_of_header[0]
return new_title
i'm guessing i need to apply this to something like the below, but i've no idea how, or how to reference a single column within df.columns:
df.columns = df.columns.map()
Any help appreciated. Thanks :)
You can do it this way:
Input:
df = pd.DataFrame(data=[['1','2'], ['3','4']], columns=['AH_PH', 'AH_AS'])
print(df)
AH_PH AH_AS
0 1 2
1 3 4
Output:
df.columns = df.columns.str.split('_').str[::-1].str.join('_')
print(df)
PH_AH AS_AH
0 1 2
1 3 4
Explained:
Use string accessor and the split method on '_'
Then using the str accessor with index slicing reversing, [::-1], you
can reverse the order of the list
Lastly, using the string accessor and join, we can concatenate the
list back together again.
You were almost there: you can do
df.columns = df.columns.map(rejig_col_names)
except that the function gets called with a column name as argument, so change it like this:
def rejig_col_names(col_name):
elements_of_header = col_name.split('_')
new_title = elements_of_header[-1] + "_" + elements_of_header[0]
return new_title
An alternative to the other answer. Using your function and DataFrame.rename
import pandas as pd
def rejig_col_names(columns):
elements_of_header = columns.split('_')
new_title = elements_of_header[-1] + "_" + elements_of_header[0]
return new_title
data = {
'A_B': [1, 2, 3],
'C_D': [4, 5, 6],
}
df = pd.DataFrame(data)
df.rename(rejig_col_names, axis='columns', inplace=True)
print(df)
str.replace is also an option via swapping capture groups:
Sample input borrowed from ScottBoston
df = pd.DataFrame(data=[['1', '2'], ['3', '4']], columns=['AH_PH', 'AH_AS'])
Then Capture everything before and after the '_' and swap capture group 1 and 2.
df.columns = df.columns.str.replace(r'^(.*)_(.*)$', r'\2_\1', regex=True)
PH_AH AS_AH
0 1 2
1 3 4
The problem is I am trying to make a ranking for every 3 cells in that column
using pandas.
For example:
This is the outcome I want
I have no idea how to make it.
I tried something like this:
for i in range(df.iloc[1:],df.iloc[,:],3):
counter = 0
i['item'] += counter + 1
The code is completely wrong, but I need help with the range and put df.iloc in the brackets in pandas.
Does this match the requirements ?
import pandas as pd
df = pd.DataFrame()
df['Item'] = ['shoes','shoes','shoes','shirts','shirts','shirts']
df2 = pd.DataFrame()
for i, item in enumerate(df['Item'].unique(), 1):
df2.loc[i-1,'rank'] = i
df2.loc[i-1, 'Item'] = item
df2['rank'] = df2['rank'].astype('int')
print(df)
print("\n")
print(df2)
df = df.merge(df2, on='Item', how='inner')
print("\n")
print(df)
I'm using python to automatise some processes at work. My final product has to be in excel format (formulas have to be there, and everything has to be traceable), so I work on a pandas DataFrame and then export the result to a .xlsx.
What I want to do is to create a pandas DataFrame that looks like this:
ID Price Quantity Total
0 A =VLOOKUP(A2;'Sheet2'!A:J;6;0) =VLOOKUP(A2;'Sheet2'!A:J;7;0) =B2*C2
1 B =VLOOKUP(A3;'Sheet2'!A:J;6;0) =VLOOKUP(A3;'Sheet2'!A:J;7;0) =B3*C3
2 C =VLOOKUP(A4;'Sheet2'!A:J;6;0) =VLOOKUP(A4;'Sheet2'!A:J;7;0) =B4*C4
3 D =VLOOKUP(A5;'Sheet2'!A:J;6;0) =VLOOKUP(A5;'Sheet2'!A:J;7;0) =B5*C5
4 E =VLOOKUP(A6;'Sheet2'!A:J;6;0) =VLOOKUP(A6;’Sheet2'!A:J;7;0) =B6*C6
As you can see in the first row, the formulas reference A2, B2 and C2; the second row references A3, B3 and C3; the 'n' row references A(n+2), B(n+2) and C(n+2). The DataFrame has about 3.000 rows.
I want to generate this dataframe with a few lines of code, and i haven't got the expected result. I though using positional formatting would do:
df = pd.DataFrame()
df['temp'] = range(3000)
df['Price'] = """=VLOOKUP(A{0};'Sheet2'!A:J;6;0)""" .format(df.index + 2)
df['Quantity'] = """=VLOOKUP(A{0};'Sheet2'!A:J;7;0)""" .format(df.index + 2)
df['Total'] = """=B{0}*C{0}""" .format(df.index + 2)
df.drop('temp', axis=1, inplace=True)
Unfortunately it doesn't work. It returns something like this:
"=VLOOKUP(ARangeIndex(start=2, stop=3002, step=1);'Sheet2'!A:J;6;0)"
Does anyone have any suggestion on how to do this?
Thanks!
Try vectorised string concatenation:
df = pd.DataFrame(index=range(2000)) # no need for temp here, btw
idx = (df.index + 2).astype(str)
df['Price'] = "=VLOOKUP(A" + idx + ";'Sheet2'!A:J;6;0)"
A similar process follows for the remainder of your columns:
df['Quantity'] = "=VLOOKUP(A" + idx + ";'Sheet2'!A:J;7;0)"
df['Total'] = 'B' + idx + '*C' + idx
df.head()
Price Quantity Total
0 =VLOOKUP(A2;'Sheet2'!A:J;6;0) =VLOOKUP(A2;'Sheet2'!A:J;7;0) B2*C2
1 =VLOOKUP(A3;'Sheet2'!A:J;6;0) =VLOOKUP(A3;'Sheet2'!A:J;7;0) B3*C3
2 =VLOOKUP(A4;'Sheet2'!A:J;6;0) =VLOOKUP(A4;'Sheet2'!A:J;7;0) B4*C4
3 =VLOOKUP(A5;'Sheet2'!A:J;6;0) =VLOOKUP(A5;'Sheet2'!A:J;7;0) B5*C5
4 =VLOOKUP(A6;'Sheet2'!A:J;6;0) =VLOOKUP(A6;'Sheet2'!A:J;7;0) B6*C6
I have some trouble processing a big csv with Pandas. Csv consists of an index and about other 450 columns in groups of 3, something like this:
cola1 colb1 colc1 cola2 colb2 colc2 cola3 colb3 colc3
1 stra_1 ctrlb_1 retc_1 stra_1 ctrlb_1 retc_1 stra_1 ctrlb_1 retc_1
2 stra_2 ctrlb_2 retc_2 stra_2 ctrlb_2 retc_2 stra_2 ctrlb_2 retc_2
3 stra_3 ctrlb_3 retc_3 stra_3 ctrlb_3 retc_3 stra_3 ctrlb_3 retc_3
For each trio of columns I would like to "analyze B column (it's a sort of "CONTROL field" and depending on its value I should then return a value by processing col A and C.
Finally I need to return a concatenation of all resulting columns starting from 150 to 1.
I already tried with apply but it seems too slow (10 min to process 50k rows).
df['Path'] = df.apply(lambda x: getFullPath(x), axis=1)
with an example function you can find here:
https://pastebin.com/S9QWTGGV
I tried extracting a list of unique combinations of cola,colb,colc - preprocessing the list - and applying map to generate results and it speeds up a little:
for i in range(1,151):
df['Concat' + str(i)] = df['cola' + str(i)] + '|' + df['colb' + str(i)] + '|' + df['colc' + str(i)]
concats = []
for i in range(1,151):
concats.append('Concat' + str(i))
ret = df[concats].values.ravel()
uniq = list(set(ret))
list = {}
for member in ret:
list[member] = getPath2(member)
for i in range(1,MAX_COLS + 1):
df['Res' + str(i)] = df['Concat' + str(i)].map(list)
df['Path'] = df.apply(getFullPath2,axis=1)
function getPath and getFullPath2 are defined as example here:
https://pastebin.com/zpFF2wXD
But it seems still a little bit slow (6 min for processing everything)
Do you have any suggestion on how I could speed up csv processing?
I don't even know if the way I using to "concatenate" columns could be better :), tried with Series.cat but I didn't get how to chain only some columns and not the full df
Thanks very much!
Mic
Amended answer: I see from your criteria, you actually have multiple controls on each column. I think what works is to split these into 3 dataframes, applying your mapping as follows:
import pandas as pd
series = {
'cola1': pd.Series(['D_1','C_1','E_1'],index=[1,2,3]),
'colb1': pd.Series(['ret1','ret1','ret2'],index=[1,2,3]),
'colc1': pd.Series(['B_1','C_2','B_3'],index=[1,2,3]),
'cola2': pd.Series(['D_1','C_1','E_1'],index=[1,2,3]),
'colb2': pd.Series(['ret3','ret1','ret2'],index=[1,2,3]),
'colc2': pd.Series(['B_2','A_1','A_3'],index=[1,2,3]),
'cola3': pd.Series(['D_1','C_1','E_1'],index=[1,2,3]),
'colb3': pd.Series(['ret2','ret2','ret1'],index=[1,2,3]),
'colc3': pd.Series(['A_1','B_2','C_3'],index=[1,2,3]),
}
your_df = pd.DataFrame(series, index=[1,2,3], columns=['cola1','colb1','colc1','cola2','colb2','colc2','cola3','colb3','colc3'])
# Split your dataframe into three frames for each column type
bframes = your_df[[col for col in your_df.columns if 'colb' in col]]
aframes = your_df[[col for col in your_df.columns if 'cola' in col]]
cframes = your_df[[col for col in your_df.columns if 'colc' in col]]
for df in [bframes, aframes, cframes]:
df.columns = ['col1','col2','col3']
# Mapping criteria
def map_colb(c):
if c == 'ret1':
return 'A'
elif c == 'ret2':
return None
else:
return 'F'
def map_cola(a):
if a.startswith('D_'):
return 'D'
else:
return 'E'
def map_colc(c):
if c.startswith('B_'):
return 'B'
elif c.startswith('C_'):
return 'C'
elif c.startswith('A_'):
return None
else:
return 'F'
# Use it on each frame
aframes = aframes.applymap(map_cola)
bframes = bframes.applymap(map_colb)
cframes = cframes.applymap(map_colc)
# The trick here is filling 'None's from the left to right in order of precedence
final = bframes.fillna(cframes.fillna(aframes))
# Then just combine them using whatever delimiter you like
# df.values.tolist() turns a row into a list
pathlist = ['|'.join(item) for item in final.values.tolist()]
This gives a result of:
In[70]: pathlist
Out[71]: ['A|F|D', 'A|A|B', 'B|E|A']
I have two series in the dataframe below. The first is a string which will appear in the second, which will be a url string. What I want to do is change the first series by concatenating on extra characters, and have that change applied onto the second string.
import pandas as pd
#import urlparse
d = {'OrigWord' : ['bunny', 'bear', 'bull'], 'WordinUrl' : ['http://www.animal.com/bunny/ear.html', 'http://www.animal.com/bear/ear.html', 'http://www.animal.com/bull/ear.html'] }
df = pd.DataFrame(d)
def trial(source_col, dest_col):
splitter = dest_col.str.split(str(source_col))
print type(splitter)
print splitter
res = 'angry_' + str(source_col).join(splitter)
return res
df['Final'] = df.applymap(trial(df.OrigWord, df.WordinUrl))
I'm trying to find the string from the source_col, then split on that string in the dest_col, then effect that change on the string in dest_col. Here I have it as a new series called Final but I would rather inplace. I think the main issue are the splitter variable, which isn't working and the application of the function.
Here's how result should look:
OrigWord WordinUrl
angry_bunny http://www.animal.com/angry_bunny/ear.html
angry_bear http://www.animal.com/angry_bear/ear.html
angry_bull http://www.animal.com/angry_bull/ear.html
apply isn't really designed to apply to multiple columns in the same row. What you can do is to change your function so that it takes in a series instead and then assigns source_col, dest_col to the appropriate value in the series. One way of doing it is as below:
def trial(x):
source_col = x["OrigWord"]
dest_col = x['WordinUrl' ]
splitter = str(dest_col).split(str(source_col))
res = splitter[0] + 'angry_' + source_col + splitter[1]
return res
df['Final'] = df.apply(trial,axis = 1 )
here is an alternative approach:
df['WordinUrl'] = (df.apply(lambda x: x.WordinUrl.replace(x.OrigWord,
'angry_' + x.OrigWord), axis=1))
In [25]: df
Out[25]:
OrigWord WordinUrl
0 bunny http://www.animal.com/angry_bunny/ear.html
1 bear http://www.animal.com/angry_bear/ear.html
2 bull http://www.animal.com/angry_bull/ear.html
Instead of using split, you can use the replace method to prepend the angry_ to the corresponding source:
def trial(row):
row.WordinUrl = row.WordinUrl.replace(row.OrigWord, "angry_" + row.OrigWord)
row.OrigWord = "angry_" + row.OrigWord
return row
df.apply(trial, axis = 1)
OrigWord WordinUrl
0 angry_bunny http://www.animal.com/angry_bunny/ear.html
1 angry_bear http://www.animal.com/angry_bear/ear.html
2 angry_bull http://www.animal.com/angry_bull/ear.html