Using Multiple Wildcards in Python Pandas - python

Thanks for helping out. Greatly appreciated. I have looked through S.O. and couldn't quite get the answer i was hoping for.
i have data frame with columns that i would like to sum, but would like to exclude based on wildcard
(so am hoping to include based on wildcard but also exclude based on wildcard)
My columns include:
"dose_1", "dose_2", "dose_3"... "new_dose" + "infusion_dose_1" + "infusion_dose_2" + many more similarly
I understand if i want to sum using wildcard, i can do
df['new_column'] = df.filter(regex = 'dose').sum(axis = 1)
but what if i want to exclude columns that contains str "infusion"?
Appreciate it!

regex probably the wrong tool for this job. Excluding based on a match is overly complicated, see Regular expression to match a line that doesn't contain a word. Just use a list comprehension to select the labels:
df = pd.DataFrame(columns=["dose_1", "dose_2", "dose_3", "new_dose",
"infusion_dose_1", "infusion_dose_2", 'foobar'])
cols = [x for x in df.columns if 'dose' in x and 'infusion' not in x]
#['dose_1', 'dose_2', 'dose_3', 'new_dose']
df['new_column'] = df[cols].sum(axis = 1)

Related

Check if terms are in columns and remove

Originally I wanted to filter only for specific terms, however I've found python will match the pattern regardless of specificity eg:
possibilities = ['temp', 'degc']
temp = (df.filter(regex='|'.join(re.escape(x) for x in temp_possibilities))
.columns.to_list())
Output does find the correct columns, but unfortunately it also returns columns like temp_uncalibrated, which I do not want.
So to solve this, so far I define and remove unwanted columns first, before filtering ie:
if 'temp_uncalibrated' in df.columns:
df = df.drop('temp_uncalibrated',axis = 1)
else:
pass
However, I have found more and more of these unwanted columns, and now the code looks messy and hard to read with all the terms. Is there a way to do this more succinctly? I tries putting the terms in a list and do it that way, but it does not work, ie:
if list in df.columns:
df = df.drop(list,axis = 1)
else:
pass
I thought maybe a def function might be a better way to do it, but not really sure where to start.

How to sum different categorical data of a data frame into different column

I want to add "R", "F" and "M" into a single column. suppose if any record has R = 1, F = 1 and M = 1 then I want 3. But when I am doing
rfm['RFM_Score'] = rfm[['R','F','M']].sum(axis = 1)
print(rfm['RFM_Score'].head())
I am getting 111.0 instead of 3
What about this:
rfm['RFM_Score'] = rfm[['R']].astype(int)+rfm[['F']].astype(int)+rfm[['M']].astype(int)
Or you can even use:
rfm[['R', 'F', 'M']].appy(pd.to_numeric, errors='coerce').sum(axis=1)
Looks like those are strings rather than numeric, so try:
rfm['RFM_Score'] = rfm[['R','F','M']].astype(float, errors='ignore').sum(axis = 1)
This might not be the most elegant solution, however I verified with some test data and this should work:
rfm['RFM_Score'] = rfm['R'] + rfm['F'] + rfm['M']
I'm afraid I'm not exactly positive why it's concatenating your values there rather than summing them. I'm still learning myself, and I've found sometimes the most intuitive manner with some of those functions does not give the intuitively expected result. Hope this helps tho! Good luck!
Edit: I saw someone point out that it's likely those numeric values are strings rather than int, and that would definitely explain the concatenation.

Split a string based on a delimiter but shift across 1

I trying to split a string (although numbers currently a string in df column) but am struggling to find an answer anywhere. I think using expressions might be the way forward but haven't quite got my head around them.
example 1) 12.540%
example 2) 4.555.6%
I would like to take everything to the left of the first '.' and only one number going to the right of the same first '.'
I need to apply it to all different number lengths and the above statement is the only constant.
example 1 ) 12.5 and 40%
example 2) 4.5 and 55.6%
Thank you
The following function should do what you want:
def split_string(num):
s=num.split('.', 1)
s1=s[0]+'.'+s[1][0]
s2=s[1][1:]
return (s1, s2)
This is a straightforward problem in string manipulation. Any string tutorial will teach you the basic operations.
Find the location of the period.
Add 1.
Split the string at that point: grab one slice through that index; a second slice from there to the end.
For instance, one you find the location loc and adjust 1 or 2 spots to the right:
num, pct = str[:loc], str[loc:]
If you want regular expressions, catch the groups using this.
^(\d+\..)(.*)$
Use this with either re.search if you want.
b = re.search(r'^(\d+\..)(.*)$', string)
b.group(1)
b.group(2)
Ex-
val = '12.445.6'
b = re.search(r'^(\d+\..)(.*)$', val)
b.group(1)
Out[24]: '12.4'
b.group(2)
Out[25]: '45.6'

Remove list type in columns while preserving list structure

I have two columns that from the way my data was pulled are in lists. This may be a really easy question, I just haven't found the exactly correct way to create the result I'm looking for.
I need the "a" column to be a string without the [] and the "a" column to be integers separated by a column if that's possible.
I've tried this code:
df['a'] = df['a'].astype(str)
to convert to a string: but it failed and outputs:
What I need the output to look like is:
a b
hbhprecision.com 123,1234,12345,123456
thomsonreuters.com 1234,12345,123456
etc.
Please help and thank you very much in advance!
for the first part, removing the brackets [ ]
df['c_u'].apply(lambda x : x.strip("['").strip("']"))
for the second part (assuming you removed your brackets as well), splitting the values across columns:
df['tawgs.db_id'].str.split(',', expand=True)

Python Pandas replace string based on format

Please, is there any ways to replace "x-y" by "x,x+1,x+2,...,y" in every row in a data frame? (Where x, y are integer).
For example, I want to replace every row like this:
"1-3,7" by "1,2,3,7"
"1,4,6-9,11-13,5" by "1,4,6,7,8,9,11,12,13,5"
etc
I know that by looping through lines and using regular expression we can do that. But the table is quite big and it takes quite some time. so I think using pandas might be faster.
Thanks alot
In pandas you can use apply to apply any function to either rows or columns in a DataFrame. The function can be passed with a lambda, or defined separately.
(side-remark: your example does not entirely make clear if you actually have a 2-D DataFrame or just a 1-D Series. Either way, apply can be used)
The next step is to find the right function. Here's a rough version (without regular expressions):
def make_list(str):
lst = str.split(',')
newlst = []
for i in lst:
if "-" in i:
newlst.extend(range(*[int(j) for j in i.split("-")]))
else:
newlst.append(int(i))
return newlst

Categories