Replace string in column with other text - python

This seems like an elementary question with many online examples, but for some reason it does not work for me.
I am trying to replace any cells in column 'A' that have the value = "Facility-based testing-OH" with the value = "Facility based testing-OH". If you note, the only difference between the two is a single '-', however for my purposes I do not want to use the split function on a delimeter. Simply want to locate the values that need replacement.
I have tried the following code, but none have worked.
1st Method:
df = df.str.replace('Facility-based testing-OH','Facility based testing-OH')
2nd Method:
df['A'] = df['A'].str.replace(['Facility-based testing-OH'], "Facility based testing-OH"), inplace=True
3rd Method
df.loc[df['A'].isin(['Facility-based testing-OH'])] = 'Facility based testing-OH'

Try:
df["A"] = df["A"].str.replace(
"Facility-based testing-OH", "Facility based testing-OH", regex=False
)
print(df)
Prints:
A
0 Facility based testing-OH
1 Facility based testing-OH
df used:
A
0 Facility-based testing-OH
1 Facility based testing-OH

Related

Replace with Python regex in pandas column

a = "Other (please specify) (What were you trying to do on the website when you encountered ^f('l1')^?Â\xa0)"
There are many values starting with '^f' and ending with '^' in a pandas column. And I need to replace them like below :
"Other (please specify) (What were you trying to do on the website when you encountered THIS ISSUE?Â\xa0)"
You don't mention what you've tried already, nor what the rest of your DataFrame looks like but here is a minimal example:
# Create a DataFrame with a single data point
df = pd.DataFrame(["... encountered ^f('l1')^?Â\xa0)"])
# Define a regex pattern
pattern = r'(\^f.+\^)'
# Use the .replace() method to replace
df = df.replace(to_replace=pattern, value='TEST', regex=True)
Output
0
0 ... encountered TEST? )

How to elegantly remove values in a column based on 5 day/idx rule?

I have a dataframe like as given below
test1 = pd.DataFrame({
'subject_id':[1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2],
'flag' : ['','','T1','T1','T1','T1','T1','T1','T1','T1','','','T1','T1','T1','T1','T1','T1','T1','T1']
})
It looks like as shown below
As per the rule/logic, T1 can appear in flag field only after 5 days/records from it's first occurrence. For example if T1 had occurred on 3rd index, it can then only occur on 9th index and more..Anything before that are invalid and has to be removed.
I tried the below. Though this works, it doesn't look elegant and not suitable for all subjects.
a = test1[test1['flag']=='T1'].index.min()
test1.loc[a+1:a+6, 'flag'] = ''
How can I do this check indvidually for all the subjects? Each subject and its flag should follow this rule
I expect my output to be like as shown below. You can see the invalid flags are removed
We can do
s=test1['flag'].eq('T1').groupby(test1['subject_id']).transform('idxmax')
test1.loc[~((test1.index==s)|(test1.index>(s+5))),'flag']=''
Here is a slightly different way to do it, in a single piped statement. For clarity, I'm creating additional columns for the cumsum and the condition and then sub-setting the dataframe.
test1.\
assign(cum_sum=lambda x: x.flag.eq('T1').groupby(x.subject_id).cumsum()).\
assign(condition=lambda x: (x.flag=='') | (x.cum_sum==1) | (x.cum_sum >=5)).\
loc[lambda x: x.condition]
Hope this helps.

beginner panda change row data based upon code

I'm a beginner in panda and python, trying to learn it.
I would like to iterate over panda rows, to apply simple coded logic.
Instead of fancy mapping functions, I just want simple coded logic.
So then I can easily adapt it later for other coded logic rules as well.
In my dataframe dc,
I like to check if column AgeUnkown == 1 (or >0 )
And if so it should move the value of column Age to AgeUnknown.
And then make Age equal to 0.0
I tried various combinations of my below code but it won't work.
# using a row reference #########
for index, row in dc.iterrows():
r = row['AgeUnknown']
if (r>0):
w = dc.at[index,'Age']
dc.at[index,'AgeUnknown']=w
dc.at[index,'Age']=0
Another attempt
for index in dc.index:
r = dc.at[index,'AgeUnknown'].[0] # also tried .sum here
if (r>0):
w= dc.at[index,'Age']
dc.at[index,'AgeUnknown']=w
dc.at[index,'Age']=0
Also tried
if(dc[index,'Age']>0 #wasnt allowed either
Why isn't this working as far as I understood a dataframe should be able to be addressed like above.
I realize you requested a solution involving iterating the df, but I thought I'd provide one that I think is more traditional.
A non-iterating solution to your problem is something like this- 1) get all the indexes that meet your criteria 2) set those indexes of the df to what you want.
# indexes where column AgeUnknown is >0
inds = dc[dc['AgeUnknown'] > 0].index.tolist()
# change the indexes of AgeUnknown to to the Age column
dc.loc[inds, 'AgeUnknown'] = dc.loc[inds, 'Age']
# change the Age to 0 at those indexes
dc.loc[inds, 'Age'] = 0

Python PANDAS: Groupby Transform First Occurrence of Condition

I have dataframe in the following general format:
customer_id,transaction_dt,product,price,units
1,2004-01-02 00:00:00,thing1,25,47
1,2004-01-17 00:00:00,thing2,150,8
2,2004-01-29 00:00:00,thing2,150,25
3,2017-07-15 00:00:00,thing3,55,17
3,2016-05-12 00:00:00,thing3,55,47
4,2012-02-23 00:00:00,thing2,150,22
4,2009-10-10 00:00:00,thing1,25,12
4,2014-04-04 00:00:00,thing2,150,2
5,2008-07-09 00:00:00,thing2,150,43
5,2004-01-30 00:00:00,thing1,25,40
5,2004-01-31 00:00:00,thing1,25,22
5,2004-02-01 00:00:00,thing1,25,2
I have it sorted by the relevant fields in ascending order. Now what I am trying to figure out how to check for a criteria inside a group and create a new indicator flag for only first time it occurs. As a toy example, I am trying to figure out something like this to start:
conditions = ((df['units'] > 20) | (df['price] > 50)
df['flag'] = df[conditions].groupby(['customer_id']).transform()
Any help on how best to formulate this properly would be most welcome!
Assuming you want the first chronological appearance of a customer_id, within the grouping you defined, you can use query, groupby, and first:
(
df.sort_values("transaction_dt")
.query("units > 20 & price > 50")
.groupby("customer_id")
.first()
)
Note: The example data you provided doesn't actually have multiple customer_id entries for the filters you specified, but the syntax will work in either case.

How to output groupby variables when using .groupby() in pandas?

I have some data that I want to analyze. I group my data by the relevant group variables (here, 'test_condition' and 'region') and analyze the measure variable ('rt') with a function I wrote:
grouped = data.groupby(['test_condition', 'region'])['rt'].apply(summarize)
That works fine. The output looks like this (fake data):
ci1 ci2 mean
test_condition region
Test Condition Name And 0 295.055978 338.857066 316.956522
Spill1 0 296.210167 357.036210 326.623188
Spill2 0 292.955327 329.435977 311.195652
The problem is, 'test_condition' and 'region' are not actual columns, I can't index into them. I just want columns with the names of the group variables! This seems so simple (and is automatically done in R's ddply) but after lots of googling I have come up with nothing. Does anyone have a simple solution?
By default, the grouping variables are turned into an index. You can change the index to columns with grouped.reset_index().
My second suggestion to specify this in the groupby call with as_index=False, seems not to work as desired in this case with apply (but it does work when using aggregate)

Categories