Replace with Python regex in pandas column - python

a = "Other (please specify) (What were you trying to do on the website when you encountered ^f('l1')^?Â\xa0)"
There are many values starting with '^f' and ending with '^' in a pandas column. And I need to replace them like below :
"Other (please specify) (What were you trying to do on the website when you encountered THIS ISSUE?Â\xa0)"

You don't mention what you've tried already, nor what the rest of your DataFrame looks like but here is a minimal example:
# Create a DataFrame with a single data point
df = pd.DataFrame(["... encountered ^f('l1')^?Â\xa0)"])
# Define a regex pattern
pattern = r'(\^f.+\^)'
# Use the .replace() method to replace
df = df.replace(to_replace=pattern, value='TEST', regex=True)
Output
0
0 ... encountered TEST? )

Related

Python loop to search multiple sets of keywords in all columns of dataframe

I've used the code below to search across all columns of my dataframe to see if each row has the word "pool" and the words "slide" or "waterslide".
AR11AR11_regex = r"""
(?=.*(?:slide|waterslide)).*pool
"""
f = lambda x: x.str.findall(AR_regex, flags= re.VERBOSE|re.IGNORECASE)
d['AR'][AR11] = d['AR'].astype(str).apply(f).any(1).astype(int)
This has worked fine but when I want to write a for loop to do this for more than one regex pattern (e.g., AR11, AR12, AR21) using the code below, the new columns are all zeros (i.e., the search is not finding any hits)
for i in AR_list:
print(i)
pat = i+"_regex"
print(pat)
f = lambda x: x.str.findall(i+"_regex", flags= re.VERBOSE|re.IGNORECASE)
d['AR'][str(i)] = d['AR'].astype(str).apply(f).any(1).astype(int)
Any advice on why this loop didn't work would be much appreciated!
A small sample data frame would help understand your question. In any case, your code sample appears to have a multitude of problems.
i+"_regex" is just the string "AR11_regex". It won't evaluate to the value of the variable with the identifier AR11_regex. Put your regex patterns in a dict.
d['AR'] is the values in the AR column. It seems like you expect it to be a row.
d['AR'][str(i)] is adding a new row. It seems like you want to add a new column.
Lastly, this approach to setting a cell generally (always for me) yields the following warning:
/var/folders/zj/pnrcbb6n01z2qv1gmsk70b_m0000gn/T/ipykernel_13985/876572204.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
The suggest approach would be to use "at" as in d.at[str(i), 'AR'] or some such.
Add a sample data frame and refine your question for more suggestions.

Replace string in column with other text

This seems like an elementary question with many online examples, but for some reason it does not work for me.
I am trying to replace any cells in column 'A' that have the value = "Facility-based testing-OH" with the value = "Facility based testing-OH". If you note, the only difference between the two is a single '-', however for my purposes I do not want to use the split function on a delimeter. Simply want to locate the values that need replacement.
I have tried the following code, but none have worked.
1st Method:
df = df.str.replace('Facility-based testing-OH','Facility based testing-OH')
2nd Method:
df['A'] = df['A'].str.replace(['Facility-based testing-OH'], "Facility based testing-OH"), inplace=True
3rd Method
df.loc[df['A'].isin(['Facility-based testing-OH'])] = 'Facility based testing-OH'
Try:
df["A"] = df["A"].str.replace(
"Facility-based testing-OH", "Facility based testing-OH", regex=False
)
print(df)
Prints:
A
0 Facility based testing-OH
1 Facility based testing-OH
df used:
A
0 Facility-based testing-OH
1 Facility based testing-OH

How to remove row completely when removing non-ascii characters?

I am using code below to remove all non english characters below:
DF.text.replace({r'[^\x00-\x7F]+':''}, regex=True, inplace=True)
where df has a column called text with text in it like below:
text
hi what are you saying?
okay let me know
sounds great, mikey
ok.
right
ご承知のとおり、残念ながら悪質な詐欺が増加しているようですのでお気を付けください。\n
¡Hola miguel! Lamento mucho la confusión cau
expected output:
text
hi what are you saying?
okay let me know
sounds great, mikey
ok.
right
For my rows where my code removes characters -
I want to delete those rows from the df completely, meaning if it does replace any non-english characters, I want to delete that row from the df completely to avoid having that row with either 0 characters or a few characters that are meaningless after they have been altered by the code above.
You can use
df[~df['text'].str.contains(r'[^\x00-\x7F]')]
Pandas test:
import pandas as pd
df = pd.DataFrame({'text': ['hi what are you saying?', 'ご承知のとおり、残念ながら悪質な詐欺が増加しているようですのでお気を付けください。'], 'another_col':['demo 1', 'demo 2']})
df[~df['text'].str.contains(r'[^\x00-\x7F]')]
# text another_col
# 0 hi what are you saying? demo 1
Notes:
df['text'].str.contains(r'[^\x00-\x7F]') finds all values in text column that contain a character other than ASCII char (it is our "mask")
df[~...] only keeps those rows that did not match the regex.
str.contains() returns a Series of booleans that we can use to index our frame
patternDel = "[^\x00-\x7F]"
filter = df['Event Name'].str.contains(patternDel)
I tend to keep the things we want as opposed to deleting rows. Since filter represents things we want to delete we use ~ to get all the rows that don't match and keep them
df = df[~filter]

Extract regex matches, and not groups, in data frames rows in Python

I am a novice in coding and I generally use R for this (stringr) but I started to learn also Python's syntax.
I have a data frame with one column generated from an imported excel file. The values in this column contain both capital and smallcase characters, symbols and numbers.
I would like to generate a second column in the data frame containing only some of these words included in the first column according to a regex pattern.
df = pd.DataFrame(["THIS IS A TEST 123123. s.m.", "THIS IS A Test test 123 .s.c.e", "TESTING T'TEST 123 da."],columns=['Test'])
df
Now, to extract what I want (words in capital case), in R I would generally use:
df <- str_extract_all(df$Test, "\\b[A-Z]{1,}\\b", simplify = FALSE)
to extract the matches of the regular expression in different data frame rows, which are:
* THIS IS A TEST
* THIS IS A
* TESTING T TEST
I couldn't find a similar solution for Python, and the closest I've got to is the following:
df["Name"] = df["Test"].str.extract(r"(\b[A-Z]{1,}\b)", expand = True)
Unfortunately this does not work, as it exports only the groups rather than the matches of the regex. I've tried multiple strategies, but also str.extractall does not seem to work ("TypeError: incompatible index of inserted column with frame index)
How can I extract the information I want with Python?
Thanks!
If I understand well, you can try:
df["Name"] = df["Test"].str.extractall(r"(\b[A-Z]{1,}\b)")
.unstack().fillna('').apply(' '.join, 1)
[EDIT]:
Here is a shorter version I discovered by looking at the doc:
df["Name"] = df["Test"].str.extractall(r"(\b[A-Z]{1,}\b)").unstack(fill_value='').apply(' '.join, 1)
You are on the right track of getting the pattern. This solution uses regular expression, join and map.
df['Name'] = df['Test'].map(lambda x: ' '.join(re.findall(r"\b[A-Z\s]+\b", x)))
Result:
Test Name
0 THIS IS A TEST 123123. s.m. THIS IS A TEST
1 THIS IS A Test test 123 .s.c.e THIS IS A
2 TESTING T'TEST 123 da. TESTING T TEST

sub string python pandas

I have a pandas dataframe that has a string column in it. The length of the frame is over 2 million rows and looping to extract the elements I need is a poor choice. My current code looks like the following
for i in range(len(table["series_id"])):
table["state_code"] = table["series_id"][i][2:4]
table["area_code"] = table["series_id"][i][5:9]
table["supersector_code"] = table["series_id"][i][11:12]
where "series_id" is the string containing multiple information fields I want to create an example data element:
columns:
[series_id, year, month, value, footnotes]
The data:
[['SMS01000000000000001' '2006' 'M01' 1966.5 '']
['SMS01000000000000001' '2006' 'M02' 1970.4 '']
['SMS01000000000000001' '2006' 'M03' 1976.6 '']
However series_id is column of interest that I am struggling with. I have looked at the str.FUNCTION for python and specifically pandas.
http://pandas.pydata.org/pandas-docs/stable/basics.html#testing-for-strings-that-match-or-contain-a-pattern
has a section describing each of the string functions i.e. specifically get & slice are the functions I would like to use. Ideally I could envision a solution like so:
table["state_code"] = table["series_id"].str.get(1:3)
or
table["state_code"] = table["series_id"].str.slice(1:3)
or
table["state_code"] = table["series_id"].str.slice([1:3])
When I have tried the following functions I get an invalid syntax for the ":".
but alas I cannot seem to figure out the proper way to perform the vector operation for taking a substring on a pandas data frame column.
Thank you
I think I would use str.extract with some regex (which you can tweak for your needs):
In [11]: s = pd.Series(["SMU78000009092000001"])
In [12]: s.str.extract('^.{2}(?P<state_code>.{3}).{1}(?P<area_code>\d{4}).{2}(?P<supersector_code>.{2})')
Out[12]:
state_code area_code supersector_code
0 U78 0000 92
This reads as: starts (^) with any two characters (which are ignored), the next three (any) characters are state_code, followed by any character (ignored), followed by four digits are area_code, ...

Categories