Python: String match is not working with regular expression

Python: String match is not working with regular expression - python

We are trying to extract rows from a column whose value contains strictly one of the following values [TC1, TC2, TC3]. The trick is that some rows also contain the following values TC12,TC13 etc. We don't want to extract them. Using str.contains is not an option in here.
Col_1 Col_2 Col_3
1 A TC1
2 B TC2
3 C TC3
4 D TC12
5 D TC15
6 D TC16
Col_1 Col_2 Col_3
1 A TC1
2 B TC2
3 C TC3
We used the following commands:
df1 = df.loc[df1['Col_3'].str.match("TC\d{1}")]
df1 = df.loc[df1['Col_3'].str.match("TC[1-3]{1}")]
df1 = df.loc[df1['Col_3'].str.match("TC[1,2,3]")]
But the problem is that is not working. Instead of returning the first 3 rows, it is returning all of the rows. We don't understand why it's wrong.

I would do
import pandas as pd
df = pd.DataFrame({"col":['TC1','TC2','TC3','TC12','TC15','TC16']})
print(df[df["col"].str.match(r"^TC\d$")])
output
col
0 TC1
1 TC2
2 TC3
Explanation: I used ^ and $ which mean start and end, so it will only detect where there is fullmatch, so-called raw-string so I can use \d inside it without need of additional escaping (for more about this see re docs). As side note "TC[1,2,3]" does not do what you think - if you enumerate characters inside [ ] there is no seperator to be used, so , is treated as character, so
import re
if(re.match("TC[1,2,3]", "TC,")):
print("match")
else:
print("no match")
output
match

You can use str.contains -
df = df[df.Col_3.str.contains(pat = r'^TC[\d{1}]$')]
or via str.match -
df = df[df.Col_3.str.match(pat = r'^TC[\d{1}]$')]
or via str.fullmatch -
df = df[df.Col_3.str.fullmatch(pat = r'^TC[\d{1}]')]
or via apply(slow) -
import re
df = df[df.Col_3.apply(lambda x : re.match(r'^TC[\d{1}]$', x)).notna()]

Related

Pandas remove duplicates within the list of values and identifying id's that share the same values

I have a pandas dataframe :
I used to have duplicate test_no ; so I remove the duplicates by
df['test_no'] = df['test_no'].apply(lambda x: ','.join(set(x.split(','))))
but still as you can see the duplicates are still there ; I think it's due to extra spaces and I want to clean it
Part 1:
my_id test_no
0 10000000000055910 461511, 461511
1 10000000000064510 528422
2 10000000000064222 528422,528422 , 528421
3 10000000000161538 433091.0, 433091.0
4 10000000000231708 nan,nan
Expected Output
my_id test_no
0 10000000000055910 461511
1 10000000000064510 528422
2 10000000000064222 528422, 528421
3 10000000000161538 433091.0
4 10000000000231708 nan
Part 2:
I also want to check if any of the "my_id" share any of the test_no ;
for example :
my_id matched_myid
10000000000064222 10000000000064510

You can use a regex to split:
import re
df['test_no'] = df['test_no'].apply(lambda x: ','.join(set(re.split(',\s*', x))))
# or
df['test_no'] = [','.join(set(re.split(',\s*', x))) for x in df['test_no']]
If you want to keep the original order use dict.fromkeys in place of set.
If the duplicates are successive you can also use:
df['test_no'] = df['test_no'].str.replace(r'([^,\s]+),\s*\1', r'\1', regex=True)

Removing comma from values in column (csv file) using Python Pandas

I want to remove commas from a column named size.
CSV looks like below:
number name size
1 Car 9,32,123
2 Bike 1,00,000
3 Truck 10,32,111
I want the output as below:
number name size
1 Car 932123
2 Bike 100000
3 Truck 1032111
I am using python3 and Pandas module for handling this csv.
I am trying replace method but I don't get the desired output.
Snapshot from my code :
import pandas as pd
df = pd.read_csv("file.csv")
// df.replace(",","")
// df['size'] = df['size'].replace(to_replace = "," , value = "")
// df['size'] = df['size'].replace(",", "")
df['size'] = df['size'].replace({",", ""})
print(df['size']) // expecting to see 'size' column without comma
I don't see any error/exception. The last line print(df['size']) simply displays values as it is, ie, with commas.

With replace, we need regex=True because otherwise it looks for exact match in a cell, i.e., cells with , in them only:
>>> df["size"] = df["size"].replace(",", "", regex=True)
>>> df
number name size
0 1 Car 932123
1 2 Bike 100000
2 3 Truck 1032111

I am using python3 and Pandas module for handling this csv
Note that pandas.read_csv function has optional argument thousands, if , are used for denoting thousands you might set thousands="," consider following example
import io
import pandas as pd
some_csv = io.StringIO('value\n"1"\n"1,000"\n"1,000,000"\n')
df = pd.read_csv(some_csv, thousands=",")
print(df)
output
value
0 1
1 1000
2 1000000
For brevity I used io.StringIO, same effect might be achieved providing name of file with same content as first argument in io.StringIO.

Try with str.replace instead:
df['size'] = df['size'].str.replace(',', '')
Optional convert to int with astype:
df['size'] = df['size'].str.replace(',', '').astype(int)
number name size
0 1 Car 932123
1 2 Bike 100000
2 3 Truck 1032111
Sample Frame Used:
df = pd.DataFrame({'number': [1, 2, 3], 'name': ['Car', 'Bike', 'Truck'],
'size': ['9,32,123', '1,00,000', '10,32,111']})
number name size
0 1 Car 9,32,123
1 2 Bike 1,00,000
2 3 Truck 10,32,111

Transform complicated string based on criteria in python and pandas

I have a column that contains a complicated string format. I would like to keep the first word only, and/or keep the first word in addition to certain other words.
I wish to keep certain key words in the string, such as 'RED', 'DB', 'APP', 'Infra', etc.
DATA
type grp
Goodbye-CCC-LET-TestData-A.1 a
Hello-PIR-SSS-Hellosims-App-INN-A.0 b
Hello-PIR-SSS-DB-RED-INN-C.0 c
Hello-PIR-SSS-App-SA200-F.0 d
Goodbye-PIR-SIR-DB_set-int-e.1 c
OK-PIR-SVV-Infra_ll-NA-A.0 e
DESIRED
type grp
Goodbye a
Hello-App b
Hello-DB-RED c
Hello-App d
Goodbye-DB c
OK-Infra e
DOING
s = (df['type'].str.split('-')
.str[0]
.str.cat(rack['type'].str.extract(r'(\d.\d+T|\d+T)', expand=False),
sep=' ',
na_rep='')
.str.strip())
df.insert(1, 'type', s)
The following code just give me the first word, for example:
Goodbye
Hello
OK
Any suggestion is appreciated. I am still researching

you can use str.extractall on your series then join the values
import pandas as pd
import re
df.drop('type',1).join(df['type'].str.extractall('(^\w+)-|(app|red|infra|db)'
,flags=re.IGNORECASE)\
.stack()\
.groupby(level=0)\
.agg(type='-'.join))
grp type
0 a Goodbye
1 b Hello-App
2 c Hello-DB-RED
3 d Hello-App
4 c Goodbye-DB
5 e OK-Infra

Pandas - Extract a string starting with a particular character

It should be fairly simple yet I'm not able to achieve it.
I have a dataframe df1, having a column "name_str". Example below:
name_str
0 alp:ha
1 bra:vo
2 charl:ie
I have to create another column that would comprise - say 5 characters - that start after the colon (:). I've written the following code:
import pandas as pd
data = {'name_str':["alp:ha", "bra:vo", "charl:ie"]}
#indx = ["name_1",]
df1 = pd.DataFrame(data=data)
n= df1['name_str'].str.find(":")+1
df1['slize'] = df1['name_str'].str.slice(n,2)
print(df1)
But the output is disappointing: NaanN
name_str slize
0 alp:ha NaN
1 bra:vo NaN
2 charl:ie NaN
The output should've been:
name_str slize
0 alp:ha ha
1 bra:vo vo
2 charl:ie ie
Would anyone please help? Appreciate it.

You can use str.extract to extract everything after the colon with this regular expression: :(.*)
df1['slize'] = df1.name_str.str.extract(':(.*)')
>>> df1
name_str slize
0 alp:ha ha
1 bra:vo vo
2 charl:ie ie
Edit, based on your updated question
If you'd like to extract up to 5 characters after the colon, then you can use this modification:
df['slize'] = df1.name_str.str.extract(':(.{,5})')

pandas: Replace string is not replacing targeted substring

I am trying to iterate a list of strings using dataframe1 to check whether the other dataframe2 has any strings found in dataframe1 to replace them.
for index, row in nlp_df.iterrows():
print( row['x1'] )
string1 = row['x1'].replace("(","\(")
string1 = string1.replace(")","\)")
string1 = string1.replace("[","\[")
string1 = string1.replace("]","\]")
nlp2_df['title'] = nlp2_df['title'].replace(string1,"")
In order to do this I iterated using the code shown above to check and replace for any string found in df1
The output belows shows the strings in df1
wait_timeout
interactive_timeout
pool_recycle
....
__all__
folder_name
re.compile('he(lo')
The output below shows the output after replacing strings in df2
0 have you tried watching the traffic between th...
1 /dev/cu.xxxxx is the "callout" device, it's wh...
2 You'll want the struct package.\r\r\n
For the output in df2 strings like /dev/cu.xxxxx should have been replaced during the iteration but as shown it is not removed. However, I have attempted using nlp2_df['title'] = nlp2_df['title'].replace("/dev/cu.xxxxx","") and managed to remove it successfully.
Is there a reason why directly writing the string works but looping using a variable to use for replacing does not?

IIUC you can simply use regular expressions:
nlp2_df['title'] = nlp2_df['title'].str.replace(r'([\(\)\[\]])',r'\\\1')
PS you don't need for loop at all...
Demo:
In [15]: df
Out[15]:
title
0 aaa (bbb) ccc
1 A [word] ...
In [16]: df['new'] = df['title'].str.replace(r'([\(\)\[\]])',r'\\\1')
In [17]: df
Out[17]:
title new
0 aaa (bbb) ccc aaa \(bbb\) ccc
1 A [word] ... A \[word\] ...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: String match is not working with regular expression - python

Related

Pandas remove duplicates within the list of values and identifying id's that share the same values

Removing comma from values in column (csv file) using Python Pandas

Transform complicated string based on criteria in python and pandas

Pandas - Extract a string starting with a particular character

pandas: Replace string is not replacing targeted substring

Categories

Resources