Python: String match is not working with regular expression - python

We are trying to extract rows from a column whose value contains strictly one of the following values [TC1, TC2, TC3]. The trick is that some rows also contain the following values TC12,TC13 etc. We don't want to extract them. Using str.contains is not an option in here.
Col_1 Col_2 Col_3
1 A TC1
2 B TC2
3 C TC3
4 D TC12
5 D TC15
6 D TC16
Col_1 Col_2 Col_3
1 A TC1
2 B TC2
3 C TC3
We used the following commands:
df1 = df.loc[df1['Col_3'].str.match("TC\d{1}")]
df1 = df.loc[df1['Col_3'].str.match("TC[1-3]{1}")]
df1 = df.loc[df1['Col_3'].str.match("TC[1,2,3]")]
But the problem is that is not working. Instead of returning the first 3 rows, it is returning all of the rows. We don't understand why it's wrong.

I would do
import pandas as pd
df = pd.DataFrame({"col":['TC1','TC2','TC3','TC12','TC15','TC16']})
print(df[df["col"].str.match(r"^TC\d$")])
output
col
0 TC1
1 TC2
2 TC3
Explanation: I used ^ and $ which mean start and end, so it will only detect where there is fullmatch, so-called raw-string so I can use \d inside it without need of additional escaping (for more about this see re docs). As side note "TC[1,2,3]" does not do what you think - if you enumerate characters inside [ ] there is no seperator to be used, so , is treated as character, so
import re
if(re.match("TC[1,2,3]", "TC,")):
print("match")
else:
print("no match")
output
match

You can use str.contains -
df = df[df.Col_3.str.contains(pat = r'^TC[\d{1}]$')]
or via str.match -
df = df[df.Col_3.str.match(pat = r'^TC[\d{1}]$')]
or via str.fullmatch -
df = df[df.Col_3.str.fullmatch(pat = r'^TC[\d{1}]')]
or via apply(slow) -
import re
df = df[df.Col_3.apply(lambda x : re.match(r'^TC[\d{1}]$', x)).notna()]

Related

Pandas remove duplicates within the list of values and identifying id's that share the same values

I have a pandas dataframe :
I used to have duplicate test_no ; so I remove the duplicates by
df['test_no'] = df['test_no'].apply(lambda x: ','.join(set(x.split(','))))
but still as you can see the duplicates are still there ; I think it's due to extra spaces and I want to clean it
Part 1:
my_id test_no
0 10000000000055910 461511, 461511
1 10000000000064510 528422
2 10000000000064222 528422,528422 , 528421
3 10000000000161538 433091.0, 433091.0
4 10000000000231708 nan,nan
Expected Output
my_id test_no
0 10000000000055910 461511
1 10000000000064510 528422
2 10000000000064222 528422, 528421
3 10000000000161538 433091.0
4 10000000000231708 nan
Part 2:
I also want to check if any of the "my_id" share any of the test_no ;
for example :
my_id matched_myid
10000000000064222 10000000000064510
You can use a regex to split:
import re
df['test_no'] = df['test_no'].apply(lambda x: ','.join(set(re.split(',\s*', x))))
# or
df['test_no'] = [','.join(set(re.split(',\s*', x))) for x in df['test_no']]
If you want to keep the original order use dict.fromkeys in place of set.
If the duplicates are successive you can also use:
df['test_no'] = df['test_no'].str.replace(r'([^,\s]+),\s*\1', r'\1', regex=True)

Removing comma from values in column (csv file) using Python Pandas

I want to remove commas from a column named size.
CSV looks like below:
number name size
1 Car 9,32,123
2 Bike 1,00,000
3 Truck 10,32,111
I want the output as below:
number name size
1 Car 932123
2 Bike 100000
3 Truck 1032111
I am using python3 and Pandas module for handling this csv.
I am trying replace method but I don't get the desired output.
Snapshot from my code :
import pandas as pd
df = pd.read_csv("file.csv")
// df.replace(",","")
// df['size'] = df['size'].replace(to_replace = "," , value = "")
// df['size'] = df['size'].replace(",", "")
df['size'] = df['size'].replace({",", ""})
print(df['size']) // expecting to see 'size' column without comma
I don't see any error/exception. The last line print(df['size']) simply displays values as it is, ie, with commas.
With replace, we need regex=True because otherwise it looks for exact match in a cell, i.e., cells with , in them only:
>>> df["size"] = df["size"].replace(",", "", regex=True)
>>> df
number name size
0 1 Car 932123
1 2 Bike 100000
2 3 Truck 1032111
I am using python3 and Pandas module for handling this csv
Note that pandas.read_csv function has optional argument thousands, if , are used for denoting thousands you might set thousands="," consider following example
import io
import pandas as pd
some_csv = io.StringIO('value\n"1"\n"1,000"\n"1,000,000"\n')
df = pd.read_csv(some_csv, thousands=",")
print(df)
output
value
0 1
1 1000
2 1000000
For brevity I used io.StringIO, same effect might be achieved providing name of file with same content as first argument in io.StringIO.
Try with str.replace instead:
df['size'] = df['size'].str.replace(',', '')
Optional convert to int with astype:
df['size'] = df['size'].str.replace(',', '').astype(int)
number name size
0 1 Car 932123
1 2 Bike 100000
2 3 Truck 1032111
Sample Frame Used:
df = pd.DataFrame({'number': [1, 2, 3], 'name': ['Car', 'Bike', 'Truck'],
'size': ['9,32,123', '1,00,000', '10,32,111']})
number name size
0 1 Car 9,32,123
1 2 Bike 1,00,000
2 3 Truck 10,32,111

Transform complicated string based on criteria in python and pandas

I have a column that contains a complicated string format. I would like to keep the first word only, and/or keep the first word in addition to certain other words.
I wish to keep certain key words in the string, such as 'RED', 'DB', 'APP', 'Infra', etc.
DATA
type grp
Goodbye-CCC-LET-TestData-A.1 a
Hello-PIR-SSS-Hellosims-App-INN-A.0 b
Hello-PIR-SSS-DB-RED-INN-C.0 c
Hello-PIR-SSS-App-SA200-F.0 d
Goodbye-PIR-SIR-DB_set-int-e.1 c
OK-PIR-SVV-Infra_ll-NA-A.0 e
DESIRED
type grp
Goodbye a
Hello-App b
Hello-DB-RED c
Hello-App d
Goodbye-DB c
OK-Infra e
DOING
s = (df['type'].str.split('-')
.str[0]
.str.cat(rack['type'].str.extract(r'(\d.\d+T|\d+T)', expand=False),
sep=' ',
na_rep='')
.str.strip())
df.insert(1, 'type', s)
The following code just give me the first word, for example:
Goodbye
Hello
OK
Any suggestion is appreciated. I am still researching
you can use str.extractall on your series then join the values
import pandas as pd
import re
df.drop('type',1).join(df['type'].str.extractall('(^\w+)-|(app|red|infra|db)'
,flags=re.IGNORECASE)\
.stack()\
.groupby(level=0)\
.agg(type='-'.join))
grp type
0 a Goodbye
1 b Hello-App
2 c Hello-DB-RED
3 d Hello-App
4 c Goodbye-DB
5 e OK-Infra

Pandas - Extract a string starting with a particular character

It should be fairly simple yet I'm not able to achieve it.
I have a dataframe df1, having a column "name_str". Example below:
name_str
0 alp:ha
1 bra:vo
2 charl:ie
I have to create another column that would comprise - say 5 characters - that start after the colon (:). I've written the following code:
import pandas as pd
data = {'name_str':["alp:ha", "bra:vo", "charl:ie"]}
#indx = ["name_1",]
df1 = pd.DataFrame(data=data)
n= df1['name_str'].str.find(":")+1
df1['slize'] = df1['name_str'].str.slice(n,2)
print(df1)
But the output is disappointing: NaanN
name_str slize
0 alp:ha NaN
1 bra:vo NaN
2 charl:ie NaN
The output should've been:
name_str slize
0 alp:ha ha
1 bra:vo vo
2 charl:ie ie
Would anyone please help? Appreciate it.
You can use str.extract to extract everything after the colon with this regular expression: :(.*)
df1['slize'] = df1.name_str.str.extract(':(.*)')
>>> df1
name_str slize
0 alp:ha ha
1 bra:vo vo
2 charl:ie ie
Edit, based on your updated question
If you'd like to extract up to 5 characters after the colon, then you can use this modification:
df['slize'] = df1.name_str.str.extract(':(.{,5})')

pandas: Replace string is not replacing targeted substring

I am trying to iterate a list of strings using dataframe1 to check whether the other dataframe2 has any strings found in dataframe1 to replace them.
for index, row in nlp_df.iterrows():
print( row['x1'] )
string1 = row['x1'].replace("(","\(")
string1 = string1.replace(")","\)")
string1 = string1.replace("[","\[")
string1 = string1.replace("]","\]")
nlp2_df['title'] = nlp2_df['title'].replace(string1,"")
In order to do this I iterated using the code shown above to check and replace for any string found in df1
The output belows shows the strings in df1
wait_timeout
interactive_timeout
pool_recycle
....
__all__
folder_name
re.compile('he(lo')
The output below shows the output after replacing strings in df2
0 have you tried watching the traffic between th...
1 /dev/cu.xxxxx is the "callout" device, it's wh...
2 You'll want the struct package.\r\r\n
For the output in df2 strings like /dev/cu.xxxxx should have been replaced during the iteration but as shown it is not removed. However, I have attempted using nlp2_df['title'] = nlp2_df['title'].replace("/dev/cu.xxxxx","") and managed to remove it successfully.
Is there a reason why directly writing the string works but looping using a variable to use for replacing does not?
IIUC you can simply use regular expressions:
nlp2_df['title'] = nlp2_df['title'].str.replace(r'([\(\)\[\]])',r'\\\1')
PS you don't need for loop at all...
Demo:
In [15]: df
Out[15]:
title
0 aaa (bbb) ccc
1 A [word] ...
In [16]: df['new'] = df['title'].str.replace(r'([\(\)\[\]])',r'\\\1')
In [17]: df
Out[17]:
title new
0 aaa (bbb) ccc aaa \(bbb\) ccc
1 A [word] ... A \[word\] ...

Categories