I have a list of names with different notations:
for example:
myList = [ab2000, abc2000_2000, AB2000, ab2000_1, ABC2000_01, AB2000_2, ABC2000_02, AB2000_A1]
the standarized version for those different notations are, for example:
'ab2000' is 'ABC2000'
'ab2000_1' is 'ABC2000_01'
'AB2000_A1' is 'ABC2000_A1'
What I tried is to separate the different characters of the string using compile.
input:
compiled = re.compile(r'[A-Za-z]+|\d+|\W+')
compiled.findall("AB2000_2000_A1")
output:
characters = ['AB', '2000', '2000', 'A', '1']
Then applying:
characters = list(set(characters))
To finally try to match the values of that list with the main components of the string: an alpha format followed by a digit format followed by an alphanumeric format.
But as you can see in the previous output I can't match 'A1' into a single character using \W+. My desired output is:
characters = ['AB', '2000', '2000', 'A1']
any idea to fix that?
o any better idea to solve my problem in general. Thank you, in advance.
Use the following pattern with optional groups and capturing groups:
r'([A-Z]+)(\d+)(?:_([A-Z\d]+))?(?:_([A-Z\d]+))?'
and re.I flag.
Note that (?:_([A-Z\d]+))? must be repeated in order to match both
third and fourth group. If you attempted to "repeat" this group, putting
it once with "*" it would match only the last group, skipping the third
group.
To test it, I ran the following test:
myList = ['ab2000', 'abc2000_2000', 'AB2000', 'ab2000_1', 'ABC2000_01',
'AB2000_2', 'ABC2000_02', 'AB2000_A1', 'AB2000_2000_A1']
pat = re.compile(r'([A-Z]+)(\d+)(?:_([A-Z\d]+))?(?:_([A-Z\d]+))?', re.I)
for tt in myList:
print(f'{tt:16} ', end=' ')
mtch = pat.match(tt)
if mtch:
for it in mtch.groups():
if it is not None:
print(f'{it:5}', end=' ')
print()
getting:
ab2000 ab 2000
abc2000_2000 abc 2000 2000
AB2000 AB 2000
ab2000_1 ab 2000 1
ABC2000_01 ABC 2000 01
AB2000_2 AB 2000 2
ABC2000_02 ABC 2000 02
AB2000_A1 AB 2000 A1
AB2000_2000_A1 AB 2000 2000 A1
Related
I want to extract only the numbers before a list of specific words. Then put the extracted numbers in a new column.
The list of words is: l = ["car", "truck", "van"]. I only put singular form here, but it should also apply to plural.
df = pd.DataFrame(columns=["description"], data=[["have 3 cars"], ["a 1-car situation"], ["may be 2 trucks"]])
We can call the new column for extracted number df["extracted_num"]
Thank you!
You can use Series.str.extract
l = ["car", "truck", "van"]
pat = f"(\d+)[\s-](?:{'|'.join(l)})"
df['extracted_num'] = df['description'].str.extract(pat)
Output:
>>> print(pat)
(\d+)[\s-](?:car|truck|van)
>>> df
description extracted_num
0 have 3 cars 3
1 a 1-car situation 1
2 may be 2 trucks 2
Explanation:
(\d+) - Matches one or more digits and captures the group;
[\s-] - Matches a single space or hyphen;
(?:{'|'.join(l)})"- Matches any word from the list l without capturing it.
I tried looking for previous posts but couldn't find anything that matches exactly what I'm looking for so here goes.
I'm trying to parse through strings in a dataframe and capture a certain substring (year) if a match is found. The formatting can vary a lot and I figured out a non-elegant way to get it done but I wonder if there is a better way.
Strings can looks like this
Random Text 31.12.2020
1.1. -31.12.2020
010120-311220
31.12.2020
1.1.2020-31.12.2020 -
1.1.2019 - 31.12.2019
1.1. . . 31.12.2019 -
1.1.2019 - -31.12.2019
010120-311220 other random words
I'm looking to find the year, currently by finding the last date and its' year.
Current regex is .+3112(\d{2,4})|.+31\.12\.(\d{2,4}) where
it would return 20 in group 1 for 010120-311220,
and it would return 2020 in group 2 for 1.1.2020-31.12.2020 -.
The problem is I cannot know beforehand which group the match will belong to, as in the first example group 2 doesn't exist and in the second example group 1 will return None when using re.match(regexPattern, stringOfInterest). Therefore I couldn't access the value by naively using .group(1) on the match object, as sometimes the value would be in .group(2).
Best I've come up so far is naming the groups with (?P<groupName>\d{2,4) and checking for Nones
def getYear(stringOfInterest):
regexPattern = '(^|.+)3112(?P<firstMatchType>\d{2,4})|(^|.+)31\.12\.(?P<secondMatchType>\d{2,4})'
matchObject = re.match(regexPattern, stringOfInterest)
if matchObject is not None:
matchDict = matchObject.groupdict()
if matchDict['firstMatchType'] is not None:
return matchDict['firstMatchType']
else:
return matchDict['secondMatchType']
return None
import re
df['year'] = df['text'].apply(getYear)
And while this works it intuitively seems like a stupid way to do it. Any ideas?
It looks like all your years are from the XXIst century. In this case, all you need is
df['year'] = '20' + df['text'].str.extract(r'.*31\.?12\.?(?:\d{2})?(\d{2})', expand=False)
See the regex demo. Details:
.* - any zero or more chars other than line break chars as many as possible
31\.?12\.? - 31, an optional ., 12, and an optional . char
(?:\d{2})? - an optional sequence of two digits
(\d{2}) - Group 1: two last digits of the year.
See a Pandas test:
import pandas as pd
df = pd.DataFrame({'text': ['Random Text 31.12.2020','1.1. -31.12.2020','010120-311220','31.12.2020','1.1.2020-31.12.2020 -','1.1.2019 - 31.12.2019','1.1. . . 31.12.2019 -','1.1.2019 - -31.12.2019','010120-311220 other random words']})
df['year'] = '20' + df['text'].str.extract(r'.*31\.?12\.?(?:\d{2})?(\d{2})', expand=False)
Output:
>>> df
text year
0 Random Text 31.12.2020 2020
1 1.1. -31.12.2020 2020
2 010120-311220 2020
3 31.12.2020 2020
4 1.1.2020-31.12.2020 - 2020
5 1.1.2019 - 31.12.2019 2019
6 1.1. . . 31.12.2019 - 2019
7 1.1.2019 - -31.12.2019 2019
8 010120-311220 other random words 2020
We can try using re.findall here against your input list, with a regex alternation covering both variants:
inp = ["Random Text 31.12.2020", "1.1. -31.12.2020", "010120-311220", "31.12.2020", "1.1.2020-31.12.2020 -", "1.1.2019 - 31.12.2019", "1.1. . . 31.12.2019 -", "1.1.2019 - -31.12.2019", "010120-311220 other random words"]
output = [re.findall(r'\d{1,2}\.\d{1,2}\.(\d{4})|\d{4}(\d{2})', x)[-1] for x in inp]
output = [x[0] if x[0] else x[1] for x in output]
print(output) # ['2020', '2020', '20', '2020', '2020', '2019', '2019', '2019', '20']
The strategy here is to match either of the two date variants. We retain the last match for each input. Then, we use a list comprehension to find the non empty value. Note that there are two capture groups, so only one will ever match.
Your regex can be factorized a lot by grouping just the alternation of the beginning of the date; this removes the need to check for two groups:
regexPattern = r'(?:^|.+)(?:3112|31\.12\.)(?P<year>\d{2,4})'
Once the group is extracted, it can be normalized into a proper four-digit year:
if matchObject is not None:
return ('20' + matchObject.group('year'))[-4:]
All in all, we get:
import re
def getYear(stringOfInterest):
regexPattern = r'(?:^|.+)(?:3112|31\.12\.)(?P<year>\d{2,4})'
matchObject = re.match(regexPattern, stringOfInterest)
if matchObject is not None:
return ('20' + matchObject.group('year'))[-4:]
return None
df['year'] = df['text'].apply(getYear)
this is my approach to your problem, maybe it would be useful
import re
string = '''
Random Text 31.12.2020
1.1. -31.12.2022
010120-311220
31.12.2020
1.1.2020-31.12.2018 -
1.1.2019 - 31.12.2019
1.1. . . 31.12.2019 -
1.1.2019 - -31.12.2019
010120-311220 other random words'''
pattern = r'\d{1,2}\.\d{1,2}\.(\d{4})|\d{4}(\d{2})'
matches = re.findall(pattern, string)
print("1) ", matches)
# convert tuple to list
match_array = [i for sub in matches for i in sub]
print(match_array)
#Remove multiple empty spaces from string List
res = [element for element in match_array if element.strip()]
print(res)
I'm playing around with regular expression in Python for the below data.
Random
0 helloooo
1 hahaha
2 kebab
3 shsh
4 title
5 miss
6 were
7 laptop
8 welcome
9 pencil
I would like to delete the words which have patterns of repeated letters (e.g. blaaaa), repeated pair of letters (e.g. hahaha) and any words which have the same adjacent letters around one letter (e.g.title, kebab, were).
Here is the code:
import pandas as pd
data = {'Random' : ['helloooo', 'hahaha', 'kebab', 'shsh', 'title', 'miss', 'were', 'laptop', 'welcome', 'pencil']}
df = pd.DataFrame(data)
df = df.loc[~df.agg(lambda x: x.str.contains(r"([a-z])+\1{1,}\b"), axis=1).any(1)].reset_index(drop=True)
print(df)
Below is the output for the above with a Warning message:
UserWarning: This pattern has match groups. To actually get the groups, use str.extract.
Random
0 hahaha
1 kebab
2 shsh
3 title
4 were
5 laptop
6 welcome
7 pencil
However, I expect to see this:
Random
0 laptop
1 welcome
2 pencil
You can use Series.str.contains directly to create a mask and disable the user warning before and enable it after:
import pandas as pd
import warnings
data = {'Random' : ['helloooo', 'hahaha', 'kebab', 'shsh', 'title', 'miss', 'were', 'laptop', 'welcome', 'pencil']}
df = pd.DataFrame(data)
warnings.filterwarnings("ignore", 'This pattern has match groups') # Disable the warning
df['Random'] = df['Random'][~df['Random'].str.contains(r"([a-z]+)[a-z]?\1")]
warnings.filterwarnings("always", 'This pattern has match groups') # Enable the warning
Output:
>>> df['Random'][~df['Random'].str.contains(r"([a-z]+)[a-z]?\1")]
# =>
7 laptop
8 welcome
9 pencil
Name: Random, dtype: object
The regex you have contains an issue: the quantifier is put outside of the group, and \1 was looking for the wrong repeated string. Also, the \b word boundary is superflous. The ([a-z]+)[a-z]?\1 pattern matches for one or more letters, then any one optional letter, and the same substring right after it.
See the regex demo.
We can safely disable the user warning because we deliberately use the capturing group here, as we need to use a backreference in this regex pattern. The warning needs re-enabling to avoid using capturing groups in other parts of our code where it is not necessary.
IIUC, you can use sth like the pattern r'(\w+)(\w)?\1', i.e., one or more letters, an optional letter, and the letters from the first match. This gives the right result:
df[~df.Random.str.contains(r'(\w+)(\w)?\1')]
I have a string which has 4 sections with no white spaces:
The first section can have 3-5 letters followed by 6 digits followed by letter 'A' followed by a floating number. A typical string could be ABCD192014A82.5
or, ABC192014A82.5 or, ABCDE192014A82.5
I would like to split this string into sub-strings as 'ABCD','192014','A' and '82.5'
I tried the following code but this works fine if the first section doesn't have 'A'. So, string CDBF192014A82.5 gets segregated correctly but string ADBF192014A82.5 has issues because, i guess, the first string has A itself.
Any suggestions?
re.match(r"([a-z]+)([0-9]+)", MyString.split('A')[0], re.I)
Using re.split with capture group:
l = ['ABCD192014A82.5', 'ABC192014A82.5', 'ABCDE192014A82.5']
for i in l:
print(i, re.split('([A-Z]+)', i)[1:])
Output:
ABCD192014A82.5 ['ABCD', '192014', 'A', '82.5']
ABC192014A82.5 ['ABC', '192014', 'A', '82.5']
ABCDE192014A82.5 ['ABCDE', '192014', 'A', '82.5']
Try this:
>>> for testcase in [
'ABCD192014A82.5',
'ABC192014A82.5',
'ABCDE192014A82.5',
'CDBF192014A82.5',
'ADBF192014A82.5'
]:
components = re.match(r'([A-Za-z]{3,5})(\d{6})(A)([0-9.]{3,4})', testcase).groups()
print(testcase, *components, sep='\t')
ABCD192014A82.5 ABCD 192014 A 82.5
ABC192014A82.5 ABC 192014 A 82.5
ABCDE192014A82.5 ABCDE 192014 A 82.5
CDBF192014A82.5 CDBF 192014 A 82.5
ADBF192014A82.5 ADBF 192014 A 82.5
The parts of the regex are:
[A-Za-z]{3,5} # 3 to 5 letters
\d{6} # 6 digit integer
A # Letter 'A'
[0-9.]{3,4} # 3 to 4 digit float
Hello all…I want to pick up the texts ‘DesingerXXX’ from a text file which contains below contents:
C DesignerTEE edBore 1 1/42006
Cylinder SingleVerticalB DesignerHHJ e 1 1/8Cooling 1
EngineBore 11/16 DesignerTDT 8Length 3Width 3
EngineCy DesignerHEE Inline2008Bore 1
Height 4TheChallen DesignerTET e 1Stroke 1P 305
Height 8C 606Wall15ccG DesignerQBG ccGasEngineJ 142
Height DesignerEQE C 60150ccGas2007
Anidea is to use the ‘Designer’ as a key, to consider each line into 2 parts, before the key, and after the key.
file_object = open('C:\\file.txt')
lines = file_object.readlines()
for line in lines:
if 'Designer' in line:
where = line.find('Designer')
before = line[0:where]
after = line[where:len(line)]
file_object.close()
In the ‘before the key’ part, I need to find the LAST space (‘ ’), and replace to another symbol/character.
In the ‘after the key’ part, I need to find the FIRST space (‘ ’), and replace to another symbol/character.
Then, I can slice it and pick up the wanted according to the new symbols/characters.
is there a better way to pick up the wanted texts? Or not, how can I replace the appointed key spaces?
In the string replace function, I can limit the times of replacing but not exactly which I can replace. How can I do that?
thanks
Using regular expressions, its a trivial task:
>>> s = '''C DesignerTEE edBore 1 1/42006
... Cylinder SingleVerticalB DesignerHHJ e 1 1/8Cooling 1
... EngineBore 11/16 DesignerTDT 8Length 3Width 3
... EngineCy DesignerHEE Inline2008Bore 1
... Height 4TheChallen DesignerTET e 1Stroke 1P 305
... Height 8C 606Wall15ccG DesignerQBG ccGasEngineJ 142
... Height DesignerEQE C 60150ccGas2007'''
>>> import re
>>> exp = 'Designer[A-Z]{3}'
>>> re.findall(exp, s)
['DesignerTEE', 'DesignerHHJ', 'DesignerTDT', 'DesignerHEE', 'DesignerTET', 'DesignerQBG', 'DesignerEQE']
The regular expression is Designer[A-Z]{3} which means the letters Designer, followed by any letter from capital A to capital Z that appears 3 times, and only three times.
So, it won't match DesignerABCD (4 letters), it also wont match Desginer123 (123 is not valid letters).
It also won't match Designerabc (abc are small letters). To make it ignore the case, you can pass an optional flag re.I as a third argument; but this will also match designerabc (you have to be very specific with regular expressions).
So, to make it so that it matches Designer followed by exactly 3 upper or lower case letters, you'd have to change the expression to Designer[Aa-zZ]{3}.
If you want to search and replace, then you can use re.sub for substituting matches; so if I want to replace all matches with the word 'hello':
>>> x = re.sub(exp, 'hello', s)
>>> print(x)
C hello edBore 1 1/42006
Cylinder SingleVerticalB hello e 1 1/8Cooling 1
EngineBore 11/16 hello 8Length 3Width 3
EngineCy hello Inline2008Bore 1
Height 4TheChallen hello e 1Stroke 1P 305
Height 8C 606Wall15ccG hello ccGasEngineJ 142
Height hello C 60150ccGas2007
and what if both before and after 'Designer', there are characters,
and the length of character is not fixed. I tried
'[Aa-zZ]Designer[Aa-zZ]{0~9}', but it doesn't work..
For these things, there are special characters in regular expressions. Briefly summarized below:
When you want to say "1 or more, but at least 1", use +
When you want to say "0 or any number, but there maybe none", use *
When you want to say "none but if it exists, only repeats once" use ?
You use this after the expression you want to be modified with the "repetition" modifiers.
For more on this, have a read through the documentation.
Now your requirements is "there are characters but the length is not fixed", based on this, we have to use +.
Try with re.sub. The regular expression match with your keyword surrounded by spaces. The second parameter of sub, replace the surrounder spaces by your_special_char (in my script a hyphen)
>>> import re
>>> with open('file.txt') as file_object:
... your_special_char = '-'
... for line in file_object:
... formated_line = re.sub(r'(\s)(Designer[A-Z]{3})(\s)', r'%s\2%s' % (your_special_char,your_special_char), line)
... print formated_line
...
C -DesignerTEE-edBore 1 1/42006
Cylinder SingleVerticalB-DesignerHHJ-e 1 1/8Cooling 1
EngineBore 11/16-DesignerTDT-8Length 3Width 3
EngineCy-DesignerHEE-Inline2008Bore 1
Height 4TheChallen-DesignerTET-e 1Stroke 1P 305
Height 8C 606Wall15ccG-DesignerQBG-ccGasEngineJ 142
Height-DesignerEQE-C 60150ccGas2007
Maroun Maroun mentioned 'Why not simply split the string'. so guessing one of the working way is:
import re
file_object = open('C:\\file.txt')
lines = file_object.readlines()
b = []
for line in lines:
a = line.split()
for aa in a:
b.append(aa)
for bb in b:
if 'Designer' in bb:
print bb
file_object.close()