Python: How to get string between matches? - python

I have
FILE = open("file.txt", "r") #long text file
TEXT = FILE.read()
#long identification code with dots (.) and slashes (-)
regex = "process \d\d\d\d\d\d\d\-\d\d\.\d\d\d\d\.\d+\.\d\d\.\d\d\d\d"
SRC = re.findall(regex, TEXT, flags=re.IGNORECASE|re.MULTILINE)
How can I get the text between first char of first occurence SRC[i] and first char of next ocurrence SRC[i+1] and so on? Couldn't find any straight forward satisfatory answer...
MORE INFO EDIT:
pattern = 'process \d{7}\-\d{2}\.\d{4}\.\d+\.\d{2}\.\d{4}'
sample_input = "Process 1234567-89.1234.12431242.12.1234 - text title and long text description with no assured pattern Process 2234567-89.1234.12431242.12.1234 : chars and more text Process 3234567-89.1234.12431242.12.1234 - more text process 3234567-89.1234.12431242.12.1234 (...)"
sample_output[0] = "Process 1234567-89.1234.12431242.12.1234 - text title and long text description with no assured pattern "
sample_output[1] = "Process 2234567-89.1234.12431242.12.1234 : chars and more text "
sample_output[2] = "Process 3234567-89.1234.12431242.12.1234 - more text "
sample_output[3] = "process 3234567-89.1234.12431242.12.1234 "

You can use this regex:
(Process \d{7}\-\d{2}\.\d{4}\.\d+\.\d{2}\.\d{4}.*?)(?=Process)|(Process \d{7}\-\d{2}\.\d{4}\.\d+\.\d{2}\.\d{4}.*)
Working demo
)
Match information
MATCH 1
1. [0-105] `Process 1234567-89.1234.12431242.12.1234 - text title and long text description with no assured pattern `
MATCH 2
1. [105-168] `Process 2234567-89.1234.12431242.12.1234 : chars and more text `
MATCH 3
1. [168-221] `Process 3234567-89.1234.12431242.12.1234 - more text `
MATCH 4
2. [221-267] `Process 3234567-89.1234.12431242.12.1234 (...)`
You can use this code:
sample_input = "Process 1234567-89.1234.12431242.12.1234 - text title and long text description with no assured pattern Process 2234567-89.1234.12431242.12.1234 : chars and more text Process 3234567-89.1234.12431242.12.1234 - more text process 3234567-89.1234.12431242.12.1234 (...)"
m = re.match(r"(Process \d{7}\-\d{2}\.\d{4}\.\d+\.\d{2}\.\d{4}.*?)(?=Process)|(Process \d{7}\-\d{2}\.\d{4}\.\d+\.\d{2}\.\d{4}.*)", sample_input)
m.group(1) # The first parenthesized subgroup.
m.groups() # Return a tuple containing all the subgroups of the match, from 1 up to however many groups are in the pattern

Suppose you have a string some_str = 'abcARelevant_SubstringAcba' and you want the string between the first A and the second A; i.e. the desired output is 'Relevant_Substring'.
You can find the indices of occurrences of A in some_str with the following line:
inds = [a.start() for a in re.finditer('A', some_str)]
So now inds = [3, 22]. Now some_str[inds[0]+1:inds[1] will contain 'Relevant_Substring'.
This should be extensible to your issue.
EDIT: Here's a concrete example.
Suppose you have a file "file.txt" that contains the following text:
Stuff I don't want.
0
Stuff I do want.
1
More stuff I don't want.
You want to use all digits (0-9) as separators. Therefore, both 0 and 1 above will act as separators. Try the following code:
import re
with open("file.txt", "r") as file:
data = file.read()
patt = re.compile('[0-9]')
inds = [a.start() for a in re.finditer(patt, data)]
print data[inds[0]+1:inds[1]]
This should print out Stuff I do want.

You don't need re to find a string between two chars:
some_str = 'abcARelevant_SubstringAcba'
print some_str.split("A",2)[1]
Relevant_Substring

Related

Split string with multiple possible delimiters to get substring

I am trying to make a simple Discord bot to respond to some user input and having difficulty trying to parse the response for the info I need. I am trying to get their "gamertag"/username but the format is a little different sometimes.
So, my idea was to make a list of delimiter words I am looking for (different versions of the word gamertag such as Gamertag:, Gamertag -, username, etc.)
Then, look line by line for one that contains any of those delimiters.
Split the string on first matching delim, strip non alphanumeric characters
I had it kinda working for a single line, then realized some people don't put it on the first line so added line by line check and messed it up (on line 19 I just realized).. Also thought there must be a better way than this? please advise, some kinda working code at this link and copied below:
testString = """Application
Gamertag : testGamertag
Discord - testDiscord
Age - 25"""
applicationString = testString
gamertagSplitList = [ "gamertag", "Gamertag","Gamertag:", "gamertag:"]
#splWord = 'Gamertag'
lineNum = 0
for line in applicationString.partition('\n'):
print(line)
if line in gamertagSplitList:
applicationString = line
break
#get first line
#applicationString = applicationString.partition('\n')[0]
res = ""
#split on word, want to split on first occurrence of list of words
for splitWord in gamertagSplitList:
if splitWord in applicationString:
res = applicationString.split(splitWord)
break
splitString = res[1]
#res = test_string.split(spl_word, 1)
#splitString = res[1]
#get rid of non alphaNum characters
finalString = "" #define string for ouput
for character in splitString:
if(character.isalnum()):
# if character is alphanumeric concat to finalString
finalString = finalString + character
print(finalString)
Don't know if this will work with all your different inputs, but you can tweak it to get what you want :
import re
gamertagSplitList = ["gamertag", "Gamertag", "Gamertag:", "gamertag:"]
applicationString = """Application
Gamertag : testGamertag
Discord - testDiscord
Age - 25"""
for line in applicationString.split('\n'):
line = line.replace(' ', '')
for tag in gamertagSplitList:
if tag in line:
gamer_tag = line.replace(tag, '', 1)
break
print(re.sub(r'\W+', '', gamer_tag))
Output :
testGamertag
You can do it without any loops with a single regex:
import re
gamertagSplitList = ["gamertag", "Gamertag"]
applicationString = """Application
Gamertag : testGamertag
Discord - testDiscord
Age - 25"""
print(re.search(r'(' + '|'.join(gamertagSplitList) + ')\s*[:-]?\s*(\w+)\s*', applicationString)[2])
If all values in gamertagSplitList differ just by casing, you can simplify that even further:
print(re.search(r'gamertag\s*[:-]?\s*(\w+)\s*', applicationString, re.IGNORECASE)[1])
Let's take a closer look at this regex:
gamertag will match a string 'gamertag'
\s* will match any (including none) whitespace characters (space, newline, tab, etc.)
[:-]? will match either none or a single character which is either : or -
(\w+) will match 1 or more alphanumeric characters. Parenthesis here denote a group -- specific substring that we can extract later from the match.
By using re.IGNORECASE we make matching case insensitive, so that separator GaMeRtAg will also be recognised by this pattern.
The indexing part [1] means that we're interested in a first group in our pattern (remember the parenthesis). A group with index 0 is always a full match, and groups from index 1 upwards represent substrings that match subexpressions in parenthesis (ordered by their ( appearance in the regex).

Implement regular expression in Python to replace every occurence of "meshname = x" in a text file

I want to replace every line in a textfile with " " which starts with "meshname = " and ends with any letter/number and underscore combination. I used regex's in CS but I never really understood the different notations in Python. Can you help me with that?
Is this the right regex for my problem and how would i transform that into a Python regex?
m.e.s.h.n.a.m.e.' '.=.' '.{{_}*,{0,...,9}*,{a,...,z}*,{A,...,Z}*}*
x.y = Concatenation of x and y
' ' = whitespace
{x} = set containing x
x* = x.x.x. ... .x or empty word
What would the script look like in order to replace every string/line in a file containing meshname = ... with the Python regex? Something like this?
fin = open("test.txt", 'r')
data = fin.read()
data = data.replace("^meshname = [[a-z]*[A-Z]*[0-9]*[_]*]+", "")
fin.close()
fin = open("test.txt", 'w')
fin.write(data)
fin.close()
or is this completely wrong? I've tried to get it working with this approach, but somehow it never matched the right string: How to input a regex in string.replace?
Following the current code logic, you can use
data = re.sub(r'^meshname = .*\w$', ' ', data, flags=re.M)
The re.sub will replace with a space any line that matches
^ - line start (note the flags=re.M argument that makes sure the multiline mode is on)
meshname - a meshname word
= - a = string
.* - any zero or more chars other than line break chars as many as possible
\w - a letter/digit/_
$ - line end.

Re.search for comma/delimiter-removed substring

I have a text and used a function to extract a part of the text. However, in the returned value, delimiters (e.g ',', '-') are removed. I need to find the extracted part in the original text including substring and position.
e.g:
original_text = "xyz, 19900 Praha 9, Letnany"
(or original_text = "xyz, 19900 Praha 9 - Letnany")
extracted_text = "praha 9 letnany" (lower case, delimiters are removed)
I expect the output is the same as the ouput of re.search('praha 9, letnany', original_text) meaning getting the substring 'Praha 9, Letnany' and start of the match: 11.
Is there any regular expression to locate extracted text in the original text?
The output of the function can't be changed (up until now)
I have tried to find problems related to ignoring some character while using regex but their problems are different.
This will locate a span in the original text that matches the extracted text ignoring case & inserting delimiters at will (in this case, comma or dash):
import re
pat = ("[,-]*".join(list(extracted_text))).replace(" ","\\s")
mat = re.search( pat, original_text, re.I )
if mat:
print(mat.span())
else:
print("No match")
Same idea as #ScottHunter but process at word level instead of character level:
import re
ori_txt = '19900, Praha 7, Letnany'
extr_txt = 'praha 7 letnany'
delimiters = [',', '\s', '-']
deli = '|'.join([i for i in delimiters])
extr_arr = re.split(deli, extr_txt)
ins_c = ''.join([i for i in delimiters])
ins_c = ''.join(['[', ins_c, ']', '*'])
pat = ins_c.join(extr_arr)
mat = re.search(pat, ori_txt, re.I)
if mat:
print mat.group()
else:
print('not found')
I first want to find a regular expression to directly search for the extracted text in the original text but there seem to be no such an expression. Here is another way to solve my problem. Thank you.

Picking up field value using Python regex

This is an example of two lines in a file that I am trying to pick up information from.
...
{ "SubtitleSettings_REPOSITORY", FieldType_STRING, (int32_t)REPOSITORY},
{ "PREFERRED_SUBTITLE_LANGUAGE", FieldType_STRING,SUBTITLE_LANGUAGE},
...
What I want to do is to find out the 3rd field of this weird data structure for the given string to match to 1st field, i.e.
SubtitleSettings_REPOSITORY => REPOSITORY
PREFERRED_SUBTITLE_LANGUAGE => SUBTITLE_LANGUAGE
The regx in my Python code can only handles the second line, but not cope with the first line. How I can improve it?
import re
...
#field is given a value in previous code, can be "SubtitleSettings_REPOSITORY", or "PREFERRED_SUBTITLE_LANGUAGE"
match = re.search(field+'"[, \t]+(\w+)[, \t]+(\w+)', src_file.read(), re.M|re.I)
return_value = match.group(2)
You can insert (?:\(\w+\))?, which allows (and ignores) an optional word in parentheses there:
match = re.search(field+'"[, \t]+(\w+)[, \t]+(?:\(\w+\))?(\w+)', line, re.M|re.I)
With this, the line matches and you get 'REPOSITORY' as desired.
import re
with open("input.txt") as f:
pattern = "\{ \"(.+)\",.+,(.+)\}"
for line in f:
first, third = re.findall(pattern, line.strip())[0]
print first.strip(), "=>", third.strip()
prints
SubtitleSettings_REPOSITORY => (int32_t)REPOSITORY
PREFERRED_SUBTITLE_LANGUAGE => SUBTITLE_LANGUAGE
where input.txt contains
{ "SubtitleSettings_REPOSITORY", FieldType_STRING, (int32_t)REPOSITORY},
{ "PREFERRED_SUBTITLE_LANGUAGE", FieldType_STRING,SUBTITLE_LANGUAGE}
Breakdown:
\{ \"(.+)\" matches strings with the structure { + space + " + text + " and extracts text
,.+,(.+)\} matches strings with the structure , + text1 + , + text2 + } and extracts text2

Split group of special characters from string

In test.txt:
quiet confidence^_^
want:P
(:let's start
Codes:
import re
file = open('test.txt').read()
for line in file.split('\n'):
line = re.findall(r"[^\w\s$]+|[a-zA-z]+|[^\w\s$]+", line)
print " ".join(line)
Results showed:
quiet confidence^_^
want : P
(: let ' s start
I tried to separate group of special characters from string but still incorrect.
Any suggestion?
Expected results:
quiet confidence ^_^
want :P
(: let's start
as #interjay said, you must define what you consider a word and what is "special characters". Still I would use 2 separate regexes to find what a word is and what is not.
word = re.compile("[a-zA-Z\']+")
not_word = re.compile("[^a-zA-Z\']+")
for line in file.split('\n'):
matched_words = re.findall(word, line)
non_matching_words = re.findall(not_word, line)
print " ".join(matched_words)
print " ".join(non_matching_words)
Have in mind that spaces \s+ will be grouped as non words.

Categories