Python extract string in a phrase - python

I have a python string that comes in a standard format string and i want to extract a piece of that string.
The string come as such:
logs(env:production service:FourDS3.Expirer #Properties.NewStatus:(ChallengeAbandoned OR Expired) #Properties.Source:Session).index(processing).rollup(count).by(#Properties.AcsInfo.Host).last(15m) > 60
I want to extract everything between logs(), that is i need to get this env:production service:FourDS3.Expirer #Properties.NewStatus:(ChallengeAbandoned OR Expired) #Properties.Source:Session
I have tried the below regex but it's not working:
result = re.search('logs((.+?)).', message.strip())
return result.group(1)
result = re.search('logs((.*?)).', message.strip())
return result.group(1)
Can someone please help me ?

Conclusion first:
import pyparsing as pp
txt = 'logs(env:production service:FourDS3.Expirer #Properties.NewStatus:(ChallengeAbandoned OR Expired) #Properties.Source:Session).index(processing).rollup(count).by(#Properties.AcsInfo.Host).last(15m) > 60'
pattern = pp.Regex(r'.*?logs(?=\()') + pp.original_text_for(pp.nested_expr('(', ')'))
result = pattern.parse_string(txt)[1][1:-1]
print(result)
* You can install pyparsing by pip install pyparsing
If you persist in using regex, my answer would not be appropriate.
According to this post, however, it seems difficult to parse such nested parentheses by regex. So, I used pyparsing to deal with your case.
Other examples:
The following examples work fine as well:
txt = 'logs(a(bc)d)e'
result = pattern.parse_string(txt)[1][1:-1]
print(result) # a(bc)d
txt = 'logs(a(b(c)d)e(f)g)h(ij(k)l)m'
result = pattern.parse_string(txt)[1][1:-1]
print(result) # a(b(c)d)e(f)g
Note:
Unfortunately, if a pair of parentheses gets broken inside logs(), an unexpected result is obtained or IndexError is raised. So you have to be careful about what kind of text comes in:
txt = 'logs(a)b)c'
result = pattern.parse_string(txt)[1][1:-1]
print(result) # a
txt = 'logs(a(b)c'
result = pattern.parse_string(txt)[1][1:-1]
print(result) # IndexError

If that input string is always in exactly the same format, then you could use the fact that the closing bracket for logs is followed by a .:
original = '''logs(env:production service:FourDS3.Expirer #Properties.NewStatus:(ChallengeAbandoned OR Expired)#Properties.Source:Session).index(processing).rollup(count).by(#Properties.AcsInfo.Host).last(15m) > 60'''
extracted = original.split('logs(')[1].split(').')[0]
print(extracted)
Which gives you this, without the need for regex:
'env:production service:FourDS3.Expirer #Properties.NewStatus:(ChallengeAbandoned OR Expired)#Properties.Source:Session'

You can achieve the result via regex like this:
input = "logs(env:production service:FourDS3.Expirer #Properties.NewStatus:(ChallengeAbandoned OR Expired) #Properties.Source:Session).index(processing).rollup(count).by(#Properties.AcsInfo.Host).last(15m) > 60"
pattern = r'logs\((?P<log>.*)\).index'
print(re.search(pattern, input).group('log'))
# which prints:
# env:production service:FourDS3.Expirer #Properties.NewStatus:(ChallengeAbandoned OR Expired) #Properties.Source:Session
The ?<P> is a named group, which you access by calling group with the name specified inside <>

Related

Delete all characters that come after a given string

how exactly can I delete characters after .jpg? is there a way to differentiate between the extension I take with python and what follows?
for example I have a link like that
https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC
How can I delete everything after .jpg?
I tried replacing but it didn't work
another way?
Use a forum to count strings or something like ?
I tried to get jpg files with this
for link in links:
res = requests.get(link).text
soup = BeautifulSoup(res, 'html.parser')
img_links = []
for img in soup.select('a.thumbnail img[src]'):
print(img["src"])
with open('links'+'.csv', 'a', encoding = 'utf-8', newline='') as csv_file:
file_is_empty = os.stat(self.filename+'.csv').st_size == 0
fieldname = ['links']
writer = csv.DictWriter(csv_file, fieldnames = fieldname)
if file_is_empty:
writer.writeheader()
writer.writerow({'links':img["src"]})
img_links.append(img["src"])
You could use split (assuming the string has 'jpg', otherwise the code below will just return the original url).
string = 'https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC'
jpg_removed = string.split('.jpg')[0]+'.jpg'
Example
string = 'www.google.com'
com_removed = string.split('.com')[0]
# com_removed = 'www.google'
You can make use of regular expression. You just want to ignore the characters after .jpg so you can some use of something like this:
import re
new_url=re.findall("(.*\.jpg).*",old_url)[0]
(.*\.jpg) is like a capturing group where you're matching any number of characters before .jpg. Since . has a special meaning you need to escape the . in jpg with a \. .* is used to match any number of character but since this is not inside the capturing group () this will get matched but won't get extracted.
You can use the .find function to find the characters .jpg then you can index the string to get everything but that. Ex:
string = https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC
index = string.find(".jpg")
new_string = string[:index+ 4]
You have to add four because that is the length of jpg so it does not delete that too.
The find() method returns the lowest index of the substring if it is found in given string. If its is not found then it returns -1.
str ='https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC'
result = str.find('jpg')
print(result)
new_str = str[:result]
print(new_str+'jpg')
See: Extracting extension from filename in Python
Instead of extracting the extension, we extract the filename and add the extension (if we know it's always .jpg, it's fine!)
import os
filename, file_extension = os.path.splitext('/path/to/somefile.jpg_corruptedpath')
result = filename + '.jpg'
Now, outside of the original question, I think there might be something wrong with how you got that piece of information int he first place. There must be a better way to extract that jpeg without messing around with the path. Sadly I can't help you with that since I a novice with BeautifulSoup.
You could use a regular expression to replace everything after .jpg with an empty string:
import re
url ='https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg10DCC3DD9E74DC1D10104F623D7E9BDC'
name = re.sub(r'(?<=\.jpg).*',"",url)
print(name)
https://s13emagst.akamaized.net/products/29146/29145166/images/res_cd1fa80f252e88faa70ffd465c516741.jpg

Searching text file for string in python

I'm using Python to search a large text file for a certain string, below the string is the data that I am interested in performing data analysis on.
def my_function(filename, variable2, variable3, variable4):
array1 = []
with open(filename) as a:
special_string = str('info %d info =*' %variable3)
for line in a:
if special_string == array1:
array1 = [next(a) for i in range(9)]
line = next(a)
break
elif special_string != c:
c = line.strip()
In the special_string variable, whatever comes after info = can vary, so I am trying to put a wildcard operator as seen above. The only way I can get the function to run though is if I put in the exact string I want to search for, including everything after the equals sign as follows:
special_string = str('info %d info = more_stuff' %variable3)
How can I assign a wildcard operator to the rest of the string to make my function more robust?
If your special string always occurs at the start of a line, then you can use the below check (where special_string does not have the * at the end):
line.startswith(special_string)
Otherwise, please do look at the module re in the standard library for working with regular expressions.
Have you thought about using something like this?
Based on your input, I'm assuming the following:
variable3 = 100000
special_string = str('info %d info = more_stuff' %variable3)
import re
pattern = re.compile('(info\s*\d+\s*info\s=)(.*)')
output = pattern.findall(special_string)
print(output[0][1])
Which would return:
more_stuff

How would I get rid of certain characters then output a cleaned up string In python?

In this snippet of code I am trying to obtain the links to images posted in a groupchat by a certain user:
import groupy
from groupy import Bot, Group, Member
prog_group = Group.list().first
prog_members = prog_group.members()
prog_messages = prog_group.messages()
rojer = str(prog_members[4])
rojer_messages = ['none']
rojer_pics = []
links = open('rojer_pics.txt', 'w')
print(prog_group)
for message in prog_messages:
if message.name == rojer:
rojer_messages.append(message)
if message.attachments:
links.write(str(message) + '\n')
links.close()
The issue is that in the links file it prints the entire message: ("Rojer Doewns: Heres a special one +https://i.groupme.com/406x1199.png.7679b4f1ee964656bde93448ff9cee12')>"
What I am wanting to do, is to get rid of characters that aren't part of the URL so it is written like so:
"https://i.groupme.com/406x1199.png.7679b4f1ee964656bde93448ff9cee12"
are there any methods in python that can manipulate a string like so?
I just used string.split() and split it into 3 parts by the parentheses:
for message in prog_messages:
if message.name == rojer:
rojer_messages.append(message)
if message.attachments:
link = str(message).split("'")
rojer_pics.append(link[1])
links.write(str(link[1]) + '\n')
This can done using string indices and the string method .find():
>>> url = "(\"Rojer Doewns: Heres a special one +https://i.groupme.com/406x1199.png.7679b4f1ee964656bde93448ff9cee12')"
>>> url = url[url.find('+')+1:-2]
>>> url
'https://i.groupme.com/406x1199.png.7679b4f1ee964656bde93448ff9cee12'
>>>
>>> string = '("Rojer Doewns: Heres a special one +https://i.groupme.com/406x1199.png.7679b4f1ee964656bde93448ff9cee12\')>"'
>>> string.split('+')[1][:-4]
'https://i.groupme.com/406x1199.png.7679b4f1ee964656bde93448ff9cee12'

Find what part of a string do not match with regular expression python

In order to see if a filename is correctly named (using re) I use the following regular expression pattern :
*^S_hc_[0-9]{1,2}\.[0-9]{1,2}\.[0-9]{1,2}_[0-9]{4,4}-[0-9]{1,3}T[0-9]{6,6}\.xml$"*
Here is a correct file name : *S_hc_1.2.3_2014-213T123121.xml*
Here is an incorrect file name : *S_hc_1.2.IncorrectName_2014-213T123121.xml*
I would like to know if a simple way to retrieve the part of the file which to do not match exits.
In the end, an error message would display :
Error, incorrect file name, the part 'IncorrectName' does not match with expected name.
You can use re.split and a generator expression within next but you also need to check the structure of your string that match waht you want, you can do it with following re.match :
re.match(r"^S_hc_(.*)\.(.*)\.(.*)_(.*)-(.*)\.xml$",s2)
And in code:
>>> import re
>>> s2 ='S_hc_1.2.IncorrectName_2014-213T123121.xml'
>>> s1
'S_hc_1.2.3_2014-213T123121.xml'
#with s1
>>> next((i for i in re.split(r'^S_hc_|[0-9]{1,2}\.|[0-9]{1,2}_|_|[0-9]{4,4}|-|[0-9]{1,3}T[0-9]{6}|\.|xml$',s1) if i and re.match(r"^S_hc_(.*)\.(.*)\.(.*)_(.*)-(.*)\.xml$",s2)),None)
#with s2
>>> next((i for i in re.split(r'^S_hc_|[0-9]{1,2}\.|[0-9]{1,2}_|_|[0-9]{4,4}|-|[0-9]{1,3}T[0-9]{6}|\.|xml$',s2) if i and re.match(r"^S_hc_(.*)\.(.*)\.(.*)_(.*)-(.*)\.xml$",s2)),None)
'IncorrectName'
All you need is to use pip (|) between unique part of your regex patterns,then the split function will split your string based on one of that patterns.
And the part that doesn't match with one of your pattern will not be split and you can find it with looping over your split text!
next(iterator[, default])
Retrieve the next item from the iterator by calling its next() method. If default is given, it is returned if the iterator is exhausted, otherwise StopIteration is raised.
If you want in several line :
>>> for i in re.split(r'^S_hc_|[0-9]{1,2}\.|[0-9]{1,2}_|_|[0-9]{4,4}|-|[0-9]{1,3}T[0-9]{6}|\.|xml$',s2):
... if i and re.match(r"^S_hc_(.*)\.(.*)\.(.*)_(.*)-(.*)\.xml$",s2):
... print i
...
IncorrectName
Maybe this is a longer solution but it will tell you what failed and what it expected. It is similar to Kasra's solution - breaking up the file name into individual bits and matching them in turn. This allows you to find out where the matching breaks:
import re
# break up the big file name pattern into individual bits that we can match
RX = re.compile
pattern = [
RX(r"\*"),
RX(r"S_hc_"),
RX(r"[0-9]{1,2}"),
RX(r"\."),
RX(r"[0-9]{1,2}"),
RX(r"\."),
RX(r"[0-9]{1,2}"),
RX(r"_"),
RX(r"[0-9]{4}"),
RX(r"-"),
RX(r"[0-9]{1,3}"),
RX(r"T"),
RX(r"[0-9]{6}"),
RX(r"\.xml"),
RX(r"\*")
]
# 'fn' is the file name matched so far
def reductor(fn, rx):
if fn is None:
return None
mo = rx.match(fn)
if mo is None:
print "File name mismatch: got {}, expected {}".format(fn, rx.pattern)
return None
# proceed with the remainder of the string
return fn[mo.end():]
validFile = lambda fn: reduce(reductor, pattern, fn) is not None
Let's test it:
print validFile("*S_hc_1.2.3_2014-213T123121.xml*")
print validFile("*S_hc_1.2.IncorrectName_2014-213T123121.xml*")
Outputs:
True
File name mismatch: got IncorrectName_2014-213T123121.xml*, expected [0-9]{1,2}
False
Here is the method I am going to use, please let me know if cases mismatch:
def verifyFileName(self, filename__, pattern__):
'''
Verifies if a file name is correct
:param filename__: file name
:param pattern__: pattern
:return: empty string if file name is correct, otherwise the incorrect part of file
'''
incorrectPart =""
pattern = pattern__.replace('\.','|\.|').replace('_','|_|')
for i in re.split(pattern, filename__):
if len(i)>1:
incorrectPart = i
return incorrectPart
Here's the counterexample. I've taken your method and defined three test cases - file names plus expected output.
Here's the output, the code follows below:
$> python m.py
S_hc_1.2.3_2014-213T123121.xml: PASS [expect None got None]
S_hc_1.2.3_Incorrect-213T123121.xml: PASS [expect Incorrect- got Incorrect-]
X_hc_1.2.3_2014-213T123121.xml: FAIL [expect X got None]
This is the code - cut & paste & run it.
def verifyFileName(filename__, pattern__):
'''
Verifies if a file name is correct
:param filename__: file name
:param pattern__: pattern
:return: empty string if file name is correct, otherwise the incorrect part of file
'''
incorrectPart = None
pattern = pattern__.replace('\.','|\.|').replace('_','|_|')
for i in re.split(pattern, filename__):
if len(i)>1:
incorrectPart = i
return incorrectPart
pattern = "^S_hc_[0-9]{1,2}\.[0-9]{1,2}\.[0-9]{1,2}_[0-9]{4,4}-[0-9]{1,3}T[0-9]{6,6}\.xml$"
# list of test cases: filenames + expected return from verifyFileName:
testcases = [
# correct file name
("S_hc_1.2.3_2014-213T123121.xml", None),
# obviously incorrect
("S_hc_1.2.3_Incorrect-213T123121.xml", "Incorrect-"),
# subtly incorrect but still incorrect
("X_hc_1.2.3_2014-213T123121.xml", "X")
]
for (fn, expect) in testcases:
res = verifyFileName(fn, pat)
print "{}: {} [expect {} got {}]".format(fn, "PASS" if res==expect else "FAIL", expect, str(res))

Analysing a text file in Python

I have a text file that needs to be analysed. Each line in the file is of this form:
7:06:32 (slbfd) IN: "lq_viz_server" aqeela#nabltas1
7:08:21 (slbfd) UNSUPPORTED: "Slb_Internal_vlsodc" (PORT_AT_HOST_PLUS ) Albahraj#nabwmps3 (License server system does not support this feature. (-18,327))
7:08:21 (slbfd) OUT: "OFM32" Albahraj#nabwmps3
I need to skip the timestamp and the (slbfd) and only keep a count of the lines with the IN and OUT. Further, depending on the name in quotes, I need to increase a variable count for different variables if a line starts with OUT and decrease the variable count otherwise. How would I go about doing this in Python?
The other answers with regex and splitting the line will get the job done, but if you want a fully maintainable solution that will grow with you, you should build a grammar. I love pyparsing for this:
S ='''
7:06:32 (slbfd) IN: "lq_viz_server" aqeela#nabltas1
7:08:21 (slbfd) UNSUPPORTED: "Slb_Internal_vlsodc" (PORT_AT_HOST_PLUS ) Albahraj#nabwmps3 (License server system does not support this feature. (-18,327))
7:08:21 (slbfd) OUT: "OFM32" Albahraj#nabwmps3'''
from pyparsing import *
from collections import defaultdict
# Define the grammar
num = Word(nums)
marker = Literal(":").suppress()
timestamp = Group(num + marker + num + marker + num)
label = Literal("(slbfd)")
flag = Word(alphas)("flag") + marker
name = QuotedString(quoteChar='"')("name")
line = timestamp + label + flag + name + restOfLine
grammar = OneOrMore(Group(line))
# Now parsing is a piece of cake!
P = grammar.parseString(S)
counts = defaultdict(int)
for x in P:
if x.flag=="IN": counts[x.name] += 1
if x.flag=="OUT": counts[x.name] -= 1
for key in counts:
print key, counts[key]
This gives as output:
lq_viz_server 1
OFM32 -1
Which would look more impressive if your sample log file was longer. The beauty of a pyparsing solution is the ability to adapt to a more complex query in the future (ex. grab and parse the timestamp, pull email address, parse error codes...). The idea is that you write the grammar independent of the query - you simply convert the raw text to a computer friendly format, abstracting away the parsing implementation away from it's usage.
If I consider that the file is divided into lines (I don't know if it's true) you have to apply split() function to each line. You will have this:
["7:06:32", "(slbfd)", "IN:", "lq_viz_server", "aqeela#nabltas1"]
And then I think you have to be capable of apply any logic comparing the values that you need.
i made some wild assumptions about your specification and here is a sample code to help you start:
objects = {}
with open("data.txt") as data:
for line in data:
if "IN:" in line or "OUT:" in line:
try:
name = line.split("\"")[1]
except IndexError:
print("No double quoted name on line: {}".format(line))
name = "PARSING_ERRORS"
if "OUT:" in line:
diff = 1
else:
diff = -1
try:
objects[name] += diff
except KeyError:
objects[name] = diff
print(objects) # for debug only, not advisable to print huge number of names
You have two options:
Use the .split() function of the string (as pointed out in the comments)
Use the re module for regular expressions.
I would suggest using the re module and create a pattern with named groups.
Recipe:
first create a pattern with re.compile() containing named groups
do a for loop over the file to get the lines use .match() od the
created pattern object on each line use .groupdict() of the
returned match object to access your values of interest
In the mode of just get 'er done with the standard distribution, this works:
import re
from collections import Counter
# open your file as inF...
count=Counter()
for line in inF:
match=re.match(r'\d+:\d+:\d+ \(slbfd\) (\w+): "(\w+)"', line)
if match:
if match.group(1) == 'IN': count[match.group(2)]+=1
elif match.group(1) == 'OUT': count[match.group(2)]-=1
print(count)
Prints:
Counter({'lq_viz_server': 1, 'OFM32': -1})

Categories