Python Regex Match Integer After String - python

I need a regex in python to match and return the integer after the string "id": in a text file.
The text file contains the following:
{"page":1,"results": [{"adult":false,"backdrop_path":"/ba4CpvnaxvAgff2jHiaqJrVpZJ5.jpg","id":807,"original_title":"Se7en","release_date":"1995-09-22","p
I need to get the 807 after the "id", using a regular expression.

Is this what you mean?
#!/usr/bin/env python
import re
subject = '{"page":1,"results": [{"adult":false,"backdrop_path":"/ba4CpvnaxvAgff2jHiaqJrVpZJ5.jpg","id":807,"original_title":"Se7en","release_date":"1995-09-22","p'
match = re.search('"id":([^,]+)', subject)
if match:
result = match.group(1)
else:
result = "no result"
print result
The Output: 807
Edit:
In response to your comment, adding one simple way to ignore the first match. If you use this, remember to add something like "id":809,"etc to your subject so that we can ignore 807 and find 809.
n=1
for match in re.finditer('"id":([^,]+)', subject):
if n==1:
print "ignoring the first match"
else:
print match.group(1)
n+=1

Assuming that there is more to the file than that:
import json
with open('/path/to/file.txt') as f:
data = json.loads(f.read())
print(data['results'][0]['id'])
If the file is not valid JSON, then you can get the value of id with:
from re import compile, IGNORECASE
r = compile(r'"id"\s*:\s*(\d+)', IGNORECASE)
with open('/path/to/file.txt') as f:
for match in r.findall(f.read()):
print(match(1))

Related

Extract Message-ID from a file

I have the following code that extracts the Message-Id in gathers them in a Dataframe.It works and gives me the follwing results :
This an example of the lines in the dataframe :
Message-ID: <23272646.1075847145300.JavaMail.evans#thyme>
What I want to have is only the string after < character and the before >. Because Message-ID ends with >. Also I have some lines where the Message-ID value is empty. I want to delete these lines.
Here is the code that I wrote
import pandas as pd
import numpy as np
f = open('C:\\Users\\hmk\\Desktop\\PFE 2019\\ML\\MachineLearningPhishing-
master\\MachineLearningPhishing-master\\code\\resources\\emails-
enron.mbox','r')
line_num = 0
e = []
search_phrase = "Message-ID"
for line in f.readlines():
line_num += 1
if line.find(search_phrase) >= 0:
#line = line[13:]
#line = line[:-2]
e.append(line)
f.close()
dfObj = pd.DataFrame(e)
One way to do it is using regex and pandas DataFrame replace:
clean_df = df.replace(to_replace='\<|\>', value='', regex=True)
clean_df = clean_df.replace(to_replace='(Message-ID:\s*$)', value=np.nan, regex=True).dropna()
the first line of code is removing the < and >, assuming the msgs will only contain those two
the second is checking if there is a message id on the body, if not it will replace for NaN.
note that I used numpy.nan just to simplify the process of dropping the blank msgs
You can use a regex which will extract the desired Message-ID for you.
So your first part for extracting the message id would be like below:
import re # import regex
s = 'Message-ID: <23272646.1075847145300.JavaMail.evans#thyme>'
message_id = re.search(r'Message-ID: <(.*?)>', s).group(1)
print('message_id: ', message_id)
Your ideal Message ID:
>>> message_id: 23272646.1075847145300.JavaMail.evans#thyme>
So you can loop through your data end check for the regex like this:
for line in f.readlines():
line_num += 1
message_id = re.search(r'Message-ID: <(.*?)>', line)
if message_id:
msg_id_string = message_id.group(1)
e.append(line)
# your other works
The if message_id: checks whether there is a match for your Message-ID and if it doesn't match it will return None and won't go through the if instructions.
You want a substring of your lines
for line in f.readlines():
if all(word in line for word in [search_phrase, "<", ">"]):
e.append(line[line.find("<")+1:-1])
#-1 suppose ">" as the last character
Use in to check if a string is inside another string
Use find to get the index of your pattern
Use [in:out] to get substring between your two values
s = "We want <This text inside only>. yes we do."
s2 = s[s.find("<")+1:s.find(">")]
print(s2) # Prints : This text inside only
# If you want to remove empty lines :
lines = filter(lambda x: x.strip(), lines)
filter goes through the whole lines, no need for a for loop that way.
One suggestion for you:
import re
f = open('PATH/TO/FILE', 'r').read()
msgID = re.findall(r'(?<=<).*?(?=>)', f)

Regular Expression result don't match with tester

I'm new with Python...
After couple days if googling I'm still don't get it to work.
My script:
import re
pattern = '^Hostname=([a-zA-Z0-9.]+)'
hand = open('python_test_data.conf')
for line in hand:
line = line.rstrip()
if re.search(pattern, line) :
print line
Test file content:
### Option: Hostname
# Unique, case sensitive Proxy name. Make sure the Proxy name is known to the server!
# Value is acquired from HostnameItem if undefined.
#
# Mandatory: no
# Default:
# Hostname=
Hostname=bbg-zbx-proxy
Script results:
ubuntu-workstation:~$ python python_test.py
Hostname=bbg-zbx-proxy
But when I have tested regex in tester the result is: https://regex101.com/r/wYUc4v/1
I need some advice haw cant I get only bbg-zbx-proxy as script output.
You have already written a regular expression capturing one part of the match, so you could as well use it then. Additionally, change your character class to include - and get rid of the line.strip() call, it's not necessary with your expression.
In total this comes down to:
import re
pattern = '^Hostname=([-a-zA-Z0-9.]+)'
hand = open('python_test_data.conf')
for line in hand:
m = re.search(pattern, line)
if m:
print(m.group(1))
# ^^^
The simple solution would be to split on the equals sign. You know it will always contain that and you will be able to ignore the first item in the split.
import re
pattern = '^Hostname=([a-zA-Z0-9.]+)'
hand = open('testdata.txt')
for line in hand:
line = line.rstrip()
if re.search(pattern, line) :
print(line.split("=")[1]) # UPDATED HERE

Python - finding specific string in files

I try to read specific string in files. Basically file look like this:
S0M6A36A108A180A252A324A36|1|48|89|36|Single|
S0M6A36A108A180A252A324A36|2|43|83|108|Single|
S0M6A36A108A180A252A324A36|3|37|85|180|Single|
S0M6A36A108A180A252A324A36|4|37|93|252|Single|
S0M6A36A108A180A252A324A36|5|43|95|324|Single|
S0M6A36A108A180A252A324A36|6|42|89|36|Single|
[META DATA]
01/10/2015|14:50:27|USA|UWI_N2C34_2|MMS1|FORD35|Bednarek|true|6|0|false|
[QUALITY CAMERA CHECK]
1|1|0|
2|1|0|
3|1|0|
4|1|0|
5|1|0|
6|1|0|
[PRESET]
S0M6A36A108A180A252A324A36|TA|
What I need is to read from line: 01/10/2015|14:50:27|USA|UWI_N2C34_2|MMS1|FORD35|Bednarek|true|6|0|false|
a country name between string |USA|
To do that I tried to use function group which is part of regular expression. I deduced that I need to read from specific line which hold this string. So I wrote small code:
import os
import string
import re
import sys
import glob
import fileinput
country_pattern = 'MYS','IDN','ZAF', 'THA','TWN','SGP', 'NWZ', 'AUS','ALB','AUT','BEL', 'BGR', 'BIH', 'CHE','CZE', 'DEU', 'DNK', 'ESP','EST','SRB','MDK','MNE','BIH', 'BIH','MNE','FIN', 'FRA', 'GBR','GRC', 'HRV', 'HUN', 'IRL', 'ITA', 'LIE', 'LTU', 'LUX', 'LVA', 'MDA', 'SMR','CYP','NLD','NOR','POL','PRT','ROU','SCG', 'SVK','SVN','SWE','TUR','BRA','CAN','USA','MEX','CHL','ARG','RUS'
pattern = r'(\d+)/(\d+)/(\d+)|(\d+):(\d+):(\d+)|(\S+)|(\S+)|(\S+)|(\S+)|(\S+)|(\S+)|(\d+)|(\d+)|(\S+)|'
src = raw_input("Enter source disk location: ")
src = os.path.dirname(src)
for dir,_,_ in os.walk(src):
file_path = glob.glob(os.path.join(dir,"*.txt"))
for file in file_path:
f = open(file, 'r')
object_name = f.readlines()
f.close()
for line_name_tmp in object_name:
line_name = line_name_tmp.replace('\n','')
if line_name == '':
line_name.split()
continue
else:
try:
searchObj = re.search(pattern, line_name)
m = searchObj.group(7)
if m in country_pattern:
print "searchObj.group(7) : ", searchObj.group(7)
else:
print 'did not find any match'
except:
print line_name
pass
But it will always print me 'did not find any match'. Did I miss something ?
Thanks for advise.
your re is the problem
try this one
pattern = r'(\d+)/(\d+)/(\d+)\|(\d+):(\d+):(\d+)\|(\S+)\|(\S+)\|(\S+)\|(\S+)\|(\S+)\|(\S+)\|(\d+)\|(\d+)\|(\S+)\|'
In regular expressions, the character | separates alternatives. So if you define a regex like this,
(\d+)/(\d+)/(\d+)|(\d+):(\d+):(\d+)
it will match a string of the form digits/digits/digits or a string of the form digits:digits:digits. Not both.
Accordingly, when you take your pattern regex and search the line
01/10/2015|14:50:27|USA|UWI_N2C34_2|MMS1|FORD35|Bednarek|true|6|0|false|
for a match, the regex winds up matching only the part 01/10/2015, because that part is matched by the first alternative ((\d+)/(\d+)/(\d+)). The seventh capturing group in the regex is not within the part that matched, so m.group(7) returns None, and of course None is not one of the elements in country_pattern.
The easy - or one might say lazy - way to fix this is to escape the pipe characters in the definition of the regex: use \| instead of |. But since you have fields separated by | in the file, I think you might have a better designed program if you were to use line_name.split('|') and then pick out the third field, instead of using a regular expression.
if need just to find it text country abbreviation this will do it:
data = '''
01/10/2015|14:50:27|USA|UWI_N2C34_2|MMS1|FORD35|Bednarek|true|6|0|false|
'''
country_pattern = 'MYS','IDN','ZAF', 'THA','TWN','SGP', 'NWZ', 'AUS','ALB','AUT','BEL', 'BGR', 'BIH', 'CHE','CZE', 'DEU', 'DNK', 'ESP','EST','SRB','MDK','MNE','BIH', 'BIH','MNE','FIN', 'FRA', 'GBR','GRC', 'HRV', 'HUN', 'IRL', 'ITA', 'LIE', 'LTU', 'LUX', 'LVA', 'MDA', 'SMR','CYP','NLD','NOR','POL','PRT','ROU','SCG', 'SVK','SVN','SWE','TUR','BRA','CAN','USA','MEX','CHL','ARG','RUS'
mo = re.search(r'\|[A-Z]{3}\|',data)
if mo:
print(mo.group(0))
|USA|

Find what part of a string do not match with regular expression python

In order to see if a filename is correctly named (using re) I use the following regular expression pattern :
*^S_hc_[0-9]{1,2}\.[0-9]{1,2}\.[0-9]{1,2}_[0-9]{4,4}-[0-9]{1,3}T[0-9]{6,6}\.xml$"*
Here is a correct file name : *S_hc_1.2.3_2014-213T123121.xml*
Here is an incorrect file name : *S_hc_1.2.IncorrectName_2014-213T123121.xml*
I would like to know if a simple way to retrieve the part of the file which to do not match exits.
In the end, an error message would display :
Error, incorrect file name, the part 'IncorrectName' does not match with expected name.
You can use re.split and a generator expression within next but you also need to check the structure of your string that match waht you want, you can do it with following re.match :
re.match(r"^S_hc_(.*)\.(.*)\.(.*)_(.*)-(.*)\.xml$",s2)
And in code:
>>> import re
>>> s2 ='S_hc_1.2.IncorrectName_2014-213T123121.xml'
>>> s1
'S_hc_1.2.3_2014-213T123121.xml'
#with s1
>>> next((i for i in re.split(r'^S_hc_|[0-9]{1,2}\.|[0-9]{1,2}_|_|[0-9]{4,4}|-|[0-9]{1,3}T[0-9]{6}|\.|xml$',s1) if i and re.match(r"^S_hc_(.*)\.(.*)\.(.*)_(.*)-(.*)\.xml$",s2)),None)
#with s2
>>> next((i for i in re.split(r'^S_hc_|[0-9]{1,2}\.|[0-9]{1,2}_|_|[0-9]{4,4}|-|[0-9]{1,3}T[0-9]{6}|\.|xml$',s2) if i and re.match(r"^S_hc_(.*)\.(.*)\.(.*)_(.*)-(.*)\.xml$",s2)),None)
'IncorrectName'
All you need is to use pip (|) between unique part of your regex patterns,then the split function will split your string based on one of that patterns.
And the part that doesn't match with one of your pattern will not be split and you can find it with looping over your split text!
next(iterator[, default])
Retrieve the next item from the iterator by calling its next() method. If default is given, it is returned if the iterator is exhausted, otherwise StopIteration is raised.
If you want in several line :
>>> for i in re.split(r'^S_hc_|[0-9]{1,2}\.|[0-9]{1,2}_|_|[0-9]{4,4}|-|[0-9]{1,3}T[0-9]{6}|\.|xml$',s2):
... if i and re.match(r"^S_hc_(.*)\.(.*)\.(.*)_(.*)-(.*)\.xml$",s2):
... print i
...
IncorrectName
Maybe this is a longer solution but it will tell you what failed and what it expected. It is similar to Kasra's solution - breaking up the file name into individual bits and matching them in turn. This allows you to find out where the matching breaks:
import re
# break up the big file name pattern into individual bits that we can match
RX = re.compile
pattern = [
RX(r"\*"),
RX(r"S_hc_"),
RX(r"[0-9]{1,2}"),
RX(r"\."),
RX(r"[0-9]{1,2}"),
RX(r"\."),
RX(r"[0-9]{1,2}"),
RX(r"_"),
RX(r"[0-9]{4}"),
RX(r"-"),
RX(r"[0-9]{1,3}"),
RX(r"T"),
RX(r"[0-9]{6}"),
RX(r"\.xml"),
RX(r"\*")
]
# 'fn' is the file name matched so far
def reductor(fn, rx):
if fn is None:
return None
mo = rx.match(fn)
if mo is None:
print "File name mismatch: got {}, expected {}".format(fn, rx.pattern)
return None
# proceed with the remainder of the string
return fn[mo.end():]
validFile = lambda fn: reduce(reductor, pattern, fn) is not None
Let's test it:
print validFile("*S_hc_1.2.3_2014-213T123121.xml*")
print validFile("*S_hc_1.2.IncorrectName_2014-213T123121.xml*")
Outputs:
True
File name mismatch: got IncorrectName_2014-213T123121.xml*, expected [0-9]{1,2}
False
Here is the method I am going to use, please let me know if cases mismatch:
def verifyFileName(self, filename__, pattern__):
'''
Verifies if a file name is correct
:param filename__: file name
:param pattern__: pattern
:return: empty string if file name is correct, otherwise the incorrect part of file
'''
incorrectPart =""
pattern = pattern__.replace('\.','|\.|').replace('_','|_|')
for i in re.split(pattern, filename__):
if len(i)>1:
incorrectPart = i
return incorrectPart
Here's the counterexample. I've taken your method and defined three test cases - file names plus expected output.
Here's the output, the code follows below:
$> python m.py
S_hc_1.2.3_2014-213T123121.xml: PASS [expect None got None]
S_hc_1.2.3_Incorrect-213T123121.xml: PASS [expect Incorrect- got Incorrect-]
X_hc_1.2.3_2014-213T123121.xml: FAIL [expect X got None]
This is the code - cut & paste & run it.
def verifyFileName(filename__, pattern__):
'''
Verifies if a file name is correct
:param filename__: file name
:param pattern__: pattern
:return: empty string if file name is correct, otherwise the incorrect part of file
'''
incorrectPart = None
pattern = pattern__.replace('\.','|\.|').replace('_','|_|')
for i in re.split(pattern, filename__):
if len(i)>1:
incorrectPart = i
return incorrectPart
pattern = "^S_hc_[0-9]{1,2}\.[0-9]{1,2}\.[0-9]{1,2}_[0-9]{4,4}-[0-9]{1,3}T[0-9]{6,6}\.xml$"
# list of test cases: filenames + expected return from verifyFileName:
testcases = [
# correct file name
("S_hc_1.2.3_2014-213T123121.xml", None),
# obviously incorrect
("S_hc_1.2.3_Incorrect-213T123121.xml", "Incorrect-"),
# subtly incorrect but still incorrect
("X_hc_1.2.3_2014-213T123121.xml", "X")
]
for (fn, expect) in testcases:
res = verifyFileName(fn, pat)
print "{}: {} [expect {} got {}]".format(fn, "PASS" if res==expect else "FAIL", expect, str(res))

KeyError in Python Script

I've tried debugging this script but I'm not sure waht's causing the error.
list1 = ['<p>Text ([0-9]):(.*)</p>' ,'<p>Text2 ([0-9]):(.*)</p>','<p>Text ([0-9]):(.*)</p>']
list2 = ["<p class='text'>Text \1:\2</p>" ,"<p class='text'>Text \1:\2</p>","<p class='text'>TEXT ([0-9]):(.*)</p>"]
translation = dict(zip(list1, list2))
pattern = re.compile('(%s)' % '|'.join(dicts.list1))
file.close()
file = open(args.file,'r+')
def translate(match):
return dicts.translation[match.group(0)]
with open(args.file, 'r+') as output:
with open(args.file, 'r+') as book:
for line in book:
output.write(pattern.sub(translate, line))
Error:
return dicts.translation5[match.group(0)]
KeyError: '<p>Text 1:1-1</p>'
I believe you are trying to match a read line and see what regexp it matches so that you can apply appropriate change to it (also in regexp form). This approach might work but using a dictionary is redundant in this case.
The broad approach is
You match the line to compiled pattern to find a match.
Then you compare each pattern in list1 to the matched string to see if it
matches.
If it does you convert the matched string to the form in list2
Something like
list1 = ['<p>Text ([0-9]):(.*)</p>' ,'<p>Text2 ([0-9]):(.*)</p>','<p>Text3 ([0-9]):(.*)</p>']
list2 = ["<p class='text'>Text \1:\2</p>" ,"<p class='text'>Text \1:\2</p>","<p class='text'>TEXT ([0-9]):(.*)</p>"]
translation = dict(zip(list1, list2))
pattern = re.compile('(%s)' % '|'.join(dicts.list1))
def translate(m):
for x,v in translation.items():
if re.search(x,m.group()):
return re.sub(x,v,m.group())
for line in book:
m = pattern.findall(line)
ret = translate(m)
if ret is not None:
output.write(ret)
else:
#No match. Echo back original line
output.write(line)
Input
<p>Text 1:1-1</p>
Output
<p class='text'>Text 1:1-1</p>
There are probably other better ways to do it
The issue is that the text '<p>Text 1:1-1</p>' is not a key in your dict. As dicts is a free variable in your code, there is nothing more we can tell you.
Try match.group(1) instead. In regex results, group(0) is the entire matched string and groups 1 and following are the groups in the regex itself. In your case group(0) == "<p>Text 1:1-1\</p\>" and group(1) == "1".

Categories