Python - finding specific string in files

Python - finding specific string in files - python

I try to read specific string in files. Basically file look like this:
S0M6A36A108A180A252A324A36|1|48|89|36|Single|
S0M6A36A108A180A252A324A36|2|43|83|108|Single|
S0M6A36A108A180A252A324A36|3|37|85|180|Single|
S0M6A36A108A180A252A324A36|4|37|93|252|Single|
S0M6A36A108A180A252A324A36|5|43|95|324|Single|
S0M6A36A108A180A252A324A36|6|42|89|36|Single|
[META DATA]
01/10/2015|14:50:27|USA|UWI_N2C34_2|MMS1|FORD35|Bednarek|true|6|0|false|
[QUALITY CAMERA CHECK]
1|1|0|
2|1|0|
3|1|0|
4|1|0|
5|1|0|
6|1|0|
[PRESET]
S0M6A36A108A180A252A324A36|TA|
What I need is to read from line: 01/10/2015|14:50:27|USA|UWI_N2C34_2|MMS1|FORD35|Bednarek|true|6|0|false|
a country name between string |USA|
To do that I tried to use function group which is part of regular expression. I deduced that I need to read from specific line which hold this string. So I wrote small code:
import os
import string
import re
import sys
import glob
import fileinput
country_pattern = 'MYS','IDN','ZAF', 'THA','TWN','SGP', 'NWZ', 'AUS','ALB','AUT','BEL', 'BGR', 'BIH', 'CHE','CZE', 'DEU', 'DNK', 'ESP','EST','SRB','MDK','MNE','BIH', 'BIH','MNE','FIN', 'FRA', 'GBR','GRC', 'HRV', 'HUN', 'IRL', 'ITA', 'LIE', 'LTU', 'LUX', 'LVA', 'MDA', 'SMR','CYP','NLD','NOR','POL','PRT','ROU','SCG', 'SVK','SVN','SWE','TUR','BRA','CAN','USA','MEX','CHL','ARG','RUS'
pattern = r'(\d+)/(\d+)/(\d+)|(\d+):(\d+):(\d+)|(\S+)|(\S+)|(\S+)|(\S+)|(\S+)|(\S+)|(\d+)|(\d+)|(\S+)|'
src = raw_input("Enter source disk location: ")
src = os.path.dirname(src)
for dir,_,_ in os.walk(src):
file_path = glob.glob(os.path.join(dir,"*.txt"))
for file in file_path:
f = open(file, 'r')
object_name = f.readlines()
f.close()
for line_name_tmp in object_name:
line_name = line_name_tmp.replace('\n','')
if line_name == '':
line_name.split()
continue
else:
try:
searchObj = re.search(pattern, line_name)
m = searchObj.group(7)
if m in country_pattern:
print "searchObj.group(7) : ", searchObj.group(7)
else:
print 'did not find any match'
except:
print line_name
pass
But it will always print me 'did not find any match'. Did I miss something ?
Thanks for advise.

your re is the problem
try this one
pattern = r'(\d+)/(\d+)/(\d+)\|(\d+):(\d+):(\d+)\|(\S+)\|(\S+)\|(\S+)\|(\S+)\|(\S+)\|(\S+)\|(\d+)\|(\d+)\|(\S+)\|'

In regular expressions, the character | separates alternatives. So if you define a regex like this,
(\d+)/(\d+)/(\d+)|(\d+):(\d+):(\d+)
it will match a string of the form digits/digits/digits or a string of the form digits:digits:digits. Not both.
Accordingly, when you take your pattern regex and search the line
01/10/2015|14:50:27|USA|UWI_N2C34_2|MMS1|FORD35|Bednarek|true|6|0|false|
for a match, the regex winds up matching only the part 01/10/2015, because that part is matched by the first alternative ((\d+)/(\d+)/(\d+)). The seventh capturing group in the regex is not within the part that matched, so m.group(7) returns None, and of course None is not one of the elements in country_pattern.
The easy - or one might say lazy - way to fix this is to escape the pipe characters in the definition of the regex: use \| instead of |. But since you have fields separated by | in the file, I think you might have a better designed program if you were to use line_name.split('|') and then pick out the third field, instead of using a regular expression.

if need just to find it text country abbreviation this will do it:
data = '''
01/10/2015|14:50:27|USA|UWI_N2C34_2|MMS1|FORD35|Bednarek|true|6|0|false|
'''
country_pattern = 'MYS','IDN','ZAF', 'THA','TWN','SGP', 'NWZ', 'AUS','ALB','AUT','BEL', 'BGR', 'BIH', 'CHE','CZE', 'DEU', 'DNK', 'ESP','EST','SRB','MDK','MNE','BIH', 'BIH','MNE','FIN', 'FRA', 'GBR','GRC', 'HRV', 'HUN', 'IRL', 'ITA', 'LIE', 'LTU', 'LUX', 'LVA', 'MDA', 'SMR','CYP','NLD','NOR','POL','PRT','ROU','SCG', 'SVK','SVN','SWE','TUR','BRA','CAN','USA','MEX','CHL','ARG','RUS'
mo = re.search(r'\|[A-Z]{3}\|',data)
if mo:
print(mo.group(0))
|USA|

Related

Finding an exact word in list

I am learning Python and am struggling with fining an exact word in each string in a list of strings.
Apologies if this is an already asked question for this situation.
This is what my code looks like so far:
with open('text.txt') as f:
lines = f.readlines()
lines = [line.rstrip('\n') for line in open('text.txt')]
keyword = input("Enter a keyword: ")
matching = [x for x in lines if keyword.lower() in x.lower()]
match_count = len(matching)
print('\nNumber of matches: ', match_count, '\n')
print(*matching, sep='\n')
Right now, matching will return all strings containing the word, not strings contating the exact word. For example, if I enter in 'local' as the keyword, strings with 'locally' and 'localized' in addition to 'local' will be returned when I only want just instances of 'local' returned.
I have tried:
match_test = re.compile(r"\b" + keyword+ r"\b")
match_test = ('\b' + keyword + '\b')
match_test = re.compile('?:^|\s|$){0}'.format(keyword))
matching = [x for x in lines if keyword.lower() == x.lower()]
matching = [x for x in lines if keyword.lower() == x.lower().strip()]
And none of them shave worked, so I'm a bit stuck.
How do I take the keyword entered from the user, and then return all strings in a list that contain that exact keyword?
Thanks

in means contained in, 'abc' in 'abcd' is True. For exact match use ==
matching = [x for x in lines if keyword.lower() == x.lower()]
You might need to remove spaces\new lines as well
matching = [x for x in lines if keyword.lower().strip() == x.lower().strip()]
Edit:
To find a line containing the keyword you can use loops
matches = []
for line in lines:
for string in line.split(' '):
if string.lower().strip() == keyword.lower().strip():
matches.append(line)

This method avoids having to read the whole file into memory. It also deals with cases like "LocaL" or "LOCAL" assuming you want to capture all such variants. There is a bit of performance overhead on making the temp string each time the line is read, however:
import re
reader(filename, target):
#this regexp matches a word at the front, end or in the middle of a line stripped
#of all punctuation and other non-alpha, non-whitespace characters:
regexp = re.compile(r'(^| )' + target.lower() + r'($| )')
with open(filename) as fin:
matching = []
#read lines one at at time:
for line in fin:
line = line.rstrip('\n')
#generates a line of lowercase and whitespace to test against
temp = ''.join([x.lower() for x in line if x.isalpha() or x == ' '])
print(temp)
if regexp.search(temp):
matching.append(line) #store unaltered line
return matching
Given the following tests:
locally local! localized
locally locale nonlocal localized
the magic word is Local.
Localized or nonlocal or LOCAL
This is returned:
['locally local! localized',
'the magic word is Local.',
'Localized or nonlocal or LOCAL']

Please find my solution which should match only local among following mentioned text in text file . I used search regular expression to find the instance which has only 'local' in string and other strings containing local will not be searched for .
Variables which were provided in text file :
local
localized
locally
local
local diwakar
local
local##!
Code to find only instances of 'local' in text file :
import os
import sys
import time
import re
with open('C:/path_to_file.txt') as f:
for line in f:
a = re.search(r'local\W$', line)
if a:
print(line)
Output
local
local
local
Let me know if this is what you were looking for

Your first test seems to be on the right track
Using input:
import re
lines = [
'local student',
'i live locally',
'keyboard localization',
'what if local was in middle',
'end with local',
]
keyword = 'local'
Try this:
pattern = re.compile(r'.*\b{}\b'.format(keyword.lower()))
matching = [x for x in lines if pattern.match(x.lower())]
print(matching)
Output:
['local student', 'what if local was in middle', 'end with local']
pattern.match will return the first instance of the regex matching or None. Using this as your if condition will filter for strings that match the whole keyword in some place. This works because \b matches the begining/ending of words. The .* works to capture any characters that may occur at the start of the line before your keyword shows up.
For more info about using Python's re, checkout the docs here: https://docs.python.org/3.8/library/re.html

You can try
pattern = re.compile(r"\b{}\b".format(keyword))
match_test = pattern.search(line)
like shown in
Python - Concat two raw strings with an user name

Regex to match strings in a list without .csv extension

How can i write a regular expression to only match the string names without .csv extension. This should be the required output
Required Output:
['ap_2010', 'class_size', 'demographics', 'graduation','hs_directory', 'sat_results']
Input:
data_files = [
"ap_2010.csv",
"class_size.csv",
"demographics.csv",
"graduation.csv",
"hs_directory.csv",
"sat_results.csv"]
I tried but it return a empty list.
for i in data_files:
regex = re.findall(r'/w+/_[/d{4}][/w*]?', i)

If you really want to use a regular expression, you can use re.sub to remove the extension if it exists, and if not, leave the string alone:
[re.sub(r'\.csv$', '', i) for i in data_files]
['ap_2010',
'class_size',
'demographics',
'graduation',
'hs_directory',
'sat_results']
A better approach in general is using the os module to handle anything to do with filenames:
[os.path.splitext(i)[0] for i in data_files]
['ap_2010',
'class_size',
'demographics',
'graduation',
'hs_directory',
'sat_results']

If you want regex, the solution is r'(.*)\.csv:
for i in data_files:
regex = re.findall(r'(.*)\.csv', i)
print(regex)

Split the string at '.' and then take the last element of the split (using index [-1]). If this is 'csv' then it is a csv file.
for i in data_files:
if i.split('.')[-1].lower() == 'csv':
# It is a CSV file
else:
# Not a CSV

# Input
data_files = [ 'ap_2010.csv', 'class_size.csv', 'demographics.csv', 'graduation.csv', 'hs_directory.csv', 'sat_results.csv' ]
import re
pattern = '(?P<filename>[a-z0-9A-Z_]+)\.csv'
prog = re.compile(pattern)
# `map` function yields:
# - a `List` in Python 2.x
# - a `Generator` in Python 3.x
result = map(lambda data_file: re.search(prog, data_file).group('filename'), data_files)

l = [
"ap_2010.csv",
"class_size.csv",
"demographics.csv",
"graduation.csv",
"hs_directory.csv",
"sat_results.csv"]
print([i.rstrip('.'+i.split('.')[-1]) for i in l])

Extract and modify substring from file path

I have a file path saved as filepath in the form of /home/user/filename. Some examples of what the filename could be:
'1990MAlogfile'
'Tantrologfile'
'2003RF_2004logfile'
I need to write something that turns the filepath into just part of the filename (but I do not have just the filename saved as anything yet). For example:
/home/user/1990MAlogfile becomes '1990 MA', /home/user/Tantrologfile becomes 'Tantro', or /home/user/2003RF_2004logfile becomes '2003 RF'.
So I need everything after the last forward slash and before an underscore if it's present (or before the 'logfile' if it's not), and then I need to insert a space between the last number and first letter if there are numbers present. Then I'd like to save the outcome as objkey. Any idea on how I could do this? I was thinking I could use regex, but don't know how I would handle inserting a space in those certain cases.

Code
def get_filename(filepath):
import re
temp = os.path.basename(example)[:-7].split('_')[0]
a = re.findall('^[0-9]*', temp)[0]
b = temp[len(a):]
return ' '.join([a, b])
example = '/home/user/2003RF_2004logfile'
objkey = get_filename(example)
Explanation
import regular expression package
import re
example filepath
example = '/home/user/2003RF_2004logfile'
/home/user/2003RF_2004logfile
get the filename and remove everything after the _
temp = example.split('/')[-1].split('_')[0]
2003RF
get the beginning portion (splits if numbers at the beginning)
a = re.findall('^[0-9]*', temp)[0]
2003
get the end portion
b = temp[len(a):]
RF
combine the beginning and end portions
return ' '.join([a, b])
2003 RF

import os, re, string
mystr = 'home/user/2003RF_2004logfile'
def format_str(str):
end = os.path.split(mystr)[-1]
m1 = re.match('(.+)logfile', end)
try:
this = m1.group(1)
this = this.split('_')[0]
except AttributeError:
return None
m2 = re.match('(.+[0-9])(.+)', this)
try:
return " ".join([m2.group(1), m2.group(2)])
except AttributeError:
return this

Regex on list element in for loop

I have a script that searches through config files and finds all matches of strings from another list as follows:
dstn_dir = "C:/xxxxxx/foobar"
dst_list =[]
files = [fn for fn in os.listdir(dstn_dir)if fn.endswith('txt')]
dst_list = []
for file in files:
parse = CiscoConfParse(dstn_dir+'/'+file)
for sfarm in search_str:
int_objs = parse.find_all_children(sfarm)
if len(int_objs) > 0:
dst_list.append(["\n","#" *40,file + " " + sfarm,"#" *40])
dst_list.append(int_objs)
I need to change this part of the code:
for sfarm in search_str:
int_objs = parse.find_all_children(sfarm)
search_str is a list containing strings similar to ['xrout:55','old:23'] and many others.
So it will only find entries that end with the string from the list I am iterating through in sfarm. My understanding is that this would require my to use re and match on something like sfarm$ but Im not sure on how to do this as part of the loop.
Am I correct in saying that sfarm is an iterable? If so I need to know how to regex on an iterable object in this context.

Strings in python are iterable, so sfarm is an iterable, but that has little meaning in this case. From reading what CiscoConfParse.find_all_children() does, it is apparent that your sfarm is the linespec, which is a regular expression string. You do not need to explicitly use the re module here; just pass sfarm concatenated with '$':
search_string = ['xrout:55','old:23']
...
for sfarm in search_str:
int_objs = parse.find_all_children(sfarm + '$') # one of many ways to concat
...

Please check this code. Used glob module to get all "*.txt" files in folder.
Please check here for more info on glob module.
import glob
import re
dst_list = []
search_str = ['xrout:55','old:23']
for file_name in glob.glob(r'C:/Users/dinesh_pundkar\Desktop/*.txt'):
with open(file_name,'r') as f:
text = f.read()
for sfarm in search_str:
regex = re.compile('%s$'%sfarm)
int_objs = regex.findall(text)
if len(int_objs) > 0:
dst_list.append(["\n","#" *40,file_name + " " + sfarm,"#" *40])
dst_list.append(int_objs)
print dst_list
Output:
C:\Users\dinesh_pundkar\Desktop>python a.py
[['\n', '########################################', 'C:/Users/dinesh_pundkar\\De
sktop\\out.txt old:23', '########################################'], ['old:23']]
C:\Users\dinesh_pundkar\Desktop>

Python Regex Match Integer After String

I need a regex in python to match and return the integer after the string "id": in a text file.
The text file contains the following:
{"page":1,"results": [{"adult":false,"backdrop_path":"/ba4CpvnaxvAgff2jHiaqJrVpZJ5.jpg","id":807,"original_title":"Se7en","release_date":"1995-09-22","p
I need to get the 807 after the "id", using a regular expression.

Is this what you mean?
#!/usr/bin/env python
import re
subject = '{"page":1,"results": [{"adult":false,"backdrop_path":"/ba4CpvnaxvAgff2jHiaqJrVpZJ5.jpg","id":807,"original_title":"Se7en","release_date":"1995-09-22","p'
match = re.search('"id":([^,]+)', subject)
if match:
result = match.group(1)
else:
result = "no result"
print result
The Output: 807
Edit:
In response to your comment, adding one simple way to ignore the first match. If you use this, remember to add something like "id":809,"etc to your subject so that we can ignore 807 and find 809.
n=1
for match in re.finditer('"id":([^,]+)', subject):
if n==1:
print "ignoring the first match"
else:
print match.group(1)
n+=1

Assuming that there is more to the file than that:
import json
with open('/path/to/file.txt') as f:
data = json.loads(f.read())
print(data['results'][0]['id'])
If the file is not valid JSON, then you can get the value of id with:
from re import compile, IGNORECASE
r = compile(r'"id"\s*:\s*(\d+)', IGNORECASE)
with open('/path/to/file.txt') as f:
for match in r.findall(f.read()):
print(match(1))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - finding specific string in files - python

your re is the problem try this one pattern = r'(\d+)/(\d+)/(\d+)\|(\d+):(\d+):(\d+)\|(\S+)\|(\S+)\|(\S+)\|(\S+)\|(\S+)\|(\S+)\|(\d+)\|(\d+)\|(\S+)\|'

Related

Finding an exact word in list

Regex to match strings in a list without .csv extension

Extract and modify substring from file path

Regex on list element in for loop

Python Regex Match Integer After String

Categories

Resources