I am using Python Paramiko module to sftp into one of my servers. I did a list_dir() to get all of the files in the folder. Out of the folder I'd like to use regex to find the matching pattern and then printout the entire string.
List_dir will list a list of the XML files with this format
LOG_MMDDYYYY_HHMM.XML
LOG_07202018_2018 --> this is for the date 07/20/2018 at the time 20:18
Id like to use regex to file all the XML files for that particular date and store them to a list or a variable. I can then pass this variable to Paramiko to get the file.
for log in file_list:
regex_pattern = 'POSLog_' + date + '*'
if (re.search(regex_pattern, log) != None):
matchObject = re.findall(regex_pattern, log)
print(matchObject)
the code above just prints:
['Log_07202018'] I want it to store the entire string Log_07202018_20:18.XML to a variable.
How would I go about doing this?
Thank you
If you are looking for a fixed string, don't use regex.
search_str = 'POSLog_' + date
for line in file_list:
if search_str in line:
print(line)
Alternatively, a list comprehension can make list of matching lines in one go:
log_lines = [line for line in file_list if search_str in line]
for line in log_lines:
print(line)
If you must use regex, there are a few things to change:
Any variable part that you put into the regex pattern must either be guaranteed to be a regex itself, or it must be escaped.
"The rest of the line" is not *, it's .*.
The start-of-line anchor ^ should be used to speed up the search - this way the regex fails faster when there is no match on a given line.
To support the ^ on multiple lines instead of only at the start of the entire string, the MULTILINE flag is needed.
There are several ways of getting all matches. One could do "for each line, if there is a match, print line", same as above. Here I'm using .finditer() and a search over the whole input block (i.e. not split into lines).
log_pattern = '^POSLog_' + re.escape(date) + '.*'
for match in re.finditer(log_pattern, whole_file, re.MULTILINE):
print(match.string)
Because you only print the matched part, just do print(log) instead and it'll print the whole filename.
Related
I would like to parse out e-mail addresses from several text files in Python. In a first attempt, I tried to get the following element that includes an e-mail address from a list of strings ('2To whom correspondence should be addressed. E-mail: joachim+pnas#uci.edu.\n').
When I try to find the list element that includes the e-mail address via i.find("#") == 0 it does not give me the content[i]. Am I misunderstanding the .find() function? Is there a better way to do this?
from os import listdir
TextFileList = []
PathInput = "C:/Users/p282705/Desktop/PythonProjects/ExtractingEmailList/text/"
# Count the number of different files you have!
for filename in listdir(PathInput):
if filename.endswith(".txt"): # In case you accidentally put other files in directory
TextFileList.append(filename)
for i in TextFileList:
file = open(PathInput + i, 'r')
content = file.readlines()
file.close()
for i in content:
if i.find("#") == 0:
print(i)
The standard way of checking whether a string contains a character, in Python, is using the in operator. In your case, that would be:
for i in content:
if "#" in i:
print(i)
The find method, as you where using, returns the position where the # character is located, starting at 0, as described in the Python official documentation.
For instance, in the string abc#google.com, it will return 3. In case the character is not located, it will return -1. The equivalent code would be:
for i in content:
if i.find("#") != -1:
print(i)
However, this is considered unpythonic and the in operator usage is preferred.
Find returns the index if you find the substring you are searching for. This isn't correct for what you are trying to do.
You would be better using a Regular Expression or RE to search for an occurence of #. In your case, you may come into as situation where there are more than one email address per line (Again I don't know your input data so I can't take a guess)
Something along these lines would benefit you:
import re
for i in content:
findEmail = re.search(r'[\w\.-]+#[\w\.-]+', i)
if findEmail:
print(findEmail.group(0))
You would need to adjust this for valid email addresses... I'm not entirely sure if you can have symbols like +...
'Find' function in python returns the index number of that character in a string. Maybe you can try this?
list = i.split(' ') # To split the string in words
for x in list: # search each word in list for # character
if x.find("#") != -1:
print(x)
I am trying to read a bunch of regexes from a file, using python.
The regexes come in a file regexes.csv, a pair in each line, and the pair is separated by commas. e.g.
<\? xml([^>]*?)>,<\? XML$1>
peter,Peter
I am doing
detergent = []
infile = open('regexes.csv', 'r')
for line in infile:
line = line.strip()
[search_term, replace_term] = line.split(',', 1)
detergent += [[search_term,replace_term]]
This is not producing the right input. If I print the detergent I get
['<\\?xml([^>]*?)>', '<\\?HEYXML$1>'],['peter','Peter']]
It seems to be that it is escaping the backslashes.
Moreover, in a file containing, say
<? xml ........>
a command re.sub(search_term,replace_term,file_content) written further below in the content is replacing it to be
<\? XML$1>
So, the $1 is not recovering the first capture group in the first regex of the pair.
What is the proper way to input regexes from a file to be later used in re.sub?
When I've had the regexes inside the script I would write them inside the r'...', but I am not sure what are the issues at hand when reading form a file.
There are no issues or special requirements for reading regex's from a file. The escaping of backslashes is simply how python represents a string containing them. For example, suppose you had defined a regex as rgx = r"\?" directly in your code. Try printing it, you'll see it is displayed the same way ...
>>> r"\?"
>>> '\\?'
The reason you $1 is not being replaced is because this is not the syntax for group references. The correct syntax is \1.
I'm parsing file via
output=wilcard.parseFile(myfile)
print output
And I do get only first match of string.
I have a big config file to parse, with "entries" which are surrounded by braces.
I expect to see all the matches that are in file or exception for not matching.
How do I achieve that?
By default, pyparsing will find the longest match, starting at the first character. So, if your parse is given by num = Word('0123456789'), parsing either "462" or "462-780" will both return the same value. However, if the parseAll=True option is passed, the parse will attempt to parse the entire string. In this case, "462" would be matched, but parsing "462-780" would raise a ParseException, because the parser doens't know how to deal with the dash.
I would recommend constructing something that will match the entirety of the file, then using the parseAll=True flag in parseFile(). If I understand your description of each entry being separated by braces correctly, one could do the following.
entire_file = OneOrMore('[' + wildcard + ']')
output = wildcard.parseFile(myfile,parseAll=True)
print output
I am looking for a regex that will extract everything up to the first . (period) in a string, and everything including and after the last . (period)
For example:
my_file.10.4.5.6.csv
myfile2.56.3.9.txt
Ideally the regex when run against these strings would return:
my_file.csv
myfile2.txt
The numeric stamp in the file will be different each time the script is run, so I am looking essentially to exclude it.
The following prints out the string up to the first . (period)
print re.search("^[^.]*", data_file).group(0)
I am having trouble though getting it to also return the the last period and string after it.
Sorry just to update this based upon feedback and comments below:
This does need to be a regex. The regex will be passed into the program from a configuration file. The user will not have access to the source code as it will be packaged.
The user may need to change the regex based upon some arbitrary criteria, so they will need to update the config file, rather than edit the application and re-build the package.
Thanks
You don’t need a regular expression!
parts = data_file.split(".")
print parts[0] + "." + parts[-1]
Instead of regular expressions, I would suggest using str.split. For example:
>>> data_file = 'my_file.10.4.5.6.csv'
>>> parts = data_file.split('.')
>>> print parts[0] + '.' + parts[-1]
my_file.csv
However if you insist on regular expressions, here is one approach:
>>> print re.sub(r'\..*\.', '.', data_file)
my_file.csv
You don't need a regex.
tokens = expanded_name.split('.')
compressed_name = '.'.join((tokens[0], tokens[-1]))
If you are concerned about performance, you could use a length limit and rsplit() to only chop up the string as much as you need.
compressed_name = expanded_name.split('.', 1)[0] + '.' + expanded_name.rsplit('.', 1)[1]
Do you need a regex here?
>>> address = "my_file.10.4.5.6.csv"
>>> split_by_periods = address.split(".")
>>> "{}.{}".format(address[0], address[-1])
>>> "my_file.csv"
I'm a python beginner and just ran into a simple problem: I have a list of names (designators) and then a very simple code that reads lines in a csv file and prints the csv lines that has a name in the first column (row[0]) in common with my "designator list". So:
import csv
DesignatorList = ["AAX-435", "AAX-961", "HHX-9387", "HHX-58", "K-58", "K-14", "K-78524"]
with open('DesignatorFile.csv','rb') as FileReader:
for row in csv.reader(FileReader, delimiter=';'):
if row[0] in DesignatorList:
print row
My csv files is only a list of names, like this:
AAX-435
AAX-961
HHX-58
HHX-9387
I would like to be able to use wildcards like * and ., example: let's say that I put this on my csv file:
AAX*
H.X-9387
*58
I need my code to be able to interpret those wild cards/control characters, printing the following:
every line that starts with "AAX";
every line that starts with "H", then any following character, then finally ends with "X-9387";
every line that ends with "58".
Thank you!
EDIT: For future reference (in case somebody runs into the same problem), this is how I solved my problem following Roman advice:
import csv
import re
DesignatorList = ["AAX-435", "AAX-961", "HHX-9387", "HHX-58", "K-58", "K-14", "K-78524"]
with open('DesignatorFile.txt','rb') as FileReader:
for row in csv.reader(FileReader, delimiter=';'):
designator_col0 = row[0]
designator_col0_re = re.compile("^" + ".*".join(re.escape(i) for i in designator_col0.split("*")) + "$")
for d in DesignatorList:
if designator_col0_re.match(d):
print d
Try the re module.
You may need to prepare regular expression (regex) for use by replacing '*' with '.*' and adding ^ (beginning of a string) and $ (end of string) to the beginning and the end of the regular expression. In addition, you may need to escape everything else by re.escape function (that is, function escape from module re).
In case you do not have any other "control characters" (as you call them), splitting the string by "*" and joining by ".*" after applying escape.
For example,
import re
def make_rule(rule): # where rule for example "H*X-9387"
return re.compile("^" + ".*".join(re.escape(i) for i in rule.split("*")) + "$")
Then you can match (I guess, your rule is row):
...
rule_re = make_rule(row)
for d in DesignatorList:
if rule_re.match(d):
print row # or maybe print d
(I have understood, that rules are coming from CSV file while designators are from a list. It's easy to do it the other way around).
The examples above are examples. You still need to adapt them into your program.
Python's string object does have a startswith and an endswith method, which you could use here if you only had a small number of rules. The most general way to go with this, since you seem to have fairly simple patterns, is regular expressions. That way you can encode those rules as patterns.
import re
rules = ['^AAX.*$', # starts with AAX
'^H.*X-9387$', # starts with H, ends with X-9387
'^.*58$'] # ends with 58
for line in reader:
if any(re.match(rule, line) for rule in rules):
print line