Regex trouble with list comprehension - python

I am working in Python 3.7.6 on Windows and am attempting to use regex to transform one list of foo.csv.gz filenames into a list of the corresponding foo.csv filenames. A code snippet is below.
zippedFileNames = [re.search('[^/]*\\.gz', link).group(0) for link in linksList]
unzippedFileNames = [re.search('.*\\.csv', name).group(0) for name in zippedFileNames]
In the above code, zippedFileNames is a list created by isolating the .gz filenames from a list of download links. This line works as I expect, and taking zippedFileNames[0] returns a string. The type of zippedFileNames[0] is str and the type of zippedFileNames is list.
However, the code throws an error on the second line:
Exception has occurred: AttributeError
'NoneType' object has no attribute 'group'
File "H:\foo\bar\foobar.py", line 133, in <listcomp>
x = [re.search('.*\\.csv', name).group(0) for name in zippedFileNames]
File "H:\foo\bar\foobar.py", line 133, in <module>
x = [re.search('.*\\.csv', name).group(0) for name in zippedFileNames]
This code was working yesterday but stopped working today, and I am not sure what I changed to break it. I believe it broke after I tried amending the second line's pattern to omit the first digits and underscore using the pattern '[^0-9\\_].*\\.csv' (the filenames all follow a pattern 0000_foo_bar_foobar.csv.gz). However, even reverting the pattern to the old one before the omission does not solve my problem.
Is there something I'm not seeing?
Thank you!
EDIT:
Thank you for your answers.
I checked whether there is a None in my list both by printing all list items and by using print(None in zippedFileNames). The latter test returned False and the former returned all the items as I'm expecting. I also did not find None in my linksList.
When I run the regex re.search on just one of the elements of linksList, linksList[0], I get the correct string output.
Are there other things I can try?
EDIT 2:
I tried re-using the original regex pattern '[^/]*\\.gz' in a separate call and it worked. Then I also tried using the pattern '[^/]*\\.csv\\.gz' in hopes of getting the same result as I got with the former pattern, but this pattern returned an error as well. I'm suspecting that the errors have something to do with \\.csv.
RESOLUTION
I was matching on .csv, but it turned out that I had a .report file as well, and that one was throwing off the entire script. Iterating through matches helped to isolate the issue. To solve the regex, I matched the pattern '.*\\[^.gz]' to keep all file extensions, not just .csv. Thank you very much!!

You can use
import re
zippedFileNames=['0001_foo1.csv', 'def.bz', '0000_foo2.csv.gz']
unzippedFileNames = []
for name in zippedFileNames:
m = re.match(r"\d+_(.*\.csv)", name)
if m:
unzippedFileNames.append(m.group(1))
print(unzippedFileNames)
# => ['foo1.csv', 'foo2.csv']
See the Python demo.
Here, the unzippedFileNames is declared as an empty list. Then, iterating over the zippedFileNames, each name is checked against a \d+_(.*\.csv) regex (note that re.match only searches for a match at the start of the string), and if there is a match (if m:) the Group 1 contents are appended to the unzippedFileNames list.

Check if there is an None or empty value in the lists that you are using, zippedFileNames or linksList.

re.search returns None if the string doesn't match . Your second regex seems wrong. I think it should be '.*\.csv. You can test it with regex101.
[EDIT]: You regex is correct, you probably have a file which doesn't match the regex in zippedFileName

Related

Python regular expression for Windows file path

The problem, and it may not be easily solved with a regex, is that I want to be able to extract a Windows file path from an arbitrary string. The closest that I have been able to come (I've tried a bunch of others) is using the following regex:
[a-zA-Z]:\\([a-zA-Z0-9() ]*\\)*\w*.*\w*
Which picks up the start of the file and is designed to look at patterns (after the initial drive letter) of strings followed by a backslash and ending with a file name, optional dot, and optional extension.
The difficulty is what happens, next. Since the maximum path length is 260 characters, I only need to count 260 characters beyond the start. But since spaces (and other characters) are allowed in file names I would need to make sure that there are no additional backslashes that could indicate that the prior characters are the name of a folder and that what follows isn't the file name, itself.
I am pretty certain that there isn't a perfect solition (the perfect being the enemy of the good) but I wondered if anyone could suggest a "best possible" solution?
Here's the expression I got, based on yours, that allow me to get the path on windows : [a-zA-Z]:\\((?:[a-zA-Z0-9() ]*\\)*).* . An example of it being used is available here : https://regex101.com/r/SXUlVX/1
First, I changed the capture group from ([a-zA-Z0-9() ]*\\)* to ((?:[a-zA-Z0-9() ]*\\)*).
Your original expression captures each XXX\ one after another (eg : Users\ the Users\).
Mine matches (?:[a-zA-Z0-9() ]*\\)*. This allows me to capture the concatenation of XXX\YYYY\ZZZ\ before capturing. As such, it allows me to get the full path.
The second change I made is related to the filename : I'll just match any group of character that does not contain \ (the capture group being greedy). This allows me to take care of strange file names.
Another regex that would work would be : [a-zA-Z]:\\((?:.*?\\)*).* as shown in this example : https://regex101.com/r/SXUlVX/2
This time, I used .*?\\ to match the XXX\ parts of the path.
.*? will match in a non-greedy way : thus, .*?\\ will match the bare minimum of text followed by a back-slash.
Do not hesitate if you have any question regarding the expressions.
I'd also encourage you to try to see how well your expression works using : https://regex101.com . This also has a list of the different tokens you can use in your regex.
Edit : As my previous answer did not work (though I'll need to spend some times to find out exactly why), I looked for another way to do what you want. And I managed to do so using string splitting and joining.
The command is "\\".join(TARGETSTRING.split("\\")[1:-1]).
How does this work : Is plit the original string into a list of substrings, based. I then remove the first and last part ([1:-1]from 2nd element to the one before the last) and transform the resulting list back into a string.
This works, whether the value given is a path or the full address of a file.
Program Files (x86)\\Adobe\\Acrobat Distiller\\acrbd.exe fred is a file path
Program Files (x86)\\Adobe\\Acrobat Distiller\\acrbd.exe fred\ is a directory path

Python 2.7 Regex; Finding varying number of expressions

I am working on a bioinformatics project and am currently trying to split a certain string containing locations on a chromosome.
Example of a few strings, which go by the name "location":
NC_000023.11:g.154532082
NC_000023.11:g.154532058_154532060
NC_000023.11:g.154532046
What I would like returned looks like:
([154532082])
([154532058], [154532060])
([154532046])
I can not think of a regex that normally captures only the first number, and when present, separately captures the second number, without creating a second group, as with:
re.findall(":g.(\d*)_?(\d*)", location)
which gives:
([154532082], [])
([154532058], [154532060])
([154532046], [])
or
re.findall(":g.(\d*)", location), re.findall("\d_(\d*)", location)
which gives:
[(154532082), ()]
[(154532058), (154532060)]
[154532046), ()]
Is there any expression that would solve this? Or should I see and try to remove the empty lists after finding them the way I do?
Here is what you could do:
[re.search("(?<=:g.)(\d*)_?(\d*)", item).group() for item in location.split("\n")]
What I did here was to make a list comprehension to do everything in a single line. Going by parts:
for item in location.split("\n")
This iterates over a list built from the location string, where I split the string in all the line breaks. Now the for loop will iterate over every part of the string between the line breaks. Each of these parts is now called 'item'.
re.search("(?<=:g.)(\d*)_?(\d*)", item).group()
Here I perform a positive lookbehind assertion, which means that the regex will look for ':g.' (the ?<=:g. part), match everything after that, and ditch the ':g.'. As for group(), this is just to print the match from the re.search() method.
Read the python documentation on regex, it helps a lot:
https://docs.python.org/2/library/re.html

Python - Injecting html tags into strings based on regex match

I wrote a script in Python for custom HTML page that finds a word within a string/line and highlights just that word with use of following tags where instance is the word that is searched for.
<b><font color=\"red\">"+instance+"</font></b>
With the following result:
I need to find a word (case insensitive) let's say "port" within a string that can be port, Port, SUPPORT, Support, support etc, which is easy enough.
pattern = re.compile(word, re.IGNORECASE)
find_all_instances = pattern.findall(string_to_search)
However my strings often contain 2 or more instances in single line, and I need to append
<b><font color=\"red\">"+instance+"</font></b> to each of those instances, without changing cases.
Problem with my approach, is that I am attempting to itterate over each of instances found with findall (exact match),
while multiple same matches can also be found within the string.
for instance in find_all_instances:
second_pattern = re.compile(instance)
string_to_search = second_pattern.sub("<b><font color=\"red\">"+instance+"</font></b>", string_to_search)
This results in following:
<b><font color="red"><b><font color="red"><b><font color="red">Http</font></b></font></b></font></b></font>
when I need
<b><font color="red">Http</font></b>
I was thinking, I would be able to avoid this if I was able to find out exact part of the string that the pattern.sub substitutes at the moment of doing it,
however I was not able to find any examples of that kind of usage, which leads me to believe that I am doing something very wrong.
If anyone have a way I could use to insert <b><font color="red">instance</font></b> without replacing instance for all matches(case insensitive), then I would be grateful.
Maybe I'm misinterpretting your question, but wouldn't re.sub be the best option?
Example: https://repl.it/DExs
Okay so two ways I did quickly! The second loop is definitely the way to go. It uses re.sub (as someone else commented too). It replaces with the lowercase search term bear in mind.
import re
FILE = open("testing.txt","r")
word="port"
#THIS LOOP IS CASE SENSITIVE
for line in FILE:
newline=line.replace(word,"<b><font color=\"red\">"+word+"</font></b>")
print newline
#THIS LOOP IS INCASESENSITIVE
for line in FILE:
pattern=re.compile(word,re.IGNORECASE)
newline = pattern.sub("<b><font color=\"red\">"+word+"</font></b>",line)
print newline

How to make the re.search() try a best attempt approach

This is the text file sb.txt
JOHN:ENGINEER:35?:
Now this the piece of code that tries to perform a regex search on the above line.
biodata1 = re.search(r'([\w\W])+?:([\w\W])+?:([\w\W])+?:',line)
Now I get a proper output for biodata1.group(1), biodata1.group(2) and biodata1.group(3).
If however, I modify the file by removing ":" from the end
JOHN:ENGINEER:35?
and run the script again, I get the following error which makes sense since group(3) didn't match successfully
Traceback (most recent call last):
File "dictionary.py", line 26, in <module>
print('re.search(r([\w\W])+?:([\w\W])+?:([\w\W])+? '+biodata1.group(1)+' '+biodata1.group(2)+' '+biodata1.group(3)) # STMT1
AttributeError: 'NoneType' object has no attribute 'group'
But group(1) and group(2) should've still matched "N" "R" respectively. Is there anyway to avoid this error and attempt a best attempt approach to regex so it doesn't fail and at least prints biodata1.group(1) & biodata1.group(2).
I tried to edit the output statment by not having it print biodata1.group(3) though that didn't work
I think you misunderstand what has happened. Your entire regular expression has failed to match and therefore there is no match object.
Where it says AttributeError: 'NoneType' object has no attribute 'group' it's trying to tell you that biodata1 is None. None is the return you get from re.search when it fails to match.
To be clear, there's no way to get a "best match". What you're asking for is that re should make a decision as to what you really want. If you want groups to be optional, you need to make them optional.
Depending on what you actually want you can try the regexes:
r'([\w\W])+?:([\w\W])+?:([\w\W])+?:?'
or
r'([\w\W])+?:([\w\W])+?:(([\w\W])+?:)?'
Which respectively make the last : and the entire last group optional.
You'll have to modify the regex to instruct it on what exactly is optional and what isn't. Python regexes don't have this concept of partial matches. One possibility is to change it to
biodata1 = re.search(r'([\w\W])+?:(?:([\w\W])+?:(?:([\w\W])+?:)?)?',line)
Where you allow 1, 2 or 3 groups to match. In this case, any groups that don't match will return the empty string when you do match.group(X)
What a regex does is it matches exactly what you provided. There is no best try or anything like that.
If you want some part of your match to be optional you need to declare it using the ? operator. So in your case your regex would need to look like this:
biodata1 = re.search(r'([\w\W])+?:([\w\W])+?:([\w\W])+?:?',line)
Also +? (at least once, or not at all) is equal to * (at least zero times), so you could just do this:
biodata1 = re.search(r'([\w\W])*:([\w\W])*:([\w\W])*:?',line)

How to search for string in Python by removing line breaks but return the exact line where the string was found?

I have a bunch of PDF files that I have to search for a set of keywords against. I have to extract the exact line where the keyword was found. I first used xpdf's pdf2text to convert the file to PDF. (Tried solr but had a tough time tailoring the output/schema to suit my requirement).
import sys
file_name = sys.argv[1]
searched_string = sys.argv[2]
result = [(line_number+1, line) for line_number, line in enumerate(open(file_name)) if searched_string.lower() in line.lower()]
#print result
for each in result:
print each[0], each[1]
ThinkCode:~$ python find_string.py sample.txt "String Extraction"
The problem I have with this is that for cases where search string is broken towards the end of the line :
If you are going to index large binary files, remember to change the
size limits. String
Extraction is a common problem
If I am searching for 'String Extraction', I will miss this keyword if I use the code presented above. What is the most efficient way of achieving this without making 2 copies of text file (one for searching the keyword to extract the line (number) and the other for removing line breaks and finding the keyword to eliminate the case where the keyword spans across 2 lines).
Much appreciated guys!
Note: Some considerations without any code, but I think they belong to an answer rather than to a comment.
My idea would be to search only for the first keyword; if a match is found, search for the second. This allows you to, if the match is found at the end of the line, take into consideration the next line and do line concatenation only if a match is found in first place*.
Edit:
Coded a simple example and ended up using a different algorithm; the basic idea behind it is this code snippet:
def iterwords(fh):
for number, line in enumerate(fh):
for word in re.split(r'\s+', line.strip()):
yield number, word
It iterates over the file handler and produces a (line_number, word) tuple for each word in the file.
The matching afterwards becomes pretty easy; you can find my implementation as a gist on github. It can be run as follows:
python search.py 'multi word search string' file.txt
There is one main concern with the linked code, I didn't code a workaround both for performance and complexity reasons. Can you figure it out? (Spoiler: try to search for a sentence whose first word appears two times in a row in the file)
* I didn't perform any testing on my own, but this article and the python wiki suggest that string concatenation is not that efficient in python (don't know how actual the information is).
There may be a better way of doing it, but my suggestion would be to start by taking in two lines (let's call them line1 and line2), concatenating them into line3 or something similar, and then search that resultant line.
Then you'd assign line2 to line1, get a new line2, and repeat the process.
Use the flag re.MULTILINE when compiling your expressions: http://docs.python.org/library/re.html#re.MULTILINE
Then use \s to represent all white space (including new lines).

Categories