Python regular expression NoneType error - python

I have a text file named "filename.txt"
content of file:
This is just a text
content to store
in a file.
i have made two python scripts to extract "to" from the text file
my 1st script:
#!/usr/bin/python
import re
f = open("filename.txt","r")
for line in f:
text = re.match(r"content (\S+) store",line)
x = text.group(1)
print x
my 2nd script:
#!/usr/bin/python
import re
f = open("filename.txt","r")
for line in f:
text = re.match(r"content (\S+) store",line)
if text:
x = text.group(1)
print x
2nd script gives the correct output
bash-3.2$ ./script2.py
to
but 1st script gives me an error
bash-3.2$ ./script1.py
Traceback (most recent call last):
File "./script1.py", line 6, in ?
x = text.group(1)
AttributeError: 'NoneType' object has no attribute 'group'
how is that adding an "if" condition gives me the correct output and when i remove it i get an error?

The error is pretty self-explanatory to me: re.match returns None if no match is found (see doc).
So when your regex doesn't match (eg first line), you're trying to access the group property of a NoneType object, it throws an error.
In the other case, you only access the property if text isn't None (since this is what the if text: checks, among other things).

This is because in your first code, your regex fails to match anything and therefore text is a NoneType. When you try to do group it throws the AttributeError: 'NoneType' object has no attribute 'group' error
However, for your regex, your code doesn't fail because you are careful to call group only if something was actually matched
Your second method is better since it's fail proof unlike the first one.

Related

Regex trouble with list comprehension

I am working in Python 3.7.6 on Windows and am attempting to use regex to transform one list of foo.csv.gz filenames into a list of the corresponding foo.csv filenames. A code snippet is below.
zippedFileNames = [re.search('[^/]*\\.gz', link).group(0) for link in linksList]
unzippedFileNames = [re.search('.*\\.csv', name).group(0) for name in zippedFileNames]
In the above code, zippedFileNames is a list created by isolating the .gz filenames from a list of download links. This line works as I expect, and taking zippedFileNames[0] returns a string. The type of zippedFileNames[0] is str and the type of zippedFileNames is list.
However, the code throws an error on the second line:
Exception has occurred: AttributeError
'NoneType' object has no attribute 'group'
File "H:\foo\bar\foobar.py", line 133, in <listcomp>
x = [re.search('.*\\.csv', name).group(0) for name in zippedFileNames]
File "H:\foo\bar\foobar.py", line 133, in <module>
x = [re.search('.*\\.csv', name).group(0) for name in zippedFileNames]
This code was working yesterday but stopped working today, and I am not sure what I changed to break it. I believe it broke after I tried amending the second line's pattern to omit the first digits and underscore using the pattern '[^0-9\\_].*\\.csv' (the filenames all follow a pattern 0000_foo_bar_foobar.csv.gz). However, even reverting the pattern to the old one before the omission does not solve my problem.
Is there something I'm not seeing?
Thank you!
EDIT:
Thank you for your answers.
I checked whether there is a None in my list both by printing all list items and by using print(None in zippedFileNames). The latter test returned False and the former returned all the items as I'm expecting. I also did not find None in my linksList.
When I run the regex re.search on just one of the elements of linksList, linksList[0], I get the correct string output.
Are there other things I can try?
EDIT 2:
I tried re-using the original regex pattern '[^/]*\\.gz' in a separate call and it worked. Then I also tried using the pattern '[^/]*\\.csv\\.gz' in hopes of getting the same result as I got with the former pattern, but this pattern returned an error as well. I'm suspecting that the errors have something to do with \\.csv.
RESOLUTION
I was matching on .csv, but it turned out that I had a .report file as well, and that one was throwing off the entire script. Iterating through matches helped to isolate the issue. To solve the regex, I matched the pattern '.*\\[^.gz]' to keep all file extensions, not just .csv. Thank you very much!!
You can use
import re
zippedFileNames=['0001_foo1.csv', 'def.bz', '0000_foo2.csv.gz']
unzippedFileNames = []
for name in zippedFileNames:
m = re.match(r"\d+_(.*\.csv)", name)
if m:
unzippedFileNames.append(m.group(1))
print(unzippedFileNames)
# => ['foo1.csv', 'foo2.csv']
See the Python demo.
Here, the unzippedFileNames is declared as an empty list. Then, iterating over the zippedFileNames, each name is checked against a \d+_(.*\.csv) regex (note that re.match only searches for a match at the start of the string), and if there is a match (if m:) the Group 1 contents are appended to the unzippedFileNames list.
Check if there is an None or empty value in the lists that you are using, zippedFileNames or linksList.
re.search returns None if the string doesn't match . Your second regex seems wrong. I think it should be '.*\.csv. You can test it with regex101.
[EDIT]: You regex is correct, you probably have a file which doesn't match the regex in zippedFileName

What is the RegEx pattern for 24-06-2015 10:15:45: Aditya Krishnakant:?

What is the RegEx pattern for 24-06-2015 10:15:45: Aditya Krishnakant:
If you look at the whatsapp chat transcript, it looks like a mess. The purpose of this code is to print messages sent by a person in a new line (for better readability). This is my code
import re
f = open("wa_chat.txt", "r")
match = re.findall(r'(\d{2})\:(\d{2})\:(\d{4})\s(\d{2})\:(\d{2})\:(\d{2})\:\s(\w)\s(\w)\:', f)
for content in match:
print(f.readlines(), '\n')
f.close()
I am getting the following error message:
Traceback (most recent call last):
File "whatsapp.py", line 4, in <module>
match = re.findall(r'(\d{2})\:(\d{2})\:(\d{4})\s(\d{2})\:(\d{2})\:(\d{2})\:\s(\w)\s(\w)\:', f)
File "/usr/lib/python2.7/re.py", line 177, in findall
return_compile(pattern, flags).findall(string)
TypeError: expected string or buffer
Where am I going wrong?
For some reason you're putting \: where - should be. Also, instead of \s you can be more specific and just use a space. You can be more specific with those kinds of things because you know exactly what the format is. Your other big problem is that you're only using \w, which only matches one alphanumeric character, when you should use \w+, matching the whole word. Lastly, your actual error is coming from the fact that you're passing in a file object instead of the string containing its contents, i.e. f.read(). Here's some code that should work:
import re
f = open("wa_chat.txt", 'r')
match = re.findall(r'(\d{2})-(\d{2})-(\d{4}) (\d{2}):(\d{2}):(\d{2}): (\w+) (\w+):', f.read())
print match #or do whatever you want with it
Note that match will be a list of tuples since you wanted to use grouping.

How to make the re.search() try a best attempt approach

This is the text file sb.txt
JOHN:ENGINEER:35?:
Now this the piece of code that tries to perform a regex search on the above line.
biodata1 = re.search(r'([\w\W])+?:([\w\W])+?:([\w\W])+?:',line)
Now I get a proper output for biodata1.group(1), biodata1.group(2) and biodata1.group(3).
If however, I modify the file by removing ":" from the end
JOHN:ENGINEER:35?
and run the script again, I get the following error which makes sense since group(3) didn't match successfully
Traceback (most recent call last):
File "dictionary.py", line 26, in <module>
print('re.search(r([\w\W])+?:([\w\W])+?:([\w\W])+? '+biodata1.group(1)+' '+biodata1.group(2)+' '+biodata1.group(3)) # STMT1
AttributeError: 'NoneType' object has no attribute 'group'
But group(1) and group(2) should've still matched "N" "R" respectively. Is there anyway to avoid this error and attempt a best attempt approach to regex so it doesn't fail and at least prints biodata1.group(1) & biodata1.group(2).
I tried to edit the output statment by not having it print biodata1.group(3) though that didn't work
I think you misunderstand what has happened. Your entire regular expression has failed to match and therefore there is no match object.
Where it says AttributeError: 'NoneType' object has no attribute 'group' it's trying to tell you that biodata1 is None. None is the return you get from re.search when it fails to match.
To be clear, there's no way to get a "best match". What you're asking for is that re should make a decision as to what you really want. If you want groups to be optional, you need to make them optional.
Depending on what you actually want you can try the regexes:
r'([\w\W])+?:([\w\W])+?:([\w\W])+?:?'
or
r'([\w\W])+?:([\w\W])+?:(([\w\W])+?:)?'
Which respectively make the last : and the entire last group optional.
You'll have to modify the regex to instruct it on what exactly is optional and what isn't. Python regexes don't have this concept of partial matches. One possibility is to change it to
biodata1 = re.search(r'([\w\W])+?:(?:([\w\W])+?:(?:([\w\W])+?:)?)?',line)
Where you allow 1, 2 or 3 groups to match. In this case, any groups that don't match will return the empty string when you do match.group(X)
What a regex does is it matches exactly what you provided. There is no best try or anything like that.
If you want some part of your match to be optional you need to declare it using the ? operator. So in your case your regex would need to look like this:
biodata1 = re.search(r'([\w\W])+?:([\w\W])+?:([\w\W])+?:?',line)
Also +? (at least once, or not at all) is equal to * (at least zero times), so you could just do this:
biodata1 = re.search(r'([\w\W])*:([\w\W])*:([\w\W])*:?',line)

How to re-match a group that did not capture anything?

I'm trying to parse a string in which a certain section can either be enclosed between " or ' or not be enclosed at all. However, I'm struggling finding a syntax that works when no quotation marks are there at all.
See the following (simplified) example:
>>> print re.match(r'\w(?P<quote>(\'|"))?\w', 'f"oo').group('quote')
"
>>> print re.match(r'\w(?P<quote>(\'|"))?\w', 'foo').group('quote')
None
>>> print re.match(r'\w(?P<quote>(\'|"))?\w(?P=quote)', 'f"o"o').group('quote')
"
>>> print re.match(r'\w(?P<quote>(\'|"))?\w(?P=quote)', 'foo').group('quote')
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "<string>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
'NoneType' object has no attribute 'group'
The desired result for the last attempt should be None as the second command in the example.
Based on the suggestions I got to another question, I was able to produce a slightly different regex that provides the correct answers:
>>> re.match(r'\w(?P<quote>[\'"]?)\w(?P=quote)\w', 'foo').group('quote')
u''
>>> re.match(r'\w(?P<quote>[\'"]?)\w(?P=quote)\w', 'f"o"o').group('quote')
u'"'
>>> re.match(r'\w(?P<quote>[\'"]?)\w(?P=quote)\w', 'f\'o\'o').group('quote')
u"'"
The trick was really to use a quantifier on the character matched rather than on the entire group.
[ The leading and trailing \w in this example are just for preventing the regex to match the full string (as an unquoted string). In the real case scenario this was not needed as this match is part of a larger regex with previous and later groups matched ].

Python TypeError when using variable in re.sub

I'm new to python and I keep getting an error doing the simpliest thing.
I'm trying to use a variable in a regular expression and replace that with an *
the following gets me the error "TypeError: not all arguments converted during string formatting" and I can't tell why. this should be so simple.
import re
file = "my123filename.zip"
pattern = "123"
re.sub(r'%s', "*", file) % pattern
Error:
Traceback (most recent call last):
File "", line 1, in ?
TypeError: not all arguments converted during string formatting
Any tips?
You're problem is on this line:
re.sub(r'%s', "*", file) % pattern
What you're doing is replacing every occurance of %s with * in the text from the string file (in this case, I'd recommend renaming the variable filename to avoid shadowing the builtin file object and to make it more explicit what you're working with). Then you're trying to replace the %s in the (already replaced) text with pattern. However, file doesn't have any format modifiers in it which leads to the TypeError you see. It's basically the same as:
'this is a string' % ("foobar!")
which will give you the same error.
What you probably want is something more like:
re.sub(str(pattern),'*',file)
which is exactly equivalent to:
re.sub(r'%s' % pattern,'*',file)
Try re.sub(pattern, "*", file)? Or maybe skip re altogether and just do file.replace("123", "*").

Categories