splitting a text file into words using regex in python

splitting a text file into words using regex in python - python

brand new to python!!! I'm given a text file https://en.wikipedia.org/wiki/Character_mask and I need to split the file into single words, (more than a single letter separated by one of more of any other character) I've tried using regex but can't seem to split it right without error. here is the code I have so far, can anyone help me fix this regex expression
import re
file = open("charactermask.txt", "r")
text = file.read()
message = print(re.split(',.-\d\c\s',text))
print (message)
file.close()

You can use re.findall with the following regex pattern instead to find all words that are more than 1 character long.
Change:
message = print(re.split(',.-\d\c\s',text))
to:
message = re.findall(r'[A-Za-z]{2,}', text))

If you are looking for simple tokens of words from text string you can use
.split it will work like a charm!
For example
mystring = "My favorite color is blue"
mystring.split()
['My', 'favorite', 'color', 'is', 'blue']

If you're just trying to split the text then SmashGuy's answer should get your job done. Using regex would seem like an overkill. Additionally, your regex pattern doesn't quite seem to do what you described your intention to be. You might want to test your pattern out until you get it right before plugging it into your python script. Try https://regex101.com/
Here's what your pattern does right now:
, matches the character , literally (case sensitive)
. matches any character (except for line terminators)
- matches the character - literally (case sensitive)
\d matches a digit (equal to [0-9])
\c matches the character c literally (case sensitive)
\s matches any whitespace character (equal to [\r\n\t\f\v ])
I'm not sure if you actually meant [,.-], one of these character-prefixes and you might have had the wrong impression on the \c token too as it doesn't do anything special in python's flavor of regex.

Related

Regex to match following pattern in SQL query

I am trying to extract parts of a MySQL query to get the information I want.
I used this code / regex in Python:
import re
query = "SELECT `asd`.`ssss` as `column1`, `ss`.`wwwwwww` from `table`"
table_and_columns = re.findall('\`.*?`[.]\`.*?`',query)
My expected output:
['`asd`.`ssss`', `ss`.`wwwwwww`']
My real output:
['`asd`.`ssss`', '`column1`, `ss`.`wwwwwww`']
Can anybody help me and explain me where I went wrong?
The regex should only find the ones that have two strings like asd and a dot in the middle.
PS: I know that this is not a valid query.

The dot . can also match a backtick, so the pattern starts by matching a backtick and is able to match all chars until it reaches the literal dot in [.]
There is no need to use non greedy quantifiers, you can use a negated character class only prevent crossing the backtick boundary.
`[^`]*`\.`[^`]*`
Regex demo
The asterix * matches 0 or more times. If there has to be at least a single char, and newlines and spaces are unwanted, you could add \s to prevent matching whitespace chars and use + to match 1 or more times.
`[^`\s]+`\.`[^`\s]+`
Regex demo | Python demo
For example
import re
query = "SELECT `asd`.`ssss` as `column1`, `ss`.`wwwwwww` from `table`"
table_and_columns = re.findall('`[^`\s]+`\.`[^`\s]+`',query)
print(table_and_columns)
Output
['`asd`.`ssss`', '`ss`.`wwwwwww`']

Please try below regex. Greedy nature of .* from left to right is what caused issue.
Instead you should search for [^`]*
`[^`]*?`\.`[^`]*?`
Demo

The thing is that
.*? matches any character (except for line terminators) even whitespaces.
Also as you're already using * which means either 0 or unlimited occurrences,not sure you need to use ?.
So this seems to work:
\`\S+\`[.]\`\S+\`
where \S is any non-whitespace character.
You always can check you regexes using https://regex101.com

Python: Regex to search for a "Mozilla" but ignore the match if the string also includes "iPhone" [duplicate]

I am trying to search for all occurrences of "Tom" which are not followed by "Thumb".
I have tried to look for
Tom ^((?!Thumb).)*$
but I still get the lines that match to Tom Thumb.

You don't say what flavor of regex you're using, but this should work in general:
Tom(?!\s+Thumb)

In case you are not looking for whole words, you can use the following regex:
Tom(?!.*Thumb)
If there are more words to check after a wanted match, you may use
Tom(?!.*(?:Thumb|Finger|more words here))
Tom(?!.*Thumb)(?!.*Finger)(?!.*more words here)
To make . match line breaks please refer to How do I match any character across multiple lines in a regular expression?
See this regex demo
If you are looking for whole words (i.e. a whole word Tom should only be matched if there is no whole word Thumb further to the right of it), use
\bTom\b(?!.*\bThumb\b)
See another regex demo
Note that:
\b - matches a leading/trailing word boundary
(?!.*Thumb) - is a negative lookahead that fails the match if there are any 0+ characters (depending on the engine including/excluding linebreak symbols) followed with Thumb.

Tom(?!\s+Thumb) is what you search for.

Python regex needed for format: 'delete([any text here])'

I am a total regex beginner. I want to create a regular expression that strictly allows the word delete followed by two closed parenthesis that contain any kind of characters (http://www.waynesworld1.com).
If I put it all together, it should accept the following: delete(http://www.waynesworld123.com).
Let me emphasize that the regex should strictly accept delete() and shouldn't accept elete(). As long as the user types in delete() anything is acceptable within the parenthesis (example: this would be fine delete(12!#Ww)
How can I craft this regex in Python? So far all I have is /delete/ for my regex.

Here you go:
^delete\(.*\)$
^ assert position at start of the string
delete matches the characters delete literally (case sensitive)
\( matches the character ( literally
.* matches any character (except newline)
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
\) matches the character ) literally
$ assert position at end of the string
Here is some Python test code:
import re
txt= {"delete(http://www.waynesworld123.com)",
"delete(12!#Ww)",
"elete(test)",
"delete[test]",
"test"}
pattern=re.compile('^delete\(.*\)$', re.DOTALL)
for line in txt:
if pattern.search(line):
print 'PASS', line
else:
print 'FAIL',line

Strip punctuation with regular expression - python

I would like to strip all of the the punctuations (except the dot) from the beginning and end of a string, but not in the middle of it.
For instance for an original string:
##%%.Hol$a.A.$%
I would like to get the word .Hol$a.A. removed from the end and beginning but not from the middle of the word.
Another example could be for the string:
##%%...&Hol$a.A....$%
In this case the returned string should be ..&Hol$a.A.... because we do not care if the allowed characters are repeated.
The idea is to remove all of the punctuations( except the dot ) just at the beginning and end of the word. A word is defined as \w and/or a .
A practical example is the string 'Barnes&Nobles'. For text analysis is important to recognize Barnes&Nobles as a single entity, but without the '
How to accomplish the goal using Regex?

Use this simple and easily adaptable regex:
[\w.].*[\w.]
It will match exactly your desired result, nothing more.
[\w.] matches any alphanumeric character and the dot
.* matches any character (except newline normally)
[\w.] matches any alphanumeric character and the dot
To change the delimiters, simply change the set of allowed characters inside the [] brackets.
Check this regex out on regex101.com
import re
data = '##%%.Hol$a.A.$%'
pattern = r'[\w.].*[\w.]'
print(re.search(pattern, data).group(0))
# Output: .Hol$a.A.

Depending on what you mean with striping the punctuation, you can adapt the following code :
import re
res = re.search(r"^[^.]*(.[^.]*.([^.]*.)*?)[^.]*$", "##%%.Hol$a.A.$%")
mystr = res.group(1)
This will strip everything before and after the dot in the expression.
Warning, you will have to check if the result is different of None, if the string doesn't match.

Python Regex for hyphenated words

I'm looking for a regex to match hyphenated words in Python.
The closest I've managed to get is: '\w+-\w+[-w+]*'
text = "one-hundered-and-three- some text foo-bar some--text"
hyphenated = re.findall(r'\w+-\w+[-\w+]*',text)
which returns list ['one-hundered-and-three-', 'foo-bar'].
This is almost perfect except for the trailing hyphen after 'three'. I only want the additional hyphen if followed by a 'word'. i.e. instead of the '[-\w+]\*' I need something like '(-\w+)*' which I thought would work, but doesn't (it returns ['-three, '']). i.e. something that matches |word followed by hyphen followed by word followed by hyphen_word zero or more times|.

Try this:
re.findall(r'\w+(?:-\w+)+',text)
Here we consider a hyphenated word to be:
a number of word chars
followed by any number of:
a single hyphen
followed by word chars

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

splitting a text file into words using regex in python - python

You can use re.findall with the following regex pattern instead to find all words that are more than 1 character long. Change: message = print(re.split(',.-\d\c\s',text)) to: message = re.findall(r'[A-Za-z]{2,}', text))

If you are looking for simple tokens of words from text string you can use .split it will work like a charm! For example mystring = "My favorite color is blue" mystring.split() ['My', 'favorite', 'color', 'is', 'blue']

Related

Regex to match following pattern in SQL query

Python: Regex to search for a "Mozilla" but ignore the match if the string also includes "iPhone" [duplicate]

Python regex needed for format: 'delete([any text here])'

Strip punctuation with regular expression - python

Python Regex for hyphenated words

Categories

Resources