I am receiving a stream of tweets with python and would like to extract the last word or know where to reference it.
for example in
NC don’t like working together www.linktowtweet.org
get back
together
I am not familiar with tweepy, so I am presuming you have the data in a python string, so maybe there is a better answer.
However, given a string in python, it simple to extract the last word.
Solution 1
Use str.rfind(' '). The idea here is to find the space, preceding the last word. Here is an example.
text = "NC don’t like working together"
text = text.rstrip() # To any spaces at the end, that would otherwise confuse the algorithm.
last_word = text[text.rfind(' ')+1:] # Output every character *after* the space.
print(last_word)
Note: If a string is given with no words, last_word will be a blank string.
Now this presumes that all of the words are separated by spaces. To handle newlines and spaces, use str.replace to turn them into strings. Whitespaces in python are \t\n\x0b\x0c\r, but I presume only newlines and tabs will be found in twitter messages.
Also see: string.whitespace
So a complete example (wrapped as a function) would be
def last_word(text):
text = text.replace('\n', ' ') # Replace newlines with spaces.
text = text.replace('\t', ' ') # Replace tabs with spaces.
text = text.rstrip(' ') # Remove trailing spaces.
return text[text.rfind(' ')+1:]
print(last_word("NC don’t like working together")) # Outputs "together".
This may still be the best situation for basic parsing. There is something better for larger problems.
Solution 2
Regular Expressions
These are a way to handle strings in python, that is a lot more flexible. REGEX, as they are often called, use there own language to specify a portion of text.
For example, .*\s(\S+) specifies the last word in a string.
Here is it again with a longer explanation.
.* # Match as many characters as possible.
\s # Until a whitespace ("\t\n\x0b\x0c\r ")
( # Remember the next section for the answer.
\S+ # Match a ~word~ (not whitespace) as possible.
) # End saved section.
So then, in python you would use this as follows.
import re # Import the REGEX library.
# Compile the code, (DOTALL makes . match \n).
LAST_WORD_PATTERN = re.compile(r".*\s(\S+)", re.DOTALL)
def last_word(text):
m = LAST_WORD_PATTERN.match(text)
if not m: # If there was not a last word to this text.
return ''
return m.group(1) # Otherwise return the last word.
print(last_word("NC don’t like working together")) # Outputs "together".
Now, even though this method is a lot less obvious, it has a couple of advantages. First off, it is a lot more customizable. If you wanted to match the final word, but not links, the regex r".*\s([^.:\s]+(?!\.\S|://))\b" would match the last word, but ignore a link if that was the last thing.
Example:
import re # Import the REGEX library.
# Compile the code, (DOTALL makes . match \n).
LAST_WORD_PATTERN = re.compile(r".*\s([^.:\s]+(?!\.\S|://))\b", re.DOTALL)
def last_word(text):
m = LAST_WORD_PATTERN.match(text)
if not m: # If there was not a last word to this text.
return ''
return m.group(1) # Otherwise return the last word.
print(last_word("NC don’t like working together www.linktowtweet.org")) # Outputs "together".
The second advantage to this method is speed.
As you can Try it online! here, the regex approach is almost as fast as the string manipulation, if not faster in some cases. (I actually found that regex execute .2 usec faster on my machine that in the demo.)
Either way, the regex execution is extremely fast, even in the simple case, and there is no question that the regex is faster then any more complex string algorithm implemented in python. So using the regex can also speed up the code.
EDIT
Changed the url avoiding regex from
re.compile(r".*\s([^.\s]+(?!\.\S))\b", re.DOTALL)
to
re.compile(r".*\s([^.:\s]+(?!\.\S|://))\b", re.DOTALL)
So that calling last_word("NC don’t like working together http://www.linktowtweet.org") returns together and not http://.
To so how this regex works, look at https://regex101.com/r/sdwpqB/2.
Simple, so if your text is:
text = "NC don’t like working together www.linktowtweet.org"
text = re.sub(r'https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE) #to remove any URL
text = text.split() #splits sentence into words with delimiter=" "
last_word = text[-1]
So there you go!! Now you'll get the last word "together".
Related
I want to capitilize the first word after a dot in a whole paragraph (str) full of sentences. The problem is that all chars are lowercase.
I tried something like this:
text = "here a long. paragraph full of sentences. what in this case does not work. i am lost"
re.sub(r'(\b\. )([a-zA-z])', r'\1' (r'\2').upper(), text)
I expect something like this:
"Here a long. Paragraph full of sentences. What in this case does not work. I am lost."
You can use re.sub with a lambda:
import re
text = "here a long. paragraph full of sentences. what in this case does not work. i am lost"
result = re.sub('(?<=^)\w|(?<=\.\s)\w', lambda x:x.group().upper(), text)
Output:
'Here a long. Paragraph full of sentences. What in this case does not work. I am lost'
Regex Explanation:
(?<=^)\w: matches an alphanumeric character preceded by the start of the line.
(?<=\.\s)\w: matches an alphanumeric character preceded by a period and a space.
You can use ((?:^|\.\s)\s*)([a-z]) regex (which doesn't depend upon lookarounds which sometimes may not be available in the regex dialect you may be using and hence is simpler and widely supported. Like for example Javascript doesn't yet widely support lookbehind although it is supported in EcmaScript2018 but its not widely supported yet) where you capture either the starting zero or more whitespace at the beginning of a sentence or one or more whitespace followed by a literal dot . and capture it in group1 and next capture a lower case letter using ([a-z]) and capture in group2 and replace the matched text with group1 captured text and group2 captured letter by making it uppercase using lambda expression. Check this Python code,
import re
arr = ['here a long. paragraph full of sentences. what in this case does not work. i am lost',
' this para contains more than one space after period and also has unneeded space at the start of string. here a long. paragraph full of sentences. what in this case does not work. i am lost']
for s in arr:
print(re.sub(r'(^\s*|\.\s+)([a-z])', lambda m: m.group(1) + m.group(2).upper(), s))
Output,
Here a long. Paragraph full of sentences. What in this case does not work. I am lost
This para contains more than one space after period and also has unneeded space at the start of string. Here a long. Paragraph full of sentences. What in this case does not work. I am lost
And in case you want to get rid of extra whitespaces and reduce them to just one space, just take that \s* out of group1 and use this regex ((?:^|\.\s))\s*([a-z]) and with updated Python code,
import re
arr = ['here a long. paragraph full of sentences. what in this case does not work. i am lost',
' this para contains more than one space after period and also has unneeded space at the start of string. here a long. paragraph full of sentences. what in this case does not work. i am lost']
for s in arr:
print(re.sub(r'((?:^|\.\s))\s*([a-z])', lambda m: m.group(1) + m.group(2).upper(), s))
You get following where extra whitespace is reduced to just one space, which may often be desired,
Here a long. Paragraph full of sentences. What in this case does not work. I am lost
This para contains more than one space after period and also has unneeded space at the start of string. Here a long. Paragraph full of sentences. What in this case does not work. I am lost
Also, if this was to be done using PCRE based regex engine, then you could have used \U in the regex itself without having to use lambda functions and just been able to replace it with \1\U\2
Regex Demo for PCRE based regex
I'm working on a simple personal project that's required I learn to use regular expressions. I have successfully used findall() once before in my program:
def getStats():
playername = input("Enter your OSRS name: ")
try:
with urllib.request.urlopen("https://secure.runescape.com/m=hiscore_oldschool/index_lite.ws?player=" + playername) as response:
page = str(response.read())
player.levels = re.findall(r',(\d\d),', page)
This worked fine and populated the list exactly as I wanted. I'm now trying to do something similar with a text file.
The text file contains a string, followed by a lot of digits, and then another string followed by a lot of digits, etc. I just want to populate a list with the text and ignore the digits, but I get no matches (the list is empty):
def getQuests():
try:
with open("quests.txt") as file:
q = file.read()
questList = re.findall(r',(\D\D),', q)
print(questList)
Pythex link: https://pythex.org/?regex=%5CD%5CD&test_string=Desert%20Treasure%2C0%2C0%2C0%2C12%0AContact!%2C0%2C0%2C11%2C0%2C0%2C0%2C5%0ACook%27s%20Assistant%2C0%2C0%2C0%2C0%0AHorror%20from%20the%20Deep%2C0%2C0%2C13&ignorecase=0&multiline=0&dotall=0&verbose=0
I've gotten some help with the pattern and edited accordingly, but the list is still printing empty
def getQuests():
try:
with open("quests.txt") as file:
q = file.read()
questList = re.findall(r'^(\D+),', q)
Your pattern is incorrect. Firstly, in the demo you linked, the website is not very well designed and shows adjacent matches as one single match. \D\D matches exactly 2 non-digit characters. Also, you didn't include the commas you have in your pattern in the code. Anyway, here is a correct pattern:
^(\D+),
It matches the start of the line, then at least one non-digit character, then a comma. The first group contains the string you want to match.
Demo: https://regex101.com/r/pViF0h/2
In code:
import re
text = '''Desert Treasure,0,0,0,12
Contact!,0,0,11,0,0,0,5
Cook's Assistant,0,0,0,0
Horror from the Deep,0,0,13'''
print(re.findall(r'^(\D+),', text, re.M))
# ['Desert Treasure', 'Contact!', "Cook's Assistant", 'Horror from the Deep']
If the first entry is what you want no matter what, you can also use:
^(.+?),
Also, for these files, it is usually a much better idea to read it as a CSV and extract what you need that way.
Your TypeError solution is correct.
Without knowing what that webpage looks like, I do see one problem. In your working example, you use ',(\d\d),', but in the problem one you use ,(\D\D),. \d Matches any digit characters, but \D matches any non-digits.
I'd like to create a regex that contains comments and a variable. I thought I'd split up the string like so, but it doesn't work:
import re
regex = re.compile(r'''
^(sm\d{5}-[a-z]+-\d{2}) # study number''' +
doctype + r'''# document
v(\d+)-(\d+) # version number
\.pdf$ # pdf extension
''', re.VERBOSE)
Break your regex pattern into multiple strings, then combine them into a single string with "".join(), like so
import re
pattern = "".join([
"^(sm\d{5}-[a-z]+-\d{2})", # study number
doctype, # document
"v(\d+)-(\d+)", # version number
"\.pdf$", # pdf extension
])
regex = re.compile(pattern, re.VERBOSE)
To avoid the need for comments, you might use descriptive variable names for each section of the regex. Doing it this way, it might also make sense to separate the line positioning characters from the "business logic" of your regex to make these variables more reusable.
study_number_pattern = "(sm\d{5}-[a-z]+-\d{2})"
version_number_pattern = "v(\d+)-(\d+)"
pdf_extension_pattern = "\.pdf"
pattern = "".join([
"^",
study_number_pattern,
doctype,
version_number_pattern,
pdf_extension_pattern,
"$"
])
regex = re.compile(pattern, re.VERBOSE)
Your concatenation sign and a closing block of quotes (''') on the second line of the pattern were commented out along with your note; simply move them in front of the comment.
I put both plus signs on one line, but you could still have them split across multiple ones, that's simply my preference...
regex = re.compile(r'''
^(sm\d{5}-[a-z]+-\d{2})''' # study number
+ doctype + # document
r'''v(\d+)-(\d+) # version number
\.pdf$ # pdf extension
''', re.VERBOSE)
On a related note, do you use an IDE or any text editor for writing code? If not, that would be quite helpful. Errors like this would be instantly caught and highlighted.
Edit:
Stack Overflow syntax highlighting makes it appear as though the lines with comments 'version number' and 'pdf extension' are part of the pattern in the code above, but using the re.VERBOSE flag makes them actual comments.
So I'm writing a Python program that reads lines of serial data, and compares them to a dictionary of line codes to figure out which specific lines are being transmitted. I am attempting to use a Regular Expression in order to filter out the extra garbage line serial read string has on it, but I'm having a bit of an issue.
Every single code in my dictionary looks like this: T12F8B0A22**F8. The asterisks are the two alpha numeric pieces that differentiate each string code.
This is what I have so far as my regex: '/^T12F8B0A22[A-Z0-9]{2}F8$/'
I am getting a few errors with this however. My first error, is that there are some characters are the end of the string I still need to get rid of, which is odd because I thought $/ denoted the end of the line in regex. However when I run my code through the debugger I notice that after running through the following code:
#regexString contains the serial read line data
regexString = re.sub('/^T12F8B0A22[A-Z0-9]{2}F8$/', '', regexString)
My string looks something like this: 'T12F8B0A2200F8\\r'
I need to get rid of the \\r.
If for some reason I can't get rid of this with regex, how in python do you send specific string character through an argument? In this case I suppose it would be length - 3?
Your problem is threefold:
1) your string contains extra \r (Carriage Return character) before \n (New Line character); this is common in Windows and in network communication protocols; it is probably best to remove any trailing whitespace from your string:
regexString = regexString.rstrip()
2) as mentioned by Wiktor Stribiżew, your regexp is unnecessarily surrounded with / characters - some languages, like Perl, define regexp as a string delimited by / characters, but Python is not one of them;
3) your instruction using re.sub is actually replacing the matching part of regexString with an empty string - I believe this is the exact opposite of what you want (you want to keep the match and remove everything else, right?); that's why fixing the regexp makes things "even worse".
To summarize, I think you should use this instead of your current code:
m = re.match('T12F8B0A22[A-Z0-9]{2}F8', regexString)
regexString = m.group(0)
There are several ways to get rid of the "\r", but first a little analysis of your code :
1. the special charakter for the end is just '$' not '$\' in python.
2. re.sub will substitute the matched pattern with a string ( '' in your case) wich would substitute the string you want to get with an empty string and you are left with the //r
possible solutions:
use simple replace:
regexString.replace('\\r','')
if you want to stick to regex the approach is the same
pattern = '\\\\r'
match = re.sub(pattern, '',regexString)
2.2 if you want the acces the different groubs use re.search
match = re.search('(^T12F8B0A22[A-Z0-9]{2}F8)(.*)',regexString)
match.group(1) # will give you the T12...
match.groupe(2) # gives you the \\r
Just match what you want to find. Couple of examples:
import re
data = '''lots of
otherT12F8B0A2212F8garbage
T12F8B0A2234F8around
T12F8B0A22ABF8the
stringsT12F8B0A22CDF8
'''
print(re.findall('T12F8B0A22..F8',data))
['T12F8B0A2212F8', 'T12F8B0A2234F8', 'T12F8B0A22ABF8', 'T12F8B0A22CDF8']
m = re.search('T12F8B0A22..F8',data)
if m:
print(m.group(0))
T12F8B0A2212F8
Might be a bit messy title, but the question is simple.
I got this in Python:
string = "start;some;text;goes;here;end"
the start; and end; word is always at the same position in the string.
I want the second word which is some in this case. This is what I did:
import re
string = "start;some;text;goes;here;end"
word = re.findall("start;.+?;" string)
In this example, there might be a few things to modify to make it more appropriate, but in my actual code, this is the best way.
However, the string I get back is start;some;, where the search characters themselves is included in the output. I could index both ;, and extract the middle part, but there have to be a way to only get the actual word, and not the extra junk too?
No need for regex in my opinion, but all you need is a capture group here.
word = re.findall("start;(.+?);", string)
Another improvement I'd like to suggest is not using .. Rather be more specific, and what you are looking for is simply anything else than ;, the delimiter.
So I'd do this:
word = re.findall("start;([^;]+);", string)