Python: fastest way to re.findall twice? - python

I like regular expressions. I often find myself using multiple regex statements to narrow in on the value I need when trying to get a substring from a large block of text.
So far, my approach has been the following:
Use resultOfRegex1 = re.findall(firstRegex, myString) for my first regex
Check to see that resultOfRegex1[0] exists
Use resultOfRegex2 = re.findall(secondRegex, resultOfRegex1[0]) for
my second regex
Check to see that resultOfRegex2[0] exists, and print that value
But I feel like this is much more verbose and costly than it has to be. Is there an easier/faster way to match one regex and then match another regex based on the result of the first?

The whole point of groups is to allow extraction of subgroups from an overall match.
For example, instead two searches done the following fashion:
>>> import re
>>> s = 'The winning team scored 15 points and used only 2 timeouts'
>>> score_clause = re.search(r'scored \d+ point', s).group(0)
>>> re.search(r'\d+', score_clause).group(0)
'15'
Do a single search with a sub-group:
>>> re.search(r'scored (\d+) point', s).group(1)
'15'
One other thought: if you want to make decisions about whether to continue a findall-style search based on the first match, a reasonable choice would be to use re.finditer and extract values as needed:
>>> game_results = '''\
10 point victory: 1 in first period, 6 in second period, 3 in third period.
5 point victory: 0 in first period, 5 in second period, 0 in third period.
12 point victory: 5 in first period, 3 in second period, 4 in third period.
7 point victory: 3 in first period, 0 in second period, 4 in third period.
'''.splitlines()
>>> # Show period-by-period scores for games won by 8 or more points
>>> for game_result in game_results:
it = re.finditer(r'\d+', game_result)
if int(next(it).group(0)) >= 8:
print 'Big win:', [int(mo.group(0)) for mo in it]
Big win: [1, 6, 3]
Big win: [5, 3, 4]

Related

Python code to return elements in a Series

I am currently putting together a script for topic modelling scraped Tweets but I am having a couple of issues. I want to be able to search for all instances of a word, then return all instances of that word, plus the words before and after, in order to provide better context into the use of a word.
I have tokenised all the tweets, and added them to a Series where the relative index position is used to identify surrounding words.
The code I currently have is:
myseries = pd.Series(["it", 'was', 'a', 'bright', 'cold', 'day', 'in', 'april'],
index= [0,1,2,3,4,5,6,7])
def phrase(w):
search_word= myseries[myseries == w].index[0]
before = myseries[[search_word- 1]].index[0]
after = myseries[[search_word+ 1]].index[0]
print(myseries[before], myseries[search_word], myseries[after])
The code mostly works, but will return an error if the first or last word is searched, as it falls outside the index range of the Series. Is there a way to ignore out of range indexes and simply return what is within range?
The current code also only returns the word before and after the searched word. I want to be able to input a number into the function which then returns a range of words before and after, but my current code is hard coded. Is there a way to have it return a designated range of elements?
I am also having issues creating a loop to search the entire series. Depending on what I write it either returns the first element and nothing else, or repeatedly prints the first element over and over again rather than continuing on with the search. The offending bit of code that keeps repeating the first element is:
def ws(word):
for element in tokened_df:
if word == element:
search_word = tokened_df[tokened_df == word].index[0]
before = tokened_df[[search_word - 1]].index[0]
after = tokened_df[[search_word + 1]].index[0]
print(tokened_df[before], word, tokened_df[after])
There is obviously something simple I've overlooked, but can't for the life of me figure out what it is. How can I modify the code so that if the same word is repeated in the series, it will return each instance of the word, plus the surrounding words? The way I want it to work follows the logic of 'if condition is true, execute 'phrase' function, if not true, continue down the series.
Something like this? I have added a repeated word ("bright") to your example. Also added n_before and n_after to put in number of surrounding words
import pandas as pd
myseries = pd.Series(["it", 'was', 'a', 'bright', 'bright', 'cold', 'day', 'in', 'april'],
index= [0,1,2,3,4,5,6,7,8])
def phrase(w, n_before=1, n_after=1):
search_words = myseries[myseries == w].index
for index in search_words:
start_index = max(index - n_before, 0)
end_index = min(index + n_after+1, myseries.shape[0])
print(myseries.iloc[start_index: end_index])
phrase("bright", n_before=2, n_after=3)
This gives:
1 was
2 a
3 bright
4 bright
5 cold
6 day
dtype: object
2 a
3 bright
4 bright
5 cold
6 day
7 in
dtype: object
This is not very elegant, but you probably need some conditionals to account for words that come at the beginning or the end of your phrase. To account for repeated words, find all instances of the repeated word and loop through your print statements. For the variable myseries, I repeated the word cold twice so there should be two print statements
import pandas as pd
myseries = pd.Series(["it", 'was', 'a', 'cold', 'bright', 'cold', 'day', 'in', 'april'],
index= [0,1,2,3,4,5,6,7,8])
def phrase(w):
for i in myseries[myseries == w].index.tolist():
search_word= i
if search_word == 0:
print(myseries[search_word], myseries[i+1])
elif search_word == len(myseries)-1:
print(myseries[i-1], myseries[search_word])
else:
print(myseries[i-1], myseries[search_word], myseries[i+1])
Output:
>>> myseries
0 it
1 was
2 a
3 cold
4 bright
5 cold
6 day
7 in
8 april
dtype: object
>>> phrase("was")
it was a
>>> phrase("cold")
a cold bright
bright cold day

Removing rows from a DataFrame based on words in a string

Novice programmer here seeking help.
I have a Dataframe that looks like this:
Current
0 "Invest in $APPL, $FB and $AMZN"
1 "Long $AAPL, Short $AMZN"
2 "$AAPL earnings announcement soon"
3 "$FB is releasing a new product. Will $FB's product be good?"
4 "$Fb doing good today"
5 "$AMZN high today. Will $amzn continue like this?"
I also have a list with all the hashtags: cashtags = ["$AAPL", "$FB", $AMZN"]
Basically, I want to go through all the lines in this column of the DataFrame and keep the rows with a unique cashtag, regardless if it is in caps or not, and delete all others.
Desired Output:
Desired
2 "$AAPL earnings announcement soon"
3 "$FB is releasing a new product. Will $FB's product be good?"
4 "$Fb doing good today"
5 "$AMZN high today. Will $amzn continue like this?"
I've tried to basically count how many times the word appears in the string and add that value to a new column so that I can delete the rows based on the number.
for i in range(0,len(df)-1):
print(i, end = "\r")
tweet = df["Current"][i]
count = 0
for word in cashtags:
count += str(tweet).count(word)
df["Word_count"][i] = count
However if I do this I will be deleting rows that I don't want to. For example, rows where the unique cashtag is mentioned several times ([3],[5])
How can I achieve my desired output?
Rather than summing the count of each cashtag, you should sum its presence or absence, since you don't care how many times each cashtag occurs, only how many cashtags.
for tag in cashtags:
count += tag in tweet
Or more succinctly: sum(tag in tweet for tag in cashtags)
To make the comparison case insensitive, you can upper case the tweets beforehand. Additionally, it would be more idiomatic to filter on a temporary series and avoid explicitly looping over the dataframe (though you may need to read up more about Pandas to understand how this works):
df[df.Current.apply(lambda tweet: sum(tag in tweet.upper() for tag in cashtags)) == 1]
If you ever want to generalise your question to any tag, then this is a good place for a regular expression.
You want to match against (\$w+)(?!.*/1) see e.g. here for a detailed explanation, but the general structure is:
\$w+: find a dollar sign followed by one or more letters/numbers (or
an _), if you just wanted to count how many tags you had this is all you need
e.g.
df.Current.str.count(r'\$\w+')
will print
0 3
1 2
2 1
3 2
4 1
5 2
but this will remove cases where you have the same element more than once so you need to add a negative lookahead meaning don't match
(?!.*/1): Is a negative lookahead, this means don't match if it is followed by the same match later on. This will mean that only the last tag is counted in the string.
Using this, you can then use pandas DataFrame.str methods, specifically DataFrame.str.count (the re.I does a case insensitive match)
import re
df[df.Current.str.count(r'(\$\w+)(?!.*\1)', re.I) == 1]
which will give you your desired output
Current
2 $AAPL earnings announcement soon
3 $FB is releasing a new product. Will $FB's pro...
4 $Fb doing good today
5 $AMZN high today. Will $amzn continue like this?

Constant part of string

I've got a problem and don't know how to solve it.
E.x. I have a dynamically expanding file which contains lines splited by '\n'
Each line - a message (string) which is built by some pattern and value part which is specific only for this line.
E.x.:
line 1: The temperature is 10 above zero
line 2: The temperature is 16 above zero
line 3: The temperature is 5 degree zero
So, as you see, the constant part (pattern) is
The temperature is zero
Value part:
For line 1 will be: 10 above
For line 2 will be: 16 above
For line 3 will be: 5 degree
Of course it's very simple example.
In fact there're too many lines and about ~50 pattern in one file.
The value part may be anything - it can be number, word, punctuation, etc!
And my question is - how can I find all possible patterns from data?
This sounds like a log message clustering problem.
Trivial solution: replace all numbers with the string NUMBER using a regex. You might need to exclude dates or IP addresses or something. That might be enough to give you a list of all patterns in your logs.
Alternately, you might be able to count the number of words (whitespace-delimited fields) in each message and group the messages that way. For example, maybe all messages with 7 words are in the same format. If two different messages have the same format you can also match on the first word or something.
If neither of the above work then things get much more complicated; clustering arbitrary log messages is a research problem. If you search for "event log clustering" on Google Scholar you should see a lot of approaches you can learn from.
If the no of words in a line is fixed, like in your eg str, then you can use str.split()
str='''
The temperature is 10 above zero
The temperature is 16 above zero
The temperature is 5 degree zero
'''
for line in str.split('\n'):
if len(line.split()) >= 5:
a,b = line.split()[3], line.split()[4]
print(a,b)
Output:
10 above
16 above
5 degree
First, we would read the file line by line and add all sentences to a List.
In the example below, I am adding few lines to a list.
This list has all sentences..
lstSentences = ['The temperature is 10 above zero', 'The temperature is 16 above zero', 'The temperature is 5 degree above zero','Weather is ten degree below normal', 'Weather is five degree below normal' , 'Weather is two degree below normal']
Create a list to store all patterns
lstPatterns=[]
Initialize
intJ = len(lstSentences)-1
Compare one sentence against the one that follows it. If there are more than 2 matching words between two setences, perhaps this is a pattern.
for inti, sentence in enumerate(lstSentences):
if intJ!=inti:
lstMatch = [ matching for matching in sentence.split() if matching in
lstSentences[inti+1].split()]
if len(lstMatch)>2: #We need min 2 words matching between sentences
if not ' '.join(lstMatch) in lstPatterns: #if not in list, add
lstPatterns.append(' '.join(lstMatch))
lstMatch=[]
print(lstPatterns)
I am assuming patterns come one after the other (i.e., 10 rows with one pattern and then, 10 rows with another pattern). If not, the above code needs to change

Extract hashtags from columns of a pandas dataframe

i have a dataframe df. I want to extract hashtags from tweets where Max==45.:
Max Tweets
42 via #VIE_unlike at #fashion
42 Ny trailer #katamaritribute #ps3
45 Saved a baby bluejay from dogs #fb
45 #Niley #Niley #Niley
i m trying something like this but its giving empty dataframe:
df.loc[df['Max'] == 45, [hsh for hsh in 'tweets' if hsh.startswith('#')]]
is there something in pandas which i can use to perform this effectively and faster.
You can use pd.Series.str.findall:
In [956]: df.Tweets.str.findall(r'#.*?(?=\s|$)')
Out[956]:
0 [#fashion]
1 [#katamaritribute, #ps3]
2 [#fb]
3 [#Niley, #Niley, #Niley]
This returns a column of lists.
If you want to filter first and then find, you can do so quite easily using boolean indexing:
In [957]: df.Tweets[df.Max == 45].str.findall(r'#.*?(?=\s|$)')
Out[957]:
2 [#fb]
3 [#Niley, #Niley, #Niley]
Name: Tweets, dtype: object
The regex used here is:
#.*?(?=\s|$)
To understand it, break it down:
#.*? - carries out a non-greedy match for a word starting with a hashtag
(?=\s|$) - lookahead for the end of the word or end of the sentence
If it's possible you have # in the middle of a word that is not a hashtag, that would yield false positives which you wouldn't want. In that case, You can modify your regex to include a lookbehind:
(?:(?<=\s)|(?<=^))#.*?(?=\s|$)
The regex lookbehind asserts that either a space or the start of the sentence must precede a # character.

Python -- Search Subsrtring Full Word(s)

I want to find the number of times a substring occurred in a string. I was doing this
termCount = content.count(term)
But if i search like "Ford" it returned result set like
"Ford Motors" Result: 1 Correct
"cannot afford Ford" Result: 2 Incorrect
"ford is good" Result: 1 Correct
The search term can have multiple terms like "Ford Motors" or "Ford Auto".
For example if i search "Ford Motor"
"Ford Motors" Result: 1 Correct
"cannot afford Ford Motor" Result: 1 Correct
"Ford Motorway" Result: 1 InCorrect
What i want is to search them case insensitive and as a whole. Mean if I search a substring it should be contained as a whole as a word or a phrase (In case of multiple terms) not part of the word. And also I need the count of the terms. How do I achieve it.
You can use regex, and in this case use re.findall then get the length of matched list :
re.findall(r'\byour_term\b',s)
Demo
>>> s="Ford Motors cannot afford Ford Motor Ford Motorway Ford Motor."
>>> import re
>>> def counter(str,term):
... return len(re.findall(r'\b{}\b'.format(term),str))
...
>>> counter(s,'Ford Motor')
2
>>> counter(s,'Ford')
4
>>> counter(s,'Fords')
0
I would split the strings by spaces so that we have independent words and then from there I would carry out the count.
terms = ['Ford Motors', 'cannot afford Ford', 'ford is good'];
splitWords = [];
for term in terms:
#take each string in the list and split it into words
#then add these words to a list called splitWords.
splitWords.extend(term.lower().split())
print(splitWords.count("ford"))

Categories