I am using the Requests module to access the HTML from my target website and then using Beautiful Soup to select a specific element on the website. The element in question is a table that contains the results thus far of the English Premier League 2016/2017 season. The table contains the match date, the teams involved, the full-time score and the half-time score. I want to use Python to parse the HTML of the table element and extract the fixtures listed on there. The teams are always listed as:
Team A - Team B
A team name can be 1-3 separate strings (e.g. Burnley, Manchester United, West Ham United.
My attempt so far is:
import re
teamsRegex = re.compile(r'((\w+\s)+-(\s\w+)+)')
My logic here is that the first team can be 1-3 separate strings in length and each string is always followed by a white space. Therefore, the pattern (\w+\s)+ represents a string of any length followed by a white space and can be repeated 1 or many times. The second team name will always begin with a white space following the "-" character and again can be a string of any length, repeated 1 or many times (\s\w+)+.
I'm sort of achieving the desired results but the above is not entirely correct. I am returned a list with my desired result at index 0 followed by the first string of index 0 as index 1, and the last string in index 0 as index 2.
Example string:
'Burnley - Swansea City align=center width=45> 0 - 1 align=center> (0-0)'
Regex finds:
[('Burnley - Swansea City', 'Burnley ', ' City'), ('0 - 1', '0 ', ' 1')]
I would just like it to find [('Burnley - Swansea City')]
Many thanks in anticipation of any help!
r'(?:[A-Z][a-z]*\s)+-(?:\s[A-Z][a-z]*)+'
Here you have two non-capturing (?:, so you'll get the full match only) groups to match the teams' names. I chose to use letters explicitly, so the expressions only match words beginning with capital letters and exclude digits. You should change that if the teams' names can contain digits (like "BVB 09").
Depending on the HTML file's content one could add a final lookahead (?= align) to increase specifity.
Edit:
To match up to three capitals and optional '&'s, try this :
r'(?:[A-Z&]{1,3}[a-z]*\s)+-(?:\s[A-Z&]{1,3}[a-z]*)+'
Related
I am working with a dataset where I am separating the contents of one Excel column into 3 separate columns. A mock version of the data is as follows:
Movie Titles/Category/Rating
Wolf of Wall Street A-13 x 9
Django Unchained IMDB x 8
The EXPL Haunted House FEAR x 7
Silver Lining DC-23 x 8
This is what I want the results to look like:
Title
Category
Rating
Wolf of Wall Street
A-13
9
Django Unchained
IMDB
8
The EXPL Haunted House
FEAR
7
Silver Lining
DC-23
8
Here is the RegEx I used to successfully separate the cells:
For Rating, this RegEx worked:
data = [[Movie Titles/Category/Rating, Rating]] = data['Movie Titles/Category/Rating'].str.split(' x ', expand = True)
However, to separate Category from movie titles, this RegEx doesn't work:
data['Category']=data['Movie Titles/Category/Rating'].str.extract('((\s[A-Z]{1,2}-\d{1,2})|(\s[A-Z]{4}$))', expand = True)
Since the uppercase letter pattern is present in the middle of the third cell as well (EXPL and I only want to separate FEAR into a separate column), the regex pattern '\s[A-Z]{4}$' is not working. Is there a way to indicate in the RegEx pattern that I only want the uppercase text in the end of the table cell to separate (FEAR) and not the middle (EXPL)?
You can use
import pandas as pd
df = pd.DataFrame({'Movie Titles/Category/Rating':['Wolf of Wall Street A-13 x 9','Django Unchained IMDB x 8','The EXPL Haunted House FEAR x 7','Silver Lining DC-23 x 8']})
df2 = df['Movie Titles/Category/Rating'].str.extract(r'^(?P<Movie>.*?)\s+(?P<Category>\S+)\s+x\s+(?P<Rating>\d+)$', expand=True)
See the regex demo.
Details:
^ - start of string
(?P<Movie>.*?) - Group (Column) "Movie": any zero or more chars other than line break chars, as few as possible
\s+ - one or more whitespaces
(?P<Category>\S+) - Group "Category": one or more non-whitespace chars
\s+x\s+ - x enclosed with one or more whitespaces
(?P<Rating>\d+) - Group "Rating": one or more digits
$ - end of string.
Assuming there is always x between Category and Rating, and the Category has no spaces in it, then the following should get what you want:
(.*) (.*) x (\d+)
I think
'((\s[A-Z]{1,2}-\d{1,2})|(\s[A-Z]{4})) x'
would work for you - to indicate that you want the part of the string that comes right before the x. (Assuming that pattern is always true for your data.)
Novice programmer here seeking help.
I have a Dataframe that looks like this:
Current
0 "Invest in $APPL, $FB and $AMZN"
1 "Long $AAPL, Short $AMZN"
2 "$AAPL earnings announcement soon"
3 "$FB is releasing a new product. Will $FB's product be good?"
4 "$Fb doing good today"
5 "$AMZN high today. Will $amzn continue like this?"
I also have a list with all the hashtags: cashtags = ["$AAPL", "$FB", $AMZN"]
Basically, I want to go through all the lines in this column of the DataFrame and keep the rows with a unique cashtag, regardless if it is in caps or not, and delete all others.
Desired Output:
Desired
2 "$AAPL earnings announcement soon"
3 "$FB is releasing a new product. Will $FB's product be good?"
4 "$Fb doing good today"
5 "$AMZN high today. Will $amzn continue like this?"
I've tried to basically count how many times the word appears in the string and add that value to a new column so that I can delete the rows based on the number.
for i in range(0,len(df)-1):
print(i, end = "\r")
tweet = df["Current"][i]
count = 0
for word in cashtags:
count += str(tweet).count(word)
df["Word_count"][i] = count
However if I do this I will be deleting rows that I don't want to. For example, rows where the unique cashtag is mentioned several times ([3],[5])
How can I achieve my desired output?
Rather than summing the count of each cashtag, you should sum its presence or absence, since you don't care how many times each cashtag occurs, only how many cashtags.
for tag in cashtags:
count += tag in tweet
Or more succinctly: sum(tag in tweet for tag in cashtags)
To make the comparison case insensitive, you can upper case the tweets beforehand. Additionally, it would be more idiomatic to filter on a temporary series and avoid explicitly looping over the dataframe (though you may need to read up more about Pandas to understand how this works):
df[df.Current.apply(lambda tweet: sum(tag in tweet.upper() for tag in cashtags)) == 1]
If you ever want to generalise your question to any tag, then this is a good place for a regular expression.
You want to match against (\$w+)(?!.*/1) see e.g. here for a detailed explanation, but the general structure is:
\$w+: find a dollar sign followed by one or more letters/numbers (or
an _), if you just wanted to count how many tags you had this is all you need
e.g.
df.Current.str.count(r'\$\w+')
will print
0 3
1 2
2 1
3 2
4 1
5 2
but this will remove cases where you have the same element more than once so you need to add a negative lookahead meaning don't match
(?!.*/1): Is a negative lookahead, this means don't match if it is followed by the same match later on. This will mean that only the last tag is counted in the string.
Using this, you can then use pandas DataFrame.str methods, specifically DataFrame.str.count (the re.I does a case insensitive match)
import re
df[df.Current.str.count(r'(\$\w+)(?!.*\1)', re.I) == 1]
which will give you your desired output
Current
2 $AAPL earnings announcement soon
3 $FB is releasing a new product. Will $FB's pro...
4 $Fb doing good today
5 $AMZN high today. Will $amzn continue like this?
i have a dataframe df. I want to extract hashtags from tweets where Max==45.:
Max Tweets
42 via #VIE_unlike at #fashion
42 Ny trailer #katamaritribute #ps3
45 Saved a baby bluejay from dogs #fb
45 #Niley #Niley #Niley
i m trying something like this but its giving empty dataframe:
df.loc[df['Max'] == 45, [hsh for hsh in 'tweets' if hsh.startswith('#')]]
is there something in pandas which i can use to perform this effectively and faster.
You can use pd.Series.str.findall:
In [956]: df.Tweets.str.findall(r'#.*?(?=\s|$)')
Out[956]:
0 [#fashion]
1 [#katamaritribute, #ps3]
2 [#fb]
3 [#Niley, #Niley, #Niley]
This returns a column of lists.
If you want to filter first and then find, you can do so quite easily using boolean indexing:
In [957]: df.Tweets[df.Max == 45].str.findall(r'#.*?(?=\s|$)')
Out[957]:
2 [#fb]
3 [#Niley, #Niley, #Niley]
Name: Tweets, dtype: object
The regex used here is:
#.*?(?=\s|$)
To understand it, break it down:
#.*? - carries out a non-greedy match for a word starting with a hashtag
(?=\s|$) - lookahead for the end of the word or end of the sentence
If it's possible you have # in the middle of a word that is not a hashtag, that would yield false positives which you wouldn't want. In that case, You can modify your regex to include a lookbehind:
(?:(?<=\s)|(?<=^))#.*?(?=\s|$)
The regex lookbehind asserts that either a space or the start of the sentence must precede a # character.
I have a large string and I want to find all the input sequences that are matching in this string.
So for example, I want to find all the possible matches of defensive rebound in:
Player xy had 10 defensive rebounds only in the 3rd quarter of a match that was a defensive battle between 2 teams that have a defensive rebound rate of over 80% and moreover the average number of rebounds in the defence by player was a staggering 3.5
I want to find all the bold words and after that extract them.
I managed to build a script that does the extraction but it only works for exact matches.
I was thinking of using difflib.SequenceMatcher but I got stuck.
You can use regex:
import re
#Find [defence(s)][space][rebound(s)][space][any word]
re.findall('defensive[\w]* rebound[\w]* [\w]+', s)
#Find [rebound(s)][space][any word][space][any word][space][any word]
re.findall('rebound[\w]* [\w]+ [\w]+ [\w]+', s)
findall returns a list of matches
If all your matches are in the same form of bold words you can extract them with:
re.findall('rebound[ \w]*defence', s)
re.findall('defensive[\w]* rebound[\w]*[ rate]*', s)
I am trying to get all names that start with a capital letter and ends with a full-stop on the same line where the number of characters are between 3 and 5
My text is as follows:
King. Great happinesse
Rosse. That now Sweno, the Norwayes King,
Craues composition:
Nor would we deigne him buriall of his men,
Till he disbursed, at Saint Colmes ynch,
Ten thousand Dollars, to our generall vse
King. No more that Thane of Cawdor shall deceiue
Our Bosome interest: Goe pronounce his present death,
And with his former Title greet Macbeth
Rosse. Ile see it done
King. What he hath lost, Noble Macbeth hath wonne.
I am testing it out on this link. I am trying to get all words between 3 and 5 but haven't succeeded.
Does this produce your desired output?
import re
re.findall(r'[A-Z].{2,4}\.', text)
When text contains the text in your question it will produce this output:
['King.', 'Rosse.', 'King.', 'Rosse.', 'King.']
The regex pattern matches any sequence of characters following an initial capital letter. You can tighten that up if required, e.g. using [a-z] in the pattern [A-Z][a-z]{2,4}\. would match an upper case character followed by between 2 to 4 lowercase characters followed by a literal dot/period.
If you don't want duplicates you can use a set to get rid of them:
>>> set(re.findall(r'[A-Z].{2,4}\.', text))
set(['Rosse.', 'King.'])
You may have your own reasons for wanting to use regexs here, but Python provides a rich set of string methods and (IMO) it's easier to understand the code using these:
matched_words = []
for line in open('text.txt'):
words = line.split()
for word in words:
if word[0].isupper() and word[-1] == '.' and 3 <= len(word)-1 <=5:
matched_words.append(word)
print matched_words