Regex using Python - python

I am trying to catch from pattern that was downloaded from specific URL, specific values but without success.
Part of the pattern is:
"All My Loving"</td>\n<td style="text-align:center;">1963</td>\n<td><i>UK: With the Beatles<br />\nUS: Meet The Beatles!</i></td>\n<td>McCartney</td>\n<td>McCartney</td>\n<td style="text-align:center;">\xe2\x80\x94</td>\n<td style="text-align:center;"><span style="display:none" class="sortkey">7001450000000000000\xe2\x99\xa0</span>45</td>\n<td></td>\n</tr>\n<tr>\n<td>"All Things Must Pass"</td>\n<td style="text-align:center;">1969</td>\n<td><i>Anthology 3</i></td>\n<td>Harrison</td>\n<td>Harrison</td>\n<td style="text-align:center;">\xe2\x80\x94</td>\n<td style="text-align:center;">\xe2\x80\x94</td>\n<td></td>\n</tr>\n<tr>\n<td>"All Together Now"</td>\n<td style="text-align:center;">1967</td>\n<td><i>Yellow Submarine</i></td>\n<td>McCartney, with Lennon</td>\n<td>McCartney, with Lennon</td>\n<td style="text-align:center;">\xe2\x80\x94</td>\n<td style="text-align:center;">\xe2\x80\x94</td>\n<td></td>\n</tr>\n<tr>\n<td>"
I want to catch the Title and the 1st <td>McCartney</td> with specific values from the file and to print it out as a JSON file.
Can I run with FOR loop with regex ? How I can do it using python ?
Thanks,

If you want to parse HTML use an HTML parser (such as BeautifulSoup), not regex.
from bs4 import BeautifulSoup
html = '''All My Loving"</td>\n<td style="text-align:center;">1963</td>\n<td><i>UK: With the Beatles<br />\nUS: Meet The Beatles!</i></td>\n<td>McCartney</td>\n<td>McCartney</td>\n<td style="text-align:center;">\xe2\x80\x94</td>\n<td style="text-align:center;"><span style="display:none" class="sortkey">7001450000000000000\xe2\x99\xa0</span>45</td>\n<td></td>\n</tr>\n<tr>\n<td>"All Things Must Pass"</td>\n<td style="text-align:center;">1969</td>\n<td><i>Anthology 3</i></td>\n<td>Harrison</td>\n<td>Harrison</td>\n<td style="text-align:center;">\xe2\x80\x94</td>\n<td style="text-align:center;">\xe2\x80\x94</td>\n<td></td>\n</tr>\n<tr>\n<td>"All Together Now"</td>\n<td style="text-align:center;">1967</td>\n<td><i>Yellow Submarine</i></td>\n<td>McCartney, with Lennon</td>\n<td>McCartney, with Lennon</td>\n<td style="text-align:center;">\xe2\x80\x94</td>\n<td style="text-align:center;">\xe2\x80\x94</td>\n<td></td>\n</tr>\n<tr>\n<td>
'''
soup = BeautifulSoup(html, 'html.parser')
a = soup.find('a') # will only find the first <a> tag
print(a.attrs['title'])
tds = soup.find_all('td') # will find all <td> tags
for td in tds:
if 'McCartney' in td.text:
print(td)
# All My Loving
# <td>McCartney</td>
# <td>McCartney</td>
# <td>McCartney, with Lennon</td>
# <td>McCartney, with Lennon</td>

Related

Add strings with tags as HTML using BeautifulSoup

I have a html table data formatted as a string I'd like to add a HTML row.
Let's say I have a row tag tag in BeautifulSoup.
<tr>
</tr>
I want to add the following data to the row which is formatted as a string (including the inner tags themselves)
<td>A\</td><td>A1<time>(3)</time>, A2<time>(4)</time>, A3<time>(8)</time></td>
Is there an easy way to do this through BeautifulSoup or otherwise (for example, I could convert my document to a string, but I would make it harder to find the tag I need to edit). I'm not sure If I have to add those inner tags manually.
Try tag.append:
from bs4 import BeautifulSoup
html = "<tr></tr>"
my_string = r'<td>A\</td><td>A1<time>(3)</time>, A2<time>(4)</time>, A3<time>(8)</time></td>'
soup = BeautifulSoup(html, "html.parser")
soup.find("tr").append(BeautifulSoup(my_string, "html.parser"))
print(soup)
Prints:
<tr><td>A\</td><td>A1<time>(3)</time>, A2<time>(4)</time>, A3<time>(8)</time></td></tr>

Regex to search specific text structure

I want to find all results of a certain structure in a string, preferably using regex.
To find all urls, one can use
re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', decode)
and it returns
'https://en.wikipedia.org'
I would like a regex string, which finds:
href="/wiki/*anything*"
OP: beginning must be href="/wiki/ middle can be anything and end must be "
st = "since-OP-did-not-provide-a-sample-string-34278234$'blahhh-okay-enough.href='/wiki/anything/everything/nothing'okay-bye"
print(st[st.find('href'):st.rfind("'")+1])
OUTPUT:
href='/wiki/anything/everything/nothing'
EDIT:
I would go with BeautifulSoup if we are to parse probably an html.
from bs4 import BeautifulSoup
text = '''<a href='/wiki/anything/everything/nothing'><img src="/hp_imgjhg/411/1/f_1hj11_100u.jpg" alt="dyufg" />well wait now <a href='/wiki/hello/how-about-now/nothing'>'''
soup = BeautifulSoup(text, features="lxml")
for line in soup.find_all('a'):
print("href =",line.attrs['href'])
OUTPUT:
href = /wiki/anything/everything/nothing
href = /wiki/hello/how-about-now/nothing

Python - Extracting data from this Html tag using BS4, instead of getting None

This is my code:
html = '''
<td class="ClassName class" width="60%">Data I want to extract<span lang=EN-
UK style="font-size:12pt;font-family:'arial'"></span></td>
'''
soup = BeautifulSoup(html, 'html.parser')
print(soup.select_one('td').string)
It returns None. I think it has to do with that span tag which is empty. I think it goes into that span tag, and returns those contents? So I either want to delete that span tag, or stop as soon as it finds the 'Data I want to extract', or tell it to ignore empty tags
If there are no empty tags inside 'td' it actually works.
Is there a way to ignore empty tags in general and go one step back? Instead of ignoring this specific span tag?
Sorry if this is too elementary, but I spent a fair amount of time searching.
Use .text property, not .string:
html = '''
<td class="ClassName class" width="60%">Data I want to extract<span lang=EN-
UK style="font-size:12pt;font-family:'arial'"></span></td>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
print(soup.select_one('td').text)
Output:
Data I want to extract
Use .text:
>>> soup.find('td').text
u'Data I want to extract'

Using BeautifulSoup to select div blocks within HTML

I am trying to parse several div blocks using Beautiful Soup using some html from a website. However, I cannot work out which function should be used to select these div blocks. I have tried the following:
import urllib2
from bs4 import BeautifulSoup
def getData():
html = urllib2.urlopen("http://www.racingpost.com/horses2/results/home.sd?r_date=2013-09-22", timeout=10).read().decode('UTF-8')
soup = BeautifulSoup(html)
print(soup.title)
print(soup.find_all('<div class="crBlock ">'))
getData()
I want to be able to select everything between <div class="crBlock "> and its correct end </div>. (Obviously there are other div tags but I want to select the block all the way down to the one that represents the end of this section of html.)
The correct use would be:
soup.find_all('div', class_="crBlock ")
By default, beautiful soup will return the entire tag, including contents. You can then do whatever you want to it if you store it in a variable. If you are only looking for one div, you can also use find() instead. For instance:
div = soup.find('div', class_="crBlock ")
print(div.find_all(text='foobar'))
Check out the documentation page for more info on all the filters you can use.

Sifting a list returned from a webscrape produced with Beautiful Soup

I am using python to code. I have been trying to webscrape the names, team images, and colleges of nba draft prospects.However when I scrape for the name of the colleges I get both the college page and the college name. How do I get it so that I only see the colleges? I have tried adding .string and .text to the end of anchor (anchor.string).
import urllib2
from BeautifulSoup import BeautifulSoup
# or if your're using BeautifulSoup4:
# from bs4 import BeautifulSoup
list = []
soup = BeautifulSoup(urllib2.urlopen(
'http://www.cbssports.com/nba/draft/mock-draft'
).read()
)
rows = soup.findAll("table",
attrs = {'class':'data borderTop'})[0].tbody.findAll("tr")[2:]
for row in rows:
fields = row.findAll("td")
if len(fields) >= 3:
anchor = row.findAll("td")[2].findAll("a")[1:]
if anchor:
print anchor
Instead of just:
print anchor
use:
print anchor[0].text
The format of an anchor in html is <a href='web_address'>Text-that-is-displayed</a> so unless there's already a fancy html parser library (I'd bet there is, just don't know of any), you'll likely need to use some kind of regular expressions to parse out the part of the anchor that you want.

Categories