BeautifulSoup Parse Text after <b> and before </br>

BeautifulSoup Parse Text after <b> and before </br> - python

I have this code trying to parse search results from a grant website (please find the URL in the code, I can't post the link yet until my rep is higher), the "Year"and "Amount Award" after tags and before tags.
Two questions:
1) Why is this only returning the 1st table?
2) Any way I can get the text that is after the (i.e. Year and Amount Award strings) and (i.e. the actual number such as 2015 and $100000)
Specifically:
<td valign="top">
<b>Year: </b>2014<br>
<b>Award Amount: </b>$84,907 </td>
Here is my script:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'http://www.ned.org/wp-content/themes/ned/search/grant-search.php?' \
'organizationName=&region=ASIA&projectCountry=China&amount=&fromDate=&toDate=&' \
'projectFocus%5B%5D=&search=&maxCount=25&orderBy=Year&start=1&sbmt=1'
r = requests.get(url)
html_content = r.text
soup = BeautifulSoup(html_content, "html.parser")
tables = soup.find_all('table')
data = {
'col_names': [],
'info' : [],
'year_amount':[]
}
index = 0
for table in tables:
rows = table.find_all('tr')[1:]
for row in rows:
cols = row.find_all('td')
data['col_names'].append(cols[0].get_text())
data['info'].append(cols[1].get_text())
try:
data['year_amount'].append(cols[2].get_text())
except IndexError:
data['year_amount'].append(None)
grant_df = pd.DataFrame(data)
index += 1
filename = 'grant ' + str(index) + '.csv'
grant_df.to_csv(filename)

I would suggest approaching the table parsing in a different manner. All of the information is available in the first row of each table. So you can parse the text of the row like:
Code:
text = '\n'.join([x.strip() for x in rows[0].get_text().split('\n')
if x.strip()]).replace(':\n', ': ')
data_dict = {k.strip(): v.strip() for k, v in
[x.split(':', 1) for x in text.split('\n')]}
How?:
This takes the text and
splits it on newlines
removes any blank lines
removes any leading/trailing space
joins the lines back together into a single text
joins any line ending in : with the next line
Then:
split the text again by newline
split each line by :
strip any whitespace of ends of text on either side of :
use the split text as key and value to a dict
Test Code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'http://www.ned.org/wp-content/themes/ned/search/grant-search.php?' \
'organizationName=&region=ASIA&projectCountry=China&amount=&' \
'fromDate=&toDate=&projectFocus%5B%5D=&search=&maxCount=25&' \
'orderBy=Year&start=1&sbmt=1'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
data = []
for table in soup.find_all('table'):
rows = table.find_all('tr')
text = '\n'.join([x.strip() for x in rows[0].get_text().split('\n')
if x.strip()]).replace(':\n', ': ')
data_dict = {k.strip(): v.strip() for k, v in
[x.split(':', 1) for x in text.split('\n')]}
if data_dict.get('Award Amount'):
data.append(data_dict)
grant_df = pd.DataFrame(data)
print(grant_df.head())
Results:
Award Amount Description \
0 $84,907 To strengthen the capacity of China's rights d...
1 $204,973 To provide an effective forum for free express...
2 $48,000 To promote religious freedom in China. The org...
3 $89,000 To educate and train civil society activists o...
4 $65,000 To encourage greater public discussion, transp...
Organization Name Project Country Project Focus \
0 NaN Mainland China Rule of Law
1 Princeton China Initiative Mainland China Freedom of Information
2 NaN Mainland China Rule of Law
3 NaN Mainland China Democratic Ideas and Values
4 NaN Mainland China Rule of Law
Project Region Project Title Year
0 Asia Empowering the Chinese Legal Community 2014
1 Asia Supporting Free Expression and Open Debate for... 2014
2 Asia Religious Freedom, Rights Defense and Rule of ... 2014
3 Asia Education on Civil Society and Democratization 2014
4 Asia Promoting Democratic Policy Change in China 2014

Related

How to scrape the specific text from kworb and extract it as an excel file?

I'm trying to scrape the positions, the artists and the songs from a ranking list on kworb. For example: https://kworb.net/spotify/country/us_weekly.html
I used the following script:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://kworb.net/spotify/country/us_weekly.html")
content = response.content
soup = BeautifulSoup(response.content, 'html.parser')
print(soup.get_text())
And here is the output:
ITUNES
WORLDWIDE
ARTISTS
CHARTS
DON'T PRAY
RADIO
SPOTIFY
YOUTUBE
TRENDING
HOME
CountriesArtistsListenersCities
Spotify Weekly Chart - United States - 2023/02/09 | Totals
PosP+Artist and TitleWksPk(x?)StreamsStreams+Total
1
+1
SZA - Kill Bill
9
1(x5)
15,560,813
+247,052
148,792,089
2
-1
Miley Cyrus - Flowers
4
1(x3)
13,934,413
-4,506,662
75,009,251
3
+20
Morgan Wallen - Last Night
2
3(x1)
11,560,741
+6,984,649
16,136,833
...
How do I only get the positions, the artists and the songs separately and store it as an excel first?
expected output:
Pos Artist Songs
1 SZA Kill Bill
2 Miley Cyrus Flowers
3 Morgan Wallen Last Night
...

Best practice to scrape tables is using pandas.read_html() it uses BeautifulSoup under the hood for you.
import pandas as pd
#find table by id and select first index from list of dfs
df = pd.read_html('https://kworb.net/spotify/country/us_weekly.html', attrs={'id':'spotifyweekly'})[0]
#split the column by delimiter and creat your expected columns
df[['Artist','Song']]=df['Artist and Title'].str.split(' - ', n=1, expand=True)
#pick your columns and export to excel
df[['Pos','Artist','Song']].to_excel('yourfile.xlsx', index = False)
Alternative based on direct approach:
import requests
from bs4 import BeautifulSoup
import pandas as pd
soup = BeautifulSoup(requests.get("https://kworb.net/spotify/country/hk_weekly.html").content, 'html.parser')
data = []
for e in soup.select('#spotifyweekly tr:has(td)'):
data .append({
'Pos':e.td.text,
'Artist':e.a.text,
'Song':e.a.find_next_sibling('a').text
})
pd.DataFrame(data).to_excel('yourfile.xlsx', index = False)
Outputs
Pos
Artist
Song
1
SZA
Kill Bill
2
Miley Cyrus
Flowers
3
Morgan Wallen
Last Night
4
Metro Boomin
Creepin'
5
Lil Uzi Vert
Just Wanna Rock
6
Drake
Rich Flex
7
Metro Boomin
Superhero (Heroes & Villains) [with Future & Chris Brown]
8
Sam Smith
Unholy
...

bs4 soup.select() vs. soup.find()

I am trying to scrape the text of some elements in a table using requests and BeautifulSoup, specifically the country names and the 2-letter country codes from this website.
Here is my code, which I have progressively walked back:
import requests
import bs4
res = requests.get('https://country-code.cl/')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text)
for i in range(3):
row = soup.find(f'#row{i} td')
print(row) # printing to check progress for now
I had hoped to go row-by-row and walk the tags to get the strings like so (over range 249). However, soup.find() doesn't appear to work, just prints blank lists. soup.select() however, works fine:
for i in range(3):
row = soup.select(f'#row{i} td')
print(row)
Why does soup.find() not work as expected here?

While .find() deals only with the first occurence of an element, .select() / .find_all() will give you a ResultSet you can iterate.
There are a lot of ways to get your goal, but basic pattern is mostly the same - select rows of the table and iterate over them.
In this first case I selected table by its id and close to your initial approach the <tr> also by its id while using css selector and the [id^="row"] that represents id attribute whose value starts with row. In addition I used .stripped_strings to extract the text from the elements, stored it in a list and pick it by index :
for row in soup.select('#countriesTable tr[id^="row"]'):
row = list(row.stripped_strings)
print(row[2], row[3])
or more precisely selecting all <tr> in <tbody> of tag with id countriesTable:
for row in soup.select('#countriesTable tbody tr'):
row = list(row.stripped_strings)
print(row[2], row[3])
...
An alternative and in my opinion best way to scrape tables is the use of pandas.read_html() that works with beautifulsoup under the hood and is doing most work for you:
import pandas as pd
pd.read_html('https://country-code.cl/', attrs={'id':'countriesTable'})[0].dropna(axis=1, how='all').iloc[:-1,:]
or to get only the two specific rows:
pd.read_html('https://country-code.cl/', attrs={'id':'countriesTable'})[0].dropna(axis=1, how='all').iloc[:-1,[1,2]]
Name
ISO 2
0
Afghanistan
AF
1
Åland Islands
AX
2
Albania
AL
3
Algeria
DZ
4
American Samoa
AS
5
Andorra
AD
...

find expects the first argument to be the DOM element you're searching, it won't work with CSS selectors.
So you'll need:
row = soup.find('tr', { 'id': f"row{i}" })
To get the tr with the desired ID.
Then to get the 2-letter country code, for the first a with title: ISO 3166-1 alpha-2 code and get it's .text:
iso = row.find('a', { 'title': 'ISO 3166-1 alpha-2 code' }).text
To get the full name, there is no classname to search for, so I'd use take the second element, then we'll need to search for the span containing the country name:
name = row.findAll('td')[2].findAll('span')[2].text
Putting it all together gives:
import requests
import bs4
res = requests.get('https://country-code.cl/')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
for i in range(3):
row = soup.find('tr', { 'id': f"row{i}" })
iso = row.find('a', { 'title': 'ISO 3166-1 alpha-2 code' }).text
name = row.findAll('td')[2].findAll('span')[2].text
print(name, iso)
Which outputs:
Afghanistan  AF
Åland Islands  AX
Albania  AL

find_all() and select() select a list but find() and select_one() select only single element.
import requests
import bs4
import pandas as pd
res = requests.get('https://country-code.cl/')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text,'lxml')
data=[]
for row in soup.select('.tablesorter.mark > tbody tr'):
name=row.find("span",class_="sortkey").text
country_code=row.select_one('td:nth-child(4)').text.replace('\n','').strip()
data.append({
'name':name,
'country_code':country_code})
df= pd.DataFrame(data)
print(df)
Output:
name country_code
0 afghanistan AF
1 aland-islands AX
2 albania AL
3 algeria DZ
4 american-samoa AS
.. ... ...
244 wallis-and-futuna WF
245 western-sahara EH
246 yemen YE
247 zambia ZM
248 zimbabwe ZW
[249 rows x 2 columns]

Pandas read_html producing empty df with tuple column names

I want to retrieve the tables on the following website and store them in a pandas dataframe: https://www.acf.hhs.gov/orr/resource/ffy-2012-13-state-of-colorado-orr-funded-programs
However, the third table on the page returns an empty dataframe with all the table's data stored in tuples as the column headers:
Empty DataFrame
Columns: [(Service Providers, State of Colorado), (Cuban - Haitian Program, $0), (Refugee Preventive Health Program, $150,000.00), (Refugee School Impact, $450,000), (Services to Older Refugees Program, $0), (Targeted Assistance - Discretionary, $0), (Total FY, $600,000)]
Index: []
Is there a way to "flatten" the tuple headers into header + values, then append this to a dataframe made up of all four tables? My code is below -- it has worked on other similar pages but keeps breaking because of this table's formatting. Thanks!
funds_df = pd.DataFrame()
url = 'https://www.acf.hhs.gov/programs/orr/resource/ffy-2011-12-state-of-colorado-orr-funded-programs'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
year = url.split('ffy-')[1].split('-orr')[0]
tables = page.content
df_list = pd.read_html(tables)
for df in df_list:
df['URL'] = url
df['YEAR'] = year
funds_df = funds_df.append(df)

For this site, there's no need for beautifulsoup or requests
pandas.read_html creates a list of DataFrames for each <table> at the URL.
import pandas as pd
url = 'https://www.acf.hhs.gov/orr/resource/ffy-2012-13-state-of-colorado-orr-funded-programs'
# read the url
dfl = pd.read_html(url)
# see each dataframe in the list; there are 4 in this case
for i, d in enumerate(dfl):
print(i)
display(d) # display worker in Jupyter, otherwise use print
print('\n')
dfl[0]
Service Providers Cash and Medical Assistance* Refugee Social Services Program Targeted Assistance Program TOTAL
0 State of Colorado $7,140,000 $1,896,854 $503,424 $9,540,278
dfl[1]
WF-CMA 2 RSS TAG-F CMA Mandatory 3 TOTAL
0 $3,309,953 $1,896,854 $503,424 $7,140,000 $9,540,278
dfl[2]
Service Providers Refugee School Impact Targeted Assistance - Discretionary Services to Older Refugees Program Refugee Preventive Health Program Cuban - Haitian Program Total
0 State of Colorado $430,000 $0 $100,000 $150,000 $0 $680,000
dfl[3]
Volag Affiliate Name Projected ORR MG Funding Director
0 CWS Ecumenical Refugee & Immigration Services $127,600 Ferdi Mevlani 1600 Downing St., Suite 400 Denver, CO 80218 303-860-0128
1 ECDC ECDC African Community Center $308,000 Jennifer Guddiche 5250 Leetsdale Drive Denver, CO 80246 303-399-4500
2 EMM Ecumenical Refugee Services $191,400 Ferdi Mevlani 1600 Downing St., Suite 400 Denver, CO 80218 303-860-0128
3 LIRS Lutheran Family Services Rocky Mountains $121,000 Floyd Preston 132 E Las Animas Colorado Springs, CO 80903 719-314-0223
4 LIRS Lutheran Family Services Rocky Mountains $365,200 James Horan 1600 Downing Street, Suite 600 Denver, CO 80218 303-980-5400

Python Regex Extract Date to New Column in Dataframe

I'm scraping a website using Python and I'm having troubles with extracting the dates and creating a new Date dataframe with Regex.
The code below is using BeautifulSoup to scrape event data and the event links:
import pandas as pd
import bs4 as bs
import urllib.request
source = urllib.request.urlopen('https://www.techmeme.com/events').read()
soup = bs.BeautifulSoup(source,'html.parser')
event = []
links = []
# ---Event Data---
for a in soup.find_all('a'):
event.append(a.text)
df_event = pd.DataFrame(event)
df_event.columns = ['Event']
df_event = df_event.iloc[1:]
# ---Links---
for a in soup.find_all('a', href=True):
if a.text:
links.append(a['href'])
df_link = pd.DataFrame(links)
df_link.columns = ['Links']
# ---Combines dfs---
df = pd.concat([df_event.reset_index(drop=True),df_link.reset_index(drop=True)],sort=False, axis=1)
At the beginning of the each event data row, the date is present. Example: (May 26-29Augmented World ExpoSan...). The date follows the following format and I have included my Regex(which I believe is correct).
Different Date Formats:
May 27: [A-Z][a-z]*(\ )[0-9]{1,2}
May 26-29: [A-Z][a-z]*(\ )[0-9]{1,2}-[0-9]{1,2}
May 28-Jun 2: [A-Z][a-z]*(\ )[0-9]{1,2}-[A-Z][a-z]*(\ )[0-9]{1,2}
Combined
[A-Z][a-z]*(\ )[0-9]{1,2}|[A-Z][a-z]*(\ )[0-9]{1,2}-[0-9]{1,2}|[A-Z][a-z]*(\ )[0-9]{1,2}-[A-Z][a-z]*(\ )[0-9]{1,2}
When I try to create a new column and extract the dates using Regex, I just receive an empty df['Date'] column.
df['Date'] = df['Event'].str.extract(r[A-Z][a-z]*(\ )[0-9]{1,2}')
df.head()
Any help would be greatly appreciated! Thank you.

You may use
date_reg = r'([A-Z][a-z]* [0-9]{1,2}(?:-(?:[A-Z][a-z]* )?[0-9]{1,2})?)'
df['Date'] = df['Event'].str.extract(date_reg, expand=False)
See the regex demo. If you want to match as whole words and numbers, you may use (?<![A-Za-z])([A-Z][a-z]* [0-9]{1,2}(?:-(?:[A-Z][a-z]* )?[0-9]{1,2})?)(?!\d).
Details
[A-Z][a-z]* - an uppercase letter and then 0 or more lowercase letters
- a space (replace with \s to match any whitespace)
[0-9]{1,2} - one or two digits
(?:-(?:[A-Z][a-z]* )?[0-9]{1,2})? - an optional sequence of
- - hyphen
(?:[A-Z][a-z]* )? - an optional sequence of
[A-Z][a-z]* - an uppercase letter and then 0 or more lowercase letters
- a space (replace with \s to match any whitespace)
[0-9]{1,2} - one or two digits
The (?<![A-Za-z]) construct is a lookbehind that fails the match if there is a letter immediately before the current location and (?!\d) fails the match if there is a digit immediately after.

This script:
import requests
from bs4 import BeautifulSoup
url = 'https://www.techmeme.com/events'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
data = []
for row in soup.select('.rhov a'):
date, event, place = map(lambda x: x.get_text(strip=True), row.find_all('div', recursive=False))
data.append({'Date': date, 'Event': event, 'Place': place, 'Link': 'https://www.techmeme.com' + row['href']})
df = pd.DataFrame(data)
print(df)
will create this dataframe:
Date Event Place Link
0 May 26-29 NOW VIRTUAL:Augmented World Expo Santa Clara https://www.techmeme.com/gotos/www.awexr.com/
1 May 27 Earnings: HPQ,BOX https://www.techmeme.com/gotos/finance.yahoo.c...
2 May 28 Earnings: CRM, VMW https://www.techmeme.com/gotos/finance.yahoo.c...
3 May 28-29 CANCELED:WeAreDevelopers World Congress Berlin https://www.techmeme.com/gotos/www.wearedevelo...
4 Jun 2 Earnings: ZM https://www.techmeme.com/gotos/finance.yahoo.c...
.. ... ... ... ...
140 Dec 7-10 NEW DATE:GOTO Amsterdam Amsterdam https://www.techmeme.com/gotos/gotoams.nl/
141 Dec 8-10 Microsoft Azure + AI Conference Las Vegas https://www.techmeme.com/gotos/azureaiconf.com...
142 Dec 9-10 NEW DATE:Paris Blockchain Week Summit Paris https://www.techmeme.com/gotos/www.pbwsummit.com/
143 Dec 13-16 NEW DATE:KNOW Identity Las Vegas https://www.techmeme.com/gotos/www.knowidentit...
144 Dec 15-16 NEW DATE, NEW LOCATION:Fortune Brainstorm Tech San Francisco https://www.techmeme.com/gotos/fortuneconferen...
[145 rows x 4 columns]

get title inside link tag in HTML using beautifulsoup

I am extracting data from https://data.gov.au/dataset?organization=reservebankofaustralia&_groups_limit=0&groups=business
and got output I wanted but now problem is: the output that I am getting is Business Support an... and Reserve Bank of Aus...., not complete text, I want to print the whole text not "......." for all. I replaced line 9 and 10 in answer by jezrael, please refer to Fetching content from html and write fetched content in a specific format in CSV with code
org = soup.find_all('a', {'class':'nav-item active'})[0].get('title')
groups = soup.find_all('a', {'class':'nav-item active'})[1].get('title')
. And I am running it separately and getting error: list index out of range. What should I use to extract complete sentences? I also tried :
org = soup.find_all('span',class_="filtered pill"), it gave answer of type string when I ran separately but could not run with whole code.

All data with longer text are in attribut title, shorter are in text. So add double if:
for i in webpage_urls:
wiki2 = i
page= urllib.request.urlopen(wiki2)
soup = BeautifulSoup(page, "lxml")
lobbying = {}
#always only 2 active li, so select first by [0] and second by [1]
l = soup.find_all('li', class_="nav-item active")
org = l[0].a.get('title')
if org == '':
org = l[0].span.get_text()
groups = l[1].a.get('title')
if groups == '':
groups = l[1].span.get_text()
data2 = soup.find_all('h3', class_="dataset-heading")
for element in data2:
lobbying[element.a.get_text()] = {}
data2[0].a["href"]
prefix = "https://data.gov.au"
for element in data2:
lobbying[element.a.get_text()]["link"] = prefix + element.a["href"]
lobbying[element.a.get_text()]["Organisation"] = org
lobbying[element.a.get_text()]["Group"] = groups
#print(lobbying)
df = pd.DataFrame.from_dict(lobbying, orient='index') \
.rename_axis('Titles').reset_index()
dfs.append(df)
df = pd.concat(dfs, ignore_index=True)
df1 = df.drop_duplicates(subset = 'Titles').reset_index(drop=True)
df1['Organisation'] = df1['Organisation'].str.replace('\(\d+\)', '')
df1['Group'] = df1['Group'].str.replace('\(\d+\)', '')
print (df1.head())
Titles \
0 Banks – Assets
1 Consolidated Exposures – Immediate and Ultimat...
2 Foreign Exchange Transactions and Holdings of ...
3 Finance Companies and General Financiers – Sel...
4 Liabilities and Assets – Monthly
link \
0 https://data.gov.au/dataset/banks-assets
1 https://data.gov.au/dataset/consolidated-expos...
2 https://data.gov.au/dataset/foreign-exchange-t...
3 https://data.gov.au/dataset/finance-companies-...
4 https://data.gov.au/dataset/liabilities-and-as...
Organisation Group
0 Reserve Bank of Australia Business Support and Regulation
1 Reserve Bank of Australia Business Support and Regulation
2 Reserve Bank of Australia Business Support and Regulation
3 Reserve Bank of Australia Business Support and Regulation
4 Reserve Bank of Australia Business Support and Regulation

I guess you are trying to do this. Here in each link there is title attribute. So here I simply checked if there is any title attribute present or not and if it is then I simply printed it.
There are blank lines because there are few links where title="" so you can avoid that using conditional statement and then get all titles from that.
>>> l = soup.find_all('a')
>>> for i in l:
... if i.has_attr('title'):
... print(i['title'])
...
Remove
Remove
Reserve Bank of Australia
Business Support and Regulation
Creative Commons Attribution 3.0 Australia
>>>

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup Parse Text after <b> and before </br> - python

Related

How to scrape the specific text from kworb and extract it as an excel file?

bs4 soup.select() vs. soup.find()

Pandas read_html producing empty df with tuple column names

Python Regex Extract Date to New Column in Dataframe

get title inside link tag in HTML using beautifulsoup

Categories

Resources