How to scrap values of a specific paragraph based on pattern - python

In the page there is a paragraph like this:
The latest financial statements filed by x in the business register it corresponds to the year 2020 and shows a turnover range of 'Between 6,000,000 and 30,000,000 Euros'.
In the page it is: L’ultimo bilancio depositato da Euro P.a. - S.r.l. nel registro delle imprese corrisponde all’anno 2020 e riporta un range di fatturato di 'Tra 6.000.000 e 30.000.000 Euro'.
I need to scrape the value inside the ' ' in this case (Between 6,000,000 and 30,000,000 Euros).
And put it inside a column called "range".
I tried with no success this code:
import pandas as pd
from bs4 import BeautifulSoup
import requests
url = 'https://www.informazione-aziende.it/Azienda_EURO-PA-SRL'
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser')
turnover = soup.find("span", {"id": "turnover"}).text
year = soup.find("span", {"id": "year"}).text
data = {'turnover': turnover, 'year': year}
df = pd.DataFrame(data, index=[0])
print(df)
But i get: AttributeError: 'NoneType' object has no attribute 'text'

First, scrape the whole text with BeautifulSoup, and assign it to a variable such as:
text = "The latest financial statements filed by x in the business register it corresponds to the year 2020 and shows a turnover range of 'Between 6,000,000 and 30,000,000 Euros'."
Then, execute the following code:
import re
pattern = "'.+'"
result = re.search(pattern, text)
result = result[0].replace("'", "")
The output will be:
'Between 6,000,000 and 30,000,000 Euros'

An alternative can be:
Split the text by the single quote character - ' - and get the text at position 1 of the list.
Code:
text = "The latest financial statements filed by x in the business register it corresponds to the year 2020 and shows a turnover range of 'Between 6,000,000 and 30,000,000 Euros'."
# Get the text at position 1 of the list:
desired_text = text.split("'")[1]
# Print the result:
print(desired_text)
Result:
Between 6,000,000 and 30,000,000 Euros

Related

Python Regex Extract Date to New Column in Dataframe

I'm scraping a website using Python and I'm having troubles with extracting the dates and creating a new Date dataframe with Regex.
The code below is using BeautifulSoup to scrape event data and the event links:
import pandas as pd
import bs4 as bs
import urllib.request
source = urllib.request.urlopen('https://www.techmeme.com/events').read()
soup = bs.BeautifulSoup(source,'html.parser')
event = []
links = []
# ---Event Data---
for a in soup.find_all('a'):
event.append(a.text)
df_event = pd.DataFrame(event)
df_event.columns = ['Event']
df_event = df_event.iloc[1:]
# ---Links---
for a in soup.find_all('a', href=True):
if a.text:
links.append(a['href'])
df_link = pd.DataFrame(links)
df_link.columns = ['Links']
# ---Combines dfs---
df = pd.concat([df_event.reset_index(drop=True),df_link.reset_index(drop=True)],sort=False, axis=1)
At the beginning of the each event data row, the date is present. Example: (May 26-29Augmented World ExpoSan...). The date follows the following format and I have included my Regex(which I believe is correct).
Different Date Formats:
May 27: [A-Z][a-z]*(\ )[0-9]{1,2}
May 26-29: [A-Z][a-z]*(\ )[0-9]{1,2}-[0-9]{1,2}
May 28-Jun 2: [A-Z][a-z]*(\ )[0-9]{1,2}-[A-Z][a-z]*(\ )[0-9]{1,2}
Combined
[A-Z][a-z]*(\ )[0-9]{1,2}|[A-Z][a-z]*(\ )[0-9]{1,2}-[0-9]{1,2}|[A-Z][a-z]*(\ )[0-9]{1,2}-[A-Z][a-z]*(\ )[0-9]{1,2}
When I try to create a new column and extract the dates using Regex, I just receive an empty df['Date'] column.
df['Date'] = df['Event'].str.extract(r[A-Z][a-z]*(\ )[0-9]{1,2}')
df.head()
Any help would be greatly appreciated! Thank you.
You may use
date_reg = r'([A-Z][a-z]* [0-9]{1,2}(?:-(?:[A-Z][a-z]* )?[0-9]{1,2})?)'
df['Date'] = df['Event'].str.extract(date_reg, expand=False)
See the regex demo. If you want to match as whole words and numbers, you may use (?<![A-Za-z])([A-Z][a-z]* [0-9]{1,2}(?:-(?:[A-Z][a-z]* )?[0-9]{1,2})?)(?!\d).
Details
[A-Z][a-z]* - an uppercase letter and then 0 or more lowercase letters
- a space (replace with \s to match any whitespace)
[0-9]{1,2} - one or two digits
(?:-(?:[A-Z][a-z]* )?[0-9]{1,2})? - an optional sequence of
- - hyphen
(?:[A-Z][a-z]* )? - an optional sequence of
[A-Z][a-z]* - an uppercase letter and then 0 or more lowercase letters
- a space (replace with \s to match any whitespace)
[0-9]{1,2} - one or two digits
The (?<![A-Za-z]) construct is a lookbehind that fails the match if there is a letter immediately before the current location and (?!\d) fails the match if there is a digit immediately after.
This script:
import requests
from bs4 import BeautifulSoup
url = 'https://www.techmeme.com/events'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
data = []
for row in soup.select('.rhov a'):
date, event, place = map(lambda x: x.get_text(strip=True), row.find_all('div', recursive=False))
data.append({'Date': date, 'Event': event, 'Place': place, 'Link': 'https://www.techmeme.com' + row['href']})
df = pd.DataFrame(data)
print(df)
will create this dataframe:
Date Event Place Link
0 May 26-29 NOW VIRTUAL:Augmented World Expo Santa Clara https://www.techmeme.com/gotos/www.awexr.com/
1 May 27 Earnings: HPQ,BOX https://www.techmeme.com/gotos/finance.yahoo.c...
2 May 28 Earnings: CRM, VMW https://www.techmeme.com/gotos/finance.yahoo.c...
3 May 28-29 CANCELED:WeAreDevelopers World Congress Berlin https://www.techmeme.com/gotos/www.wearedevelo...
4 Jun 2 Earnings: ZM https://www.techmeme.com/gotos/finance.yahoo.c...
.. ... ... ... ...
140 Dec 7-10 NEW DATE:GOTO Amsterdam Amsterdam https://www.techmeme.com/gotos/gotoams.nl/
141 Dec 8-10 Microsoft Azure + AI Conference Las Vegas https://www.techmeme.com/gotos/azureaiconf.com...
142 Dec 9-10 NEW DATE:Paris Blockchain Week Summit Paris https://www.techmeme.com/gotos/www.pbwsummit.com/
143 Dec 13-16 NEW DATE:KNOW Identity Las Vegas https://www.techmeme.com/gotos/www.knowidentit...
144 Dec 15-16 NEW DATE, NEW LOCATION:Fortune Brainstorm Tech San Francisco https://www.techmeme.com/gotos/fortuneconferen...
[145 rows x 4 columns]

How to get a value from a text document that has an unstructured table

I am trying to get the total assets values from the 10-K text filings. The problem is that the html format varies from one company to another.
Take Apple 10-K as an example:
total assets is in a table that has balance sheet header and typical terms like cash, inventories, ... exist in some rows of that table. In the last row, there is a summation of assets of 290,479 for 2015 and 231,839 for 2014. I wanted to get the number for the 2015 --> 290,479. I have not been able to find a way that
1) finds the relevant table that has some specific headings (like balance sheet) and words in rows (cash, ...)
2) get the value in the row that has the word total assets and belongs to the greater year (2015 for our example).
import re
url = 'https://www.sec.gov/Archives/edgar/data/320193/000119312515356351/d17062d10k.htm'
r = requests.get(url)
soup = BeautifulSoup(r.text, "xml")
for tag in soup.find_all(text=re.compile('Total\sassets')):
print(tag.findParent('table').findParent('table'))
Using lxml or html.parser instead of xml I can get
title > CONSOLIDATED BALANCE SHEETS
row > Total assets
column 0 > Total assets
column 1 >
column 2 > $
column 3 > 290,479
column 4 >
column 5 >
column 6 > $
column 7 > 231,839
column 8 >
using code
import requests
from bs4 import BeautifulSoup
import re
url = 'https://www.sec.gov/Archives/edgar/data/320193/000119312515356351/d17062d10k.htm'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')# "lxml")
# get all `b` to find title
all_b = soup.find_all('b')
for item in all_b:
# check text in every `b`
title = item.get_text(strip=True)
if title == 'CONSOLIDATED BALANCE SHEETS':
print('title >', title)
# get first `table` after `b`
table = item.parent.findNext('table')
# all rows in table
all_tr = table.find_all('tr')
for tr in all_tr:
# all columns in row
all_td = tr.find_all('td')
# text in first column
text = all_td[0].get_text(strip=True)
if text == 'Total assets':
print('row >', text)
for i, td in enumerate(all_td):
print('column', i, '>', td.get_text(strip=True))

get title inside link tag in HTML using beautifulsoup

I am extracting data from https://data.gov.au/dataset?organization=reservebankofaustralia&_groups_limit=0&groups=business
and got output I wanted but now problem is: the output that I am getting is Business Support an... and Reserve Bank of Aus...., not complete text, I want to print the whole text not "......." for all. I replaced line 9 and 10 in answer by jezrael, please refer to Fetching content from html and write fetched content in a specific format in CSV with code
org = soup.find_all('a', {'class':'nav-item active'})[0].get('title')
groups = soup.find_all('a', {'class':'nav-item active'})[1].get('title')
. And I am running it separately and getting error: list index out of range. What should I use to extract complete sentences? I also tried :
org = soup.find_all('span',class_="filtered pill"), it gave answer of type string when I ran separately but could not run with whole code.
All data with longer text are in attribut title, shorter are in text. So add double if:
for i in webpage_urls:
wiki2 = i
page= urllib.request.urlopen(wiki2)
soup = BeautifulSoup(page, "lxml")
lobbying = {}
#always only 2 active li, so select first by [0] and second by [1]
l = soup.find_all('li', class_="nav-item active")
org = l[0].a.get('title')
if org == '':
org = l[0].span.get_text()
groups = l[1].a.get('title')
if groups == '':
groups = l[1].span.get_text()
data2 = soup.find_all('h3', class_="dataset-heading")
for element in data2:
lobbying[element.a.get_text()] = {}
data2[0].a["href"]
prefix = "https://data.gov.au"
for element in data2:
lobbying[element.a.get_text()]["link"] = prefix + element.a["href"]
lobbying[element.a.get_text()]["Organisation"] = org
lobbying[element.a.get_text()]["Group"] = groups
#print(lobbying)
df = pd.DataFrame.from_dict(lobbying, orient='index') \
.rename_axis('Titles').reset_index()
dfs.append(df)
df = pd.concat(dfs, ignore_index=True)
df1 = df.drop_duplicates(subset = 'Titles').reset_index(drop=True)
df1['Organisation'] = df1['Organisation'].str.replace('\(\d+\)', '')
df1['Group'] = df1['Group'].str.replace('\(\d+\)', '')
print (df1.head())
Titles \
0 Banks – Assets
1 Consolidated Exposures – Immediate and Ultimat...
2 Foreign Exchange Transactions and Holdings of ...
3 Finance Companies and General Financiers – Sel...
4 Liabilities and Assets – Monthly
link \
0 https://data.gov.au/dataset/banks-assets
1 https://data.gov.au/dataset/consolidated-expos...
2 https://data.gov.au/dataset/foreign-exchange-t...
3 https://data.gov.au/dataset/finance-companies-...
4 https://data.gov.au/dataset/liabilities-and-as...
Organisation Group
0 Reserve Bank of Australia Business Support and Regulation
1 Reserve Bank of Australia Business Support and Regulation
2 Reserve Bank of Australia Business Support and Regulation
3 Reserve Bank of Australia Business Support and Regulation
4 Reserve Bank of Australia Business Support and Regulation
I guess you are trying to do this. Here in each link there is title attribute. So here I simply checked if there is any title attribute present or not and if it is then I simply printed it.
There are blank lines because there are few links where title="" so you can avoid that using conditional statement and then get all titles from that.
>>> l = soup.find_all('a')
>>> for i in l:
... if i.has_attr('title'):
... print(i['title'])
...
Remove
Remove
Reserve Bank of Australia
Business Support and Regulation
Creative Commons Attribution 3.0 Australia
>>>

BeautifulSoup Parse Text after <b> and before </br>

I have this code trying to parse search results from a grant website (please find the URL in the code, I can't post the link yet until my rep is higher), the "Year"and "Amount Award" after tags and before tags.
Two questions:
1) Why is this only returning the 1st table?
2) Any way I can get the text that is after the (i.e. Year and Amount Award strings) and (i.e. the actual number such as 2015 and $100000)
Specifically:
<td valign="top">
<b>Year: </b>2014<br>
<b>Award Amount: </b>$84,907 </td>
Here is my script:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'http://www.ned.org/wp-content/themes/ned/search/grant-search.php?' \
'organizationName=&region=ASIA&projectCountry=China&amount=&fromDate=&toDate=&' \
'projectFocus%5B%5D=&search=&maxCount=25&orderBy=Year&start=1&sbmt=1'
r = requests.get(url)
html_content = r.text
soup = BeautifulSoup(html_content, "html.parser")
tables = soup.find_all('table')
data = {
'col_names': [],
'info' : [],
'year_amount':[]
}
index = 0
for table in tables:
rows = table.find_all('tr')[1:]
for row in rows:
cols = row.find_all('td')
data['col_names'].append(cols[0].get_text())
data['info'].append(cols[1].get_text())
try:
data['year_amount'].append(cols[2].get_text())
except IndexError:
data['year_amount'].append(None)
grant_df = pd.DataFrame(data)
index += 1
filename = 'grant ' + str(index) + '.csv'
grant_df.to_csv(filename)
I would suggest approaching the table parsing in a different manner. All of the information is available in the first row of each table. So you can parse the text of the row like:
Code:
text = '\n'.join([x.strip() for x in rows[0].get_text().split('\n')
if x.strip()]).replace(':\n', ': ')
data_dict = {k.strip(): v.strip() for k, v in
[x.split(':', 1) for x in text.split('\n')]}
How?:
This takes the text and
splits it on newlines
removes any blank lines
removes any leading/trailing space
joins the lines back together into a single text
joins any line ending in : with the next line
Then:
split the text again by newline
split each line by :
strip any whitespace of ends of text on either side of :
use the split text as key and value to a dict
Test Code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'http://www.ned.org/wp-content/themes/ned/search/grant-search.php?' \
'organizationName=&region=ASIA&projectCountry=China&amount=&' \
'fromDate=&toDate=&projectFocus%5B%5D=&search=&maxCount=25&' \
'orderBy=Year&start=1&sbmt=1'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
data = []
for table in soup.find_all('table'):
rows = table.find_all('tr')
text = '\n'.join([x.strip() for x in rows[0].get_text().split('\n')
if x.strip()]).replace(':\n', ': ')
data_dict = {k.strip(): v.strip() for k, v in
[x.split(':', 1) for x in text.split('\n')]}
if data_dict.get('Award Amount'):
data.append(data_dict)
grant_df = pd.DataFrame(data)
print(grant_df.head())
Results:
Award Amount Description \
0 $84,907 To strengthen the capacity of China's rights d...
1 $204,973 To provide an effective forum for free express...
2 $48,000 To promote religious freedom in China. The org...
3 $89,000 To educate and train civil society activists o...
4 $65,000 To encourage greater public discussion, transp...
Organization Name Project Country Project Focus \
0 NaN Mainland China Rule of Law
1 Princeton China Initiative Mainland China Freedom of Information
2 NaN Mainland China Rule of Law
3 NaN Mainland China Democratic Ideas and Values
4 NaN Mainland China Rule of Law
Project Region Project Title Year
0 Asia Empowering the Chinese Legal Community 2014
1 Asia Supporting Free Expression and Open Debate for... 2014
2 Asia Religious Freedom, Rights Defense and Rule of ... 2014
3 Asia Education on Civil Society and Democratization 2014
4 Asia Promoting Democratic Policy Change in China 2014

Airline Price Scraping with Python

I've been trying to create python code to scrape airline prices from JFK to LAX.
The URL of the prices that I want to scrape are here: https://www.google.com/flights/#search;f=JFK;t=LAX;d=2014-05-28;r=2014-06-01;tt=o
I would ideally be able to get a list of time of the airline, time of departure and price.
I know that
'div class="GHOFUQ5BGJC>" $210 '
corresponds to the price and
'div class="GHOFUQ5BMFC">Sun Country'
corresponds to the airline.
So far, this is what I have
import re
import urllib
html = "https://www.google.com/flights/#search;f=JFK;t=LAX;d=2014-05-28;r=2014-06-01;tt=o"
htmlfile = urllib.urlopen(html)
htmltext = htmlfile.read()
re1 = '<div class="GHOFUQ5BGJC">(.+?)</div>'
pattern1 = re.compile(re1)
price = re.findall(pattern1, htmltext)
re2 ='<div class="GHOFUQ5BMFC">(.+?)</div>'
pattern2 = re.compile(re2)
airline = re.findall(pattern2, htmltext)
print price
print airline
Is there a way to access the price and airline tags through beautiful soup? Or am I on the right track with the regex?
When run, the code just gives me two empty lists.
What am I doing wrong?
Thanks

Categories