I'm scraping a website using Python and I'm having troubles with extracting the dates and creating a new Date dataframe with Regex.
The code below is using BeautifulSoup to scrape event data and the event links:
import pandas as pd
import bs4 as bs
import urllib.request
source = urllib.request.urlopen('https://www.techmeme.com/events').read()
soup = bs.BeautifulSoup(source,'html.parser')
event = []
links = []
# ---Event Data---
for a in soup.find_all('a'):
event.append(a.text)
df_event = pd.DataFrame(event)
df_event.columns = ['Event']
df_event = df_event.iloc[1:]
# ---Links---
for a in soup.find_all('a', href=True):
if a.text:
links.append(a['href'])
df_link = pd.DataFrame(links)
df_link.columns = ['Links']
# ---Combines dfs---
df = pd.concat([df_event.reset_index(drop=True),df_link.reset_index(drop=True)],sort=False, axis=1)
At the beginning of the each event data row, the date is present. Example: (May 26-29Augmented World ExpoSan...). The date follows the following format and I have included my Regex(which I believe is correct).
Different Date Formats:
May 27: [A-Z][a-z]*(\ )[0-9]{1,2}
May 26-29: [A-Z][a-z]*(\ )[0-9]{1,2}-[0-9]{1,2}
May 28-Jun 2: [A-Z][a-z]*(\ )[0-9]{1,2}-[A-Z][a-z]*(\ )[0-9]{1,2}
Combined
[A-Z][a-z]*(\ )[0-9]{1,2}|[A-Z][a-z]*(\ )[0-9]{1,2}-[0-9]{1,2}|[A-Z][a-z]*(\ )[0-9]{1,2}-[A-Z][a-z]*(\ )[0-9]{1,2}
When I try to create a new column and extract the dates using Regex, I just receive an empty df['Date'] column.
df['Date'] = df['Event'].str.extract(r[A-Z][a-z]*(\ )[0-9]{1,2}')
df.head()
Any help would be greatly appreciated! Thank you.
You may use
date_reg = r'([A-Z][a-z]* [0-9]{1,2}(?:-(?:[A-Z][a-z]* )?[0-9]{1,2})?)'
df['Date'] = df['Event'].str.extract(date_reg, expand=False)
See the regex demo. If you want to match as whole words and numbers, you may use (?<![A-Za-z])([A-Z][a-z]* [0-9]{1,2}(?:-(?:[A-Z][a-z]* )?[0-9]{1,2})?)(?!\d).
Details
[A-Z][a-z]* - an uppercase letter and then 0 or more lowercase letters
- a space (replace with \s to match any whitespace)
[0-9]{1,2} - one or two digits
(?:-(?:[A-Z][a-z]* )?[0-9]{1,2})? - an optional sequence of
- - hyphen
(?:[A-Z][a-z]* )? - an optional sequence of
[A-Z][a-z]* - an uppercase letter and then 0 or more lowercase letters
- a space (replace with \s to match any whitespace)
[0-9]{1,2} - one or two digits
The (?<![A-Za-z]) construct is a lookbehind that fails the match if there is a letter immediately before the current location and (?!\d) fails the match if there is a digit immediately after.
This script:
import requests
from bs4 import BeautifulSoup
url = 'https://www.techmeme.com/events'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
data = []
for row in soup.select('.rhov a'):
date, event, place = map(lambda x: x.get_text(strip=True), row.find_all('div', recursive=False))
data.append({'Date': date, 'Event': event, 'Place': place, 'Link': 'https://www.techmeme.com' + row['href']})
df = pd.DataFrame(data)
print(df)
will create this dataframe:
Date Event Place Link
0 May 26-29 NOW VIRTUAL:Augmented World Expo Santa Clara https://www.techmeme.com/gotos/www.awexr.com/
1 May 27 Earnings: HPQ,BOX https://www.techmeme.com/gotos/finance.yahoo.c...
2 May 28 Earnings: CRM, VMW https://www.techmeme.com/gotos/finance.yahoo.c...
3 May 28-29 CANCELED:WeAreDevelopers World Congress Berlin https://www.techmeme.com/gotos/www.wearedevelo...
4 Jun 2 Earnings: ZM https://www.techmeme.com/gotos/finance.yahoo.c...
.. ... ... ... ...
140 Dec 7-10 NEW DATE:GOTO Amsterdam Amsterdam https://www.techmeme.com/gotos/gotoams.nl/
141 Dec 8-10 Microsoft Azure + AI Conference Las Vegas https://www.techmeme.com/gotos/azureaiconf.com...
142 Dec 9-10 NEW DATE:Paris Blockchain Week Summit Paris https://www.techmeme.com/gotos/www.pbwsummit.com/
143 Dec 13-16 NEW DATE:KNOW Identity Las Vegas https://www.techmeme.com/gotos/www.knowidentit...
144 Dec 15-16 NEW DATE, NEW LOCATION:Fortune Brainstorm Tech San Francisco https://www.techmeme.com/gotos/fortuneconferen...
[145 rows x 4 columns]
Related
In the page there is a paragraph like this:
The latest financial statements filed by x in the business register it corresponds to the year 2020 and shows a turnover range of 'Between 6,000,000 and 30,000,000 Euros'.
In the page it is: L’ultimo bilancio depositato da Euro P.a. - S.r.l. nel registro delle imprese corrisponde all’anno 2020 e riporta un range di fatturato di 'Tra 6.000.000 e 30.000.000 Euro'.
I need to scrape the value inside the ' ' in this case (Between 6,000,000 and 30,000,000 Euros).
And put it inside a column called "range".
I tried with no success this code:
import pandas as pd
from bs4 import BeautifulSoup
import requests
url = 'https://www.informazione-aziende.it/Azienda_EURO-PA-SRL'
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser')
turnover = soup.find("span", {"id": "turnover"}).text
year = soup.find("span", {"id": "year"}).text
data = {'turnover': turnover, 'year': year}
df = pd.DataFrame(data, index=[0])
print(df)
But i get: AttributeError: 'NoneType' object has no attribute 'text'
First, scrape the whole text with BeautifulSoup, and assign it to a variable such as:
text = "The latest financial statements filed by x in the business register it corresponds to the year 2020 and shows a turnover range of 'Between 6,000,000 and 30,000,000 Euros'."
Then, execute the following code:
import re
pattern = "'.+'"
result = re.search(pattern, text)
result = result[0].replace("'", "")
The output will be:
'Between 6,000,000 and 30,000,000 Euros'
An alternative can be:
Split the text by the single quote character - ' - and get the text at position 1 of the list.
Code:
text = "The latest financial statements filed by x in the business register it corresponds to the year 2020 and shows a turnover range of 'Between 6,000,000 and 30,000,000 Euros'."
# Get the text at position 1 of the list:
desired_text = text.split("'")[1]
# Print the result:
print(desired_text)
Result:
Between 6,000,000 and 30,000,000 Euros
Python 3.9.5/Pandas 1.1.3
I have a very large csv file with values that look like:
Ac\\Nme Products Inc.
and all the values are different company names with double backslashes in random places throughout.
I'm attempting to get rid of all the double backslashes. It's not working in Pandas. But a simple test against the standalone value just using string.replace does work.
Example:
org = "Ac\\Nme Products Inc."
result = org.replace("\\","")
print(result)
returns AcNme Products Inc. as the output, as I would expect.
However, using Pandas with the names in a csv file:
import pandas as pd
csv_input = pd.read_csv('/Users/me/file.csv')
csv_input.replace("\\", "")
csv_input.to_csv('/Users/me/file_revised.csv', index=False)
When I open the new file_revised.csv file, the value still shows as Ac\\Nme Products Inc.
EDIT 1:
Here is a snippet of file.csv as requested:
id,company_name,address,country 1000566,A1 Comm\\Nodity Traders,LEVEL 28 THREE PACIFIC PLACE 1 QUEEN'S RD EAST HK,TH 1000579,"A2 A Mf\\g. Co., Ltd.",53 YONG-AN 2ND ST. TAINAN TAIWAN,CA 1000585,"A2 Z Logisitcs Indi\\Na Pvt., Ltd.",114A/1 1ST FLOOR SOUTH RAJA ST TUTICORIN - 628 001 TAMILNADU - INDIA,PE
Pandas doesn't have a dataframe level string operation, but it can be updated per-column:
for col in csv_input.columns:
if col == 'that_int_column':
continue
csv_input[col] = csv_input[col].str.replace(r"\\N", "")
I would like to seperate the authors name, the domain and the date out of a dataframe column.
While
.split(" in ")
works well to seperate the authors name on the left, I also want to seperate the domain and the date, which are not seperated through a space sign.
from pandas import DataFrame
Cars = {'Details': ['Daniel Jacobs in HackeMoon.comJul 31, 2017','Wil Zelk in websiteabc.deJan 28','Wil Zelk in anotherwebsite.chJan 28, 2019'],
}
df = DataFrame(Cars,columns= ['Details'])
print(df)
df = df.Details.str.split(" in ", expand=True)
print(df)
You can try DataFrame.str.extract for this in combination with a regex:
df['Details'].str.extract(r'(?P<author>.*?) in (?P<url>.*)(?P<date>[A-Z].*)', expand=True)
This yields:
author url date
0 Daniel Jacobs HackeMoon.com Jul 31, 2017
1 Wil Zelk websiteabc.de Jan 28
2 Wil Zelk anotherwebsite.ch Jan 28, 2019
To separate the strings I make use of the following assumptions:
the name and the url are separated by " in "
the first character (and only the first character) of the date is an upper case letter (so the last upper case character in the string marks the first character of the date part)
I am extracting data from https://data.gov.au/dataset?organization=reservebankofaustralia&_groups_limit=0&groups=business
and got output I wanted but now problem is: the output that I am getting is Business Support an... and Reserve Bank of Aus...., not complete text, I want to print the whole text not "......." for all. I replaced line 9 and 10 in answer by jezrael, please refer to Fetching content from html and write fetched content in a specific format in CSV with code
org = soup.find_all('a', {'class':'nav-item active'})[0].get('title')
groups = soup.find_all('a', {'class':'nav-item active'})[1].get('title')
. And I am running it separately and getting error: list index out of range. What should I use to extract complete sentences? I also tried :
org = soup.find_all('span',class_="filtered pill"), it gave answer of type string when I ran separately but could not run with whole code.
All data with longer text are in attribut title, shorter are in text. So add double if:
for i in webpage_urls:
wiki2 = i
page= urllib.request.urlopen(wiki2)
soup = BeautifulSoup(page, "lxml")
lobbying = {}
#always only 2 active li, so select first by [0] and second by [1]
l = soup.find_all('li', class_="nav-item active")
org = l[0].a.get('title')
if org == '':
org = l[0].span.get_text()
groups = l[1].a.get('title')
if groups == '':
groups = l[1].span.get_text()
data2 = soup.find_all('h3', class_="dataset-heading")
for element in data2:
lobbying[element.a.get_text()] = {}
data2[0].a["href"]
prefix = "https://data.gov.au"
for element in data2:
lobbying[element.a.get_text()]["link"] = prefix + element.a["href"]
lobbying[element.a.get_text()]["Organisation"] = org
lobbying[element.a.get_text()]["Group"] = groups
#print(lobbying)
df = pd.DataFrame.from_dict(lobbying, orient='index') \
.rename_axis('Titles').reset_index()
dfs.append(df)
df = pd.concat(dfs, ignore_index=True)
df1 = df.drop_duplicates(subset = 'Titles').reset_index(drop=True)
df1['Organisation'] = df1['Organisation'].str.replace('\(\d+\)', '')
df1['Group'] = df1['Group'].str.replace('\(\d+\)', '')
print (df1.head())
Titles \
0 Banks – Assets
1 Consolidated Exposures – Immediate and Ultimat...
2 Foreign Exchange Transactions and Holdings of ...
3 Finance Companies and General Financiers – Sel...
4 Liabilities and Assets – Monthly
link \
0 https://data.gov.au/dataset/banks-assets
1 https://data.gov.au/dataset/consolidated-expos...
2 https://data.gov.au/dataset/foreign-exchange-t...
3 https://data.gov.au/dataset/finance-companies-...
4 https://data.gov.au/dataset/liabilities-and-as...
Organisation Group
0 Reserve Bank of Australia Business Support and Regulation
1 Reserve Bank of Australia Business Support and Regulation
2 Reserve Bank of Australia Business Support and Regulation
3 Reserve Bank of Australia Business Support and Regulation
4 Reserve Bank of Australia Business Support and Regulation
I guess you are trying to do this. Here in each link there is title attribute. So here I simply checked if there is any title attribute present or not and if it is then I simply printed it.
There are blank lines because there are few links where title="" so you can avoid that using conditional statement and then get all titles from that.
>>> l = soup.find_all('a')
>>> for i in l:
... if i.has_attr('title'):
... print(i['title'])
...
Remove
Remove
Reserve Bank of Australia
Business Support and Regulation
Creative Commons Attribution 3.0 Australia
>>>
I have this code trying to parse search results from a grant website (please find the URL in the code, I can't post the link yet until my rep is higher), the "Year"and "Amount Award" after tags and before tags.
Two questions:
1) Why is this only returning the 1st table?
2) Any way I can get the text that is after the (i.e. Year and Amount Award strings) and (i.e. the actual number such as 2015 and $100000)
Specifically:
<td valign="top">
<b>Year: </b>2014<br>
<b>Award Amount: </b>$84,907 </td>
Here is my script:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'http://www.ned.org/wp-content/themes/ned/search/grant-search.php?' \
'organizationName=®ion=ASIA&projectCountry=China&amount=&fromDate=&toDate=&' \
'projectFocus%5B%5D=&search=&maxCount=25&orderBy=Year&start=1&sbmt=1'
r = requests.get(url)
html_content = r.text
soup = BeautifulSoup(html_content, "html.parser")
tables = soup.find_all('table')
data = {
'col_names': [],
'info' : [],
'year_amount':[]
}
index = 0
for table in tables:
rows = table.find_all('tr')[1:]
for row in rows:
cols = row.find_all('td')
data['col_names'].append(cols[0].get_text())
data['info'].append(cols[1].get_text())
try:
data['year_amount'].append(cols[2].get_text())
except IndexError:
data['year_amount'].append(None)
grant_df = pd.DataFrame(data)
index += 1
filename = 'grant ' + str(index) + '.csv'
grant_df.to_csv(filename)
I would suggest approaching the table parsing in a different manner. All of the information is available in the first row of each table. So you can parse the text of the row like:
Code:
text = '\n'.join([x.strip() for x in rows[0].get_text().split('\n')
if x.strip()]).replace(':\n', ': ')
data_dict = {k.strip(): v.strip() for k, v in
[x.split(':', 1) for x in text.split('\n')]}
How?:
This takes the text and
splits it on newlines
removes any blank lines
removes any leading/trailing space
joins the lines back together into a single text
joins any line ending in : with the next line
Then:
split the text again by newline
split each line by :
strip any whitespace of ends of text on either side of :
use the split text as key and value to a dict
Test Code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'http://www.ned.org/wp-content/themes/ned/search/grant-search.php?' \
'organizationName=®ion=ASIA&projectCountry=China&amount=&' \
'fromDate=&toDate=&projectFocus%5B%5D=&search=&maxCount=25&' \
'orderBy=Year&start=1&sbmt=1'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
data = []
for table in soup.find_all('table'):
rows = table.find_all('tr')
text = '\n'.join([x.strip() for x in rows[0].get_text().split('\n')
if x.strip()]).replace(':\n', ': ')
data_dict = {k.strip(): v.strip() for k, v in
[x.split(':', 1) for x in text.split('\n')]}
if data_dict.get('Award Amount'):
data.append(data_dict)
grant_df = pd.DataFrame(data)
print(grant_df.head())
Results:
Award Amount Description \
0 $84,907 To strengthen the capacity of China's rights d...
1 $204,973 To provide an effective forum for free express...
2 $48,000 To promote religious freedom in China. The org...
3 $89,000 To educate and train civil society activists o...
4 $65,000 To encourage greater public discussion, transp...
Organization Name Project Country Project Focus \
0 NaN Mainland China Rule of Law
1 Princeton China Initiative Mainland China Freedom of Information
2 NaN Mainland China Rule of Law
3 NaN Mainland China Democratic Ideas and Values
4 NaN Mainland China Rule of Law
Project Region Project Title Year
0 Asia Empowering the Chinese Legal Community 2014
1 Asia Supporting Free Expression and Open Debate for... 2014
2 Asia Religious Freedom, Rights Defense and Rule of ... 2014
3 Asia Education on Civil Society and Democratization 2014
4 Asia Promoting Democratic Policy Change in China 2014