I have a been scraping some HTML pages with beautiful soup trying to extract some updated financial data. I only care about numbers that have a comma ie 100,000 or 12,000,000 but not 450 for example. The goal is just to find the location of the comma separated numbers within a string then I need to extract the entire sentence they are in.
I moved the entire scrape to a string list and within that list I want to extract all numbers that have a comma.
url = 'https://www.sec.gov/Archives/edgar/data/354950/000035495020000024/hd-2020proxystatement.htm'
r = requests.get(url)
soup = BeautifulSoup(r.content)
text = soup.find_all(text = True)
strings = []
for i in range(len(text)):
text_s = str(proxy_text[i])
strings.append(text)
I thought about the follow re code but I am not sure if it will extract all instances.. ie within the list there may be multiple instances of numbers separated by commas.
number = re.sub('[^>0-9,]', "", text)
Any thoughts would be a huge help! Thank you
You can use:
from bs4 import BeautifulSoup
import requests, re
url = 'https://www.sec.gov/Archives/edgar/data/354950/000035495020000024/hd-2020proxystatement.htm'
soup = BeautifulSoup(requests.get(url).text, "html5lib")
for el in soup.find_all(True): # loop all element in page
if re.search(r"(?=\d+,\d+).*", el.text):
print(el.text)
# print("END OF ELEMENT\n") # debug only
If you simply want to check if a number has a comma or not, and you want to extract it if it does, then you could try the following.
new = []
for i in text:
if ',' in i:
new.append(i)
This will append all the elements in the 'text' collection that contain a comma, even if the exact same element is repeated multiple times.
Related
I am having some problems trying to manipulate some strings here. I am scraping some data from a website and I am facing 2 challenges:
I am scraping unnecessary data as the website I target has redundant class naming. My goal is to isolate this data and delete it so I can keep only the data I am interested in.
With the data kept, I need to split the string in order to store some information into specific variables.
So initially I was planning to use a simple split() function and store each new string into list and then play with it to keep the parts that I want. Unfortunately, every time I do this, I end up with 3 separate lists that I cannot manipulate/split.
Here is the code:
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome('\\Users\\rapha\\Desktop\\10Milz\\4. Python\\Python final\\Scrape\\chromedriver.exe')
driver.get("https://www.atptour.com/en/scores/2020/7851/MS011/match-stats")
content = driver.page_source
soup = BeautifulSoup(content, "html.parser" )
for infos in soup.find_all('h3', class_='section-title'):
title = infos.get_text()
title = ' '.join(title.split())
title_list = []
title_list = title.split(" | ")
print(title_list)
Here is the "raw data" retrieve
Player Results
Tournament Results
Salvatore Caruso VS. Brandon Nakashima | Indian Wells 2020
And here is what I like to achieve
Variable_1 = Salvatore Caruso
Variable_2 = Brandon Nakashima
Variable 3 = Indian Wells
Variable 4 = 2020
Could you please let me know how to proceed here?
How about this ?
Its not so pretty but will work as long as there is always a VS. and a | separating the names and that the date is always 4 digits for the year.
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome('/home/lewis/Desktop/chromedriver')
driver.get("https://www.atptour.com/en/scores/2020/7851/MS011/match-stats")
content = driver.page_source
soup = BeautifulSoup(content, "html.parser" )
text = soup.find_all('h3', class_='section-title')[2].get_text().replace("\n","")
while text.find(" ")> -1:
text = text.replace(" "," ")
text = text.strip()
#split by two parameters
split = [st.split("|") for st in text.split("VS.")]
#flatten the nested lists
flat_list = [item for sublist in split for item in sublist]
#extract the date from the end of the last item
flat_list.append(flat_list[-1][-4:])
#remove date fromt the 3rd item
flat_list[2] = flat_list[2][:-4]
#strip any leading or trailing white space
final_list = [x.strip() for x in flat_list]
print(final_list)
output
['Salvatore Caruso', 'Brandon Nakashima', 'Indian Wells', '2020']
I would need to get all the information from this page:
https://it.wikipedia.org/wiki/Categoria:Periodici_italiani_in_lingua_italiana
from symbol " to letter Z.
Then:
"
"900", Cahiers d'Italie et d'Europe
A
Abitare
Aerei
Aeronautica & Difesa
Airone (periodico)
Alp (periodico)
Alto Adige (quotidiano)
Altreconomia
....
In order to do this, I have tried using the following code:
res = requests.get("https://it.wikipedia.org/wiki/Categoria:Periodici_italiani_in_lingua_italiana")
soup = bs(res.text, "html.parser")
url_list = []
links = soup.find_all('a')
for link in links:
url = link.get("href", "")
url_list.append(url)
lists_A=[]
for url in url_list:
lists_A(url)
print(lists_A)
However this code collects more information than what I would need.
In particular, the last item that I should collect would be La Zanzara (possibly all the items should not have any word in the brackets, i.e. they should not contain (rivista), (periodico), (settimanale), and so on, but just the title (e.g. Jack (periodico) should be just Jack).
Could you give me any advice on how to get this information? Thanks
This will help you to filter out some of the unwanted urls (not all though). Basically everything before "Corriere della Sera", which I'm assuming should be the first expected URL.
links = [a.get('href') for a in soup.find_all('a', {'title': True, 'href': re.compile('/wiki/(.*)'), 'accesskey': False})]
You can safely assume that all the magazine URLs are ordered at this point and since you know that "La Zanzara" should be the last expected URL you can get the position of that particular string in your new list and slice up to that index + 1
links.index('/wiki/La_zanzara_(periodico)')
Out[20]: 144
links = links[:145]
As for removing ('periodico') and other data cleaning you need to inspect your data and figure out what is it that you want to remove.
Write a simple function like this maybe:
def clean(string):
to_remove = ['_(periodico)', '_(quotidiano)']
for s in to_remove:
if s in string:
return replace(string, s, '')
I have html source code from an HTML page:
import requests
text = requests.get("https://en.wikipedia.org/wiki/Collatz_conjecture").text
What I would like to do is to get a count of the number of unique HTML tags on this page.
For example: <head>, <title>. Closing tags do not count (<head> and </head> would be counted only once).
Yes, I know this is much easier using HTML parsers such as Beautiful Soup but I would like to accomplish this using only Regular Expression.
I've brute force counted this and the answer is in the ballpark of around 60 unique tags.
How would I go about doing this?
I've already tried using re.findall(), to no avail.
Since the answer is around 60, I would like the output to be:
"Number of unique HTML tags: 60"
The following will yield 63 URL's from the url in question
import requests
import re
url = "https://en.wikipedia.org/wiki/Collatz_conjecture"
text = requests.get(url).text
url_pattern = r"((http(s)?://)([\w-]+\.)+[\w-]+[.com]+([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?)"
# Get all matching patterns of url_pattern
# this will return a list of tuples
# where we are only interested in the first item of the tuple
urls = re.findall(url_pattern, text)
# using list comprehension to get the first item of the tuple,
# and the set function to filter out duplicates
unique_urls = set([x[0] for x in urls])
print(f'Number of unique HTML tags: {len(unique_urls)} found on {url}')
out:
Number of unique HTML tags: 63 found on https://en.wikipedia.org/wiki/Collatz_conjecture
Please! Do not parse a HTML in regex use modules like bs4. But still if you insist Do that as follows:
import requests
import re
url = 'https://en.wikipedia.org/wiki/Collatz_conjecture'
text = requests.get(url).text
tags = re.findall('<[^>]*>',text)
total=[]
for i in range(len(tags)):
total.append(re.match('<[^\s\>]+',tags[i]).group())
total=[elem+'>' for elem in total]
r= re.compile('</[^<]')
unwanted =list(filter(r.match,total))
un=['<!-->','<!--[if>','<!DOCTYPE>','<![endif]-->']
unwanted.extend(un)
final=[x for x in list(set(total)) if x not in set(unwanted)]
print('Number of Unique HTML tags : ',len(final))
my code outputs multiple empty line breaks.
How do i remove all the empty space?
from bs4 import BeautifulSoup
import urllib.request
import re
url = input('enter url moish')
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page,'lxml')
all = soup.find_all('a', {'class' : re.compile('itemIncludes')})
for i in all:
print(i.text)
code output:
Canon EOS 77D DSLR Camera (Body Only)
LP-E17 Lithium-Ion Battery Pack
LC-E17 Charger for LP-E17 Battery Pack
desired output:
Canon EOS 77D DSLR Camera (Body Only)
LP-E17 Lithium-Ion Battery Pack
LC-E17 Charger for LP-E17 Battery Pack
Thanks!
You could remove empty lines before printing:
items = [item.text for item in all if item.text.strip() != '']
for i in all:
items = ' '.join(i.text.split())
print(items)
the code above removed all the white spaces
You can use a regex to filter the output, something like:
import re
text = i.text.strip()
if not re.search(r"^\s+$", text): # if not a bank line
print(text)
Note:
This is just a fix for the output since the problem may reside on
the find_all arguments, which I cannot test.
I'm sure you've solved this by now, but I'm brand new to python and had the same issue. I also didn't want to just remove the lines when printing, I wanted to change them in the element, this was my solution
soup = BeautifulSoup(getPage())
elements = soup.findAll()
for element in elements:
text = element.text.strip()
element.string = re.sub(r"[\n][\W]+[^\w]", "\n", text)
print(soup)
Loops through elements, gets the text, replace any instance of "\n followed by whitespace, but nothing else>" (one way to find empty lines, but feel free to use a better one!), sets the replaced value back into the element.
I want to extract some information off websites with URLs of the form:
http://www.pedigreequery.com/american+pharoah
where "american+pharoah" is the extension for one of many horse names.
I have a list of the horse names I'm searching for, I just need to figure out how to plug the names in after "http://www.pedigreequery.com/"
This is what I currently have:
import csv
allhorses = csv.reader(open('HORSES.csv') )
rows=list(allhorses)
import requests
from bs4 import BeautifulSoup
for i in rows: # Number of pages plus one
url = "http://www.pedigreequery.com/".format(i)
r = requests.get(url)
soup = BeautifulSoup(r.content)
letters = soup.find_all("a", class_="horseName")
print(letters)
When I print out the url it doesn't have the horse's name at the end, just the URL in quotes. the letters/print statement at the end are just to check if it's actually going to the website.
This is how I've seen it done for looping URLs that change by numbers at the end- I haven't found advice on URLs that change by characters.
Thanks!
You are missing the placeholder in your format so scange the format to:
url = "http://www.pedigreequery.com/{}".format(i)
^
#add placeholder
Also you are getting a list of lists at best from rows=list(allhorses) so you would be passing a list not a string/horsename, just open the file normally if you have a horse per line and iterate over the file object stripping the newline.
Presuming one horse name per line, the whole working code would be:
import requests
from bs4 import BeautifulSoup
with open("HORSES.csv") as f:
for horse in map(str.strip,f): # Number of pages plus one
url = "http://www.pedigreequery.com/{}".format(horse)
r = requests.get(url)
soup = BeautifulSoup(r.content)
letters = soup.find_all("a", class_="horseName")
print(letters)
If you have multiple horses per line you can use the csv lib but you will need an inner loop:
with open("HORSES.csv") as f:
for row in csv.reader(f):
# Number of pages plus one
for horse in row:
url = "http://www.pedigreequery.com/{}".format(horse)
r = requests.get(url)
soup = BeautifulSoup(r.content)
letters = soup.find_all("a", class_="horseName")
print(letters)
Lastly if you don't have the names store correctly you have a few options the simplest of which is to split and create the create the query manually.
url = "http://www.pedigreequery.com/{}".format("+".join(horse.split()))