How to remove multiple empty lines when scraping with Beautifulsoup

How to remove multiple empty lines when scraping with Beautifulsoup - python

my code outputs multiple empty line breaks.
How do i remove all the empty space?
from bs4 import BeautifulSoup
import urllib.request
import re
url = input('enter url moish')
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page,'lxml')
all = soup.find_all('a', {'class' : re.compile('itemIncludes')})
for i in all:
print(i.text)
code output:
Canon EOS 77D DSLR Camera (Body Only)
LP-E17 Lithium-Ion Battery Pack
LC-E17 Charger for LP-E17 Battery Pack
desired output:
Canon EOS 77D DSLR Camera (Body Only)
LP-E17 Lithium-Ion Battery Pack
LC-E17 Charger for LP-E17 Battery Pack
Thanks!

You could remove empty lines before printing:
items = [item.text for item in all if item.text.strip() != '']

for i in all:
items = ' '.join(i.text.split())
print(items)
the code above removed all the white spaces

You can use a regex to filter the output, something like:
import re
text = i.text.strip()
if not re.search(r"^\s+$", text): # if not a bank line
print(text)
Note:
This is just a fix for the output since the problem may reside on
the find_all arguments, which I cannot test.

I'm sure you've solved this by now, but I'm brand new to python and had the same issue. I also didn't want to just remove the lines when printing, I wanted to change them in the element, this was my solution
soup = BeautifulSoup(getPage())
elements = soup.findAll()
for element in elements:
text = element.text.strip()
element.string = re.sub(r"[\n][\W]+[^\w]", "\n", text)
print(soup)
Loops through elements, gets the text, replace any instance of "\n followed by whitespace, but nothing else>" (one way to find empty lines, but feel free to use a better one!), sets the replaced value back into the element.

Related

Find Location of All Numbers with a Comma

I have a been scraping some HTML pages with beautiful soup trying to extract some updated financial data. I only care about numbers that have a comma ie 100,000 or 12,000,000 but not 450 for example. The goal is just to find the location of the comma separated numbers within a string then I need to extract the entire sentence they are in.
I moved the entire scrape to a string list and within that list I want to extract all numbers that have a comma.
url = 'https://www.sec.gov/Archives/edgar/data/354950/000035495020000024/hd-2020proxystatement.htm'
r = requests.get(url)
soup = BeautifulSoup(r.content)
text = soup.find_all(text = True)
strings = []
for i in range(len(text)):
text_s = str(proxy_text[i])
strings.append(text)
I thought about the follow re code but I am not sure if it will extract all instances.. ie within the list there may be multiple instances of numbers separated by commas.
number = re.sub('[^>0-9,]', "", text)
Any thoughts would be a huge help! Thank you

You can use:
from bs4 import BeautifulSoup
import requests, re
url = 'https://www.sec.gov/Archives/edgar/data/354950/000035495020000024/hd-2020proxystatement.htm'
soup = BeautifulSoup(requests.get(url).text, "html5lib")
for el in soup.find_all(True): # loop all element in page
if re.search(r"(?=\d+,\d+).*", el.text):
print(el.text)
# print("END OF ELEMENT\n") # debug only

If you simply want to check if a number has a comma or not, and you want to extract it if it does, then you could try the following.
new = []
for i in text:
if ',' in i:
new.append(i)
This will append all the elements in the 'text' collection that contain a comma, even if the exact same element is repeated multiple times.

Scraping lists of items from Wikipedia

I would need to get all the information from this page:
https://it.wikipedia.org/wiki/Categoria:Periodici_italiani_in_lingua_italiana
from symbol " to letter Z.
Then:
"
"900", Cahiers d'Italie et d'Europe
A
Abitare
Aerei
Aeronautica & Difesa
Airone (periodico)
Alp (periodico)
Alto Adige (quotidiano)
Altreconomia
....
In order to do this, I have tried using the following code:
res = requests.get("https://it.wikipedia.org/wiki/Categoria:Periodici_italiani_in_lingua_italiana")
soup = bs(res.text, "html.parser")
url_list = []
links = soup.find_all('a')
for link in links:
url = link.get("href", "")
url_list.append(url)
lists_A=[]
for url in url_list:
lists_A(url)
print(lists_A)
However this code collects more information than what I would need.
In particular, the last item that I should collect would be La Zanzara (possibly all the items should not have any word in the brackets, i.e. they should not contain (rivista), (periodico), (settimanale), and so on, but just the title (e.g. Jack (periodico) should be just Jack).
Could you give me any advice on how to get this information? Thanks

This will help you to filter out some of the unwanted urls (not all though). Basically everything before "Corriere della Sera", which I'm assuming should be the first expected URL.
links = [a.get('href') for a in soup.find_all('a', {'title': True, 'href': re.compile('/wiki/(.*)'), 'accesskey': False})]
You can safely assume that all the magazine URLs are ordered at this point and since you know that "La Zanzara" should be the last expected URL you can get the position of that particular string in your new list and slice up to that index + 1
links.index('/wiki/La_zanzara_(periodico)')
Out[20]: 144
links = links[:145]
As for removing ('periodico') and other data cleaning you need to inspect your data and figure out what is it that you want to remove.
Write a simple function like this maybe:
def clean(string):
to_remove = ['_(periodico)', '_(quotidiano)']
for s in to_remove:
if s in string:
return replace(string, s, '')

How to strip string down to "point guard"

I'm trying to scrape the position off of this webpage using BeautifulSoup. Here is my relevant code.
info_panel = soup.find("div",{"id":"meta"})
info_panel_rows = info_panel.find_all("p")
if(info_panel_rows[2].find("strong") != None):
position = info_panel_rows[2].find("strong").next_sibling
position = str(position).strip()
else: # Executing on this path in my current problem
position = info_panel_rows[3].find("strong").next_sibling
position = str(position).strip()
print(position)
When I scrape it though, it prints like such:
Small Forward
▪
How would I go about stripping this down to just "Small Forward"? I've looked all over Stack Overflow and couldn't find a clear answer.
Thanks for any help you can provide!

Are you having issues with newline and tab in position? if so, do
position = str(position).strip('\n\t ')
and if that dot is also an issue, copy from print and paste it into strip. when you dont put anything in strip, it only removes white space from both side, you need to specify what you want to remove, the above example removes newline and tab and whitespace
If this does not solve your problem, you can try regex
import re
string_patterns = re.compile(r'\b[0-9a-zA-Z]*\b')
position = info_panel_rows[3].find("strong").next_sibling
results = string_patterns.findall(str(position))
results = ' '.join([item for item in results if len(item)])
print(results)
Hope this helps

If you encode it to ascii ignoring errors then call strip() you get the desired output.
import requests
from bs4 import BeautifulSoup
html = requests.get('https://www.basketball-reference.com/players/y/youngtr01.html').text
soup = BeautifulSoup(html, 'html.parser')
info_panel = soup.find("div", {"id": "meta"})
info_panel_rows = info_panel.find_all("p")
if info_panel_rows[2].find("strong") is not None:
position = info_panel_rows[2].find("strong").next_sibling
else:
position = info_panel_rows[3].find("strong").next_sibling
print(position.encode('ascii', 'ignore').strip())
Outputs:
Point Guard
Encoding to ascii gets rid of the bullet point.
Or if you just want to print the second line:
print(position.splitlines()[1].strip())
Also outputs:
Point Guard

Why doesn't this function return the same output in both situations(webscraping project)?

import requests
import re
from bs4 import BeautifulSoup
#The website I like to get, converts the contents of the web page to lxml format
base_url = "https://festivalfans.nl/event/dominator-festival"
url = requests.get(base_url)
soup = BeautifulSoup(url.content, "lxml")
#Modifies the given string to look visually good. Like this:
#['21 / JulZaterdag2018'] becomes 21 Jul 2018
def remove_char(string):
#All blacklisted characters and words
blacklist = ["/", "[", "]", "'", "Maandag", "Dinsdag", "Woensdag",
"Donderdag", "Vrijdag", "Zaterdag", "Zondag"]
#Replace every blacklisted character with white space
for char in blacklist:
string = string.replace(char,' ')
#Replace more than 2 consecutive white spaces
string = re.sub("\s\s+", " ", string)
#Gets the date of the festival I'm interested in
def get_date_info():
#Makes a list for the data
raw_info = []
#Adds every "div" with a certain name to list, and converts it to text
for link in soup.find_all("div", {"class": "event-single-data"}):
raw_info.append(link.text)
#Converts list into string, because remove_char() only accepts strings
raw_info = str(raw_info)
#Modifies the string as explained above
final_date = remove_char(raw_info)
#Prints the date in this format: 21 Jul 2018(example)
print(final_date)
get_date_info()
Hi there! So I'm currently working on a little webscraping project. I thought I had a good idea and I wanted to get more experienced with Python. What it basically does is it gets festival information like date, time and price and puts it in a little text file. I'm using BeautifulSoup to navigate and edit the web page. Link is down there!
But now I'm kinda running into a problem. I can't figure out what's wrong. Maybe I'm totally looking over it. So when I run this program it should give me this: 21 Jul 2018. But instead it returns 'None'. For some reason every character in the string gets removed.
I tried running remove_char() on its own, with the same list(converted it to string first) as input. This worked perfectly. It returned "21 Jul 2018" like it was supposed to do. So I'm quite sure the error is not in this function.
So somehow I'm missing something. Maybe it has to do with BeautifulSoup and how it handles things?
Hope someone can help me out!
BeautifulSoup:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Web page:
https://festivalfans.nl/event/dominator-festival

You forgot to return the value in the remove_char() function.
That's it!

Neither of your functions has a return statement, and so return None by default. remove_char() should end with return string for example.

import requests
from bs4 import BeautifulSoup
base_url = "https://festivalfans.nl/event/dominator-festival"
url = requests.get(base_url)
soup = BeautifulSoup(url.content , "html.parser")
def get_date_info():
for link in soup.find_all("div", {"class": "event-single-data"}):
day = link.find('div', {"class":"event-single-day"}).text.replace(" ", '')
month = link.find('div', {"class": "event-single-month"}).text.replace('/', "").replace(' ', '')
year = link.find('div', {"class": "event-single-year"}).text.replace(" ", '')
print(day, month, year)
get_date_info()
here is an easier code no need of re

How to take all the words in a list element as a variable

I have the following program, in which I am trying to pass a list of elements to consecutive Google searches:
search_terms = ['Telejob (ETH)', 'Luisa da Silva','The CERN Recruitment Services']
for el in search_terms:
webpage = 'http://google.com/search?q='+el)
print('xxxxxxxxxxxxxxxxxxx')
print(webpage)
Unfortunately my program is not taking ALL the words in each list item, but taking only the first one, giving me this output:
http://google.com/search?q=Telejob (ETH)
xxxxxxxxxxxxxxxxxxx
http://google.com/search?q=Luisa da Silva
xxxxxxxxxxxxxxxxxxx
http://google.com/search?q=The CERN Recruitment Services
xxxxxxxxxxxxxxxxxxx
http://google.com/search?q=The Swiss National Science Foundation
Altough you can see the whole item with every word being added to the search above, when I verify the link, it is going concatenating as element ONLY the first word of each item, as such:
http://google.com/search?q=Telejob
xxxxxxxxxxxxxxxxxxx
http://google.com/search?q=Luisa
xxxxxxxxxxxxxxxxxxx
http://google.com/search?q=The
xxxxxxxxxxxxxxxxxxx
http://google.com/search?q=The
What am I doing wrong and what's the solution to concatenate ALL the words in each list item to the google search?
Thank you

This line:
webpage = 'http://google.com/search?q='+el)
should be split and joined with a %20 joiner:
webpage = 'http://google.com/search?q='+'%20'.join(el.split()))

You can use urllib.parse.urlencode in python3. For python2 you can use urllib.urlencode.
import urllib
search_terms = ['Telejob (ETH)', 'Luisa da Silva','The CERN Recruitment Services']
for el in search_terms:
query = urllib.parse.urlencode({'q': el}) # urllib.urlencode({'q': el})
webpage = 'http://google.com/search?{}'.format(query)
print('xxxxxxxxxxxxxxxxxxx')
print(webpage)

Neither of these answers address the base issue: you need to encode the entire string as a url.
I chose urllib.quote():
>>> import urllib
>>> for term in search_terms:
print urllib.quote(term)
Telejob%20%28ETH%29
Luisa%20da%20Silva
The%20CERN%20Recruitment%20Services
Notice the () are also encoded, as will any other strange characters that might bork your query.
In your case, it would be:
webpage = 'http://google.com/search?q=' + urllib.quote(el))
the equivalent in Py3:
from urllib import parse
for term in search_terms:
print(parse.quote(term))
so
webpage = 'http://google.com/search?q=' + parse.quote(el))

The thing is that URLs need to be percent-encoded, there are characters with special meaning in URLs, for example:
#: goes to a certain position in the page
/: I think you know what this one does...
You should use quote() to fix that, and just remember that:
urllib.quote() is for Python2
url.parse.quote() is for Python3
Here are some examples for Python3:
from urllib.parse import quote
quote('/bars/will/stay/intact')
#'/bars/will/stay/intact'
quote('/bars/wont/stay/intact', safe='')
#'%2Fbars%2Fwont%2Fstay%2Fintact' #Actually, everything will be encoded here
quote('()ñ´ ç')
#'%28%29%C3%B1%C2%B4%20%C3%A7'
So you code is now:
search_terms = ['Telejob (ETH)', 'Luisa da Silva','The CERN Recruitment Services']
for el in search_terms:
webpage = 'http://google.com/search?q='+quote(el)
print('xxxxxxxxxxxxxxxxxxx')
print(webpage)
As search_terms could include other characters that won't be escaped by quote('something'), you'll have to use its safe argument:
search_terms = ['Telejob (ETH)', 'Luisa da Silva','The CERN Recruitment Services']
for el in search_terms:
webpage = 'http://google.com/search?q='+quote(el, safe='')
print('xxxxxxxxxxxxxxxxxxx')
print(webpage)
This last one, outputs:
xxxxxxxxxxxxxxxxxxx
http://google.com/search?q=Telejob%20%28ETH%29
xxxxxxxxxxxxxxxxxxx
http://google.com/search?q=Luisa%20da%20Silva
xxxxxxxxxxxxxxxxxxx
http://google.com/search?q=The%20CERN%20Recruitment%20Services
I would suggest you to see: https://docs.python.org/3/library/urllib.parse.html#url-quoting for further information (See? a # character!)

I believe your problem is with url-encoding.
To allow spaces in the URLs they are place by '%20'
Try changing your links to be like
https://www.google.com/search?q=The%20CERN%20Recruitment%20Services

Google queries have the format https://www.google.com/search?q=keyword_1+...+keyword_N so you should format your query like so:
search_terms = ["Telejob (ETH)", "Luisa da Silva","The CERN Recruitment Services"]
for search_term in search_terms:
query = "+".join(search_term.split())
url = "http://google.com/search?q=" + query

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to remove multiple empty lines when scraping with Beautifulsoup - python

You could remove empty lines before printing: items = [item.text for item in all if item.text.strip() != '']

for i in all: items = ' '.join(i.text.split()) print(items) the code above removed all the white spaces

You can use a regex to filter the output, something like: import re text = i.text.strip() if not re.search(r"^\s+$", text): # if not a bank line print(text) Note: This is just a fix for the output since the problem may reside on the find_all arguments, which I cannot test.

Related

Find Location of All Numbers with a Comma

Scraping lists of items from Wikipedia

How to strip string down to "point guard"

Why doesn't this function return the same output in both situations(webscraping project)?

How to take all the words in a list element as a variable

Categories

Resources