Print Duplicate email and Telephone number in Scraping python - python

I am Facing issue it print two times email and phone number along with it print tele and send to also how it trim on scraping i tried but failed.
Please help me out from this.
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
suit=[]
url ="https://www.allenovery.com/en-gb/global/people/Shruti-Ajitsaria"
r= requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
#time.sleep(1)
content = soup.find_all('div', class_ = 'hyperlinks')
#print(content)
for property in content:
link = property.find('a', {'class': 'tel'})['href']
email = property.find_next('a', {'class': 'mail'})['href']
print(link,email)

Use this in your code to see that how many times your loop is going to be run
print(len(content))
If you do so, you will find that the length of your loop (which is equal to length of content variable) is 2, and as you put the print(link,email) inside your loop, it will run twice and you see the printed result two times... in other words, you are printing the result 2 times. to fix that, remove the indentation for print(link,email) to put it outside the loop and it will be fixed.

Duplicate email and Telephone number because they exist more than once.
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
suit=[]
url ="https://www.allenovery.com/en-gb/global/people/Shruti-Ajitsaria"
r= requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
#time.sleep(1)
tel = soup.select('div.hyperlinks > ul > li > a')[0].get('href')
email= soup.select('div.hyperlinks > ul > li > a')[1].get('href')
print(tel,email)
Output:
tel:+44 20 3088 1831 mailto:shruti.ajitsaria#allenovery.com

Related

Beautiful Soup get value from page which updates daily

I am trying to get a single value from website which will update daily.
I am trying to get latest price in Vijayawada in page
below is my code but I am getting empty space as output but expecting 440 as output.
import requests
from bs4 import BeautifulSoup
import csv
res = requests.get('https://www.e2necc.com/home/eggprice')
soup = BeautifulSoup(res.content, 'html.parser')
price = soup.select("Vijayawada")
Looking to get value: 440 [Which is today value] any suggestions?
One approach could be to select the tr by its text and iterate over its strings to pick the last one that isdigit():
[x for x in soup.select_one('tr:-soup-contains("Vijayawada")').stripped_strings if x.isdigit()][-1]
Example
import requests
from bs4 import BeautifulSoup
res = requests.get('https://www.e2necc.com/home/eggprice')
soup = BeautifulSoup(res.content, 'html.parser')
price = [x for x in soup.select_one('tr:-soup-contains("Vijayawada")').stripped_strings if x.isdigit()][-1]
print(price)
Another one is to pick the element by day:
import requests
from bs4 import BeautifulSoup
import datetime
d = datetime.datetime.now().strftime("%d")
res = requests.get('https://www.e2necc.com/home/eggprice')
soup = BeautifulSoup(res.content, 'html.parser')
soup.select('tr:-soup-contains("Vijayawada") td')[int(d)].text
Output
440

What is the proper syntax for .find() in bs4?

I am trying to scrape the bitcoin price off of coinbase and cannot find the proper syntax. When I run the program (without the line with question marks) I get the block of html that I need, but I don't know how to narrow down and retrieve the price itself. Any help appreciated, thanks.
import requests
from bs4 import BeautifulSoup
url = 'https://www.coinbase.com/charts'
data = requests.get(url)
nicedata = data.text
soup = BeautifulSoup(nicedata, 'html.parser')
prettysoup = soup.prettify()
bitcoin = soup.find('h4', {'class':
'Header__StyledHeader-sc-1q6y56a-0 hZxUBM
TextElement__Spacer-sc-18l8wi5-0 hpeTzd'})
price = bitcoin.find('???')
print(price)
The attached image contains the html
To get text from item:
price = bitcoin.text
But this page has many items <h4> with this class but find() gets only first one and it has text Bitcoin, not price from your image. You may need find_all() to get list with all items and then you can use index [index] or slicing [start:end] to get some items, or you can use for-loop to work with every item on list.
import requests
from bs4 import BeautifulSoup
url = 'https://www.coinbase.com/charts'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
all_h4 = soup.find_all('h4', {'class': 'Header__StyledHeader-sc-1q6y56a-0 hZxUBM TextElement__Spacer-sc-18l8wi5-0 hpeTzd'})
for h4 in all_h4:
print(h4.text)
It can be easier to work with data if you keep it in list of list or array or DataFrame. But to create list of lists it would be easier to find rows <tr> and inside every row search <h4>
import requests
from bs4 import BeautifulSoup
url = 'https://www.coinbase.com/charts'
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
all_tr = soup.find_all('tr')
data = []
for tr in all_tr:
row = []
for h4 in tr.find_all('h4'):
row.append(h4.text)
if row: # skip empty row
data.append(row)
for row in data:
print(row)
It doesn't need class to get all h4.
BTW: This page uses JavaScript to append new rows when you scroll page but requests and BeautifulSoup can't run JavaScript - so if you will need all rows then you may need Selenium to control web browser which runs JavaScript

python crawling beautifulsoup how to crawl several pages?

Please Help.
I want to get all the company names of each pages and they have 12 pages.
http://www.saramin.co.kr/zf_user/jobs/company-labs/list/page/1
http://www.saramin.co.kr/zf_user/jobs/company-labs/list/page/2
-- this website only changes the number.
So Here is my code so far.
Can I get just the title (company name) of 12 pages?
Thank you in advance.
from bs4 import BeautifulSoup
import requests
maximum = 0
page = 1
URL = 'http://www.saramin.co.kr/zf_user/jobs/company-labs/list/page/1'
response = requests.get(URL)
source = response.text
soup = BeautifulSoup(source, 'html.parser')
whole_source = ""
for page_number in range(1, maximum+1):
URL = 'http://www.saramin.co.kr/zf_user/jobs/company-labs/list/page/' + str(page_number)
response = requests.get(URL)
whole_source = whole_source + response.text
soup = BeautifulSoup(whole_source, 'html.parser')
find_company = soup.select("#content > div.wrap_analysis_data > div.public_con_box.public_list_wrap > ul > li:nth-child(13) > div > strong")
for company in find_company:
print(company.text)
---------Output of one page
---------page source :)
So, you want to remove all the headers and get only the string of the company name?
Basically, you can use the soup.findAll to find the list of company in the format like this:
<strong class="company"><span>중소기업진흥공단</span></strong>
Then you use the .find function to extract information from the <span> tag:
<span>중소기업진흥공단</span>
After that, you use .contents function to get the string from the <span> tag:
'중소기업진흥공단'
So you write a loop to do the same for each page, and make a list called company_list to store the results from each page and append them together.
Here's the code:
from bs4 import BeautifulSoup
import requests
maximum = 12
company_list = [] # List for result storing
for page_number in range(1, maximum+1):
URL = 'http://www.saramin.co.kr/zf_user/jobs/company-labs/list/page/{}'.format(page_number)
response = requests.get(URL)
print(page_number)
whole_source = response.text
soup = BeautifulSoup(whole_source, 'html.parser')
for entry in soup.findAll('strong', attrs={'class': 'company'}): # Finding all company names in the page
company_list.append(entry.find('span').contents[0]) # Extracting name from the result
The company_list will give you all the company names you want
I figured it out eventually. Thank you for your answer though!
image : code captured in jupyter notebook
Here is my final code.
from urllib.request import urlopen
from bs4 import BeautifulSoup
company_list=[]
for n in range(12):
url = 'http://www.saramin.co.kr/zf_user/jobs/company-labs/list/page/{}'.format(n+1)
webpage = urlopen(url)
source = BeautifulSoup(webpage,'html.parser',from_encoding='utf-8')
companys = source.findAll('strong',{'class':'company'})
for company in companys:
company_list.append(company.get_text().strip().replace('\n','').replace('\t','').replace('\r',''))
file = open('company_name1.txt','w',encoding='utf-8')
for company in company_list:
file.write(company+'\n')
file.close()

Stock price data refresh

I am very new and I am getting totally stuck with recent task. I want to autorefresh stock price automatically as it is changing. I am scrapping nasdaq.com
website for actual intraday price.
I have a recent code:
import bs4 as bs
import urllib
tiker = input("zadaj ticker: ")
url = urllib.request.urlopen("http://www.nasdaq.com/symbol/"+tiker+"/real-time")
stranka = url.read()
soup = bs.BeautifulSoup(stranka, 'lxml')
print (tiker.upper())
for each in soup.find('div', attrs={'id': 'qwidget_lastsale'}):
print(each.string)
I was only able to make an infinite loop while True but i get prints in lines despite i want to change only one line as actual price is changing.
very thank you for your notes.
You can achieve it by printing "\b" to remove the previously printed string and then printing on the same line:
import bs4 as bs
import urllib
import time
import sys
tiker = input("zadaj ticker: ")
print (tiker.upper())
written_string = ''
while True:
url = urllib.request.urlopen("http://www.nasdaq.com/symbol/"+tiker+"/real-time")
stranka = url.read()
soup = bs.BeautifulSoup(stranka, 'lxml')
for each in soup.find('div', attrs={'id': 'qwidget_lastsale'}):
for i in range(len(written_string)):
sys.stderr.write("\b")
sys.stderr.write(each.string)
written_string = each.string
time.sleep(1)

Parsing error with Beautiful Soup 4 and Python

I need to get the list of the rooms from this website: http://www.studentroom.ch/en/dynasite.cfm?dsmid=106547
I'm using Beautiful Soup 4 in order to parse the page.
This is the code I wrote until now:
from bs4 import BeautifulSoup
import urllib
pageFile = urllib.urlopen("http://studentroom.ch/dynasite.cfm?dsmid=106547")
pageHtml = pageFile.read()
pageFile.close()
soup = BeautifulSoup("".join(pageHtml))
roomsNoFilter = soup.find('div', {"id": "ImmoListe"})
rooms = roomsNoFilter.table.find_all('tr', recursive=False)
for room in rooms:
print room
print "----------------"
print len(rooms)
For now I'm trying to get only the rows of the table.
But I get only 7 rows instead of 78 (or 77).
At first I tough that I was receiving only a partial html, but I printed the whole html and I'm receiving it correctly.
There's no ajax calls that loads new rows after the page loaded...
Someone could please help me finding the error?
This is working for me
soup = BeautifulSoup(pageHtml)
div = soup.select('#ImmoListe')[0]
table = div.select('table > tbody')[0]
k = 0
for room in table.find_all('tr'):
if 'onmouseout' in str(room):
print room
k = k + 1
print "Total ",k
Let me know the status

Categories