I'm trying to export som data to excel. I'm a total beginner, so i apologise for any dumb questions.
I',m practicising scraping from a demosite webscraper.io - and so far i have found scraped the data, that i want, which is the laptop names and links for the products
import requests
from bs4 import BeautifulSoup
from pprint import pprint
url ="https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html)
css_selector = {"class": "col-sm-4 col-lg-4 col-md-4"}
laptops = soup.find_all("div", attrs=css_selector)
for laptop in laptops:
laptop_link = laptop.find('a')
text = laptop_link.get_text()
href = laptop_link['href']
full_url = f"https://webscraper.io{href}"
print(text)
print (full_url)
I'm having major difficulties wrapping my head around how to export the text + full_url to excel.
I have seen coding being done like this
import pandas as pd
df = pd.DataFrame(laptops)
df.to_excel("laptops_testing.xlsx", encoding="utf-8")
But when i'm doing so, i'm getting an .xlsx file which contains a lot of data and coding, that i dont want. I just want the data, that i have been printing (text) and (full_url)
The data i'm seeing in Excel is looking like this:
<div class="thumbnail">
<img alt="item" class="img-responsive" src="/images/test-sites/e-commerce/items/cart2.png"/>
<div class="caption">
<h4 class="pull-right price">$295.99</h4>
<h4>
<a class="title" href="/test-sites/e-commerce/allinone/product/545" title="Asus VivoBook X441NA-GA190">Asus VivoBook X4...</a>
</h4>
<p class="description">Asus VivoBook X441NA-GA190 Chocolate Black, 14", Celeron N3450, 4GB, 128GB SSD, Endless OS, ENG kbd</p>
</div>
<div class="ratings">
<p class="pull-right">14 reviews</p>
<p data-rating="3">
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star"></span>
</p>
</div>
</div>
Screenshot from google sheets:
This is not that much hard for solve just use this code you just have to add urls and text in lists then change it into a pandas dataframe and then make a new excel file.
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
from pprint import pprint
url ="https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html)
css_selector = {"class": "col-sm-4 col-lg-4 col-md-4"}
laptops = soup.find_all("div", attrs=css_selector)
laptop_name = []
laptop_url = []
for laptop in laptops:
laptop_link = laptop.find('a')
text = laptop_link.get_text()
href = laptop_link['href']
full_url = f"https://webscraper.io{href}"
print(text)
//appending name of laptops
laptop_name.append(text)
print (full_url)
//appending urls
laptop_url.append(full_url)
//changing it into dataframe
new_df = pd.DataFrame({'Laptop Name':laptop_name,'Laptop url':laptop_url})
print(new_df)
// defining excel file
file_name = 'laptop.xlsx'
new_df.to_excel(file_name)
Use soup.select function to find by extended css selectors.
Here's a short solution:
import requests
from bs4 import BeautifulSoup
url ="https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser')
laptops = [(a.getText(), requests.compat.urljoin(url, a.get('href')))
for a in soup.select("div.col-sm-4.col-lg-4.col-md-4 a")]
df = pd.DataFrame(laptops)
df.to_excel("laptops_testing.xlsx")
The final document would look like:
Try this. Remeber to import pandas
And try not to run the code to many times you are sending a new request to the website each time
html = r.text
soup = BeautifulSoup(html)
css_selector = {"class": "col-sm-4 col-lg-4 col-md-4"}
laptops = soup.find_all("div", attrs=css_selector)
data = []
for laptop in laptops:
laptop_link = laptop.find('a')
text = laptop_link.get_text()
href = laptop_link['href']
full_url = f"https://webscraper.io{href}"
data.append([text,full_url])
df = pd.DataFrame(data, columns = ["laptop name","Url"])
df.to_csv("name")
Related
I wrote a crooked parser code.
I'm trying to write a parser to get several identical elements from the site, located in separate blocks.
Here is the parser code
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
URL_TEMPLATE = "http://127.0.0.1:5500/rr.html"
FILE_NAME = "img.csv"
def parse(url=URL_TEMPLATE):
result_list = {'id': []}
result_l = {'id': []}
r = requests.get(url)
soup = bs(r.text, "html.parser")
vacancies_names = soup.find_all('ul', class_='PhotoListSmall')
vacancies_li = soup.find_all('li')
for name in vacancies_names:
for i in vacancies_li:
result_list['id'].append(i.a['href'])
result_l['id'].append(result_list['id'])
result_list['id'] = []
return result_l
df = pd.DataFrame(data=parse())
df.to_csv(FILE_NAME)
Here is the page the parser is processing
<ul class="PhotoListSmall">
<li class="one">
</li>
<li class="two">
</li>
</ul>
<ul class="PhotoListSmall">
<li class="one">
</li>
<li class="two">
</li>
</ul>
And this is what I get
,id
0,"['one_1', 'one_2', 'two_1', 'two_2']"
1,"['one_1', 'one_2', 'two_1', 'two_2']"
And here is what I want to get
,id
0,"['one_1', 'one_2']"
1,"['two_1', 'two_2']"
What have I done wrong?
Do not judge strictly, I'm just a beginner developer
You used the soup which is having all the li elements from both of the PhotoListSmall class in it. That's why it's appending the same array twice.
To get the li from only one class at a time, create another soup inside the loop, instead of using the original soup.
for name in vacancies_names:
li_soup = bs(str(name), "html.parser")
the new soup should have the name as its data, so that it can only have access to the li elements of the first PhotoListSmall class. now create the vacancies_li and find all li elements from the li_soup.
The final code should look something like this.
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
URL_TEMPLATE = "http://192.168.100.196:5500/test.html"
FILE_NAME = "img.csv"
def parse(url=URL_TEMPLATE):
result_list = {'id': []}
result_l = {'id': []}
r = requests.get(url)
soup = bs(r.text, "html.parser")
vacancies_names = soup.find_all('ul', class_='PhotoListSmall')
# vacancies_li = soup.find_all('li')
for name in vacancies_names:
li_soup = bs(str(name), "html.parser")
vacancies_li = li_soup.find_all('li')
for i in vacancies_li:
result_list['id'].append(i.a['href'])
result_l['id'].append(result_list['id'])
result_list['id'] = []
return result_l
df = pd.DataFrame(data=parse())
df.to_csv(FILE_NAME)
I hope it helped you :)
My goal is to retrieve the ids of tweets in a twitter search as they are being posted. My code so far looks like this:
import requests
from bs4 import BeautifulSoup
keys = some_key_words + " -filter:retweets AND -filter:replies"
query = "https://twitter.com/search?f=tweets&vertical=default&q=" + keys + "&src=typd&lang=es"
req = requests.get(query).text
soup = BeautifulSoup(req, "lxml")
for tweets in soup.findAll("li",{"class":"js-stream-item stream-item stream-item"}):
print(tweets)
However, this doesn't return anything. Is there a problem with the code itself or am I looking at the wrong place of the source code? I understand that the ids should be stored here:
<div class="stream">
<ol class="stream-items js-navigable-stream" id="stream-items-id">
<li class="js-stream-item stream-item stream-item" **data-item-id**="1210306781806833664" id="stream-item-tweet-1210306781806833664" data-item-type="tweet">
from bs4 import BeautifulSoup
data = """
<div class="stream">
<ol class="stream-items js-navigable-stream" id="stream-items-id">
<li class="js-stream-item stream-item stream-item
" **data-item-id**="1210306781806833664"
id="stream-item-tweet-1210306781806833664"
data-item-type="tweet"
>
...
"""
soup = BeautifulSoup(data, 'html.parser')
for item in soup.findAll("li", {'class': 'js-stream-item stream-item stream-item'}):
print(item.get("**data-item-id**"))
Output:
1210306781806833664
I am working on a project to get the real-time stock price on http://www.jpmhkwarrants.com/en_hk/market-statistics/underlying/underlying-terms/code/1. I have searched online and tried several way to get the price, but still fail. Here is my code:
def getStockPrice():
url = "http://www.jpmhkwarrants.com/zh_hk/market-statistics/underlying/underlying-terms/code/1"
r = urlopen(url)
soup = BeautifulSoup(r.read(), 'lxmll)
price = soup.find(id = "real_time_box").find({"span", "class":"price"})
print(price)
The output is "None". I know that the price is scripted in the function above but I have no idea how to get the price. Can it be solved by beautifulsoup or else module?
view the page source you will see html like this
<div class="table detail">
.....
<div class="tl">即市走勢 <span class="description">前收市價</span>
.....
<td>買入價(延遲*)<span>82.15</span></td>
the span that we want is in index 2, select it with
price = soup.select('.table.detail td span')[1]
print(price.text)
Demo:
<script type="text/javascript" src="//cdn.datacamp.com/dcl-react.js.gz"></script>
<div data-datacamp-exercise data-lang="python">
<code data-type="sample-code">
from bs4 import BeautifulSoup
from urllib.request import urlopen
def getStockPrice():
url = "http://www.jpmhkwarrants.com/zh_hk/market-statistics/underlying/underlying-terms/code/1"
r = urlopen(url)
soup = BeautifulSoup(r.read(), 'html.parser')
price = soup.select('.table.detail td span')[1]
print(price.text)
getStockPrice()
</code>
</div>
I am attempting to scrape HTML to create a dictionary that includes a pitchers name and his handed-ness. The data-tags are buried--so far I've only been able to collect the pitchers name from the data set. The HTML output (for each player) is as follows:
<div class="pitcher players">
<input name="import-data" type="hidden" value="%5B%7B%22slate_id%22%3A20190%2C%22type%22%3A%22classic%22%2C%22player_id%22%3A%2210893103%22%2C%22salary%22%3A%2211800%22%2C%22position%22%3A%22SP%22%2C%22fpts%22%3A14.96%7D%2C%7B%22slate_id%22%3A20192%2C%22type%22%3A%22classic%22%2C%22player_id%22%3A%2210894893%22%2C%22salary%22%3A%2211800%22%2C%22position%22%3A%22SP%22%2C%22fpts%22%3A14.96%7D%2C%7B%22slate_id%22%3A20193%2C%22type%22%3A%22classic%22%2C%22player_id%22%3A%2210895115%22%2C%22salary%22%3A%2211800%22%2C%22position%22%3A%22SP%22%2C%22fpts%22%3A14.96%7D%5D"/>
<a class="player-popup" data-url="https://rotogrinders.com/players/johnny-cueto-11193?site=draftkings" href="https://rotogrinders.com/players/johnny-cueto-11193">Johnny Cueto</a>
<span class="meta stats">
<span class="stats">
R
</span>
<span class="salary" data-role="salary" data-salary="$11.8K">
$11.8K
</span>
<span class="fpts" data-fpts="14.96" data-product="56" data-role="authorize" title="Projected Points">14.96</span>
I've tinkered and and coming up empty--I'm sure I'm overthinking this. Here is the code I have so far:
import requests
from bs4 import BeautifulSoup
url = "https://rotogrinders.com/lineups/mlb?site=draftkings"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
players_confirmed = {}
results = [soup.find_all("div", {'class':'pitcher players'}]
What's the best way to loop through the results set for the more granular data tag information I need?
I need the text from the HTML beginning with , and handed-ness from the tag
Optimally, I would have a dictionary with the following:
{Johnny Cueto : R, Player 2 : L, ...}
import requests
from bs4 import BeautifulSoup
url = "https://rotogrinders.com/lineups/mlb?site=draftkings"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
players_confirmed = {}
results = soup.find_all("div", {'class': 'pitcher players'})
dicti={}
for j in results:
dicti[j.a.text]=j.select(".stats")[1].text.strip("\n").strip()
just use select or find function of the founded element,and you will be able to iterate
My code:
from bs4 import BeautifulSoup
import urllib.request
url = "http://yaz.tek.firat.edu.tr/tr/duyurular"
url_oku = urllib.request.urlopen(url)
soup = BeautifulSoup(url_oku, 'html.parser')
icerik = soup.find_all('div',attrs={'class':'views-row views-row-1 views-row-odd views-row-first'})
print(kardiz)
my output :
[<div class="views-row views-row-1 views-row-odd views-row-first">
<span class="views-field views-field-title"> <span class="field-content">Grup-1, Grup-2, Grup-3, Grup-4 ve Grup-6 Öğrencileri İçin Staj Sunum Tarihleri</span> </span>
<span class="views-field views-field-created"> <span class="field-content"><i class="fa fa-calendar"></i> Salı, Aralık 5, 2017 - 09:58 </span> </span> </div>]
But I want to get just " Grup-1, Grup-2, Grup-3, Grup-4 ve Grup-6 Öğrencileri İçin Staj Sunum Tarihleri ". How can I achieve that?
You can call .text on a result from BeautifulSoup. It takes the textual content of the elements found, skipping the tags of the elements.
e.g.
from bs4 import BeautifulSoup
import urllib.request
url = "http://yaz.tek.firat.edu.tr/tr/duyurular"
url_oku = urllib.request.urlopen(url)
soup = BeautifulSoup(url_oku, 'html.parser')
icerik = soup.find_all('div',attrs={'class':'views-row views-row-1 views-row-odd views-row-first'})
for result in icerik:
print(result.text)
You can try like this as well to get the title and link from that page. I used css selector to get them:
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import requests
url = "http://yaz.tek.firat.edu.tr/tr/duyurular"
res = requests.get(url)
soup = BeautifulSoup(res.text,'lxml')
for item in soup.select("#content .field-content a"):
link = urljoin(url,item['href'])
print("Title: {}\nLink: {}\n".format(item.text,link))
Partial output:
Title: 2017-2018 Güz Dönemi Final Sınav Programı (TASLAK)
Link: http://yaz.tek.firat.edu.tr/tr/node/481
Title: NETAŞ İşyeri Eğitimi Mülakatları Hakkında Duyuru
Link: http://yaz.tek.firat.edu.tr/tr/node/480