websrcraping, find() function not working

websrcraping, find() function not working - python

i have a wescraping project and i faced a problem in my codes
import requests
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
driver = requests.get('https://bama.ir/car')
products= []
prices = []
kilometrs = []
soup = BeautifulSoup(driver.text,'html.parser')
for a in soup.find_all('li',herf=True,attrs={'class':'car-list-item-li list-data-main'}):
name=a.find('div', attrs={'class':'title'})
price=a.find('p', attrs={'class':'cost single-price'})
kilometr=a.find('div', attrs={'class':'car-func-details'})
products.append(name.text)
prices.append(price.text)
kilometrs.append(kilometr.text)
print(kilometr.text)
df = pd.DataFrame({'Product Name':products,'Price':prices,'kilometr':kilometrs})
df.to_csv('products.csv', index=False, encoding='utf-8')
a.find() is not working and i have no idea why!!can u help me ?

Indeed, your request returns 403 Forbiden status code.
The website is cloudflare protected, take a look to package like https://github.com/VeNoMouS/cloudscraper.

Related

web scraping using pandas

I want to scrape multiple pages of website using Python, but I'm getting Remote Connection closed error.
Here is my code
import pandas as pd
url_link = 'https://www.taneps.go.tz/epps/viewAllAwardedContracts.do?d-3998960-p={}&selectedItem=viewAllAwardedContracts.do'
LIST = []
for number in range(1,5379):
url = url_link.format(number)
dframe = pd.read_html(url, header=None)[0]
LIST.append(dframe)
Result_df = pd.concat(LIST)
Result_df.to_csv('Taneps_contracts.csv')
Any idea how to solve it?

For me, just using requests to fetch the html before passing to read_html is getting the data. I just edited your code to
import pandas as pd
import re
url_link = 'https://www.taneps.go.tz/epps/viewAllAwardedContracts.do?d-3998960-p={}&selectedItem=viewAllAwardedContracts.do'
LIST = []
for number in range(1,5379):
url = url_link.format(number)
r = requests.get(url) # getting page -> html in r.text
dframe = pandas.read_html(r.text, header=None)[0]
LIST.append(dframe)
Result_df = pd.concat(LIST)
Result_df.to_csv('Taneps_contracts.csv')
I didn't even have to add headers, but if this isn't enough for you (i.e., if the program breaks or if you don't end up with 53770+ rows), try adding convincing headers or using something like HTMLSession instead of directly calling requests.get...

What to do when Python requests.get gets a browser error from the website?

I'm trying to read in a table from a website, but when I do this, I am getting a result from the website that says: "It appears your browser may be outdated. For the best website experience, we recommend updating your browser."
I am able to use requests.get on the Stats portion of this same PGA website without issue, but for some reason the way these historical results tables are displayed it is causing issues. One interesting thing going on is the web site allows you to select different years for the displayed table, but doing that doesn't result in any difference to the address, so I suspect they are formatting it in a way that read_html won't work. Any other suggestions? Code below.
import pandas as pd
import requests
farmers_url = 'https://www.pgatour.com/tournaments/farmers-insurance-open/past-results.html'
farmers = pd.read_html(requests.get(farmers_url).text, header=0)[0]
farmers.head()

I see a request to the following file for the content you want. This would otherwise be an additional request made by the browser from your start url. What you are currently getting is the actual content of a table at the requested url prior to any updates which would happen dynamically with a browser.
import requests
import pandas as pd
headers = {'User-Agent': 'Mozilla/5.0'}
r = requests.get('https://www.pgatour.com/tournaments/farmers-insurance-open/past-results/jcr:content/mainParsys/pastresults.selectedYear.2021.004.html', headers=headers).text
pd.read_html(r)
If you want to do tidying to look like the actual webpage then something like the following transformations and cleaning:
import requests
import pandas as pd
headers = {'User-Agent': 'Mozilla/5.0'}
r = requests.get('https://www.pgatour.com/tournaments/farmers-insurance-open/past-results/jcr:content/mainParsys/pastresults.selectedYear.2021.004.html', headers=headers).text
t = pd.read_html(r)[0]
t.reset_index()
t.columns = [':'.join([i[0], i[1]]) if 'ROUNDS' in i else i[0] for i in t.columns]
t.POS = t.POS.map(lambda x: x.split(' ')[-1])
round_columns = [i for i in t.columns if 'ROUNDS' in i]
t[round_columns] = t[round_columns].applymap(lambda x: x.split(' ')[0])
t.drop('TO PAR', inplace = True, axis = 1)
t.rename(columns={"TOTALSCORE": "TOTAL SCORE", "OFFICIALMONEY": "OFFICIAL MONEY", "FEDEXCUPPOINTS":"FEDEX CUP POINTS"}, inplace = True)
Detail:

ValueError: No tables found by using pd.read_html

cannot download the table even thought HTML shows there is a table, any way to fix it ?
the code is ok when I change other website link, not sure what is wrong with this website~
import requests
import pandas as pd
from io import StringIO
import datetime
import os
url = "https://www.cmoney.tw/etf/e210.aspx?key=0050"
response = requests.get(url)
listed = pd.read_html(response.text)[0]
listed.columns = listed.iloc[0,:]
listed = listed[["標的代號","標的名稱"]]
listed = listed.iloc[1:]
listed
ValueError: No tables found

Import a dynamic table cell value into python code

import requests
from bs4 import BeautifulSoup
html = requests.get("https://www.haremaltin.com/canli-piyasalar/")
soup = BeautifulSoup(html.content)
atalira = soup.findall(?????)
for gold in atalira:
price = gold.text
print(price)
Hello everyone, if you go to page https://www.haremaltin.com/canli-piyasalar/ In "Altın Fiyatları" you will see "Eski Ata". I want to insert one of those values into ?????? part of my python code and it is a little bit challenging for me. Thank you for your time in advance. Below you can see that html codes and value that I want to insert
<span class="item end price"><span class="arrowWrapper"><!----> <!----></span>
3.327
</span>
Edit:
I have found a way
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
import time
# pip install selenium
# apt-get update # to update ubuntu to correctly run apt install
# apt install chromium-chromedriver
# cp /usr/lib/chromium-browser/chromedriver /usr/bin
# use command above if you code on google colab
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
site = 'https://www.haremaltin.com/altin-fiyatlari'
wd = webdriver.Chrome('chromedriver', options=options)
wd.get(site)
time.sleep(5) # give chrome 5 seconds to load the page
html = wd.page_source
df = pd.read_html(html)
gold = df[1][2][15] # table 1, column 2, row 15, the value I want
gold = int((float(gold))*1000)
# it was a float and even more float value that I got, something like
# 3.252,000, so I tried to convert it into int so code above the
# solution that I found

I'm not sure you can do it this way. The best way is to get the API for this site you want and go from there. If you can't get it, find a different site. Here is a sample code I made a while back ago.
import re
import http.client
def gold_price():
conn = http.client.HTTPSConnection("www.goldapi.io")
payload = ''
headers = {
'x-access-token': 'goldapi-aq2kfluknfhfjz4-io',
'Content-Type': 'application/json'
}
conn.request("GET", "/api/XAU/USD", payload, headers)
res = conn.getresponse()
data = res.read()
data.decode("utf-8")
txt = data.decode("utf-8")
pattern = re.search(r'"price":\d\d\d\d',txt)
# pattern = re.findall(r'\d\d\d\d',txt)
print(pattern)
gold_price()

How to download website when URL doesn't change after data addition

I would like to download data from http://ec.europa.eu/taxation_customs/vies/ site. Case is that when I enter data on it through program the URL doesn't change, so file saved on disc has a page same as the one which were opened from the begining without data.Maybe I don't know how to access this site after adding data? I'm new in Python and tried to look for solution but with no result so if there was such issue, please link me. Here's my code. I appreciate all responses:)
import requests
import selenium
import select as something
from selenium import webdriver
from selenium.webdriver.support.ui import Select
import pdfkit
url = "http://ec.europa.eu/taxation_customs/vies/?locale=pl"
driver = webdriver.Chrome(executable_path ="C:\\Users\\Python\\Chromedriver.exe")
driver.get("http://ec.europa.eu/taxation_customs/vies/")
#wait = WebDriverWait(driver, 10)
obj = Select(driver.find_element_by_id("countryCombobox"))
obj = obj.select_by_index(1)
vies_r = requests.get(url)
vies_vat = driver.find_element_by_id("number")
vies_vat.send_keys('U54799909')
vies_verify = driver.find_element_by_id("submit")
vies_verify.click()
path_wkhtmltopdf = r'C:\Users\Python\wkhtmltox\wkhtmltox\bin\wkhtmltopdf.exe'
config = pdfkit.configuration(wkhtmltopdf=path_wkhtmltopdf)
print(driver.current_url)
pdfkit.from_url(driver.current_url, "out.pdf", configuration=config)
Ukalo

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

websrcraping, find() function not working - python

Indeed, your request returns 403 Forbiden status code. The website is cloudflare protected, take a look to package like https://github.com/VeNoMouS/cloudscraper.

Related

web scraping using pandas

What to do when Python requests.get gets a browser error from the website?

ValueError: No tables found by using pd.read_html

Import a dynamic table cell value into python code

How to download website when URL doesn't change after data addition

Categories

Resources