Unable to get the content of webpage using requests - python

I use requests to get a webpage but failed.
Here is my code:
import requests
from bs4 import BeautifulSoup
url = 'http://db.house.qq.com/index.php?mod=search&city=bj'
headers = {}
headers['authority'] = 'db.house.qq.com'
headers['method'] = 'GET'
headers['path'] = '/index.php?mod=search&city=bj'
headers['scheme'] = 'https'
headers['accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9'
headers['accept-encoding'] = 'gzip, deflate, br'
headers['accept-language'] = 'en-US,en;q=0.9,zh-HK;q=0.8,zh;q=0.7,zh-CN;q=0.6,an;q=0.5'
headers['cookie'] = 'pgv_info=ssid=s9739254340; pgv_pvid=9743767040; ts_uid=2023229671; pac_uid=0_da940c972d7c0; h_uid=h592060229584922854; Hm_lvt_73f18bb34ff30f1061b904f30f86c5cb=1602238779; ts_refer=www.google.com/; ts_uid=6802299874; pgv_pvi=196710400; pgv_si=s9373821952; Hm_lpvt_73f18bb34ff30f1061b904f30f86c5cb=1602767734; hisuid=[%22h592060229584922854%22]; hisuin=[null]; feature={%2295%22:1%2C%2298%22:1}; ts_last=db.house.qq.com/index.php; ad_play_index=86'
headers['dnt'] = '1'
headers['sec-ch-ua'] = '"Chromium";v="86", "\"Not\\A;Brand";v="99", "Google Chrome";v="86"'
headers['sec-ch-ua-mobile'] = '?0'
headers['sec-fetch-dest'] = 'document'
headers['sec-fetch-mode'] = 'navigate'
headers['sec-fetch-site'] = 'none'
headers['upgrade-insecure-requests'] = '1'
headers['user-agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'
res = requests.get(url, headers=headers)
res.encoding = res.apparent_encoding
print(soup.find('em', {'id':'search_result_num'}).text)#0, should be 3767
print('三湘印象·森林海尚城' in res.text)#False, should be True
How to solve this problem?
Thanks.

It was generated by Javascript.(Not json.),I caught this url in developer tool.
import requests
from bs4 import BeautifulSoup
import re
url = 'https://db.house.qq.com/index.php?mod=search&act=newsearch&city=bj&showtype=1&mod=search&city=bj'
# url = 'http://db.house.qq.com/index.php?mod=search&city=bj'
headers = {}
headers['authority'] = 'db.house.qq.com'
headers['method'] = 'GET'
headers['path'] = '/index.php?mod=search&city=bj'
headers['scheme'] = 'https'
headers['accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9'
headers['accept-encoding'] = 'gzip, deflate, br'
headers['accept-language'] = 'en-US,en;q=0.9,zh-HK;q=0.8,zh;q=0.7,zh-CN;q=0.6,an;q=0.5'
headers['cookie'] = 'pgv_info=ssid=s9739254340; pgv_pvid=9743767040; ts_uid=2023229671; pac_uid=0_da940c972d7c0; h_uid=h592060229584922854; Hm_lvt_73f18bb34ff30f1061b904f30f86c5cb=1602238779; ts_refer=www.google.com/; ts_uid=6802299874; pgv_pvi=196710400; pgv_si=s9373821952; Hm_lpvt_73f18bb34ff30f1061b904f30f86c5cb=1602767734; hisuid=[%22h592060229584922854%22]; hisuin=[null]; feature={%2295%22:1%2C%2298%22:1}; ts_last=db.house.qq.com/index.php; ad_play_index=86'
headers['dnt'] = '1'
headers['sec-ch-ua'] = '"Chromium";v="86", "\"Not\\A;Brand";v="99", "Google Chrome";v="86"'
headers['sec-ch-ua-mobile'] = '?0'
headers['sec-fetch-dest'] = 'document'
headers['sec-fetch-mode'] = 'navigate'
headers['sec-fetch-site'] = 'none'
headers['upgrade-insecure-requests'] = '1'
headers['user-agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'
res = requests.get(url, headers=headers)
text = res.content.decode("unicode_escape") # escape this content.
print("三湘印象·森林海尚城" in text)
soup = BeautifulSoup(text, "lxml")
result = soup.find(id="search_result_page").find_all("a")[-1].text
print(re.search(r"search_result_list_num = (\d+);", result).group(1)) # use regex to find the amount of results.
Print:
True
3767

Related

Data are overwrite in pandas

When I make the csv file data are overwrite in csv file If there is any solution provide me the link of the page is https://www.aeafa.es/asociados.php?provinput=&_pagi_pg=1 have already searched for an answer here and spent a long time on google, but nothing... I've already tried opening the file with 'w' instead of 'r' or 'a' but I still can't get my code to
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers ={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'
}
for page in range(1,3):
r =requests.get('https://www.aeafa.es/asociados.php?provinput=&_pagi_pg={page}'.format(page=page),
headers=headers)
soup=BeautifulSoup(r.content, 'lxml')
tag=soup.find_all('div',class_='col-md-8 col-sm-8')
temp=[]
for pro in tag:
data=[tup.text for tup in pro.find_all('p')]
Dirección=data[2]
Dirección=Dirección[12:]
Población=data[3]
Población=Población[14:]
Provincia=data[4]
Provincia=Provincia[14:]
Teléfono=data[5]
Teléfono="+" + Teléfono[11:].replace('.', "")
Email=data[6]
Email=Email[10:]
temp.append([Dirección,Provincia,Población,Teléfono, Email])
df=pd.DataFrame(temp,columns=["Dirección","Provincia","Población","Teléfono","Email"])
df.to_csv('samp.csv')
Try to put the list temp outside of the for-loop. Then, create the dataframe after all the loops finish:
import requests
import pandas as pd
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36"
}
temp = []
for page in range(1, 3):
r = requests.get(
"https://www.aeafa.es/asociados.php?provinput=&_pagi_pg={page}".format(
page=page
),
headers=headers,
)
soup = BeautifulSoup(r.content, "lxml")
tag = soup.find_all("div", class_="col-md-8 col-sm-8")
for pro in tag:
data = [tup.text for tup in pro.find_all("p")]
Dirección = data[2]
Dirección = Dirección[12:]
Población = data[3]
Población = Población[14:]
Provincia = data[4]
Provincia = Provincia[14:]
Teléfono = data[5]
Teléfono = "+" + Teléfono[11:].replace(".", "")
Email = data[6]
Email = Email[10:]
temp.append([Dirección, Provincia, Población, Teléfono, Email])
df = pd.DataFrame(
temp, columns=["Dirección", "Provincia", "Población", "Teléfono", "Email"]
)
df.to_csv("samp.csv")
print(len(df))
Prints:
98
Screenshot from LibreOffice:

urlopen Returning Redirect Error for Valid Links [HTTP Error 308: Permanent Redirect]

I'm trying to scrape the amazon listings, I am consistently getting a redirect error with my scraper. I even used the http.cookiejar.CookieJar and a urllib.request.HTTPCookieProcessor to avoid the redirect loop but still getting the error.
from bs4 import BeautifulSoup as soup
import pandas as pd
import requests
import urllib
import time
import requests, random
from requests.exceptions import HTTPError
from socket import error as SocketError
from http.cookiejar import CookieJar
data =[]
def getdata (url):
user_agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'
]
user_agent = random.choice(user_agents)
header_ = {'User-Agent': user_agent}
req = urllib.request.Request(url, headers=header_)
cj = CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
response = opener.open(req)
amazon_html = response.read().decode('utf8', errors='ignore')
a_soup = soup(amazon_html,'html.parser')
cat = k
for e in a_soup.select('div[data-component-type="s-search-result"]'):
try:
asin = e.find('a')['href'].replace('dp%2F', '/dp/').split('/dp/')[1].replace('%2','/ref').split('/ref')[0]
except:
asin = 'No ASIN Found'
try:
title = e.find('h2').text
except:
title = None
data.append({
'Category': cat,
'ASIN': asin,
'Title':title
})
return a_soup
def getnextpage(a_soup):
try:
page = a_soup.find('a',attrs={"class": 's-pagination-item s-pagination-next s-pagination-button s-pagination-separator'})['href']
url = 'http://www.amazon.in'+ str(page)
except:
url = None
return url
keywords = ['headphone','mobile','router','smartwatch']
for k in keywords:
url = 'https://www.amazon.in/s?k='+k
while True:
geturl = getdata(url)
url = getnextpage(geturl)
if not url:
break
print(url)
Output
HTTPError: HTTP Error 308: Permanent Redirect
Error Screenshot
Any ideas how I can correct this ?

What is the fix for this Error: 'NoneType' object has no attribute 'prettify'

I want to scrape this URL https://aviation-safety.net/wikibase/type/C206.
I don't understand the meaning of this error below:
'NoneType' object has no attribute 'prettify'
import requests
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import Request
url = 'https://aviation-safety.net/wikibase/type/C206'
req = Request(url , headers = {
'accept':'*/*',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'})
data = []
while True:
print(url)
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
data.append(pd.read_html(soup.select_one('tbody').prettify())[0])
if soup.select_one('div.pagenumbers + div a[href]'):
url = soup.select_one('div.pagenumbers + div a')['href']
else:
break
df = pd.concat(data)
df.to_csv('206.csv',encoding='utf-8-sig',index=False)
You're not using headers with requests, which is the reason you're not getting the right HTML and the table you're after is the second one, not the first. Also, I'd highly recommend to use requests over urllib.request.
So, having said that, here's how to get all the tables from all the pages:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://aviation-safety.net/wikibase/type/C206'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36',
}
data = []
with requests.Session() as s:
total_pages = int(
BeautifulSoup(s.get(url, headers=headers).text, "lxml")
.select("div.pagenumbers > a")[-1]
.getText()
)
for page in range(1, total_pages + 1):
print(f"Getting page: {page}...")
data.append(
pd.read_html(
s.get(f"{url}/{page}", headers=headers).text,
flavor="lxml",
)[1]
)
df = pd.concat(data)
df.to_csv('206.csv', sep=";", index=False)

Scrape and save the data in to csv in beautiful soup

Below is the url to scrape
https://www.agtta.co.in/individuals.php
I need to extract Name, Mobile number, and Email
I need to save into csv after that
I am able scrape the data full data with below code
Extract using user agent below is the code
from bs4 import BeautifulSoup
import urllib.request
urls=['https://www.agtta.co.in/individuals.php']
for url in urls:
req = urllib.request.Request(
url,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
)
resp= urllib.request.urlopen(req)
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'),features='html.parser')
scrape_data = soup.find('section', class_='b-branches')
to_list = scrape_data .find_all_next(string=True)
I tried with
for biz in results:
#print(biz)
title = biz.findAll('h3', {'class': 'b-branches__title ui-title-inner ui-title-inner_lg'})
print (title)
I m getting [<h3 class="b-branches__title ui-title-inner ui-title-inner_lg">SHRI RAMESHBHAI P. SAKARIYA</h3>]
Tag is coming while extracting How to remove the tag
My expected out
Name, Mobilenumber, Email
A, 333, mm#gmail.com`
from bs4 import BeautifulSoup
import urllib.request
import pandas as pd
urls=['https://www.agtta.co.in/individuals.php']
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
for url in urls:
req = urllib.request.Request(url, headers=headers)
resp= urllib.request.urlopen(req)
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'),features='html.parser')
result = []
for individual in soup.findAll("section", {"class": "b-branches"}):
name = individual.h3.text
phone_data = individual.find('p')
phone = phone_data.text.replace("Mobile No","").strip() if phone_data else ""
email_data = individual.select('div:contains("Email")')
email = email_data[0].text.replace("Email","").strip() if email_data else ""
result.append({"Name":name, "Phone": phone, "Email":email})
output = pd.DataFrame(result)
output.to_csv("Details.csv",index = False)
Here is the full code to do it:
from bs4 import BeautifulSoup
import requests
import pandas as pd
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
r = requests.get('https://www.agtta.co.in/individuals.php',headers = headers).text
soup = BeautifulSoup(r,'html5lib')
sections = soup.find_all('section',class_ = "b-branches")
names = []
phone_numbers = []
emails = []
for section in sections:
name = section.h3.text
names.append(name)
phone_number = section.p.text
phone_number = phone_number.split('Mobile No ')[1]
phone_numbers.append(phone_number)
try:
email = section.find_all('div')[3].text
email = email.split('Email ')[1]
emails.append(email)
except:
emails.append(None)
details_dict = {"Names":names,
"Phone Numbers":phone_numbers,
"Emails":emails}
df = pd.DataFrame(details_dict)
df.to_csv("Details.csv",index = False)
Output:
Hope that this helps!

Retrieving Lyrics from Musixmatch

import requests
import json
import urllib
import lyricsgenius
import os
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
from pymongo import MongoClient
client = MongoClient('localhost', 27017)
db = client.dbsparta
def get_artist_id(artistName):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
response = requests.get("https://api.musixmatch.com/ws/1.1/artist.search?page_size=100&format=json&apikey=123&q_artist=" + artistName, headers=headers)
response.encoding = 'UTF-8'
return response.json()['message']['body']['artist_list'][0]['artist']['artist_id']
# print(response.json()['message']['body']['artist_list'][0]['artist']['artist_id'])
def get_album_ids(artist_id):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
album_response = requests.get("https://api.musixmatch.com/ws/1.1/artist.albums.get?page_size=100&format=json&apikey=123&artist_id=" + str(artist_id), headers=headers)
album_response.encoding = 'UTF-8'
# counter = 0
# album_list = album_response.json()['message']['body']['album_list']
return album_response.json()['message']['body']['album_list']
# print(album_response.json()['message']['body']['album_list'])
# for album in album_list:
# # counter += 1
# print(album['album']['album_id'])
def get_album_tracks_ids(album_id):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
response = requests.get("https://api.musixmatch.com/ws/1.1/album.tracks.get?page_size=100&format=json&apikey=123&album_id=" + str(album_id), headers=headers)
response.encoding = 'UTF-8'
return response.json()['message']['body']['track_list']
# def get_track_id(artist_id):
# headers = {
# 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
# response = requests.get("https://api.musixmatch.com/ws/1.1/track.search?page_size=100format=json&apikey=123&f_artist_id=" + str(artist_id), headers=headers)
# response.encoding = 'UTF-8'
# for tracks in response.json()['message']['body']['track_list']:
# print(tracks['track']['track_name'])
def get_track_lyrics(track_id):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
response = requests.get("https://api.musixmatch.com/ws/1.1/track.lyrics.get?apikey=123&track_id=" + str(track_id), headers=headers)
response.encoding = 'UTF-8'
# return response['message']['body']['lyrics']['lyrics_body']
return response.json()['message']['body']['lyrics']['lyrics_body']
def main():
stars_list = list(db.new_top200.find({}, {'_id': 0}))
for stars in stars_list:
print(stars['name'])
album_ids = get_album_ids(get_artist_id(stars['name']))
# if album_ids is not None:
for album_id in album_ids:
# if album_id is not None and get_album_tracks_ids(album_id['album']['album_id']) is not [] and get_album_tracks_ids(album_id['album']['album_id']) is not None:
track_ids = get_album_tracks_ids(album_id['album']['album_id'])
for track in track_ids:
# if track is not [] and track['track']['track_id'] is not [] and track is not None:
# if get_track_lyrics(track['track']['track_id']) is not [] and get_track_lyrics(track['track']['track_id']) is not None:
lyric = get_track_lyrics(track['track']['track_id'])
db.new_top200.update_one({'name': stars['name']},{'$push': {'lyrics': lyric } })
# get_track_id(get_artist_id('Kanye West'))
# get_album_ids(get_artist_id("Kanye West"))
# get_album_tracks(15565713)
if __name__ == "__main__":
# for album in get_album_ids(get_artist_id("Kanye West")):
# get_album_tracks_ids(album['album']['album_id'])
# get_track_lyrics(96610952)
# get_album_tracks_ids(15565713)
# get_album_ids(get_artist_id('Drake'))
main()
I'm trying to get ALL of the lyrics of an artist and store it in a database. For example, if the artist is "Drake" I want all of the lyrics stored in the 'lyrics' key in my database.
However, I get a bunch of unpredictable errors every time I run the same code. For example, it would be inserting 400 lyrics without any problem and suddenly I'll get an error saying that 'list indices must be integers or slices not str'. This error is quite confusing to me because I'm assuming that all of the json data are in the same format and I have a sudden error after processing 400 song lyrics with no problem enter image description here
I can run the same code and at about 200 song lyrics in, I'll get a json decode error and then when I can run it AGAIN and after processing a different amount of song lyrics I'll get the error I described in the beginning again.
Can someone explain the random nature of this error?
Thank you!
You are making assumptions about the data types that will be returned from the JSON. In your case I suspect that one of the json elements is a list not an object.
Your issue can be reproduced with this simple example:
my_dict = {
'message': {
'body': {
'lyrics': ['Always look on the bright side of life']
}
}
}
print(my_dict['message']['body']['lyrics']['lyrics_body'])
gives:
TypeError: list indices must be integers or slices, not str
How do you fix it? You'll need to check each element matches what you expect; for example:
my_dict = {
'message': {
'body': {
'lyrics': ['Always look on the bright side of life']
}
}
}
def checker(item, field):
if isinstance(item, dict):
return item.get(field)
else:
raise ValueError(f"'{item}' in field '{field}' is not a valid dict")
message = checker(my_dict, 'message')
body = checker(message, 'body')
lyrics = checker(body, 'lyrics')
print(checker(lyrics, 'lyrics'))
gives:
ValueError: '['Always look on the bright side of life']' in field 'lyrics' is not a valid dict

Categories