problem when download a image from website using beautiful soap

problem when download a image from website using beautiful soap - python

i am wrtiting this code to get information about top movies and also download the image blong to the movie but on some image they downloaded but their size are 0 but they have size on disk when i kilick on the link of the image that i cant download it well its opening well and there is no problem in link
for exampele this is one of the link that images :
https://static.stacker.com/s3fs-public/styles/slide_desktop/s3/00000116_4_0.png
import requests
from bs4 import BeautifulSoup
import pandas as pd
URL = "https://stacker.com/stories/1587/100-best-movies-all-time"
count = 0
local_description = ""
movie_data = []
data = requests.get(URL).text
soap = BeautifulSoup(data, "html.parser")
titles = soap.find_all(name="h2", class_="ct-slideshow__slide__text-container__caption")[1:]
description = soap.find_all(name="div", class_="ct-slideshow__slide__text-container__description")[1:]
images = soap.find_all(name="img", typeof="foaf:Image")[6:106]
for num in range(100):
movie_name = titles[num].getText().replace("\n", "")
local_des = description[num].find_all(name="p")[1:]
for s in local_des:
local_description = s.getText().replace(" ", "")
local_data = {"title": movie_name, "description": local_description}
movie_data.append(local_data)
movie_image_link = images[num].get("src")
response = requests.get(movie_image_link)
with open(f"images/{movie_name}.png", 'wb') as f:
f.write(response.content)
count += 1
print(count)
data_collected = pd.DataFrame(movie_data)
data_collected.to_csv("Data/100_movie.csv", index=False)

i found my problem in some movie name there was ":" and as you knwo you cant user ":"
in file names i fix the code with .replace()
movie_name.replace(":", "")

Once you get a response, check if it's empty before writing to disk. Might need to retry or the link may be bad.

Related

Web-scrape. BeautifulSoup. Multiple Pages. How on earth would you do that?

Hi I am a Newbie to programming. So I spent 4 days trying to learn python. I evented some new swear words too.
I was particularly interested in trying as an exercise some web-scraping to learn something new and get some exposure to see how it all works.
This is what I came up with. See code at end. It works (to a degree)
But what's missing?
This website has pagination on it. In this case 11 pages worth.  How would you go about adding to this script and getting python to go look at those other pages too and carry out the same scrape. Ie scrape page one , scrape page 2, 3 ... 11 and post the results to a csv?
https://www.organicwine.com.au/vegan/?pgnum=1
https://www.organicwine.com.au/vegan/?pgnum=2
https://www.organicwine.com.au/vegan/?pgnum=3
https://www.organicwine.com.au/vegan/?pgnum=4
https://www.organicwine.com.au/vegan/?pgnum=5
https://www.organicwine.com.au/vegan/?pgnum=6
https://www.organicwine.com.au/vegan/?pgnum=7
8, 9,10, and 11
On these pages the images are actually a thumbnail images something like 251px by 251px.
How would you go about adding to this script to say. And whilst you are at it follow the links to the detailed product page and capture the image link from there where the images are 1600px by 1600px and post those links to CSV
https://www.organicwine.com.au/mercer-wines-preservative-free-shiraz-2020
When we have identified those links lets also download those larger images to a folder
CSV writer. Also I don't understand line 58
for i in range(23)
how would i know how many products there were without counting them (i.e. there is 24 products on page one)
So this is what I want to learn how to do. Not asking for much (he says sarcastically) I could pay someone on up-work to do it but where's the fun in that? and that does not teach me how to 'fish'.
Where is a good place to learn python? A master class on web-scraping. It seems to be trial and error and blog posts and where ever you can pick up bits of information to piece it all together.
Maybe I need a mentor.
I wish there had been someone I could have reached out to, to tell me what beautifulSoup was all about. worked it out by trial and error and mostly guessing. No understanding of it but it just works.
Anyway, any help in pulling this all together to produce a decent script would be greatly appreciated.
Hopefully there is someone out there who would not mind helping me.
Apologies to organicwine for using their website as a learning tool. I do not wish to cause any harm or be a nuisance to the site
Thank you in advance
John
code:
import requests
import csv
from bs4 import BeautifulSoup
URL = "https://www.organicwine.com.au/vegan/?pgnum=1"
response = requests.get(URL)
website_html = response.text
soup = BeautifulSoup(website_html, "html.parser")
product_title = soup.find_all('div', class_="caption")
# print(product_title)
winename = []
for wine in product_title:
winetext = wine.a.text
winename.append(winetext)
print(f'''Wine Name: {winetext}''')
# print(f'''\nWine Name: {winename}\n''')
product_price = soup.find_all('div', class_='wrap-thumb-mob')
# print(product_price.text)
price =[]
for wine in product_price:
wineprice = wine.span.text
price.append(wineprice)
print(f'''Wine Price: {wineprice}''')
# print(f'''\nWine Price: {price}\n''')
image =[]
product_image_link = (soup.find_all('div', class_='thumbnail-image'))
# print(product_image_link)
for imagelink in product_image_link:
wineimagelink = imagelink.a['href']
image.append(wineimagelink)
# image.append(imagelink)
print(f'''Wine Image Lin: {wineimagelink}''')
# print(f'''\nWine Image: {image}\n''')
#
#
# """ writing data to CSV """
# open OrganicWine2.csv file in "write" mode
# newline stops a blank line appearing in csv
with open('OrganicWine2.csv', 'w',newline='') as file:
# create a "writer" object
writer = csv.writer(file, delimiter=',')
# use "writer" obj to write
# you should give a "list"
writer.writerow(["Wine Name", "Wine Price", "Wine Image Link"])
for i in range(23):
writer.writerow([
winename[i],
price[i],
image[i],
])

In this case, to do pagination, instead of for i in range(1, 100) which is a hardcoded way of paging, it's better to use a while loop to dynamically paginate all possible pages.
"While" is an infinite loop and it will be executed until the transition to the next page is possible, in this case it will check for the presence of the button for the next page, for which the CSS selector ".fa-chevron-right" is responsible:
if soup.select_one(".fa-chevron-right"):
params["pgnum"] += 1 # go to the next page
else:
break
To extract the full size image an additional request is required, CSS selector ".main-image a" is responsible for full-size images:
full_image_html = requests.get(link, headers=headers, timeout=30)
image_soup = BeautifulSoup(full_image_html.text, "lxml")
try:
original_image = f'https://www.organicwine.com.au{image_soup.select_one(".main-image a")["href"]}'
except:
original_image = None
An additional step to avoid being blocked is to rotate user-agents. Ideally, it would be better to use residential proxies with random user-agent.
pandas can be used to extract data in CSV format:
pd.DataFrame(data=data).to_csv("<csv_file_name>.csv", index=False)
For a quick and easy search for CSS selectors, you can use the SelectorGadget Chrome extension (not always work perfectly if the website is rendered via JavaScript).
Check code with pagination and saving information to CSV in online IDE.
from bs4 import BeautifulSoup
import requests, json, lxml
import pandas as pd
# https://requests.readthedocs.io/en/latest/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36",
}
params = {
'pgnum': 1 # number page by default
}
data = []
while True:
page = requests.get(
"https://www.organicwine.com.au/vegan/?",
params=params,
headers=headers,
timeout=30,
)
soup = BeautifulSoup(page.text, "lxml")
print(f"Extracting page: {params['pgnum']}")
for products in soup.select(".price-btn-conts"):
try:
title = products.select_one(".new-h3").text
except:
title = None
try:
price = products.select_one(".price").text.strip()
except:
price = None
try:
snippet = products.select_one(".price-btn-conts p a").text
except:
snippet = None
try:
link = products.select_one(".new-h3 a")["href"]
except:
link = None
# additional request is needed to extract full size image
full_image_html = requests.get(link, headers=headers, timeout=30)
image_soup = BeautifulSoup(full_image_html.text, "lxml")
try:
original_image = f'https://www.organicwine.com.au{image_soup.select_one(".main-image a")["href"]}'
except:
original_image = None
data.append(
{
"title": title,
"price": price,
"snippet": snippet,
"link": link,
"original_image": original_image
}
)
if soup.select_one(".fa-chevron-right"):
params["pgnum"] += 1
else:
break
# save to CSV (install, import pandas as pd)
pd.DataFrame(data=data).to_csv("<csv_file_name>.csv", index=False)
print(json.dumps(data, indent=2, ensure_ascii=False))
Example output:
[
{
"title": "Yangarra McLaren Vale GSM 2016",
"price": "$29.78 in a straight 12\nor $34.99 each",
"snippet": "The Yangarra GSM is a careful blending of Grenache, Shiraz and Mourvèdre in which the composition varies from year to year, conveying the traditional estate blends of the southern Rhône. The backbone of the wine comes fr...",
"link": "https://www.organicwine.com.au/yangarra-mclaren-vale-gsm-2016",
"original_image": "https://www.organicwine.com.au/assets/full/YG_GSM_16.png?20211110083637"
},
{
"title": "Yangarra Old Vine Grenache 2020",
"price": "$37.64 in a straight 12\nor $41.99 each",
"snippet": "Produced from the fruit of dry grown bush vines planted high up in the Estate's elevated vineyards in deep sandy soils. These venerated vines date from 1946 and produce a wine that is complex, perfumed and elegant with a...",
"link": "https://www.organicwine.com.au/yangarra-old-vine-grenache-2020",
"original_image": "https://www.organicwine.com.au/assets/full/YG_GRE_20.jpg?20210710165951"
},
#...
]

Create the URL by putting the page number in it, then put the rest of your code into a for loop and you can use len(winenames) to count how many results you have. You should do the writing outside the for loop. Here's your code with those changes:
import requests
import csv
from bs4 import BeautifulSoup
num_pages = 11
result = []
for pgnum in range(num_pages):
url = f"https://www.organicwine.com.au/vegan/?pgnum={pgnum+1}"
response = requests.get(url)
website_html = response.text
soup = BeautifulSoup(website_html, "html.parser")
product_title = soup.find_all("div", class_="caption")
winename = []
for wine in product_title:
winetext = wine.a.text
winename.append(winetext)
product_price = soup.find_all("div", class_="wrap-thumb-mob")
price = []
for wine in product_price:
wineprice = wine.span.text
price.append(wineprice)
image = []
product_image_link = soup.find_all("div", class_="thumbnail-image")
for imagelink in product_image_link:
winelink = imagelink.a["href"]
response = requests.get(winelink)
wine_page_soup = BeautifulSoup(response.text, "html.parser")
main_image = wine_page_soup.find("a", class_="fancybox")
image.append(main_image['href'])
for i in range(len(winename)):
result.append([winename[i], price[i], image[i]])
with open("/tmp/OrganicWine2.csv", "w", newline="") as file:
writer = csv.writer(file, delimiter=",")
writer.writerow(["Wine Name", "Wine Price", "Wine Image Link"])
writer.writerows(results)
And here's how I would rewrite your code to accomplish this task. It's more pythonic (you should basically never write range(len(something)), there's always a cleaner way) and it doesn't require knowing how many pages of results there are:
import csv
import itertools
import time
import requests
from bs4 import BeautifulSoup
data = []
# Try opening 100 pages at most, in case the scraping code is broken
# which can happen because websites change.
for pgnum in range(1, 100):
url = f"https://www.organicwine.com.au/vegan/?pgnum={pgnum}"
response = requests.get(url)
website_html = response.text
soup = BeautifulSoup(website_html, "html.parser")
search_results = soup.find_all("div", class_="thumbnail")
for search_result in search_results:
name = search_result.find("div", class_="caption").a.text
price = search_result.find("p", class_="price").span.text
# link to the product's page
link = search_result.find("div", class_="thumbnail-image").a["href"]
# get the full resolution product image
response = requests.get(link)
time.sleep(1) # rate limit
wine_page_soup = BeautifulSoup(response.text, "html.parser")
main_image = wine_page_soup.find("a", class_="fancybox")
image_url = main_image["href"]
# or you can just "guess" it from the thumbnail's URL
# thumbnail = search_result.find("div", class_="thumbnail-image").a.img['src']
# image_url = thumbnail.replace('/thumbL/', '/full/')
data.append([name, price, link, image_url])
# if there's no "next page" button or no search results on the current page,
# stop scraping
if not soup.find("i", class_="fa-chevron-right") or not search_results:
break
# rate limit
time.sleep(1)
with open("/tmp/OrganicWine3.csv", "w", newline="") as file:
writer = csv.writer(file, delimiter=",")
writer.writerow(["Wine Name", "Wine Price", "Wine Link", "Wine Image Link"])
writer.writerows(data)

Multiple for loops and csv files

I am new here and newbie with python and currently learning some basic stuff, mostly scraping and I encountered a problem that I hope you can help me to solve.
I'm trying to scrape few details from a website and writing them into a CSV file but I'm able to write only the last results into my CSV, apparently my script just overwrite the data.
Also if you find any mistakes on my code or any room for improvement (which I'm sure there are) I'd be glad if you will point them out as well.
Also2, any recommendation for videos/tutorials that can help me improve my python and scraping skills would be appreciated.
import requests
from bs4 import BeautifulSoup
import csv
url = 'https://www.tamarackgc.com/club-contacts'
source = requests.get(url).text
soup = BeautifulSoup (source, 'lxml')
csv_file = open('contacts.csv', 'w')
csv_writer = csv.writer (csv_file)
csv_writer.writerow(["department", "name", "position", "phone"])
for department in soup.find_all("div", class_="view-content"):
department_name = department.h3
print (department_name.text)
for contacts in soup.find_all("div", class_="col-md-7 col-xs-10"):
contact_name = contacts.strong
print(contact_name.text)
for position in soup.find_all("div", class_="field-content"):
print(position.text)
for phone in soup.find_all("div", class_="modal-content"):
first_phone = phone.h3
first_phones = first_phone
print(first_phones)
csv_writer.writerow([department_name, contact_name, position, first_phones])
csv_file.close()

Thanks Thomas,
Actually I tweaked my code a little bit by thinking how I can make it simpler (four for loops are too much, no?) so with the following code I solved my problem(dropped the 'department' and 'phones' because some other issues):
import requests
from bs4 import BeautifulSoup
import csv
url = 'https://www.tamarackgc.com/club-contacts'
source = requests.get(url).text
soup = BeautifulSoup (source, 'lxml')
f = open("contactslot.csv", "w+")
csv_writer = csv.writer (f)
csv_writer.writerow(["Name", "Position"])
infomation = soup.find_all("div", class_="well profile")
info = information[0]
for info in information:
contact_name = info.find_all("div", class_="col-md-7 col-xs-10")
names = contact_name[0].strong
name = names.text
print (name)
position_name = info.find_all("div", class_="field-content")
position = position_name[0].text
print(position)
print("")
csv_writer.writerow([name, position])
f.close()

Hi Babr welcome to use python. Your answer is good and here is one more little thing you may can do better.
use find replace find_all if you just want one element
import requests
from bs4 import BeautifulSoup
import csv
url = 'https://www.tamarackgc.com/club-contacts'
source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')
f = open("/Users/mingjunliu/Downloads/contacts.csv", "w+")
csv_writer = csv.writer(f)
csv_writer.writerow(["Name", "Position"])
for info in soup.find_all("div", class_="well profile"):
contact_name = info.find("div", class_="col-md-7 col-xs-10")
names = contact_name.strong
name = names.text
print(name)
position_name = info.find("div", class_="field-content")
position = position_name.text
print(position)
print("")
csv_writer.writerow([name, position])
f.close()
And the reason you need to drop phone and department is because of the bad website structure. It's not your fault.

Python download image from HTTPS aspx

I am trying to download some images from NASS Case Viewer. An example of a case is
https://www-nass.nhtsa.dot.gov/nass/cds/CaseForm.aspx?xsl=main.xsl&CaseID=149006692
The link to the image viewer for this case is
https://www-nass.nhtsa.dot.gov/nass/cds/GetBinary.aspx?ImageView&ImageID=497001669&Desc=FRONT&Title=Vehicle+1+-+Front&Version=1&Extend=jpg
which may not be viewable, I assume because of the https. However, this is simply the Front second image.
The actual link to the image is (or should be?)
https://www-nass.nhtsa.dot.gov/nass/cds/GetBinary.aspx?Image&ImageID=497001669&CaseID=149006692&Version=1
This will simply download aspx binaries.
My problem is that I do not know how to store these binaries to proper jpg files.
Example of code I've tried is
import requests
test_image = "https://www-nass.nhtsa.dot.gov/nass/cds/GetBinary.aspx?Image&ImageID=497001669&CaseID=149006692&Version=1"
pull_image = requests.get(test_image)
with open("test_image.jpg", "wb+") as myfile:
myfile.write(str.encode(pull_image.text))
But this does not result in a proper jpg file. I've also inspected pull_image.raw.read() and saw that it's empty.
What could be the issue here? Are my URL's improper? I've used Beautifulsoup to put these URLs together and reviewed them by inspecting the HTML code from a few pages.
Am I saving the binaries incorrectly?

.text decodes the response content to string, so your imge file will be corrupted.
Instead you should use .content which holds the binary response content.
import requests
test_image = "https://www-nass.nhtsa.dot.gov/nass/cds/GetBinary.aspx?Image&ImageID=497001669&CaseID=149006692&Version=1"
pull_image = requests.get(test_image)
with open("test_image.jpg", "wb+") as myfile:
myfile.write(pull_image.content)
.raw.read() also returns bytes, but in order to use it you must set the stream parameter to True.
pull_image = requests.get(test_image, stream=True)
with open("test_image.jpg", "wb+") as myfile:
myfile.write(pull_image.raw.read())

I wanted to follow up on #t.m.adam 's answer to provide a complete answer for anyone who is interested in using this data for their own projects.
Here is my code to pull all images for a sample of Case IDs. It's fairly un-clean code, but I think it gives you what you may need to get started.
import requests
from bs4 import BeautifulSoup
from tqdm import tqdm
CaseIDs = [149006673, 149006651, 149006672, 149006673, 149006692, 149006693]
url_part1 = 'https://www-nass.nhtsa.dot.gov/nass/cds/'
data = []
with requests.Session() as sesh:
for caseid in tqdm(CaseIDs):
url_full = f"https://www-nass.nhtsa.dot.gov/nass/cds/CaseForm.aspx?ViewText&CaseID={caseid}&xsl=textonly.xsl&websrc=true"
#print(url_full)
source = sesh.get(url_full).text
soup = BeautifulSoup(source, 'lxml')
tr_tags = soup.find_all('tr', style="page-break-after: always")
for tag in tr_tags:
#print(tag)
"""
try:
vehicle = [x for x in tag.text.split('\n') if 'Vehicle' in x][0] ## return the first element
except IndexError:
vehicle = [x for x in tag.text.split('\n') if 'Scene' in x][0] ## return the first element
"""
tag_list = tag.find_all('tr', class_ = 'label')
test = [x.find('td').text for x in tag_list]
#print(test)
img_id, img_type, part_name = test
img_id = img_id.replace(":", "")
img = tag.find('img')
#part_name = img.get('alt').replace(":", "").replace("/", "")
part_name = part_name.replace(":", "").replace("/", "")
image_name = " ".join([img_type, part_name, img_id]) + ".jpg"
url_src = img.get('src')
img_url = url_part1 + url_src
print(img_url)
pull_image = sesh.get(img_url, stream=True)
with open(image_name, "wb+") as myfile:
myfile.write(pull_image.content)

loop stuck on first page

Been using beautiful soup to iterate through pages, but for whatever reason I can't get the loop to advance beyond the first page. it seems like it should be easy because it's a text string, but it seems to loop back, maybe it's my structure not my text string?
Here's what I have:
import csv
import urllib2
from bs4 import BeautifulSoup
f = open('nhlstats.csv', "w")
groups=['points', 'shooting', 'goaltending', 'defensive', 'timeonice', 'faceoffs', 'minor-penalties', 'major-penalties']
year = ["2016", "2015","2014","2013","2012"]
for yr in year:
for gr in groups:
url = "http://www.espn.com/nhl/statistics/player/_/stat/points/year/"+str(yr)
#www.espn.com/nhl/statistics/player/_/stat/points/year/2014/
page = urllib2.urlopen(url)
soup=BeautifulSoup(page, "html.parser")
pagecount = soup.findAll(attrs= {"class":"page-numbers"})[0].string
pageliteral = int(pagecount[5:])
for i in range(0,pageliteral):
number = int(((i*40) + 1))
URL = "http://www.espn.com/nhl/statistics/player/_/stat/points/sort/points/year/"+str(yr) + "/count/"+str(number)
page = urllib2.urlopen(url)
soup=BeautifulSoup(page, "html.parser")
for tr in soup.select("#my-players-table tr[class*=player]"):
row =[]
for ob in range(1,15):
player_info = tr('td')[ob].get_text(strip=True)
row.append(player_info)
f.write(str(yr) +","+",".join(row) + "\n")
f.close()
this gets the same first 40 records over and over.
I tried using this solution as an if and did find that doing
prevLink = soup.select('a[rel="nofollow"]')[0]
newurl = "http:" + prevLink.get('href')
did work better, but I'm not sure how to do the loop in such a way that it advances? possibly just tired but my loop there still just goes to the next set of records and gets stuck on that one. please help me fix my loop
UPDATE
my formatting was lost in the copy paste, my actual code looks like:
import csv
import urllib2
from bs4 import BeautifulSoup
f = open('nhlstats.csv', "w")
groups=['points', 'shooting', 'goaltending', 'defensive', 'timeonice', 'faceoffs', 'minor-penalties', 'major-penalties']
year = ["2016", "2015","2014","2013","2012"]
for yr in year:
for gr in groups:
url = "http://www.espn.com/nhl/statistics/player/_/stat/points/year/"+str(yr)
#www.espn.com/nhl/statistics/player/_/stat/points/year/2014/
page = urllib2.urlopen(url)
soup=BeautifulSoup(page, "html.parser")
pagecount = soup.findAll(attrs= {"class":"page-numbers"})[0].string
pageliteral = int(pagecount[5:])
for i in range(0,pageliteral):
number = int(((i*40) + 1))
URL = "http://www.espn.com/nhl/statistics/player/_/stat/points/sort/points/year/"+str(yr) + "/count/"+str(number)
page = urllib2.urlopen(url)
soup=BeautifulSoup(page, "html.parser")
for tr in soup.select("#my-players-table tr[class*=player]"):
row =[]
for ob in range(1,15):
player_info = tr('td')[ob].get_text(strip=True)
row.append(player_info)
f.write(str(yr) +","+",".join(row) + "\n")
f.close()

Your code indenting was mostly at fault. Also it would be wise to actually use the CSV library you imported, this will automatically wrap the player names in quotes to avoid any commas inside from ruining the csv structure.
This works by looking for the link to the next page and extracting the starting count. This is then used to build your the next page get. If no next page can be found, it moves to the next year group. Note, the count is not a page count but a starting entry count.
import csv
import urllib2
from bs4 import BeautifulSoup
groups= ['points', 'shooting', 'goaltending', 'defensive', 'timeonice', 'faceoffs', 'minor-penalties', 'major-penalties']
year = ["2016", "2015", "2014", "2013", "2012"]
with open('nhlstats.csv', "wb") as f_output:
csv_output = csv.writer(f_output)
for yr in year:
for gr in groups:
start_count = 1
while True:
#print "{}, {}, {}".format(yr, gr, start_count) # show progress
url = "http://www.espn.com/nhl/statistics/player/_/stat/points/sort/points/year/{}/count/{}".format(yr, start_count)
page = urllib2.urlopen(url)
soup = BeautifulSoup(page, "html.parser")
for tr in soup.select("#my-players-table tr[class*=player]"):
row = [yr]
for ob in range(1, 15):
player_info = tr('td')[ob].get_text(strip=True)
row.append(player_info)
csv_output.writerow(row)
try:
start_count = int(soup.find(attrs= {"class":"page-numbers"}).find_next('a')['href'].rsplit('/', 1)[1])
except:
break
Using with will also automatically close your file at the end.
This would give you a csv file starting as follows:
2016,"Patrick Kane, RW",CHI,82,46,60,106,17,30,1.29,287,16.0,9,17,20
2016,"Jamie Benn, LW",DAL,82,41,48,89,7,64,1.09,247,16.6,5,17,13
2016,"Sidney Crosby, C",PIT,80,36,49,85,19,42,1.06,248,14.5,9,10,14
2016,"Joe Thornton, C",SJ,82,19,63,82,25,54,1.00,121,15.7,6,8,21

You are changing the URL many times before you are opening it the first time, due to an indentation error. Try this:
for gr in groups:
url = "...some_url..."
page = urllib2.urlopen(url)
...everything else should be indented....

File Storage Problem with Python Web Crawler

I am screen scraping data using a web crawler and storing the results - (tweets from a twitter page) as separate html files for each user I'm crawling. I intend to later parse the html files and store the data into a database for analysis. However, I am having a bizarre problem.
When I run the following program - a small snippet from the overall crawler - I am able to get a separate html file for each follower:
import re
import urllib2
import twitter
start_follower = "NYTimesKrugman"
depth = 3
searched = set()
api = twitter.Api()
def crawl(follower, in_depth):
if in_depth > 0:
searched.add(follower)
directory = "C:\\Python28\\Followertest1\\" + follower + ".html"
output = open(directory, 'a')
output.write(follower)
output.write('\n\n')
users = api.GetFriends(follower)
names = set([str(u.screen_name) for u in users])
names -= searched
for name in list(names)[0:5]:
crawl(name, in_depth-1)
crawl(start_follower, depth)
for x in searched:
print x
print "Program is completed."
However, when I run the full crawler, I do not get a separate file for each follower:
import twitter
import urllib
from BeautifulSoup import BeautifulSoup
import re
import time
start_follower = "NYTimeskrugman"
depth = 2
searched = set()
api = twitter.Api()
def add_to_U(user):
U.append(user)
def site(follower): #creates a twitter site url in string format based on the follower username
followersite = "http://mobile.twitter.com/" + follower
return followersite
def getPage(follower): #obtains access to a webapge
url = site(follower)
response = urllib.urlopen(url)
return response
def getSoup(response): #creates the parsing module
html = response.read()
soup = BeautifulSoup(html)
return soup
def gettweets(soup, output):
tags = soup.findAll('div', {'class' : "list-tweet"})#to obtain tweet of a follower
for tag in tags:
a = tag.renderContents()
b = str (a)
output.write(b)
output.write('\n\n')
def are_more_tweets(soup):#to check whether there is more than one page on mobile twitter
links = soup.findAll('a', {'href': True}, {id: 'more_link'})
for link in links:
b = link.renderContents()
test_b = str(b)
if test_b.find('more') != -1:
return True
return False
def getnewlink(soup): #to get the link to go to the next page of tweets on twitter
links = soup.findAll('a', {'href': True}, {id : 'more_link'})
for link in links:
b = link.renderContents()
if str(b) == 'more':
c = link['href']
d = 'http://mobile.twitter.com' +c
return d
def crawl(follower, in_depth): #main method of sorts
if in_depth > 0:
searched.add(follower)
directory = "C:\\Python28\\Followertest2\\" + follower + ".html"
output = open(directory, 'a')
output.write(follower)
output.write('\n\n')
a = getPage(follower)
soup = getSoup(a)
gettweets(soup, output)
tweets = are_more_tweets(soup)
while(tweets):
b = getnewlink(soup)
red = urllib.urlopen(b)
html = red.read()
soup = BeautifulSoup(html)
gettweets(soup, output)
tweets = are_more_tweets(soup)
users = api.GetFriends(follower)
names = set([str(u.screen_name) for u in users])
names -= searched
for name in list(names)[0:5]:
print name
crawl(name, in_depth - 1)
crawl(start_follower, depth)
print("Program done. Look at output file.")
More specifically, I seem to get a separate html file for about the first five followers and then no new files appear to be created. Any help would be appreciated!

The depth value is different between the snippet and the full code (you're only going to get one level of recursion in the full code). Also, you only grab the first five names from the followers list: for name in list(names)[0:5]: So you get six people total: the starting follower and their first five friends.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

problem when download a image from website using beautiful soap - python

i found my problem in some movie name there was ":" and as you knwo you cant user ":" in file names i fix the code with .replace() movie_name.replace(":", "")

Once you get a response, check if it's empty before writing to disk. Might need to retry or the link may be bad.

Related

Web-scrape. BeautifulSoup. Multiple Pages. How on earth would you do that?

Multiple for loops and csv files

Python download image from HTTPS aspx

loop stuck on first page

File Storage Problem with Python Web Crawler

Categories

Resources