CSV manipulation with python 3 - python

I've created a script in python 3 to scrape data from 4 different pages of a site. It works fine but when i try to get that result in a csv file, something goes wrong and it only prints the info of the last page. Could anybody help me out on this. i've attached the script for your consideration. Dying to know what i'm doing wrong.
import csv
import requests
from bs4 import BeautifulSoup
def web_crawler(mpage):
page=1
while page<=mpage:
url=requests.get("http://www.yellowpages.com/search?search_terms=Coffee%20Shops&geo_location_terms=San%20Francisco%2C%20CA&page="+str(page))
soup=BeautifulSoup(url.text,'html.parser')
x=soup.findAll(class_='info')
gist=[]
for z in x:
Item=z.findAll(class_="business-name")
for Title in Item:
Name=Title.text
Patta=z.findAll(class_="adr")
for Thikana in Patta:
Address=Thikana.text
Number=z.findAll(class_="phones")
for Token in Number:
Phone=Token.text
metco=(Name,Address,Phone)
print(metco)
gist.append(metco)
outfile=open('data.csv','w',newline='')
writer=csv.writer(outfile)
writer.writerow(["Name","Address","Phone"])
writer.writerows(gist)
page+=1
web_crawler(4)

You are overwriting your file in the loop
outfile=open('data.csv','w',newline='')
try moving it out of the main loop.

Related

Trying to loop through URL's and download images from these webpages

I have a nice URL structure and I want to iterate through the URL's and download all the images from the URL's. I am trying to use BeautifulSoup to get the job done, along with the requests function.
Here is the URL - https://sixmorevodka.com/#&gid=0&pid={i}, and I want 'i' to iterate from say 1 to 100 for this example.
from bs4 import BeautifulSoup as soup
import requests, contextlib, re, os
#contextlib.contextmanager
def get_images(url:str):
d = soup(requests.get(url).text, 'html.parser')
yield [[i.find('img')['src'], re.findall('(?<=\.)\w+$', i.find('img')['alt'])[0]] for i in d.find_all('a') if re.findall('/image/\d+', i['href'])]
n = 100 #end value
for i in range(n):
with get_images(f'https://sixmorevodka.com/#&gid=0&pid={i}') as links:
print(links)
for c, [link, ext] in enumerate(links, 1):
with open(f'ART/image{i}{c}.{ext}', 'wb') as f:
f.write(requests.get(f'https://sixmorevodka.com{link}').content)
I think I either messed something up in the Yield line or in the very last write line. Someone help me out please. I am using Python 3.7
In looking at the structure of that webpage, your gid parameter is invalid. To see for yourself, open a new tab and navigate to https://sixmorevodka.com/#&gid=0&pid=22.
You'll notice that none of the portfolio images are displayed. gid can be a value 1-5, denoting the grid to which an image belongs.
Regardless, your current scraping methodology is inefficient, and puts undue traffic on the website. Instead, you only need to make this request once, and extract the urls actually containing the images using the ilb portfolio__grid__item class selector.
Then, you can iterate and download those urls, which are directly the source of the images.

HTML hidden elements

I'm actually trying to code a little "GPS" and actually I couldn't use Google API because of the daily restriction.
I decided to use a site "viamichelin" which provide me the distance between two adresses. I created a little code to fetch all the URL adresses I needed like this :
import pandas
import numpy as np
df = pandas.read_excel('C:\Users\Bibi\Downloads\memoire\memoire.xlsx', sheet_name='Clients')
df2= pandas.read_excel('C:\Users\Bibi\Downloads\memoire\memoire.xlsx', sheet_name='Agences')
matrix=df.as_matrix(columns=None)
clients = np.squeeze(np.asarray(matrix))
matrix2=df2.as_matrix(columns=None)
agences = np.squeeze(np.asarray(matrix2))
compteagences=0
comptetotal=0
for j in agences:
compteclients=0
for i in clients:
print agences[compteagences]
print clients[compteclients]
url ='https://fr.viamichelin.be/web/Itineraires?departure='+agences[compteagences]+'&arrival='+clients[compteclients]+'&arrivalId=34MTE1MnJ5ZmQwMDMzb3YxMDU1ZDFvbGNOVEF1TlRVNU5UUT1jTlM0M01qa3lOZz09Y05UQXVOVFl4TlE9PWNOUzQzTXpFNU5nPT1jTlRBdU5UVTVOVFE9Y05TNDNNamt5Tmc9PTBqUnVlIEZvbmQgZGVzIEhhbGxlcw==&index=0&vehicle=0&type=0&distance=km&currency=EUR&highway=false&toll=false&vignette=false&orc=false&crossing=true&caravan=false&shouldUseTraffic=false&withBreaks=false&break_frequency=7200&coffee_duration=1200&lunch_duration=3600&diner_duration=3600&night_duration=32400&car=hatchback&fuel=petrol&fuelCost=1.393&allowance=0&corridor=&departureDate=&arrivalDate=&fuelConsumption='
print url
compteclients+=1
comptetotal+=1
compteagences+=1
All my datas are on Excel that's why I used the pandas library. I have all the URL's needed for my project.
Although, I would like to extract the number of kilometers needed but there's a little problem. In the source code, I don't have the information I need, so I can't extract it with Python... The site is presented like this:
Michelin
When I click on "inspect" I can find the information needed (on the left) but I can't on the source code (on the right) ... Can someone provide me some help?
Itinerary
I have already tried this, without succeeding :
import os
import csv
import requests
from bs4 import BeautifulSoup
requete = requests.get("https://fr.viamichelin.be/web/Itineraires?departure=Rue%20Lebeau%2C%20Liege%2C%20Belgique&departureId=34MTE1Mmc2NzQwMDM0NHoxMDU1ZW44d2NOVEF1TmpNek5ERT1jTlM0MU5qazJPQT09Y05UQXVOak16TkRFPWNOUzQxTnpBM01nPT1jTlRBdU5qTXpOREU9Y05TNDFOekEzTWc9PTBhUnVlIExlYmVhdQ==&arrival=Rue%20Rys%20De%20Mosbeux%2C%20Trooz%2C%20Belgique&arrivalId=34MTE1MnJ5ZmQwMDMzb3YxMDU1ZDFvbGNOVEF1TlRVNU5UUT1jTlM0M01qa3lOZz09Y05UQXVOVFl4TlE9PWNOUzQzTXpFNU5nPT1jTlRBdU5UVTVOVFE9Y05TNDNNamt5Tmc9PTBqUnVlIEZvbmQgZGVzIEhhbGxlcw==&index=0&vehicle=0&type=0&distance=km&currency=EUR&highway=false&toll=false&vignette=false&orc=false&crossing=true&caravan=false&shouldUseTraffic=false&withBreaks=false&break_frequency=7200&coffee_duration=1200&lunch_duration=3600&diner_duration=3600&night_duration=32400&car=hatchback&fuel=petrol&fuelCost=1.393&allowance=0&corridor=&departureDate=&arrivalDate=&fuelConsumption=")
page = requete.content
soup = BeautifulSoup(page, "html.parser")
print soup
Looking at the inspector for the page, the actual routing is done via a JavaScript invocation to this rather long URL.
The data you need seems to be in that response, starting from _scriptLoaded(. (Since it's a JavaScript object literal, you can use Python's built-in JSON library to load the data into a dict.)

Extracting text/numbers from HTML list using Python requests and lxml

I am trying to extract the 'Seller rank' from items on amazon using Python requests and lxml. So:
<li id="SalesRank">
<b>Amazon Bestsellers Rank:</b>
957,875 in Books (See Top 100 in Books)
from this example, 957875 is the number I want to extract.
(Please note, the actual HTML has about 100 blank lines between 'Amazon Bestsellers Rank:' and '957875'. Unsure if this is effecting my result.)
My current Python code is set up like so:
import re
import requests
from lxml import html
page = requests.get('http://www.amazon.co.uk/Lakeland-Expanding-Together-Compartments-Organiser/dp/B00A7Q77GM/ref=sr_1_1?s=kitchen&ie=UTF8&qid=1452504370&sr=1-1-spons&psc=1')
tree = html.fromstring(page.content)
salesrank = tree.xpath('//li[#id="SalesRank"]/text()')
print 'Sales Rank:', salesrank
and the printed output is
Sales Rank: []
I was expecting to receive the full list data including all the blank lines of which I would later parse.
Am I correct in assuming that /text() is not the correct use in this instance and I need to put something else?
Any help is greatly appreciated.
You are getting an empty list because in one call of the url you are not getting the complete data of the web page. For that you have to stream through the url and get all the data in small chunks. And then find out the required in the non-empty chunk. The code for the following is :-
import requests as rq
import re
from bs4 import BeautifulSoup as bs
r=rq.get('http://www.amazon.in/gp/product/0007950306/ref=s9_al_bw_g14_i1?pf_rd_m=A1VBAL9TL5WCBF&pf_rd_s=merchandised-search-3&pf_rd_r=1XBKB22RGT2HBKH4K2NP&pf_rd_t=101&pf_rd_p=798805127&pf_rd_i=4143742031',stream=True)
for chunk in r.iter_content(chunk_size=1024):
if chunk:
data = chunk
soup=bs(data)
elem=soup.find_all('li',attrs={'id':'SalesRank'})
if elem!=[]:
s=re.findall('#[\d+,*]*\sin',str(elem[0]))
print s[0].split()[0]
break

Python 3.0/Beautifulsoup: Looping through parsed data choose a specific link open that link and repeat

I tried to do the following: read trough a website choose the 18 th of link on this site, open that link and repeat that 7 times. But I am not really advanced in programming, so I stuck by trying to open the 18th link. How can I do that? My code is this:
import urllib.request
import io
u = urllib.request.urlopen("http://xxxxxxxx.com/tsugi/mod/python-data/data/known_by_Yong.html", data = None)
f = io.TextIOWrapper(u, encoding='utf-8')
text = f.read()
from bs4 import BeautifulSoup
soup = BeautifulSoup(text)
print (soup.find_all("a")
my result looks like this: e.g
[ href="http://xxxxxxxx.com/tsugi/mod/python-data/data/known_by_Keiva.html">Keiva</a>, href="http://xxxxx.com/tsugi/mod/python-data/data/known_by_Rowyn.</a>]
An HTML-document with names/links.
Till I not expect, that anybody is guiding me trough the whole code, where can I look up what I need?
Here are my main questions:
How can I make the program count the names/links?
How can I open the 18th link in the list?
How can I repeat that,7 times?
Thanks for your support in advance!!

Wrong output using tree.xpath

Iam a beginner to data scraping.I want to extract all the marathon events name from a website Wikipedia
For this I have written a small code :-
from lxml import html
import requests
page = requests.get('https://en.wikipedia.org/wiki/List_of_marathon_races')
tree = html.fromstring(page.text)
events = tree.xpath('//td/a[#class="new"]/text()')
print events
The problem is when I try to execute the following code,a blank array comes as an output.What is the problem with this code?It would be grateful if anyone can help me in correcting or finding mistakes my code.

Categories