I'm currently working on a project for myself, and that includes scraping this specific website.
My code currently looks like this:
for i in range(0,4):
my_url = 'https://www.kickante.com.br/campanhas-crowdfunding?page='+str(i)
uclient = ureq(my_url)
page_html = uclient.read()
uclient.close()
page_soup = soup(page_html, 'html.parser')
containers = page_soup.find_all("div", {"class":"campaign-card-wrapper views-row"})
for container in containers:
#Achando os títulos das campanhas
titleCampaignBruto = container.div.div.a.img["title"].replace('Crowdfunding para: ', '')
titleCampaignParsed = titleCampaignBruto.strip().replace(",", ";")
#Achando o valor da campanha
arrecadadoFind = container.div.find_all("div",{"class":"funding-raised"})
arrecadado = arrecadadoFind[0].text.strip().replace(",", ".")
#Número de doadores
doadoresBruto = container.div.find_all('span', {"class":"contributors-value"})
doadoresParsed = doadoresBruto[0].text.strip().replace(",",";")
#target da campanha
fundingGoal = container.div.find_all('div', {"class":"funding-progress"})
quantoArrecadado = fundingGoal[0].text.strip().replace(",",";")
#Descricao da campanha
descricaoBruta = container.div.find_all('div', {"class":"field field-name-field-short-description field-type-text-long field-label-hidden"})
descricaoParsed = descricaoBruta[0].text.strip().replace(",",";")
#link da campanha
linkCampanha = container.div.find_all('href')
print("Título da campanha: " + titleCampaignParsed)
print("Valor da campanha: " +arrecadado)
print("Doadores: "+ doadoresParsed)
print("target: " + quantoArrecadado)
print("descricao: " + descricaoParsed)
f.write(titleCampaignParsed + "," + arrecadado + "," + doadoresParsed + "," + quantoArrecadado+ "," + descricaoParsed.replace("," ,";") + "\n")
i = i+1
f.close()
When I open the csv file it generated, I see that some lines are broken where they shouldn't be (example: See line 31 on the csv file). That line should be a part of the previous line (line 30) as the body of the description.
Does anyone have an idea of what can be causing that? Thanks in advance.
Some of the text you're writing to CSV might contain newlines. You can remove them like so:
csv_line_entries = [
titleCampaignParsed, arrecadado, doadoresParsed,
quantoArrecadado, descricaoParsed.replace("," ,";")
]
csv_line = ','.join([
entry.replace('\n', ' ') for entry in csv_line_entries
])
f.write(csv_line + '\n')
Cause of the bug
The strip() method removes only leading and trailing newlines/whitespace.
import bs4
soup = bs4.BeautifulSoup('<p>Whatever\nelse\n</p>')
soup.find('p').text.strip()
>>> 'Whatever\nelse'
Notice that the inner \n is not removed.
You have newlines in the middle of the text. strip() only removes whitespace on the start and end of a string, so you need to use replace('\n','') instead. This replaces all of the newlines \n with nothing ''
Related
I have a URL as follows: https://www.vq.com/36851082/?p=1. I want to create a file named list_of_urls.txt which contains url links from p=1 to p=20, seperate each with space, and save it as a txt file.
Here is what I have tried, but it only prints the last one:
url = "https://www.vq.com/36851082/?p="
list_of_urls = []
for page in range(20):
list_of_urls = url + str(page)
print(list_of_urls)
The expected txt file inside would be like this:
It is the occasion to use f-strings, usable since Python 3.6, and fully described in PEP 498 -- Literal String Interpolation.
url_base = "https://www.vq.com/36851082/?p="
with open('your.txt', 'w') as f:
for page in range(1, 20 + 1):
f.write(f'{url_base}{page} ')
#f.write('{}{} '.format(url_base, page))
#f.write('{0}{1} '.format(url_base, page))
#f.write('{u}{p} '.format(u=url_base, p=page))
#f.write('{u}{p} '.format(**{'u':url_base, 'p':page}))
#f.write('%s%s '%(url_base, page))
Notice the space character at the end of each formatting expression.
Be careful with range - it starts from from 0 by default and the last number of the range is not included. Hence, if you want numbers 1 - 20 you need to use range(1, 21).
url_template = "https://www.vq.com/36851082/?p={page}"
urls = [url_template.format(page=page) for page in range(1, 21)]
with open("/tmp/urls.txt", "w") as f:
f.write(" ".join(urls))
Try this :)
url = "https://www.vq.com/36851082/?p="
list_of_urls = ""
for page in range(20):
list_of_urls = list_of_urls + url + str(page) + " "
print(list_of_urls)
Not sure if you want one line inside your file but if so:
url = "https://www.vq.com/36851082/?p=%i"
with open("expected.txt", "w") as f:
f.write(' '.join([url %i for i in range(1,21)]))
Output:
https://www.vq.com/36851082/?p=1 https://www.vq.com/36851082/?p=2 https://www.vq.com/36851082/?p=3 https://www.vq.com/36851082/?p=4 https://www.vq.com/36851082/?p=5 https://www.vq.com/36851082/?p=6 https://www.vq.com/36851082/?p=7 https://www.vq.com/36851082/?p=8 https://www.vq.com/36851082/?p=9 https://www.vq.com/36851082/?p=10 https://www.vq.com/36851082/?p=11 https://www.vq.com/36851082/?p=12 https://www.vq.com/36851082/?p=13 https://www.vq.com/36851082/?p=14 https://www.vq.com/36851082/?p=15 https://www.vq.com/36851082/?p=16 https://www.vq.com/36851082/?p=17 https://www.vq.com/36851082/?p=18 https://www.vq.com/36851082/?p=19 https://www.vq.com/36851082/?p=20
This one also work, thanks to my colleague!
url = "https://www.vq.com/36851082/?p=%d"
result = " ".join([ url % (x + 1) for x in range(20)])
with open("list_of_urls.txt", "w") as f:
f.write(result)
i'm an absolut beginner, but with youtube and some websites i've written a crawler for the german website Immoscout24.
My problem: the crawler works fine, if all attributes are excisting. But if one site hasn't any attribute (e.g. "pre" in "beschreibung_container"), i'll get "NameError: name 'beschreibung' is not defined". How can i do, that it writes nothing ("") into my result list (csv), if the attribute not exists ans continues crawling?
for number in numbers:
my_url = "https://www.immobilienscout24.de/expose/%s#/" %number
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
containers = page_soup.find_all("div", {"id":"is24-content"})
filename = "results_"+current_datetime+".csv"
f = open(filename, "a")
headers = "Objekt-ID##Titel##Adresse##Merkmale##Kosten##Bausubstanz und Energieausweis##Beschreibung##Ausstattung##Lage\n"
f.write(headers)
for container in containers:
try:
objektid_container = container.find_all("div", {"class":"is24-scoutid__content padding-top-s"})
objektid = objektid_container[0].get_text().strip()
titel_container = container.find_all("h1", {"class":"font-semibold font-xl margin-bottom margin-top-m palm-font-l"})
titel = titel_container[0].get_text().strip()
adresse_container = container.find_all("div", {"class":"address-block"})
adresse = adresse_container[0].get_text().strip()
criteria_container = container.find_all("div", {"class":"criteriagroup criteria-group--two-columns"})
criteria = criteria_container[0].get_text().strip()
preis_container = container.find_all("div", {"class":"grid-item lap-one-half desk-one-half padding-right-s"})
preis = preis_container[0].get_text().strip()
energie_container = container.find_all("div", {"class":"criteriagroup criteria-group--border criteria-group--two-columns criteria-group--spacing"})
energie = energie_container[0].get_text().strip()
beschreibung_container = container.find_all("pre", {"class":"is24qa-objektbeschreibung text-content short-text"})
beschreibung = beschreibung_container[0].get_text().strip()
ausstattung_container = container.find_all("pre", {"class":"is24qa-ausstattung text-content short-text"})
ausstattung = ausstattung_container[0].get_text().strip()
lage_container = container.find_all("pre", {"class":"is24qa-lage text-content short-text"})
lage = lage_container[0].get_text().strip()
except:
print("some mistake")
pass
f.write(objektid + "##" + titel + "##" + adresse + "##" + criteria.replace(" ", ";") + "##" + preis.replace(" ", ";") + "##" + energie.replace(" ", ";") + "##" + beschreibung.replace("\n", " ") + "##" + ausstattung.replace("\n", " ") + "##" + lage.replace("\n", " ") + "\n")
f.close()
EDIT
First problem is solved. Another problem: my result list shows in each column like:
look here
How can i do, that "Objekt-ID" and the other headlines are only in row No. 1?
For each variable, you can simply just do the following
obj = container.find_all("div", {"class":"xxxxx"}) or ""
objid = obj[0].get_text().strip() if obj else ""
The first line will default the value into "" empty string if find_all returns empty list or none. The second also does the same thing but check for the existence of value first then apply the if else condition.
I think you need to encapsulate each variable in try-except block.
E.g:
try:
objektid_container = container.find_all("div", {"class":"is24-scoutid__content padding-top-s"})
objektid = objektid_container[0].get_text().strip()
except:
objektid = ""
Do this for all variables
For Second issue Move your headers outside loop
Remove this code:
filename = "results_"+current_datetime+".csv"
f = open(filename, "a")
headers = "Objekt-ID##Titel##Adresse##Merkmale##Kosten##Bausubstanz und Energieausweis##Beschreibung##Ausstattung##Lage\n"
f.write(headers)
And add it before:
for number in numbers:
Maybe this question was asked before but since I could not find a proper answer, I dare to ask a similar one. My problem is, I have been trying to scrape a Turkish car sale web site which is named 'Sahibinden'. I use jupyter notebook and sublime editors.Once I try to get the data written in a csv file, the Turkish letter changes to different characters. I tried. 'UTF-8 Encoding', '# -- coding: utf-8 --', ISO 8859-9, etc. but I could not solve the problem. The other issue is that Sublime editor does not create the csv file despite I did not have any problem on the jupyter notebook. You will find the csv file output in the image link. If someone can reply me I would appreciate it.
Note: the program works and no problem once I run print command on the editors.
Thanks a lot.
# -*- coding: utf-8 -*-
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import TimeoutException
import unicodedata
with open ('result1.csv','w') as f:
f.write('brand, model, year, oil_type, gear, odometer, body, hp,
eng_dim, color, warranty, condition, price, safe,
in_fea, outs_fea, mul_fea,pai_fea, rep_fea, acklm \n')
chrome_path = r"C:\Users\Mike\Desktop\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
def final_page(fn_20):
for lur in fn_20:
driver.get(lur)
brand = driver.find_element_by_xpath('''//*[#id="classifiedDetail"]/div[1]/div[2]/div[2]/ul/li[3]/span''')
brand = brand.text
brand = brand.encode("utf-8")
print (brand)
model = driver.find_element_by_xpath('''//*[#id="classifiedDetail"]/div[1]/div[2]/div[2]/ul/li[5]/span''')
model = model.text
model = model.encode("utf-8")
print (model)
year = driver.find_element_by_xpath('''//*[#id="classifiedDetail"]/div[1]/div[2]/div[2]/ul/li[6]/span''')
year = year.text
year = year.encode("utf-8")
print (year)
oil_type = driver.find_element_by_xpath('''//*[#id="classifiedDetail"]/div[1]/div[2]/div[2]/ul/li[7]/span''')
oil_type = oil_type.text
oil_type = oil_type.encode("utf-8")
print (oil_type)
gear = driver.find_element_by_xpath('''//*[#id="classifiedDetail"]/div[1]/div[2]/div[2]/ul/li[8]/span''')
gear = gear.text
gear = gear.encode("utf-8")
print (gear)
odometer = driver.find_element_by_xpath('''//*[#id="classifiedDetail"]/div[1]/div[2]/div[2]/ul/li[9]/span''')
odometer = odometer.text
odometer = odometer.encode("utf-8")
print (odometer)
body = driver.find_element_by_xpath('''//*[#id="classifiedDetail"]/div[1]/div[2]/div[2]/ul/li[10]/span''')
body = body.text
body = body.encode("utf-8")
print (body)
hp = driver.find_element_by_xpath('''//*[#id="classifiedDetail"]/div[1]/div[2]/div[2]/ul/li[11]/span''')
hp = hp.text
hp = hp.encode("utf-8")
print (hp)
eng_dim = driver.find_element_by_xpath('''//*[#id="classifiedDetail"]/div[1]/div[2]/div[2]/ul/li[12]/span''')
eng_dim = eng_dim.text
eng_dim = eng_dim.encode("utf-8")
print (eng_dim)
color = driver.find_element_by_xpath('''//*[#id="classifiedDetail"]/div[1]/div[2]/div[2]/ul/li[14]/span''')
color = color.text
color = color.encode("utf-8")
print (color)
warranty = driver.find_element_by_xpath('''//*[#id="classifiedDetail"]/div[1]/div[2]/div[2]/ul/li[15]/span''')
warranty = warranty.text
warranty = warranty.encode("utf-8")
print (warranty)
condition = driver.find_element_by_xpath('''//*[#id="classifiedDetail"]/div[1]/div[2]/div[2]/ul/li[19]/span''')
condition = condition.text
condition = condition.encode("utf-8")
print (condition)
price = driver.find_element_by_xpath('''//*[#id="classifiedDetail"]/div[1]/div[2]/div[2]/h3''')
price = price.text
price = price.encode("utf-8")
print (price)
safe = ''
safety1 = driver.find_elements_by_xpath('''//div[#id='classifiedProperties']/ul[1]/li[#class='selected']''')
for ur in safety1:
ur1 = ur.text
ur1 = ur1.encode("utf-8")
safe +=ur1 + ', '
print (safe)
in_fea = ''
in_features = driver.find_elements_by_xpath('''//div[#id='classifiedProperties']/ul[2]/li[#class='selected']''')
for ins in in_features:
ins1 = ins.text
ins1 = ins1.encode("utf-8")
in_fea += ins1 + ', '
print (in_fea)
outs_fea = ''
out_features = driver.find_elements_by_xpath('''//div[#id='classifiedProperties']/ul[3]/li[#class='selected']''')
for outs in out_features:
out1 = outs.text
out1 = out1.encode("utf-8")
outs_fea += out1 + ', '
print (outs_fea)
mul_fea = ''
mult_features = driver.find_elements_by_xpath('''//div[#id='classifiedProperties']/ul[4]/li[#class='selected']''')
for mults in mult_features:
mul = mults.text
mul = mul.encode("utf-8")
mul_fea += mul + ', '
print (mul_fea)
pai_fea = ''
paint = driver.find_elements_by_xpath('''//div[#class='classified-pair custom-area ']/ul[1]/li[#class='selected']''')
for pai in paint:
pain = pai.text
pain = pain.encode("utf-8")
pai_fea += pain + ', '
print (pai_fea)
rep_fea = ''
replcd = driver.find_elements_by_xpath('''//div[#class='classified-pair custom-area']/ul[2]/li[#class='selected']''')
for rep in replcd:
repa = rep.text
repa = repa.encode("utf-8")
rep_fea += rep + ', '
print (rep_fea)
acklm = driver.find_element_by_xpath('''//div[#id='classified-detail']/div[#class='uiBox'][1]/div[#id='classifiedDescription']''')
acklm = acklm.text
acklm = acklm.encode("utf-8")
print (acklm)
try:
with open ('result1.csv', 'a') as f:
f.write (brand + ',' [enter image description here][1]+ model + ',' + year + ',' + oil_type + ',' + gear + ',' + odometer + ',' + body + ',' + hp + ',' + eng_dim + ',' + color + ',' + warranty + ',' + condition + ',' + price + ',' + safe + ',' + in_fea + ',' + outs_fea + ',' + mul_fea + ',' + pai_fea + ',' + rep_fea + ',' + acklm + '\n')
except Exception as e:
print (e)
driver.close
import codecs
file = codecs.open("utf_test", "w", "utf-8")
file.write(u'\ufeff')
file.write("test with utf-8")
file.write("字符")
file.close()
or this also works for me
with codecs.open("utf_test", "w", "utf-8-sig") as temp:
temp.write("this is a utf-test\n")
temp.write(u"test")
Ok I wasn't clear enough before. So what I am trying to do is take the list of college teams and their url from http://www.cfbstats.com/2014/player/index.html and export to csv. I have done that successfully. From there I am going into each team and grabbing each player and their link. If a player does not have a link then it will just put their data in the csv. I currently only have players with URLs but not ones without. Eventually I will want to go into each player and grab each of their stats and write to a csv.
Sorry for all the confusion in the original post.
import csv
import sys
import json
import urllib
import requests
from bs4 import BeautifulSoup
def getCollegeandURL():
f = open('colleges.csv', 'w')
f.write("Teams" + "," + "," + "URL" + '\n')
originalurl = "http://www.cfbstats.com/2014/player/index.html"
base = requests.get("http://www.cfbstats.com/2014/player/index.html")
base = base.text
soup = BeautifulSoup(base)
# this is to find all the colleges in the div conference
mydivs = soup.find_all('div',{'class': 'conference'})
##g
g = open('rosters.csv', 'w')
g.write("College Rosters" + '\n' + '\n' + 'College' + ',' + ',' + 'Playernumber' + ',' + 'Player Last Name' + ',' +'Player First Name' + ',' + 'Position' + ',' + 'Year' + ',' + 'Height' + ',' + ' Weight' + ',' +'Hometown' + ',' + 'State' + ',' + 'Last School' + ',' + '\n')
# this for loop finds writes each college to a line
for div in mydivs:
urls= div.findAll('a')
# this is to pull all the college names and each of their links
for url in urls:
college = url.text
url = url.attrs['href']
teamurl = originalurl[:23]+url
f.write(college[:]+ ',' + ',' + teamurl[:]+'\n')
scrapeRosters(college, teamurl, g)
def scrapeRosters(college, teamurl, g):
# g is the excel document to read into
# college is the college name
# teamurl is the url link to that team's roster
roster = requests.get(teamurl)
roster = roster.text
roster = BeautifulSoup(roster)
teamname = roster.find_all('h1' , {'id': 'pageTitle'})
teamAndPlayers = {}
table = roster.find_all('table', {'class' : 'team-roster'})
for i in table:
rows = i.find_all('tr')
for row in rows[1:]:
# this retrieves the player url
for item in row.findAll('a'):
if item not in row.findAll('a'):
row = row.text
row = row.split('\n')
row = str(row)
g.write(college + ',' + row + ',' + ',' + '\n')
elif (item['href'].startswith('/')):
playerurl = item.attrs['href']
row = row.text
row = row.split('\n')
row = str(row)
g.write(college + ',' + row + ',' + ',' + playerurl + ',' + '\n')
def main():
getCollegeandURL()
main()
The error I believe is in my if and elif statement.
import urllib, bs4
data = urllib.urlopen('http://www.cfbstats.com/2014/team/140/roster.html')
soup = bs4.BeautifulSoup(data.read()) # creates a BS4 HTML parsing object
for row in soup('tr')[1:]:
data = [str(i.getText()) for i in row('td')]
link = row('td')[1]('a') # the linked player
if len(link) > 0:
link = str(link[0]['href'])
data = [str(link)] + data
print data
print '\n'
So I have created a web scraper that goes into cfbstats.com/2014/player/index.html and retrieves all the college football teams and the links of the football teams. From there it goes into each link and takes the roster and players link. Finally it goes into each players link and takes his stats.
I am currently having a problem with the taking the players stats. When I call the header of each table I get printed output [Tackle] and when call the first row of the table I get [G]. I would like to get rid of those tags. I have been able to not have them for my past functions. Any help would be appreciated.
import csv
import sys
import json
import urllib
import requests
from bs4 import BeautifulSoup
import xlrd
import xlwt
def getCollegeandURL():
f = open('colleges.csv', 'w')
f.write("Teams" + "," + "," + "URL" + '\n')
originalurl = "http://www.cfbstats.com/2014/player/index.html"
base = requests.get("http://www.cfbstats.com/2014/player/index.html")
base = base.text
soup = BeautifulSoup(base)
# this is to find all the colleges in the div conference
mydivs = soup.find_all('div',{'class': 'conference'})
##g is an excel document for the roster
g = open('rosters.csv', 'w')
g.write("College Rosters" + '\n' + '\n' + 'College' + ',' + 'Playernumber' + ',' + 'Player Last Name' + ',' +'Player First Name' + ',' + 'Position' + ',' + 'Year' + ',' + 'Height' + ',' + ' Weight' + ',' +'Hometown' + ',' + 'State' + ',' + 'Last School' + ',' + '\n')
# h is an excel for each player stats
h = xlwt.Workbook()
# this for loop finds writes each college to a line
for div in mydivs:
urls= div.findAll('a')
# this is to pull all the college names and each of their links
for url in urls:
college = url.text
url = url.attrs['href']
teamurl = originalurl[:23]+url
f.write(college[:]+ ',' + ',' + teamurl[:]+'\n')
scrapeRosters(college, teamurl, g, h)
############################################################################
def scrapeRosters(college, teamurl, g, h):
# create the excel documents
# this gets the pages of teams
roster = requests.get(teamurl)
roster = roster.text
roster = BeautifulSoup(roster)
teamname = roster.find_all('h1' , {'id': 'pageTitle'})
teamAndPlayers = {}
table = roster.find_all('table', {'class' : 'team-roster'})
for i in table:
rows = i.find_all('tr')
for row in rows[1:]:
data = [str(i.getText()) for i in row('td')]
link = row('td')[1]('a')
if len(link) > 0:
link = str(link[0]['href'])
data = [str(link)] + data
# unpacking data into variables
(playerurl, playernumber, playerName, playerPosition,YearinCollege, playerHeight, playerWeight, playerHometown, lastSchool) = data
# creating the full player url
playerurl = teamurl[:23] + playerurl
# repacking the data
data = (college, playernumber, playerName, playerPosition,YearinCollege, playerHeight, playerWeight, playerHometown, lastSchool)
g.write(college + ',' + playernumber + ',' + playerName + ',' + playerPosition + ','+ YearinCollege + ',' + playerHeight + ',' + playerWeight + ',' + playerHometown + ',' + lastSchool+ ',' + ',' + playerurl + ',' + '\n')
playerStats(data, playerurl, h)
############################################################################
def playerStats(data,playerurl, h):
playerurl = requests.get(playerurl)
playerurl = playerurl.text
playerurl = BeautifulSoup(playerurl)
tablestats = playerurl.find_all('table', {'class' : 'player-home'})
(college, playernumber, playerName, playerPosition,YearinCollege, playerHeight, playerWeight, playerHometown, lastSchool) = data
#print college, playernumber, playerName
print college, playerName, playernumber
for x in tablestats:
caption = x.find_all('caption')
rows = x.find_all('tr')
## caption = caption.strip
for row in rows:
headers = x.find_all('th')
headers = [str(i.getText()) for i in row('tr')]
stats = [str(x.getText()) for x in row('td')]
print caption, headers, stats
############################################################################
def main():
getCollegeandURL()
main()
Don't work so hard, your data is already available in parseable form.