CSV output is empty - python

I am one step before finishing a project. As far as I know, all parts of the code works, and I have tested them separately. However, the output CSV still comes out empty for some reason. My code:
import requests, bs4, csv, sys
reload(sys)
sys.setdefaultencoding('utf-8')
url = 'http://www.constructeursdefrance.com/resultat/?dpt=01'
count = 1
def result():
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text,'html.parser')
links = []
try:
for div in soup.select('.link'):
link = div.a.get('href')
links.append(link)
with open('french.csv', 'wb') as file:
writer = csv.writer(file)
for i in links:
res2 = requests.get(i)
soup2 = bs4.BeautifulSoup(res2.text, 'html.parser')
for each in soup2.select('li > strong'):
writer.writerow([each.text, each.next_sibling])
except:
pass
while not url.endswith('?dpt=010'):
print 'downloading %s' %url
result()
count += 1
url = 'http://www.constructeursdefrance.com/resultat/?dpt=0' + str(count)
url = 'http://www.constructeursdefrance.com/resultat/?dpt=10'
count = 10
while not url.endswith('?dpt=102'):
print 'downloading %s' %url
result()
count += 1
url = 'http://www.constructeursdefrance.com/resultat/?dpt=' + str(count)
print 'done'
This is really one of the first bigger projects I am trying to solve as a beginner. Being so close yet so stuck is frustrating, however. Any help is appreciated.

first, do not use try except in a large block, just use in a small place.
if you comment of you try except statement, this error will raise:
Traceback (most recent call last):
File "/home/li/PycharmProjects/tw/1.py", line 29, in <module>
result()
File "/home/li/PycharmProjects/tw/1.py", line 26, in result
writer.writerow([each.text, each.next_sibling])
TypeError: a bytes-like object is required, not 'str'
and this error message is clear, when it write to file, it require a bytes_like object, and you can check that the file you opened is in 'wb' mode, 'b' represent bytes mode, so the problem is clear, just change the mode to normal mode which require a str_like object:
with open('french.csv', 'w') as file:

Related

Not getting all possible variables from splitting a web-scraped string

I can't get my program to get every string possible from a split.
Here is one thing I tried:
var2 = "apple banana orange"
for var in var2.split():
#Here I would put what I want to do with the variable, but I put print() to show what happens
print(var)
I got:
applebananaorange
Full Code:
import requests
response = requests.get('https://raw.githubusercontent.com/Charonum/JSCode/main/Files.txt')
responsecontent = str(response.content)
for file in responsecontent.split("\n"):
file = file.replace("b'", "")
file = file.replace("'", "")
file = file.replace(r"\n", "")
if file == "":
pass
else:
print(file)
url = 'https://raw.githubusercontent.com/Charonum/JSCode/main/code/windows/' + file + ""
wget.download(url)
What should I do?
It looks like one of the files in the list is not available. It is good practice to always wrap input/output operations with a try/except to control problems like this. The code below downloads all available files and informs you which files could not be downloaded:
import requests
import wget
from urllib.error import HTTPError
response = requests.get('https://raw.githubusercontent.com/Charonum/JSCode/main/Files.txt')
responsecontent = str(response.content)
for file in responsecontent.split("\\n"):
file = file.replace("b'", "")
file = file.replace("'", "")
file = file.replace(r"\n", "")
if file == "":
pass
else:
url = 'https://raw.githubusercontent.com/Charonum/JSCode/main/code/windows/' + file + ""
print(url)
try:
wget.download(url)
except HTTPError:
print(f"Error 404: {url} not found")
It seems to work for me replacing the for statement with this one:
for file in responsecontent.split("\\n"):
...
Instead of responsecontent = str(response.content) try:
responsecontent = response.text
and then for file in responsecontent.split().

Write output data to csv

I'm writing a short piece of code in python to check the status code of a list of URLS. The steps are
1. read the URL's from a csv file.
2. Check request code
3. Write the status code request into the csv next to the checked URL
The first two steps I've managed to do but I'm stuck with writing the output of the requests into the same csv, next to the urls. Please help.
import urllib.request
import urllib.error
from multiprocessing import Pool
file = open('innovators.csv', 'r', encoding="ISO-8859-1")
urls = file.readlines()
def checkurl(url):
try:
conn = urllib.request.urlopen(url)
except urllib.error.HTTPError as e:
print('HTTPError: {}'.format(e.code) + ', ' + url)
except urllib.error.URLError as e:
print('URLError: {}'.format(e.reason) + ', ' + url)
else:
print('200' + ', ' + url)
if __name__ == "__main__":
p = Pool(processes=1)
result = p.map(checkurl, urls)
with open('innovators.csv', 'w') as f:
for line in file:
url = ''.join(line)
checkurl(urls + "," + checkurl)
The .readlines() operation leaves the file object at the end of file. When you attempt to loop through the lines of file again, without first rewinding it (file.seek(0)) or closing and opening it again (file.close() followed by opening again), there are no lines remaining. Always recommended to use with open(...) as file construct to ensure file is closed when operation is finished.
Additionally, there appears to be an error in your input to checkurl. You have added a list (urls) to a string (",") to a function (checkurl).
You probably meant for this section to read
with open('innovators.csv', 'w') as f:
for line in urls:
url = ''.join(line.replace('\n','')) # readlines leaves linefeed character at end of line
f.write(url + "," + checkurl(url))
The checkurl function should return what you are intending to place into the csv file. You are simply printing to standard output (screen). Thus, replace your checkurl with
def checkurl(url):
try:
conn = urllib.request.urlopen(url)
ret='0'
except urllib.error.HTTPError as e:
ret='HTTPError: {}'.format(e.code)
except urllib.error.URLError as e:
ret='URLError: {}'.format(e.reason)
else:
ret='200'
return ret
or something equivalent to your needs.
Save the status in a dict. and convert it to dataframe. Then simply send it to a csv file. str(code.getcode()) will return 200 if the url is connecting else it will return an exception, for which i assigned status as '000'. So your csv file will contain url,200 if URL is connecting and url,000 if URL is not connecting.
status_dict={}
for line in lines:
try:
code = urllib.request.urlopen(line)
status = str(code.getcode())
status_dict[line] = status
except:
status = "000"
status_dict[line] = status
df = pd.Dataframe(status_dict)
df.to_csv('filename.csv')

Python requests stops working mid-file

I've got a csv file with URL's and I need to scrape metadata from those website. I'm using python requests for that reasons with code below:
from tempfile import NamedTemporaryFile
import shutil
import csv
from bs4 import BeautifulSoup
import requests
import re
import html5lib
import sys
#import logging
filename = 'TestWTF.csv'
#logging.basicConfig(level=logging.DEBUG)
#Get filename (with extension) from terminal
#filename = sys.argv[1]
tempfile = NamedTemporaryFile(delete=False)
read_timeout = 1.0
#Does actual scraping done, returns metaTag data
def getMetadata (url, metaTag):
r = requests.get("http://" + url, timeout=2)
data = r.text
soup = BeautifulSoup(data, 'html5lib')
metadata = soup.findAll(attrs={"name":metaTag})
return metadata
#Gets either keyword or description
def addDescription ( row ):
scrapedKeywordsData = getMetadata(row, 'keywords')
if not scrapedKeywordsData:
print row + ' NO KEYWORDS'
scrapedKeywordsData = getMetadata(row, 'description')
if not scrapedKeywordsData:
return ''
return scrapedKeywordsData[0]
def prepareString ( data ):
output = data
#Get rid of opening meta content
if output.startswith( '<meta content="' ):
output = data[15:]
#Get rid of closing meta content (keywords)
if output.endswith( '" name="keywords"/>' ):
output = output[:-19]
#Get rid of closing meta content (description)
if output.endswith( '" name="description"/>' ):
output = output[:-22]
return output
def iterator():
with open(filename, 'rb') as csvFile, tempfile:
reader = csv.reader(csvFile, delimiter=',', quotechar='"')
writer = csv.writer(tempfile, delimiter=',', quotechar='"')
i = 0
for row in reader:
try:
data = str(addDescription (row[1] ))
row[3] = prepareString( data )
except requests.exceptions.RequestException as e:
print e
except requests.exceptions.Timeout as e:
print e
except requests.exceptions.ReadTimeout as e:
print "lol"
except requests.exceptions.ConnectionError as e:
print "These aren't the domains we're looking for."
except requests.exceptions.ConnectTimeout as e:
print "Too slow Mojo!"
writer.writerow(row)
i = i + 1
print i
shutil.move(tempfile.name, filename)
def main():
iterator()
#Defining main function
if __name__ == '__main__':
main()
It works just fine but at some URL's (out of 3000 let's say maybe 2-3) it would just suddenly stop working and not progress to next one after timeout time.. So I have to kill it using Ctr+C which results in file not being saved.
I know it's a problem of catching exceptions but I cannot figure out which one or what to do with that problem.. I'm more than happy to simply ignore the one which is stuck on..
EDIT:
Added traceback:
^CTraceback (most recent call last):
File "blacklist.py", line 90, in <module>
main()
File "blacklist.py", line 85, in main
iterator()
File "blacklist.py", line 62, in iterator
data = str(addDescription (row[1] ))
File "blacklist.py", line 30, in addDescription
scrapedKeywordsData = getMetadata(row, 'keywords')
File "blacklist.py", line 25, in getMetadata
metadata = soup.findAll(attrs={"name":metaTag})
File "/Library/Python/2.7/site-packages/bs4/element.py", line 1259, in find_all
return self._find_all(name, attrs, text, limit, generator, **kwargs)
File "/Library/Python/2.7/site-packages/bs4/element.py", line 537, in _find_all
found = strainer.search(i)
File "/Library/Python/2.7/site-packages/bs4/element.py", line 1654, in search
found = self.search_tag(markup)
File "/Library/Python/2.7/site-packages/bs4/element.py", line 1626, in search_tag
if not self._matches(attr_value, match_against):
File "/Library/Python/2.7/site-packages/bs4/element.py", line 1696, in _matches
if isinstance(markup, Tag):
KeyboardInterrupt
EDIT 2:
Example website for which script doesn't work: miniusa.com

Python Lists - How do I use a list of domains with dns.resolver.query for loop

print (data[1])
ymcacanada.ca
answers = dns.resolver.query(data[1]), 'MX')
traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "dns\resolver.py", line 981, in query
File "dns\resolver.py", line 912, in query
File "dns\resolver.py", line 143, in __init__
dns.resolver.NoAnswer
I expected the data[1] to be equal to "ymcacanada", so I can call the whole data list in a for loop with the dns.resolver looking up the MX records.
What I got was this error that does not happen when I manually run the code with the url.
The next thing I want to do is write those lookups into a .CSV
Here's my code so far(doesn't work!)
import dns
import dns.resolver
import os
import csv
import dns.resolver
file = "domains.txt"
f = open(file)
data = f.read()
f.close
list = []
for url in data:
answers = dns.resolver.query(url, 'MX')
for rdata in answers:
x = [rdata.exchange, rdata.preference]
print(x)
list.append([x])
EDIT
Hey All here is my working Code in it's totality. I'm sure it can be improved but I'm still learning!
import dns
import dns.resolver
import os
import csv
##tested on python 2.7 only
errcountA = 0
errcountMX = 0
listoflists = []
with open("domains.txt") as f:
for url in f:
A = []
MX = []
url = url.strip()
try:
answers = dns.resolver.query(url, 'A') ##Get the A record
for rdata in answers:
A.append(rdata.address) ## Add A Records to list
except: ## Incase the URL doesnt resolve
A = "Error resolving"
errcountA += 1
try: ##Get the MX record
answers = dns.resolver.query(url, 'MX')
for rdata in answers:
MX.append([rdata.preference, rdata.exchange])
except: ##incase url doesnt resolver
MX = "Error resolving"
errcountMX += 1
list = [url, MX, A]
print(list)
listoflists.append(list)
with open('output.csv', 'wb') as csvfile: ##write the csv file
writer = csv.writer(csvfile)
for r in listoflists:
writer.writerow(r)
print listoflists
print ("There were %e A record errors") %errcountA
print ("There were %f MX record errors") %errcountMX
print ("Done!")
Your url data includes the trailing newline. Use .strip() to remove any leading or trailing whitespace from your data. Try this:
with open("domaints.txt") as f:
for url in f:
url = url.strip()
answers = dns.resolver.query(url, 'MX')
# Continue as before

Why won't this python script generate the desired text file? (despite the script running with no errors)

This script is supposed to generate a text file of stock price values. I can't seem to either find the text file that is supposed to be generated from this script or have this script actually create the desired text file... I added a section of code to check if the file exists, but I keep getting the result that the text file is indeed not created. Please let me know what I can do to correct. When I run the code I do not get any errors. Thanks.
import urllib2
import time
import os
import sys
stockToPull = 'AAPL'
def pullData(stock):
try:
fileLine = stock+'.txt'
urlToVisit = 'http://chartapi.finance.yahoo.com/instrument/1.0/'+stock+'/chartdata;type=quote;range=1y/csv'
sourceCode = urllib2.urlopen(urlToVisit).read()
splitSource = sourceCode.split('\n')
for eachLine in splitSource:
splitLine = eachLine.split(', ')
if len(splitLine)==6:
if 'values' not in eachLine:
saveFile = open(fileLine,'a')
lineToWrite = eachLine+'\n'
saveFile.write(lineToWrite)
print 'Pulled', stock
print 'sleeping'
if os.path.isfile(fileLine): # checks to see if text file created
print "file does exist"
else:
print "No such file"
time.sleep(5)
except Exception, e:
print 'main loop', str(e)
pullData(stockToPull)
You are splitting each row on the string ', ' (note the space). You should be splitting on the comma only:
for eachLine in splitSource:
splitLine = eachLine.split(',')
if len(splitLine)==6:
# etc
You would be better off opening the file once, writing each line to it, then closing the file when finished. You can use a with statement to do this:
with open(fileLine, 'w') as outfile:
for eachLine in splitSource:
splitLine = eachLine.split(',')
if len(splitLine) == 6 and 'values' not in eachLine:
outfile.write('%s\n' % eachLine)
outfile.close()
i think you need to check whether data is arriving or not using
from urllib2 import Request, urlopen, URLError, HTTPError
fileLine = stock+'.txt'
urlToVisit = 'http://chartapi.finance.yahoo.com/instrument/1.0/'+stock+'/chartdata;type=quote;range=1y/csv'
try:
response = urlopen(urlToVisit)
sourceCode = response.read()
<place your logic here>
except HTTPError as e:
print 'The server couldn\'t fulfill the request.'
print 'Error code: ', e.code
except URLError as e:
print 'We failed to reach a server.'
print 'Reason: ', e.reason
your code works fine
output i got
['uri:/instrument/1.0/AAPL/chartdata;type=quote;range=1y/csv', 'ticker:aapl', 'Company-Name:Apple Inc.', 'Exchange-Name:NMS', 'unit:DAY', 'timestamp:', 'first-trade:19801212', 'last-trade:20140702', 'currency:USD', 'previous_close_price:59.7843', 'Date:20130703,20140702', 'labels:20130703,20130801,20130903,20131001,20131101,20131202,20140102,20140203,20140303,20140401,20140501,20140602,20140701', 'values:Date,close,high,low,open,volume', 'close:59.2929,94.2500', 'high:60.1429,95.0500', 'low:58.6257,93.5700', 'open:59.0857,94.7300', 'volume:28420900,266380800', '20140620,90.9100,92.5500,90.9000,91.8500,100813200', .....
'20140702,93.4800,94.0600,93.0900,93.8700,28420900', '']
['uri:/instrument/1.0/AAPL/chartdata;type=quote;range=1y/csv']
['ticker:aapl']
['Company-Name:Apple Inc.']
['Exchange-Name:NMS']
['unit:DAY']
['timestamp:']
['first-trade:19801212']
['last-trade:20140702']
['currency:USD']
['previous_close_price:59.7843']
['Date:20130703,20140702']
['labels:20130703,20130801,20130903,20131001,20131101,20131202,20140102,20140203,20140303,20140401,20140501,20140602,20140701']
['values:Date,close,high,low,open,volume']
['close:59.2929,94.2500']
['high:60.1429,95.0500']
['low:58.6257,93.5700']
['open:59.0857,94.7300']
['volume:28420900,266380800']
['20130703,60.1143,60.4257,59.6357,60.1229,60232200']
['20130705,59.6314,60.4700,59.3357,60.0557,68506200']
['20130708,59.2929,60.1429,58.6643,60.0157,74534600']
......
['20140630,92.9300,93.7300,92.0900,92.1000,49482300']
['20140701,93.5200,94.0700,93.1300,93.5200,38170200']
['20140702,93.4800,94.0600,93.0900,93.8700,28420900']
['']
Pulled AAPL
sleeping
file does exist
Your code is fine, but I don't think it is able to do what you want or you don't know what you want. You haven't observed your data correctly. I have made minor changes to your script. Please run this script now:
import urllib2
import time
import os
import sys
stockToPull = 'AAPL'
def pullData(stock):
try:
fileLine = stock+'.txt'
urlToVisit = 'http://chartapi.finance.yahoo.com/instrument/1.0/'+stock+'/chartdata;type=quote;range=1y/csv'
sourceCode = urllib2.urlopen(urlToVisit).read()
splitSource = sourceCode.split('\n')
for eachLine in splitSource:
splitLine = eachLine.split(', ')
if len(splitLine)==6:
print 'Entering outer'
if 'values' not in eachLine:
print 'Entering innter'
saveFile = open(fileLine,'a')
lineToWrite = eachLine+'\n'
saveFile.write(lineToWrite)
print 'Pulled', stock
print 'sleeping'
if os.path.isfile(fileLine): # checks to see if text file created
print "file does exist"
else:
print "No such file"
time.sleep(5)
except Exception, e:
print 'main loop', str(e)
pullData(stockToPull)
If you'd notice, I have just put two print statements inside the if blocks that actually write to a file. On running the script, I noticed that the print statements are never executed. So there are no errors in your code to my knowledge but it doesn't seem to be doing what you want it to. So re-check your data again.
Lastly, to solve these kind of problems, you must use the pdb library which stands for Python Debugger and it is an amazingly helpful tool to debug your code without making a mess of it. Checkout this video from PyCon.

Categories