So I've never really used import csv before, but I've managed to scrape a bunch of information from websites and now want to put them in a csv file. The issue I'm having is that all my list values are being separated by commas (i.e. Jane Doe = J,a,n,e, ,D,o,e).
Also, I have three lists (one with names, one with emails, and one with titles) and I would like to add them each as its own column in the CSV file (so col1 = Name, col2 = title, col3= email)
Any thoughts on how to execute this? Thanks.
from bs4 import BeautifulSoup
import requests
import csv
urls = ''
with open('websites.txt', 'r') as f:
for line in f.read():
urls += line
urls = list(urls.split())
name_lst = []
position_lst = []
email_lst = []
for url in urls:
print(f'CURRENTLY PARSING: {url}')
print()
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
try:
for information in soup.find_all('tr', class_='sidearm-staff-member'):
names = information.find("th", attrs={'headers': "col-fullname"}).text.strip()
positions = information.find("td", attrs={'headers': "col-staff_title"}).text.strip()
emails = information.find("td", attrs={'headers': "col-staff_email"}).script
target = emails.text.split('var firstHalf = "')[1]
fh = target.split('";')[0]
lh = target.split('var secondHalf = "')[1].split('";')[0]
emails = fh + '#' + lh
name_lst.append(names)
position_lst.append(positions)
email_lst.append(emails)
except Exception as e:
pass
with open('test.csv', 'w') as csv_file:
csv_writer = csv.writer(csv_file)
for line in name_lst:
csv_writer.writerow(line)
for line in position_lst:
csv_writer.writerow(line)
for line in email_lst:
csv_writer.writerow(line)
Writing your data column-by-column is easy. All you have to do is write the rows where each row contains elements of the 3 tables with the same list index. Here is the code:
with open('test.csv', 'w') as csv_file:
csv_writer = csv.writer(csv_file)
for name, position, email in zip(name_lst, position_lst, email_lst):
csv_writer.writerow([name, position, email])
Assuming that the name_lst, position_lst and email_lst are all correct and are of the same size, Your problem is in the last part of your code where you write it to a CSV file.
Here is a way to do this:
fieldnames = ['Name', 'Position', 'Email']
with open('Data_to_Csv.csv', 'w') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for i in range(len(name_lst)):
writer.writerow({'Name':name_lst[i],'Position':position_lst[i], 'Email':email_lst[i]})
This would of course fail if you are the length of the lists are unequal. You need to make sure that you are adding dummy values for entries that are not available to make sure that 3 lists have equal number of values.
I am trying to read data from a table in a pdf into a pandas dataframe. I am able to do so using tabula-py when the pdf has outlines around the table, but when I try on the pdf without an outline the script produces an error.
For example, I am looking at the pdfs available from two different urls. I have downloaded the pdfs from the urls and saved them as 'JSE Opts.pdf' and 'JSE Divs.pdf' respectively.
import requests
import pandas as pd
url='https://clientportal.jse.co.za/JSE%20Equity%20Derivatives/Dividends/ED_DividendsReport.pdf'
response = requests.get(url)
fname = 'JSE Divs.pdf'
f= open(fname, 'wb')
f.write(response.content)
f.close()
url='https://clientportal.jse.co.za/JSE%20Equity%20Derivatives/Options%20Daily%20Traded%20Report/ED_OptionsDailyTradedReport.pdf'
response = requests.get(url)
fname = 'JSE Opts.pdf'
f= open(fname, 'wb')
f.write(response.content)
f.close()
I am able to read the 'JSE Opts.pdf' into a pandas dataframe using the code:
import tabula as tb
pdf = './JSE Opts.pdf'
data = tb.read_pdf(pdf,pages = 1)
data = data[0]
print(data)
When I try to do the same for 'JSE Divs.pdf', I get errors and tabula-py is only able to read the header:
pdf = './JSE Divs.pdf'
data = tb.read_pdf(pdf,pages = 1)
data = data[0]
print(data)
I suspect that this is because there are no lines around the table. If that is the case, what is the best way to go about reading the data from 'JSE Divs.pdf' into pandas?
I was able to read the data into a string using pdfplumber, save the string as a CSV file (after cleaning the data to suit my needs) and then import into pandas.
import pdfplumber
pdf = pdfplumber.open("./JSE Divs.pdf")
text = ''
i = 0
while True:
try:
text += pdf.pages[i].extract_text() + '\n'
i = i+1
except IndexError:
break
for replace_s in [' DN',' CA1',' ANY',' CSH',' PHY',' QUANTO']:
text = text.replace(replace_s,'')
while True:
try:
idx = text.index('EXO')
replace_s =text[idx-1:idx+8]
text = text.replace(replace_s,'')
except ValueError:
break
cols ='EXPIRY_s,USYM,EXPIRY,EX_DATE,CUM_PV_DIVS,CUM_DIVS,ISIN,INSTR_ID\n'
text = text[text.index('Div\n')+4:]
text = cols + text
text = text.replace(' ',',')
f = open('divs.csv','w')
f.write(text)
f.close()
I am hoping to extract the change in cost of living from one city against many cities. I plan to list the cities I would like to compare in a CSV file and using this list to create the web link that would take me to the website with the information I am looking for.
Here is the link to an example: http://www.expatistan.com/cost-of-living/comparison/phoenix/new-york-city
Unfortunately I am running into several challenges. Any assistance to the following challenges is greatly appreciated!
The output only shows the percentage, but no indication whether it is more expensive or cheaper. For the example listed above, my output based on the current code shows 48%, 129%, 63%, 43%, 42%, and 42%. I tried to correct for this by adding an 'if-statement' to add '+' sign if it is more expensive, or a '-' sign if it is cheaper. However, this 'if-statement' is not functioning correctly.
When I write the data to a CSV file, each of the percentages is written to a new row. I can't seem to figure out how to write it as a list on one line.
(related to item 2) When I write the data to a CSV file for the example listed above, the data is written in the format listed below. How can I correct the format and have the data written in the preferred format listed below (also without the percentage sign)?
CURRENT CSV FORMAT (Note: 'if-statement' not functioning correctly):
City,Food,Housing,Clothes,Transportation,Personal Care,Entertainment
n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,8,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,1,2,9,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,6,3,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,3,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,2,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,2,%
PREFERRED CSV FORMAT:
City,Food,Housing,Clothes,Transportation,Personal Care,Entertainment
new-york-city, 48,129,63,43,42,42
Here is my current code:
import requests
import csv
from bs4 import BeautifulSoup
#Read text file
Textfile = open("City.txt")
Textfilelist = Textfile.read()
Textfilelistsplit = Textfilelist.split("\n")
HomeCity = 'Phoenix'
i=0
while i<len(Textfilelistsplit):
url = "http://www.expatistan.com/cost-of-living/comparison/" + HomeCity + "/" + Textfilelistsplit[i]
page = requests.get(url).text
soup_expatistan = BeautifulSoup(page)
#Prepare CSV writer.
WriteResultsFile = csv.writer(open("Expatistan.csv","w"))
WriteResultsFile.writerow(["City","Food","Housing","Clothes","Transportation","Personal Care", "Entertainment"])
expatistan_table = soup_expatistan.find("table",class_="comparison")
expatistan_titles = expatistan_table.find_all("tr",class_="expandable")
for expatistan_title in expatistan_titles:
percent_difference = expatistan_title.find("th",class_="percent")
percent_difference_title = percent_difference.span['class']
if percent_difference_title == "expensiver":
WriteResultsFile.writerow(Textfilelistsplit[i] + '+' + percent_difference.span.string)
else:
WriteResultsFile.writerow(Textfilelistsplit[i] + '-' + percent_difference.span.string)
i+=1
Answers:
Question 1: the class of the span is a list, you need to check if expensiver is inside this list. In other words, replace:
if percent_difference_title == "expensiver"
with:
if "expensiver" in percent_difference.span['class']
Questions 2 and 3: you need to pass a list of column values to writerow(), not string. And, since you want only one record per city, call writerow() outside of the loop (over the trs).
Other issues:
open csv file for writing before the loop
use with context managers while working with files
try to follow PEP8 style guide
Here's the code with modifications:
import requests
import csv
from bs4 import BeautifulSoup
BASE_URL = 'http://www.expatistan.com/cost-of-living/comparison/{home_city}/{city}'
home_city = 'Phoenix'
with open('City.txt') as input_file:
with open("Expatistan.csv", "w") as output_file:
writer = csv.writer(output_file)
writer.writerow(["City", "Food", "Housing", "Clothes", "Transportation", "Personal Care", "Entertainment"])
for line in input_file:
city = line.strip()
url = BASE_URL.format(home_city=home_city, city=city)
soup = BeautifulSoup(requests.get(url).text)
table = soup.find("table", class_="comparison")
differences = []
for title in table.find_all("tr", class_="expandable"):
percent_difference = title.find("th", class_="percent")
if "expensiver" in percent_difference.span['class']:
differences.append('+' + percent_difference.span.string)
else:
differences.append('-' + percent_difference.span.string)
writer.writerow([city] + differences)
For the City.txt containing just one new-york-city line, it produces Expatistan.csv with the following content:
City,Food,Housing,Clothes,Transportation,Personal Care,Entertainment
new-york-city,+48%,+129%,+63%,+43%,+42%,+42%
Make sure you understand what changes have I made. Let me know if you need further help.
csv.writer.writerow() takes a sequence and makes each element a column; normally you'd give it a list with columns, but you are passing in strings instead; that'll add individual characters as columns instead.
Just build a list, then write it to the CSV file.
First, open the CSV file once, not for every separate city; you are clearing out the file every time you open it.
import requests
import csv
from bs4 import BeautifulSoup
HomeCity = 'Phoenix'
with open("City.txt") as cities, open("Expatistan.csv", "wb") as outfile:
writer = csv.writer(outfile)
writer.writerow(["City", "Food", "Housing", "Clothes",
"Transportation", "Personal Care", "Entertainment"])
for line in cities:
city = line.strip()
url = "http://www.expatistan.com/cost-of-living/comparison/{}/{}".format(
HomeCity, city)
resp = requests.get(url)
soup = BeautifulSoup(resp.content, from_encoding=resp.encoding)
titles = soup.select("table.comparison tr.expandable")
row = [city]
for title in titles:
percent_difference = title.find("th", class_="percent")
changeclass = percent_difference.span['class']
change = percent_difference.span.string
if "expensiver" in changeclass:
change = '+' + change
else:
change = '-' + change
row.append(change)
writer.writerow(row)
So, first of all, one passes the writerow method an iterable, and each object in that iterable gets written with commas separating them. So if you give it a string, then each character gets separated:
WriteResultsFile.writerow('hello there')
writes
h,e,l,l,o, ,t,h,e,r,e
But
WriteResultsFile.writerow(['hello', 'there'])
writes
hello,there
That's why you are getting results like
n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,8,%
The rest of your problems are errors in your webscraping. First of all, when I scrape the site, searching for tables with CSS class "comparison" gives me None. So I had to use
expatistan_table = soup_expatistan.find("table","comparison")
Now, the reason your "if statement is broken" is because
percent_difference.span['class']
returns a list. If we modify that to
percent_difference.span['class'][0]
things will work the way you expect.
Now, your real issue is that inside the innermost loop you are finding the % changing in price for the individual items. You want these as items in your row of price differences, not individual rows. So, I declare an empty list items to which I append percent_difference.span.string, and then write the row outside the innermost loop Like so:
items = []
for expatistan_title in expatistan_titles:
percent_difference = expatistan_title.find("th","percent")
percent_difference_title = percent_difference.span["class"][0]
print percent_difference_title
if percent_difference_title == "expensiver":
items.append('+' + percent_difference.span.string)
else:
items.append('-' + percent_difference.span.string)
row = [Textfilelistsplit[i]]
row.extend(items)
WriteResultsFile.writerow(row)
The final error, is the in the while loop you re-open the csv file, and overwrite everything so you only have the final city in the end. Accounting for all theses errors (many of which you should have been able to find without help) leaves us with:
#Prepare CSV writer.
WriteResultsFile = csv.writer(open("Expatistan.csv","w"))
i=0
while i<len(Textfilelistsplit):
url = "http://www.expatistan.com/cost-of-living/comparison/" + HomeCity + "/" + Textfilelistsplit[i]
page = requests.get(url).text
print url
soup_expatistan = BeautifulSoup(page)
WriteResultsFile.writerow(["City","Food","Housing","Clothes","Transportation","Personal Care", "Entertainment"])
expatistan_table = soup_expatistan.find("table","comparison")
expatistan_titles = expatistan_table.find_all("tr","expandable")
items = []
for expatistan_title in expatistan_titles:
percent_difference = expatistan_title.find("th","percent")
percent_difference_title = percent_difference.span["class"][0]
print percent_difference_title
if percent_difference_title == "expensiver":
items.append('+' + percent_difference.span.string)
else:
items.append('-' + percent_difference.span.string)
row = [Textfilelistsplit[i]]
row.extend(items)
WriteResultsFile.writerow(row)
i+=1
YAA - Yet Another Answer.
Unlike the other answers, this treats the data as a series key-value pairs; ie: a list of dictionaries, which are then written to CSV. A list of wanted fields is provided to the csv writer (DictWriter), which discards additional information (beyond the specified fields) and blanks missing information. Also, should the order of the information on the original page change, this solution is unaffected.
I also assume you are going to open the CSV file in something like Excel. Additional parameters need to be given to the csv writer for this to happen nicely (see dialect parameter). Given that we are not sanitising the returned data, we should explicitly delimit it with unconditional quoting (see quoting parameter).
import csv
import requests
from bs4 import BeautifulSoup
#Read text file
with open("City.txt") as cities_h:
cities = cities_h.readlines()
home_city = "Phoenix"
city_data = []
for city in cities:
url = "http://www.expatistan.com/cost-of-living/comparison/%s/%s" % (home_city, city)
resp = requests.get(url)
soup = BeautifulSoup(resp.content, from_encoding = resp.encoding)
titles = soup.select("table.comparison tr.expandable")
if titles:
data = {}
for title in titles:
name = title.find("th", class_ = "clickable")
diff = title.find("th", class_ = "percent")
exp = bool(diff.find("span", class_ = "expensiver"))
data[name.text] = ("+" if exp else "-") + diff.span.text
data["City"] = soup.find("strong", class_ = "city-2").text
city_data.append(data)
with open("Expatistan.csv","w") as csv_h:
fields = \
[
"City",
"Food",
"Housing",
"Clothes",
"Transportation",
"Personal Care",
"Entertainment"
]
#Prepare CSV writer.
writer = csv.DictWriter\
(
csv_h,
fields,
quoting = csv.QUOTE_ALL,
extrasaction = "ignore",
dialect = "excel",
lineterminator = "\n",
)
writer.writeheader()
writer.writerows(city_data)
I want to read files in an advanced mode.
First:
In this file, I have certain steps with which the code has to follow, how do I read the steps until the string [data] appears.
[Steps]
step1 = WebAddress
step2 = Tab
step3 = SecurityType
step4 = Criteria
step5 = Date
step6 = Click1
step7 = Results
step8 = Download
[data]
......
Second:
How can I read all everything after [data].
[data]
WebAddress___________________________ Destination___________ Tab_____________ SecurityType___________________________________________________ Criteria___ Date_______ Click1_ Results_ Download
https://mbsdisclosure.fanniemae.com/ q:\\%s\\raw\\fnmapool Advanced Search Interim MBS: Single-Family Issue Date 09/01/2012 Search 100 CSV XML
https://mbsdisclosure.fanniemae.com/ q:\\%s\\raw\\fnmapool Advanced Search Preliminary Mega: Fannie Mae/Ginnie Mae backed Adjustable Rate Issue Date 09/01/2012 Search 100 CSV XML
https://mbsdisclosure.fanniemae.com/ q:\\%s\\raw\\fnmapool Advanced Search Preliminary Mega: Fannie Mae/Ginnie Mae backed Fixed Rate Issue Date 09/01/2012 Search 100 CSV XML
I want to pass everything under the step____________________ where step can be the steps(e.g. WebAddress).
So for example, if step1 = WebAddress how do I read everything under WebAddress__________________________ and so on? Thanks!
First:
with open(file_name) as f:
print (f.read()).split("[data]")
Second:
with open(file_name) as f:
pre_data,post_data =[s.strip() for s in (f.read()).split("[data]")]
post_data_lines = post_data.splitlines()
headers = post_data_lines[0].split()
print headers
for line in post_data_lines[1:]:
print line.split()
print dict(zip(headers,line.split()))
Im also not sure how your [data]is delimited you may want line.split('\t') if its tabbed
this is untested... but it should work and it doesnt quite get you all the way where you want but at least it gets most of what your want (the "hard" parts)
to split by header width use
file_name = "testdata.txt"
with open(file_name) as f:
pre_data,post_data =[s.strip() for s in (f.read()).split("[data]")]
post_data_lines = post_data.splitlines()
headers = post_data_lines[0].split()
for line in post_data_lines[1:]:
tmpline = []
pos = 0
for itm in headers:
tmpline.append(line[pos:pos+len(itm)])
pos += len(itm)+1
print dict(zip(headers,tmpline))
and if you want the actual header with out the __'s then use
file_name = "testdata.txt"
with open(file_name) as f:
pre_data,post_data =[s.strip() for s in (f.read()).split("[data]")]
post_data_lines = post_data.splitlines()
headers = post_data_lines[0].split()
headers2 = [s.replace("_"," ").strip() for s in headers]
for line in post_data_lines[1:]:
tmpline = []
pos = 0
for itm in headers:
tmpline.append(line[pos:pos+len(itm)])
pos += len(itm)+1
print dict(zip(headers2,tmpline))
First step:
>>> import ConfigParser
>>> cfg = ConfigParser.RawConfigParser()
>>> with open('sample.cfg') as f:
... cfg.readfp(f)
...
>>> cfg.get('Steps','step1')
'WebAddress'
Second step:
>>> data_section = ''
>>> with open('sample.cfg') as f:
... data_section = f.read()
...
>>> data = data_section[data_section.index('[data]')+len('[data]')+1:]
>>> reader = csv.reader(io.BytesIO(data),delimiter='\t')
>>> reader.next() # skips header
>>> results = [row for for row in reader]
Now results is a list of lists, with each inner list having items from the data section.
[['https://mbsdisclosure.fanniemae.com/','q:\\\\%s\\\\raw\\\\fnmapool','Advanced Search', 'Interim MBS: Single-Family', 'Issue Date','09/01/2012','Search','100', 'CSV XML']...]