Need help writing to a CSV file Python 3.5 - python
My code writes to a CSV file titled 'output' here is a link to past help on this code
When I run my code my CSV file is being rewritten over in the body row. I want to write to a new row every time there is new information being scraped from the table of the stock table URL.
Here is what my CSV file looks like:
Index,P/E,EPS (ttm),Insider Own,Shs Outstand,Perf Week,Market Cap,Forward P/E,EPS next Y,Insider Trans,Shs Float,Perf Month,Income,PEG,EPS next Q,Inst Own,Short Float,Perf Quarter,Sales,P/S,EPS this Y,Inst Trans,Short Ratio,Perf Half Y,Book/sh,P/B,EPS next Y,ROA,Target Price,Perf Year,Cash/sh,P/C,EPS next 5Y,ROE,52W Range,Perf YTD,Dividend,P/FCF,EPS past 5Y,ROI,52W High,Beta,Dividend %,Quick Ratio,Sales past 5Y,Gross Margin,52W Low,ATR,Employees,Current Ratio,Sales Q/Q,Oper. Margin,RSI (14),Volatility,Optionable,Debt/Eq,EPS Q/Q,Profit Margin,Rel Volume,Prev Close,Shortable,LT Debt/Eq,Earnings,Payout,Avg Volume,Price,Recom,SMA20,SMA50,SMA200,Volume,Change
-,-,-3.00,45.18%,5.19M,30.47%,15.78M,-,-,0.00%,2.84M,-16.48%,-14.00M,-,-,1.00%,9.24%,88.82%,18.30M,0.86,-122.00%,136.99%,0.26,88.82%,27.27,0.11,-,-,4.00,-51.44%,0.87,3.51,15.00%,-,1.30 - 8.00,-27.10%,-,-,-15.40%,0.40%,-62.00%,2.73,-,1.10,-16.40%,25.10%,133.85%,0.52,450,1.20,-58.50%,-,53.21,19.81% 17.08%,No,0.37,-,-,5.40,2.96,Yes,0.13,-,-,991.40K,3.04,3.00,1.72%,-6.24%,29.44%,"5,358,503",2.70%
Here is my code:
import csv
import urllib.request
from bs4 import BeautifulSoup
twiturl = "https://twitter.com/ACInvestorBlog"
twitpage = urllib.request.urlopen(twiturl)
soup = BeautifulSoup(twitpage,"html.parser")
print(soup.title.text)
tweets = [i.text for i in soup.select('a.twitter-cashtag.pretty-link.js-nav b')]
print(tweets)
url_base = "https://finviz.com/quote.ashx?t="
url_list = [url_base + tckr for tckr in tweets]
for url in url_list:
fpage = urllib.request.urlopen(url)
fsoup = BeautifulSoup(fpage, 'html.parser')
#scrape single page and add data to list
#write datalist
with open('output.csv', 'wt') as file:
writer = csv.writer(file)
# write header row
writer.writerow(map(lambda e : e.text, fsoup.find_all('td', {'class':'snapshot-td2-cp'})))
# write body row
writer.writerow(map(lambda e : e.text, fsoup.find_all('td', {'class':'snapshot-td2'})))
Append mode
The issue is with your command open('output.csv', 'wt') - 'w' option opens the file for (over)writing. If you want to append data at the end of the existing file, use the 'a' option instead, as shown in the fine manual at https://docs.python.org/3.7/library/functions.html#open .
Also, you might want to check if the file exists beforehand and write the header row only if it does not.
Related
Output scrape results into multiple .csv files, with python, BeautifulSoup, pandas?
I am scraping links from multiple pages under multiple searches and want to output scraped results into multiple .csv files. The table shows the .csv file which lists both my source urls and desired output file names: url outputfile https://www.marketresearch.com/search/results.asp?categoryid=230&qtype=2&publisher=IDCs&datepub=0&submit2=Search outputPS1xIDC.csv https://www.marketresearch.com/search/results.asp?categoryid=90&qtype=2&publisher=IDC&datepub=0&submit2=Search outputPS2xIDC.csv https://www.marketresearch.com/search/results.asp?categoryid=233&qtype=2&publisher=IDC&datepub=0&submit2=Search outputPS3xIDC.csv https://www.marketresearch.com/search/results.asp?categoryid=169&qtype=2&publisher=IDC&datepub=0&submit2=Search outputPS4xIDC.csv Now, with the code below, I managed to read the urls in sequence and the rest of the code also works well (when I specify the output filename directly). However, it only outputs the last of the 4 pages in the list, so it overwrites the result each time. What I actually want for it is to output the results from the first url to the first outputfile, second to second, etc. (Of course my actual list of source URLs is much longer than these 4). Please help, especially with the last line, as clearly just writing [outputs] there doesn't work. import requests from bs4 import BeautifulSoup import pandas as pd import csv with open('inputs.csv', newline='') as csvfile: reader = csv.DictReader(csvfile) urls = [row["url"] for row in reader] outputs = [row["outputfile"] for row in reader] data = [] for url in urls: def scrape_it(url): page = requests.get(url, headers={'Cookie': 'ResultsPerPage=100'}) soup = BeautifulSoup(page.text, 'html.parser') nexturl = soup.find_all(class_="standardLinkDkBlue")[-1]['href'] stri = soup.find_all(class_="standardLinkDkBlue")[-1].string reports = soup.find_all("tr", {"class": ["SearchTableRowAlt", "SearchTableRow"]}) for report in reports: data.append({ 'title': report.find('a', class_='linkTitle').text, 'price': report.find('div', class_='resultPrice').text, 'date_author': report.find('div', class_='textGrey').text.replace(' | published by: TechNavio', ''), 'detail_link': report.a['href'] }) if 'next' not in stri: print("All pages completed") else: scrape_it(nexturl) scrape_it(url) myOutput = pd.DataFrame(data) myOutput.to_csv([outputs], header=False) #works (but only for the last url) if instead of [outputs] I have f'filename.csv'
I don't have Pandas, and I don't really want to run your input, but a couple of things jump out a me when I look at your code: It looks like you are not looping over url and output together. It looks like you loop over all the URLs, and then after all those loops you write once. Likewise, data is just having the HTML table data appended and appended, it's never reset for each individual URL. Without being able to run this, I recommend something like this. The scraping is fully encapsulated and separate from the loop, and as such you can now more clearly see the flow of inputs and outputs: import requests from bs4 import BeautifulSoup import csv import pandas as pd def scrape_it(url, data): page = requests.get(url, headers={'Cookie': 'ResultsPerPage=100'}) soup = BeautifulSoup(page.text, 'html.parser') nexturl = soup.find_all(class_="standardLinkDkBlue")[-1]['href'] stri = soup.find_all(class_="standardLinkDkBlue")[-1].string reports = soup.find_all("tr", {"class": ["SearchTableRowAlt", "SearchTableRow"]}) for report in reports: data.append({ 'title': report.find('a', class_='linkTitle').text, 'price': report.find('div', class_='resultPrice').text, 'date_author': report.find('div', class_='textGrey').text.replace(' | published by: TechNavio', ''), 'detail_link': report.a['href'] }) if 'next' in stri: data = scrape_it(nexturl, data) return data with open('inputs.csv', newline='') as csvfile: reader = csv.DictReader(csvfile) urls = [row["url"] for row in reader] outputs = [row["outputfile"] for row in reader] for (url, output) in zip(urls, outputs): # work on url and output together data = scrape_it(url, []) myOutput = pd.DataFrame(data) myOutput.to_csv(output, header=False)
Issue using BeautifulSoup and reading target URLs from a CSV
Everything works as expected when I'm using a single URL for the URL variable to scrape, but not getting any results when attempting to read links from a csv. Any help is appreciated. Info about the CSV: One column with a header called "Links" 300 rows of links with no space, commoa, ; or other charters before/after the links One link in each row import requests # required to make request from bs4 import BeautifulSoup # required to parse html import pandas as pd import csv with open("urls.csv") as infile: reader = csv.DictReader(infile) for link in reader: res = requests.get(link['Links']) #print(res.url) url = res page = requests.get(url) soup = BeautifulSoup(page.text, 'html.parser') email_elm0 = soup.find_all(class_= "app-support-list__item")[0].text.strip() email_elm1 = soup.find_all(class_= "app-support-list__item")[1].text.strip() email_elm2 = soup.find_all(class_= "app-support-list__item")[2].text.strip() email_elm3 = soup.find_all(class_= "app-support-list__item")[3].text.strip() final_email_elm = (email_elm0,email_elm1,email_elm2,email_elm3) print(final_email_elm) df = pd.DataFrame(final_email_elm) #getting an output in csv format for the dataframe we created #df.to_csv('draft_part2_scrape.csv')
The problem lies in this part of the code: with open("urls.csv") as infile: reader = csv.DictReader(infile) for link in reader: res = requests.get(link['Links']) ... After the loop is executed, res will have the last link. So, this program will only scrape the last link. To solve this problem, store all the links in a list and iterate that list to scrape each of the link. You can store the scraped result in a seperate dataframe and concatenate them at the end to store in a single file: import requests # required to make request from bs4 import BeautifulSoup # required to parse html import pandas as pd import csv links = [] with open("urls.csv") as infile: reader = csv.DictReader(infile) for link in reader: links.append(link['Links']) dfs = [] for url in links: page = requests.get(url) soup = BeautifulSoup(page.text, 'html.parser') email_elm0 = soup.find_all(class_="app-support-list__item")[0].text.strip() email_elm1 = soup.find_all(class_="app-support-list__item")[1].text.strip() email_elm2 = soup.find_all(class_="app-support-list__item")[2].text.strip() email_elm3 = soup.find_all(class_="app-support-list__item")[3].text.strip() final_email_elm = (email_elm0, email_elm1, email_elm2, email_elm3) print(final_email_elm) dfs.append(pd.DataFrame(final_email_elm)) #getting an output in csv format for the dataframe we created df = pd.concat(dfs) df.to_csv('draft_part2_scrape.csv')
Writing Hrefs to CSV file
Having Trouble writing to a CSV file. Code as Below. the Set is writing on the same row when I go to write it into "fofo" file response = requests.get(href) soup = BeautifulSoup(response.content, 'lxml') This opens the "shomo" file with existing Hrefs with open('shomo.csv', newline='') as f: reader = csv.reader(f) seen = {row[0] for row in reader} allthreads=soup.find('table', class_='categories').find_all('p') for thread in allthreads: thread_link= thread.a.get('href') #Checks if Link is in "Seen" if thread_link not in seen: seen.add(thread_link) #Add new href to Seen thread_data = scrape_thread_link(thread_link) #Calls function #Having trouble with this part with open('fofo.csv', 'w', newline='') as file: writer = csv.writer(file) writer.writerows([seen]) The code without opening a previous file and adding to the Seen the code prints prefect fine as below: not sure what is different/ wrong? import csv import time from bs4 import BeautifulSoup import requests import re response = requests.get('https://website.com') soup = BeautifulSoup(response.content, 'lxml') seen = set () with open('momo.csv', 'w', newline='') as file: writer = csv.writer(file) allthreads=soup.find('table', class_='categories').find_all('p') for thread in allthreads: #thread_name = thread.text #print (thread_name) thread_link= thread.a.get('href') if thread_link not in seen: seen.add(thread_link) writer.writerows([seen])
How to read the entire urls from the first column in a csv file
I am trying to read the urls from the first column in a csv file. In the csv file, there are 6051 urls in total which I want to read. To do so, I tried the following codes: urls = [] with open("C:/Users/hyoungm/Downloads/urls.csv") as csvfile: blogurl = csv.reader(csvfile) for row in blogurl: row = row[0] print(row) len(row) However, the number of urls that are shown is only 65. I have no idea why the total number of urls appears differently from the csv file. Can anybody help me with figuring out how to read all urls (6051 in total) from the csv file? To read all the urls from the csv file, I also tried several different codes that resulted in the same number of urls (i.e., 65 urls) or failure, such as: 1) openfile = open("C:/Users/hyoungm/Downloads/urls.csv") r = csv.reader(openfile) for i in r: #the urls are in the first column ... 0 refers to the first column blogurls = i[0] print (blogurls) len(blogurls) 2) urls = pd.read_csv("C:/Users/hyoungm/Downloads/urls.csv") with closing(requests.get(urls, stream = True)) as r: reader = csv.reader(r.iter_lines(), delimiter = ',', quotechar = '""') for row in reader: print(row) len(row) 3) with open("C:/Users/hyoungm/Downloads/urls.csv") as csvfile: lines = csv.reader(csvfile) for i, line in enumerate(lines): if i == 0: for line in csvfile: print(line[1:]) len(line) 4) and blogurls = [] with open("C:/Users/hyoungm/Downloads/urls.csv") as csvfile: r = csv.reader(csvfile) for i in r: blogurl = i[0] r = requests.get(blogurl) blogurls.append(blogurl) for url in blogurls: page = urlopen(url[0]).read() soup = BeautifulSoup(page, "html.parser") len(blogurls) I expect the output of 6051 urls as originally collected in the csv file, instead of 65 urls. After reading all the urls, I am going to scrawl down the textual data from each url. I supposed to get the following textual data using all 6051 urls. Please click the following link for the image: the codes and the outcomes based on 65 urls so far
The following two approaches work for me: import requests r = requests.get('https://raw.githubusercontent.com/GemmyMoon/MultipleUrls/master/urls.csv') urls = r.text.splitlines() print(len(urls)) # Returns 6051 and import csv import requests from io import StringIO r = requests.get('https://raw.githubusercontent.com/GemmyMoon/MultipleUrls/master/urls.csv') reader = csv.reader(StringIO(r.text)) urls = [line[0] for line in reader] print(len(urls)) # Returns 6051
Python BeautifulSoup Empty Rows in CSV File
I am working on a scraper to pull street names and zip codes from a site and all of that is working great and it builds a CSV file just fine for me. But when I open the CSV file in Excel the file will have a blank row than a row with a street name with the zip code in the next column just like I want. But next I have a blank row than a row with a street name and zip code beside it. And this just continues on all the way through the file which gives me a row with a street name and zip codes in row then the word none in the next row when imported into the PHPMyAdmin database. I want to get rid of the blank rows. Here is my code. from bs4 import BeautifulSoup import csv import urllib2 url="http://www.conakat.com/states/ohio/cities/defiance/road_maps/" page=urllib2.urlopen(url) soup = BeautifulSoup(page.read()) f = csv.writer(open("Defiance Steets1.csv", "w")) f.writerow(["Street", "Zipcode"]) # Write column headers as the first line links = soup.find_all('a') for link in links: i = link.find_next_sibling('i') if getattr(i, 'name', None): a, i = link.string, i.string[1:-1] f.writerow([a, i])
This worked for me (I added lineterminator ="\n"): from BeautifulSoup import BeautifulSoup import csv import urllib2 url="http://www.conakat.com/states/ohio/cities/defiance/road_maps/" page=urllib2.urlopen(url) soup = BeautifulSoup(page.read()) f = csv.writer(open("Defiance Steets1.csv", "w"), lineterminator ="\n") f.writerow(["Street", "Zipcode"]) # Write column headers as the first line #print soup. links = soup.findAll('a') for link in links: #i = link.find_next_sibling('i') i = link.findNextSibling('i') if getattr(i, 'name', None): a, i = link.string, i.string[1:-1] print [a,i] f.writerow([a, i])
this works for me... thanks if you have the writer and open in different lines, put it as a param in the writer function...