Python BeautifulSoup Empty Rows in CSV File - python

I am working on a scraper to pull street names and zip codes from a site and all of that is working great and it builds a CSV file just fine for me. But when I open the CSV file in Excel the file will have a blank row than a row with a street name with the zip code in the next column just like I want. But next I have a blank row than a row with a street name and zip code beside it. And this just continues on all the way through the file which gives me a row with a street name and zip codes in row then the word none in the next row when imported into the PHPMyAdmin database. I want to get rid of the blank rows. Here is my code.
from bs4 import BeautifulSoup
import csv
import urllib2
url="http://www.conakat.com/states/ohio/cities/defiance/road_maps/"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
f = csv.writer(open("Defiance Steets1.csv", "w"))
f.writerow(["Street", "Zipcode"]) # Write column headers as the first line
links = soup.find_all('a')
for link in links:
i = link.find_next_sibling('i')
if getattr(i, 'name', None):
a, i = link.string, i.string[1:-1]
f.writerow([a, i])

This worked for me (I added lineterminator ="\n"):
from BeautifulSoup import BeautifulSoup
import csv
import urllib2
url="http://www.conakat.com/states/ohio/cities/defiance/road_maps/"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
f = csv.writer(open("Defiance Steets1.csv", "w"), lineterminator ="\n")
f.writerow(["Street", "Zipcode"]) # Write column headers as the first line
#print soup.
links = soup.findAll('a')
for link in links:
#i = link.find_next_sibling('i')
i = link.findNextSibling('i')
if getattr(i, 'name', None):
a, i = link.string, i.string[1:-1]
print [a,i]
f.writerow([a, i])

this works for me... thanks
if you have the writer and open in different lines,
put it as a param in the writer function...

Related

How to scrape the website properly and getting all td texts from website

I am new to python. is anyone know {sum(int(td.text) for td in soup.select('td:last-child')[1:])} what is use of [1:] in this or [0] or [1]. i saw it in many scraping examples below for in loop. As i was practicing i build this code and don't able to scrape all data in csv file. thanks in advance, sorry for two question at one time.
import requests
from bs4 import BeautifulSoup
import csv
url= "https://iplt20.com/stats/2020/most-runs"
r= requests.get (url)
soup= BeautifulSoup (r.content, 'html5lib')
lst= []
table=soup.find ('div', attrs = {'class':'js-table'})
#for row in table.findAll ('div', attrs= {'class':'top-players__player-name'}):
# score = {}
# score['Player'] = row.a.text.strip()
# lst.append(score)
for row in table.findAll (class_='top-players__m top-players__padded '):
score = {}
score['Matches'] = int(row.td.text)
lst.append(score)
filename= 'iplStat.csv'
with open (filename, 'w', newline='') as f:
w= csv.DictWriter(f,['Player', 'Matches'])
w.writeheader()
for score in lst:
w.writerow(score)
print (lst)
All of this is not even needed. Just use pandas:
import requests
import pandas as pd
url = "https://iplt20.com/stats/2020/most-runs"
r = requests.get (url)
df = pd.read_html(r.content)[0]
df.to_csv("iplStats.csv", index = False)
Screenshot of csv file:

Issue using BeautifulSoup and reading target URLs from a CSV

Everything works as expected when I'm using a single URL for the URL variable to scrape, but not getting any results when attempting to read links from a csv. Any help is appreciated.
Info about the CSV:
One column with a header called "Links"
300 rows of links with no space, commoa, ; or other charters before/after the links
One link in each row
import requests # required to make request
from bs4 import BeautifulSoup # required to parse html
import pandas as pd
import csv
with open("urls.csv") as infile:
reader = csv.DictReader(infile)
for link in reader:
res = requests.get(link['Links'])
#print(res.url)
url = res
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
email_elm0 = soup.find_all(class_= "app-support-list__item")[0].text.strip()
email_elm1 = soup.find_all(class_= "app-support-list__item")[1].text.strip()
email_elm2 = soup.find_all(class_= "app-support-list__item")[2].text.strip()
email_elm3 = soup.find_all(class_= "app-support-list__item")[3].text.strip()
final_email_elm = (email_elm0,email_elm1,email_elm2,email_elm3)
print(final_email_elm)
df = pd.DataFrame(final_email_elm)
#getting an output in csv format for the dataframe we created
#df.to_csv('draft_part2_scrape.csv')
The problem lies in this part of the code:
with open("urls.csv") as infile:
reader = csv.DictReader(infile)
for link in reader:
res = requests.get(link['Links'])
...
After the loop is executed, res will have the last link. So, this program will only scrape the last link.
To solve this problem, store all the links in a list and iterate that list to scrape each of the link. You can store the scraped result in a seperate dataframe and concatenate them at the end to store in a single file:
import requests # required to make request
from bs4 import BeautifulSoup # required to parse html
import pandas as pd
import csv
links = []
with open("urls.csv") as infile:
reader = csv.DictReader(infile)
for link in reader:
links.append(link['Links'])
dfs = []
for url in links:
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
email_elm0 = soup.find_all(class_="app-support-list__item")[0].text.strip()
email_elm1 = soup.find_all(class_="app-support-list__item")[1].text.strip()
email_elm2 = soup.find_all(class_="app-support-list__item")[2].text.strip()
email_elm3 = soup.find_all(class_="app-support-list__item")[3].text.strip()
final_email_elm = (email_elm0, email_elm1, email_elm2, email_elm3)
print(final_email_elm)
dfs.append(pd.DataFrame(final_email_elm))
#getting an output in csv format for the dataframe we created
df = pd.concat(dfs)
df.to_csv('draft_part2_scrape.csv')

Need help writing to a CSV file Python 3.5

My code writes to a CSV file titled 'output' here is a link to past help on this code
When I run my code my CSV file is being rewritten over in the body row. I want to write to a new row every time there is new information being scraped from the table of the stock table URL.
Here is what my CSV file looks like:
Index,P/E,EPS (ttm),Insider Own,Shs Outstand,Perf Week,Market Cap,Forward P/E,EPS next Y,Insider Trans,Shs Float,Perf Month,Income,PEG,EPS next Q,Inst Own,Short Float,Perf Quarter,Sales,P/S,EPS this Y,Inst Trans,Short Ratio,Perf Half Y,Book/sh,P/B,EPS next Y,ROA,Target Price,Perf Year,Cash/sh,P/C,EPS next 5Y,ROE,52W Range,Perf YTD,Dividend,P/FCF,EPS past 5Y,ROI,52W High,Beta,Dividend %,Quick Ratio,Sales past 5Y,Gross Margin,52W Low,ATR,Employees,Current Ratio,Sales Q/Q,Oper. Margin,RSI (14),Volatility,Optionable,Debt/Eq,EPS Q/Q,Profit Margin,Rel Volume,Prev Close,Shortable,LT Debt/Eq,Earnings,Payout,Avg Volume,Price,Recom,SMA20,SMA50,SMA200,Volume,Change
-,-,-3.00,45.18%,5.19M,30.47%,15.78M,-,-,0.00%,2.84M,-16.48%,-14.00M,-,-,1.00%,9.24%,88.82%,18.30M,0.86,-122.00%,136.99%,0.26,88.82%,27.27,0.11,-,-,4.00,-51.44%,0.87,3.51,15.00%,-,1.30 - 8.00,-27.10%,-,-,-15.40%,0.40%,-62.00%,2.73,-,1.10,-16.40%,25.10%,133.85%,0.52,450,1.20,-58.50%,-,53.21,19.81% 17.08%,No,0.37,-,-,5.40,2.96,Yes,0.13,-,-,991.40K,3.04,3.00,1.72%,-6.24%,29.44%,"5,358,503",2.70%
Here is my code:
import csv
import urllib.request
from bs4 import BeautifulSoup
twiturl = "https://twitter.com/ACInvestorBlog"
twitpage = urllib.request.urlopen(twiturl)
soup = BeautifulSoup(twitpage,"html.parser")
print(soup.title.text)
tweets = [i.text for i in soup.select('a.twitter-cashtag.pretty-link.js-nav b')]
print(tweets)
url_base = "https://finviz.com/quote.ashx?t="
url_list = [url_base + tckr for tckr in tweets]
for url in url_list:
fpage = urllib.request.urlopen(url)
fsoup = BeautifulSoup(fpage, 'html.parser')
#scrape single page and add data to list
#write datalist
with open('output.csv', 'wt') as file:
writer = csv.writer(file)
# write header row
writer.writerow(map(lambda e : e.text, fsoup.find_all('td', {'class':'snapshot-td2-cp'})))
# write body row
writer.writerow(map(lambda e : e.text, fsoup.find_all('td', {'class':'snapshot-td2'})))
Append mode
The issue is with your command open('output.csv', 'wt') - 'w' option opens the file for (over)writing. If you want to append data at the end of the existing file, use the 'a' option instead, as shown in the fine manual at https://docs.python.org/3.7/library/functions.html#open .
Also, you might want to check if the file exists beforehand and write the header row only if it does not.

How can I populate a txt with results from a mechanized results?

I am trying to populate a txt file with the response I get from a mechanized form. Here's the form code
import mechanize
from bs4 import BeautifulSoup
br = mechanize.Browser()
br.open ('https://www.cpsbc.ca/physician_search')
first = raw_input('Enter first name: ')
last = raw_input('Enter last name: ')
br.select_form(nr=0)
br.form['filter[first_name]'] = first
br.form['filter[last_name]'] = last
response = br.submit()
content = response.read()
soup = BeautifulSoup(content, "html.parser")
for row in soup.find_all('tbody'):
print row
This spits out lines of html code depending on how many privileges the doc has in regards to locations, but the last line has their specialty of training. Please go ahead and test it with any physician from BC, Canada.
I have a txt file that is listed as such:
lastname1, firstname1
lastname2, firstname2
lastname3, firstname3 middlename3
lastname4, firstname4 middlename4
I hope you get the idea. I would appreciate any help in automatizing the following steps:
go through txt with names one by one and record the output text into a new txt file.
So far, I have this working to spit out the row (which is raw html), which I don't mind, but I can't get it to write into a txt file...
import mechanize
from bs4 import BeautifulSoup
with open('/Users/s/Downloads/hope.txt', 'w') as file_out:
with open('/Users/s/Downloads/names.txt', 'r') as file_in:
for line in file_in:
a = line
delim = ", "
i1 = a.find(delim)
br = mechanize.Browser()
br.open('https://www.cpsbc.ca/physician_search')
br.select_form(nr=0)
br.form['filter[first_name]'] = a[i1+2:]
br.form['filter[last_name]'] = a[:i1]
response = br.submit()
content = response.read()
soup = BeautifulSoup(content, "html.parser")
for row in soup.find_all('tbody'):
print row
This should not be too complicated. Assuming your file with all the names you want to query upon is called "names.txt" and the output file you want to create is called "output.txt", the code should look something like:
with open('output.txt', 'w') as file_out:
with open('names.txt', 'r') as file_in:
for line in file_in:
<your parsing logic goes here>
file_out.write(new_record)
This assumes your parsing logic generates some sort of "record" to be written on file as a string.
If you get more advanced, you can also look into the csv module to import/export data in CSV.
Also have a look at the Input and Output tutorial.

Python, BeautifulSoup iterating through files issue

This may end up being a really novice question, because i'm a novice, but here goes.
i have a set of .html pages obtained using wget. i want to iterate through them and extract certain info, putting it in a .csv file.
using the code below, all the names print when my program runs, but only the info from the next to last page (i.e., page 29.html here) prints to the .csv file. i'm trying this with only a handful of files at first, there are about 1,200 that i'd like to get into this format.
the files are based on those here: https://www.cfis.state.nm.us/media/ReportLobbyist.aspx?id=25&el=2014 where page numbers are the id
thanks for any help!
from bs4 import BeautifulSoup
import urllib2
import csv
for i in xrange(22, 30):
try:
page = urllib2.urlopen('file:{}.html'.format(i))
except:
continue
else:
soup = BeautifulSoup(page.read())
n = soup.find(id='ctl00_ContentPlaceHolder1_lnkBCLobbyist')
name = n.string
print name
table = soup.find('table', 'reportTbl')
#get the rows
list_of_rows = []
for row in table.findAll('tr')[1:]:
col = row.findAll('td')
filing = col[0].string
status = col[1].string
cont = col[2].string
exp = col[3].string
record = (name, filing, status, cont, exp)
list_of_rows.append(record)
#write to file
writer = csv.writer(open('lob.csv', 'wb'))
writer.writerows(list_of_rows)
You need to append each time not overwrite, use a, open('lob.csv', 'wb') is overwriting each time through your outer loop:
writer = csv.writer(open('lob.csv', 'ab'))
writer.writerows(list_of_rows)
You could also declare list_of_rows = [] outside the for loops and write to the file once at the very end.
If you are wanting page 30 also you need to loop in range(22,31).

Categories