Python Requests to parse HTML to get CSV - python

So I am trying to do a POST request to a website and this website will display a CSV, however, the CSV is not downloadable only there in the form it is in so can be copied and pasted.
I am trying to get the HTML from the POST request and get the CSV, export this into a CSV file, to then run a function on. I have managed to get it into CSV form as a string but there doesn't appear to be new lines i.e.
>>> print(text1)
"Heading 1","Heading 2""Item 1","Item 2"
not
"Heading 1","Heading 2"
"Item 1","Item 2"
Is this format OK?
If not how do I get it into an OK format?
Secondly, how can I write this string into a CSV file? If I try to convert text1 into bytes, I get _csv.Error: iterable expected, not int, if not I get TypeError: a bytes-like object is required, not 'str'.
My code so far:
with requests.Session() as s:
response = s.post(headers=headers, data=data, url=url)
html = response.content
soup = BeautifulSoup(html, features="html.parser")
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
text1 = text.replace(text[:56], '')
print(text1)

I think this will work for you, this will find the element containing the csv data (could be body, could be a div, could be a p, etc) and only extract text from there so you don't need to worry about scripts or classes getting in your data:
import csv
from bs4 import BeautifulSoup
# emulate your html format
html_string = '''
<body>
<div class="csv">"Category","Position","Name","Time","Team","avg_power","20min","Male?"<br>"A","1","James ","00:21:31.45","5743","331","5.3","1"<br>"A","2","Da","00:21:31.51","4435","377","5.0","1"<br>"A","3","Timmy ","00:21:31.52","3964","371","4.8","1"<br>"A","4","Timothy ","00:21:31.83","5229","401","5.7","1"<br>"A","5","Stefan ","00:21:31.86","2991","338","","1"<br>"A","6","Josh ","00:21:31.92","","403","5.1","1"<br></div>
<body>
'''
soup = BeautifulSoup(html_string)
for br in soup.find_all('br'):
br.replace_with('\n')
rows = [[i.replace('"', '').strip() # clean the lines
for i in item.split(',')] # splite each item by the comma
# get all the lines inside the div
# this will get the first item matching the filter
for item in soup.find('div', class_='csv').text.splitlines()]
# csv writing function
def write_csv(path, data):
with open(path, 'w') as file:
writer = csv.writer(file)
writer.writerows(data)
print(rows)
write_csv('./data.csv', rows)
Output (data.csv):
Category,Position,Name,Time,Team,avg_power,20min,Male?
A,1,James,00:21:31.45,5743,331,5.3,1
A,2,Da,00:21:31.51,4435,377,5.0,1
A,3,Timmy,00:21:31.52,3964,371,4.8,1
A,4,Timothy,00:21:31.83,5229,401,5.7,1
A,5,Stefan,00:21:31.86,2991,338,,1
A,6,Josh,00:21:31.92,,403,5.1,1
soup.find()/find_all() can isolate an html element for you to scrape from so you don't have to worry about parsing other elements.

Related

Select multiple elements with BeautifulSoup and manage them individually

I am using BeautifulSoup to parse a webpage of poetry. The poetry is separated into h3 for poem title, and .line for each line of the poem. I can get both elements and add them to a list. But I want to manipulate the h3 to be uppercase and indicate a line break, then insert it into the lines list.
linesArr = []
for lines in full_text:
booktitles = lines.select('h3')
for booktitle in booktitles:
linesArr.append(booktitle.text.upper())
linesArr.append('')
for line in lines.select('h3, .line'):
linesArr.append(line.text)
This code appends all book titles to the beginning of the list, then continues getting the h3 and .line items. I have tried inserting code like this:
linesArr = []
for lines in full_text:
for line in lines.select('h3, .line'):
if line.find('h3'):
linesArr.append(line.text.upper())
linesArr.append('')
else:
linesArr.append(line.text)
I'm not sure of what you are trying to do, but here with this way you can get an array with the title in upper case and all your line:
#!/usr/bin/python3
# coding: utf8
from bs4 import BeautifulSoup
import requests
page = requests.get("https://quod.lib.umich.edu/c/cme/CT/1:1?rgn=div2;view=fulltext")
soup = BeautifulSoup(page.text, 'html.parser')
title = soup.find('h3')
full_lines = soup.find_all('div',{'class':'line'})
linesArr = []
linesArr.append(title.get_text().upper())
for line in full_lines:
linesArr.append(line.get_text())
# Print full array with the title and text
print(linesArr)
# Print text here with line break
for linea in linesArr:
print(linea + '\n')

Scraping multiple URLs using beautiful soup

I have a dataframe with one of the columns containing over 4000 different URLs for articles. I have implemented the following code to extract all the text from the URLs, it seems to work for maybe one or two URLs but does not work for all.
for i in df.url:
http = urllib3.PoolManager()
response = http.request('GET', i)
soup = bsoup(response.data, 'html.parser')
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text)
break
In the first for loop, you are assigning all the parsed urls to the same variable - soup. At the end of the loop, this variable will contain the parsed content of the last url and not all the urls as you expected. That's why you are seeing only one output.
You can put all your code in a single loop
for url in df.url:
http = urllib3.PoolManager()
response = http.request('GET', url)
soup = bsoup(response.data, 'html.parser')
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print(url)
print(text)

How to filter out unimportant text from html to text conversion?

I am currently scraping couple of websites loading their html contents and striping out all unnecessary parts (tags etc.).
It often appears that I still get a lot of unneeded words which describe the bottom of the page or other subpages of the website.
def text_extractor(url):
try:
html = requests.get(url).text
soup = BeautifulSoup(html,'lxml')
for script in soup(["script","style","p"]):
script.decompose()
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = '\n'.join(chunk for chunk in chunks if chunk)
text = word_tokenize(text)
url = str(url).replace('/','').replace('http:','').replace('https:','')
name = 'textextractor_'+url+'.txt'
f = open(name,'a')
f.write(str(text))
f.close()
return text
except requests.exceptions.ConnectionError:
pass
I am working with this function which makes a good job but most of the generated tokens are irrelevant. Is there a smart way how to just get the text which is relevant for the page and describing for example an event or something similar?
These are examples for unimportant tokens:
'JournalismBig', 'HollywoodNational', 'SecurityTechVideoSportsThe', 'WiresBreitbart', 'LondonBreitbart', 'JerusalemBreitbart', 'TexasBreitbart', 'CaliforniaPeopleSTORE', 'HomeSubscribe', 'advertisement'
Thanks a lot!

How can I populate a txt with results from a mechanized results?

I am trying to populate a txt file with the response I get from a mechanized form. Here's the form code
import mechanize
from bs4 import BeautifulSoup
br = mechanize.Browser()
br.open ('https://www.cpsbc.ca/physician_search')
first = raw_input('Enter first name: ')
last = raw_input('Enter last name: ')
br.select_form(nr=0)
br.form['filter[first_name]'] = first
br.form['filter[last_name]'] = last
response = br.submit()
content = response.read()
soup = BeautifulSoup(content, "html.parser")
for row in soup.find_all('tbody'):
print row
This spits out lines of html code depending on how many privileges the doc has in regards to locations, but the last line has their specialty of training. Please go ahead and test it with any physician from BC, Canada.
I have a txt file that is listed as such:
lastname1, firstname1
lastname2, firstname2
lastname3, firstname3 middlename3
lastname4, firstname4 middlename4
I hope you get the idea. I would appreciate any help in automatizing the following steps:
go through txt with names one by one and record the output text into a new txt file.
So far, I have this working to spit out the row (which is raw html), which I don't mind, but I can't get it to write into a txt file...
import mechanize
from bs4 import BeautifulSoup
with open('/Users/s/Downloads/hope.txt', 'w') as file_out:
with open('/Users/s/Downloads/names.txt', 'r') as file_in:
for line in file_in:
a = line
delim = ", "
i1 = a.find(delim)
br = mechanize.Browser()
br.open('https://www.cpsbc.ca/physician_search')
br.select_form(nr=0)
br.form['filter[first_name]'] = a[i1+2:]
br.form['filter[last_name]'] = a[:i1]
response = br.submit()
content = response.read()
soup = BeautifulSoup(content, "html.parser")
for row in soup.find_all('tbody'):
print row
This should not be too complicated. Assuming your file with all the names you want to query upon is called "names.txt" and the output file you want to create is called "output.txt", the code should look something like:
with open('output.txt', 'w') as file_out:
with open('names.txt', 'r') as file_in:
for line in file_in:
<your parsing logic goes here>
file_out.write(new_record)
This assumes your parsing logic generates some sort of "record" to be written on file as a string.
If you get more advanced, you can also look into the csv module to import/export data in CSV.
Also have a look at the Input and Output tutorial.

python: extracting text from any website

so far i have done my work but it successfully getting text from these two websites :
http://www.tutorialspoint.com/cplusplus/index.htm
http://www.cplusplus.com/doc/tutorial/program_structure/
But I don't know where I am doing wrong and it is not getting text from other websites and it's is giving me error when i place other links such as:
http://www.cmpe.boun.edu.tr/~akin/cmpe223/chap2.htm
http://www.i-programmer.info/babbages-bag/477-trees.html
http://www.w3schools.com/html/html_elements.asp
Error:
Traceback (most recent call last):
File "C:\Users\DELL\Desktop\python\s\fyp\data extraction.py", line 20, in
text = soup.select('.C_doc')[0].get_text()
IndexError: list index out of range
My code:
import urllib
from bs4 import BeautifulSoup
url = "http://www.i-programmer.info/babbages-bag/477-trees.html" #unsuccessfull
#url = "http://www.tutorialspoint.com/cplusplus/index.htm" #doing successfully
#url = "http://www.cplusplus.com/doc/tutorial/program_structure/" #doing successfully
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# kill all script and style elements
for script in soup(["script", "style","a","<div id=\"bottom\" >"]):
script.extract() # rip it out
# get text
#text = soup.select('.C_doc')[0].get_text()
#text = soup.select('.content')[0].get_text()
if soup.select('.content'):
text = soup.select('.content')[0].get_text()
else:
text = soup.select('.C_doc')[0].get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print text
fo = open('foo.txt', 'w')
fo.seek(0, 2)
line = fo.writelines( text )
fo.close()
#writing done :)
Try using
Text = soup.findAll(text=True)
UPDATE
This is a basic text stripper you can start from.
import urllib
from bs4 import BeautifulSoup
url = "http://www.i-programmer.info/babbages-bag/477-trees.html"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
for script in soup(["script", "style","a","<div id=\"bottom\" >"]):
script.extract()
text = soup.findAll(text=True)
for p in text:
print p
You are assuming all websites that you scrap has class name content OR C_doc.
What if the website you scrap does not have such class name C_doc?
Here is the fix:
text = ''
if soup.select('.content'):
text = soup.select('.content')[0].get_text()
elif soup.select('.C_doc'):
text = soup.select('.C_doc')[0].get_text()
if text:
#put rest of the code.
else:
print 'text does not exists.'

Categories