I'm trying to scrape specific text from a website. Because I'm new in Python, I find it difficult to scrape a text with a single script, so I used this code first:
import urllib
import requests
from bs4 import BeautifulSoup
htmltext = urllib.urlopen("https://io.winmasters.com/Feeds/api/event /282576?lang=el").read()
data = htmltext
soup = BeautifulSoup(data)
f = open('/Desktop/text.txt', 'w')
f.write(data)
f.close()`
and next I'm trying to write a script for searching the text and print specific words.
with open("/Desktop/text.txt") as openfile:
for line in openfile:
for part in line.split():
if "odds=" in part:
print part
but the search script doesn't return the text I'm searching for. Any suggestions please?
If you simply want the values associated with the odds key, without any context at all, you could simply do the following:
import urllib
from json import loads # JSON parser
jsontext = urllib.urlopen("https://io.winmasters.com/Feeds/api/event/282576?lang=el").read()
data = loads(jsontext) # Parse the JSON
odds = [[b['odds'] for b in a['children']] for a in data['children']]
The nested list comprehension takes advantage of the structure of the data. An advantage of using the data structure is that you can do quite rich analytics without too much effort. If you wanted other info in addition to the odds then this would probably better implemented as a nested for-loop.
How about:
import sys
from bs4 import Beautiful Soup
import mechanize
def viewPage(url):
browser=mechanize.Browser()
browser.set_handle_robots(False)
browser.addheaders=[('user-agent','MozillaMozilla/5.0')]
page=browser.open(url)
source_code=page.read()
soup=BeautifulSoup(source_code)
info=soup.findAll("insert what you want to locate")
print(info)
viewPage("www.xkcd.com")
I have a program that when you choose a webpage it reads off all the links, chooses one at random and goes to it, doing the same. It basically crawls across the interweb. The code above is a modified excerpt.
Related
Since I've been trying to figure out how to make a loop and I couldn't make it from another threads, I need help. I am totally new to this so editing existing codes is hard for me.
I am trying to web scrape data from website. Here's what I've done so far, but I have to insert pages "manually". I want it to automatically scrape prices in zl/m2 from 1 to 20 pages for example:
import requests
from bs4 import BeautifulSoup
link=("https://ogloszenia.trojmiasto.pl/nieruchomosci-mam-do-wynajecia/wi,100.html?strona=1")
page = requests.get(link).text
link1=("https://ogloszenia.trojmiasto.pl/nieruchomosci-mam-do-wynajecia/wi,100.html?strona=2")
page1 = requests.get(link1).text
link2=("https://ogloszenia.trojmiasto.pl/nieruchomosci-mam-do-wynajecia/wi,100.html?strona=3")
page2 = requests.get(link2).text
pages=page+page1+page2+page3+page4+page5+page6
soup = BeautifulSoup(pages, 'html.parser')
price_box = soup.findAll('p', attrs={'class':'list__item__details__info details--info--price'})
prices=[]
for i in range(len(price_box)):
prices.append(price_box[i].text.strip())
prices
I've tried with this code, but got stuck. I don't know what should I add to get output from 20 pages at once and how to save it to csv file.
npages=20
baselink="https://ogloszenia.trojmiasto.pl/nieruchomosci-mam-do-wynajecia/wi,100.html?strona="
for i in range (1,npages+1):
link=baselink+str(i)
page = requests.get(link).text
Thanks in advance for any help.
Python is whitespace sensitive, so the code block of any loops needs to be indented, like so:
for i in range (1,npages+1):
link=baselink+str(i)
page = requests.get(link).text
If you want all of the pages in a single string (so you can use the same approach as with your pages variable above), you can append the strings together in your loop:
pages = ""
for i in range (1,npages+1):
link=baselink+str(i)
pages += requests.get(link).text
To create a csv file with your results, you can look into the csv.writer() method in python's built-in csv module, but I usually find it easier to write to a file using print():
with open(samplefilepath, mode="w+") as output_file:
for price in prices:
print(price, file=output_file)
w+ tells python to create the file if it doesn't exist and overwrite if it does exist. a+ would append to the existing file if it exists
from bs4 import BeautifulSoup
import requests
import csv
page=requests.get("http://www.gigantti.fi/catalog/tietokoneet/fi_kannettavat/kannettavat-tietokoneet")
data=BeautifulSoup(page.content)
h=open("test.csv","wb+")
h.write(data)
h.close()
print (data)
i have tried running the code as it is without printing it in csv file and it runs perfectly but the moment I try to save it in csv I get the error : argument 1 must be convertible to a buffer, not BeautifulSoup. PLEASE HELP and thanks in advance
I don't know whether someone was able to solve it or not but my hit and trial worked. the problem was I was not converting the content to string.
#what i needed to add was:
#after line data=BeautifulSoup(page.content)
a=str(data)
Hopefully this helps
What you are trying to do doesn't make any sense.
As mentioned on Beautiful Soup Documentation:
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.
You do not seem to be pulling any data but you are trying to write a BeautifulSoup object into a file which doesn't make sense.
>>> type(data)
<class 'bs4.BeautifulSoup'>
What you should be using BeautifulSoup for is to search the data for some information, and then use that information, here's a useless example:
from bs4 import BeautifulSoup
import requests
page = requests.get("http://www.gigantti.fi/catalog/tietokoneet/fi_kannettavat/kannettavat-tietokoneet")
data = BeautifulSoup(page.content)
with open("test.txt", "wb+") as f:
# find the first `<title>` tag and retrieve its value
value = data.findAll('title')[0].text
f.write(value)
It seems like you should be using BeautifulSoup to be retreiving all the information on each product in the product listing and putting them into columns in a csv file if I'm guessing correctly, but I will leave that work up to you. You must use BeautifulSoup to find each product in the html and then retrieve all of its details and print to a csv
I am trying to parse the xml from YouTube that is embedded in the code below. I am trying to display all of the titles. However, I am running into trouble when I try to print the 'title' only enter lines appear. Any advice?
#import library to do http requests:
import urllib2
#import easy to use xml parser called minidom:
from xml.dom.minidom import parseString
#all these imports are standard on most modern python implementations
#download the file:
file = urllib2.urlopen('http://gdata.youtube.com/feeds/api/users/buzzfeed/uploads?v=2&max-results=50')
#convert to string:
data = file.read()
#close file because we dont need it anymore:
file.close()
#parse the xml you downloaded
dom = parseString(data)
entry=dom.getElementsByTagName('entry')
for node in entry:
video_title=node.getAttribute('title')
print video_title
Title is not an attribute, it is a child element of an entry.
here is an example how to extract it:
for node in entry:
video_title = node.getElementsByTagName('title')[0].firstChild.nodeValue
print video_title
lxml can be a bit difficult to figure out, so here's a really simple beautiful soup solution (It's called beautifulsoup for a reason). You can also set up beautiful soup to use the lxml parser, so the speed is about the same.
from bs4 import BeautifulSoup
soup = BeautifulSoup(data) # data as is seen in your code
soup.findAll('title')
returns a list of title elements. you can also use soup.findAll('media:title') in this case to return just the media:title elements (the actual video names).
There a small bug in your code. You access title as an attribute, although it's a child element of entry. Your code can be fixed by:
dom = parseString(data)
for node in dom.getElementsByTagName('entry'):
print node.getElementsByTagName('title')[0].firstChild.data
HI, I've got a list of 10 websites in CSV. All of the sites have the same general format, including a large table. I only want the the data in the 7th columns. I am able to extract the html and filter the 7th column data (via RegEx) on an individual basis but I can't figure out how to loop through the CSV. I think I'm close but my script won't run. I would really appreciate it if someone could help me figure-out how to do this. Here's what i've got:
#Python v2.6.2
import csv
import urllib2
import re
urls = csv.reader(open('list.csv'))
n =0
while n <=10:
for url in urls:
response = urllib2.urlopen(url[n])
html = response.read()
print re.findall('td7.*?td',html)
n +=1
When I copied your routine, I did get a white space / tab error error. Check your tabs. You were indexing into the URL string incorrectly using your loop counter. This would have also messed you up.
Also, you don't really need to control the loop with a counter. This will loop for each line entry in your CSV file.
#Python v2.6.2
import csv
import urllib2
import re
urls = csv.reader(open('list.csv'))
for url in urls:
response = urllib2.urlopen(url[0])
html = response.read()
print re.findall('td7.*?td',html)
Lastly, be sure that your URLs are properly formed:
http://www.cnn.com
http://www.fark.com
http://www.cbc.ca
I am trying to scrape rows off of over 1200 .htm files that are on my hard drive. On my computer they are here 'file:///home/phi/Data/NHL/pl07-08/PL020001.HTM'. These .htm files are sequential from *20001.htm until *21230.htm. My plan is to eventually toss my data in MySQL or SQLite via a spreadsheet app or just straight in if I can get a clean .csv file out of this process.
This is my first attempt at code (Python), scraping, and I just installed Ubuntu 9.04 on my crappy pentium IV. Needless to say I am newb and have some roadblocks.
How do I get mechanize to go through all the files in the directory in order. Can mechanize even do this? Can mechanize/Python/BeautifulSoup read a 'file:///' style url or is there another way to point it to /home/phi/Data/NHL/pl07-08/PL020001.HTM? Is it smart to do this in 100 or 250 file increments or just send all 1230?
I just need rows that start with this "<tr class="evenColor">" and end with this "</tr>". Ideally I only want the rows that contain "SHOT"|"MISS"|"GOAL" within them but I want the whole row (every column). Note that "GOAL" is in bold so do I have to specify this? There are 3 tables per htm file.
Also I would like the name of the parent file (pl020001.htm) to be included in the rows I scrape so I can id them in their own column in the final database. I don't even know where to begin for that. This is what I have so far:
#/usr/bin/python
from BeautifulSoup import BeautifulSoup
import re
from mechanize import Browser
mech = Browser()
url = "file:///home/phi/Data/NHL/pl07-08/PL020001.HTM"
##but how do I do multiple urls/files? PL02*.HTM?
page = mech.open(url)
html = page.read()
soup = BeautifulSoup(html)
##this confuses me and seems redundant
pl = open("input_file.html","r")
chances = open("chancesforsql.csv,"w")
table = soup.find("table", border=0)
for row in table.findAll 'tr class="evenColor"'
#should I do this instead of before?
outfile = open("shooting.csv", "w")
##how do I end it?
Should I be using IDLE or something like it? just Terminal in Ubuntu 9.04?
You won't need mechanize. Since I do not exactly know the HTML content, I'd try to see what matches, first. Like this:
import glob
from BeautifulSoup import BeautifulSoup
for filename in glob.glob('/home/phi/Data/*.htm'):
soup = BeautifulSoup(open(filename, "r").read()) # assuming some HTML
for a_tr in soup.findAll("tr", attrs={ "class" : "evenColor" }):
print a_tr
Then pick the stuff you want and write it to stdout with commas (and redirect it > to a file). Or write the csv via python.
MYYN's answer looks like a great start to me. One thing I'd point out that I've had luck with is:
import glob
for file_name in glob.glob('/home/phi/Data/*.htm'):
#read the file and then parse with BeautifulSoup
I've found both the os and glob imports to be really useful for running through files in a directory.
Also, once you're using a for loop in this way, you have the file_name which you can modify for use in the output file, so that the output filenames will match the input filenames.