Loop for automatically webscrape data from few pages - python

Since I've been trying to figure out how to make a loop and I couldn't make it from another threads, I need help. I am totally new to this so editing existing codes is hard for me.
I am trying to web scrape data from website. Here's what I've done so far, but I have to insert pages "manually". I want it to automatically scrape prices in zl/m2 from 1 to 20 pages for example:
import requests
from bs4 import BeautifulSoup
link=("https://ogloszenia.trojmiasto.pl/nieruchomosci-mam-do-wynajecia/wi,100.html?strona=1")
page = requests.get(link).text
link1=("https://ogloszenia.trojmiasto.pl/nieruchomosci-mam-do-wynajecia/wi,100.html?strona=2")
page1 = requests.get(link1).text
link2=("https://ogloszenia.trojmiasto.pl/nieruchomosci-mam-do-wynajecia/wi,100.html?strona=3")
page2 = requests.get(link2).text
pages=page+page1+page2+page3+page4+page5+page6
soup = BeautifulSoup(pages, 'html.parser')
price_box = soup.findAll('p', attrs={'class':'list__item__details__info details--info--price'})
prices=[]
for i in range(len(price_box)):
prices.append(price_box[i].text.strip())
prices
I've tried with this code, but got stuck. I don't know what should I add to get output from 20 pages at once and how to save it to csv file.
npages=20
baselink="https://ogloszenia.trojmiasto.pl/nieruchomosci-mam-do-wynajecia/wi,100.html?strona="
for i in range (1,npages+1):
link=baselink+str(i)
page = requests.get(link).text
Thanks in advance for any help.

Python is whitespace sensitive, so the code block of any loops needs to be indented, like so:
for i in range (1,npages+1):
link=baselink+str(i)
page = requests.get(link).text
If you want all of the pages in a single string (so you can use the same approach as with your pages variable above), you can append the strings together in your loop:
pages = ""
for i in range (1,npages+1):
link=baselink+str(i)
pages += requests.get(link).text
To create a csv file with your results, you can look into the csv.writer() method in python's built-in csv module, but I usually find it easier to write to a file using print():
with open(samplefilepath, mode="w+") as output_file:
for price in prices:
print(price, file=output_file)
w+ tells python to create the file if it doesn't exist and overwrite if it does exist. a+ would append to the existing file if it exists

Related

how should I automation of work with Python

I'm very beginner to python but I know intermediate JavaScript. I have one project to complete this is like a scraper but I want to automate some work for me.
1 ) I have a Excel with more than 1000 data and it also has URLs. I want to code that python visit every URL from that Excel sheet and search first page for Some predefine search texts (List of Texts)
2 ) If my code find any of the Text from that web page then it should return true else false
I want any idea or logic to do this kind of process. Any help will make my head pain less 😅
it is very heavy work which is not very good idea to do in JavaScript that's why I want to do it in Python
An easy way to do this would be to get the requests module. Then learn how to use the csv module which can read spreadsheets such as excel spreadsheets. Then here is what you want to do
import csv
import requests
URLS = []
def GetUrlFromCSVFile():
global URLS
#Figure out how to get link from csv file then append them to the URLS list
for url in URLS:
r = requests.get(URL, headers=#You Should Probs get some headers)
if whatever_keyword_u_looking_for in r.text:
print("Found")
else:
print("Not here")
I suggest the following:
Read about the csv library - to read the content of an excel file.
Read about the requests library - to get the page's content from its URL.
Read about regular expressions in the re library.

Python: Attempting to scrape multiple (Similar) websites for (Similar) data

I have a code that retrieves information from HPEs website regarding switches. The script works just fine and outputs the information into a CSV file. however, now I need to loop the script through 30 different switches.
I have a list of URLs that are stored in a CSV document. Here are a few examples.
https://h10145.www1.hpe.com/downloads/SoftwareReleases.aspx?ProductNumber=J4813A
https://h10145.www1.hpe.com/downloads/SoftwareReleases.aspx?ProductNumber=J4903A
https://h10145.www1.hpe.com/downloads/SoftwareReleases.aspx?ProductNumber=J9019B
https://h10145.www1.hpe.com/downloads/SoftwareReleases.aspx?ProductNumber=J9022A
in my code, I bind 'URL' to one of these links, which pushes that through the code to retrieve the information I need.
Here is my full code:
url = "https://h10145.www1.hpe.com/downloads/SoftwareReleases.aspx?
ProductNumber=J9775A"
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
table = soup.find('table', attrs={"class": "hpui-standardHrGrid-table"})
headers = [header.text for header in table.find_all('th')]
rows = []
for row in table.find_all('tr', {'releasetype': 'Current_Releases'}):
item = []
for val in row.find_all('td'):
item.append(val.text.encode('utf8').strip())
rows.append(item)
with open('c:\source\output_file.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow({url})
writer.writerow(headers)
writer.writerows(rows)
I am trying to find the best way to automate this, as this script needs to run at least once a week. It will need to output to 1 CSV file that is overwritten every time. That CSV file then links to my excel sheet as a data source
Please find patience in dealing with my ignorance. I am a novice at Python and I haven't been able to find a solution elsewhere.
Are you on a linux system? You could set up a cron job to run your script whenever you want.
Personally, I would just make an array of each "ProductNumber" unique query parameter value and iterate through that array via a loop.
Then, by calling the rest of the code as it would be encapsulated in that loop, you should be able to accomplish this task.

Scraping and writing output to text file

I coded this scraper using Python 2.7 to fetch links from the first 3 pages of TrueLocal.com.au and write them to a text file.
When I run the program, only the first link is written in the text file. What can I do so that all the URLs returned are written on the file?
import requests
from bs4 import BeautifulSoup
def tru_crawler(max_pages):
page = 1
while page <= max_pages:
url = 'http://www.truelocal.com.au/find/car-rental/' + str(page)
code = requests.get(url)
text = code.text
soup = BeautifulSoup(text)
for link in soup.findAll('a', {'class':'name'}):
href = 'http://www.truelocal.com.au' + link.get('href')
fob = open('c:/test/true.txt', 'w')
fob.write(href + '\n')
fob.close()
print (href)
page += 1
#Run the function
tru_crawler(3)
Your problem is that for each link, you open the output file, write it, then close the file again. Not only is this inefficient, but unless you open the file in "append" mode each time, it will just get overwritten. What's happening is actually that the last link gets left in the file and everything prior is lost.
The quick fix would be to change the open mode from 'w' to 'a', but it'd be even better to slightly restructure your program. Right now the tru_crawler function is responsible for both crawling your site and writing output; instead it's better practice to have each function responsible for one thing only.
You can turn your crawl function into a generator that yields links one at a time, and then write the generated output to a file separately. Replace the three fob lines with:
yield href + '\n'
Then you can do the following:
lines = tru_crawler(3)
filename = 'c:/test/true.txt'
with open(filename, 'w') as handle:
handle.writelines(lines)
Also note the usage of the with statement; opening the file using with automatically closes it once that block ends, saving you from having to call close() yourself.
Taking the idea of generators and task-separation one step further, you may notice that the tru_crawler function is also responsible for generating the list of URLs to crawl. That too can be separated out, if your crawler accepts an iterable of URLs instead of creating them itself. Something like:
def make_urls(base_url, pages):
for page in range(1, pages+1):
yield base_url + str(page)
def crawler(urls):
for url in urls:
#fetch, parse, and yield hrefs
Then, instead of calling tru_crawler(3), it becomes:
urls = make_urls('http://www.truelocal.com.au/find/car_rental/', 3)
lines = crawler(urls)
and then proceed as above.
Now if you want to crawl other sites, you can just change your make_urls call, or create different generators for other URL-patterns, and the rest of your code doesn't need to change!
By default 'w' is truncating mode and you may need append mode. See: https://docs.python.org/2/library/functions.html#open.
Maybe appending your hrefs to a list in while loop and then write to file later would look readable. Or as suggested use yield for efficiency.
Something like
with open('c:/test/true.txt', 'w') as fob:
fob.writelines(yourlistofhref)
https://docs.python.org/2/library/stdtypes.html#file.writelines

Python Web-Scrape Loop via CSV list of URLs?

HI, I've got a list of 10 websites in CSV. All of the sites have the same general format, including a large table. I only want the the data in the 7th columns. I am able to extract the html and filter the 7th column data (via RegEx) on an individual basis but I can't figure out how to loop through the CSV. I think I'm close but my script won't run. I would really appreciate it if someone could help me figure-out how to do this. Here's what i've got:
#Python v2.6.2
import csv
import urllib2
import re
urls = csv.reader(open('list.csv'))
n =0
while n <=10:
for url in urls:
response = urllib2.urlopen(url[n])
html = response.read()
print re.findall('td7.*?td',html)
n +=1
When I copied your routine, I did get a white space / tab error error. Check your tabs. You were indexing into the URL string incorrectly using your loop counter. This would have also messed you up.
Also, you don't really need to control the loop with a counter. This will loop for each line entry in your CSV file.
#Python v2.6.2
import csv
import urllib2
import re
urls = csv.reader(open('list.csv'))
for url in urls:
response = urllib2.urlopen(url[0])
html = response.read()
print re.findall('td7.*?td',html)
Lastly, be sure that your URLs are properly formed:
http://www.cnn.com
http://www.fark.com
http://www.cbc.ca

Scraping Multiple html files to CSV

I am trying to scrape rows off of over 1200 .htm files that are on my hard drive. On my computer they are here 'file:///home/phi/Data/NHL/pl07-08/PL020001.HTM'. These .htm files are sequential from *20001.htm until *21230.htm. My plan is to eventually toss my data in MySQL or SQLite via a spreadsheet app or just straight in if I can get a clean .csv file out of this process.
This is my first attempt at code (Python), scraping, and I just installed Ubuntu 9.04 on my crappy pentium IV. Needless to say I am newb and have some roadblocks.
How do I get mechanize to go through all the files in the directory in order. Can mechanize even do this? Can mechanize/Python/BeautifulSoup read a 'file:///' style url or is there another way to point it to /home/phi/Data/NHL/pl07-08/PL020001.HTM? Is it smart to do this in 100 or 250 file increments or just send all 1230?
I just need rows that start with this "<tr class="evenColor">" and end with this "</tr>". Ideally I only want the rows that contain "SHOT"|"MISS"|"GOAL" within them but I want the whole row (every column). Note that "GOAL" is in bold so do I have to specify this? There are 3 tables per htm file.
Also I would like the name of the parent file (pl020001.htm) to be included in the rows I scrape so I can id them in their own column in the final database. I don't even know where to begin for that. This is what I have so far:
#/usr/bin/python
from BeautifulSoup import BeautifulSoup
import re
from mechanize import Browser
mech = Browser()
url = "file:///home/phi/Data/NHL/pl07-08/PL020001.HTM"
##but how do I do multiple urls/files? PL02*.HTM?
page = mech.open(url)
html = page.read()
soup = BeautifulSoup(html)
##this confuses me and seems redundant
pl = open("input_file.html","r")
chances = open("chancesforsql.csv,"w")
table = soup.find("table", border=0)
for row in table.findAll 'tr class="evenColor"'
#should I do this instead of before?
outfile = open("shooting.csv", "w")
##how do I end it?
Should I be using IDLE or something like it? just Terminal in Ubuntu 9.04?
You won't need mechanize. Since I do not exactly know the HTML content, I'd try to see what matches, first. Like this:
import glob
from BeautifulSoup import BeautifulSoup
for filename in glob.glob('/home/phi/Data/*.htm'):
soup = BeautifulSoup(open(filename, "r").read()) # assuming some HTML
for a_tr in soup.findAll("tr", attrs={ "class" : "evenColor" }):
print a_tr
Then pick the stuff you want and write it to stdout with commas (and redirect it > to a file). Or write the csv via python.
MYYN's answer looks like a great start to me. One thing I'd point out that I've had luck with is:
import glob
for file_name in glob.glob('/home/phi/Data/*.htm'):
#read the file and then parse with BeautifulSoup
I've found both the os and glob imports to be really useful for running through files in a directory.
Also, once you're using a for loop in this way, you have the file_name which you can modify for use in the output file, so that the output filenames will match the input filenames.

Categories