Python: Attempting to scrape multiple (Similar) websites for (Similar) data

Python: Attempting to scrape multiple (Similar) websites for (Similar) data - python

I have a code that retrieves information from HPEs website regarding switches. The script works just fine and outputs the information into a CSV file. however, now I need to loop the script through 30 different switches.
I have a list of URLs that are stored in a CSV document. Here are a few examples.
https://h10145.www1.hpe.com/downloads/SoftwareReleases.aspx?ProductNumber=J4813A
https://h10145.www1.hpe.com/downloads/SoftwareReleases.aspx?ProductNumber=J4903A
https://h10145.www1.hpe.com/downloads/SoftwareReleases.aspx?ProductNumber=J9019B
https://h10145.www1.hpe.com/downloads/SoftwareReleases.aspx?ProductNumber=J9022A
in my code, I bind 'URL' to one of these links, which pushes that through the code to retrieve the information I need.
Here is my full code:
url = "https://h10145.www1.hpe.com/downloads/SoftwareReleases.aspx?
ProductNumber=J9775A"
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
table = soup.find('table', attrs={"class": "hpui-standardHrGrid-table"})
headers = [header.text for header in table.find_all('th')]
rows = []
for row in table.find_all('tr', {'releasetype': 'Current_Releases'}):
item = []
for val in row.find_all('td'):
item.append(val.text.encode('utf8').strip())
rows.append(item)
with open('c:\source\output_file.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow({url})
writer.writerow(headers)
writer.writerows(rows)
I am trying to find the best way to automate this, as this script needs to run at least once a week. It will need to output to 1 CSV file that is overwritten every time. That CSV file then links to my excel sheet as a data source
Please find patience in dealing with my ignorance. I am a novice at Python and I haven't been able to find a solution elsewhere.

Are you on a linux system? You could set up a cron job to run your script whenever you want.

Personally, I would just make an array of each "ProductNumber" unique query parameter value and iterate through that array via a loop.
Then, by calling the rest of the code as it would be encapsulated in that loop, you should be able to accomplish this task.

Related

Python Web scraping multiple URLs Output CSV

I have literally started with python today, I have managed to get one url to display data in python using.
import requests
URL = "https://gateway.pinata.cloud/ipfs/QmUJrnabRCMLsnvXNryojLWcysc4WwJCLqWYvJcWADfZFo/chadsJSON/1.json"
page = requests.get(URL)
print(page.text)
I need to look up multiple urls (like above but numbered 1-upto 10,000)
and save it as a csv with each url data in one cell I will be able to manipulate the data to make it useable in excel.
Please can anyone make they python code I can run.

This is a very simple answer but will be very useful in your case although you still need to do some work to reach the required output.
This is how you can send requests from 0 to 100, using a for loop:
For Loop
import requests
for i in range(100):
URL = "https://gateway.pinata.cloud/ipfs/QmUJrnabRCMLsnvXNryojLWcysc4WwJCLqWYvJcWADfZFo/chadsJSON/" + str(i) + ".json"
page = requests.get(URL)
print(page.text)
And to store the data in a CSV file, I advice you to use csv library which is really helpful in that, you can read more in its documentation: https://docs.python.org/3/library/csv.html
import csv
with open('file.csv', 'a+', newline='') as file:
writer = csv.writer(file)
writer.writerow(["COL1", "COL2"])

How to get a crawler to add findings to cell in CSV document on daily basis?

My website just launched a new simple component that contains a series of links. Every 24 hours, the links update/change based on an algorithm. I'm wanting to see how long a particular link stays in the component (because, based on the algorithm, sometimes a particular link may stay in the component for multiple days, or sometimes maybe it will be present for just one day).
I'm working on building a Python crawler to crawl the frontend of the website where this new component is present, and I want to have a simple output likely in a CSV file with two columns:
Column 1: URL (the URL that was found within the component)
Column 2: #/days seen (The number of times the Python crawler saw that URL. If it crawls every day, this could be simply thought of as the #/days the crawler has seen that particular URL. So this number would be updated every time the crawler runs. Or, if it was the first time a particular URL was seen, the URL would simply be added to the bottom of the list with a "1" in this column)
How can this be achieved from an output perspective? I'm pretty new to Python, but I'm pretty sure I've got the crawling part covered to identify the links. I'm just not sure how to accomplish the output part, especially as it will update daily, and I want to keep the historical data of how many times the link has been seen.

You need to learn how to webscrape, I suggest using the beautiful soup package for that.
You scraping script should then iterate over your csv file, incrementing the number on each url it finds, or adding a new one if its not found.
Put this script in a cron job, to run it once every 24 hours.
For 2 you can do something like this
from tempfile import NamedTemporaryFile
import shutil
import csv
links_found = [] # find the links here
filename = 'temp.csv'
tempfile = NamedTemporaryFile(delete=False)
with open("myfile.csv") as csv_file, tempfile:
reader = csv.reader(csv_file)
writer = csv.writer(tempfile)
# Increment exising
existing_links = []
writer.write_row(reader.next())
for row in reader:
link = row[0]
existing_links.append(link)
times = int(row[1])
if link in links_found:
row[1] = str(row[1]+1)
writer.write_row(row)
# Add new links
for link in links_found:
if link not in existing_links:
writer.write_row([link, 1])
shutil.move(tempfile.name, filename)

Loop for automatically webscrape data from few pages

Since I've been trying to figure out how to make a loop and I couldn't make it from another threads, I need help. I am totally new to this so editing existing codes is hard for me.
I am trying to web scrape data from website. Here's what I've done so far, but I have to insert pages "manually". I want it to automatically scrape prices in zl/m2 from 1 to 20 pages for example:
import requests
from bs4 import BeautifulSoup
link=("https://ogloszenia.trojmiasto.pl/nieruchomosci-mam-do-wynajecia/wi,100.html?strona=1")
page = requests.get(link).text
link1=("https://ogloszenia.trojmiasto.pl/nieruchomosci-mam-do-wynajecia/wi,100.html?strona=2")
page1 = requests.get(link1).text
link2=("https://ogloszenia.trojmiasto.pl/nieruchomosci-mam-do-wynajecia/wi,100.html?strona=3")
page2 = requests.get(link2).text
pages=page+page1+page2+page3+page4+page5+page6
soup = BeautifulSoup(pages, 'html.parser')
price_box = soup.findAll('p', attrs={'class':'list__item__details__info details--info--price'})
prices=[]
for i in range(len(price_box)):
prices.append(price_box[i].text.strip())
prices
I've tried with this code, but got stuck. I don't know what should I add to get output from 20 pages at once and how to save it to csv file.
npages=20
baselink="https://ogloszenia.trojmiasto.pl/nieruchomosci-mam-do-wynajecia/wi,100.html?strona="
for i in range (1,npages+1):
link=baselink+str(i)
page = requests.get(link).text
Thanks in advance for any help.

Python is whitespace sensitive, so the code block of any loops needs to be indented, like so:
for i in range (1,npages+1):
link=baselink+str(i)
page = requests.get(link).text
If you want all of the pages in a single string (so you can use the same approach as with your pages variable above), you can append the strings together in your loop:
pages = ""
for i in range (1,npages+1):
link=baselink+str(i)
pages += requests.get(link).text
To create a csv file with your results, you can look into the csv.writer() method in python's built-in csv module, but I usually find it easier to write to a file using print():
with open(samplefilepath, mode="w+") as output_file:
for price in prices:
print(price, file=output_file)
w+ tells python to create the file if it doesn't exist and overwrite if it does exist. a+ would append to the existing file if it exists

csv.writer Append a csv file with new data only

I have a script that is used to scrape data from a website and stores it into a spreadsheet
with open("c:\source\list.csv") as f:
for row in csv.reader(f):
for url in row:
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
tables = soup.find('table', attrs={"class": "hpui-standardHrGrid-table"})
for rows in tables.find_all('tr', {'releasetype': 'Current_Releases'})[0::1]:
item = []
for val in rows.find_all('td'):
item.append(val.text.strip())
with open('c:\output_file.csv', 'a', newline='') as f:
writer = csv.writer(f)
writer.writerow({url})
writer.writerows(item)
As of right now, when this script runs, about 50 new lines are added to the bottom of the CSV file (Totally expected with the append function) but what I would like it to do is to determine if there are duplicate entries in the CSV file and skip them, and then change the mismatches.
I feel like this should be possible but I can't seem to think of a way
Any thoughts?

You cannot do that without reading the data from the CSV file. Also to "change the mismatches", you will just have to over write them.
f = open('c:\output_file.csv', 'w', newline='')
writer = csv.writer(f)
for item in list_to_write_from:
writer.writerow(item)
Here, you are assuming that list_to_write_from will contain the most current form of the data you need.

I found a workaround to this problem as the answer provided did not work for me
I added:
if os.path.isfile("c:\source\output_file.csv"):
os.remove("c:\source\output_file.csv")
To the top of my code, as this will check to see if that file exists, and deletes it, only to recreate it with the most up to date information later. This is a duct tape way of doing things, but it works.

How to store output into a column of csv in python

I have scraped some links using the following code, but I can't seem to store them in one column of excel. When I use the code it will parse all the alphabet of the link address and will store them in multiple columns.
for h1 in soup.find_all('h1', class_="title entry-title"):
print(h1.find("a")['href'])
This yields all the links I need to find.
To store in csv I used:
import csv
with open('links1.csv', 'wb') as f:
writer = csv.writer(f)
for h1 in soup.find_all('h1', class_="title entry-title"):
writer.writerow(h1.find("a")['href'])
I also tried to store the results for instance using
for h1 in soup.find_all('h1', class_="title entry-title"):
dat = h1.find("a")['href']
and then tried using dat in other csv codes but would not work.

If you only need one link every line, You may not even need a csv writer? It looks like plain file writing to me
The new line character should serve you well at one link per line
with open('file', 'w') as f:
for h1 in soup.find_all('h1', class_="title entry-title"):
f.write(str(h1.find("a")['href']) + '\n')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Attempting to scrape multiple (Similar) websites for (Similar) data - python

Are you on a linux system? You could set up a cron job to run your script whenever you want.

Personally, I would just make an array of each "ProductNumber" unique query parameter value and iterate through that array via a loop. Then, by calling the rest of the code as it would be encapsulated in that loop, you should be able to accomplish this task.

Related

Python Web scraping multiple URLs Output CSV

How to get a crawler to add findings to cell in CSV document on daily basis?

Loop for automatically webscrape data from few pages

csv.writer Append a csv file with new data only

How to store output into a column of csv in python

Categories

Resources