I have a script that is used to scrape data from a website and stores it into a spreadsheet
with open("c:\source\list.csv") as f:
for row in csv.reader(f):
for url in row:
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
tables = soup.find('table', attrs={"class": "hpui-standardHrGrid-table"})
for rows in tables.find_all('tr', {'releasetype': 'Current_Releases'})[0::1]:
item = []
for val in rows.find_all('td'):
item.append(val.text.strip())
with open('c:\output_file.csv', 'a', newline='') as f:
writer = csv.writer(f)
writer.writerow({url})
writer.writerows(item)
As of right now, when this script runs, about 50 new lines are added to the bottom of the CSV file (Totally expected with the append function) but what I would like it to do is to determine if there are duplicate entries in the CSV file and skip them, and then change the mismatches.
I feel like this should be possible but I can't seem to think of a way
Any thoughts?
You cannot do that without reading the data from the CSV file. Also to "change the mismatches", you will just have to over write them.
f = open('c:\output_file.csv', 'w', newline='')
writer = csv.writer(f)
for item in list_to_write_from:
writer.writerow(item)
Here, you are assuming that list_to_write_from will contain the most current form of the data you need.
I found a workaround to this problem as the answer provided did not work for me
I added:
if os.path.isfile("c:\source\output_file.csv"):
os.remove("c:\source\output_file.csv")
To the top of my code, as this will check to see if that file exists, and deletes it, only to recreate it with the most up to date information later. This is a duct tape way of doing things, but it works.
Related
I have a list of 100 images I want to download in a CSV file. There are no headers on the CSV. The first column has the location of the image and the second column has the name I would like the image saved as.
I am using urllib to download the images. I would like to setup a for loop which uses the location from the first column and names it from the data in the second column. The line I put in here is wrong, but hopefully you get the idea - urllib.request.urlretrieve('url[0]', 'url[1]') - I'm trying to pull from the first or second column.
urls = []
f = open('multicoltest.csv', encoding='utf-8-sig')
csv_f = csv.reader(f)
for row in csv_f:
urls.append(row[0])
f.close()
for url in urls:
urllib.request.urlretrieve('url[0]', 'url[1]')
You can do it a line at a time without reading the entire file first:
with open('multicoltest.csv', encoding='utf-8-sig') as f:
for row in csv.reader(f):
urllib.request.urlretrieve(row[0], row[1])
I have a code that retrieves information from HPEs website regarding switches. The script works just fine and outputs the information into a CSV file. however, now I need to loop the script through 30 different switches.
I have a list of URLs that are stored in a CSV document. Here are a few examples.
https://h10145.www1.hpe.com/downloads/SoftwareReleases.aspx?ProductNumber=J4813A
https://h10145.www1.hpe.com/downloads/SoftwareReleases.aspx?ProductNumber=J4903A
https://h10145.www1.hpe.com/downloads/SoftwareReleases.aspx?ProductNumber=J9019B
https://h10145.www1.hpe.com/downloads/SoftwareReleases.aspx?ProductNumber=J9022A
in my code, I bind 'URL' to one of these links, which pushes that through the code to retrieve the information I need.
Here is my full code:
url = "https://h10145.www1.hpe.com/downloads/SoftwareReleases.aspx?
ProductNumber=J9775A"
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
table = soup.find('table', attrs={"class": "hpui-standardHrGrid-table"})
headers = [header.text for header in table.find_all('th')]
rows = []
for row in table.find_all('tr', {'releasetype': 'Current_Releases'}):
item = []
for val in row.find_all('td'):
item.append(val.text.encode('utf8').strip())
rows.append(item)
with open('c:\source\output_file.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow({url})
writer.writerow(headers)
writer.writerows(rows)
I am trying to find the best way to automate this, as this script needs to run at least once a week. It will need to output to 1 CSV file that is overwritten every time. That CSV file then links to my excel sheet as a data source
Please find patience in dealing with my ignorance. I am a novice at Python and I haven't been able to find a solution elsewhere.
Are you on a linux system? You could set up a cron job to run your script whenever you want.
Personally, I would just make an array of each "ProductNumber" unique query parameter value and iterate through that array via a loop.
Then, by calling the rest of the code as it would be encapsulated in that loop, you should be able to accomplish this task.
I have scraped some links using the following code, but I can't seem to store them in one column of excel. When I use the code it will parse all the alphabet of the link address and will store them in multiple columns.
for h1 in soup.find_all('h1', class_="title entry-title"):
print(h1.find("a")['href'])
This yields all the links I need to find.
To store in csv I used:
import csv
with open('links1.csv', 'wb') as f:
writer = csv.writer(f)
for h1 in soup.find_all('h1', class_="title entry-title"):
writer.writerow(h1.find("a")['href'])
I also tried to store the results for instance using
for h1 in soup.find_all('h1', class_="title entry-title"):
dat = h1.find("a")['href']
and then tried using dat in other csv codes but would not work.
If you only need one link every line, You may not even need a csv writer? It looks like plain file writing to me
The new line character should serve you well at one link per line
with open('file', 'w') as f:
for h1 in soup.find_all('h1', class_="title entry-title"):
f.write(str(h1.find("a")['href']) + '\n')
I'm new to python.
I have a list with 19188 rows that I want to save as a csv.
When I write the list's rows to the csv, it does not have the last rows (it stops at 19112).
Do you have any idea what might cause this?
Here is how I write to the csv:
mycsvfile = open('file.csv', 'w')
thedatawriter = csv.writer(mycsvfile, lineterminator = '\n')
list = []
#list creation code
thedatawriter.writerows(list)
Each row of list has 4 string elements.
Another piece of information:
If I create a list that contains only the last elements that are missing and add them to the csv file, it kind of works (it is added, but twice...).
mycsvfile = open('file.csv', 'w')
thedatawriter = csv.writer(mycsvfile, lineterminator = '\n')
list = []
#list creation code
thedatawriter.writerows(list)
list_end = []
#list_end creation code
thedatawriter.writerows(list_end)
If I try to add the list_end alone, it doesn't seem to be working. I'm thinking there might be a csv writing parameter that I got wrong.
Another piece of information:
If I open the file adding ", newline=''", then it write more rows to it (though not all)
mycsvfile = open('file.csv', 'w', newline='')
There must be a simple mistake in the way I open or write to the csv (or in the dialect?)
Thanks for your help!
I found my answer! I was not closing the filehandle before script end which left unwritten rows.
Here is the fix:
with open('file.csv', 'w', newline='') as mycsvfile:
thedatawriter = csv.writer(mycsvfile, lineterminator = '\n')
thedatawriter.writerows(list)
See: Writing to CSV from list, write.row seems to stop in a strange place
Close the filehandle before the script ends. Closing the filehandle
will also flush any strings waiting to be written. If you don't flush
and the script ends, some output may never get written.
Using the with open(...) as f syntax is useful because it will close
the file for you when Python leaves the with-suite. With with, you'll
never omit closing a file again.
I'm using Python's csv module to do some reading and writing of csv files.
I've got the reading fine and appending to the csv fine, but I want to be able to overwrite a specific row in the csv.
For reference, here's my reading and then writing code to append:
#reading
b = open("bottles.csv", "rb")
bottles = csv.reader(b)
bottle_list = []
bottle_list.extend(bottles)
b.close()
#appending
b=open('bottles.csv','a')
writer = csv.writer(b)
writer.writerow([bottle,emptyButtonCount,100, img])
b.close()
And I'm using basically the same for the overwrite mode(which isn't correct, it just overwrites the whole csv file):
b=open('bottles.csv','wb')
writer = csv.writer(b)
writer.writerow([bottle,btlnum,100,img])
b.close()
In the second case, how do I tell Python I need a specific row overwritten? I've scoured Gogle and other stackoverflow posts to no avail. I assume my limited programming knowledge is to blame rather than Google.
I will add to Steven Answer :
import csv
bottle_list = []
# Read all data from the csv file.
with open('a.csv', 'rb') as b:
bottles = csv.reader(b)
bottle_list.extend(bottles)
# data to override in the format {line_num_to_override:data_to_write}.
line_to_override = {1:['e', 'c', 'd'] }
# Write data to the csv file and replace the lines in the line_to_override dict.
with open('a.csv', 'wb') as b:
writer = csv.writer(b)
for line, row in enumerate(bottle_list):
data = line_to_override.get(line, row)
writer.writerow(data)
You cannot overwrite a single row in the CSV file. You'll have to write all the rows you want to a new file and then rename it back to the original file name.
Your pattern of usage may fit a database better than a CSV file. Look into the sqlite3 module for a lightweight database.