I have scraped some links using the following code, but I can't seem to store them in one column of excel. When I use the code it will parse all the alphabet of the link address and will store them in multiple columns.
for h1 in soup.find_all('h1', class_="title entry-title"):
print(h1.find("a")['href'])
This yields all the links I need to find.
To store in csv I used:
import csv
with open('links1.csv', 'wb') as f:
writer = csv.writer(f)
for h1 in soup.find_all('h1', class_="title entry-title"):
writer.writerow(h1.find("a")['href'])
I also tried to store the results for instance using
for h1 in soup.find_all('h1', class_="title entry-title"):
dat = h1.find("a")['href']
and then tried using dat in other csv codes but would not work.
If you only need one link every line, You may not even need a csv writer? It looks like plain file writing to me
The new line character should serve you well at one link per line
with open('file', 'w') as f:
for h1 in soup.find_all('h1', class_="title entry-title"):
f.write(str(h1.find("a")['href']) + '\n')
Related
I am incredibly new to python, so I might not have the right terminology...
I've extracted text from a pdf using pdfplumber. That's been saved as a object. The code I used for that is:
with pdfplumber.open('Bell_2014.pdf') as pdf:
page = pdf.pages[0]
bell = page.extract_text()
print(bell)
So "bell" is all of the text from the first page of the imported PDF.
what bell looks like I need to write all of that text as a string to a csv. I tried using:
with open('Bell_2014_ex.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows(bell)
and
bell_ex = 'bell_2014_ex.csv'
with open(bell_ex, 'w', newline='') as csvfile:
file_writer = csv.writer(csvfile,delimiter=',')
file_writer.writerow(bell)
All I keep finding when I search this is how to create a csv with specific characters or numbers, but nothing from an output of an already executed code. For instance, I can get the above code:
bell_ex = 'bell_2014_ex.csv'
with open(bell_ex, 'w', newline='') as csvfile:
file_writer = csv.writer(csvfile,delimiter=',')
file_writer.writerow(['bell'])
to create a csv that has "bell" in one cell of the csv, but that's as close as I can get.
I feel like this should be super easy, but I just can't seem to get it to work.
Any thoughts?
Please and thank you for helping my inexperienced self.
page.extract_text() is defined as: "Collates all of the page's character objects into a single string." which would make bell just a very long string.
The CSV writerow() expects by default a list of strings, with each item in the list corresponding to a single column.
Your main issue is a type mismatch, you're trying to write a single string where a list of strings is expected. You will need to further operate on your bell object to convert it into a format acceptable to be written to a CSV.
Without having any knowledge of what bell contains or what you intend to write, I can't get any more specific, but documentation on Python's CSV module is very comprehensive in terms of settings delimiters, dialects, column definitions, etc. Once you have converted bell into a proper iterable of lists of strings, you can then write it to a CSV.
Some similar code I wrote recently converts a tab-separated file to csv for insertion into sqlite3 database:
Maybe this is helpful:
retval = ''
mode = 'r'
out_file = os.path.join('input', 'listfile.csv')
"""
Convert tab-delimited listfile.txt to comma separated values (.csv) file
"""
in_text = open(listfile.txt, 'r')
in_reader = csv.reader(in_text, delimiter='\t')
out_csv = open(out_file, 'w', newline='\n')
out_writer = csv.writer(out_csv, dialect=csv.excel)
for _line in in_reader:
out_writer.writerow(_line)
out_csv.close()
... and that's it, not too tough
So my problem was that I was missing the "encoding = 'utf-8'" for special characters and my delimiter need to be a space instead of a comma. What ended up working was:
from pdfminer.high_level import extract_text
object = extract_text('filepath.pdf')
print(object)
new_csv = 'filename.csv'
with open(new_csv, 'w', newline='', encoding = 'utf-8') as csvfile:
file_writer = csv.writer(csvfile,delimiter=' ')
file_writer.writerow(object)
However, since a lot of my pdfs weren't true pdfs but scans, the csv ended up having a lot of weird symbols. This worked for about half of the pdfs I have. If you have true pdfs, this will be great. If not, I'm currently trying to figure out how to extract all the text into a pandas dataframe separated by headers within the pdfs since pdfminer extracted all text perfectly.
Thank you for everyone that helped!
I'm trying to automate a find+replace for a series of broken image links in .rst files. I have a csv file where column A is the "old" link (which is seen in the .rst files) and column B is the new replacement link for each row.
I can't use pandoc first to convert to HTML because it "breaks" the rst file.
I did this once for a set of HTML files using BeautifulSoup and regex,but that parser wont work for my rst files.
A coworker suggested trying Grep, but I can't seem to figure out how to call in the csv file to make the "match" and switch.
for the html files, it would loop through each file, search for an img tag and replace links using the csv file as a dict
with open(image_csv, newline='') as f:
reader = csv.reader(f)
next(reader, None) # Ignore the header row
for row in reader:
graph_main_nodes.append(row[0])
graph_child_nodes.append(row[1:])
graph = dict(zip(graph_main_nodes, graph_child_nodes)) # Dict with keys in correct location, vals in old locations
graph = dict((v, k) for k in graph for v in graph[k])
for fixfile in html:
try:
with open(fixfile, 'r', encoding='utf-8') as f:
soup = BeautifulSoup(f, 'html.parser')
tags = soup.findAll('img')
for tag in tags:
print(tag['src'])
if tag['src'] in graph.keys():
tag['src'] = tag['src'].replace(tag['src'], graph[tag['src']])
replaced_links += 1
print("Match found!")
else:
orphan_links.append(tag["src"])
print("Ignore")
I would love some suggestions on how to approach this. I'd love to repurpose my BeautifulSoup code but I'm not sure if that's realistic.
This question has information on parsing an RST file, but I don't think it's necessary. Your question boils down to replace textA with textB. You already have your graph with the csv loaded, so should be ok with just this (credit to this answer)
# Read in the file
filedata = None
with open('fixfile', 'r', encoding='utf-8') as file:
filedata = file.read()
# Replace the target strings
for old, new in graph.items():
filedata.replace(old, new)
# Write the file out again
with open('fixfile', 'w', encoding='utf-8') as file:
file.write(filedata)
This is also a good candidate for sed or perl. Using something like this answer Also used this answer for help on specifying a rare delimiter for sed. (change the -n to -i and the p to g after testing to get it to actually save the file):
DELIM=$(echo -en "\001");
IFS=","
cat csvFile | while read PATTERN REPLACEMENT # You feed the while loop with stdout lines and read fields separated by ":"
do
sed -n "\\${DELIM}${PATTERN}${DELIM},\\${DELIM}${REPLACEMENT}${DELIM}p" fixfile.rst
done
I have a code that retrieves information from HPEs website regarding switches. The script works just fine and outputs the information into a CSV file. however, now I need to loop the script through 30 different switches.
I have a list of URLs that are stored in a CSV document. Here are a few examples.
https://h10145.www1.hpe.com/downloads/SoftwareReleases.aspx?ProductNumber=J4813A
https://h10145.www1.hpe.com/downloads/SoftwareReleases.aspx?ProductNumber=J4903A
https://h10145.www1.hpe.com/downloads/SoftwareReleases.aspx?ProductNumber=J9019B
https://h10145.www1.hpe.com/downloads/SoftwareReleases.aspx?ProductNumber=J9022A
in my code, I bind 'URL' to one of these links, which pushes that through the code to retrieve the information I need.
Here is my full code:
url = "https://h10145.www1.hpe.com/downloads/SoftwareReleases.aspx?
ProductNumber=J9775A"
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
table = soup.find('table', attrs={"class": "hpui-standardHrGrid-table"})
headers = [header.text for header in table.find_all('th')]
rows = []
for row in table.find_all('tr', {'releasetype': 'Current_Releases'}):
item = []
for val in row.find_all('td'):
item.append(val.text.encode('utf8').strip())
rows.append(item)
with open('c:\source\output_file.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow({url})
writer.writerow(headers)
writer.writerows(rows)
I am trying to find the best way to automate this, as this script needs to run at least once a week. It will need to output to 1 CSV file that is overwritten every time. That CSV file then links to my excel sheet as a data source
Please find patience in dealing with my ignorance. I am a novice at Python and I haven't been able to find a solution elsewhere.
Are you on a linux system? You could set up a cron job to run your script whenever you want.
Personally, I would just make an array of each "ProductNumber" unique query parameter value and iterate through that array via a loop.
Then, by calling the rest of the code as it would be encapsulated in that loop, you should be able to accomplish this task.
I have a script that is used to scrape data from a website and stores it into a spreadsheet
with open("c:\source\list.csv") as f:
for row in csv.reader(f):
for url in row:
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
tables = soup.find('table', attrs={"class": "hpui-standardHrGrid-table"})
for rows in tables.find_all('tr', {'releasetype': 'Current_Releases'})[0::1]:
item = []
for val in rows.find_all('td'):
item.append(val.text.strip())
with open('c:\output_file.csv', 'a', newline='') as f:
writer = csv.writer(f)
writer.writerow({url})
writer.writerows(item)
As of right now, when this script runs, about 50 new lines are added to the bottom of the CSV file (Totally expected with the append function) but what I would like it to do is to determine if there are duplicate entries in the CSV file and skip them, and then change the mismatches.
I feel like this should be possible but I can't seem to think of a way
Any thoughts?
You cannot do that without reading the data from the CSV file. Also to "change the mismatches", you will just have to over write them.
f = open('c:\output_file.csv', 'w', newline='')
writer = csv.writer(f)
for item in list_to_write_from:
writer.writerow(item)
Here, you are assuming that list_to_write_from will contain the most current form of the data you need.
I found a workaround to this problem as the answer provided did not work for me
I added:
if os.path.isfile("c:\source\output_file.csv"):
os.remove("c:\source\output_file.csv")
To the top of my code, as this will check to see if that file exists, and deletes it, only to recreate it with the most up to date information later. This is a duct tape way of doing things, but it works.
I am using the following code and it works well except for the fact that my code spits out on to a CSV file from Excel and it skips every other line. I have googled the csv module documentation and other examples in stackoverflow.com and I found that I need to use DictWriter with the lineterminator set at '\n'. My own attempts to write it into the code have been foiled.
So I am wondering is there a way for me to apply this(being the lineterminator) to the whole file so that I do not have any lines skipped? And if so how?
Here is the code:
import urllib2
from BeautifulSoup import BeautifulSoup
import csv
page = urllib2.urlopen('http://finance.yahoo.com/q/ks?s=F%20Key%20Statistics').read()
f = csv.writer(open("pe_ratio.csv","w"))
f.writerow(["Name","PE"])
soup = BeautifulSoup(page)
all_data = soup.findAll('td', "yfnc_tabledata1")
f.writerow([all_data[2].getText()])
Thanks for your help in advance.
You need to open your file with the right options for the csv.writer class to work correctly. The module has universal newline support internally, so you need to turn off Python's universal newline support at the file level.
For Python 2, the docs say:
If csvfile is a file object, it must be opened with the 'b' flag on platforms where that makes a difference.
For Python 3, they say:
If csvfile is a file object, it should be opened with newline=''.
Also, you should probably use a with statement to handle opening and closing your file, like this:
with open("pe_ratio.csv","wb") as f: # or open("pe_ratio.csv", "w", newline="") in Py3
writer = csv.writer(f)
# do other stuff here, staying indented until you're done writing to the file
First, since Yahoo provides an API that returns CSV files, maybe you can solve your problem that way? For example, this URL returns a CSV file containing prices, market cap, P/E and other metrics for all stocks in that industry. There is some more information in this Google Code project.
Your code only produces a two-row CSV because there are only two calls to f.writerow(). If the only piece of data you want from that page is the P/E ratio, this is almost certainly not the best way to do it, but you should pass to f.writerow() a tuple containing the value for each column. To be consistent with your header row, that would be something like:
f.writerow( ('Ford', all_data[2].getText()) )
Of course, that assumes that the P/E ratio will always be second in the list. If instead you wanted all the statistics provided on that page, you could try:
# scrape the html for the name and value of each metric
metrics = soup.findAll('td', 'yfnc_tablehead1')
values = soup.findAll('td', 'yfnc_tabledata1')
# create a list of tuples for the writerows method
def stripTag(tag): return tag.text
data = zip(map(stripTag, metrics), map(stripTag, values))
# write to csv file
f.writerows(data)