I am using the following code and it works well except for the fact that my code spits out on to a CSV file from Excel and it skips every other line. I have googled the csv module documentation and other examples in stackoverflow.com and I found that I need to use DictWriter with the lineterminator set at '\n'. My own attempts to write it into the code have been foiled.
So I am wondering is there a way for me to apply this(being the lineterminator) to the whole file so that I do not have any lines skipped? And if so how?
Here is the code:
import urllib2
from BeautifulSoup import BeautifulSoup
import csv
page = urllib2.urlopen('http://finance.yahoo.com/q/ks?s=F%20Key%20Statistics').read()
f = csv.writer(open("pe_ratio.csv","w"))
f.writerow(["Name","PE"])
soup = BeautifulSoup(page)
all_data = soup.findAll('td', "yfnc_tabledata1")
f.writerow([all_data[2].getText()])
Thanks for your help in advance.
You need to open your file with the right options for the csv.writer class to work correctly. The module has universal newline support internally, so you need to turn off Python's universal newline support at the file level.
For Python 2, the docs say:
If csvfile is a file object, it must be opened with the 'b' flag on platforms where that makes a difference.
For Python 3, they say:
If csvfile is a file object, it should be opened with newline=''.
Also, you should probably use a with statement to handle opening and closing your file, like this:
with open("pe_ratio.csv","wb") as f: # or open("pe_ratio.csv", "w", newline="") in Py3
writer = csv.writer(f)
# do other stuff here, staying indented until you're done writing to the file
First, since Yahoo provides an API that returns CSV files, maybe you can solve your problem that way? For example, this URL returns a CSV file containing prices, market cap, P/E and other metrics for all stocks in that industry. There is some more information in this Google Code project.
Your code only produces a two-row CSV because there are only two calls to f.writerow(). If the only piece of data you want from that page is the P/E ratio, this is almost certainly not the best way to do it, but you should pass to f.writerow() a tuple containing the value for each column. To be consistent with your header row, that would be something like:
f.writerow( ('Ford', all_data[2].getText()) )
Of course, that assumes that the P/E ratio will always be second in the list. If instead you wanted all the statistics provided on that page, you could try:
# scrape the html for the name and value of each metric
metrics = soup.findAll('td', 'yfnc_tablehead1')
values = soup.findAll('td', 'yfnc_tabledata1')
# create a list of tuples for the writerows method
def stripTag(tag): return tag.text
data = zip(map(stripTag, metrics), map(stripTag, values))
# write to csv file
f.writerows(data)
Related
I have a list of a million pins and one URL that has a pin within the URL
Example:
https://www.example.com/api/index.php?pin=101010&key=113494
I have to change the number "101010" for the pin with the list of a million values like 093939,493943,344454 that I have in a csv file and then save all of those new urls to a csv file.
Here's what I have tried doing so far that has not worked:
def change(var_data):
var = str(var_data)
url = 'https://www.example.com/api/index.php?pin=101010&key=113494'
url1 = url.split('=')
url2 = ''.join(url1[:-2] + [var] + [url1[-1]])
print(url2)
change('xxxxxxxxxx')
Also this is for an api request that goes to a json page. Would using python and then reiterating through these urls I save on a csv file be the best way to do this? I want to collect some information for all of the pins that I have and save it to a BigQuery database, or somewhere where I can connect to Google Data Studio in order to have the ability to create a dashboard using all of this data.
Any ideas? What do you think the best way of getting this done would be?
Answering the first part of the question, the changes to the change function returns a list of urls using f-strings.
This can then be applied via list comprehension.
The variable url_variables would be the list of integers for the variable you are reading in from the other file.
Then writing the url list to rows in a csv.
import csv
url_variables = [93939, 493943, 344454]
def change(var_data):
var_data = str(var_data)
url = 'https://www.example.com/api/index.php?pin='
key = 'key=113494'
new_url = f'{url}{var_data}&{key}'
return(new_url)
url_list = [change(x) for x in url_variables]
with open('output.csv', 'w', newline='') as f:
writer = csv.writer(f)
for val in url_list:
writer.writerow([val])
Output in output.csv
1. First part of the question: (replace the numbers between "pin=" and "&") I will use an answer from Change a text between two strings in Python with Regex post:
import re
def change(var_data):
var = str(var_data)
url = 'https://www.example.com/api/index.php?pin=101010&key=113494'
url2 = re.sub("(?<=pin=).*?(?=&)",var,url)
print(url2)
change('xxxxxxxxxx')
Here I use the sub method from the built-in package "re" and the RegEx Lookarounds sintax, where:
(?<=pin=) # Asserts that what immediately precedes the current position in the string is "pin="
.*? # is the occurrence of any character
(?=&) #Asserts that what immediately follows the current position in the string is "&"
Here is a formal explanation about the Lookarounds syntax.
2. Second part of the question: As another answer explains, you can register the urls in the csv file by rows but I recommend you to read this post about handling csv files with python and you can give yourself an idea of the way you want to save them.
I am not very good at english but I hope that I have explained myself well.
I'm learning to do web scraping and i managed to pull data out of a webpage into excel file. But, it might be because of the item names that contain "," and this made the item names in the excel file to multiple columns.
I have tried using strip and replace elements in the list but it returns an error saying: AttributeError: 'WebElement' object has no attribute 'replace'.
item = driver.find_elements_by_xpath('//h2[#class="list_title"]')
item = [i.replace(",","") for i in item]
price = driver.find_elements_by_xpath('//div[#class="ads_price"]')
price = [p.replace("rm","") for p in price]
expected result in excel file file:
expected
actual result in excel file file:
actual
The function find_elements_by_xpath returns a WebElement object, you will need to convert this to a string in order to use the replace function.
Depending on your use case you may want to reconsider using excel as your storage medium, unless this is the final step of your process.
The portion of your code that you've included in your question isn't the portion that's relevant to the issue you're experiencing.
As CMMCD mentioned, I would also recommend skipping the binary excel format for the sake of simplicity, and use the built-in csv library instead. This will prevent unintended separators from splitting your cells
from csv import writer
# your data should be a list of lists
data = [['product1', 8.0], ['product2', 12.25]] # etc, as an example
with open('your_output_file.csv', 'w') as file:
mywriter = writer(file)
for line in data:
mywriter.writerow(line)
The docs: https://docs.python.org/3/library/csv.html
I am trying to write a script to automate browsing to my most commonly visited websites. I have put the websites into a list and am trying to open it using the webbrowser() module in Python. My code looks like the following at the moment:
import webbrowser
f = open("URLs", "r")
list = f.readline()
for line in list:
webbrowser.open_new_tab(list)
This only reads the first line from my file "URLs" and opens it in the browser. Could any one please help me understand how I can achieve reading through the entire file and also opening the URLs in different tabs?
Also other options that can help me achieve the same.
You have two main problems.
The first problem you have is that you are using readline and not readlines. readline will give you the first line in the file, while readlines gives you a list of your file contents.
Take this file as an example:
# urls.txt
http://www.google.com
http://www.imdb.com
Also, get in to the habit of using a context manager, as this will close the file for you once you have finished reading from it. Right now, even though for what you are doing, there is no real danger, you are leaving your file open.
Here is the information from the documentation on files. There is a mention about best practices with handling files and using with.
The second problem in your code is that, when you are iterating over list (which you should not use as a variable name, since it shadows the builtin list), you are passing list in to your webrowser call. This is definitely not what you are trying to do. You want to pass your iterator.
So, taking all this in to mind, your final solution will be:
import webbrowser
with open("urls.txt") as f:
for url in f:
webbrowser.open_new_tab(url.strip())
Note the strip that is called in order to ensure that newline characters are removed.
You're not reading the file properly. You're only reading the first line. Also, assuming you were reading the file properly, you're still trying to open list, which is incorrect. You should be trying to open line.
This should work for you:
import webbrowser
with open('file name goes here') as f:
all_urls = f.read().split('\n')
for each_url in all_urls:
webbrowser.open_new_tab(each_url)
My answer is assuming that you have the URLs 1 per line in the text file. If they are separated by spaces, simply change the line to all_urls = f.read().split(' '). If they're separated in another way just change the line to split accordingly.
First question here so forgive any lapses in the etiquette.
I'm new to python. I have a small project I'm trying to accomplish both for practical reasons and as a learning experience and maybe some people here can help me out. There's a proprietary system I regularly retrieve data from. Unfortunately they don't use standard CSV format. They use a strange character to separate data, its a ‡. I need it in CSV format in order to import it into another system. So what I need to do is take the data and replace the special character (with a comma) and format the data by removing whitespaces among other minor things like unrecognized characters etc...so it's the way I need it in CSV to import it.
I want to learn some python so I figured I'd write it in python. I'll be reading it from a webservice URL, but for now I just have some test data in the same format I'd receive.
In reality it will be tons of data per request but I can scale it when I understand how to retrieve and manipulate the data properly.
My code so far just trying to read and write two columns from the data:
import requests
import csv
r = requests.get ('https://www.dropbox.com/s/7uhheam5lqppzis/singlelineTest.csv?dl=0')
data = r.text
with open("testData.csv", "wb") as csvfile:
f = csv.writer(csvfile)
f.writerow(["PlayerID", "Partner"]) # add headers
for elem in data:
f.writerow([elem["PlayerID"], elem["Partner"]])
I'm getting this error.
File "csvTest.py", line 14, in
f.writerow([elem["PlayerID"], elem["Partner"]])
TypeError: string indices must be integers
It's probably evident by that, that I don't know how to manipulate the data much nor read it properly. I was able to pull back some JSON data and output it so i know the structure works at core with standardized data.
Thanks in advance for any tips.
I'll continue to poke at it.
Sample data is at the dropbox link mentioned in the script.
https://www.dropbox.com/s/7uhheam5lqppzis/singlelineTest.csv?dl=0
There are multiple problems. First, the link is incorrect, since it returns the html. To get the raw file, use:
r = requests.get ('https://www.dropbox.com/s/7uhheam5lqppzis/singlelineTest.csv?dl=1')
Then, data is a string, so elem in data will iterate over all the characters of the string, which is not what you want.
Then, your data are unicode, not string. So you need to decode them first.
Here is your program, with some changes:
import requests
import csv
r = requests.get ('https://www.dropbox.com/s/7uhheam5lqppzis/singlelineTest.csv?dl=1')
data = str(r.text.encode('utf-8').replace("\xc2\x87", ",")).splitlines()
headers = data.pop(0).split(",")
pidx = headers.index('PlayerID')
partidx = headers.index('Partner')
with open("testData.csv", "wb") as csvfile:
f = csv.writer(csvfile)
f.writerow(["PlayerID", "Partner"]) # add headers
for data in data[1:]:
words = data.split(',')
f.writerow([words[pidx], words[partidx]])
Output:
PlayerID,Partner
1038005,EXT
254034,EXT
Use split:
lines = data.split('\n') # split your data to lines
headers = lines[0].split('‡')
player_index = headers.index('PlayerID')
partner_index = headers.index('Partner')
for line in lines[1:]: # skip the headers line
words = line.split('‡') # split each line by the delimiter '‡'
print words[player_index], words[partner_index]
For this to work, define the encoding of your python source code as UTF-8 by adding this line to the top of your file:
# -*- coding: utf-8 -*-
Read more about it in PEP 0263.
# my scraper script file
#-*- coding: utf-8 -*-
from selenium import webdriver
import csv
browser = webdriver.Firefox()
browser.get("http://web.com")
f = open("result.csv", 'w')
writer = csv.writer(f)
then the first method
element = browser.find_element_by_xpath("xpath_addr")
temp = [element.get_attribute("innerHTML").encode("utf-8")]
print temp # ['\xec\x84\something\xa8']
writer.writerow(temp)
this results in the right csv file with my language.(e.g. 한글)
but the second case, which I think just a little different
element = browser.find_element_by_xpath("xpath_addr")
temp = element.get_attribute("innerHTML").encode("utf-8")
print temp # "한글"
writer.writerow(temp)
then the csv file is full of non-character things. What makes this difference? print also gets different results but why? (It must be the problem because of my little knowledge about encoding)
Firstly, writerow interface is expecting a list-like object, so the first snippet is correct for this interface. But in your second snippet, the method is assuming that the string you have passed as an argument is a list - and iterating it as such - which is probably not what you wanted. You could try writerow([temp]) and see that it should match the output of the first case.
Secondly, I want to warn you that Python csv module is notorious for headaches with unicode, basically it's unsupoorted. Try using unicodecsv as a drop-in replacement for the csv module if you need to support unicode. Then you won't need to encode the strings before writing them to file, you just write the unicode objects directly and let the library handle the encoding.