I'm learning to do web scraping and i managed to pull data out of a webpage into excel file. But, it might be because of the item names that contain "," and this made the item names in the excel file to multiple columns.
I have tried using strip and replace elements in the list but it returns an error saying: AttributeError: 'WebElement' object has no attribute 'replace'.
item = driver.find_elements_by_xpath('//h2[#class="list_title"]')
item = [i.replace(",","") for i in item]
price = driver.find_elements_by_xpath('//div[#class="ads_price"]')
price = [p.replace("rm","") for p in price]
expected result in excel file file:
expected
actual result in excel file file:
actual
The function find_elements_by_xpath returns a WebElement object, you will need to convert this to a string in order to use the replace function.
Depending on your use case you may want to reconsider using excel as your storage medium, unless this is the final step of your process.
The portion of your code that you've included in your question isn't the portion that's relevant to the issue you're experiencing.
As CMMCD mentioned, I would also recommend skipping the binary excel format for the sake of simplicity, and use the built-in csv library instead. This will prevent unintended separators from splitting your cells
from csv import writer
# your data should be a list of lists
data = [['product1', 8.0], ['product2', 12.25]] # etc, as an example
with open('your_output_file.csv', 'w') as file:
mywriter = writer(file)
for line in data:
mywriter.writerow(line)
The docs: https://docs.python.org/3/library/csv.html
Related
question regarding pandas:
Say I created a dataframe and generated output under separate variables, rather than printing them, how would I go about combining them back into another dataframe correctly to either send as a CSV and then upload to a DB or directly upload to a DB?
Everything works fine code wise, I just haven't really seen or know of the best practice to do this. I know we can store things in list, dict, etc
What I did was:
#imported all modules
object = df.iloc[0,0]
#For loop magic goes here
#nested for loop
#if conditions are met, do this
result = df.iloc[i, k+1]
print(object, result)
I've also stored them into a separate DataFrame trying:
df2 = pd.DataFrame({'object': object, 'result' : result}, index=[0])
df2.to_csv('output.csv', index=False, mode='a')
The only problem with that is that it appends everything to each row, most likely do to the append and perhaps not including it in the for loop. Which is odd because the raw output is EXACTLY how I'm trying to get it into a csv or into a DB
As saying though, looking to combine both values back into a dataframe for speed. I tried concat etc, but no luck, so I was wondering what the correct format would be? Thanks
So it turns out that after more research and revising, I solved my issue
Referenced this and personal revisions, this is a basis of what I did:
Empty space in between rows after using writer in python
import csv
/* Had to wrap in a for loop that is not listed and append to file while clearing it first to remove spaces in each row*/
with open('csvexample.csv', wb+, newline='') as file:
reader = csv.reader(file)
for row in reader:
print(row)
Additional supporting material:
Confused by python file mode "w+"
When working with json.dump() I noticed that it appears to be rewriting the entire document. Is this correct, and is there another way to append to the dictionary like .append() deos with lists?
When I write the function like this and change the key value (name), it would appear that the item is being appended.
filename = "infohere.json"
name = "Bob"
numbers = 20
#Write to JSON
def writejson(name = name, numbers = numbers):
with open(filename, "r") as info:
xdict = json.load(info)
xdict[name] = numbers
with open(filename, "w") as info:
json.dump(xdict, info)
When you write it out like this however, you can see that the code clearly writes over the entire dictionary/json file.
filename = infohere.json
dict = {"Bob":23, "Mark":50}
dict2 = {Ricky":40}
#Write to JSON
def writejson2(dict):
with open(filehere, "w") as info:
json.dump(dict, info)
writejson(dict)
writejson(dict2)
In the second example it only ever shows up the last date input leading me to believe that this is rewriting the entire document. If the case is that it writes the whole document during each json.dump, does this cause issues with larger json file, if so is there another method like .append() but for dealing with json.
Thanks in advance.
Neither.
json.dump doesn't decide whether to delete prior content when it writes to a file. That decision happens when you run open(filehere, "w"); that is what deletes old content.
But: Normal JSON isn't amenable to appends.
A single JSON document is one object. There are variants on the format that allow multiple documents in one file, the most common of which is JSONL (which has one JSON document per line). Unless you're using such a format, trying to append JSON to a non-empty file usually won't result in something that can be successfully parsed.
I have a huge text file that contains several JSON objects inside of it that I want to parse into a csv file. Just because i'm dealing with someone else's data I cannot really change the format its being delivered in.
Since I dont know how many objects JSON objects I just can create a couple set of dictionaries, wrap them in a list and then json.loads() the list.
Also, since all the objects are in a single text line I can't a regex expression to separete each individual json object and then put them on a list.(It's a super complicated and sometimes triple nested json at some points.
Here's, my current code
def json_to_csv(text_file_name,desired_csv_name):
#Cleans up a bit of the text file
file = fileinput.FileInput(text_file_name, inplace=True)
ile = fileinput.FileInput(text_file_name, inplace=True)
for line in file:
sys.stdout.write(line.replace(u'\'', u'"'))
for line in ile:
sys.stdout.write(re.sub(r'("[\s\w]*)"([\s\w]*")', r"\1\2", line))
#try to load the text file to content var
with open(text_file_name, "rb") as fin:
content = json.load(fin)
#Rest of the logic using the json data in content
#that uses it for the desired csv format
This code gives a ValueError: Extra data: line 1 column 159816 because there is more than one object there.
I seen similar questions in Google and StackOverflow. But none of those solutions none because of the fact that it's just one really long line in a text file and I dont know how many objects there are in the file.
If you are trying to split apart the highest level braces you could do something like
string = '{"NextToken": {"value": "...'
objects = eval("[" + string + "]")
and then parse each item in the list.
I am trying to take more than 1 file with information such as names, email addresses, and other. To take these files from CSV format and remove everything except the emails. Then output a new file with delimiter of semicolon all on same line.
The final format should look like:
someone#hotmail.com; someoneelse#gmail.com; someone3#university.edu
I must check the emails are in correct format of alphanumeric#alphanumeric.3letters.
I must remove all duplicates.
I must compare this list to another and remove the emails from 1 list that occur in the others.
The final format will be such that someone can copy and paste into Outlook the email recipient addresses.
I have looked at some video. Also here. I found: python csv copy column
But I get an error when trying to write the new file.
I have import csv and re
Here is my code below:
def final_emails(email_list):
with open(email_list) as csv_file:
read_csv = csv.reader(csv_file, delimiter=',')
write_csv = csv.writer(out_emails, delimiter=";")
for row in read_csv:
email = row[2] # only take the emails (from column 3)
if email != '': # remove empties
# remove the header, or anything that doesn't have a '#'
# convert to lowercase and append to list
emails.append(re.findall('\w*#\w*.\w{3}', email.lower()))
write_csv.write([email])
return emails
final_emails(list1)
final_emails(list2)
print(emails)
I have the print to check at bottom.
I added write to make new file, but this error
TypeError: argument 1 must have a "write" method
I'm still trying to learn. I do many things this time I didn't before, like csv and regular expression.
Please any assistance. Thank you.
You need to define out_emails as file handle with writable permissions before you can use it in csv.writer
csv.writer needs an object which has a .write property like file handles to be able to write to it. It seems like out_emails doesn't have a .write property.
I am using the following code and it works well except for the fact that my code spits out on to a CSV file from Excel and it skips every other line. I have googled the csv module documentation and other examples in stackoverflow.com and I found that I need to use DictWriter with the lineterminator set at '\n'. My own attempts to write it into the code have been foiled.
So I am wondering is there a way for me to apply this(being the lineterminator) to the whole file so that I do not have any lines skipped? And if so how?
Here is the code:
import urllib2
from BeautifulSoup import BeautifulSoup
import csv
page = urllib2.urlopen('http://finance.yahoo.com/q/ks?s=F%20Key%20Statistics').read()
f = csv.writer(open("pe_ratio.csv","w"))
f.writerow(["Name","PE"])
soup = BeautifulSoup(page)
all_data = soup.findAll('td', "yfnc_tabledata1")
f.writerow([all_data[2].getText()])
Thanks for your help in advance.
You need to open your file with the right options for the csv.writer class to work correctly. The module has universal newline support internally, so you need to turn off Python's universal newline support at the file level.
For Python 2, the docs say:
If csvfile is a file object, it must be opened with the 'b' flag on platforms where that makes a difference.
For Python 3, they say:
If csvfile is a file object, it should be opened with newline=''.
Also, you should probably use a with statement to handle opening and closing your file, like this:
with open("pe_ratio.csv","wb") as f: # or open("pe_ratio.csv", "w", newline="") in Py3
writer = csv.writer(f)
# do other stuff here, staying indented until you're done writing to the file
First, since Yahoo provides an API that returns CSV files, maybe you can solve your problem that way? For example, this URL returns a CSV file containing prices, market cap, P/E and other metrics for all stocks in that industry. There is some more information in this Google Code project.
Your code only produces a two-row CSV because there are only two calls to f.writerow(). If the only piece of data you want from that page is the P/E ratio, this is almost certainly not the best way to do it, but you should pass to f.writerow() a tuple containing the value for each column. To be consistent with your header row, that would be something like:
f.writerow( ('Ford', all_data[2].getText()) )
Of course, that assumes that the P/E ratio will always be second in the list. If instead you wanted all the statistics provided on that page, you could try:
# scrape the html for the name and value of each metric
metrics = soup.findAll('td', 'yfnc_tablehead1')
values = soup.findAll('td', 'yfnc_tabledata1')
# create a list of tuples for the writerows method
def stripTag(tag): return tag.text
data = zip(map(stripTag, metrics), map(stripTag, values))
# write to csv file
f.writerows(data)