I have written a python web scraper and would like to output the data strings I have gotten into a csv/excel file. So far, I have a for loop that accesses multiple database websites and stores the data in a string. I would like to pop out these strings each time I complete the web scraping before moving onto the next page.
Someone suggested to create a whole repository of them or a dictionary and then reference it. I tried implementing it, but my code instead returns me a the data in one cell instead of spanning multiple cells because I have a header at the top that separates the data into my desired attributes.
Substances = []
Whole_list = []
f = open(filename) # chemtest.txt
for sub in f:
Substances.append(sub)
print sub
for substance in Substances:
#some logic
names1 = [data ]
Whole_list.append(names1)
with open('chemtest.csv', 'wb') as myfile: #creates new chemtest.csv
wr = csv.writer(myfile)
wr.writerow(Whole_list)
So far I'm running through 2 websites as a test and my outputs are:
names1 = ['Acetaldehyde', 'Acetaldehyde', '75-07-0', 'GO1N1ZPR3B', 'CC=O']
Whole_list = [['Acetaldehyde', 'Acetaldehyde', '75-07-0', 'GO1N1ZPR3B', 'CC=O']]
names1 = ['Acetone', 'Acetone', '67-64-1', '1364PS73AF', '=O']
Whole_list = [['Acetaldehyde', 'Acetaldehyde', '75-07-0', 'GO1N1ZPR3B', 'CC=O'], ['Acetone', 'Acetone', '67-64-1', '1364PS73AF', '=O']]
What is wrong with my method exactly and how can I improve it?
Use writerows (note the s at the end). writerow is for writing one line at a time.
wr.writerows(Whole_list)
As a side note, capitalized variable names are usually reserved to classes, so prefer whole_list.
Related
I'm trying to loop through some unstructured text data in python. End goal is to structure it in a dataframe. For now I'm just trying to get the relevant data in an array and understand the line, readline() functionality in python.
This is what the text looks like:
Title: title of an article
Full text: unfortunately the full text of each article,
is on numerous lines. Each article has a differing number
of lines. In this example, there are three..
Subject: Python
Title: title of another article
Full text: again unfortunately the full text of each article,
is on numerous lines.
Subject: Python
This same format is repeated for lots of text articles in the same file. So far I've figured out how to pull out lines that include certain text. For example, I can loop through it and put all of the article titles in a list like this:
a = "Title:"
titleList = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:
for line in unstr:
if a in line:
titleList.append(line)
Now I want to do the below:
a = "Title:"
b = "Full text:"
d = "Subject:"
list = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:
for line in unstr:
if a in line:
list.append(line)
if b in line:
1. Concatenate this line with each line after it, until i reach the line that includes "Subject:". Ignore the "Subject:" line, stop the "Full text:" subloop, add the concatenated full text to the list array.<br>
2. Continue the for loop within which all of this sits
As a Python beginner, I'm spinning my wheels searching google on this topic. Any pointers would be much appreciated.
If you want to stick with your for-loop, you're probably going to need something like this:
titles = []
texts = []
subjects = []
with open('sample.txt', encoding="utf8") as f:
inside_fulltext = False
for line in f:
if line.startswith("Title:"):
inside_fulltext = False
titles.append(line)
elif line.startswith("Full text:"):
inside_fulltext = True
full_text = line
elif line.startswith("Subject:"):
inside_fulltext = False
texts.append(full_text)
subjects.append(line)
elif inside_fulltext:
full_text += line
else:
# Possibly throw a format error here?
pass
(A couple of things: Python is weird about names, and when you write list = [], you're actually overwriting the label for the list class, which can cause you problems later. You should really treat list, set, and so on like keywords - even thought Python technically doesn't - just to save yourself the headache. Also, the startswith method is a little more precise here, given your description of the data.)
Alternatively, you could wrap the file object in an iterator (i = iter(f), and then next(i)), but that's going to cause some headaches with catching StopIteration exceptions - but it would let you use a more classic while-loop for the whole thing. For myself, I would stick with the state-machine approach above, and just make it sufficiently robust to deal with all your reasonably expected edge-cases.
As your goal is to construct a DataFrame, here is a re+numpy+pandas solution:
import re
import pandas as pd
import numpy as np
# read all file
with open('sample.txt', encoding="utf8") as f:
text = f.read()
keys = ['Subject', 'Title', 'Full text']
regex = '(?:^|\n)(%s): ' % '|'.join(keys)
# split text on keys
chunks = re.split(regex, text)[1:]
# reshape flat list of records to group key/value and infos on the same article
df = pd.DataFrame([dict(e) for e in np.array(chunks).reshape(-1, len(keys), 2)])
Output:
Title Full text Subject
0 title of an article unfortunately the full text of each article,\nis on numerous lines. Each article has a differing number \nof lines. In this example, there are three.. Python
1 title of another article again unfortunately the full text of each article,\nis on numerous lines. Python
This may be considered as second part of question Finding an element within an element using Selenium Webdriver.
What Im doing here is, after extracting each text from the table, writing it into csv file
Here is the code:
from selenium import webdriver
import os
import csv
chromeDriver = "/home/manoj/workspace2/RedTools/test/chromedriver"
os.environ["webdriver.chrome.driver"] = chromeDriver
driver = webdriver.Chrome(chromeDriver)
driver.get("https://www.betfair.com/exchange/football/coupon?id=2")
list2 = driver.find_elements_by_xpath('//*[#data-sportid="1"]')
couponlist = []
finallist = []
for game in list2[1:]:
coup = game.find_element_by_css_selector('span.home-team').text
print(coup)
couponlist.append(coup)
print(couponlist)
print('its done')
outfile = open("./footballcoupons.csv", "wb")
writer = csv.writer(outfile)
writer.writerow(["Games"])
writer.writerows(couponlist)
Results of 3 print statements:
Santos Laguna
CSMS Iasi
AGF
Besiktas
Malmo FF
Sirius
FCSB
Eibar
Newcastle
Pescara
[u'Santos Laguna', u'CSMS Iasi', u'AGF', u'Besiktas', u'Malmo FF', u'Sirius', u'FCSB', u'Eibar', u'Newcastle', u'Pescara']
its done
Now, You can notice the code where i write these values into csv. But I end up writing it weirdly into csv. please see the snapshot. Can someone help me to fix this please?
According to the documentation, writerows takes as parameter a list of rows, and
A row must be an iterable of strings or numbers for Writer objects
You are passing a list of strings, so writerows iterates over your strings, making a row out of each character.
You could use a loop:
for team in couponlist:
writer.writerow([team])
or turn your list into a list of lists, then use writerows :
couponlist = [[team] for team in couponlist]
writer.writerows(couponlist)
But anyway, there's no need to use csv if you only have one column...
I'm having trouble getting anything to write in my outut file (word_count.txt).
I expect the script to review all 500 phrases in my phrases.txt document, and output a list of all the words and how many times they appear.
from re import findall,sub
from os import listdir
from collections import Counter
# path to folder containg all the files
str_dir_folder = '../data'
# name and location of output file
str_output_file = '../data/word_count.txt'
# the list where all the words will be placed
list_file_data = '../data/phrases.txt'
# loop through all the files in the directory
for str_each_file in listdir(str_dir_folder):
if str_each_file.endswith('data'):
# open file and read
with open(str_dir_folder+str_each_file,'r') as file_r_data:
str_file_data = file_r_data.read()
# add data to list
list_file_data.append(str_file_data)
# clean all the data so that we don't have all the nasty bits in it
str_full_data = ' '.join(list_file_data)
str_clean1 = sub('t','',str_full_data)
str_clean_data = sub('n',' ',str_clean1)
# find all the words and put them into a list
list_all_words = findall('w+',str_clean_data)
# dictionary with all the times a word has been used
dict_word_count = Counter(list_all_words)
# put data in a list, ready for output file
list_output_data = []
for str_each_item in dict_word_count:
str_word = str_each_item
int_freq = dict_word_count[str_each_item]
str_out_line = '"%s",%d' % (str_word,int_freq)
# populates output list
list_output_data.append(str_out_line)
# create output file, write data, close it
file_w_output = open(str_output_file,'w')
file_w_output.write('n'.join(list_output_data))
file_w_output.close()
Any help would be great (especially if I'm able to actually output 'single' words within the output list.
thanks very much.
Would be helpful if we got more information such as what you've tried and what sorts of error messages you received. As kaveh commented above, this code has some major indentation issues. Once I got around those, there were a number of other logic errors to work through. I've made some assumptions:
list_file_data is assigned to '../data/phrases.txt' but there is then a
loop through all file in a directory. Since you don't have any handling for
multiple files elsewhere, I've removed that logic and referenced the
file listed in list_file_data (and added a small bit of error
handling). If you do want to walk through a directory, I'd suggest
using os.walk() (http://www.tutorialspoint.com/python/os_walk.htm)
You named your file 'pharses.txt' but then check for if the files
that endswith 'data'. I've removed this logic.
You've placed the data set into a list when findall works just fine with strings and ignores special characters that you've manually removed. Test here:
https://regex101.com/ to make sure.
Changed 'w+' to '\w+' - check out the above link
Converting to a list outside of the output loop isn't necessary - your dict_word_count is a Counter object which has an 'iteritems' method to roll through each key and value. Also changed the variable name to 'counter_word_count' to be slightly more accurate.
Instead of manually generating csv's, I've imported csv and utilized the writerow method (and quoting options)
Code below, hope this helps:
import csv
import os
from collections import Counter
from re import findall,sub
# name and location of output file
str_output_file = '../data/word_count.txt'
# the list where all the words will be placed
list_file_data = '../data/phrases.txt'
if not os.path.exists(list_file_data):
raise OSError('File {} does not exist.'.format(list_file_data))
with open(list_file_data, 'r') as file_r_data:
str_file_data = file_r_data.read()
# find all the words and put them into a list
list_all_words = findall('\w+',str_file_data)
# dictionary with all the times a word has been used
counter_word_count = Counter(list_all_words)
with open(str_output_file, 'w') as output_file:
fieldnames = ['word', 'freq']
writer = csv.writer(output_file, quoting=csv.QUOTE_ALL)
writer.writerow(fieldnames)
for key, value in counter_word_count.iteritems():
output_row = [key, value]
writer.writerow(output_row)
Something like this?
from collections import Counter
from glob import glob
def extract_words_from_line(s):
# make this as complicated as you want for extracting words from a line
return s.strip().split()
tally = sum(
(Counter(extract_words_from_line(line))
for infile in glob('../data/*.data')
for line in open(infile)),
Counter())
for k in sorted(tally, key=tally.get, reverse=True):
print k, tally[k]
I’m trying to split downloaded data to an 2D array into different datatypes. The downloaded data looks like this:
000|17:40
000|17:45
010|17:50
025|17:55
056|18:00
178|18:05
202|18:10
203|18:15
190|18:20
072|18:25
013|18:30
002|18:35
000|18:40
000|18:45
000|18:50
000|18:55
000|19:00
000|19:05
000|19:10
000|19:15
000|19:20
000|19:25
000|19:30
000|19:35
000|19:40
I’m using the following code to parse this into a two dimensional array:
#!/usr/bin/python
import urllib2
response = urllib2.urlopen('http://gps.buienradar.nl/getrr.php?lat=52&lon=4')
html = response.read()
htmlsplit = []
for record in html.split("\r\n"):
htmlsplit.append(record.split("|"))
print htmlsplit
This is working great, but as expected, it treats it as a string. I’ve found some examples that splits into integers. That’s great if both sides where integers. But in my case it’s an integer | string (or maybe some kind of Python time format)
How can I split this directly into different data types?
Something like this?
for record in html.split("\r\n"): # beware, newlines are treacherous!
s = record.split("|")
htmlsplit.append((int(s[0]), s[1]))
Just write a parser for each record, if you have data this simple. However, I would add some try/except clause to catch errors for non-conforming lines, empty lines, etc. which may be present in the data. The code above is very fragile. Also, you might want to break at only \n and then clean your strings by strip() (i.e. replace s[1] by s[1].strip()). The integer conversion takes care of it automatically.
Use str.splitlines instead of splitting on \r\n
Use the csv module to iterate over the lines:
import csv
txt = '000|17:40\n000|17:45\n000|17:50\n000|17:55\n000|18:00\n000|18:05\n000|18:10\n000|18:15\n000|18:20\n000|18:25\n000|18:30\n000|18:35\n000|18:40\n000|18:45\n000|18:50\n000|18:55\n000|19:00\n000|19:05\n000|19:10\n000|19:15\n000|19:20\n000|19:25\n000|19:30\n000|19:35\n000|19:40\n'
reader = csv.reader(txt.splitlines(), delimiter='|')
column1 = []
column2 = []
for c1, c2 in reader:
column1.append(c1)
column2.append(c2)
You can also use the DictReader
import StringIO
reader2 = csv.DictReader(StringIO.StringIO(txt),
fieldnames=['int', 'time'],
delimiter='|')
column1 = []
column2 = []
for row in reader2:
column1.append(row['time'])
column2.append(row['int'])
I am working on a school project to make a video club management program and I need some help. Here is what I am trying to do:
I have a txt file with the client data, in which there is this:
clientId:clientFirstName:clientLastName:clientPhoneNumber
The : is the separator for any file in data.
And in the movie title data file I got this:
movieid:movieKindFlag:MovieName:MovieAvalaible:MovieRented:CopieInTotal
where it is going is that in the rentedData file there should be that:
idClient:IdMovie:DateOfReturn
I am able to do this part. Where I fail due to lack of experience:
I need to actually make a container with 3 levels for the movie data file because I want to track the available and rented numbers (changing them when I rent a movie and when I return one).
The first level represents the whole file, calling it will print the whole file, the second level should have each line in a container, the third one is every word of the line in a container.
Here is an example of what I mean:
dataMovie = [[[movie id],[movie title],[MovieAvailable],[MovieRented],[CopieInTotal]],[[movie id],[movie title],[MovieAvailable],[MovieRented],[CopieInTotal]]
I actually know that I can do this for a two layer in this way:
DataMovie=[]
MovieInfo = open('Data_Movie', 'r')
#Reading the file and putting it into a container
for ligne in MovieInfo:
print(ligne, end='')
words = ligne.split(":")
DataMovie.append(words)
print(DataMovie)
MovieInfo.close()
It separates all the words in to this:
[[MovieID],[MovieTitle],[movie id],[movie title],[MovieAvailable],[MovieRented],[CopieInTotal], [MovieID],[MovieTitle],[movie id],[movie title],[MovieAvailable],[MovieRented],[CopieInTotal]]
Each line is in the same container (second layer) but the lines are not separated, not very helpful since I need to change a specific information about the quantity available and the rented one to be able to not rent the movie if all of the copies are rented.
I think you should be using dictionaries to store your data. Rather then just embedding lists on top of one another.
Here is a quick page about dictionaries.
http://www.network-theory.co.uk/docs/pytut/Dictionaries.html
So your data might look like
movieDictionary = {"movie_id":234234,"movie title":"Iron
Man","MovieAvailable":Yes,"MovieRented":No,"CopieInTotal":20}
Then when you want to retrieve a value.
movieDictionary["movie_id"]
would yield the value.
234234
you can also embed lists inside of a dictionary value.
Does this help answer you question?
If you have to use a txt file, storing it in xml format might make the task easier. Since there's already are several good xml parsers for python.
For example ElementTree:
You could structure you'r data like this:
<?xml version="1.0"?>
<movies>
<movie id = "1">
<type>movieKind</type>
<name>firstmovie</name>
<MovieAvalaible>True</MovieAvalaible>
<MovieRented>False</MovieRented>
<CopieInTotal>2</CopieInTotal>
</movie>
<movie id = "2">
<type>movieKind</type>
<name>firstmovie2</name>
<MovieAvalaible>True</MovieAvalaible>
<MovieRented>False</MovieRented>
<CopieInTotal>3</CopieInTotal>
</movie>
</movies>
and then access and modify it like this:
import xml.etree.ElementTree as ET
tree = ET.parse('data.xml')
root = tree.getroot()
search = root.findall('.//movie[#id="2"]')
for element in search:
rented = element.find('MovieRented')
rented.text = "False"
tree.write('data.xml')
What you are actually doing is creating three databases:
one for clients
one for movies
one for rentals
A relatively easy way to read text files with one record per line and a : separator is to create a csv.reader object. For storing the databases into your program I would recommend using lists of collections.namedtuple objects for the clients and the rentals.
from collections import namedtuple
from csv import reader
Rental = namedtuple('Rental', ['client', 'movie', 'returndate'])
with open('rentals.txt', newline='') as rentalsfile:
rentalsreader = csv.reader(rentalsfile, delimiter=':')
rentals = [Rental(int(row[0]), int(row[1]), row[2]) for row in rentalsreader]
And a list of dictionaries for the movies:
with open('movies.txt', 'rb', newline='') as moviesfile:
moviesreader = csv.reader(moviesfile, delimiter=':')
movies = [{'id': int(row[0]), 'kind', row[1], 'name': row[2],
'rented': int(row[3]), 'total': int(row[4])} for row in moviesreader]
The main reason for using a list of dictionaries for the movies is that a named tuple is a tuple and therefore immutable, and presumably you want to be able to change rented.
Referring to your comment on Daniel Rasmuson's answer, since you only put the values of the fields in the text files, you will have to hardocde the names of the fields into your program one way or another.
An alternative solution is to store the date in json files. Those are easily mapped to Python data structures.
This might be what you we're looking for
#Using OrderedDict so we always get the items in the right order when iteration.
#So the values match up with the categories\headers
from collections import OrderedDict as Odict
class DataContainer(object):
def __init__(self, fileName):
'''
Loading the text file in a list. First line assumed a header line and is used to set dictionary keys
Using OrderedDict to fix the order or iteration for dict, so values match up with the headers again when called
'''
self.file = fileName
self.data = []
with open(self.file, 'r') as content:
self.header = content.next().split('\n')[0].split(':')
for line in content:
words = line.split('\n')[0].split(':')
self.data.append(Odict(zip(self.header, words)))
def __call__(self):
'''Outputs the contents as a string that can be written back to the file'''
lines = []
lines.append(':'.join(self.header))
for i in self.data:
this_line = ':'.join(i.values())
lines.append(this_line)
newContent = '\n'.join(lines)
return newContent
def __getitem__(self, index):
'''Allows index access self[index]'''
return self.data[index]
def __setitem__(self, index, value):
'''Allows editing of values self[index]'''
self.data[index] = value
d = DataContainer('data.txt')
d[0]['MovieAvalaible'] = 'newValue' # Example of how to set the values
#Will print out a string with the contents
print d()