I am new to programming, and to Python. I have written a simple random quote generator which loads various categories of quotes as lists into a dictionary. It then randomly chooses a list and then a specific quote from the list and outputs it to the screen. It is mostly complete but I am looking for ways to clean up the code to make it more efficient. Right now I have a set of 14 different categories that the user can select from to populate the dictionary. Each one of those category selections calls a function to update the dictionary and the config.ini file to save the user preferences. That results in hundreds of lines of near-identical code, where the only differences are the specific category and files in use. I am looking for a way to rewrite it so that the same function can be reused each time and simply pass in the correct information to make it work. I have posted snippets of the relevant code below. I am using Python 3.6 and TKinter. Thank you for any help you can provide.
Adversity check button to call update_adversity function and add/remove adversity category quotes to/from the dictionary
self.adversity = BooleanVar()
self.adv = Checkbutton(self, text = 'Adversity/Hardship', variable = self.adversity, command = self.update_adversity)
self.adv.grid(row = 1, column = 0, sticky = 'W', padx = 0, pady = 0)
if 'adversity' in quotes:
self.adversity.set(1)
elif 'adversity' not in quotes:
self.adversity.set(0)
add/remove adversity list in dictionary based on checkbutton value
def update_adversity(self):
if self.adversity.get() == True:
config.set('categories', 'adversity', 'True') # updates config file
with open('adversity.py', 'r', encoding = 'UTF8') as f:
new_quotes_added = f.readlines()
quotes['adversity'] = new_quotes_added
try:
del quotes['default']
config.set('categories', 'default', 'False') # updates config file
return quotes
except:
return quotes
elif self.adversity.get() == False:
config.set('categories', 'adversity', 'False') # updates config file
try:
del quotes['adversity']
if quotes == {}:
with open('default.py', 'r', encoding = 'UTF8') as f:
quotes['default'] = f.readlines()
config.set('categories', 'default', 'True') # updates config file
return quotes
else:
return quotes
except:
return quotes
Related
I am trying to scrape data from a word document available at:-
https://dl.dropbox.com/s/pj82qrctzkw9137/HE%20Distributors.docx
I need to scrape the Name, Address, City, State, and Email ID. I am able to scrape the E-mail using the below code.
import docx
content = docx.Document('HE Distributors.docx')
location = []
for i in range(len(content.paragraphs)):
stat = content.paragraphs[i].text
if 'Email' in stat:
location.append(i)
for i in location:
print(content.paragraphs[i].text)
I tried to use the steps mentioned:
How to read data from .docx file in python pandas?
I need to convert this into a data frame with all the columns mentioned above.
Still facing issues with the same.
There are some inconsistencies in the document - phone numbers starting with Tel: sometimes, and Tel.: other times, and even Te: once, and I noticed one of the emails is just in the last line for that distributor without the Email: prefix, and the State isn't always in the last line.... Still, for the most part, most of the data can be extracted with regex and/or splits.
The distributors are separated by empty lines, and the names are in a different color - so I defined this function to get the font color of any paragraph from its xml:
# from bs4 import BeautifulSoup
def getParaColor(para):
try:
return BeautifulSoup(
para.paragraph_format.element.xml, 'xml'
).find('color').get('w:val')
except:
return ''
The try...except hasn't been necessary yet, but just in case...
(The xml is actually also helpful for double-checking that .text hasn't missed anything - in my case, I noticed that the email for Shri Adhya Educational Books wasn't getting extracted.)
Then, you can process the paragraphs from docx.Document with a function like:
# import re
def splitParas(paras):
ptc = [(
p.text, getParaColor(p), p.paragraph_format.element.xml
) for p in paras]
curSectn = 'UNKNOWN'
splitBlox = [{}]
for pt, pc, px in ptc:
# double-check for missing text
xmlText = BeautifulSoup(px, 'xml').text
xmlText = ' '.join([s for s in xmlText.split() if s != ''])
if len(xmlText) > len(pt): pt = xmlText
# initiate
if not pt:
if splitBlox[-1] != {}:
splitBlox.append({})
continue
if pc == '20752E':
curSectn = pt.strip()
continue
if splitBlox[-1] == {}:
splitBlox[-1]['section'] = curSectn
splitBlox[-1]['raw'] = []
splitBlox[-1]['Name'] = []
splitBlox[-1]['address_raw'] = []
# collect
splitBlox[-1]['raw'].append(pt)
if pc == 'D12229':
splitBlox[-1]['Name'].append(pt)
elif re.search("^Te.*:.*", pt):
splitBlox[-1]['tel_raw'] = re.sub("^Te.*:", '', pt).strip()
elif re.search("^Mob.*:.*", pt):
splitBlox[-1]['mobile_raw'] = re.sub("^Mob.*:", '', pt).strip()
elif pt.startswith('Email:') or re.search(".*[#].*[.].*", pt):
splitBlox[-1]['Email'] = pt.replace('Email:', '').strip()
else:
splitBlox[-1]['address_raw'].append(pt)
# some cleanup
if splitBlox[-1] == {}: splitBlox = splitBlox[:-1]
for i in range(len(splitBlox)):
addrsParas = splitBlox[i]['address_raw'] # for later
# join lists into strings
splitBlox[i]['Name'] = ' '.join(splitBlox[i]['Name'])
for k in ['raw', 'address_raw']:
splitBlox[i][k] = '\n'.join(splitBlox[i][k])
# search address for City, State and PostCode
apLast = addrsParas[-1].split(',')[-1]
maybeCity = [ap for ap in addrsParas if '–' in ap]
if '–' not in apLast:
splitBlox[i]['State'] = apLast.strip()
if maybeCity:
maybePIN = maybeCity[-1].split('–')[-1].split(',')[0]
maybeCity = maybeCity[-1].split('–')[0].split(',')[-1]
splitBlox[i]['City'] = maybeCity.strip()
splitBlox[i]['PostCode'] = maybePIN.strip()
# add mobile to tel
if 'mobile_raw' in splitBlox[i]:
if 'tel_raw' not in splitBlox[i]:
splitBlox[i]['tel_raw'] = splitBlox[i]['mobile_raw']
else:
splitBlox[i]['tel_raw'] += (', ' + splitBlox[i]['mobile_raw'])
del splitBlox[i]['mobile_raw']
# split tel [as needed]
if 'tel_raw' in splitBlox[i]:
tel_i = [t.strip() for t in splitBlox[i]['tel_raw'].split(',')]
telNum = []
for t in range(len(tel_i)):
if '/' in tel_i[t]:
tns = [t.strip() for t in tel_i[t].split('/')]
tel1 = tns[0]
telNum.append(tel1)
for tn in tns[1:]:
telNum.append(tel1[:-1*len(tn)]+tn)
else:
telNum.append(tel_i[t])
splitBlox[i]['Tel_1'] = telNum[0]
splitBlox[i]['Tel'] = telNum[0] if len(telNum) == 1 else telNum
return splitBlox
(Since I was getting font color anyway, I decided to add another
column called "section" to put East/West/etc in. And I added "PostCode" too, since it seems to be on the other side of "City"...)
Since "raw" is saved, any other value can be double checked manually at least.
The function combines "Mobile" into "Tel" even though they're extracted with separate regex.
I'd say "Tel_1" is fairly reliable, but some of the inconsistent patterns mean that other numbers in "Tel" might come out incorrect if they were separated with '/'.
Also, "Tel" is either a string or a list of strings depending on how many numbers there were in "tel_raw".
After this, you can just view as DataFrame with:
#import docx
#import pandas
content = docx.Document('HE Distributors.docx')
# pandas.DataFrame(splitParas(content.paragraphs)) # <--all Columns
pandas.DataFrame(splitParas(content.paragraphs))[[
'section', 'Name', 'address_raw', 'City',
'PostCode', 'State', 'Email', 'Tel_1', 'tel_raw'
]]
I want to read a .bib file (Which I would be downloading frequently) and then read all the entries in it and wherever a predefined set of fields inside the entry is missing, add the specific field with some static information and then update the file.
For example, If I have a file input.bib like this:
#article{ ISI:000361215300002,
Abstract = {{some abstract}},
Year = {{2016}},
Volume = {{47}}
}
Then I would execute something like this:
python code.py < input.bib > input.bib
And inside the code.py, I want to do something like:
def populateKey(data):
for entry in data.entries:
if 'key' not in entry:
entry['key'] = KEY_VALUE
bib_str = ""
for line in sys.stdin:
bib_str += line
bib_data = loads(bib_str)
populateKey(bib_data)
bibtex_str = bibtexparser.dumps(bib_data)
print bibtex_str
After I execute above code, I am getting an output that looks like:
#article{ ISI:000361215300002,
abstract = {some abstract},
key = value
volume = {47},
year = {2016},
}
The bibtex module is corrupting my format in that it makes everything lowercase and removes redundant brackets and jumbles the fields. Is there a way to not overwrite the file and just add a specific field wherever the specific field is not present?
I have some log files that look like many lines of the following:
<tickPrice tickerId=0, field=2, price=201.81, canAutoExecute=1>
<tickSize tickerId=0, field=3, size=25>
<tickSize tickerId=0, field=8, size=534349>
<tickPrice tickerId=0, field=2, price=201.82, canAutoExecute=1>
I need to define a class of type tickPrice or tickSize. I will need to decide which to use before doing the definition.
What would be the Pythonic way to grab these values? In other words, I need an effective way to reverse str() on a class.
The classes are already defined and just contain the presented variables, e.g., tickPrice.tickerId. I'm trying to find a way to extract these values from the text and set the instance attributes to match.
Edit: Answer
This is what I ended up doing-
with open(commandLineOptions.simulationFilename, "r") as simulationFileHandle:
for simulationFileLine in simulationFileHandle:
(date, time, msgString) = simulationFileLine.split("\t")
if ("tickPrice" in msgString):
msgStringCleaned = msgString.translate(None, ''.join("<>,"))
msgList = msgStringCleaned.split(" ")
msg = message.tickPrice()
msg.tickerId = int(msgList[1][9:])
msg.field = int(msgList[2][6:])
msg.price = float(msgList[3][6:])
msg.canAutoExecute = int(msgList[4][15:])
elif ("tickSize" in msgString):
msgStringCleaned = msgString.translate(None, ''.join("<>,"))
msgList = msgStringCleaned.split(" ")
msg = message.tickSize()
msg.tickerId = int(msgList[1][9:])
msg.field = int(msgList[2][6:])
msg.size = int(msgList[3][5:])
else:
print "Unsupported tick message type"
I'm not sure how you want to dynamically create objects in your namespace, but the following will at least dynamically create objects based on your loglines:
Take your line:
line = '<tickPrice tickerId=0, field=2, price=201.81, canAutoExecute=1>'
Remove chars that aren't interesting to us, then split the line into a list:
line = line.translate(None, ''.join('<>,'))
line = line.split(' ')
Name the potential class attributes for convenience:
line_attrs = line[1:]
Then create your object (name, base tuple, dictionary of attrs):
tickPriceObject = type(line[0], (object,), { key:value for key,value in [at.split('=') for at in line_attrs]})()
Prove it works as we'd expect:
print(tickPriceObject.field)
# 2
Approaching the problem with regex, but with the same result as tristan's excellent answer (and stealing his use of the type constructor that I will never be able to remember)
import re
class_instance_re = re.compile(r"""
<(?P<classname>\w[a-zA-Z0-9]*)[ ]
(?P<arguments>
(?:\w[a-zA-Z0-9]*=[0-9.]+[, ]*)+
)>""", re.X)
objects = []
for line in whatever_file:
result = class_instance_re.match(line)
classname = line.group('classname')
arguments = line.group('arguments')
new_obj = type(classname, (object,),
dict([s.split('=') for s in arguments.split(', ')]))
objects.append(new_obj)
I'm writing a python scraper code for OpenData and I have one question about : how to check if all values aren't filled in site and if it is null change value to null.
My scraper is here.
Currently I'm working on it to optimalize.
My variables now look like:
evcisloval = soup.find_all('td')[3].text.strip()
prinalezival = soup.find_all('td')[5].text.strip()
popisfaplnenia = soup.find_all('td')[7].text.replace('\"', '')
hodnotafaplnenia = soup.find_all('td')[9].text[:-1].replace(",", ".").replace(" ", "")
datumdfa = soup.find_all('td')[11].text
datumzfa = soup.find_all('td')[13].text
formazaplatenia = soup.find_all('td')[15].text
obchmenonazov = soup.find_all('td')[17].text
sidlofirmy = soup.find_all('td')[19].text
pravnaforma = soup.find_all('td')[21].text
sudregistracie = soup.find_all('td')[23].text
ico = soup.find_all('td')[25].text
dic = soup.find_all('td')[27].text
cislouctu = soup.find_all('td')[29].text
And Output :
scraperwiki.sqlite.save(unique_keys=["invoice_id"],
data={ "invoice_id":number,
"invoice_price":hodnotafaplnenia,
"evidence_no":evcisloval,
"paired_with":prinalezival,
"invoice_desc":popisfaplnenia,
"date_received":datumdfa,
"date_payment":datumzfa,
"pay_form":formazaplatenia,
"trade_name":obchmenonazov,
"trade_form":pravnaforma,
"company_location":sidlofirmy,
"court":sudregistracie,
"ico":ico,
"dic":dic,
"accout_no":cislouctu,
"invoice_attachment":urlfa,
"invoice_url":url})
I googled it but without success.
First, write a configuration dict of your variables in the form:
conf = {'evidence_no': (3, str.strip),
'trade_form': (21, None),
...}
i.e. key is the output key, value is a tuple of id from soup.find_all('td') and of an optional function that has to be applied to the result, None otherwise. You don't need those Slavic variable names that may confuse other SO members.
Then iterate over conf and fill the data dict.
Also, run soup.find_all('td') before the loop.
tds = soup.find_all('td')
data = {}
for name, (num, func) in conf.iteritems():
text = tds[num].text
# replace text with None or "NULL" or whatever if needed
...
if func is None:
data[name] = text
else:
data[name] = func(text)
This will remove a lot of duplicated code. Easier to maintain.
Also, I am not sure the strings "NULL" are the best way to write missing data. Doesn't sqlite support Python's real None objects?
Just read your attached link, and it seems what you want is
evcisloval = soup.find_all('td')[3].text.strip() or "NULL"
But be careful. You should only do this with strings. If the part before or is either empty or False or None, or 0, they will all be replaced with "NULL"
def replace_acronym(): # function not yet implemented
#FIND
for abbr, text in acronyms.items():
if abbr == acronym_edit.get():
textadd.insert(0,text)
#DELETE
name = acronym_edit.get().upper()
name.upper()
r =dict(acronyms)
del r[name]
with open('acronym_dict.py','w')as outfile:
outfile.write(str(r))
outfile.close() # uneccessary explicit closure since used with...
message ='{0} {1} {2} \n '.format('Removed', name,'with its text from the database.')
display.insert('0.0',message)
#ADD
abbr_in = acronym_edit.get()
text_in = add_expansion.get()
acronyms[abbr_in] = text_in
# write amended dictionary
with open('acronym_dict.py','w')as outfile:
outfile.write(str(acronyms))
outfile.close()
message ='{0} {1}:{2}{3}\n '.format('Modified entry', abbr_in,text_in, 'added')
display.insert('0.0',message)
I am trying to add the functionality of editing my dictionary entries in my tkinter widget. The dictionary is in the format {ACRONYM: text, ACRONYM2: text2...}
What I thought the function would achieve is to find the entry in the dictionary, delete both the acronym and its associated text and then add whatever the acronym and text have been changed to. What happens is for example if I have an entry TEST: test and I want to modify it to TEXT: abc what is returned by the function is TEXT: testabc - appending the changed text although I have (I thought) overwritten the file.
What am I doing wrong?
That's a pretty messy lookin' function. The acronym replacement itself can be done pretty simple:
acronyms = {'SONAR': 'SOund Navigation And Ranging',
'HTML': 'HyperText Markup Language',
'CSS': 'Cascading Style Sheets',
'TEST': 'test',
'SCUBA': 'Self Contained Underwater Breathing Apparatus',
'RADAR': 'RAdio Detection And Ranging',
}
def replace_acronym(a_dict,check_for,replacement_key,replacement_text):
c = a_dict.get(check_for)
if c is not None:
del a_dict[check_for]
a_dict[replacement_key] = replacement_text
return a_dict
new_acronyms = replace_acronym(acronyms,'TEST','TEXT','abc')
That works perfect for me (in Python 3). You could just call this in another function that writes the new_acronyms dict into the file or do whatever else you want with it 'cause it's no longer tied to just being written to the file.