I am using the following line of code for executing and printing data from my sql database. For some reason that is the only command that works for me.
json_string = json.dumps(location_query_1)
My question is that when I print json_string it shows data in the following format:
Actions.py code:
class FindByLocation(Action):
def name(self) -> Text:
return "action_find_by_location"
def run (self, dispatcher: CollectingDispatcher,
tracker: Tracker,
doman: Dict[Text, Any])-> List[Dict[Text,Any]]:
global flag
location = tracker.get_slot("location")
price = tracker.get_slot("price")
cuisine = tracker.get_slot("cuisine")
print("In find by Location")
print(location)
location_query = "SELECT Name FROM Restaurant WHERE Location = '%s' LIMIT 5" % location
location_count_query = "SELECT COUNT(Name) FROM Restaurant WHERE Location = '%s'" % location
location_query_1 = getData(location_query)
location_count_query_1 = getData(location_count_query)
if not location_query_1:
flag = 1
sublocation_view_query = "CREATE VIEW SublocationView AS SELECT RestaurantID, Name, PhoneNumber, Rating, PriceRange, Location, Sublocation FROM Restaurant WHERE Sublocation = '%s'"%(location)
sublocation_view = getData(sublocation_view_query)
dispatcher.utter_message(text="یہ جگہ کس ایریا میں ہے")
else:
flag = 0
if cuisine is None and price is None:
json_string = json.dumps(location_query_1)
print(isinstance(json_string, str))
print("Check here")
list_a=json_string.split(',')
remove=["'",'"','[',']']
for i in remove:
list_a=[s.replace(i, '') for s in list_a]
dispatcher.utter_message(text="Restaurants in Location only: ")
dispatcher.utter_message(list_a)
What should I do se that the data is showed in a vertical list format (new line indentation) and without the bracket and quotation marks? Thank you
First of all, have you tried reading your data into a pandas object? I have done some programs with a sqlite database and this worked for me:
df = pd.read_sql_query("SELECT * FROM {}".format(self.tablename), conn)
But now to the string formatting part:
# this code should do the work for you
# first of all we have our string a like yours
a="[['hallo'],['welt'],['kannst'],['du'],['mich'],['hoeren?']]"
# now we split the string into a list on every ,
list_a=a.split(',')
# this is our list with chars we want to remove
remove=["'",'"','[',']']
# now we replace all elements step by step with nothing
for i in remove:
list_a=[s.replace(i, '') for s in list_a]
print(list_a)
for z in list_a:
print(z)
The output is then:
['hallo', 'welt', 'kannst', 'du', 'mich', 'hoeren?']
hallo
welt
kannst
du
mich
hoeren?
I hope I could help.
Related
I am trying to scrape data from a word document available at:-
https://dl.dropbox.com/s/pj82qrctzkw9137/HE%20Distributors.docx
I need to scrape the Name, Address, City, State, and Email ID. I am able to scrape the E-mail using the below code.
import docx
content = docx.Document('HE Distributors.docx')
location = []
for i in range(len(content.paragraphs)):
stat = content.paragraphs[i].text
if 'Email' in stat:
location.append(i)
for i in location:
print(content.paragraphs[i].text)
I tried to use the steps mentioned:
How to read data from .docx file in python pandas?
I need to convert this into a data frame with all the columns mentioned above.
Still facing issues with the same.
There are some inconsistencies in the document - phone numbers starting with Tel: sometimes, and Tel.: other times, and even Te: once, and I noticed one of the emails is just in the last line for that distributor without the Email: prefix, and the State isn't always in the last line.... Still, for the most part, most of the data can be extracted with regex and/or splits.
The distributors are separated by empty lines, and the names are in a different color - so I defined this function to get the font color of any paragraph from its xml:
# from bs4 import BeautifulSoup
def getParaColor(para):
try:
return BeautifulSoup(
para.paragraph_format.element.xml, 'xml'
).find('color').get('w:val')
except:
return ''
The try...except hasn't been necessary yet, but just in case...
(The xml is actually also helpful for double-checking that .text hasn't missed anything - in my case, I noticed that the email for Shri Adhya Educational Books wasn't getting extracted.)
Then, you can process the paragraphs from docx.Document with a function like:
# import re
def splitParas(paras):
ptc = [(
p.text, getParaColor(p), p.paragraph_format.element.xml
) for p in paras]
curSectn = 'UNKNOWN'
splitBlox = [{}]
for pt, pc, px in ptc:
# double-check for missing text
xmlText = BeautifulSoup(px, 'xml').text
xmlText = ' '.join([s for s in xmlText.split() if s != ''])
if len(xmlText) > len(pt): pt = xmlText
# initiate
if not pt:
if splitBlox[-1] != {}:
splitBlox.append({})
continue
if pc == '20752E':
curSectn = pt.strip()
continue
if splitBlox[-1] == {}:
splitBlox[-1]['section'] = curSectn
splitBlox[-1]['raw'] = []
splitBlox[-1]['Name'] = []
splitBlox[-1]['address_raw'] = []
# collect
splitBlox[-1]['raw'].append(pt)
if pc == 'D12229':
splitBlox[-1]['Name'].append(pt)
elif re.search("^Te.*:.*", pt):
splitBlox[-1]['tel_raw'] = re.sub("^Te.*:", '', pt).strip()
elif re.search("^Mob.*:.*", pt):
splitBlox[-1]['mobile_raw'] = re.sub("^Mob.*:", '', pt).strip()
elif pt.startswith('Email:') or re.search(".*[#].*[.].*", pt):
splitBlox[-1]['Email'] = pt.replace('Email:', '').strip()
else:
splitBlox[-1]['address_raw'].append(pt)
# some cleanup
if splitBlox[-1] == {}: splitBlox = splitBlox[:-1]
for i in range(len(splitBlox)):
addrsParas = splitBlox[i]['address_raw'] # for later
# join lists into strings
splitBlox[i]['Name'] = ' '.join(splitBlox[i]['Name'])
for k in ['raw', 'address_raw']:
splitBlox[i][k] = '\n'.join(splitBlox[i][k])
# search address for City, State and PostCode
apLast = addrsParas[-1].split(',')[-1]
maybeCity = [ap for ap in addrsParas if '–' in ap]
if '–' not in apLast:
splitBlox[i]['State'] = apLast.strip()
if maybeCity:
maybePIN = maybeCity[-1].split('–')[-1].split(',')[0]
maybeCity = maybeCity[-1].split('–')[0].split(',')[-1]
splitBlox[i]['City'] = maybeCity.strip()
splitBlox[i]['PostCode'] = maybePIN.strip()
# add mobile to tel
if 'mobile_raw' in splitBlox[i]:
if 'tel_raw' not in splitBlox[i]:
splitBlox[i]['tel_raw'] = splitBlox[i]['mobile_raw']
else:
splitBlox[i]['tel_raw'] += (', ' + splitBlox[i]['mobile_raw'])
del splitBlox[i]['mobile_raw']
# split tel [as needed]
if 'tel_raw' in splitBlox[i]:
tel_i = [t.strip() for t in splitBlox[i]['tel_raw'].split(',')]
telNum = []
for t in range(len(tel_i)):
if '/' in tel_i[t]:
tns = [t.strip() for t in tel_i[t].split('/')]
tel1 = tns[0]
telNum.append(tel1)
for tn in tns[1:]:
telNum.append(tel1[:-1*len(tn)]+tn)
else:
telNum.append(tel_i[t])
splitBlox[i]['Tel_1'] = telNum[0]
splitBlox[i]['Tel'] = telNum[0] if len(telNum) == 1 else telNum
return splitBlox
(Since I was getting font color anyway, I decided to add another
column called "section" to put East/West/etc in. And I added "PostCode" too, since it seems to be on the other side of "City"...)
Since "raw" is saved, any other value can be double checked manually at least.
The function combines "Mobile" into "Tel" even though they're extracted with separate regex.
I'd say "Tel_1" is fairly reliable, but some of the inconsistent patterns mean that other numbers in "Tel" might come out incorrect if they were separated with '/'.
Also, "Tel" is either a string or a list of strings depending on how many numbers there were in "tel_raw".
After this, you can just view as DataFrame with:
#import docx
#import pandas
content = docx.Document('HE Distributors.docx')
# pandas.DataFrame(splitParas(content.paragraphs)) # <--all Columns
pandas.DataFrame(splitParas(content.paragraphs))[[
'section', 'Name', 'address_raw', 'City',
'PostCode', 'State', 'Email', 'Tel_1', 'tel_raw'
]]
So I got a list with names and I'm trying to delete the names out of the list. but somehow it doesn't work the way I want it to. Any help would be much appreciated. I think the problem is something with the replace function but I'm not sure.
Funtion:
#pyqtSlot(str, str)
def rapportmail (self, email, wachtwoord):
credentials = Credentials(email, wachtwoord)
acc = Account(email, credentials=credentials, autodiscover=True)
cursor.execute("SELECT DISTINCT Naam FROM Klant")
updatedklant = str(cursor.fetchall())
test = updatedklant
for item in acc.inbox.all().order_by('-datetime_received')[:500]:
inboxmail = str(item.sender).split("'")
currentinboxmail = inboxmail[3]
cursor.execute("SELECT DISTINCT Klant FROM Mail WHERE Mail=?", currentinboxmail)
currentklant = str(cursor.fetchall())
remove_characters = ["(",")",",","]","["]
for characters in remove_characters:
currentklant = currentklant.replace(characters, "")
if currentklant not in updatedklant:
for idx, elem in enumerate(test):
if elem[0] == currentklant:
test.pop(idx)
It is prints this now:
currentklant == 'alerttestting'
test == [('alerttestting', ), ('emretestingsystems', ), ('jarnodebaas', ),('yikes', )]
Result should be:
currentklant == 'alerttestting'
test ==[('emretestingsystems', ), ('jarnodebaas', ),('yikes', )]
Never, ever convert the result of cursor.fetchall() to a str and then strip out the characters you don't like. This is going to be a huge source of trouble. What if the name itself contains one of those characters, like "Bond, James" or "O'Brien"?
Keep it as a Python object (probably a list or tuple) and process it in that form.
In this case you probably wanted something like this:
...
updatedklant = cursor.fetchall() # No more str()
...
currentklant = cursor.fetchall()) # No more str()
...
test = [klant for klant in updatedklant if klant != currentklant]
should work perfectly
[('alerttestting',), ('emretestingsystems',), ('jarnodebaas',), ('yikes',)]
if __name__ == "__main__":
currentklant = 'alerttestting'
test = [('alerttestting',), ('emretestingsystems',), ('jarnodebaas',), ('yikes',)]
for idx, elem in enumerate(test):
if elem[0] == currentklant:
test.pop(idx)
[('emretestingsystems',), ('jarnodebaas',), ('yikes',)]
I have this code, but keep running into versions of the title error. Can anyone help me get past these? Traceback hits on the newfilingDate line (4th from bottom), but I suspect that's not where the actual error is?
def getIndexLink(tickerCode,FormType):
csvOutput = open(IndexLinksFile,"a+b") # "a+b" indicates that we are adding lines rather than replacing lines
csvWriter = csv.writer(csvOutput, quoting = csv.QUOTE_NONNUMERIC)
urlLink = "https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK="+tickerCode+"&type="+FormType+"&dateb=&owner=exclude&count=100"
pageRequest = urllib.request.Request(urlLink)
with urllib.request.urlopen(pageRequest) as url:
pageRead = url.read()
soup = BeautifulSoup(pageRead,"html.parser")
#Check if there is a table to extract / code exists in edgar database
try:
table = soup.find("table", { "class" : "tableFile2" })
except:
print("No tables found or no matching ticker symbol for ticker symbol for"+tickerCode)
return -1
docIndex = 1
for row in table.findAll("tr"):
cells = row.findAll("td")
if len(cells)==5:
if cells[0].text.strip() == FormType:
link = cells[1].find("a",{"id": "documentsbutton"})
docLink = "https://www.sec.gov"+link['href']
description = cells[2].text.encode('utf8').strip() #strip take care of the space in the beginning and the end
filingDate = cells[3].text.encode('utf8').strip()
newfilingDate = filingDate.replace("-","_") ### <=== Change date format from 2012-1-1 to 2012_1_1 so it can be used as part of 10-K file names
csvWriter.writerow([tickerCode, docIndex, docLink, description, filingDate,newfilingDate])
docIndex = docIndex + 1
csvOutput.close()
byte-like objects can have .replace called on it so long as the replace args are also byte-like. (special thanks to juanpa.arrivillaga for pointing this out)
foo = b'hi-mom'
foo = foo.replace(b"-", b"_")
print(foo)
Alternatively, you can recast to a string and then back to byte-like but that is messy and inefficient.
foo = b'hi-mom'
foo = str(foo).replace("-","_").encode('utf-8')
print(foo)
I am trying to extract raw data from a text file and after processing the raw data, I want to export it to another text file. Below is the python code I have written for this process. I am using the "petl" package in python 3 for this purpose. 'locations.txt' is the raw data file.
import glob, os
from petl import *
class ETL():
def __init__(self, input):
self.list = input
def parse_P(self):
personids = None
for term in self.list:
if term.startswith('P'):
personids = term[1:]
personid = personids.split(',')
return personid
def return_location(self):
location = None
for term in self.list:
if term.startswith('L'):
location = term[1:]
return location
def return_location_id(self, location):
location = self.return_location()
locationid = None
def return_country_id(self):
countryid = None
for term in self.list:
if term.startswith('C'):
countryid = term[1:]
return countryid
def return_region_id(self):
regionid = None
for term in self.list:
if term.startswith('R'):
regionid = term[1:]
return regionid
def return_city_id(self):
cityid = None
for term in self.list:
if term.startswith('I'):
cityid = term[1:]
return cityid
print (os.getcwd())
os.chdir("D:\ETL-IntroductionProject")
print (os.getcwd())
final_location = [['L','P', 'C', 'R', 'I']]
new_location = fromtext('locations.txt', encoding= 'Latin-1')
stored_list = []
for identifier in new_location:
if identifier[0].startswith('L'):
identifier = identifier[0]
info_list = identifier.split('_')
stored_list.append(info_list)
for lst in stored_list:
tabling = ETL(lst)
location = tabling.return_location()
country = tabling.return_country_id()
city = tabling.return_city_id()
region = tabling.return_region_id()
person_list = tabling.parse_P()
for person in person_list:
table_new = [location, person, country, region, city]
final_location.append(table_new)
totext(final_location, 'l1.txt')
However when I use "totext" function of petl, it throws me an "Assertion Error".
AssertionError: template is required
I am unable to understand what the fault is. Can some one please explain the problem I am facing and what I should be doing ?
The template parameter to the toext function is not optional there is no default format for how the rows are written in this case, you must provide a template. Check the doc for toext here for an example: https://petl.readthedocs.io/en/latest/io.html#text-files
The template describes the format of each row that it writes out using the field headers to describe things, you can optionally pass in a prologue to write the header too. A basic template in your case would be:
table_new_template = "{L} {P} {C} {R} {I}"
totext(final_location, 'l1.txt', template=table_new_template)
Here's my problem: I'm trying to parse a big text file (about 15,000 KB) and write it to a MySQL database. I'm using Python 2.6, and the script parses about half the file and adds it to the database before freezing up. Sometimes it displays the text:
MemoryError.
Other times it simply freezes. I figured I could avoid this problem by using generator's wherever possible, but I was apparently wrong.
What am I doing wrong?
When I press Ctrl + C to keyboard interrupt, it shows this error message:
...
sucessfully added vote # 2281
sucessfully added vote # 2282
sucessfully added vote # 2283
sucessfully added vote # 2284
floorvotes_db.py:35: Warning: Data truncated for column 'vote_value' at row 1
r['bill ID'] , r['last name'], r['vote'])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "floorvotes_db.py", line 67, in addAllFiles
addFile(file)
File "floorvotes_db.py", line 61, in addFile
add(record)
File "floorvotes_db.py", line 35, in add
r['bill ID'] , r['last name'], r['vote'])
File "build/bdist.linux-i686/egg/MySQLdb/cursors.py", line 166, in execute
File "build/bdist.linux-i686/egg/MySQLdb/connections.py", line 35, in defaulte rrorhandler
KeyboardInterrupt
import os, re, datetime, string
# Data
DIR = '/mydir'
tfn = r'C:\Documents and Settings\Owner\Desktop\data.txt'
rgxs = {
'bill number': {
'rgx': r'(A|S)[0-9]+-?[A-Za-z]* {50}'}
}
# Compile rgxs for speediness
for rgx in rgxs: rgxs[rgx]['rgx'] = re.compile(rgxs[rgx]['rgx'])
splitter = rgxs['bill number']['rgx']
# Guts
class floor_vote_file:
def __init__(self, fn):
self.iterdata = (str for str in
splitter.split(open(fn).read())
if str and str <> 'A' and str <> 'S')
def iterVotes(self):
for record in self.data:
if record: yield billvote(record)
class billvote(object):
def __init__(self, section):
self.data = [line.strip() for line
in section.splitlines()]
self.summary = self.data[1].split()
self.vtlines = self.data[2:]
self.date = self.date()
self.year = self.year()
self.votes = self.parse_votes()
self.record = self.record()
# Parse summary date
def date(self):
d = [int(str) for str in self.summary[0].split('/')]
return datetime.date(d[2],d[0],d[1]).toordinal()
def year(self):
return datetime.date.fromordinal(self.date).year
def session(self):
"""
arg: 2-digit year int
returns: 4-digit session
"""
def odd():
return divmod(self.year, 2)[1] == 1
if odd():
return str(string.zfill(self.year, 2)) + \
str(string.zfill(self.year + 1, 2))
else:
return str(string.zfill(self.year - 1, 2))+ \
str(string.zfill(self.year, 2))
def house(self):
if self.summary[2] == 'Assembly': return 1
if self.summary[2] == 'Senate' : return 2
def splt_v_line(self, line):
return [string for string in line.split(' ')
if string <> '']
def splt_v(self, line):
return line.split()
def prse_v(self, item):
"""takes split_vote item"""
return {
'vote' : unicode(item[0]),
'last name': unicode(' '.join(item[1:]))
}
# Parse votes - main
def parse_votes(self):
nested = [[self.prse_v(self.splt_v(vote))
for vote in self.splt_v_line(line)]
for line in self.vtlines]
flattened = []
for lst in nested:
for dct in lst:
flattened.append(dct)
return flattened
# Useful data objects
def record(self):
return {
'date' : unicode(self.date),
'year' : unicode(self.year),
'session' : unicode(self.session()),
'house' : unicode(self.house()),
'bill ID' : unicode(self.summary[1]),
'ayes' : unicode(self.summary[5]),
'nays' : unicode(self.summary[7]),
}
def iterRecords(self):
for vote in self.votes:
r = self.record.copy()
r['vote'] = vote['vote']
r['last name'] = vote['last name']
yield r
test = floor_vote_file(tfn)
import MySQLdb as dbapi2
import floorvotes_parse as v
import os
# Initial database crap
db = dbapi2.connect(db=r"db",
user="user",
passwd="XXXXX")
cur = db.cursor()
if db and cur: print "\nConnected to db.\n"
def commit(): db.commit()
def ext():
cur.close()
db.close()
print "\nConnection closed.\n"
# DATA
DIR = '/mydir'
files = [DIR+fn for fn in os.listdir(DIR)
if fn.startswith('fvote')]
# Add stuff
def add(r):
"""add a record"""
cur.execute(
u'''INSERT INTO ny_votes (vote_house, vote_date, vote_year, bill_id,
member_lastname, vote_value) VALUES
(%s , %s , %s ,
%s , %s , %s )''',
(r['house'] , r['date'] , r['year'],
r['bill ID'] , r['last name'], r['vote'])
)
#print "added", r['year'], r['bill ID']
def crt():
"""create table"""
SQL = """
CREATE TABLE ny_votes (openleg_id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
vote_house int(1), vote_date int(5), vote_year int(2), bill_id varchar(8),
member_lastname varchar(50), vote_value varchar(10));
"""
cur.execute(SQL)
print "\nCreate ny_votes.\n"
def rst():
SQL = """DROP TABLE ny_votes"""
cur.execute(SQL)
print "\nDropped ny_votes.\n"
crt()
def addFile(fn):
"""parse and add all records in a file"""
n = 0
for votes in v.floor_vote_file(fn).iterVotes():
for record in votes.iterRecords():
add(record)
n += 1
print 'sucessfully added vote # ' + str(n)
def addAllFiles():
for file in files:
addFile(file)
if __name__=='__main__':
rst()
addAllFiles()
Generators are a good idea, but you seem to miss the biggest problem:
(str for str in splitter.split(open(fn).read()) if str and str <> 'A' and str <> 'S')
You're reading the whole file in at once even if you only need to work with bits at a time. You're code is too complicated for me to fix, but you should be able to use file's iterator for your task:
(line for line in open(fn))
I noticed that you use a lot of slit() calls. This is memory consuming, according to http://mail.python.org/pipermail/python-bugs-list/2006-January/031571.html . You can start investigating this.
Try to comment out add(record) to see if the problem is in your code or on the database side. All the records are added in one transaction (if supported) and maybe this leads to a problem if it get too many records. If commenting out add(record) helps, you could try to call commit() from time to time.
This isn't a Python memory issue, but perhaps it's worth thinking about. The previous answers make me think you'll sort that issue out quickly.
I wonder about the rollback logs in MySQL. If a single transaction is too large, perhaps you can checkpoint chunks. Commit each chunk separately instead of trying to rollback a 15MB file's worth.