I am in the learning phase of writing python code's. I have created the below code and have got results successfully however, i have been asked to refactor the code and i am not very sure how to proceed. I did refer to multiple post related to refactoring but got more confused and was not clear how its done. Any assistance will be appreciated. Thanks.
import pandas as pd
import numpy as np
pd.set_option('display.max_columns',100)
data = pd.read_excel (r'S:\folder\file1.xlsx')
df_mail =pd.DataFrame(data,columns= ['CustomerName','CDAAccount','Transit'])
print(df_mail)
df_maillist =df_mail.rename(columns={'CDAAccount':'ACOUNT_NUM','Transit':'BRANCH_NUM'})
print(df_maillist)
## 1) Read SAS files
pathcifbas = 'S:\folder\custbas.sas7bdat'
pathcifadr = 'S:\folder\cusadr.sas7bdat'
pathcifacc = 'S:\folder\cusact.sas7bdat'
##custbas.sas7bdat
columns=['CIFNUM','CUSTOMR_LANGUG_C']
dfcifbas = pd.read_sas(pathcifbas)
print(dfcifbas.head())
df_langprf= dfcifbas[columns]
print(df_langprf.head())
df_lang =df_langprf.rename(columns={'CUSTOMR_LANGUG_C':'Language Preference'})
print(df_lang)
## cusadr.sas7bdat
dfcifadr = pd.read_sas(pathcifadr)
print(dfcifadr.head())
cols=['CIFNUM','ADRES_STREET_NUM','ADRES_STREET_NAME','ADRES_CITY','ADRES_STATE_PROV_C','FULL_POSTAL','ADRES_COUNTRY_C','ADRES_SPECL_ADRES']
df_adr= dfcifadr[cols]
print(df_adr.head)
### Renaming the columns
df_adrress =df_adr.rename(columns={'ADRES_CITY':'City','ADRES_STATE_PROV_C':'Province','FULL_POSTAL':'Postal Code','ADRES_COUNTRY_C':'Country','ADRES_SPECL_ADRES':'Special Address'})
print(df_adrress)
## cusact.sas7bdat
dfcifacc = pd.read_sas(pathcifacc)
print(dfcifacc.head())
colmns=['CIFNUM','ACOUNT_NUM','BRANCH_NUM','APLICTN_ID']
df_acc= dfcifacc[colmns]
print(df_acc)
## Filtering the tables with ['APLICTN_ID']== b'CDA'
df_cda= df_acc['APLICTN_ID']== b'CDA'
print(df_cda.head())
df_acccda = df_acc[df_cda]
print(df_acccda)
## Joining dataframes (df_lang), (df_adrress) and (df_acccda) on CIF_NUM
from functools import reduce
Combine_CIFNUM= [df_acccda,df_lang,df_adrress ]
df_cifnum = reduce(lambda left,right: pd.merge(left,right,on='CIFNUM'), Combine_CIFNUM)
print(df_cifnum)
#convert multiple columns object byte to string
df_cifnumstr= df_cifnum.select_dtypes([np.object])
df_cifnumstr=df_cifnumstr.stack().str.decode('latin1').unstack()
for col in df_cifnumstr:
df_cifnum[col] = df_cifnumstr[col]
print(df_cifnum) ## Combined Data Frame
# Joining Mail list with df_cifnum(combined dataframe)
Join1_mailcifnum=pd.merge(df_maillist,df_cifnum, on=['ACOUNT_NUM','BRANCH_NUM'],how='left')
print(Join1_mailcifnum)
## dropping unwanted columns
Com_maillist= Join1_mailcifnum.drop(['CIFNUM','APLICTN_ID'], axis =1)
print(Com_maillist)
## concatenating Street Num + Street Name = Street Address
Com_maillist["Street Address"]=(Com_maillist['ADRES_STREET_NUM'].map(str)+ ' ' + Com_maillist['ADRES_STREET_NAME'].map(str))
print (Com_maillist.head())
## Rearranging columns
Final_maillist= Com_maillist[["CustomerName","ACOUNT_NUM","BRANCH_NUM","Street Address","City","Province","Postal Code","Country","Language Preference","Special Address"]]
print(Final_maillist)
## Export to excel
Final_maillist.to_excel(r'S:\Data Analysis\folder\Final_List.xlsx',index= False, sheet_name='Final_Maillist',header=True)```
Good code refactoring can be composed of many different steps, and depending on what your educator/client/manager/etc. expects, could involve vastly different amounts of effort and time spent. It's a good idea to ask this person what expectations they have for this specific project and start there.
However, for someone relatively new to Python I'd recommend you start with readability and organization. Make sure all your variable names are explicit and readable (assuming you're not using a required pattern like Hungarian notation). As a starting point, the Python naming conventions tend to use lowercase letters and underscores, with exceptions for certain objects or class names. Python actually has a really in-depth style guide called PEP-8. You can find it here
https://www.python.org/dev/peps/pep-0008/
A personal favorite of mine are comments. Comments should always contain the "why" of something, not necessarily the "how" (your code should be readable enough to make this part relatively obvious). This is a bit harder for smaller scripts or assignments where you don't have a ton of individual choice, but it's good to keep in mind.
If you've learned about object oriented programming, you should definitely split up tasks into functions and classes. In your specific case, you could create individual functions for things like loading files, performing specific operations on the file contents, and exporting. If you notice a bunch of functions that tend to have similar themes, that may be a good time to look into creating a class for those functions!
Finally, and again this is a personal preference (for basic scripts anyways), but I like to see a main declaration for readability and organization.
# imports go here!
# specific functions
def some_function():
return
if __name__ == "__main__":
# the start of your program goes here!
This is all pretty heavily simplified for the purposes of just starting out. There are plenty of other resources that can go more in depth in organization, good practices, and optimization.
Best of luck!
Related
I am trying to get all data from view(Lotus Notes) with lotusscript and Python(noteslib module) and export it to csv, but problem is that this takes too much time. I have tried two ways with loop through all documents:
import noteslib
db = noteslib.Database('database','file.nsf')
view = db.GetView('My View')
doc = view.GetFirstDocument()
data = list()
while doc:
data.append(doc.ColumnValues)
doc = view.GetNextDocument(doc)
To get about 1000 lines of data it took me 70 seconds, but view has about 85000 lines so get all data will be too much time, because manually when I use File->Export in Lotus Notes it is about 2 minutes to export all data to csv.
And I tried second way with AllEntries, but it was even slower:
database = []
ec = view.AllEntries
ent = ec.Getfirstentry()
while ent:
row = []
for v in ent.Columnvalues:
row.append(v)
database.append(row)
ent = ec.GetNextEntry(ent)
Everything that I found on the Internet is based on "NextDocument" or "AllEntries". Is there any way to do it faster?
It is (or at least used to be) very expensive from a time standpoint to open a Notes document, like you are doing in your code.
Since you are saying that you want to export the data that is being displayed in the view, you could use the NotesViewEntry class instead. It should be much faster.
Set col = view.AllEntries
Set entry = col.GetFirstEntry()
Do Until entry Is Nothing
values = entry.ColumnValues '*** Array of column values
'*** Do stuff here
Set entry = col.GetNextEntry(entry)
Loop
I wrote a blog about this back in 2013:
http://blog.texasswede.com/which-is-faster-columnvalues-or-getitemvalue/
Something is going on with your code "outside" the view navigation: You already chose the most performant way to navigate a view using "GetFirstDocument" and "GetNextDocument". Using the NotesViewNavigator as mentioned in the comments will be slightly better, but not significant.
You might get a little bit of performance out of your code by setting view.AutoUpdate = False to prohibit the view object to refresh when something in the backend changes. But as you only read data and not change view data that will not give you much of a performance boost.
My suggestion: Identify the REAL bottleneck of your code by commenting out single sections to find out when it starts to get slower:
First attempt:
while doc:
doc = view.GetNextDocument(doc)
Slow?
If not then next attempt:
while doc:
arr = doc.ColumnValues
doc = view.GetNextDocument(doc)
Slow?
If yes: ColumnValues is your enemy...
If not then next attempt:
while doc:
arr = doc.ColumnValues
data.append(arr)
doc = view.GetNextDocument(doc)
I would be very interested to get your results of where it starts to become slow.
I would suspect the performance issue is using COM/ActiveX in Python to access Notes databases. Transferring data via COM involves datatype 'marshalling', possibly at every step, and especially for 'out-of-process' method/property calls.
I don't think there is any way around this in COM. You should consider arranging a Notes 'agent' to do this for you instead (LotusScript or Java maybe). Even a basic LotusScript agent can export 000's of docs per minute. A further alternative may be to look at the Notes C-API (not an easy option and requires API calls from Python).
I'm about to roll my own property file parser. I've got a somewhat odd requirement where I need to be able to store metadata in an existing field of a GUI. The data needs to be easily parse-able and human readable, preferably with some flexibility in defining the data (no yaml for example).
I was thinking I could do something like this:
this is random text that is truly a description
.metadata.
owner.first: rick
owner.second: bob
property: blue
pets.mammals.dog: rufus
pets.mammals.cat: ludmilla
I was thinking I could use something like '.metadata.' to denote that anything below that line is metadata to be parsed. Then, I would treat the properties almost like java properties where I would read each line in and build a map (or object) to hold the metadata, which would then be outputted and searchable via a simple web app.
My real question before I roll this on my own, is can anyone suggest a better method for solving this problem? A specific data format or library that would fit this use case? I would normally use something like yaml or the like, but there's no good way for me to validate that the data is indeed in yaml format when it is saved.
You have 3 problems:
How to fit two different things into one box.
If you are mixing free form text and something that is more tightly defined, you are always going to end up with stuff that you can't parse. Then you will have a never ending battle of trying to deal with the rubbish that gets put in. Is there really no other way?
How to define a simple format for metadata that is robust enough for simple use.
This is a hard problem - all attempts to do so seem to expand until they become quite complicated (e.g. YAML). You will probably have custom requirements for your domain, so what you've proposed may be best.
How to parse that format.
For this I would recommend parsy.
It would be quite simple to split the text on .metadata. and then parse what remains.
Here is an example using parsy:
from parsy import *
attribute = letter.at_least(1).concat()
name = attribute.sep_by(string("."))
value = regex(r"[^\n]+")
definition = seq(name << string(":") << string(" ").many(), value)
metadata = definition.sep_by(string("\n"))
Example usage:
>>> metadata.parse_partial("""owner.first: rick
owner.second: bob
property: blue
pets.mammals.dog: rufus
pets.mammals.cat: ludmilla""")
([[['owner', 'first'], 'rick'],
[['owner', 'second'], 'bob'],
[['property'], 'blue'],
[['pets', 'mammals', 'dog'], 'rufus'],
[['pets', 'mammals', 'cat'], 'ludmilla']],
'')
YAML is a simple and nice solution. There is a YAML library in Python:
import yaml
output = {'a':1,'b':{'c':output = {'a':1,'b':{'c':[2,3,4]}}}}
print yaml.dump(output,default_flow_style=False)
Giving as a result:
a: 1
b:
c:
- 2
- 3
- 4
You can also parse from string and so. Just explore it and check if it fits your requeriments.
Good luck!
I'm working on a GUI editor for a propriety config format. Basically the editor will parse the config file, display the object properties so that users can edit from GUI and then write the objects back to the file.
I've got the parse - edit - write part done, except for:
The parsed data structure only include object properties information, so comments and whitespaces are lost on write
If there is any syntax error, the rest of the file is skipped
How would you address these issues? What is the usual approach to this problem? I'm using Python and Parsec module https://pythonhosted.org/parsec/documentation.html, however and help and general direction is appreciated.
I've also tried Pylens (https://pythonhosted.org/pylens/), which is really close to what I need except it can not skip syntax errors.
You asked about typical approaches to this problem. Here are two projects which tackle similar challenges to the one you describe:
sketch-n-sketch: "Direct manipulation" interface for vector images, where you can either edit the image-describing source language, or edit the image it represents directly and see those changes reflected in the source code. Check out the video presentation, it's super cool.
Boomerang: Using lenses to "focus" on the abstract meaning of some concrete syntax, alter that abstract model, and then reflect those changes in the original source.
Both projects have yielded several papers describing the approaches their authors took. As far as I can tell, the Lens approach is popular, where parsing and printing become the get and put functions of a Lens which takes a some source code and focuses on the abstract concept which that code describes.
Eventually I ran out of research time and have to settle with a rather manual skipping. Basically each time the parser fail we try to advance the cursor one character and repeat. Any parts skipped by the process, regardless of whitespace/comment/syntax error is dump into a Text structure. The code is quite reusable, except for the part you have to incorporate it to all the places with repeated results and the original parser may fail.
Here's the code, in case it helps anyone. It is written for Parsy.
class Text(object):
'''Structure to contain all the parts that the parser does not understand.
A better name would be Whitespace
'''
def __init__(self, text=''):
self.text = text
def __repr__(self):
return "Text(text='{}')".format(self.text)
def __eq__(self, other):
return self.text.strip() == getattr(other, 'text', '').strip()
def many_skip_error(parser, skip=lambda t, i: i + 1, until=None):
'''Repeat the original `parser`, aggregate result into `values`
and error in `Text`.
'''
#Parser
def _parser(stream, index):
values, result = [], None
while index < len(stream):
result = parser(stream, index)
# Original parser success
if result.status:
values.append(result.value)
index = result.index
# Check for end condition, effectively `manyTill` in Parsec
elif until is not None and until(stream, index).status:
break
# Aggregate skipped text into last `Text` value, or create a new one
else:
if len(values) > 0 and isinstance(values[-1], Text):
values[-1].text += stream[index]
else:
values.append(Text(stream[index]))
index = skip(stream, index)
return Result.success(index, values).aggregate(result)
return _parser
# Example usage
skip_error_parser = many_skip_error(original_parser)
On other note, I guess the real issue here is I'm using a parser combinator library instead of a proper two stages parsing process. In traditional parsing, the tokenizer will handle retaining/skipping any whitespace/comment/syntax error, making them all effectively whitespace and are invisible to the parser.
What advantages and/or disadvantages are there to using a "snippets" plugin, e.g. snipmate, ultisnips, for VIM as opposed to simply using the builtin "abbreviations" functionality?
Are there specific use-cases where declaring iabbr, cabbr, etc. lack some major features that the snippets plugins provide? I've been unsuccessful in finding a thorough comparison between these two "features" and their respective implementations.
As #peter-rincker pointed out in a comment:
It should be noted that abbreviations can execute code as well. Often via <c-r>= or via an expression abbreviation (<expr>). Example which expands ## to the current file's path: :iabbrev ## <c-r>=expand('%:p')<cr>
As an example for python, let's compare a snipmate snippet and an abbrev in Vim for inserting lines for class declaration.
Snipmate
# New Class
snippet cl
class ${1:ClassName}(${2:object}):
"""${3:docstring for $1}"""
def __init__(self, ${4:arg}):
${5:super($1, self).__init__()}
self.$4 = $4
${6}
Vimscript
au FileType python :iabbr cl class ClassName(object):<CR><Tab>"""docstring for ClassName"""<CR>def __init__(self, arg):<CR><Tab>super(ClassName, self).__init__()<CR>self.arg = arg
Am I missing some fundamental functionality of "snippets" or am I correct in assuming they are overkill for the most part, when Vim's abbr and :help template templates are able to do all most of the stuff snippets do?
I assume it's easier to implement snippets, and they provide additional aesthetic/visual features. For instance, if I use abbr in Vim and other plugins for running/testing python code inside vim--e.g. syntastic, pytest, ropevim, pep8, etc--am I missing out on some key features that snippets provide?
Everything that can be done with snippets can be done with abbreviations and vice-versa. You can have (mirrored or not) placeholders with abbreviations, you can have context-sensitive snippets.
There are two important differences:
Abbreviations are triggered when the abbreviation text has been typed, and a non word character (or esc) is hit. Snippets are triggered on demand, and shortcuts are possible (no need to type while + tab. w + tab may be enough).
It's much more easier to define new snippets (or to maintain old ones) than to define abbreviations. With abbreviations, a lot of boiler plate code is required when we want to do neat things.
There are a few other differences. For instance, abbreviations are always triggered everywhere. And seeing for expanded into for(placeholder) {\n} within a comment or a string context is certainly not what the end-user expects. With snippets, this is not a problem any more: we can expect the end-user to know what's he's doing when he asks to expand a snippet. Still, we can propose context-aware snippets that expand throw into #throw {domain::exception} {explanation} within a comment, or into throw domain::exception({message}); elsewhere.
Snippets
Rough superset of Vim's native abbreviations. Here are the highlights:
Only trigger on key press
Uses placeholders which a user can jump between
Exist only for insert mode
Dynamic expansions
Abbreviations
Great for common typos and small snippets.
Native to Vim so no need for plugins
Typically expand on whitespace or <c-]>
Some special rules on trigger text (See :h abbreviations)
Can be used in command mode via :cabbrev (often used to create command aliases)
No placeholders
Dynamic expansions
Conclusion
For the most part snippets are more powerful and provide many features that other editors enjoy, but you can use both and many people do. Abbreviations enjoy the benefit of being native which can be useful for remote environments. Abbreviations also enjoy another clear advantage which is can be used in command mode.
Snippets are more powerful.
Depending on the implementation, snippets can let you change (or accept defaults for) multiple placeholders and can even execute code when the snippet is expanded.
For example with ultisnips, you can have it execute shell commands, vimscript but also Python code.
An (ultisnips) example:
snippet hdr "General file header" b
# file: `!v expand('%:t')`
# vim:fileencoding=utf-8:ft=`!v &filetype`
# ${1}
#
# Author: ${2:J. Doe} ${3:<jdoe#gmail.com>}
# Created: `!v strftime("%F %T %z")`
# Last modified: `!v strftime("%F %T %z")`
endsnippet
This presents you with three placeholders to fill in (it gives default values for two of them), and sets the filename, filetype and current date and time.
After the word "snippet", the start line contains three items;
the trigger string,
a description and
options for the snippet.
Personally I mostly use the b option where the snippet is expanded at the beginning of a line and the w option that expands the snippet if the trigger string starts at the beginning of a word.
Note that you have to type the trigger string and then input a key or key combination that actually triggers the expansion. So a snippet is not expanded unless you want it to.
Additionally, snippets can be specialized by filetype. Suppose you want to define four levels of headings, h1 .. h4. You can have the same name expand differently between e.g. an HTML, markdown, LaTeX or restructuredtext file.
snippets are like the built-in :abbreviate on steroids, usually with:
parameter insertions: You can insert (type or select) text fragments in various places inside the snippet. An abbreviation just expands once.
mirroring: Parameters may be repeated (maybe even in transformed fashion) elsewhere in the snippet, usually updated as you type.
multiple stops inside: You can jump from one point to another within the snippet, sometimes even recursively expand snippets within one.
There are three things to evaluate in a snippet plugin: First, the features of the snippet engine itself, second, the quality and breadth of snippets provided by the author or others; third, how easy it is to add new snippets.
I'm writing a python script that generates another python script based off an external file. A small section of my code can be seen below. I haven't been exposed to many examples of these kinds of scripts, so I was wondering what the best practices were.
As seen in the last two lines of the code example, the techniques that I'm using can be unwieldy at times.
SIG_DICT_NAME = "sig_dict"
SIG_LEN_KEYWORD = "len"
SIG_BUS_IND_KEYWORD = "ind"
SIG_EP_ADDR_KEYWORD = "ep_addr"
KEYWORD_DEC = "{} = \"{}\""
SIG_LEN_KEYWORD_DEC = KEYWORD_DEC.format(SIG_LEN_KEYWORD, SIG_LEN_KEYWORD)
SIG_BUS_IND_KEYWORD_DEC = KEYWORD_DEC.format(SIG_BUS_IND_KEYWORD,
SIG_BUS_IND_KEYWORD)
SIG_EP_ADDR_KEYWORD_DEC = KEYWORD_DEC.format(SIG_EP_ADDR_KEYWORD,
SIG_EP_ADDR_KEYWORD)
SIG_DICT_DEC = "{} = dict()"
SIG_DICT_BODY_LINE = "{}[{}.{}] = {{{}:{}, {}:{}, {}:{}}}"
#line1 = SIG_DICT_DEC.format(SIG_DICT_NAME)
#line2 = SIG_DICT_BODY.format(SIG_DICT_NAME, x, y, z...)
You don't really see examples of this kind of thing because your solution might be a wee bit over-engineered ;)
I'm guessing that you're trying to collect some "state of things", and then you want to run a script to process that "state of things". Rather than writing a meta-script, what is typically far more convenient is to write a script that will do the processing (say, process.py), and another script that will do the collecting of the "state of things" (say, collect.py).
Then you can take the results from collect.py and throw them at process.py and write out todays_results.txt or some such:
collect.py -> process.py -> 20150207_results.txt
If needed, you can write intermediate files to disk with something like:
with open('todays_progress.txt') as f_out:
for thing, state in states_of_things.iteritems():
f.write('{}<^_^>{}\n'.format(state, thing))
Then you can parse it back in later with something like:
with open('todays_progress.txt') as f_in:
lines = f_in.read().splitlines()
things, states = [x, y for x, y in lines.split('<^_^>')]
states_of_things = dict(zip(things, states))
More complicated data structures than a flat dict? Well, this is Python. There's probably more than one module for that! Off the top of my head I would suggest json if plaintext will do, or pickle if you need some more detailed structures. Two warnings with pickle: custom objects don't always get reinstantiated well, and it's vulnerable to code injection attacks, so only use it if your entire workflow is trusted.
Hope this helps!
You seem to be translating keyword-by-keyword.
It would almost certainly be better to read each "sentence" into a representative Python class; you could then run the simulation directly, or have each class write itself to an "output sentence".
Done correctly, this should be much easier to write and debug and produce more idiomatic output.