How to read from a file or a list? - python

Say you have a piece of code that accepts either a list or a file name, and must filter through each item of either one provided by applying the same criteria:
import argparse
parser = argparse.ArgumentParser()
group = parser.add_mutually_exclusive_group(required = True)
group.add_argument('-n', '--name', help = 'single name', action = 'append')
group.add_argument('-N', '--names', help = 'text file of names')
args = parser.parse_args()
results = []
if args.name:
# We are dealing with a list.
for name in args.name:
name = name.strip().lower()
if name not in results and len(name) > 6: results.append(name)
else:
# We are dealing with a file name.
with open(args.names) as f:
for name in f:
name = name.strip().lower()
if name not in results and len(name) > 6: results.append(name)
I'd like to remove as much redundancy as possible in the above code. I tried creating the following function for strip and lower, but it didn't remove much repeat code:
def getFilteredName(name):
return name.strip().lower()
Is there any way to iterate over both a list and a file in the same function? How should I go about reducing as much code as possible?

You have duplicate code that you can simplify: list and file-objects are both iterables - if you create a method that takes an iterable and returns the correct output you have less code duplication (DRY).
Choice of datastructure:
You do not want duplicate items, meaning set() or dict() are better suited to collect the data you want to parse - they eliminate duplicates by design which is faster then looking if an item is already in a list:
if the order of names matter use
a OrderedDict from collections when on python 3.6 or less or
a normal dict for 3.7 or more (dicts gurantee input order)
more info: Are dictionaries ordered in Python 3.6+?
if name order is not important, use a set()
Either one of the above choices removes duplicates for you.
import argparse
from collections import OrderedDict # use normal dict on 3.7+ it hasinput order
def get_names(args):
"""Takes an iterable and returns a list of all unique lower cased elements, that
have at least length 6."""
seen = OrderedDict() # or dict or set
def add_names(iterable):
"""Takes care of adding the stuff to your return collection."""
k = [n.strip().lower() for n in iterable] # do the strip().split()ing only once
# using generator comp to update - use .add() for set()
seen.update( ((n,None) for n in k if len(n)>6))
if args.name:
# We are dealing with a list:
add_names(args.name)
elif args.names:
# We are dealing with a file name:
with open(args.names) as f:
add_names(f)
# return as list
return list(seen)
Testcode:
parser = argparse.ArgumentParser()
group = parser.add_mutually_exclusive_group(required = True)
group.add_argument('-n', '--name', help = 'single name', action = 'append')
group.add_argument('-N', '--names', help = 'text file of names')
args = parser.parse_args()
results = get_names(args)
print(results)
Output for -n Joh3333n -n Ji3333m -n joh3333n -n Bo3333b -n bo3333b -n jim:
['joh3333n', 'ji3333m', 'bo3333b']
Input file:
with open("names.txt","w") as names:
for n in ["a"*k for k in range(1,10)]:
names.write( f"{n}\n")
Output for -N names.txt:
['aaaaaaa', 'aaaaaaaa', 'aaaaaaaaa']

Subclass list and make the subclass a context manager:
class F(list):
def __enter__(self):
return self
def __exit__(self,*args,**kwargs):
pass
Then the conditional can decide what to iterate over
if args.name:
# We are dealing with a list.
thing = F(args.name)
else:
# We are dealing with a file name.
thing = open(args.names)
And the iteration code can be factored out.
results = []
with thing as f:
for name in f:
name = name.strip().lower()
if name not in results and len(name) > 6: results.append(name)
Here is a similar solution that makes an io.StringIO object from either the file or the list then uses a single set of instructions to process them.
import io
if args.name:
# We are dealing with a list.
f = io.StringIO('\n'.join(args.name))
else:
# We are dealing with a file name.
with open(args.names) as fileobj:
f = io.StringIO(fileobj.read())
results = []
for name in f:
name = name.strip().lower()
if name not in results and len(name) > 6: results.append(name)
If the file is huge and memory is scarce, this has the disadvantage of reading the entire file into memory.

Related

Path inside the dictionary from the variable

I have a code:
def replaceJSONFilesList(JSONFilePath, JSONsDataPath, newJSONData):
JSONFileHandleOpen = open(JSONFilePath, 'r')
ReadedJSONObjects = json.load(JSONFileHandleOpen)
JSONFileHandleOpen.close()
ReadedJSONObjectsModifyingSector = ReadedJSONObjects[JSONsDataPath]
for newData in newJSONData:
ReadedJSONObjectsModifyingSector.append(newData)
JSONFileHandleWrite = open(JSONFilePath, 'w')
json.dump(ReadedJSONObjects, JSONFileHandleWrite)
JSONFileHandleWrite.close()
def modifyJSONFile(Path):
JSONFilePath = '/path/file'
JSONsDataPath = "['first']['second']"
newJSONData = 'somedata'
replaceJSONFilesList(JSONFilePath, JSONsDataPath, newJSONData)
Now I have an error:
KeyError: "['first']['second']"
But if I try:
ReadedJSONObjectsModifyingSector = ReadedJSONObjects['first']['second']
Everything is okay.
How I should send the path to the list from the JSON's dictionary — from one function to other?
You cannot pass language syntax elements as if they were data strings. Similarly, you could not pass the string "2 > 1 and False", and expect the function to be able to insert that into an if condition.
Instead, extract the data items and pass them as separate strings (which matches their syntax in the calling routine), or as a tuple of strings. For instance:
JSONsDataPath = ('first', 'second')
...
Then, inside the function ...
ReadedJSONObjects[JSONsDataPath[0]][JSONsDataPath[1]]
If you have a variable sequence of indices, then you need to write code to handle that case; research that on Stack Overflow.
The iterative way to handle an unknown quantity of indices is like this:
obj = ReadedJSONObjects
for index in JSONsDataPath:
obj = obj[index]

Sorting nested list

Im trying to sort my list which contains of 3 nested lists: paths, file names and finally file creation time. So I want to sort them to be able to get the latest files.
So Ive seen people been using lambda for this, but I dont feel comfortable using those and kind of dont get how to the sorting with that works.
I think the best way is just to switch the list components, but this does not work:
class FILE:
PATH = 0
NAME = 1
DATE = 2
mayaFiles = [[],[],[]]
mayaFiles[FILE.DATE] = [0,56,3,12,7,35,16]
doSwitch = True
while (doSwitch):
for ma in range(0, len(mayaFiles[FILE.DATE])-1):
doSwitch = False
doSwitch = mayaFiles[FILE.DATE][ma] > mayaFiles[FILE.DATE][ma+1]
hi = mayaFiles[FILE.DATE][ma]
lo = mayaFiles[FILE.DATE][ma+1]
if doSwitch:
mayaFiles[FILE.DATE][ma] = lo
mayaFiles[FILE.DATE][ma+1] = hi
else:
break
print mayaFiles[FILE.DATE]
Assuming these lists are already aligned, you'll have a much easier time by combing the there separate lists into a list of tuples arranged by your sort order. the namedtuple construct in the collections module is great for this sort of thing. I'm assuming you can get your data into three lists: paths, dates and names. I'm supplying some dummy data here so you can see what I'm assuming.
names = "a.ma", "b.ma", "c.ma", "d.ma"
paths = "c:/test", "c/test", "c:/other", "d:/extra"
dates = "17-01-01", "16-01-01", "17-02-01", "17-06-30"
# this creates a namedtuple, which is a
# mini-class with named fields that otherwise
# works like a tuple
from collections import namedtuple
record = namedtuple("filerecord", "date name path")
# in real use this should be a list comp
# but this is easier to read:
records = []
for date, name, path in zip(dates, names, paths):
records.append(record(date, name, path))
records.sort(reverse=True)
for item in records:
print item
# filerecord(date='17-06-30', name='d.ma', path='d:/extra')
# filerecord(date='17-02-01', name='c.ma', path='c:/other')
# filerecord(date='17-01-01', name='a.ma', path='c:/test')
# filerecord(date='16-01-01', name='b.ma', path='c/test')
You could sort on other fields using the 'key' argument to sort():
records.sort(key=lambda k: k.name)
for item in records:
print item
# filerecord(date='17-01-01', name='a.ma', path='c:/test')
# filerecord(date='16-01-01', name='b.ma', path='c/test')
# filerecord(date='17-02-01', name='c.ma', path='c:/other')
# filerecord(date='17-06-30', name='d.ma', path='d:/extra')

How to change a python dictionary within a function?

So im running into an issue trying to get my dictionary to change within a function without returning anything here is my code:
def load_twitter_dicts_from_file(filename, emoticons_to_ids, ids_to_emoticons):
in_file = open(filename, 'r')
emoticons_to_ids = {}
ids_to_emoticons = {}
for line in in_file:
data = line.split()
if len(data) > 0:
emoticon = data[0].strip('"')
id = data[2].strip('"')
if emoticon not in emoticons_to_ids:
emoticons_to_ids[emoticon] = []
if id not in ids_to_emoticons:
ids_to_emoticons[id] = []
emoticons_to_ids[emoticon].append(id)
ids_to_emoticons[id].append(emoticon)
basically what im trying to do is to pass in two dictionaries and fill them with information from the file which works out fine but after i call it in the main and try to print the two dictionaries it says they are empty. Any ideas?
def load_twitter_dicts_from_file(filename, emoticons_to_ids, ids_to_emoticons):
…
emoticons_to_ids = {}
ids_to_emoticons ={}
These two lines replace whatever you pass to the function. So if you passed two dictionaries to the function, those dictionaries are never touched. Instead, you create two new dictionaries which are never passed to the outside.
If you want to mutate the dictionaries you pass to the function, then remove those two lines and create the dictionaries first.
Alternatively, you could also return those two dictionaries from the function at the end:
return emoticons_to_ids, ids_to_emoticons

How do I avoid overriding a key from a global dictionary in python?

I'm using the command line arguments as user input...
from datetime import timedelta, datetime
import csv, argparse
from collections import defaultdict
parser = argparse.ArgumentParser()
parser.add_argument("-p", dest='prodfile', action="append", help="file names for prod")
args = parser.parse_args()
files_d={}
files_d[""]=[]
if args.testfile:
testfile = args.testfile
type_file="test"
files_d[type_file]=testfile
if args.prodfile:
prodfile = args.prodfile
type_file="prod"
files_d[type_file]=prodfile
print files_d
how do I avoid overriding [type_file] on all dictionaries??
EDIT The core question is how to accumulate a list of values that shared the same key (in this case, type_file is the repeated key with testfile and prodfile both needing to be accumulated in a list.
The issue is that the dictionary keys are unique, so any subsequent assignment to the same key will replace rather than add to the dict value. What you need is for the key to may to a list of matching values.
A collections.defaultdict with a list() factory function should meet your needs:
from collections import defaultdict # <== new code
files_d = defaultdict(list) # <== new code
files_d[""]=[]
if args.testfile:
testfile = args.testfile
type_file="test"
files_d[type_file].append(testfile) # <== new code
if args.prodfile:
prodfile = args.prodfile
type_file="prod"
files_d[type_file].append(prodfile) # <== new code
Good luck. Hope this solves your issue regarding [type_file] overriding values (replacing them) instead of appending to a list of matching values.
You can simply do it as follows:
if args.testfile:
testfile = args.testfile
type_file="test"
if files_d.get(type_file):
files_d[type_file].append(testfile)
else:
files_d[type_file] = list(type_file)
if args.prodfile:
prodfile = args.prodfile
type_file="prod"
if files_d.get(type_file):
files_d[type_file].append(testfile)
else:
files_d[type_file] = list(type_file)
print files_d

Using a function to make a dict won't work, but outside the function does

I have a problem that I can't seem to find and fix.
FASTA = >header1
ATCGATCGATCCCGATCGACATCAGCATCGACTAC
ATCGACTCAAGCATCAGCTACGACTCGACTGACTACGACTCGCT
>header2
ATCGATCGCATCGACTACGACTACGACTACGCTTCGTATCAGCATCAGCT
ATCAGCATCGACGACGACTAGCACTACGACTACGACGATCCCGATCGATCAGCT
def dnaSequence():
'''
This function makes a dict called DNAseq by reading the fasta file
given as first argument on the command line
INPUT: Fasta file containing strings
OUTPUT: key is header and value is sequence
'''
DNAseq = {}
for line in FASTA:
line = line.strip()
if line.startswith('>'):
header = line
DNAseq[header] = ""
else:
seq = line
DNAseq[header] = seq
return DNAseq
def digestFragmentsWithOneEnzyme(dnaSequence):
'''
This function digests the sequence from DNAseq into smaller parts
by using the enzymes listed in the MODES.
INPUT: DNAseq and the enzymes from sys.argv[2:]
OUTPUT: The DNAseq is updated with the segments gained from the
digesting
'''
enzymes = sys.argv[2:]
updated_list = []
for enzyme in enzymes:
pattern = MODES(enzyme)
p = re.compile(pattern)
for dna in DNAseq.keys():
matchlist = re.findall(p,dna)
updated_list = re.split(MODES, DNAseq)
DNAseq.update((key, updated_list.index(k)) for key in
d.iterkeys())
return DNAseq
def getMolecularWeight(dnaSequence):
'''
This function calculates the molWeight of the sequence in DNAseq
INPUT: the updated DNAseq from the previous function as a dict
OUTPUT: The DNAseq is updated with the molweight of the digested fragments
'''
results = []
for seq in DNAseq.keys():
results = sum((dnaMass[base]) for base in DNAseq[seq])
DNAseq.update((key, results.index(k)) for key in
d.iterkeys())
return DNAseq
def main(argv=None):
'''
This function prints the results of the digested DNA sequence on in the terminal.
INPUT: The DNAseq from the previous function as a dict
OUTPUT: name weight weight weight
name2 weight weight weight
'''
if argv == None:
argv = sys.argv
if len(argv) <2:
usage()
return 1
digestFragmentsWithOneEnzyme(dnaSequence())
Genes = getMolecularWeight(digestFragmentsWithOneEnzyme())
print ({header},{seq}).format(**DNAseq)
return 0
if __name__ == '__main__':
sys.exit(main())
In the first function I'm trying to make a dict from the fasta file, using the same dict in the second function where the sequences are being sliced by regex and finally the molweight is being calculated.
My problem is that for some reason Python doesn't recognize my dict and I get an error:
name error DNAseq is not defined
If I make the dict outside of the function then I do have the dict.
You're passing the dict to both functions as dnaSequence, not DNAseq.
Note this is a very strange way of calling functions. You completely ignore the result of the first call to digestFragmentsWithOneEnzyme when you pass the sequence to it, then try to call it again to pass the result to getMolecularWeight but you fail to actually pass the sequence in that call, so that would actually error if you got that far.
I think what you are trying to do is this:
sequence = dnaSequence()
fragments = digestFragmentsWithOneEnzyme(sequence)
genes = getMolecularWeight(fragments)
and you should avoid calling the parameter to both functions with the same name as a separate function, as that will hide the function name. Instead, choose a new name:
def digestFragmentsWithOneEnzyme(sequence):
...
for dna in sequence:
(you don't need to call keys() - iterating over a dict is always over the keys.)

Categories