I'm parsing a string that doesn't have a delimiter but does have specific indexes where fields start and stop. Here's my list comprehension to generate a list from the string:
field_breaks = [(0,2), (2,10), (10,13), (13, 21), (21, 32), (32, 43), (43, 51), (51, 54), (54, 55), (55, 57), (57, 61), (61, 63), (63, 113), (113, 163), (163, 213), (213, 238), (238, 240), (240, 250), (250, 300)]
s = '4100100297LICACTIVE 09-JUN-198131-DEC-2010P0 Y12490227WYVERN RESTAURANTS INC 1351 HEALDSBURG AVE HEALDSBURG CA95448 ROUND TABLE PIZZA 575 W COLLEGE AVE STE 201 SANTA ROSA CA95401 '
data = [s[x[0]:x[1]].strip() for x in field_breaks]
Any recommendation on how to improve this?
You can cut your field_breaks list in half by doing:
field_breaks = [0, 2, 10, 13, 21, 32, 43, ..., 250, 300]
s = ...
data = [s[x[0]:x[1]].strip() for x in zip(field_breaks[:-1], field_breaks[1:])]
You can use tuple unpacking for cleaner code:
data = [s[a:b].strip() for a,b in field_breaks]
To be honest, I don't find the parse-by-column-number approach very readable, and I question its maintainability (off by one errors and the like). Though I'm sure the list comprehensions are very virtuous and efficient in this case, and the suggested zip-based solution has a nice functional tweak to it.
Instead, I'm going to throw softballs from out here in left field, since list comprehensions are supposed to be in part about making your code more declarative. For something completely different, consider the following approach based on the pyparsing module:
def Fixed(chars, width):
return Word(chars, exact=width)
myDate = Combine(Fixed(nums,2) + Literal('-') + Fixed(alphas,3) + Literal('-')
+ Fixed(nums,4))
fullRow = Fixed(nums,2) + Fixed(nums,8) + Fixed(alphas,3) + Fixed(alphas,8)
+ myDate + myDate + ...
data = fullRow.parseString(s)
# should be ['41', '00100297', 'LIC', 'ACTIVE ',
# '09-JUN-1981', '31-DEC-2010', ...]
To make this even more declarative, you could name each of the fields as you come across them. I have no idea what the fields actually are, but something like:
someId = Fixed(nums,2)
someOtherId = Fixed(nums,8)
recordType = Fixed(alphas,3)
recordStatus = Fixed(alphas,8)
birthDate = myDate
issueDate = myDate
fullRow = someId + someOtherId + recordType + recordStatus
+ birthDate + issueDate + ...
Now an approach like this probably isn't going to break any land speed records. But, holy cow, wouldn't you find this easier to read and maintain?
Here is a way using map
data = map(s.__getslice__, *zip(*field_breaks))
Related
The classic knapsack addresses the solution to get the most valuable items inside the knapsack which has a limited weight it can carry.
I am trying to get instead the least valuable items.
The following code is a very good one using Recursive dynamic programming from rosetacode http://rosettacode.org/wiki/Knapsack_problem/0-1#Recursive_dynamic_programming_algorithm
def total_value(items, max_weight):
return sum([x[2] for x in items]) if sum([x[1] for x in items]) <= max_weight else 0
cache = {}
def solve(items, max_weight):
if not items:
return ()
if (items,max_weight) not in cache:
head = items[0]
tail = items[1:]
include = (head,) + solve(tail, max_weight - head[1])
dont_include = solve(tail, max_weight)
if total_value(include, max_weight) > total_value(dont_include, max_weight):
answer = include
else:
answer = dont_include
cache[(items,max_weight)] = answer
return cache[(items,max_weight)]
items = (
("map", 9, 150), ("compass", 13, 35), ("water", 153, 200), ("sandwich", 50, 160),
("glucose", 15, 60), ("tin", 68, 45), ("banana", 27, 60), ("apple", 39, 40),
("cheese", 23, 30), ("beer", 52, 10), ("suntan cream", 11, 70), ("camera", 32, 30),
("t-shirt", 24, 15), ("trousers", 48, 10), ("umbrella", 73, 40),
("waterproof trousers", 42, 70), ("waterproof overclothes", 43, 75),
("note-case", 22, 80), ("sunglasses", 7, 20), ("towel", 18, 12),
("socks", 4, 50), ("book", 30, 10),
)
max_weight = 400
solution = solve(items, max_weight)
print "items:"
for x in solution:
print x[0]
print "value:", total_value(solution, max_weight)
print "weight:", sum([x[1] for x in solution])
I have been trying to figure out how can i get the least valuable items looking on the internet with no luck so maybe somebody can help me with that.
I really apreciate your help in advance.
I'll try my best to guide you through what should be done to achieve this.
In order to make changes to this code and find the least valuable items with which you can fill the bag make a function which,
Takes in the most valuable items(solution in your code) as the
input
Find the (I'll call it least_items) items that you
will be leaving behind
Check if the total weight of the items in least_items is greater
than the max_weight.
If yes find the most valuable items in least_items and remove them from least_items.This will be a place where you will have
to initiate some sort of recursion to keep seperating the least
valueable from the most valuable
If no that means you could fill you knapsack with more items.So then you have to go back to the most valuable items you had
and keep looking for the least valuable items until you fill the
knapsack.Again some sort of recursion will have too be initiated
But take note that you will also have to include a terminating step so that the program stops when it has found the best solution.
This is not the best solution you could make though.I tried finding something better myself but unfortunately it demands more time than I thought.Feel free to leave any problems in the comments.I'll be happy to help.
Hope this helps.
I have a list with weekly figures and need to obtain the grouped totals by month.
The following code does the job, but there should be a more pythonic way of doing it with using the standard libraries.
The drawback of the code below is that the list needs to be in sorted order.
#Test data (not sorted)
sum_weekly=[('2020/01/05', 59), ('2020/01/19', 88), ('2020/01/26', 95), ('2020/02/02', 89),
('2020/02/09', 113), ('2020/02/16', 90), ('2020/02/23', 68), ('2020/03/01', 74), ('2020/03/08', 85),
('2020/04/19', 6), ('2020/04/26', 5), ('2020/05/03', 14),
('2020/05/10', 5), ('2020/05/17', 20), ('2020/05/24', 28),('2020/03/15', 56), ('2020/03/29', 5), ('2020/04/12', 2),]
month = sum_weekly[0][0].split('/')[1]
count=0
out=[]
for item in sum_weekly:
m_sel = item[0].split('/')[1]
if m_sel!=month:
out.append((month, count))
count=item[1]
else:
count+=item[1]
month = m_sel
out.append((month, count))
# monthly sums output as ('01', 242), ('02', 360), ('03', 220), ('04', 13), ('05', 67)
print (out)
You could use defaultdict to store the result instead of a list. The keys of the dictionary would be the months and you can simply add the values with the same month (key).
Possible implementation:
# Test Data
from collections import defaultdict
sum_weekly = [('2020/01/05', 59), ('2020/01/19', 88), ('2020/01/26', 95), ('2020/02/02', 89),
('2020/02/09', 113), ('2020/02/16', 90), ('2020/02/23', 68), ('2020/03/01', 74), ('2020/03/08', 85),
('2020/03/15', 56), ('2020/03/29', 5), ('2020/04/12', 2), ('2020/04/19', 6), ('2020/04/26', 5),
('2020/05/03', 14),
('2020/05/10', 5), ('2020/05/17', 20), ('2020/05/24', 28)]
results = defaultdict(int)
for date, count in sum_weekly: # used unpacking to make it clearer
month = date.split('/')[1]
# because we use a defaultdict if the key does not exist it
# the entry for the key will be created and initialize at zero
results[month] += count
print(results)
You can use itertools.groupby (it is part of standard library) - it does pretty much what you did under the hood (grouping together sequences of elements for which the key function gives same output). It can look like the following:
import itertools
def select_month(item):
return item[0].split('/')[1]
def get_value(item):
return item[1]
result = [(month, sum(map(get_value, group)))
for month, group in itertools.groupby(sorted(sum_weekly), select_month)]
print(result)
Terse, but maybe not that pythonic:
import calendar, functools, collections
{calendar.month_name[i]: val for i, val in functools.reduce(lambda a, b: a + b, [collections.Counter({datetime.datetime.strptime(time, '%Y/%m/%d').month: val}) for time, val in sum_weekly]).items()}
a method using pyspark
from pyspark import SparkContext
sc = SparkContext()
l = sc.parallelize(sum_weekly)
r = l.map(lambda x: (x[0].split("/")[1], x[1])).reduceByKey(lambda p, q: (p + q)).collect()
print(r) #[('04', 13), ('02', 360), ('01', 242), ('03', 220), ('05', 67)]
You can accomplish this with a Pandas dataframe. First, you isolate the month, and then use groupby.sum().
import pandas as pd
sum_weekly=[('2020/01/05', 59), ('2020/01/19', 88), ('2020/01/26', 95), ('2020/02/02', 89), ('2020/02/09', 113), ('2020/02/16', 90), ('2020/02/23', 68), ('2020/03/01', 74), ('2020/03/08', 85), ('2020/04/19', 6), ('2020/04/26', 5), ('2020/05/03', 14), ('2020/05/10', 5), ('2020/05/17', 20), ('2020/05/24', 28),('2020/03/15', 56), ('2020/03/29', 5), ('2020/04/12', 2)]
df= pd.DataFrame(sum_weekly)
df.columns=['Date','Sum']
df['Month'] = df['Date'].str.split('/').str[1]
print(df.groupby('Month').sum())
I have a big list in python like the following small example:
small example:
['GAATTCCTTGAGGCCTAAATGCATCGGGGTGCTCTGGTTTTGTTGTTGTTATTTCTGAATGACATTTACTTTGGTGCTCTTTATTTTGCGTATTTAAAAC', 'TAAGTCCCTAAGCATATATATAATCATGAGTAGTTGTGGGGAAAATAACACCATTAAATGTACCAAAACAAAAGACCGATCACAAACACTGCCGATGTTTCTCTGGCTTAAATTAAATGTATATACAACTTATATGATAAAATACTGGGC']
I want to make a new list in which every string will be converted to a new list and every list has some tuples. in fact I want to divide the length of each string by 10. the 1st tuple would be (1, 10) and the 2nd tuple would be (10, 20) until the end , depending on the length of the string. at the end, every string will be a list oftuples and finally I would have a list of lists.
in the small example the 1st string has 100 characters and the 2nd string has 150 characters.
for example the expected output for the small example would be:
new_list = [[(1, 10), (10, 20), (20, 30), (30, 40), (40, 50), (50, 60), (60, 70), (70, 80), (80, 90), (90, 100)], [(1, 10), (10, 20), (20, 30), (30, 40), (40, 50), (50, 60), (60, 70), (70, 80), (80, 90), (90, 100), (100, 110), (110, 120), (120, 130), (130, 140), (140, 150)]]
to make such list I made the following code but it does not return what I expect. do you know how to fix it?
mylist = []
valTup = list()
for count, char in enumerate(mylist):
if count % 10 == 0 and count > 0:
valTup.append(count)
else:
new_list.append(tuple(valTup))
I recommend to use the package boltons
boltons.iterutils
boltons.iterutils.chunked_iter(src, size) returns pieces of
the source iterable in size -sized chunks (this example was copied
from the docs):
>>> list(chunked_iter(range(10), 3))
[[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]
Example:
from boltons.iterutils import chunked_iter
adn = [
'GAATTCCTTGAGGCCTAAATGCATCGGGGTGCTCTGGTTTTGTTGTTGTTATTTCTGAATGACATTTACTTTGGTGCTCTTTATTTTGCGTATTTAAAAC',
'TAAGTCCCTAAGCATATATATAATCATGAGTAGTTGTGGGGAAAATAACACCATTAAATGTACCAAAACAAAAGACCGATCACAAACACTGCCGATGTTTCTCTGGCTTAAATTAAATGTATATACAACTTATATGATAAAATACTGGGC'
]
result = []
for s in adn:
result.append(list(chunked_iter(list(s), 10)))
print(result)
I suggest you the following solutions, the first one based on your code, the second one taking only one line, and finally the third one which is my preferred solution based on range(), zip() and slicing:
mylist = ['GAATTCCTTGAGGCCTAAATGCATCGGGGTGCTCTGGTTTTGTTGTTGTTATTTCTGAATGACATTTACTTTGGTGCTCTTTATTTTGCGTATTTAAAAC',
'TAAGTCCCTAAGCATATATATAATCATGAGTAGTTGTGGGGAAAATAACACCATTAAATGTACCAAAACAAAAGACCGATCACAAACACTGCCGATGTTTCTCTGGCTTAAATTAAATGTATATACAACTTATATGATAAAATACTGGGC']
# Here is the solution based on your code
resultlist = []
for s in mylist:
valTup = []
for count, char in enumerate(s, 1):
if count % 10 == 0:
valTup.append((count-10, count))
resultlist.append(valTup)
print(resultlist)
# Here is the one-line style solution
resultlist = [[(n-10, n) for n,char in enumerate(s, 1) if n % 10 == 0] for s in mylist]
print(resultlist)
# Here is my preferred solution
resultlist = []
for s in mylist:
temp = range(1+len(s))[::10]
resultlist.append(list(zip(temp[:-1], temp[1:])))
print(resultlist)
Are you looking for something like this?
mylist = ['GAATTCCTTGAGGCCTAAATGCATCGGGGTGCTCTGGTTTTGTTGTTGTTATTTCTGAATGACATTTACTTTGGTGCTCTTTATTTTGCGTATTTAAAAC', 'TAAGTCCCTAAGCATATATATAATCATGAGTAGTTGTGGGGAAAATAACACCATTAAATGTACCAAAACAAAAGACCGATCACAAACACTGCCGATGTTTCTCTGGCTTAAATTAAATGTATATACAACTTATATGATAAAATACTGGGC']
new_list1 = list()
new_list2 = list()
for i in range(len(mylist[0])/10):
if(10+i*10 <= len(mylist[0])):
new_list1.append(mylist[0][0+i*10:10+i*10])
else:
new_list1.append(mylist[0][0+i*10:])
for i in range(len(mylist[1])/10):
if(10+i*10 <= len(mylist[1])):
new_list2.append(mylist[1][0+i*10:10+i*10])
else:
new_list2.append(mylist[1][0+i*10:])
new_list = [new_list1,new_list2]
[['GAATTCCTTG', 'AGGCCTAAAT', 'GCATCGGGGT', 'GCTCTGGTTT',
'TGTTGTTGTT', 'ATTTCTGAAT', 'GACATTTACT', 'TTGGTGCTCT',
'TTATTTTGCG', 'TATTTAAAAC'], ['TAAGTCCCTA', 'AGCATATATA',
'TAATCATGAG', 'TAGTTGTGGG', 'GAAAATAACA', 'CCATTAAATG',
'TACCAAAACA', 'AAAGACCGAT', 'CACAAACACT', 'GCCGATGTTT',
'CTCTGGCTTA', 'AATTAAATGT', 'ATATACAACT', 'TATATGATAA',
'AATACTGGGC']]
Sorry for the broad title, I just do not know how to name this.
I have a list of integers, let's say:
X = [20, 30, 40, 50, 60, 70, 80, 100]
And a second list of tuples of size 2 to 6 made from this integers:
Y = [(20, 30), (40, 50, 80, 100), (100, 100, 100), ...]
Some of the numbers come back quite often in Y and I'd like to identify the group of integers coming back often.
Right now, I'm counting the number of apparition of each integer. It gives me some information, but nothing about the groups.
Example:
Y = [(20, 40, 80), (30, 60, 80), (60, 80, 100), (60, 80, 100, 20), (40, 60, 80, 20, 100), ...]
On that example (60, 80) and (60, 80, 100) are combinations which come back often.
I could use itertools.combinations_with_replacement() to generate every combinations and then count the number of apparition, but is there any other better way to do this?
Thanks.
Don't know if it is a strictly better way to do it or rather similar, but you could try to check for appearance fraction of subsets. Below a brute force way of doing so, storing the results in a dictionary. Quite possibly, it would be better to build a tree where you don't search through a branch if the appearance rate of its elements already did not make the cut. (i.e. if (20,80) does not appear together often enough, then why search for (20,80,100)?)
N=len(Y)
dicter = {}
for i in range(2,7):
for comb in itertools.combinations(X,i):
c3 = set(comb)
d3 = sum([c3.issubset(set(val)) for val in Y])/N
dicter['{}'.format(c3)] = d3
As edit: you probably are not interested in all non-appearances, so I'll throw in a piece of code to chop down the final dictionary size..First we define a function to return a shallow copy of our dictionary with 1 value removed. This is required to avoid RunTimeError when looping over the dict.
def removekey(d, key):
r = dict(d)
del r[key]
return r
Then we remove insignificant "clusters"
for d, v in dicter.items():
if v < 0.1:
dicter = removekey(dicter, d)
It will still be unsorted, as itertools and sets do not sort by themselves. Hope this will help you further along.
The approach that you are looking for is called
Frequent Itemset Mining
It finds frequent subsets, given a list of sets.
I would like to sort a list of tuples based on the two last columns:
mylist = [(33, 36, 84),
(34, 37, 656),
(23, 38, 42)]
I know I can do this like:
final = sorted(mylist, key:lambda x: [ x[1], x[2]])
Now my problem is that I want to compare the second column of my list with a special condition: if the difference between two numbers is less than an offset they should be taken as equal ( 36 == 37 == 38) and the third column should be used to sort the list. The end result I wish to see is:
mylist = [(23, 38, 42)
(33, 36, 84),
(34, 37, 656)]
I was thinking of creating my own integer type and overriding the equal operator. Is this possible? is it overkill? Is there a better way to solve this problem?
I think the easiest way is to create a new class that compares like you want it to:
mylist = [(33, 36, 84),
(34, 37, 656),
(23, 38, 42)]
offset = 2
class Comp(object):
def __init__(self, tup):
self.tup = tup
def __lt__(self, other): # sorted works even if only __lt__ is implemented.
# If the difference is less or equal the offset of the second item compare the third
if abs(self.tup[1] - other.tup[1]) <= offset:
return self.tup[2] < other.tup[2]
# otherwise compare them as usual
else:
return (self.tup[1], self.tup[2]) < (other.tup[1], other.tup[2])
A sample run shows your expected result:
>>> sorted(mylist, key=Comp)
[(23, 38, 42), (33, 36, 84), (34, 37, 656)]
I think it's a bit cleaner than using functools.cmp_to_key but that's a matter of personal preference.
Sometimes an old-style sort based on a cmp function is easier than doing one based on a key. So -- write a cmp function and then use functools.cmp_to_key to convert it to a key:
import functools
def compare(s,t,offset):
_,y,z = s
_,u,v = t
if abs(y-u) > offset: #use 2nd component
if y < u:
return -1
else:
return 1
else: #use 3rd component
if z < v:
return -1
elif z == v:
return 0
else:
return 1
mylist = [(33, 36, 84),
(34, 37, 656),
(23, 38, 42)]
mylist.sort(key = functools.cmp_to_key(lambda s,t: compare(s,t,2)))
for t in mylist: print(t)
output:
(23, 38, 42)
(33, 36, 84)
(34, 37, 656)
In https://wiki.python.org/moin/HowTo/Sorting look for "The Old Way Using the cmp Parameter". This allows you to write your own comparison function, instead of just setting the key and using comparison operators.
There is a danger to making a sort ordering like this. Look up "strict weak ordering." You could have multiple different valid orderings. This can break other code which assumes there is one correct way to sort things.
Now to actually answer your question:
mylist = [(33, 36, 84),
(34, 37, 656),
(23, 38, 42)]
def custom_sort_term(x, y, offset = 2):
if abs(x-y) <= offset:
return 0
return x-y
def custom_sort_function(x, y):
x1 = x[1]
y1 = y[1]
first_comparison_result = custom_sort_term(x1, y1)
if (first_comparison_result):
return first_comparison_result
x2 = x[2]
y2 = y[2]
return custom_sort_term(x2, y2)
final = sorted(mylist, cmp=custom_sort_function)
print final
[(23, 38, 42), (33, 36, 84), (34, 37, 656)]
Not pretty, but I tried be general in my interpretation of OP's problem statement
I expanded test case, then applied unsophisticated blunt force
# test case expanded
mylist = [(33, 6, 104),
(31, 36, 84),
(35, 86, 84),
(30, 9, 4),
(23, 38, 42),
(34, 37, 656),
(33, 88, 8)]
threshld = 2 # different final output can be seen if changed to 1, 3, 30
def collapse(nums, threshld):
"""
takes sorted (increasing) list of numbers, nums
replaces runs of consequetive nums
that successively differ by threshld or less
with 1st number in each run
"""
cnums = nums[:]
cur = nums[0]
for i in range(len(nums)-1):
if (nums[i+1] - nums[i]) <= threshld:
cnums[i+1] = cur
else:
cur = cnums[i+1]
return cnums
mylists = [list(i) for i in mylist] # change the tuples to lists to modify
indxd=[e + [i] for i, e in enumerate(mylists)] # append the original indexing
#print(*indxd, sep='\n')
im0 = sorted(indxd, key=lambda x: [ x[1]]) # sort by middle number
cns = collapse([i[1] for i in im0], threshld) # then collapse()
#print(cns)
for i in range(len(im0)): # overwrite collapsed into im0
im0[i][1] = cns[i]
#print(*im0, sep='\n')
im1 = sorted(im0, key=lambda x: [ x[1], x[2]]) # now do 2 level sort
#print(*sorted(im0, key=lambda x: [ x[1], x[2]]), sep='\n')
final = [mylist[im1[i][3]] for i in range(len(im1))] # rebuid using new order
# of original indices
print(*final, sep='\n')
(33, 6, 104)
(30, 9, 4)
(23, 38, 42)
(31, 36, 84)
(34, 37, 656)
(33, 88, 8)
(35, 86, 84)