Related
I have a list of tuples converted from a dictionary. I am looking to compare a conditional value against the list of tuples(values) whether it is higher or lower starting from the beginning on the list. When this conditional value is lower than a tuple's(value) I want to use that specific tuple for further coding.
Please can somebody give me an insight into how this is achieved?
I am relatively new to coding, self-learning and I am not 100% sure the example would run but for the sake of demonstrating I have tried my best.
`tuple_list = [(12:00:00, £55.50), (13:00:00, £65.50), (14:00:00, £75.50), (15:00:00, £45.50), (16:00:00, £55.50)]
conditional_value = £50
if conditional_value != for x in tuple_list.values()
y = 0
if conditional_value < tuple_list(y)
y++1
else
///"return the relevant value from the tuple_list to use for further coding. I would be
looking to work with £45.50"///`
Thank you.
Just form a new list with a condition:
tuple_list = [("12:00:00", 55.50), ("13:00:00", 65.50), ("14:00:00", 75.50), ("15:00:00", 45.50), ("16:00:00", 55.50)]
threshold = 50
below = [tpl for tpl in tuple_list if tpl[1] < threshold]
print(below)
Which yields
[('15:00:00', 45.5)]
Note that I added quotation marks and removed the currency sign to be able to compare the values. If you happen to have the £ in your actual values, you'll have to preprocess (stripping) them before.
If I'm understanding your question correctly, this should be what you're looking for:
for key, value in tuple_list:
if conditional_value < value:
continue # Skips to next in the list.
else:
# Do further coding.
You can use
tuple_list = [("12:00:00", 55.50), ("13:00:00", 65.50), ("14:00:00", 75.50), ("15:00:00", 45.50), ("16:00:00", 55.50)]
conditional_value = 50
new_tuple_list = list(filter(lambda x: x[1] > conditional_value, tuple_list))
This code will return a new_tuple_list with all items that there value us greater then the conditional_value.
I'm trying to create a little load balancing function that will intake a list of ordered numbers (these numbers will be string lengths) and output load-balanced chunks. The idea is that we start with a chunk with index 0 (smallest string length) and then add to it the index -1 (longest string length). And we repeat this until we run out of string lengths (stored in list_ordered) so that each chunk has a desired chunk_size.
Anyway, the function below works fine but is not exactly scalable since we are storing all the data in the list of lists res; My question is, taken into account what I want and the code below, could you please help me convert this function into a generator?
Thanks!
def chunk_generator_load_balanced(list_ordered,chunk_size):
n_chunks=ceil(len(list_ordered)/chunk_size)
res=[]
direction_chunks={}
for i in range(n_chunks):
res.append([])
direction_chunks[i]=True
chunk_index=0
while list_ordered:
if direction_chunks[chunk_index]:
chunk_val=list_ordered.pop(0)
direction_chunks[chunk_index]=False
else:
chunk_val=list_ordered.pop(-1)
direction_chunks[chunk_index]=True
res[chunk_index].append(chunk_val)
if chunk_index==n_chunks-1: chunk_index=0
else: chunk_index+=1
return res
if __name__ == '__main__':
list_keys=[i for i in range(50)]
a=chunk_generator_load_balanced(list_keys,10)
why don't just use yield instead of return in chunk_generator_load_balanced function?
I mean this:
def chunk_generator_load_balanced(list_ordered,chunk_size):
n_chunks=ceil(len(list_ordered)/chunk_size)
res=[]
direction_chunks={}
for i in range(n_chunks):
res.append([])
direction_chunks[i]=True
chunk_index=0
while list_ordered:
if direction_chunks[chunk_index]:
chunk_val=list_ordered.pop(0)
direction_chunks[chunk_index]=False
else:
chunk_val=list_ordered.pop(-1)
direction_chunks[chunk_index]=True
res[chunk_index].append(chunk_val)
if chunk_index==n_chunks-1: chunk_index=0
else: chunk_index+=1
yield res
how are you?
I'm trying to take the lowest value of the following code, my idea is that for example the result will be like. country,price,date
im using python for the code
valores= ["al[8075]['2019-05-27']", "de[2177]['2019-05-27']", "at[3946]['2019-05-27']", "be[3019]['2019-05-26']", "by[5741]['2019-05-27']", "ba[0]['2019-05-26', '2019-05-27']", "bg[3223]['2019-05-26']", "hr[4358]['2019-05-26']", "dk[5006]['2019-05-27']", "sk[4964]['2019-05-27']", "si[5253]['2019-05-26']", "es[3813]['2019-05-27']", "ee[4699]['2019-05-27']", "ru[4889]['2019-05-27']", "fi[5410]['2019-05-26']", "fr[2506]['2019-05-26']", "gi[0]['2019-05-26', '2019-05-27']", "gr[1468]['2019-05-26']", "hu[3475]['2019-05-27']", "ie[5360]['2019-05-26']", "is[0]['2019-05-26']", "it[2970]['2019-05-26']", "lv[2482]['2019-05-27']", "lt[1276]['2019-05-27']", "lu[0]['2019-05-26']", "mk[5417]['2019-05-26']", "mt[3532]['2019-05-26']", "md[6158]['2019-05-27']", "me[11080]['2019-05-26']", "no[2967]['2019-05-27']", "nl[3640]['2019-05-27']", "pl[2596]['2019-05-27']", "pt[5409]['2019-05-27']", "uk[5010]['2019-05-27']", "cz[5493]['2019-05-26']", "ro[1017]['2019-05-27']", "rs[6535]['2019-05-27']", "se[3971]['2019-05-26']", "ch[5112]['2019-05-26']", "tr[3761]['2019-05-26']", "ua[5187]['2019-05-26']"]
the idea in this example will be like
as you see country(ro) price(1017) date('2019-05-27') is the lowest
valores= "ro[1017]['2019-05-27']"
Python's max() and min() functions take a key argument. So, whenever you need a minimum or maximum you can often leverage these built-ins. The only code you have to write something to convert a value to the corresponding representation for max/min purposes.
def f(s):
return int(s.split('[')[1].split(']')[0]) or float('inf')
lowest = min(valores, key = f) # ro[1017]['2019-05-27']
There are more than one way of coding this. The following will do this:
lowest = 1000000
target = " "
for i in valores:
ix = i.find("[") + 1
iy = i.find("]")
value = int(i[ix:iy])
if value < lowest and value != 0:
lowest = value
target = i
print(target)
It will output
"ro[1017]['2019-05-27]"
However, here I am assuming you do not want 0 values, otherwise the answer would be
"ba[0]['2019-05-26', '2019-05-27']"
If you want to include 0, just modify the if block.
This should work for you. I assume you want the lowest non-zero price.
I split every string in the lists into sublists via square brackets [ and strip away the extra brackets [ and ] for each item, hence each sublist will have [state, price, dates] .
I then sort on the price, which is the second item of each sublist, and filter out the 0 prices,
The result will then be the first element of the filtered list
import re
import re
valores= ["al[8075]['2019-05-27']", "de[2177]['2019-05-27']", "at[3946]['2019-05-27']", "be[3019]['2019-05-26']", "by[5741]['2019-05-27']", "ba[0]['2019-05-26', '2019-05-27']", "bg[3223]['2019-05-26']", "hr[4358]['2019-05-26']", "dk[5006]['2019-05-27']", "sk[4964]['2019-05-27']", "si[5253]['2019-05-26']", "es[3813]['2019-05-27']", "ee[4699]['2019-05-27']", "ru[4889]['2019-05-27']", "fi[5410]['2019-05-26']", "fr[2506]['2019-05-26']", "gi[0]['2019-05-26', '2019-05-27']", "gr[1468]['2019-05-26']", "hu[3475]['2019-05-27']", "ie[5360]['2019-05-26']", "is[0]['2019-05-26']", "it[2970]['2019-05-26']", "lv[2482]['2019-05-27']", "lt[1276]['2019-05-27']", "lu[0]['2019-05-26']", "mk[5417]['2019-05-26']", "mt[3532]['2019-05-26']", "md[6158]['2019-05-27']", "me[11080]['2019-05-26']", "no[2967]['2019-05-27']", "nl[3640]['2019-05-27']", "pl[2596]['2019-05-27']", "pt[5409]['2019-05-27']", "uk[5010]['2019-05-27']", "cz[5493]['2019-05-26']", "ro[1017]['2019-05-27']", "rs[6535]['2019-05-27']", "se[3971]['2019-05-26']", "ch[5112]['2019-05-26']", "tr[3761]['2019-05-26']", "ua[5187]['2019-05-26']"]
results = []
#Iterate through valores
for item in valores:
#Extract elements from each string by splitting on [ and then stripping extra square brackets
items = [it.strip('][') for it in item.split('[')]
results.append(items)
#Sort on the second element which is price, and filter prices with are 0
res = list(
filter(lambda x: int(x[1]) > 0,
sorted(results, key=lambda x:int(x[1])))
)
#This is your lowest non-zero price
print(res[0])
The output will be
['ro', '1017', "'2019-05-27'"]
I need some help getting my brain around designing an (efficient) markov chain in spark (via python). I've written it as best as I could, but the code I came up with doesn't scale.. Basically for the various map stages, I wrote custom functions and they work fine for sequences of a couple thousand, but when we get in the 20,000+ (and I've got some up to 800k) things slow to a crawl.
For those of you not familiar with markov moodels, this is the gist of it..
This is my data.. I've got the actual data (no header) in an RDD at this point.
ID, SEQ
500, HNL, LNH, MLH, HML
We look at sequences in tuples, so
(HNL, LNH), (LNH,MLH), etc..
And I need to get to this point.. where I return a dictionary (for each row of data) that I then serialize and store in an in memory database.
{500:
{HNLLNH : 0.333},
{LNHMLH : 0.333},
{MLHHML : 0.333},
{LNHHNL : 0.000},
etc..
}
So in essence, each sequence is combined with the next (HNL,LNH become 'HNLLNH'), then for all possible transitions (combinations of sequences) we count their occurrence and then divide by the total number of transitions (3 in this case) and get their frequency of occurrence.
There were 3 transitions above, and one of those was HNLLNH.. So for HNLLNH, 1/3 = 0.333
As a side not, and I'm not sure if it's relevant, but the values for each position in a sequence are limited.. 1st position (H/M/L), 2nd position (M/L), 3rd position (H,M,L).
What my code had previously done was to collect() the rdd, and map it a couple times using functions I wrote. Those functions first turned the string into a list, then merged list[1] with list[2], then list[2] with list[3], then list[3] with list[4], etc.. so I ended up with something like this..
[HNLLNH],[LNHMLH],[MHLHML], etc..
Then the next function created a dictionary out of that list, using the list item as a key and then counted the total ocurrence of that key in the full list, divided by len(list) to get the frequency. I then wrapped that dictionary in another dictionary, along with it's ID number (resulting in the 2nd code block, up a above).
Like I said, this worked well for small-ish sequences, but not so well for lists with a length of 100k+.
Also, keep in mind, this is just one row of data. I have to perform this operation on anywhere from 10-20k rows of data, with rows of data varying between lengths of 500-800,000 sequences per row.
Any suggestions on how I can write pyspark code (using the API map/reduce/agg/etc.. functions) to do this efficiently?
EDIT
Code as follows.. Probably makes sense to start at the bottom. Please keep in mind I'm learning this(Python and Spark) as I go, and I don't do this for a living, so my coding standards are not great..
def f(x):
# Custom RDD map function
# Combines two separate transactions
# into a single transition state
cust_id = x[0]
trans = ','.join(x[1])
y = trans.split(",")
s = ''
for i in range(len(y)-1):
s= s + str(y[i] + str(y[i+1]))+","
return str(cust_id+','+s[:-1])
def g(x):
# Custom RDD map function
# Calculates the transition state probabilities
# by adding up state-transition occurrences
# and dividing by total transitions
cust_id=str(x.split(",")[0])
trans = x.split(",")[1:]
temp_list=[]
middle = int((len(trans[0])+1)/2)
for i in trans:
temp_list.append( (''.join(i)[:middle], ''.join(i)[middle:]) )
state_trans = {}
for i in temp_list:
state_trans[i] = temp_list.count(i)/(len(temp_list))
my_dict = {}
my_dict[cust_id]=state_trans
return my_dict
def gen_tsm_dict_spark(lines):
# Takes RDD/string input with format CUST_ID(or)PROFILE_ID,SEQ,SEQ,SEQ....
# Returns RDD of dict with CUST_ID and tsm per customer
# i.e. {cust_id : { ('NLN', 'LNN') : 0.33, ('HPN', 'NPN') : 0.66}
# creates a tuple ([cust/profile_id], [SEQ,SEQ,SEQ])
cust_trans = lines.map(lambda s: (s.split(",")[0],s.split(",")[1:]))
with_seq = cust_trans.map(f)
full_tsm_dict = with_seq.map(g)
return full_tsm_dict
def main():
result = gen_tsm_spark(my_rdd)
# Insert into DB
for x in result.collect():
for k,v in x.iteritems():
db_insert(k,v)
You can try something like below. It depends heavily on tooolz but if you prefer to avoid external dependencies you can easily replace it with some standard Python libraries.
from __future__ import division
from collections import Counter
from itertools import product
from toolz.curried import sliding_window, map, pipe, concat
from toolz.dicttoolz import merge
# Generate all possible transitions
defaults = sc.broadcast(dict(map(
lambda x: ("".join(concat(x)), 0.0),
product(product("HNL", "NL", "HNL"), repeat=2))))
rdd = sc.parallelize(["500, HNL, LNH, NLH, HNL", "600, HNN, NNN, NNN, HNN, LNH"])
def process(line):
"""
>>> process("000, HHH, LLL, NNN")
('000', {'LLLNNN': 0.5, 'HHHLLL': 0.5})
"""
bits = line.split(", ")
transactions = bits[1:]
n = len(transactions) - 1
frequencies = pipe(
sliding_window(2, transactions), # Get all transitions
map(lambda p: "".join(p)), # Joins strings
Counter, # Count
lambda cnt: {k: v / n for (k, v) in cnt.items()} # Get frequencies
)
return bits[0], frequencies
def store_partition(iter):
for (k, v) in iter:
db_insert(k, merge([defaults.value, v]))
rdd.map(process).foreachPartition(store_partition)
Since you know all possible transitions I would recommend using a sparse representation and ignore zeros. Moreover you can replace dictionaries with sparse vectors to reduce memory footprint.
you can achieve this result by using pure Pyspark, i did using it using pyspark.
To create frequencies, let say you have already achieved and these are input RDDs
ID, SEQ
500, [HNL, LNH, MLH, HML ...]
and to get frequencies like, (HNL, LNH),(LNH, MLH)....
inputRDD..map(lambda (k, list): get_frequencies(list)).flatMap(lambda x: x) \
.reduceByKey(lambda v1,v2: v1 +v2)
get_frequencies(states_list):
"""
:param states_list: Its a list of Customer States.
:return: State Frequencies List.
"""
rest = []
tuples_list = []
for idx in range(0,len(states_list)):
if idx + 1 < len(states_list):
tuples_list.append((states_list[idx],states_list[idx+1]))
unique = set(tuples_list)
for value in unique:
rest.append((value, tuples_list.count(value)))
return rest
and you will get results
((HNL, LNH), 98),((LNH, MLH), 458),() ......
after this you may convert result RDDs into Dataframes or yu can directly insert into DB using RDDs mapPartitions
I am new to Python and I have a hard time solving this.
I am trying to sort a list to be able to human sort it 1) by the first number and 2) the second number. I would like to have something like this:
'1-1bird'
'1-1mouse'
'1-1nmouses'
'1-2mouse'
'1-2nmouses'
'1-3bird'
'10-1birds'
(...)
Those numbers can be from 1 to 99 ex: 99-99bird is possible.
This is the code I have after a couple of headaches. Being able to then sort by the following first letter would be a bonus.
Here is what I've tried:
#!/usr/bin/python
myList = list()
myList = ['1-10bird', '1-10mouse', '1-10nmouses', '1-10person', '1-10cat', '1-11bird', '1-11mouse', '1-11nmouses', '1-11person', '1-11cat', '1-12bird', '1-12mouse', '1-12nmouses', '1-12person', '1-13mouse', '1-13nmouses', '1-13person', '1-14bird', '1-14mouse', '1-14nmouses', '1-14person', '1-14cat', '1-15cat', '1-1bird', '1-1mouse', '1-1nmouses', '1-1person', '1-1cat', '1-2bird', '1-2mouse', '1-2nmouses', '1-2person', '1-2cat', '1-3bird', '1-3mouse', '1-3nmouses', '1-3person', '1-3cat', '2-14cat', '2-15cat', '2-16cat', '2-1bird', '2-1mouse', '2-1nmouses', '2-1person', '2-1cat', '2-2bird', '2-2mouse', '2-2nmouses', '2-2person']
def mysort(x,y):
x1=""
y1=""
for myletter in x :
if myletter.isdigit() or "-" in myletter:
x1=x1+myletter
x1 = x1.split("-")
for myletter in y :
if myletter.isdigit() or "-" in myletter:
y1=y1+myletter
y1 = y1.split("-")
if x1[0]>y1[0]:
return 1
elif x1[0]==y1[0]:
if x1[1]>y1[1]:
return 1
elif x1==y1:
return 0
else :
return -1
else :
return -1
myList.sort(mysort)
print myList
Thanks !
Martin
You have some good ideas with splitting on '-' and using isalpha() and isdigit(), but then we'll use those to create a function that takes in an item and returns a "clean" version of the item, which can be easily sorted. It will create a three-digit, zero-padded representation of the first number, then a similar thing with the second number, then the "word" portion (instead of just the first character). The result looks something like "001001bird" (that won't display - it'll just be used internally). The built-in function sorted() will use this callback function as a key, taking each element, passing it to the callback, and basing the sort order on the returned value. In the test, I use the * operator and the sep argument to print it without needing to construct a loop, but looping is perfectly fine as well.
def callback(item):
phrase = item.split('-')
first = phrase[0].rjust(3, '0')
second = ''.join(filter(str.isdigit, phrase[1])).rjust(3, '0')
word = ''.join(filter(str.isalpha, phrase[1]))
return first + second + word
Test:
>>> myList = ['1-10bird', '1-10mouse', '1-10nmouses', '1-10person', '1-10cat', '1-11bird', '1-11mouse', '1-11nmouses', '1-11person', '1-11cat', '1-12bird', '1-12mouse', '1-12nmouses', '1-12person', '1-13mouse', '1-13nmouses', '1-13person', '1-14bird', '1-14mouse', '1-14nmouses', '1-14person', '1-14cat', '1-15cat', '1-1bird', '1-1mouse', '1-1nmouses', '1-1person', '1-1cat', '1-2bird', '1-2mouse', '1-2nmouses', '1-2person', '1-2cat', '1-3bird', '1-3mouse', '1-3nmouses', '1-3person', '1-3cat', '2-14cat', '2-15cat', '2-16cat', '2-1bird', '2-1mouse', '2-1nmouses', '2-1person', '2-1cat', '2-2bird', '2-2mouse', '2-2nmouses', '2-2person']
>>> print(*sorted(myList, key=callback), sep='\n')
1-1bird
1-1cat
1-1mouse
1-1nmouses
1-1person
1-2bird
1-2cat
1-2mouse
1-2nmouses
1-2person
1-3bird
1-3cat
1-3mouse
1-3nmouses
1-3person
1-10bird
1-10cat
1-10mouse
1-10nmouses
1-10person
1-11bird
1-11cat
1-11mouse
1-11nmouses
1-11person
1-12bird
1-12mouse
1-12nmouses
1-12person
1-13mouse
1-13nmouses
1-13person
1-14bird
1-14cat
1-14mouse
1-14nmouses
1-14person
1-15cat
2-1bird
2-1cat
2-1mouse
2-1nmouses
2-1person
2-2bird
2-2mouse
2-2nmouses
2-2person
2-14cat
2-15cat
2-16cat
You need leading zeros. Strings are sorted alphabetically with the order different from the one for digits. It should be
'01-1bird'
'01-1mouse'
'01-1nmouses'
'01-2mouse'
'01-2nmouses'
'01-3bird'
'10-1birds'
As you you see 1 goes after 0.
The other answers here are very respectable, I'm sure, but for full credit you should ensure that your answer fits on a single line and uses as many list comprehensions as possible:
import itertools
[''.join(r) for r in sorted([[''.join(x) for _, x in
itertools.groupby(v, key=str.isdigit)]
for v in myList], key=lambda v: (int(v[0]), int(v[2]), v[3]))]
That should do nicely:
['1-1bird',
'1-1cat',
'1-1mouse',
'1-1nmouses',
'1-1person',
'1-2bird',
'1-2cat',
'1-2mouse',
...
'2-2person',
'2-14cat',
'2-15cat',
'2-16cat']