I have a list of tuples
(something1, 500)
(something1, 200)
(something1, 300)
(something2, 200)
(something2, 600)
(something2, 400)
I have written a function in pySpark to do the calculation to get a result something like this. The function basically needs to sum up the total of the counts that occur
(something1, 1000),
(something2, 1200)
My function so far
def add_function(key, value):
last_key = None
recur_total = 0
key, value = join_data[0][0], join_data[0][1]
if last_key == key:
recur_total+ = value
else:
if last_key:
recur_total = value
if last_key == key:
recur_total = value
last_key = key
if last_key == key:
return(last_key, value)
Problems I am facing
I am unable to paste the function as one function at the pySpark console. It gets split to multiple prompts.
It says syntax error at line 6 (recur_total+ = value).
What am I doing wrong and how to rectify this?
I am unable to paste the function as one function at the pySpark console. It gets split to multiple prompts.
I do not understand what you mean by this. As long as your indentation is correct, the "multiple prompts" do create a single function correctly.
It says syntax error at line 6 (recur_total+ = value).
This error means that you seem to be pasting your code correctly. To fix the error, double check the spacing on the line that gives the syntax error.
Others have already pointed out answers to your questions regarding proper indentation , but my 2 cents regarding whole function itself..
Task you want to achieve can be simply done using groupby of itertools
from itertools import groupby
data = [ ('something1', 500),
('something1', 200),
('something1', 300),
('something2', 200),
('something2', 600),
('something2', 400)]
for key, group in groupby(data, lambda x: x[0]):
result = 0
for things in group:
result = result + things[1]
print(key,result)
Related
I have a list of tuples converted from a dictionary. I am looking to compare a conditional value against the list of tuples(values) whether it is higher or lower starting from the beginning on the list. When this conditional value is lower than a tuple's(value) I want to use that specific tuple for further coding.
Please can somebody give me an insight into how this is achieved?
I am relatively new to coding, self-learning and I am not 100% sure the example would run but for the sake of demonstrating I have tried my best.
`tuple_list = [(12:00:00, £55.50), (13:00:00, £65.50), (14:00:00, £75.50), (15:00:00, £45.50), (16:00:00, £55.50)]
conditional_value = £50
if conditional_value != for x in tuple_list.values()
y = 0
if conditional_value < tuple_list(y)
y++1
else
///"return the relevant value from the tuple_list to use for further coding. I would be
looking to work with £45.50"///`
Thank you.
Just form a new list with a condition:
tuple_list = [("12:00:00", 55.50), ("13:00:00", 65.50), ("14:00:00", 75.50), ("15:00:00", 45.50), ("16:00:00", 55.50)]
threshold = 50
below = [tpl for tpl in tuple_list if tpl[1] < threshold]
print(below)
Which yields
[('15:00:00', 45.5)]
Note that I added quotation marks and removed the currency sign to be able to compare the values. If you happen to have the £ in your actual values, you'll have to preprocess (stripping) them before.
If I'm understanding your question correctly, this should be what you're looking for:
for key, value in tuple_list:
if conditional_value < value:
continue # Skips to next in the list.
else:
# Do further coding.
You can use
tuple_list = [("12:00:00", 55.50), ("13:00:00", 65.50), ("14:00:00", 75.50), ("15:00:00", 45.50), ("16:00:00", 55.50)]
conditional_value = 50
new_tuple_list = list(filter(lambda x: x[1] > conditional_value, tuple_list))
This code will return a new_tuple_list with all items that there value us greater then the conditional_value.
I am trying to get proportion of nouns in my text using the code below and it is giving me an error. I am using a function that calculates the number of nouns in my text and I have the overall word count in a different column.
pos_family = {
'noun' : ['NN','NNS','NNP','NNPS']
}
def check_pos_tag(x, flag):
cnt = 0
try:
for tag,value in x.items():
if tag in pos_family[flag]:
cnt +=value
except:
pass
return cnt
df2['noun_count'] = df2['PoS_Count'].apply(lambda x: check_pos_tag(x, 'noun')/df2['word_count'])
Note: I have used nltk package to get the counts by PoS tags and I have the counts in a dictionary in PoS_Count column in my dataframe.
If I remove "/df2['word_count']" in the first run and get the noun count and include it again and run, it works fine but if I run it for the first time I get the below error.
ValueError: Wrong number of items passed 100, placement implies 1
Any help is greatly appreciated
Thanks in Advance!
As you have guessed, the problem is in the /df2['word_count'] bit.
df2['word_count'] is a pandas series, but you need to use a float or int here, because you are dividing check_pos_tag(x, 'noun') (which is an int) by it.
A possible solution is to extract the corresponding field from the series and use it in your lambda.
However, it would be easier (and arguably faster) to do each operation alone.
Try this:
df2['noun_count'] = df2['PoS_Count'].apply(lambda x: check_pos_tag(x, 'noun')) / df2['word_count']
I have a problem which my novice knowledge cannot solve.
I'm trying to copy some python-2.x code (which is working) to python-3.x. Now it gives me an error.
Here's a snippet of the code:
def littleUglyDataCollectionInTheSourceCode():
a = {
'Aabenraa': [842.86917819535, 25.58264089252],
'Aalborg': [706.92644963185, 27.22746146366],
'Aarhus': [696.60346488317, 25.67540525994],
'Albertslund': [632.49007681987, 27.70499807418],
'Allerød': [674.10474259426, 27.91964123274],
'Assens': [697.02257492453, 25.83386400960],
'Ballerup': [647.05121493736, 27.72466920284],
'Billund': [906.63431520239, 26.23136823557],
'Bornholm': [696.05765684503, 28.98396327957],
'Brøndby': [644.89390717471, 28.18974127413],
}
return a
and:
def calcComponent(data):
# Todo: implement inteface to set these values by
# the corresponding 'Kommune'
T = float(data.period)
k = 1.1
rH = 1.0
# import with s/\([^\s-].*?\)\t\([0-9.]*\)$/'\1':'\2',/
myDict = littleUglyDataCollectionInTheSourceCode();
#if data.kommune in myDict:
# https://docs.djangoproject.com/en/1.10/ref/unicode/
key = data.kommune.encode("utf-8")
rd = myDict.get(key.strip(), 0)
laP = float(rd[0]) # average precipitation
midV = float(rd[1]) # Middelværdi Klimagrid
print(("lap " + str(laP)))
print(("mid V" + str(midV)))
It gives the error:
line 14, in calcComponent
laP = float(rd[0]) # average precipitation
TypeError: 'int' object is not subscriptable
I've tried different approaches and read dozens of aticles with no luck. Being a novice it is like tumbling in the dark.
In your example myDict is a dictionary with strings as keys and lists as values.
key = data.kommune.encode("utf-8")
will be a bytes object, so there can't ever be any corresponding value for that key in the dictionary. This worked in python2 where automatic conversion was performed, but not anymore in python3, you need to use the correct type for lookups.
rd = myDict.get(key.strip(), 0)
will always return the integer 0, which means that rd[0] can not work because integers are not indexable, as the error message tells you.
Generally the default value in a get() call should be compatible with what is returned in all other cases. Returning 0 as default where all non-default cases return lists can only lead to problems.
You are using 0 as a default value for rd, whereas the values in the dict are lists, so if the key is not found, rd[0] or rd[1] will fail. Instead, use a list or tuple as default, then it should work.
rd = myDict.get(key.strip(), [0, 0])
And that is why googling the TypeError text didn't lead me to a solution, as my problem were twofold. I forgot about the integrated encoding in Python3.
I changed:
key = data.kommune.encode("utf-8")
rd = myDict.get(key.strip(), 0)
to:
key = data.kommune
rd = myDict.get(key.strip(), [0, 0])
And now it works:-)
I am new to Python and I have a hard time solving this.
I am trying to sort a list to be able to human sort it 1) by the first number and 2) the second number. I would like to have something like this:
'1-1bird'
'1-1mouse'
'1-1nmouses'
'1-2mouse'
'1-2nmouses'
'1-3bird'
'10-1birds'
(...)
Those numbers can be from 1 to 99 ex: 99-99bird is possible.
This is the code I have after a couple of headaches. Being able to then sort by the following first letter would be a bonus.
Here is what I've tried:
#!/usr/bin/python
myList = list()
myList = ['1-10bird', '1-10mouse', '1-10nmouses', '1-10person', '1-10cat', '1-11bird', '1-11mouse', '1-11nmouses', '1-11person', '1-11cat', '1-12bird', '1-12mouse', '1-12nmouses', '1-12person', '1-13mouse', '1-13nmouses', '1-13person', '1-14bird', '1-14mouse', '1-14nmouses', '1-14person', '1-14cat', '1-15cat', '1-1bird', '1-1mouse', '1-1nmouses', '1-1person', '1-1cat', '1-2bird', '1-2mouse', '1-2nmouses', '1-2person', '1-2cat', '1-3bird', '1-3mouse', '1-3nmouses', '1-3person', '1-3cat', '2-14cat', '2-15cat', '2-16cat', '2-1bird', '2-1mouse', '2-1nmouses', '2-1person', '2-1cat', '2-2bird', '2-2mouse', '2-2nmouses', '2-2person']
def mysort(x,y):
x1=""
y1=""
for myletter in x :
if myletter.isdigit() or "-" in myletter:
x1=x1+myletter
x1 = x1.split("-")
for myletter in y :
if myletter.isdigit() or "-" in myletter:
y1=y1+myletter
y1 = y1.split("-")
if x1[0]>y1[0]:
return 1
elif x1[0]==y1[0]:
if x1[1]>y1[1]:
return 1
elif x1==y1:
return 0
else :
return -1
else :
return -1
myList.sort(mysort)
print myList
Thanks !
Martin
You have some good ideas with splitting on '-' and using isalpha() and isdigit(), but then we'll use those to create a function that takes in an item and returns a "clean" version of the item, which can be easily sorted. It will create a three-digit, zero-padded representation of the first number, then a similar thing with the second number, then the "word" portion (instead of just the first character). The result looks something like "001001bird" (that won't display - it'll just be used internally). The built-in function sorted() will use this callback function as a key, taking each element, passing it to the callback, and basing the sort order on the returned value. In the test, I use the * operator and the sep argument to print it without needing to construct a loop, but looping is perfectly fine as well.
def callback(item):
phrase = item.split('-')
first = phrase[0].rjust(3, '0')
second = ''.join(filter(str.isdigit, phrase[1])).rjust(3, '0')
word = ''.join(filter(str.isalpha, phrase[1]))
return first + second + word
Test:
>>> myList = ['1-10bird', '1-10mouse', '1-10nmouses', '1-10person', '1-10cat', '1-11bird', '1-11mouse', '1-11nmouses', '1-11person', '1-11cat', '1-12bird', '1-12mouse', '1-12nmouses', '1-12person', '1-13mouse', '1-13nmouses', '1-13person', '1-14bird', '1-14mouse', '1-14nmouses', '1-14person', '1-14cat', '1-15cat', '1-1bird', '1-1mouse', '1-1nmouses', '1-1person', '1-1cat', '1-2bird', '1-2mouse', '1-2nmouses', '1-2person', '1-2cat', '1-3bird', '1-3mouse', '1-3nmouses', '1-3person', '1-3cat', '2-14cat', '2-15cat', '2-16cat', '2-1bird', '2-1mouse', '2-1nmouses', '2-1person', '2-1cat', '2-2bird', '2-2mouse', '2-2nmouses', '2-2person']
>>> print(*sorted(myList, key=callback), sep='\n')
1-1bird
1-1cat
1-1mouse
1-1nmouses
1-1person
1-2bird
1-2cat
1-2mouse
1-2nmouses
1-2person
1-3bird
1-3cat
1-3mouse
1-3nmouses
1-3person
1-10bird
1-10cat
1-10mouse
1-10nmouses
1-10person
1-11bird
1-11cat
1-11mouse
1-11nmouses
1-11person
1-12bird
1-12mouse
1-12nmouses
1-12person
1-13mouse
1-13nmouses
1-13person
1-14bird
1-14cat
1-14mouse
1-14nmouses
1-14person
1-15cat
2-1bird
2-1cat
2-1mouse
2-1nmouses
2-1person
2-2bird
2-2mouse
2-2nmouses
2-2person
2-14cat
2-15cat
2-16cat
You need leading zeros. Strings are sorted alphabetically with the order different from the one for digits. It should be
'01-1bird'
'01-1mouse'
'01-1nmouses'
'01-2mouse'
'01-2nmouses'
'01-3bird'
'10-1birds'
As you you see 1 goes after 0.
The other answers here are very respectable, I'm sure, but for full credit you should ensure that your answer fits on a single line and uses as many list comprehensions as possible:
import itertools
[''.join(r) for r in sorted([[''.join(x) for _, x in
itertools.groupby(v, key=str.isdigit)]
for v in myList], key=lambda v: (int(v[0]), int(v[2]), v[3]))]
That should do nicely:
['1-1bird',
'1-1cat',
'1-1mouse',
'1-1nmouses',
'1-1person',
'1-2bird',
'1-2cat',
'1-2mouse',
...
'2-2person',
'2-14cat',
'2-15cat',
'2-16cat']
I am receiving the error
TypeError: 'filter' object is not subscriptable
When trying to run the following block of code
bonds_unique = {}
for bond in bonds_new:
if bond[0] < 0:
ghost_atom = -(bond[0]) - 1
bond_index = 0
elif bond[1] < 0:
ghost_atom = -(bond[1]) - 1
bond_index = 1
else:
bonds_unique[repr(bond)] = bond
continue
if sheet[ghost_atom][1] > r_length or sheet[ghost_atom][1] < 0:
ghost_x = sheet[ghost_atom][0]
ghost_y = sheet[ghost_atom][1] % r_length
image = filter(lambda i: abs(i[0] - ghost_x) < 1e-2 and
abs(i[1] - ghost_y) < 1e-2, sheet)
bond[bond_index] = old_to_new[sheet.index(image[0]) + 1 ]
bond.sort()
#print >> stderr, ghost_atom +1, bond[bond_index], image
bonds_unique[repr(bond)] = bond
# Removing duplicate bonds
bonds_unique = sorted(bonds_unique.values())
And
sheet_new = []
bonds_new = []
old_to_new = {}
sheet=[]
bonds=[]
The error occurs at the line
bond[bond_index] = old_to_new[sheet.index(image[0]) + 1 ]
I apologise that this type of question has been posted on SO many times, but I am fairly new to Python and do not fully understand dictionaries. Am I trying to use a dictionary in a way in which it should not be used, or should I be using a dictionary where I am not using it?
I know that the fix is probably very simple (albeit not to me), and I will be very grateful if someone could point me in the right direction.
Once again, I apologise if this question has been answered already
Thanks,
Chris.
I am using Python IDLE 3.3.1 on Windows 7 64-bit.
filter() in python 3 does not return a list, but an iterable filter object. Use the next() function on it to get the first filtered item:
bond[bond_index] = old_to_new[sheet.index(next(image)) + 1 ]
There is no need to convert it to a list, as you only use the first value.
Iterable objects like filter() produce results on demand rather than all in one go. If your sheet list is very large, it might take a long time and a lot of memory to put all the filtered results into a list, but filter() only needs to evaluate your lambda condition until one of the values from sheet produces a True result to produce one output. You tell the filter() object to scan through sheet for that first value by passing it to the next() function. You could do so multiple times to get multiple values, or use other tools that take iterables to do more complex things; the itertools library is full of such tools. The Python for loop is another such a tool, it too takes values from an iterable one by one.
If you must have access to all filtered results together, because you have to, say, index into the results at will (e.g. because this time your algorithm needed to access index 223, index 17 then index 42) only then convert the iterable object to a list, by using list():
image = list(filter(lambda i: ..., sheet))
The ability to access any of the values of an ordered sequence of values is called random access; a list is such a sequence, and so is a tuple or a numpy array. Iterables do not provide random access.
Use list before filter condtion then it works fine. For me it resolved the issue.
For example
list(filter(lambda x: x%2!=0, mylist))
instead of
filter(lambda x: x%2!=0, mylist)
image = list(filter(lambda i: abs(i[0] - ghost_x) < 1e-2 and abs(i[1] - ghost_y) < 1e-2, sheet))