I'm basically trying to do this (pseudo code, not valid python):
limit = 10
results = [xml_to_dict(artist) for artist in xml.findall('artist') while limit--]
So how could I code this in a concise and efficient way?
The XML file can contain anything between 0 and 50 artists, and I can't control how many to get at a time, and AFAIK, there's no XPATH expression to say something like "get me up to 10 nodes".
Thanks!
Are you using lxml? You could use XPath to limit the items in the query level, e.g.
>>> from lxml import etree
>>> from io import StringIO
>>> xml = etree.parse(StringIO('<foo><bar>1</bar><bar>2</bar><bar>4</bar><bar>8</bar></foo>'))
>>> [bar.text for bar in xml.xpath('bar[position()<=3]')]
['1', '2', '4']
You could also use itertools.islice to limit any iterable, e.g.
>>> from itertools import islice
>>> [bar.text for bar in islice(xml.iterfind('bar'), 3)]
['1', '2', '4']
>>> [bar.text for bar in islice(xml.iterfind('bar'), 5)]
['1', '2', '4', '8']
Assuming that xml is an ElementTree object, the findall() method returns a list, so just slice that list:
limit = 10
limited_artists = xml.findall('artist')[:limit]
results = [xml_to_dict(artist) for artist in limited_artists]
For everyone else who found this question because they were trying to limit items returned from an infinite generator:
from itertools import takewhile
ltd = takewhile(lambda x: x[0] < MY_LIMIT, enumerate( MY_INFINITE_GENERATOR ))
# ^ This is still an iterator.
# If you want to materialize the items, e.g. in a list, do:
ltd_m = list( ltd )
# If you don't want the enumeration indices, you can strip them as usual:
ltd_no_enum = [ v for i,v in ltd_m ]
EDIT: Actually, islice is a much better option.
limit = 10
limited_artists = [artist in xml.findall('artist')][:limit]
results = [xml_to_dict(artist) for limited_artists]
This avoids the issues of slicing: it doesn't change the order of operations, and doesn't construct a new list, which can matter for large lists if you're filtering the list comprehension.
def first(it, count):
it = iter(it)
for i in xrange(0, count):
yield next(it)
raise StopIteration
print [i for i in first(range(1000), 5)]
It also works properly with generator expressions, where slicing will fall over due to memory use:
exp = (i for i in first(xrange(1000000000), 10000000))
for i in exp:
print i
Related
I am checking a list of strings if they contain specific substrings. Depending on the conditions, the string from the list is added to another list, or is left out respectively.
This is what I have so far. It is working, but there are quite some loops concatenated. Is there a better (or more pythonesque) way of writing this?
varNams = ['fee1.foo.bar','fee1.foo','fee2.foo.bar','fee2.foo']
selection = []
sub_incl = ['foo']
sub_excl = ['bar']
for i in range(len(varNams)):
for sub_in in sub_incl:
for sub_ex in sub_excl:
if sub_in in varNams[i] and sub_ex not in varNams[i]:
selection.append(varNams[i])
You can use itertools.product in a list comprehension.
from itertools import product
varNams = ['fee1.foo.bar','fee1.foo','fee2.foo.bar','fee2.foo']
sub_incl = ['foo']
sub_excl = ['bar']
res = [i for i in varNams for sub_in, sub_ex in product(sub_incl, sub_excl) if sub_in in i and sub_ex not in i]
print(res)
Output
['fee1.foo', 'fee2.foo']
My solution is using regex
import re
varNams = ['fee1.foo.bar','fee1.foo','fee2.foo.bar','fee2.foo']
selection = []
regex1='foo'
regex2='bar'
for i in varNams:
if re.search(regex1, i) and not re.search(regex2, i):
selection.append(i)
TL DR: How can I best use map to filter a list based on logical indexing?
Given a list:
values = ['1', '2', '3', '5', 'N/A', '5']
I would like to map the following function and use the result to filter my list. I could do this with filter and other methods but mostly looking to learn if this can be done solely using map.
The function:
def is_int(val):
try:
x = int(val)
return True
except ValueError:
return False
Attempted solution:
[x for x in list(map(is_int, values)) if x is False]
The above gives me the values I need. However, it does not return the index or allow logical indexing. I have tried to do other ridiculous things like:
[values[x] for x in list(map(is_int, values)) if x is False]
and many others that obviously don't work.
What I thought I could do:
values[[x for x in list(map(is_int, values)) if x is False]]
Expected outcome:
['N/A']
[v for v in values if not is_int(v)]
If you have a parallel list of booleans:
[v for v, b in zip(values, [is_int(x) for x in values]) if not b]
you can get the expected outcome using the simple snippet written below which does not involve any map function
[x for x in values if is_int(x) is False]
And, if you want to strictly use map function then the snippet below will help you
[values[i] for i,y in enumerate(list(map(is_int,values))) if y is False]
map is just not the right tool for the job, as that would transform the values, whereas you just want to check them. If anything, you are looking for filter, but you have to "inverse" the filter-function first:
>>> values = ['1', '2', "foo", '3', '5', 'N/A', '5']
>>> not_an_int = lambda x: not is_int(x)
>>> list(filter(not_an_int, values))
['foo', 'N/A']
In practice, however, I would rather use a list comprehension with a condition.
You can do this using a bit of help from itertools and by negating the output of your original function since we want it to return True where it is not an int.
from itertools import compress
from operator import not_
list(compress(values, map(not_, map(is_int, values))))
['N/A']
You cannot use map() alone to perform a reduction. By its very definition, map() preserves the number of items (see e.g. here).
On the other hand, reduce operations are meant to be doing what you want. In Python these may be implemented normally with a generator expression or for the more functional-style inclined programmers, with filter(). Other non-primitive approach may exist, but they essentially boil down to one of the two, e.g.:
values = ['1', '2', '3', '5', 'N/A', '5']
list(filter(lambda x: not is_int(x), values))
# ['N/A']
Yet, if what you want is to combine the result of map() to use it for within slicing, this cannot be done with Python alone.
However, NumPy supports precisely what you want except that the result will not be a list:
import numpy as np
np.array(values)[list(map(lambda x: not is_int(x), values))]
# array(['N/A'], dtype='<U3')
(Or you could have your own container defined in such a way as to implement this behavior).
That being said, it is quite common to use the following generator expression in Python in place of map() / filter().
filter(func, items)
is roughly equivalent to:
item for item in items if func(item)
while
map(func, items)
is roughly equivalent to:
func(item) for item in items
and their combination:
filter(f_func, map(m_func, items))
is roughly equivalent to:
m_func(item) for item in items if f_func(item)
Not exactly what I had in mind but something I learnt from this problem, we could do the following(which might be computationally less efficient). This is almost similar to #aws_apprentice 's answer. Clearly one is better off using filter and/or list comprehension:
from itertools import compress
list(compress(values, list(map(lambda x: not is_int(x), values))))
Or as suggested by #aws_apprentice simply:
from itertools import compress
list(compress(values, map(lambda x: not is_int(x), values)))
I ideally want to turn this 100020630 into [100,020,630]
but so far i can only do this "100.020.630" into ["100","020","630"]
def fulltotriple(x):
X=x.split(".")
return X
print(fulltotriple("192.123.010"))
for some additionnal info my goal is no turn ip adresses into bin adresses using this as a first step =)
edit: i have not found any way of getting the list WITHOUT the " " in the list on stack overflow
Here's one approach using a list comprehension:
s = '100020630'
[s[i:i + 3] for i in range(0, len(s), 3)]
# ['100', '020', '630']
If you want to handle IP addresses, you are doing it totally wrong.
IP address is a 24-binary digit number, not a 9-decimal digit. It is splitted for 4 sub-blocks, like: 192.168.0.1. BUT. In decimal view they all can be 3-digit, or 2-digit, or any else combination. I recommend you to use ipaddress standard module:
import ipaddress
a = '192.168.0.1'
ip = ipaddress.ip_address(a)
ip.packed
will return you the packed binary format:
b'\xc0\xa8\x00\x01'
If you want to convert your IPv4 to binary format, you can use this command:
''.join(bin(i)[2:] for i in ip.packed)
It will return you this string:
'110000001010100001'
You could use the built-in wrap function:
In [3]: s = "100020630"
In [4]: import textwrap
In [6]: textwrap.wrap(s, 3)
Out[6]: ['100', '020', '630']
Wraps the single paragraph in text (a string) so every line is at most width characters long. Returns a list of output lines, without final newlines.
If you want a list of ints:
[int(num) for num in textwrap.wrap(s, 3)]
Outputs:
[100, 020, 630]
You could use wrap which is a inbuilt function in python
from textwrap import wrap
def fulltotriple(x):
x = wrap(x, 3)
return x
print(fulltotriple("100020630"))
Outputs:
['100', '020', '630']
You can use python built-ins for this:
text = '100020630'
# using wrap
from textwrap import wrap
wrap(text, 3)
>>> ['100', '020', '630']
# using map/zip
map(''.join, zip(*[iter(text)]*3))
>>> ['100', '020', '630']
Use regex to find all matches of triplets \d{3}
import re
str = "100020630"
def fulltotriple(x):
pattern = re.compile(r"\d{3}")
return [int(found_match) for found_match in pattern.findall(x)]
print(fulltotriple(str))
Outputting:
[100, 20, 630]
def fulltotriple(data):
result = []
for i in range(0, len(data), 3):
result.append(int(data[i:i + 3]))
return (result)
print(fulltotriple("192123010"))
output:
[192, 123, 10]
I have a string array for example [a_text, b_text, ab_text, a_text]. I would like to get the number of objects that contain each prefix such as ['a_', 'b_', 'ab_'] so the number of 'a_' objects would be 2.
so far I've been counting each by filtering the array e.g num_a = len(filter(lambda x: x.startswith('a_'), array)). I'm not sure if this is slower than looping through all the fields and incrementing each counter since I am filtering the array for each prefix I am counting. Are functions such as filter() faster than a for loop? For this scenario I don't need to build the filtered list if I use a for loop so that may make it faster.
Also perhaps instead of the filter I could use list comprehension to make it faster?
You can use collections.Counter with a regular expression (if all of your strings have prefixes):
from collections import Counter
arr = ['a_text', 'b_text', 'ab_text', 'a_text']
Counter([re.match(r'^.*?_', i).group() for i in arr])
Output:
Counter({'a_': 2, 'b_': 1, 'ab_': 1})
If not all of your strings have prefixes, this will throw an error, since re.match will return None. If this is a possibility, just add an extra step:
arr = ['a_text', 'b_text', 'ab_text', 'a_text', 'test']
matches = [re.match(r'^.*?_', i) for i in arr]
Counter([i.group() for i in matches if i])
Output:
Counter({'a_': 2, 'b_': 1, 'ab_': 1})
Another way would be to use a defaultdict() object. You just go over the whole list once and count each prefix as you encounter it by splitting at the underscore. You need to check the underscore exists, else the whole word will be taken as a prefix (otherwise it wouldn't distinguish between 'a' and 'a_a').
from collections import defaultdict
array = ['a_text', 'b_text', 'ab_text', 'a_text'] * 250000
def count_prefixes(arr):
counts = defaultdict(int)
for item in arr:
if '_' in item:
counts[item.split('_')[0] + '_'] += 1
return counts
The logic is similar to user3483203's answer, in that all prefixes are calculated in one pass. However, it seems invoking regex methods is a bit slower than simple string operations. But I also have to echo Michael's comment, in that the speed difference is insignificant for even 1 million items.
from timeit import timeit
setup = """
from collections import Counter, defaultdict
import re
array = ['a_text', 'b_text', 'ab_text', 'a_text']
def with_defaultdict(arr):
counts = defaultdict(int)
for item in arr:
if '_' in item:
counts[item.split('_')[0] + '_'] += 1
return counts
def with_counter(arr):
matches = [re.match(r'^.*?_', i) for i in arr]
return Counter([i.group() for i in matches if i])
"""
for method in ('with_defaultdict', 'with_counter'):
print(timeit('{}(array)'.format(method), setup=setup, number=1))
Timing results:
0.4836089063341265
1.3238173544676142
If I'm understanding what you're asking for, it seems like you really want to use Regular Expressions (Regex). They're built for just this sort of pattern-matching use. I don't know Python, but I do see that regular expressions are supported, so it's a matter of using them. I use this tool because it makes it easy to craft and test your regex.
You could also try using str.partition() to extract the string before the separator and the separator, then just concatenate these two to form the prefix. Then you just have to check if this prefix exists in the prefixes set, and count them with collections.Counter():
from collections import Counter
arr = ['a_text', 'b_text', 'ab_text', 'a_text']
prefixes = {'a_', 'b_', 'ab_'}
counter = Counter()
for word in arr:
before, delim, _ = word.partition('_')
prefix = before + delim
if prefix in prefixes:
counter[prefix] += 1
print(counter)
Which Outputs:
Counter({'a_': 2, 'b_': 1, 'ab_': 1})
I want part of a script I am writing to do something like this.
x=0
y=0
list=[["cat","dog","mouse",1],["cat","dog","mouse",2],["cat","dog","mouse",3]]
row=list[y]
item=row[x]
print list.count(item)
The problem is that this will print 0 because it isn't searching the individual lists.How can I make it return the total number of instances instead?
Search per sublist, adding up results per contained list with sum():
sum(sub.count(item) for sub in lst)
Demo:
>>> lst = [["cat","dog","mouse",1],["cat","dog","mouse",2],["cat","dog","mouse",3]]
>>> item = 'cat'
>>> sum(sub.count(item) for sub in lst)
3
sum() is a builtin function for adding up its arguments.
The x.count(item) for x in list) is a "generator expression" (similar to a list comprehension) - a handy way to create and manage list objects in python.
item_count = sum(x.count(item) for x in list)
That should do it
Using collections.Counter and itertools.chain.from_iterable:
>>> from collections import Counter
>>> from itertools import chain
>>> lst = [["cat","dog","mouse",1],["cat","dog","mouse",2],["cat","dog","mouse",3]]
>>> count = Counter(item for item in chain.from_iterable(lst) if not isinstance(item, int))
>>> count
Counter({'mouse': 3, 'dog': 3, 'cat': 3})
>>> count['cat']
3
I filtered out the ints because I didn't see why you had them in the first place.