generate list of tuples to import to pandas df

generate list of tuples to import to pandas df - python

I have found a list of consonant cluster with the following code:
list_2 = ['financial','disastrous','accuracy','important','numbers']
reg = r'[bdðfghjklmnprstvxþ]+'
d = []
largest = []
for w in list_2:
d.append(re.findall(reg, str(w), re.IGNORECASE))
print(d)
[['f', 'n', 'n', 'l'], ['d', 's', 'str', 's'], ['r'], ['mp', 'rt', 'nt'], ['n', 'mb', 'rs']]
I need to get the largest consonant count for each word to import as list (of tuples) to a pandas dataframe. I have tried various things but without success.

This should give you the int you are looking for:
def largest_cluster(cons_list):
return max(len(c) for c in cons_list)
Then you can get the tuples by:
tuples = [(w, largest_cluster(cons)) for w, cons in zip(list2, d)]

Works fine, now to import it to pandas with column labels. That will hopefully be OK.

If the word has no consonants I get ValueError: max() arg is an empty sequence. How do I fix that?

Related

Matching some values from a list with values from a column

I have the following issue:
I have to columns:
I have to create a list for each element from the first column. Precisely, I have to create the following lists:
the first list for element 1 should contain [a, g]
for element 2: [r, t], etc.
I tried something, but I something must not be correct
The first part of the code I used to find out which elements are duplicated in the first column
values = []
for i in range(2, ws11.max_row+1):
if ws11.cell(row=i, column=1).value in values:
pass
else:
values.append(ws11.cell(row=i, column=1).value)
print(values)
For the second part I wrote the following code:
listanew = []
for i in range(1, len(values)+1):
for j in range(2, ws11.max_row+1):
if ws11.cell(row=j, column=1).column == values[i-1]:
listanew.append(ws11.cell(row=j, column=2).value)
print(listanew)
When I tried to print my new list, I obtained an empty list.
Could you give me a solution?

I took the liberty to give names to the two columns; "A" and "B". Here is what you can do to group the data by column "A" and get a list of all the elements in "B":
df = pd.DataFrame({'A': [1,2,3,4,2,3,4,1],
'B': ['a', 'r', 'fg', 'h', 't', 'd', 'd', 'g']})
df = df.groupby('A')['B'].apply(list).reset_index()

Use a dictionary (a defaultdict for shorter code):
from collections import defaultdict
d=defaultdict(list)
for row in range(1,ws11.max_row+1):
d[ws11.cell(row,1).value].append(ws11.cell(row,2).value)
d
defaultdict(<class 'list'>, {1: ['a', 'g'], 2: ['r', 't'], 3: ['fg', 'd'], 4: ['h', 'd']})

How to find common elements from a list of lists such that order of occurrence is maintained?

I have a list of lists where the length of the lists are same. I need to find the common elements from them with the order of occurrence maintained.
For example:
Suppose the list of lists is [['a','e','d','c','f']['e','g','a','d','c']['c','a','h','e','j']]
The output list should contain ['a','e','c'] Priority should be given to elements which occur earlier in most of the lists. In this example 'a' occurs earlier, then 'e' and so on.
How to proceed with this?

you could find common items first then sorted it
from collections import defaultdict
data = [['a','e','d','c','f'],['e','g','a','d','c'],['c','a','h','e','j']]
common = set(data[0])
for line in data:
common = common.intersection(set(line))
res = defaultdict(int)
for line in data:
for idx, item in enumerate(line):
if item in common:
res[item] += idx
[item[0] for item in sorted(res.items(), key=lambda x: x[1])]
output:
['a', 'e', 'c']

Here's a quick solution that I managed to get working:
data = [['a', 'e', 'd', 'c', 'f'],
['e', 'g', 'a', 'd', 'c'], ['c', 'a', 'h', 'e', 'j']]
# count number of times each character appears
char_count = {}
for arr in data:
for char in arr:
if not char in char_count:
char_count.update({char: 1})
else:
char_count[char] += 1
# select characters that appear multiple times
common_chars = [i[0] for i in char_count.items() if i[1] > 1]
# remove characters that are not present in all lists
for char in common_chars:
count = 0
for arr in data:
if char in arr:
count += 1
if count < len(data):
common_chars.remove(char)
# final result with common characters
print(common_chars)
Resulting output:
['a', 'e', 'c']
Probably not the most efficient solution if you're working with lots of data though.

Flatten a list of strings to characters and then de-dupe the new list

I have the following and have flattened the list via this documentation
>>> wordlist = ['cat','dog','rabbit']
>>> letterlist = [lt for wd in wordlist for lt in wd]
>>> print(letterlist)
['c', 'a', 't', 'd', 'o', 'g', 'r', 'a', 'b', 'b', 'i', 't']
Can the list comprehension be extended to remove duplicate characters. The desired result is the following (in any order):
['a', 'c', 'b', 'd', 'g', 'i', 'o', 'r', 't']
I can convert to a set and then back to a list but I'd prefer to keep it as a list.

Easiest is to use a set comprehension instead of a list comp:
letterlist = {lt for wd in wordlist for lt in wd}
All I did was replace the square brackets with curly braces. This works in Python 2.7 and up.
For Python 2.6 and earlier, you'd use the set() callable with a generator expression instead:
letterlist = set(lt for wd in wordlist for lt in wd)
Last, but not least, you can replace the comprehension syntax altogether by producing the letters from all sequences by chaining the strings together, treat them all like one long sequence, with itertools.chain.from_iterable(); you give that a sequence of sequences and it'll give you back one long sequence:
from itertools import chain
letterlist = set(chain.from_iterable(wordlist))

Sets are an easy way to get unique elements from an iterable. To flatten a list of lists, itertools.chain provides a handy way to do that.
from itertools import chain
>>> set(chain.from_iterable(['cat','dog','rabbit'])
{'a', 'b', 'c', 'd', 'g', 'i', 'o', 'r', 't'}

I think set comprehension should be used
wordlist = ['cat','dog','rabbit']
letterlist = {lt for wd in wordlist for lt in wd}
print(letterlist)
this will work only in python 2.7 and higher
for previous versions use set instead of {}
wordlist = ['cat','dog','rabbit']
letterlist = set(lt for wd in wordlist for lt in wd)
print(letterlist)

What is the best way to generate all possible three letter strings?

I am generating all possible three letters keywords e.g. aaa, aab, aac.... zzy, zzz below is my code:
alphabets = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
keywords = []
for alpha1 in alphabets:
for alpha2 in alphabets:
for alpha3 in alphabets:
keywords.append(alpha1+alpha2+alpha3)
Can this functionality be achieved in a more sleek and efficient way?

keywords = itertools.product(alphabets, repeat = 3)
See the documentation for itertools.product. If you need a list of strings, just use
keywords = [''.join(i) for i in itertools.product(alphabets, repeat = 3)]
alphabets also doesn't need to be a list, it can just be a string, for example:
from itertools import product
from string import ascii_lowercase
keywords = [''.join(i) for i in product(ascii_lowercase, repeat = 3)]
will work if you just want the lowercase ascii letters.

You could also use map instead of the list comprehension (this is one of the cases where map is still faster than the LC)
>>> from itertools import product
>>> from string import ascii_lowercase
>>> keywords = map(''.join, product(ascii_lowercase, repeat=3))
This variation of the list comprehension is also faster than using ''.join
>>> keywords = [a+b+c for a,b,c in product(ascii_lowercase, repeat=3)]

from itertools import combinations_with_replacement
alphabets = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
for (a,b,c) in combinations_with_replacement(alphabets, 3):
print a+b+c

You can also do this without any external modules by doing simple calculation.
The PermutationIterator is what you are searching for.
def permutation_atindex(_int, _set, length):
"""
Return the permutation at index '_int' for itemgetter '_set'
with length 'length'.
"""
items = []
strLength = len(_set)
index = _int % strLength
items.append(_set[index])
for n in xrange(1,length, 1):
_int //= strLength
index = _int % strLength
items.append(_set[index])
return items
class PermutationIterator:
"""
A class that can iterate over possible permuations
of the given 'iterable' and 'length' argument.
"""
def __init__(self, iterable, length):
self.length = length
self.current = 0
self.max = len(iterable) ** length
self.iterable = iterable
def __iter__(self):
return self
def __next__(self):
if self.current >= self.max:
raise StopIteration
try:
return permutation_atindex(self.current, self.iterable, self.length)
finally:
self.current += 1
Give it an iterable object and an integer as the output-length.
from string import ascii_lowercase
for e in PermutationIterator(ascii_lowercase, 3):
print "".join(e)
This will start from 'aaa' and end with 'zzz'.

chars = range(ord('a'), ord('z')+1);
print [chr(a) + chr(b) +chr(c) for a in chars for b in chars for c in chars]

We could solve this without the itertools by utilizing two function definitions:
def combos(alphas, k):
l = len(alphas)
kRecur(alphas, "", l, k)
def KRecur(alphas, prfx, l, k):
if k==0:
print(prfx)
else:
for i in range(l):
newPrfx = prfx + alphas[i]
KRecur(alphas, newPrfx, l, k-1)
It's done using two functions to avoid resetting the length of the alphas, and the second function self-iterates itself until it reaches a k of 0 to return the k-mer for that i loop.
Adopted from a solution by Abhinav Ramana on Geeks4Geeks

Well, i came up with that solution while thinking about how to cover that topic:
import random
s = "aei"
b = []
lenght=len(s)
for _ in range(10):
for _ in range(length):
password = ("".join(random.sample(s,length)))
if password not in b:
b.append("".join(password))
print(b)
print(len(b))
Please let me describe what is going on inside:
Importing Random,
creating a string with letters that we want to use
creating an empty list that we will use to put our combinations in
and now we are using range (I put 10 but for 3 digits it can be less)
next using random.sample with a list and list length we are creating letter combinations and joining it.
in next steps we are checking if in our b list we have that combination - if so, it is not added to the b list. If current combination is not on the list, we are adding it to it. (we are comparing final joined combination).
the last step is to print list b with all combinations and print number of possible combinations.
Maybe it is not clear and most efficient code but i think it works...

print([a+b+c for a in alphabets for b in alphabets for c in alphabets if a !=b and b!=c and c!= a])
This removes the repetition of characters in one string

Ordered Sets Python 2.7

I have a list that I'm attempting to remove duplicate items from. I'm using python 2.7.1 so I can simply use the set() function. However, this reorders my list. Which for my particular case is unacceptable.
Below is a function I wrote; which does this. However I'm wondering if there's a better/faster way. Also any comments on it would be appreciated.
def ordered_set(list_):
newlist = []
lastitem = None
for item in list_:
if item != lastitem:
newlist.append(item)
lastitem = item
return newlist
The above function assumes that none of the items will be None, and that the items are in order (ie, ['a', 'a', 'a', 'b', 'b', 'c', 'd'])
The above function returns ['a', 'a', 'a', 'b', 'b', 'c', 'd'] as ['a', 'b', 'c', 'd'].

Another very fast method with set:
def remove_duplicates(lst):
dset = set()
# relies on the fact that dset.add() always returns None.
return [item for item in lst
if item not in dset and not dset.add(item)]

Use an OrderedDict:
from collections import OrderedDict
l = ['a', 'a', 'a', 'b', 'b', 'c', 'd']
d = OrderedDict()
for x in l:
d[x] = True
# prints a b c d
for x in d:
print x,
print

Assuming the input sequence is unordered, here's O(N) solution (both in space and time).
It produces a sequence with duplicates removed, while leaving unique items in the same relative order as they appeared in the input sequence.
>>> def remove_dups_stable(s):
... seen = set()
... for i in s:
... if i not in seen:
... yield i
... seen.add(i)
>>> list(remove_dups_stable(['q', 'w', 'e', 'r', 'q', 'w', 'y', 'u', 'i', 't', 'e', 'p', 't', 'y', 'e']))
['q', 'w', 'e', 'r', 'y', 'u', 'i', 't', 'p']

I know this has already been answered, but here's a one-liner (plus import):
from collections import OrderedDict
def dedupe(_list):
return OrderedDict((item,None) for item in _list).keys()
>>> dedupe(['q', 'w', 'e', 'r', 'q', 'w', 'y', 'u', 'i', 't', 'e', 'p', 't', 'y', 'e'])
['q', 'w', 'e', 'r', 'y', 'u', 'i', 't', 'p']

I think this is perfectly OK. You get O(n) performance which is the best you could hope for.
If the list were unordered, then you'd need a helper set to contain the items you've already visited, but in your case that's not necessary.

if your list isn't sorted then your question doesn't make sense.
e.g. [1,2,1] could become [1,2] or [2,1]
if your list is large you may want to write your result back into the same list using a SLICE to save on memory:
>>> x=['a', 'a', 'a', 'b', 'b', 'c', 'd']
>>> x[:]=[x[i] for i in range(len(x)) if i==0 or x[i]!=x[i-1]]
>>> x
['a', 'b', 'c', 'd']
for inline deleting see Remove items from a list while iterating or Remove items from a list while iterating without using extra memory in Python
one trick you can use is that if you know x is sorted, and you know x[i]=x[i+j] then you don't need to check anything between x[i] and x[i+j] (and if you don't need to delete these j values, you can just copy the values you want into a new list)
So while you can't beat n operations if everything in the set is unique i.e. len(set(x))=len(x)
There is probably an algorithm that has n comparisons as its worst case but can have n/2 comparisons as its best case (or lower than n/2 as its best case if you know somehow know in advance that len(x)/len(set(x))>2 because of the data you've generated):
The optimal algorithm would probably use binary search to find maximum j for each minimum i in a divide and conquer type approach. Initial divisions would probably be of length len(x)/approximated(len(set(x))). Hopefully it could be carried out such that even if len(x)=len(set(x)) it still uses only n operations.

There is unique_everseen solution described in
http://docs.python.org/2/library/itertools.html
def unique_everseen(iterable, key=None):
"List unique elements, preserving order. Remember all elements ever seen."
# unique_everseen('AAAABBBCCDAABBB') --> A B C D
# unique_everseen('ABBCcAD', str.lower) --> A B C D
seen = set()
seen_add = seen.add
if key is None:
for element in ifilterfalse(seen.__contains__, iterable):
seen_add(element)
yield element
else:
for element in iterable:
k = key(element)
if k not in seen:
seen_add(k)
yield element

Looks ok to me. If you really want to use sets do something like this:
def ordered_set (_list) :
result = set()
lastitem = None
for item in _list :
if item != lastitem :
result.add(item)
lastitem = item
return sorted(tuple(result))
I don't know what performance you will get, you should test it; probably the same because of method's overheat!
If you really are paranoid, just like me, read here:
http://wiki.python.org/moin/HowTo/Sorting/
http://wiki.python.org/moin/PythonSpeed/PerformanceTips
Just remembered this(it contains the answer):
http://www.peterbe.com/plog/uniqifiers-benchmark

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

generate list of tuples to import to pandas df - python

This should give you the int you are looking for: def largest_cluster(cons_list): return max(len(c) for c in cons_list) Then you can get the tuples by: tuples = [(w, largest_cluster(cons)) for w, cons in zip(list2, d)]

Works fine, now to import it to pandas with column labels. That will hopefully be OK.

If the word has no consonants I get ValueError: max() arg is an empty sequence. How do I fix that?

Related

Matching some values from a list with values from a column

How to find common elements from a list of lists such that order of occurrence is maintained?

Flatten a list of strings to characters and then de-dupe the new list

What is the best way to generate all possible three letter strings?

Ordered Sets Python 2.7

Categories

Resources