I have pandas data frame with two columns:
sentence - fo n bar
annotations [B-inv, B-inv, O, I-acc, O, B-com, I-com, I-com]
I want to insert additional 'O' elements in the annotations list in front of each annotation starting with 'B', which will look like this:
[O, B-inv, O, B-inv, O, I-acc, O, O, B-com, I-com, I-com]
' f o n bar'
And then insert additional whitespace in front of each element with an index equal to the 'B' annotation indexes from the initial annotation: meaning inserting in front of each char from the sentence with index in this list [0,1,5]
Maybe to make it more visibly appealing I should represent it this way:
Initial sentence:
Ind
Sentence char
Annot
0
f
B-inv
1
o
B-inv
2
whitespace
O
3
n
I-acc
4
whitespace
O
5
b
B-com
6
a
I-com
7
r
I-com
End sentence:
Ind
Sentence char
Annot
0
whitespace
O
1
f
B-inv
2
whitespace
O
3
o
B-inv
4
whitespace
O
5
n
I-acc
6
whitespace
O
7
whitespace
O
8
b
B-com
9
a
I-com
10
r
I-com
Updated answer (list comprehension)
from itertools import chain
annot = ['B-inv', 'B-inv', 'O', 'I-acc', 'O', 'B-com', 'I-com', 'I-com']
sent = list('fo n bar')
annot, sent = list(map(lambda l: list(chain(*l)), list(zip(*[(['O', a], [' ', s]) if a.startswith('B') else ([a], [s]) for a,s in zip(annot, sent)]))))
print(annot)
print(''.join(sent))
chain from itertools allow you to chain together a list of lists to form a single list. Then the rest is some clumsy use of zip together with list unpacking (the prefix * in argument names) to get it in one line. map is only used to apply the same operation to both lists basically.
But a more readable version, so you can also follow the steps better, could be:
# find where in the annotations the element starts with 'B'
loc = [a.startswith('B') for a in annot]
# Use this locator to add an element and Merge the list of lists with `chain`
annot = list(chain.from_iterable([['O', a] if l else [a] for a,l in zip(annot, loc)]))
sent = ''.join(chain.from_iterable([[' ', a] if l else [a] for a,l in zip(sent, loc)])) # same on sentence
Note that above, I do not use map as we process each list separately, and there is less zipping and casting to lists. So most probably, a much cleaner, and hence preferred solution.
Old answer (pandas)
I am not sure it is the most convenient to do this on a DataFrame. It might be easier on a simple list, before converting to a DataFrame.
But anyway, here is a way through it, assuming you don't really have meaningful indices in your DataFrame (so that indices are simply the integer count of each row).
The trick is to use .str strings functions such as startswith in this case to find matching strings in one of the column Series of interest and then you could loop over the matching indices ([0, 1, 5] in the example) and insert at a dummy location (half index, e.g. 0.5 to place the row before row 1) the row with the whitespace and 'O' data. Then sorting by sindices with .sort_index() will rearrange all rows in the way you want.
import pandas as pd
annot = ['B-inv', 'B-inv', 'O', 'I-acc', 'O', 'B-com', 'I-com', 'I-com']
sent = list('fo n bar')
df = pd.DataFrame({'sent':sent, 'annot':annot})
idx = np.argwhere(df.annot.str.startswith('B').values) # find rows where annotations start with 'B'
for i in idx.ravel(): # Loop over the indices before which we want to insert a new row
df.loc[i-0.5] = [' ', 'O'] # made up indices so that the subsequent sorting will place the row where you want it
df.sort_index().reset_index(drop=True) # this will output the new DataFrame
Related
I have a grouped list of strings that sort of looks like this, the lists inside of these groups will always contain 5 elements:
text_list = [['aaa','bbb','ccc','ddd','eee'],
['fff','ggg','hhh','iii','jjj'],
['xxx','mmm','ccc','bbb','aaa'],
['fff','xxx','aaa','bbb','ddd'],
['aaa','bbb','ccc','ddd','eee'],
['fff','xxx','aaa','ddd','eee'],
['iii','xxx','ggg','jjj','aaa']]
The objective is simple, group all of the list that is similar by the first 3 elements that is then compared against all of the elements inside of the other groups.
So from the above example the output might look like this (output is the index of the list):
[[0,2,4],[3,5]]
Notice how if there is another list that contains the same elements but in a different order is removed.
I've written the following code to extract the groups but they would return duplicates and I am unsure how to proceed. I also think this might not be the most efficient way to do the extraction as the real list can contain upwards to millions of groups:
grouped_list = []
for i in range(0,len(text_list)):
int_temp = []
for m in range(0,len(text_list)):
if i == m:
continue
bool_check = all( x in text_list[m] for x in text_list[i][0:3])
if bool_check:
if len(int_temp) == 0:
int_temp.append(i)
int_temp.append(m)
continue
int_temp.append(m)
grouped_list.append(int_temp)
## remove index with no groups
grouped_list = [x for x in grouped_list if x != []]
Is there a better way to go about this? How do I remove the duplicate group afterwards? Thank you.
Edit:
To be clearer, I would like to retrieve the lists that is similar to each other but only using the first 3 elements of the other lists. For example, using the first 3 elements from list A, check if list B,C,D... contains all 3 of the elements from list A. Repeat for the entire list then remove any list that contains duplicate elements.
You can build a set of frozensets to keep track of indices of groups with the first 3 items being a subset of the rest of the members:
groups = set()
sets = list(map(set, text_list))
for i, lst in enumerate(text_list):
groups.add(frozenset((i, *(j for j, s in enumerate(sets) if set(lst[:3]) <= s))))
print([sorted(group) for group in groups if len(group) > 1])
If the input list is long, it would be faster to create a set of frozensets of the first 3 items of all sub-lists and use the set to filter all combinations of 3 items from each sub-list, so that the time complexity is essentially linear to the input list rather than quadratic despite the overhead in generating combinations:
from itertools import combinations
sets = {frozenset(lst[:3]) for lst in text_list}
groups = {}
for i, lst in enumerate(text_list):
for c in map(frozenset, combinations(lst, 3)):
if c in sets:
groups.setdefault(c, []).append(i)
print([sorted(group) for group in groups.values() if len(group) > 1])
I have a string and 2 arrays like below:
st="a1b2c3d"
arr1 = ['1','2','3']
arr2 = ['X','Y','Z']
I want to replace all the value of '1', '2', '3' to 'X', 'Y', 'Z'. The final string will look like:
'aXbYcZd'
So I wrote this for loop:
for i in range(0, len(arr1)):
st.replace(str(arr1[i]),str(arr2[i]))
The result is:
'aXb2c3d'
'a1bYc3d'
'a1b2cZd'
How to correctly do what I want above?
Thanks!
Use zip() to iterate through two lists simultaneously to replace values:
st = "a1b2c3d"
arr1 = ['1','2','3']
arr2 = ['X','Y','Z']
for x, y in zip(arr1, arr2):
st = st.replace(x, y)
print(st)
# aXbYcZd
str.replace() does not replace a string in-place. You need to assign returned value back to a variable.
If you're replacing characters, instead of the inefficient replace loop use str.translate with str.maketrans:
>>> table = str.maketrans('123', 'XYZ')
>>> result = 'a1b2c3d'.translate(table)
>>> result
'aXbYcZd'
maketrans requires 2 strings as arguments. If you really have a list, you can use ''.join(l) to make it into a suitable string. You need to make the table only once.
The efficiency is but one point. str.translate is the way to do this correctly in cases where you will map a => b and b => something else. If you want to replace strings then you might need to use re.sub instead.
Calling replace over and over means you have to iterate through the entire string for each replacement, which is O(m * n). Instead:
rep = dict(zip(arr1, arr2)) # make mapping, O(m)
result = ''.join(rep.get(ch, ch) for ch in st)
The first line is O(m), where m is the length of arr1 and arr2.
The second line is O(n), where n is the length of st.
In total this is O(m + n) instead of O(m * n), which is a significant win if either m or n is large.
I have a text file with letters (tab delimited), and a numpy array (obj) with a few letters (single row). The text file has rows with different numbers of columns. Some rows in the text file may have multiple copies of same letters (I will like to consider only a single copy of a letter in each row). Letters in the same row of the text file are assumed to be similar to each other. Also, each letter of the numpy array obj is present in one or more rows of the text file.
Below is an example of the text file (you can download the file from here ):
b q a i m l r
j n o r o
e i k u i s
In the above example, the letter o is mentioned two times in the second row, and the letter i is denoted two times in the third row. I will like to consider single copies of letters rows of the text file.
This is an example of obj: obj = np.asarray(['a', 'e', 'i', 'o', 'u'])
I want to compare obj with rows of the text file and form clusters from elements in obj.
This is how I want to do it. Corresponding to each row of the text file, I want to have a list which denotes a cluster (In the above example we will have three clusters since the text file has three rows). For every given element of obj, I want to find rows of the text file where the element is present. Then, I will like to assign index of that element of obj to the cluster which corresponds to the row with maximum length (the lengths of rows are decided with all rows having single copies of letters).
Below is a python code that I have written for this task
import pandas as pd
import numpy as np
data = pd.read_csv('file.txt', sep=r'\t+', header=None, engine='python').values[:,:].astype('<U1000')
obj = np.asarray(['a', 'e', 'i', 'o', 'u'])
for i in range(data.shape[0]):
globals()['data_row' + str(i).zfill(3)] = []
globals()['clust' + str(i).zfill(3)] = []
for j in range(len(obj)):
if obj[j] in set(data[i, :]): globals()['data_row' + str(i).zfill(3)] += [j]
for i in range(len(obj)):
globals()['obj_lst' + str(i).zfill(3)] = [0]*data.shape[0]
for j in range(data.shape[0]):
if i in globals()['data_row' + str(j).zfill(3)]:
globals()['obj_lst' + str(i).zfill(3)][j] = len(globals()['data_row' + str(j).zfill(3)])
indx_max = globals()['obj_lst' + str(i).zfill(3)].index( max(globals()['obj_lst' + str(i).zfill(3)]) )
globals()['clust' + str(indx_max).zfill(3)] += [i]
for i in range(data.shape[0]): print globals()['clust' + str(i).zfill(3)]
>> [0]
>> [3]
>> [1, 2, 4]
The above code gives me the right answer. But, in my actual work, the text file has tens of thousands of rows, and the numpy array has hundreds of thousands of elements. And, the above given code is not very fast. So, I want to know if there is a better (faster) way to implement the above functionality and aim (using python).
You can do it using merge after a stack on data (in pandas), then some groupby with nunique or idxmax to get what you want
#keep data in pandas
data = pd.read_csv('file.txt', sep=r'\t+', header=None, engine='python')
obj = np.asarray(['a', 'e', 'i', 'o', 'u'])
#merge to keep only the letters from obj
df = (data.stack().reset_index(0,name='l')
.merge(pd.DataFrame({'l':obj})).set_index('level_0'))
#get the len of unique element of obj in each row of data
# and use transform to keep this lenght along each row of df
df['len'] = df.groupby('level_0').transform('nunique')
#get the result you want in a series
res = (pd.DataFrame({'data_row':df.groupby('l')['len'].idxmax().values})
.groupby('data_row').apply(lambda x: list(x.index)))
print(res)
data_row
0 [0]
1 [3]
2 [1, 2, 4]
dtype: object
res contains the clusters with the index being the row in the original data
Say I have a string list:
li = ['a', 'b', 'c']
I would like to construct a new list such that each entry of the new list is a concatenation of a selection of 3 entries in the original list. Note that each entry can be chosen repeatedly:
new_li=['abc', 'acb', 'bac', 'bca', 'cab', 'cba', 'aab', 'aac',....'aaa', 'bbb', 'ccc']
The brutal force way is to construct a 3-fold nested for loop and insert each 3-combination into the new list. I was wondering if there is any Pythonic way to deal with that? Thanks.
Update:
Later I will convert the new list into a set, so the order does not matter anyway.
This looks like a job for itertools.product.
import itertools
def foo(l):
yield from itertools.product(*([l] * 3))
for x in foo('abc'):
print(''.join(x))
aaa
aab
aac
aba
abb
abc
aca
acb
acc
baa
bab
bac
bba
bbb
bbc
bca
bcb
bcc
caa
cab
cac
cba
cbb
cbc
cca
ccb
ccc
yield from is available to you from python3.3 and beyond. For older version, yield within a loop:
def foo(l):
for i in itertools.product(*([l] * 3)) :
yield i
The best way to get all combinations (also called cartesian product) of a list is to use itertools.product using the len of your iterable as repeat argument (that's where it differs from the other answer):
from itertools import product
li = ['a', 'b', 'c']
for comb in product(li, repeat=len(li)):
print(''.join(comb))
or if you want the result as list:
>>> combs = [''.join(comb) for comb in product(li, repeat=len(li))]
>>> combs
['aaa', 'aab', 'aac', 'aba', 'abb', 'abc', 'aca', 'acb', 'acc', 'baa',
'bab', 'bac', 'bba', 'bbb', 'bbc', 'bca', 'bcb', 'bcc', 'caa', 'cab',
'cac', 'cba', 'cbb', 'cbc', 'cca', 'ccb', 'ccc']
It's a bit cleaner to use the repeat argument than to multiply and unpack the list you have manually.
An alternate approach using list comprehension:
li = ['a', 'b', 'c']
new_li = [a+b+c for a in li for b in li for c in li]
import itertools
repeat=int(input("Enter length: ")
def password():
def foo(l):
yield from itertools.product(*([l] * repeat)))
for x in foo('abcdefghijklmnopqrstuvwxyz'):
# you could also use string.ascii_lowercase or ["a","b","c"]
print(''.join(x))
password()
I'll show you a way to do this without any libraries so that you can understand the logic behind how to achieve it.
First, we need to understand how to achieve all combinations mathematically.
Let's take a look at the pattern of every possible combination of characters ranging from a-b with a length of '1'.
a
b
Not much to see but from what we can see, there is one set of each character in the list. Let's increase our string length to '2' and see what pattern emerges.
aa
ab
ba
bb
So looking at this pattern, we see a new column has been added. The far right column is the same as the first example, with there being only 1 set of characters, but it's looped this time. The column on the far left has 2 set of characters. Could it be that for every new column added, one more set of characters is added? Let's take a look and find out by increasing the string length to '3'.
aaa
aab
aba
abb
baa
bab
bba
bbb
We can see the two columns on the right have stayed the same and the new column on the left has 4 of each characters! Not what we was expecting. So the number of characters doesn't increase by 1 for each column. Instead, if you notice the pattern, it is actually increasing by powers of 2.
The first column with only '1' set of characters : 2 ^ 0 = 1
The second column with '2' sets of characters : 2 ^ 1 = 2
The third column with '4' sets of characters : 2 ^ 2 = 4
So the answer here is, with each new column added, the number of each characters in the column is determined by it's position of powers, with the first column on the right being x ^ 0, then x ^ 1, then x ^ 2... and so on.
But what is x? In the example I gave x = 2. But is it always 2? Let's take a look.
I will now give an example of each possible combination of characters from range a-c
aa
ab
ac
ba
bb
bc
ca
cb
cc
If we count how many characters are in the first column on the right, there is still only one set of each characters for every time it loops, this is because the very first column on the right will always be equal to x ^ 0 and anything to the power of 0 is always 1. But if we look at the second column, we see 3 of each characters for every loop. So if x ^ 1 is for the second column, then x = 3. For the first example I gave with a range of a-b (range of 2), to the second example where I used a range a-c (range of 3), it seems as if x is always the length of characters used in your combinations.
With this first pattern recognised, we can start building a function that can identify what each column should represent. If we want to build every combination of characters from range a-b with a string length of 3, then we need a function that can understand that every set of characters in each column will as followed : [4, 2, 1].
Now create a function that can find how many set of characters should be in each column by returning a list of numbers that represent the total number of characters in a column based on it's position. We do this using powers.
Remember if we use a range of characters from a-b (2) then each column should have a total of x ^ y number of characters for each set, where x represents the length of characters being used, and y represents it's column position, where the very first column on the right is column number 0.
Example:
A combination of characters ranging from ['a', 'b'] with a string length of 3 will have a total of 4 a's and b's in the far left column for each set, a total of 2 a's and b's in the next for each set and a total of 1 a's and b's in the last for each set.
To return a list with this total number of characters respective to their columns as so [4, 2, 1] we can do this
def getCharPower(stringLength, charRange):
charpowers = []
for x in range(0, stringLength):
charpowers.append(len(charRange)**(stringLength - x - 1))
return charpowers
With the above function - if we want to create every possible combination of characters that range from a-b (2) and have a string length of 4, like so
aaaa
aaab
aaba
aabb
abaa
abab
abba
abbb
baaa
baab
baba
babb
bbaa
bbab
bbba
bbbb
which have a total set of (8) a's and b's, (4) a's and b's, (2) a's and b's, and (1) a's and b's, then we want to return a list of [8, 4, 2, 1]. The stringLength is 4 and our charRange is ['a', 'b'] and the result from our function is [8, 4, 2, 1].
So now all we have to do is print out each character x number of times depending on the value of it's column placement from our returned list.
In order to do this though, we need to find out how many times each set is printed in it's column. Take a look at the first column on the right of the previous combination example. All though a and b is only printed once per set, it loops and prints out the same thing 7 more times (8 total). If the string was only 3 characters in length then it loop a total of 4 times.
The reason for this is because the length of our strings determine how many combinations there will be in total. The formula for working this out is x ^ y = a, where x equals our range of characters, y equals the length of the string and a equals the total number of combinations that are possible within those specifications.
So to finalise this problem, our solution is to figure out
How many many characters in each set go into each column
How many times to repeat each set in each column
Our first option has already been solved with our previously created function.
Our second option can be solved by finding out how many combinations there are in total by calculating charRange ^ stringLength. Then running through a loop, we add how many sets of characters there are until a (total number of possible combinations) has been reached in that column. Run that for each column and you have your result.
Here is the function that solves this
def Generator(stringLength, charRange):
workbench = []
results = []
charpowers = getCharPower(stringLength, charRange)
for x in range(0, stringLength):
while len(workbench) < len(charRange)**stringLength:
for char in charRange:
for z in range(0, charpowers[x]):
workbench.append(char)
results.append(workbench)
workbench = []
results = ["".join(result) for result in list(zip(*results))]
return results
That function will return every possible combination of characters and of string length that you provide.
A way more simpler way of approaching this problem would be to just run a for loop for your total length.
So to create every possible combination of characters ranging from a-b with a length of 2
characters = ['a', 'b']
for charone in characters:
for chartwo in characters:
print(charone+chartwo)
All though this is a lot simpler, this is limited. This code only works to print every combination with a length of 2. To create more than this, we would have to manually add another for loop each time we wanted to change it. The functions I provided to you before this code however will print any combination for how many string length you give it, making it 100% adaptable and the best way to solve this issue manually yourself without any libraries.
My list looks like this :
['', 'CCCTTTCGCGACTAGCTAATCTGGCATTGTCAATACAGCGACGTTTCCGTTACCCGGGTGCTGACTTCATACTT
CGAAGA', 'ACCGGGCCGCGGCTACTGGACCCATATCATGAACCGCAGGTG', '', '', 'AGATAAGCGTATCACG
ACCTCGTGATTAGCTTCGTGGCTACGGAAGACCGCAACAGGCCGCTCTTCTGATAAGTGTGCGG', '', '', 'ATTG
TCTTACCTCTGGTGGCATTGCAACAATGCAAATGAGAGTCACAAGATTTTTCTCCGCCCGAGAATTTCAAAGCTGT', '
TGAAGAGAGGGTCGCTAATTCGCAATTTTTAACCAAAAGGCGTGAAGGAATGTTTGCAGCTACGTCCGAAGGGCCACATA
', 'TTTTTTTAGCACTATCCGTAAATGGAAGGTACGATCCAGTCGACTAT', '', '', 'CCATGGACGGTTGGGGG
CCACTAGCTCAATAACCAACCCACCCCGGCAATTTTAACGTATCGCGCGGATATGTTGGCCTC', 'GACAGAGACGAGT
TCCGGAACTTTCTGCCTTCACACGAGCGGTTGTCTGACGTCAACCACACAGTGTGTGTGCGTAAATT', 'GGCGGGTGT
CCAGGAGAACTTCCCTGAAAACGATCGATGACCTAATAGGTAA', '']
Those are sample DNA sequences read from a file. The list can have various length, and one sequence can have 10 as well as 10,000 letters. In a source file, they are delimited by empty lines, hence empty items in list. How can I join all items in between empty ones ?
Try this, it's a quick and dirty solution that works fine, but won't be efficient if the input list is really big:
lst = ['GATTACA', 'etc']
[x for x in ''.join(',' if not e else e for e in lst).split(',') if x]
This is how it works, using generator expressions and list comprehensions from the inside-out:
',' if not e else e for e in lst : replace all '' strings in the list with ','
''.join(',' if not e else e for e in lst) : join together all the strings. Now the spaces between sequences will be separated by one or more ,
''.join(',' if not e else e for e in lst).split(',') : split the string at the points where there are , characters, this produces a list
[x for x in ''.join(',' if not e else e for e in lst).split(',') if x] : finally, remove the empty strings, leaving a list of sequences
Alternatively, the same functionality could be written in a longer way using explicit loops, like this:
answer = [] # final answer
partial = [] # partial answer
for e in lst:
if e == '': # if current element is an empty string …
if partial: # … and there's a partial answer
answer.append(''.join(partial)) # join and append partial answer
partial = [] # reset partial answer
else: # otherwise it's a new element of partial answer
partial.append(e) # add it to partial answer
else: # this part executes after the loop exits
if partial: # if one partial answer is left
answer.append(''.join(partial)) # add it to final answer
The idea is the same: we keep track of the non empty-strings and accumulate them, and whenever an empty string is found, we add all the accumulated values to the answer, taking care of adding the last sublist after the loop ends. The result ends up in the answer variable, and this solution only makes a single pass across the input.