split and flatten tuple of tuples - python

What is the best way to split and flatten the tuple of tuples below?
I have this tuple of tuples:
(('aaaa_BBB_wacker* cccc',), ('aaaa_BBB_tttt*',), ('aaaa_BBB2_wacker,aaaa_BBB',), ('BBB_ffff',), ('aaaa_BBB2MM*\r\naaaa_BBB_cccc2MM*',), ('BBBMM\\r\\nBBB2MM BBB',), ('aaaa_BBB_cccc2MM_tttt',), ('aaaa_BBB_tttt, aaaa_BBB',))
I need to:
split by \n\, \r\, \n\, \r\, ",", " "
and flatten it. So the end result should look like this:
['aaaa_BBB_wacker*','cccc', 'aaaa_BBB_tttt*','aaaa_BBB2_wacker','aaaa_BBB','BBB_ffff','aaaa_BBB2MM*','naaaa_BBB_cccc2MM*','BBBMM','BBB2MM BBB','aaaa_BBB_cccc2MM_tttt','aaaa_BBB_tttt', 'aaaa_BBB']
I tried the following and it eventually completes the job but I have to repeat it multiple times for each pattern.
patterns = [[i.split('\\r') for i in patterns]]
patterns = [item for sublist in patterns for item in sublist]
patterns = [item for sublist in patterns for item in sublist]
patterns = [[i.split('\\n') for i in patterns]]

You should use a regexp to split the strings:
import re
re.split(r'[\n\r, ]+', s)
It will be easier using a loop:
patterns = []
for item in l:
patterns += re.split(r'[\n\r, ]+', s)

Given
tups = (('aaaa_BBB_wacker* cccc',), ('aaaa_BBB_tttt*',),
('aaaa_BBB2_wacker,aaaa_BBB',), ('BBB_ffff',),
('aaaa_BBB2MM*\r\naaaa_BBB_cccc2MM*',), ('BBBMM\\r\\nBBB2MM BBB',),
('aaaa_BBB_cccc2MM_tttt',), ('aaaa_BBB_tttt, aaaa_BBB',))
Do
import re
delimiters = ('\r', '\n', ',', ' ', '\\r', '\\n')
pat = '(?:{})+'.format('|'.join(map(re.escape, delimiters)))
result = [s for tup in tups for item in tup for s in re.split(pat, item)]
Notes. Calling re.escape on your delimiters makes sure that they are properly escaped for your regular expression. | makes them alternatives. ?: makes your delimiter group non-capturing so it isn't returned by re.split. + means match the previous group one or more times.

Here is a one-liner.. but it's not simple. You can add as many items as you want in the replace portion, just keep adding them.
start = (('aaaa_BBB_wacker* cccc',), ('aaaa_BBB_tttt*',), ('aaaa_BBB2_wacker,aaaa_BBB',), ('BBB_ffff',), ('aaaa_BBB2MM*\r\naaaa_BBB_cccc2MM*',), ('BBBMM\\r\\nBBB2MM BBB',), ('aaaa_BBB_cccc2MM_tttt',), ('aaaa_BBB_tttt, aaaa_BBB',))
output = [final_item for sublist in start for item in sublist for final_item in item.replace('\\r',' ').replace('\\n',' ').split()]

Related

How to replace a character within a string in a list?

I have a list that has some elements of type string. Each item in the list has characters that are unwanted and want to be removed. For example, I have the list = ["string1.", "string2."]. The unwanted character is: ".". Therefore, I don't want that character in any element of the list. My desired list should look like list = ["string1", "string2"] Any help? I have to remove some special characters; therefore, the code must be used several times.
hola = ["holamundoh","holah","holish"]
print(hola[0])
print(hola[0][0])
for i in range(0,len(hola),1):
for j in range(0,len(hola[i]),1):
if (hola[i][j] == "h"):
hola[i] = hola[i].translate({ord('h'): None})
print(hola)
However, I have an error in the conditional if: "string index out of range". Any help? thanks
Modifying strings is not efficient in python because strings are immutable. And when you modify them, the indices may become out of range at the end of the day.
list_ = ["string1.", "string2."]
for i, s in enumerate(list_):
l[i] = s.replace('.', '')
Or, without a loop:
list_ = ["string1.", "string2."]
list_ = list(map(lambda s: s.replace('.', ''), list_))
You can define the function for removing an unwanted character.
def remove_unwanted(original, unwanted):
return [x.replace(unwanted, "") for x in original]
Then you can call this function like the following to get the result.
print(remove_unwanted(hola, "."))
Use str.replace for simple replacements:
lst = [s.replace('.', '') for s in lst]
Or use re.sub for more powerful and more complex regular expression-based replacements:
import re
lst = [re.sub(r'[.]', '', s) for s in lst]
Here are a few examples of more complex replacements that you may find useful, e.g., replace everything that is not a word character:
import re
lst = [re.sub(r'[\W]+', '', s) for s in lst]

Reverse a string based on custom delimeter

I have a string;
txt = "Hello$JOHN$*How*Are*$You"
I want output like:
Output: "You*$Are*How$*JOHN$Hello"
If you see closely, the character delimiters ($ and *) are NOT reversed in their sequence of occurrence. The string is reversed word-wise, but the delimiters are kept sequential.
I have tried the following:
sep=['$','*']
txt_1 = ""
for ch in txt:
if ch in sep:
txt_1 = txt_1+ch
I can't come up with the logic to capture the sequence of the delimiters and reverse the words of the string.
One approach using regex:
import re
s = "Hello$JOHN$*How*Are*$You"
splits = re.split('([$*]+)', s)
res = ''.join(reversed(splits))
print(res)
Output
You*$Are*How$*JOHN$Hello
A (perhaps less elegant) solution (but easier to understand) is to use itertools.groupby:
from itertools import groupby
s = "Hello$JOHN$*How*Are*$You"
splits = [''.join(g) for k, g in groupby(s, key=lambda x: x in ('$', '*'))]
res = ''.join(reversed(splits))
print(res)
The idea here is to create contiguous sequence of delimiter, non-delimiter characters.

Searching for similar values within a regex string

I'm trying to do a search with regex within two lists that have similar strings, but not the same, how to fix the fault below?
Script:
import re
list1 = [
'juice',
'potato']
list2 = [
'juice;44',
'potato;55',
'apple;66']
correlation = []
for a in list1:
r = re.compile(r'\b{}\b'.format(a), re.I)
for b in list2:
if r.search(b):
pass
else:
correlation.append(b)
print(correlation)
Output:
['potato;55', 'apple;66', 'juice;44', 'apple;66']
Desired Output:
['apple;66']
Regex:
You can create a single regex pattern to match terms from list1 as whole words, and then use filter:
import re
list1 = ['juice', 'potato']
list2 = ['juice;44', 'potato;55', 'apple;66']
rx = re.compile(r'\b(?:{})\b'.format("|".join(list1)))
print( list(filter(lambda x: not rx.search(x), list2)) )
# => ['apple;66']
See the Python demo.
The regex is \b(?:juice|potato)\b, see its online demo. The \b is a word boundary, the regex matches juice or potato as whole words. filter(lambda x: not rx.search(x), list2) removes all items from list2 that match the regex.
First, inner and outer for-loop must be swapped to make this work.
Then you can set a flag to False before the inner for-loop, set it in the inner loop to True if you found a match, after the loop add to correlation if flag is False yet.
This finally looks like:
import re
list1 = [
'juice',
'potato']
list2 = [
'juice;44',
'potato;55',
'apple;66']
correlation = []
for b in list2:
found = False
for a in list1:
r = re.compile(r'\b{}\b'.format(a), re.I)
if r.search(b):
found = True
if not found:
correlation.append(b)
print(correlation)
Convert list1 into a single regexp that matches all the words. Then append the element of list2 if it doesn't match the regexp.
regex = re.compile(r'\b(?:' + '|'.join(re.escape(word) for word in ROE) + r')\b')
correlation = [a for a in list2 if not regex.search(a)]

How can you terminate a string after k consecutive numbers have been found?

Say I have some list with files of the form *.1243.*, and I wish to obtain everything before these 4 digits. How do I do this efficiently?
An ugly, inefficient example of working code is:
names = []
for file in file_list:
words = file.split('.')
for i, word in enumerate(words):
if word.isdigit():
if int(word)>999 and int(word)<10000:
names.append(' '.join(words[:i]))
break
print(names)
Obviously though, this is far from ideal and I was wondering about better ways to do this.
You may want to use regular expressions for this.
import re
name = []
for file in file_list:
m = re.match(r'^(.+?)\.\d{4}\.', file)
if m:
name.append(m.groups()[0])
Using a regular expression, this would become simpler
import re
names = ['hello.1235.sas','test.5678.hai']
for fn in names:
myreg = r'(.*)\.(?:\d{4})\..*'
output = re.findall(myreg,fn)
print(output)
output:
['hello']
['test']
If you know that all entries has the same format, here is list comprehension approach:
[item[0] for item in filter(lambda start, digit, end: len(digit) == 4, (item.split('.') for item in file_list))]
To be fair I also like solution, provided by #James. Note, that downside of this list comprehension is three loops:
1. On all items to split
2. Filtering all items, that match
3. Returning result.
With regular for loop it could be be more sufficient:
output = []
for item in file_list:
begging, digits, end = item.split('.')
if len(digits) == 4:
output.append(begging)
It does only one loop, which way better.
You can use Positive Lookahead (?=(\.\d{4}))
import re
pattern=r'(.*)(?=(\.\d{4}))'
text=['*hello.1243.*','*.1243.*','hello.1235.sas','test.5678.hai','a.9999']
print(list(map(lambda x:re.search(pattern,x).group(0),text)))
output:
['*hello', '*', 'hello', 'test', 'a']

Remove special characters from individual python list

I have a list that contain many elements.
I was able to find a way to remove duplicates, blank values, and white space.
The only thing left is to:
remove any thing that contain (ae) string.
remove from the list any thing that contain the period (.)
Order of the resulting list is not important.
The final list should only contain:
FinalList = ['eth-1/1/0', 'jh-3/0/1', 'eth-5/0/0','jh-5/9/9']
Code:
XYList = ['eth-1/1/0', 'ae1', 'eth-1/1/0', 'eth-1/1/0', 'ae1', 'jh-3/0/1','jh-5/9/9', 'jh-3/0/1.3321', 'jh-3/0/1.53', 'ae0', '', 'eth-5/0/0', 'ae0', '', 'eth-5/0/0', 'ae0', 'eth-5/0/0', '', 'jh-2.1.2']
XYUnique = set(XYList)
XYNoBlanks = (filter(None,XY))
RemovedWhitespace = [item.strip() for item in XYNoBlanks]
# the order of the list is not important
# the final result should be
FinalList = ['eth-1/1/0', 'jh-3/0/1', 'eth-5/0/0','jh-5/9/9']
The entire conversion sequence (excluding uniqueness) can be accomplished with a list comprehension:
FinalList = [elem.strip() for elem in set(XYList) if elem and "." not in elem and "ae" not in elem]
filtered_l = [s for s in XYList if 'ae' not in s and '.' not in s]

Categories