How to split by newline and ignore blank lines using regex? - python

Lets say I have this data
data = '''a, b, c
d, e, f
g. h, i
j, k , l
'''
4th line contains one single space, 6th and 7th line does not contain any space, just a blank new line.
Now when I split the same using splitlines
data.splitlines()
I get
['a, b, c', 'd, e, f', 'g. h, i', ' ', 'j, k , l', '', '']
However expected was just
['a, b, c', 'd, e, f', 'g. h, i', 'j, k , l']
Is there a simple solution using regular expressions to do this.
Please note that I know the other way of doing the same by filtering empty strings from the output of splitlines()
I am not sure if the same can be achieved using regex.
When I use regex to split on new line, it gives me
import re
re.split("\n", data)
Output :
['a, b, c', 'd,e,f', 'g. h, i', ' ', 'j, k , l', '', '', '']

I disagree with your assessment that filtering is more complicated than using regular expressions. However, if you really want to use regex, you could split at multiple consecutive newlines like so:
>>> re.split(r"\n+", data)
['a, b, c', 'd, e, f', 'g. h, i', 'j, k , l', '']
Unfortunately, this leaves an empty string at the end of your list.
To get around this, use re.findall to find everything that isn't a newline:
>>> re.findall(r"([^\n]+)", data)
['a, b, c', 'd, e, f', 'g. h, i', 'j, k , l']
Since that regex doesn't work on input with spaces, here's one that does:
>>> re.findall(r"^([ \t]*\S.*)$", data, re.MULTILINE)
['a, b, c', 'd, e, f', 'g. h, i', 'j, k , l ']
Here's the explanation:
^([ \t]*\S.*)$
^ $ : Start of line and end of line
( ) : Capturing group
[ \t]* : Zero or more of blank space or tab (i.e. whitespace that isn't newline
\S : One non-whitespace character
.* : Zero or more of any character

List comprehension approach
You can add elements to your list if they are not empty strings or whitespace ones with a condition check.
If the element/line is True after stripping it from whitespaces, then it is different from an empty string, thus you add it to your list.
filtered_data = [el for el in data.splitlines() if el.strip()]
# ['a, b, c', 'd, e, f', 'g. h, i', 'j, k , l']
Regexp approach
import re
p = re.compile(r"^([^\s]+.+)", re.M)
p.findall(data)
# ['a, b, c', 'd, e, f', 'g. h, i', 'j, k , l']

Related

Merging items in list given condition

Let's say I have ['A B', 'B C', 'X Y', 'C D', 'Y Z', 'D E', 'C G'].
If the second word in each element of the list is same as first word in any other elements in the list, they should be merged into one item. The order matters as well.
['A B C D E G', 'X Y Z'] should be the final product.
Letters will not form a cycle, i.e., ['A B', 'B C', 'C A'] is not possible.
As for why G was added at the end, let's say we are at 'A B C' and we see 'C D'. And later on in the list we have 'C G'. We add 'C D' since it comes first, so we have 'A B C D', and then if there is 'D E' we merge that to 'A B C D E'. Now at the end of a 'chain', we add 'C G' so we have 'A B C D E G'. If there had been a 'B H', since B comes first, then it would be 'A B C D E H G'.
Since order matters, ['A B', 'C A'] is still ['A B', 'C A'] since B does not connect to C.
Another example is:
If we have ['A B', 'A C'] then at step 1 we just have A B and then we see A C and we merge that into A B and we have A B C.
I have tried many things, such as dissecting each element into two lists, then indexing and removing etc. It is way too complex and I feel there should be a more intuitive way to solve this in Python. Alas, I can't seem to figure it out.
A simple algorithm solving this appears to be:
initialize results as empty list
repeat for each pair in input list:
repeat for each sublist R in results:
if R contains the first item of pair, append second item to R and continue with next pair
if no sublist R contained the first item of pair, append pair as new sublist to results
The implementation in Python is straightforward and should be self-explanatory:
def merge_pairs(pairs):
results = []
for pair in pairs:
first, second = pair.split()
for result in results:
if first in result:
result.append(second)
break
else:
results.append([first, second])
return [' '.join(result) for result in results]
The only extra steps are the conversions between space-separated letters and lists of letters by .split() and ' '.join().
Note the else clause of the for statement which is only executed if no break was encountered.
Some examples:
>>> merge_pairs(['A B', 'B C', 'X Y', 'C D', 'Y Z', 'D E', 'C G'])
['A B C D E G', 'X Y Z']
>>> merge_pairs(['A B', 'A C'])
['A B C']
>>> merge_pairs(['B C', 'A B'])
['B C', 'A B']
bit messy but this works.
a = ['AB', 'BC', 'XY', 'CD', 'YZ', 'DE', 'CG']
used_item = []
result = []
for f, fword in enumerate(a):
for s, sword in enumerate(a):
if f == s:
continue
if fword[1] == sword[0]:
if f in used_item or s in used_item:
idx = [i for i, w in enumerate(result) if fword[1] in w][0]
result = [r + sword[1] if i == idx else r for i, r in enumerate(result) ]
used_item.append(f)
used_item.append(s)
else:
result.append(fword+sword[1])
used_item.append(f)
used_item.append(s)
print(result)
output is ['ABCDGE', 'XYZ']
you could sort 'ABCDGE' if necessary.
I appreciate all the above answers and wanted to attempt the question using Graph Theory.
This question is a classic example of connected components, where we can assign a unique identifier to each node (nodes are the different characters in the list = A, B, C, .... Z) accordingly.
Also, as we need to maintain the order, DFS is the correct choice to proceed with.
Algorithm:
1. Treat the characters A, B, C, .... Z as different 26 nodes.
2. Perform DFS on each node.
2.1 If the node has not been assigned an unique identifier:
2.1.1. Assign a new unique identifier to the node.
2.1.2. dfs(node)
2.2 Else:
no nothing
The function dfs() will group the nodes according to the unique identifier. While performing DFS, the child gets the same unique identifier as its parent.
Have a look at the following implementation:
class Graph:
def __init__(self, n):
self.number_of_nodes = n
self.visited = []
self.adjacency_list = []
self.identifier = [-1] * self.number_of_nodes
self.merged_list = []
def add_edge(self, edge_start, edge_end):
if(self.encode(edge_end) not in self.adjacency_list[self.encode(edge_start)]):
self.adjacency_list[self.encode(edge_start)] = self.adjacency_list[self.encode(edge_start)] + [self.encode(edge_end)]
def initialize_graph(self):
self.visited = [False] * self.number_of_nodes
for i in range(0, self.number_of_nodes):
self.adjacency_list = self.adjacency_list + [[]]
def get_adjacency_list(self):
return self.adjacency_list
def encode(self, node):
return ord(node) - 65
def decode(self, node):
return chr(node + 65)
def dfs(self, start_index):
if(self.visited[self.encode(start_index)] == True):
return
self.visited[self.encode(start_index)] = True
for node in self.adjacency_list[self.encode(start_index)]:
if(self.identifier[node] == -1):
self.identifier[node] = self.identifier[self.encode(start_index)]
if(self.visited[node] == False):
self.merged_list[self.identifier[node]] = self.merged_list[self.identifier[node]] + self.decode(node)
self.dfs(self.decode(node))
graph = Graph(26)
graph.initialize_graph()
input_list = ['A B', 'B C', 'X Y', 'C D', 'Y Z', 'D E', 'C G']
for inputs in input_list:
edge = inputs.split()
edge_start = edge[0]
edge_end = edge[1]
graph.add_edge(edge_start, edge_end)
unique_identifier = 0
for inputs in input_list:
edge = inputs.split()
edge_start = edge[0]
if(graph.identifier[graph.encode(edge_start)] == -1):
graph.identifier[graph.encode(edge_start)] = unique_identifier
graph.merged_list = graph.merged_list + [edge_start]
unique_identifier = unique_identifier + 1
graph.dfs(edge_start)
print(graph.merged_list)
Output:
['ABCDEG', 'XYZ']

How to find a substring in a string list?

I try to find some string in a list, but have problems because of word order.
list = ['a b c d', 'e f g', 'h i j k']
str = 'e g'
I need to find the 2nd item in a list and output it.
You can use combination of any() and all() to check the presence in one line:
>>> my_list = ['a b c d', 'e f g', 'h i j k']
>>> my_str = 'e g'
>>> any(all(s in sub_list for s in my_str.split()) for sub_list in my_list)
True
Here, above expression will return True / False depending on whether the char in your strings are present inside the list.
To also get the get that sub-list as return value, you can modify above expression by skipping any() with list comprehension as:
>>> [sub_list for sub_list in my_list if all(s in sub_list for s in my_str.split())]
['e f g']
It'll return the list of strings containing your chars.
You can try:
for l in list:
l_words = l.split(" ")
if all([x in l_words for x in str.split(" ")]):
print(l_words)
You can try this
list = ['a b c d', 'e f g', 'h i j k']
str = list[2].split()
for letter in str:
print(letter)
This can be achieved by using sets and list comprehension
ls = ['a b c d', 'e f g', 'h i j k']
s = 'e g'
print([i for i in ls if len(set(s.replace(" ", "")).intersection(set(i.replace(" ", "")))) == len(s.replace(" ", ""))])
OR
ls = ['a b c d', 'e f g', 'h i j k']
s = 'e g'
s_set = set(s.replace(" ", ""))
print([i for i in ls if len(s_set.intersection(set(i.replace(" ", "")))) == len(s_set)])
Output
['e f g']
The list comprehension is removing all the items in ls that all the chars from s are not including inside the list item, by that you will get all the ls items that all the s chars are in them.

Regular expression to remove numbers ending with "%" in python

Quick question, how do I use regular expression or any method for that matter to remove every numeric value ending with the % sign from the list below. If similar question has been previously asked, kindly post the link, no need to down vote the question.
Thanks
my_list = [['First Class', 'F, U', '150%'], ['P', '125%'],
['Business Class', 'J, C, D, I', '125%'],
['Premium Economy Class', 'W', '110%'],
['Economy Class', 'Y, B', '100%'],
['E, H, M', '75%'],
['L, N, R, S, V, K', '50%'],
['T', '30%'],
['Not eligible for accrual', 'Z, Q, G', '0%']]
You can just use a normal list comprehension and test every element of the sublists with endswith().
new_list = [[i for i in l if not i.endswith('%')] for l in my_list]
print(new_list)
gives
[['First Class', 'F, U'], ['P'], ['Business Class', 'J, C, D, I'], ['Premium Economy Class', 'W'], ['Economy Class', 'Y, B'], ['E, H, M'], ['L, N, R, S, V, K'], ['T'], ['Not eligible for accrual', 'Z, Q, G']]

Concatenating Arbitrary number of items of a string in Python

Given a list ['a','b','c','d','e','f'].No. of divisions to be made 2.. So In the first string i want to take the 0,2,4 elements of the list, and then concatenate them separated by a space delimiter with the second string of 1,3,5 elements.
The output needs to be in the form of k = ["a c e", "b d f"]
The actual program is to take in a string (eg {ball,bat,doll,choclate,bat,kite}), also take in the input of the number of kids who take those gifts(eg 2), and then divide them so that the frst kid gets a gift, goes to the back of the line, the second kid takes the gift and stands at the back, in that way all kids take gifts. If gifts remain then the first kid again takes a gift and the cycle continues....
desired output for above eg: {"ball doll bat" , "bat choclate kite"}
Here is a general way to do this for any number of groups:
def merge(lst, ngroups):
return [' '.join(lst[start::ngroups]) for start in xrange(ngroups)]
Here is how it's used:
>>> lst = ['a','b','c','d','e','f']
>>> merge(lst, 2)
['a c e', 'b d f']
>>> merge(lst, 3)
['a d', 'b e', 'c f']
lst = ['a','b','c','d','e','f']
k = [" ".join(lst[::2]), " ".join(lst[1::2])]
output:
['a c e', 'b d f']
more generic solution:
def group(lst, n):
return [" ".join(lst[i::n]) for i in xrange(n)]
lst = ['a','b','c','d','e','f']
print group(lst, 3)
output:
['a d', 'b e', 'c f']

Lining up DNA base pairs in an array

Trying to import a series of base pairs into an array.
I want it in the form ['AA','AT','AG','AC'...]
Here's my code:
paths = [str(x[4:7]) for x in mm_start]
paths
['A A', 'A T', 'A G', 'A C', 'T A', 'T T', 'T G', 'T C', 'G A', 'G T', 'G G', 'G C', 'C A', 'C T', 'C G', 'C C']
I get spaces in between the letters!
This strip command isn't helping either.
paths = str(paths).replace(" ","")
paths
"['AA','AT','AG','AC','TA','TT','TG','TC','GA','GT','GG','GC','CA','CT','CG','CC']"
Now I get a (") at the beginning and end of this array.
Any ideas very welcome!
The text file has the base pairs laid out
1 2 3 4 A A
1 2 3 1 A T
...
Thanks
You're converting the list into a string. You really want to say
paths = [pair.replace(" ", "") for pair in paths]
That iterates over each pair in your list of strings and removes spaces rather than converting the entire list to a string.
paths = str(paths).replace(" ","")
Well of course it doesn't work, because you converted the list to a string (and you didn't convert it back). What you need to do is apply the replace to each element of the list, not to the string expression of the list.
paths = [str(x[4:7]).replace(" ","") for x in mm_start]
(that's only one way of many).

Categories