How to remove empty separators from read files in Python?

How to remove empty separators from read files in Python? - python

Here is my input file sample (z.txt)
>qrst
ABCDE-- 6 6 35 25 10
>qqqq
ABBDE-- 7 7 28 29 2
I store the alpha and numeric in separate lists. Here is the output of numerics list
#Output : ['', '6', '', '6', '35', '25', '10']
['', '7', '', '7', '28', '29', '', '2']
The output has an extra space when there are single digits because of the way the file has been created. Is there anyway to get rid of the '' (empty spaces)?

You can take advantage of filter with None as function for that:
numbers = ['', '7', '', '7', '28', '29', '', '2']
numbers = filter(None, numbers)
print numbers
See it in action here: https://eval.in/640707

If your input looks like this:
>>> li=[' 6 6 35 25 10', ' 7 7 28 29 2']
Just use .split() which will handle the repeated whitespace as a single delimiter:
>>> [e.split() for e in li]
[['6', '6', '35', '25', '10'], ['7', '7', '28', '29', '2']]
vs .split(" "):
>>> [e.split(" ") for e in li]
[['', '6', '', '6', '', '35', '', '25', '', '10'], ['', '7', '7', '28', '', '29', '2']]

I guess there are many ways to do this. I prefer using regular expressions, although this might be slower if you have a large input file with tens of thousands of lines. For smaller files, it's okay.
Few points:
Use context manager (with statement) to open files. When the with statement ends, the file will automatically be closed.
An alternative to re.findall() is re.match() or re.search(). Subsequent code will be slightly different.
It org, sequence and numbers are related element-wise, I suggest you maintain a list of 3-element tuples instead. Of course, you have buffer the org field and add to the list of tuples when the next line is obtained.
import re
org = []
sequence = []
numbers = []
with open('ddd', 'r') as f:
for line in f.readlines():
line = line.strip()
if re.search(r'^>', line):
org.append(line)
else:
m = re.findall(r'^([A-Z]+--)\s+(.*)\s+', line)
if m:
sequence.append(m[0][0])
numbers.append(map(int, m[0][1].split())) # convert from str to int
print(org, sequence, numbers)

Related

Check if multiple elements are in a list

i have 2 lists :
A = ['1', '2', '3', '4', '5']
B = ['0', '1', '9', '3', '0']
and i want to check if elements in list B are in A and return a list, if so it should return the same number, if not it should return empty string '', here is the result i'm looking for :
C = ['', '1', '', '3', '']
i Tried using for loop and append result to an empty list, but i got this :
C = ['', '1', '', '', '', '', '', '', '3', ''...]
it doubled the number of elements in the list cuz it looks for the first number in the entire list then move to second one, which makes sense since i'm using for loop, what should i use instead to get back a 5 elements list please ?
thanks in advance.

To get your required output, you can just loop through B and check if the item exists in A. If it does, then append it in c otherwise append an empty string to c.
A = ['1', '2', '3', '4', '5']
B = ['0', '1', '9', '3', '0']
c = []
for i in B:
if i in A:
c.append(i)
else:
c.append('')
print(c)
The output of the above code
['', '1', '', '3', '']

How to get all the substrings in string using Regex in Python

I have a string such as: "12345"
using the regex, how to get all of its substrings that consist of one up to three consecutive characters to get an output such as:
'1', '2', '3', '4', '5', '12', '23', '34', '45', '123', '234', '345'

You can use re.findall with a positive lookahead pattern that matches a character repeated for a number of times that's iterated from 1 to 3:
[match for size in range(1, 4) for match in re.findall('(?=(.{%d}))' % size, s)]
However, it would be more efficient to use a list comprehension with nested for clauses to iterate through all the sizes and starting indices:
[s[start:start + size] for size in range(1, 4) for start in range(len(s) - size + 1)]
Given s = '12345', both of the above would return:
['1', '2', '3', '4', '5', '12', '23', '34', '45', '123', '234', '345']

How can I use regex to match only one character in Python?

I am trying do precess a list of files
file_list = ['.DS_Store', '9', '7', '6', '8', '01', '4', '3', '2', '5']
the goal is to find the files whose name has only one character.
I tried this code
r = re.compile('[0-9]')
result_list = list(filter(r.match, file_list))
result_list
and got
['9', '7', '6', '8', '01', '4', '3', '2', '5']
where '01' should not be included.
I made a workaround
tmp = []
for i in file_list:
if len(i)==1:
tmp.append(i)
tmp
and I got
['9', '7', '6', '8', '4', '3', '2', '5']
this is exactly what I want. Although the method is ugly.
how can I use regex in Python to finish the task?

r = re.compile('^[0-9]$')
The ^ matches the beginning of a line and $ matches the end.
And if you really want it to match any character, not just numbers, it should be
r = re.compile('^.$')
The . in the regex is a single-character wildcard.

Match a string if it's simply any single character appearing at the beginning of the string (^.) right before the end of the string ($):
^.$
Regex101
Your Python then becomes:
r = re.compile('^.$')
result_list = list(filter(r.match, file_list))

Your code is equivalent to
[ i for i in file_list if len(i)==1]
And this method adapts to every case in which file's name has only one character.

Splitting a string similar to ip addresses using regex in Python

I want to have a regular expression which will split on seeing a '.'(dot)
For example:
Input: '1.2.3.4.5.6'
Output : ['1', '2', '3', '4', '5', '6']
What I have tried:-
>>> pattern = '(\d+)(\.(\d+))+'
>>> test = '192.168.7.6'
>>> re.findall(pat, test)
What I get:-
[('192', '.6', '6')]
What I expect from re.findall():-
[('192', '168', '7', '6')]
Could you please help in pointing what is wrong?
My thinking -
In pattern = '(\d+)(\.(\d+))+', initial (\d+) will find first number i.e. 192 Then (\.(\d+))+ will find one or more occurences of the form '.<number>' i.e. .168 and .7 and .6
[EDIT:]
This is a simplified version of the problem I am solving.
In reality, the input can be-
192.168 dot 7 {dot} 6
and expected output is still [('192', '168', '7', '6')].
Once I figure out the solution to extract .168, .7, .6 like patterns, I can then extend it to dot 168, {dot} 7 like patterns.

Since you only need to find the numbers, the regex \d+ should be enough to find numbers separated by any other token/separator:
re.findall("\d+", test)
This should work on any of those cases:
>>> re.findall("\d+", "192.168.7.6")
['192', '168', '7', '6']
>>> re.findall("\d+", "192.168 dot 7 {dot} 6 | 125 ; 1")
['192', '168', '7', '6', '125', '1']

Conditional regular expression to split on commas

I am splitting a string in python and my goal is to split by commas except these between quotations marks. I am using
fields = line.strip().split(",")
but some strings are like the following one:
10,20,"Installations, machines",3,5
How can I use regular expressions for accomplishing this?

Although I agree that regular expressions may not be the best tool for the job, I found the problem quite interesting on its own.
import re
split_on_commas = re.compile(r'[^,]*".*"[^,]*|[^,]+|(?<=,)|^(?=,)').findall
This regexp consists in four alternative parts in this order:
any number of non-commas, followed by a substring enclosed between double quotes, followed by any number of non-commas;
at least one non-comma;
an empty substring following a comma;
an empty substring at the start of the string, and followed by a comma.
Some tests:
assert split_on_commas('10,20,"aaa, bbb",3,5') == ['10', '20', '"aaa, bbb"', '3', '5']
assert split_on_commas('10,,20,"aaa, bbb",3,5') == ['10', '', '20', '"aaa, bbb"', '3', '5']
assert split_on_commas('10,,,20,"aaa, bbb",3,5') == ['10', '', '', '20', '"aaa, bbb"', '3', '5']
assert split_on_commas(',10,20,"aaa, bbb",3,5') == ['', '10', '20', '"aaa, bbb"', '3', '5']
assert split_on_commas('10,20,"aaa, bbb",3,5,') == ['10', '20', '"aaa, bbb"', '3', '5', '']
assert split_on_commas('10,20,"aaa, bbb" ccc,3,5') == ['10', '20', '"aaa, bbb" ccc', '3', '5']
assert split_on_commas('10,20,ccc "aaa, bbb",3,5') == ['10', '20', 'ccc "aaa, bbb"', '3', '5']
assert split_on_commas('10,20,"aaa, bbb" "ccc",3,5,') == ['10', '20', '"aaa, bbb" "ccc"', '3', '5', '']
assert split_on_commas('10,20,"aaa, bbb" "ccc, ddd",3,5,') == ['10', '20', '"aaa, bbb" "ccc, ddd"', '3', '5', '']
assert split_on_commas('10,20,"aaa, "bbb",3,5') == ['10', '20', '"aaa, "bbb"', '3', '5']
assert split_on_commas('10,20,"",3,5') == ['10', '20', '""', '3', '5']
assert split_on_commas('10,20,",",3,5') == ['10', '20', '","', '3', '5']
assert split_on_commas(',,,') == ['', '', '', '']
assert split_on_commas('') == []
assert split_on_commas(',') == ['', '']
assert split_on_commas('","') == ['","']
assert split_on_commas('",') == ['"', '']
assert split_on_commas(',"') == ['', '"']
assert split_on_commas('"') == ['"']
Update: comparison with the csv module solution
Similar questions have been asked many times on SO, and each time the best / accepted answer was "Just use the csv module". Perhaps it's useful to point out some differences between the recommended solution and my re proposition. But first, devise a csv function with the same interface as split (not idiomatic, but consistent with the original requirement):
import csv
split_on_commas = lambda s: csv.reader([s]).next()
The first thing to be aware of is that csv.reader does more than a smart split. The external delimiters are suppressed:
assert split_on_commas('10,20,"aaa, bbb",3,5') == ['10', '20', 'aaa, bbb', '3', '5']
Which can lead to some strange behaviours:
assert split_on_commas('10,20,"aaa, bbb" ccc,3,5') == ['10', '20', 'aaa, bbb ccc', '3', '5']
assert split_on_commas('10,20,aaa", bbb ccc",3,5') == ['10', '20', 'aaa"', ' bbb ccc"', '3', '5']
I am sure this is not a problem with a generated CSV, since the offending double quotes would be escaped.
More shocking is the fact that this module still does not support Unicode:
split_on_commas(u'10,20,"Juan, Chô",3,5')
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-83-a0ef82b5fc26> in <module>()
----> 1 split_on_commas(u'10,20,"Juan, Chô",3,5')
<ipython-input-81-18a2b4070348> in <lambda>(s)
1 if __name__ == "__main__":
2 import csv
----> 3 split_on_commas = lambda s: csv.reader([s]).next()
4
5 assert split_on_commas('10,20,"aaa, bbb",3,5') == ['10', '20', 'aaa, bbb', '3', '5']
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf4' in position 15: ordinal not in range(128)
But there is of course a third difference: my solution has not be thoroughly tested, and is not guaranteed to work in the cases I didn't think of... Now, since this approach seems to have several real use cases (e.g., non-TSV files, non-ASCII input), I would be glad if some regex guru, far from dismissing it as dangerous, could help to find out its limitations and improve it.

This is how I'd do it:
import re
data = "my string \"string is nice\" other string "
print re.findall(r'(\w+|".*?")', data)
The output will be:
['my', 'string', '"string is nice"', 'other', 'string']
I don't think there's anything to explain here as the regex speaks for itself. Anyway, if you have any doubts I recommend regex101
\w+ - match any word character [a-zA-Z0-9_]
" - matches the characters " literally
.*? - matches any character (except newline)
If you also want to get rid of the square brackets, do this:
import re
string = "my string \"string is nice\" other string "
parsed_string = re.findall(r'(\w+|".*?")', string)
print(", ".join(parsed_string))
The output will be:
my, string, "string is nice", other, string

As jonrsharpe and Alan Moore mentioned, the Python's built-in CSV module would be a much better solution.
As per their own example:
import csv
with open('some.csv', 'rb') as f:
reader = csv.reader(f)
for row in reader:
print row

Regular expressions will not work well here.
You can split by comma and then recombine...
Or use the csv module as suggested in the comments...
line = '10,20,"Installations, machines",3,5'
fields = line.strip().split(",")
result = []
tmpfield = ''
for checkfield in fields:
tmpfield = checkfield if tmpfield=='' else tmpfield +','+ checkfield
if tmpfield.strip().startswith('"'):
if tmpfield.strip().endswith('"'):
result.append(tmpfield)
tmpfield = ''
else:
result.append(tmpfield)
tmpfield = ''
if tmpfield<>'':
result.append(tmpfield)
print(result)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to remove empty separators from read files in Python? - python

You can take advantage of filter with None as function for that: numbers = ['', '7', '', '7', '28', '29', '', '2'] numbers = filter(None, numbers) print numbers See it in action here: https://eval.in/640707

Related

Check if multiple elements are in a list

How to get all the substrings in string using Regex in Python

How can I use regex to match only one character in Python?

Splitting a string similar to ip addresses using regex in Python

Conditional regular expression to split on commas

Categories

Resources