Dictionary replaces substrings Python 2.7

Dictionary replaces substrings Python 2.7 - python

I want to replace numbers from a text file in a new text file. I tried to solve it with the function Dictionary, but now python also replaces the substrings.
For Example: I want to replace the number 014189 to 1489, with this code it also replaces 014896 to 1489 - how can i get rid of this? Thank you!!!
replacements = {'01489':'1489', '01450':'1450'}
infile = open('test_fplan.txt', 'r')
outfile = open('test_fplan_neu.txt', 'w')
for line in infile:
for src, target in replacements.iteritems():
line = line.replace(src, target)
outfile.write(line)

I don't know how your input file looks, but if the numbers are surrounded by spaces, this should work:
replacements = {' 01489 ':' 1489 ', ' 01450 ':' 1450 '}

It looks like your concern is that it's also modifying numbers that contain your src pattern as a substring. To avoid that, you'll need to first define the boundaries that should be respected. For instance, do you want to insist that only matched numbers surrounded by spaces get replaced? Or perhaps just that there be no adjacent digits (or periods or commas). Since you'll probably want to use regular expressions to constrain the matching, as suggested by JoshuaF in another answer, you'll likely need to avoid the simple replace function in favor of something from the re library.

Use regexp with negative lookarounds:
import re
replacements = {'01489':'1489', '01450':'1450'}
def find_replacement(match_obj):
number = match_obj.group(0)
return replacements.get(number, number)
with open('test_fplan.txt') as infile:
with open('test_fplan_neu.txt', 'w') as outfile:
outfile.writelines(
re.sub(r'(?<!\d)(\d+)(?!\d)', find_replacement, line)
for line in infile
)

Check out the regular expression syntax https://docs.python.org/2/library/re.html. It should allow you to match whatever pattern you're looking for exactly.

Related

Replacing semicolon for comma in csv using regex in python

I'm working with a .csv file and, as always, it has format problems. In this case it's a ; separated table, but there's a row that sometimes has semicolons, like this:
code;summary;sector;sub_sector
1;fishes;2;2
2;agriculture; also fishes;1;2
3;fishing. Extraction; animals;2;2
So there are three cases:
no semicolon -> no problem
word character(non-numeric), semicolon, whitespace, word character(non-numeric)
word character(non-numeric), semicolon, 2xwhitespace, word character(non-numeric)
I turned the .csv into a .txt and then imported it as a string and then I compiled this regex:
re.compile('([^\d\W]);\s+([^\d\W])', re.S)
Which should do. I almost managed to replace those semicolons for commas, doing the following:
def replace_comma(match):
text = match.group()
return text.replace(';', ',')
regex = re.compile('([^\d\W]);\s+([^\d\W])', re.S)
string2 = string.split('\n')
for n,i in enumerate(string2):
if len(re.findall('([^\d\W]);(\s+)([^\d\W])', i))>=1:
string2[n] = regex.sub(replace_comma, i)
This mostly works, but when there's two whitespaces after the semicolon, it leaves an \xa0 after the comma. I have two problems with this approach:
It's not very straightforward
Why is it leaving this \xa0 character ?
Do you know any better way to approach this?
Thanks
Edit: My desired output would be:
code;summary;sector;sub_sector
1;fishes;2;2
2;agriculture, also fishes;1;2
3;fishing. Extraction, animals;2;2
Edit: Added explanation about turning the file into a string for better manipulation.

For this case I wouldn't use regex, split() and rsplit() with maxpslit= parameter is enough:
data = '''1;fishes;2;2
2;agriculture; also fishes;1;2
3;fishing. Extraction; animals;2;2'''
for line in data.splitlines():
row = line.split(';', maxsplit=1)
row = row[:1] + row[-1].rsplit(';', maxsplit=2)
row[1] = row[1].replace(';', ',')
print(';'.join(row))
Prints:
1;fishes;2;2
2;agriculture, also fishes;1;2
3;fishing. Extraction, animals;2;2

How to split a string with multiple delimiters without deleting delimiters in Python?

I currently have a list of filenames in a txt file and I am trying to sort them. The first this I am trying to do is split them into a list since they are all in a single line. There are 3 types of file types in the list. I am able to split the list but I would like to keep the delimiters in the end result and I have not been able to find a way to do this. The way that I am splitting the files is as follows:
import re
def breakLines():
unsorted_list = []
file_obj = open("index.txt", "rt")
file_str = file_obj.read()
unsorted_list.append(re.split('.txt|.mpd|.mp4', file_str))
print(unsorted_list)
breakLines()
I found DeepSpace's answer to be very helpful here Split a string with "(" and ")" and keep the delimiters (Python), but that only seems to work with single characters.
EDIT:
Sample input:
file_name1234.mp4file_name1235.mp4file_name1236.mp4file_name1237.mp4
Expected output:
file_name1234.mp4
file_name1235.mp4
file_name1236.mp4
file_name1237.mp4

In re.split, the key is to parenthesise the split pattern so it's kept in the result of re.split. Your attempt is:
>>> s = "file_name1234.mp4file_name1235.mp4file_name1236.mp4file_name1237.mp4"
>>> re.split('.txt|.mpd|.mp4', s)
['file_name1234', 'file_name1235', 'file_name1236', 'file_name1237', '']
okay that doesn't work (and the dots would need escaping to be really compliant with what an extension is), so let's try:
>>> re.split('(\.txt|\.mpd|\.mp4)', s)
['file_name1234',
'.mp4',
'file_name1235',
'.mp4',
'file_name1236',
'.mp4',
'file_name1237',
'.mp4',
'']
works but this is splitting the extensions from the filenames and leaving a blank in the end, not what you want (unless you want an ugly post-processing). Plus this is a duplicate question: In Python, how do I split a string and keep the separators?
But you don't want re.split you want re.findall:
>>> s = "file_name1234.mp4file_name1235.mp4file_name1236.mp4file_name1237.mp4"
>>> re.findall('(\w*?(?:\.txt|\.mpd|\.mp4))',s)
['file_name1234.mp4',
'file_name1235.mp4',
'file_name1236.mp4',
'file_name1237.mp4']
the expression matches word characters (basically digits, letters & underscores), followed by the extension. To be able to create a OR, I created a non-capturing group inside the main group.
If you have more exotic file names, you can't use \w anymore but it still reasonably works (you may need some str.strip post-processing to remove leading/trailing blanks which are likely not part of the filenames):
>>> s = " file name1234.mp4file-name1235.mp4 file_name1236.mp4file_name1237.mp4"
>>> re.findall('(.*?(?:\.txt|\.mpd|\.mp4))',s)
[' file name1234.mp4',
'file-name1235.mp4',
' file_name1236.mp4',
'file_name1237.mp4']
So sometimes you think re.split when you need re.findall, and the reverse is also true.

Positive lookbehind, slice words by \t until \n

I am new to regex. I am attempting to use regex with python to find a line in a file and extract all of the subsequent words separated by tab stops. My line looks like this.
#position 4450 4452 4455 4465 4476 4496 D110 D111 D112 D114 D116 D118 D23 D24 D27 D29 D30 D56 D59 D69 D85 D88 D90 D91 JW1 JW10 JW15 JW22 JW28 JW3 JW35 JW39 JW43 JW45 JW47 JW49 JW5 JW52 JW54 JW56 JW57 JW59 JW66 JW7 JW70 JW75 JW77 JW9 REF_OR74A
I have identified that the base of this expression involves the positive lookbehind.
(?<=#position).*
I do not expect this to separate the matches by tabstop. However, it does find my line in the file:
import re
file = open('src.txt','r')
f = list(file)
file.close()
pattern = '(?<=#position).*'
regex = re.compile(pattern)
regex.findall(''.join(f))
['\t4450\t4452\t4455\t4465\t4476\t4496\tD110\tD111\tD112\tD114\tD116\tD118\tD23\tD24\tD27\tD29\tD30\tD56\tD59\tD69\tD85\tD88\tD90\tD91\tJW1\tJW10\tJW15\tJW22\tJW28\tJW3\tJW35\tJW39\tJW43\tJW45\tJW47\tJW49\tJW5\tJW52\tJW54\tJW56\tJW57\tJW59\tJW66\tJW7\tJW70\tJW75\tJW77\tJW9\tREF_OR74A']
With some kludge and list slicing / string methods, I can manipulate this and get my data out. What I'd really like to do is have findall yield a list of just these entries. What would the regular expression look like to do that?

Do you need to use regex? List slicing and string methods don't appear to be as much of a kludge as you say.
something like:
f = open('src.txt','r')
for line in f:
if line.startswith("#position"):
l = line.split() # with no arguments it splits on all whitespace characters
l = l[1:] # get rid of the "#position" tag
break
and further manipulate from there?

How to substitute specific patterns in python

I want to replace all occurrences of integers which are greater than 2147483647 and are followed by ^^<int> by the first 3 digits of the numbers. For example, I have my original data as:
<stack_overflow> <isA> "QuestionAndAnsweringWebsite" "fact".
"Ask a Question" <at> "25500000000"^^<int> <stack_overflow> .
<basic> "language" "89028899" <html>.
I want to replace the original data by the below mentioned data:
<stack_overflow> <isA> "QuestionAndAnsweringWebsite" "fact".
"Ask a Question" <at> "255"^^<int> <stack_overflow> .
<basic> "language" "89028899" <html>.
The way I have implemented is by scanning the data line by line. If I find numbers greater than 2147483647, I replace them by the first 3 digits. However, I don't know how should I check that the next part of the string is ^^<int> .
What I want to do is: for numbers greater than 2147483647 e.g. 25500000000, I want to replace them with the first 3 digits of the number. Since my data is 1 Terabyte in size, a faster solution is much appreciated.

Use the re module to construct a regular expression:
regex = r"""
( # Capture in group #1
"[\w\s]+" # Three sequences of quoted letters and white space characters
\s+ # followed by one or more white space characters
"[\w\s]+"
\s+
"[\w\s]+"
\s+
)
"(\d{10,})" # Match a quoted set of at least 10 integers into group #2
(^^\s+\.\s+) # Match by two circumflex characters, whitespace and a period
# into group #3
(.*) # Followed by anything at all into group #4
"""
COMPILED_REGEX = re.compile(regex, re.VERBOSE)
Next, we need to define a callback function (since re.RegexObject.sub takes a callback) to handle the replacement:
def replace_callback(matches):
full_line = matches.group(0)
number_text = matches.group(2)
number_of_interest = int(number_text, base=10)
if number_of_interest > 2147483647:
return full_line.replace(number_of_interest, number_text[:3])
else:
return full_line
And then find and replace:
fixed_data = COMPILED_REGEX.sub(replace_callback, YOUR_DATA)
If you have a terrabyte of data you will probably not want to do this in memory - you'll want to open the file and then iterate over it, replacing the data line by line and writing it back out to another file (there are undoubtedly ways to speed this up, but they will make the gist of the technique harder to follow:
# Given the above
def process_data():
with open("path/to/your/file") as data_file,
open("path/to/output/file", "w") as output_file:
for line in data_file:
fixed_data = COMPILED_REGEX.sub(replace_callback, line)
output_file.write(fixed_data)

If each line in your text file looks like your example, then you can do this:
In [2078]: line = '"QuestionAndAnsweringWebsite" "fact". "Ask a Question" "25500000000"^^ . "language" "89028899"'
In [2079]: re.findall('\d+"\^\^', line)
Out[2079]: ['25500000000"^^']
with open('path/to/input') as infile, open('path/to/output', 'w') as outfile:
for line in infile:
for found in re.findall('\d+"\^\^', line):
if int(found[:-3]) > 2147483647:
line = line.replace(found, found[:3])
outfile.write(line)
Because of the inner for-loop, this has the potential to be an inefficient solution. However, I can't think of a better regex at the moment, so this should get you started, at the very least

Editing lines and removing lines from file

I have a file of accession numbers and 16S rrna sequences, and what I'm trying to do is remove all lines of RNA, and only keep the lines with the accession numbers and the species name (and remove all the junk in between). So my input file looks like this (there are > in front of the accession numbers):
> D50541 1 1409 1409bp rna Abiotrophia defectiva Aerococcaceae
CUGGCGGCGU GCCUAAUACA UGCAAGUCGA ACCGAAGCAU CUUCGGAUGC UUAGUGGCGA ACGGGUGAGU AACACGUAGA
UAACCUACCC UAGACUCGAG GAUAACUCCG GGAAACUGGA GCUAAUACUG GAUAGGAUAU AGAGAUAAUU UCUUUAUAUU
(... and many more lines)
> AY538167 1 1411 1411bp rna Acholeplasma hippikon Acholeplasmataceae
CUGGCGGCGU GCCUAAUACA UGCAAGUCGA ACGCUCUAUA GCAAUAUAGG GAGUGGCGAA CGGGUGAGUA ACACGUAGAU
AACCUACCCU UACUUCGAGG AUAACUUCGG GAAACUGGAG CUAAUACUGG AUAGGACAUA UUGAGGCAUC UUAAUAUGUU
...
I want my output to look like this:
>D50541 Abiotrophia defectiva Aerococcaceae
>AY538167 Acholeplasma hippikon Acholeplasmataceae
The code I wrote does what I want... for most of the lines. It looks like this:
#!/usr/bin/env python
# take LTPs111.compressed fasta and reduce to accession numbers with names.
import re
infilename = 'LTPs111.compressed.fasta'
outfilename = 'acs.fasta'
regex = re.compile(r'(>)\s(\w+).+[rna]\s+([A-Z].+)')
#remove extra letters and spaces
with open(infilename, 'r') as infile, open(outfilename, 'w') as outfile:
for line in infile:
x = regex.sub(r'\1\2 \3', line)
#remove rna sequences
for line in x:
if '>' in line:
outfile.write(x)
Sometimes, the code seems to skip over some of the names. for example, for the first accession number above, I only got back:
>D50541 Aerococcaceae
Why might my code be doing this? The input for each accession number looks identical, and the spacing between 'rna' and the first name is the same for each line (5 spaces).
Thank you to anyone who might have some ideas!

I still haven't been able to run your code to get the claimed results, but I think I know what the problem is:
>>> line = '> AY538167 1 1411 1411bp rna Acholeplasma hippikon Acholeplasmataceae'
>>> regex = re.compile(r'(>)\s(\w+).+[rna]\s+([A-Z].+)')
>>> regex.findall(line)
[('>', 'AY538167', 'Acholeplasmataceae')]
The problem is that [rna]\s+ matches any one of the characters r, n, or a at the end of a word. And, because all of the matches are greedy, with no lookahead or anything else to prevent it, this means that it matches the n at the end of hippikon.
The simple solution is to remove the brackets, so it matches the string rna:
>>> regex = re.compile(r'(>)\s(\w+).+rna\s+([A-Z].+)')
That won't work if any of your species or genera can end with that string. Are there any such names? If so, you need to come up with a better way to describe the cutoff between the 1409bp part and the rna part. The simplest may be to just look for rna surrounded by spaces:
>>> regex = re.compile(r'(>)\s(\w+).+\s+rna\s+([A-Z].+)')
Whether this is actually correct or not, I can't say without knowing more about the format, but hopefully you understand what I'm doing well enough to verify that it's correct (or at least to ask smarter questions than I can ask).
It may help debug things to add capture groups. For example, instead of this:
(>)\s(\w+).+[rna]\s+([A-Z].+)
… search for this:
(>)(\s)(\w+)(.+[rna]\s+)([A-Z].+)
Obviously your desired capture groups are now \1\3 \5 instead of \1\2 \3… but the big thing is that you can see what got matched in \4:
[('>', ' ', 'AY538167', ' 1 1411 1411bp Acholeplasma hippikon ', 'Acholeplasmataceae')]
So, now the question is "Why did .+[rna]\s+ match '1 1411 1411bp Acholeplasma hippikon '? Sometimes the context matters, but in this case, it doesn't. You don't want that group to match that string in any context, and yet it will always match it, so that's the part you have to debug.
Also, a visual regexp explorer often helps a lot. The best ones can color parts of the expression and the matched text, etc., to show you how and why the regexp is doing what it does.
Of course you're limited by those that run on your platform or online, and work with Python syntax. If you're careful and/or only use simple features (as in your example), perl/PCRE syntax is very close to Python, and JavaScript/ActionScript is also pretty close (the one big difference to keep in mind is that replace/sub uses $ instead of \1).
I don't have a good online one to strongly recommend, but from a quick glance Debuggex looks pretty cool.

Items between brackets are character classes, so by setting your regex to look for "[rna]" you are requesting lines with either r, n, or a, but not all three.
Further, if the lines you want all have the pattern "bp rna", I'd use that to yank those lines. By reading the file in line by line, the following worked for me for a quick and dirty line-yanker, for instance:
regex = re.compile(r'^[\w\s]+bp rna .*$')
But, again, if it's as simple as finding lines with "bp rna" in them, you could read the file line by line and forego regex entirely:
for line in file:
if "bp rna" in line:
print(line)
EDIT: I blew it by not reading the request carefully enough. Maybe a capture-and-replace regex would help?
for line in file:
if "bp rna" in line:
subreg = re.sub(r'^(>[\w]+)\s[\d\s]+bp\srna\s([\w\s]+$)', r"\1 \2", line)
print(subreg)
OUTPUT:
>AY538166 Acholeplasma granularum Acholeplasmataceae
>AY538167 Acholeplasma hippikon Acholeplasmataceae
This should match any whitespace (tabs or spaces) between the things you want.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dictionary replaces substrings Python 2.7 - python

I don't know how your input file looks, but if the numbers are surrounded by spaces, this should work: replacements = {' 01489 ':' 1489 ', ' 01450 ':' 1450 '}

Check out the regular expression syntax https://docs.python.org/2/library/re.html. It should allow you to match whatever pattern you're looking for exactly.

Related

Replacing semicolon for comma in csv using regex in python

How to split a string with multiple delimiters without deleting delimiters in Python?

Positive lookbehind, slice words by \t until \n

How to substitute specific patterns in python

Editing lines and removing lines from file

Categories

Resources