sed 's/\t/_tab_/3g'
I have a sed command that basically replaces all excess tab delimiters in my text document.
My documents are supposed to be 3 columns, but occasionally there's an extra delimiter. I don't have control over the files.
I use the above command to clean up the document. However all my other operations on these files are in python. Is there a way to do the above sed command in python?
sample input:
Column1 Column2 Column3
James 1,203.33 comment1
Mike -3,434.09 testing testing 123
Sarah 1,343,342.23 there here
sample output:
Column1 Column2 Column3
James 1,203.33 comment1
Mike -3,434.09 testing_tab_testing_tab_123
Sarah 1,343,342.23 there_tab_here
You may read the file line by line, split with tab, and if there are more than 3 items, join the items after the 3rd one with _tab_:
lines = []
with open('inputfile.txt', 'r') as fr:
for line in fr:
split = line.split('\t')
if len(split) > 3:
tmp = split[:2] # Slice the first two items
tmp.append("_tab_".join(split[2:])) # Append the rest joined with _tab_
lines.append("\t".join(tmp)) # Use the updated line
else:
lines.append(line) # Else, put the line as is
See the Python demo
The lines variable will contain something like
Mike -3,434.09 testing_tab_testing_tab_123
Mike -3,434.09 testing_tab_256
No operation here
import os
os.system("sed -i 's/\t/_tab_/3g' " + file_path)
Does this work? Please notice that there is a -i argument for the above sed command, which is used to modify the input file inplace.
You can mimic the sed behavior in python:
import re
pattern = re.compile(r'\t')
string = 'Mike\t3,434.09\ttesting\ttesting\t123'
replacement = '_tab_'
count = -1
spans = []
start = 2 # Starting index of matches to replace (0 based)
for match in re.finditer(pattern, string):
count += 1
if count >= start:
spans.append(match.span())
spans.reverse()
new_str = string
for sp in spans:
new_str = new_str[0:sp[0]] + replacement + new_str[sp[1]:]
And now new_str is 'Mike\t3,434.09\ttesting_tab_testing_tab_123'.
You can wrap it in a function and repeat for every line.
However, note that this GNU sed behavior isn't standard:
'NUMBER'
Only replace the NUMBERth match of the REGEXP.
interaction in 's' command Note: the POSIX standard does not
specify what should happen when you mix the 'g' and NUMBER
modifiers, and currently there is no widely agreed upon meaning
across 'sed' implementations. For GNU 'sed', the interaction is
defined to be: ignore matches before the NUMBERth, and then match
and replace all matches from the NUMBERth on.
Related
I have a csv file in which pipes serve as delimiters.
But sometimes a short substring follows the 3rd pipe: up to 2 alphanumeric characters behind it. Then the 3rd pipe should not be interpreted as a delimiter.
example: split on each pipe:
x1 = "as234-HJ123-HG|dfdf KHT werg|XXL|s45dtgIKU|2017-SS0|123.45|asUJY"
=> split after XXL because it is followed by more than 2 characters
examples: split on all pipes except the 3rd if there are less than 3 characters between pipes 3 and 4:
x2 = "as234-H344423-dfX|dfer XXYUyu werg|1g|z4|sweDSgIKU|2017-SS0|123.45|YTf"
x3 = "as234-H3wer23-dZ|df3r Xa12yu wg|a1|2|sweDSgIKU|2017-SS0|123.45|YTf"
=> keep "1g|z4" and "a1|2" together.
My regex attempts only suffice for a substring replacement like this one: It replaces the pipe with a hyphen if it finds it between 2 digits: 3|4 => 3-4.
x = re.sub(r'(?<=\d)\|(?=\d)', repl='-', string=x1, count=1).
My question is:
If after the third pipe follows a short alphanumeric substring no longer than 1 or 2 characters (like Bx, 2, 42, z or 3b), then re.split should ignore the 3rd pipe and continue with the 4th pipe. All other pipes but #3 are unconditional delimiters.
You can use re.sub to add quotechar around the short columns. Then use Python's builtin csv module to parse the text (regex101 of the used expression)
import re
import csv
from io import StringIO
txt = """\
as234-HJ123-HG|dfdf KHT werg|XXL|s45dtgIKU|2017-SS0|123.45|asUJY
as234-H344423-dfX|dfer XXYUyu werg|1g|z4|sweDSgIKU|2017-SS0|123.45|YTf
as234-H3wer23-dZ|df3r Xa12yu wg|a1|2|sweDSgIKU|2017-SS0|123.45|YTf"""
pat = re.compile(r"^((?:[^|]+\|){2})([^|]+\|[^|]{,2}(?=\|))", flags=re.M)
txt = pat.sub(r'\1"\2"', txt)
reader = csv.reader(StringIO(txt), delimiter="|", quotechar='"')
for line in reader:
print(line)
Prints:
['as234-HJ123-HG', 'dfdf KHT werg', 'XXL', 's45dtgIKU', '2017-SS0', '123.45', 'asUJY']
['as234-H344423-dfX', 'dfer XXYUyu werg', '1g|z4', 'sweDSgIKU', '2017-SS0', '123.45', 'YTf']
['as234-H3wer23-dZ', 'df3r Xa12yu wg', 'a1|2', 'sweDSgIKU', '2017-SS0', '123.45', 'YTf']
I adapted Andrej's solution as follows:
Assume that the dataframe has already been imported from csv without parsing.
To split the dataframe's single column 0, apply a function that checks if the 3rd pipe is a qualified delimiter.
pat1 is Andrej's solution for identifying if substring4 after the 3rd pipe is longer than 2 characters or not. If it is short, then substring3, pipe3 and substring4 are enclosed within double quotes in text x (in a dataframe, this result type differs from the list shown by the print loop).
This part could be replaced by a different regex if your own criterion for "delimiter to ignore" differs from the example.
Next I replace the disqualified pipe(s), those between double quotes, with a hyphen: pat2 in re.sub.
The function returns the resulting text y to the new dataframe column "out".
We can get rid of the double quotes in the entire column. They were only needed for the replacements.
Finally, we split column "out" into multiple columns by using all remaining pipe delimiters in str.split.
I suppose my 3 steps could be combined to fewer steps (first enclose 3rd pipe in double quotes if the pipe matches a pattern that disqualifies it as delimiter, then replace the disqualified pipe with a hyphen, then split the text/column). But I'm happy enough that this 3-step solution works.
# identify if 3rd pipe is a valid delimiter:
def cond_replace_3rd_pipe(row):
# put substring3, 3rd pipe and short substring4 between double quotes
pat1 = re.compile(r"^((?:[^|]+\|){2})([^|]+\|[^|]{,2}(?=\|))", flags=re.M)
x = pat1.sub(r'\1"\2"', row[0])
# replaces pipes between double quotes with hyphen
pat2 = r'"(.+)\|(.+)"'
y = re.sub(pat2, r'"\1-\2"', x)
return y
df["out"] = df.apply(cond_replace_3rd_pipe, axis=1, result_type="expand")
df["out"] = df["out"].str.replace('"', "") # double quotes no longer needed
df["out"].str.split('|', expand=True) # split out into separate columns at all remaining pipes
I want to write a script that reads from a csv file and splits each line by comma except any commas in-between two specific characters.
In the below code snippet I would like to split line by commas except the commas in-between two $s.
line = "$abc,def$,$ghi$,$jkl,mno$"
output = line.split(',')
for o in output:
print(o)
How do I write output = line.split(',') so that I get the following terminal output?
~$ python script.py
$abc,def$
$ghi$
$jkl,mno$
You can do this with a regular expression:
In re, the (?<!\$) will match a character not immediately following a $.
Similarly, a (?!\$) will match a character not immediately before a dollar.
The | character cam match multiple options. So to match a character where either side is not a $ you can use:
expression = r"(?<!\$),|,(?!\$)"
Full program:
import re
expression = r"(?<!\$),|,(?!\$)"
print(re.split(expression, "$abc,def$,$ghi$,$jkl,mno$"))
One solution (maybe not the most elegant but it will work) is to replace the string $,$ with something like $,,$ and then split ,,. So something like this
output = line.replace('$,$','$,,$').split(',,')
Using regex like mousetail suggested is the more elegant and robust solution but requires knowing regex (not that anyone KNOWS regex)
Try regular expressions:
import re
line = "$abc,def$,$ghi$,$jkl,mno$"
output = re.findall(r"\$(.*?)\$", line)
for o in output:
print('$'+o+'$')
$abc,def$
$ghi$
$jkl,mno$
First, you can identify a character that is not used in that line:
c = chr(max(map(ord, line)) + 1)
Then, you can proceed as follows:
line.replace('$,$', f'${c}$').split(c)
Here is your example:
>>> line = '$abc,def$,$ghi$,$jkl,mno$'
>>> c = chr(max(map(ord, line)) + 1)
>>> result = line.replace('$,$', f'${c}$').split(c)
>>> print(*result, sep='\n')
$abc,def$
$ghi$
$jkl,mno$
I'm reading some strings from a file such as this one:
s = "Ab [word] 123 \test[abc] hi \abc [] a \command123[there\hello[www]]!"
which should be transformed into
"Ab [word] 123 abc hi \abc [] a therewww!"
Another example is
s = "\ human[[[rr] \[A] r \B[] r p\[]q \A[x\B[C]!"
which should be transformed into
"\ human[[[rr] A r r pq \A[xC!"
How can you generalize this to all similar "functions" with alphanumeric names? By "function" I mean a pattern such as \name[arg] where name is a (possibly empty) alphanumeric string and arg is a (possibly empty) arbitrary string.
Update: After reading kcsquared's comments, I looked through the input files and found stray brackets and backslashes, so I've updated my examples accordingly. The previous regex solution (see below) breaks completely for these special cases:
s = re.sub(r'\\command123\[([^}]*)\]', ' \\1', s)
s = re.sub(r'\test\[([^}]*)\]', ' \\1', s) # Fails if this substitution is executed first
s = " ".join(s.split())
Use an array to push and pop the strings onto it, as if it were a stack.
Scan the string by character and interpret it one by one, don't use regex.
I am really new to python and for some reason this has stumped me for a while so I figured I'd ask for help.
I am working on a python script that would allow me read in my files but then if there is a '\' at the end of the line it would join the line after it.
So if the lines are as follows:
: Student 1
: Student 2 \
Student 3
Any line that doesn't have the colon before it and if the previous line has the '\' I want to combine them to look like this:
: Student 2 Student 3
Here is what I tried:
s = ""
if line.endswith('\\'):
s.join(line) ## line being the line read from the file
Any help in the righ direction would be great
s.join doesn't do what you think it does. Also consider that the line in the file has a newline character ('\n') so .endswith('\\') won't catch for that reason.
Something like this (although somewhat different method)
output = ''
with open('/path/to/file.txt') as f:
for line in f:
if line.rstrip().endswith('\\'):
next_line = next(f)
line = line.rstrip()[:-1] + next_line
output += line
In the above, we used line.rstrip() to get read of any trailing whitespace (the newline character) so that the .endswith method would match properly.
If a line ends with \, we go ahead and pull the next line out of the file generator using the builtin function next.
Finally, we combine the line and next line, taking care to once again remove the whitespace (.rstrip()) and the \ character ([:-1] means all chars up to last character) and taking the new line and adding it to the output.
The resulting string prints out like so
: Student 1
: Student 2 Student 3
Note about s.join... It's probably best explained as the opposite of split, using s as the separator (or joining) character.
>>> "foo.bar.baz".split('.')
['foo', 'bar', 'baz']
>>> "|".join(['foo', 'bar', 'baz'])
'foo|bar|baz'
If you can read the full file without splitting it into lines, you can use a regex:
import re
text = """
: Student 1
: Student 2 \
Student 3
""".strip()
print(re.sub(r'\\\s*\n[^:]', ' ', text))
: Student 1
: Student 2 Student 3
The regex matches occurrences of \ followed by new line and something that is not a :.
You can use regex and join to avoid loop if you starts with a list of strings.
l = ['a\\', 'b','c']
s = '_'.join(l)
lx = re.split(r'(?<!\\)_', s) # use negative lookbehind to only split underscore with no `\` before it
[e.replace('\\_', '') for e in lx] # replace with '', ' ' if you need so.
Output:
['ab', 'c']
I need to parse lines having multiple language codes as below
008800002 Bruxelles-Nord$Br�ussel Nord$<deu>$Brussel Noord$<nld>
008800002 being a id
Bruxelles-Nord$Br�ussel Nord$ being name1
deu being language one
$Brussel Noord$ being name two
nld being language two.
SO, the idea is name and language can appear N number of times. I need to collect them all.
the language in <> is 3 characters in length (fixed)
and all names end with $ sign.
I tried this one but it is not giving expected output.
x = re.compile('(?P<stop_id>\d{9})\s(?P<authority>[[\x00-\x7F]{3}|\s{3}])\s(?P<stop_name>.*)
(?P<lang_code>(?:[<]\S{0,4}))',flags=re.UNICODE)
I have no idea how to get repeated elements.
It takes
Bruxelles-Nord$Br�ussel Nord$<deu>$Brussel Noord$ as stop_name and <nld> as language.
Do it in two steps. First separate ID from name/language pairs; then use re.finditer on the name/language section to iterate over the pairs and stuff them into a dict.
import re
line = u"008800002 Bruxelles-Nord$Br�ussel Nord$<deu>$Brussel Noord$<nld>"
m = re.search("(\d+)\s+(.*)", line, re.UNICODE)
id = m.group(1)
names = {}
for m in re.finditer("(.*?)<(.*?)>", m.group(2), re.UNICODE):
names[m.group(2)] = m.group(1)
print id, names
\b(\d+)\b\s*|(.*?)(?=<)<(.*?)>
Try this.Just grab the captures.see demo.
http://regex101.com/r/hS3dT7/4