Join lines together in file that end with '\' - python

I am really new to python and for some reason this has stumped me for a while so I figured I'd ask for help.
I am working on a python script that would allow me read in my files but then if there is a '\' at the end of the line it would join the line after it.
So if the lines are as follows:
: Student 1
: Student 2 \
Student 3
Any line that doesn't have the colon before it and if the previous line has the '\' I want to combine them to look like this:
: Student 2 Student 3
Here is what I tried:
s = ""
if line.endswith('\\'):
s.join(line) ## line being the line read from the file
Any help in the righ direction would be great

s.join doesn't do what you think it does. Also consider that the line in the file has a newline character ('\n') so .endswith('\\') won't catch for that reason.
Something like this (although somewhat different method)
output = ''
with open('/path/to/file.txt') as f:
for line in f:
if line.rstrip().endswith('\\'):
next_line = next(f)
line = line.rstrip()[:-1] + next_line
output += line
In the above, we used line.rstrip() to get read of any trailing whitespace (the newline character) so that the .endswith method would match properly.
If a line ends with \, we go ahead and pull the next line out of the file generator using the builtin function next.
Finally, we combine the line and next line, taking care to once again remove the whitespace (.rstrip()) and the \ character ([:-1] means all chars up to last character) and taking the new line and adding it to the output.
The resulting string prints out like so
: Student 1
: Student 2 Student 3
Note about s.join... It's probably best explained as the opposite of split, using s as the separator (or joining) character.
>>> "foo.bar.baz".split('.')
['foo', 'bar', 'baz']
>>> "|".join(['foo', 'bar', 'baz'])
'foo|bar|baz'

If you can read the full file without splitting it into lines, you can use a regex:
import re
text = """
: Student 1
: Student 2 \
Student 3
""".strip()
print(re.sub(r'\\\s*\n[^:]', ' ', text))
: Student 1
: Student 2 Student 3
The regex matches occurrences of \ followed by new line and something that is not a :.

You can use regex and join to avoid loop if you starts with a list of strings.
l = ['a\\', 'b','c']
s = '_'.join(l)
lx = re.split(r'(?<!\\)_', s) # use negative lookbehind to only split underscore with no `\` before it
[e.replace('\\_', '') for e in lx] # replace with '', ' ' if you need so.
Output:
['ab', 'c']

Related

Split string if separator is not in-between two characters

I want to write a script that reads from a csv file and splits each line by comma except any commas in-between two specific characters.
In the below code snippet I would like to split line by commas except the commas in-between two $s.
line = "$abc,def$,$ghi$,$jkl,mno$"
output = line.split(',')
for o in output:
print(o)
How do I write output = line.split(',') so that I get the following terminal output?
~$ python script.py
$abc,def$
$ghi$
$jkl,mno$
You can do this with a regular expression:
In re, the (?<!\$) will match a character not immediately following a $.
Similarly, a (?!\$) will match a character not immediately before a dollar.
The | character cam match multiple options. So to match a character where either side is not a $ you can use:
expression = r"(?<!\$),|,(?!\$)"
Full program:
import re
expression = r"(?<!\$),|,(?!\$)"
print(re.split(expression, "$abc,def$,$ghi$,$jkl,mno$"))
One solution (maybe not the most elegant but it will work) is to replace the string $,$ with something like $,,$ and then split ,,. So something like this
output = line.replace('$,$','$,,$').split(',,')
Using regex like mousetail suggested is the more elegant and robust solution but requires knowing regex (not that anyone KNOWS regex)
Try regular expressions:
import re
line = "$abc,def$,$ghi$,$jkl,mno$"
output = re.findall(r"\$(.*?)\$", line)
for o in output:
print('$'+o+'$')
$abc,def$
$ghi$
$jkl,mno$
First, you can identify a character that is not used in that line:
c = chr(max(map(ord, line)) + 1)
Then, you can proceed as follows:
line.replace('$,$', f'${c}$').split(c)
Here is your example:
>>> line = '$abc,def$,$ghi$,$jkl,mno$'
>>> c = chr(max(map(ord, line)) + 1)
>>> result = line.replace('$,$', f'${c}$').split(c)
>>> print(*result, sep='\n')
$abc,def$
$ghi$
$jkl,mno$

Implement regular expression in Python to replace every occurence of "meshname = x" in a text file

I want to replace every line in a textfile with " " which starts with "meshname = " and ends with any letter/number and underscore combination. I used regex's in CS but I never really understood the different notations in Python. Can you help me with that?
Is this the right regex for my problem and how would i transform that into a Python regex?
m.e.s.h.n.a.m.e.' '.=.' '.{{_}*,{0,...,9}*,{a,...,z}*,{A,...,Z}*}*
x.y = Concatenation of x and y
' ' = whitespace
{x} = set containing x
x* = x.x.x. ... .x or empty word
What would the script look like in order to replace every string/line in a file containing meshname = ... with the Python regex? Something like this?
fin = open("test.txt", 'r')
data = fin.read()
data = data.replace("^meshname = [[a-z]*[A-Z]*[0-9]*[_]*]+", "")
fin.close()
fin = open("test.txt", 'w')
fin.write(data)
fin.close()
or is this completely wrong? I've tried to get it working with this approach, but somehow it never matched the right string: How to input a regex in string.replace?
Following the current code logic, you can use
data = re.sub(r'^meshname = .*\w$', ' ', data, flags=re.M)
The re.sub will replace with a space any line that matches
^ - line start (note the flags=re.M argument that makes sure the multiline mode is on)
meshname - a meshname word
= - a = string
.* - any zero or more chars other than line break chars as many as possible
\w - a letter/digit/_
$ - line end.

How to substitute specific patterns in python

I want to replace all occurrences of integers which are greater than 2147483647 and are followed by ^^<int> by the first 3 digits of the numbers. For example, I have my original data as:
<stack_overflow> <isA> "QuestionAndAnsweringWebsite" "fact".
"Ask a Question" <at> "25500000000"^^<int> <stack_overflow> .
<basic> "language" "89028899" <html>.
I want to replace the original data by the below mentioned data:
<stack_overflow> <isA> "QuestionAndAnsweringWebsite" "fact".
"Ask a Question" <at> "255"^^<int> <stack_overflow> .
<basic> "language" "89028899" <html>.
The way I have implemented is by scanning the data line by line. If I find numbers greater than 2147483647, I replace them by the first 3 digits. However, I don't know how should I check that the next part of the string is ^^<int> .
What I want to do is: for numbers greater than 2147483647 e.g. 25500000000, I want to replace them with the first 3 digits of the number. Since my data is 1 Terabyte in size, a faster solution is much appreciated.
Use the re module to construct a regular expression:
regex = r"""
( # Capture in group #1
"[\w\s]+" # Three sequences of quoted letters and white space characters
\s+ # followed by one or more white space characters
"[\w\s]+"
\s+
"[\w\s]+"
\s+
)
"(\d{10,})" # Match a quoted set of at least 10 integers into group #2
(^^\s+\.\s+) # Match by two circumflex characters, whitespace and a period
# into group #3
(.*) # Followed by anything at all into group #4
"""
COMPILED_REGEX = re.compile(regex, re.VERBOSE)
Next, we need to define a callback function (since re.RegexObject.sub takes a callback) to handle the replacement:
def replace_callback(matches):
full_line = matches.group(0)
number_text = matches.group(2)
number_of_interest = int(number_text, base=10)
if number_of_interest > 2147483647:
return full_line.replace(number_of_interest, number_text[:3])
else:
return full_line
And then find and replace:
fixed_data = COMPILED_REGEX.sub(replace_callback, YOUR_DATA)
If you have a terrabyte of data you will probably not want to do this in memory - you'll want to open the file and then iterate over it, replacing the data line by line and writing it back out to another file (there are undoubtedly ways to speed this up, but they will make the gist of the technique harder to follow:
# Given the above
def process_data():
with open("path/to/your/file") as data_file,
open("path/to/output/file", "w") as output_file:
for line in data_file:
fixed_data = COMPILED_REGEX.sub(replace_callback, line)
output_file.write(fixed_data)
If each line in your text file looks like your example, then you can do this:
In [2078]: line = '"QuestionAndAnsweringWebsite" "fact". "Ask a Question" "25500000000"^^ . "language" "89028899"'
In [2079]: re.findall('\d+"\^\^', line)
Out[2079]: ['25500000000"^^']
with open('path/to/input') as infile, open('path/to/output', 'w') as outfile:
for line in infile:
for found in re.findall('\d+"\^\^', line):
if int(found[:-3]) > 2147483647:
line = line.replace(found, found[:3])
outfile.write(line)
Because of the inner for-loop, this has the potential to be an inefficient solution. However, I can't think of a better regex at the moment, so this should get you started, at the very least

How do I efficiently replace the last line in a string?

I have a multi-line string, and I would like to replace the last line of the string with a different line. How do I most efficiently do this?
Split on the last linebreak and attach a new line:
new = old.rstrip('\n').rsplit('\n', 1)[0] + '\nNew line to be added with line break included.'
This first removes any trailing newline after the last line, splits once on the last newline in the text, takes everything before that last newline, and concatenates the result with a new newline and text.
Demo:
>>> old = '''The quick
... brown fox jumps
... over the lazy
... dog
... '''
>>> old.rstrip('\n').rsplit('\n', 1)[0] + '\nhorse and rider'
'The quick\nbrown fox jumps\nover the lazy\nhorse and rider'
This presumes that your lines are separated by \n characters; reading text files in text mode gives you such data on any platform.
If you are dealing with data with different line endings, adjust accordingly. In such cases os.linesep can come in useful.
I would suggest this approach:
>>> x = """
... test1
... test2
... test3"""
>>> print "\n".join(x.splitlines()[:-1]+["something else"])
test1
test2
something else
>>>
You can accomplish this using a simple regular expressions.
import re
new_string = re.sub(r'[^\n]*\n?$', "replacement", "existing\nstring")
It matches everything with the exception of the \n at the end of the string and replaces it with the replacement string.

Python - trying to capture the middle of a line, regex or split

I have a text file with some names and emails and other stuff. I want to capture email addresses.
I don't know whether this is a split or regex problem.
Here are some sample lines:
[name]bill billy [email]bill.billy#hotmail.com [dob]01.01.81
[name]mark hilly [email]mark.hilly#hotmail.com [dob]02.11.80
[name]gill silly [email]gill.silly#hotmail.com [dob]03.12.79
I want to be able to do a loop that prints all the email addresses.
Thanks.
I'd use a regex:
import re
data = '''[name]bill billy [email]bill.billy#hotmail.com [dob]01.01.81
[name]mark hilly [email]mark.hilly#hotmail.com [dob]02.11.80
[name]gill silly [email]gill.silly#hotmail.com [dob]03.12.79'''
group_matcher = re.compile(r'\[(.*?)\]([^\[]+)')
for line in data.split('\n'):
o = dict(group_matcher.findall(line))
print o['email']
\[ is literally [.
(.*?) is a non-greedy capturing group. It "expands" to capture the text.
\] is literally ]
( is the beginning of a capturing group.
[^\[] matches anything but a [.
+ repeats the last pattern any number of times.
) closes the capturing group.
for line in lines:
print line.split("]")[2].split(" ")[0]
You can pass substrings to split, not just single characters, so:
email = line.partition('[email]')[-1].partition('[')[0].rstrip()
This has an advantage over the simple split solutions that it will work on fields that can have spaces in the value, on lines that have things in a different order (even if they have [email] as the last field), etc.
To generalize it:
def get_field(line, field):
return line.partition('[{}]'.format(field)][-1].partition('[')[0].rstrip()
However, I think it's still more complicated than the regex solution. Plus, it can only search for exactly one field at a time, instead of all fields at once (without making it even more complicated). To get two fields, you'll end up parsing each line twice, like this:
for line in data.splitlines():
print '''{} "babysat" Dan O'Brien on {}'''.format(get_field(line, 'name'),
get_field(line, 'dob'))
(I may have misinterpreted the DOB field, of course.)
You can split by space and then search for the element that starts with [email]:
line = '[name]bill billy [email]bill.billy#hotmail.com [dob]01.01.81'
items = line.split()
for item in items:
if item.startswith('[email]'):
print item.replace('[email]', '', 1)
say you have a file with lines.
import re
f = open("logfile", "r")
data = f.read()
for line in data.split("\n"):
match=re.search("email\](?P<id>.*)\[dob", line)
if match:
# either store or print the emails as you like
print match.group('id').strip(), "\n"
Thats all (try it, for python 3 n above remember print is a function make those changes ) !
The output from your sample data:
bill.billy#hotmail.com
mark.hilly#hotmail.com
gill.silly#hotmail.com
>>>

Categories