Joining Multiple Lines in Python (Text Formatting)

Joining Multiple Lines in Python (Text Formatting) - python

I am working on pulling logs through an web API and so far when pulling the logs they return in the following format (3 events below starting with and ending with . My question is what would be the best way to loop through each line and concatenate them so that the result event looks like below.
Current output
<attack_headlines version="1.0.1">
<attack_headline>
<site_id>1</site_id>
<category>V2luZG93cyBEaXJlY3RvcmllcyBhbmQgRmlsZXM=</category>
<subcategory>SUlTIEhlbHA=</subcategory>
<client_ip>172.17.1.126</client_ip>
<date>1363735940</date>
<gmt_diff>0</gmt_diff>
<reference_id>6D13-DE3D-9539-8980</reference_id>
</attack_headline>
</attack_headlines>
<attack_headlines version="1.0.1">
<attack_headline>
<site_id>1</site_id>
<category>V2luZG93cyBEaXJlY3RvcmllcyBhbmQgRmlsZXM=</category>
<subcategory>SUlTIEhlbHA=</subcategory>
<client_ip>172.17.1.136</client_ip>
<date>1363735971</date>
<gmt_diff>0</gmt_diff>
<reference_id>6D13-DE3D-9539-8981</reference_id>
</attack_headline>
</attack_headlines>
<attack_headlines version="1.0.1">
<attack_headline>
<site_id>1</site_id>
<category>V2luZG93cyBEaXJlY3RvcmllcyBhbmQgRmlsZXM=</category>
<subcategory>SUlTIEhlbHA=</subcategory>
<client_ip>172.17.1.156</client_ip>
<date>1363735975</date>
<gmt_diff>0</gmt_diff>
<reference_id>6D13-DE3D-9539-8982</reference_id>
</attack_headline>
</attack_headlines>
Expected output
<attack_headlines version="1.0.1"><attack_headline><site_id>1</site_id<category>V2luZG93cyBEaXJlY3RvcmllcyBhbmQgRmlsZXM=</category<subcategory>SUlTIEhlbHA=</subcategory><client_ip>172.17.1.156</client_ip<date>1363735975</date><gmt_diff>0</gmt_diff<reference_id>6D13-DE3D-9539-8982</reference_id></attack_headline</attack_headlines>
Thanks in advance!
import json
import os
from suds.transport.https import WindowsHttpAuthenticated
class Helpers:
def set_connection(self,conf):
#SUDS BUG FIXER(doctor)
protocol=conf['protocol']
hostname=conf['hostname']
port=conf['port']
path=conf['path']
file=conf['file']
u_name=conf['login']
passwrd=conf['password']
auth_type = conf['authType']
from suds.xsd.doctor import ImportDoctor, Import
from suds.client import Client
url = '{0}://{1}:{2}/{3}/{4}?wsdl'.format(protocol,
hostname,port, path, file)
imp = Import('http://schemas.xmlsoap.org/soap/encoding/')
d = ImportDoctor(imp)
if(auth_type == 'ntlm'):
ntlm = WindowsHttpAuthenticated(username=u_name, password=passwrd)
client = Client(url, transport=ntlm, doctor=d)
else:
client = Client(url, username=u_name, password=passwrd, doctor=d)
return client
def read_from_file(self, filename):
try:
fo = open(filename, "r")
try:
result = fo.read()
finally:
fo.close()
return result
except IOError:
print "##Error opening/reading file {0}".format(filename)
exit(-1)
def read_json(self,filename):
string=self.read_from_file(filename)
return json.loads(string)
def get_recent_attacks(self, client):
import time
import base64
from xml.dom.minidom import parseString
epoch_time_now = int(time.time())
epochtimeread = open('epoch_last', 'r')
epoch_time_last_read = epochtimeread.read()
epochtimeread.close()
epoch_time_last = int(float(epoch_time_last_read))
print client.service.get_recent_attacks("",epoch_time_last,epoch_time_now,1,"",15)

If this is just a single, large string object with line-breaks, you can simply delete them:
import re
text = re.sub('\s*\n\s*', '', text)
To leave the line breaks in that follow the </attack_headline> delimiter, try:
re.sub('(?<!<\/attack_headline>)\s*\n\s*', '', x)

You could use:
oneline = "".join(multiline.split())
Edit 1 (I've just seen your edit) - I will change your code like this:
with open(filename, "r") as fo:
result = []
for line in fo.readlines():
result.append(line.strip())
return result
Edit 2 (I've read your comment on the other answer) - You could do like this:
with open(filename, "r") as fo:
partial = []
for line in fo.readlines():
if line.startswith("<"):
yield "".join(partial)
partial = []
else:
clean = line.strip()
if clean:
partial.append(clean)

import re
# remove all newline whitespace stuff as in answer given before:
text = re.sub(r'\s*\n\s*', '', text)
# break again at desired points:
text = re.sub(r'</attack_headlines>', '</attack_headlines>\n', text)

Related

TypeError: init() got an unexpected keyword argument 'output' -- in a comment stripper program

So my program's basic intention is to take comments out of C files and display them.
import urllib2
import html2text
import re
import subprocess
from cStringIO import StringIO
url = raw_input('Please input URL youd like to analyze: ')
page = urllib2.urlopen(url)
html_content = page.read().decode('utf8')
rendered_content =
html2text.html2text(html_content).encode('ascii','ignore')
f = open('file_text.txt', 'wb')
f.write(rendered_content)
f.close()
fd = open("file_text.txt", "r")
buf = fd.read()
def comment_remover(text):
def replacer(match):
s = match.group(0)
if s.startswith('/'):
return " "
else:
return s
pattern = re.compile(
#Stackoverflow- properly removes all comments
r'//.*?$|/\*.*?\*/|\'(?:\\.|[^\\\'])*\'|"(?:\\.|[^\\"])*"',
re.DOTALL | re.MULTILINE
)
return re.sub(pattern, replacer, text)
input = StringIO(comment_remover(buf)) # source_code is a string with the source code.
output = StringIO()
process = subprocess.Popen(['sed', '/path/to/remccoms3.sed'], input=input, output=output)
return_code = process.wait()
stripped_code = output.getvalue()
Here's where I get the error message:
from linecache import getline
with open("file_text.txt") as f:
for ind, line in enumerate(f,1):
if line.rstrip() == "---|---":
print(getline(f.name, ind + 4))
Basically the line.rstrip == '---|---' is the point in which, when I run the program, that line is where the comments from the actual page itself is showed, and not the comments from the source.

How to update line with modified data in Jython?

I'm have a csv file which contains hundred thousands of rows and below are some sample lines..,
1,Ni,23,28-02-2015 12:22:33.2212-02
2,Fi,21,28-02-2015 12:22:34.3212-02
3,Us,33,30-03-2015 12:23:35-01
4,Uk,34,31-03-2015 12:24:36.332211-02
I need to get the last column of csv data which is in wrong datetime format. So I need to get default datetimeformat("YYYY-MM-DD hh:mm:ss[.nnn]") from last column of the data.
I have tried the following script to get lines from it and write into flow file.
import json
import java.io
from org.apache.commons.io import IOUtils
from java.nio.charset import StandardCharsets
from org.apache.nifi.processor.io import StreamCallback
class PyStreamCallback(StreamCallback):
def __init__(self):
pass
def process(self, inputStream, outputStream):
text = IOUtils.readLines(inputStream, StandardCharsets.UTF_8)
for line in text[1:]:
outputStream.write(line + "\n")
flowFile = session.get()
if (flowFile != None):
flowFile = session.write(flowFile,PyStreamCallback())
flowFile = session.putAttribute(flowFile, "filename", flowFile.getAttribute('filename'))
session.transfer(flowFile, REL_SUCCESS)
but I am not able to find a way to convert it like below output.
1,Ni,23,28-02-2015 12:22:33.221
2,Fi,21,29-02-2015 12:22:34.321
3,Us,33,30-03-2015 12:23:35
4,Uk,34,31-03-2015 12:24:36.332
I have checked solutions with my friend(google) and was still not able to find solution.
Can anyone guide me to convert those input data into my required output?

In this transformation the unnecessary data located at the end of each line, so it's really easy to manage transform task with regular expression.
^(.*:\d\d)((\.\d{1,3})(\d*))?(-\d\d)?
Check the regular expression and explanation here:
https://regex101.com/r/sAB4SA/2
As soon as you have a large file - better not to load it into the memory. The following code loads whole the file into the memory:
IOUtils.readLines(inputStream, StandardCharsets.UTF_8)
Better to iterate line by line.
So this code is for ExecuteScript nifi processor with python (Jython) language:
import sys
import re
import traceback
from org.apache.commons.io import IOUtils
from org.apache.nifi.processor.io import StreamCallback
from org.python.core.util import StringUtil
from java.lang import Class
from java.io import BufferedReader
from java.io import InputStreamReader
from java.io import OutputStreamWriter
class TransformCallback(StreamCallback):
def __init__(self):
pass
def process(self, inputStream, outputStream):
try:
writer = OutputStreamWriter(outputStream,"UTF-8")
reader = BufferedReader(InputStreamReader(inputStream,"UTF-8"))
line = reader.readLine()
p = re.compile('^(.*:\d\d)((\.\d{1,3})(\d*))?(-\d\d)?')
while line!= None:
# print line
match = p.search(line)
writer.write( match.group(1) + (match.group(3) if match.group(3)!=None else '') )
writer.write('\n')
line = reader.readLine()
writer.flush()
writer.close()
reader.close()
except:
traceback.print_exc(file=sys.stdout)
raise
flowFile = session.get()
if flowFile != None:
flowFile = session.write(flowFile, TransformCallback())
# Finish by transferring the FlowFile to an output relationship
session.transfer(flowFile, REL_SUCCESS)
And as soon as question is about nifi, here are alternatives that seems to be easier
the same code as above but in groovy for nifi ExecuteScript processor:
def ff = session.get()
if(!ff)return
ff = session.write(ff, {rawIn, rawOut->
// ## transform streams into reader and writer
rawIn.withReader("UTF-8"){reader->
rawOut.withWriter("UTF-8"){writer->
reader.eachLine{line, lineNum->
if(lineNum>1) { // # skip the first line
// ## let use regular expression to transform each line
writer << line.replaceAll( /^(.*:\d\d)((\.\d{1,3})(\d*))?(-\d\d)?/ , '$1$3' ) << '\n'
}
}
}
}
} as StreamCallback)
session.transfer(ff, REL_SUCCESS)
ReplaceText processor
And if regular expression is ok - the easiest way in nifi is a ReplaceText processor that could do regular expression replace line-by-line.
In this case you don't need to write any code, just build the regular expression and configure your processor correctly.

Just using pure jython. It is an example that can be adapted to OP's needs.
Define a datetime parser for this csv file
from datetime import datetime
def parse_datetime(dtstr):
mydatestr='-'.join(dtstr.split('-')[:-1])
try:
return datetime.strptime(mydatestr,'%d-%m-%Y %H:%M:%S.%f').strftime('%d-%m-%Y %H:%M:%S.%f')[:-3]
except ValueError:
return datetime.strptime(mydatestr,'%d-%m-%Y %H:%M:%S').strftime('%d-%m-%Y %H:%M:%S')
my test.csv includes data like this: ( 2015 didnt have 29 Feb had to change OP's example ).
1,Ni,23,27-02-2015 12:22:33.2212-02
2,Fi,21,28-02-2015 12:22:34.3212-02
3,Us,33,30-03-2015 12:23:35-01
4,Uk,34,31-03-2015 12:24:36.332211-02
now the solution
with open('test.csv') as fi:
for line in fi:
line_split=line.split(',')
out_line = ', '.join(word if i<3 else parse_datetime(word) for i,word in enumerate(line_split))
#print(out_line)
#you can write this out_line to a file here.
printing out_line looks like this
1, Ni, 23, 27-02-2015 12:22:33.221
2, Fi, 21, 28-02-2015 12:22:34.321
3, Us, 33, 30-03-2015 12:23:35
4, Uk, 34, 31-03-2015 12:24:36.332

You can get them with regex :
(\d\d-\d\d-\d\d\d\d\ \d\d:\d\d:)(\d+(?:\.\d+)*)(-\d\d)$
Then just replace #2 with a rounded version of #2
See regex example at regexr.com
You could even do it "nicer" by getting every single value with a capturing group and then put them into a datetime.datetime object and print it from there, but imho that would be an overkill in maintainability and loose you too much performance.
Code had no possibility to test
import re
...
pattern = '^(.{25})(\d+(?:\.\d+)*)(-\d\d)$' //used offset for simplicity
....
for line in text[1:]:
match = re.search(pattern, line)
line = match.group(1) + round(match.group(2),3) + match.group(3)
outputStream.write(line + "\n")

Loop only reads 1st line of file

I'm trying to convert a JSON file to XML using a small python script, but for some reason the loop only seems to read the first line of the JSON file.
from xml.dom import minidom
from json import JSONDecoder
import json
import sys
import csv
import os
import re
import dicttoxml
from datetime import datetime, timedelta
from functools import partial
reload(sys)
sys.setdefaultencoding('utf-8')
nav = 'navigation_items.bson.json'
df = 'testxmloutput.txt'
def json2xml(json_obj, line_padding=""):
result_list = list()
json_obj_type = type(json_obj)
if json_obj_type is list:
for sub_elem in json_obj:
result_list.append(json2xml(sub_elem, line_padding))
return "\n".join(result_list)
if json_obj_type is dict:
for tag_name in json_obj:
sub_obj = json_obj[tag_name]
result_list.append("%s<%s>" % (line_padding, tag_name))
result_list.append(json2xml(sub_obj, "\t" + line_padding))
result_list.append("%s</%s>" % (line_padding, tag_name))
return "\n".join(result_list)
return "%s%s" % (line_padding, json_obj)
def json_parse(fileobj, decoder=JSONDecoder(), buffersize=2048):
buffer = ''
for chunk in iter(partial(fileobj.read, buffersize), ''):
buffer += chunk
while buffer:
try:
result, index = decoder.raw_decode(buffer)
yield result
buffer = buffer[index:]
except ValueError:
# Not enough data to decode, read more
break
def converter(data):
f = open(df,'w')
data = open(nav)
for line in json_parse(data):
f.write(dicttoxml.dicttoxml(line, attr_type=False))
f.close()
converter(nav)
I was under the assumption that the iter would read the first line to memory and move on to the next. The converted output looks great but im too sure where to look on how to get it to loop through to the next line in file.

Try json.load to load the file into a dict and then iterate the dict for your output.
import sys
import json
json_file = sys.argv[1]
data = {}
with open(json_file) as data_file:
data = json.load(data_file)
for key in data:
#do your things

Understanding how to quote XML string in order to serialize using Python's ElementTree

My specs:
Python 3.4.3
Windows 7
IDE is Jupyter Notebooks
What I have referenced:
how-to-properly-escape-single-and-double-quotes
python-escaping-strings-for-use-in-xml
escaping-characters-in-a-xml-file-with-python
Here is the data and script, respectively, below (I have tried variations on serializing Column 'E' using both Sax and ElementTree):
Data
A,B,C,D,E,F,G,H,I,J
"3","8","1","<Request TransactionID="3" RequestType="FOO"><InstitutionISO /><CallID>23</CallID><MemberID>12</MemberID><MemberPassword /><RequestData><AccountNumber>2</AccountNumber><AccountSuffix>85</AccountSuffix><AccountType>S</AccountType><MPIAcctType>Checking</MPIAcctType><TransactionCount>10</TransactionCount></RequestData></Request>","<Response TransactionID="2" RequestType="HoldInquiry"><ShareList>0000',0001,0070,</ShareList></Response>","1967-12-25 22:18:13.471000","2005-12-25 22:18:13.768000","2","70","0"
Script
#!/usr/bin/python
# -*- coding: utf-8 -*-
import os.path
import sys
import csv
from io import StringIO
import xml.etree.cElementTree as ElementTree
from xml.etree.ElementTree import XMLParser
import xml
import xml.sax
from xml.sax import ContentHandler
class MyHandler(xml.sax.handler.ContentHandler):
def __init__(self):
self._charBuffer = []
self._result = []
def _getCharacterData(self):
data = ''.join(self._charBuffer).strip()
self._charBuffer = []
return data.strip() #remove strip() if whitespace is important
def parse(self, f):
xml.sax.parse(f, self)
return self._result
def characters(self, data):
self._charBuffer.append(data)
def startElement(self, name, attrs):
if name == 'Response':
self._result.append({})
def endElement(self, name):
if not name == 'Response': self._result[-1][name] = self._getCharacterData()
def read_data(path):
with open(path, 'rU', encoding='utf-8') as data:
reader = csv.DictReader(data, delimiter =',', quotechar="'", skipinitialspace=True)
for row in reader:
yield row
if __name__ == "__main__":
empty = ''
Response = 'sample.csv'
for idx, row in enumerate(read_data(Response)):
if idx > 10: break
data = row['E']
print(data) # The before
data = data[1:-1]
data = ""'{}'"".format(data)
print(data) # Sanity check
# data = '<Response TransactionID="2" RequestType="HoldInquiry"><ShareList>0000',0001,0070,</ShareList></Response>'
try:
root = ElementTree.XML(data)
# print(root)
except StopIteration:
raise
pass
# xmlstring = StringIO(data)
# print(xmlstring)
# Handler = MyHandler().parse(xmlstring)
Specifically, due to the quoting in the CSV file (which is beyond my control), I have had to resort to slicing the string (line 51) and then formatting it (line 52).
However the print out from the above attempt is as follows:
"<Response TransactionID="2" RequestType="HoldInquiry"><ShareList>0000'
<Response TransactionID="2" RequestType="HoldInquiry"><ShareList>0000
File "<string>", line unknown
ParseError: no element found: line 1, column 69
Interestingly - if I assign the variable "data" (as in line 54) I receive this:
File "<ipython-input-80-7357c9272b92>", line 56
data = '<Response TransactionID="2" RequestType="HoldInquiry"><ShareList>0000',0001,0070,</ShareList></Response>'
^
SyntaxError: invalid token
I seek feedback and information on how to address utilizing the most Pythonic means to do so. Ideally, is there a method that can leverage ElementTree. Thank you, in advance, for your feedback and guidance.

It seems that You have badly formatted (well, badly quoted) csv data.
If csv file is beyond Your control I suggest not using csv reader to read them,
instead - if You can rely on each field being properly quoted - split them yourself.
with open(Response, 'rU', encoding='utf-8') as data:
separated = data.read().split('","')
try:
x = ElementTree.XML(separated[3])
print(x)
xml.etree.ElementTree.dump(x)
y = ElementTree.XML(separated[4])
xml.etree.ElementTree.dump(y)
except Exception as e:
print(e)
outputs
<Element 'Request' at 0xb6d973b0>
<Request RequestType="FOO" TransactionID="3"><InstitutionISO /><CallID>23</CallID><MemberID>12</MemberID><MemberPassword /><RequestData><AccountNumber>2</AccountNumber><AccountSuffix>85</AccountSuffix><AccountType>S</AccountType><MPIAcctType>Checking</MPIAcctType><TransactionCount>10</TransactionCount></RequestData></Request>
<Response RequestType="HoldInquiry" TransactionID="2"><ShareList>0000',0001,0070,</ShareList></Response>

Twitter Python Json to CSV

I am new to Python and programing in general.
I wrote this script and it runs with out error but is doesn't print any content in the .csv even though I know there is content to print. I have been stuck for a day or 2 and need some help.
import sys
import json
import urllib
import oauth2 as oauth
import requests
import time
import csv
CONSUMER_KEY = ""
CONSUMER_SECRET = ""
ACCESS_KEY = ""
ACCESS_SECRET = ""
consumer = oauth.Consumer(key=CONSUMER_KEY, secret=CONSUMER_SECRET)
access_token = oauth.Token(key=ACCESS_KEY, secret=ACCESS_SECRET)
client = oauth.Client(consumer, access_token)
html ="https://api.twitter.com/1.1/search/tweets.json?q=#gmail.com"
response, data = client.request(html)
f = open("twitter_gmail.csv", 'a')
handle_tweet =json.loads(data)
def handle_tweet(self, data):
search_terms = ['#gmail.com']
text = message.get('text')
words = text.split()
matches = []
for term in search_terms:
match = [word for word in words if term in word]
matches.append(match)
f.write('%s,%s,%s,%s\n' % (message.get('created_at'), message.get('text'), message.get('user').get('id'),matches))

It looks like you are importing the csv module without even using it. There is probably something funky with your f.write statement, but things will be much easier for you if you try to write to your file using csv.writer. The csv.writer can easily take in a list, and spit out a comma separated line of the values in the list. I'd recommend reading its documentation in order to implement it.
You're pretty close already. I'm not exactly sure what you want on each row, but the following might be close to your intended goal.
f = open("twitter_gmail.csv", 'a')
# This next line makes the csv writer.
writer = csv.writer(f)
handle_tweet =json.loads(data)
def handle_tweet(self, data):
search_terms = ['#gmail.com']
text = message.get('text')
words = text.split()
matches = []
for term in search_terms:
match = [word for word in words if term in word]
matches.append(match)
# The next line is how you write with a csv writer.
writer.writerows(matches)
If I'm reading your code correctly, matches is a list of lists. I've used csv.writer's writerows method which will put each list on its own line (notice how it's plural, as opposed to writerow, which expects a single list and writes it to a single line).

I don't have Twitter auth keys, so can't test, but should work based on Twitter's API documentation.
If you are doing serious work (as opposed to figuring out how it is done), you should probably use a tested module like twitter or python-twitter.
import csv
import json
import oauth2 as oauth
import urllib
# I don't if any of these are actually needed?
import sys
import requests
import time
CONSUMER_KEY = ''
CONSUMER_SECRET = ''
ACCESS_KEY = ''
ACCESS_SECRET = ''
class TwitterSearch:
def __init__(self,
ckey = CONSUMER_KEY,
csecret = CONSUMER_SECRET,
akey = ACCESS_KEY,
asecret = ACCESS_SECRET,
query = 'https://api.twitter.com/1.1/search/tweets.{mode}?{query}'
):
consumer = oauth.Consumer(key=ckey, secret=csecret)
access_token = oauth.Token(key=akey, secret=asecret)
self.client = oauth.Client(consumer, access_token)
self.query = query
def search(self, q, mode='json', **queryargs):
queryargs['q'] = q
query = urllib.urlencode(queryargs)
return self.client.request(self.query.format(query=query, mode=mode))
def write_csv(fname, rows, header=None, append=False, **kwargs):
filemode = 'ab' if append else 'wb'
with open(fname, filemode) as outf:
out_csv = csv.writer(outf, **kwargs)
if header:
out_csv.writerow(header)
out_csv.writerows(rows)
def main():
ts = TwitterSearch()
response, data = ts.search('#gmail.com', result_type='recent')
js = json.loads(data)
# This _should_ work, based on sample data from
# https://dev.twitter.com/docs/api/1.1/get/search/tweets
messages = ([msg['created_at'], msg['txt'], msg['user']['id']] for msg in js.get('statuses', []))
write_csv('twitter_gmail.csv', messages, append=True)
if __name__ == "__main__":
main()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Joining Multiple Lines in Python (Text Formatting) - python

If this is just a single, large string object with line-breaks, you can simply delete them: import re text = re.sub('\s\n\s', '', text) To leave the line breaks in that follow the </attack_headline> delimiter, try: re.sub('(?<!<\/attack_headline>)\s\n\s', '', x)

import re # remove all newline whitespace stuff as in answer given before: text = re.sub(r'\s\n\s', '', text) # break again at desired points: text = re.sub(r'</attack_headlines>', '</attack_headlines>\n', text)

Related

TypeError: init() got an unexpected keyword argument 'output' -- in a comment stripper program

How to update line with modified data in Jython?

Loop only reads 1st line of file

Understanding how to quote XML string in order to serialize using Python's ElementTree

Twitter Python Json to CSV

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Joining Multiple Lines in Python (Text Formatting) - python

If this is just a single, large string object with line-breaks, you can simply delete them: import re text = re.sub('\s*\n\s*', '', text) To leave the line breaks in that follow the </attack_headline> delimiter, try: re.sub('(?<!<\/attack_headline>)\s*\n\s*', '', x)

import re # remove all newline whitespace stuff as in answer given before: text = re.sub(r'\s*\n\s*', '', text) # break again at desired points: text = re.sub(r'</attack_headlines>', '</attack_headlines>\n', text)

Related

TypeError: __init__() got an unexpected keyword argument 'output' -- in a comment stripper program

How to update line with modified data in Jython?

Loop only reads 1st line of file

Understanding how to quote XML string in order to serialize using Python's ElementTree

Twitter Python Json to CSV

Categories

Resources

If this is just a single, large string object with line-breaks, you can simply delete them: import re text = re.sub('\s\n\s', '', text) To leave the line breaks in that follow the </attack_headline> delimiter, try: re.sub('(?<!<\/attack_headline>)\s\n\s', '', x)

import re # remove all newline whitespace stuff as in answer given before: text = re.sub(r'\s\n\s', '', text) # break again at desired points: text = re.sub(r'</attack_headlines>', '</attack_headlines>\n', text)

TypeError: init() got an unexpected keyword argument 'output' -- in a comment stripper program