I want to convert Python multiline string to a single line. If I open the string in a Vim , I can see ^M at the start of each line. How do I process the string to make it all in a single line with tab separation between each line. Example in Vim it looks like:
Serialnumber
^MName Rick
^MAddress 902, A.street, Elsewhere
I would like it to be something like:
Serialnumber \t Name \t Rick \t Address \t 902, A.street,......
where each string is in one line. I tried
somestring.replace(r'\r','\t')
But it doesn't work. Also, once the string is in a single line if I wanted a newline(UNIX newline?) at the end of the string how would I do that?
Deleted my previous answer because I realized it was wrong and I needed to test this solution.
Assuming that you are reading this from the file, you can do the following:
f = open('test.txt', 'r')
lines = f.readlines()
mystr = '\t'.join([line.strip() for line in lines])
As ep0 said, the ^M represents '\r', which the carriage return character in Windows. It is surprising that you would have ^M at the beginning of each line since the windows new-line character is \r\n. Having ^M at the beginning of the line indicates that your file contains \n\r instead.
Regardless, the code above makes use of a list comprehension to loop over each of the lines read from test.txt. For each line in lines, we call str.strip() to remove any whitespace and non-printing characters from the ENDS of each line. Finally, we call '\t'.join() on the resulting list to insert tabs.
You can replace "\r" characters by "\t".
my_string.replace("\r", "\t")
I use splitlines() to detect all types of lines, and then join everything together. This way you don't have to guess to replace \r or \n etc.
"".join(somestring.splitlines())
it is hard coding. But it works.
poem='''
If I can stop one heart from breaking,
I shall not live in vain;
If I can ease one life the aching,
Or cool one pain,
Or help one fainting robin
Unto his nest again,
I shall not live in vain.
'''
lst=list(poem)
str=''
for i in lst:
str+=i
print(str)
lst1=str.split("\n")
str1=""
for i in lst1:
str1+=i+" "
str2=str1[:-2]
print(str2)
This occurs of how VIM interprets CR (carriage return), used by Windows to delimit new lines. You should use just one editor (I personally prefer VIM). Read this: VIM ^M
This trick also can be useful, write "\n" as a raw string. Like :
my_string = my_string.replace(r"\n", "\t")
this should do the work:
def flatten(multiline):
lst = multiline.split('\n')
flat = ''
for line in lst:
flat += line.replace(' ', '')+' '
return flat
This should do the job:
string = """Name Rick
Address 902, A.street, Elsewhere"""
single_line = string.replace("\n", "\t")
Related
I'm looking to find out how to use Python to get rid of needless newlines in text like what you get from Project Gutenberg, where their plain-text files are formatted with newlines every 70 characters or so. In Tcl, I could do a simple string map, like this:
set newtext [string map "{\r} {} {\n\n} {\n\n} {\n\t} {\n\t} {\n} { }" $oldtext]
This would keep paragraphs separated by two newlines (or a newline and a tab) separate, but run together the lines that ended with a single newline (substituting a space), and drop superfluous CR's. Since Python doesn't have string map, I haven't yet been able to find out the most efficient way to dump all the needless newlines, although I'm pretty sure it's not just to search for each newline in order and replace it with a space. I could just evaluate the Tcl expression in Python, if all else fails, but I'd like to find out the best Pythonic way to do the same thing. Can some Python connoisseur here help me out?
The nearest equivalent to the tcl string map would be str.translate, but unfortunately it can only map single characters. So it would be necessary to use a regexp to get a similarly compact example. This can be done with look-behind/look-ahead assertions, but the \r's have to be replaced first:
import re
oldtext = """\
This would keep paragraphs separated.
This would keep paragraphs separated.
This would keep paragraphs separated.
\tThis would keep paragraphs separated.
\rWhen, in the course
of human events,
it becomes necessary
\rfor one people
"""
newtext = re.sub(r'(?<!\n)\n(?![\n\t])', ' ', oldtext.replace('\r', ''))
output:
This would keep paragraphs separated. This would keep paragraphs separated.
This would keep paragraphs separated.
This would keep paragraphs separated.
When, in the course of human events, it becomes necessary for one people
I doubt whether this is as efficient as the tcl code, though.
UPDATE:
I did a little test using this Project Gutenberg EBook of War and Peace (Plain Text UTF-8, 3.1 MB). Here's my tcl script:
set fp [open "gutenberg.txt" r]
set oldtext [read $fp]
close $fp
set newtext [string map "{\r} {} {\n\n} {\n\n} {\n\t} {\n\t} {\n} { }" $oldtext]
puts $newtext
and my python equivalent:
import re
with open('gutenberg.txt') as stream:
oldtext = stream.read()
newtext = re.sub(r'(?<!\n)\n(?![\n\t])', ' ', oldtext.replace('\r', ''))
print(newtext)
Crude performance test:
$ /usr/bin/time -f '%E' tclsh gutenberg.tcl > output1.txt
0:00.18
$ /usr/bin/time -f '%E' python gutenberg.py > output2.txt
0:00.30
So, as expected, the tcl version is more efficient. However, the output from the python version seems somewhat cleaner (no extra spaces inserted at the beginning of lines).
You can use a regular expression with a look-ahead search:
import re
text = """
...
"""
newtext = re.sub(r"\n(?=[^\n\t])", " ", text)
That will replace any new line that is not followed by a newline or a tab with a space.
I use the following script when I want to do this:
import sys
import os
filename, extension = os.path.splitext(sys.argv[1])
with open(filename+extension, encoding='utf-8-sig') as (file
), open(filename+"_unwrapped"+extension, 'w', encoding='utf-8-sig') as (output
):
*lines, last = list(file)
for line in lines:
if line == "\n":
line = "\n\n"
elif line[0] == "\t":
line = "\n" + line[:-1] + " "
else:
line = line[:-1] + " "
output.write(line)
output.write(last)
A "blank" line, with only a linefeed, turns into two linefeeds (to replace the one removed from the previous line). This handles files that separate paragraphs with two linefeeds.
A line beginning with a tab gets a leading linefeed (to replace the one removed from the previous line) and gets its trailing linefeed replaced with a space. This handles files that separate paragraphs with a tab character.
A line that is neither blank nor beginning with a tab gets its trailing linefeed replace with a space.
The last line in the file may not have a trailing linefeed and therefore gets copied directly.
I have a large textfile on my computer (location: /home/Seth/documents/bruteforce/passwords.txt) and I'm trying to find a specific string in the file. The list has one word per line and 215,000 lines/words. Does anyone know of simple Python script I can use to find a specific string?
Here's the code I have so far,
f = open("home/seth/documents/bruteforce/passwords.txt", "r")
for line in f.readlines():
line = str(line.lower())
print str(line)
if str(line) == "abe":
print "success!"
else:
print str(line)
I keep running the script, but it never finds the word in the file (and I know for sure the word is in the file).
Is there something wrong with my code? Is there a simpler method than the one I'm trying to use?
Your help is greatly appreciated.
Ps: I'm using Python 2.7 on a Debian Linux laptop.
I'd rather use the in keyword to look for a string in a line. Here I'm looking for the keyword 'KHANNA' in a csv file and for any such existence the code returns true.
In [121]: with open('data.csv') as f:
print any('KHANNA' in line for line in f)
.....:
True
It's just because you forgot to strip the new line char at the end of each line.
line = line.strip().lower()
would help.
Usually, when you read lines out of a file, they have a newline character at the end. Thus, they will technically not be equal to the same string without the newline character. You can get rid of this character by adding the line line=line.strip() before the test for equality to your target string. By default, the strip() method removes all white space (such as newlines) from the string it is called on.
What do you want to do? Just test whether the word is in the file? Here:
print 'abe' in open("passwords.txt").read().split()
Or:
print 'abe' in map(str.strip, open("passwords.txt"))
Or if it doesn't have to be Python:
egrep '^abe$' passwords.txt
EDIT: Oh, I forgot the lower. Probably because passwords are usually case sensitive. But if it really does make sense in your case:
print 'abe' in open("passwords.txt").read().lower().split()
or
print 'abe' in (line.strip().lower() for line in open("passwords.txt"))
or
print 'abe' in map(str.lower, map(str.strip, open("passwords.txt")))
Your script doesn't find the line because you didn't check for the newline characters:
Your file is made of many "lines". Each "line" ends with a character that you didn't account for - the newline character ('\n'1). This is the character that creates a new line - it is what gets written to the file when you hit enter. This is how the next line is created.
So, when you read the lines out of your file, the string contained in each line actually ends with a newline character. This is why your equality test fails. You should instead, test equality against the line, after it has been stripped of this newline character:
with open("home/seth/documents/bruteforce/passwords.txt") as infile:
for line in infile:
line = line.rstrip('\n')
if line == "abe":
print 'success!'
1 Note that on some machines, the newline character is in fact two characters - the carriage return (CR), and line-feed (LF). This terminology comes from back in the day when typewriters had to jump a line-width of space on the paper that was being written to, and that the carriage that contained the paper had to be returned to its starting position. When seen in a line in the file, this appears as '\r\n'
I have a text file with numbers and symbols, i want to delete some character of them and to put new line.
for example the text file is like that:
00004430474314-3","100004430474314-3","1779803519-3","100003004929477-3","100006224433874-3","1512754498-3","100003323786067
i want the output to be like that:
00004430474314
100004430474314
100003004929477
1779803519
100006224433874
1512754498
100003323786067
i tred to replace -3"," with \n by this code but it does not work. any help?
import re
import collections
s = re.findall('\w+', open('text.txt').read().lower())
print(s.replace("-3","",">\n"))
The re.findall is useless here.
with open('path/to/file') as infile:
contents = infile.read()
contents = contents.replace('-3","', '\n')
print(contents)
Another problem with your code is that you seem to think that "-3","" is a string containing -3",". This is not the case. Python sees a second " and interprets that as the end of the string. You have a comma right afterward, which makes python consider the second bit as the second parameter to s.replace().
What you really want to do is to tell python that those double quotes are part of the string. You can do this by manually escaping them as follows:
some_string_with_double_quotes = "this is a \"double quote\" within a string"
You can also accomplish the same thing by defining the string with single quotes:
some_string_with_double_quotes = 'this is a "double quote" within a string'
Both types of quotes are equivalent in python and can be used to define strings. This may be weird to you if you come from a language like C++, where single quotes are used for characters, and double quotes are used for strings.
First I think that the s object is not a string but a list and if you try to make is a string (s=''.join(s) for example) you are going to end with something like this:
0000443047431431000044304743143177980351931000030049294773100006224433874315127544983100003323786067
Where replace() is useless.
I would change your code to the following (tested in python 3.2)
lines = [line.strip() for line in open('text.txt')]
line=''.join(lines)
cl=line.replace("-3\",\"","\n")
print(cl)
This is a continuation of my former questions (check them if you are curious).
I can already see the light at the end of the tunnel, but I've got a last problem.
For some reason, every line starts with a TAB character.
How can I ignore that first character ("tab" (\t) in my case)?
filename = "terem.txt"
OraRend = collections.namedtuple('OraRend', 'Nap, OraKezdese, OraBefejezese, Azonosito, Terem, OraNeve, Emelet')
csv.list_dialects()
for line in csv.reader(open(filename, "rb"), delimiter='\t', lineterminator='\t\t', doublequote=False, skipinitialspace=True):
print line
orar = OraRend._make(line) # Here comes the trouble!
The text file:
http://pastebin.com/UYg4P4J1
(Can't really paste it here with all the tabs.)
I have found lstrip, strip and other methods, all of them would eat all the chars, so the filling of the tuple would fail.
You could do line = line[1:] to just strip the first character. But if you do this, you should add an assertion that the first character is indeed a tab, to avoid mangling data without leading tab.
There is an easier alternative that also handles several other cases and doesn't break things if the things to be removed aren't there. You can strip all leading and trailing whitespace with line = line.strip(). Alternatively, use .lstrip() to strip only leading whitespace, and add '\t' as argument to either method call if you want to leave other whitespace in place and just remove tabs.
To remove the first character from a string:
>>> s = "Hello"
>>> s
'Hello'
>>> s[1:]
'ello'
From the docs:
str.lstrip([chars])
Return a copy of the string with leading characters removed. The chars
argument is a string specifying the
set of characters to be removed. If
omitted or None, the chars argument
defaults to removing whitespace. The
chars argument is not a prefix;
rather, all combinations of its values
are stripped
If you want to only remove the tab at the beginning of a line, use
str.lstrip("\t")
This has the benefit that you don't have to check to make sure the first character is, in fact, a tab. However, if there are cases when there are more than one tab, and you want to keep the second tab and on, you're going to have to use str[1:].
Consider this. You don't need to pass a "file" to csv.reader. A file-line object that is a sequence of string values works nicely.
filename = "terem.txt"
OraRend = collections.namedtuple('OraRend', 'Nap, OraKezdese, OraBefejezese, Azonosito, Terem, OraNeve, Emelet')
with open(filename, "rb") as source:
cleaned = ( line.lstrip() for line in source )
rdr= csv.reader( cleaned, delimiter='\t', lineterminator='\t\t', doublequote=False, skipinitialspace=True)
for line in rdr
print line
orar = OraRend._make(line) # Here comes the trouble!
When writing to a text file, some of the file.write instances are followed by a linebreak in the output file and others aren't. I don't want linebreaks except where I tell them to occur. Code:
for doc,wc in wordcounts.items():
out.write(doc) #this works fine, no linebreak
for word in wordlist:
if word in wc: out.write("\t%d" % wc[word]) #linebreaks appear
else: out.write("\t0") #after each of these
out.write("\n") #this line had mixed spaces/tabs
What am I missing?
Update
I should have taken a clue from how the code pasted into SO. For some reason there was a mixture of spaces and tabs in the final line, such that in TextMate it visually appeared outside the "for word..." loop—but the interpreter was treating it as part of that loop. Converting spaces to tabs solved the problem.
Thanks for your input.
file.write() does not add any newlines if the string you write does not contain any \ns.
But you force a newline for each word in your word list using out.write("\n"), is that what you want?
for doc,wc in wordcounts.items():
out.write(doc) #this works fine, no linebreak
for word in wordlist:
if word in wc: out.write("\t%d" % wc[word]) #linebreaks appear
else: out.write("\t0") #after each of these
out.write("\n") #<--- NEWLINE ON EACH ITERATION!
Perhaps you indented out.write("\n") too far???
You write a line breaks after every word:
for word in wordlist:
...
out.write("\n")
Are these the line breaks you are seeing, or are there more additional ones?
You might need to perform a strip() on each wc[word]. Printing a single item from wc is would probably be enough to determine if there are already line breaks on those items that area causing this behavior.
Either that or the indentation on your final out.write("\n") is not doing what you intended it to do.
I think your indentation is wrong.
(also I took the liberty to make your if clause redundant and code more readable :)
for doc,wc in wordcounts.items()
out.write(doc)
for word in wordlist:
out.write("\t%d" % wc.get(word,0))
out.write("\n")