How remove large space between the sentences in text file? - python

I am working with Unicode file after processing it. I am getting very large spacing between sentences for example
തൃശൂരില്‍ ഹര്‍ത്താല്‍ പൂര്‍ണം
തൃശൂവില്‍ ഇടതുമുന്നണി ഹര്‍ത്താലില്‍ ജനജീവിതം പൂര്‍ണമായും സ്‌...
ഡി.വൈ.എഫ്‌.ഐ. ഉപരോധം; കലക്‌ടറേറ്റ്‌ സ്‌തംഭിച്ചു
തൃശൂര്‍: നിയമനനിരോധനം, അഴിമതി, വിലക്കയറ്റം എന്നീ വിഷയങ്ങള്‍ മുന്‍...
ബൈക്ക്‌ പോസ്‌റ്റിലിടിച്ച്‌ പതിന്നേഴുകാരന്‍ മരിച്ചു
How to remove these large spaces ?
I have tried this
" ".join(raw.split())
It is not working at all. Any suggestions ?

The easiest way is to write the results another file, or rewrite it to your file. Most operating systems doesn't allow us to edit directly into a file, especially appending. For simple cases like this, rewriting is much simpler:
with open('f.txt') as raw:
data = ''.join(raw.read().split()) #If you want to remove newlines only, use split('\n')
with open('f.txt', 'w') as raw:
raw.write(data)
Hope this helps!

Assuming raw is your raw data, you need to split the raw data using str.splitlines, filter all the empty lines, and rejoin them using newline
print '\n'.join(line for line in raw.splitlines() if line.strip())
If you are open to use regex, you may also try
import re
print re.sub("\n+","\n", raw)
If instead raw is a file object, group all consecutive spaces as one
from itertools import groupby
with open("<some-file>") as raw:
data = ''.join(k for k, _ in groupby(raw))

assuming the lines are empty (only a newline) using python:
import re
import sys
f = sys.argv[1]
for line in open(f, 'r'):
if not re.search('^$', line):
print line
or if you prefer:
egrep -v "^$" <filename>

Related

Python CSV remove new lines denoted by &#x0D

I have a BCP file that contains lots of 
 carriage return symbols. They are not meant to be there and I have no control over the original output so am left with trying to parse the file to remove them.
A sample of the data looks like....
"test1","apples","this is
some sample","3877"
"test66","bananas","this represents more
wrong data","378"
I am trying to send up with...
"test1","apples","this is some sample","3877"
"test66","bananas","this represents more wrong data","378"
Is there a simple way to do this prefereably using python CSV?
You can try:
import re
with open("old.csv") as f, open("new.csv", "w") as w:
for line in f:
line = re.sub(r"
\s*", "", line)
w.write(line)
"test1","apples","this is some sample","3877"
"test66","bananas","this represents more wrong data","378"
Demo

How to read a very large integer (1000+ digits) that is saved in **multiple lines** in a text file using Python?

Suppose I have a very large integer (say, of 1000+ digits) that I've saved into a text file named 'large_number.txt'. But the problem is the integer has been divided into multiple lines in the file, i.e. like the following:
47451445736001306439091167216856844588711603153276
70386486105843025439939619828917593665686757934951
62176457141856560629502157223196586755079324193331
64906352462741904929101432445813822663347944758178
92575867718337217661963751590579239728245598838407
58203565325359399008402633568948830189458628227828
80181199384826282014278194139940567587151170094390
35398664372827112653829987240784473053190104293586
86515506006295864861532075273371959191420517255829
71693888707715466499115593487603532921714970056938
54370070576826684624621495650076471787294438377604
Now, I want to read this number from the file and use it as a regular integer in my program. I tried the following but I can't.
My Try (Python):
with open('large_number.txt') as f:
data = f.read().splitlines()
Is there any way to do this properly in Python 3.6 ? Or what best can be done in this situation?
Just replace the newlines with nothing, then parse:
with open('large_number.txt') as f:
data = int(f.read().replace('\n', ''))
If you might have arbitrary (ASCII) whitespace and you want to discard all of it, switch to:
import string
killwhitespace = str.maketrans(dict.fromkeys(string.whitespace))
with open('large_number.txt') as f:
data = int(f.read().translate(killwhitespace))
Either way that's significantly more efficient than processing line-by-line in this case (because you need all the lines to parse, any line-by-line solution would be ugly), both in memory and runtime.
You can use this code:
with open('large_number.txt', 'r') as myfile:
data = myfile.read().replace('\n', '')
number = int(data)
You can use str.rstrip to remove the trailing newline characters and use str.join to join the lines into one string:
with open('large_number.txt') as file:
data = int(''.join(line.rstrip() for line in file))

python code to search a text string in multiple text file

I have multiple text files in a folder say "configs", I want to search a particular text "-cfg" in each file and copy the data after -cfg from opening to closing of inverted commas ("data"). This result should be updated in another text file "result.txt" with filename, test name and the config for each file.
NOTE: Each file can have multiple "cfg" in separate line along with test name related to that configuration.
E.g: cube_demo -cfg "RGB 888; MODE 3"
My approach is to open each text file one at a time and find the pattern, then store the required result into a buffer. Later, copy the entire result into a new file.
I came across Python and looks like it's easy to do it in Python. Still learning python and trying to figure out how to do it. Please help. Thanks.
I know how to open the file and iterate over each line to search for a particular string:
import re
search_term = "Cfg\s(\".*\")" // Not sure, if it's correct
ifile = open("testlist.csv", "r")
ofile = open("result.txt", "w")
searchlines = ifile.readlines()
for line in searchlines:
if search_term in line:
if re.search(search_term, line):
ofile.write(\1)
// trying to get string with the \number special sequence
ifile.close()
ofile.close()
But this gives me the complete line, I could not find how to use regular expression to get only the "data" and how to iterate over files in the folder to search the text.
Not quite there yet...
import re
search_term = "Cfg\s(\".*\")" // Not sure, if it's correct
"//" is not a valid comment marker, you want "#"
wrt/ your regexp, you want (from your specs) : 'cfg', followed by one or more space, followed by any text between double quotes, stopping at the first closing double quote, and want to capture the part between these double quotes. This is spelled as 'cfg "(.?)"'. Since you don't want to deal with escape chars, the best way is to use a raw single quoted string:
exp = r'cfg *"(.+?)"'
now since you're going to reuse this expression in a loop, you might as well compile it already:
exp = re.compile(r'cfg *"(.+?)"')
so now exp is a re.pattern object instead of string. To use it, you call it's search(<text>) method, with your current line as argument. If the line matches the expression, you'll get a re.match object, else you'll get None:
>>> match = exp.search('foo bar "baaz" boo')
>>> match is None
True
>>> match = exp.search('foo bar -cfg "RGB 888; MODE 3" tagada "tsoin"')
>>> match is None
False
>>>
To get the part between the double quotes, you call match.group(1) (second captured group, the first one being the one matchin the whole expression)
>>> match.group(0)
'cfg "RGB 888; MODE 3"'
>>> match.group(1)
'RGB 888; MODE 3'
>>>
Now you just have to learn and make correct use of files... First hint: files are context managers that know how to close themselves. Second hint: files are iterable, no need to read the whole file in memory. Third hint : file.write("text") WONT append a newline after "text".
If we glue all this together, your code should look something like:
import re
search_term = re.compile(r'cfg *"(.+?)"')
with open("testlist.csv", "r") as ifile:
with open("result.txt", "w") as ofile:
for line in ifile:
match = search_term.search(line)
if match:
ofile.write(match.group(1) + "\n")

How can I split a text file by a specific string that has wilcards?

I have a text file that has solutions from a textbook and I'm attempting to split each solution into its own text file, and after searching through SO, I can't seem to find a solution that's elegant.
Each solution is prefaced with the problem number such as *1-3; or *4-2;.
I can read in the file and store each line in a list, but I'm having trouble actually processing the list to split by the header.
Here's a pastebin with a few of the solutions straight from the .txt: http://pastebin.com/ntSXLn72
Thank you!
Use re.split:
import re
with open('text.txt') as f:
text = f.read()
solutions = re.split('\*[0-9]\-[0-9];',text)
That regex will look for *<any number>-<any number>;, and split the full text by anything matching. You may have to do a little cleanup for empty members.
#!/usr/bin/python
import re
file_name = "" # put the txt file you're working on
new_header = None
for line in open(file_name,"r").readlines():
if re.search("^[*][0-9]+[-][0-9]+[;]", line):
if new_header:
new_header.close()
new_header = open("%s_section:%s" % (file_name, line), "w")
if new_header:
new_header.write(line)
if new_header:
new_header.close

Delete newline / return carriage in file output

I have a wordlist that contains returns to separate each new letter. Is there a way to programatically delete each of these returns using file I/O in Python?
Edit: I know how to manipulate strings to delete returns. I want to physically edit the file so that those returns are deleted.
I'm looking for something like this:
wfile = open("wordlist.txt", "r+")
for line in wfile:
if len(line) == 0:
# note, the following is not real... this is what I'm aiming to achieve.
wfile.delete(line)
>>> string = "testing\n"
>>> string
'testing\n'
>>> string = string[:-1]
>>> string
'testing'
This basically says "chop off the last thing in the string" The : is the "slice" operator. It would be a good idea to read up on how it works as it is very useful.
EDIT
I just read your updated question. I think I understand now. You have a file, like this:
aqua:test$ cat wordlist.txt
Testing
This
Wordlist
With
Returns
Between
Lines
and you want to get rid of the empty lines. Instead of modifying the file while you're reading from it, create a new file that you can write the non-empty lines from the old file into, like so:
# script
rf = open("wordlist.txt")
wf = open("newwordlist.txt","w")
for line in rf:
newline = line.rstrip('\r\n')
wf.write(newline)
wf.write('\n') # remove to leave out line breaks
rf.close()
wf.close()
You should get:
aqua:test$ cat newwordlist.txt
Testing
This
Wordlist
With
Returns
Between
Lines
If you want something like
TestingThisWordlistWithReturnsBetweenLines
just comment out
wf.write('\n')
You can use a string's rstrip method to remove the newline characters from a string.
>>> 'something\n'.rstrip('\r\n')
>>> 'something'
The most efficient is to not specify a strip value
'\nsomething\n'.split() will strip all special characters and whitespace from the string
simply use, it solves the issue.
string.strip("\r\n")
Remove empty lines in the file:
#!/usr/bin/env python
import fileinput
for line in fileinput.input("wordlist.txt", inplace=True):
if line != '\n':
print line,
The file is moved to a backup file and standard output is directed to the input file.
'whatever\r\r\r\r\r\r\r\r\n\n\n\n\n'.translate(None, '\r\n')
returns
'whatever'
This is also a possible solution
file1 = open('myfile.txt','r')
conv_file = open("numfile.txt","w")
temp = file1.read().splitlines()
for element in temp:
conv_file.write(element)
file1.close()
conv_file.close()

Categories