Python read text file and split on control character - python

I'm working with output text files from Hadoop and Hive where the files have fields delimited by control-A. I'm then using Python to read the file line-by-line, but the string split() function is not splitting correctly even when I specify the delimiter.
Here is some sample data that is typical of what I get from Hadoop. Note that ^A is actually a control character.
field1^Afield2^Afield3^Afield4
field5^Afield6^Afield7^Afield8
You can see that the Linux command-line tool cut using the control code as a delimiter actually works. It is outputting the third field:
bash> cat test.txt | cut -d $'\001' -f 3
field3
field7
I then wrote a Python function that reads the file line-by-line using the standard Python idiom:
import re
def read_file(filename):
''' Read file line-by-line and split. '''
with open(filename, "r") as myfile:
for line in myfile:
tokens = line.split('\u0001')
#tokens = line.split('\^A')
#tokens = re.split('\^A', line)
print 'len(tokens): %d, tokens[0]: %s\n' % (len(tokens), tokens[0])
However, when I run the function, the string is not split correctly. There should be four tokens in each line.
>>> read_file('test2.txt')
len(tokens): 1, tokens[0]: field1field2field3field4
len(tokens): 1, tokens[0]: field5field6field7field8
As you can see in my Python function, I tried three different approaches to splitting the string. None of them worked.
tokens = line.split('\u0001')
tokens = line.split('\^A')
tokens = re.split('\^A', line)
Thanks for any help.
Related questions (none had a working solution for me):
delimiting carat A in python
re.split not working on ^A

Assuming that control-A is character "\x01" (ASCII code 1):
>>> line="field1\x01field2\x01field3\x01field4"
>>> line.split("\x01")
['field1', 'field2', 'field3', 'field4']
If you want to use the "\u0001" notation, you need the 'u' prefix (Python 2):
>>> line.split(u"\u0001")
[u'field1', u'field2', u'field3', u'field4']

Related

Find TM superscript in python 2 using regex

My text file includes "SSS™" as one of its words and I am trying to find it using regular expression. My problem is with finding ™ superscript. My code is:
import re
path='G:\python_code\A.txt'
f_general=open(path, 'r')
special=re.findall(r'\U2122',f_general.read())
print(special)
but it doesn't print anything. How can I fix it?
It may have to do with the encoding of your file. Try this:
import re
path = "g:\python_code\A.txt"
f_general=open(path, "r", encoding="UTF-16")
data = f_general.read()
special=re.findall(chr(8482), data)
print(special)
print(chr(8482))
Note I'm using the decimal value for Trade mark. This is the site I use:
https://www.ascii.cl/htmlcodes.htm
So, open the file you have in notepad. Do a save as and choose encoding unicode and this should all work. Working with extended ascii can be a hassle. I am using Python 3.6 but I think this should still work in 2.x
Note when it prints out the chr(8482) in your command line it will probably just be a T, at least that is what I get in windows.
update
Try this for Python 2 and this should capture the word before trademark:
import re
with open("g:\python_code\A.txt", "rb") as f:
data = f.read().decode("UTF-16")
regex = re.compile("\S+" + chr(8482))
match = re.search(regex, data)
if match:
print (match.group(0))

search for string and print line containing that string in a very large file

Trying to search for a string (email address) and print the line it is found in within a 1.66 gig .dump file(ashley madison). If I change print (line) to print ('true'), i get true returned, so i know it is reading the file, but when I try to print the line, python crashes with no error. Please help. python 3.4 on windows vista (rather than using a database and importing, I'm using this as a learning exercize for python)
import os
with open('aminno_member_email.dump', 'r', errors = 'ignore')as searchfile:
for line in searchfile:
if 'email#address.com' in line:
#print ('true')
print (line)
As I suspected, each line of that file is very long (to the tune of nearly a million characters, as you found). Most consoles are not set up to handle that sort of thing, so writing that line to a text file is your best bet. You can then open the file in a text editor or word processor and use its search function to locate areas of interest.
To display your search string with some characters of surrounding text, you can use a regular expression.
import re
...
# replace this:
'''
if 'email#address.com' in line:
#print ('true')
print (line)
'''
# with this:
print(*re.findall(r'(.{0,10}email#address\.com.{0,10})', line), sep='\n')
That will print each match with up to 10 characters before and after the search string, separated by a newline.
Example:
>>> print(*re.findall(r'(.{0,10}str.{0,10})', 'hello this is a string with text and it is very strong stuff'), sep='\n')
this is a string with t
t is very strong stuff
Open the file as stream instead and read from the stream instead of loading the entire file to RAM. Use io from the Python standard library.
with io.open('aminno_member_email.dump', 'r') as file:
...

Reconstruct a URL string from sys.stdin in Python

I have a script that takes input from a large log file. This file has encoded URLs.
I am using standard input to grab these URLs from the file. I wish to process each URL separately.
Problem is when I get the a single URL its split up into each character in the URL. I do ''.join(something) when then after processing I get characters.
e.g.
for line in sys.stdin:
line = line.strip()
line1 = ''.join(line)
I also tried collecting all the characters in the URL and then joining. Still same result.
Sample out I get:
Input from file: " www.cnn.com"
output after sys.std and processing : ['w','w','w','.','c','n','n','.','c','o','m']
the list appears because i make it so. Otherwise i get www.cnn.com from sys.stdin. But the underlying structure is same as the output.
What I want is:
Input from file: " www.cnn.com"
output: "www.cnn.com" (this should be one string. not strings of individual characters)
Thanks
I think your stdin input might be garbled. Consider this script:
#stdin.py
import sys
for line in sys.stdin:
print line.strip()
Then piping input into it works as expected:
$ echo -e "www.cnn.com\nwww.test.com" | python stdin.py
www.cnn.com
www.test.com
If you call list() on a string, it splits it up by character:
>>> list("test")
['t', 'e', 's', 't']
I'm guessing what you probably want to do is read the entire input and then split on lines, like this:
import sys
lines = sys.stdin.read().split()
print lines
Running it, I get:
$ echo -e "www.cnn.com\nwww.test.com" | python stdin.py
['www.cnn.com', 'www.test.com']

Using grep in python

There is a file (query.txt) which has some keywords/phrases which are to be matched with other files using grep. The last three lines of the following code are working perfectly but when the same command is used inside the while loop it goes into an infinite loop or something(ie doesn't respond).
import os
f=open('query.txt','r')
b=f.readline()
while b:
cmd='grep %s my2.txt'%b #my2 is the file in which we are looking for b
os.system(cmd)
b=f.readline()
f.close()
a='He is'
cmd='grep %s my2.txt'%a
os.system(cmd)
First of all, you are not iterating over the file properly. You can simply use for b in f: without the .readline() stuff.
Then your code will blow in your face as soon as the filename contains any characters which have a special meaning in the shell. Use subprocess.call instead of os.system() and pass an argument list.
Here's a fixed version:
import os
import subprocess
with open('query.txt', 'r') as f:
for line in f:
line = line.rstrip() # remove trailing whitespace such as '\n'
subprocess.call(['/bin/grep', line, 'my2.txt'])
However, you can improve your code even more by not calling grep at all.
Read my2.txt to a string instead and then use the re module to perform the search. In case you do not need a regex at all, you can even simply use if line in my2_content
Your code scans the whole my2.txt file for each query in query.txt.
You want to:
read all queries into a list
iterate once over all lines of the text file and check each file against all queries.
Try this code:
with open('query.txt','r') as f:
queries = [l.strip() for l in f]
with open('my2.txt','r') as f:
for line in f:
for query in queries:
if query in line:
print query, line
This isn't actually a good way to use Python, but if you have to do something like that, then do it correctly:
from __future__ import with_statement
import subprocess
def grep_lines(filename, query_filename):
with open(query_filename, "rb") as myfile:
for line in myfile:
subprocess.call(["/bin/grep", line.strip(), filename])
grep_lines("my2.txt", "query.txt")
And hope that your file doesn't contain any characters which have special meanings in regular expressions =)
Also, you might be able to do this with grep alone:
grep -f query.txt my2.txt
It works like this:
~ $ cat my2.txt
One two
two two
two three
~ $ cat query.txt
two two
three
~ $ python bar.py
two two
two three
$ grep -wFf query.txt my2.txt > out.txt
this will match all the keywords in query.txt with my2.txt file and save the output in out.txt
Read man grep for a description of all the possible arguments.

How can I detect DOS line breaks in a file?

I have a bunch of files. Some are Unix line endings, many are DOS. I'd like to test each file to see if if is dos formatted, before I switch the line endings.
How would I do this? Is there a flag I can test for? Something similar?
Python can automatically detect what newline convention is used in a file, thanks to the "universal newline mode" (U), and you can access Python's guess through the newlines attribute of file objects:
f = open('myfile.txt', 'U')
f.readline() # Reads a line
# The following now contains the newline ending of the first line:
# It can be "\r\n" (Windows), "\n" (Unix), "\r" (Mac OS pre-OS X).
# If no newline is found, it contains None.
print repr(f.newlines)
This gives the newline ending of the first line (Unix, DOS, etc.), if any.
As John M. pointed out, if by any chance you have a pathological file that uses more than one newline coding, f.newlines is a tuple with all the newline codings found so far, after reading many lines.
Reference: http://docs.python.org/2/library/functions.html#open
If you just want to convert a file, you can simply do:
with open('myfile.txt', 'U') as infile:
text = infile.read() # Automatic ("Universal read") conversion of newlines to "\n"
with open('myfile.txt', 'w') as outfile:
outfile.write(text) # Writes newlines for the platform running the program
You could search the string for \r\n. That's DOS style line ending.
EDIT: Take a look at this
(Python 2 only:) If you just want to read text files, either DOS or Unix-formatted, this works:
print open('myfile.txt', 'U').read()
That is, Python's "universal" file reader will automatically use all the different end of line markers, translating them to "\n".
http://docs.python.org/library/functions.html#open
(Thanks handle!)
As a complete Python newbie & just for fun, I tried to find some minimalistic way of checking this for one file. This seems to work:
if "\r\n" in open("/path/file.txt","rb").read():
print "DOS line endings found"
Edit: simplified as per John Machin's comment (no need to use regular expressions).
dos linebreaks are \r\n, unix only \n. So just search for \r\n.
Using grep & bash:
grep -c -m 1 $'\r$' file
echo $'\r\n\r\n' | grep -c $'\r$' # test
echo $'\r\n\r\n' | grep -c -m 1 $'\r$'
You can use the following function (which should work in Python 2 and Python 3) to get the newline representation used in an existing text file. All three possible kinds are recognized. The function reads the file only up to the first newline to decide. This is faster and less memory consuming when you have larger text files, but it does not detect mixed newline endings.
In Python 3, you can then pass the output of this function to the newline parameter of the open function when writing the file. This way you can alter the context of a text file without changing its newline representation.
def get_newline(filename):
with open(filename, "rb") as f:
while True:
c = f.read(1)
if not c or c == b'\n':
break
if c == b'\r':
if f.read(1) == b'\n':
return '\r\n'
return '\r'
return '\n'

Categories