Python 3: receive user input including newline characters - python

I'm trying to read in the following text from the command-line in Python 3 (copied verbatim, newlines and all):
lcbeika
rraobmlo
grmfina
ontccep
emrlin
tseiboo
edosrgd
mkoeys
eissaml
knaiefr
Using input, I can only read in the first word as once it reads the first newline it stops reading.
Is there a way I could read in them all without iteratively calling input?

You can import sys and use the methods on sys.stdin for example:
text = sys.stdin.read()
or:
lines = sys.stdin.readlines()
or:
for line in sys.stdin:
# Do something with line.

if you are passing the text into your script as a file , you can use readlines()
eg
data=open("file").readlines()
or you can use fileinput
import fileinput
for line in fileinput.input():
print line

Related

Python read text file and split on control character

I'm working with output text files from Hadoop and Hive where the files have fields delimited by control-A. I'm then using Python to read the file line-by-line, but the string split() function is not splitting correctly even when I specify the delimiter.
Here is some sample data that is typical of what I get from Hadoop. Note that ^A is actually a control character.
field1^Afield2^Afield3^Afield4
field5^Afield6^Afield7^Afield8
You can see that the Linux command-line tool cut using the control code as a delimiter actually works. It is outputting the third field:
bash> cat test.txt | cut -d $'\001' -f 3
field3
field7
I then wrote a Python function that reads the file line-by-line using the standard Python idiom:
import re
def read_file(filename):
''' Read file line-by-line and split. '''
with open(filename, "r") as myfile:
for line in myfile:
tokens = line.split('\u0001')
#tokens = line.split('\^A')
#tokens = re.split('\^A', line)
print 'len(tokens): %d, tokens[0]: %s\n' % (len(tokens), tokens[0])
However, when I run the function, the string is not split correctly. There should be four tokens in each line.
>>> read_file('test2.txt')
len(tokens): 1, tokens[0]: field1field2field3field4
len(tokens): 1, tokens[0]: field5field6field7field8
As you can see in my Python function, I tried three different approaches to splitting the string. None of them worked.
tokens = line.split('\u0001')
tokens = line.split('\^A')
tokens = re.split('\^A', line)
Thanks for any help.
Related questions (none had a working solution for me):
delimiting carat A in python
re.split not working on ^A
Assuming that control-A is character "\x01" (ASCII code 1):
>>> line="field1\x01field2\x01field3\x01field4"
>>> line.split("\x01")
['field1', 'field2', 'field3', 'field4']
If you want to use the "\u0001" notation, you need the 'u' prefix (Python 2):
>>> line.split(u"\u0001")
[u'field1', u'field2', u'field3', u'field4']

python stdin with txt file and input() function

I have a input.txt file with the following content.
3
4 5
I want to use this as a standard input by using the following command in the command line.
python a.py < input.txt
In the a.py script, I am trying to read the input line by line using input() function. I know there are better ways to read the stdin, but I need to use input() function.
A naive approach of
line1 = input()
line2 = input()
did not work. I get the following error message.
File "<string>", line 1
4 5
^
SyntaxError: unexpected EOF while parsing
That way is ok, it works:
read = input()
print(read)
but you are just reading one line.
From the input() doc:
The function then reads a line from input, converts it to a string
(stripping a trailing newline), and returns that.
That means that if the file does not end with a blank line, or what is the same, the last nonblank line of the file do not end with an end of line character, you will get exceptions.SyntaxError and the last line will not be read.
You mention HackerRank; looking at some of my old submissions, I think I opted to give up on input in lieu of sys.stdin manipulations. input() is very similar to next(sys.stdin), but the latter will handle EOF just fine.
By way of example, my answer for https://www.hackerrank.com/challenges/maximize-it/
import sys
import itertools
# next(sys.stdin) is functionally identical to input() here
nK, M = (int(n) for n in next(sys.stdin).split())
# but I can also iterate over it
K = [[int(n) for n in line.split()][1:] for line in sys.stdin]
print(max(sum(x**2 for x in combo) % M for combo in itertools.product(*K)))

search for string and print line containing that string in a very large file

Trying to search for a string (email address) and print the line it is found in within a 1.66 gig .dump file(ashley madison). If I change print (line) to print ('true'), i get true returned, so i know it is reading the file, but when I try to print the line, python crashes with no error. Please help. python 3.4 on windows vista (rather than using a database and importing, I'm using this as a learning exercize for python)
import os
with open('aminno_member_email.dump', 'r', errors = 'ignore')as searchfile:
for line in searchfile:
if 'email#address.com' in line:
#print ('true')
print (line)
As I suspected, each line of that file is very long (to the tune of nearly a million characters, as you found). Most consoles are not set up to handle that sort of thing, so writing that line to a text file is your best bet. You can then open the file in a text editor or word processor and use its search function to locate areas of interest.
To display your search string with some characters of surrounding text, you can use a regular expression.
import re
...
# replace this:
'''
if 'email#address.com' in line:
#print ('true')
print (line)
'''
# with this:
print(*re.findall(r'(.{0,10}email#address\.com.{0,10})', line), sep='\n')
That will print each match with up to 10 characters before and after the search string, separated by a newline.
Example:
>>> print(*re.findall(r'(.{0,10}str.{0,10})', 'hello this is a string with text and it is very strong stuff'), sep='\n')
this is a string with t
t is very strong stuff
Open the file as stream instead and read from the stream instead of loading the entire file to RAM. Use io from the Python standard library.
with io.open('aminno_member_email.dump', 'r') as file:
...

Use of readline function with Python StringIO module

I am trying to read a file from an FTP site and process one line at a time. I write from the FTP server to a StringIO object and call the readline function, but this returns the entire file, rather that the first line. I downloaded the file to my pc and examined it with a hex editor, and the file uses x0d0a for a newline character, or a carriage return with a line feed. Could somebody point out to me where I might be going wrong here?
Thanks in advance!
#!/usr/bin/python
import ftplib
import StringIO
settles = StringIO.StringIO()
ftp = ftplib.FTP('ftp.cmegroup.com')
ftp.login()
ftp.cwd('pub/settle/')
ftp.retrlines('RETR cbt.settle.s.txt', settles.write)
settles.seek(0)
print settles.readline()
According to the FTP.retrlines documentation:
... The callback function is called for each line with a string argument containing the line with the trailing CRLF stripped. ....
Replace retrlines with retrbinary.
Alternatively, you can ..retrlines .. lines as follow (appending newlines):
ftp.retrlines('RETR cbt.settle.s.txt', lambda line: settles.write(line + '\n'))

How can I detect DOS line breaks in a file?

I have a bunch of files. Some are Unix line endings, many are DOS. I'd like to test each file to see if if is dos formatted, before I switch the line endings.
How would I do this? Is there a flag I can test for? Something similar?
Python can automatically detect what newline convention is used in a file, thanks to the "universal newline mode" (U), and you can access Python's guess through the newlines attribute of file objects:
f = open('myfile.txt', 'U')
f.readline() # Reads a line
# The following now contains the newline ending of the first line:
# It can be "\r\n" (Windows), "\n" (Unix), "\r" (Mac OS pre-OS X).
# If no newline is found, it contains None.
print repr(f.newlines)
This gives the newline ending of the first line (Unix, DOS, etc.), if any.
As John M. pointed out, if by any chance you have a pathological file that uses more than one newline coding, f.newlines is a tuple with all the newline codings found so far, after reading many lines.
Reference: http://docs.python.org/2/library/functions.html#open
If you just want to convert a file, you can simply do:
with open('myfile.txt', 'U') as infile:
text = infile.read() # Automatic ("Universal read") conversion of newlines to "\n"
with open('myfile.txt', 'w') as outfile:
outfile.write(text) # Writes newlines for the platform running the program
You could search the string for \r\n. That's DOS style line ending.
EDIT: Take a look at this
(Python 2 only:) If you just want to read text files, either DOS or Unix-formatted, this works:
print open('myfile.txt', 'U').read()
That is, Python's "universal" file reader will automatically use all the different end of line markers, translating them to "\n".
http://docs.python.org/library/functions.html#open
(Thanks handle!)
As a complete Python newbie & just for fun, I tried to find some minimalistic way of checking this for one file. This seems to work:
if "\r\n" in open("/path/file.txt","rb").read():
print "DOS line endings found"
Edit: simplified as per John Machin's comment (no need to use regular expressions).
dos linebreaks are \r\n, unix only \n. So just search for \r\n.
Using grep & bash:
grep -c -m 1 $'\r$' file
echo $'\r\n\r\n' | grep -c $'\r$' # test
echo $'\r\n\r\n' | grep -c -m 1 $'\r$'
You can use the following function (which should work in Python 2 and Python 3) to get the newline representation used in an existing text file. All three possible kinds are recognized. The function reads the file only up to the first newline to decide. This is faster and less memory consuming when you have larger text files, but it does not detect mixed newline endings.
In Python 3, you can then pass the output of this function to the newline parameter of the open function when writing the file. This way you can alter the context of a text file without changing its newline representation.
def get_newline(filename):
with open(filename, "rb") as f:
while True:
c = f.read(1)
if not c or c == b'\n':
break
if c == b'\r':
if f.read(1) == b'\n':
return '\r\n'
return '\r'
return '\n'

Categories