I have a script that takes input from a large log file. This file has encoded URLs.
I am using standard input to grab these URLs from the file. I wish to process each URL separately.
Problem is when I get the a single URL its split up into each character in the URL. I do ''.join(something) when then after processing I get characters.
e.g.
for line in sys.stdin:
line = line.strip()
line1 = ''.join(line)
I also tried collecting all the characters in the URL and then joining. Still same result.
Sample out I get:
Input from file: " www.cnn.com"
output after sys.std and processing : ['w','w','w','.','c','n','n','.','c','o','m']
the list appears because i make it so. Otherwise i get www.cnn.com from sys.stdin. But the underlying structure is same as the output.
What I want is:
Input from file: " www.cnn.com"
output: "www.cnn.com" (this should be one string. not strings of individual characters)
Thanks
I think your stdin input might be garbled. Consider this script:
#stdin.py
import sys
for line in sys.stdin:
print line.strip()
Then piping input into it works as expected:
$ echo -e "www.cnn.com\nwww.test.com" | python stdin.py
www.cnn.com
www.test.com
If you call list() on a string, it splits it up by character:
>>> list("test")
['t', 'e', 's', 't']
I'm guessing what you probably want to do is read the entire input and then split on lines, like this:
import sys
lines = sys.stdin.read().split()
print lines
Running it, I get:
$ echo -e "www.cnn.com\nwww.test.com" | python stdin.py
['www.cnn.com', 'www.test.com']
Related
I'm working with output text files from Hadoop and Hive where the files have fields delimited by control-A. I'm then using Python to read the file line-by-line, but the string split() function is not splitting correctly even when I specify the delimiter.
Here is some sample data that is typical of what I get from Hadoop. Note that ^A is actually a control character.
field1^Afield2^Afield3^Afield4
field5^Afield6^Afield7^Afield8
You can see that the Linux command-line tool cut using the control code as a delimiter actually works. It is outputting the third field:
bash> cat test.txt | cut -d $'\001' -f 3
field3
field7
I then wrote a Python function that reads the file line-by-line using the standard Python idiom:
import re
def read_file(filename):
''' Read file line-by-line and split. '''
with open(filename, "r") as myfile:
for line in myfile:
tokens = line.split('\u0001')
#tokens = line.split('\^A')
#tokens = re.split('\^A', line)
print 'len(tokens): %d, tokens[0]: %s\n' % (len(tokens), tokens[0])
However, when I run the function, the string is not split correctly. There should be four tokens in each line.
>>> read_file('test2.txt')
len(tokens): 1, tokens[0]: field1field2field3field4
len(tokens): 1, tokens[0]: field5field6field7field8
As you can see in my Python function, I tried three different approaches to splitting the string. None of them worked.
tokens = line.split('\u0001')
tokens = line.split('\^A')
tokens = re.split('\^A', line)
Thanks for any help.
Related questions (none had a working solution for me):
delimiting carat A in python
re.split not working on ^A
Assuming that control-A is character "\x01" (ASCII code 1):
>>> line="field1\x01field2\x01field3\x01field4"
>>> line.split("\x01")
['field1', 'field2', 'field3', 'field4']
If you want to use the "\u0001" notation, you need the 'u' prefix (Python 2):
>>> line.split(u"\u0001")
[u'field1', u'field2', u'field3', u'field4']
I am using sys.stdin in my code, and I want to know how to test my code on multiple text files.
My code(test.py) is:
for line in sys.stdin:
line = line.strip()
words = line.split()
I am trying to test it on 2 text files, so I type in terminal:
echo "test1.txt" "test2.txt" | test.py
but it not works, so I just want to know how can I test the code on 2 text files?
echo "test1.txt" "test2.txt" | test.py
Won't actually run test.py, you need to use this command instead:
echo "test1.txt" "test2.txt" | python test.py
However, another method for getting arguments into python would be:
import sys
for arg in sys.argv:
print line
Which when run like so:
python test.py "test1" "test2"
Produces the following output:
test.py
test1
test2
The first argument of argv is the name of the program. This can be skipped with:
import sys
for arg in sys.argv[1:]:
print line
A further problem you appear to be having is you're assuming that python is opening the text files you're handing it in the loop - this isn't true. If you print in the loop you'll see it's only printing the strings you gave it initially.
If you actually want to open and parse the files, do something like this in the loop:
import sys
args = sys.stdin.readlines()[0].replace("\"","").split()
for arg in args:
arg = arg.strip()
with open(arg, "r") as f:
for line in f:
line = line.strip()
words = line.split()
The reason we have that weird first line is that stdin is a stream, so we have to read it in via readlines().
The result is a list with a single element (because we only gave it one line), hence teh [0]
Then we need to remove the internal quotes, because the quotes aren't really required when piping, this would also work:
echo test1.txt test2.txt | python test.py
Finally, we have to split the string into the actual filenames.
Trying to search for a string (email address) and print the line it is found in within a 1.66 gig .dump file(ashley madison). If I change print (line) to print ('true'), i get true returned, so i know it is reading the file, but when I try to print the line, python crashes with no error. Please help. python 3.4 on windows vista (rather than using a database and importing, I'm using this as a learning exercize for python)
import os
with open('aminno_member_email.dump', 'r', errors = 'ignore')as searchfile:
for line in searchfile:
if 'email#address.com' in line:
#print ('true')
print (line)
As I suspected, each line of that file is very long (to the tune of nearly a million characters, as you found). Most consoles are not set up to handle that sort of thing, so writing that line to a text file is your best bet. You can then open the file in a text editor or word processor and use its search function to locate areas of interest.
To display your search string with some characters of surrounding text, you can use a regular expression.
import re
...
# replace this:
'''
if 'email#address.com' in line:
#print ('true')
print (line)
'''
# with this:
print(*re.findall(r'(.{0,10}email#address\.com.{0,10})', line), sep='\n')
That will print each match with up to 10 characters before and after the search string, separated by a newline.
Example:
>>> print(*re.findall(r'(.{0,10}str.{0,10})', 'hello this is a string with text and it is very strong stuff'), sep='\n')
this is a string with t
t is very strong stuff
Open the file as stream instead and read from the stream instead of loading the entire file to RAM. Use io from the Python standard library.
with io.open('aminno_member_email.dump', 'r') as file:
...
There is a file (query.txt) which has some keywords/phrases which are to be matched with other files using grep. The last three lines of the following code are working perfectly but when the same command is used inside the while loop it goes into an infinite loop or something(ie doesn't respond).
import os
f=open('query.txt','r')
b=f.readline()
while b:
cmd='grep %s my2.txt'%b #my2 is the file in which we are looking for b
os.system(cmd)
b=f.readline()
f.close()
a='He is'
cmd='grep %s my2.txt'%a
os.system(cmd)
First of all, you are not iterating over the file properly. You can simply use for b in f: without the .readline() stuff.
Then your code will blow in your face as soon as the filename contains any characters which have a special meaning in the shell. Use subprocess.call instead of os.system() and pass an argument list.
Here's a fixed version:
import os
import subprocess
with open('query.txt', 'r') as f:
for line in f:
line = line.rstrip() # remove trailing whitespace such as '\n'
subprocess.call(['/bin/grep', line, 'my2.txt'])
However, you can improve your code even more by not calling grep at all.
Read my2.txt to a string instead and then use the re module to perform the search. In case you do not need a regex at all, you can even simply use if line in my2_content
Your code scans the whole my2.txt file for each query in query.txt.
You want to:
read all queries into a list
iterate once over all lines of the text file and check each file against all queries.
Try this code:
with open('query.txt','r') as f:
queries = [l.strip() for l in f]
with open('my2.txt','r') as f:
for line in f:
for query in queries:
if query in line:
print query, line
This isn't actually a good way to use Python, but if you have to do something like that, then do it correctly:
from __future__ import with_statement
import subprocess
def grep_lines(filename, query_filename):
with open(query_filename, "rb") as myfile:
for line in myfile:
subprocess.call(["/bin/grep", line.strip(), filename])
grep_lines("my2.txt", "query.txt")
And hope that your file doesn't contain any characters which have special meanings in regular expressions =)
Also, you might be able to do this with grep alone:
grep -f query.txt my2.txt
It works like this:
~ $ cat my2.txt
One two
two two
two three
~ $ cat query.txt
two two
three
~ $ python bar.py
two two
two three
$ grep -wFf query.txt my2.txt > out.txt
this will match all the keywords in query.txt with my2.txt file and save the output in out.txt
Read man grep for a description of all the possible arguments.
I'm trying to read in the following text from the command-line in Python 3 (copied verbatim, newlines and all):
lcbeika
rraobmlo
grmfina
ontccep
emrlin
tseiboo
edosrgd
mkoeys
eissaml
knaiefr
Using input, I can only read in the first word as once it reads the first newline it stops reading.
Is there a way I could read in them all without iteratively calling input?
You can import sys and use the methods on sys.stdin for example:
text = sys.stdin.read()
or:
lines = sys.stdin.readlines()
or:
for line in sys.stdin:
# Do something with line.
if you are passing the text into your script as a file , you can use readlines()
eg
data=open("file").readlines()
or you can use fileinput
import fileinput
for line in fileinput.input():
print line