Python readline() from a string? - python

In python, is there a built-in way to do a readline() on string? I have a large chunk of data and want to strip off just the first couple lines w/o doing split() on the whole string.
Hypothetical example:
def handleMessage(msg):
headerTo = msg.readline()
headerFrom= msg.readline()
sendMessage(headerTo,headerFrom,msg)
msg = "Bob Smith\nJane Doe\nJane,\nPlease order more widgets\nThanks,\nBob\n"
handleMessage(msg)
I want this to result in:
sendMessage("Bob Smith", "Jane Doe", "Jane,\nPlease order...")
I know it would be fairly easy to write a class that does this, but I'm looking for something built-in if possible.
EDIT: Python v2.7

In Python 3, you can use io.StringIO:
>>> msg = "Bob Smith\nJane Doe\nJane,\nPlease order more widgets\nThanks,\nBob\n"
>>> msg
'Bob Smith\nJane Doe\nJane,\nPlease order more widgets\nThanks,\nBob\n'
>>>
>>> import io
>>> buf = io.StringIO(msg)
>>> buf.readline()
'Bob Smith\n'
>>> buf.readline()
'Jane Doe\n'
>>> len(buf.read())
44
In Python 2, you can use StringIO (or cStringIO if performance is important):
>>> import StringIO
>>> buf = StringIO.StringIO(msg)
>>> buf.readline()
'Bob Smith\n'
>>> buf.readline()
'Jane Doe\n'

The easiest way for both python 2 and 3 is using string's method splitlines(). This returns a list of lines.
>>> "some\nmultilene\nstring\n".splitlines()
['some', 'multilene', 'string']

Why not just only do as many splits as you need? Since you're using all of the resulting parts (including the rest of the string), loading it into some other buffer object and then reading it back out again is probably going to be slower, not faster (plus the overhead of function calls).
If you want the first N lines separated out, just do .split("\n", N).
>>> foo = "ABC\nDEF\nGHI\nJKL"
>>> foo.split("\n", 1)
['ABC', 'DEF\nGHI\nJKL']
>>> foo.split("\n", 2)
['ABC', 'DEF', 'GHI\nJKL']
So for your function:
def handleMessage(msg):
headerTo, headerFrom, msg = msg.split("\n", 2)
sendMessage(headerTo,headerFrom,msg)
or if you really wanted to get fancy:
# either...
def handleMessage(msg):
sendMessage(*msg.split("\n", 2))
# or just...
handleMessage = lambda msg: sendMessage(*msg.split("\n", 2))

Do it like StringIO does it:
i = self.buf.find('\n', self.pos)
So this means:
pos = msg.find("\n")
first_line = msg[:pos]
...
Seems more elegant than using the whole StringIO...

in Python string have method splitlines
msg = "Bob Smith\nJane Doe\nJane,\nPlease order more widgets\nThanks,\nBob\n"
msg_splitlines = msg.splitlines()
headerTo = msg_splitlines[0]
headerFrom= msg_splitlines[1]
sendMessage(headerTo,headerFrom,msg)

Related

Split string by comma, ignoring comma inside string. Am trying CSV

I have a string like this:
s = '1,2,"hello, there"'
And I want to turn it into a list:
[1,2,"hello, there"]
Normally I'd use split:
my_list = s.split(",")
However, that doesn't work if there's a comma in a string.
So, I've read that I need to use cvs, but I don't really see how. I've tried:
from csv import reader
s = '1,2,"hello, there"'
ll = reader(s)
print ll
for row in ll:
print row
Which writes:
<_csv.reader object at 0x020EBC70>
['1']
['', '']
['2']
['', '']
['hello, there']
I've also tried with
ll = reader(s, delimiter=',')
It is that way because you provide the csv reader input as a string. If you do not want to use a file or a StringIO object just wrap your string in a list as shown below.
>>> import csv
>>> s = ['1,2,"hello, there"']
>>> ll = csv.reader(s, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
>>> list(ll)
[['1', '2', 'hello, there']]
It sounds like you probably want to use the csv module. To use the reader on a string, you want a StringIO object.
As an example:
>> import csv, StringIO
>> print list(csv.reader(StringIO.StringIO(s)))
[['1', '2', 'hello, there']]
To clarify, csv.reader expects a buffer object, not a string. So StringIO does the trick. However, if you're reading this csv from a file object, (a typical use case) you can just as easily give the file object to the reader and it'll work the same way.
It's usually easier to re-use than to invent a bicycle... You just to use csv library properly. If you can't for some reason, you can always check the source code out and learn how's the parsing done there.
Example for parsing a single string into a list. Notice that the string in wrapped in list.
>>> import csv
>>> s = '1,2,"hello, there"'
>>> list(csv.reader([s]))[0]
['1', '2', 'hello, there']
You can split first by the string delimiters, then by the commas for every even index (The ones not in the string)
import itertools
new_data = s.split('"')
for i in range(len(new_data)):
if i % 2 == 1: # Skip odd indices, making them arrays
new_data[i] = [new_data[i]]
else:
new_data[i] = new_data[i].split(",")
data = itertools.chain(*new_data)
Which goes something like:
'1,2,"hello, there"'
['1,2,', 'hello, there']
[['1', '2'], ['hello, there']]
['1', '2', 'hello, there']
But it's probably better to use the csv library if that's what you're working with.
You could also use ast.literal_eval if you want to preserve the integers:
>>> from ast import literal_eval
>>> literal_eval('[{}]'.format('1,2,"hello, there"'))
[1, 2, 'hello, there']

Extracting float numbers from file using python

I have .txt file which looks like:
[ -5.44339373e+00 -2.77404404e-01 1.26122094e-01 9.83589873e-01
1.95201179e-01 -4.49866890e-01 -2.06423297e-01 1.04780491e+00]
[ 4.34562117e-01 -1.04469577e-01 2.83633101e-01 1.00452355e-01 -7.12572469e-01 -4.99234705e-01 -1.93152897e-01 1.80787567e-02]
I need to extract all floats from it and put them to list/array
What I've done is this:
A = []
for line in open("general.txt", "r").read().split(" "):
for unit in line.split("]", 3):
A.append(list(map(lambda x: str(x), unit.replace("[", "").replace("]", "").split(" "))))
but A contains elements like [''] or even worse ['3.20973096e-02\n']. These are all strings, but I need floats. How to do that?
Why not use a regular expression?
>>> import re
>>> e = r'(\d+\.\d+e?(?:\+|-)\d{2}?)'
>>> results = re.findall(e, your_string)
['5.44339373e+00',
'2.77404404e-01',
'1.26122094e-01',
'9.83589873e-01',
'1.95201179e-01',
'4.49866890e-01',
'2.06423297e-01',
'1.04780491e+00',
'4.34562117e-01',
'1.04469577e-01',
'2.83633101e-01',
'1.00452355e-01',
'7.12572469e-01',
'4.99234705e-01',
'1.93152897e-01',
'1.80787567e-02']
Now, these are the matched strings, but you can easily convert them to floats:
>>> map(float, re.findall(e, your_string))
[5.44339373,
0.277404404,
0.126122094,
0.983589873,
0.195201179,
0.44986689,
0.206423297,
1.04780491,
0.434562117,
0.104469577,
0.283633101,
0.100452355,
0.712572469,
0.499234705,
0.193152897,
0.0180787567]
Note, the regular expression might need some tweaking, but its a good start.
As a more precise way you can use regex for split the lines :
>>> s="""[ -5.44339373e+00 -2.77404404e-01 1.26122094e-01 9.83589873e-01
... 1.95201179e-01 -4.49866890e-01 -2.06423297e-01 1.04780491e+00]
... [ 4.34562117e-01 -1.04469577e-01 2.83633101e-01 1.00452355e-01 -7.12572469e-01 -4.99234705e-01 -1.93152897e-01 1.80787567e-02] """
>>> print re.split(r'[\s\[\]]+',s)
['', '-5.44339373e+00', '-2.77404404e-01', '1.26122094e-01', '9.83589873e-01', '1.95201179e-01', '-4.49866890e-01', '-2.06423297e-01', '1.04780491e+00', '4.34562117e-01', '-1.04469577e-01', '2.83633101e-01', '1.00452355e-01', '-7.12572469e-01', '-4.99234705e-01', '-1.93152897e-01', '1.80787567e-02', '']
And in this case that you have the data in file you can do :
import re
print re.split(r'[\s\[\]]+',open("general.txt", "r").read())
If you want to get ride of the empty strings in leading and trailing you can just use a list comprehension :
>>> print [i for i in re.split(r'[\s\[\]]*',s) if i]
['-5.44339373e+00', '-2.77404404e-01', '1.26122094e-01', '9.83589873e-01', '1.95201179e-01', '-4.49866890e-01', '-2.06423297e-01', '1.04780491e+00', '4.34562117e-01', '-1.04469577e-01', '2.83633101e-01', '1.00452355e-01', '-7.12572469e-01', '-4.99234705e-01', '-1.93152897e-01', '1.80787567e-02']
let's slurp the file
content = open('data.txt').read()
split on ']'
logical_lines = content.split(']')
strip the '[' and the other stuff
logical_lines = [ll.lstrip(' \n[') for ll in logical_lines]
convert to floats
lol = [map(float,ll.split()) for ll in logical_lines]
Sticking it all in a one-liner
lol=[map(float,l.lstrip(' \n[').split()) for l in open('data.txt').read().split(']')]
I've tested it on the exemplar data we were given and it works...

how to easily read Python built-in types from a file

I have a file which lists values of some Python built-in types: None, integers, and strings, with proper Python syntax, including escaping. For example, the file might look like this:
2
"""\\nfoo
bar
""" 'foo bar'
None
I then want to read that file into the array of the values. For the above example, the array would be:
[2, '\\nfoo\nbar\n', 'foo bar', None]
I can do this by carefully parsing and/or using split function.
Is there an easy way to do it?
I would recommend changing your file format. That said, what you have is parseable. It might get harder to parse if you have multi-token values like lists, but with only None, ints, and strings, you can tokenize the input with tokenize and parse it with something like ast.literal_eval:
import tokenize
import ast
values = []
with open('input_file') as f:
for token_type, token_string, _, _, _ in tokenize.generate_tokens(f.readline):
# Ignore newlines and the file-ending dummy token.
if token_type in (tokenize.ENDMARKER, tokenize.NEWLINE, tokenize.NL):
continue
values.append(ast.literal_eval(token_string))
You can use ast.literal_val
>>> import ast
>>> ast.literal_eval('2')
2
>>> type(ast.literal_eval('2')
<type 'int'>
>>> ast.literal_eval('[1,2,3]')
[1, 2, 3]
>>> type(ast.literal_eval('[1,2,3]')
<type 'list'>
>>> ast.literal_eval('"a"')
'a'
>>> type(ast.literal_eval('"a"')
<type 'str'>
This almost gets you there, but due to the way strings work, it ends up combining the two strings:
import ast
with open('tokens.txt') as in_file:
current_string = ''
tokens = []
for line in in_file:
current_string += line.strip()
try:
new_token = ast.literal_eval(current_string)
tokens.append(new_token)
current_string = ''
except SyntaxError:
print("Couldn't parse current line, combining with next")
tokens
Out[8]: [2, '\\nfoobarfoo bar', None]
The problem is that in Python, if you have two string literals sitting next to each other, they concatenate even if you don't use +, e.g.:
x = 'string1' 'string2'
x
Out[10]: 'string1string2'
I apologize for posting an answer to my own question, but it looks like, what works, is that I replace unquoted whitespace (including newlines), with commas, and then put [] around the whole thing and import.

Python strings / match case

I have a CSV file which has the following format:
id,case1,case2,case3
Here is a sample:
123,null,X,Y
342,X,X,Y
456,null,null,null
789,null,null,X
For each line I need to know which of the cases is not null. Is there an easy way to find out which case(s) are not null without splitting the string and going through each element?
This is what the result should look like:
123,case2:case3
342,case1:case2:case3
456:None
789:case3
You probably want to take a look at the CSV module, which has readers and writers that will enable you to create transforms.
>>> from StringIO import StringIO
>>> from csv import DictReader
>>> fh = StringIO("""
... id,case1,case2,case3
...
... 123,null,X,Y
...
... 342,X,X,Y
...
... 456,null,null,null
...
... 789,null,null,X
... """.strip())
>>> dr = DictReader(fh)
>>> dr.next()
{'case1': 'null', 'case3': 'Y', 'case2': 'X', 'id': '123'}
At which point you can do something like:
>>> from csv import DictWriter
>>> out_fh = StringIO()
>>> writer = DictWriter(fh, fieldnames=dr.fieldnames)
>>> for mapping in dr:
... writer.write(dict((k, v) for k, v in mapping.items() if v != 'null'))
...
The last bit is just pseudocode -- not sure dr.fieldnames is actually a property. Replace out_fh with the filehandle that you'd like to output to.
Anyway you slice it, you are still going to have to go through the list. There are more and less elegant ways to do it. Depending on the python version you are using, you can use list comprehensions.
ids=line.split(",")
print "%s:%s" % (ids[0], ":".join(["case%d" % x for x in range(1, len(ids)) if ids[x] != "null"])
Why do you treat spliting as a problem? For performance reasons?
Literally you could avoid splitting with smart regexps (like:
\d+,null,\w+,\w+
\d+,\w+,null,\w+
...
but I find it a worse solution than reparsing the data into lists.
You could use the Python csv module, comes in with the standard installation of python... It will not be much easier, though...

pyyaml: dumping without tags

I have
>>> import yaml
>>> yaml.dump(u'abc')
"!!python/unicode 'abc'\n"
But I want
>>> import yaml
>>> yaml.dump(u'abc', magic='something')
'abc\n'
What magic param forces no tagging?
You can use safe_dump instead of dump. Just keep in mind that it won't be able to represent arbitrary Python objects then. Also, when you load the YAML, you will get a str object instead of unicode.
How about this:
def unicode_representer(dumper, uni):
node = yaml.ScalarNode(tag=u'tag:yaml.org,2002:str', value=uni)
return node
yaml.add_representer(unicode, unicode_representer)
This seems to make dumping unicode objects work the same as dumping str objects for me (Python 2.6).
In [72]: yaml.dump(u'abc')
Out[72]: 'abc\n...\n'
In [73]: yaml.dump('abc')
Out[73]: 'abc\n...\n'
In [75]: yaml.dump(['abc'])
Out[75]: '[abc]\n'
In [76]: yaml.dump([u'abc'])
Out[76]: '[abc]\n'
You need a new dumper class that does everything the standard Dumper class does but overrides the representers for str and unicode.
from yaml.dumper import Dumper
from yaml.representer import SafeRepresenter
class KludgeDumper(Dumper):
pass
KludgeDumper.add_representer(str,
SafeRepresenter.represent_str)
KludgeDumper.add_representer(unicode,
SafeRepresenter.represent_unicode)
Which leads to
>>> print yaml.dump([u'abc',u'abc\xe7'],Dumper=KludgeDumper)
[abc, "abc\xE7"]
>>> print yaml.dump([u'abc',u'abc\xe7'],Dumper=KludgeDumper,encoding=None)
[abc, "abc\xE7"]
Granted, I'm still stumped on how to keep this pretty.
>>> print u'abc\xe7'
abcç
And it breaks a later yaml.load()
>>> yy=yaml.load(yaml.dump(['abc','abc\xe7'],Dumper=KludgeDumper,encoding=None))
>>> yy
['abc', 'abc\xe7']
>>> print yy[1]
abc�
>>> print u'abc\xe7'
abcç
little addition to interjay's excellent answer, you can keep your unicode on a reload if you take care of your file encodings.
# -*- coding: utf-8 -*-
import yaml
import codecs
data = dict(key = u"abcç\U0001F511")
fn = "test2.yaml"
with codecs.open(fn, "w", encoding="utf-8") as fo:
yaml.safe_dump(data, fo)
with codecs.open(fn, encoding="utf-8") as fi:
data2 = yaml.safe_load(fi)
print ("data2:", data2, "type(data.key):", type(data2.get("key")) )
print data2.get("key")
test2.yaml contents in my editor:
{key: "abc\xE7\uD83D\uDD11"}
print outputs:
('data2:', {'key': u'abc\xe7\U0001f511'}, 'type(data.key):', <type 'unicode'>)
abcç🔑
Plus, after reading http://nedbatchelder.com/blog/201302/war_is_peace.html I am pretty sure that safe_load/safe_dump is where I want to be anyway.
I've just started with Python and YAML, but probably this may also help. Just compare outputs:
def test_dump(self):
print yaml.dump([{'name': 'value'}, {'name2': 1}], explicit_start=True)
print yaml.dump_all([{'name': 'value'}, {'name2': 1}])

Categories