How would you do this in python?
(It goes through a file and print the string in between author": ", " and text": ", \ and then print them to their files)
Here is an example string before it goes through this:
{"text": "Love this series!\ufeff", "time": "Hace 11 horas", "author": "HasRah", "cid": "UgyvXmvSiMjuDrOQn-l4AaABAg"}
#!/bin/bash
cat html.txt | awk -F 'author": "' {'print $2'} | cut -d '"' -f1 >> users.txt
cat html.txt | awk -F 'text": "' {'print $2'} | cut -d '\' -f1 >> comments.txt
I tried to do it like this in python (Didn't work):
import re
start = '"author": "'
end = '", '
st = open("html.txt", "r")
s = st.readlines()
u = re.search('%s(.*)%s' % (start, end), s).group(1)
#print u.group(1)
Not sure if I'm close.
I get this error code:
Traceback (most recent call last):
File "test.py", line 9, in <module>
u = re.search('%s(.*)%s' % (start, end), s).group(1)
File "/usr/lib/python2.7/re.py", line 146, in search
return _compile(pattern, flags).search(string)
TypeError: expected string or buffer`
Before getting into any of this: As chepner pointed out in a comment, this input looks like, and therefore is probably intended to be, JSON. Which means you shouldn't be parsing it with regular expressions; just parse it as JSON:
>>> s = ''' {"text": "Love this series!\ufeff", "time": "Hace 11 horas", "author": "HasRah", "cid": "UgyvXmvSiMjuDrOQn-l4AaABAg"}'''
>>> obj = json.loads(s)
>>> obj['author']
'HasRah'
Actually, it's not clear whether your input is a JSON file (a file containing one JSON text), or a JSONlines file (a file containing a bunch of lines, each of which is a JSON text with no embedded newlines).1
For the former, you want to parse it like this:
obj = json.load(st)
For the latter, you want to loop over the lines, and parse each one like this:
for line in st:
obj = json.loads(line)
… or, alternatively, you can get a JSONlines library off PyPI.
But meanwhile, if you want to understand what's wrong with your code:
The error message is telling you the problem, although maybe not in the user-friendliest way:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/re.py", line 148, in search
return _compile(pattern, flags).search(string)
TypeError: expected string or bytes-like object
See the docs for search make clear:
re.search(pattern, string, flags=0)
Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding MatchObject instance…
You haven't passed it a string, you've passed it a list of strings. That's the whole point of readlines, after all.
There are two obvious fixes here.
First, you could read the whole file into a single string, instead of reading it into a list of strings:
s = st.read()
u = re.search('%s(.*)%s' % (start, end), s).group(1)
Alternatively, you could loop over the lines, trying to match each one. And, if you do this, you still don't need readlines(), because a file is already an iterable of lines:
for line in st:
u = re.search('%s(.*)%s' % (start, end), line).group(1)
While we're at it, if any of your lines don't match the pattern, this is going to raise an AttributeError. After all, search returns None if there's no match, but then you're going to try to call None.group(1).
There are two obvious fixes here as well.
You could handle that error:
try:
u = re.search('%s(.*)%s' % (start, end), line).group(1)
except AttributeError:
pass
… or you could check whether you got a match:
m = re.search('%s(.*)%s' % (start, end), line)
if m:
u = m.group(1)
1. In fact, there are at least two other formats that are nearly, but not quite, identical to JSONlines. I think that if you only care about reading, not creating files, and you don't have any numbers, you can parse all of them with a loop around json.loads or with a JSONlines library. But if you know who created the file, and know that they intended it to be, say, NDJ rather than JSONlines, you should read the docs on NDJ, or get a library made for NDJ, rather than just trusting that some guy on the internet thinks it's OK to treat it as JSONlines.
Related
I'm trying to write a simple lex parser. The cope is currently:
from ply import lex
tokens = (
'COMMENT',
'OTHER'
)
t_COMMENT = r'^\#.*\n'
t_OTHER = r'^[^\#].*\n'
def t_error(t):
raise TypeError("Unknown text '%s'" % (t.value,))
lex.lex()
lex.input(yaml)
for tok in iter(lex.token, None):
print repr(tok.type), repr(tok.value)
But is fails to parse simple input file:
# This is a real comment
#And this one also
#/*
# *
# *Variable de feeu
# */
ma_var: True
It is done, over, kaput
With the following output:
l
'COMMENT' '# This is a real comment\n'
Traceback (most recent call last):
File "parser_adoc.py", line 62, in <module>
main2()
File "parser_adoc.py", line 57, in main2
for tok in iter(lex.token, None):
File "/usr/lib/python2.7/site-packages/ply/lex.py", line 384, in token
newtok = self.lexerrorf(tok)
File "parser_adoc.py", line 44, in t_error
raise TypeError("Unknown text '%s'" % (t.value,))
TypeError: Unknown text '#And this one also
#/*
# *
# *Variable de feeu
# */
ma_var: True
this is done
'
So in summary, I defined 2 regex:
One for line beginning with #
One for lines beginning not with #
But it's not working.
I don't understand what's wrong with my regex.
Could you help?
Simon
In python regexes (which PLY uses), ^ refers to the beginning of the string, not the beginning of the line, unless multi-line mode has been set. So since both of your rules start with ^, they can only match on the first line.
You could fix this by wrapping your regexes in (?m:...), which enables multi-line mode, but that's not even necessary here. Instead you can just remove the ^ from the beginning of your rules and it will work as you intend. Since both of your rules always match the entire line, the next token will always start at the beginning of the line - no need to anchor them.
I have a list name my= ['cbs is down','abnormal']
and I have opened a file in read mode
Now I want to search any of the string available in list that exist in that file and perform the if action
fopen = open("test.txt","r")
my =['cbs is down', 'abnormal']
for line in fopen:
if my in line:
print ("down")
and when I execute it, I get the following
Traceback (most recent call last):
File "E:/python/fileread.py", line 4, in <module>
if my in line:
TypeError: 'in <string>' requires string as left operand, not list
This should work things out:
if any(i in line for i in my):
...
Basically you are going through my and checking whether any of its elements is present in line.
fopen = open("test.txt","r")
my =['cbs is down', 'abnormal']
for line in fopen:
for x in my:
if x in line:
print ("down")
Sample input
Some text cbs is down
Yes, abnormal
not in my list
cbs is down
Output
down
down
down
The reason for your error:
The in operator as used in:
if my in line: ...
^ ^
|_ left | hand side
|
|_ right hand side
for a string operand on the right side (i.e. line) requires a corresponding string operand on the left hand side. This operand consistency check is implemented by the str.__contains__ method, where the call to __contains__ is made from the string on the right hand side (see cpython implemenetation). Same as:
if line.__contains__(my): ...
You're however passing a list, my, instead of a string.
An easy way to resolve this is by check that any of the items in the list are contained in the current line using the builtin any function:
for line in fopen:
if any(item in line for item in my):
...
Or since you have just two items use the or operator (pun unintended) which short-circuits in the same way as any:
for line in fopen:
if 'cbs is down' in line or 'abnormal' in line:
...
You could also join the terms in my to a regular expression like \b(cbs is down|abnormal)\b and use re.findall or re.search to find the terms. This way, you can also enclose the pattern in word-boundaries \b...\b so it does not match parts of longer words, and you also see which term was matched, and where.
>>> import re
>>> my = ['cbs is down', 'abnormal']
>>> line = "notacbs is downright abnormal"
>>> p = re.compile(r"\b(" + "|".join(map(re.escape, my)) + r")\b")
>>> p.findall(line)
['abnormal']
>>> p.search(line).span()
(21, 29)
I have thousands of text files containing multiple JSON objects, but unfortunately there is no delimiter between the objects. Objects are stored as dictionaries and some of their fields are themselves objects. Each object might have a variable number of nested objects. Concretely, an object might look like this:
{field1: {}, field2: "some value", field3: {}, ...}
and hundreds of such objects are concatenated without a delimiter in a text file. This means that I can neither use json.load() nor json.loads().
Any suggestion on how I can solve this problem. Is there a known parser to do this?
This decodes your "list" of JSON Objects from a string:
from json import JSONDecoder
def loads_invalid_obj_list(s):
decoder = JSONDecoder()
s_len = len(s)
objs = []
end = 0
while end != s_len:
obj, end = decoder.raw_decode(s, idx=end)
objs.append(obj)
return objs
The bonus here is that you play nice with the parser. Hence it keeps telling you exactly where it found an error.
Examples
>>> loads_invalid_obj_list('{}{}')
[{}, {}]
>>> loads_invalid_obj_list('{}{\n}{')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "decode.py", line 9, in loads_invalid_obj_list
obj, end = decoder.raw_decode(s, idx=end)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 376, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Expecting object: line 2 column 2 (char 5)
Clean Solution (added later)
import json
import re
#shameless copy paste from json/decoder.py
FLAGS = re.VERBOSE | re.MULTILINE | re.DOTALL
WHITESPACE = re.compile(r'[ \t\n\r]*', FLAGS)
class ConcatJSONDecoder(json.JSONDecoder):
def decode(self, s, _w=WHITESPACE.match):
s_len = len(s)
objs = []
end = 0
while end != s_len:
obj, end = self.raw_decode(s, idx=_w(s, end).end())
end = _w(s, end).end()
objs.append(obj)
return objs
Examples
>>> print json.loads('{}', cls=ConcatJSONDecoder)
[{}]
>>> print json.load(open('file'), cls=ConcatJSONDecoder)
[{}]
>>> print json.loads('{}{} {', cls=ConcatJSONDecoder)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 339, in loads
return cls(encoding=encoding, **kw).decode(s)
File "decode.py", line 15, in decode
obj, end = self.raw_decode(s, idx=_w(s, end).end())
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 376, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Expecting object: line 1 column 5 (char 5)
Sebastian Blask has the right idea, but there's no reason to use regexes for such a simple change.
objs = json.loads("[%s]"%(open('your_file.name').read().replace('}{', '},{')))
Or, more legibly
raw_objs_string = open('your_file.name').read() #read in raw data
raw_objs_string = raw_objs_string.replace('}{', '},{') #insert a comma between each object
objs_string = '[%s]'%(raw_objs_string) #wrap in a list, to make valid json
objs = json.loads(objs_string) #parse json
How about something like this:
import re
import json
jsonstr = open('test.json').read()
p = re.compile( '}\s*{' )
jsonstr = p.sub( '}\n{', jsonstr )
jsonarr = jsonstr.split( '\n' )
for jsonstr in jsonarr:
jsonobj = json.loads( jsonstr )
print json.dumps( jsonobj )
Solution
As far as I know }{ does not appear in valid JSON, so the following should be perfectly safe when trying to get strings for separate objects that were concatenated (txt is the content of your file). It does not require any import (even of re module) to do that:
retrieved_strings = map(lambda x: '{'+x+'}', txt.strip('{}').split('}{'))
or if you prefer list comprehensions (as David Zwicker mentioned in the comments), you can use it like that:
retrieved_strings = ['{'+x+'}' for x in txt.strip('{}').split('}{'))]
It will result in retrieved_strings being a list of strings, each containing separate JSON object. See proof here: http://ideone.com/Purpb
Example
The following string:
'{field1:"a",field2:"b"}{field1:"c",field2:"d"}{field1:"e",field2:"f"}'
will be turned into:
['{field1:"a",field2:"b"}', '{field1:"c",field2:"d"}', '{field1:"e",field2:"f"}']
as proven in the example I mentioned.
Why don't you load the file as string, replace all }{ with },{ and surround the whole thing with []? Something like:
re.sub('\}\s*?\{', '\}, \{', string_read_from_a_file)
Or simple string replace if you are sure you always have }{ without whitespaces in between.
In case you expect }{ to occur in strings as well, you could also split on }{ and evaluate each fragment with json.load, in case you get an error, the fragment wasn't complete and you have to add the next to the first one and so forth.
import json
file1 = open('filepath', 'r')
data = file1.readlines()
for line in data :
values = json.loads(line)
'''Now you can access all the objects using values.get('key') '''
How about reading through the file incrementing a counter every time a { is found and decrementing it when you come across a }. When your counter reaches 0 you'll know that you've come to the end of the first object so send that through json.load and start counting again. Then just repeat to completion.
Suppose you added a [ to the start of the text in a file, and used a version of json.load() which, when it detected the error of finding a { instead of an expected comma (or hits the end of the file), spit out the just-completed object?
Replace a file with that junk in it:
$ sed -i -e 's;}{;}, {;g' foo
Do it on the fly in Python:
junkJson.replace('}{', '}, {')
I've recently gotten back into programming and decided as a project to get me going and motivated I was going to write a character editor for fallout 2. The issue I'm having is after the first few strings I can't seem to pull the data I need using the file offsets or structs.
This is what I am doing.
The file I Am working with is www.retro-gaming-world.com/SAVE.DAT
import struct
savefile = open('SAVE.DAT', 'rb')
try:
test = savefile.read()
finally:
savefile.close()
print 'Header: ' + test[0x00:0x18] # returns the save files header description "'FALLOUT SAVE FILE '"
print "Character Name: " + test[0x1D:0x20+4] Returns the characters name "f1nk"
print "Save game name: " + test[0x3D:0x1E+4] # isn't returning the save name "church" like expected
print "Experience: " + str(struct.unpack('>h', test[0x08:0x04])[0]) # is expected to return the current experience but gives the follosing error
output :
Header: FALLOUT SAVE FILE
Character Name: f1nk
Save game name:
Traceback (most recent call last):
File "test", line 11, in <module>
print "Experience: " + str(struct.unpack('>h', test[0x08:0x04])[0])
struct.error: unpack requires a string argument of length 2
I've confirmed the offsets but it just isn't returning anything as it is expected.
test[0x08:0x04] is an empty string because the end index is smaller than the starting index.
For example, test[0x08:0x0A] would give you two bytes as required by the h code.
The syntax for string slicing is s[start:end] or s[start:end:step]. Link to docs
I'm in a little over my head on this one, so please pardon my terminology in advance.
I'm running this using Python 2.7 on Windows XP.
I found some Python code that reads a log file, does some stuff, then displays something.
What, that's not enough detail? Ok, here's a simplified version:
#!/usr/bin/python
import re
import sys
class NotSupportedTOCError(Exception):
pass
def filter_toc_entries(lines):
while True:
line = lines.next()
if re.match(r""" \s*
.+\s+ \| (?#track)
\s+.+\s+ \| (?#start)
\s+.+\s+ \| (?#length)
\s+.+\s+ \| (?#start sec)
\s+.+\s*$ (?#end sec)
""", line, re.X):
lines.next()
break
while True:
line = lines.next()
m = re.match(r"""
^\s*
(?P<num>\d+)
\s*\|\s*
(?P<start_time>[0-9:.]+)
\s*\|\s*
(?P<length_time>[0-9:.]+)
\s*\|\s*
(?P<start_sector>\d+)
\s*\|\s*
(?P<end_sector>\d+)
\s*$
""", line, re.X)
if not m:
break
yield m.groupdict()
def calculate_mb_toc_numbers(eac_entries):
eac = list(eac_entries)
num_tracks = len(eac)
tracknums = [int(e['num']) for e in eac]
if range(1,num_tracks+1) != tracknums:
raise NotSupportedTOCError("Non-standard track number sequence: %s", tracknums)
leadout_offset = int(eac[-1]['end_sector']) + 150 + 1
offsets = [(int(x['start_sector']) + 150) for x in eac]
return [1, num_tracks, leadout_offset] + offsets
f = open(sys.argv[1])
mb_toc_urlpart = "%20".join(str(x) for x in calculate_mb_toc_numbers(filter_toc_entries(f)))
print mb_toc_urlpart
The code works fine as long as the log file is "simple" text (I'm tempted to say ASCII although that may not be precise/accurate - for e.g. Notepad++ indicates it's ANSI).
However, the script doesn't work on certain log files (in these cases, Notepad++ says "UCS-2 Little Endian").
I get the following error:
Traceback (most recent call last):
File "simple.py", line 55, in <module>
mb_toc_urlpart = "%20".join(str(x) for x in calculate_mb_toc_numbers(filter_
toc_entries(f)))
File "simple.py", line 49, in calculate_mb_toc_numbers
leadout_offset = int(eac[-1]['end_sector']) + 150 + 1
IndexError: list index out of range
This log works
This log breaks
I believe it's the encoding that's breaking the script because if I simply do this at a command prompt:
type ascii.log > scrubbed.log
and then run the script on scrubbed.log, the script works fine (this is actually fine for my purposes since there's no loss of important information and I'm not writing back to a file, just printing to the console).
One workaround would be to "scrub" the log file before passing it to Python (e.g. using the type pipe trick above to a temporary file and then have the script run on that), but I would like to have Python "ignore" the encoding if it's possible. I'm also not sure how to detect what type of log file the script is reading so I can act appropriately.
I'm reading this and this but my eyes are still spinning around in their head, so while that may be my longer term strategy, I'm wondering if there's an interim hack I could use.
codecs.open() will allow you to open a file using a specific encoding, and it will produce unicodes. You can try a few, going from most likely to least likely (or the tool could just always produce UTF-16LE but ha ha fat chance).
Also, "Unicode In Python, Completely Demystified".
works.log appears to be encoded in ASCII:
>>> data = open('works.log', 'rb').read()
>>> all(d < '\x80' for d in data)
True
breaks.log appears to be encoded in UTF-16LE -- it starts with the 2 bytes '\xff\xfe'. None of the characters in breaks.log are outside the ASCII range:
>>> data = open('breaks.log', 'rb').read()
>>> data[:2]
'\xff\xfe'
>>> udata = data.decode('utf16')
>>> all(d < u'\x80' for d in udata)
True
If these are the only two possibilities, you should be able to get away with the following hack. Change your mainline code from:
f = open(sys.argv[1])
mb_toc_urlpart = "%20".join(
str(x) for x in calculate_mb_toc_numbers(filter_toc_entries(f)))
print mb_toc_urlpart
to this:
f = open(sys.argv[1], 'rb')
data = f.read()
f.close()
if data[:2] == '\xff\xfe':
data = data.decode('utf16').encode('ascii')
# ilines is a generator which produces newline-terminated strings
ilines = (line + '\n' for line in data.splitlines())
mb_toc_urlpart = "%20".join(
str(x) for x in calculate_mb_toc_numbers(filter_toc_entries(ilines)) )
print mb_toc_urlpart
Python 2.x expects normal strings to be ASCII (or at least one byte). Try this:
Put this at the top of your Python source file:
from __future__ import unicode_literals
And change all the str to unicode.
[edit]
And as Ignacio Vazquez-Abrams wrote, try codecs.open() to open the input file.