How to read the json file after skippinf few lines in python? - python

I have a json file whose contents is as follow:-
[
{"time":"56990","device_id":"1","kwh":"279.4"},
{"time":"60590","device_id":"1","kwh":"289.4"},
{"time":"64190","device_id":"1","kwh":"299.4"},
{"time":"67790","device_id":"1","kwh":"319.4"},
]
Now I want to read this file one line at a time using seek and tell methods in python. I tried this but it shows an error saying not able to decode. I actually want to read the json file after every 15 mins or so from that pointer where it was last read.
This is what I have tried.
last_pointer = 0
with open (FILENAME) as f:
f.seek(last_pointer)
raw_data = json.load(f) // this raw_data should load json starting from the last pointer.
.....process something.........
last_position = f.tell()

If your data is arranged in lines exactly as shown, you can construct an ad-hoc solution by reading lines from the file one by one, trimming the trailing comma, and feeding the result to json.loads. But perhaps the better variant would be to use a streaming parser like ijson.

import json
import time
with open ('dat') as f:
line = f.readline()
while line:
try:
raw_data = json.loads(line.strip().strip(','))
print (raw_data)
time.sleep(15*60)
except ValueError:
pass
line = f.readline()

Related

Read only first line of gzip JSON file

Although a lot of code has been posted here about how to read the first line of a file, I cannot figure out how to only read the first line of a gzipped JSON file in Python.
Here is my current working example. However, it contains a nasty break statement, and the loop seems completely unnecessary:
for line in gzip.open(file, 'rb'):
one_line = json.loads(line)
print(one_line)
break
Is there a solution that keeps the json.loads() command (or a similar one that reads in the JSON file correctly), while only reading the first line of the gzipped JSON file?
Call readline() instead of a for loop.
with gzip.open(file, 'rb') as f:
line = f.readline()
one_line = json.loads(line)
print(one_line)

read the header and replace a column value with another one in Python

I am a new bie to Python and I am trying to read in a file with the below format
ORDER_NUMBER!Speed_Status!Days!
10!YES!100!
10!NO!100!
10!TRUE!100!
And the output to be written to the same file is
ORDER_NUMBER!STATUS!Days!
10!YES!100!
10!NO!100!
10!TRUE!100!
so far I tried
# a file named "repo", will be opened with the reading mode.
file = open('repo.dat', 'r+')
# This will print every line one by one in the file
for line in file:
if line.startswith('ORDER_NUMBER'):
words = [w.replace('Speed_Status', 'STATUS') for w in line.partition('!')]
file.write(words)
input()
But somehow its not working. what am I missing.
Read file ⇒ replace content ⇒ write to file:
with open('repo.dat', 'r') as f:
data = f.read()
data = data.replace('Speed_Status', 'STATUS')
with open('repo.dat', 'w') as f:
f.write(data)
The ideal way would be to use the fileinput module to replace the file contents in-place instead of opening the file in update mode r+
from __future__ import print_function
import fileinput
for line in fileinput.input("repo.dat", inplace=True):
if line.startswith('ORDER_NUMBER'):
print (line.replace("Speed_Status", "STATUS"), end="")
else:
print (line, end="")
As for why your attempt didn't work, the logic to form the words is quite incorrect, when you partition the line based on !, the list you formed back is in out of order as ['ORDER_NUMBER', '!', 'STATUS!Days!\n'] with the embedded new-line. Also your write() call would never take a non-character buffer object. You need to have cast it into a string format to print it.

Python Loop through dictionary

I have a file that I wish to parse. It has data in the json format, but the file is not a json file. I want to loop through the file, and pull out the ID where totalReplyCount is greater than 0.
{ "totalReplyCount": 0,
"newLevel":{
"main":{
"url":"http://www.someURL.com",
"name":"Ronald Whitlock",
"timestamp":"2016-07-26T01:22:03.000Z",
"text":"something great"
},
"id":"z12wcjdxfqvhif5ee22ys5ejzva2j5zxh04"
}
},
{ "totalReplyCount": 4,
"newLevel":{
"main":{
"url":"http://www.someUR2L.com",
"name":"other name",
"timestamp":"2016-07-26T01:22:03.000Z",
"text":"something else great"
},
"id":"kjsdbesd2wd2eedd23rf3r3r2e2dwe2edsd"
}
},
My initial attempt was to do the following
def readCsv(filename):
with open(filename, 'r') as csvFile:
for row in csvFile["totalReplyCount"]:
print row
but I get an error stating
TypeError: 'file' object has no attribute 'getitem'
I know this is just an attempt at printing and not doing what I want to do, but I am a novice at python and lost as to what I am doing wrong. What is the correct way to do this? My end result should look like this for the ids:
['insdisndiwneien23e2es', 'lsndion2ei2esdsd',....]
EDIT 1- 7/26/16
I saw that I made a mistake in my formatting when I copied the code (it was late, I was tired..). I switched it to a proper format that is more like JSON. This new edit properly matches file I am parsing. I then tried to parse it with JSON, and got the ValueError: Extra data: line 2 column 1 - line X column 1:, where line X is the end of the line.
def readCsv(filename):
with open(filename, 'r') as file:
data=json.load(file)
pprint(data)
I also tried DictReader, and got a KeyError: 'totalReplyCount'. Is the dictionary un-ordered?
EDIT 2 -7/27/16
After taking a break, coming back to it, and thinking it over, I realized that what I have (after proper massaging of the data) is a CSV file, that contains a proper JSON object on each line. So, I have to parse the CSV file, then parse each line which is a top level, whole and complete JSON object. The code I used to try and parse this is below but all I get is the first string character, an open curly brace '{' :
def readCsv(filename):
with open(filename, 'r') as csvfile:
for row in csv.DictReader(csvfile):
for item in row:
print item[0]
I am guessing that the DictReader is converting the json object to a string, and that is why I am only getting a curly brace as opposed to the first key. If I was to do print item[0:5] I would get a mish mash of the first 4 characters in an un-ordered fashion on each line, which I assume is because the format has turned into an un-ordered list? I think I understand my problem a little bit better, but still wrapping my head around the data structures and the methods used to parse them. What am I missing?
After reading the question and all the above answers, please check if this is useful to you.
I have considered input file as simple file not as csv or json file.
Flow of code is as follow:
Open and read a file in reverse order.
Search for ID in line. Extract ID and store in temp variable.
Go on reading file line by line and search totalReplyCount.
Once you got totalReplyCount, check it if it greater than 0.
If yes, then store temp ID in id_list and re-initialize temp variable.
import re
tmp_id_to_store = ''
id_list = []
for line in reversed(open("a.txt").readlines()):
m = re.search('"id":"(\w+)"', line.rstrip())
if m:
tmp_id_to_store = m.group(1)
n = re.search('{ "totalReplyCount": (\d+),', line.rstrip())
if n:
fou = n.group(1)
if int(fou) > 0:
id_list.append(tmp_id_to_store)
tmp_id_to_store = ''
print id_list
More check points can be added.
As the error stated, Your csvFile is a file object, it is not a dict object, so you can't get an item out of it.
if your csvFile is in CSV format, you can use the csv module to read each line of the csv into a dict :
import csv
with open(filename) as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
print row['totalReplyCount']
note the DictReader method from the csv module, it will read your csv line and parse it into dict object
If your input file is JSON why not just use the JSON library to parse it and then run a for loop over that data. Then it is just a matter of iterating over the keys and extracting data.
import json
from pprint import pprint
with open('data.json') as data_file:
data = json.load(data_file)
pprint(data)
Parsing values from a JSON file using Python?
Look at Justin Peel's answer. It should help.
Parsing values from a JSON file in Python , this link has it all # Parsing values from a JSON file using Python? via stackoverflow.
Here is a shell one-liner, should solve your problem, though it's not python.
egrep -o '"(?:totalReplyCount|id)":(.*?)$' filename | awk '/totalReplyCount/ {if ($2+0 > 0) {getline; print}}' | cut -d: -f2
output:
"kjsdbesd2wd2eedd23rf3r3r2e2dwe2edsd"

Python removing duplicates and saving the result

I am trying to remove duplicates of 3-column tab-delimited txt file, but as long as the first two columns are duplicates, then it should be removed even if the two has different 3rd column.
from operator import itemgetter
import sys
input = sys.argv[1]
output = sys.argv[2]
#Pass any column number you want, note that indexing starts at 0
ig = itemgetter(0,1)
seen = set()
data = []
for line in input.splitlines():
key = ig(line.split())
if key not in seen:
data.append(line)
seen.add(key)
file = open(output, "w")
file.write(data)
file.close()
First, I get error
key = ig(line.split())
IndexError: list index out of range
Also, I can't see how to save the result to output.txt
People say saving to output.txt is a really basic matter. But no tutorial helped.
I tried methods that use codec, those that use with, those that use file.write(data) and all didn't help.
I could learn MatLab quite easily. The online tutorial was fantastic and a series of Googling always helped a lot.
But I can't find a helpful tutorial of Python yet. This is obviously because I am a complete novice. For complete novices like me, what would be the best tutorial with 1) comprehensiveness AND 2) lots of examples 3) line by line explanation that dosen't leave any line without explanation?
And why is the above code causing error and not saving result?
I'm assuming since you assign input to the first command line argument with input = sys.argv[1] and output to the second, you intend those to be your input and output file names. But you're never opening any file for the input data, so you're callling .splitlines() on a file name, not on file contents.
Next, splitlines() is the wrong approach here anyway. To iterate over a file line-by-line, simply use for line in f, where f is an open file. Those lines will include the newline at the end of the line, so it needs to be stripped if it's not supposed to be part of the third columns data.
Then you're opening and closing the file inside your loop, which means you'll try to write the entire contents of data to the file every iteration, effectively overwriting any data written to the file before. Therefore I moved that block out of the loop.
It's good practice to use the with statement for opening files. with open(out_fn, "w") as outfile will open the file named out_fn and assign the open file to outfile, and close it for you as soon as you exit that indented block.
input is a builtin function in Python. I therefore renamed your variables so no builtin names get shadowed.
You're trying to directly write data to the output file. This won't work since data is a list of lines. You need to join those lines first in order to turn them in a single string again before writing it to a file.
So here's your code with all those issues addressed:
from operator import itemgetter
import sys
in_fn = sys.argv[1]
out_fn = sys.argv[2]
getkey = itemgetter(0, 1)
seen = set()
data = []
with open(in_fn, 'r') as infile:
for line in infile:
line = line.strip()
key = getkey(line.split())
if key not in seen:
data.append(line)
seen.add(key)
with open(out_fn, "w") as outfile:
outfile.write('\n'.join(data))
Why is the above code causing error?
Because you haven't opened the file, you are trying to work with the string input.txtrather than with the file. Then when you try to access your item, you get a list index out of range because line.split() returns ['input.txt'].
How to fix that: open the file and then work with it, not with its name.
For example, you can do (I tried to stay as close to your code as possible)
input = sys.argv[1]
infile = open(input, 'r')
(...)
lines = infile.readlines()
infile.close()
for line in lines:
(...)
Why is this not saving result?
Because you are opening/closing the file inside the loop. What you need to do is write the data once you're out of the loop. Also, you cannot write directly a list to a file. Hence, you need to do something like (outside of your loop):
outfile = open(output, "w")
for item in data:
outfile.write(item)
outfile.close()
All together
There are other ways of reading/writing files, and it is pretty well documented on the internet but I tried to stay close to your code so that you would understand better what was wrong with it
from operator import itemgetter
import sys
input = sys.argv[1]
infile = open(input, 'r')
output = sys.argv[2]
#Pass any column number you want, note that indexing starts at 0
ig = itemgetter(0,1)
seen = set()
data = []
lines = infile.readlines()
infile.close()
for line in lines:
print line
key = ig(line.split())
if key not in seen:
data.append(line)
seen.add(key)
print data
outfile = open(output, "w")
for item in data:
outfile.write(item)
outfile.close()
PS: it seems to produce the result that you needed there Python to remove duplicates using only some, not all, columns

Given a URL to a text file, what is the simplest way to read the contents of the text file that has tons of tons of data?

I have checked this other answer that I found in this forum In Python, given a URL to a text file, what is the simplest way to read the contents of the text file?
And it was useful but if you take a look at my URL file here http://baldboybakery.com/courses/phys2300/resources/CDO6674605799016.txt
You'll notice that is tons of data going on in here. So when I use this code:
import urllib2
data =
urllib2.urlopen('http://baldboybakery.com/courses/phys2300/resources/CDO6674605799016.txt').read(69700) # read only 69700 chars
data = data.split("\n") # then split it into lines
for line in data:
print line
The amount of characters that python can read with the headers in the URL file is 69700 characters, but my problem is that I need all of the data in there which is about like 30000000 characters or so.
When I put that much amount of characters I get only a chunk of the data showing up and not all of it, the headers for each one of the columns in the URL file data are gone. Help to fix this problem??
What yer gonna wanna do here is read and process the data in chunks, e.g.:
import urllib2
f = urllib2.urlopen('http://baldboybakery.com/courses/phys2300/resources/CDO6674605799016.txt')
while True:
next_chunk = f.read(4096) #read next 4k
if not next_chunk: #all data has been read
break
process_chunk(next_chunk) #arbitrary processing
f.close()
The simple ways work just fine:
If you want to examine the file line by line:
for line in urllib2.urlopen('http://baldboybakery.com/courses/phys2300/resources/CDO6674605799016.txt'):
# Do something, like maybe print the data:
print line,
Or, if you want to download all of the data:
data = urllib2.urlopen('http://baldboybakery.com/courses/phys2300/resources/CDO6674605799016.txt')
data = data.read()
sys.stdout.write(data)

Categories