Extracting data from dictionary in python

Extracting data from dictionary in python - python

I have a program from Dr.Chuck to print the sum of the counts from this data. The problem is. The count of the JSON is showing "2" when there are many..
import json
import urllib
url="http://python-data.dr-chuck.net/comments_42.json"
uh = urllib.urlopen(url)
data = uh.read()
print 'Retrieved',len(data),'characters'
print data
info = json.loads(data)
print 'User count:', len(info)
This line print 'User count:', len (info) is showing an output of 2. When there is a lot of data, hence I can only access 2 datas and not the rest.
I have no idea why. I can solve the counting sum part. Just not getting why am I only getting access to the first 2 data and the rest of the JSON is getting ignored.

The json has two top level properties: note and comments. That is why you get a length of 2.
This will probably give you what you want:
len(info["comments"])

To count the number of comments:
print 'User count:', len(info["comments"])
To print the total "count":
count = 0
for comment in info["comments"]:
count += comment["count"]
print 'Total count:', count

So, your json parsed to dict like
{"note":"bla", "comments":[...]}
Length of this should be 2 because it's only two keys in this dict. Right way to do you case is get comments itself and count them.
For example:
len(data.get('comments',[]))

The Json is composed from note and comments. Inside comments there's another array of object.
if you wanna access to that array you have to use this info['comments'] and then, if you want the length of that array, as you do, you can use len(info['comments'])

Related

Extracting multiple data from a single list

I working on a text file that contains multiple information. I converted it into a list in python and right now I'm trying to separate the different data into different lists. The data is presented as following:
CODE/ DESCRIPTION/ Unity/ Value1/ Value2/ Value3/ Value4 and then repeat, an example would be:
P03133 Auxiliar helper un 203.02 417.54 437.22 675.80
My approach to it until now has been:
Creating lists to storage each information:
codes = []
description = []
unity = []
cost = []
Through loops finding a code, based on the code's structure, and using the code's index as base to find the remaining values.
Finding a code's easy, it's a distinct type of information amongst the other data.
For the remaining values I made a loop to find the next value that is numeric after a code. That way I can delimitate the rest of the indexes:
The unity would be the code's index + index until isnumeric - 1, hence it's the first information prior to the first numeric value in each line.
The cost would be the code's index + index until isnumeric + 2, the third value is the only one I need to store.
The description is a little harder, the number of elements that compose it varies across the list. So I used slicing starting at code's index + 1 and ending at index until isnumeric - 2.
for i, carc in enumerate(txtl):
if carc[0] == "P" and carc[1].isnumeric():
codes.append(carc)
j = 0
while not txtl[i+j].isnumeric():
j = j + 1
description.append(" ".join(txtl[i+1:i+j-2]))
unity.append(txtl[i+j-1])
cost.append(txtl[i+j])
I'm facing some problems with this approach, although there will always be more elements to the list after a code I'm getting the error:
while not txtl[i+j].isnumeric():
txtl[i+j] list index out of range.
Accepting any solution to debug my code or even new solutions to problem.
OBS: I'm also going to have to do this to a really similar data font, but the code would be just a sequence of 7 numbers, thus harder to find amongst the other data. Any solution that includes this facet is also appreciated!

A slight addition to your code should resolve this:
while i+j < len(txtl) and not txtl[i+j].isnumeric():
j += 1
The first condition fails when out of bounds, so the second one doesn't get checked.
Also, please use a list of dict items instead of 4 different lists, fe:
thelist = []
thelist.append({'codes': 69, 'description': 'random text', 'unity': 'whatever', 'cost': 'your life'})
In this way you always have the correct values together in the list, and you don't need to keep track of where you are with indexes or other black magic...
EDIT after comment interactions:
Ok, so in this case you split the line you are processing on the space character, and then process the words in the line.
from pprint import pprint # just for pretty printing
textl = 'P03133 Auxiliar helper un 203.02 417.54 437.22 675.80'
the_list = []
def handle_line(textl: str):
description = ''
unity = None
values = []
for word in textl.split()[1:]:
# it splits on space characters by default
# you can ignore the first item in the list, as this will always be the code
# str.isnumeric() doesn't work with floats, only integers. See https://stackoverflow.com/a/23639915/9267296
if not word.replace(',', '').replace('.', '').isnumeric():
if len(description) == 0:
description = word
else:
description = f'{description} {word}' # I like f-strings
elif not unity:
# if unity is still None, that means it has not been set yet
unity = word
else:
values.append(word)
return {'code': textl.split()[0], 'description': description, 'unity': unity, 'values': values}
the_list.append(handle_line(textl))
pprint(the_list)
str.isnumeric() doesn't work with floats, only integers. See https://stackoverflow.com/a/23639915/9267296

Variable table width with .format

I'm trying to display data from a csv in a text table. I've got to the point where it displays everything that I need, however the table width still has to be set, meaning if the data is longer than the number set then issues begin.
I currently print the table using .format to sort out formatting, is there a way to set the width of the data to a variable that is dependant on the length of the longest piece of data?
for i in range(len(list_l)):
if i == 0:
print(h_dashes)
print('{:^1s}{:^26s}{:^1s}{:^26s}{:^1s}{:^26s}{:^1s}{:^26s}{:^1s}'.format('|', (list_l[i][0].upper()),'|', (list_l[i][1].upper()),'|',(list_l[i][2].upper()),'|', (list_l[i][3].upper()),'|'))
print(h_dashes)
else:
print('{:^1s}{:^26s}{:^1s}{:^26s}{:^1s}{:^26s}{:^1s}{:^26s}{:^1s}'.format('|', list_l[i][0], '|', list_l[i][1], '|', list_l[i][2],'|', list_l[i][3],'|'))
I realise that the code is far from perfect, however I'm still a newbie so it's piecemeal from various tutorials

You can actually use a two-pass approach to first get the correct lengths. As per your example with four fields per line, the following shows the basic idea you can use.
What follows is an example of the two-pass approach, first to get the maximum lengths for each field, the other to do what you're currently doing (with the calculated rather than fixed lengths):
# Can set MINIMUM lengths here if desired, eg: lengths = [10, 0, 41, 7]
lengths = [0] * 4
fmtstr = None
for pass in range(2):
for i in range(len(list_l)):
if pass == 0:
# First pass sets lengths as per data.
for field in range(4):
lengths[field] = max(lengths[field], len(list_l[i][field])
else:
# Second pass prints the data.
# First, set format string if not yet set.
if fmtstr is None:
fmtstr = '|'
for item in lengths:
fmtstr += '{:^%ds}|' % (item)
# Now print item (and header stuff if first item).
if i == 0: print(h_dashes)
print(fmtstr.format(list_l[i][0].upper(), list_l[i][1].upper(), list_l[i][2].upper(), list_l[i][3].upper()))
if i == 0: print(h_dashes)
The construction of the format string is done the first time you process an item in pass two.
It does so by taking a collection like [31,41,59] and giving you the string:
|{:^31s}|{:^41s}|{:^59s}|
There's little point using all those {:^1s} format specifiers when the | is not actually a varying item - you may as well code it directly into the format string.

Using two for loop to iterate two large list in python. Nothing happens

I want to use two for loop to iterate two large lists, however nohting happens.
Here is the code
pfam = open(r'D:\RPS_data\pfam_annotations.tbl', 'r')
number = []
description = []
for j in pfam:
number.append(j.split('\t')[0])
description.append(j.split('\t')[-2])
print len(number),len(description)
for j in number:
for r in description:
if j==r:
print 'ok'
else:
print 'not match'
The result is:
D:\Pathyon\pathyon2.7\python.exe "D:/Biopython/Pycharm/Python Learning/war_battle/AUL-prediction/trash.py"
16230 16230
Process finished with exit code 0
My question is to find the same fraction in number and description. It is obvious that python does not run the for loop. Is it because I get two large lists? Anyone knows how to overcome this?

I modified your script to run a test. This is what I did: I inserted a dummy text blob as an io.StringIO file, and printed the resulting columns.
import io
annotations=u"""
1024\tmiddle\tkay\tend
3192\tmiddle\tdiana\tend
8675309\tmiddle\tjenny\tend
""".strip()
#pfam = open(r'D:\RPS_data\pfam_annotations.tbl', 'r')
pfam = io.StringIO(annotations)
number = []
description = []
for j in pfam:
number.append(j.split('\t')[0])
description.append(j.split('\t')[-2])
print len(number),len(description)
print number, description
for j in number:
for r in description:
if j==r:
print 'ok'
else:
print 'not match'
When I run this, the output is:
$ python2.7 test.py
3 3
[u'1024', u'3192', u'8675309'] [u'kay', u'diana', u'jenny']
not match
I'll argue that (1) your initial code appears to be working fine; and (2) Python is, in fact, running the second loop.
I am suspicious of your second loop. You are comparing every number against every description, using a direct comparison (==). Those both seem to be possible sources of error.
For example, are you sure you want to compare a number from line 1 (1024) with a description from line 3 (jenny)?
For example, are you sure you want to compare using equality? Are you looking for directly-equals, or containment. A "description" might contain several words, one of which was the number in question.

Python *.count is returning a letter instead of a number

I am attempting to count the number of times the letter C appears in a list. When I use:
count = data[data.count('C')]
print ("There are", count, "molecules in the file")
when the code is run, it returns There are . molecules in the file
If I type data.count('C') after the program has run, it returns the correct value (43). I can't figure out what I am doing wrong.

Could this line have something to do with it, maybe? ;)
count = data[data.count('C')] # This gives you the value at index data.count('C') of data
The actual count, as you later put it, is:
count = data.count('C')

Try replacing the 1st line with:
count = data.count('C')

Modify the first line:
count = data.count('C')
The problem is that you were printing the n'th element of the list data (where n=count) instead of the count itself.
As a side note, this is a better way to print your result:
print "There are {0} molecules in the file".format(count)

You are using data twice ....
count = data[data.count('C')]
should be
count =data.count('C')
This would print
There are 43 molecules in the file

The good news is that you're getting the correct value from the method.
The bad news is that you're using it incorrectly.
You're using the result as an index into the string, which then results in a character from the string. Stop doing that.

Python, AttributeError: 'float' object has no attribute 'encode'

I have a script which consumes an API of bus location, I am attempting to parse the lat/lng fields which are float objects. I am repeatedly receiving this error.
row.append(Decimal(items['longitude'].encode('utf-16')))
AttributeError: 'float' object has no attribute 'encode'
# IMPORTS
from decimal import *
#Make Python understand how to read things on the Internet
import urllib2
#Make Python understand the stuff in a page on the Internet is JSON
import simplejson as json
from decimal import Decimal
# Make Python understand csvs
import csv
# Make Python know how to take a break so we don't hammer API and exceed rate limit
from time import sleep
# tell computer where to put CSV
outfile_path='C:\Users\Geoffrey\Desktop\pycharm1.csv'
# open it up, the w means we will write to it
writer = csv.writer(open(outfile_path, 'wb'))
#create a list with headings for our columns
headers = ['latitude', 'longitude']
#write the row of headings to our CSV file
writer.writerow(headers)
# GET JSON AND PARSE IT INTO DICTIONARY
# We need a loop because we have to do this for every JSON file we grab
#set a counter telling us how many times we've gone through the loop, this is the first time, so we'll set it at 1
i=1
#loop through pages of JSON returned, 100 is an arbitrary number
while i<100:
#print out what number loop we are on, which will make it easier to track down problems when they appear
print i
#create the URL of the JSON file we want. We search for 'egypt', want English tweets,
#and set the number of tweets per JSON file to the max of 100, so we have to do as little looping as possible
url = urllib2.Request('http://api.metro.net/agencies/lametro/vehicles' + str(i))
#use the JSON library to turn this file into a Pythonic data structure
parsed_json = json.load(urllib2.urlopen('http://api.metro.net/agencies/lametro/vehicles'))
#now you have a giant dictionary.
#Type in parsed_json here to get a better look at this.
#You'll see the bulk of the content is contained inside the value that goes with the key, or label "results".
#Refer to results as an index. Just like list[1] refers to the second item in a list,
#dict['results'] refers to values associated with the key 'results'.
print parsed_json
#run through each item in results, and jump to an item in that dictionary, ex: the text of the tweet
for items in parsed_json['items']:
#initialize the row
row = []
#add every 'cell' to the row list, identifying the item just like an index in a list
#if latitude is not None:
#latitude = str(latitude)
#if longitude is not None:
#longitude = str(longitude)
row.append(Decimal(items['longitude'].encode('utf-16')))
row.append(Decimal(items['latitude'].encode('utf-16')))
#row.append(bool(services['predictable'].unicode('utf-8')))
#once you have all the cells in there, write the row to your csv
writer.writerow(row)
#increment our loop counter, now we're on the next time through the loop
i = i +1
#tell Python to rest for 5 secs, so we don't exceed our rate limit
sleep(5)

encode is available only for string. In your case item['longitude'] is a float. float doesn't have encode method. You can type case it and then use encode.
You can write like,
str(items['longitude']).encode('utf-16')
str(items['latitude']).encode('utf-16')
I think you can't pass an encoded string to Decimal object.

encode is a method that strings have, not floats.
Change row.append(Decimal(items['longitude'].encode('utf-16'))) to row.append(Decimal(str(items['longitude']).encode('utf-16'))) and similar with the other line.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting data from dictionary in python - python

The json has two top level properties: note and comments. That is why you get a length of 2. This will probably give you what you want: len(info["comments"])

To count the number of comments: print 'User count:', len(info["comments"]) To print the total "count": count = 0 for comment in info["comments"]: count += comment["count"] print 'Total count:', count

So, your json parsed to dict like {"note":"bla", "comments":[...]} Length of this should be 2 because it's only two keys in this dict. Right way to do you case is get comments itself and count them. For example: len(data.get('comments',[]))

The Json is composed from note and comments. Inside comments there's another array of object. if you wanna access to that array you have to use this info['comments'] and then, if you want the length of that array, as you do, you can use len(info['comments'])

Related

Extracting multiple data from a single list

Variable table width with .format

Using two for loop to iterate two large list in python. Nothing happens

Python *.count is returning a letter instead of a number

Python, AttributeError: 'float' object has no attribute 'encode'

Categories

Resources