Scrapy with a nested array - python

I'm new to scrapy and would like to understand how to scrape on object for output into nested JSON. Right now, I'm producing JSON that looks like
[
{'a' : 1,
'b' : '2',
'c' : 3},
]
And I'd like it more like this:
[
{ 'a' : '1',
'_junk' : [
'b' : 2,
'c' : 3]},
]
---where I put some stuff in _junk subfields to post-process later.
The current code under the parser definition file in my scrapername.py is...
item['a'] = x
item['b'] = y
item['c'] = z
And it seemed like
item['a'] = x
item['_junk']['b'] = y
item['_junk']['c'] = z
---might fix that, but I'm getting an error about the _junk key:
File "/usr/local/lib/python2.7/dist-packages/scrapy/item.py", line 49, in __getitem__
return self._values[key]
exceptions.KeyError: '_junk'
Does this mean I need to change my items.py somehow? Currently I have:
class Website(Item):
a = Field()
_junk = Field()
b = Field()
c = Field()

You need to create the junk dictionary before storing items in it.
item['a'] = x
item['_junk'] = {}
item['_junk']['b'] = y
item['_junk']['c'] = z

Related

Dynamic list creation and append values - python

I have a input data that is parsed from a json and printing the output like this from keys like tablename,columnname,columnlength
data = ('tablename', 'abc.xyz'),('tablename','abc.xyz'),('columnname', 'xxx'),('columnname', 'yyy'),('columnlen', 55)
data[0] =
abc.xyz
abc.xyz
abc.xyz
data[1] =
xxx
yyy
zzz
data[2] =
20
30
60
data[0] represents tablename
data[1] represents columnname
data[2] represents column length
I have code below that does creating the empty list manually
TableName_list = []
ColumnName_list = []
ColumnLen_list = []
for x in data:
if x[0] == 'tablename':
TableName_list.append(data[0]])
elif x[0] == 'columnname':
ColumnName_list.append(data[1])
elif x[0] == 'columnlen':
ColumnLen_list.append(data[2])
I need to create a dynamic empty list respectively for each fields(tablename,column,columnlength) and append the data to that empty list in the dictionary
and my output is needed like this in a dictionary
dict = {'TableName':TableName_list,'ColumnName':ColumnName_list,'ColumnLen':columnLength_list }
This is probably most easily done with a defaultdict:
from collections import defaultdict
dd = defaultdict(list)
data = [
('tablename', 'abc.xyz'),('tablename','abc.xyz'),
('columnname', 'xxx'),('columnname', 'yyy'),
('columnlen', 55),('columnlen', 30)
]
for d in data:
dd[d[0]].append(d[1])
Output:
defaultdict(<class 'list'>, {
'tablename': ['abc.xyz', 'abc.xyz'],
'columnname': ['xxx', 'yyy'],
'columnlen': [55, 30]
})
If the case of the names in the result is important, you could use a dictionary to translate the incoming names:
aliases = { 'tablename' : 'TableName', 'columnname' : 'ColumnName', 'columnlen' : 'ColumnLen' }
for d in data:
dd[aliases[d[0]]].append(d[1])
Output:
defaultdict(<class 'list'>, {
'TableName': ['abc.xyz', 'abc.xyz'],
'ColumnName': ['xxx', 'yyy'],
'ColumnLen': [55, 30]
})
I suggest to make a dictionary directly, something look like this:
out_dict = {}
for x in data:
key = x[0]
if key in out_dict.keys():
out_dict[key] = out_dict[key].append(x[1])
else:
out_dict[key] = [x[1]]
using pandas:
import pandas as pd
>>> pd.DataFrame(data).groupby(0)[1].apply(list).to_dict()
'''
{'columnlen': [55, 30],
'columnname': ['xxx', 'yyy'],
'tablename': ['abc.xyz', 'abc.xyz']}

python set dict not exist, how can I handle it?

import SimpleITK as sitk
reader = sitk.ImageFileReader()
reader.SetFileName(filePath)
reader.ReadImageInformation()
img = reader.Execute()
meta = {
"a": reader.GetMetaData('0'), <- if not exist return 'undeinfed'
"b": reader.GetMetaData('1'),
"c": reader.GetMetaData('2'),
}
I am javascript developer.
I want to set meta dict and it shows error which is 'Key '0' does not exist'.
It can be not exist how can I set meta in this case?
From the docs, the ImageFileReader class has a HasMetaDataKey() boolean function. So you should be able to do something like this:
meta = {
"a": reader.GetMetaData('0') if reader.HasMetaDataKey('0') else 'undefined',
"b": reader.GetMetaData('1') if reader.HasMetaDataKey('1') else 'undefined',
"c": reader.GetMetaData('2') if reader.HasMetaDataKey('2') else 'undefined',
}
And you could do in one (long) line:
meta = {m: reader.GetMetaData(k) if reader.HasMetaDataKey(k) else 'undefined'
for m, k in zip(['a', 'b', 'c'], ['0', '1', '2'])}
you can use default dict
from collections import defaultdict
d = defaultdict(lambda : 'xx') #<- Whatever value you want
d[10] #no value passed value automatically assinged to xx
d[11]=12 #value 12 assinged
#to get value you can use d.get(key)
print(d[10]) #prints 'xx'
print(d)
outputs
defaultdict(<function <lambda> at 0x000001557B4B03A8>, {10: 'xx', 11: 12})
you get the idea you can modify according to your need

How can i merge several dictionaries by a for loop?

This is my code:
a = int(input())
for i in range(a):
b = input()
b = b.split(".")#creating a list
#print(b)
b[1]= b[1].lower()
b[1]=b[1].capitalize()
a=b[1]
#print(b[1])
#print(b[0] , b [1] , b[2])
dic = {}
dic_final={}
dic={b[1] : {'name':b[0] ,'lan':b[2] }}
dic_final.update(dic)
del(dic)
print(dic_final)
My input :
2
f.sara.python
m.john.java
Output has to be like :
{ 'sara':{'gender':'f' , 'lan':'python'} , 'john':{'gender':'m' , 'lan':'python'}}
But i always get the last dictionary item i wrote in the input :
{'john':{'gender':'m' , 'lan':'python'}
How can i solve it and get a dictionary like below?
{ 'sara':{'gender':'f' , 'lan':'python'} , 'john':{'gender':'m' , 'lan':'python'}}
this is a very clear solution that i came up with
num_sample = int(input("how many test cases?: "))
final = {}
for case in range(num_sample):
new_case = input("insert new case: ")
gender, name, lan = new_case.split(".")
info = {"gender": gender, "lan": lan}
final[name] = info
#access final from here
Create a global dictionary instead of creating a local one in the for loop. Moreover in your code, you are naming gender as name
a = int(input())
dic_final = {}
for i in range(a):
b = input()
b = b.split(".")
b[1]= b[1].lower()
b[1]=b[1].capitalize()
a=b[1]
dic={b[1] : {'gender':b[0] ,'lan':b[2] }}
dic_final.update(dic)
del(dic)
print(dic_final)
Output:
2
f.sara.python
m.john.java
{'Sara': {'gender': 'f', 'lan': 'python'}, 'John': {'gender': 'm', 'lan': 'java'}}
The reason you are getting only the last entry is because , you are re-initialising the dictionary for every iteration.
All you have to do is , make "dic_final" global .
You're overwriting the dic_final dictionary in the for loop each time.
This line is causing the issue dic_final={}
Solution:
Add this line before your 'for loop' and remove the declaration inside.
dic_final = dict()
Better coding style:
a = int(input())
dic_final = dict()
for i in range(a):
b = input()
gender, name, lan = b.split(".")
name = name.capitalize()
dic_final.update({name : {"gender": gender, "lan": lan}})
print(dic_final)
Output:
2
f.sara.python
m.john.java
{'Sara': {'gender': 'f', 'lan': 'python'}, 'John': {'gender': 'm', 'lan': 'java'}}

Compare difference in two json files and out put the difference

I am trying to compare difference between two json files and output the list of r_id values which are present in file a but not in file b.
Json files which i am trying to compare
File a =
{“r_id”:”123”,"RefNumber”:”2341234131","amount":"22.99”},
{“r_id”:”345”,"RefNumber”:”2341234131","amount":"22.99”},
{“r_id”:”678”,"RefNumber”:”2341234131","amount":"22.99”}
File b =
{“name” : “James”, "id" : “123”, “class” : “1A”},
{“name” : “Sam”,"id" : “345”, “class” : “1A”},
{“name” : “Jen”,"id" : “005”, “class” : “1A”}
Comparison should be based on id's in both files. Expecting following output in difference file
{“r_id”:”678”,"RefNumber”:”2341234131","amount":"22.99”}
This will work if ids are not in order and jsons don't have equal items.
import json
with open("json_a.json","r") as first, open("json_b.json","r") as second :
b = json.load(first,object_pairs_hook=lambda x: x[0])
c = json.load(second,object_pairs_hook=lambda x: x[1])
b = [ _[1] for _ in b]
c = [ _[1] for _ in c]
with open("json_a.json","r") as first:
for each_line in json.load(first):
for uniq_id in list(set(b).difference(c)):
if each_line['r_id']== uniq_id :
print(each_line)
Another approach:
import json
with open("json_a.json","r") as first, open("json_b.json","r") as second :
b = json.load(first)
c = json.load(second)
b_ids=[x['r_id'] for x in b]
c_ids=[x['id'] for x in c]
for each_item in b:
for uniq_id in list(set(b_ids).difference(c_ids)):
if each_item['r_id'] == uniq_id:
print(each_item)
Write to file:
# Serializing json
json_object = json.dumps(each_item)
# Writing to sample.json
with open("sample.json", "w") as outfile:
outfile.write(json_object)
More details about file writing options can be found here.
Try this Code :
import json
a = ['{"r_id":"123","RefNumber":"2341234131","amount":"22.99"}',
'{"r_id":"345","RefNumber":"2341234131","amount":"22.99"}',
'{"r_id":"678","RefNumber":"2341234131","amount":"22.99"}'
]
b = [ '{"name" : "James", "id" : "123", "class" : "1A"}',
'{"name" : "Sam", "id" : "345", "class" : "1A"}',
'{"name" : "Jen", "id" : "005", "class" : "1A"}'
]
for i in range(len(a)):
y = json.loads(a[i])
z = json.loads(b[i])
if y["r_id"] != z["id"]:
print(a[i])
Output :
{"r_id":"678","RefNumber":"2341234131","amount":"22.99"}
Before working with json files the file should be like below format :
[{"r_id":"123","RefNumber":"2341234131","amount":"22.99"},
{"r_id":"345","RefNumber":"2341234131","amount":"22.99"},
{"r_id":"678","RefNumber":"2341234131","amount":"22.99"}
]
Try with this code(with files):
import json
with open('file1.json','r') as a:
data1 = a.read()
obj1 = json.loads(data1)
with open('file2.json','r') as a:
data2 = a.read()
obj2 = json.loads(data2)
count = 0
for i in obj1:
a = obj2[count]
if i["r_id"] != a["id"]:
print(i)
count = count + 1
Output is same as above.

Get information from different dict by dict name

I have a data/character_data.py:
CHARACTER_A = { 1: {"level": 1, "name":"Ann", "skill_level" : 1},
2: {"level": 2, "name":"Tom", "skill_level" : 1}}
CHARACTER_B = { 1: {"level": 1, "name":"Kai", "skill_level" : 1},
2: {"level": 2, "name":"Mel", "skill_level" : 1}}
In main.py, I can do this:
from data import character_data as character_data
print character_data.CHARACTER_A[1]["name"]
>>> output: Ann
print character_data.CHARACTER_B[2]["name"]
>>> output: Mel
How do I achieve this?
from data import character_data as character_data
character_type = "CHARACTER_A"
character_id = 1
print character_data.character_type[character_id]["name"]
>>> correct output should be: Ann
I get AttributeError when try use character_type as "CHARACTER_A".
How about this
In [38]: from data import character_data as character_data
In [39]: character_type = "CHARACTER_A"
In [40]: character_id = 1
In [41]: getattr(character_data, character_type)[character_id]["name"]
Out[41]: 'Ann'
You can use locals():
>>> from data.character_data import CHARACTER_A, CHARACTER_B
>>> character_id = 1
>>> character_type = "CHARACTER_A"
>>> locals()[character_type][character_id]["name"]
Ann
Though, think about merging CHARACTER_A and CHARACTER_B into one dict and access this dict instead of locals().
Also, see Dive into Python: locals and globals.
You need to structure your data properly.
characters = {}
characters['type_a'] = {1: {"level": 1, "name":"Ann", "skill_level" : 1},
2: {"level": 2, "name":"Tom", "skill_level" : 1}}
characters['type_b'] = ...
Or, the better solution is to create your own "character" type, and use that instead:
class Character(object):
def __init__(self, type, level, name, skill):
self.type = type
self.level = level
self.name = name
self.skill = skill
characters = []
characters.append(Character('A',1,'Ann',1))
characters.append(Character('A',2,'Tom',1))
characters.append(Character('B',2,'Kai',1)) # and so on
Then,
all_type_a = []
looking_for = 'A'
for i in characters:
if i.type == looking_for:
all_type_a.append(i)
Or, the shorter way:
all_type_a = [i for i in characters if i.type == looking_for]

Categories