Parsing through a really long text - python

I'm a complete beginner to python, but I am making a web scraper as a project.
I'm using a jupyter notebook, beautifulsoup, and lxml.
I managed to grab the text that contains all the information I need, but now I'm lost on what to do.
I want to obtain specific pieces of data like longitude, latitude, siteid, direction (North, South, etc.), and I want to download the photos and rename them. I need to do this for all 41 locations.
If anyone could suggest any packages or methods I would really appreciate it! Thank you!
Here's a small portion of the text I grabbed (pattern repeats 41 times):
{
"count": 41,
"message": "success",
"results": [
{
"protocol": "land_covers",
"measuredDate": "2020-06-13",
"createDate": "2020-06-13T16:35:04",
"updateDate": "2020-06-15T14:00:10",
"publishDate": "2020-07-17T21:06:31",
"organizationId": 17043304,
"organizationName": "United States of America Citizen Science",
"siteId": 202689,
"siteName": "18TWK294769",
"countryName": null,
"countryCode": null,
"latitude": xx.xxx(edited),
"longitude": xx.xxx(edited),
"elevation": 25.4,
"pid": 163672280,
"data": {
"landcoversDownwardPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682247/original.jpg",
"landcoversEastExtraData": "(source: app, (compassData.horizon: -14.32171587255965))",
"landcoversEastPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682242/original.jpg",
"landcoversMucCode": null,
"landcoversUpwardPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682246/original.jpg",
"landcoversEastCaption": "",
"landcoversMeasurementLatitude": xx.xxx(edited),
"landcoversWestClassifications": null,
"landcoversNorthCaption": "",
"landcoversNorthExtraData": "(source: app, (compassData.horizon: -10.817734330181267))",
"landcoversDataSource": "GLOBE Observer App",
"landcoversDryGround": true,
"landcoversSouthClassifications": null,
"landcoversWestCaption": "",
"landcoversNorthPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682241/original.jpg",
"landcoversUpwardCaption": "",
"landcoversDownwardExtraData": "(source: app, (compassData.horizon: -84.48900393488086))",
"landcoversEastClassifications": null,
"landcoversMucDetails": "",
"landcoversMeasuredAt": "2020-06-13T15:12:00",
"landcoversDownwardCaption": "",
"landcoversSouthPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682243/original.jpg",
"landcoversMuddy": false,
"landcoversWestPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682245/original.jpg",
"landcoversStandingWater": false,
"landcoversLeavesOnTrees": true,
"landcoversUserid": 67150810,
"landcoversSouthExtraData": "(source: app, (compassData.horizon: -14.872806403121302))",
"landcoversSouthCaption": "",
"landcoversRainingSnowing": false,
"landcoversUpwardExtraData": "(source: app, (compassData.horizon: 89.09211989270894))",
"landcoversMeasurementElevation": 24.1,
"landcoversWestExtraData": "(source: app, (compassData.horizon: -15.47334477111039))",
"landcoversLandCoverId": 32043,
"landcoversMeasurementLongitude": xx.xxx(edited),
"landcoversMucDescription": null,
"landcoversSnowIce": false,
"landcoversNorthClassifications": null,
"landcoversFieldNotes": "(none)"
}
},
{
"protocol": "land_covers",
"measuredDate": "2020-06-13",
"createDate": "2020-06-13T16:35:04",
"updateDate": "2020-06-15T14:00:10",
"publishDate": "2020-07-17T21:06:31",
"organizationId": 17043304,
"organizationName": "United States of America Citizen Science",
"siteId": 202689,
"siteName": "18TWK294769",
"countryName": null,
"countryCode": null,
"latitude": xx.xxx(edited),
"longitude": xx.xxx(edited),
"elevation": 25.4,
"pid": 163672280,
"data": {
"landcoversDownwardPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682240/original.jpg",
"landcoversEastExtraData": "(source: app, (compassData.horizon: -6.06710116543897))",
"landcoversEastPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682235/original.jpg",
"landcoversMucCode": null,
"landcoversUpwardPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682239/original.jpg",
"landcoversEastCaption": "",
"landcoversMeasurementLatitude": xx.xxx(edited),
"landcoversWestClassifications": null,
"landcoversNorthCaption": "",
"landcoversNorthExtraData": "(source: app, (compassData.horizon: -9.199031748908894))",
"landcoversDataSource": "GLOBE Observer App",
"landcoversDryGround": true,
"landcoversSouthClassifications": null,
"landcoversWestCaption": "",
"landcoversNorthPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682233/original.jpg",
"landcoversUpwardCaption": "",
"landcoversDownwardExtraData": "(source: app, (compassData.horizon: -88.86569321651771))",
"landcoversEastClassifications": null,
"landcoversMucDetails": "",
"landcoversMeasuredAt": "2020-06-13T15:07:00",
"landcoversDownwardCaption": "",
"landcoversSouthPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682236/original.jpg",
"landcoversMuddy": false,
"landcoversWestPhotoUrl": "https://data.globe.gov/system/photos/2020/06/13/1682237/original.jpg",
"landcoversStandingWater": false,
"landcoversLeavesOnTrees": true,
"landcoversUserid": 67150810,
"landcoversSouthExtraData": "(source: app, (compassData.horizon: -11.615041431350335))",
"landcoversSouthCaption": "",
"landcoversRainingSnowing": false,
"landcoversUpwardExtraData": "(source: app, (compassData.horizon: 86.6284079864236))",
"landcoversMeasurementElevation": 24,
"landcoversWestExtraData": "(source: app, (compassData.horizon: -9.251774266832626))",
"landcoversLandCoverId": 32042,
"landcoversMeasurementLongitude": xx.xxx(edited),
"landcoversMucDescription": null,
"landcoversSnowIce": false,
"landcoversNorthClassifications": null,
"landcoversFieldNotes": "(none)"
}
},

It would help to see some code. Having said that, as has been pointed out, the built-in json library would help you. This is a JSON formatted output, please see here for an introduction to this type of format.
Say for sake your output here is stored within a variable called data. You could convert this json data to a dictionary.
Coding Example
import json
data_dict = json.load(data)
What json.load does is takes a JSON object and converts this into a python dictionary. The json.load actually scans the variable to check if it's a JSON object and uses a conversion table to convert this into a dictionary. There are other json formats that get converted to other python object types. See here for that table.
Now you have a python dictionary which you can access the data from. So lets go through longitude, latitude, siteid, direction (North, South, etc.). I see there's an open '[' without the corresponding ']'. I can only assume that the list has 41 items in it from what you're describing, so I'll take the first result first. You can always loop through this quite easily to get all 41 results.
longitude = data_dict['results'][0]['longitude']
langitude = data_dict['results'][0]['langitude']
site_id = data_dict['results'][0]['siteid']
Tips
I always use jupyter notebooks as a quick way to try and grab the specific data I want from JSON objects, sometimes it can take a few goes at getting the right access the right part. That way when I write my variables I know I get the data I want back from the JSON object. Json objects can be heavily nested sometimes and can be hard to follow.

Related

How to parse nested JSON object?

I am working on a new project in HubSpot that returns nested JSON like the sample below. I am trying to access the associated contacts id, but am struggling to reference it correctly (the id I am looking for is the value '201' in the example below). I've put together this script, but this script only returns the entire associations portion of the JSON and I only want the id. How do I reference the id correctly?
Here is the output from the script:
{'contacts': {'paging': None, 'results': [{'id': '201', 'type': 'ticket_to_contact'}]}}
And here is the script I put together:
import hubspot
from pprint import pprint
client = hubspot.Client.create(api_key="API_KEY")
try:
api_response = client.crm.tickets.basic_api.get_page(limit=2, associations=["contacts"], archived=False)
for x in range(2):
pprint(api_response.results[x].associations)
except ApiException as e:
print("Exception when calling basic_api->get_page: %s\n" % e)
Here is what the full JSON looks like ('contacts' property shortened for readability):
{
"results": [
{
"id": "34018123",
"properties": {
"content": "Hi xxxxx,\r\n\r\nCan you clarify on how the blocking of script happens? Is it because of any CSP (or) the script will decide run time for every URL’s getting triggered from browser?\r\n\r\nRegards,\r\nLogan",
"createdate": "2019-07-03T04:20:12.366Z",
"hs_lastmodifieddate": "2020-12-09T01:16:12.974Z",
"hs_object_id": "34018123",
"hs_pipeline": "0",
"hs_pipeline_stage": "4",
"hs_ticket_category": null,
"hs_ticket_priority": null,
"subject": "RE: call followup"
},
"createdAt": "2019-07-03T04:20:12.366Z",
"updatedAt": "2020-12-09T01:16:12.974Z",
"archived": false
},
{
"id": "34018892",
"properties": {
"content": "Hi Guys,\r\n\r\nI see that we were placed back on the staging and then removed again.",
"createdate": "2019-07-03T07:59:10.606Z",
"hs_lastmodifieddate": "2021-12-17T09:04:46.316Z",
"hs_object_id": "34018892",
"hs_pipeline": "0",
"hs_pipeline_stage": "3",
"hs_ticket_category": null,
"hs_ticket_priority": null,
"subject": "Re: Issue due to server"
},
"createdAt": "2019-07-03T07:59:10.606Z",
"updatedAt": "2021-12-17T09:04:46.316Z",
"archived": false,
"associations": {
"contacts": {
"results": [
{
"id": "201",
"type": "ticket_to_contact"
}
]
}
}
}
],
"paging": {
"next": {
"after": "35406270",
"link": "https://api.hubapi.com/crm/v3/objects/tickets?associations=contacts&archived=false&hs_static_app=developer-docs-ui&limit=2&after=35406270&hs_static_app_version=1.3488"
}
}
}
You can do api_response.results[x].associations["contacts"]["results"][0]["id"].
Sorted this out, posting in case anyone else is struggling with the response from the HubSpot v3 Api. The response schema for this call is:
Response schema type: Object
String results[].id
Object results[].properties
String results[].createdAt
String results[].updatedAt
Boolean results[].archived
String results[].archivedAt
Object results[].associations
Object paging
Object paging.next
String paging.next.after
String paging.next.linkResponse schema type: Object
String results[].id
Object results[].properties
String results[].createdAt
String results[].updatedAt
Boolean results[].archived
String results[].archivedAt
Object results[].associations
Object paging
Object paging.next
String paging.next.after
String paging.next.link
So to access the id of the contact associated with the ticket, you need to reference it using this notation:
api_response.results[1].associations["contacts"].results[0].id
notes:
results[x] - reference the result in the index
associations["contacts"] -
associations is a dictionary object, you can access the contacts item
by it's name
associations["contacts"].results is a list - reference
by the index []
id - is a string
In my case type was ModelProperty or CollectionResponseProperty couldn't reach dict anyhow.
For the record this got me to go through the results.
for result in list(api_response.results):
ID = result.id

Deleting duplicates from List of dict elements (created from Twitter json objects) [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
I have downloaded Twitter Users' objects,
This is example of One object
{
"id": 6253282,
"id_str": "6253282",
"name": "Twitter API",
"screen_name": "TwitterAPI",
"location": "San Francisco, CA",
"profile_location": null,
"description": "The Real Twitter API. Tweets about API changes, service issues and our Developer Platform. Don't get an answer? It's on my website.",
"url": "https:\/\/t.co\/8IkCzCDr19",
"entities": {
"url": {
"urls": [{
"url": "https:\/\/t.co\/8IkCzCDr19",
"expanded_url": "https:\/\/developer.twitter.com",
"display_url": "developer.twitter.com",
"indices": [
0,
23
]
}]
},
"description": {
"urls": []
}
},
"protected": false,
"followers_count": 6133636,
"friends_count": 12,
"listed_count": 12936,
"created_at": "Wed May 23 06:01:13 +0000 2007",
"favourites_count": 31,
"utc_offset": null,
"time_zone": null,
"geo_enabled": null,
"verified": true,
"statuses_count": 3656,
"lang": null,
"contributors_enabled": null,
"is_translator": null,
"is_translation_enabled": null,
"profile_background_color": null,
"profile_background_image_url": null,
"profile_background_image_url_https": null,
"profile_background_tile": null,
"profile_image_url": null,
"profile_image_url_https": "https:\/\/pbs.twimg.com\/profile_images\/942858479592554497\/BbazLO9L_normal.jpg",
"profile_banner_url": null,
"profile_link_color": null,
"profile_sidebar_border_color": null,
"profile_sidebar_fill_color": null,
"profile_text_color": null,
"profile_use_background_image": null,
"has_extended_profile": null,
"default_profile": false,
"default_profile_image": false,
"following": null,
"follow_request_sent": null,
"notifications": null,
"translator_type": null
}
but somehow it has many duplicates, maybe the input file had duplicated values.
This is the pattern of downloaded Twitter File. I named it as rawjson
{
user-object
}{
user-object
}{
user-object
}
So I ended up with a 16 GB file of users with repeated values. I need to delete the duplicated users.
This is what I have done so far
def twitterToListJsonMethodTwo(self, rawjson, twitterToListJson):
# Delete Old File
if (os.path.exists(twitterToListJson)):
try:
os.remove(twitterToListJson)
except OSError:
pass
counter = 1
objc = 1
with open(rawjson, encoding='utf8') as fin, open(twitterToListJson, 'w', encoding='utf8') as fout:
for line in fin:
if (line.find('}{') != -1 and len(line) == 3):
objc = objc + 1
fout.write(line.replace('}{', '},\n{'))
else:
fout.write(line)
counter = counter + 1
# print(counter)
print("Process Complete: Twitter object to Total lines: ", counter)
self.twitterToListJsonMethodOne(twitterToListJson)
and the output sample file looks like this. Now
[
{user-object},
{user-object},
{user-object}
]
While each user-object is dict But I can not find a way to remove the duplicates, all of the tutorials/solutions I found are just for small objects and small lists. I am not very good with python but I need some optimal solution as the file size is too big and memory could be a problem.
While each user-object is like below, with unique id and screen_name
To process huge JSON datasets, especially long lists of objects, it's better to use JSON streaming from https://github.com/daggaz/json-stream to read the user objects one by one, then add them to your results if this user was not encountered before.
Example:
import json_stream
unique_users = []
seen_users = set()
with open('input.json') as f:
js = json_stream.load(f)
for us in js:
user = dict(us.items())
if user['id'] not in seen_users:
unique_users.append(user)
seen_users.add(user['id'])
The reason for user = dict(us.items()) is that if we go looking for the id in the object via the stream, we can't backtrack to get the whole object any more. So we need to "render" out every user object and then check the id.
You could modify a merge sort and just delete duplicates in O(nlogn).
Use ijson like it is used here.
Create a set that will hold the item id.
If the id is in the set - drop the item, else - collect the item
Convert the dictionaries into tuples using the items() dict method to turn the list of dictionaries into a list of tuples. Then you can run set() on the list to get rid of duplicates because tuples are hashable. While using items() on each dict, remember to use tuple() on that. Sample code would be:
data = (tuple(d.items()) for d in twitter_data)
This should solve the issue of duplicate dictionaries if the dictionaries are identical on every key-value pairs.
I did not find any useful and memory-efficient solution, so I downloaded the data again.
One possible solution was (Step by Step).
1- Make the input data unique (The file I used for downloading the data)
2- Then read the JSON file and copy elements to another file one by one, and keep deleting processed values from the input file to avoid duplications.
3- But it would not be memory efficient and too much work as compared to downloading data again.
In the future, if someone comes with this problem. You better download the data again.
#vaizki answer is good, maybe useful for someone, but I could not install it as, pip did not find it and conda don't works really well here (I am in China, maybe my university network have the problem or VPN)

How do I insert another nested document using pymongo?

Ahoy,
I have a document that looks like this:
{"_id": "123abc456def",
"name": "John Smith",
"address": [
{"street": "First St.", "date": "yesterday", "last_updated": "two days ago"}
],
"age": 123}
I try to add another street document using $push, it errors out with:
pymongo.errors.WriteError: The field 'address' must be an array but is of type object in document {_id: ObjectId('6049e88657e43d8801197c72')}
Code I'm using:
mydb3 = myclient["catalogue"]
mycolALL = mydb3["locations"]
query = {"charID": 0}
newvalue = {"$push": {"address": {"street": "test123", "date": "test123", "last_updated": "now123"}}}
mycolALL.update_one(query, newvalue)
Not making an address book or anything, just edited it so it makes a bit more sense to anyone without context.
My desired output would be that the document would look like this:
{"_id": "123abc456def",
"name": "John Smith",
"address": [
{"street": "First St.", "date": "yesterday", "last_updated": "two days ago"},
{"street": "test123", "date": "test123", "last_updated": "now123"}
],
"age": 123}
Normally I can google my way to an answer that makes the coin drop and JACKPOT! but this time I'm outta luck.
$set = it just changes the existing document, effectively replacing it. Which is not what I want.
$addToSet = for arrays only, error message: "pymongo.errors.WriteError: Cannot apply $addToSet to non-array field. Field named 'address' has non-array type object"
Anyone that can help?
Just a guess but are you sure you're looking at the right data / database.
Based on the data you posted your update_one() won't update that record because it doesn't match your filter {"charID": 0}

Group_by and filter Django

I'm trying to make and query in Django,But I can't get the output I want. I want to use group by and filter in Django Query, I tried using annotate by looking at some answers on stackoverflow and some other sites but couldn't make it work . Here's my response on after using filter.
[
{
"id": 11667,
"rate_id": "FIT-PIT2",
"name": "FIT-PIT111",
"pms_room": null,
"description": null,
"checkin": "",
"checkout": "",
"connected_room": null
},
{
"id": 11698,
"rate_id": "343",
"name": "dfggffd",
"pms_room": "5BZ",
"description": null,
"checkin": null,
"checkout": null,
"connected_room": null
},
{
"id": 11699,
"rate_id": "343",
"name": "dfggffd",
"pms_room": "6BZ",
"description": null,
"checkin": null,
"checkout": null,
"connected_room": null
}]
What I want to do is group all those pms_rooms which have same rate_id, roughly something like this
{'343':['5BZ','6BZ'],'FIT-PIT2':[null]}
I can do it using dictionary or list .
But I want to do it directly from query like table.objects.filter(condition).group_by('rate_id') , something SQL equivalent of SELECT *,GROUP_CONCAT('name') FROM TABLE NAME WHERE PMS = hotel.pms GROUP BY rate_id . Can somebody please help me out . Thanks.
table.objects.filter(condition).values('rate_id'), check out the doc https://docs.djangoproject.com/en/3.0/ref/models/querysets/
Since your example have mentioned GROUP_CONCAT, I'll assume that you are using MySQL. Django did not support GROUP_CONCAT natively, yet you can try django-MySQL, which is supporting an equivalent database function GroupConcat. Then you can make a query like this:
table.objects.values('rate_id').annotate(grouped_rooms=GroupConcat('pms_room'))
The result may be like this:
[
{
'rate_id': '343',
'grouped_rooms': '5BZ,6BZ',
},
{
'rate_id': 'FIT-PIT2',
'grouped_rooms': '',
},
...
]
Not actually meet the format you mentioned in OP, yet you may do some post process to this result in native python for making it meet what you expected.

Check a dataframe column to see if a bool if True/False, if False, geocode only those values

I am using the [geocoder python API library][1]. I have a pandas dataframe column of boolean True/False based on if I already have that particular address geocoded or not. Is there a way to modify my existing code to geocode based on if I have it geocoded or not?
Right now all it does is print a True statement and then geocodes everything, regardless of the boolean I have. Help Please!
Here is another way to put it:
I have a dataframe of Tweets. If a Tweet was geocoded, I have marked that Tweet with a True (if it has been geocoded) or False (If it has not been geocoded). What I'm trying to do is check if the column is True, print out that row. Else if that row is False, then send it into my for loop to be geocoded. I will edit the original post for an input.
Here is my exisiting code:
for d in tweets2['Exist']:
if d is True:
print d
elif d.any() is False:
coord = []
for index, row in tweets2.iterrows():
print(row['location_x'])
time.sleep(1.01)
g = geocoder.osm(row['location_x'])
geo = g.latlng
print(geo)
coord.append(geo)
else:
pass
Here is an example of the JSON file as an input:
{
"data": [
{
"user_id": 3299796214,
"features": {
"screen_name": "SaveOurSparrows",
"text": "Details confirmed for inquiry into #INEOS #Derbyshire #Fracking site! \n\nAnti Fracking, #keepitintheground #wesaidno\u2026",
"location": "West Pennine Moors AONB SSSI",
"tweets": 3,
"geo_type": "User location",
"primary_geo": "West Pennine Moors AONB SSSI",
"id": 3299796214,
"name": "SaveOurSparrows",
"Exist": "True"
}
},
{
"user_id": 3302831409,
"features": {
"screen_name": "ProjectLower",
"text": "Cutting down on energy costs is the dream for many #smallbusinesses, but to put ideas into practice isn\u2019t always ea\u2026",
"location": "Manchester",
"tweets": 1,
"geo_type": "User location",
"primary_geo": "Manchester",
"id": 3302831409,
"name": "Project Lower",
"Exist": "False"
}
},
{
"user_id": 2205129714,
"features": {
"screen_name": "AmbCanHaiti",
"text": "Petit-d\u00e9jeuner causerie le mercredi 28 mars 2018 \u00e0 l'h\u00f4tel Montana sur l'\u00e9nergie #micror\u00e9seaux #microgrids\u2026",
"location": "Haiti",
"tweets": 1,
"geo_type": "User location",
"primary_geo": "Haiti",
"id": 2205129714,
"name": "Canada en Ha\u00efti",
"Exist": "False"
}
}
]
}
The simplest way is to walk over your data set, and if there is no coords key, add it:
for data in your_data_set['data']:
data['coords'] = data.setdefault('coords', geocoder.osm(data'location_x']).latlang)
Then, convert it into a dataframe.
If you already have it as a dataframe:
df.loc[df['coords'] == False, 'coords'] = geocoder.osm(df['location_x']).latlang

Categories