Python - Extract information from dataframe (JSON)

Python - Extract information from dataframe (JSON) - python

I'm a begginer and it's been a long time I didn't code anything :-) I'm using requests library to retrieve JSON data from the Incapsula(Cloud web security service) API to get some stats about a website. What I want in the end is to write the "type of trafic, timestamp, and number" to a file to create reports.
API Response is something like this :
{
"res": 0,
"res_message": "OK",
"visits_timeseries" : [
{
"id":"api.stats.visits_timeseries.human",
"name":"Human visits",
"data":[
[1344247200000,50],
[1344247500000,40],
...
]
},
{
"id":"api.stats.visits_timeseries.bot",
"name":"Bot visits",
"data":[
[1344247200000,10],
[1344247500000,20],
...
]
}
I'm recovering the Visit_timeseries data like this:
r = requests.post('https://my.incapsula.com/api/stats/v1', params=payload)
reply=r.json()
reply = reply['visits_timeseries']
reply = pandas.DataFrame(reply)
I recover data in that form (date in unix time, number of visit) :
print(reply[['name', 'data']].head())
name data
0 Human visits [[1500163200000, 39], [1499904000000, 73], [14...
1 Bot visits [[1500163200000, 1891], [1499904000000, 1926],...
I don't undestand how to extract the fields I want from the dataframe to write only them into the excel. I would need modify the data field into two rows (date, value). And only the name as the top rows.
What would be great is :
Human Visit Bot Visit
Date Value Value
Date Value Value
Date Value Value
Thanks for your help!

Well, if it is any help, this is a hardcoded version:
import pandas as pd
reply = {
"res": 0,
"res_message": "OK",
"visits_timeseries" : [
{
"id":"api.stats.visits_timeseries.human",
"name":"Human visits",
"data":[
[1344247200000,50],
[1344247500000,40]
]
},
{
"id":"api.stats.visits_timeseries.bot",
"name":"Bot visits",
"data":[
[1344247200000,10],
[1344247500000,20]
]
}
]
}
human_data = reply['visits_timeseries'][0]['data']
bot_data = reply['visits_timeseries'][1]['data']
df_h = pd.DataFrame(human_data, columns=['Date', 'Human Visit'])
df_b = pd.DataFrame(bot_data, columns=['Date', 'Bot Visit'])
df = df_h.append(df_b, ignore_index=True).fillna(0)
df = df.groupby('Date').sum()

Related

How to write a Python script to automate API calls and retrieve a specific part of the result

I have a csv file of schools that contains one school per row for a total of 32091 schools. The name of the school is indicated in the 6th column, and the city code is indicated in the 7th column.
I would like to retrieve the latitude and longitude of the schools by using the geocoding API of the IGN (Institut Géographique National de France) whose documentation in French is here: https://geoservices.ign.fr/documentation/services/api-et-services-ogc/geocodage-beta-20/documentation-technique-de-lapi-de
This API allows me to indicate a string of characters as search terms, and to restrict the search with a filter on the city code. I have tested several queries and the results seem to be satisfactory. For example, for the school "ecole primaire privee st joseph de bonabry" located in Fougères (city code 35115), the following query:
https://wxs.ign.fr/essentiels/geoportail/geocodage/rest/0.1/search?q=ecole%20primaire%20privee%20st%20joseph%20de%20bonabry&index=poi&limit=1&returntruegeometry=false&postcode=35300
returns the following json:
{
"type": "FeatureCollection",
"features": [
{
"type": "Feature",
"properties": {
"postcode": [
"35300"
],
"citycode": [
"35115",
"35"
],
"city": [
"Fougères"
],
"toponym": "École Primaire Saint-Joseph de Bonabry",
"category": [
"area of activity or interest",
"primary education"
],
"extrafields": {
"cleabs": "SURFACTI0000000215529805",
"names": [
"saint joseph de bonabry elementary school"
]
},
"_score": 0.703030303030303,
"_type": "poi"
},
"geometry": {
"type": "Point",
"coordinates": [
-1.19610139955834,
48.3550652629677
]
}
}
]
}
So the coordinates to extract are located here: {"features":[{ "geometry":{"coordinates":[lon, lat]}}]}
I would like to go through a Python script to automate the task. From what I understand, the steps could be as follows:
Open the CSV
Read the value contained in the sixth column
Perform an http get request for each row, changing the URL based on the value in the sixth column
Extract longitude and latitude from the results
Update the longitude and latitude columns (already existing) with the previously extracted values.
Panda allows me to read the CSV while Requests allows me to formulate the query. Being a beginner in programming I don't really know how to write the script. I guess it can start this way:
import panda as pd
import requests
df = pd.read_csv("myfile.csv")
...but I'm stuck on what to do next. I guess a loop would allow to repeat the request but how do you change the URL terms? In general, any help on the whole scrit will be greatly appreciated!

This is how I would do it.
Replace "name" and "post" with the actual column names from your CSV
import pandas as pd
import requests
# read the data CSV
# you have to replace "name" and "post" with the actual column names
df = pd.read_csv("data.csv", usecols=["name", "post"])
# define the request URL
url = "https://wxs.ign.fr/essentiels/geoportail/geocodage/rest/0.1/search"
#api call for each element
for i in range(len(df["name"])):
# prepare the name for URL
genName = df["name"][i].replace(" ", "%20")
print(genName)
# prepare request
request = url + "?q=" + genName + "&index=poi&limit=1&returntruegeometry=false&postcode=" + str(df["post"][i])
print(request)
# do the request
r = requests.get(request)
# response
result = r.text
print(result)

How to save json data as it is without data type conversion in dynamo db using python

I want to store key-value JSON data in aws DynamoDB where key is a date string in YYYY-mm-dd format and value is entries which is a python dictionary. When I used boto3 client to save data there, it saved it as a data type object, which I don't want. My purpose is simple: Store JSON data against a key which is a date, so that later I will query the data by giving that date. I am struggling with this issue because I did not find any relevant link which says how to store JSON data and retrieve it without any conversion.
I need help to solve it in Python.
What I am doing now:
item = {
"entries": [
{
"path": [
{
"name": "test1",
"count": 1
},
{
"name": "test2",
"count": 2
}
],
"repo": "test3"
}
],
"date": "2022-10-11"
}
dynamodb_client = boto3.resource('dynamodb')
table = self.dynamodb_client.Table(table_name)
response = table.put_item(Item = item)
What actually saved:
[{"M":{"path":{"L":[{"M":{"name":{"S":"test1"},"count":{"N":"1"}}},{"M":{"name":{"S":"test2"},"count":{"N":"2"}}}]},"repo":{"S":"test3"}}}]
But I want to save exactly the same JSON data as it is, without any conversion at all.
When I retrieve it programmatically, you see the difference of single quote, count value change.
response = table.get_item(
Key={
"date": "2022-10-12"
}
)
Output
{'Item': {'entries': [{'path': [{'name': 'test1', 'count': Decimal('1')}, {'name': 'test2', 'count': Decimal('2')}], 'repo': 'test3'}], 'date': '2022-10-12} }
Sample picture:

Why not store it as a single attribute of type string? Then you’ll get out exactly what you put in, byte for byte.

When you store this in DynamoDB you get exactly what you want/have provided. Key is your date and you have a list of entries.
If you need it to store in a different format you need to provide the JSON which correlates with what you need. It's important to note that DynamoDB is a key-value store not a document store. You should also look up the differences in these.

I figured out how to solve this issue. I have two column name date and entries in my dynamo db (also visible in screenshot in ques).
I convert entries values from list to string then saved it in db. At the time of retrival, I do the same, create proper json response and return it.
I am also sharing sample code below so that anybody else dealing with the same situation can have atleast one option.
# While storing:
entries_string = json.dumps([
{
"path": [
{
"name": "test1",
"count": 1
},
{
"name": "test2",
"count": 2
}
],
"repo": "test3"
}
])
item = {
"entries": entries_string,
"date": "2022-10-12"
}
dynamodb_client = boto3.resource('dynamodb')
table = dynamodb_client.Table(<TABLE-NAME>)
-------------------------
# While fetching:
response = table.get_item(
Key={
"date": "2022-10-12"
}
)['Item']
entries_string=response['entries']
entries_dic = json.loads(entries_string)
response['entries'] = entries_dic
print(json.dumps(response))

Select specific keys inside a json using python

I have the following json that I extracted using request with python and json.loads. The whole json basically repeats itself with changes in the ID and names. It has a lot of information but I`m just posting a small sample as an example:
"status":"OK",
"statuscode":200,
"message":"success",
"apps":[
{
"id":"675832210",
"title":"AGED",
"desc":"No annoying ads＆easy to play",
"urlImg":"https://test.com/pImg.aspx?b=675832&z=1041813&c=495181&tid=API_MP&u=https%3a%2f%2fcdna.test.com%2fbanner%2fwMMUapCtmeXTIxw_square.png&q=",
"urlImgWide":"https://cdna.test.com/banner/sI9MfGhqXKxVHGw_rectangular.jpeg",
"urlApp":"https://admin.test.com/appLink.aspx?b=675832&e=1041813&tid=API_MP&sid=2c5cee038cd9449da35bc7b0f53cf60f&q=",
"androidPackage":"com.agedstudio.freecell",
"revenueType":"cpi",
"revenueRate":"0.10",
"categories":"Card",
"idx":"2",
"country":[
"CH"
],
"cityInclude":[
"ALL"
],
"cityExclude":[
],
"targetedOSver":"ALL",
"targetedDevices":"ALL",
"bannerId":"675832210",
"campaignId":"495181210",
"campaignType":"network",
"supportedVersion":"",
"storeRating":"4.3",
"storeDownloads":"10000+",
"appSize":"34603008",
"urlVideo":"",
"urlVideoHigh":"",
"urlVideo30Sec":"https://cdn.test.com/banner/video/video-675832-30.mp4?rnd=1620699136",
"urlVideo30SecHigh":"https://cdn.test.com/banner/video/video-675832-30_o.mp4?rnd=1620699131",
"offerId":"5825774"
},
I dont need all that data, just a few like 'title', 'country', 'revenuerate' and 'urlApp' but I dont know if there is a way to extract only that.
My solution so far was to make the json a dataframe and then drop the columns, however, I wanted to find an easier solution.
My ideal final result would be to have a dataframe with selected keys and arrays
Does anybody know an easy solution for this problem?
Thanks

I assume you have that data as a dictionary, let's call it json_data. You can just iterate over the apps and write them into a list. Alternatively, you could obviously also define a class and initialize objects of that class.
EDIT:
I just found this answer: https://stackoverflow.com/a/20638258/6180150, which tells how you can convert a list of dicts like from my sample code into a dataframe. See below adaptions to the code for a solution.
json_data = {
"status": "OK",
"statuscode": 200,
"message": "success",
"apps": [
{
"id": "675832210",
"title": "AGED",
"desc": "No annoying ads＆easy to play",
"urlImg": "https://test.com/pImg.aspx?b=675832&z=1041813&c=495181&tid=API_MP&u=https%3a%2f%2fcdna.test.com%2fbanner%2fwMMUapCtmeXTIxw_square.png&q=",
"urlImgWide": "https://cdna.test.com/banner/sI9MfGhqXKxVHGw_rectangular.jpeg",
"urlApp": "https://admin.test.com/appLink.aspx?b=675832&e=1041813&tid=API_MP&sid=2c5cee038cd9449da35bc7b0f53cf60f&q=",
"androidPackage": "com.agedstudio.freecell",
"revenueType": "cpi",
"revenueRate": "0.10",
"categories": "Card",
"idx": "2",
"country": [
"CH"
],
"cityInclude": [
"ALL"
],
"cityExclude": [
],
"targetedOSver": "ALL",
"targetedDevices": "ALL",
"bannerId": "675832210",
"campaignId": "495181210",
"campaignType": "network",
"supportedVersion": "",
"storeRating": "4.3",
"storeDownloads": "10000+",
"appSize": "34603008",
"urlVideo": "",
"urlVideoHigh": "",
"urlVideo30Sec": "https://cdn.test.com/banner/video/video-675832-30.mp4?rnd=1620699136",
"urlVideo30SecHigh": "https://cdn.test.com/banner/video/video-675832-30_o.mp4?rnd=1620699131",
"offerId": "5825774"
},
]
}
filtered_data = []
for app in json_data["apps"]:
app_data = {
"id": app["id"],
"title": app["title"],
"country": app["country"],
"revenueRate": app["revenueRate"],
"urlApp": app["urlApp"],
}
filtered_data.append(app_data)
print(filtered_data)
# Output
d = [
{
'id': '675832210',
'title': 'AGED',
'country': ['CH'],
'revenueRate': '0.10',
'urlApp': 'https://admin.test.com/appLink.aspx?b=675832&e=1041813&tid=API_MP&sid=2c5cee038cd9449da35bc7b0f53cf60f&q='
}
]
d = pd.DataFrame(filtered_data)
print(d)
# Output
id title country revenueRate urlApp
0 675832210 AGED [CH] 0.10 https://admin.test.com/appLink.aspx?b=675832&e=1041813&tid=API_MP&sid=2c5cee038cd9449da35bc7b0f53cf60f&q=

if your endgame is dataframe, just load the dataframe and take the columns you want:
setting the json to data
df = pd.json_normalize(data['apps'])
yields
id title desc urlImg ... urlVideoHigh urlVideo30Sec urlVideo30SecHigh offerId
0 675832210 AGED No annoying ads＆easy to play https://test.com/pImg.aspx?b=675832&z=1041813&... ... https://cdn.test.com/banner/video/video-675832... https://cdn.test.com/banner/video/video-675832... 5825774
[1 rows x 28 columns]
then if you want certain columns:
df_final = df[['title', 'desc', 'urlImg']]
title desc urlImg
0 AGED No annoying ads＆easy to play https://test.com/pImg.aspx?b=675832&z=1041813&...

use a dictionary comprehension to extract a dictionary of key/value pairs you want
import json
json_string="""{
"id":"675832210",
"title":"AGED",
"desc":"No annoying ads＆easy to play",
"urlApp":"https://admin.test.com/appLink.aspx?b=675832&e=1041813&tid=API_MP&sid=2c5cee038cd9449da35bc7b0f53cf60f&q=",
"revenueRate":"0.10",
"categories":"Card",
"idx":"2",
"country":[
"CH"
],
"cityInclude":[
"ALL"
],
"cityExclude":[
]
}"""
json_dict = json.loads(json_string)
filter_fields=['title','country','revenueRate','urlApp']
dict_result = { key: json_dict[key] for key in json_dict if key in filter_fields}
json_elements = []
for key in dict_result:
json_elements.append((key,json_dict[key]))
print(json_elements)
output:
[('title', 'AGED'), ('urlApp', 'https://admin.test.com/appLink.aspx?b=675832&e=1041813&tid=API_MP&sid=2c5cee038cd9449da35bc7b0f53cf60f&q='), ('revenueRate', '0.10'), ('country', ['CH'])]

python Issue with getting correct info from the json

I have issue then i try get from correct information.
For example i have very big json output after request i made in post (i cant use get).
"offers": [
{
"rank": 1,
"provider": {
"id": 6653,
"isLocalProvider": false,
"logoUrl": "https://img.vxcdn.com/i/partner-energy/c_6653.png?v=878adaf9ed",
"userRatings": {
"additonalCustomerRatings": {
"price": {
"percent": 73.80
},
"service": {
"percent": 67.50
},
"switching": {
"percent": 76.37
},
"caption": {
"text": "Zusätzliche Kundenbewertungen"
}
},
I cant show it all because its very big.
Like you see "rank" 1 in this request exist 20 ranks with information like content , totalCost and i need pick them all. Like 6 rank content and totalCost, 8 rank content and totalCost.
So first off all in python i use code for getting what json data.
import requests
import json
url = "https://www.verivox.de/api/energy/offers/electricity/postCode/10555/custom?"
payload="{\"profile\":\"H0\",\"prepayment\":true,\"signupOnly\":true,\"includePackageTariffs\":true,\"includeTariffsWithDeposit\":true,\"includeNonCompliantTariffs\":true,\"bonusIncluded\":\"non-compliant\",\"maxResultsPerPage\":20,\"onlyProductsWithGoodCustomerRating\":false,\"benchmarkTariffId\":741122,\"benchmarkPermanentTariffId\":38,\"paolaLocationId\":\"71085\",\"includeEcoTariffs\":{\"includesNonEcoTariffs\":true},\"maxContractDuration\":240,\"maxContractProlongation\":240,\"usage\":{\"annualTotal\":3500,\"offPeakUsage\":0},\"priceGuarantee\":{\"minDurationInMonths\":0},\"maxTariffsPerProvider\":999,\"cancellationPeriod\":null,\"previewDisplayTime\":null,\"onlyRegionalTariffs\":false,\"sorting\":{\"criterion\":\"TotalCosts\",\"direction\":\"Ascending\"},\"includeSpecialBonusesInCalculation\":\"None\",\"totalCostViewMode\":1,\"ecoProductType\":0}"
headers = {
'Content-Type': 'application/json',
'Cookie': '__cfduid=d97a159bb287de284487ebdfa0fd097b41606303469; ASP.NET_SessionId=jfg3y20s31hclqywloocjamz; 0e3a873fd211409ead79e21fffd2d021=product=Power&ReturnToCalcLink=/power/&CustomErrorsEnabled=False&IsSignupWhiteLabelled=False; __RequestVerificationToken=vrxksNqu8CiEk9yV-_QHiinfCqmzyATcGg18dAqYXqR0L8HZNlvoHZSZienIAVQ60cB40aqfQOXFL9bsvJu7cFOcS2s1'
}
response = requests.request("POST", url, headers=headers, data=payload)
jsondata = response.json()
# print(response.text)
For it working fine, but then i try pick some data what i needed like i say before im getting
for Rankdata in str(jsondata['rank']):
KeyError: 'rank'
my code for this error.
dataRank = []
for Rankdata in str(jsondata['rank']):
dataRank.append({
'tariff':Rankdata['content'],
'cost': Rankdata['totalCost'],
'sumOfOneTimeBonuses': Rankdata['content'],
'savings': Rankdata['content']
})
Then i try do another way. Just get one or some data, but not working too.
data = response.json()
#print(data)
test = float((data['rank']['totalCost']['content']))
I know my code not perfect, but i first time deal with json what are so big and are so difficult. I will be very grateful if show my in my case example how i can pick rank 1 - rank 20 data and print it.
Thank you for your help.

If you look closely at the highest level in the json, you can see that the value for key offers is a list of dicts. You can therefore loop through it like this:
for offer in jsondata['offers']:
print(offer.get('rank'))
print(offer.get('provider').get('id'))
And the same goes for other keys in the offers.

Using Pandas to convert JSON to CSV with specific fields

I am currently trying to convert a JSON file to a CSV file using Pandas.
The codes that I'm using now are able to convert the JSON to a CSV file.
import pandas as pd
json_data = pd.read_json("out1.json")
from pandas.io.json import json_normalize
df = json_normalize(json_data["events"])
df.to_csv("out.csv)
This is my JSON file:
{
"events": [
{
"raw": "{\"level\": \"INFO\", \"message\": \"Disabled camera with QR scan on by 80801234 at Area A\n\"}",
"logtypes": [
"json"
],
"timestamp": 1537190572023,
"unparsed": null,
"logmsg": "{\"level\": \"INFO\", \"message\": \"Disabled camera with QR scan on by 80801234 at Area A\n\"}",
"id": "c77afb4c-ba7c-11e8-8000-12b233ae723a",
"tags": [
"INFO"
],
"event": {
"json": {
"message": "Disabled camera with QR scan on by 80801234 at Area A\n",
"level": "INFO"
},
"http": {
"clientHost": "116.197.237.29",
"contentType": "text/plain; charset=UTF-8"
}
}
},
{
"raw": "{\"level\": \"INFO\", \"message\": \"Employee number saved successfully.\"}",
"logtypes": [
"json"
],
"timestamp": 1537190528619,
"unparsed": null,
"logmsg": "{\"level\": \"INFO\", \"message\": \"Employee number saved successfully.\"}",
"id": "ad9c0175-ba7c-11e8-803d-12b233ae723a",
"tags": [
"INFO"
],
"event": {
"json": {
"message": "Employee number saved successfully.",
"level": "INFO"
},
"http": {
"clientHost": "116.197.237.29",
"contentType": "text/plain; charset=UTF-8"
}
}
}
]
}
But what I wanted was just some fields (timestamp, level, message) inside the JSON file not all of it.
I have tried a variety of ways:
df = json_normalize(json_data["timestamp"]) // gives a KeyError on 'timestamp'
df = json_normalize(json_data, 'timestamp', ['event', 'json', ['level', 'message']]) // TypeError: string indices must be integers
Where did i went wrong?

I don't think json_normalize is intended to work on this specific orientation. I could be wrong but from the documentation, it appears that normalization means "Deal with lists within each dictionary".
Assume data is
data = json.load(open('out1.json'))['events']
Look at the first entry
data[0]['timestamp']
1537190572023
json_normalize wants this to be a list
[{'timestamp': 1537190572023}]
Create augmented data2
I don't actually recommend this approach.
If we create data2 accordingly:
data2 = [{**d, **{'timestamp': [{'timestamp': d['timestamp']}]}} for d in data]
We can use json_normalize
json_normalize(
data2, 'timestamp',
[['event', 'json', 'level'], ['event', 'json', 'message']]
)
timestamp event.json.level event.json.message
0 1537190572023 INFO Disabled camera with QR scan on by 80801234 a...
1 1537190528619 INFO Employee number saved successfully.
Comprehension
I think it's simpler to just do
pd.DataFrame([
(d['timestamp'],
d['event']['json']['level'],
d['event']['json']['message'])
for d in data
], columns=['timestamp', 'level', 'message'])
timestamp level message
0 1537190572023 INFO Disabled camera with QR scan on by 80801234 a...
1 1537190528619 INFO Employee number saved successfully.
json_normalize
But without the fancy arguments
json_normalize(data).pipe(
lambda d: d[['timestamp']].join(
d.filter(like='event.json')
)
)
timestamp event.json.level event.json.message
0 1537190572023 INFO Disabled camera with QR scan on by 80801234 a...
1 1537190528619 INFO Employee number saved successfully.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - Extract information from dataframe (JSON) - python

Related

How to write a Python script to automate API calls and retrieve a specific part of the result

How to save json data as it is without data type conversion in dynamo db using python

Select specific keys inside a json using python

python Issue with getting correct info from the json

Using Pandas to convert JSON to CSV with specific fields

Categories

Resources