How to convert JSON data inside a pandas column into new columns - python

I have this short version of ADSB json data and would like to convert it into dataFrame columns as Icao, Alt, Lat, Long, Spd, Cou.....
After Alperen told me to do this
df = pd.read_json('2016-06-20-2359Z.json', lines=True),
I can load it into a DataFrame. However, df.acList is
[{'Id': 10537990, 'Rcvr': 1, 'HasSig': False, ...
Name: acList, dtype: object
How can I get the Icao, Alt, Lat, Long, Spd, Cou data?
"src":1,
"feeds":[
{
"id":1,
"name":"ADSBexchange.com",
"polarPlot":false
}
],
"srcFeed":1,
"showSil":true,
"showFlg":true,
"showPic":true,
"flgH":20,
"flgW":85,
"acList":[
{
"Id":11281748,
"Rcvr":1,
"HasSig":false,
"Icao":"AC2554",
"Bad":false,
"Reg":"N882AS",
"FSeen":"\/Date(1466467166951)\/",
"TSecs":3,
"CMsgs":1,
"AltT":0,
"Tisb":false,
"TrkH":false,
"Type":"CRJ2",
"Mdl":"2001
BOMBARDIER INC
CL-600-2B19",
"Man":"Bombardier",
"CNum":"7503",
"Op":"EXPRESSJET AIRLINES INC - ATLANTA, GA",
"OpIcao":"ASQ",
"Sqk":"",
"VsiT":0,
"WTC":2,
"Species":1,
"Engines":"2",
"EngType":3,
"EngMount":1,
"Mil":false,
"Cou":"United States",
"HasPic":false,
"Interested":false,
"FlightsCount":0,
"Gnd":false,
"SpdTyp":0,
"CallSus":false,
"TT":"a",
"Trt":1,
"Year":"2001"
},
{
"Id":11402205,
"Rcvr":1,
"HasSig":true,
"Sig":110,
"Icao":"ADFBDD",
"Bad":false,
"FSeen":"\/Date(1466391940977)\/",
"TSecs":75229,
"CMsgs":35445,
"Alt":8025,
"GAlt":8025,
"AltT":0,
"Call":"TEST1234",
"Tisb":false,
"TrkH":false,
"Sqk":"0262",
"Help":false,
"VsiT":0,
"WTC":0,
"Species":0,
"EngType":0,
"EngMount":0,
"Mil":true,
"Cou":"United States",
"HasPic":false,
"Interested":false,
"FlightsCount":0,
"Gnd":true,
"SpdTyp":0,
"CallSus":false,
"TT":"a",
"Trt":1
}
],
"totalAc":4231,
"lastDv":"636019887431643594",
"shtTrlSec":61,
"stm":1466467170029
}
</pre>

If you already have your data in acList column in a pandas DataFrame, simply do:
import pandas as pd
pd.io.json.json_normalize(df.acList[0])
Alt AltT Bad CMsgs CNum Call CallSus Cou EngMount EngType ... Sqk TSecs TT Tisb TrkH Trt Type VsiT WTC Year
0 NaN 0 False 1 7503 NaN False United States 1 3 ... 3 a False False 1 CRJ2 0 2 2001
1 8025.0 0 False 35445 NaN TEST1234 False United States 0 0 ... 0262 75229 a False False 1 NaN 0 0 NaN
Since pandas 1.0 the imports should be:
import pandas as pd
pd.json_normalize(df.acList[0])

#Sergey's answer solved the issue for me but I was running into issues because the json in my data frame column was kept as a string and not as an object. I had to add the additional step of mapping the column:
import json
import pandas as pd
pd.io.json.json_normalize(df.acList.apply(json.loads))

Since pandas 1.0, json_normalize is available in the top-level namespace.
Therefore use:
import pandas as pd
pd.json_normalize(df.acList[0])

I can't comment yet on ThinkBonobo's answer but in case the JSON in the column isn't exactly a dictionary you can keep doing .apply until it is. So in my case
import json
import pandas as pd
json_normalize(
df
.theColumnWithJson
.apply(json.loads)
.apply(lambda x: x[0]) # the inner JSON is list with the dictionary as the only item
)

In my case I had some missing values (None) then I created a more specific code that also drops the original column after creating the new ones:
for prefix in ['column1', 'column2']:
df_temp = df[prefix].apply(lambda x: {} if pd.isna(x) else x)
df_temp = pd.io.json.json_normalize(df_temp)
df_temp = df_temp.add_prefix(prefix + '_')
df.drop([prefix], axis=1, inplace=True)
df = pd.concat([df, df_temp], axis = 1, sort=False)

Related

Is there a way to replace True/False with string values in Pandas?

The pandas data frame looks like this:
job_url
0 https://neuvoo.ca/view/?id=34134414434
1 https://zip.com/view/?id=35453453454
2 https://neuvoo.com/view/?id=2452444252
I want to turn all the strings beginning with 'https://neuvoo.ca' into 'Canada'.
My solution to this was to search for that using str.startswith and then replacing True with 'Canada'
csv_file['job_url'] = csv_file['job_url'].str.startswith('https://neuvoo.ca/')
csv_file['job_url'] = replace({'job_url': {np.True: 'Canada', np.False: 'USA'}})
But found out that it does't replace boolean with string.
Let's try with np.where instead:
import numpy as np
import pandas as pd
csv_file = pd.DataFrame({
'job_url': ['https://neuvoo.ca/view/?id=34134414434',
'https://zip.com/view/?id=35453453454',
'https://neuvoo.com/view/?id=2452444252']
})
csv_file['job_url'] = np.where(
csv_file['job_url'].str.startswith('https://neuvoo.ca/'),
'Canada',
'USA'
)
job_url
0 Canada
1 USA
2 USA
replace would work as well but with True and False not np.True/np.False:
csv_file['job_url'] = (
csv_file['job_url'].str.startswith('https://neuvoo.ca/')
.replace({True: 'Canada', False: 'USA'})
)

Pandas isin() returning all false

I'm using pandas 1.1.3, the latest available with Anaconda.
I have two DataFrames, imported from a .txt and a .xlsx file. They have a column called "ID" which is an int64 (verified with df.info()) on both DataFrames.
df1:
ID Name
0 1234564567 Last, First
1 1234564569 Last, First
...
df2:
ID Amount
0 1234564567 59.99
1 5678995545 19.99
I want to check if all of the IDs on df1 are on df2. For this I create a series:
foo = df1["ID"].isin(df2["ID"])
And I get that all values are False, even though manually I checked and the values do match.
0 False
1 False
2 False
3 False
4 False
...
I'm not sure if I'm missing something, if there is something wrong with the environment, or if it is a known bug.
You must do something wrong. Try to reproduce this error with a toy example as I did here. The below works for me.
Reproducing with and sharing a minimal example not only allows you to challenge your error but also allows us to provide help.
import pandas as pd
import numpy as np
data = {'Name':['Tom', 'nick'], 'ID':[1234564567, 1234564569]}
data2 = {'Name':['Tom', 'nick'], 'ID':[1234564567, 5678995545]}
# Create DataFrame
df = pd.DataFrame(data)
df2 = pd.DataFrame(data2)
df["ID"].isin(df2["ID"])
0 True
1 False
Name: ID, dtype: bool
EDIT: with Paul's data I don't get any error. See the importance of providing examples?
import pandas as pd
data = {'ID':['1234564567', '1234564569'],'Name':['Last, First', 'Last, First']}
data2 = {'ID':['1234564567', '5678995545'],'Amount': [59.99, 19.99]}
df1 = pd.DataFrame(data)
df2 = pd.DataFrame(data2)
df["ID"].isin(df2["ID"])
0 True
1 False
import pandas as pd
data = {'ID':['1234564567', '1234564569'],'Name':['Last, First', 'Last, First']}
data2 = {'ID':['1234564567', '5678995545'],'Amount': [59.99, 19.99]}
df1 = pd.DataFrame(data)
df2 = pd.DataFrame(data2)
now we have that set up we get to the meat...
df1["ID"].apply(lambda x: df2['ID'].isin([x]))
Which shows
0 1
0 True False
1 False False
That ID 0 in df1 is in ID 0 of df2

How to drop rows by condition on string value in pandas dataframe?

Consider a Pandas Dataframe like:
>>> import pandas as pd
>>> df = pd.DataFrame(dict(url=['http://url1.com', 'http://www.url1.com', 'http://www.url2.com','http://www.url3.com','http://www.url1.com']))
>>> df
Giving:
url
0 http://url1.com
1 http://www.url1.com
2 http://www.url2.com
3 http://www.url3.com
4 http://www.url1.com
I want to remove all rows containing url1.com and url2.com to obtain dataframe result like:
url
0 http://ww.url3.com
I do this
domainToCheck = ('url1.com', 'url2.com')
goodUrl = df['url'].apply(lambda x : any(domain in x for domain in domainToCheck))
But this give me no result.
Any idea how to solve the above problem?
Edit: Solution
import pandas as pd
import tldextract
df = pd.DataFrame(dict(url=['http://url1.com', 'http://www.url1.com','http://www.url2.com','http://www.url3.com','http://www.url1.com']))
domainToCheck = ['url1', 'url2']
s = df.url.map(lambda x : tldextract.extract(x).domain).isin(domainToCheck)
df = df[~s].reset_index(drop=True)
If we checking domain , we should find the 100% match domain rather than use string contain . since the subdomain may contain the same key work as domain
import tldextract
s=df.url.map(lambda x : tldextract.extract(x).domain).isin(['url1','url2'])
Out[594]:
0 True
1 True
2 True
3 False
4 True
Name: url, dtype: bool
df=df[~s]
Use, Series.str.contains to create a boolean mask m and then you can filter the dataframe df using this boolean mask:
m = df['url'].str.contains('|'.join(domainToCheck))
df = df[~m].reset_index(drop=True)
Result:
url
0 http://www.url3.com
you can use pd.Series.str.contains here.
df[~df.url.str.contains('|'.join(domainToCheck))]
url
3 http://www.url3.com
If you want to reset index use this
df[~df.url.str.contains('|'.join(domainToCheck))].reset_index(drop=True)
url
0 http://www.url3.com

How to force specific column type in Dataframe?

I have the following code:
import pandas as pd
dic = {"rates":{
"IT":{
"country_name":"Italy",
"standard_rate":20,
"reduced_rates":{
"food":13,
"books":11
}
},
"UK":{
"country_name":"United Kingdom",
"standard_rate":21,
"reduced_rates":{
"food":12,
"books":1
}
}
}}
df = pd.DataFrame([{'code': k, 'standard_rate': v["standard_rate"]} for k,v in dic["rates"].items()])
print(df)
Which gives:
code standard_rate
0 IT 20
1 UK 21
How can I force standard_rate to be float type?
Note: I know how to print it as float. This is not what I want.
I want to change the type of the column in the dataframe itself. So if I export it to csv to Json the value in the standard_rate will be float.
Cast it in list comprehension:
comp=[{'code': k, 'standard_rate': float(v["standard_rate"])} for k,v in dic["rates"].items()]
df = pd.DataFrame(comp)
print (df)
code standard_rate
0 IT 20.0
1 UK 21.0

Groupby mean in pandas python

I have a csv file consisting of 5 fields.
Some sample data:
market_name,vendor_name,price,name,ship_from
'Greece',03wel,1.79367196,huhif,Germany
'Greece',le,0.05880975,fdfd,Germany
'Mlkio',dpg,0.11344859,fdfd,Germany
'Greece',gert,0.18655316,,Germany
'Tu',roland,0.52856728,fdfsdv,Germany
'ghuo',andy,0.52856728,jhjhj,Germany
'ghuo',didier,0.02085452,fsdfdf,Germany
'arsen',roch,0.02578377,uykujkj,Germany
'arsen',dpg,0.10010169,wrefrewrf,Germany
'arsen',dpg,0.06415609,jhgjhg,Germany
'arsen',03wel,0.02578377,gfdgb,Germany
'giar',03wel,0.02275039,gfhfbf,Germany
'giar',03wel,0.42751765,sdgfdgfg,Germany
In this file there are multiple records for every vendor. I want to find every unique value of the field vendor_name and also calculate the average price for each vendor. I am using the following script:
import pandas as pd
import numpy as np
import csv
from random import randint
ds = pd.read_csv("sxedonetoimo2.csv",
dtype={"vendor_name": object, "name" : object,
"ship_from" : object, "price": object})
ds['ship_from']=ds.ship_from.str.lower()
print(ds.dtypes)
pd.to_numeric(ds['price'], errors='coerce')
d = { 'name': pd.Series.nunique,
'ship_from' : lambda x: randint(1,2) if (x==('eu'or'europe'or'eu'or'europeanunion'or'worldwide'or'us'or'unitedstates'or'usa'or'us'or'ww'or'wweu'or'euww'or'internet')).any() else randint(3,20)
,'price': ds.groupby('vendor_name')['price'].mean()
}
result = ds.groupby('vendor_name').agg(d)
result.to_csv("scaled_upd.csv")
But I am getting this error :
raise DataError('No numeric types to aggregate')
pandas.core.base.DataError: No numeric types to aggregate
In my csv file, the values of the field price is only numbers. If I change the type of that field to float , it raises an error that a specific string cannot be parsed. I am really confused. Any help?
Just use read_csv(), groupby() and agg():
import pandas as pd
df = pd.read_csv('test.csv').groupby('vendor_name') /
.agg({'price': 'mean', 'name': lambda x: x.nunique()})
Yields:
price name
vendor_name
03wel 0.567431 4
andy 0.528567 1
didier 0.020855 1
dpg 0.092569 3
gert 0.186553 0
le 0.058810 1
roch 0.025784 1
roland 0.528567 1

Categories