Build JSON object from pandas dataframe - python

I'm trying to format pandas dataframe:
> year mileage model manufacturer power fuel_type price
> 0 2011 184000 c-klasa Mercedes-Benz 161 diesel 114340
> 1 2013 102000 v40 Volvo 130 diesel 80511
> 2 2014 191000 scenic Renault 85 diesel 57613
> 3 1996 210000 vectra Opel 85 benzin 6278
> 4 2005 258000 tucson Hyundai 83 diesel 41363
> 5 2007 325000 astra Opel 74 diesel 26590
> 6 2002 200000 megane Renault 79 plin 16988
> 7 2011 191000 touran VW 77 diesel 62783
> 8 2007 210000 118 BMW 105 diesel 44318
> 9 2012 104000 3 Mazda 85 diesel 63522
> 10 2011 68000 c3 Citroen 54 benzin 44318
> 11 1993 200000 ax Citroen 37 diesel 43467
> 12 2011 142000 twingo Renault 55 benzin 28068
> 13 2005 280000 320 BMW 120 diesel 28068
output to fit JSON object requirements.
Here's my code:
for model, car in carsDF.groupby('manufacturer'):
print("{\"",model,":\"[\"",'","'.join(car['model'].unique()),"\"]},")
which yields:
> {" Alfa Romeo
> :"["156","159","146","147","giulietta","gt","33","mito","166","145","brera","sprint","spider","155","ostalo
> "]}, {" Aston Martin :"[" vantage "]},...
Which is ok except for spaces that shows each time I use escape chars "\".
How to create JSON object without them?
Is there any better way to generate JSON object for case like this?

I believe you need create Series by unique values by SeriesGroupBy.unique and then convert to json by Series.to_json:
j = carsDF.groupby('manufacturer')['model'].unique().to_json()
print (j)
{
"BMW": ["118", "320"],
"Citroen": ["c3", "ax"],
"Hyundai": ["tucson"],
"Mazda": ["3"],
"Mercedes-Benz": ["c-klasa"],
"Opel": ["vectra", "astra"],
"Renault": ["scenic", "megane", "twingo"],
"VW": ["touran"],
"Volvo": ["v40"]
}
If want each json separately solution is create dictionaries and convert to jsons:
import json
for model, car in carsDF.groupby('manufacturer'):
print (json.dumps({model: car['model'].unique().tolist()}))
{"BMW": ["118", "320"]}
{"Citroen": ["c3", "ax"]}
{"Hyundai": ["tucson"]}
{"Mazda": ["3"]}
{"Mercedes-Benz": ["c-klasa"]}
{"Opel": ["vectra", "astra"]}
{"Renault": ["scenic", "megane", "twingo"]}
{"VW": ["touran"]}
{"Volvo": ["v40"]}

Related

compare 2 dataframes simultaneously - 2 itertuples?

Im comparing 2 dataframes and Id like see if the the name matches on the address then to pull the unique ID. otherwise, continue on and search for the best match. (Im using fuzzy matcher for that part)
I was exploring itertools and wondered if using the itertools.zip_longest option would work simultaneously to compare 2 items togther. rather than using 2 for loops (example for x in df1.itertuples: do something... for y in df2.itertuples: do something) would something like this work?
result = itertools.zip_longest(df1.itertuples(), df2.itertuples())
Here's my 2 dataframes -
Here's my DF1:
NAME ADDRESS CUSTOMER_SUPPLIER_NUMBER Sales Calls Target
0 OFFICE 1 123 road 2222277 84 170 265
1 OFFICE 2 15 lane 2222289 7 167 288
2 OFFICE 3 3 highway 1111111 1 2 286
3 OFFICE 4 2 street 1111171 95 193 299
4 OFFICE 5 1 place 1111191 9 193 298
DF2:
NAME ADDRESS CUSTOMER_SUPPLIER_NUMBER UNIQUE ID
0 OFFICE 1 123 road 2222277 014168
1 OFFICE 2 15 lane 2222289 131989
2 OFFICE 3 3 highway 1111111 149863
3 OFFICE 4 2 street 1111171 198664
4 OFFICE 5 1 place 1111191 198499
5 OFFICE 6 zzzz rd 165198 198791
6 OFFICE 7 5z st 19844 298791
7 OFFICE 8 34 hwy 981818 398791
8 OFFICE 9 81290 rd 899811 498791
9 OFFICE 10 59 rd 699161 598791
10 OFFICE 11 5141 bldvd 33211 698791
Then perform a for loop and do some comparison if statements. I can access both items side by side but how would I then loop over the items to do the check?
Right now im getting: "
TypeError: 'NoneType' object is not subscriptable"
for yy in result:
if yy[0][1]== yy[1][1]:
print(yy) ......
If your headers are the same in both df´s, just apply merge:
dfmerge=pd.merge(df1,df2)
the output should be:

Pandas - Count the number of rows that would be true for a function - for each input row

I have a dataframe that needs a column added to it. That column needs to be a count of all the other rows in the table that meet a certain condition, that condition needs to take in input both from the "input" row and the "output" row.
For example, if it was a dataframe describing people, and I wanted to make a column that counted how many people were taller than the current row and lighter.
I'd want the height and weight of the row, as well as the height and weight of the other rows in a function, so I can do something like:
def example_function(height1, weight1, height2, weight2):
if height1 > height2 and weight1 < weight2:
return True
else:
return False
And it would just sum up all the True's and give that sum in the column.
Is something like this possible?
Thanks in advance for any ideas!
Edit: Sample input:
id name height weight country
0 Adam 70 180 USA
1 Bill 65 190 CANADA
2 Chris 71 150 GERMANY
3 Eric 72 210 USA
4 Fred 74 160 FRANCE
5 Gary 75 220 MEXICO
6 Henry 61 230 SPAIN
The result would need to be:
id name height weight country new_column
0 Adam 70 180 USA 1
1 Bill 65 190 CANADA 1
2 Chris 71 150 GERMANY 3
3 Eric 72 210 USA 1
4 Fred 74 160 FRANCE 4
5 Gary 75 220 MEXICO 1
6 Henry 61 230 SPAIN 0
I believe it will need to be some sort of function, as the actual logic I need to use is more complicated.
edit 2:fixed typo
You can add booleans, like this:
count = ((df.height1 > df.height2) & (df.weight1 < df.weight2)).sum()
EDIT:
I test it a bit and then change conditions with custom function:
def f(x):
#check boolean mask
#print ((df.height > x.height) & (df.weight < x.weight))
return ((df.height < x.height) & (df.weight > x.weight)).sum()
df['new_column'] = df.apply(f, axis=1)
print (df)
id name height weight country new_column
0 0 Adam 70 180 USA 2
1 1 Bill 65 190 CANADA 1
2 2 Chris 71 150 GERMANY 3
3 3 Eric 72 210 USA 1
4 4 Fred 74 160 FRANCE 4
5 5 Gary 75 220 MEXICO 1
6 6 Henry 61 230 SPAIN 0
Explanation:
For each row compare values and for count simply sum values True.
For example, if it was a dataframe describing people, and I wanted to make a column that counted how many people were taller than the current row and lighter.
As far as I understand, you want to assign to a new column something like
df['num_heigher_and_leighter'] = df.apply(lambda r: ((df.height > r.height) & (df.weight < r.weight)).sum(), axis=1)
However, your text description doesn't seem to match the outcome, which is:
0 2
1 3
2 0
3 1
4 0
5 0
6 6
dtype: int64
Edit
As in any other case, you can use a named function instead of a lambda:
df = ...
def foo(r):
return ((df.height > r.height) & (df.weight < r.weight)).sum()
df['num_heigher_and_leighter'] = df.apply(foo, axis=1)
I'm assuming you had a typo and want to compare heights with heights and weights with weights. If so, you could count the number of persons taller OR heavier like so:
>>> for i,height,weight in zip(df.index,df.height, df.weight):
... cnt = df.loc[((df.height>height) & (df.weight>weight)), 'height'].count()
... df.loc[i,'thing'] = cnt
...
>>> df
name height weight country thing
0 Adam 70 180 USA 2.0
1 Bill 65 190 CANADA 2.0
2 Chris 71 150 GERMANY 3.0
3 Eric 72 210 USA 1.0
4 Fred 74 160 FRANCE 1.0
5 Gary 75 220 MEXICO 0.0
6 Henry 61 230 SPAIN 0.0
Here for instance, no person is Heavier than Henry, and no person is taller than Gary. If that's not what you intended, it should be easy to modify the & above to a | instead or switching out the > to a <.
When you're more accustomed to Pandas, I suggest you use Ami Tavory excellent answer instead.
PS. For the love of god, use the Metric system for representing weight and height, and convert to whatever for presentation. These numbers are totally nonsensical for the world population at large. :)

Reshape data frame (with R or

I want to know if it's possible to have this result:
exemple:
With this data frame
df
y Faisceaux destination Trajet RED_Groupe Nbr observation RED Pond Nbr observation total RED pct
1 2015 France DOM-TOM Aller 78248.47 87 85586.75 307 0.9142591 0.04187815
2 2015 Hors Schengen Aller 256817.64 234 195561.26 1194 1.3132337 0.06015340
3 2015 INTERNATIONAL Aller 258534.78 473 288856.53 2065 0.8950283 0.04099727
4 2015 Maghreb Aller 605514.45 270 171718.14 1130 3.5262113 0.16152007
5 2015 NATIONAL Aller 361185.82 923 1082529.19 5541 0.3336500 0.01528302
6 2015 Schengen Aller 312271.06 940 505181.07 4190 0.6181369 0.02831411
7 2015 France DOM-TOM Retour 30408.70 23 29024.60 108 1.0476871 0.04798989
8 2015 Hors Schengen Retour 349805.15 225 168429.96 953 2.0768583 0.09513165
9 2015 INTERNATIONAL Retour 193536.63 138 99160.52 678 1.9517509 0.08940104
10 2015 Maghreb Retour 302863.83 110 41677.90 294 7.2667735 0.33285861
11 2015 NATIONAL Retour 471520.80 647 757258.33 3956 0.6226684 0.02852167
12 2015 Schengen Retour 307691.66 422 243204.76 2104 1.2651548 0.05795112
without using Exel.
With R or Python? I don't know if spliting column like that is possible.
thanks to all the comment here my solution :
I split my data frame into two data frame df15 ( with 2015 data) and df16 (2016 data) then :
mytable15 <- tabular(Heading()*Faisceaux_destination ~ Trajet*(`RED_Groupe` + `Nbr observation RED` + Pond + `Nbr observation total` + RED + pct)*Heading()*(identity),data=df15)
mytable16 <- tabular(Heading()*Faisceaux_destination ~ Trajet*(`RED_Groupe` + `Nbr observation RED` + Pond + `Nbr observation total` + RED + pct)*Heading()*(identity),data=df16)

ValueError errors while reading JSON file with pd.read_json

I am trying to read JSON file using pandas:
import pandas as pd
df = pd.read_json('https://data.gov.in/node/305681/datastore/export/json')
I get ValueError: arrays must all be same length
Some other JSON pages show this error:
ValueError: Mixing dicts with non-Series may lead to ambiguous ordering.
How do I somehow read the values? I am not particular about data validity.
Looking at the json it is valid, but it's nested with data and fields:
import json
import requests
In [11]: d = json.loads(requests.get('https://data.gov.in/node/305681/datastore/export/json').text)
In [12]: list(d.keys())
Out[12]: ['data', 'fields']
You want the data as the content, and fields as the column names:
In [13]: pd.DataFrame(d["data"], columns=[x["label"] for x in d["fields"]])
Out[13]:
S. No. States/UTs 2008-09 2009-10 2010-11 2011-12 2012-13
0 1 Andhra Pradesh 183446.36 193958.45 201277.09 212103.27 222973.83
1 2 Arunachal Pradesh 360.5 380.15 407.42 419 438.69
2 3 Assam 4658.93 4671.22 4707.31 4705 4709.58
3 4 Bihar 10740.43 11001.77 7446.08 7552 8371.86
4 5 Chhattisgarh 9737.92 10520.01 12454.34 12984.44 13704.06
5 6 Goa 148.61 148 149 149.45 457.87
6 7 Gujarat 12675.35 12761.98 13269.23 14269.19 14558.39
7 8 Haryana 38149.81 38453.06 39644.17 41141.91 42342.66
8 9 Himachal Pradesh 977.3 1000.26 1020.62 1049.66 1069.39
9 10 Jammu and Kashmir 7208.26 7242.01 7725.19 6519.8 6715.41
10 11 Jharkhand 3994.77 3924.73 4153.16 4313.22 4238.95
11 12 Karnataka 23687.61 29094.3 30674.18 34698.77 36773.33
12 13 Kerala 15094.54 16329.52 16856.02 17048.89 22375.28
13 14 Madhya Pradesh 6712.6 7075.48 7577.23 7971.53 8710.78
14 15 Maharashtra 35502.28 38640.12 42245.1 43860.99 45661.07
15 16 Manipur 1105.25 1119 1137.05 1149.17 1162.19
16 17 Meghalaya 994.52 999.47 1010.77 1021.14 1028.18
17 18 Mizoram 411.14 370.92 387.32 349.33 352.02
18 19 Nagaland 831.92 833.5 802.03 703.65 617.98
19 20 Odisha 19940.15 23193.01 23570.78 23006.87 23229.84
20 21 Punjab 36789.7 32828.13 35449.01 36030 37911.01
21 22 Rajasthan 6449.17 6713.38 6696.92 9605.43 10334.9
22 23 Sikkim 136.51 136.07 139.83 146.24 146
23 24 Tamil Nadu 88097.59 108475.73 115137.14 118518.45 119333.55
24 25 Tripura 1388.41 1442.39 1569.45 1650 1565.17
25 26 Uttar Pradesh 10139.8 10596.17 10990.72 16075.42 17073.67
26 27 Uttarakhand 1961.81 2535.77 2613.81 2711.96 3079.14
27 28 West Bengal 33055.7 36977.96 39939.32 43432.71 47114.91
28 29 Andaman and Nicobar Islands 617.58 657.44 671.78 780 741.32
29 30 Chandigarh 272.88 248.53 180.06 180.56 170.27
30 31 Dadra and Nagar Haveli 70.66 70.71 70.28 73 73
31 32 Daman and Diu 18.83 18.9 18.81 19.67 20
32 33 Delhi 1.17 1.17 1.17 1.23 NA
33 34 Lakshadweep 134.64 138.22 137.98 139.86 139.99
34 35 Puducherry 111.69 112.84 113.53 116 112.89
See also json_normalize for more complex json DataFrame extraction.
The following listed both the key and value pair for me:
from urllib.request import urlopen
import json
from pandas.io.json import json_normalize
import pandas as pd
import requests
df = json.loads(requests.get('https://api.github.com/repos/akkhil2012/MachineLearning').text)
data = pd.DataFrame.from_dict(df, orient='index')
print(data)

Pivot tables using pandas

I have the following dataframe:
df1= df[['rsa_units','regions','ssno','veteran','pos_off_ttl','occ_ser','grade','gender','ethnicity','age','age_category','service_time','type_appt','disabled','actn_dt','nat_actn_2_3','csc_auth_12','fy']]
this will produce 1.4 mil records. I've taken the first 12.
Eastern Region (R9),Eastern Region (R9),123456789,Non Vet,LBRER,3502,3,Male,White,43.0,Older Gen X'ers,5.0,Temporary,,2009-05-18 00:00:00,115,BDN,2009
Northern Region (R1),Northern Region (R1),234567891,Non Vet,FRSTRY TECHNCN,0462,4,Male,White,37.0,Younger Gen X'ers,7.0,Temporary,,2007-05-27 00:00:00,115,BDN,2007
Northern Region (R1),Northern Region (R1),345678912,Non Vet,FRSTRY AID,0462,3,Male,White,33.0,Younger Gen X'ers,8.0,Temporary,,2006-06-05 00:00:00,115,BDN,2006
Northern Research Station (NRS),Research & Development(RES),456789123,Non Vet,FRSTRY TECHNCN,0462,7,Male,White,37.0,Younger Gen X'ers,10.0,Term,,2006-11-26 00:00:00,702,N6M,2007
Intermountain Region (R4),Intermountain Region (R4),5678912345,Non Vet,BIOLCL SCI TECHNCN,0404,5,Male,White,45.0,Older Gen X'ers,6.0,Temporary,,2008-05-18 00:00:00,115,BWA,2008
Intermountain Region (R4),Intermountain Region (R4),678912345,Non Vet,FRSTRY AID (FIRE),0462,3,Female,White,31.0,Younger Gen X'ers,5.0,Temporary,,2009-05-10 00:00:00,115,BDN,2009
Pacific Southwest Region (R5),Pacific Southwest Region (R5),789123456,Non Vet,FRSTRY AID (FIRE),0462,3,Male,White,31.0,Younger Gen X'ers,3.0,Temporary,,2012-05-06 00:00:00,115,NAM,2012
Pacific Southwest Region (R5),Pacific Southwest Region (R5),891234567,Non Vet,FRSTRY AID (FIRE),0462,3,Male,White,31.0,Younger Gen X'ers,3.0,Temporary,,2011-06-05 00:00:00,115,BDN,2011
Intermountain Region (R4),Intermountain Region (R4),912345678,Non Vet,FRSTRY TECHNCN,0462,5,Male,White,37.0,Younger Gen X'ers,11.0,Temporary,,2006-04-30 00:00:00,115,BDN,2006
Northern Region (R1),Northern Region (R1),987654321,Non Vet,FRSTRY TECHNCN,0462,4,Male,White,37.0,Younger Gen X'ers,11.0,Temporary,,2005-04-11 00:00:00,115,BDN,2005
Southwest Region (R3),Southwest Region (R3),876543219,Non Vet,FRSTRY TECHNCN (HOTSHOT/HANDCREW),0462,4,Male,White,30.0,Gen Y Millennial,4.0,Temporary,,2013-03-24 00:00:00,115,NAM,2013
Southwest Region (R3),Southwest Region (R3),765432198,Non Vet,FRSTRY TECHNCN (RECR),0462,4,Male,White,30.0,Gen Y Millennial,5.0,Temporary,,2010-11-21 00:00:00,115,BDN,2011
I then filter on ['nat_actn_2_3'] for the certain hiring codes.
h1 = df1[df1['nat_actn_2_3'].isin(['100','101','108','170','171','115','130','140','141','190','702','703'])]
h2 = h1.sort('ssno')
h3 = h2.drop_duplicates(['ssno','actn_dt'])
and can look at value_counts() to see total hires by region.
total_newhires = h3['regions'].value_counts()
total_newhires
produces:
Out[38]:
Pacific Southwest Region (R5) 42255
Pacific Northwest Region (R6) 32081
Intermountain Region (R4) 24045
Northern Region (R1) 22822
Rocky Mountain Region (R2) 17481
Southwest Region (R3) 17305
Eastern Region (R9) 11034
Research & Development(RES) 7337
Southern Region (R8) 7288
Albuquerque Service Center(ASC) 7032
Washington Office(WO) 4837
Alaska Region (R10) 4210
Job Corps(JC) 4010
nda 438
I'd like to do something like in excel where I can have the ['regions'] as my row and the ['fy'] as the columns to give me a total count of numbers based off the ['ssno'] for each ['fy']. It would also be nice to eventually do calculations based off the numbers too, like averages and sums.
Along with looking at examples in the url: http://pandas.pydata.org/pandas-docs/stable/reshaping.html, I've also tried:
hirestable = pivot_table(h3, values=['ethnicity', 'veteran'], rows=['regions'], cols=['fy'])
I'm wondering if groupby may be what I'm looking for?
Any help is appreciated. I've spent 3 days on this and can't seem to put it together.
So based off the answer below I did a pivot using the following code:
h3.pivot_table(values=['ssno'], rows=['nat_actn_2_3'], cols=['fy'], aggfunc=len).
Which produced a somewhat decent result. When I used 'ethnicity' or 'veteran' as a value my results came out really strange and didn't match my value counts numbers. Not sure if the pivot eliminates duplicates or what, but it did not come out correctly.
ssno
fy 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
nat_actn_2_3
100 34 20 25 18 38 43 45 14 19 25 10
101 510 453 725 795 1029 1293 957 383 470 605 145
108 170 132 112 85 123 127 84 43 40 29 10
115 9203 8972 7946 9038 10139 10480 9211 8735 10482 11258 339
130 299 313 431 324 291 325 336 202 230 436 112
140 62 74 71 75 132 125 82 42 45 74 18
141 20 16 23 17 20 14 10 9 13 17 7
170 202 433 226 278 336 386 284 265 121 118 49
171 4771 4627 4234 4196 4470 4472 3270 3145 354 341 34
190 1 1 NaN NaN NaN 1 NaN NaN NaN NaN NaN
702 3141 3099 3429 3030 3758 3952 3813 2902 2329 2375 650
703 2280 2354 2225 2050 2260 2328 2172 2503 2649 2856 726
Try it like this:
h3.pivot_table(values=['ethnicity', 'veteran'], index=['regions'], columns=['fy'], aggfunc=len, fill_value=0)
To get counts use the aggfunc = len
Also your isin references a list of strings, but the data you provide for columns 'nat_actn_2_3' are int
Try:
h3.pivot_table(values=['ethnicity', 'veteran'], rows=['regions'], cols=['fy'], aggfunc=len, fill_value=0)
if you have an older version of pandas

Categories