GroupBY frequency counts JSON response - nested field - python

I'm trying aggregate the response from an API call that returns a JSON object and get some frequency counts.
I've managed to do it for one of the fields in the JSON response, but a second field that I want to try the same thing isn't working
Both fields are called "category" but the one that isn't working is nested within "outcome_status".
The error I get is KeyError: 'category'
The below code uses a public API that does not require authentication, so can be tested easily.
import simplejson
import requests
#make a polygon for use in the API call
lat_coord = 51.767538
long_coord = -1.497488
lat_upper = str(lat_coord + 0.02)
lat_lower = str(lat_coord - 0.02)
long_upper = str(long_coord + 0.02)
long_lower = str(long_coord - 0.02)
#call from the API - no authentication required
api_call="https://data.police.uk/api/crimes-street/all-crime?poly=" + lat_lower + "," + long_upper + ":" + lat_lower + "," + long_lower + ":" + lat_upper + "," + long_lower + ":" + lat_upper + "," + long_upper + "&date=2017-01"
print (api_call)
request_resp=requests.get(api_call).json()
import pandas as pd
import numpy as np
df_resp = pd.DataFrame(request_resp)
#frequency counts for non-nested field (this works)
df_resp.groupby('category').context.count()
#next bit tries to do the nested (this doesn't work)
#tried dropping nulls
df_outcome = df_resp['outcome_status'].dropna()
print(df_outcome)
#tried index reset
df_outcome.reset_index()
#just errors
df_outcome.groupby('category').date.count()

I think you will have the easiest time of it, if you expand the dict in the "outcome_status" column like:
Code:
outcome_status = [
{'outcome_status_' + k: v for k, v in z.items()} for z in (
dict(category=None, date=None) if x is None else x
for x in (y['outcome_status'] for y in request_resp)
)
]
df = pd.concat([df_resp.drop('outcome_status', axis=1),
pd.DataFrame(outcome_status)], axis=1)
This uses some comprehensions to rename the fields in the outcome_status by pre-pending "outcome_status_" to the key names and turning them into columns. It also expands None values as well.
Test Code:
import requests
import pandas as pd
# make a polygon for use in the API call
lat_coord = 51.767538
long_coord = -1.497488
lat_upper = str(lat_coord + 0.02)
lat_lower = str(lat_coord - 0.02)
long_upper = str(long_coord + 0.02)
long_lower = str(long_coord - 0.02)
# call from the API - no authentication required
api_call = ("https://data.police.uk/api/crimes-street/all-crime?poly=" +
lat_lower + "," + long_upper + ":" +
lat_lower + "," + long_lower + ":" +
lat_upper + "," + long_lower + ":" +
lat_upper + "," + long_upper + "&date=2017-01")
request_resp = requests.get(api_call).json()
df_resp = pd.DataFrame(request_resp)
outcome_status = [
{'outcome_status_' + k: v for k, v in z.items()} for z in (
dict(category=None, date=None) if x is None else x
for x in (y['outcome_status'] for y in request_resp)
)
]
df = pd.concat([df_resp.drop('outcome_status', axis=1),
pd.DataFrame(outcome_status)], axis=1)
# just errors
print(df.groupby('outcome_status_category').category.count())
Results:
outcome_status_category
Court result unavailable 4
Investigation complete; no suspect identified 38
Local resolution 1
Offender given a caution 2
Offender given community sentence 3
Offender given conditional discharge 1
Offender given penalty notice 2
Status update unavailable 6
Suspect charged as part of another case 1
Unable to prosecute suspect 9
Name: category, dtype: int64

Related

Pandas - Add List to multiple columns (for multiple rows)

I have a list of values that I want to update into multiple columns, this is fine for a single row. However when I try to update over multiple rows it simply overrides the whole column with the last value.
List for each row looks like below (note: list length is of variable size):
['2016-03-16T09:53:05',
'2016-03-16T16:13:33',
'2016-03-17T13:30:31',
'2016-03-17T13:39:09',
'2016-03-17T16:59:01',
'2016-03-23T12:20:47',
'2016-03-23T13:22:58',
'2016-03-29T17:26:26',
'2016-03-30T09:08:17']
I can store this in empty columns by using:
for i in range(len(trans_dates)):
df[('T' + str(i + 1) + ' - Date')] = trans_dates[i]
However this updates the whole column with the single trans_dates[i] value
I thought looping over each row with the above code would work but it still overwrites.
for issues in all_issues:
for i in range(len(trans_dates)):
df[('T' + str(i + 1) + ' - Date')] = trans_dates[i]
How do I only update my current row in the loop?
Am I even going about this the right way? Or is there a faster vectorised way of doing it?
Full code snippet below:
for issues in all_issues:
print(issues)
changelog = issues.changelog
trans_dates = []
from_status = []
to_status = []
for history in changelog.histories:
for item in history.items:
if item.field == 'status':
trans_dates.append(history.created[:19])
from_status.append(item.fromString)
to_status.append(item.toString)
trans_dates = list(reversed(trans_dates))
from_status = list(reversed(from_status))
to_status = list(reversed(to_status))
print(trans_dates)
# Store raw data in created columns and convert dates to pd.to_datetime
for i in range(len(trans_dates)):
df[('T' + str(i + 1) + ' - Date')] = trans_dates[i]
for i in range(len(to_status)):
df[('T' + str(i + 1) + ' - To')] = to_status[i]
for i in range(len(from_status)):
df[('T' + str(i + 1) + ' - From')] = from_status[i]
for i in range(len(trans_dates)):
df['T' + str(i + 1) + ' - Date'] = pd.to_datetime(df['T' + str(i + 1) + ' - Date'])
EDIT: Sample input and output added.
input:
issue/row #1 list (note year changes):
['2016-03-16T09:53:05',
'2016-03-16T16:13:33',
'2016-03-17T13:30:31',
'2016-03-17T13:39:09']
issue #2
['2017-03-16T09:53:05',
'2017-03-16T16:13:33',
'2017-03-17T13:30:31']
issue #3
['2018-03-16T09:53:05',
'2018-03-16T16:13:33',
'2018-03-17T13:30:31']
issue #4
['2015-03-16T09:53:05',
'2015-03-16T16:13:33']
output:
col T1 T2 T3 T4
17 '2016-03-16T09:53:05' '2016-03-16T16:13:33' '2016-03-17T13:30:31' '2016-03-17T13:30:31'
18 '2017-03-16T09:53:05' '2017-03-16T16:13:33' '2017-03-17T13:30:31' np.nan
19 '2018-03-16T09:53:05' '2018-03-16T16:13:33' '2018-03-17T13:30:31' np.nan
20 '2015-03-16T09:53:05' '2015-03-16T16:13:33' np.nan np.nan
Instead of this:
for i in range(len(trans_dates)):
df[('T' + str(i + 1) + ' - Date')] = trans_dates[i]
Try this:
for i in range(len(trans_dates)):
df.loc[i, ('T' + str(i + 1) + ' - Date')] = trans_dates[i]
There are probably better ways to do this... df.merge or df.replace come to mind... it would be helpful if you posted what the input dataframe looked like and what the expected result is.

reduce for loop time in dataframe operation

To see the sample response check this on browser https://bittrex.com/Api/v2.0/pub/market/GetTicks?marketName=BTC-ETH&tickInterval=thirtyMin. I have 275 values market list, and 330 time interval in time_series list. GetTicks api have 1000s of list of dict. I am only interested in those record where interval in time_series matches with 'T' value in GetTicks api. if time_series doesnt not matches with 'T' value in GetTicks api then I am setting respective values of 'BV'/'L' in master df as 0. Each loop is taking 3 second to execute, making around 20-25minute of execution time. is there a better pythonic way to construct this master df in less time ? Appreciate your help/suggestion.
my code --->
for (mkt, market_pair) in enumerate(market_list):
getTicks = requests.get("https://bittrex.com/Api/v2.0/pub/market/GetTicks?marketName=" + str(
market_pair) + "&tickInterval=thirtyMin")
getTicks_result = (getTicks.json())["result"]
print(mkt + 1, '/', len_market_list, market_pair, "API called", datetime.utcnow().strftime('%H:%M:%S.%f'))
first_df = pd.DataFrame(getTicks_result)
first_df.set_index('T', inplace=True)
for tk, interval in enumerate(time_series):
if interval in first_df.index:
master_bv_df.loc[market_pair, interval] = first_df.loc[interval,'BV']
bv_sh.cell(row=mkt + 2, column=tk + 3).value = first_df.loc[interval,'BV']
master_lp_df.loc[market_pair, interval] = first_df.loc[interval,'L']
lp_sh.cell(row=mkt + 2, column=tk + 3).value = first_df.loc[interval,'L']
else:
master_bv_df.loc[market_pair, interval] = master_lp_df.loc[market_pair, interval]=0
bv_sh.cell(row=mkt + 2, column=tk + 3).value = lp_sh.cell(row=mkt + 2, column=tk + 3).value=0

Same equation gives different values in Matlab and Numpy?

I'm trying to convert a function from Matlab to Python. The Matlab function is:
function [f,df_dr1,df_dr2,g,dg_dr1,dg_dr2] = f_eval_2eq(r1,r2,r3,z1,z2,z3,n1,n2,n3)
f = (r1)./sqrt(r1.^2 + z1.^2)...
- (n2/n1)*(r2-r1)./sqrt((r2-r1).^2 + z2.^2);
df_dr1 = 1./sqrt(r1.^2 + z1.^2)...
- r1.^2./(r1.^2 + z1.^2).^(3/2)...
+ (n2/n1)./sqrt(z2.^2 + (r1-r2).^2)...
- (n2/n1).*(r1-r2).*(2*r1-2*r2)./(2*((r1-r2).^2 + z2.^2).^(3/2));
df_dr2 = (n2/n1).*(r1-r2).*(2*r1-2*r2)./(2*((r1-r2).^2 + z2.^2).^(3/2))...
- (n2/n1)./sqrt(z2.^2 + (r1-r2).^2);
g = (r2-r1)./sqrt((r2-r1).^2 + z2.^2)...
- (n3/n2)*(r3-r2)./sqrt((r3-r2).^2 + z3.^2);
dg_dr1 = (r1-r2).*(2*r1-2*r2)./(2*((r1-r2).^2 + z2.^2).^(3/2))...
- 1./sqrt(z2.^2 + (r1-r2).^2);
dg_dr2 = 1./sqrt((r1-r2).^2 + z2.^2)...
+ (n3/n2)./sqrt(z3.^2 + (r2-r3).^2)...
- (r1-r2).*(2*r1-2*r2)./(2*((r1-r2).^2 + z2.^2).^(3/2))...
- (n3/n2).*(r2-r3).*(2*r2-2*r3)./(2*((r2-r3).^2 + z3.^2).^(3/2));
end
%test code
K>> a=[1,2,3];b=a+1;c=b+1;d=a;e=b;f=c;g=1;h=2;i=3;
K>> [f,df_dr1,df_dr2,g,dg_dr1,dg_dr2] = f_eval_2eq(a,b,c,d,e,f,g,h,i)
The Python function I wrote is:
def f_eval_2eq(r1,r2,r3,z1,z2,z3,n1,n2,n3):
#evaluate gradients
#n_ are scalars
f = (r1)/np.sqrt(r1**2 + z1**2) \
- (n2/n1)*(r2-r1)/np.sqrt((r2-r1)**2 + z2**2);
df_dr1 = 1/np.sqrt(r1**2 + z1**2) \
- r1**2/((r1**2 + z1**2)**(3/2)) \
+ (n2/n1)/np.sqrt(z2**2 + (r1-r2)**2) \
- (n2/n1)*(r1-r2)*(2*r1-2*r2)/(2*((r1-r2)**2 + z2**2)**(3/2));
df_dr2 = (n2/n1)*(r1-r2)*(2*r1-2*r2)/(2*((r1-r2)**2 + z2**2)**(3/2)) \
- (n2/n1)/np.sqrt(z2**2 + (r1-r2)**2);
g = (r2-r1)/np.sqrt((r2-r1)**2 + z2**2) \
- (n3/n2)*(r3-r2)/np.sqrt((r3-r2)**2 + z3**2);
dg_dr1 = (r1-r2)*(2*r1-2*r2)/(2*((r1-r2)**2 + z2**2)**(3/2)) \
- 1/np.sqrt(z2**2 + (r1-r2)**2);
dg_dr2 = 1/np.sqrt((r1-r2)**2 + z2**2) \
+ (n3/n2)/np.sqrt(z3**2 + (r2-r3)**2) \
- (r1-r2)*(2*r1-2*r2)/(2*((r1-r2)**2 + z2**2)**(3/2)) \
- (n3/n2)*(r2-r3)*(2*r2-2*r3)/(2*((r2-r3)**2 + z3**2)**(3/2));
return (f,df_dr1,df_dr2,g,dg_dr1,dg_dr2)
#test code
A=np.array([1,2,3])
B=A+1
C=B+1
D=A
E=B
F=C
G=1
H=2
I=3
[f,df_dr1,df_dr2,g,dg_dr1,dg_dr2] =f_eval_2eq(A,B,C,D,E,F,G,H,I)
print ('f= '+str(f) +'\n'+'df_dr1= '+str(df_dr1) +'\n' +'df_dr2='+str(df_dr2) +'\n'+'g= '+str(g) +'\n'+'dg_dr1= '+str(dg_dr1) +'\n'+'dg_dr2= '+str(dg_dr2) +'\n')
The output for f is the same in both, but all the other values are different and I cant figure out why???
Any help is appreciated.
In Python 2.x, if you divide two integers (such as 2 and 3) the result is cast as an integer as well:
x = 3/2
# 1
type(x)
# <type 'int'>
You need to explicitly specify either the numerator or denominator to be a float rather than an integer using a decimal point and this will allow the output to be a float as well.
y = 3./2
# 1.5
type(y)
# <type 'float'>
Alternately, as suggested by #rayryeng, you can place the following at the top of your code to get the behavior you expect.
from __future__ import division
You can also add
from __future__ import division
to the top of your file, if you're using Python 2, in order to get the Python 3 behavior, i.e. always using float division.

Algorithm to invert strings of algebraic expressions in Python

Is there an easy way to make a function to inverse an algorithm for example like this:
>>> value = inverse("y = 2*x+3")
>>> print(value)
"x = (y-3)/2"
If you can't make actual code for the function, please recommend me tools that would make this task easier. The function would be only used to inverse algorithms with +, -, * and /
You should try SymPy for doing that:
from sympy import solve
from sympy.abc import x, y
e = 2*x+3-y
solve(e,x)
#[y/2 - 3/2]
solve(e,y)
#[2*x + 3]
Based on this, you can build your inverse() like (works for two variables):
def inverse(string, left_string=None):
from sympy import solve, Symbol, sympify
string = '-' + string
e = sympify(string.replace('=','+'))
if left_string:
ans = left_string + ' = ' + str(solve(e, sympify(left_string))[0])
else:
left = sympify(string.split('=')[0].strip().replace('-',''))
symbols = e.free_symbols
symbols.remove( left )
right = list(symbols)[0]
ans = str(right) + ' = ' + str(solve(e, right)[0])
return ans
Examples:
inverse(' x = 4*y/2')
#'y = x/2'
inverse(' y = 100/x + x**2')
#'x = -y/(3*(sqrt(-y**3/27 + 2500) + 50)**(1/3)) - (sqrt(-y**3/27 + 2500) + 50)**(1/3)'
inverse("screeny = (isox+isoy)*29/2.0344827586206895", "isoy")
#'isoy = -isox + 0.0701545778834721*screeny'
This is a little long for a comment, but here's the sort of thing I had in mind:
import sympy
def inverse(s):
terms = [sympy.sympify(term) for term in s.split("=")]
eqn = sympy.Eq(*terms)
var_to_solve_for = min(terms[1].free_symbols)
solns = sympy.solve(eqn, var_to_solve_for)
output_eqs = [sympy.Eq(var_to_solve_for, soln) for soln in solns]
return output_eqs
After which we have
>>> inverse("y = 2*x+3")
[x == y/2 - 3/2]
>>> inverse("x = 100/z + z**2")
[z == -x/(3*(sqrt(-x**3/27 + 2500) + 50)**(1/3)) - (sqrt(-x**3/27 + 2500) + 50)**(1/3), z == -x/(3*(-1/2 - sqrt(3)*I/2)*(sqrt(-x**3/27 + 2500) + 50)**(1/3)) - (-1/2 - sqrt(3)*I/2)*(sqrt(-x**3/27 + 2500) + 50)**(1/3),
z == -x/(3*(-1/2 + sqrt(3)*I/2)*(sqrt(-x**3/27 + 2500) + 50)**(1/3)) - (-1/2 + sqrt(3)*I/2)*(sqrt(-x**3/27 + 2500) + 50)**(1/3)]
etc.

Factor/collect expression in Sympy

I have an equation like:
R₂⋅V₁ + R₃⋅V₁ - R₃⋅V₂
i₁ = ─────────────────────
R₁⋅R₂ + R₁⋅R₃ + R₂⋅R₃
defined and I'd like to split it into factors that include only single variable - in this case V1 and V2.
So as a result I'd expect
-R₃ (R₂ + R₃)
i₁ = V₂⋅───────────────────── + V₁⋅─────────────────────
R₁⋅R₂ + R₁⋅R₃ + R₂⋅R₃ R₁⋅R₂ + R₁⋅R₃ + R₂⋅R₃
But the best I could get so far is
-R₃⋅V₂ + V₁⋅(R₂ + R₃)
i₁ = ─────────────────────
R₁⋅R₂ + R₁⋅R₃ + R₂⋅R₃
using equation.factor(V1,V2). Is there some other option to factor or another method to separate the variables even further?
If it was possible to exclude something from the factor algorithm (the denominator in this case) it would have been easy. I don't know a way to do this, so here is a manual solution:
In [1]: a
Out[1]:
r₁⋅v₁ + r₂⋅v₂ + r₃⋅v₂
─────────────────────
r₁⋅r₂ + r₁⋅r₃ + r₂⋅r₃
In [2]: b,c = factor(a,v2).as_numer_denom()
In [3]: b.args[0]/c + b.args[1]/c
Out[3]:
r₁⋅v₁ v₂⋅(r₂ + r₃)
───────────────────── + ─────────────────────
r₁⋅r₂ + r₁⋅r₃ + r₂⋅r₃ r₁⋅r₂ + r₁⋅r₃ + r₂⋅r₃
You may also look at the evaluate=False options in Add and Mul, to build those expressions manually. I don't know of a nice general solution.
In[3] can be a list comprehension if you have many terms.
You may also check if it is possible to treat this as multivariate polynomial in v1 and v2. It may give a better solution.
Here I have sympy 0.7.2 installed and the sympy.collect() works for this purpose:
import sympy
i1 = (r2*v1 + r3*v1 - r3*v2)/(r1*r2 + r1*r3 + r2*r3)
sympy.pretty_print(sympy.collect(i1, (v1, v2)))
# -r3*v2 + v1*(r2 + r3)
# ---------------------
# r1*r2 + r1*r3 + r2*r3

Categories