Pandas - Add List to multiple columns (for multiple rows) - python

I have a list of values that I want to update into multiple columns, this is fine for a single row. However when I try to update over multiple rows it simply overrides the whole column with the last value.
List for each row looks like below (note: list length is of variable size):
['2016-03-16T09:53:05',
'2016-03-16T16:13:33',
'2016-03-17T13:30:31',
'2016-03-17T13:39:09',
'2016-03-17T16:59:01',
'2016-03-23T12:20:47',
'2016-03-23T13:22:58',
'2016-03-29T17:26:26',
'2016-03-30T09:08:17']
I can store this in empty columns by using:
for i in range(len(trans_dates)):
df[('T' + str(i + 1) + ' - Date')] = trans_dates[i]
However this updates the whole column with the single trans_dates[i] value
I thought looping over each row with the above code would work but it still overwrites.
for issues in all_issues:
for i in range(len(trans_dates)):
df[('T' + str(i + 1) + ' - Date')] = trans_dates[i]
How do I only update my current row in the loop?
Am I even going about this the right way? Or is there a faster vectorised way of doing it?
Full code snippet below:
for issues in all_issues:
print(issues)
changelog = issues.changelog
trans_dates = []
from_status = []
to_status = []
for history in changelog.histories:
for item in history.items:
if item.field == 'status':
trans_dates.append(history.created[:19])
from_status.append(item.fromString)
to_status.append(item.toString)
trans_dates = list(reversed(trans_dates))
from_status = list(reversed(from_status))
to_status = list(reversed(to_status))
print(trans_dates)
# Store raw data in created columns and convert dates to pd.to_datetime
for i in range(len(trans_dates)):
df[('T' + str(i + 1) + ' - Date')] = trans_dates[i]
for i in range(len(to_status)):
df[('T' + str(i + 1) + ' - To')] = to_status[i]
for i in range(len(from_status)):
df[('T' + str(i + 1) + ' - From')] = from_status[i]
for i in range(len(trans_dates)):
df['T' + str(i + 1) + ' - Date'] = pd.to_datetime(df['T' + str(i + 1) + ' - Date'])
EDIT: Sample input and output added.
input:
issue/row #1 list (note year changes):
['2016-03-16T09:53:05',
'2016-03-16T16:13:33',
'2016-03-17T13:30:31',
'2016-03-17T13:39:09']
issue #2
['2017-03-16T09:53:05',
'2017-03-16T16:13:33',
'2017-03-17T13:30:31']
issue #3
['2018-03-16T09:53:05',
'2018-03-16T16:13:33',
'2018-03-17T13:30:31']
issue #4
['2015-03-16T09:53:05',
'2015-03-16T16:13:33']
output:
col T1 T2 T3 T4
17 '2016-03-16T09:53:05' '2016-03-16T16:13:33' '2016-03-17T13:30:31' '2016-03-17T13:30:31'
18 '2017-03-16T09:53:05' '2017-03-16T16:13:33' '2017-03-17T13:30:31' np.nan
19 '2018-03-16T09:53:05' '2018-03-16T16:13:33' '2018-03-17T13:30:31' np.nan
20 '2015-03-16T09:53:05' '2015-03-16T16:13:33' np.nan np.nan

Instead of this:
for i in range(len(trans_dates)):
df[('T' + str(i + 1) + ' - Date')] = trans_dates[i]
Try this:
for i in range(len(trans_dates)):
df.loc[i, ('T' + str(i + 1) + ' - Date')] = trans_dates[i]
There are probably better ways to do this... df.merge or df.replace come to mind... it would be helpful if you posted what the input dataframe looked like and what the expected result is.

Related

Selecting columns using [[]] is very inefficient especially as the size of the dataset increases in python using pandas

Created sample data using below function:
def create_sample(num_of_rows=1000):
num_of_rows = num_of_rows # number of records to generate.
data = {
'var1' : [random.uniform(0.0, 1.0) for x in range(num_of_rows)],
'other' : [random.uniform(0.0, 1.0) for x in range(num_of_rows)]
}
df = pd.DataFrame(data)
print("Shape : {}".format(df.shape))
print("Type : \n{}".format(df.dtypes))
return df
df = create_sample()
times = []
for i in range(1, 300):
start = time.time()
# Make the dataframe 1 column bigger
df['var' + str(i + 1)] = df['var' + str(i)]
# Select two columns from the dataframe using double square brackets
####################################################
temp = df[['var' + str(i + 1), 'var' + str(i)]]
####################################################
end = time.time()
times.append(end - start)
start = end
plt.plot(times)
print(sum(times))
The graph is linear
enter image description here
used pd.concat to select columns, the graph shows peaks at every 100.. why is this so
df = create_sample()
times = []
for i in range(1, 300):
start = time.time()
# Make the dataframe 1 column bigger
df['var' + str(i + 1)] = df['var' + str(i)]
# Select two columns from the dataframe using double square brackets
####################################################
temp = pd.concat([df['var' + str(i + 1)],df['var' + str(i)]], axis=1)
####################################################
end = time.time()
times.append(end - start)
start = end
plt.plot(times)
print(sum(times))
please ignore indentation.
**From the above we can see that the time taken to select columns using [[]] increases linerly with the size of the dataset.
However, using pd.concat the time does not increase materially. Why increases in every 100 records only. The above is not obvious
**

Trying to sum combine a whole lot of columns faster/easier... help appreciated

I'm trying to sum columns into groups of 30 (month). Each column is a day. There are almost 2,000 columns
Each row is an individual product and there are about 30,000 of them.
Below is what I am doing to sum them in jupyter.
My question is that is there an easier/faster way to do this without having to do what I did below over 60 more times?
Month1 = (df_sales["d_1"] + df_sales["d_2"] + df_sales["d_3"] + df_sales["d_4"] + df_sales["d_5"] + df_sales["d_6"] + df_sales["d_7"] + df_sales["d_8"] + df_sales["d_9"] + df_sales["d_10"]
+ df_sales["d_11"] + df_sales["d_12"] + df_sales["d_13"] + df_sales["d_14"] + df_sales["d_15"] + df_sales["d_16"] + df_sales["d_17"] + df_sales["d_18"] + df_sales["d_19"] + df_sales["d_20"]
+ df_sales["d_21"] + df_sales["d_22"] + df_sales["d_23"] + df_sales["d_24"] + df_sales["d_25"] + df_sales["d_26"] + df_sales["d_27"] + df_sales["d_28"] + df_sales["d_29"] + df_sales["d_30"])
Month1 = df_sales.loc[:, "d_1":"d_30"].sum(axis=1)
If every month in your table has 30 days (columns) and you start with the first column, you may perform
all_months = pd.concat((df_sales.iloc[:, i:i+30].sum(axis=1)
for i in range(0, df_sales.shape[1], 30)),
axis=1)
to obtain the dataframe of all months sums.
Replace
range(0, df_sales.shape[1], 30)
with
range(n, df_sales.shape[1], 30)
if your days start in the column n (be aware - the first column has number 0).

reduce for loop time in dataframe operation

To see the sample response check this on browser https://bittrex.com/Api/v2.0/pub/market/GetTicks?marketName=BTC-ETH&tickInterval=thirtyMin. I have 275 values market list, and 330 time interval in time_series list. GetTicks api have 1000s of list of dict. I am only interested in those record where interval in time_series matches with 'T' value in GetTicks api. if time_series doesnt not matches with 'T' value in GetTicks api then I am setting respective values of 'BV'/'L' in master df as 0. Each loop is taking 3 second to execute, making around 20-25minute of execution time. is there a better pythonic way to construct this master df in less time ? Appreciate your help/suggestion.
my code --->
for (mkt, market_pair) in enumerate(market_list):
getTicks = requests.get("https://bittrex.com/Api/v2.0/pub/market/GetTicks?marketName=" + str(
market_pair) + "&tickInterval=thirtyMin")
getTicks_result = (getTicks.json())["result"]
print(mkt + 1, '/', len_market_list, market_pair, "API called", datetime.utcnow().strftime('%H:%M:%S.%f'))
first_df = pd.DataFrame(getTicks_result)
first_df.set_index('T', inplace=True)
for tk, interval in enumerate(time_series):
if interval in first_df.index:
master_bv_df.loc[market_pair, interval] = first_df.loc[interval,'BV']
bv_sh.cell(row=mkt + 2, column=tk + 3).value = first_df.loc[interval,'BV']
master_lp_df.loc[market_pair, interval] = first_df.loc[interval,'L']
lp_sh.cell(row=mkt + 2, column=tk + 3).value = first_df.loc[interval,'L']
else:
master_bv_df.loc[market_pair, interval] = master_lp_df.loc[market_pair, interval]=0
bv_sh.cell(row=mkt + 2, column=tk + 3).value = lp_sh.cell(row=mkt + 2, column=tk + 3).value=0

GroupBY frequency counts JSON response - nested field

I'm trying aggregate the response from an API call that returns a JSON object and get some frequency counts.
I've managed to do it for one of the fields in the JSON response, but a second field that I want to try the same thing isn't working
Both fields are called "category" but the one that isn't working is nested within "outcome_status".
The error I get is KeyError: 'category'
The below code uses a public API that does not require authentication, so can be tested easily.
import simplejson
import requests
#make a polygon for use in the API call
lat_coord = 51.767538
long_coord = -1.497488
lat_upper = str(lat_coord + 0.02)
lat_lower = str(lat_coord - 0.02)
long_upper = str(long_coord + 0.02)
long_lower = str(long_coord - 0.02)
#call from the API - no authentication required
api_call="https://data.police.uk/api/crimes-street/all-crime?poly=" + lat_lower + "," + long_upper + ":" + lat_lower + "," + long_lower + ":" + lat_upper + "," + long_lower + ":" + lat_upper + "," + long_upper + "&date=2017-01"
print (api_call)
request_resp=requests.get(api_call).json()
import pandas as pd
import numpy as np
df_resp = pd.DataFrame(request_resp)
#frequency counts for non-nested field (this works)
df_resp.groupby('category').context.count()
#next bit tries to do the nested (this doesn't work)
#tried dropping nulls
df_outcome = df_resp['outcome_status'].dropna()
print(df_outcome)
#tried index reset
df_outcome.reset_index()
#just errors
df_outcome.groupby('category').date.count()
I think you will have the easiest time of it, if you expand the dict in the "outcome_status" column like:
Code:
outcome_status = [
{'outcome_status_' + k: v for k, v in z.items()} for z in (
dict(category=None, date=None) if x is None else x
for x in (y['outcome_status'] for y in request_resp)
)
]
df = pd.concat([df_resp.drop('outcome_status', axis=1),
pd.DataFrame(outcome_status)], axis=1)
This uses some comprehensions to rename the fields in the outcome_status by pre-pending "outcome_status_" to the key names and turning them into columns. It also expands None values as well.
Test Code:
import requests
import pandas as pd
# make a polygon for use in the API call
lat_coord = 51.767538
long_coord = -1.497488
lat_upper = str(lat_coord + 0.02)
lat_lower = str(lat_coord - 0.02)
long_upper = str(long_coord + 0.02)
long_lower = str(long_coord - 0.02)
# call from the API - no authentication required
api_call = ("https://data.police.uk/api/crimes-street/all-crime?poly=" +
lat_lower + "," + long_upper + ":" +
lat_lower + "," + long_lower + ":" +
lat_upper + "," + long_lower + ":" +
lat_upper + "," + long_upper + "&date=2017-01")
request_resp = requests.get(api_call).json()
df_resp = pd.DataFrame(request_resp)
outcome_status = [
{'outcome_status_' + k: v for k, v in z.items()} for z in (
dict(category=None, date=None) if x is None else x
for x in (y['outcome_status'] for y in request_resp)
)
]
df = pd.concat([df_resp.drop('outcome_status', axis=1),
pd.DataFrame(outcome_status)], axis=1)
# just errors
print(df.groupby('outcome_status_category').category.count())
Results:
outcome_status_category
Court result unavailable 4
Investigation complete; no suspect identified 38
Local resolution 1
Offender given a caution 2
Offender given community sentence 3
Offender given conditional discharge 1
Offender given penalty notice 2
Status update unavailable 6
Suspect charged as part of another case 1
Unable to prosecute suspect 9
Name: category, dtype: int64

Factor/collect expression in Sympy

I have an equation like:
R₂⋅V₁ + R₃⋅V₁ - R₃⋅V₂
i₁ = ─────────────────────
R₁⋅R₂ + R₁⋅R₃ + R₂⋅R₃
defined and I'd like to split it into factors that include only single variable - in this case V1 and V2.
So as a result I'd expect
-R₃ (R₂ + R₃)
i₁ = V₂⋅───────────────────── + V₁⋅─────────────────────
R₁⋅R₂ + R₁⋅R₃ + R₂⋅R₃ R₁⋅R₂ + R₁⋅R₃ + R₂⋅R₃
But the best I could get so far is
-R₃⋅V₂ + V₁⋅(R₂ + R₃)
i₁ = ─────────────────────
R₁⋅R₂ + R₁⋅R₃ + R₂⋅R₃
using equation.factor(V1,V2). Is there some other option to factor or another method to separate the variables even further?
If it was possible to exclude something from the factor algorithm (the denominator in this case) it would have been easy. I don't know a way to do this, so here is a manual solution:
In [1]: a
Out[1]:
r₁⋅v₁ + r₂⋅v₂ + r₃⋅v₂
─────────────────────
r₁⋅r₂ + r₁⋅r₃ + r₂⋅r₃
In [2]: b,c = factor(a,v2).as_numer_denom()
In [3]: b.args[0]/c + b.args[1]/c
Out[3]:
r₁⋅v₁ v₂⋅(r₂ + r₃)
───────────────────── + ─────────────────────
r₁⋅r₂ + r₁⋅r₃ + r₂⋅r₃ r₁⋅r₂ + r₁⋅r₃ + r₂⋅r₃
You may also look at the evaluate=False options in Add and Mul, to build those expressions manually. I don't know of a nice general solution.
In[3] can be a list comprehension if you have many terms.
You may also check if it is possible to treat this as multivariate polynomial in v1 and v2. It may give a better solution.
Here I have sympy 0.7.2 installed and the sympy.collect() works for this purpose:
import sympy
i1 = (r2*v1 + r3*v1 - r3*v2)/(r1*r2 + r1*r3 + r2*r3)
sympy.pretty_print(sympy.collect(i1, (v1, v2)))
# -r3*v2 + v1*(r2 + r3)
# ---------------------
# r1*r2 + r1*r3 + r2*r3

Categories