Pandas recursive cumulative sum? - python

I am having a bit of trouble getting some pandas code to work. My basic problem is given a set of transactions and a set of balances, I need to come up with "balancing transactions"; i.e. fake transactions (which will be tagged as such) that will make it so that the sum of transactions are equal to balances (ignore the fact that in most cases this isn't a good idea; it makes sense in the context I am working in, I promise!).
Sample data:
import pandas as pd
from io import StringIO
txn_data = StringIO(
"""Contract,Txndate,TxnAmount
"wer42134423",1/1/2014, 50
"wer42134423",1/2/2014, -10
"wer42134423",1/3/2014, 100
"wer42134423",1/4/2014, -50
"wer42134423",1/5/2014, -10
"wer42134423",1/6/2014, 20
"wer42134423",1/7/2014, 50
"wer42134423",1/8/2014, -70
"wer42134423",1/10/2014, 21
"wer42134423",1/11/2014, -3
"""
)
txns=pd.read_csv(txn_data,parse_dates=["Txndate"])
txns.head()
balance_data = StringIO(
"""Contract,Baldate,Amount
"wer42134423", 1/1/2014, 50
"wer42134423", 1/4/2014, 100
"wer42134423", 1/9/2014, 96
"wer42134423", 1/11/2014, 105
"""
)
balances=pd.read_csv(balance_data,parse_dates=["Baldate"])
txns["CumulativeSumofTxns"]=txns.groupby("Contract")["TxnAmount"].cumsum()
balances_merged=pd.merge_asof(balances,txns,by="Contract",left_on=["Baldate"],right_on=["Txndate"])
balances_merged.head()
I can do this fairly easily in Excel; I merge the cumulative sum of transactions onto my balance data, then just apply a fairly simple sum formula, and then everything can balance out.
However, I cannot for the life of me figure out how to do the same in Pandas (without manually iterating through each "cell", which would be horrendous for performance). After doing a lot of digging it almost seem like the expanding window function would do the trick, but I couldn't get that to work after multiple attempts with shifting and such. I think the problem is that every entry in my new column is dependent on entries for the same row (namely, the current balance and cumulative sum of transactions) and all the prior entries in the column (namely, all the prior balancing transactions). Any help appreciated!

IIUC, do you want?
balances_merged['Cumulative Sum Balancing Transactions'] = balances_merged['Amount'] - balances_merged['CumulativeSumofTxns']
balances_merged['Balancing Transaction'] = balances_merged['Cumulative Sum Balancing Transactions'].diff()
balances_merged
Output:
Contract Baldate Amount Txndate TxnAmount CumulativeSumofTxns Cumulative Sum Balancing Transactions Balancing Transaction
0 wer42134423 2014-01-01 50 2014-01-01 50 50 0 NaN
1 wer42134423 2014-01-04 100 2014-01-04 -50 90 10 10.0
2 wer42134423 2014-01-09 96 2014-01-08 -70 80 16 6.0
3 wer42134423 2014-01-11 105 2014-01-11 -3 98 7 -9.0

Related

Get the daily percentages of values that fall within certain ranges

I have a large dataset of test results where I have a columns to represent the date a test was completed and number of hours it took to complete the test i.e.
df = pd.DataFrame({'Completed':['21/03/2020','22/03/2020','21/03/2020','24/03/2020','24/03/2020',], 'Hours_taken':[23,32,8,73,41]})
I have a months worth of test data and the tests can take anywhere from a couple of hours to a couple of days. I want to try and work out, for each day, what percentage of tests fall within the ranges of 24hrs/48hrs/72hrs ect. to complete, up to the percentage of tests that took longer than a week.
I've been able to work it out generally without taking the dates into account like so:
Lab_tests['one-day'] = Lab_tests['hours'].between(0,24)
Lab_tests['two-day'] = Lab_tests['hours'].between(24,48)
Lab_tests['GreaterThanWeek'] = Lab_tests['hours'] >168
one = Lab_tests['1-day'].value_counts().loc[True]
two = Lab_tests['two-day'].value_counts().loc[True]
eight = Lab_tests['GreaterThanWeek'].value_counts().loc[True]
print(one/10407 * 100)
print(two/10407 * 100)
print(eight/10407 * 100)
Ideally I'd like to represent the percentages in another dataset where the rows represent the dates and the columns represent the data ranges. But I can't work out how to take what I've done and modify it to get these percentages for each date. Is this possible to do in pandas?
This question, Counting qualitative values based on the date range in Pandas is quite similar but the fact that I'm counting the occurrences in specified ranges is throwing me off and I haven't been able to get a solution out of it.
Bonus Question
I'm sure you've noticed my current code is not the most elegant thing in the world, is the a cleaner way to do what I've done above, as I'm doing that for every data range that I want?
Edit:
So the Output for the sample data given would look like so:
df = pd.DataFrame({'1-day':[100,0,0,0], '2-day':[0,100,0,50],'3-day':[0,0,0,0],'4-day':[0,0,0,50]},index=['21/03/2020','22/03/2020','23/03/2020','24/03/2020'])
You're almost there. You just need to do a few final steps:
First, cast your bools to ints, so that you can sum them.
Lab_tests['one-day'] = Lab_tests['hours'].between(0,24).astype(int)
Lab_tests['two-day'] = Lab_tests['hours'].between(24,48).astype(int)
Lab_tests['GreaterThanWeek'] = (Lab_tests['hours'] > 168).astype(int)
Completed hours one-day two-day GreaterThanWeek
0 21/03/2020 23 1 0 0
1 22/03/2020 32 0 1 0
2 21/03/2020 8 1 0 0
3 24/03/2020 73 0 0 0
4 24/03/2020 41 0 1 0
Then, drop the hours column and roll the rest up to the level of Completed:
Lab_tests['one-day'] = Lab_tests['hours'].between(0,24).astype(int)
Lab_tests['two-day'] = Lab_tests['hours'].between(24,48).astype(int)
Lab_tests['GreaterThanWeek'] = (Lab_tests['hours'] > 168).astype(int)
Lab_tests.drop('hours', axis=1).groupby('Completed').sum()
one-day two-day GreaterThanWeek
Completed
21/03/2020 2 0 0
22/03/2020 0 1 0
24/03/2020 0 1 0
EDIT: To get to percent, you just need to divide each column by the sum of all three. You can sum columns by defining the axis of the sum:
...
daily_totals = Lab_tests.drop('hours', axis=1).groupby('Completed').sum()
daily_totals.sum(axis=1)
Completed
21/03/2020 2
22/03/2020 1
24/03/2020 1
dtype: int64
Then divide the daily totals dataframe by the column-wise sum of the daily totals (again, we use axis to define whether each value of the series will be the divisor for a row or a column.):
daily_totals.div(daily_totals.sum(axis=1), axis=0)
one-day two-day GreaterThanWeek
Completed
21/03/2020 1.0 0.0 0.0
22/03/2020 0.0 1.0 0.0
24/03/2020 0.0 1.0 0.0

Inserting missing numbers in dataframe

I have a program that ideally measures the temperature every second. However, in reality this does not happen. Sometimes, it skips a second or it breaks down for 400 seconds and then decides to start recording again. This leaves gaps in my 2-by-n dataframe, where ideally n = 86400 (the amount of seconds in a day). I want to apply some sort of moving/rolling average to it to get a nicer plot, but if I do that to the "raw" datafiles, the amount of data points becomes less. This is shown here, watch the x-axis. I know the "nice data" doesn't look nice yet; I'm just playing with some values.
So, I want to implement a data cleaning method, which adds data to the dataframe. I thought about it, but don't know how to implement it. I thought of it as follows:
If the index is not equal to the time, then we need to add a number, at time = index. If this gap is only 1 value, then the average of the previous number and the next number will do for me. But if it is bigger, say 100 seconds are missing, then a linear function needs to be made, which will increase or decrease the value steadily.
So I guess a training set could be like this:
index time temp
0 0 20.10
1 1 20.20
2 2 20.20
3 4 20.10
4 100 22.30
Here, I would like to get a value for index 3, time 3 and the values missing between time = 4 and time = 100. I'm sorry about my formatting skills, I hope it is clear.
How would I go about programming this?
Use merge with complete time column and then interpolate:
# Create your table
time = np.array([e for e in np.arange(20) if np.random.uniform() > 0.6])
temp = np.random.uniform(20, 25, size=len(time))
temps = pd.DataFrame([time, temp]).T
temps.columns = ['time', 'temperature']
>>> temps
time temperature
0 4.0 21.662352
1 10.0 20.904659
2 15.0 20.345858
3 18.0 24.787389
4 19.0 20.719487
The above is a random table generated with missing time data.
# modify it
filled = pd.Series(np.arange(temps.iloc[0,0], temps.iloc[-1, 0]+1))
filled = filled.to_frame()
filled.columns = ['time'] # Create a fully filled time column
merged = pd.merge(filled, temps, on='time', how='left') # merge it with original, time without temperature will be null
merged.temperature = merged.temperature.interpolate() # fill nulls linearly.
# Alternatively, use reindex, this does the same thing.
final = temps.set_index('time').reindex(np.arange(temps.time.min(),temps.time.max()+1)).reset_index()
final.temperature = final.temperature.interpolate()
>>> merged # or final
time temperature
0 4.0 21.662352
1 5.0 21.536070
2 6.0 21.409788
3 7.0 21.283505
4 8.0 21.157223
5 9.0 21.030941
6 10.0 20.904659
7 11.0 20.792898
8 12.0 20.681138
9 13.0 20.569378
10 14.0 20.457618
11 15.0 20.345858
12 16.0 21.826368
13 17.0 23.306879
14 18.0 24.787389
15 19.0 20.719487
First you can set the second values to actual time values as such:
df.index = pd.to_datetime(df['time'], unit='s')
After which you can use pandas' built-in time series operations to resample and fill in the missing values:
df = df.resample('s').interpolate('time')
Optionally, if you still want to do some smoothing you can use the following operation for that:
df.rolling(5, center=True, win_type='hann').mean()
Which will smooth with a 5 element wide Hanning window. Note: any window-based smoothing will cost you value points at the edges.
Now your dataframe will have datetimes (including date) as index. This is required for the resample method. If you want to lose the date, you can simply use:
df.index = df.index.time

How to check the type of missing data in python(randomly missing or not)?

I have a big amount of data with me(93 files, ~150mb each). The data is a time series, i.e, information about a given set of coordinates(3.3 million latitude-longitude values) is recorded and stored everyday for 93 days, and the whole data is broken up into 93 files respectively. Example of two such files:
Day 1:
lon lat A B day1
68.4 8.4 NaN 20 20
68.4 8.5 16 20 18
68.6 8.4 NaN NaN NaN
.
.
Day 2:
lon lat C D day2
68.4 8.4 NaN NaN NaN
68.4 8.5 24 25 24.5
68.6 8.4 NaN NaN NaN
.
.
I am interested in understanding the nature of the missing data in the columns 'day1', 'day2', 'day3', etc. For example, if the values missing in the concerned columns are evenly distributed among all the set of coordinates then the data is probably missing at random, but if the missing values are concentrated more in a particular set of coordinates then my data will become biased. Consider the way my data is divided into multiple files of large sizes and isn't in a very standard form to operate on making it harder to use some tools.
I am looking for a diagnostic tool or visualization in python that can check/show how the missing data is distributed over the set of coordinates so I can impute/ignore it appropriately.
Thanks.
P.S: This is the first time I am handling missing data so it would be great to see if there exists a workflow which people who do similar kind of work follow.
Assuming that you read file and name it df. You can count amount of NaNs using:
df.isnull().sum()
It will return you amount of NaNs per column.
You could also use:
df.isnull().sum(axis=1).value_counts()
This on the other hand will sum number of NaNs per row and then calculate number of rows with no NaNs, 1 NaN, 2 NaNs and so on.
Regarding working with files of such size, to speed up loading data and processing it I recommend using Dask and change format of your files preferably to parquet so that you can read and write to it in parallel.
You could easily recreate function above in Dask like this:
from dask import dataframe as dd
dd.read_parquet(file_path).isnull().sum().compute()
Answering the comment question:
Use .loc to slice your dataframe, in code below I choose all rows : and two columns ['col1', 'col2'].
df.loc[:, ['col1', 'col2']].isnull().sum(axis=1).value_counts()

Detecting outliers in a Pandas dataframe using a rolling standard deviation

I have a DataFrame for a fast Fourier transformed signal.
There is one column for the frequency in Hz and another column for the corresponding amplitude.
I have read a post made a couple of years ago, that you can use a simple boolean function to exclude or only include outliers in the final data frame that are above or below a few standard deviations.
df = pd.DataFrame({'Data':np.random.normal(size=200)}) # example dataset of normally distributed data.
df[~(np.abs(df.Data-df.Data.mean())>(3*df.Data.std()))] # or if you prefer the other way around
The problem is that my signal drops several magnitudes (up to 10 000 times smaller) as frequency increases up to 50 000Hz. Therefore, I am unable to use a function that only exports values above 3 standard deviation because I will only pick up the "peaks" outliers from the first 50 Hz.
Is there a way I can export outliers in my dataframe that are above 3 rolling standard deviations of a rolling mean instead?
This is maybe best illustrated with a quick example. Basically you're comparing your existing data to a new column that is the rolling mean plus three standard deviations, also on a rolling basis.
import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame({'Data':np.random.normal(size=200)})
# Create a few outliers (3 of them, at index locations 10, 55, 80)
df.iloc[[10, 55, 80]] = 40.
r = df.rolling(window=20) # Create a rolling object (no computation yet)
mps = r.mean() + 3. * r.std() # Combine a mean and stdev on that object
print(df[df.Data > mps.Data]) # Boolean filter
# Data
# 55 40.0
# 80 40.0
To add a new column filtering only to outliers, with NaN elsewhere:
df['Peaks'] = df['Data'].where(df.Data > mps.Data, np.nan)
print(df.iloc[50:60])
Data Peaks
50 -1.29409 NaN
51 -1.03879 NaN
52 1.74371 NaN
53 -0.79806 NaN
54 0.02968 NaN
55 40.00000 40.0
56 0.89071 NaN
57 1.75489 NaN
58 1.49564 NaN
59 1.06939 NaN
Here .where returns
An object of same shape as self and whose corresponding entries are
from self where cond is True and otherwise are from other.

Efficient operation over grouped dataframe Pandas

I have a very big Pandas dataframe where I need an ordering within groups based on another column. I know how to iterate over groups, do an operation on the group and union all those groups back into one dataframe however this is slow and I feel like there is a better way achieve this. Here is the input and what I want out of it. Input:
ID price
1 100.00
1 80.00
1 90.00
2 40.00
2 40.00
2 50.00
Output:
ID price order
1 100.00 3
1 80.00 1
1 90.00 2
2 40.00 1
2 40.00 2 (could be 1, doesn't matter too much)
2 50.00 3
Since this is over about 5kk records with around 250,000 IDs efficiency is important.
If speed is what you want, then the following should be pretty good, although it is a bit more complicated as it makes use of complex number sorting in numpy. This is similar to the approach used (my me) when writing the aggregate-sort method in the package numpy-groupies.
# get global sort order, for sorting by ID then price
full_idx = np.argsort(df['ID'] + 1j*df['price'])
# get min of full_idx for each ID (note that there are multiple ways of doing this)
n_for_id = np.bincount(df['ID'])
first_of_idx = np.cumsum(n_for_id)-n_for_id
# subtract first_of_idx from full_idx
rank = np.empty(len(df),dtype=int)
rank[full_idx] = arange(len(df)) - first_of_idx[df['ID'][full_idx]]
df['rank'] = rank+1
It takes 2s for 5m rows on my machine, which is about 100x faster than using groupby.rank from pandas (although I didn't actually run the pandas version with 5m rows because it would take too long; I'm not sure how #ayhan managed to do it in only 30s, perhaps a difference in pandas versions?).
If you do use this, then I recommend testing it thoroughly, as I have not.
You can use rank:
df["order"] = df.groupby("ID")["price"].rank(method="first")
df
Out[47]:
ID price order
0 1 100.0 3.0
1 1 80.0 1.0
2 1 90.0 2.0
3 2 40.0 1.0
4 2 40.0 2.0
5 2 50.0 3.0
It takes about 30s on a dataset of 5m rows with 250000 ID's (i5-3330) :
df = pd.DataFrame({"price": np.random.rand(5000000), "ID": np.random.choice(np.arange(250000), size = 5000000)})
%time df["order"] = df.groupby("ID")["price"].rank(method="first")
Wall time: 36.3 s

Categories