I have a table with several million transactions. The table contains a timestamp for the transaction, an amount, and several other properties (e.g., address). For each transaction I want to calculate the count and amount sum of transactions that have happened in a timeframe, e.g., 1 month, with the same, e.g., address.
Here is an example input:
+----+---------------------+----------------+--------+
| id | ts | address | amount |
+----+---------------------+----------------+--------+
| 0 | 2016-10-11 00:34:21 | 123 First St. | 56.20 |
+----+---------------------+----------------+--------+
| 1 | 2016-10-13 02:53:58 | 456 Second St. | 96.19 |
+----+---------------------+----------------+--------+
| 2 | 2016-10-23 02:28:17 | 123 First St. | 64.65 |
+----+---------------------+----------------+--------+
| 3 | 2016-10-31 07:14:35 | 456 Second St. | 36.38 |
+----+---------------------+----------------+--------+
| 4 | 2016-11-04 09:25:39 | 123 First St. | 93.65 |
+----+---------------------+----------------+--------+
| 5 | 2016-11-20 22:30:15 | 123 First St. | 88.39 |
+----+---------------------+----------------+--------+
| 6 | 2016-11-28 09:39:14 | 123 First St. | 74.40 |
+----+---------------------+----------------+--------+
| 7 | 2016-12-03 17:09:12 | 123 First St. | 83.13 |
+----+---------------------+----------------+--------+
This should output:
+----+-------+--------+
| id | count | amount |
+----+-------+--------+
| 0 | 0 | 0.00 |
+----+-------+--------+
| 1 | 0 | 0.00 |
+----+-------+--------+
| 2 | 1 | 56.20 |
+----+-------+--------+
| 3 | 1 | 96.19 |
+----+-------+--------+
| 4 | 2 | 120.85 |
+----+-------+--------+
| 5 | 1 | 64.65 |
+----+-------+--------+
| 6 | 1 | 88.39 |
+----+-------+--------+
| 7 | 2 | 162.79 |
+----+-------+--------+
In order to do this, I sorted the table by timestamp and then I'm essentially using queues and dictionaries, but it seems to be running really slow, so I was wondering if there's a better way to do it.
Here is my code:
import csv
import Queue
import time
props = [ 'address', ... ]
spans = { '1m': 2629800, ... }
h = [ 'id' ]
for value in [ 'count', 'amount' ]:
for span in spans:
for prop in props:
h.append(span + '_' + prop + '_' + value)
tq = { }
kq = { }
vq = { }
for span in spans:
tq[span] = Queue.Queue()
kq[span] = { }
vq[span] = { }
for prop in props:
kq[span][prop] = Queue.Queue()
vq[span][prop] = { }
with open('transactions.csv', 'r') as csvin, open('velocities.csv', 'w') as csvout:
reader = csv.DictReader(csvin)
writer = csv.DictWriter(csvout, h)
writer.writeheader()
for i in reader:
o = { 'id': i['id'] }
ts = time.mktime(time.strptime(i['ts'], '%Y-%m-%d %H:%M:%S'))
for span in spans:
while not tq[span].empty() and ts > tq[span].queue[0] + spans[span]:
tq[span].get()
for prop in props:
key = kq[span][prop].get()
vq[span][prop][key].get()
if vq[span][prop][key].empty():
del vq[span][prop][key]
tq[span].put(ts)
for prop in props:
kq[span][prop].put(i[prop])
if not i[prop] in vq[span][prop]:
vq[span][prop][i[prop]] = Queue.Queue()
o[span + '_' + prop + '_count'] = vq[span][prop][i[prop]].qsize()
o[span + '_' + prop + '_amount'] = sum(vq[span][prop][i[prop]].queue)
vq[span][prop][i[prop]].put(float(i['auth']))
writer.writerow(o)
csvout.flush()
I also tried replacing vq[span][prop] with a RB-trees but the performance was even worse.
Either I fundamentally misunderstand what you're trying to do, or you do, because your code is vastly more complicated (not complex, complicated) than it needs to be if you're doing what you say you're doing.
import csv
from collections import namedtuple, defaultdict, Counter
from datetime import datetime
Span = namedtuple('Span', ('start', 'end'))
month_span = Span(start=datetime(2016, 1, 1), end=datetime(2016, 1, 31))
counts = defaultdict(Counter)
amounts = defaultdict(Counter)
with open('transactions.csv') as f:
reader = csv.DictReader(f)
for row in reader:
timestamp = datetime.strptime(row['ts'], '%Y-%m-%d %H:%M:%S')
if month_span.start < timestamp < month_span.end: # or <=
# You do some checking for properties. If you *will* always
# have these columns, you *should* just use ``row['count']``
# and ``row['amount']``
counts[month_span][row['address']] += int(row.get('count', 0))
amount[month_span][row['address']] += float(row.get('amount', 0.00))
print(counts)
print(amounts)
Note that you're still operating, as you say, over "several million transactions". That's going to take a while no matter which way you turn it because you're doing the same thing several million times. If you want to see where your current code is spending all it's time, you can profile it. I find that the line profiler is easy to use and works well.
Chances are, because you're doing what you're doing a million times, you're not going to be able to speed this up much, without dropping to a lower level language, e.g. Cython, C, or C++. That will speed some things up, but it will definitely be a lot harder to write the code.
Related
I have the following data file.
>| --- | | Adelaide | | --- | | 2021 | | --- | | Rnd | T | Opponent | Scoring | F | Scoring | A | R | M | W-D-L | Venue | Crowd |
> Date | | R1 | H | Geelong | 4.4 11.7 13.9 15.13 | 103 | 2.3 5.5 10.8
> 13.13 | 91 | W | 12 | 1-0-0 | Adelaide Oval | 26985 | Sat 20-Mar-2021 4:05 PM | | R2 | A | Sydney | 3.2 4.6 6.14 11.22 | 88 |
> 4.1 9.6 15.11 18.13 | 121 | L | -33 | 1-0-1 | S.C.G. | 23946 | Sat 27-Mar-2021 1:45 PM |
I created a code to manipulate that data to my desired results which is a list. When I print my variable row at the current spot it prints correctly.
However, when I append my list row to another list my_array I have issues. I get an empty list returned.
I think the issue is the placement of where I am appending?
My code is this:
with open('adelaide.md', 'r') as f:
my_array = []
team = ''
year = ''
for line in f:
row=[]
line = line.strip()
fields = line.split('|')
num_fields = len(fields)
if len(fields) == 3:
val = fields[1].strip()
if val.isnumeric():
year = val
elif val != '---':
team = val
elif num_fields == 15:
row.append(team)
row.append(year)
for i in range(1, 14):
row.append(fields[i].strip())
print(row)
my_array.append(row)
You need to append the row array inside the for loop.
I think last line should be inside the for loop. Your code is probably appending the last 'row' list. Just give it a tab.
I am really struggling to get a logic for this . I have data set called Col as shown below . I am using Python and Pandas
I want to set a new column called as "STATUS" . The logic is
a. When Col==0 , i will Buy . But this Buy will happen only when Col==0 is the first value in the data set or after the Status Sell. There cannot be two Buy values without a Sell in between
b. When Col<=-8 I will Sell. But this will happen if there is a Buy preceding it in the Satus Column. There cannot be two Sells without a Buy in between them .
I have provided the example of how i want my output as. Any help is really appreciated
Here the raw data is in the column : Col and output i want is in Status
+-------+--------+
| Col | Status |
+-------+--------+
| 0 | Buy |
| -1.41 | 0 |
| 0 | 0 |
| -7.37 | 0 |
| -8.78 | Sell |
| -11.6 | 0 |
| 0 | Buy |
| -5 | 0 |
| -6.1 | 0 |
| -8 | Sell |
| -11 | 0 |
| 0 | Buy |
| 0 | 0 |
| -9 | Sell |
+-------+--------+
Took me some time.
Relies on the following property : the last order you can see from now, even if you chose not to send it, is always the last decision that you took. (Otherwise it would have been sent.)
df['order'] = (df['order'] == 0).astype(int) - (df['order'] <= -8).astype(int)
orders_no_filter = df.loc[df['order'] != 0, 'order']
possible = (orders_no_filter != orders_no_filter.shift(1))
df['order'] = df['order'] * possible.reindex(df.index, fill_value=0)
I am trying to aggregate data in pyspark dataframe on a particular criteria. I am trying to align the acct based on switchOUT amount to switchIN amount. So that accounts with money switching out of becomes from account and other accounts become to_accounts.
Data I am getting in the dataframe to begin with
+--------+------+-----------+----------+----------+-----------+
| person | acct | close_amt | open_amt | switchIN | switchOUT |
+--------+------+-----------+----------+----------+-----------+
| A | 1 | 125 | 50 | 75 | 0 |
+--------+------+-----------+----------+----------+-----------+
| A | 2 | 100 | 75 | 25 | 0 |
+--------+------+-----------+----------+----------+-----------+
| A | 3 | 200 | 300 | 0 | 100 |
+--------+------+-----------+----------+----------+-----------+
To this table
+--------+--------+-----------+----------+----------+
| person | from_acct| to_acct | switchIN | switchOUT|
+--------+----------+--------+----------+-----------+
| A | 3 | 1 | 75 | 100 |
+--------+----------+--------+----------+-----------+
| A | 3 | 2 | 25 | 100 |
+--------+----------+--------+----------+-----------+
And also how can I do it so that it works for N number of rows (not just 3 accounts)
So far I have used this code
# define udf
def sorter(l):
res = sorted(l, key=operator.itemgetter(1))
return [item[0] for item in res]
def list_to_string(l):
res = 'from_fund_' +str(l[0]) + '_to_fund_'+str(l[1])
return res
def listfirstAcc(l):
res = str(l[0])
return res
def listSecAcc(l):
res = str(l[1])
return res
sort_udf = F.udf(sorter)
list_str = F.udf(list_to_string)
extractFirstFund = F.udf(listfirstAcc)
extractSecondFund = F.udf(listSecAcc)
# Add additional columns
df= df.withColumn("move", sort_udf("list_col").alias("sorted_list"))
df= df.withColumn("move_string", list_str("move"))
df= df.withColumn("From_Acct",extractFirstFund("move"))
df= df.withColumn("To_Acct",extractSecondFund("move"))
Current outcome I am getting:
+--------+--------+-----------+----------+----------+
| person | from_acct| to_acct | switchIN | switchOUT|
+--------+----------+--------+----------+-----------+
| A | 3 | 1,2 | 75 | 100 |
+--------+----------+--------+----------+-----------+
I want to calculate APRU for several countries.
country_list = ['us','gb','ca','id']
count = {}
for i in country_list:
count[i] = df_day_country[df_day_country.isin([i])]
count[i+'_reverse'] = count[i].iloc[::-1]
for j in range(1,len(count[i+'_reverse'])):
count[i+'_reverse']['count'].iloc[j] = count[i+'_reverse']['count'][j-1:j+1].sum()
for k in range(1,len(count[i])):
count[i][revenue_sum].iloc[k] = count[i][revenue_sum][k-1:k+1].sum()
count[i]['APRU'] = count[i][revenue_sum] / count[i]['count'][0]/100
After that, I will create 4 dataframes: df_us, df_gb, df_ca, df_id that show each country's APRU.
But the size of dataset is large. The running time is extremely slow after the country list become larger. So is there a way to decrease the running time?
Consider using numba
Your code thus becomes
from numba import njit
country_list = ['us','gb','ca','id']
#njit
def count(country_list):
count = {}
for i in country_list:
count[i] = df_day_country[df_day_country.isin([i])]
count[i+'_reverse'] = count[i].iloc[::-1]
for j in range(1,len(count[i+'_reverse'])):
count[i+'_reverse']['count'].iloc[j] = count[i+'_reverse']['count'][j-1:j+1].sum()
for k in range(1,len(count[i])):
count[i][revenue_sum].iloc[k] = count[i][revenue_sum][k-1:k+1].sum()
count[i]['APRU'] = count[i][revenue_sum] / count[i]['count'][0]/100
return count
Numba makes python loops a lot faster and is in the process of being integrated into the more heavy duty python libraries like scipy. Deffinetly give this a look.
IIUC, from your code and variable names, it looks like you are trying to compute average:
# toy data set:
country_list = ['us','gb']
np.random.seed(1)
datalen=10
df_day_country = pd.DataFrame({'country': np.random.choice(country_list, datalen),
'count': np.random.randint(0,100, datalen),
'revenue_sum': np.random.uniform(0,100,datalen)})
df_day_country['APRU'] = (df_day_country.groupby('country',group_keys=False)
.apply(lambda x: x['revenue_sum']/x['count'].sum())
)
Output:
+----------+--------+--------------+------------+----------+
| country | count | revenue_sum | APRU | |
+----------+--------+--------------+------------+----------+
| 0 | gb | 16 | 20.445225 | 0.150333 |
| 1 | gb | 1 | 87.811744 | 0.645675 |
| 2 | us | 76 | 2.738759 | 0.011856 |
| 3 | us | 71 | 67.046751 | 0.290246 |
| 4 | gb | 6 | 41.730480 | 0.306842 |
| 5 | gb | 25 | 55.868983 | 0.410801 |
| 6 | gb | 50 | 14.038694 | 0.103226 |
| 7 | gb | 20 | 19.810149 | 0.145663 |
| 8 | gb | 18 | 80.074457 | 0.588783 |
| 9 | us | 84 | 96.826158 | 0.419161 |
+----------+--------+--------------+------------+----------+
What is the best way to compare 2 dataframes w/ the same column names, row by row, if a cell is different have the Before & After value and which cellis different in that dataframe.
I know this question has been asked a lot, but none of the applications fit my use case. Speed is important. There is a package called datacompy but it is not good if I have to compare 5000 dataframes in a loop (i'm only comparing 2 at a time, but around 10,000 total, and 5000 times).
I don't want to join the dataframes on a column. I want to compare them row by row. Row 1 with row 1. Etc. If a column in row 1 is different, I only need to know the column name, the before, and the after. Perhaps if it is numeric I could also add a column w/ the abs val. of the dif.
The problem is, there is sometimes an edge case where rows are out of order (only by 1 entry), and don’t want these to come up as false positives.
Example:
These dataframes would be created when I pass in race # (there are 5,000 race numbers)
df1
+-----+-------+--+------+--+----------+----------+-------------+--+
| Id | Speed | | Name | | Distance | | Location | |
+-----+-------+--+------+--+----------+----------+-------------+--+
| 181 | 10.3 | | Joe | | 2 | | New York | |
| 192 | 9.1 | | Rob | | 1 | | Chicago | |
| 910 | 1.0 | | Fred | | 5 | | Los Angeles | |
| 97 | 1.8 | | Bob | | 8 | | New York | |
| 88 | 1.2 | | Ken | | 7 | | Miami | |
| 99 | 1.1 | | Mark | | 6 | | Austin | |
+-----+-------+--+------+--+----------+----------+-------------+--+
df2:
+-----+-------+--+------+--+----------+----------+-------------+--+
| Id | Speed | | Name | | Distance | | | Location |
+-----+-------+--+------+--+----------+----------+-------------+--+
| 181 | 10.3 | | Joe | | 2 | | New York | |
| 192 | 9.4 | | Rob | | 1 | | Chicago | |
| 910 | 1.0 | | Fred | | 5 | | Los Angeles | |
| 97 | 1.5 | | Bob | | 8 | | New York | |
| 99 | 1.1 | | Mark | | 6 | | Austin | |
| 88 | 1.2 | | Ken | | 7 | | Miami | |
+-----+-------+--+------+--+----------+----------+-------------+--+
diff:
+-------+----------+--------+-------+
| Race# | Diff_col | Before | After |
+-------+----------+--------+-------+
| 123 | Speed | 9.1 | 9.4 |
| 123 | Speed | 1.8 | 1.5 |
An example of a false positive is with the last 2 rows, Ken + Mark.
I could summarize the differences in one line per race, but if the dataframe has 3000 records and there are 1,000 differences (unlikely, but possible) than I will have tons of columns. I figured this was was easier as I could export to excel and then sort by race #, see all the differences, or by diff_col, see which columns are different.
def DiffCol2(df1, df2, race_num):
is_diff = False
diff_cols_list = []
row_coords, col_coords = np.where(df1 != df2)
diffDf = []
alldiffDf = []
for y in set(col_coords):
col_df1 = df1.iloc[:,y].name
col_df2 = df2.iloc[:,y].name
for index, row in df1.iterrows():
if df1.loc[index, col_df1] != df2.loc[index, col_df2]:
col_name = col_df1
if col_df1 != col_df2: col_name = (col_df1, col_df2)
diffDf.append({‘Race #’: race_num,'Column Name': col_name, 'Before: df2.loc[index, col_df2], ‘After’: df1.loc[index, col_df1]})
try:
check_edge_case = df1.loc[index, col_df1] == df2.loc[index+1, col_df1]
except:
check_edge_case = False
try:
check_edge_case_two = df1.loc[index, col_df1] == df2.loc[index-1, col_df1]
except:
check_edge_case_two = False
if not (check_edge_case or check_edge_case_two):
col_name = col_df1
if col_df1 != col_df2:
col_name = (col_df1, col_df2) #if for some reason column name isn’t the same, which should never happen but in case, I want to know both col names
is_diff = True
diffDf.append({‘Race #’: race_num,'Column Name': col_name, 'Before: df2.loc[index, col_df2], ‘After’: df1.loc[index, col_df1]})
return diffDf, alldiffDf, is_diff
[apologies in advance for weirdly formatted tables, i did my best given how annoying pasting tables into s/o is]
The code below works if dataframes have the same number and names of columns and the same number of rows, so comparing only values in the tables
Not sure where you want to get Race# from
df1 = pd.DataFrame(np.random.randn(10, 4), columns=list('ABCD'))
df2 = df1.copy(deep=True)
df2['B'][5] = 100 # Creating difference
df2['C'][6] = 100 # Creating difference
dif=[]
for col in df1.columns:
for bef, aft in zip(df1[col], df2[col]):
if bef!=aft:
dif.append([col, bef, aft])
print(dif)
Results below
Alternative solution without loops
df = df1.melt()
df.columns=['Column', 'Before']
df.insert(2, 'After', df2.melt().value)
df[df.Before!=df.After]