Optimising finding row pairs in a dataframe - python

I have a dataframe with rows that describe a movement of value between nodes in a system. This dataframe looks like this:
index from_node to_node value invoice_number
0 A E 10 a
1 B F 20 a
2 C G 40 c
3 D H 60 d
4 E I 35 c
5 X D 43 d
6 Y F 50 d
7 E H 70 a
8 B A 55 b
9 X B 33 a
I am looking to find "swaps" in the invoice history. A swap is defined where a node both receives and sends a value to a different node within the same invoice number. In the above dataset there are two swaps in invoice "a", and one swap in invoice "d" ("sent to" and "received from" could be the same node in the same row):
index node sent_to sent_value received_from received_value invoice_number
0 B F 20 X 33 a
1 E H 70 A 10 a
2 D H 60 X 43 d
I have solved this problem by iterating over all of the unique invoice numbers in the dataset and then iterating over each row within that invoice number to find pairs:
import pandas as pd
df = pd.DataFrame({
'from_node':['A','B','C','D','E','X','Y','E','B','X'],
'to_node':['E','F','G','H','I','D','F','H','A','B'],
'value':[10,20,40,60,35,43,50,70,55,33],
'invoice_number':['a','a','c','d','c','d','d','a','b','a'],
})
invoices = set(df.invoice_number)
list_df_swap = []
for invoice in invoices:
df_inv = df[df.invoice_number.isin([invoice])]
for r in df_inv.itertuples():
df_is_swap = df_inv[df_inv.to_node.isin([r.from_node])]
if len(df_is_swap.index) == 1:
swap = {'node': r.from_node,
'sent_to': r.to_node,
'sent_value': r.value,
'received_from': df_is_swap.iloc[0]['from_node'],
'received_value': df_is_swap.iloc[0]['value'],
'invoice_number': r.invoice_number
}
list_df_swap.append(pd.DataFrame(swap, index = [0]))
df_swap = pd.concat(list_df_swap, ignore_index = True)
The total dataset consists of several hundred million rows, and this approach is not very efficient. Is there a way to solve this problem using some kind of vectorised solution, or another method that would speed up the execution time?

Calculate all possible swaps, regradless of the invoice number:
swaps = df.merge(df, left_on='from_node', right_on='to_node')
Then select those that have the same invoice number:
columns = ['from_node_x', 'to_node_x', 'value_x', 'from_node_y', 'value_y',
'invoice_number_x']
swaps[swaps.invoice_number_x == swaps.invoice_number_y][columns]
# from_node_x to_node_x value_x from_node_y value_y invoice_number_x
#1 B F 20 X 33 a
#3 D H 60 X 43 d
#5 E H 70 A 10 a

Related

Python/Pandas: .CSV Data Cleanup For Migration

I'm a fairly new python/pandas user, and I have been tasked with the cleanup of a roughly 5,000 row csv of records, and the subsequent migration of the records into a sql database.
The contents are individual people's personal information(which prevents me from posting it for reference) and their 'seat' occupation information, but the file has been... mismanaged... over the years, and has ended up looking like this:
#Sect1 Sect2 Sect3 Seat#
L/L/L/L 320/320/319/321 D/C/D/C 1-2/1-2/1-2/1-2
V 602 - 1-6
T 101 F 1&3
R 158 - 3* 4
U 818 4 Ds9R
With that individual's personal information in four columns not shown to the left.
In reality, even just the top row from the selection above should actually be:
#Sect1 Sect2 Sect3 Seat#
L 320 D 1
L 320 D 2
L 320 C 1
L 320 C 2
L 319 D 1
L 319 D 2
L 321 C 1
L 321 C 2
with the '-'s implying that it's 'through' not 'and'. (For example; the second row in my original example would be Seat# 1 through Seat# 6, not Seat# 1 and 6.
I should also note that there's no unique ID/Index for these individuals, and it's purely based on First/Last name.
I've been attempting to break some of this up, and have limited success with
df1 = df1.drop('Sect2', axis=1).join(df1['Sect2'].str.split('/', expand=True).stack().reset_index(level=1, drop=True).rename('Sect2'))
but this eventually ends up creating erroneous records such as
#Sect1 Sect2 Sect3 Seat#
L 319 C 1
In the end, my question is; Is using a script to clean this data even possible? I'm rapidly running out of ideas, and really don't want to have to do this manually, but I also don't want to waste any more time trying to script this out if it's a pointless endeavor.
The code below should address the two issues described in your post. Should be selective enough to avoid misinterpreting rows, but some manual curation will likely still be necessary.
Basic concept is to iterate row by row, processing as much as possible before proceeding. First priority is to split rows containing character "/". If none is found, range value "-" is interpreted. The while loop permits gradual improvement. For example, the code would convert 1-3 into 1/2/3, then re-read the same row and split it into 3 different rows.
# build demo dataframe
d = {"Sect1": ["L/L/L/L", "V", "T", "R", "U"],
"Sect2": ["320/320/319/321", "602", "101", "158", "818"],
"Sect3": ["D/C/D/C", "-", "F", "-", "4"],
"Seat#": ["1-2/1-2/1-2/1-2", "1-6", "1&3", "3* 4", "Ds9R"]}
df = pd.DataFrame(data=d)
index = 0
while index < len(df):
len_df = len(df)
row_li = [df.iloc[index][x] for x in df.head()]
# extract separated values
sep_li = [x.split("/") for x in row_li]
sep_min, sep_max = len(min(sep_li, key=lambda x: len(x))), len(max(sep_li, key=lambda x: len(x)))
# extract range values
num_range_li = [re.findall("^\d+\-\d+$|$", x)[0].split("-") for x in row_li]
num_range_max = len(max(num_range_li, key=lambda x: len(x)))
# create temporary dictionary representing current row
r = {}
for i, head in enumerate(df.head()):
r[head] = row_li[i]
# separated values treatment -> split into distinct rows
if sep_min > 1 and sep_min == sep_max:
for i, head in enumerate(df.head()):
r[head] = sep_li[i]
row_df = pd.DataFrame(data=r)
df = df.append(row_df, ignore_index=True)
# range values treatment -> convert into separated values
elif num_range_max > 1:
for part in (1, 2):
for idx, header in enumerate(df.head()):
if len(num_range_li[idx]) > 1:
split_li = [str(x) for x in range(int(num_range_li[idx][0]), int(num_range_li[idx][1])+1)]
# convert range values to separated values
if part == 1:
r[header] = "/".join(split_li)
# multiply other values
else:
for i, head in enumerate(df.head()):
if i != idx:
r[head] = "/".join([r[head] for x in range(len(split_li))])
row_df = pd.DataFrame(data=r, index=[0])
df = df.append(row_df, ignore_index=True)
# if no new rows are added, increment
if len(df) == len_df:
index += 1
# if rows are added, drop current row
else:
df = df.iloc[:index].append(df.iloc[index+1:])
print(df)
Output
Sect1 Sect2 Sect3 Seat#
0 T 101 F 1&3
1 R 158 - 3* 4
2 U 818 4 Ds9R
4 V 602 - 1
5 V 602 - 2
6 V 602 - 3
7 V 602 - 4
8 V 602 - 5
9 V 602 - 6
10 L 320 D 1
11 L 320 D 2
12 L 320 C 1
13 L 320 C 2
14 L 319 D 1
15 L 319 D 2
16 L 321 C 1
17 L 321 C 2

Sampling a dataframe by selecting rows where the location modulo P = Q

Let's say I have a dataframe with N rows. I want to pick the rows where the row location modulo P gives Q. So for concreteness, let's say P = 7 and Q = 5.
Row 0: 0 mod 7 = 0 (not satisfied)
Row 1: 1 mod 7 = 1 (not satisfied)
...
Row 5: 5 mod 7 = 5 (satisfied)
...
Row 12: 12 mod 7 = 5 (satisfied)
So the rows that are selected will be 5, 12, 19, 26 ....
If Q=0, you can use the slicing method df.iloc[::P]. How does one do it for mod P = Q?
df.iloc[Q::P] this indicates start at row Q then step in increments of P.
When the first argument isn't given like .iloc[::P] it is implicitly 0 (and the middle one is implicitly end of data frame), you can just specify it to be something other than 0 if that is what you need.
Using the numpy package:
import numpy as np
#instantiate new col
df["satisfied"] = 0
#fill new col based on modulus condition
df.satisfied = np.where(df.index % P == Q, "(satisfied)", "(not satisfied)")
code:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(100).reshape(25,4), columns = ['A','B','C','D'])
p = 7
q = 5
a = []
#piece of code for getting the p%7 value and appending in a list
for i in range(df.shape[0]):
if i%p == q:
a.append(i)
#printing the p%7 values
print(df.iloc[a,:])
Output:
================
A B C D
5 20 21 22 23
12 48 49 50 51
19 76 77 78 79

Is there a way to speed up the following pandas for loop?

My data frame contains 10,000,000 rows! After group by, ~ 9,000,000 sub-frames remain to loop through.
The code is:
data = read.csv('big.csv')
for id, new_df in data.groupby(level=0): # look at mini df and do some analysis
# some code for each of the small data frames
This is super inefficient, and the code has been running for 10+ hours now.
Is there a way to speed it up?
Full Code:
d = pd.DataFrame() # new df to populate
print 'Start of the loop'
for id, new_df in data.groupby(level=0):
c = [new_df.iloc[i:] for i in range(len(new_df.index))]
x = pd.concat(c, keys=new_df.index).reset_index(level=(2,3), drop=True).reset_index()
x = x.set_index(['level_0','level_1', x.groupby(['level_0','level_1']).cumcount()])
d = pd.concat([d, x])
To get the data:
data = pd.read_csv('https://raw.githubusercontent.com/skiler07/data/master/so_data.csv', index_col=0).set_index(['id','date'])
Note:
Most of id's will only have 1 date. This indicates only 1 visit. For id's with more visits, I would like to structure them in a 3d format e.g. store all of their visits in the 2nd dimension out of 3. The output is (id, visits, features)
Here is one way to speed that up. This adds the desired new rows in some code which processes the rows directly. This saves the overhead of constantly constructing small dataframes. Your sample of 100,000 rows runs in a couple of seconds on my machine. While your code with only 10,000 rows of your sample data takes > 100 seconds. This seems to represent a couple of orders of magnitude improvement.
Code:
def make_3d(csv_filename):
def make_3d_lines(a_df):
a_df['depth'] = 0
depth = 0
prev = None
accum = []
for row in a_df.values.tolist():
row[0] = 0
key = row[1]
if key == prev:
depth += 1
accum.append(row)
else:
if depth == 0:
yield row
else:
depth = 0
to_emit = []
for i in range(len(accum)):
date = accum[i][2]
for j, r in enumerate(accum[i:]):
to_emit.append(list(r))
to_emit[-1][0] = j
to_emit[-1][2] = date
for r in to_emit[1:]:
yield r
accum = [row]
prev = key
df_data = pd.read_csv('big-data.csv')
df_data.columns = ['depth'] + list(df_data.columns)[1:]
new_df = pd.DataFrame(
make_3d_lines(df_data.sort_values('id date'.split())),
columns=df_data.columns
).astype(dtype=df_data.dtypes.to_dict())
return new_df.set_index('id date'.split())
Test Code:
start_time = time.time()
df = make_3d('big-data.csv')
print(time.time() - start_time)
df = df.drop(columns=['feature%d' % i for i in range(3, 25)])
print(df[df['depth'] != 0].head(10))
Results:
1.7390995025634766
depth feature0 feature1 feature2
id date
207555809644681 20180104 1 0.03125 0.038623 0.008130
247833985674646 20180106 1 0.03125 0.004378 0.004065
252945024181083 20180107 1 0.03125 0.062836 0.065041
20180107 2 0.00000 0.001870 0.008130
20180109 1 0.00000 0.001870 0.008130
329567241731951 20180117 1 0.00000 0.041952 0.004065
20180117 2 0.03125 0.003101 0.004065
20180117 3 0.00000 0.030780 0.004065
20180118 1 0.03125 0.003101 0.004065
20180118 2 0.00000 0.030780 0.004065
I believe your approach for feature engineering could be done better, but I will stick to answering your question.
In Python, iterating over a Dictionary is way faster than iterating over a DataFrame
Here how I managed to process a huge pandas DataFrame (~100,000,000 rows):
# reset the Dataframe index to get level 0 back as a column in your dataset
df = data.reset_index() # the index will be (id, date)
# split the DataFrame based on id
# and store the splits as Dataframes in a dictionary using id as key
d = dict(tuple(df.groupby('id')))
# iterate over the Dictionary and process the values
for key, value in d.items():
pass # each value is a Dataframe
# concat the values and get the original (processed) Dataframe back
df2 = pd.concat(d.values(), ignore_index=True)
Modified #Stephen's code
def make_3d(dataset):
def make_3d_lines(a_df):
a_df['depth'] = 0 # sets all depth from (1 to n) to 0
depth = 1 # initiate from 1, so that the first loop is correct
prev = None
accum = [] # accumulates blocks of data belonging to given user
for row in a_df.values.tolist(): # for each row in our dataset
row[0] = 0 # NOT SURE
key = row[1] # this is the id of the row
if key == prev: # if this rows id matches previous row's id, append together
depth += 1
accum.append(row)
else: # else if this id is new, previous block is completed -> process it
if depth == 0: # previous id appeared only once -> get that row from accum
yield accum[0] # also remember that depth = 0
else: # process the block and emit each row
depth = 0
to_emit = [] # prepare to emit the list
for i in range(len(accum)): # for each unique day in the accumulated list
date = accum[i][2] # define date to be the first date it sees
for j, r in enumerate(accum[i:]):
to_emit.append(list(r))
to_emit[-1][0] = j # define the depth
to_emit[-1][2] = date # define the
for r in to_emit[0:]:
yield r
accum = [row]
prev = key
df_data = dataset.reset_index()
df_data.columns = ['depth'] + list(df_data.columns)[1:]
new_df = pd.DataFrame(
make_3d_lines(df_data.sort_values('id date'.split(), ascending=[True,False])),
columns=df_data.columns
).astype(dtype=df_data.dtypes.to_dict())
return new_df.set_index('id date'.split())
Testing:
t = pd.DataFrame(data={'id':[1,1,1,1,2,2,3,3,4,5], 'date':[20180311,20180310,20180210,20170505,20180312,20180311,20180312,20180311,20170501,20180304], 'feature':[10,20,45,1,14,15,20,20,13,11],'result':[1,1,0,0,0,0,1,0,1,1]})
t = t.reindex(columns=['id','date','feature','result'])
print t
id date feature result
0 1 20180311 10 1
1 1 20180310 20 1
2 1 20180210 45 0
3 1 20170505 1 0
4 2 20180312 14 0
5 2 20180311 15 0
6 3 20180312 20 1
7 3 20180311 20 0
8 4 20170501 13 1
9 5 20180304 11 1
Output
depth feature result
id date
1 20180311 0 10 1
20180311 1 20 1
20180311 2 45 0
20180311 3 1 0
20180310 0 20 1
20180310 1 45 0
20180310 2 1 0
20180210 0 45 0
20180210 1 1 0
20170505 0 1 0
2 20180312 0 14 0
20180312 1 15 0
20180311 0 15 0
3 20180312 0 20 1
20180312 1 20 0
20180311 0 20 0
4 20170501 0 13 1

Pandas track consecutive near numbers via compare-cumsum-groupby pattern

I am trying to extend my current pattern to accommodate extra conditions of +- a percentage of the last value rather than strict does it match previous value.
data = np.array([[2,30],[2,900],[2,30],[2,30],[2,30],[2,1560],[2,30],
[2,300],[2,30],[2,450]])
df = pd.DataFrame(data)
df.columns = ['id','interval']
UPDATE 2 (id fix): Updated Data 2 with more data:
data2 = np.array([[2,30],[2,900],[2,30],[2,29],[2,31],[2,30],[2,29],[2,31],[2,1560],[2,30],[2,300],[2,30],[2,450], [3,40],[3,900],[3,40],[3,39],[3,41], [3,40],[3,39],[3,41] ,[3,1560],[3,40],[3,300],[3,40],[3,450]])
df2 = pd.DataFrame(data2)
df2.columns = ['id','interval']
for i, g in df.groupby([(df.interval != df.interval.shift()).cumsum()]):
if len(g.interval.tolist())>=3:
print(g.interval.tolist())
results in [30,30,30]
however I really want to catch near number conditions say when a number is +-10% of the previous number.
so looking at df2 I would like to pickup the series [30,29,31]
for i, g in df2.groupby([(df2.interval != <???+- 10% magic ???>).cumsum()]):
if len(g.interval.tolist())>=3:
print(g.interval.tolist())
UPDATE: Here is the end of line processing code where I store the gathered lists into a dictionary with the ID as the key
leak_intervals = {}
final_leak_intervals = {}
serials = []
for i, g in df.groupby([(df.interval != df.interval.shift()).cumsum()]):
if len(g.interval.tolist()) >= 3:
print(g.interval.tolist())
serial = g.id.values[0]
if serial not in serials:
serials.append(serial)
if serial not in leak_intervals:
leak_intervals[serial] = g.interval.tolist()
else:
leak_intervals[serial] = leak_intervals[serial] + (g.interval.tolist())
UPDATE:
In [116]: df2.groupby(df2.interval.pct_change().abs().gt(0.1).cumsum()) \
.filter(lambda x: len(x) >= 3)
Out[116]:
id interval
2 2 30
3 2 29
4 2 31
5 2 30
6 2 29
7 2 31
15 3 40
16 3 39
17 2 41
18 2 40
19 2 39
20 2 41

Binning values into groups with a minimum size using pandas

I'm trying to bin a sample of observations into n discrete groups, then combine these groups until each subgroup has a mimimum of 6 members. So far, I've generated bins, and grouped my DataFrame into them:
# df is a DataFrame containing 135 measurments
bins = np.linspace(df.heights.min(), df.heights.max(), 21)
grp = df.groupby(np.digitize(df.heights, bins))
grp.size()
1 4
2 1
3 2
4 3
5 2
6 8
7 7
8 6
9 19
10 12
11 13
12 12
13 7
14 12
15 12
16 2
17 3
18 6
19 3
21 1
So I can see that I need to combine groups 1 - 3, 3 - 5, and 16 - 21, while leaving the others intact, but I don't know how to do this programmatically.
You can do this:
df = pd.DataFrame(np.random.random_integers(1,200,135), columns=['heights'])
bins = np.linspace(df.heights.min(), df.heights.max(), 21)
grp = df.groupby(np.digitize(df.heights, bins))
sizes = grp.size()
def f(vals, max):
sum = 0
group = 1
for v in vals:
sum += v
if sum <= max:
yield group
else:
group +=1
sum = v
yield group
#I've changed 6 by 30 for the example cause I don't have your original dataset
grp.size().groupby([g for g in f(sizes, 30)])
And if you do print grp.size().groupby([g for g in f(sizes, 30)]).cumsum() you will see that the cumulative sums is grouped as expected.
Also if you want to group the original values you can do something like:
dat = np.random.random_integers(0,200,135)
dat = np.array([78,116,146,111,147,78,14,91,196,92,163,144,107,182,58,89,77,134,
83,126,94,70,121,175,174,88,90,42,93,131,91,175,135,8,142,166,
1,112,25,34,119,13,95,182,178,200,97,8,60,189,49,94,191,81,
56,131,30,107,16,48,58,65,78,8,0,11,45,179,151,130,35,64,
143,33,49,25,139,20,53,55,20,3,63,119,153,14,81,93,62,162,
46,29,84,4,186,66,90,174,55,48,172,83,173,167,66,4,197,175,
184,20,23,161,70,153,173,127,51,186,114,27,177,96,93,105,169,158,
83,155,161,29,197,143,122,72,60])
df = pd.DataFrame({'heights':dat})
bins = np.digitize(dat,np.linspace(0,200,21))
grp = df.heights.groupby(bins)
m = 15 #you should put 6 here, the minimun
s = 0
c = 1
def f(x):
global c,s
res = pd.Series([c]*x.size,index=x.index)
s += x.size
if s>m:
s = 0
c += 1
return res
g = grp.apply(f)
print df.groupby(g).size()
#another way of doing the same, just a matter of taste
m = 15 #you should put 6 here, the minimun
s = 0
c = 1
def f2(x):
global c,s
res = [c]*x.size #here is the main difference with f
s += x.size
if s>m:
s = 0
c += 1
return res
g = grp.transform(f2) #call it this way
print df.groupby(g).size()

Categories