Pandas - merging rows with contiguous intervals - python

I've started using Pandas recently and have been stumbling over this issue for a few days. I have a dataframe with interval information that looks a bit like this:
df = pd.DataFrame({'RangeBegin' : [1,3,5,10,12,42,65],
'RangeEnd' : [2,4,7,11,41,54,100],
'Var1' : ['A','A','A','B','B','B','A'],
'Var2' : ['A','A','B','B','B','B','A']})
RangeBegin RangeEnd Var1 Var2
0 1 2 A A
1 3 4 A A
2 5 7 A B
3 10 11 B B
4 12 41 B B
5 42 54 B B
6 65 100 A A
It is sorted by RangeBegin. The idea is to to end up with something like this instead:
RangeBegin RangeEnd Var1 Var2
0 1.0 4.0 A A
2 5.0 7.0 A B
3 10.0 54.0 B B
6 65.0 100.0 A A
Where every "duplicate" (matching Var1 and Var2) row with contiguous ranges is aggregated into a single row. I'm thinking of expanding this algorithm to detect and deal with overlaps, but I'd like to get this working properly first.
You see, I've got a solution working by using iterrows to build a new dataframe row-by-row, but it takes far too long on my real dataset and I'd like to use a more vectorized implementation.
I've looked into groupby but can't find a set of keys (or a function to apply to said groups) that would make this work.
Here's my current implementation as it stands:
def test():
df = pd.DataFrame({'RangeBegin' : [1,3,5,10,12,42,65],
'RangeEnd' : [2,4,7,11,41,54,100],
'Var1' : ['A','A','A','B','B','B','A'],
'Var2' : ['A','A','B','B','B','B','A']})
print(df)
i = 0
cols = df.columns
aggData = pd.DataFrame(columns = cols)
for row in df.iterrows():
rowIndex, rowData = row
#if our new dataframe is empty or its last row is not contiguous, append it
if(aggData.empty or not duplicateContiguousRow(cols,rowData,aggData.loc[i])):
aggData = aggData.append(rowData)
i=rowIndex
#otherwise, modify the last row
else:
aggData.loc[i,'RangeEnd'] = rowData['RangeEnd']
print(aggData)
def duplicateContiguousRow(cols, row, aggDataRow):
#first bool: are the ranges contiguous?
contiguousBool = aggDataRow['RangeEnd']+1 == row['RangeBegin']
if(not contiguousBool):
return False
#second bool: is this row a duplicate (minus range columns)?
duplicateBool = True
for col in cols:
if(not duplicateBool):
break
elif col not in ['RangeBegin','RangeEnd']:
#Nan != Nan
duplicateBool = duplicateBool and (row[col] == aggDataRow[col] or (row[col]!=row[col] and aggDataRow[col]!=aggDataRow[col]))
return duplicateBool
EDIT: This question just got asked while I was writing this one. The answer looks promising

You can use groupby for this purpose, when first detecting the consecutive segments:
df['block'] = ((df['Var1'].shift(1) != df['Var1']) | (df['Var2'].shift(1) != df['Var2'])).astype(int).cumsum()
df.groupby(['Var1', 'Var2', 'block']).agg({'RangeBegin': np.min, 'RangeEnd': np.max}).reset_index()
will result in:
Var1 Var2 block RangeBegin RangeEnd
0 A A 1 1 4
1 A A 4 65 100
2 A B 2 5 7
3 B B 3 10 54
You could then sort by block to restore the original order.

Related

vectoring pandas df by row with multiple conditional statements

I'm trying to avoid for loops applying a function on a per row basis of a pandas df. I have looked at many vectorization examples but have not come across anything that will work completely. Ultimately I am trying to add an additional df column with the summation of successful conditions with a specified value per each condition by row.
I have looked at np.apply_along_axis but that's just a hidden loop, np.where but I could not see this working for 25 conditions that I am checking
A B C ... R S T
0 0.279610 0.307119 0.553411 ... 0.897890 0.757151 0.735718
1 0.718537 0.974766 0.040607 ... 0.470836 0.103732 0.322093
2 0.222187 0.130348 0.894208 ... 0.480049 0.348090 0.844101
3 0.834743 0.473529 0.031600 ... 0.049258 0.594022 0.562006
4 0.087919 0.044066 0.936441 ... 0.259909 0.979909 0.403292
[5 rows x 20 columns]
def point_calc(row):
points = 0
if row[2] >= row[13]:
points += 1
if row[2] < 0:
points -= 3
if row[4] >= row[8]:
points += 2
if row[4] < row[12]:
points += 1
if row[16] == row[18]:
points += 4
return points
points_list = []
for indx, row in df.iterrows():
value = point_calc(row)
points_list.append(value)
df['points'] = points_list
This is obviously not efficient but I am not sure how I can vectorize my code since it requires the values per row for each column in the df to get a custom summation of the conditions.
Any help in pointing me in the right direction would be much appreciated.
Thank you.
UPDATE:
I was able to get a little more speed replacing the df.iterrows section with df.apply.
df['points'] = df.apply(lambda row: point_calc(row), axis=1)
UPDATE2:
I updated the function as follows and have substantially decreased the run time with a 10x speed increase from using df.apply and the initial function.
def point_calc(row):
a1 = np.where(row[:,2]) >= row[:,13], 1,0)
a2 = np.where(row[:,2] < 0, -3, 0)
a3 = np.where(row[:,4] >= row[:,8])
etc.
all_points = a1 + a2 + a3 + etc.
return all_points
df['points'] = point_calc(df.to_numpy())
What I am still working on is using np.vectorize on the function itself to see if that can be improved upon as well.
You can try it it the following way:
# this is a small version of your dataframe
df = pd.DataFrame(np.random.random((10,4)), columns=list('ABCD'))
It looks like that:
A B C D
0 0.724198 0.444924 0.554168 0.368286
1 0.512431 0.633557 0.571369 0.812635
2 0.680520 0.666035 0.946170 0.652588
3 0.467660 0.277428 0.964336 0.751566
4 0.762783 0.685524 0.294148 0.515455
5 0.588832 0.276401 0.336392 0.997571
6 0.652105 0.072181 0.426501 0.755760
7 0.238815 0.620558 0.309208 0.427332
8 0.740555 0.566231 0.114300 0.353880
9 0.664978 0.711948 0.929396 0.014719
You can create a Series which counts your points and is initialized with zeros:
points = pd.Series(0, index=df.index)
It looks like that:
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
dtype: int64
Afterwards you can add and subtract values line by line if you want:
The condition within the brackets selects the rows, where the condition is true.
Therefore -= and += is only applied in those rows.
points.loc[df.A < df.C] += 1
points.loc[df.B < 0] -= 3
At the end you can extract the values of the series as numpy array if you want (optional):
point_list = points.values
Does this solve your problem?

Improving performance of Python for loops with Pandas data frames

please consider the following DataFrame df:
timestamp id condition
1234 A
2323 B
3843 B
1234 C
8574 A
9483 A
Basing on the condition contained in the column condition I have to define a new column in this data frame which counts how many ids are in that condition.
However, please note that since the DataFrame is ordered by the timestamp column, one could have multiple entries of the same id and then a simple .cumsum() is not a viable option.
I have come out with the following code, which is working properly but is extremely slow:
#I start defining empty arrays
ids_with_condition_a = np.empty(0)
ids_with_condition_b = np.empty(0)
ids_with_condition_c = np.empty(0)
#Initializing new column
df['count'] = 0
#Using a for loop to do the task, but this is sooo slow!
for r in range(0, df.shape[0]):
if df.condition[r] == 'A':
ids_with_condition_a = np.append(ids_with_condition_a, df.id[r])
elif df.condition[r] == 'B':
ids_with_condition_b = np.append(ids_with_condition_b, df.id[r])
ids_with_condition_a = np.setdiff1d(ids_with_condition_a, ids_with_condition_b)
elifif df.condition[r] == 'C':
ids_with_condition_c = np.append(ids_with_condition_c, df.id[r])
df.count[r] = ids_with_condition_a.size
Keeping these Numpy arrays is very useful to me because it gives the list of the ids in a particular condition. I would also be able to put dinamically these arrays in a corresponding cell in the df DataFrame.
Are you able to come out with a better solution in terms of performance?
you need to use groupby on the column 'condition' and cumcount to count how many ids are in each condition up to the current row (which seems to be what your code do):
df['count'] = df.groupby('condition').cumcount()+1 # +1 is to start at 1 not 0
with your input sample, you get:
id condition count
0 1234 A 1
1 2323 B 1
2 3843 B 2
3 1234 C 1
4 8574 A 2
5 9483 A 3
which is faster than using loop for
and if you want just have the row with condition A for example, you can use a mask such as, if you do
print (df[df['condition'] == 'A']), you see row with only condition egal to A. So to get an array,
arr_A = df.loc[df['condition'] == 'A','id'].values
print (arr_A)
array([1234, 8574, 9483])
EDIT: to create two column per conditions, you can do for example for condition A:
# put 1 in a column where the condition is met
df['nb_cond_A'] = pd.np.where(df['condition'] == 'A',1,None)
# then use cumsum for increment number, ffill to fill the same number down
# where the condition is not meet, fillna(0) for filling other missing values
df['nb_cond_A'] = df['nb_cond_A'].cumsum().ffill().fillna(0).astype(int)
# for the partial list, first create the full array
arr_A = df.loc[df['condition'] == 'A','id'].values
# create the column with apply (here another might exist, but it's one way)
df['partial_arr_A'] = df['nb_cond_A'].apply(lambda x: arr_A[:x])
the output looks like this:
id condition nb_condition_A partial_arr_A nb_cond_A
0 1234 A 1 [1234] 1
1 2323 B 1 [1234] 1
2 3843 B 1 [1234] 1
3 1234 C 1 [1234] 1
4 8574 A 2 [1234, 8574] 2
5 9483 A 3 [1234, 8574, 9483] 3
then same thing for B, C. Maybe with a loop for cond in set(df['condition']) ould be practical for generalisation
EDIT 2: one idea to do what you expalined in the comments but not sure it improves the performance:
# array of unique condition
arr_cond = df.condition.unique()
#use apply to create row-wise the list of ids for each condition
df[arr_cond] = (df.apply(lambda row: (df.loc[:row.name].drop_duplicates('id','last')
.groupby('condition').id.apply(list)) ,axis=1)
.applymap(lambda x: [] if not isinstance(x,list) else x))
Some explanations: for each row, select the dataframe up to this row loc[:row.name], drop the duplicated 'id' and keep the last one drop_duplicates('id','last') (in your example, it means that once we reach the row 3, the row 0 is dropped, as the id 1234 is twice), then the data is grouped by condition groupby('condition'), and ids for each condition are put in a same list id.apply(list). The part starting with applymap fillna with empty list (you can't use fillna([]), it's not possible).
For the length for each condition, you can do:
for cond in arr_cond:
df['len_{}'.format(cond)] = df[cond].str.len().fillna(0).astype(int)
THe result is like this:
id condition A B C len_A len_B len_C
0 1234 A [1234] [] [] 1 0 0
1 2323 B [1234] [2323] [] 1 1 0
2 3843 B [1234] [2323, 3843] [] 1 2 0
3 1234 C [] [2323, 3843] [1234] 0 2 1
4 8574 A [8574] [2323, 3843] [1234] 1 2 1
5 9483 A [8574, 9483] [2323, 3843] [1234] 2 2 1

Pandas logical indexing on a single column of a dataframe to assign values

I am an R programmer and looking for a similar way to do something like this in R:
data[data$x > value, y] <- 1
(basically, take all rows where the x column is greater than some value and assign the y column at those rows the value of 1)
In pandas it would seem the equivalent would go something like:
data['y'][data['x'] > value] = 1
But this gives a SettingWithCopyWarning.
Equivalent statements I've tried are:
condition = data['x']>value
data.loc(condition,'x')=1
But I'm seriously confused. Maybe I'm thinking too much in R terms and can't wrap my head around what's going on in Python.
What would be equivalent code for this in Python, or workarounds?
Your statement is incorrect it should be:
data.loc[condition, 'x'] = 1
Example:
In [3]:
df = pd.DataFrame({'a':np.random.randn(10)})
df
Out[3]:
a
0 -0.063579
1 -1.039022
2 -0.011687
3 0.036160
4 0.195576
5 -0.921599
6 0.494899
7 -0.125701
8 -1.779029
9 1.216818
In [4]:
condition = df['a'] > 0
df.loc[condition, 'a'] = 20
df
Out[4]:
a
0 -0.063579
1 -1.039022
2 -0.011687
3 20.000000
4 20.000000
5 -0.921599
6 20.000000
7 -0.125701
8 -1.779029
As you are subscripting the df you should use square brackets [] rather than parentheses () which is a function call. See the docs

Editing then concatenating values of several columns into a single one (pandas, python)

I'm looking for a way to use pandas and python to combine several columns in an excel sheet with known column names into a new, single one, keeping all the important information as in the example below:
input:
ID,tp_c,tp_b,tp_p
0,transportation - cars,transportation - boats,transportation - planes
1,checked,-,-
2,-,checked,-
3,checked,checked,-
4,-,checked,checked
5,checked,checked,checked
desired output:
ID,tp_all
0,transportation
1,cars
2,boats
3,cars+boats
4,boats+planes
5,cars+boats+planes
The row with ID of 0 contans a description of the contents of the column. Ideally the code would parse the description in the second row, look after the '-' and concatenate those values in the new "tp_all" column.
This is quite interesting as it's a reverse get_dummies...
I think I would manually munge the column names so that you have a boolean DataFrame:
In [11]: df1 # df == 'checked'
Out[11]:
cars boats planes
0
1 True False False
2 False True False
3 True True False
4 False True True
5 True True True
Now you can use an apply with zip:
In [12]: df1.apply(lambda row: '+'.join([col for col, b in zip(df1.columns, row) if b]),
axis=1)
Out[12]:
0
1 cars
2 boats
3 cars+boats
4 boats+planes
5 cars+boats+planes
dtype: object
Now you just have to tweak the headers, to get the desired csv.
Would be nice if there were a less manual way / faster to do reverse get_dummies...
OK a more dynamic method:
In [63]:
# get a list of the columns
col_list = list(df.columns)
# remove 'ID' column
col_list.remove('ID')
# create a dict as a lookup
col_dict = dict(zip(col_list, [df.iloc[0][col].split(' - ')[1] for col in col_list]))
col_dict
Out[63]:
{'tp_b': 'boats', 'tp_c': 'cars', 'tp_p': 'planes'}
In [64]:
# define a func that tests the value and uses the dict to create our string
def func(x):
temp = ''
for col in col_list:
if x[col] == 'checked':
if len(temp) == 0:
temp = col_dict[col]
else:
temp = temp + '+' + col_dict[col]
return temp
df['combined'] = df[1:].apply(lambda row: func(row), axis=1)
df
Out[64]:
ID tp_c tp_b tp_p \
0 0 transportation - cars transportation - boats transportation - planes
1 1 checked NaN NaN
2 2 NaN checked NaN
3 3 checked checked NaN
4 4 NaN checked checked
5 5 checked checked checked
combined
0 NaN
1 cars
2 boats
3 cars+boats
4 boats+planes
5 cars+boats+planes
[6 rows x 5 columns]
In [65]:
df = df.ix[1:,['ID', 'combined']]
df
Out[65]:
ID combined
1 1 cars
2 2 boats
3 3 cars+boats
4 4 boats+planes
5 5 cars+boats+planes
[5 rows x 2 columns]
Here is one way:
newCol = pandas.Series('',index=d.index)
for col in d.ix[:, 1:]:
name = '+' + col.split('-')[1].strip()
newCol[d[col]=='checked'] += name
newCol = newCol.str.strip('+')
Then:
>>> newCol
0 cars
1 boats
2 cars+boats
3 boats+planes
4 cars+boats+planes
dtype: object
You can create a new DataFrame with this column or do what you like with it.
Edit: I see that you have edited your question so that the names of the modes of transportation are now in row 0 instead of in the column headers. It is easier if they're in the column headers (as my answer assumes), and your new column headers don't seem to contain any additional useful information, so you should probably start by just setting the column names to the info from row 0, and deleting row 0.

Pandas Multi-Colum Boolean Indexing/Selection with Dict Generator

Lets imagine you have a DataFrame df with a large number of columns, say 50, and df does not have any indexes (i.e. index_col=None). You would like to select a subset of the columns as defined by a required_columns_list, but would like to only return those rows meeting a mutiple criteria as defined by various boolean indexes. Is there a way to consicely generate the selection statement using a dict generator?
As an example:
df = pd.DataFrame(np.random.randn(100,50),index=None,columns=["Col" + ("%03d" % (i + 1)) for i in range(50)])
# df.columns = Index[u'Col001', u'Col002', ..., u'Col050']
required_columns_list = ['Col002', 'Col012', 'Col025', 'Col032', 'Col033']
now lets imagine that I define:
boolean_index_dict = {'Col001':"MyAccount", 'Col002':"Summary", 'Col005':"Total"}
I would like to select out using a dict generator to construct the multiple boolean indices:
df.loc[GENERATOR_USING_boolean_index_dict, required_columns_list].values
The above generator boolean method would be the equivalent of:
df.loc[(df['Col001']=="MyAccount") & (df['Col002']=="Summary") & (df['Col005']=="Total"), ['Col002', 'Col012', 'Col025', 'Col032', 'Col033']].values
Hopefully, you can see that this would be really useful 'template' in operating on large DataFrames and the boolean indexing can then be defined in the boolean_index_dict. I would greatly appreciate if you could let me know if this is possible in Pandas and how to construct the GENERATOR_USING_boolean_index_dict?
Many thanks and kind regards,
Bertie
p.s. If you would like to test this out, you will need to populate some of df columns with text. The definition of df using random numbers was simply given as a starter if required for testing...
Suppose this is your df:
df = pd.DataFrame(np.random.randint(0,4,(100,50)),index=None,columns=["Col" + ("%03d" % (i + 1)) for i in range(50)])
# the first five cols and rows:
df.iloc[:5,:5]
Col001 Col002 Col003 Col004 Col005
0 2 0 2 3 1
1 0 1 0 1 3
2 0 1 1 0 3
3 3 1 0 2 1
4 1 2 3 1 0
Compared to your example all columns are filled with ints of 0,1,2 or 3.
Lets define the criteria:
req = ['Col002', 'Col012', 'Col025', 'Col032', 'Col033']
filt = {'Col001': 2, 'Col002': 2, 'Col005': 2}
So we want some columns, where some others columns all contain the value 2.
You can then get the result with:
df.loc[df[filt.keys()].apply(lambda x: x.tolist() == filt.values(), axis=1), req]
In my case this is the result:
Col002 Col012 Col025 Col032 Col033
43 2 2 1 3 3
98 2 1 1 1 2
Lets check the required columns for those rows:
df[filt.keys()].iloc[[43,98]]
Col005 Col001 Col002
43 2 2 2
98 2 2 2
And some other (non-matching) rows:
df[filt.keys()].iloc[[44,99]]
Col005 Col001 Col002
44 3 0 3
99 1 0 0
I'm starting to like Pandas more and more.

Categories