Joining a DataFrame on itself to speed up iteration - python

I am working on a data project and I'm trying to speed up my initial data processing because inevitably I want to do something else/new with the data. So far I've been trying to do more vectorization and using np.where and the like. I've seen material gains with just that.
The last bit of my code that I need to process is the slowest. I am using itterrows to cycle through a very large dataframe (>million rows).
What I am essentially trying to do is the SQL equivalent of
select curr.value, prev.value from t1 left join t2 on curr.number = prev.number - 1
As far as I know, there's no way to join a DataFrame on itself like that. Is there some other way to iterate through it to compare the current and previous values? Here's how the data frame currently looks
df =
[a b c
3 1 0
4 1 0
5 1 0
6 0 1]
Note that b goes from 1 to 0 and that's what I am trying to capture such that I would now have a df that looks like this
[a b c b_c
3 1 0 0
4 1 0 0
5 1 0 0
6 0 1 1]
Any help is much appreciated, thank you.

I think you are looking for something like this. Basically you want to know the switch from b to c and back.
df = pd.DataFrame()
df["a"] = [3,4,5,6,7,8,9]
df["b"] = [1,1,1,0,0,1,1]
df["c"] = [0,0,0,1,1,0,0]
df["b_c"] = df["b"].eq(df["c"].shift()).astype(int)
print(df)
Output:
a b c b_c
0 3 1 0 0
1 4 1 0 0
2 5 1 0 0
3 6 0 1 1
4 7 0 1 0
5 8 1 0 1
6 9 1 0 0
I'm not sure if this is the fastest method out there or if it's faster than with iterrows but I assume it is. (at least it looks nice)

Related

Python - Attempting to create binary features from a column with lists of strings

It was hard for me to come up with clear title but an example should make things more clear.
Index C1
1 [dinner]
2 [brunch, food]
3 [dinner, fancy]
Now, I'd like to create a set of binary features for each of the unique values in this column.
The example above would turn into:
Index C1 dinner brunch fancy food
1 [dinner] 1 0 0 0
2 [brunch, food] 0 1 0 1
3 [dinner, fancy] 1 0 1 0
Any help would be much appreciated.
For a performant solution, I recommend creating a new DataFrame by listifying your column.
pd.get_dummies(pd.DataFrame(df.C1.tolist()), prefix='', prefix_sep='')
brunch dinner fancy food
0 0 1 0 0
1 1 0 0 1
2 0 1 1 0
This is going to be so much faster than apply(pd.Series).
This works assuming lists don't have more of the same value (eg., ['dinner', ..., 'dinner']). If they do, then you'll need an extra groupby step:
(pd.get_dummies(
pd.DataFrame(df.C1.tolist()), prefix='', prefix_sep='')
.groupby(level=0, axis=1)
.sum())
Well, if your data is like this, then what you're looking for isn't "binary" anymore.
Maybe using MultiLabelBinarizer
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
pd.DataFrame(mlb.fit_transform(df.C1),columns=mlb.classes_,index=df.Index).reset_index()
Out[970]:
Index brunch dinner fancy food
0 1 0 1 0 0
1 2 1 0 0 1
2 3 0 1 1 0

Problems transforming a pandas dataframe

I have troubles converting a pandas dataframe into the format i need in order to analyze it further. The current data is derived from a survey where we asked people to order preferred means of communication (1=highest,4=lowest). Every row is a respondee.
The current dataframe:
A B C D
0 1 2 4 3
1 2 3 1 4
2 2 1 4 3
3 2 1 4 3
4 1 3 4 2
...
For data analysis i want to transform this into the following dataframe, where every row is a different means of communication and the columns are the counts how often a person ranked it in that spot.
1st 2d 3th 4th
A 2 3 0 0
B 2 1 2 0
C 1 0 0 4
D 0 1 3 1
I have tried apply defined functions on the original dataframe, i have tried to apply .groupby function or .T on the dataframe with I don't seem to come closer to the result I actually want.
This is the function I wrote but I can't figure out how to apply it correctly to give me the desired result.
def count_values_rank(column,rank):
total_count_n1 = 0
for i in column:
if i == rank:
total_count_n1 += 1
return total_count_n1
Running this piece of code on a single column of my dataframe get's the desired results but having troubles to actually write it so i can apply it to the dataframe and get the result I am looking for. The below line of code would return 2.
count_values_rank(df.iloc[:,0],'1')
It is probably a really obvious solution but having troubles seeing the easiest way to solve this.
Thanks alot!
melt with crosstab
pd.crosstab(df.melt().variable,df.melt().value).add_suffix('st')
Out[107]:
value 1st 2st 3st 4st
variable
A 2 3 0 0
B 2 1 2 0
C 1 0 0 4
D 0 1 3 1

iterate through the rows of a pandas df to generate new df (depending on conditions)

I have a df with badminton scores. Each sets of a games for a team are on rows and the score at each point on the columns like so:
0 0 1 1 2 3 4
0 1 2 3 3 4 4
I want to obtain only O and 1 when a point is scored, like so: (to analyse if there any pattern in the points):
0 0 1 0 1 1 1
0 1 1 1 0 1 0
I was thinking of using df.itertuples() and iloc and conditions to attribute 1 to new dataframe if next score = score+1 or 0 if next score = score + 1
But I dont know how to iterate through the generated tuples and how to generate my new df with the 0 and 1 at the good locations.
Hope that is clear thanks for your help.
Oh also, any suggestions to analyse the patterns after that ?
You just need diff(If you need convert it back try cumsum)
df.diff(axis=1).fillna(0).astype(int)
Out[1382]:
1 2 3 4 5 6 7
0 0 0 1 0 1 1 1
1 0 1 1 1 0 1 0

Generating new columns as a full-combination of other columns

Could not find similar cases here.
Suppose, i have a DataFrame
df = pd.DataFrame({'A':[2,2,1,2],
'B':[2,2,3,3],
'C':[3,3,3,4],
'I':[1,0,0,1],
'II':[0,1,0,1]})
So it is:
A B C I II
0 2 2 3 1 0
1 2 2 3 0 1
2 1 3 3 0 0
3 2 3 4 1 1
I want to make a full pairwise combination between {A,B,C} and {I,II}, so i get {I-A,I-B,I-C,II-A,II-B,II-C}:
Each of a new column is just an elementwise multiplication of corresponding base columns
I-A I-B I-C II-A II-B II-C
0 2 2 3 0 0 0
1 0 0 0 2 2 3
2 0 0 0 0 0 0
3 2 3 4 2 3 4
ATM i dont have any working solution. I'am trying to use loops(not succeding in this), but i hope there's more sufficient way.
It's pretty simple, really. You have two sets of columns that you want to combine pairwise. I won't even bother with permutation tools:
>>> new_df = pd.DataFrame()
>>>
>>> for i in ["I", "II"]:
for a in ["A", "B", "C"]:
new_df[i+"-"+a] = df[i] * df[a]
>>> new_df
I-A I-B I-C II-A II-B II-C
0 2 2 3 0 0 0
1 0 0 0 2 2 3
2 0 0 0 0 0 0
3 2 3 4 2 3 4
Of course you could obtain the lists of column names as slices off df.columns, or in whatever other way is convenient. E.g. for your example dataframe you could write
>>> for i in df.columns[3:]:
for a in df.columns[:3]:
new_df[i+"-"+a] = df[i] * df[a]
Using loops, you can use this code. It's definitely not the most elegant solution but should work for your purpose. It only requires that you specify the columns that you'd like to use for the pairwise multiplication. It seems to be quite readable though, which is something you may want.
def element_wise_mult(first, second):
element_wise_mult = []
for i, el in enumerate(first):
element_wise_mult.append(el * second[i])
return element_wise_mult
if __name__ == '__main__':
import pandas as pd
df = pd.DataFrame({'A':[2,2,1,2],
'B':[2,2,3,3],
'C':[3,3,3,4],
'I':[1,0,0,1],
'II':[0,1,0,1]})
fs = ['I', 'II']
sc = ['A', 'B', 'C']
series = []
names = []
for i in fs:
for j in sc:
names.append(i + '-' + j)
series.append(pd.Series(element_wise(df[i], df[j]))) # append array creates as a pandas series
print(pd.DataFrame(series, index=names).T) # reconstruct dataframe from the series and names stored
Returns:
I-A I-B I-C II-A II-B II-C
0 2 2 3 0 0 0
1 0 0 0 2 2 3
2 0 0 0 0 0 0
3 2 3 4 2 3 4
Here is a solution without for loops for your specific example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[2,2,1,2],
'B':[2,2,3,3],
'C':[3,3,3,4],
'I':[1,0,0,1],
'II':[0,1,0,1]})
cross_vals=np.tile(df[df.columns[:3]].values,(1,2))*np.repeat(df[df.columns[3:]].values,3,axis=1)
cros_cols=np.repeat(df.columns[3:].values,3)+np.array('-')+np.tile(df.columns[:3].values,(1,2))
new_df=pd.DataFrame(cross_vals,columns=cros_cols[0])
Then new_df is
I-A I-B I-C II-A II-B II-C
0 2 2 3 0 0 0
1 0 0 0 2 2 3
2 0 0 0 0 0 0
3 2 3 4 2 3 4
You could generalize it to any size as long as the columns A,B,C,... are consecutive and similarly the columns I,II,... are consecutive.
For the general case, if the columns are not necessarily consecutive, you can do the following:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[2,2,1,2],
'B':[2,2,3,3],
'C':[3,3,3,4],
'I':[1,0,0,1],
'II':[0,1,0,1]})
let=np.array(['A','B','C'],dtype=object)
num=np.array(['I','II'],dtype=object)
cross_vals=np.tile(df[let].values,(1,len(num)))*np.repeat(df[num].values,len(let),axis=1)
cros_cols=np.repeat(num,len(let))+np.array('-')+np.tile(let,(1,len(num)))
new_df=pd.DataFrame(cross_vals,columns=cros_cols[0])
And the result is the same as above.

pandas: Grouping or filtering based on values in list, instead of dataframe

I want to get a row count of the frequency of each value, even if that value doesn't exist in the dataframe.
d = {'light' : pd.Series(['b','b','c','a','a','a','a'], index=[1,2,3,4,5,6,9]),'injury' : pd.Series([1,5,5,5,2,2,4], index=[1,2,3,4,5,6,9])}
testdf = pd.DataFrame(d)
injury light
1 1 b
2 5 b
3 5 c
4 5 a
5 2 a
6 2 a
9 4 a
I want to get a count of the number of occurrences of each unique value of 'injury' for each unique value in 'light'.
Normally I would just use groupby(), or (in this case, since I want it to be in a specific format), pivot_table:
testdf.reset_index().pivot_table(index='light',columns='injury',fill_value=0,aggfunc='count')
index
injury 1 2 4 5
light
a 0 2 1 1
b 1 0 0 1
c 0 0 0 1
But in this case I actually want to compare the records in the dataframe to an external list of values-- in this case, ['a','b','c','d']. So if 'd' doesn't exist in this dataframe, then I want it to return a count of zero:
index
injury 1 2 4 5
light
a 0 2 1 1
b 1 0 0 1
c 0 0 0 1
d 0 0 0 0
The closest I've come is filtering the dataframe based on each value, and then getting the size of that dataframe:
for v in sorted(['a','b','c','d']):
idx2 = (df['light'].isin([v]))
df2 = df[idx2]
print(df2.shape[0])
4
2
1
0
But that only returns counts from the 'light' column-- instead of a cross-tabulation of both columns.
Is there a way to make a pivot table, or a groupby() object, that groups things based on values in a list, rather than in a column in a dataframe? Or is there a better way to do this?
Try this:
df = pd.crosstab(df.light, df.injury,margins=True)
df
injury 1 2 4 5 All
light
a 0 2 1 1 4
b 1 0 0 1 2
c 0 0 0 1 1
All 1 2 1 3 7
df["All"]
light
a 4
b 2
c 1
All 7

Categories