Comparing multiple columns of a massive DataFrame with complex duplicate rows

Comparing multiple columns of a massive DataFrame with complex duplicate rows - python

I have a massive dataframe df with around 10 million rows:
df.sort_values(['pair','x1','x2'])
x1 x1gen x2 x2gen y1 y1gen y2 y2gen pair
-------------------------------------------------------------------------------
A male H female a male d male 0
A male W male a male d male 0 (*)
A male KK female a male d male 0 (**)
B female C male a male d male 0 (-)
B female W male a male d male 0 (*)
B female BB female a male d male 0
B female KK female a male d male 0 (**)
F male W male a male d male 0 (*)
A male T female b female d male 1
A male BB female b female d male 1
B female C male b female d male 1 (-)
D male E male b female d male 1
A male C male b female e female 2
...
Each column can be explained by the following:
x1gen is a gender data of x1, x2gen is of x2, and so on.
x1 cites y1 and x2 cites y2.
Each pair of y1 and y2 is assigned a unique pair value.
My objective is to find four values per unique pair:
male citing male
male citing female
female citing male
female citing female
where, each citation network should not be counted more than once.
For example, in the given sample, x2 = W is appeared three times in pair = 0 (see (*)), so it should be counted once, not three times. Same applies to x2 = KK in pair = 0 (see (**)). However, we can count the same reference if it is a new pair. (C -> d in (-) is counted separately once per pair = 0 and pair = 1)
Hence, for the first pair pair = 0, the objective values are:
male citing male = 4 (A -> a, F -> a, W -> d, C -> d)
male citing female = 0
female citing male = 4 (B -> a, H -> d, KK -> d, BB -> d)
female citing female = 0
What I initially did was using a for loop and a set of if loops and creating four lists separately for x1 and x2:
mm = [1]
mf = [0]
fm = [0]
ff = [0]
mm1 = 1
mf1 = 0
fm1 = 0
ff1 = 0
for i in range(1, len(df)):
if df['pair'][i] == df['pair'][i-1]:
if df['x1'][i] != df['x1'][i-1]:
if df['x1gen'][i] == 'male':
if df['y1gen'][i] == 'male':
mm1 += 1
else:
mf1 += 1
else:
if df['y1gen'][i] == 'male':
fm1 += 1
else:
ff1 += 1
...
and the gist is analogous (the code itself is MANY lines long, but just a repetition of those lines). As one can tell, this is HIGHLY inefficient (takes around 120 minutes).
What is the optimal way to find such values without having to do a highly inefficient string-matching?

You can try the following:
import io
import re
import pandas as pd
# this just recreates the dataframe
s = '''
x1 x1gen x2 x2gen y1 y1gen y2 y2gen pair
A male H female a male d male 0
A male W male a male d male 0
A male KK female a male d male 0
B female C male a male d male 0
B female W male a male d male 0
B female BB female a male d male 0
B female KK female a male d male 0
F male W male a male d male 0
A male T female b female d male 1
A male BB female b female d male 1
B female C male b female d male 1
D male E male b female d male 1
A male C male b female e female 2
'''
s = re.sub(r" +", " ", s)
df = pd.read_csv(io.StringIO(s), sep=" ")
print(df)
It gives:
x1 x1gen x2 x2gen y1 y1gen y2 y2gen pair
0 A male H female a male d male 0
1 A male W male a male d male 0
2 A male KK female a male d male 0
3 B female C male a male d male 0
4 B female W male a male d male 0
5 B female BB female a male d male 0
6 B female KK female a male d male 0
7 F male W male a male d male 0
8 A male T female b female d male 1
9 A male BB female b female d male 1
10 B female C male b female d male 1
11 D male E male b female d male 1
12 A male C male b female e female 2
Counting citation pairs:
# count x1-> y1 pairs
df1 = df.drop_duplicates(subset=['x1', 'y1', 'pair'])
c1 = (df1['x1gen'] + '_' + df1['y1gen']).value_counts()
# count x2-> y2 pairs
df2 = df.drop_duplicates(subset=['x2', 'y2', 'pair'])
c2 = (df2['x2gen'] + '_' + df2['y2gen']).value_counts()
# add results
c1.add(c2, fill_value=0).astype(int)
This gives:
female_female 1
female_male 6
male_female 4
male_male 6
Computing results for each pair separately:
def cit_count(g):
# count x2-> y2 pairs
df1 = g.drop_duplicates(subset=['x1', 'y1'])
c1 = (df1['x1gen'] + '_' + df1['y1gen']).value_counts()
# count x2-> y2 pairs
df2 = g.drop_duplicates(subset=['x2', 'y2'])
c2 = (df2['x2gen'] + '_' + df2['y2gen']).value_counts()
# add results
return c1.add(c2, fill_value=0)
print(df.groupby('pair').apply(cit_count).unstack().fillna(0).astype(int))
It gives:
female_female female_male male_female male_male
pair
0 0 4 0 4
1 1 2 2 2
2 0 0 2 0

Related

How to add multiple calculated columns to dataframe at once

I have dataframe as below
Slno Name_x Age_x Sex_x Name_y Age_y Sex_y
0 1 A 27 Male A 32 Male
1 2 B 28 Female B 28 Female
2 3 C 8 Female C 1 Female
3 4 D 28 Male D 72 Male
4 5 E 25 Female E 64 Female
I need to create calculated column , difference between age, check gender match and to achieve this in one go I am using
DF3.loc[:,["Gendermatch","Agematch"]]= pd.DataFrame([np.where(DF3["Name_x"]==DF3["Name_y"],True,False),np.where(DF3["Age_x"]-DF3["Age_y"]==0,True,False)])
and the resultant dataframe looks like as below
Slno Name_x Age_x Sex_x Name_y Age_y Sex_y Gendermatch Agematch
0 1 A 27 Male A 32 Male NaN NaN
1 2 B 28 Female B 28 Female NaN NaN
2 3 C 8 Female C 1 Female NaN NaN
3 4 D 28 Male D 72 Male NaN NaN
4 5 E 25 Female E 64 Female NaN NaN
Resultant columns shows not a number , what wrong am I doing here?

DF3[["Gendermatch","Agematch"]]= np.where(DF3["Name_x"]==DF3["Name_y"],True,False),np.where(DF3["Age_x"]-DF3["Age_y"]==0,True,False)

DF3[["Gendermatch","Agematch"]] = pd.DataFrame([np.where(DF3["Name_x"]==DF3["Name_y"],True,False),np.where(DF3["Age_x"]-DF3["Age_y"]==0,True,False)]).T

np.where is useless, Series comparison already returns boolean Series
DF3["Gendermatch"] = DF3["Name_x"]==DF3["Name_y"]
DF3["Agematch"] = DF3["Age_x"]==DF3["Age_y"]
# or in one line
DF3["Gendermatch"], DF3["Agematch"] = (DF3["Name_x"]==DF3["Name_y"]), (DF3["Age_x"]==DF3["Age_y"])
print(DF3)
Slno Name_x Age_x Sex_x Name_y Age_y Sex_y Gendermatch Agematch
0 1 A 27 Male A 32 Male True False
1 2 B 28 Female B 28 Female True True
2 3 C 8 Female C 1 Female True False
3 4 D 28 Male D 72 Male True False
4 5 E 25 Female E 64 Female True False

pandas: transform based on count of row value in another dataframe

I have two dataframes:
df1:
Gender Registered
female 1
male 0
female 0
female 1
male 1
male 0
df2:
Gender
female
female
male
male
I want to modify df2, so that there is a new column 'Count' with the count of registered = 1 for corresponding gender values from df1. For example, in df1 there are 2 registered females and 1 registered male. I want to transform the df2 so that the output is as follows:
output:
Gender Count
female 2
female 2
male 1
male 1
I tried many things and got close but couldn't make it fully work.

sum + map:
v = df1.groupby('Gender').Registered.sum()
df2.assign(Count=df2.Gender.map(v))
Gender Count
0 female 2
1 female 2
2 male 1
3 male 1

pd.merge
pd.merge(df2, df1.groupby('Gender', as_index=False).sum())
Gender Registered
0 female 2
1 female 2
2 male 1
3 male 1

Could not convert string to float error from the Titanic competition

I'm trying to solve the Titanic survival program from Kaggle. It's my first step in actually learning Machine Learning. I have a problem where the gender column causes an error. The stacktrace says could not convert string to float: 'female'. How did you guys come across this issue? I don't want solutions. I just want a practical approach to this problem because I do need the gender column to build my model.
This is my code:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
train_path = "C:\\Users\\Omar\\Downloads\\Titanic Data\\train.csv"
train_data = pd.read_csv(train_path)
columns_of_interest = ['Survived','Pclass', 'Sex', 'Age']
filtered_titanic_data = train_data.dropna(axis=0)
x = filtered_titanic_data[columns_of_interest]
y = filtered_titanic_data.Survived
train_x, val_x, train_y, val_y = train_test_split(x, y, random_state=0)
titanic_model = DecisionTreeRegressor()
titanic_model.fit(train_x, train_y)
val_predictions = titanic_model.predict(val_x)
print(filtered_titanic_data)

There are a couple ways to deal with this, and it kind of depends what you're looking for:
You could encode your categories to numeric values, i.e. transform each level of your category to a distinct number,
or
dummy code your category, i.e. turn each level of your category into a separate column, which gets a value of 0 or 1.
In lots of machine learning applications, factors are better to deal with as dummy codes.
Note that in the case of a 2-level category, encoding to numeric according to the methods outlined below is essentially equivalent to dummy coding: all the values that are not level 0 are necessarily level 1. In fact, in the dummy code example I've given below, there is redundant information, as I've given each of the 2 classes its own column. It's just to illustrate the concept. Typically, one would only create n-1 columns, where n is the number of levels, and the omitted level is implied (i.e. make a column for Female, and all the 0 values are implied to be Male).
Encoding Categories to numeric:
Method 1: pd.factorize
pd.factorize is a simple, fast way of encoding to numeric:
For example, if your column gender looks like this:
>>> df
gender
0 Female
1 Male
2 Male
3 Male
4 Female
5 Female
6 Male
7 Female
8 Female
9 Female
df['gender_factor'] = pd.factorize(df.gender)[0]
>>> df
gender gender_factor
0 Female 0
1 Male 1
2 Male 1
3 Male 1
4 Female 0
5 Female 0
6 Male 1
7 Female 0
8 Female 0
9 Female 0
Method 2: categorical dtype
Another way would be to use category dtype:
df['gender_factor'] = df['gender'].astype('category').cat.codes
This would result in the same output
Method 3 sklearn.preprocessing.LabelEncoder()
This method comes with some bonuses, such as easy back transforming:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
# Transform the gender column
df['gender_factor'] = le.fit_transform(df.gender)
>>> df
gender gender_factor
0 Female 0
1 Male 1
2 Male 1
3 Male 1
4 Female 0
5 Female 0
6 Male 1
7 Female 0
8 Female 0
9 Female 0
# Easy to back transform:
df['gender_factor'] = le.inverse_transform(df.gender_factor)
>>> df
gender gender_factor
0 Female Female
1 Male Male
2 Male Male
3 Male Male
4 Female Female
5 Female Female
6 Male Male
7 Female Female
8 Female Female
9 Female Female
Dummy Coding:
Method 1: pd.get_dummies
df.join(pd.get_dummies(df.gender))
gender Female Male
0 Female 1 0
1 Male 0 1
2 Male 0 1
3 Male 0 1
4 Female 1 0
5 Female 1 0
6 Male 0 1
7 Female 1 0
8 Female 1 0
9 Female 1 0
Note, if you want to omit one column to get a non-redundant dummy code (see my note at the beginning of this answer), you can use:
df.join(pd.get_dummies(df.gender, drop_first=True))
gender Male
0 Female 0
1 Male 1
2 Male 1
3 Male 1
4 Female 0
5 Female 0
6 Male 1
7 Female 0
8 Female 0
9 Female 0

Finding unique combinations of columns from a dataframe

In my below data set, I need to find unique sequences and assign them a serial no ..
DataSet :
user age maritalstatus product
A Young married 111
B young married 222
C young Single 111
D old single 222
E old married 111
F teen married 222
G teen married 555
H adult single 444
I adult single 333
Expected output:
young married 0
young single 1
old single 2
old married 3
teen married 4
adult single 5
After finding the unique values like shown above, if I pass a new user like below,
user age maritalstatus
X young married
it should return me the products as a list .
X : [111, 222]
if there is no sequence, like below
user age maritalstatus
Y adult married
it should return me an empty list
Y : []

First select only columns for output and add drop_duplicates, last add new column by range:
df = df[['age','maritalstatus']].drop_duplicates()
df['no'] = range(len(df.index))
print (df)
age maritalstatus no
0 Young married 0
1 young married 1
2 young Single 2
3 old single 3
4 old married 4
5 teen married 5
7 adult single 6
If want convert all values to lowercase first:
df = df[['age','maritalstatus']].apply(lambda x: x.str.lower()).drop_duplicates()
df['no'] = range(len(df.index))
print (df)
age maritalstatus no
0 young married 0
2 young single 1
3 old single 2
4 old married 3
5 teen married 4
7 adult single 5
EDIT:
First convert to lowercase:
df[['age','maritalstatus']] = df[['age','maritalstatus']].apply(lambda x: x.str.lower())
print (df)
user age maritalstatus product
0 A young married 111
1 B young married 222
2 C young single 111
3 D old single 222
4 E old married 111
5 F teen married 222
6 G teen married 555
7 H adult single 444
8 I adult single 333
And then use merge for unique product converted to list:
df2 = pd.DataFrame([{'user':'X', 'age':'young', 'maritalstatus':'married'}])
print (df2)
age maritalstatus user
0 young married X
a = pd.merge(df, df2, on=['age','maritalstatus'])['product'].unique().tolist()
print (a)
[111, 222]
df2 = pd.DataFrame([{'user':'X', 'age':'adult', 'maritalstatus':'married'}])
print (df2)
age maritalstatus user
0 adult married X
a = pd.merge(df, df2, on=['age','maritalstatus'])['product'].unique().tolist()
print (a)
[]
But if need column use transform:
df['prod'] = df.groupby(['age', 'maritalstatus'])['product'].transform('unique')
print (df)
user age maritalstatus product prod
0 A young married 111 [111, 222]
1 B young married 222 [111, 222]
2 C young single 111 [111]
3 D old single 222 [222]
4 E old married 111 [111]
5 F teen married 222 [222, 555]
6 G teen married 555 [222, 555]
7 H adult single 444 [444, 333]
8 I adult single 333 [444, 333]
EDIT1:
a = (pd.merge(df, df2, on=['age','maritalstatus'])
.groupby('user_y')['product']
.apply(lambda x: x.unique().tolist())
.to_dict())
print (a)
{'X': [111, 222]}
Detail:
print (pd.merge(df, df2, on=['age','maritalstatus']))
user_x age maritalstatus product user_y
0 A young married 111 X
1 B young married 222 X

One way is pd.factorize. Note I convert columns to lower case first for results to make sense.
for col in ['user', 'age', 'maritalstatus']:
df[col] = df[col].str.lower()
df['category'] = list(zip(df.age, df.maritalstatus))
df['category'] = pd.factorize(df['category'])[0]
# user age maritalstatus product category
# 0 a young married 111 0
# 1 b young married 222 0
# 2 c young single 111 1
# 3 d old single 222 2
# 4 e old married 111 3
# 5 f teen married 222 4
# 6 g teen married 555 4
# 7 h adult single 444 5
# 8 i adult single 333 5
Finally, drop duplicates:
df_cats = df[['age', 'maritalstatus', 'category']].drop_duplicates()
# age maritalstatus category
# 0 young married 0
# 2 young single 1
# 3 old single 2
# 4 old married 3
# 5 teen married 4
# 7 adult single 5
To map a list of products, try this:
s = df.groupby(['age', 'maritalstatus'])['product'].apply(list)
df['prod_catwise'] = list(map(s.get, zip(df.age, df.maritalstatus)))
Another option is to use categorical data, which I highly recommend for workflows. You can easily extract codes from a categorical series via pd.Series.cat.codes.

Pandas get sorted index order for multiple columns

I have something like the following multi-index Pandas series where the values are indexed by Team, Year, and Gender.
>>> import pandas as pd
>>> import numpy as np
>>> multi_index=pd.MultiIndex.from_product([['Team A','Team B', 'Team C', 'Team D'],[2015,2016],['Male','Female']], names = ['Team','Year','Gender'])
>>> np.random.seed(0)
>>> df=pd.Series(index=multi_index, data=np.random.randint(1, 10, 16))
>>> df
>>>
Team Year Gender
Team A 2015 Male 6
Female 1
2016 Male 4
Female 4
Team B 2015 Male 8
Female 4
2016 Male 6
Female 3
Team C 2015 Male 5
Female 8
2016 Male 7
Female 9
Team D 2015 Male 9
Female 2
2016 Male 7
Female 8
My goal is to get a dataframe of the team ranked order for each of the 4 Year / Gender combinations (Male 2015, Male 2016, Female 2015, and Female 2016).
My approach has been to first unstack the dataframe so that it is indexed by team...
>>> unstacked_df = df.unstack(['Year','Gender'])
>>> print unstacked_df
>>>
>>>
Year 2015 2016
Gender Male Female Male Female
Team
Team A 6 1 4 4
Team B 8 4 6 3
Team C 5 8 7 9
Team D 9 2 7 8
And then create a dataframe from the index orders by looping through and sorting each of those 4 columns...
>>> team_orders = np.array([unstacked_df.sort_values(x).index.tolist() for x in unstacked_df.columns]).T
>>> result = pd.DataFrame(team_orders, columns=unstacked_df.columns)
>>> print result
Year 2015 2016
Gender Male Female Male Female
0 Team C Team A Team A Team B
1 Team A Team D Team B Team A
2 Team B Team B Team C Team D
3 Team D Team C Team D Team C
Is there an easier / better approach that I'm missing?

Starting from your unstacked version, you can use .argsort() with .apply() to rank order each column and then just use that as a lookup against the index:
df.unstack([1,2]).apply(lambda x: x.index[x.argsort()]).reset_index(drop=True)
Year 2015 2016
Gender Male Female Male Female
0 Team C Team A Team A Team B
1 Team A Team D Team B Team A
2 Team B Team B Team C Team D
3 Team D Team C Team D Team C
EDIT: Here's a little more info on why this works. With just the .argsort(), you get:
print df.unstack([1,2]).apply(lambda x: x.argsort())
Year 2015 2016
Gender Male Female Male Female
Team
Team A 2 0 0 1
Team B 0 3 1 0
Team C 1 1 2 3
Team D 3 2 3 2
The lookup bit is essentially just doing the following for each column:
df.unstack([1,2]).index[[2,0,1,3]]
Index([u'Team C', u'Team A', u'Team B', u'Team D'], dtype='object', name=u'Team')
and the .reset_index() gets rid of the now-meaningless index labels.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Comparing multiple columns of a massive DataFrame with complex duplicate rows - python

Related

How to add multiple calculated columns to dataframe at once

pandas: transform based on count of row value in another dataframe

Could not convert string to float error from the Titanic competition

Finding unique combinations of columns from a dataframe

Pandas get sorted index order for multiple columns

Categories

Resources