CountTokenizing a field, turning into columns - python

I'm working with data that look something like this:
ID PATH GROUP
11937 MM-YT-UJ-OO GT
11938 YT-RY-LM TQ
11939 XX-XX-OT DX
I'd like to tokenize the PATH column into n-grams and then one-hot encode those into their own columns so I'd end up with something like:
ID GROUP MM YT UJ OO RY LM XX OT MM-YT YT-UH ...
11937 GT 1 1 1 1 0 0 0 0 1 1
I could also use counted tokens rather than one-hot, so 11939 would have a 2 in the XX column instead of a 1, but I can work with either.
I can tokenize the column quite easily with scikitlearn CountVectorizer, but then I have to cbind the ID and GROUP fields. Is there a standard way to do this or a best practice that anyone has discovered?

A solution:
df.set_index(['ID', 'GROUP'], inplace=True)
pd.get_dummies(df.PATH.str.split('-', expand=True).stack())\
.groupby(level=[0,1]).sum().reset_index()
Isolate the ID and GROUP columns as index. Then convert the string to cell items
df.PATH.str.split('-', expand=True)
Out[37]:
0 1 2 3
ID GROUP
11937 GT MM YT UJ OO
11938 TQ YT RY LM None
11939 DX XX XX OT None
Get them into a single column of data
df.PATH.str.split('-', expand=True).stack()
Out[38]:
ID GROUP
11937 GT 0 MM
1 YT
2 UJ
3 OO
11938 TQ 0 YT
1 RY
2 LM
11939 DX 0 XX
1 XX
2 OT
get_dummies bring the counter as columns spread accross rows
pd.get_dummies(df.PATH.str.split('-', expand=True).stack())
Out[39]:
LM MM OO OT RY UJ XX YT
ID GROUP
11937 GT 0 0 1 0 0 0 0 0 0
1 0 0 0 0 0 0 0 1
2 0 0 0 0 0 1 0 0
3 0 0 1 0 0 0 0 0
11938 TQ 0 0 0 0 0 0 0 0 1
1 0 0 0 0 1 0 0 0
2 1 0 0 0 0 0 0 0
11939 DX 0 0 0 0 0 0 0 1 0
1 0 0 0 0 0 0 1 0
2 0 0 0 1 0 0 0 0
Group by the data per ID, GROUP (levels 0 and 1 in the index) to sum up the rows together and have one line per tuple. And finally reset the index to get ID and GROUP column back as regular columns.

Maybe you can try something like that.
# Test data
df = DataFrame({'GROUP': ['GT', 'TQ', 'DX'],
'ID': [11937, 11938, 11939],
'PATH': ['MM-YT-UJ-OO', 'YT-RY-LM', 'XX-XX-OT']})
# Expanding data and creating on column by token
tmp = pd.concat([df.loc[:,['GROUP', 'ID']],
df['PATH'].str.split('-', expand=True)], axis=1)
# Converting wide to long format
tmp = pd.melt(tmp, id_vars=['ID', 'GROUP'])
# Now grouping and counting
tmp.groupby(['ID', 'GROUP', 'value']).count().unstack().fillna(0)
# variable
# value LM MM OO OT RY UJ XX YT
# ID GROUP
# 11937 GT 0.0 1.0 1.0 0.0 0.0 1.0 0.0 1.0
# 11938 TQ 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0
# 11939 DX 0.0 0.0 0.0 1.0 0.0 0.0 2.0 0.0

Related

Realise accumulated DataFrame from a column of Boolean values

Be the following python pandas DataFrame:
ID
Holidays
visit_1
visit_2
visit_3
other
0
True
1
2
0
red
0
False
3
2
0
red
0
True
4
4
1
blue
1
False
2
0
0
red
1
True
1
2
1
green
2
False
1
0
0
red
Currently I calculate a new DataFrame with the accumulated visit values as follows.
# Calculate the columns of the total visit count
visit_df = df.groupby('ID')[['visit_1', 'visit_2', 'visit_3']].sum()
I would like to create a new one taking into account only the rows whose Holiday value is True. How could I do this?
Simple subset the rows first:
df[df['Holidays']].groupby('ID')[['visit_1', 'visit_2', 'visit_3']].sum()
output:
visit_1 visit_2 visit_3
ID
0 5 6 1
1 1 2 1
Alternative if you want to also get the groups without any match:
df2 = df.set_index('ID')
(df2.where(df2['Holidays'])
.groupby('ID')[['visit_1', 'visit_2', 'visit_3']].sum()
)
output:
visit_1 visit_2 visit_3
ID
0 5.0 6.0 1.0
1 1.0 2.0 1.0
2 0.0 0.0 0.0
variant
df2 = df.set_index('ID')
(df2.where(df2['Holidays'])
.groupby('ID')[['visit_1', 'visit_2', 'visit_3']].sum()
.convert_dtypes()
.add_suffix('_Holidays')
)
output:
visit_1_Holidays visit_2_Holidays visit_3_Holidays
ID
0 5 6 1
1 1 2 1
2 0 0 0

Boolean Masking on a Pandas Dataframe where columns may not exist

I have a dataframe called compare that looks like this:
Resident
1xdisc
1xdisc_doc
conpark
parking
parking_doc
conmil
conmil_doc
pest
pest_doc
pet
pet1x
pet_doc
rent
rent_doc
stlc
storage
trash
trash_doc
water
water_doc
John
0
-500
0
50
50
0
0
3
3
0
0
0
1803
1803
0
0
30
30
0
0
Cheldone
-500
0
0
50
50
0
0
1.25
1.25
0
0
0
1565
1565
0
0
30
30
0
0
Dieu
-300
-300
0
0
0
0
0
3
3
0
0
0
1372
1372
0
0
18
18
0
0
Here is the dataframe in a form that can be copied and pasted:
,Resident,1xdisc,1xdisc_doc,conpark,parking,parking_doc,conmil,conmil_doc,pest,pest_doc,pet,pet1x,pet_doc,rent,rent_doc,stlc,storage,trash,trash_doc,water,water_doc 0,Acacia,0,0,0,0,0,0,-500,3.0,3.0,0,0,70,2067,2067,0,0,15,15,0,0 1,ashley,0,0,0,0,0,0,0,3.0,3.0,0,0,0,2067,2067,0,0,15,15,0,0 2,Sheila,0,0,0,0,0,0,0,0.0,0.0,0,0,0,1574,1574,0,0,0,0,0,0 3,Brionne,0,0,0,0,0,0,0,0.0,0.0,0,0,0,1787,1787,0,0,0,0,0,0 4,Danielle,0,0,0,0,0,0,0,0.0,0.0,0,0,0,1422,0,0,0,0,0,0,0 5,Nmesomachi,0,0,0,0,0,0,0,0.0,0.0,0,0,0,1675,1675,0,0,0,0,0,0 6,Doaa,0,0,0,0,0,0,0,0.0,0.0,0,0,0,1306,1306,0,0,0,0,0,0 7,Reynaldo,0,0,0,0,0,0,0,0.0,0.0,0,0,0,1685,1685,0,0,0,0,0,0 8,Shajuan,0,0,0,0,0,0,0,0.0,0.0,0,0,0,1768,0,0,0,0,0,0,0 9,Dalia,0,0,0,0,0,0,0,0.0,0.0,0,0,0,1596,1596,0,0,0,0,0,0
I want to create another dataframe using boolean masking that only contains rows where there are mismatches various sets of columns. For example, where parking doesn't match parking_doc or conmil doesn't match conmil_doc.
Here is the code I am using currently:
nonmatch = compare[((compare['1xdisc']!=compare['1xdisc_doc']) &
(compare['conpark']!=compare['1xdisc']))
| (compare['rent']!=compare['rent_doc'])
| (compare['parking']!=compare['parking_doc'])
| (compare['trash']!=compare['trash_doc'])
| (compare['pest']!=compare['pest_doc'])
| (compare['stlc']!=compare['stlc_doc'])
| (compare['pet']!=compare['pet_doc'])
|(compare['conmil']!=compare['conmil_doc'])
]
The problem I'm having is that some columns may not always exist, for example stlc_doc or pet_doc. How do I select rows with mismatches, but only check for mismatches for particular columns if the columns exist?
If the column names doesn't always exist, you can either add the columns that doesn't exist which I don't think will be a good idea since you will have to replicate the corresponding columns which will eventually increase the size of the dataframe.
So, another approach might be to filter the column names themselves and take only the column pairs that exists:
Given DataFrame:
>>> df.head(3)
Resident 1xdisc 1xdisc_doc conpark parking parking_doc conmil conmil_doc pest pest_doc pet pet1x pet_doc rent rent_doc stlc storage trash trash_doc water water_doc
0 Acacia 0 0 0 0 0 0 -500 3.0 3.0 0 0 70 2067 2067 0 0 15 15 0 0
1 ashley 0 0 0 0 0 0 0 3.0 3.0 0 0 0 2067 2067 0 0 15 15 0 0
2 Sheila 0 0 0 0 0 0 0 0.0 0.0 0 0 0 1574 1574 0 0 0 0 0 0
Take out the columns pairs:
>>> maskingCols = [(col[:-4], col) for col in df if col[:-4] in df and col.endswith('_doc')]
maskingCols
[('1xdisc', '1xdisc_doc'), ('parking', 'parking_doc'), ('conmil', 'conmil_doc'), ('pest', 'pest_doc'), ('pet', 'pet_doc'), ('rent', 'rent_doc'), ('trash', 'trash_doc')]
Now that you have the column pairs, you can create the expression required to mask the dataframe.
>>> "|".join(f"(df['{col1}'] != df['{col2}'])" for col1, col2 in maskingCols)
"(df['1xdisc'] != df['1xdisc_doc'])|(df['parking'] != df['parking_doc'])|(df['conmil'] != df['conmil_doc'])|(df['pest'] != df['pest_doc'])|(df['pet'] != df['pet_doc'])|(df['rent'] != df['rent_doc'])|(df['trash'] != df['trash_doc'])"
You can simply pass this expression string to eval function to evaluate it.
>>> eval("|".join(f"(df['{col1}'] != df['{col2}'])" for col1, col2 in maskingCols))
You can add other criteria other than this masking:
>>> eval("|".join(f"(df['{col1}'] != df['{col2}'])" for col1, col2 in maskingCols)) | ((df['1xdisc']!=df['1xdisc_doc']) & (df['conpark']!=df['1xdisc']))
0 True
1 False
2 False
3 False
4 True
5 False
6 False
7 False
8 True
9 False
dtype: bool
You can use it to get your desired dataframe:
>>> df[eval("|".join(f"(df['{col1}'] != df['{col2}'])" for col1, col2 in maskingCols)) | ((df['1xdisc']!=df['1xdisc_doc']) & (df['conpark']!=df['1xdisc']))]
OUTPUT:
Resident 1xdisc 1xdisc_doc conpark parking parking_doc conmil conmil_doc pest pest_doc pet pet1x pet_doc rent rent_doc stlc storage trash trash_doc water water_doc
0 Acacia 0 0 0 0 0 0 -500 3.0 3.0 0 0 70 2067 2067 0 0 15 15 0 0
4 Danielle 0 0 0 0 0 0 0 0.0 0.0 0 0 0 1422 0 0 0 0 0 0 0
8 Shajuan 0 0 0 0 0 0 0 0.0 0.0 0 0 0 1768 0 0 0 0 0 0 0
I'm not a pandas expert, so there might be a simpler, library way to do this, but here's a relatively Pythonic, adaptable implementation:
mask = True
for col_name in df.columns:
# inefficient but readable, could get this down
# to O(n) with a better data structure
if col_name + '_doc' in df.columns:
mask = mask & (df[col_name] != df[col_name + '_doc'])
non_match = df[mask]

How to perform pd.get_dummies() on a dataframe while simultaneously keeping NA values in place instead of creating an NA column?

I have a dataset with some missing data. I would like to maintain the missingness within the data while performing pd.get_dummies().
Here is an example dataset:
Table 1.
someCol
A
B
NA
C
D
I would expect pd.get_dummies(df, dummy_na=True)) to transform the data into something like this:
Table 2.
someCol_A someCol_B someCol_NA someCol_C someCol_D
1 0 0 0 0
0 1 0 0 0
0 0 1 0 0
0 0 0 1 0
0 0 0 0 1
But, what I would like is this:
Table 3.
someCol_A someCol_B someCol_C someCol_D
1 0 0 0
0 1 0 0
NA NA NA NA
0 0 1 0
0 0 0 1
Notice that the 3rd row has the NA in place of all of the row values broken out from the original column.
How can I achieve the results of Table 3?
A bit of a hack, but you could do something like this, where you're only getting the dummies for the non-null rows, and then re-inserting the missing values in their proper place by re-indexing the resulting dummies by the index of the original dataframe
pd.get_dummies(df.dropna()).reindex(df.index)
someCol_A someCol_B someCol_C someCol_D
0 1.0 0.0 0.0 0.0
1 0.0 1.0 0.0 0.0
2 NaN NaN NaN NaN
3 0.0 0.0 1.0 0.0
4 0.0 0.0 0.0 1.0
#sacuL essentially provided the answer. The following is my modification:
df_lister = []
for i in range(len(df.columns)):
df_lister.append(pd.get_dummies(df[df.columns[i]].dropna(), prefix=df.columns[i]).reindex(df[df.columns[i]].index))
df = pd.concat(df_lister, axis=1)

Pandas split column based on category of the elements

I have a pandas.DataFrame in it I have a column. The columns contains, integers, strings, time...
I want to create columns (containing [0,1]) that tells if the value in that column is a string or not, a time or not... in an efficient way.
A
0 Hello
1 Name
2 123
3 456
4 22/03/2019
And the output should be
A A_string A_number A_date
0 Hello 1 0 0
1 Name 1 0 0
2 123 0 1 0
3 456 0 1 0
4 22/03/2019 0 0 1
Using pandas str methods to check for the string type could help:
df = pd.read_clipboard()
df['A_string'] = df.A.str.isalpha().astype(int)
df['A_number'] = df.A.str.isdigit().astype(int)
#naive assumption
df['A_Date'] = (~df.A.str.isalnum()).astype(int)
df.filter(['A','A_string','A_number','A_Date'])
A A_string A_number A_Date
0 Hello 1 0 0
1 Name 1 0 0
2 123 0 1 0
3 456 0 1 0
4 22/03/2019 0 0 1
We can use the native pandas .to_numeric, to_datetime to test for dates & numbers. Then we can use .loc for assignment and fillna to match your target df.
df.loc[~pd.to_datetime(df['A'],errors='coerce').isna(),'A_Date'] = 1
df.loc[~pd.to_numeric(df['A'],errors='coerce').isna(),'A_Number'] = 1
df.loc[(pd.to_numeric(df['A'],errors='coerce').isna())
& pd.to_datetime(df['A'],errors='coerce').isna()
,'A_String'] = 1
df = df.fillna(0)
print(df)
A A_Date A_Number A_String
0 Hello 0.0 0.0 1.0
1 Name 0.0 0.0 1.0
2 123 0.0 1.0 0.0
3 456 0.0 1.0 0.0
4 22/03/2019 1.0 0.0 0.0

How can I delete int in Pandas dataframe column?

I have a dataframe like this, how can I delete all the int in a column?
For example, the value of column[0]['material'], transformed from lm792 to lm.
material item
index
0 lm792 1
1 sotl085-pu01. 1
2 lm792 1
3 sotl085-pu01. 1
4 ym11-3527 1
... ... ...
135526 0 0
135527 0 0
135528 0 0
135529 0 0
135530 0 0
you could use a simple regex -
\d is a digit (a character in the range 0-9), and + means 1 or more times. So, \d+ is 1 or more digits.
df['material'] = df['material'].str.replace('\d+','')
print(df)
material item
0 lm 1.0
1 sotl-pu. 1.0
2 lm 1.0
3 sotl-pu. 1.0
4 ym- 1.0
5 NaN
6 NaN
7 NaN
8 NaN
9 0.0

Categories