Pandas split column based on category of the elements - python

I have a pandas.DataFrame in it I have a column. The columns contains, integers, strings, time...
I want to create columns (containing [0,1]) that tells if the value in that column is a string or not, a time or not... in an efficient way.
A
0 Hello
1 Name
2 123
3 456
4 22/03/2019
And the output should be
A A_string A_number A_date
0 Hello 1 0 0
1 Name 1 0 0
2 123 0 1 0
3 456 0 1 0
4 22/03/2019 0 0 1

Using pandas str methods to check for the string type could help:
df = pd.read_clipboard()
df['A_string'] = df.A.str.isalpha().astype(int)
df['A_number'] = df.A.str.isdigit().astype(int)
#naive assumption
df['A_Date'] = (~df.A.str.isalnum()).astype(int)
df.filter(['A','A_string','A_number','A_Date'])
A A_string A_number A_Date
0 Hello 1 0 0
1 Name 1 0 0
2 123 0 1 0
3 456 0 1 0
4 22/03/2019 0 0 1

We can use the native pandas .to_numeric, to_datetime to test for dates & numbers. Then we can use .loc for assignment and fillna to match your target df.
df.loc[~pd.to_datetime(df['A'],errors='coerce').isna(),'A_Date'] = 1
df.loc[~pd.to_numeric(df['A'],errors='coerce').isna(),'A_Number'] = 1
df.loc[(pd.to_numeric(df['A'],errors='coerce').isna())
& pd.to_datetime(df['A'],errors='coerce').isna()
,'A_String'] = 1
df = df.fillna(0)
print(df)
A A_Date A_Number A_String
0 Hello 0.0 0.0 1.0
1 Name 0.0 0.0 1.0
2 123 0.0 1.0 0.0
3 456 0.0 1.0 0.0
4 22/03/2019 1.0 0.0 0.0

Related

Realise accumulated DataFrame from a column of Boolean values

Be the following python pandas DataFrame:
ID
Holidays
visit_1
visit_2
visit_3
other
0
True
1
2
0
red
0
False
3
2
0
red
0
True
4
4
1
blue
1
False
2
0
0
red
1
True
1
2
1
green
2
False
1
0
0
red
Currently I calculate a new DataFrame with the accumulated visit values as follows.
# Calculate the columns of the total visit count
visit_df = df.groupby('ID')[['visit_1', 'visit_2', 'visit_3']].sum()
I would like to create a new one taking into account only the rows whose Holiday value is True. How could I do this?
Simple subset the rows first:
df[df['Holidays']].groupby('ID')[['visit_1', 'visit_2', 'visit_3']].sum()
output:
visit_1 visit_2 visit_3
ID
0 5 6 1
1 1 2 1
Alternative if you want to also get the groups without any match:
df2 = df.set_index('ID')
(df2.where(df2['Holidays'])
.groupby('ID')[['visit_1', 'visit_2', 'visit_3']].sum()
)
output:
visit_1 visit_2 visit_3
ID
0 5.0 6.0 1.0
1 1.0 2.0 1.0
2 0.0 0.0 0.0
variant
df2 = df.set_index('ID')
(df2.where(df2['Holidays'])
.groupby('ID')[['visit_1', 'visit_2', 'visit_3']].sum()
.convert_dtypes()
.add_suffix('_Holidays')
)
output:
visit_1_Holidays visit_2_Holidays visit_3_Holidays
ID
0 5 6 1
1 1 2 1
2 0 0 0

Applying a lambda function to columns on pandas avoiding redundancy

I have this dataset, which contains some NaN values:
df = pd.DataFrame({'Id':[1,2,3,4,5,6], 'Name':['Eve','Diana',np.NaN,'Mia','Mae',np.NaN], "Count":[10,3,np.NaN,8,5,2]})
df
Id Name Count
0 1 Eve 10.0
1 2 Diana 3.0
2 3 NaN NaN
3 4 Mia 8.0
4 5 Mae 5.0
5 6 NaN 2.0
I want to test if the column has a NaN value (0) or not (1) and creating two new columns. I have tried this:
df_clean = df
df_clean[['Name_flag','Count_flag']] = df_clean[['Name','Count']].apply(lambda x: 0 if x == np.NaN else 1, axis = 1)
But it mentions that The truth value of a Series is ambiguous. I want to make it avoiding redundancy, but I see there is a mistake in my logic. Please, could you help me with this question?
The expected table is:
Id Name Count Name_flag Count_flag
0 1 Eve 10.0 1 1
1 2 Diana 3.0 1 1
2 3 NaN NaN 0 0
3 4 Mia 8.0 1 1
4 5 Mae 5.0 1 1
5 6 NaN 2.0 0 1
Multiply boolean mask by 1:
df[['Name_flag','Count_flag']] = df[['Name', 'Count']].isna() * 1
>>> df
Id Name Count Name_flag Count_flag
0 1 Eve 10.0 0 0
1 2 Diana 3.0 0 0
2 3 NaN NaN 1 1
3 4 Mia 8.0 0 0
4 5 Mae 5.0 0 0
5 6 NaN 2.0 1 0
For your problem of The truth value of a Series is ambiguous
For apply, you cannot return a scalar 0 or 1 because you have a series as input . You have to use applymap instead to apply a function elementwise. But comparing to NaN is not an easy thing:
Try:
df[['Name','Count']].applymap(lambda x: str(x) == 'nan') * 1
We can use isna and convert the boolean to int:
df[["Name_flag", "Count_flag"]] = df[["Name", "Count"]].isna().astype(int)
Id Name Count Name_flag Count_flag
0 1 Eve 10.00 0 0
1 2 Diana 3.00 0 0
2 3 NaN NaN 1 1
3 4 Mia 8.00 0 0
4 5 Mae 5.00 0 0
5 6 NaN 2.00 1 0

How to perform pd.get_dummies() on a dataframe while simultaneously keeping NA values in place instead of creating an NA column?

I have a dataset with some missing data. I would like to maintain the missingness within the data while performing pd.get_dummies().
Here is an example dataset:
Table 1.
someCol
A
B
NA
C
D
I would expect pd.get_dummies(df, dummy_na=True)) to transform the data into something like this:
Table 2.
someCol_A someCol_B someCol_NA someCol_C someCol_D
1 0 0 0 0
0 1 0 0 0
0 0 1 0 0
0 0 0 1 0
0 0 0 0 1
But, what I would like is this:
Table 3.
someCol_A someCol_B someCol_C someCol_D
1 0 0 0
0 1 0 0
NA NA NA NA
0 0 1 0
0 0 0 1
Notice that the 3rd row has the NA in place of all of the row values broken out from the original column.
How can I achieve the results of Table 3?
A bit of a hack, but you could do something like this, where you're only getting the dummies for the non-null rows, and then re-inserting the missing values in their proper place by re-indexing the resulting dummies by the index of the original dataframe
pd.get_dummies(df.dropna()).reindex(df.index)
someCol_A someCol_B someCol_C someCol_D
0 1.0 0.0 0.0 0.0
1 0.0 1.0 0.0 0.0
2 NaN NaN NaN NaN
3 0.0 0.0 1.0 0.0
4 0.0 0.0 0.0 1.0
#sacuL essentially provided the answer. The following is my modification:
df_lister = []
for i in range(len(df.columns)):
df_lister.append(pd.get_dummies(df[df.columns[i]].dropna(), prefix=df.columns[i]).reindex(df[df.columns[i]].index))
df = pd.concat(df_lister, axis=1)

How can I delete int in Pandas dataframe column?

I have a dataframe like this, how can I delete all the int in a column?
For example, the value of column[0]['material'], transformed from lm792 to lm.
material item
index
0 lm792 1
1 sotl085-pu01. 1
2 lm792 1
3 sotl085-pu01. 1
4 ym11-3527 1
... ... ...
135526 0 0
135527 0 0
135528 0 0
135529 0 0
135530 0 0
you could use a simple regex -
\d is a digit (a character in the range 0-9), and + means 1 or more times. So, \d+ is 1 or more digits.
df['material'] = df['material'].str.replace('\d+','')
print(df)
material item
0 lm 1.0
1 sotl-pu. 1.0
2 lm 1.0
3 sotl-pu. 1.0
4 ym- 1.0
5 NaN
6 NaN
7 NaN
8 NaN
9 0.0

CountTokenizing a field, turning into columns

I'm working with data that look something like this:
ID PATH GROUP
11937 MM-YT-UJ-OO GT
11938 YT-RY-LM TQ
11939 XX-XX-OT DX
I'd like to tokenize the PATH column into n-grams and then one-hot encode those into their own columns so I'd end up with something like:
ID GROUP MM YT UJ OO RY LM XX OT MM-YT YT-UH ...
11937 GT 1 1 1 1 0 0 0 0 1 1
I could also use counted tokens rather than one-hot, so 11939 would have a 2 in the XX column instead of a 1, but I can work with either.
I can tokenize the column quite easily with scikitlearn CountVectorizer, but then I have to cbind the ID and GROUP fields. Is there a standard way to do this or a best practice that anyone has discovered?
A solution:
df.set_index(['ID', 'GROUP'], inplace=True)
pd.get_dummies(df.PATH.str.split('-', expand=True).stack())\
.groupby(level=[0,1]).sum().reset_index()
Isolate the ID and GROUP columns as index. Then convert the string to cell items
df.PATH.str.split('-', expand=True)
Out[37]:
0 1 2 3
ID GROUP
11937 GT MM YT UJ OO
11938 TQ YT RY LM None
11939 DX XX XX OT None
Get them into a single column of data
df.PATH.str.split('-', expand=True).stack()
Out[38]:
ID GROUP
11937 GT 0 MM
1 YT
2 UJ
3 OO
11938 TQ 0 YT
1 RY
2 LM
11939 DX 0 XX
1 XX
2 OT
get_dummies bring the counter as columns spread accross rows
pd.get_dummies(df.PATH.str.split('-', expand=True).stack())
Out[39]:
LM MM OO OT RY UJ XX YT
ID GROUP
11937 GT 0 0 1 0 0 0 0 0 0
1 0 0 0 0 0 0 0 1
2 0 0 0 0 0 1 0 0
3 0 0 1 0 0 0 0 0
11938 TQ 0 0 0 0 0 0 0 0 1
1 0 0 0 0 1 0 0 0
2 1 0 0 0 0 0 0 0
11939 DX 0 0 0 0 0 0 0 1 0
1 0 0 0 0 0 0 1 0
2 0 0 0 1 0 0 0 0
Group by the data per ID, GROUP (levels 0 and 1 in the index) to sum up the rows together and have one line per tuple. And finally reset the index to get ID and GROUP column back as regular columns.
Maybe you can try something like that.
# Test data
df = DataFrame({'GROUP': ['GT', 'TQ', 'DX'],
'ID': [11937, 11938, 11939],
'PATH': ['MM-YT-UJ-OO', 'YT-RY-LM', 'XX-XX-OT']})
# Expanding data and creating on column by token
tmp = pd.concat([df.loc[:,['GROUP', 'ID']],
df['PATH'].str.split('-', expand=True)], axis=1)
# Converting wide to long format
tmp = pd.melt(tmp, id_vars=['ID', 'GROUP'])
# Now grouping and counting
tmp.groupby(['ID', 'GROUP', 'value']).count().unstack().fillna(0)
# variable
# value LM MM OO OT RY UJ XX YT
# ID GROUP
# 11937 GT 0.0 1.0 1.0 0.0 0.0 1.0 0.0 1.0
# 11938 TQ 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0
# 11939 DX 0.0 0.0 0.0 1.0 0.0 0.0 2.0 0.0

Categories