I would like to find the mode value of each digit in binary strings of a pandas column. Suppose I have the following data
df = pd.DataFrame({'categories':['A','B','C'],'values':['001','110','111']})
so my data look like this
categories values
A 001
B 110
C 111
If we consider the column "values" at the first digit (0, 1, 1) of A, B, and C respectively, the mode value is 1. If we do the same for other digits, my expected output should be 111.
I can find a mode value of a particular column. If I split each bit into a new column and find the mode value. I could get the expected output by concatenation later. However, when the data has much more columns of binary strings, I'm not sure whether this method still be a good way to do. I'm looking for the more elegant method do this. May I have your suggestion?
I think you can use apply with Series and list for convert digits to columns and then mode:
print (df['values'].apply(lambda x: pd.Series(list(x))))
0 1 2
0 0 0 1
1 1 1 0
2 1 1 1
df1 = df['values'].apply(lambda x: pd.Series(list(x))).mode()
print (df1)
0 1 2
0 1 1 1
Last select row, create list and join:
print (''.join(df1.iloc[0].tolist()))
111
Another possible solution with list comprehension:
df = pd.DataFrame([list(x) for x in df['values']])
print (df)
0 1 2
0 0 0 1
1 1 1 0
2 1 1 1
If output is DataFrame is possible use apply join:
df = pd.DataFrame({'categories':['A','B','C', 'D'],'values':['001','110','111', '000']})
print (df)
categories values
0 A 001
1 B 110
2 C 111
3 D 000
print (pd.DataFrame([list(x) for x in df['values']]).mode())
0 1 2
0 0 0 0
1 1 1 1
df1 = pd.DataFrame([list(x) for x in df['values']]).mode().apply(''.join, axis=1)
print (df1)
0 000
1 111
dtype: object
Related
I have a dataframe as follows:
import pandas as pd
df = pd.DataFrame({'sent.1':[0,1,0,1],
'sent.2':[0,1,1,0],
'sent.3':[0,0,0,1],
'sent.4':[1,1,0,1]
})
I am trying to replace the non-zero values with the 5th character in the column names (which is the numeric part of the column names), so the output should be,
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
I have tried the following but it does not work,
print(df.replace(1, pd.Series([i[5] for i in df.columns], [i[5] for i in df.columns])))
However when I replace it with column name, the above code works, so I am not sure which part is wrong.
print(df.replace(1, pd.Series(df.columns, df.columns)))
Since you're dealing with 1's and 0's, you can actually just use multiply the dataframe by a range:
df = df * range(1, df.shape[1] + 1)
Output:
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
Or, if you want to take the numbers from the column names:
df = df * df.columns.str.split('.').str[-1].astype(int)
you could use string multiplication on a boolean array to place the strings based on the condition, and where to restore the zeros:
mask = df.ne(0)
(mask*df.columns.str[5]).where(mask, 0)
To have integers:
mask = df.ne(0)
(mask*df.columns.str[5].astype(int))
output:
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
And another one, working with an arbitrary condition (here s.ne(0)):
df.apply(lambda s: s.mask(s.ne(0), s.name.rpartition('.')[-1]))
Suppose I have a dataframe df as shown below
qty
0 1.300
1 1.909
Now I want to extract only the integer portion of the qty column and the df should look like
qty
0 1
1 1
Tried using df['qty'].round(0) but didn't get the desired result as it rounds of the number to the nearest integer.
Java has a function intValue() which does the desired operation. Is there a similar function in pandas ?
Convert values to integers by Series.astype:
df['qty'] = df['qty'].astype(int)
print (df)
qty
0 1
1 1
If not working above is possible use numpy.modf for extract values before .:
a, b = np.modf(df['qty'])
df['qty'] = b.astype(int)
print (df)
qty
0 1
1 1
Or by split before ., but it should be slow if large DataFrame:
df['qty'] = b.astype(str).str.strip('.').str[0].astype(int)
Or use numpy.floor:
df['qty'] = np.floor(df['qty']).astype(int)
You can use the method floordiv:
df['col'].floordiv(1).astype(int)
For example:
col
0 9.748333
1 6.612708
2 2.888753
3 8.913470
4 2.354213
Output:
0 9
1 6
2 2
3 8
4 2
Name: col, dtype: int64
I have a single row dataframe(df) on which I want to insert value for every column using only the index numbers.
The dataframe df is in following form.
a b c
1 0 0 0
2 0 0 0
3 0 0 0
df.iloc[[0],[1]] = predictions[:1]
This gives me the following warning and does not write anything to the row:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
However when I try using
pred_row.iloc[0,1] = predictions[:1]
It gives me error
ValueError: Incompatible indexer with Series
Is there a way to write value to single row dataframe.
Predictions is any random value that I am trying to set in a particular cell of df
For set one element of Series to DataFrame change selecting to predictions[0]:
print (df)
a b c
1 0 0 0
2 0 0 0
3 0 0 0
predictions = pd.Series([1,2,3])
print (predictions)
0 1
1 2
2 3
dtype: int64
df.iloc[0, 1] = predictions[0]
#more general for set one element of Series by position
#df.iloc[0, 1] = predictions.iat[0]
print (df)
a b c
1 0 1 0
2 0 0 0
3 0 0 0
Details:
#scalar
print (predictions[0])
1
#one element Series
print (predictions[:1])
0 1
dtype: int64
Also working convert one element Series to one element array, but set by scalar is simplier:
df.iloc[0, 1] = predictions[:1].values
print (df)
a b c
1 0 1 0
2 0 0 0
3 0 0 0
print (predictions[:1].values)
[1]
I need to process my dataframe in Python such that I add the numeric values of numeric columns that lie between 2 rows of the dataframe.
The dataframe can be created using
df = pd.DataFrame(np.array([['a',0,1,0,0,0,0,'i'],
['b',1,0,0,0,0,0,'j'],
['c',0,0,1,0,0,0,'k'],
['None',0,0,0,1,0,0,'l'],
['e',0,0,0,0,1,0,'m'],
['f',0,1,0,0,0,0,'n'],
['None',0,0,0,1,0,0,'o'],
['h',0,0,0,0,1,0,'p']]),
columns=[0,1,2,3,4,5,6,7],
index=[0,1,2,3,4,5,6,7])
I need to add all rows that occur before the 'None' entries and move the aggregated row to a new dataframe that should look like:
Your data frame dtype is mess up , cause you are using the array to assign the value , since one array only accpet one type , so it push all int to become string , we need convert it firstly
df=df.apply(pd.to_numeric,errors ='ignore')# convert
df['newkey']=df[0].eq('None').cumsum()# using cumsum create the key
df.loc[df[0].ne('None'),:].groupby('newkey').agg(lambda x : x.sum() if x.dtype=='int64' else x.head(1))# then we agg
Out[742]:
0 1 2 3 4 5 6 7
newkey
0 a 1 1 1 0 0 0 i
1 e 0 1 0 0 1 0 m
2 h 0 0 0 0 1 0 p
You can also specify the agg funcs
s = lambda s: sum(int(k) for k in s)
d = {i: s for i in range(8)}
d.update({0: 'first', 7: 'first'})
df.groupby((df[0] == 'None').cumsum().shift().fillna(0)).agg(d)
0 1 2 3 4 5 6 7
0
0.0 a 1 1 1 1 0 0 i
1.0 e 0 1 0 1 1 0 m
2.0 h 0 0 0 0 1 0 p
In my dataframe, I have a categorical variable that I'd like to convert into dummy variables. This column however has multiple values separated by commas:
0 'a'
1 'a,b,c'
2 'a,b,d'
3 'd'
4 'c,d'
Ultimately, I'd want to have binary columns for each possible discrete value; in other words, final column count equals number of unique values in the original column. I imagine I'd have to use split() to get each separate value but not sure what to do afterwards. Any hint much appreciated!
Edit: Additional twist. Column has null values. And in response to comment, the following is the desired output. Thanks!
a b c d
0 1 0 0 0
1 1 1 1 0
2 1 1 0 1
3 0 0 0 1
4 0 0 1 1
Use str.get_dummies
df['col'].str.get_dummies(sep=',')
a b c d
0 1 0 0 0
1 1 1 1 0
2 1 1 0 1
3 0 0 0 1
4 0 0 1 1
Edit: Updating the answer to address some questions.
Qn 1: Why is it that the series method get_dummies does not accept the argument prefix=... while pandas.get_dummies() does accept it
Series.str.get_dummies is a series level method (as the name suggests!). We are one hot encoding values in one Series (or a DataFrame column) and hence there is no need to use prefix. Pandas.get_dummies on the other hand can one hot encode multiple columns. In which case, the prefix parameter works as an identifier of the original column.
If you want to apply prefix to str.get_dummies, you can always use DataFrame.add_prefix
df['col'].str.get_dummies(sep=',').add_prefix('col_')
Qn 2: If you have more than one column to begin with, how do you merge the dummies back into the original frame?
You can use DataFrame.concat to merge one hot encoded columns with the rest of the columns in dataframe.
df = pd.DataFrame({'other':['x','y','x','x','q'],'col':['a','a,b,c','a,b,d','d','c,d']})
df = pd.concat([df, df['col'].str.get_dummies(sep=',')], axis = 1).drop('col', 1)
other a b c d
0 x 1 0 0 0
1 y 1 1 1 0
2 x 1 1 0 1
3 x 0 0 0 1
4 q 0 0 1 1
The str.get_dummies function does not accept prefix parameter, but you can rename the column names of the returned dummy DataFrame:
data['col'].str.get_dummies(sep=',').rename(lambda x: 'col_' + x, axis='columns')