I have a df called all_data and within it a column called 'Neighborhood'. all_data["Neighborhood"].head() looks like
0 CollgCr
1 Veenker
2 CollgCr
3 Crawfor
4 NoRidge
I want to replace certain neighborhood names with 0, and others with 1 to get
0 1
1 1
2 1
3 0
4 1
So I did this:
all_data["Neighb_Good"] = all_data["Neighborhood"].copy().replace({'Neighborhood': {'StoneBr': 1, 'NrdigHt': 1,
'Veenker': 1, 'Somerst': 1,
'Timber': 1, 'CollgCr': 1,
'Blmngtn': 1, 'NoRidge': 1,
'Mitchel': 1, 'ClearCr': 1,
'ClearCr': 0, 'Crawfor': 0,
'SawyerW': 0, 'Gilbert': 0,
'Sawyer': 0, 'NPkVill': 0,
'NAmes': 0, 'NWAmes': 0,
'BrkSide': 0, 'MeadowV': 0,
'Edwards': 0, 'Blueste': 0,
'BrDale': 0, 'OldTown': 0,
'IDOTRR': 0, 'SWISU': 0,
}})
It doesn't give me an error, but nothing happens. Instead, all_data["Neighb_Good"] looks exactly like all_data["Neighborhood"].
I've been trying to figure it out for a while now and I swear I can't see what's the matter because I've used this same method yesterday on some other columns and it worked perfectly.
UPDATE: you seem to need Series.map():
In [196]: df['Neighb_Good'] = df['Neighborhood'].map(d['Neighborhood'])
In [197]: df
Out[197]:
Neighborhood Neighb_Good
0 CollgCr 1
1 Veenker 1
2 CollgCr 1
3 Crawfor 0
4 NoRidge 1
using data set from comments:
In [201]: df["ExterQual_Good"] = df["ExterQual"].map(d)
In [202]: df
Out[202]:
ExterQual ExterQual_Good
0 TA 1
1 Fa 0
2 Gd 1
3 Po 0
4 Ex 1
Old answer:
Use DataFrame.replace() instead of Series.replace() if you have a nested dict, containing column names:
In [81]: df['Neighb_Good'] = df.replace(d)
In [82]: df
Out[82]:
Neighborhood Neighb_Good
0 CollgCr 1
1 Veenker 1
2 CollgCr 1
3 Crawfor 0
4 NoRidge 1
or use Series.replace() with a flat (not nested dict):
In [85]: df['Neighb_Good'] = df['Neighborhood'].replace(d['Neighborhood'])
In [86]: df
Out[86]:
Neighborhood Neighb_Good
0 CollgCr 1
1 Veenker 1
2 CollgCr 1
3 Crawfor 0
4 NoRidge 1
How about
A B C
P 0 1 2
Q 3 4 5
R 6 7 8
S 9 10 11
T 12 13 14
U 15 16 17
data1.A.replace({0:"A"..and so on})
Related
HI I want to ask I am using df.mode() function to find the most common in one row. This will give me an extra column how could I have only one column? I am using df.mode(axis=1)
for example I have a data frame
0 1 2 3 4
1 1 0 1 1 1
2 0 1 0 0 1
3 0 0 1 1 0
so I want the output
1 1
2 0
3 0
but I am getting
1 1 NaN
2 0 NaN
3 0 NaN
Does anyone know why?
The code you tried gives the expected output in Python 3.7.6 with Pandas 1.0.3.
import pandas as pd
df = pd.DataFrame(
data=[[1, 0, 1, 1, 1], [0, 1, 0, 0, 1], [0, 0, 1, 1, 0]],
index=[1, 2, 3])
df
0 1 2 3 4
1 1 0 1 1 1
2 0 1 0 0 1
3 0 0 1 1 0
df.mode(axis=1)
0
1 1
2 0
3 0
There could be different data types in your columns and mode cannot be used to compare column of different data type.
Use str() or int() to convert your df.series to a suitable data type. Make sure that the data type is consistent in the df before employing mode(axis=1)
Quick question that I'm brain-farting on how to best implement. I am generating a matrix to add up how many times two items are found next to each other in a list across a large number of permutations of this list. My code looks something like this:
agreement_matrix = pandas.DataFrame(0, index=names, columns=names)
for list in bunch_of_lists:
for i in range(len(list)-1):
agreement_matrix[list[i]][list[i+1]] += 1
It generates an array like:
A B C D
A 0 2 1 1
B 2 0 1 1
C 1 1 0 2
D 1 1 2 0
And because I don't care about order as much I want to add up values so it's like this:
A B C D
A 0 4 2 2
B 0 0 2 2
C 0 0 0 4
D 0 0 0 0
Is there any fast/simple way to achieve this? I've been toying around with both doing it after generation and trying to do it as I add values.
Use np.tri*:
np.triu(df) + np.tril(df).T
array([[0, 4, 2, 2],
[0, 0, 2, 2],
[0, 0, 0, 4],
[0, 0, 0, 0]])
Call the DataFrame constructor:
pd.DataFrame(np.triu(df) + np.tril(df).T, df.index, df.columns)
A B C D
A 0 4 2 2
B 0 0 2 2
C 0 0 0 4
D 0 0 0 0
To solve the problem ..
np.triu(df.values*2)#df.values.T+df.values
Out[595]:
array([[0, 4, 2, 2],
[0, 0, 2, 2],
[0, 0, 0, 4],
[0, 0, 0, 0]], dtype=int64)
Then you do
pd.DataFrame(np.triu(df.values*2), df.index, df.columns)
Out[600]:
A B C D
A 0 4 2 2
B 0 0 2 2
C 0 0 0 4
D 0 0 0 0
A pandas solution to avoid the first loop:
values=['ABCD'[i] for i in np.random.randint(0,4,100)] # data
df=pd.DataFrame(values)
df[1]=df[0].shift()
df=df.iloc[1:]
df.values.sort(axis=1)
df[2]=1
res=df.pivot_table(2,0,1,np.sum,0)
#
#1 A B C D
#0
#A 2 14 11 16
#B 0 5 9 13
#C 0 0 10 17
#D 0 0 0 2
I have a pandas dataframe with columns names as: (columns type as Object)
1. x_id
2. y_id
3. Sentence1
4. Sentences2
5. Label
I want to separate sentences1 and sentence2 into multiple columns in same dataframe.
Here is an example: dataframe names as df
x_id y_id Sentence1 Sentence2 Label
0 2 This is a ball I hate you 0
1 5 I am a boy Ahmed Ali 1
2 1 Apple is red Rose is red 1
3 9 I love you so much Me too 1
After splitting the columns[Sentence1,Sentence2] by ' ' Space, dataframe looks like:
x_id y_id 1 2 3 4 5 6 7 8 Label
0 2 This is a ball NONE I hate you 0
1 5 I am a boy NONE Ahmed Ali NONE 1
2 1 Apple is red NONE NONE Rose is red 1
3 9 I love you so much Me too NONE 1
How to split the columns like this in python? How to do this using pandas dataframe?
In [26]: x = pd.concat([df.pop('Sentence1').str.split(expand=True),
...: df.pop('Sentence2').str.split(expand=True)],
...: axis=1)
...:
In [27]: x.columns = np.arange(1, x.shape[1]+1)
In [28]: x
Out[28]:
1 2 3 4 5 6 7 8
0 This is a ball None I hate you
1 I am a boy None Ahmed Ali None
2 Apple is red None None Rose is red
3 I love you so much Me too None
In [29]: df = df.join(x)
In [30]: df
Out[30]:
x_id y_id Label 1 2 3 4 5 6 7 8
0 0 2 0 This is a ball None I hate you
1 1 5 1 I am a boy None Ahmed Ali None
2 2 1 1 Apple is red None None Rose is red
3 3 9 1 I love you so much Me too None
One-hot-encoding labeling solution:
In [14]: df.Sentence1 += ' ' + df.pop('Sentence2')
In [15]: df
Out[15]:
x_id y_id Sentence1 Label
0 0 2 This is a ball I hate you 0
1 1 5 I am a boy Ahmed Ali 1
2 2 1 Apple is red Rose is red 1
3 3 9 I love you so much Me too 1
In [16]: from sklearn.feature_extraction.text import CountVectorizer
In [17]: vect = CountVectorizer()
In [18]: X = vect.fit_transform(df.Sentence1.fillna(''))
X - is a sparsed (memory saving) matrix:
In [23]: X
Out[23]:
<4x17 sparse matrix of type '<class 'numpy.int64'>'
with 19 stored elements in Compressed Sparse Row format>
In [24]: type(X)
Out[24]: scipy.sparse.csr.csr_matrix
In [19]: X.toarray()
Out[19]:
array([[0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1],
[1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 2, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1]], dtype=int64)
Most of sklearn methods accept sparsed matrixes.
If you want to "unpack" it:
In [21]: r = pd.DataFrame(X.toarray(), columns=vect.get_feature_names())
In [22]: r
Out[22]:
ahmed ali am apple ball boy hate is love me much red rose so this too you
0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 1 0 1
1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 1 0 0 0 2 0 0 0 2 1 0 0 0 0
3 0 0 0 0 0 0 0 0 1 1 1 0 0 1 0 1 1
Here is how to do it for the sentences in the column Sentence1. The idea is identical for the Sentence2 column.
splits = df.Sentence1.str.split(' ')
longest = splits.apply(len).max()
Note that longest is the length of the longest sentence. Now make the Null columns:
for j in range(1,longest+1):
df[str(j)] = np.nan
And finally, go through the splitted values and assign them:
for j in splits.values:
for k in range(1,longest+1):
try:
df.loc[str(j), k] = j[k]
except:
pass
`
It looks like a machine learning problem. Converting from 1 col to max words columns this way may not be efficient.
Another (probably more efficient) solution is converting each words to integer and then padding to the longest sentences. Tensorflow as tools for that.
I have a DataFrame like below and would like for B to be 1 for n rows after the 1 in column A (where below n = 2)
index A B
0 0 0
1 1 0
2 0 1
3 0 1
4 1 0
5 0 1
6 0 1
7 0 0
8 1 0
9 0 1
I think I can do it using .ix similar to this example but not sure how. I'd like to do it in a single in pandas-style selection command if possible. (Ideally not using rolling_apply.)
Modifying a subset of rows in a pandas dataframe
EDIT: the application is that the 1 in column A is "ignored" if it falls within n rows of the previous 1. As per the comments, for n = 2 then, and these example:
A = [1, 0, 1, 0, 1], B should be = [0, 1, 1, 0, 0]
A = [1, 1, 0, 0], B should be [0, 1, 1, 0]
I've got a dataset with a big number of rows. Some of the values are NaN, like this:
In [91]: df
Out[91]:
1 3 1 1 1
1 3 1 1 1
2 3 1 1 1
1 1 NaN NaN NaN
1 3 1 1 1
1 1 1 1 1
And I want to count the number of NaN values in each row, it would be like this:
In [91]: list = <somecode with df>
In [92]: list
Out[91]:
[0,
0,
0,
3,
0,
0]
What is the best and fastest way to do it?
You could first find if element is NaN or not by isnull() and then take row-wise sum(axis=1)
In [195]: df.isnull().sum(axis=1)
Out[195]:
0 0
1 0
2 0
3 3
4 0
5 0
dtype: int64
And, if you want the output as list, you can
In [196]: df.isnull().sum(axis=1).tolist()
Out[196]: [0, 0, 0, 3, 0, 0]
Or use count like
In [130]: df.shape[1] - df.count(axis=1)
Out[130]:
0 0
1 0
2 0
3 3
4 0
5 0
dtype: int64
To count NaNs in specific rows, use
cols = ['col1', 'col2']
df['number_of_NaNs'] = df[cols].isna().sum(1)
or index the columns by position, e.g. count NaNs in the first 4 columns:
df['number_of_NaNs'] = df.iloc[:, :4].isna().sum(1)