I have a dataframe and I want to change some element of a column based on a condition.
In particular given this column:
... VALUE ....
0
"1076A"
12
9
"KKK0139"
5
I want to obtain this:
... VALUE ....
0
"1076A"
12
9
"0139"
5
In the 'VALUE' column there are both strings and numbers, when I found a particular substring in a string value, I want to obtain the same value without that substring.
I have tried:
1) df['VALUE'] = np.where(df['VALUE'].str.contains('KKK', na=False), df['VALUE'].str[3:], df['VALUE'])
2) df.loc[df['VALUE'].str.contains('KKK', na=False), 'VALUE'] = df['VALUE'].str[3:]
But these two attempts returns a IndexError: invalid index to scalar variable
Some advice ?
As the column contains both numeric value (non-string) and string values, you cannot use .str.replace() since it handles strings only. You have to use .replace() instead. Otherwise, non-string elements will be converted to NaN by str.replace().
Here, you can use:
df['VALUE'] = df['VALUE'].replace(r'KKK', '', regex=True)
Input:
data = {'VALUE': [0, "1076A", 12, 9, "KKK0139", 5]}
df = pd.DataFrame(data)
Result:
0 0
1 1076A
2 12
3 9
4 0139
5 5
Name: VALUE, dtype: object
If you use .str.replace(), you will get:
Note the NaN values result for numeric values (not of string type)
0 NaN
1 1076A
2 NaN
3 NaN
4 0139
5 NaN
Name: VALUE, dtype: object
In general, if you want to remove leading alphabet substring, you can use:
df['VALUE'] = df['VALUE'].replace(r'^[A-Za-z]+', '', regex=True)
>>> df['VALUE'].str.replace(r'KKK', '')
0 0
1 1076A
2 12
3 9
4 0139
5 5
Name: VALUE, dtype: object
Your second solution fails because you also need to apply the row selector to the right side of your assignment.
df.loc[df['VALUE'].str.contains('KKK', na=False), 'VALUE'] = df.loc[df['VALUE'].str.contains('KKK', na=False), 'VALUE'].str[3:]
Looking at your sample data, if k is the only problem, just replace it with empty string
df['VALUE'].str.replace('K', '')
0 0
1 "1076A"
2 12
3 9
4 "0139"
5 5
Name: text, dtype: object
If you want to do it for specific occurrences or positions of k, you can do that as well.
Related
I have a column that contains type str of both numbers and words:
ex.
['2','3','Amy','199','Happy']
And I want to convert all "str number" into int and remove (the rows with) the "str words".
So my expected output would be a list like below:
[2, 3, 199]
Since I have a pandas dataframe, and this supposed to be one of the columns, it would be even better if it could be a Series as follows:
0 2.0
1 3.0
3 199.0
dtype: float64
As you mentioned you have a column (a series), so let's say it's called s:
s = pd.Series(['2', '3', 'Amy', '199', 'Happy'])
Then after assigning, just do pd.to_numeric and put the parameter of errors='coerce'. Then, remove the NaNs with dropna:
print(pd.to_numeric(s, errors='coerce').dropna())
Then the above code will output:
0 2.0
1 3.0
3 199.0
dtype: float64
without using pandas as you are supplying an array
import re
data = ['2','3','Amy','199','Happy']
for item in data:
print (*re.findall(r'\d+',item))
will give
2
3
199
and
import re
data = ['2','3','Amy','199','Happy']
out = []
for item in data:
m = str(*re.findall(r'\d+',item))
if m != "":
out.append(int(m))
print (out)
will give
[2, 3, 199]
You can use isnumeric to filter out nonnumeric items.
s = pd.Series(['2','3','Amy','199','Happy'])
print(s[s.str.isnumeric()].astype(int))
Output:
0 2
1 3
3 199
dtype: int64
I have python pandas data frame like this with 200k to 400k rows
Index value
1 a
2
3 v
4
5
6 6077
7
8 h
and I want this dataframe value column to be filled all below rows with the specific value based on number of string values(like here in this table we have 1 number of string value).
I want my dataframe to be like this.
Index value
1 a
2 a
3 v
4 v
5 v
6 v
7 v
8 h
If need repeat strings with length 1 you can use Series.str.match by regex [a-zA-Z]{1} for check if strings with length 1, replace not matched values to NaNs by Series.where and last forward filling missing values by ffill:
df['value'] = df['value'].where(df['value'].str.match('^[a-zA-Z]{1}$', na=False)).ffill()
print (df)
Index value
0 1 a
1 2 a
2 3 v
3 4 v
4 5 v
5 6 v
6 7 v
7 8 h
Another idea:
m1 = df['value'].str.len().eq(1)
m2 = df['value'].str.isalpha()
df['value'] = df['value'].where(m1 & m2).ffill()
The forward fill method in fillna is exactly for this.
This should work for you:
df.fillna(method='ffill')
try this,
import pandas as pd
df['value'].replace('\d+', pd.np.nan, regex=True).ffill()
0 a
1 a
2 v
3 v
4 v
5 v
6 v
7 h
Name: value, dtype: object
Once you have removed all numbers, do this:
df[df['value']==""] = np.NaN
df.fillna(method='ffill')
Assuming that any value that is not an empty string or number should be forward filled, then the regular expression r'^\d*$' will match both an empty string or number. These values can be replaced by np.nan and then ffill can be called:
import numpy as np
df['value'].replace(r'^\d*$', np.nan, regex=True, inplace=True)
df['value'].ffill(inplace=True)
I was passing an Index type variable (Pandas.Index) containing the labels of columns I want to drop from my DataFrame and it was working correctly. It was Index type because I was extracting the column names based on certain condition from the DataFrame itself.
Afterwards, I needed to add another column name to that list, so I converted the Index object to a Python list so I could append the additional label name. But on passing the list as columns parameter to the drop() method on the Dataframe, I now keep getting the error :
ValueError: Need to specify at least one of 'labels', 'index' or 'columns'
How to resolve this error?
The code I use is like this:
unique_count = df.apply(pd.Series.nunique)
redundant_columns = unique_count[unique_count == 1].index.values.tolist()
redundant_columns.append('DESCRIPTION')
print(redundant_columns)
df.drop(columns=redundant_columns, inplace=True)
Out: None
I found why the error is occurring. After the append() statement, redundant_columns is becoming None. I don't know why. I would love if someone can explain why this is happening?
For me your solution working nice.
Another solution for remove columns by boolean indexing:
df = pd.DataFrame({'A':list('bbbbbb'),
'DESCRIPTION':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'DESCRIPTION':list('aaabbb')})
print (df)
A C D DESCRIPTION E
0 b 7 1 a 5
1 b 8 3 a 3
2 b 9 5 a 6
3 b 4 7 b 9
4 b 2 1 b 2
5 b 3 0 b 4
mask = df.nunique().ne(1)
mask['DESCRIPTION'] = False
df = df.loc[:, mask]
print (df)
C D E
0 7 1 5
1 8 3 3
2 9 5 6
3 4 7 9
4 2 1 2
5 3 0 4
Explanation:
First get length of unique values by nunique and compare by ne for not equal
Change boolean mask - column DESCRIPTION to False for always remove
Filter by boolean indexing
Details:
print (df.nunique())
A 1
C 6
D 5
DESCRIPTION 2
E 6
dtype: int64
mask = df.nunique().ne(1)
print (mask)
A False
C True
D True
DESCRIPTION True
E True
mask['DESCRIPTION'] = False
print (mask)
A False
C True
D True
DESCRIPTION False
E True
dtype: bool
After trying around, this got fixed by using numpy.ndarray instead of plain Python list, although I don't know why.
In my trials, using plain Python List is giving ValueError, pandas.Index or numpy.ndarray type object containing the labels is working fine. So I went with np.ndarray as that is appendable.
Current working code:
unique_count = df.apply(pd.Series.nunique)
redundant_columns: np.ndarray = unique_count[unique_count == 1].index.values
redundant_columns = np.append(redundant_columns, 'DESCRIPTION')
self.full_data.drop(columns=redundant_columns, inplace=True)
I had the same error when using .remove in the line of initialization:
myNewList = [i for i in myOldList].remove('Last Item')
myNewList would become none type. Using .tolist() in a separate column might help you:
redundant_columns = unique_count[unique_count == 1].index.values
redundant_columns.tolist()
redundant_columns.append('DESCRIPTION')
I want to replace the entire cell that contains the word as circled in the picture with blanks or NaN. However when I try to replace for example '1.25 Dividend' it turned out as '1.25 NaN'. I want to return the whole cell as 'NaN'. Any idea how to work on this?
Option 1
Use a regular expression in your replace
df.replace('^.*Dividend.*$', np.nan, regex=True)
From comments
(Using regex=True) means that it will interpret the problem as a regular expression one. You still need an appropriate pattern. The '^' says to start at the beginning of the string. '^.*' matches all characters from the beginning of the string. '$' says to end the match with the end of the string. '.*$' matches all characters up to the end of the string. Finally, '^.*Dividend.*$' matches all characters from the beginning, has 'Dividend' somewhere in the middle, then any characters after it. Then replace this whole thing with np.nan
Consider the dataframe df
df = pd.DataFrame([[1, '2 Dividend'], [3, 4], [5, '6 Dividend']])
df
0 1
0 1 2 Dividend
1 3 4
2 5 6 Dividend
then the proposed solution yields
0 1
0 1 NaN
1 3 4.0
2 5 NaN
Option 2
Another alternative is to use pd.DataFrame.mask in conjunction with a applymap.
If I pass a lambda to applymap that identifies if any cell has 'Dividend' in it.
df.mask(df.applymap(lambda s: 'Dividend' in s if isinstance(s, str) else False))
0 1
0 1 NaN
1 3 4
2 5 NaN
Option 3
Similar in concept but using stack/unstack + pd.Series.str.contains
df.mask(df.stack().astype(str).str.contains('Dividend').unstack())
0 1
0 1 NaN
1 3 4
2 5 NaN
Replace all strings:
df.apply(lambda x: pd.to_numeric(x, errors='coerce'))
I would use applymap like this
df.applymap(lambda x: 'NaN' if (type(x) is str and 'Dividend' in x) else x)
I have the following dataframe df1:
X Y Order_ NEW_ID
0 484970.4517 408844.0920 95083 1320437
1 478512.3233 415791.5395 96478 1320727
2 504516.3032 452923.4420 105246 1321260
3 485147.0529 428172.1055 99633 1320979
And another one, df2:
Order_ Loc
0 83158 239,211
1 83159 239,212
2 83160 239,213
3 83161 239,214
which I want to merge with the first so that the Loc column gets added with the correct values to df1. To do the merge, I use map to perform a left merge, first casting the Loc values as string:
df2['Loc'] = df2['Loc'].astype(str)
df1['Loc']=df1.Order_.map(df2.Loc)
The result is odd in that the Loc values appearing in df1 are of the NaN type:
X Y Order_ NEW_ID Loc
0 484970.4517 408844.0920 95083 1320437 NaN
1 478512.3233 415791.5395 96478 1320727 NaN
2 504516.3032 452923.4420 105246 1321260 NaN
3 485147.0529 428172.1055 99633 1320979 NaN
whereas I expected them to be string and to appear in a 239,211 fashion (string that includes a comma). When investigating the dtype of Loc in df2 I get:
Order_ int64
Loc object
dtype: object
My question: How can I perform a change of type from object to string, so that I am able to effectively read the Loc values, and avoid their becoming NaN?
I think you need cast Order_ to int if necessary for same dtypes:
df1['Order_'] = df1['Order_'].astype(int)
But maybe problem is you need map by Series or dict, so Order_ has to be set to index:
d = df2.set_index('Order_')['Loc'].to_dict()
df1['Loc']= df1.Order_.map(d)
Sample:
print (df1)
X Y Order_ NEW_ID
0 484970.4517 408844.0920 95083 1320437
1 478512.3233 415791.5395 96478 1320727
2 504516.3032 452923.4420 105246 1321260
3 485147.0529 428172.1055 99633 1320979
print (df2)
Order_ Loc
0 95083 239,211 <-first value was changed for align
1 83159 239,212
2 83160 239,213
3 83161 239,214
#check if same dtypes
print (df1['Order_'].dtypes)
int64
print (df2['Order_'].dtypes)
int64
d = df2.set_index('Order_')['Loc'].to_dict()
print (d)
{83160: '239,213', 83161: '239,214', 95083: '239,211', 83159: '239,212'}
df1['Loc']= df1.Order_.map(d)
print (df1)
X Y Order_ NEW_ID Loc
0 484970.4517 408844.0920 95083 1320437 239,211
1 478512.3233 415791.5395 96478 1320727 NaN
2 504516.3032 452923.4420 105246 1321260 NaN
3 485147.0529 428172.1055 99633 1320979 NaN