Manipulate Dataframe Series - python

I have a dataframe and I want to change some element of a column based on a condition.
In particular given this column:
... VALUE ....
0
"1076A"
12
9
"KKK0139"
5
I want to obtain this:
... VALUE ....
0
"1076A"
12
9
"0139"
5
In the 'VALUE' column there are both strings and numbers, when I found a particular substring in a string value, I want to obtain the same value without that substring.
I have tried:
1) df['VALUE'] = np.where(df['VALUE'].str.contains('KKK', na=False), df['VALUE'].str[3:], df['VALUE'])
2) df.loc[df['VALUE'].str.contains('KKK', na=False), 'VALUE'] = df['VALUE'].str[3:]
But these two attempts returns a IndexError: invalid index to scalar variable
Some advice ?

As the column contains both numeric value (non-string) and string values, you cannot use .str.replace() since it handles strings only. You have to use .replace() instead. Otherwise, non-string elements will be converted to NaN by str.replace().
Here, you can use:
df['VALUE'] = df['VALUE'].replace(r'KKK', '', regex=True)
Input:
data = {'VALUE': [0, "1076A", 12, 9, "KKK0139", 5]}
df = pd.DataFrame(data)
Result:
0 0
1 1076A
2 12
3 9
4 0139
5 5
Name: VALUE, dtype: object
If you use .str.replace(), you will get:
Note the NaN values result for numeric values (not of string type)
0 NaN
1 1076A
2 NaN
3 NaN
4 0139
5 NaN
Name: VALUE, dtype: object
In general, if you want to remove leading alphabet substring, you can use:
df['VALUE'] = df['VALUE'].replace(r'^[A-Za-z]+', '', regex=True)

>>> df['VALUE'].str.replace(r'KKK', '')
0 0
1 1076A
2 12
3 9
4 0139
5 5
Name: VALUE, dtype: object

Your second solution fails because you also need to apply the row selector to the right side of your assignment.
df.loc[df['VALUE'].str.contains('KKK', na=False), 'VALUE'] = df.loc[df['VALUE'].str.contains('KKK', na=False), 'VALUE'].str[3:]

Looking at your sample data, if k is the only problem, just replace it with empty string
df['VALUE'].str.replace('K', '')
0 0
1 "1076A"
2 12
3 9
4 "0139"
5 5
Name: text, dtype: object
If you want to do it for specific occurrences or positions of k, you can do that as well.

Related

Convert type str (with number and words) column into int pandas

I have a column that contains type str of both numbers and words:
ex.
['2','3','Amy','199','Happy']
And I want to convert all "str number" into int and remove (the rows with) the "str words".
So my expected output would be a list like below:
[2, 3, 199]
Since I have a pandas dataframe, and this supposed to be one of the columns, it would be even better if it could be a Series as follows:
0 2.0
1 3.0
3 199.0
dtype: float64
As you mentioned you have a column (a series), so let's say it's called s:
s = pd.Series(['2', '3', 'Amy', '199', 'Happy'])
Then after assigning, just do pd.to_numeric and put the parameter of errors='coerce'. Then, remove the NaNs with dropna:
print(pd.to_numeric(s, errors='coerce').dropna())
Then the above code will output:
0 2.0
1 3.0
3 199.0
dtype: float64
without using pandas as you are supplying an array
import re
data = ['2','3','Amy','199','Happy']
for item in data:
print (*re.findall(r'\d+',item))
will give
2
3
199
and
import re
data = ['2','3','Amy','199','Happy']
out = []
for item in data:
m = str(*re.findall(r'\d+',item))
if m != "":
out.append(int(m))
print (out)
will give
[2, 3, 199]
You can use isnumeric to filter out nonnumeric items.
s = pd.Series(['2','3','Amy','199','Happy'])
print(s[s.str.isnumeric()].astype(int))
Output:
0 2
1 3
3 199
dtype: int64

return values from dataframe

I have python pandas data frame like this with 200k to 400k rows
Index value
1 a
2
3 v
4
5
6 6077
7
8 h
and I want this dataframe value column to be filled all below rows with the specific value based on number of string values(like here in this table we have 1 number of string value).
I want my dataframe to be like this.
Index value
1 a
2 a
3 v
4 v
5 v
6 v
7 v
8 h
If need repeat strings with length 1 you can use Series.str.match by regex [a-zA-Z]{1} for check if strings with length 1, replace not matched values to NaNs by Series.where and last forward filling missing values by ffill:
df['value'] = df['value'].where(df['value'].str.match('^[a-zA-Z]{1}$', na=False)).ffill()
print (df)
Index value
0 1 a
1 2 a
2 3 v
3 4 v
4 5 v
5 6 v
6 7 v
7 8 h
Another idea:
m1 = df['value'].str.len().eq(1)
m2 = df['value'].str.isalpha()
df['value'] = df['value'].where(m1 & m2).ffill()
The forward fill method in fillna is exactly for this.
This should work for you:
df.fillna(method='ffill')
try this,
import pandas as pd
df['value'].replace('\d+', pd.np.nan, regex=True).ffill()
0 a
1 a
2 v
3 v
4 v
5 v
6 v
7 h
Name: value, dtype: object
Once you have removed all numbers, do this:
df[df['value']==""] = np.NaN
df.fillna(method='ffill')
Assuming that any value that is not an empty string or number should be forward filled, then the regular expression r'^\d*$' will match both an empty string or number. These values can be replaced by np.nan and then ffill can be called:
import numpy as np
df['value'].replace(r'^\d*$', np.nan, regex=True, inplace=True)
df['value'].ffill(inplace=True)

Getting ValueError: Need to specify at least one of 'labels', 'index' or 'columns' on passing a list of lables as 'columns' parameter of drop() method

I was passing an Index type variable (Pandas.Index) containing the labels of columns I want to drop from my DataFrame and it was working correctly. It was Index type because I was extracting the column names based on certain condition from the DataFrame itself.
Afterwards, I needed to add another column name to that list, so I converted the Index object to a Python list so I could append the additional label name. But on passing the list as columns parameter to the drop() method on the Dataframe, I now keep getting the error :
ValueError: Need to specify at least one of 'labels', 'index' or 'columns'
How to resolve this error?
The code I use is like this:
unique_count = df.apply(pd.Series.nunique)
redundant_columns = unique_count[unique_count == 1].index.values.tolist()
redundant_columns.append('DESCRIPTION')
print(redundant_columns)
df.drop(columns=redundant_columns, inplace=True)
Out: None
I found why the error is occurring. After the append() statement, redundant_columns is becoming None. I don't know why. I would love if someone can explain why this is happening?
For me your solution working nice.
Another solution for remove columns by boolean indexing:
df = pd.DataFrame({'A':list('bbbbbb'),
'DESCRIPTION':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'DESCRIPTION':list('aaabbb')})
print (df)
A C D DESCRIPTION E
0 b 7 1 a 5
1 b 8 3 a 3
2 b 9 5 a 6
3 b 4 7 b 9
4 b 2 1 b 2
5 b 3 0 b 4
mask = df.nunique().ne(1)
mask['DESCRIPTION'] = False
df = df.loc[:, mask]
print (df)
C D E
0 7 1 5
1 8 3 3
2 9 5 6
3 4 7 9
4 2 1 2
5 3 0 4
Explanation:
First get length of unique values by nunique and compare by ne for not equal
Change boolean mask - column DESCRIPTION to False for always remove
Filter by boolean indexing
Details:
print (df.nunique())
A 1
C 6
D 5
DESCRIPTION 2
E 6
dtype: int64
mask = df.nunique().ne(1)
print (mask)
A False
C True
D True
DESCRIPTION True
E True
mask['DESCRIPTION'] = False
print (mask)
A False
C True
D True
DESCRIPTION False
E True
dtype: bool
After trying around, this got fixed by using numpy.ndarray instead of plain Python list, although I don't know why.
In my trials, using plain Python List is giving ValueError, pandas.Index or numpy.ndarray type object containing the labels is working fine. So I went with np.ndarray as that is appendable.
Current working code:
unique_count = df.apply(pd.Series.nunique)
redundant_columns: np.ndarray = unique_count[unique_count == 1].index.values
redundant_columns = np.append(redundant_columns, 'DESCRIPTION')
self.full_data.drop(columns=redundant_columns, inplace=True)
I had the same error when using .remove in the line of initialization:
myNewList = [i for i in myOldList].remove('Last Item')
myNewList would become none type. Using .tolist() in a separate column might help you:
redundant_columns = unique_count[unique_count == 1].index.values
redundant_columns.tolist()
redundant_columns.append('DESCRIPTION')

How to replace an entire cell with NaN on pandas DataFrame

I want to replace the entire cell that contains the word as circled in the picture with blanks or NaN. However when I try to replace for example '1.25 Dividend' it turned out as '1.25 NaN'. I want to return the whole cell as 'NaN'. Any idea how to work on this?
Option 1
Use a regular expression in your replace
df.replace('^.*Dividend.*$', np.nan, regex=True)
From comments
(Using regex=True) means that it will interpret the problem as a regular expression one. You still need an appropriate pattern. The '^' says to start at the beginning of the string. '^.*' matches all characters from the beginning of the string. '$' says to end the match with the end of the string. '.*$' matches all characters up to the end of the string. Finally, '^.*Dividend.*$' matches all characters from the beginning, has 'Dividend' somewhere in the middle, then any characters after it. Then replace this whole thing with np.nan
Consider the dataframe df
df = pd.DataFrame([[1, '2 Dividend'], [3, 4], [5, '6 Dividend']])
df
0 1
0 1 2 Dividend
1 3 4
2 5 6 Dividend
then the proposed solution yields
0 1
0 1 NaN
1 3 4.0
2 5 NaN
Option 2
Another alternative is to use pd.DataFrame.mask in conjunction with a applymap.
If I pass a lambda to applymap that identifies if any cell has 'Dividend' in it.
df.mask(df.applymap(lambda s: 'Dividend' in s if isinstance(s, str) else False))
0 1
0 1 NaN
1 3 4
2 5 NaN
Option 3
Similar in concept but using stack/unstack + pd.Series.str.contains
df.mask(df.stack().astype(str).str.contains('Dividend').unstack())
0 1
0 1 NaN
1 3 4
2 5 NaN
Replace all strings:
df.apply(lambda x: pd.to_numeric(x, errors='coerce'))
I would use applymap like this
df.applymap(lambda x: 'NaN' if (type(x) is str and 'Dividend' in x) else x)

Pandas: using `map` for left merge returns NaN

I have the following dataframe df1:
X Y Order_ NEW_ID
0 484970.4517 408844.0920 95083 1320437
1 478512.3233 415791.5395 96478 1320727
2 504516.3032 452923.4420 105246 1321260
3 485147.0529 428172.1055 99633 1320979
And another one, df2:
Order_ Loc
0 83158 239,211
1 83159 239,212
2 83160 239,213
3 83161 239,214
which I want to merge with the first so that the Loc column gets added with the correct values to df1. To do the merge, I use map to perform a left merge, first casting the Loc values as string:
df2['Loc'] = df2['Loc'].astype(str)
df1['Loc']=df1.Order_.map(df2.Loc)
The result is odd in that the Loc values appearing in df1 are of the NaN type:
X Y Order_ NEW_ID Loc
0 484970.4517 408844.0920 95083 1320437 NaN
1 478512.3233 415791.5395 96478 1320727 NaN
2 504516.3032 452923.4420 105246 1321260 NaN
3 485147.0529 428172.1055 99633 1320979 NaN
whereas I expected them to be string and to appear in a 239,211 fashion (string that includes a comma). When investigating the dtype of Loc in df2 I get:
Order_ int64
Loc object
dtype: object
My question: How can I perform a change of type from object to string, so that I am able to effectively read the Loc values, and avoid their becoming NaN?
I think you need cast Order_ to int if necessary for same dtypes:
df1['Order_'] = df1['Order_'].astype(int)
But maybe problem is you need map by Series or dict, so Order_ has to be set to index:
d = df2.set_index('Order_')['Loc'].to_dict()
df1['Loc']= df1.Order_.map(d)
Sample:
print (df1)
X Y Order_ NEW_ID
0 484970.4517 408844.0920 95083 1320437
1 478512.3233 415791.5395 96478 1320727
2 504516.3032 452923.4420 105246 1321260
3 485147.0529 428172.1055 99633 1320979
print (df2)
Order_ Loc
0 95083 239,211 <-first value was changed for align
1 83159 239,212
2 83160 239,213
3 83161 239,214
#check if same dtypes
print (df1['Order_'].dtypes)
int64
print (df2['Order_'].dtypes)
int64
d = df2.set_index('Order_')['Loc'].to_dict()
print (d)
{83160: '239,213', 83161: '239,214', 95083: '239,211', 83159: '239,212'}
df1['Loc']= df1.Order_.map(d)
print (df1)
X Y Order_ NEW_ID Loc
0 484970.4517 408844.0920 95083 1320437 239,211
1 478512.3233 415791.5395 96478 1320727 NaN
2 504516.3032 452923.4420 105246 1321260 NaN
3 485147.0529 428172.1055 99633 1320979 NaN

Categories