Removing portions of string in Pandas: not working + errors - python

I have a pandas DataFrame named full_list with a string-variable column named domains. Part of a snip shown here
domains
0 naturalhealth365.com
1 truththeory.com
2 themillenniumreport.com
3 https://www.cernovich.com
4 https://www.christianpost.com
5 http://evolutionnews.org
6 http://www.greenmedinfo.com
7 http://www.magapill.com8
8 https://needtoknow.news
I need to remove the https:// OR http:// from the website names.
I checked multiple pandas post on SO dealing with vaguely similar issues and I have tried all of these methods:
full_list['domains'] = full_list['domains'].apply(lambda x: x.lstrip('http://'))
but that erronoeusly removes the letters t, h and p as well i.e. "truththeory.com" (index 1) becomes "uththeory.com"
full_list['domains'] = full_list['domains'].replace(('http://', '')) and this makes no changes to the strings AT ALL. Like before and after the line run, the values in domains stay the same
full_list['domains'] = full_list['domains'].str.replace(('http://', '')) gives the error replace() missing 1 required positional argument: 'repl'
full_list['domains'] = full_list['domains'].str.lsplit('//', n=1).str.get(1) makes the first 3 rows (index 0, 1, 2) nan
For the world of me, I am unable to see what is it that I am doing wrong. Any help is appreciated.

Use Series.str.replace with regex ^ for start of string and [s]* for optional s:
df['domains'] = df['domains'].str.replace(r'^http[s]*://', '', regex=True)
print (df)
domains
0 naturalhealth365.com
1 truththeory.com
2 themillenniumreport.com
3 www.cernovich.com
4 www.christianpost.com
5 evolutionnews.org
6 www.greenmedinfo.com
7 www.magapill.com8
8 needtoknow.news

Try str.replace with regex like the following:
>>> df['domains'].str.replace('http(s|)://', '')
0 naturalhealth365.com
1 truththeory.com
2 themillenniumreport.com
3 www.cernovich.com
4 www.christianpost.com
5 evolutionnews.org
6 www.greenmedinfo.com
7 www.magapill.com8
8 needtoknow.news
Name: domains, dtype: object
>>>

Related

Change multiple column names in pandas dataframe (not all colmn names) at the same time using index numbers

I have successfully changed a single column name in the dataframe using this:
df.columns=['new_name' if x=='old_name' else x for x in df.columns]
However i have lots of columns to update (but not all 240 of them) and I don't want to have to write it out for each single change if i can help it.
I have tried to follow the advice from #StefanK in this thread:
Changing multiple column names but not all of them - Pandas Python
my code:
df.columns=[[4,18,181,182,187,188,189,190,203,204]]=['Brand','Reason','Chat_helpful','Chat_expertise','Answered_questions','Recommend_chat','Alternate_help','Customer_comments','Agent_category','Agent_outcome']
but i am getting an error message:
File "<ipython-input-17-2808488b712d>", line 3
df.columns=[[4,18,181,182,187,188,189,190,203,204]]=['Brand','Reason','Chat_helpful','Chat_expertise','Answered_questions','Recommend_chat','Alternate_help','Customer_comments','Agent_category','Agent_outcome']
^
SyntaxError: can't assign to literal
So having googled the error and read many more S.O. questions here it looks to me like it is trying to read the numbers as integers instead of an index? I'm not certain here though.
So how do i fix it so it looks at the numbers as the index?! The column names I am replacing are at least 10 words long each so I'm keen not to have to type them all out! my only ideas are to use iloc somehow but i'm going into new territory here!
really appreciate some help please
Remove the '=' after df.columns in your code and use this instead:
df.columns.values[[4,18,181,182,187,188,189,190,203,204]]=['Brand','Reason','Chat_helpful','Chat_expertise','Answered_questions','Recommend_chat','Alternate_help','Customer_comments','Agent_category','Agent_outcome']
Because index does not support mutable operations convert it to numpy array, reassign and set back:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
arr = df.columns.to_numpy()
arr[[0,2,3]] = list('RTG')
df.columns = arr
print (df)
R B T G E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
So with your data use:
idx = [4,18,181,182,187,188,189,190,203,204]
names = ['Brand','Reason','Chat_helpful','Chat_expertise','Answered_questions','Recommend_chat','Alternate_help','Customer_comments','Agent_category','Agent_outcome']
arr = df.columns.to_numpy()
arr[idx] = names
df.columns = arr

How to filter pd.Dataframe based on strings and special characters?

Here is what I have:
import re
import pandas as pd
d = {'ID': [1, 2, 3, 4, 5], 'Desc': ['0*1***HHCM', 'HC:83*20', 'HC:5*2CASL', 'DM*72\nCAS*', 'HC:564*CAS*5']}
df = pd.DataFrame(data=d)
df
Output:
ID Desc
0 1 0*1***HHCM
1 2 HC:83*20
2 3 HC:5*2CASL
3 4 DM*72\nCAS*
4 5 HC:564*CAS*5
I need to filter the dataframe by column "Desc", if it contains "CAS" or "HC" that are not surrounded by letters or digits.
Here is what I tried:
new_df = df[df['Desc'].str.match(r'[^A-Za-z0-9]CAS[^A-Za-z0-9]|[^A-Za-z0-9]HC[^A-Za-z0-9]') == True]
It returns an empty dataframe.
I want it to return the following:
ID Desc
1 2 HC:83*20
2 3 HC:5*2CASL
3 4 DM*72\nCAS*
4 5 HC:564*CAS*5
Another thing: since 3rd row has "\nCas", where "\n" is a line separator, will it treat it as a letter before "CAS"?
Please help.
Try this:
df.loc[df['Desc'].str.contains(r'(\W|^)(HC|CAS)(\W|$)', flags=re.M)]
# If you don't want to import re you can just use flags=8:
df.loc[df['Desc'].str.contains(r'(\W|^)(HC|CAS)(\W|$)', flags=8)]
Result:
ID Desc
1 2 HC:83*20
2 3 HC:5*2CASL
3 4 DM*72\nCAS*
4 5 HC:564*CAS*5
To answer your other question, as long as \n is passed correctly it will be parsed as a newline character instead of an alphanumeric character n. i.e.:
r'\n' -> `\\n` (backslash character + n character)
'\n' -> '\n' (newline character)
For further explanation on the regex, please see Regex101 demo: https://regex101.com/r/FNBgPV/2
You can try this, it checks only the numbers and letters before CAS and HC, but you can easily modify it to after also:
print(df[~df['Desc'].str.contains('([0-9a-zA-Z]+CAS*)|([0-9a-zA-Z]+HC*)', regex=True)])
ID Desc
1 2 HC:83*20
3 4 DM*72\nCAS*
4 5 HC:564*CAS*5

How to replace an entire cell with NaN on pandas DataFrame

I want to replace the entire cell that contains the word as circled in the picture with blanks or NaN. However when I try to replace for example '1.25 Dividend' it turned out as '1.25 NaN'. I want to return the whole cell as 'NaN'. Any idea how to work on this?
Option 1
Use a regular expression in your replace
df.replace('^.*Dividend.*$', np.nan, regex=True)
From comments
(Using regex=True) means that it will interpret the problem as a regular expression one. You still need an appropriate pattern. The '^' says to start at the beginning of the string. '^.*' matches all characters from the beginning of the string. '$' says to end the match with the end of the string. '.*$' matches all characters up to the end of the string. Finally, '^.*Dividend.*$' matches all characters from the beginning, has 'Dividend' somewhere in the middle, then any characters after it. Then replace this whole thing with np.nan
Consider the dataframe df
df = pd.DataFrame([[1, '2 Dividend'], [3, 4], [5, '6 Dividend']])
df
0 1
0 1 2 Dividend
1 3 4
2 5 6 Dividend
then the proposed solution yields
0 1
0 1 NaN
1 3 4.0
2 5 NaN
Option 2
Another alternative is to use pd.DataFrame.mask in conjunction with a applymap.
If I pass a lambda to applymap that identifies if any cell has 'Dividend' in it.
df.mask(df.applymap(lambda s: 'Dividend' in s if isinstance(s, str) else False))
0 1
0 1 NaN
1 3 4
2 5 NaN
Option 3
Similar in concept but using stack/unstack + pd.Series.str.contains
df.mask(df.stack().astype(str).str.contains('Dividend').unstack())
0 1
0 1 NaN
1 3 4
2 5 NaN
Replace all strings:
df.apply(lambda x: pd.to_numeric(x, errors='coerce'))
I would use applymap like this
df.applymap(lambda x: 'NaN' if (type(x) is str and 'Dividend' in x) else x)

Python parse dataframe element

I have a pandas dataframe column (Data Type) which I want to split into three columns
target_table_df = LoadS_A [['Attribute Name',
'Data Type',
'Primary Key Indicator']]
Example input (target_table_df)
Attribute Name Data Type Primary Key Indicator
0 ACC_LIM DECIMAL(18,4) False
1 ACC_NO NUMBER(11,0) False
2 ACC_OPEN_DT DATE False
3 ACCB DECIMAL(18,4) False
4 ACDB DECIMAL(18,4) False
5 AGRMNT_ID NUMBER(11,0) True
6 BRNCH_NUM NUMBER(11,0) False
7 CLRD_BAL DECIMAL(18,4) False
8 CR_INT_ACRD_GRSS DECIMAL(18,4) False
9 CR_INT_ACRD_NET DECIMAL(18,4) False
I aim to:
Reassign 'Data Type' to the text preceding the parenthesis
[..if parenthesis exists in 'Data Type']:
Create new column 'Precision' and assign to first comma separated
value
Create new column 'Scale' and assign to second comma separated value
Intended output would therefore become:
Data Type Precision Scale
0 decimal 18 4
1 number 11 0
2 date
3 decimal 18 4
4 decimal 18 4
5 number 4 0
I have tried in anger to achieve this but i'm new to dataframes....can't work out if I am to iterate over all rows or if there is a way to apply to all values in the dataframe?
Any help much appreciated
Use target_table_df['Data Type'].str.extract(pattern)
You'll need to assign pattern to be a regular expression that captures each of the components you're looking for.
pattern = r'([^\(]+)(\(([^,]*),(.*)\))?'
([^\(]+) says grab as many non-open parenthesis characters you can up to the first open parenthesis.
\(([^,]*, says to grab the first set of non-comma characters after an open parenthesis and stop at the comma.
,(.*)\) says to grab the rest of the characters between the comma and the close parenthesis.
(\(([^,]*),(.*)\))? says the whole parenthesis thing may not even happen, grab it if you can.
Solution
everything together looks like this:
pattern = r'([^\(]+)(\(([^,]*),(.*)\))?'
df = s.str.extract(pattern, expand=True).iloc[:, [0, 2, 3]]
# Formatting to get it how you wanted
df.columns = ['Data Type', 'Precision', 'Scale']
df.index.name = None
print df
I put a .iloc[:, [0, 2, 3]] at the end because the pattern I used grabs the whole parenthesis in column 1 and I wanted to skip it. Leave it off and see.
Data Type Precision Scale
0 decimal 18 4
1 number 11 0
2 date NaN NaN
3 decimal 18 4
4 decimal 18 4
5 number 11 0

Delineate twice through a dataframe in pandas

I have a sparse pandas DataFrame/Series with values that look like variations of "AB1234:12, CD5678:34, EF3456:56". Something to the effect of
"AB1234:12, CD5678:34, EF3456:56"
"AB1234:12, CD5678:34"
NaN
"GH5678:34, EF3456:56"
"OH56:34"
Which I'd like to convert into
["AB1234","CD5678", "EF3456"]
["AB1234","CD5678"]
NaN
["GH5678","EF3456"]
["OH56"]
This kind of "double delineation" has been proving difficult. I know we can A = df["columnName"].str.split(",") however I've run across a couple of problems including that .split(", ") doesnt seem to work and '.split(",")' leaves whitespace. Also, that iterating through the generated A and splitting seems to be interpreting my new lists as 'floats'. Although that last one might be a technical difficulty with ipython - I'm trying to work out that problem as well.
Is there a way to delineate on two types of separators - instead of just one? If not, how do you perform the loop to iterate over the inner list?
//Edit: changed the apostrophes to commas - that was just my dyslexia
kicking in
You nearly had it, note you can use a regular expression to split more generally:
In [11]: s2
Out[11]:
0 AB1234:12, CD5678:34, EF3456:56
1 AB1234:12, CD5678:34
2 NaN
3 GH5678:34, EF3456:56
4 OH56:34
dtype: object
In [12]: s2.str.split(", '")
Out[12]:
0 [AB1234:12, CD5678:34, EF3456:56]
1 [AB1234:12, CD5678:34]
2 NaN
3 [GH5678:34, EF3456:56]
4 [OH56:34]
dtype: object
In [13]: s2.str.split("\s*,\s*'")
Out[13]:
0 [AB1234:12, CD5678:34, EF3456:56]
1 [AB1234:12, CD5678:34]
2 NaN
3 [GH5678:34, EF3456:56]
4 [OH56:34]
dtype: object
Where this removes any spaces before or after a comma.
Here is your DataFrame
>>> df
A
0 AB1234:12, CD5678:34, EF3456:56
1 AB1234:12, CD5678:34
2 None
3 GH5678:34, EF3456:56
4 OH56:34
And now I use split and replace to split by ', ' and remove all ':'
>>> df.A = [i.replace(':','').split(", ") if isinstance(i,str) else i for i in df.A]
>>> df.A
0 [AB123412, CD567834, EF345656]
1 [AB123412, CD567834]
2 None
3 [GH567834, EF345656]
4 [OH5634]
Name: A

Categories