Delineate twice through a dataframe in pandas - python

I have a sparse pandas DataFrame/Series with values that look like variations of "AB1234:12, CD5678:34, EF3456:56". Something to the effect of
"AB1234:12, CD5678:34, EF3456:56"
"AB1234:12, CD5678:34"
NaN
"GH5678:34, EF3456:56"
"OH56:34"
Which I'd like to convert into
["AB1234","CD5678", "EF3456"]
["AB1234","CD5678"]
NaN
["GH5678","EF3456"]
["OH56"]
This kind of "double delineation" has been proving difficult. I know we can A = df["columnName"].str.split(",") however I've run across a couple of problems including that .split(", ") doesnt seem to work and '.split(",")' leaves whitespace. Also, that iterating through the generated A and splitting seems to be interpreting my new lists as 'floats'. Although that last one might be a technical difficulty with ipython - I'm trying to work out that problem as well.
Is there a way to delineate on two types of separators - instead of just one? If not, how do you perform the loop to iterate over the inner list?
//Edit: changed the apostrophes to commas - that was just my dyslexia
kicking in

You nearly had it, note you can use a regular expression to split more generally:
In [11]: s2
Out[11]:
0 AB1234:12, CD5678:34, EF3456:56
1 AB1234:12, CD5678:34
2 NaN
3 GH5678:34, EF3456:56
4 OH56:34
dtype: object
In [12]: s2.str.split(", '")
Out[12]:
0 [AB1234:12, CD5678:34, EF3456:56]
1 [AB1234:12, CD5678:34]
2 NaN
3 [GH5678:34, EF3456:56]
4 [OH56:34]
dtype: object
In [13]: s2.str.split("\s*,\s*'")
Out[13]:
0 [AB1234:12, CD5678:34, EF3456:56]
1 [AB1234:12, CD5678:34]
2 NaN
3 [GH5678:34, EF3456:56]
4 [OH56:34]
dtype: object
Where this removes any spaces before or after a comma.

Here is your DataFrame
>>> df
A
0 AB1234:12, CD5678:34, EF3456:56
1 AB1234:12, CD5678:34
2 None
3 GH5678:34, EF3456:56
4 OH56:34
And now I use split and replace to split by ', ' and remove all ':'
>>> df.A = [i.replace(':','').split(", ") if isinstance(i,str) else i for i in df.A]
>>> df.A
0 [AB123412, CD567834, EF345656]
1 [AB123412, CD567834]
2 None
3 [GH567834, EF345656]
4 [OH5634]
Name: A

Related

Removing portions of string in Pandas: not working + errors

I have a pandas DataFrame named full_list with a string-variable column named domains. Part of a snip shown here
domains
0 naturalhealth365.com
1 truththeory.com
2 themillenniumreport.com
3 https://www.cernovich.com
4 https://www.christianpost.com
5 http://evolutionnews.org
6 http://www.greenmedinfo.com
7 http://www.magapill.com8
8 https://needtoknow.news
I need to remove the https:// OR http:// from the website names.
I checked multiple pandas post on SO dealing with vaguely similar issues and I have tried all of these methods:
full_list['domains'] = full_list['domains'].apply(lambda x: x.lstrip('http://'))
but that erronoeusly removes the letters t, h and p as well i.e. "truththeory.com" (index 1) becomes "uththeory.com"
full_list['domains'] = full_list['domains'].replace(('http://', '')) and this makes no changes to the strings AT ALL. Like before and after the line run, the values in domains stay the same
full_list['domains'] = full_list['domains'].str.replace(('http://', '')) gives the error replace() missing 1 required positional argument: 'repl'
full_list['domains'] = full_list['domains'].str.lsplit('//', n=1).str.get(1) makes the first 3 rows (index 0, 1, 2) nan
For the world of me, I am unable to see what is it that I am doing wrong. Any help is appreciated.
Use Series.str.replace with regex ^ for start of string and [s]* for optional s:
df['domains'] = df['domains'].str.replace(r'^http[s]*://', '', regex=True)
print (df)
domains
0 naturalhealth365.com
1 truththeory.com
2 themillenniumreport.com
3 www.cernovich.com
4 www.christianpost.com
5 evolutionnews.org
6 www.greenmedinfo.com
7 www.magapill.com8
8 needtoknow.news
Try str.replace with regex like the following:
>>> df['domains'].str.replace('http(s|)://', '')
0 naturalhealth365.com
1 truththeory.com
2 themillenniumreport.com
3 www.cernovich.com
4 www.christianpost.com
5 evolutionnews.org
6 www.greenmedinfo.com
7 www.magapill.com8
8 needtoknow.news
Name: domains, dtype: object
>>>

extracting numerical information from strings in a dataframe column

I've seen this done in excel but I'd like to split the SOP and number into different columns. It gets a little tricky since the formatting is different at times.
0 SOP-015641
1 SOP-007809
2 SOP018262
3 SOP-007802
4 SOP-007804
5 SOP-007807
use .str.extract() method:
In [8]: df[['a','b']] = df.pop('col').str.extract('(\D+)(\d+)', expand=True)
In [9]: df
Out[9]:
a b
0 SOP- 015641
1 SOP- 007809
2 SOP 018262
3 SOP- 007802
4 SOP- 007804
5 SOP- 007807
RegEx explained

Deleting zeros from string column in pandas dataframe

I have a column in my dataframe,where the values are something like this:
col1:
00000000000012VG
00000000000014SG
00000000000014VG
00000000000010SG
20000000000933LG
20000000000951LG
20000000000957LG
20000000000963LG
20000000000909LG
20000000000992LG
I want to delete all zeros:
a)that are in front of other numbers and letters(For example in case of 00000000000010SG I want to delete this part000000000000 and keep 10SG).
b) In cases like 20000000000992LG I want to delete this part 0000000000 and unite 2 with 992LG.
str.stprip('0') solves only part a), as I checked.
But what is the right solution for both cases?
I would recommend something similar to Ed's answer, but using regex to ensure that not all 0s are replaced, and the eliminate the need to hardcode the number of 0s.
In [2426]: df.col1.str.replace(r'[0]{2,}', '', 1)
Out[2426]:
0 12VG
1 14SG
2 14VG
3 10SG
4 2933LG
5 2951LG
6 2957LG
7 2963LG
8 2909LG
9 2992LG
Name: col1, dtype: object
Only the first string of 0s is replaced.
Thanks to #jezrael for pointing out a small bug in my answer.
You can just do
In[9]:
df['col1'] = df['col1'].str.replace('000000000000','')
df['col1'] = df['col1'].str.replace('0000000000','')
df
Out[9]:
col1
0 12VG
1 14SG
2 14VG
3 10SG
4 2933LG
5 2951LG
6 2957LG
7 2963LG
8 2909LG
9 2992LG
This will replace a fixed number of 0s with a blank space, this isn't dynamic but for your given dataset this is the simplest thing to do unless you can explains better the pattern

How to replace an entire cell with NaN on pandas DataFrame

I want to replace the entire cell that contains the word as circled in the picture with blanks or NaN. However when I try to replace for example '1.25 Dividend' it turned out as '1.25 NaN'. I want to return the whole cell as 'NaN'. Any idea how to work on this?
Option 1
Use a regular expression in your replace
df.replace('^.*Dividend.*$', np.nan, regex=True)
From comments
(Using regex=True) means that it will interpret the problem as a regular expression one. You still need an appropriate pattern. The '^' says to start at the beginning of the string. '^.*' matches all characters from the beginning of the string. '$' says to end the match with the end of the string. '.*$' matches all characters up to the end of the string. Finally, '^.*Dividend.*$' matches all characters from the beginning, has 'Dividend' somewhere in the middle, then any characters after it. Then replace this whole thing with np.nan
Consider the dataframe df
df = pd.DataFrame([[1, '2 Dividend'], [3, 4], [5, '6 Dividend']])
df
0 1
0 1 2 Dividend
1 3 4
2 5 6 Dividend
then the proposed solution yields
0 1
0 1 NaN
1 3 4.0
2 5 NaN
Option 2
Another alternative is to use pd.DataFrame.mask in conjunction with a applymap.
If I pass a lambda to applymap that identifies if any cell has 'Dividend' in it.
df.mask(df.applymap(lambda s: 'Dividend' in s if isinstance(s, str) else False))
0 1
0 1 NaN
1 3 4
2 5 NaN
Option 3
Similar in concept but using stack/unstack + pd.Series.str.contains
df.mask(df.stack().astype(str).str.contains('Dividend').unstack())
0 1
0 1 NaN
1 3 4
2 5 NaN
Replace all strings:
df.apply(lambda x: pd.to_numeric(x, errors='coerce'))
I would use applymap like this
df.applymap(lambda x: 'NaN' if (type(x) is str and 'Dividend' in x) else x)

Creating a new column in Pandas by selecting part of string in other column

I have a lot of experience programming in Matlab, now using Python and I just don't get this thing to work... I have a dataframe containing a column with timecodes like 00:00:00.033.
timecodes = ['00:00:01.001', '00:00:03.201', '00:00:09.231', '00:00:11.301', '00:00:20.601', '00:00:31.231', '00:00:90.441', '00:00:91.301']
df = pd.DataFrame(timecodes, columns=['TimeCodes'])
All my inputs are 90 seconds or less, so I want to create a column with just the seconds as float. To do this, I need to select position 6 to end and make that into a float, which I can do for the first row like:
float(df['TimeCodes'][0][6:])
This works just fine, but if I now want to create a whole new column 'Time_sec', the following does not work:
df['Time_sec'] = float(df['TimeCodes'][:][6:])
Because df['TimeCodes'][:][6:] takes row 6 to last row, while I want WITHIN each row the 6th till last position. Also this does not work:
df['Time_sec'] = float(df['TimeCodes'][:,6:])
Do I need to make a loop? There must be a better way... And why does df['TimeCodes'][:][6:] not work?
You can use the slice string method and then cast the whole thing to a float:
In [13]: df["TimeCodes"].str.slice(6).astype(float)
Out[13]:
0 1.001
1 3.201
2 9.231
3 11.301
4 20.601
5 31.231
6 90.441
7 91.301
Name: TimeCodes, dtype: float64
As to why df['TimeCodes'][:][6:] doesn't work, what this ends up doing is chaining some selections. First you grab the pd.Series associated with the TimeCodes column, then you select all of the items from the Series with [:], and then you just select the items with index 6 or higher with [6:].
Solution - indexing with str and casting to float by astype:
print (df["TimeCodes"].str[6:])
0 01.001
1 03.201
2 09.231
3 11.301
4 20.601
5 31.231
6 90.441
7 91.301
Name: TimeCodes, dtype: object
df['new'] = df["TimeCodes"].str[6:].astype(float)
print (df)
TimeCodes new
0 00:00:01.001 1.001
1 00:00:03.201 3.201
2 00:00:09.231 9.231
3 00:00:11.301 11.301
4 00:00:20.601 20.601
5 00:00:31.231 31.231
6 00:00:90.441 90.441
7 00:00:91.301 91.301

Categories