I have a sparse pandas DataFrame/Series with values that look like variations of "AB1234:12, CD5678:34, EF3456:56". Something to the effect of
"AB1234:12, CD5678:34, EF3456:56"
"AB1234:12, CD5678:34"
NaN
"GH5678:34, EF3456:56"
"OH56:34"
Which I'd like to convert into
["AB1234","CD5678", "EF3456"]
["AB1234","CD5678"]
NaN
["GH5678","EF3456"]
["OH56"]
This kind of "double delineation" has been proving difficult. I know we can A = df["columnName"].str.split(",") however I've run across a couple of problems including that .split(", ") doesnt seem to work and '.split(",")' leaves whitespace. Also, that iterating through the generated A and splitting seems to be interpreting my new lists as 'floats'. Although that last one might be a technical difficulty with ipython - I'm trying to work out that problem as well.
Is there a way to delineate on two types of separators - instead of just one? If not, how do you perform the loop to iterate over the inner list?
//Edit: changed the apostrophes to commas - that was just my dyslexia
kicking in
You nearly had it, note you can use a regular expression to split more generally:
In [11]: s2
Out[11]:
0 AB1234:12, CD5678:34, EF3456:56
1 AB1234:12, CD5678:34
2 NaN
3 GH5678:34, EF3456:56
4 OH56:34
dtype: object
In [12]: s2.str.split(", '")
Out[12]:
0 [AB1234:12, CD5678:34, EF3456:56]
1 [AB1234:12, CD5678:34]
2 NaN
3 [GH5678:34, EF3456:56]
4 [OH56:34]
dtype: object
In [13]: s2.str.split("\s*,\s*'")
Out[13]:
0 [AB1234:12, CD5678:34, EF3456:56]
1 [AB1234:12, CD5678:34]
2 NaN
3 [GH5678:34, EF3456:56]
4 [OH56:34]
dtype: object
Where this removes any spaces before or after a comma.
Here is your DataFrame
>>> df
A
0 AB1234:12, CD5678:34, EF3456:56
1 AB1234:12, CD5678:34
2 None
3 GH5678:34, EF3456:56
4 OH56:34
And now I use split and replace to split by ', ' and remove all ':'
>>> df.A = [i.replace(':','').split(", ") if isinstance(i,str) else i for i in df.A]
>>> df.A
0 [AB123412, CD567834, EF345656]
1 [AB123412, CD567834]
2 None
3 [GH567834, EF345656]
4 [OH5634]
Name: A
Related
I have a pandas DataFrame named full_list with a string-variable column named domains. Part of a snip shown here
domains
0 naturalhealth365.com
1 truththeory.com
2 themillenniumreport.com
3 https://www.cernovich.com
4 https://www.christianpost.com
5 http://evolutionnews.org
6 http://www.greenmedinfo.com
7 http://www.magapill.com8
8 https://needtoknow.news
I need to remove the https:// OR http:// from the website names.
I checked multiple pandas post on SO dealing with vaguely similar issues and I have tried all of these methods:
full_list['domains'] = full_list['domains'].apply(lambda x: x.lstrip('http://'))
but that erronoeusly removes the letters t, h and p as well i.e. "truththeory.com" (index 1) becomes "uththeory.com"
full_list['domains'] = full_list['domains'].replace(('http://', '')) and this makes no changes to the strings AT ALL. Like before and after the line run, the values in domains stay the same
full_list['domains'] = full_list['domains'].str.replace(('http://', '')) gives the error replace() missing 1 required positional argument: 'repl'
full_list['domains'] = full_list['domains'].str.lsplit('//', n=1).str.get(1) makes the first 3 rows (index 0, 1, 2) nan
For the world of me, I am unable to see what is it that I am doing wrong. Any help is appreciated.
Use Series.str.replace with regex ^ for start of string and [s]* for optional s:
df['domains'] = df['domains'].str.replace(r'^http[s]*://', '', regex=True)
print (df)
domains
0 naturalhealth365.com
1 truththeory.com
2 themillenniumreport.com
3 www.cernovich.com
4 www.christianpost.com
5 evolutionnews.org
6 www.greenmedinfo.com
7 www.magapill.com8
8 needtoknow.news
Try str.replace with regex like the following:
>>> df['domains'].str.replace('http(s|)://', '')
0 naturalhealth365.com
1 truththeory.com
2 themillenniumreport.com
3 www.cernovich.com
4 www.christianpost.com
5 evolutionnews.org
6 www.greenmedinfo.com
7 www.magapill.com8
8 needtoknow.news
Name: domains, dtype: object
>>>
I've seen this done in excel but I'd like to split the SOP and number into different columns. It gets a little tricky since the formatting is different at times.
0 SOP-015641
1 SOP-007809
2 SOP018262
3 SOP-007802
4 SOP-007804
5 SOP-007807
use .str.extract() method:
In [8]: df[['a','b']] = df.pop('col').str.extract('(\D+)(\d+)', expand=True)
In [9]: df
Out[9]:
a b
0 SOP- 015641
1 SOP- 007809
2 SOP 018262
3 SOP- 007802
4 SOP- 007804
5 SOP- 007807
RegEx explained
I have a column in my dataframe,where the values are something like this:
col1:
00000000000012VG
00000000000014SG
00000000000014VG
00000000000010SG
20000000000933LG
20000000000951LG
20000000000957LG
20000000000963LG
20000000000909LG
20000000000992LG
I want to delete all zeros:
a)that are in front of other numbers and letters(For example in case of 00000000000010SG I want to delete this part000000000000 and keep 10SG).
b) In cases like 20000000000992LG I want to delete this part 0000000000 and unite 2 with 992LG.
str.stprip('0') solves only part a), as I checked.
But what is the right solution for both cases?
I would recommend something similar to Ed's answer, but using regex to ensure that not all 0s are replaced, and the eliminate the need to hardcode the number of 0s.
In [2426]: df.col1.str.replace(r'[0]{2,}', '', 1)
Out[2426]:
0 12VG
1 14SG
2 14VG
3 10SG
4 2933LG
5 2951LG
6 2957LG
7 2963LG
8 2909LG
9 2992LG
Name: col1, dtype: object
Only the first string of 0s is replaced.
Thanks to #jezrael for pointing out a small bug in my answer.
You can just do
In[9]:
df['col1'] = df['col1'].str.replace('000000000000','')
df['col1'] = df['col1'].str.replace('0000000000','')
df
Out[9]:
col1
0 12VG
1 14SG
2 14VG
3 10SG
4 2933LG
5 2951LG
6 2957LG
7 2963LG
8 2909LG
9 2992LG
This will replace a fixed number of 0s with a blank space, this isn't dynamic but for your given dataset this is the simplest thing to do unless you can explains better the pattern
I want to replace the entire cell that contains the word as circled in the picture with blanks or NaN. However when I try to replace for example '1.25 Dividend' it turned out as '1.25 NaN'. I want to return the whole cell as 'NaN'. Any idea how to work on this?
Option 1
Use a regular expression in your replace
df.replace('^.*Dividend.*$', np.nan, regex=True)
From comments
(Using regex=True) means that it will interpret the problem as a regular expression one. You still need an appropriate pattern. The '^' says to start at the beginning of the string. '^.*' matches all characters from the beginning of the string. '$' says to end the match with the end of the string. '.*$' matches all characters up to the end of the string. Finally, '^.*Dividend.*$' matches all characters from the beginning, has 'Dividend' somewhere in the middle, then any characters after it. Then replace this whole thing with np.nan
Consider the dataframe df
df = pd.DataFrame([[1, '2 Dividend'], [3, 4], [5, '6 Dividend']])
df
0 1
0 1 2 Dividend
1 3 4
2 5 6 Dividend
then the proposed solution yields
0 1
0 1 NaN
1 3 4.0
2 5 NaN
Option 2
Another alternative is to use pd.DataFrame.mask in conjunction with a applymap.
If I pass a lambda to applymap that identifies if any cell has 'Dividend' in it.
df.mask(df.applymap(lambda s: 'Dividend' in s if isinstance(s, str) else False))
0 1
0 1 NaN
1 3 4
2 5 NaN
Option 3
Similar in concept but using stack/unstack + pd.Series.str.contains
df.mask(df.stack().astype(str).str.contains('Dividend').unstack())
0 1
0 1 NaN
1 3 4
2 5 NaN
Replace all strings:
df.apply(lambda x: pd.to_numeric(x, errors='coerce'))
I would use applymap like this
df.applymap(lambda x: 'NaN' if (type(x) is str and 'Dividend' in x) else x)
I have a lot of experience programming in Matlab, now using Python and I just don't get this thing to work... I have a dataframe containing a column with timecodes like 00:00:00.033.
timecodes = ['00:00:01.001', '00:00:03.201', '00:00:09.231', '00:00:11.301', '00:00:20.601', '00:00:31.231', '00:00:90.441', '00:00:91.301']
df = pd.DataFrame(timecodes, columns=['TimeCodes'])
All my inputs are 90 seconds or less, so I want to create a column with just the seconds as float. To do this, I need to select position 6 to end and make that into a float, which I can do for the first row like:
float(df['TimeCodes'][0][6:])
This works just fine, but if I now want to create a whole new column 'Time_sec', the following does not work:
df['Time_sec'] = float(df['TimeCodes'][:][6:])
Because df['TimeCodes'][:][6:] takes row 6 to last row, while I want WITHIN each row the 6th till last position. Also this does not work:
df['Time_sec'] = float(df['TimeCodes'][:,6:])
Do I need to make a loop? There must be a better way... And why does df['TimeCodes'][:][6:] not work?
You can use the slice string method and then cast the whole thing to a float:
In [13]: df["TimeCodes"].str.slice(6).astype(float)
Out[13]:
0 1.001
1 3.201
2 9.231
3 11.301
4 20.601
5 31.231
6 90.441
7 91.301
Name: TimeCodes, dtype: float64
As to why df['TimeCodes'][:][6:] doesn't work, what this ends up doing is chaining some selections. First you grab the pd.Series associated with the TimeCodes column, then you select all of the items from the Series with [:], and then you just select the items with index 6 or higher with [6:].
Solution - indexing with str and casting to float by astype:
print (df["TimeCodes"].str[6:])
0 01.001
1 03.201
2 09.231
3 11.301
4 20.601
5 31.231
6 90.441
7 91.301
Name: TimeCodes, dtype: object
df['new'] = df["TimeCodes"].str[6:].astype(float)
print (df)
TimeCodes new
0 00:00:01.001 1.001
1 00:00:03.201 3.201
2 00:00:09.231 9.231
3 00:00:11.301 11.301
4 00:00:20.601 20.601
5 00:00:31.231 31.231
6 00:00:90.441 90.441
7 00:00:91.301 91.301