Cut string in dataframe column until certain string but including - python

I have similar data as the following:
df = pd.DataFrame({'pagePath':['/my/retour/details/n8hWu7iWtuRXzSvDvCAUZRAlPda6LM/',
'/my/orders/details/151726/',
'/my/retours/retourmethod/']})
print(df)
pagePath
0 /my/retour/details/n8hWu7iWtuRXzSvDvCAUZRAlPda...
1 /my/orders/details/151726/
2 /my/retours/retourmethod/
What I want to do is to cut the string until (but including) details
Expected output
pagePath
0 /my/retour/details/
1 /my/orders/details/
2 /my/retours/retourmethod/
The following works, but its slow
df['pagePath'] = np.where(df.pagePath.str.contains('details'),
df.pagePath.apply(lambda x: x[0:x.find('details')+8]),
df.pagePath)
print(df)
pagePath
0 /my/retour/details/
1 /my/orders/details/
2 /my/retours/retourmethod/
I tried regex, but could only get it to work excluding:
df['pagePath'] = np.where(df.pagePath.str.contains('details'),
df.pagePath.str.extract('(.+?(?=details))'),
df.pagePath)
print(df)
pagePath
0 /my/retour/
1 /my/orders/
2 NaN
Plus the regex code returns NaN, when the row does not contain details
So I feel there's an easier and more elegant way for this. How would I write a regex code to solve my problem? Or is my solution already sufficient?

All you need to do is provide a fallback in the regex for when there is no 'details':
>>> df.pagePath.str.extract('(.+?details/?|.*)')
0
0 /my/retour/details/
1 /my/orders/details/
2 /my/retours/retourmethod/

Would you like to try str.extract
('/'+df.pagePath.str.extract('/(.*)details')+'details')[0].fillna(df.pagePath)
Out[130]:
0 /my/retour/details
1 /my/orders/details
2 /my/retours/retourmethod/
Name: 0, dtype: object

Related

pandas.Index.isin produces a different dataframe than simple slicing

I'm really new to pandas and python in general, so I apologize if this is too basic.
I have a list of indices that I must use to take a subset of the rows of a dataframe. First, I simply sliced the dataframe using the indices to produce (df_1). Then I tried to use index.isin just to see if it also works (df_2). Well, it works but it produces a shorter dataframe (and seemingly ignores some of the rows that are supposed to be selected).
df_1 = df.iloc[df_idx]
df_2 = df[df.index.isin(df_idx)]
So my question is, why are they different? How exactly does index.isin work and when is it appropriate to use it?
Synthesising duplicates in index and then it re-produces the behaviour you note. If your index has duplicates it's absolutely expected the two will give different results. If you want to use these interchangeably you need to ensure that your index values uniquely identify a row
n = 6
df = pd.DataFrame({"idx":[i//2 for i in range(n)],"col1":[f"text {i}" for i in range(n)]}).set_index("idx")
df_idx = df.index
print(f"""
{df}
{df.iloc[df_idx]}
{df[df.index.isin(df_idx)]}
""")
output
col1
idx
0 text 0
0 text 1
1 text 2
1 text 3
2 text 4
2 text 5
col1
idx
0 text 0
0 text 0
0 text 1
0 text 1
1 text 2
1 text 2
col1
idx
0 text 0
0 text 1
1 text 2
1 text 3
2 text 4
2 text 5

Update dataframe values that match a regex condition and keep remaining values intact

The following is an excerpt from my dataframe:
In[1]: df
Out[1]:
LongName BigDog
1 Big Dog 1
2 Mastiff 0
3 Big Dog 1
4 Cat 0
I want to use regex to update BigDog values to 1 if LongName is a mastiff. I need other values to stay the same. I tried this, and although it assigns 1 to mastiffs, it nulls all other values instead of keeping them intact.
def BigDog(longname):
if re.search('(?i)mastiff', longname):
return '1'
df['BigDog'] = df['LongName'].apply(BigDog)
I'm not sure what to do, could anybody please help?
You don't need a loop or apply, use str.match with DataFrame.loc:
df.loc[df['LongName'].str.match('(?i)mastiff'), 'BigDog'] = 1
LongName BigDog
1 Big Dog 1
2 Mastiff 1
3 Big Dog 1
4 Cat 0

Selecting iloc based on a condition

My problem is quite hard to explain but easily understandable with an example :
From this dataframe
pd.DataFrame([[2,"1523974569"],[3,"3214569871"],[0,"9384927512"]])
I would like to obtain :
pd.DataFrame(["15","321",""])
It means that the first column is telling me how much characters I should extract from the second column starting from the start.
Thanks
you could get it using apply and lambda on dataframe as below
df = pd.DataFrame([[2,"1523974569"],[3,"3214569871"],[0,"9384927512"]])
df[2] = df.apply(lambda x : x[1][:x[0]], axis=1)
df
it will give you the output
0 1 2
0 2 1523974569 15
1 3 3214569871 321
2 0 9384927512

How to replace an entire cell with NaN on pandas DataFrame

I want to replace the entire cell that contains the word as circled in the picture with blanks or NaN. However when I try to replace for example '1.25 Dividend' it turned out as '1.25 NaN'. I want to return the whole cell as 'NaN'. Any idea how to work on this?
Option 1
Use a regular expression in your replace
df.replace('^.*Dividend.*$', np.nan, regex=True)
From comments
(Using regex=True) means that it will interpret the problem as a regular expression one. You still need an appropriate pattern. The '^' says to start at the beginning of the string. '^.*' matches all characters from the beginning of the string. '$' says to end the match with the end of the string. '.*$' matches all characters up to the end of the string. Finally, '^.*Dividend.*$' matches all characters from the beginning, has 'Dividend' somewhere in the middle, then any characters after it. Then replace this whole thing with np.nan
Consider the dataframe df
df = pd.DataFrame([[1, '2 Dividend'], [3, 4], [5, '6 Dividend']])
df
0 1
0 1 2 Dividend
1 3 4
2 5 6 Dividend
then the proposed solution yields
0 1
0 1 NaN
1 3 4.0
2 5 NaN
Option 2
Another alternative is to use pd.DataFrame.mask in conjunction with a applymap.
If I pass a lambda to applymap that identifies if any cell has 'Dividend' in it.
df.mask(df.applymap(lambda s: 'Dividend' in s if isinstance(s, str) else False))
0 1
0 1 NaN
1 3 4
2 5 NaN
Option 3
Similar in concept but using stack/unstack + pd.Series.str.contains
df.mask(df.stack().astype(str).str.contains('Dividend').unstack())
0 1
0 1 NaN
1 3 4
2 5 NaN
Replace all strings:
df.apply(lambda x: pd.to_numeric(x, errors='coerce'))
I would use applymap like this
df.applymap(lambda x: 'NaN' if (type(x) is str and 'Dividend' in x) else x)

Delineate twice through a dataframe in pandas

I have a sparse pandas DataFrame/Series with values that look like variations of "AB1234:12, CD5678:34, EF3456:56". Something to the effect of
"AB1234:12, CD5678:34, EF3456:56"
"AB1234:12, CD5678:34"
NaN
"GH5678:34, EF3456:56"
"OH56:34"
Which I'd like to convert into
["AB1234","CD5678", "EF3456"]
["AB1234","CD5678"]
NaN
["GH5678","EF3456"]
["OH56"]
This kind of "double delineation" has been proving difficult. I know we can A = df["columnName"].str.split(",") however I've run across a couple of problems including that .split(", ") doesnt seem to work and '.split(",")' leaves whitespace. Also, that iterating through the generated A and splitting seems to be interpreting my new lists as 'floats'. Although that last one might be a technical difficulty with ipython - I'm trying to work out that problem as well.
Is there a way to delineate on two types of separators - instead of just one? If not, how do you perform the loop to iterate over the inner list?
//Edit: changed the apostrophes to commas - that was just my dyslexia
kicking in
You nearly had it, note you can use a regular expression to split more generally:
In [11]: s2
Out[11]:
0 AB1234:12, CD5678:34, EF3456:56
1 AB1234:12, CD5678:34
2 NaN
3 GH5678:34, EF3456:56
4 OH56:34
dtype: object
In [12]: s2.str.split(", '")
Out[12]:
0 [AB1234:12, CD5678:34, EF3456:56]
1 [AB1234:12, CD5678:34]
2 NaN
3 [GH5678:34, EF3456:56]
4 [OH56:34]
dtype: object
In [13]: s2.str.split("\s*,\s*'")
Out[13]:
0 [AB1234:12, CD5678:34, EF3456:56]
1 [AB1234:12, CD5678:34]
2 NaN
3 [GH5678:34, EF3456:56]
4 [OH56:34]
dtype: object
Where this removes any spaces before or after a comma.
Here is your DataFrame
>>> df
A
0 AB1234:12, CD5678:34, EF3456:56
1 AB1234:12, CD5678:34
2 None
3 GH5678:34, EF3456:56
4 OH56:34
And now I use split and replace to split by ', ' and remove all ':'
>>> df.A = [i.replace(':','').split(", ") if isinstance(i,str) else i for i in df.A]
>>> df.A
0 [AB123412, CD567834, EF345656]
1 [AB123412, CD567834]
2 None
3 [GH567834, EF345656]
4 [OH5634]
Name: A

Categories