extract values from column in dataframe - python

I have the following dataframe:
A
url/3gth33/item/PO151302
url/3jfj6/item/S474-3
url/dfhk34j/item/4964114989191
url/sdfkj3k4/place/9b81f6fd
url/as3f343d/thing/ecc539ec
I'm looking to extract anything with /item/ and its subsequent value.
The end result should be:
item
/item/PO151302
/item/S474-3
/item/4964114989191
here is what I've tried:
df['A'] = df['A'].str.extract(r'(/item/\w+\D+\d+$)')
This is returning what I need except the integer only values.
Based on the regex docs I'm reading this should grab all instances.
What am I missing here?

Use /item/.+ to match /item/ and anything after. Also, if you put ?P<foo> at the beginning of a group, e.g. (?P<foo>...), the column for that matched group in the returned dataframe of captures will be named what's inside the <...>:
item = df['A'].str.extract('(?P<item>/item/.+)').dropna()
Output:
>>> item
item
0 /item/PO151302
1 /item/S474-3
2 /item/4964114989191

This is not a regex solution but it could come handy in some situations.
keyword = "/item/"
df["item"] = ((keyword + df["A"].str.split(keyword).str[-1]) *
df["A"].str.contains(keyword))
which returns
A item
0 url/3gth33/item/PO151302 /item/PO151302
1 url/3jfj6/item/S474-3 /item/S474-3
2 url/dfhk34j/item/4964114989191 /item/4964114989191
3 url/sdfkj3k4/place/9b81f6fd
4 url/as3f343d/thing/ecc539ec
5
And in case you want only the rows where item is not empty you could use
df[df["item"].ne("")][["item"]]

Related

Splitting row values and count unique's from a DataFrame

I have the following data in a column titled Reference:
ABS052
ABS052/01
ABS052/02
ADA010/00
ADD005
ADD005/01
ADD005/02
ADD005/03
ADD005/04
ADD005/05
...
WOO032
WOO032/01
WOO032/02
WOO032/03
WOO045
WOO045/01
WOO045/02
WOO045/03
WOO045/04
I would like to know how to split the row values to create a Dataframe that contains the single Reference code, plus a Count value, for example:
Reference
Count
ABS052
3
ADA010
0
ADD005
2
...
...
WOO032
3
WOO045
4
I have the following code:
df['Reference'] = df['Reference'].str.split('/')
Results in:
['ABS052'],
['ABS052','01'],
['ABS052','02'],
['ABS052','03'],
...
But I'm not sure how to ditch the last two digits from the list in each row.
All I want now is to keep the string in each row [0] if that makes sense, then I could just retrieve a value_count from the 'Reference' column.
There seems to be something wrong with the expected result listed in the question.
Let's say you want to ditch the digits and count the prefix occurrences:
df.Reference.str.split("/", expand=True)[0].value_counts()
If instead the suffix means something and you want to keep the highest value this should do
df.Reference.str.split("/", expand=True).fillna("00").astype({0: str, 1: int}).groupby(0).max()
You can just use regex to replace the last two digits like this:
df = pd.DataFrame({'a':['ABS052','ABS052/01','ABS052/02','ADA010/00','ADD005','ADD005/01','ADD005/02','ADD005/03','ADD005/04','ADD005/05']})
df = df['a'].str.replace(r'\/\d+$', '').value_counts().reset_index()
Output:
>>>> index a
0 ADD005 6
1 ABS052 3
2 ADA010 1
You are almost there, you can add expand=True to split and then use groupby:
df['Reference'].str.split("/", expand=True).fillna("--").groupby(0).count()
returns:
1
0
ABS052 3
ADA010 1
ADD005 6
for the first couple of rows of your data.
The fillna("--") makes sure you also count lines like ABS052 without a "", i.e. None in the second column.
Output to df with column names
df['Reference'] = df['Reference'].str.split('/').str[0]
df_counts = df['Reference'].value_counts().rename_axis('Reference').reset_index(name='Counts')
output
Reference Counts
0 ADD005 6
1 ABS052 3
2 ADA010 1
Explanation - The first line gives a clean series called 'Reference'. The second line gives a count of unique items and then resets the index and renames the columns.

Match the column name based on the string in python?

I am new to python, I have an issue with matching the names of the column of Dataframe in python. So, I have a string s = "8907*890a" where a is the column name of a data frame. Now I want to match that with the column names of df which is there or not. I have tried it but the string is being taken as the whole. How to get only the 'a' from the whole string?
My code:
s = "8907*890a"
df=
a b c
0 rr 12 4
1 rt 45 9
2 ht 78 0
for col in df.columns:
for i in s.split():
print(i)
Which gives:
"8907*890a"
Expected out:
a
The split function accepts a delimiter as a parameter. By default the delimiter is a space. So when you try s.split() the interpreter is looking for a space in the string which it doesn't find in this case. So it returns the whole string as the output. If you try s.split('*') you will get
8907
890a
as output. In your case it appears that splitting the string is not the best option to extract the column name. I would go with extracting the last character instead. This can be done using s[-1:]

Python: Find list elements with part of it as a duplicate. A logic to work would be sufficient

List1 = ['ABCD_123.A_062320_082824', 'ABCD_123.A_062320_094024','ABCD_123.A_063020_084447']
I want to keep the last element as it has the latest time stamp MonDayYear_HrMinSec
Method 1
names = []
for name in list1:
names.append(name.split('_')[0])
Day = name.split('_')[-2]
Time = name.split('_')[-1]
print(names,Day,Time)
Method 2
for name in list1:
namematch = re.search(r'^([a-zA-Z0-9]*)(__[\d]*.A_)([\d]{6})_([\d]{6})',name)
names.append(namematch.group(1))
#print(names)
I tried regex which works but I dont know how to check for corresponding group. DO I use an if condition checking for group 2 and 3 and keep group1 or something along those lines?
You want this (assuming structure is name_date_time):
from itertools import groupby
out = [sorted(list(v))[-1] for k,v in groupby(sorted(List1), key=lambda x: '_'.join(x.split('_')[:-2]))]
Explanation:
Split your elements by '_' and throw away date and time and join the rest by '_' to form the names
Use groupby to group by names and then sort each group
Select the last in the sorted group (if you sort, latest date and time will come last)
output (Note that the order of elements can be different in this solution. If you need to keep the order, simply keep the order of names and reorder this by that):
['ABCD_123.A_063020_084447']
Another example:
List1 = ['ABCE_123.A_062320_082824', 'ABCE_123.A_062320_094024','ABCD_123.A_063020_084447']
out:
['ABCD_123.A_063020_084447', 'ABCE_123.A_062320_094024']

How to grab a string inside a pandas dataframe using a regex

I am trying to regex out a certain string inside my pandas df.
Say I have a df like so:
a b
0 foo foo AA123 bar 4
1 foo foo BB245 bar 5
2 foo CA234 bar bar 5
How would I get this df:
a b
0 AA123 4
1 BB245 5
2 CA234 5
One method I tried was df.replace({'(\w{3}\d{3})': ?}) but wasn't sure what to put for the second parameter.
You could use the regex-based Series.str.extract function to keep just the matching group. You also need a fix to your regex - the cardinality for the \w elements should be 2. In the end the code would be:
df["a"] = df["a"].str.extract('(\w{2}\d{3})', expand=False)
The expand=False is to indicate you don't want str.extract to return a DataFrame, which it does by default in order to accommodate multiple regex groups (it returns one column per group). Since you already know there is just one regex group here, for convenience you specify expand=False to get back a Series you can immediately assign to df["a"]. If there were more than one regex group, the function would return a DataFrame no matter what you specified for expand, and you would index into it to get the column/group you wanted.

Pattern Match in List of Strings, Create New Column in pandas

I have a pandas dataframe with the following general format:
id,product_name_extract
1,00012CDN
2,14311121NDC
3,NDC37ba
4,47CD27
I also have a list of product codes I would like to match (unfortunately, I have to do NLP extraction, so it will not be a clean match) and then create a new column with the matching list value:
product_name = ['12CDN','21NDC','37ba','7CD2']
id,product_name_extract,product_name_mapped
1,00012CDN,12CDN
2,14311121NDC,21NDC
3,NDC37ba,37ba
4,47CD27,7CD2
I am not too worried about there being collisions.
This would be easy enough if I just needed a True/False indicator using contains and the list values concatenated together with "|" for alternation, but I am a bit stumped now on how I would create a column value of the exact match. Any tips or trick appreciated!
Since you're not worried about collisions, you can join your product_name list with the | operator, and use that as a regex:
df['product_name_mapped'] = (df.product_name_extract.str
.findall('|'.join(product_name))
.str[0])
Result:
>>> df
id product_name_extract product_name_mapped
0 1 00012CDN 12CDN
1 2 14311121NDC 21NDC
2 3 NDC37ba 37ba
3 4 47CD27 7CD2

Categories