I am new to python, I have an issue with matching the names of the column of Dataframe in python. So, I have a string s = "8907*890a" where a is the column name of a data frame. Now I want to match that with the column names of df which is there or not. I have tried it but the string is being taken as the whole. How to get only the 'a' from the whole string?
My code:
s = "8907*890a"
df=
a b c
0 rr 12 4
1 rt 45 9
2 ht 78 0
for col in df.columns:
for i in s.split():
print(i)
Which gives:
"8907*890a"
Expected out:
a
The split function accepts a delimiter as a parameter. By default the delimiter is a space. So when you try s.split() the interpreter is looking for a space in the string which it doesn't find in this case. So it returns the whole string as the output. If you try s.split('*') you will get
8907
890a
as output. In your case it appears that splitting the string is not the best option to extract the column name. I would go with extracting the last character instead. This can be done using s[-1:]
Related
I have the following dataframe:
A
url/3gth33/item/PO151302
url/3jfj6/item/S474-3
url/dfhk34j/item/4964114989191
url/sdfkj3k4/place/9b81f6fd
url/as3f343d/thing/ecc539ec
I'm looking to extract anything with /item/ and its subsequent value.
The end result should be:
item
/item/PO151302
/item/S474-3
/item/4964114989191
here is what I've tried:
df['A'] = df['A'].str.extract(r'(/item/\w+\D+\d+$)')
This is returning what I need except the integer only values.
Based on the regex docs I'm reading this should grab all instances.
What am I missing here?
Use /item/.+ to match /item/ and anything after. Also, if you put ?P<foo> at the beginning of a group, e.g. (?P<foo>...), the column for that matched group in the returned dataframe of captures will be named what's inside the <...>:
item = df['A'].str.extract('(?P<item>/item/.+)').dropna()
Output:
>>> item
item
0 /item/PO151302
1 /item/S474-3
2 /item/4964114989191
This is not a regex solution but it could come handy in some situations.
keyword = "/item/"
df["item"] = ((keyword + df["A"].str.split(keyword).str[-1]) *
df["A"].str.contains(keyword))
which returns
A item
0 url/3gth33/item/PO151302 /item/PO151302
1 url/3jfj6/item/S474-3 /item/S474-3
2 url/dfhk34j/item/4964114989191 /item/4964114989191
3 url/sdfkj3k4/place/9b81f6fd
4 url/as3f343d/thing/ecc539ec
5
And in case you want only the rows where item is not empty you could use
df[df["item"].ne("")][["item"]]
I have a pandas dataframe column with characters like this (supposed to be a dictionary but became strings after scraping into a CSV):
{"id":307,"name":"Drinks","slug":"food/drinks"...`
I'm trying to extract the values for "name", so in this case it would be "Drinks".
The code I have right now (shown below) keeps outputting NaN for the entire dataframe.
df['extracted_category'] = df.category.str.extract('("name":*(?="slug"))')
What's wrong with my regex? Thanks!
Better to convert it into dataframe you can use eval and pd.Series for that like
# sample dataframe
df
category
0 {"id":307,"name":"Drinks","slug":"food/drinks"}
df.category.apply(lambda x : pd.Series(eval(x)))
id name slug
0 307 Drinks food/drinks
Or convert only string to dictionary using eval
df['category'] = df.category.apply(eval)
df.category.str["name"]
0 Drinks
Name: category, dtype: object
Hi #Ellie check also this approach:
x = {"id":307,"name":"Drinks","slug":"food/drinks"}
result = [(key, value) for key, value in x.items() if key.startswith("name")]
print(result)
[('name', 'Drinks')]
So, firstly the outer-most parenthesis in ("name":*(?="slug")) need to go because these represent the first group and the extracted value would then be equal to the first group which is not where the value of 'name' lies.
A simpler regex to try would be "name":"(\w*)" (Note: make sure to keep the part of the regex that you want to be extracted inside the parenthesis). This regex looks for the following string:
"name":"
and extracts all the alphabets that follow it (\w*) before stopping at another double quotation mark.
You can test your regex at: https://regex101.com/
I'm trying to use str.match to match a phrase exactly, but for each word in each row's string. I want to return the row's index number for the correct row, which is why I'm using str.match instead of regex.
I want to return the index for the row that contains exactly 'FL', not 'FLORIDA'. The problem with using str.contains though, is that it returns to me the index of the row with 'FLORIDA'.
import pandas as pd
data = [['Alex in FL','ten'],['Bob in FLORIDA','five'],['Will in GA','three']]
df = pd.DataFrame(data,columns=['Name','Age'])
df.index[df['Name'].str.contains('FL')]
df.index[df['Name'].str.match('FL')]
Here's what the dataframe looks like:
Name Age
0 Alex in FL ten
1 Bob in FLORIDA five
2 Will in GA three
The output should be returning the index of row 0:
Int64Index([0], dtype='int64')
Use contains with word boundaries:
import pandas as pd
data = [['Alex in FL','ten'],['Bob in FLORIDA','five'],['Will in GA','three']]
df = pd.DataFrame(data,columns=['Name','Age'])
print(df.index[df['Name'].str.contains(r'\bFL\b')])
Output
Int64Index([0], dtype='int64')
Try:
df[df.Name.str.contains(r'\bFL\b', regex=True)]
OR
df[['FL' in i for i in df.Name.str.split('\s')]]
Output:
Name Age
0 Alex in FL ten
The docs say that it's matching Regex with the expression ("FL" in your case). Since "FLORIDA" does contain that substring, it does match.
One way you could do this would be to match instead for " FL " (padded with space) but you would also need to pad each of the values with spaces as well (for when "FL" is the end of the string).
I have a pandas dataframe with the following general format:
id,product_name_extract
1,00012CDN
2,14311121NDC
3,NDC37ba
4,47CD27
I also have a list of product codes I would like to match (unfortunately, I have to do NLP extraction, so it will not be a clean match) and then create a new column with the matching list value:
product_name = ['12CDN','21NDC','37ba','7CD2']
id,product_name_extract,product_name_mapped
1,00012CDN,12CDN
2,14311121NDC,21NDC
3,NDC37ba,37ba
4,47CD27,7CD2
I am not too worried about there being collisions.
This would be easy enough if I just needed a True/False indicator using contains and the list values concatenated together with "|" for alternation, but I am a bit stumped now on how I would create a column value of the exact match. Any tips or trick appreciated!
Since you're not worried about collisions, you can join your product_name list with the | operator, and use that as a regex:
df['product_name_mapped'] = (df.product_name_extract.str
.findall('|'.join(product_name))
.str[0])
Result:
>>> df
id product_name_extract product_name_mapped
0 1 00012CDN 12CDN
1 2 14311121NDC 21NDC
2 3 NDC37ba 37ba
3 4 47CD27 7CD2
I have a pandas dataframe that has a string column in it. The length of the frame is over 2 million rows and looping to extract the elements I need is a poor choice. My current code looks like the following
for i in range(len(table["series_id"])):
table["state_code"] = table["series_id"][i][2:4]
table["area_code"] = table["series_id"][i][5:9]
table["supersector_code"] = table["series_id"][i][11:12]
where "series_id" is the string containing multiple information fields I want to create an example data element:
columns:
[series_id, year, month, value, footnotes]
The data:
[['SMS01000000000000001' '2006' 'M01' 1966.5 '']
['SMS01000000000000001' '2006' 'M02' 1970.4 '']
['SMS01000000000000001' '2006' 'M03' 1976.6 '']
However series_id is column of interest that I am struggling with. I have looked at the str.FUNCTION for python and specifically pandas.
http://pandas.pydata.org/pandas-docs/stable/basics.html#testing-for-strings-that-match-or-contain-a-pattern
has a section describing each of the string functions i.e. specifically get & slice are the functions I would like to use. Ideally I could envision a solution like so:
table["state_code"] = table["series_id"].str.get(1:3)
or
table["state_code"] = table["series_id"].str.slice(1:3)
or
table["state_code"] = table["series_id"].str.slice([1:3])
When I have tried the following functions I get an invalid syntax for the ":".
but alas I cannot seem to figure out the proper way to perform the vector operation for taking a substring on a pandas data frame column.
Thank you
I think I would use str.extract with some regex (which you can tweak for your needs):
In [11]: s = pd.Series(["SMU78000009092000001"])
In [12]: s.str.extract('^.{2}(?P<state_code>.{3}).{1}(?P<area_code>\d{4}).{2}(?P<supersector_code>.{2})')
Out[12]:
state_code area_code supersector_code
0 U78 0000 92
This reads as: starts (^) with any two characters (which are ignored), the next three (any) characters are state_code, followed by any character (ignored), followed by four digits are area_code, ...