I have a pandas dataframe column with characters like this (supposed to be a dictionary but became strings after scraping into a CSV):
{"id":307,"name":"Drinks","slug":"food/drinks"...`
I'm trying to extract the values for "name", so in this case it would be "Drinks".
The code I have right now (shown below) keeps outputting NaN for the entire dataframe.
df['extracted_category'] = df.category.str.extract('("name":*(?="slug"))')
What's wrong with my regex? Thanks!
Better to convert it into dataframe you can use eval and pd.Series for that like
# sample dataframe
df
category
0 {"id":307,"name":"Drinks","slug":"food/drinks"}
df.category.apply(lambda x : pd.Series(eval(x)))
id name slug
0 307 Drinks food/drinks
Or convert only string to dictionary using eval
df['category'] = df.category.apply(eval)
df.category.str["name"]
0 Drinks
Name: category, dtype: object
Hi #Ellie check also this approach:
x = {"id":307,"name":"Drinks","slug":"food/drinks"}
result = [(key, value) for key, value in x.items() if key.startswith("name")]
print(result)
[('name', 'Drinks')]
So, firstly the outer-most parenthesis in ("name":*(?="slug")) need to go because these represent the first group and the extracted value would then be equal to the first group which is not where the value of 'name' lies.
A simpler regex to try would be "name":"(\w*)" (Note: make sure to keep the part of the regex that you want to be extracted inside the parenthesis). This regex looks for the following string:
"name":"
and extracts all the alphabets that follow it (\w*) before stopping at another double quotation mark.
You can test your regex at: https://regex101.com/
Related
I know there have been a lot of questions around this topic but I didn't find any that described my problem. I have a df, with a specific column that looks like this:
colA
['drinks/coke/diet', 'food/spaghetti']
['drinks/water', 'drinks/tea', 'drinks/coke', 'food/pizza']
['drinks/coke/diet', 'drinks/coke']
...
The values of colA are a string NOT a list. What I want to achieve is a new column, where I only keep part of the values that contain 'coke'. Coke can be repeated any number of times in the string, and be in any place. The values between '' don't always contain en equal number of values seperated by /.
So the result should look like this:
colA colB
['drinks/coke/diet', 'food/spaghetti'] 'drinks/coke/diet'
['drinks/water', 'drinks/tea', 'drinks/coke', 'food/pizza'] 'drinks/coke'
['drinks/coke/diet', 'drinks/coke'] 'drinks/coke/diet', 'drinks/coke'
...
I've tried calling a function:
import json
df['coke'] = df['colA'].apply(lambda secties: [s for s in json.loads(colA) if 'coke' in s], meta=str)
But this one keeps throwing errors that I don't know how to solve.
You could split on comma and explode to create a Series. Then use str.contains to create a boolean mask that you could use to filter the items that contain the word "coke". Finally join the strings back across indices:
s = df['colA'].str.split(',').explode()
df['colB'] = s[s.str.contains('coke')].groupby(level=0).apply(','.join).str.strip('[]')
Output:
colA colB
0 ['drinks/coke/diet', 'food/spaghetti'] 'drinks/coke/diet'
1 ['drinks/water', 'drinks/tea', 'drinks/coke', ... 'drinks/coke'
2 ['drinks/coke/diet', 'drinks/coke'] 'drinks/coke/diet', 'drinks/coke'
Try splitting the string into a list and then making the check for coke in the list, something like this:
import json
df['coke'] = df['colA'].apply(lambda secties: [s for s in json.loads(colA.split("/")) if 'coke' in s], meta=str)
a = "Other (please specify) (What were you trying to do on the website when you encountered ^f('l1')^?Â\xa0)"
There are many values starting with '^f' and ending with '^' in a pandas column. And I need to replace them like below :
"Other (please specify) (What were you trying to do on the website when you encountered THIS ISSUE?Â\xa0)"
You don't mention what you've tried already, nor what the rest of your DataFrame looks like but here is a minimal example:
# Create a DataFrame with a single data point
df = pd.DataFrame(["... encountered ^f('l1')^?Â\xa0)"])
# Define a regex pattern
pattern = r'(\^f.+\^)'
# Use the .replace() method to replace
df = df.replace(to_replace=pattern, value='TEST', regex=True)
Output
0
0 ... encountered TEST? )
I am new to python, I have an issue with matching the names of the column of Dataframe in python. So, I have a string s = "8907*890a" where a is the column name of a data frame. Now I want to match that with the column names of df which is there or not. I have tried it but the string is being taken as the whole. How to get only the 'a' from the whole string?
My code:
s = "8907*890a"
df=
a b c
0 rr 12 4
1 rt 45 9
2 ht 78 0
for col in df.columns:
for i in s.split():
print(i)
Which gives:
"8907*890a"
Expected out:
a
The split function accepts a delimiter as a parameter. By default the delimiter is a space. So when you try s.split() the interpreter is looking for a space in the string which it doesn't find in this case. So it returns the whole string as the output. If you try s.split('*') you will get
8907
890a
as output. In your case it appears that splitting the string is not the best option to extract the column name. I would go with extracting the last character instead. This can be done using s[-1:]
I have a pandas dataframe with the following general format:
id,product_name_extract
1,00012CDN
2,14311121NDC
3,NDC37ba
4,47CD27
I also have a list of product codes I would like to match (unfortunately, I have to do NLP extraction, so it will not be a clean match) and then create a new column with the matching list value:
product_name = ['12CDN','21NDC','37ba','7CD2']
id,product_name_extract,product_name_mapped
1,00012CDN,12CDN
2,14311121NDC,21NDC
3,NDC37ba,37ba
4,47CD27,7CD2
I am not too worried about there being collisions.
This would be easy enough if I just needed a True/False indicator using contains and the list values concatenated together with "|" for alternation, but I am a bit stumped now on how I would create a column value of the exact match. Any tips or trick appreciated!
Since you're not worried about collisions, you can join your product_name list with the | operator, and use that as a regex:
df['product_name_mapped'] = (df.product_name_extract.str
.findall('|'.join(product_name))
.str[0])
Result:
>>> df
id product_name_extract product_name_mapped
0 1 00012CDN 12CDN
1 2 14311121NDC 21NDC
2 3 NDC37ba 37ba
3 4 47CD27 7CD2
I have a pandas dataframe that has a string column in it. The length of the frame is over 2 million rows and looping to extract the elements I need is a poor choice. My current code looks like the following
for i in range(len(table["series_id"])):
table["state_code"] = table["series_id"][i][2:4]
table["area_code"] = table["series_id"][i][5:9]
table["supersector_code"] = table["series_id"][i][11:12]
where "series_id" is the string containing multiple information fields I want to create an example data element:
columns:
[series_id, year, month, value, footnotes]
The data:
[['SMS01000000000000001' '2006' 'M01' 1966.5 '']
['SMS01000000000000001' '2006' 'M02' 1970.4 '']
['SMS01000000000000001' '2006' 'M03' 1976.6 '']
However series_id is column of interest that I am struggling with. I have looked at the str.FUNCTION for python and specifically pandas.
http://pandas.pydata.org/pandas-docs/stable/basics.html#testing-for-strings-that-match-or-contain-a-pattern
has a section describing each of the string functions i.e. specifically get & slice are the functions I would like to use. Ideally I could envision a solution like so:
table["state_code"] = table["series_id"].str.get(1:3)
or
table["state_code"] = table["series_id"].str.slice(1:3)
or
table["state_code"] = table["series_id"].str.slice([1:3])
When I have tried the following functions I get an invalid syntax for the ":".
but alas I cannot seem to figure out the proper way to perform the vector operation for taking a substring on a pandas data frame column.
Thank you
I think I would use str.extract with some regex (which you can tweak for your needs):
In [11]: s = pd.Series(["SMU78000009092000001"])
In [12]: s.str.extract('^.{2}(?P<state_code>.{3}).{1}(?P<area_code>\d{4}).{2}(?P<supersector_code>.{2})')
Out[12]:
state_code area_code supersector_code
0 U78 0000 92
This reads as: starts (^) with any two characters (which are ignored), the next three (any) characters are state_code, followed by any character (ignored), followed by four digits are area_code, ...