I have a dataframe:
Name url
A 'https://foo.com, https://www.bar.org, https://goo.com'
B 'https://foo.com, https://www.bar.org, https://www.goo.com'
C 'https://foo.com, https://www.bar.org, https://goo.com'
and then a keyword list:
keyword_list = ['foo','bar']
I'm trying remove the urls that contain the keywords while keeping the ones that don't, so far this is the only thing that has worked for me, however it just removes that instance of the word only:
df['url'] = df['url'].str.replace('|'.join(keywordlist), ' ')
I've tried to convert the elements in the string to a list, however I get an indexing error when combining it back with the larger dataframe its part of, anyone run into this before?
Desired output:
Name url
A 'https://goo.com'
B 'https://www.goo.com'
C 'https://goo.com'
I'm pretty sure you can do so with some regex. But you can also do:
new_df = df.set_index('Name').url.str.split(',\s+', expand=True).stack()
(new_df[~new_df.str.contains('|'.join(keyword_list))]
.reset_index(level=1, drop=True)
.to_frame(name='url')
.reset_index()
)
Output:
Name url
0 A https://goo.com
1 B https://www.goo.com
2 C https://goo.com
Related
I have the next DataFrame:
a = [{'name': 'AAA|YYY'},{ 'name': 'BBB|LLL'}]
df = pd.DataFrame(a)
print(df)
name
0 AAA|YYY
1 BBB|LLL
and I'm trying to remove the part of the string from the right up to the character |:
df['name'] = [i.split('|')[:-1] for i in df['name']]
but I get the following result:
name
0 [AAA]
1 [BBB]
how can I get the following result?:
name
0 AAA
1 BBB
You're selecting a range of items from the result of your split operation, because you're passing a slice object (:-1).
Actually, to get your result, you just have to select the first part of the split, which will correspond to the index 0:
df['name'] = [i.split('|')[0] for i in df['name']]
Or if you have multiple occurrences of '|', and want to remove only the last part, you can join the remaining part after your selection:
df['name'] = ['|'.join(i.split('|')[:-1]) for i in df['name']]
Problem: Currently, I have a column of a dataframe with results like 1.Goalkeeper, 4.Midfield...and I can't change partially replace the string.
Objective: My goal is to replace it with 1.GK, 4.MD...but it doesn't make the replacement. It seems as if these lines are not written. Any ideas?
The code works if the input is the same as the replacement. For example, Goalkeeper, Midfield... but it doesn't work when I prefix it with ( number + dot).
CODE
df2['Posicion'].replace({'Goalkeeper':'GK','Left-Back':'LI','Defensive Midfield':'MCD'
,'Right Midfield':'MD','Attacking Midfield':'MP','Right Winger':'ED','Centre-Forward':'DC',
'Centre-Back':'DFC','Right-Back':'LD','Central Midfield':'MC','Second Striker':'SD',
'Left Midfield':'MI','Left Winger':'EI','N':'','None':'','Sweeper':'DFC'}, inplace=True)
regex=True will do the trick here.
df2 = pd.DataFrame({
'Posicion' : ['1.Goalkeeper', '2.Midfield', '3.Left Winger']
})
df2['Posicion'].replace({'Goalkeeper':'GK',
'Left Winger':'EI',
'N':'',
'None':'',
'Sweeper':'DFC'},
regex=True,
inplace=True)
Output:
Posicion
0 1.GK
1 2.Midfield
2 3.EI
I've got a data frame with column names like 'AH_AP' and 'AH_AS'.
Essentially all i want to do is swap the part before the underscore and the part after the underscore so that the column headers are 'AP_AH' and 'AS_AH'.
I can do that if the elements are in a list, but i've no idea how to get that to apply to column names.
My solution if it were a list goes like this:
columns = ['AH_AP','AS_AS']
def rejig_col_names():
elements_of_header = columns.split('_')
new_title = elements_of_header[-1] + "_" + elements_of_header[0]
return new_title
i'm guessing i need to apply this to something like the below, but i've no idea how, or how to reference a single column within df.columns:
df.columns = df.columns.map()
Any help appreciated. Thanks :)
You can do it this way:
Input:
df = pd.DataFrame(data=[['1','2'], ['3','4']], columns=['AH_PH', 'AH_AS'])
print(df)
AH_PH AH_AS
0 1 2
1 3 4
Output:
df.columns = df.columns.str.split('_').str[::-1].str.join('_')
print(df)
PH_AH AS_AH
0 1 2
1 3 4
Explained:
Use string accessor and the split method on '_'
Then using the str accessor with index slicing reversing, [::-1], you
can reverse the order of the list
Lastly, using the string accessor and join, we can concatenate the
list back together again.
You were almost there: you can do
df.columns = df.columns.map(rejig_col_names)
except that the function gets called with a column name as argument, so change it like this:
def rejig_col_names(col_name):
elements_of_header = col_name.split('_')
new_title = elements_of_header[-1] + "_" + elements_of_header[0]
return new_title
An alternative to the other answer. Using your function and DataFrame.rename
import pandas as pd
def rejig_col_names(columns):
elements_of_header = columns.split('_')
new_title = elements_of_header[-1] + "_" + elements_of_header[0]
return new_title
data = {
'A_B': [1, 2, 3],
'C_D': [4, 5, 6],
}
df = pd.DataFrame(data)
df.rename(rejig_col_names, axis='columns', inplace=True)
print(df)
str.replace is also an option via swapping capture groups:
Sample input borrowed from ScottBoston
df = pd.DataFrame(data=[['1', '2'], ['3', '4']], columns=['AH_PH', 'AH_AS'])
Then Capture everything before and after the '_' and swap capture group 1 and 2.
df.columns = df.columns.str.replace(r'^(.*)_(.*)$', r'\2_\1', regex=True)
PH_AH AS_AH
0 1 2
1 3 4
I have read some pricing data into a pandas dataframe the values appear as:
$40,000*
$40000 conditions attached
I want to strip it down to just the numeric values.
I know I can loop through and apply regex
[0-9]+
to each field then join the resulting list back together but is there a not loopy way?
Thanks
You could use Series.str.replace:
import pandas as pd
df = pd.DataFrame(['$40,000*','$40000 conditions attached'], columns=['P'])
print(df)
# P
# 0 $40,000*
# 1 $40000 conditions attached
df['P'] = df['P'].str.replace(r'\D+', '', regex=True).astype('int')
print(df)
yields
P
0 40000
1 40000
since \D matches any character that is not a decimal digit.
You could use pandas' replace method; also you may want to keep the thousands separator ',' and the decimal place separator '.'
import pandas as pd
df = pd.DataFrame(['$40,000.32*','$40000 conditions attached'], columns=['pricing'])
df['pricing'].replace(to_replace="\$([0-9,\.]+).*", value=r"\1", regex=True, inplace=True)
print(df)
pricing
0 40,000.32
1 40000
You could remove all the non-digits using re.sub():
value = re.sub(r"[^0-9]+", "", value)
regex101 demo
You don't need regex for this. This should work:
df['col'] = df['col'].astype(str).convert_objects(convert_numeric=True)
In case anyone is still reading this. I'm working on a similar problem and need to replace an entire column of pandas data using a regex equation I've figured out with re.sub
To apply this on my entire column, here's the code.
#add_map is rules of replacement for the strings in pd df.
add_map = dict([
("AV", "Avenue"),
("BV", "Boulevard"),
("BP", "Bypass"),
("BY", "Bypass"),
("CL", "Circle"),
("DR", "Drive"),
("LA", "Lane"),
("PY", "Parkway"),
("RD", "Road"),
("ST", "Street"),
("WY", "Way"),
("TR", "Trail"),
])
obj = data_909['Address'].copy() #data_909['Address'] contains the original address'
for k,v in add_map.items(): #based on the rules in the dict
rule1 = (r"(\b)(%s)(\b)" % k) #replace the k only if they're alone (lookup \
b)
rule2 = (lambda m: add_map.get(m.group(), m.group())) #found this online, no idea wtf this does but it works
obj = obj.str.replace(rule1, rule2, regex=True, flags=re.IGNORECASE) #use flags here to avoid the dictionary iteration problem
data_909['Address_n'] = obj #store it!
Hope this helps anyone searching for the problem I had. Cheers
Hi say I have a column in data frame
name submission contains - mhttps://ckd.pdc.com/pdc/73ba5189-94fd-44aa-88d3-6b36aaa69b02/DDA1610095.zip
I want to one column say Zest and I want the value DDA1610095 in that column.
and a new column say type and want .zip in that column how to do that using pandas.
you can use str.split to extract the zip from the url
df
url
0 mhttps://ckd.pdc.com/pdc/73ba5189-94fd-44aa-88d3-6b36aaa69b02/DDA1610095.zip
df['zip'] = df.url.str.split('/',expand=True).T[0] \
[df.url.str.split('/',expand=True).T.shape[0]-1]
df.T
Out[46]:
0
url mhttps://ckd.pdc.com/pdc/73ba5189-94fd-44aa-88d3-6b36aaa69b02/DDA1610095.zip
zip DDA1610095.zip
try using a str.split and add another str so you can index each row.
data = [{'ID' : '1',
'URL': 'https://ckd.pdc.com/pdc/73ba5189-94fd-44aa-88d3-6b36aaa69b02/DDA1610095.zip'}]
df = pd.DataFrame(data)
print(df)
ID URL
0 1 https://ckd.pdc.com/pdc/73ba5189-94fd-44aa-88d...
#Get the file name and replace zip (probably a more elegant way to do this)
df['Zest'] = df.URL.str.split('/').str[-1].str.replace('.zip','')
#assign the type into the next column.
df['Type'] = df.URL.str.split('.').str[-1]
print(df)
ID URL Zest Type
0 1 https://ckd.pdc.com/pdc/73ba5189-94fd-44aa-88d... DDA1610095 zip