I am trying to extract a string from a longer string in one of my columns.
Here is a sample of what I have tried:
df['Campaign'] = df.full_utm.str.extract('utm_campaign=([^&]*)')
and this is a sample of the string I am referring to:
?utm_source=Facebook&utm_medium=CPC&utm_campaign=April+Merchants+LAL+-+All+SA+-+CAP+250&utm_content=01noprice
The problem is that this only returns this:
A
The desired output in this context would be
April+Merchants+LAL+-+All+SA+-+CAP+250
Use urlparse
Ex:
import urllib.parse as urlparse
df['Campaign'] = df["full_utm"].apply(lambda x: urlparse.parse_qs(urlparse.urlparse(x).query)["utm_campaign"])
print(df)
Related
I've a URL like this:
url = 'https://hp.wd5.myworkdayjobs.com/en-US/ExternalCareerSite/job/Enterprise-Business-Planning-Analyst_3103928-1'
x= 'Enterprise-Business-Planning-Analyst_3103928-1'
I want to extract id at the last of url you can say the x part from the above string to get the unique id.
Any help regarding this will be highly appreciated.
_parsed_url.path.split("/")[-1].split('-')[-1]
I am using this but it is giving error.
Python's urllib.parse and pathlib builtin libraries can help here.
url = 'https://hp.wd5.myworkdayjobs.com/en-US/ExternalCareerSite/job/Enterprise-Business-Planning-Analyst_3103928-1'
from urllib.parse import urlparse
from pathlib import PurePath
x = PurePath(urlparse(url).path).name
print(x)
# Enterprise-Business-Planning-Analyst_3103928-1
To print the text Enterprise-Business-Planning-Analyst_3103928-1 you can split() with respect to the / character:
url = 'https://hp.wd5.myworkdayjobs.com/en-US/ExternalCareerSite/job/Enterprise-Business-Planning-Analyst_3103928-1'
print(url.split("/")[-1])
# Enterprise-Business-Planning-Analyst_3103928-1
To print the text 3103928 you can replace the _ character with - and you can split() with respect to the - character:
url = 'https://hp.wd5.myworkdayjobs.com/en-US/ExternalCareerSite/job/Enterprise-Business-Planning-Analyst_3103928-1'
print(url.replace("_", "-").split("-")[-2])
# 3103928
Im passing through some urls and I'd like to strip a part of it which dynamically changes so I don't know it firsthand.
An example url is:
https://...?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2
And I'd like to strip the gid=lostchapter part without any of the rest.
How do I do that?
You can use urllib to convert the query string into a Python dict and access the desired item:
In [1]: from urllib import parse
In [2]: s = "https://...?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2"
In [3]: q = parse.parse_qs(parse.urlsplit(s).query)
In [4]: q
Out[4]:
{'pid': ['2'],
'gid': ['lostchapter'],
'lang': ['en_GB'],
'practice': ['1'],
'channel': ['desktop'],
'demo': ['2']}
In [5]: q["gid"]
Out[5]: ['lostchapter']
Here is the simple way to strip them
urls = "https://...?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2"
# Import the `urlparse` and `urlunparse` methods
from urllib.parse import urlparse, urlunparse
# Parse the URL
url = urlparse(urls)
# Convert the `urlparse` object back into a URL string
url = urlunparse(url)
# Strip the string
url = url.split("?")[1]
url = url.split("&")[1]
# Print the new URL
print(url) # Prints "gid=lostchapter"
Method 1: Using UrlParsers
from urllib.parse import urlparse
p = urlparse('https://.../?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2')
param: list[str] = [i for i in p.query.split('&') if i.startswith('gid=')]
Output: gid=lostchapter
Method 2: Using Regex
param: str = re.search(r'gid=.*&', 'https://.../?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2').group()[:-1]
you can change the regex pattern to appropriate pattern to match the expected outputs. currently it will extract any value.
We can try doing a regex replacement:
url = "https://...?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2"
output = re.sub(r'(?<=[?&])gid=lostchapter&?', '', url)
print(output) # https://...?pid=2&lang=en_GB&practice=1&channel=desktop&demo=2
For a more generic replacement, match on the following regex pattern:
(?<=[?&])gid=\w+&?
Using string slicing (I'm assuming there will be an '&' after gid=lostchapter)
url = r'https://...?pid=2&gid=lostchapter&lang=en_GB&practice=1&channel=desktop&demo=2'
start = url.find('gid')
end = start + url[url.find('gid'):].find('&')
url = url[start:] + url[:end-1]
print(url)
output
gid=lostchapter
What I'm trying to do here is:
find index of occurrence of "gid"
find the first "&" after "gid" is found
concatenate the parts of the url after"gid" and before "&"
I've got the next csv file:
Summary,Issue key,Issue id,Issue Type,Status,Project key,Attachment,Attachment.1,Attachment.2,Attachment.3,Attachment.4,Attachment.5
Find issue,IS-11,576,Task,Solved,One-1,10/28/21 11:49;Olga_Sokolova;SALUPRJBKK-1663_2021-10-28 14-38-01-372.mp4;file://SALUPRJBKK/SALUPRJBKK-1663/SALUPRJBKK-1663_2021-10-28 14-38-01-372.mp4
I need to choose all the attachments values and replace the "space" in the filename to "%20".
The main problem is to skip the first 'space' after the date in the attachment value and also to get all the attachment value.
I tried to use standart csv reader, pandas and etc. but I can only get name of the column
import pandas as pd
data = pd.read_csv("SALUPRJBKK_new_10.csv")
for i in data:
if "Attachment" in i:
print(i)
Select 'Attachment' columns using filter and replace all whitespaces by '%20' then update your dataframe in place:
df.update(df.filter(like='Attachment').replace(' ', '%20', regex=True))
My advise if you need to escape HTML entities is to use quote from urllib module:
from urllib.parse import quote
df.update(df.filter(like='Attachment').fillna('').applymap(quote))
Update
Try:
out = df.filter(like='Attachment').unstack().str.split(';').explode()
out = out.where(~(out.str.startswith('file://').fillna(False)),
out.str.replace(' ', '%20'))
df.update(out.dropna().groupby(level=[0, 1]).apply(';'.join).unstack(0))
consider specifying the separotor:
data = pd.read_csv("SALUPRJBKK_new_10.csv", sep=",")
Note: there is a mix of seperator in your example "," or ";"
and if you want to acces only to "Attachment" do:
data["Attachment"]
I think do you need a url encode.
Try with this:
import urllib.parse
query = 'Hellóóó W r l d # Pyt## h on.mp4'
newUrl = urllib.parse.quote(query)
print(newUrl)
And now the encode result:
Hell%C3%B3%C3%B3%C3%B3%20W%20r%20l%20d%20%40%20Pyt%40%40%20h%20on.mp4
The text or url is encoded and change all the special characters in string.
I need to merge two dataframe by using url as a primary key. However, there are some extra strings in the url like in df1, I have https://www.mcdonalds.com/us/en-us.html, where in df2, I have https://www.mcdonalds.com
I need to remove the /us/en-us.html after the .com and the https:// from the url, so I can perform the merge using url between 2 dfs. Below is a simplified example. What would be the solution for this?
df1={'url': ['https://www.mcdonalds.com/us/en-us.html','https://www.cemexusa.com/find-your-
location']}
df2={'url':['https://www.mcdonalds.com','www.cemexusa.com']}
df1['url']==df2['url']
Out[7]: False
Thanks.
URLs are not trivial to parse. Take a look at the urllib module in the standard library.
Here's how you could remove the path after the domain:
import urllib.parse
def remove_path(url):
parsed = urllib.parse.urlparse(url)
parsed = parsed._replace(path='')
return urllib.parse.urlunparse(parsed)
df1['url'] = df1['url'].apply(remove_path)
You can use urlparse as suggested by others, or you could also use urlsplit. However, both will not handle www.cemexusa.com. So if you do not need the scheme in your key, you could use something like this:
def to_key(url):
if "://" not in url: # or: not re.match("(?:http|ftp|https)://"", url)
url = f"https://{url}"
return urlsplit(url).hostname
df1["Key"] = df1["URL"].apply(to_key)
Here is a full working example:
import pandas as pd
import io
from urllib.parse import urlsplit
df1_data = io.StringIO("""
URL,Description
https://www.mcdonalds.com/us/en-us.html,Junk Food
https://www.cemexusa.com/find-your-location,Cemex
""")
df2_data = io.StringIO("""
URL,Last Update
https://www.mcdonalds.com,2021
www.cemexusa.com,2020
""")
df1 = pd.read_csv(df1_data)
df2 = pd.read_csv(df2_data)
def to_key(url):
if "://" not in url: # or: not re.match("(?:http|ftp|https)://"", url)
url = f"https://{url}"
return urlsplit(url).hostname
df1["Key"] = df1["URL"].apply(to_key)
df2["Key"] = df2["URL"].apply(to_key)
joined = df1.merge(df2, on="Key", suffixes=("_df1", "_df2"))
# and if you want to get rid of the original urls
joined = joined.drop(["URL_df1", "URL_df2"], axis=1)
The output of print(joined) would be:
Description Key Last Update
0 Junk Food www.mcdonalds.com 2021
1 Cemex www.cemexusa.com 2020
There may be other special cases not handled in this answer. Depending on your data, you may also need to handle an omitted www:
urlsplit("https://realpython.com/pandas-merge-join-and-concat").hostname
# realpython.com
urlsplit("https://www.realpython.com").hostname # also a valid URL
# www.realpython.com
What is the difference between urlparse and urlsplit?
It depends on your use case and what information you would like to extract. Since you do not need the URL's params, I would suggest using urlsplit.
[urlsplit()] is similar to urlparse(), but does not split the params from the URL. https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlsplit
Use urlparse and isolate the hostname:
from urllib.parse import urlparse
urlparse('https://www.mcdonalds.com/us/en-us.html').hostname
# 'www.mcdonalds.com'
I'm coding a python script to check a bunch of URL's and get their ID text, the URL's follow this sequence:
http://XXXXXXX.XXX/index.php?id=YY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
http://XXXXXXX.XXX/index.php?id=YYY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
http://XXXXXXX.XXX/index.php?id=YYYY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
Up to
http://XXXXXXX.XXX/index.php?id=YYYYYYY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
What I'm trying to do is get only the numbers after the id= and before the &
I've tried to use the regex (\D+)(\d+) but I'm also getting the auth numbers too.
Any suggestion on how to get only the id sequence?
Another way is to use split:
string = 'http://XXXXXXX.XXX/index.php?id=YY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX'
string.split('id=')[1].split('&auth=')[0]
Output:
YY
These are URL addresses, so I would just use url parser in that case.
Look at urllib.parse
Use urlparse to get query parameters, and then parse_qs to get query dict.
import urllib.parse as p
url = "http://XXXXXXX.XXX/index.php?id=YY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX"
query = p.urlparse(url).query
params = p.parse_qs(query)
print(params['id'])
You can include the start and stop tokens in the regex:
pattern = r'id=(\d+)(?:&|$)'
You can try this regex
import re
urls = ["http://XXXXXXX.XXX/index.php?id=YY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX", "http://XXXXXXX.XXX/index.php?id=YYY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX", "http://XXXXXXX.XXX/index.php?id=YYYY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX"]
for url in urls:
id_value = re.search(r"id=(.*)(?=&)", url).group(1)
print(id_value)
that will get you the id value from the URL
YY
YYY
YYYY
variables = """http://XXXXXXX.XXX/index.php?id=YY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
http://XXXXXXX.XXX/index.php?id=YYY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
http://XXXXXXX.XXX/index.php?id=YYYY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX""".splitlines()
for v in variables:
p1 = v.split("id=")[1]
p2 = p1.split("&")[0]
print(p2)
outoput:
YY
YYY
YYYY
If you prefer regex
import re
variables = """http://XXXXXXX.XXX/index.php?id=YY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
http://XXXXXXX.XXX/index.php?id=YYY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
http://XXXXXXX.XXX/index.php?id=YYYY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX"""
pattern = "id=(.*)\\&"
x = re.findall(pattern, variables)
print(x)
output:
['YY', 'YYY', 'YYYY']
I don't know if you mean with only numbers after id= and before & you mean that there could be letters and numbers between those letters, so I though to this
import re
variables = """http://XXXXXXX.XXX/index.php?id=5Y44Y&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
http://XXXXXXX.XXX/index.php?id=Y2242YY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX
http://XXXXXXX.XXX/index.php?id=5YY453YY&auth=XXXYYYXXXYYYXXXYYYXXXYYYX"""
pattern = "id=(.*)\\&"
x = re.findall(pattern, variables)
print(x)
x2 = []
for p in x:
x2.append(re.sub("\\D", "", p))
print(x2)
Output:
['5Y44Y', 'Y2242YY', '5YY453YY']
['544', '2242', '5453']
Use the regex id=[0-9]+:
pattern = "id=[0-9]+"
id = re.findall(pattern, url)[0].split("id=")[1]
If you do it this way, there is no need for &auth to follow the id, which makes it very versatile. However, the &auth won't make the code stop working. It works for the edge cases, as well as the simple ones.