I'm trying to parse the site. I don't want to use selenium. Requests is coping. BUT! something strange is happening. I can't cut out the text I need with a regular expression (and it's there - you can see it if you do print(data.text)) But re doesn't see him. If this text is copied to notepad++, it outputs this - it sees these characters as a single line.
import requests
import re
data = requests.get('https://ru.runetki3.com/?page=1')
print(data.text)
What is it and how to work with it?pay attention to the line numbers
You can try to use their Ajax API to load all usernames + thumb images:
import pandas as pd
import requests
url = 'https://ru.runetki3.com/tools/listing_v3.php?livetab=female&offset=0&limit={}'
headers = {'X-Requested-With': 'XMLHttpRequest'}
all_data = []
for p in range(1, 4): # <-- increase number of pages here
data = requests.get(url.format(p * 144), headers=headers).json()
for m in data['models']:
all_data.append((m['username'], m['display_name'], m['thumb_image'].replace('{ext}', 'jpg')))
df = pd.DataFrame(all_data, columns=['username', 'display_name', 'thumb'])
print(df.head())
Prints:
username display_name thumb
0 wetlilu Little_Lilu //i.bimbolive.com/live/034/263/131/xbig_lq/c30823.jpg
1 mellannie8 mellannieSEX //i.bimbolive.com/live/034/24f/209/xbig_lq/314348.jpg
2 mokkoann mokkoann //i.bimbolive.com/live/034/270/279/xbig_lq/cb25cb.jpg
3 ogurezzi CynEp-nuCbka //i.bimbolive.com/live/034/269/02c/xbig_lq/3ebe2a.jpg
4 Pepetka22 _-Katya-_ //i.bimbolive.com/live/034/24f/36e/xbig_lq/18da8e.jpg
Avoid using . in a regex unless you really want to get any character; here, the usernames (as far as I can see) only contain - and alphanumeric characters, so you can retrieve them with:
re.findall(r'"username":"([\w|-]+)"',data.text)
An even simpler way, which will remove the need to deal with special characters by getting all characters except " is:
re.findall(r'"username":"([^"]+)"',data.text)
So here's a way of getting the info you seek (I joined them into a dictionary, but you can change that to whatever you prefer):
import requests
import re
data = requests.get('https://ru.runetki3.com/?page=1')
with open ("return.txt",'w', encoding = 'utf-8') as f:
f.write(data.text)
names = re.findall(r'"username":"([^"]+)"',data.text)
disp_names = re.findall(r'"display_name":"([^"]+)"',data.text)
thumbs = re.findall(r'"thumb_image":"([^"]+)"',data.text)
names_dict = {name:[disp, thumb.replace('{ext}', 'jpg')] for name, disp, thumb in zip(names, disp_names, thumbs)}
Example
names_dict['JuliaCute']
# ['_Cute',
# '\\/\\/i.bimbolive.com\\/live\\/055\\/2b0\\/15d\\/xbig_lq\\/d89ef4.jpg']
Related
I've a URL like this:
url = 'https://hp.wd5.myworkdayjobs.com/en-US/ExternalCareerSite/job/Enterprise-Business-Planning-Analyst_3103928-1'
x= 'Enterprise-Business-Planning-Analyst_3103928-1'
I want to extract id at the last of url you can say the x part from the above string to get the unique id.
Any help regarding this will be highly appreciated.
_parsed_url.path.split("/")[-1].split('-')[-1]
I am using this but it is giving error.
Python's urllib.parse and pathlib builtin libraries can help here.
url = 'https://hp.wd5.myworkdayjobs.com/en-US/ExternalCareerSite/job/Enterprise-Business-Planning-Analyst_3103928-1'
from urllib.parse import urlparse
from pathlib import PurePath
x = PurePath(urlparse(url).path).name
print(x)
# Enterprise-Business-Planning-Analyst_3103928-1
To print the text Enterprise-Business-Planning-Analyst_3103928-1 you can split() with respect to the / character:
url = 'https://hp.wd5.myworkdayjobs.com/en-US/ExternalCareerSite/job/Enterprise-Business-Planning-Analyst_3103928-1'
print(url.split("/")[-1])
# Enterprise-Business-Planning-Analyst_3103928-1
To print the text 3103928 you can replace the _ character with - and you can split() with respect to the - character:
url = 'https://hp.wd5.myworkdayjobs.com/en-US/ExternalCareerSite/job/Enterprise-Business-Planning-Analyst_3103928-1'
print(url.replace("_", "-").split("-")[-2])
# 3103928
I need to merge two dataframe by using url as a primary key. However, there are some extra strings in the url like in df1, I have https://www.mcdonalds.com/us/en-us.html, where in df2, I have https://www.mcdonalds.com
I need to remove the /us/en-us.html after the .com and the https:// from the url, so I can perform the merge using url between 2 dfs. Below is a simplified example. What would be the solution for this?
df1={'url': ['https://www.mcdonalds.com/us/en-us.html','https://www.cemexusa.com/find-your-
location']}
df2={'url':['https://www.mcdonalds.com','www.cemexusa.com']}
df1['url']==df2['url']
Out[7]: False
Thanks.
URLs are not trivial to parse. Take a look at the urllib module in the standard library.
Here's how you could remove the path after the domain:
import urllib.parse
def remove_path(url):
parsed = urllib.parse.urlparse(url)
parsed = parsed._replace(path='')
return urllib.parse.urlunparse(parsed)
df1['url'] = df1['url'].apply(remove_path)
You can use urlparse as suggested by others, or you could also use urlsplit. However, both will not handle www.cemexusa.com. So if you do not need the scheme in your key, you could use something like this:
def to_key(url):
if "://" not in url: # or: not re.match("(?:http|ftp|https)://"", url)
url = f"https://{url}"
return urlsplit(url).hostname
df1["Key"] = df1["URL"].apply(to_key)
Here is a full working example:
import pandas as pd
import io
from urllib.parse import urlsplit
df1_data = io.StringIO("""
URL,Description
https://www.mcdonalds.com/us/en-us.html,Junk Food
https://www.cemexusa.com/find-your-location,Cemex
""")
df2_data = io.StringIO("""
URL,Last Update
https://www.mcdonalds.com,2021
www.cemexusa.com,2020
""")
df1 = pd.read_csv(df1_data)
df2 = pd.read_csv(df2_data)
def to_key(url):
if "://" not in url: # or: not re.match("(?:http|ftp|https)://"", url)
url = f"https://{url}"
return urlsplit(url).hostname
df1["Key"] = df1["URL"].apply(to_key)
df2["Key"] = df2["URL"].apply(to_key)
joined = df1.merge(df2, on="Key", suffixes=("_df1", "_df2"))
# and if you want to get rid of the original urls
joined = joined.drop(["URL_df1", "URL_df2"], axis=1)
The output of print(joined) would be:
Description Key Last Update
0 Junk Food www.mcdonalds.com 2021
1 Cemex www.cemexusa.com 2020
There may be other special cases not handled in this answer. Depending on your data, you may also need to handle an omitted www:
urlsplit("https://realpython.com/pandas-merge-join-and-concat").hostname
# realpython.com
urlsplit("https://www.realpython.com").hostname # also a valid URL
# www.realpython.com
What is the difference between urlparse and urlsplit?
It depends on your use case and what information you would like to extract. Since you do not need the URL's params, I would suggest using urlsplit.
[urlsplit()] is similar to urlparse(), but does not split the params from the URL. https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlsplit
Use urlparse and isolate the hostname:
from urllib.parse import urlparse
urlparse('https://www.mcdonalds.com/us/en-us.html').hostname
# 'www.mcdonalds.com'
I made a function that scrapes the last 64 characters of text from a website and adds it to url1, resulting in new_url. I want to repeat the process by scraping the last 64 characters from the resulting URL (new_url) and adding it to url1 again. The goal is to repeat this until I hit a website where the last 3 characters are "END".
Here is my code so far:
#function
def getlink(url):
url1 = 'https://www.random.computer/api.php?file='
req=request.urlopen(url)
link = req.read().splitlines()
for i,line in enumerate(link):
text = line.decode('utf-8')
last64= text[-64:]
new_url= url1+last64
return new_url
getlink('https://www.random/api.php?file=abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz012345678910')
#output
'https://www.random/api.php?file=zyxwvutsrqponmlkjihgfedcba012345678910abcdefghijklmnopqrstuvwxyz'
My trouble is figuring out a way to be able to repeat the function on its output. Any help would be appreciated!
A simple loop should work. I've removed the first token as it may be sensible information. Just change the WRITE_YOUR_FIRST_TOKEN_HERE string with the code for the first link.
from urllib import request
def get_chunk(chunk, url='https://www.uchicago.computer/api.php?file='):
with request.urlopen(url + chunk) as f:
return f.read().decode('UTF-8').strip()
if __name__ == '__main__':
chunk = 'WRITE_YOUR_FIRST_TOKEN_HERE'
while chunk[-3:] != "END":
chunk = get_chunk(chunk[-64:])
print(chunk)
# Chunk is a string, do whatever you want with it,
# like chunk.splitlines() to get a list of the lines
read get the byte stream, decode turns it into a string, and strip removes leading and trailing whitespaces (like \n) so that it doesn't mess with the last 64 chars (if you get the last 64 chars but one is a \n you will only get 63 chars of the token).
Try the below code. It can perform what you mention above?
import requests
from bs4 import BeautifulSoup
def getlink(url):
url1 = 'https://www.uchicago.computer/api.php?file='
response = requests.post(url)
doc = BeautifulSoup(response.text, 'html.parser')
text = doc.decode('utf-8')
last64= text[-65:-1]
new_url= url1+last64
return new_url
def caller(url):
url = getlink(url)
if not url[-3:]=='END':
print(url)
caller(url)
I've created a script in python to get the name of neighbors from a webpage. I've used requests library along with re module to parse the content from some script tag out of that site. when I run the script I get the name of neighbors in the right way. However, the problem is i've used this line if not item.startswith("NY:"):continue to get rid of unwanted results from that page. I do not wish to use this hardcoded portion NY: to do this trick.
website link
I've tried with:
import re
import json
import requests
link = 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=New%20York%2C%20NY&start=1'
resp = requests.get(link,headers={"User-Agent":"Mozilla/5.0"})
data = json.loads(re.findall(r'data-hypernova-key[^{]+(.*)--></script>',resp.text)[0])
items = data['searchPageProps']['filterPanelProps']['filterInfoMap']
for item in items:
if not item.startswith("NY:"):continue
print(item)
Result I'm getting (desired result):
NY:New_York:Brooklyn:Mill_Basin
NY:New_York:Bronx:Edenwald
NY:New_York:Staten_Island:Stapleton
If I do not use this line if not item.startswith("NY:"):continue, the results are something like:
rating
NY:New_York:Brooklyn:Mill_Basin
NY:New_York:Bronx:Edenwald
NY:New_York:Staten_Island:Stapleton
NY:New_York:Staten_Island:Lighthouse_Hill
NY:New_York:Queens:Rochdale
NY:New_York:Queens:Pomonok
BusinessParking.validated
food_court
NY:New_York:Queens:Little_Neck
The bottom line is I wish to get everything started with NY:New_York:. What I meant by unwanted results are rating, BusinessParking.validated, food_court and so on.
How can I get the neighbors without using any hardcoded portion of search within the script?
I'm not certain what your complete data set looks like, but based on your sample,
you might use something like:
if ':' not in item:
continue
# or perhaps:
if item.count(':') < 3:
continue
# I'd prefer a list comprehension if I didn't need the other data
items = [x for x in data['searchPageProps']['filterPanelProps']['filterInfoMap'] if ':' in x]
If that doesn't work for what you're trying to achieve then you could just use a variable for the state.
Another solution - using BeautifulSoup - which doesn't involve regex or hardcoding "NY:New_York" is below; it's convoluted, but mainly because Yelp buried it's treasure several layers deep...
So for future reference:
from bs4 import BeautifulSoup as bs
import json
import requests
link = 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=New%20York%2C%20NY&start=1'
resp = requests.get(link,headers={"User-Agent":"Mozilla/5.0"})
target = soup.find_all('script')[14]
content = target.text.replace('<!--','').replace('-->','')
js_data = json.loads(content)
And now the fun of extracting NYC info from the json begins....
for a in js_data:
if a == 'searchPageProps':
level1 = js_data[a]
for b in level1:
if b == 'filterPanelProps':
level2 = level1[b]
for c in level2:
if c == 'filterSets':
level3 = level2[c][1]
for d in level3:
if d == 'moreFilters':
level4 = level3[d]
for e in range(len(level4)):
print(level4[e]['title'])
print(level4[e]['sectionFilters'])
print('---------------')
The output is the name of each borough plus a list of all neighborhoods in that borough. For example:
Manhattan
['NY:New_York:Manhattan:Alphabet_City',
'NY:New_York:Manhattan:Battery_Park',
'NY:New_York:Manhattan:Central_Park', 'NY:New_York:Manhattan:Chelsea',
'...]
etc.
I'm having trouble trying to make this work
import requests
import random
response = requests.get("https://cdn.discordapp.com/attachments/480168592164257792/557872162661335040/aaaaa.txt")
data = response.text
for line in data:
print(line)
I am trying to pull a txt file from the internet, and be able to use the list inside of the text file.
Right now all it does is assume each letter is a different string(?)
response.text seems to be characters, if you loop over them you get each string. (Read about how Python handles strings).
In this case Python doesn't know what a "line" is. So split the data with newlines and try again:
import requests
import random
response = requests.get("https://cdn.discordapp.com/attachments/480168592164257792/557872162661335040/aaaaa.txt")
data = response.text
for line in data.split("\n"):
print(line)
The attribute response.text is a string, so iterating over it will give you individual chars. You can split the string by spaces (or maybe be newlines) to get what you need (I also added a few print statements to show the steps):
import requests
response = requests.get(
"https://cdn.discordapp.com/attachments/480168592164257792/557872162661335040/aaaaa.txt")
print('response.text type:', type(response.text))
print('response.text len:', len(response.text))
print(response.text)
print()
print('splitting by spaces:')
for i, s in enumerate(response.text.split()):
print(i, s)
print()
print('splitting by newlines:')
for i, line in enumerate(response.text.split('\n')):
print(i, line)
The code gives this output:
response.text type: <class 'str'>
response.text len: 21
a = ["please","work"]
splitting by spaces:
0 a
1 =
2 ["please","work"]
splitting by newlines:
0 a = ["please","work"]
#bruno suggested in a comment to use str.splitlines(); this will work even if the response is bytes, since there also exists the method bytes.splitlines().