Repeat a python function on its own output - python

I made a function that scrapes the last 64 characters of text from a website and adds it to url1, resulting in new_url. I want to repeat the process by scraping the last 64 characters from the resulting URL (new_url) and adding it to url1 again. The goal is to repeat this until I hit a website where the last 3 characters are "END".
Here is my code so far:
#function
def getlink(url):
url1 = 'https://www.random.computer/api.php?file='
req=request.urlopen(url)
link = req.read().splitlines()
for i,line in enumerate(link):
text = line.decode('utf-8')
last64= text[-64:]
new_url= url1+last64
return new_url
getlink('https://www.random/api.php?file=abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz012345678910')
#output
'https://www.random/api.php?file=zyxwvutsrqponmlkjihgfedcba012345678910abcdefghijklmnopqrstuvwxyz'
My trouble is figuring out a way to be able to repeat the function on its output. Any help would be appreciated!

A simple loop should work. I've removed the first token as it may be sensible information. Just change the WRITE_YOUR_FIRST_TOKEN_HERE string with the code for the first link.
from urllib import request
def get_chunk(chunk, url='https://www.uchicago.computer/api.php?file='):
with request.urlopen(url + chunk) as f:
return f.read().decode('UTF-8').strip()
if __name__ == '__main__':
chunk = 'WRITE_YOUR_FIRST_TOKEN_HERE'
while chunk[-3:] != "END":
chunk = get_chunk(chunk[-64:])
print(chunk)
# Chunk is a string, do whatever you want with it,
# like chunk.splitlines() to get a list of the lines
read get the byte stream, decode turns it into a string, and strip removes leading and trailing whitespaces (like \n) so that it doesn't mess with the last 64 chars (if you get the last 64 chars but one is a \n you will only get 63 chars of the token).

Try the below code. It can perform what you mention above?
import requests
from bs4 import BeautifulSoup
def getlink(url):
url1 = 'https://www.uchicago.computer/api.php?file='
response = requests.post(url)
doc = BeautifulSoup(response.text, 'html.parser')
text = doc.decode('utf-8')
last64= text[-65:-1]
new_url= url1+last64
return new_url
def caller(url):
url = getlink(url)
if not url[-3:]=='END':
print(url)
caller(url)

Related

Python requests returns incomprehensible content

I'm trying to parse the site. I don't want to use selenium. Requests is coping. BUT! something strange is happening. I can't cut out the text I need with a regular expression (and it's there - you can see it if you do print(data.text)) But re doesn't see him. If this text is copied to notepad++, it outputs this - it sees these characters as a single line.
import requests
import re
data = requests.get('https://ru.runetki3.com/?page=1')
print(data.text)
What is it and how to work with it?pay attention to the line numbers
You can try to use their Ajax API to load all usernames + thumb images:
import pandas as pd
import requests
url = 'https://ru.runetki3.com/tools/listing_v3.php?livetab=female&offset=0&limit={}'
headers = {'X-Requested-With': 'XMLHttpRequest'}
all_data = []
for p in range(1, 4): # <-- increase number of pages here
data = requests.get(url.format(p * 144), headers=headers).json()
for m in data['models']:
all_data.append((m['username'], m['display_name'], m['thumb_image'].replace('{ext}', 'jpg')))
df = pd.DataFrame(all_data, columns=['username', 'display_name', 'thumb'])
print(df.head())
Prints:
username display_name thumb
0 wetlilu Little_Lilu //i.bimbolive.com/live/034/263/131/xbig_lq/c30823.jpg
1 mellannie8 mellannieSEX //i.bimbolive.com/live/034/24f/209/xbig_lq/314348.jpg
2 mokkoann mokkoann //i.bimbolive.com/live/034/270/279/xbig_lq/cb25cb.jpg
3 ogurezzi CynEp-nuCbka //i.bimbolive.com/live/034/269/02c/xbig_lq/3ebe2a.jpg
4 Pepetka22 _-Katya-_ //i.bimbolive.com/live/034/24f/36e/xbig_lq/18da8e.jpg
Avoid using . in a regex unless you really want to get any character; here, the usernames (as far as I can see) only contain - and alphanumeric characters, so you can retrieve them with:
re.findall(r'"username":"([\w|-]+)"',data.text)
An even simpler way, which will remove the need to deal with special characters by getting all characters except " is:
re.findall(r'"username":"([^"]+)"',data.text)
So here's a way of getting the info you seek (I joined them into a dictionary, but you can change that to whatever you prefer):
import requests
import re
data = requests.get('https://ru.runetki3.com/?page=1')
with open ("return.txt",'w', encoding = 'utf-8') as f:
f.write(data.text)
names = re.findall(r'"username":"([^"]+)"',data.text)
disp_names = re.findall(r'"display_name":"([^"]+)"',data.text)
thumbs = re.findall(r'"thumb_image":"([^"]+)"',data.text)
names_dict = {name:[disp, thumb.replace('{ext}', 'jpg')] for name, disp, thumb in zip(names, disp_names, thumbs)}
Example
names_dict['JuliaCute']
# ['_Cute',
# '\\/\\/i.bimbolive.com\\/live\\/055\\/2b0\\/15d\\/xbig_lq\\/d89ef4.jpg']

Find Location of All Numbers with a Comma

I have a been scraping some HTML pages with beautiful soup trying to extract some updated financial data. I only care about numbers that have a comma ie 100,000 or 12,000,000 but not 450 for example. The goal is just to find the location of the comma separated numbers within a string then I need to extract the entire sentence they are in.
I moved the entire scrape to a string list and within that list I want to extract all numbers that have a comma.
url = 'https://www.sec.gov/Archives/edgar/data/354950/000035495020000024/hd-2020proxystatement.htm'
r = requests.get(url)
soup = BeautifulSoup(r.content)
text = soup.find_all(text = True)
strings = []
for i in range(len(text)):
text_s = str(proxy_text[i])
strings.append(text)
I thought about the follow re code but I am not sure if it will extract all instances.. ie within the list there may be multiple instances of numbers separated by commas.
number = re.sub('[^>0-9,]', "", text)
Any thoughts would be a huge help! Thank you
You can use:
from bs4 import BeautifulSoup
import requests, re
url = 'https://www.sec.gov/Archives/edgar/data/354950/000035495020000024/hd-2020proxystatement.htm'
soup = BeautifulSoup(requests.get(url).text, "html5lib")
for el in soup.find_all(True): # loop all element in page
if re.search(r"(?=\d+,\d+).*", el.text):
print(el.text)
# print("END OF ELEMENT\n") # debug only
If you simply want to check if a number has a comma or not, and you want to extract it if it does, then you could try the following.
new = []
for i in text:
if ',' in i:
new.append(i)
This will append all the elements in the 'text' collection that contain a comma, even if the exact same element is repeated multiple times.

How to strip string down to "point guard"

I'm trying to scrape the position off of this webpage using BeautifulSoup. Here is my relevant code.
info_panel = soup.find("div",{"id":"meta"})
info_panel_rows = info_panel.find_all("p")
if(info_panel_rows[2].find("strong") != None):
position = info_panel_rows[2].find("strong").next_sibling
position = str(position).strip()
else: # Executing on this path in my current problem
position = info_panel_rows[3].find("strong").next_sibling
position = str(position).strip()
print(position)
When I scrape it though, it prints like such:
Small Forward
▪
How would I go about stripping this down to just "Small Forward"? I've looked all over Stack Overflow and couldn't find a clear answer.
Thanks for any help you can provide!
Are you having issues with newline and tab in position? if so, do
position = str(position).strip('\n\t ')
and if that dot is also an issue, copy from print and paste it into strip. when you dont put anything in strip, it only removes white space from both side, you need to specify what you want to remove, the above example removes newline and tab and whitespace
If this does not solve your problem, you can try regex
import re
string_patterns = re.compile(r'\b[0-9a-zA-Z]*\b')
position = info_panel_rows[3].find("strong").next_sibling
results = string_patterns.findall(str(position))
results = ' '.join([item for item in results if len(item)])
print(results)
Hope this helps
If you encode it to ascii ignoring errors then call strip() you get the desired output.
import requests
from bs4 import BeautifulSoup
html = requests.get('https://www.basketball-reference.com/players/y/youngtr01.html').text
soup = BeautifulSoup(html, 'html.parser')
info_panel = soup.find("div", {"id": "meta"})
info_panel_rows = info_panel.find_all("p")
if info_panel_rows[2].find("strong") is not None:
position = info_panel_rows[2].find("strong").next_sibling
else:
position = info_panel_rows[3].find("strong").next_sibling
print(position.encode('ascii', 'ignore').strip())
Outputs:
Point Guard
Encoding to ascii gets rid of the bullet point.
Or if you just want to print the second line:
print(position.splitlines()[1].strip())
Also outputs:
Point Guard

How do I import an external text file from a website and use a list from inside it

I'm having trouble trying to make this work
import requests
import random
response = requests.get("https://cdn.discordapp.com/attachments/480168592164257792/557872162661335040/aaaaa.txt")
data = response.text
for line in data:
print(line)
I am trying to pull a txt file from the internet, and be able to use the list inside of the text file.
Right now all it does is assume each letter is a different string(?)
response.text seems to be characters, if you loop over them you get each string. (Read about how Python handles strings).
In this case Python doesn't know what a "line" is. So split the data with newlines and try again:
import requests
import random
response = requests.get("https://cdn.discordapp.com/attachments/480168592164257792/557872162661335040/aaaaa.txt")
data = response.text
for line in data.split("\n"):
print(line)
The attribute response.text is a string, so iterating over it will give you individual chars. You can split the string by spaces (or maybe be newlines) to get what you need (I also added a few print statements to show the steps):
import requests
response = requests.get(
"https://cdn.discordapp.com/attachments/480168592164257792/557872162661335040/aaaaa.txt")
print('response.text type:', type(response.text))
print('response.text len:', len(response.text))
print(response.text)
print()
print('splitting by spaces:')
for i, s in enumerate(response.text.split()):
print(i, s)
print()
print('splitting by newlines:')
for i, line in enumerate(response.text.split('\n')):
print(i, line)
The code gives this output:
response.text type: <class 'str'>
response.text len: 21
a = ["please","work"]
splitting by spaces:
0 a
1 =
2 ["please","work"]
splitting by newlines:
0 a = ["please","work"]
#bruno suggested in a comment to use str.splitlines(); this will work even if the response is bytes, since there also exists the method bytes.splitlines().

Manipulate string data

I'm new to python and trying to create a script to modify the output of a JS file to match what is required to send data to an API. The JS file is being read via urllib2.
def getPage():
url = "http://url:port/min_day.js"
req = urllib2.Request(url)
response = urllib2.urlopen(req)
return response.read()
# JS Data
# m[mi++]="19.12.12 09:30:00|1964;2121;3440;293;60"
# m[mi++]="19.12.12 09:25:00|1911;2060;3277;293;59"
# Required format for API
# addbatchstatus.jsp?data=20121219,09:25,3277.0,1911,-1,-1,59.0,293.0;20121219,09:30,3440.0,1964,-1,-1,60.0,293.0
As a breakdown (Required values are bold)
m[mi++]="19.12.12 09:30:00|1964;2121;3440;293;60"
and need to add values of -1,-1 into the string
I've managed to get the date into the correct format and replace characters and line breaks to make the output look as such, but I have a feeling I'm heading down the wrong track if I need to be able to reorder this string values. Although it looks like the order is in reverse in regards to time as well.
20121219,09:30:00,1964,2121,3440,293,60;20121219,09:25:00,1911,2060,3277,293,59
Any help would be greatly appreciated! I'm thinking along the lines of regex might be what I need.
Here's a Regex pattern to strip out the bits you don't want
m\[mi\+\+\]="(?P<day>\d{2})\.(?P<month>\d{2})\.(?P<year>\d{2}) (?P<time>[\d:]{8})\|(?P<v1>\d+);(?P<v2>\d+);(?P<v3>\d+);(?P<v4>\d+);(?P<v5>\d+).+
and replace with
20\P<year>\P<month>\P<day>,\P<time>,\P<v3>,\P<v1>,-1,-1,\P<v5>,\P<v4>
This pattern assumes that the characters before the date are constant. You can replace m\[mi\+\+\]=" with [^\d]+ if you want more general handling of that bit.
So to put this in practice in python:
import re
def getPage():
url = "http://url:port/min_day.js"
req = urllib2.Request(url)
response = urllib2.urlopen(req)
return response.read()
def repl(match):
return '20%s%s%s,%s,%s,%s,-1,-1,%s,%s'%(match.group('year'),
match.group('month'),
match.group('day'),
match.group('time'),
match.group('v3'),
match.group('v1'),
match.group('v5'),
match.group('v4'))
pattern = re.compile(r'm\[mi\+\+\]="(?P<day>\d{2})\.(?P<month>\d{2})\.(?P<year>\d{2}) (?P<time>[\d:]{8})\|(?P<v1>\d+);(?P<v2>\d+);(?P<v3>\d+);(?P<v4>\d+);(?P<v5>\d+).+')
data = [re.sub(pattern, repl, line).split(',') for line in getPage().split('\n')]
# If you want to sort your data
data = sorted(data, key=lambda x:x[0], reverse=True)
# If you want to write your data back to a formatted string
new_string = ';'.join(','.join(x) for x in data)
# If you want to write it back to file
with open('new/file.txt', 'w') as f:
f.write(new_string)
Hope that helps!

Categories