Splitting website url to keywords, multible splits - python

I am currently creating a tool which scans the URL of a website and returns the keywords as a list. For example google.com/images then the tool should give out:
{"google", "images"}
I knew how to filter the .com part out, but I have the problem that I can't split the split parts again. So I end up with the results of the first split. How do I split these parts again?
First run split(".") -> {"google", "com/images"}
Second run split("/") -> {"google", "com", "images"}
because then I can filter things like the .com part out. I'm writing this in Python and this is my code atm.
First the error:
" AttributeError: 'list' object has no attribute 'split' "
so the problem is that this is an list object and I can't split this again.
Now the code
url_content = input('Enter url: ')
url_split1 = url_content.split('.')
url_split2 = url_split1.split('/')
url_split3 = url_split2.split('-')
url_split4 = url_split3.split('&')
filtered = {'com', 'net'}
print(url_split4)
for key in url_split4:
if key not in filtered:
print(key)

You can use replace:
url_content = input('Enter url: ').replace('/','.').replace('-','.').replace('&','.')
and then split it once:
url_split1 = url_content.split('.')

You can use either use python's builtin regular expressions library as follows.
import re
re.split('\.|\&|\-|/', url_content)
or you may use the string replace method.
url_content.replace(".", "/").replace("&", "/").replace("-", "/").split("/")

Related

How to iterate through a list of Twitter users using Snscrape?

I trying to retrieve tweets over a list of users, however in the snscrape function this argument is inside quotes, which makes the username to be taken as a fixed input
import snscrape.modules.twitter as sntwitter
tweets_list1 = []
users_name = [{'username':'#bbcmundo'},{'username':'#nytimes'}]
for i,tweet in enumerate(sntwitter.TwitterSearchScraper('from:{}').get_items().format(username)):
if i>100:
break
tweets_list1.append([tweet.date, tweet.id, tweet.content, tweet.url,\
tweet.user.username, tweet.user.followersCount,tweet.replyCount,\
tweet.retweetCount, tweet.likeCount, tweet.quoteCount, tweet.lang,\
tweet.outlinks, tweet.media, tweet.retweetedTweet, tweet.quotedTweet,\
tweet.inReplyToTweetId, tweet.inReplyToUser, tweet.mentionedUsers,\
tweet.coordinates, tweet.place, tweet.hashtags, tweet.cashtags])
As output Python get:
`AttributeError: 'generator' object has no attribute 'format'
This code works fine replacing the curly braces with the username and deleting the .format attribute. If you want replicate this code be sure install snscrape library using:
pip install git+https://github.com/JustAnotherArchivist/snscrape.git
I found some mistakes that I did writing this code. So, I want to share with all of you just in case you need it and overcome your stuck with this very same problem or a similar one:
First: I changed the users_name format, from a dict to a list items.
Second: I put the format attribute in the right place. Right after text input function
Third: I added a nested loop to scrape each Twitter user account
users_name = ['bbcmundo','nytimes']
for n, k in enumerate(users_name):
for i,tweet in enumerate(sntwitter.TwitterSearchScraper('from:{}'.format(users_name[n])).get_items()):
if i>100:
break
tweets_list1.append([tweet.date, tweet.id, tweet.content, tweet.url,\
tweet.user.username, tweet.user.followersCount,tweet.replyCount,\
tweet.retweetCount, tweet.likeCount, tweet.quoteCount, tweet.lang,\
tweet.outlinks, tweet.media, tweet.retweetedTweet, tweet.quotedTweet,\
tweet.inReplyToTweetId, tweet.inReplyToUser, tweet.mentionedUsers,\
tweet.coordinates, tweet.place, tweet.hashtags, tweet.cashtags])
You can avoid to make several requests by using more than one from criteria:
users = ['bbcmundo','nytimes']
filters = ['since:2022-07-06', 'until:2022-07-07']
from_filters = []
for user in users:
from_filters.append(f'from:{user}')
filters.append(' OR '.join(from_filters))
tweets = list(sntwitter.TwitterSearchScraper(' '.join(filters)).get_items())
# The argument is 'since:2022-07-06 until:2022-07-07 from:bbcmundo OR from:nytimes'

Python - Format a string from a list not working

I want to crawl a webpage for some information and what I've done so far It's working but I need to do a request to another url from the website, I'm trying to format it but it's not working, this is what I have so far:
name = input("> ")
page = requests.get("http://www.mobafire.com/league-of-legends/champions")
tree = html.fromstring(page.content)
for index, champ in enumerate(champ_list):
if name == champ:
y = tree.xpath(".//*[#id='browse-build']/a[{}]/#href".format(index + 1))
print(y)
guide = requests.get("http://www.mobafire.com{}".format(y))
builds = html.fromstring(guide.content)
print(builds)
for title in builds.xpath(".//table[#class='browse-table']/tr[2]/td[2]/div[1]/a/text()"):
print(title)
From the input, the user enters a name; if the name matches one from a list (champ_list) it prints an url and from there it formats it to the guide variable and gets another information but I'm getting errors such as invalid ipv6.
This is the output url (one of them but they're similar anyway) ['/league-of-legends/champion/ivern-133']
I tried using slicing but it doesn't do anything, probably I'm using it wrong or it doesn't work in this case. I tried using replace as well, they don't work on lists; tried using it as:
y = [y.replace("'", "") for y in y] so I could see if it removed at least the quotes but it didn't work neither; what can be another approach to format this properly?
I take it y is the list you want to insert into the string?
Try this:
"http://www.mobafire.com{}".format('/'.join(y))

How do you slice part of an input statement in Python 3?

I have a question which asks me to get a user's email address and then return the URL it is associated with. So, for example: 'abc123#address.com' --> 'http:://www.address.com'
I did get this:
def main():
email_address = input('Enter your email address (eg. abc123#address.com): ').strip()
strip_username = email_address.split('#', 1)[-1]
the_url(strip_username)
def the_url(url_ending):
print('Your associated URL is: http://www.' + str(url_ending))
main()
which does what I want, but this code: split('#'...) is something I haven't learned yet. I just found it online. I need to use indexing and splicing for this program, but how can I use splicing if I don't know the length of the user's email? I need to get rid of everything before and including the '#' symbol so that it can leave me with just 'address.com' but I don't know what address it will be. It could be hotmail, gmail, etc. Thanks, and I'm really new to Python so I'm trying to only use what I've learned in class so far.
The split method just splits up the string based on the character to you give it, so:
"Hello#cat".split("#")
Will give you
["Hello", "cat"]
Then you can just take the 1st index of that array to give you whatever's after the first # symbol.
If you don't want to use str.split then by indexing and slicing,
you can do something like this.
>>> str = 'abc123#address.com'
>>> 'http://www.' + str[str.index('#')+1:]
'http://www.address.com'

Python splitting values from urllib in string

I'm trying to get IP location and other stuff from ipinfodb.com, but I'm stuck.
I want to split all of the values into new strings that I can format how I want later. What I wrote so far is:
resp = urllib2.urlopen('http://api.ipinfodb.com/v3/ip-city/?key=mykey&ip=someip').read()
out = resp.replace(";", " ")
print out
Before I replaced the string into new one the output was:
OK;;someip;somecountry;somecountrycode;somecity;somecity;-;42.1975;23.3342;+05:00
So I made it show only
OK someip somecountry somecountrycode somecity somecity - 42.1975;23.3342 +05:00
But the problem is that this is pretty stupid, because I want to use them not in one string, but in more, because what I do now is print out and it outputs this, I want to change it like print country, print city and it outputs the country,city etc. I tried checking in their site, there's some class for that but it's for different api version so I can't use it (v2, mine is v3). Does anyone have an idea how to do that?
PS. Sorry if the answer is obvious or I'm mistaken, I'm new with Python :s
You need to split the resp text by ;:
out = resp.split(';')
Now out is a list of values instead, use indexes to access various items:
print 'Country: {}'.format(out[3])
Alternatively, add format=json to your query string and receive a JSON response from that API:
import json
resp = urllib2.urlopen('http://api.ipinfodb.com/v3/ip-city/?format=json&key=mykey&ip=someip')
data = json.load(resp)
print data['countryName']

Pattern matching Twitter Streaming API

I'm trying to insert into a dictionary certain values from the Streaming API. One of these is the value of the term that is used in the filter method using track=keyword. I've written some code but on the print statement I get an "Encountered Exception: 'term'" error. This is my partial code:
for term in setTerms:
a = re.compile(term, re.IGNORECASE)
if re.search(a, status.text):
message['term'] = term
else:
pass
print message['text'], message['term']
This is the filter code:
setTerms = ['BBC','XFactor','Obama']
streamer.filter(track = setTerms)
It matches the string, but I also need to be able to match all instances eg. BBC should also match with #BBC, #BBC or BBC1 etc.
So my question how would i get a term in setTerms eg BBC to match all these instances in if re.search(term, status.text)?
Thanks
Have you tried putting all your search terms into a single expression?
i.e. setTerms = '(BBC)|(XFactor)|(Obama)', then seeing if it matches the whole piece string (not just individual word?

Categories