Unable to parse Url with python urlparse

Unable to parse Url with python urlparse - python

I am trying to write a small script that will take url as input and will parse it.
Following is my script
#! /usr/bin/env python
import sys
from urlparse import urlsplit
url = sys.argv[1]
parseUrl = urlsplit(url)
print 'scheme :', parseUrl.scheme
print 'netloc :', parseUrl.netloc
But when I execute this script with ./myscript http://www.example.com
it shows following error.
AttributeError: 'tuple' object has no attribute 'scheme'
I am new to python/scripting, where am I doing wrong?
Edit: Python version that I am using is Python 2.7.5

You don't want scheme. Instead in this case you want to access the 0 index of the tuple and the 1 index of the tuple.
print 'scheme :', parseUrl[0]
print 'netloc :', parseUrl[1]
urlparse uses the .scheme and .netloc notation, urlsplit instead uses a tuple (refer to the appropriate index number):
This is similar to urlparse(), but does not split the params from the
URL. This should generally be used instead of urlparse() if the more
recent URL syntax allowing parameters to be applied to each segment of
the path portion of the URL (see RFC 2396) is wanted. A separate
function is needed to separate the path segments and parameters. This
function returns a 5-tuple: (addressing scheme, network location,
path, query, fragment identifier).
The return value is actually an instance of a subclass of tuple. This
class has the following additional read-only convenience attributes:
Attribute Index Value Value if not present
scheme 0 URL scheme specifier empty string
netloc 1 Network location part empty string
path 2 Hierarchical path empty string
query 3 Query component empty string
fragment 4 Fragment identifier empty string
username User name None
password Password None
hostname Host name (lower case) None
port Port number as integer, if present None

Looking at the docs, it sounds like you are using Python 2.4, which does not have the attributes added. The other answered missed off the critical bit from the docs:
New in version 2.2.
Changed in version 2.5: Added attributes to return value.
You will have to access the tuple parts by index or unpacking:
scheme, netloc, path, query, fragment = urlsplit(url)
However, you should really be upgrading to Python 2.7. Python 2.4 is no longer supported.

Related

Python Iterating Array of URI to retrieve JSON

I have a dynamically created array of URIs that I want to iterate through for retrieval of json data, which I'll then use to search a particular field for a specific value. Unfortunately, I keep getting syntax errors.
for i in list_url_id:{
var t = requests.get(base_url+i+'?limit='+str(count),auth=HTTPBasicAuth(uname,pw)).json()
print(t)
}
If I do a print(i) in the loop, it prints the full URL out properly. I'm lost.
EDIT:
base_url is a URL similar to https://www.abcdef.com:1443
the URI in list_url_id is a URI similar to /v1/messages/list/0293842
I have no issue (as mentioned) concatenating them into a print operation, but when used for the string for requests.get, it returns a nondescript syntax error

Python sees the code inside the bracket as a dictionary, that's probably causing the syntax errors.
Indentation is used in Python for traversing for loops.
for i in list_url_id:
t = requests.get(base_url+i+'?limit='+str(count),auth=HTTPBasicAuth(uname,pw)).json()
print(t)
This should work for you and note that var is also removed as it is the wrong syntax. Python variables do not need an explicit declaration to reserve memory space. The declaration happens automatically when you assign a value to a variable.

Python - Removing result in shodan results

I am attempting to parse Shodan query results and print only the results that match the criteria I have set. The output need to be in JSON format to be integrated later in Splunk.
I'd like to iterate over the set of elements and removing an element if it doesn't match the location country_code of "US".
Here is my code :
import shodan
import os
import sys
import json
SHODAN_API_KEY = os.environ.get("SHODAN_API_KEY")
api = shodan.Shodan(SHODAN_API_KEY)
query = sys.argv[1]
try:
query_results = api.search(query)
except shodan.APIError as err :
print('Error: {}'.format(err))
for element in query_results['matches']:
if 'US' in format(element['location']['country_code']):
del element
print(query_results['matches'])
But with this code my element won't get removed from query_result['matches'].

There are a few things:
Consider using the Shodan.search_cursor(query) method instead of just Shodan.search(query). The search_cursor() method handles paging through results for you in case there are more than 100 results. Otherwise you need to do that on your own by providing the page parameter to the search() method. Here is an article that explains it a bit further: https://help.shodan.io/guides/how-to-download-data-with-api
You can actually filter out non-US results within the search query! Simply add " -country:US" to your query and you won't get any results for services in the US. I.e. do the following assuming you have Python 3.7:
query_results = api.search(f'{query} -country:US')

I'm not understanding what "Version" v I have to put as a format of date

I'm using the FOURSQUARE API for extracting the venue searches. I have created a URL with my client_id and client_secret. But I'm unable to know what VERSION DATE I need to put.
Please check the error in the image. I checked online but can't quite understand it. Any help will be appreciated.

First of All:
Go revoke and regenerate your token immediately, since you just posted it to the internet
Your URL only contained 3 format variables (count the {} characters in your format string)
You tried to stuff 4 variables into a format string containing 3 holes.
.format(
venue_id, # gets placed in the url after client_id=
CLIENT_ID, # gets placed after client_secret=
CLIENT_SECRET, # placed after v=
VERSION # placed nowhere because you don't have a 4th {} in the string.
)
The error result you are seeing shows your CLIENT_SECRET is an invalid version.
You may be violating their TOS by publishing your client keys, that is why you should revoke and regenerate.
Suggestion
use named format strings
if you used named strings, then you will reduce your chance to make mistakes like this because if a required name isn't passed in you'll get an error, if you pass in extra, no problem, but you won't get this config shifting error.
url = "https://<stuff>/client_id={client_id}&client_secret={client_secret}&v={version}".format(
client_id=CLIENT_ID,
client_secret=CLIENT_SECRET,
version=VERSION
)
or shorthand format strings where the variables inside the format string are named variables in your program.
url = f"https://<stuff>/client_id={CLIENT_ID}&client_secret={CLIENT_SECRET}&v={VERSION}"
# ^
#-----| # f indicates this is an inline format string

Python 3.3 library abpy, file undefined

I'm using a library ABPY (library here) for python but it is in older version i think. I'm using Python 3.3.
I did fix some PRINT errors, but that's how much i know, I'm really new on programing.
I want to fetch some webpage and filter it from advertising and then print it again.
EDITED after Sg'te'gmuj told me how to convert from python 2.x to 3.x this is my new code:
#!/usr/local/bin/python3.1
import cgitb;cgitb.enable()
import urllib.request
response = urllib.request.build_opener()
response.addheaders = [('User-agent', 'Mozilla/5.0')]
response = urllib.request.urlopen("http://www.youtube.com")
html = response.read()
from abpy import Filter
with open("easylist.txt") as f:
ABPFilter = Filter(file('easylist.txt'))
ABPFilter.match(html)
print("Content-type: text/html")
print()
print (html)
Now it is displaying a blank page

Just took a peek at the library, it seems that the file "easylist.txt" does not exist; you need to create the file, and populate it with the appropriate filters (in whatever format ABP specifies).
Additionally, it appears it takes a file object; try something like this instead:
with open("easylist.txt") as f:
ABPFilter = Filter(f)
I can't say this is wholly accurate though since I have no experience with the library, but looking at it's code I'd suspect either of the two are the problem, if not both.
Addendum #1
Looking at the code more in-depth, I have to agree that even if that fix I supplied does work, you're going to have more problems (it's in 2.x as you suggested, when you're using 3.x). I'd suggest utilizing Python's 2to3 function, to convert from typical Python 2 to Python 3 code (it's not foolproof though). The command line would be as so:
2to3 -w abpy.py
That will convert it from Python 2.x to 3.x code, and re-write the source file.
Addendum #2
The code to pass the file object should be the "f" variable, as shown above (modified to represent that; I wasn't paying attention and just left the old file function call in the argument).
You need to pass a URI to the function as well:
ABPFilter.match(URI)
You'll need to modify the code to pass those items into an array (I'm assuming at least); I'm playing with it now to see. At present I'm getting a rule error (not a Python error; but merely error handling used by abpy.py, which is good because it suggests that it's the right train of thought).
The code for the Filter.match function is as following (after using the 2to3 Python script):
def match(self, url, elementtype=None):
tokens = RE_TOK.split(url)
print(tokens)
for tok in tokens:
if len(tok) > 2:
if tok in self.index:
for rule in self.index[tok]:
if rule.match(url, elementtype=elementtype):
print(str(rule))
What this means is you're, at present, at a point where you need to program the functionality; it appears this module only indicates the rule. However, that is still useful.
What this means is that you're going to have to modify this function to take the HTML, in place of the the "url" parameter. You're going to regex the HTML (this may be rather intensive) for a list of URIs and then run each item through the match loop Where you go from there to actually filter the nodes, I'm not sure; but there is a list of filter types, so I'm assuming there is a typical procedural ABP does to remove the nodes (possibly, in some cases merely by removing the given URI from the HTML?)
References
http://docs.python.org/3.3/library/2to3.html

How do I get the full content of a Splunk search result when using the Python SDK?

I can get the results from a one_shot query, but I can't get the full content of the _raw field.
import splunklib.client as client
import splunklib.results as results
def splunk_oneshot(search_string, **CARGS):
# Run a oneshot search and display the results using the results reader
service = client.connect(**CARGS)
oneshotsearch_results = service.jobs.oneshot(search_string)
# Get the results and display them using the ResultsReader
reader = results.ResultsReader(oneshotsearch_results)
for item in reader:
for key in item.keys():
print(key, len(item[key]), item[key])
This gives me the following for _raw:
('_raw', 120, '2013-05-03 22:17:18,497+0000 [SWF Activity attNsgActivitiesTaskList1 19] INFO c.s.r.h.s.a.n.AttNsgSmsRequestAdapter - ')
So this content is truncated at 120 characters. I need the entire value of the search result, because I need to run some string comparisons thereupon. I have not found any documentation on the ResultsReader fields or their size restrictions.

My best guess is that is caused by the insertion of special tags in the event raw data to highlight matched search terms in the Splunk UI front-end. In all likelihood, your search string specifies a matching literal term present in the raw data right at the point of truncation. This is not an appropriate default behavior for the SDK result-fetching method and there is currently a bug opened to fix this (internal reference DVPL-1519).
Fortunately, avoiding this problem is fairly trivial: One simply needs to pass segmentation='none' as an argument to the job.results() method:
(...)
oneshotsearch_results = service.jobs.oneshot(search_string,segmentation='none')
(...)
Do note that the 'segmentation' argument for the service.jobs() method is only available on Splunk 5.0 and onwards.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unable to parse Url with python urlparse - python

Related

Python Iterating Array of URI to retrieve JSON

Python - Removing result in shodan results

I'm not understanding what "Version" v I have to put as a format of date

Python 3.3 library abpy, file undefined

How do I get the full content of a Splunk search result when using the Python SDK?

Categories

Resources