Google search from Python program - python

I'm trying to take an input file, read each line, search google with that line and print all the search results from the query ONLY IF the result is from a specific website. A simple example to illustrate my point, if I search dog I only want results printed from wikipedia, whether that be one result or ten results from wikipedia. My problem is I've been getting really weird results. Below is my Python code which contains a specific URL I want results from.
My program
inputFile = open("small.txt", 'r') # Makes File object
outputFile = open("results1.txt", "w")
dictionary = {} # Our "hash table"
compare = "www.someurl.com/" # urls will compare against this string
from googlesearch import GoogleSearch
for line in inputFile.read().splitlines():
lineToRead = line
dictionary[lineToRead] = [] #initialzed to empty list
gs = GoogleSearch(lineToRead)
for url in gs.top_urls():
print url # check to make sure this is printing URLs
compare2 = url
if compare in compare2: #compare the two URLs, if they match
dictionary[lineToRead].append(url) #write out query string to dictionary key & append EACH url that matches
inputFile.close()
for i in dictionary:
print i # this print is a test that shows what the query was in google (dictionary key)
outputFile.write(i+"\n")
for j in dictionary[i]:
print j # this print is a test that shows the results from the query which should look like correct URL: "www.medicaldepartmentstore.com/..."(dictionary value(s))
outputFile.write(j+"\n") #write results for the query string to the output file.
My output file is incorrect, the way it's supposed to be formatted is
query string
http://www.
http://www.
http://www.
query string
http://www.
query string
http://www.medical...
http://www.medical...

Can you limit the scope of the results to the specific site (e.g. wikipedia) at the time of the query? For example, using:
gs = GoogleSearch("site:wikipedia.com %s" % query) #as shown in https://pypi.python.org/pypi/googlesearch/0.7.0
This would instruct Google to return only the results from that domain, so you won't need to filter them after seeing the results.

I think #Cahit has the right idea. The only reason you would be getting lines of just the query string is because the domain you were looking for wasn't in the top_urls(). You can verify this by checking if the array contained in the dictionary for a given key is empty
for i in dictionary:
outputFile.write("%s: " % str(i))
if len(dictionary[i]) == 0:
outputFile.write("No results in top_urls\n")
else:
outputFile.write("%s\n" % ", ".join(dictionary[i]))

Related

Generate DF from attributes of tags in list

I have a list of revisions from a Wikipedia article that I queried like this:
import urllib
import re
def getRevisions(wikititle):
url = "https://en.wikipedia.org/w/api.php?action=query&format=xml&prop=revisions&rvlimit=500&titles="+wikititle
revisions = [] #list of all accumulated revisions
next = '' #information for the next request
while True:
response = urllib.request.urlopen(url + next).read() #web request
response = str(response)
revisions += re.findall('<rev [^>]*>', response) #adds all revisions from the current request to the list
cont = re.search('<continue rvcontinue="([^"]+)"', response)
if not cont: #break the loop if 'continue' element missing
break
next = "&rvcontinue=" + cont.group(1) #gets the revision Id from which to start the next request
return revisions
Which results in a list with each element being a rev Tag as a string:
['<rev revid="343143654" parentid="6546465" minor="" user="name" timestamp="2021-12-12T08:26:38Z" comment="abc" />',...]
How can I get generate a DF from this list
An "easy" way without using regex would be splitting the string and then parsing:
for rev_string in revisions:
rev_dict = {}
# Skipping the first and last as it's the tag.
attributes = rev_string.split(' ')[1:-1]
#Split on = and take each value as key and value and convert value to string to get rid of excess ""
for attribute in attributes:
key, value = attribute.split("=")
rev_dict[key] = str(value)
df = pd.DataFrame.from_dict(rev_dict)
This sample would create one dataframe per revision. If you would like to gather multiple reivsions in one dictionary then you handle unique attributes (I don't know if these are changing depending on wiki-document) and then after gathering all attributes in the dictionary you convert to a DataFrame.
Use output format of json then you can easily create data fram from Json
Example URL for JSON output
For json to dataframe help check out this stackoverflow query
any other solution if i have multiple revisions like
''''[, ]''''

How to get the same name with multiple value get unique results in Python

I have a large csv file that compares the URLs of my txt files
How to get the same name with multiple value get unique results in Python and Is there a way to better compare the speed of two files? because it has a minimum large csv file of 1 gb
file1.csv
[01/Nov/2019:09:54:26 +0900] ","","102.12.14.22","34.122.104.106","200","CONNECT","http://www.amazon.com/asdd/asd/","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","164.16.37.75","52.222.194.116","200","CONNECT","http://www.google.com:443","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","167.27.14.62","34.122.104.106","200","CONNECT","http://www.amazon.com/asdd/asd/","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","192.10.77.95","21.323.12.96","200","CONNECT","http://www.wakers.com/sg/wew/","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","167.27.14.62","34.122.104.106","200","CONNECT","http://www.amazon.com/asdd/asd/","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","197.99.94.32","34.122.104.106","200","CONNECT","http://www.amazon.com/asdd/asd/","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","157.87.34.72","34.122.104.106","200","CONNECT","http://www.amazon.com/asdd/asd/","555976","1508"
file2.txt
1 www.amazon.com shop
1 wakers.com shop
script:
import csv
with open("file1.csv", 'r') as f:
reader = csv.reader(f)
for k in reader:
ko = set()
srcip = k[2]
url = k[6]
lines = url.replace(":443", "").replace(":8080", "")
war = lines.split("//")[-1].split("/")[0].split('?')[0]
ko.add((war,srcip))
for to in ko:
with open("file2.txt", "r") as f:
all_val = set()
for i in f:
val = i.strip().split(" ")[1]
if val in to[0]:
all_val.add(to)
for ki in all_val:
print(ki)
my output:
('www.amazon.com', '102.12.14.22')
('www.amazon.com', '167.27.14.62')
('www.wakers.com', '192.10.77.95')
('www.amazon.com', '167.27.14.62')
('www.amazon.com', '197.99.94.32')
('www.amazon.com', '157.87.34.72')
how to get if the url is the same, get the total value with a unique value
how to get results like this?
amazon.com 102.12.14.22
167.27.14.62
197.99.94.32
157.87.34.72
wakers.com 192.10.77.95
Short answer: you can't directly do so. Well you can but with low performances.
CSV is a good storing format but if you want to do something like that you might want to store everything in another custom data file. you could first parse your file to have only Unique IDs instead of long strings (like amazon = 0, wakers = 1 and so on) to perform better and reduce compare cost.
The thing is, those thing are pretty bad for variable csv, memory mapping or building a database from your csv might also be great though (and making the changes on the database, only dumping the csv when you need to)
look at: How do quickly search through a .csv file in Python for a more complete answer.
Problem solution
import csv
import re
def possible_urls(filename, category, category_position, url_position):
# Here we will read a txt file to create a list of domains, that could correspond to shops
domains = []
with open(filename, "r") as file:
file_content = file.read().splitlines()
for line in file_content:
info_in_line = line.split(" ")
# Here i use a regular expression, to prase domain from url.
domain = re.sub('www.', '', info_in_line[url_position])
if info_in_line[category_position] == category:
domains.append(domain)
return domains
def read_from_csv(filename, ip_position, url_position, possible_domains):
# Here we will create a dictionary, where will
# all ips that this domain can have.
# Dictionary will look like this:
# {domain_name: [list of possible ips]}
domain_ip = {domain: [] for domain in possible_domains}
with open(filename, 'r') as f:
reader = csv.reader(f)
for line in reader:
if len(line) < max(ip_position, url_position):
print(f'Not enough items in line {line}, to obtain url or ip')
continue
ip = line[ip_position]
url = line[url_position]
# Using python regular expression to get a domain name
# from url.
domain = re.search('//[w]?[w]?[w]?\.?(.[^/]*)[:|/]', url).group(1)
if domain in domain_ip.keys():
domain_ip[domain].append(ip)
return domain_ip
def print_fomatted_result(result):
# Prints formatted result
for shop_domain in result.keys():
print(f'{shop_domain}: ')
for shop_ip in result[shop_domain]:
print(f' {shop_ip}')
def create_list_of_shops():
# Function that first creates a list of possible domains, and
# then read ip for that domains from csv
possible_domains = possible_urls('file2.txt', 'shop', 2, 1)
shop_domains_with_ip = read_from_csv('file1.csv', 2, 6, possible_domains)
# Display result, we get in previous operations
print(shop_domains_with_ip)
print_fomatted_result(shop_domains_with_ip)
create_list_of_shops()
Output
Dictionary of ip's where domains are keys, so you can get all possible ip's for domain by giving a name of that domain:
{'amazon.com': ['102.12.14.22', '167.27.14.62', '167.27.14.62', '197.99.94.32', '157.87.34.72'], 'wakers.com': ['192.10.77.95']}
amazon.com:
102.12.14.22
167.27.14.62
167.27.14.62
197.99.94.32
157.87.34.72
wakers.com:
192.10.77.95
Regular expressions
A very useful thing you can learn from the solution is regular expressions. Regular expressions are tools that allow you to filter or retrieve information from lines in a very convenient way. It also greatly reduces the amount of code, which makes the code more readable and safe.
Let's consider your code of removing ports from strings and think how we can replace it with regex.
lines = url.replace(":443", "").replace(":8080", "")
Replacing of ports in such way is vulnerable, because you never can be sure, what port numbers can actually be in url. What if there will appear port number 5460, or port number 1022, etc. For each of such ports you will add new replaces and soon your code will look something like this
lines = url.replace(":443", "").replace(":8080", "").replace(":5460","").replace(":1022","")...
Not very readable. But with regular experssion you can describe a pattern. And the great news is that we actually know pattern for url with port numbers. They all looking like this:
:some_digits. So if we know pattern we can describe it with regular expression, and tell python to find everything, that match it and replace with empty string '':
re.sub(':\d+', '', url)
It tells to python regular expression engine:
Look for all digits in string url, that goes after : and replace them with empty string. This solution is shorter, safer and a way more readable then solution with replace chain, so I suggest you to read about them a little. Great resource to learn about regular expressions is
this site. Here you can test your regex.
Explanation of Regular expressions in code
re.sub('www.', '', info_in_line[url_position])
Look for all www. in string info_in_line[url_position] and replace it with empty string.
re.search('www.(.[^/]*)[:|/]', url).group(1)
Let's split it on parts:
[^/] - here could be everything except /
(.[^/]*) - Here i used match group. It tells to engine where solution we intersted in will be.
[:|/] - it means characters that could stay on that place. Long story short: after capturing group could be : or(|) /.
So summarizing. Regex can be expressed in words as follows:
Find all substrings, that starts with www., and ends with : or \ and return me everything that stadns between them.
group(1) - means get the first match.
Hope answer will be helpful!
If you used the URL as the key in a dictionary, and had your IP address sets as the elements of the dictionary, would that achieve what you intended?
my_dict = {
'www.amazon.com' = {
'102.12.14.22',
'167.27.14.62',
'197.99.94.32',
'157.87.34.72',
},
'www.wakers.com' = {'192.10.77.95'},
}
## I have used your code & Pandas to get your desired output
## Copy paste the code & execute to get the result
import csv
url_dict = {}
## STEP 1: Open file2.txt to get url names
with open("file2.txt", "r") as f:
for i in f:
val = i.strip().split(" ")[1]
url_dict[val] = []
## STEP 2: 2.1 Open csv file 'file1.csv' to extract url name & ip address
## 2.2 Check if url from file2.txt is available from the extracted url from 'file1.csv'
## 2.3 Create a dictionary with the matched url & its ip address
## 2.4 Remove duplicates in ip addresses from same url
with open("file1.csv", 'r') as f: ## 2.1
reader = csv.reader(f)
for k in reader:
#ko = set()
srcip = k[2]
#print(srcip)
url = k[6]
lines = url.replace(":443", "").replace(":8080", "")
war = lines.split("//")[-1].split("/")[0].split('?')[0]
for key, value in url_dict.items():
if key in war: ## 2.2
url_dict[key].append(srcip) ## 2.3
## 2.4
for key, value in url_dict.items():
url_dict[key] = list(set(value))
## STEP 3: Print dictionary output to .TXT file
file3 = open('output_text.txt', 'w')
for key, value in url_dict.items():
file3.write('\n' + key + '\n')
for item in value:
file3.write(' '*15 + item + '\n')
file3.close()

Python Django- Not Returning a Valid Result

I am generating all possible combinations for the given scrambled letters and storing it in a list. Then, I'm checking if words from that list are in my database. Although, the word is in the database, it is not returning so.
example for result list:
result = ['aargh', 'raagh', 'hraag']
Although there is a word called aargh in my database, its not returning it.
for r in result:
# print(r)
try:
actual = Dictionary.objects.get(word=r)
print(actual.word)
except:
actual = 'Not found'
print("Actual Word " + str(actual))
I have words stored in 'Dictionary' Table. What is wrong here?
you can check wheter the word exists or not:
for r in result:
actual = Dictionary.objects.filter(word__iexact=r).first()
if actual:
print(actual.word)
actual = actual.word
else:
actual = 'Not found'
print("Actual Word " + str(actual))
Try using icontains
Ex:
actual = Dictionary.objects.get(word__icontains=r)
Info on icontains

How to pass string variable into search function?

I am having issues passing a string variable into a search function.
Here is what I'm trying to accomplish:
I have a file full of values and I want to check the file to make sure a specific matching line exists before I proceed. I want to ensure that the line <endSW=UNIQUE-DNS-NAME-HERE<> exists if a valid <begSW=UNIQUE-DNS-NAME-HERE<> exists and is reachable.
Everything works fine until I call if searchForString(searchString,fileLoc): which always returns false. If I assign the variable 'searchString' a direct value and pass it it works, so I know it must be something with the way I'm combining the strings, but I can't seem to figure out what I'm doing wrong.
If I examine the data that 'searchForString' is using I see what seems to be valid values:
values in fileLines list:
['<begSW=UNIQUE-DNS-NAME-HERE<>', ' <begPortType=UNIQUE-PORT-HERE<>', ' <portNumbers=80,443,22<>', ' <endPortType=UNIQUE-PORT-HERE<>', '<endSW=UNIQUE-DNS-NAME-HERE<>']
value of searchVar:
<endSW=UNIQUE-DNS-NAME-HERE<>
An example of the entry in the file is:
<begSW=UNIQUE-DNS-NAME-HERE<>
<begPortType=UNIQUE-PORT-HERE<>
<portNumbers=80,443,22<>
<endPortType=UNIQUE-PORT-HERE<>
<endSW=UNIQUE-DNS-NAME-HERE<>
Here is the code in question:
def searchForString(searchVar,readFile):
with open(readFile) as findMe:
fileLines = findMe.read().splitlines()
print fileLines
print searchVar
if searchVar in fileLines:
return True
return False
findMe.close()
fileLoc = '/dir/folder/file'
fileLoc.lstrip()
fileLoc.rstrip()
with open(fileLoc,'r') as switchFile:
for line in switchFile:
#declare all the vars we need
lineDelimiter = '#'
endLine = '<>\n'
begSWLine= '<begSW='
endSWLine = '<endSW='
begPortType = '<begPortType='
endPortType = '<endPortType='
portNumList = '<portNumbers='
#skip over commented lines -(REMOVE THIS)
if line.startswith(lineDelimiter):
pass
#checks the file for a valid switch name
#checks to see if the host is up and reachable
#checks to see if there is a file input is valid
if line.startswith(begSWLine):
#extract switch name from file
switchName = line[7:-3]
#check to make sure switch is up
if pingCheck(switchName):
print 'Ping success. Host is reachable.'
searchString = endSWLine+switchName+'<>'
**#THIS PART IS SUCKING, WORKS WITH DIRECT STRING PASS
#WONT WORK WITH A VARIABLE**
if searchForString(searchString,fileLoc):
print 'found!'
else:
print 'not found'
Any advice or guidance would be extremely helpful.
Hard to tell without the file's contents, but I would try
switchName = line[7:-2]
So that would look like
>>> '<begSW=UNIQUE-DNS-NAME-HERE<>'[7:-2]
'UNIQUE-DNS-NAME-HERE'
Additionally, you could look into regex searches to make your cleanup more versatile.
import re
# re.findall(search_expression, string_to_search)
>>> re.findall('\=(.+)(?:\<)', '<begSW=UNIQUE-DNS-NAME-HERE<>')[0]
'UNIQUE-DNS-NAME-HERE'
>>> e.findall('\=(.+)(?:\<)', ' <portNumbers=80,443,22<>')[0]
'80,443,22'
I found how to recursively iterate over XML tags in Python using ElementTree? and used the methods detailed to parse an XML file instead of using a TXT file.

Searching in a .txt file (JSON format) for a particular string and then printing a specific key

I have to take input from the user in the form of strings and then have to search for it in a .txt file which is in JSON format. If the text matches, X has to be done otherwise Y. For example if the user enters 'mac' my code should display the complete name(s) of the terms which contains the search term 'mac'.
My JSON file has currently Big Mac as an item and when I search for 'mac' it shows nothing, whereas, it has to display me (0 Big Mac). 0 is the index number which is also required.
if option == 's':
if 'name' in open('data.txt').read():
sea = input ("Type a menu item name to search for: ")
with open ('data.txt', 'r') as data_file:
data = json.load(data_file)
for line in data:
if sea in line:
print (data[index]['name'])
else:
print ('No such item exist')
else:
print ("The list is empty")
main()
I have applied a number of solutions but none works.
See How to search if dictionary value contains certain string with Python.
Since you know you are looking for the string within the value stored against the 'name' key, you can just change:
if sea in line
to:
if sea in line['name']
(or if sea in line.get('name') if there is a risk that one of your dictionaries might not have a 'name' key).
However, you're attempting to use index without having set that up anywhere. If you need to keep track of where you are in the list, you'd be better off using enumerate:
for index, line in enumerate(data):
if sea.lower() in line['name'].lower():
print ((index, line['name']))
If you want 'm' to match 'Big Mac' then you will need to do case-insensitive matching ('m' is not the same as 'M'). See edit above, which converts all strings to lower case before comparing.

Categories