Grouping of documents having the same phone number - python

My database consists of collection of a large no. of hotels (approx 121,000).
This is how my collection looks like :
{
"_id" : ObjectId("57bd5108f4733211b61217fa"),
"autoid" : 1,
"parentid" : "P01982.01982.110601173548.N2C5",
"companyname" : "Sheldan Holiday Home",
"latitude" : 34.169552,
"longitude" : 77.579315,
"state" : "JAMMU AND KASHMIR",
"city" : "LEH Ladakh",
"pincode" : 194101,
"phone_search" : "9419179870|253013",
"address" : "Sheldan Holiday Home|Changspa|Leh Ladakh-194101|LEH Ladakh|JAMMU AND KASHMIR",
"email" : "",
"website" : "",
"national_catidlineage_search" : "/10255012/|/10255031/|/10255037/|/10238369/|/10238380/|/10238373/",
"area" : "Leh Ladakh",
"data_city" : "Leh Ladakh"
}
Each document can have 1 or more phone numbers separated by "|" delimiter.
I have to group together documents having same phone number.
By real time, I mean when a user opens up a particular hotel to see its details on the web interface, I should be able to display all the hotels linked to it grouped by common phone numbers.
While grouping, if one hotel links to another and that hotels links to another, then all 3 should be grouped together.
Example : Hotel A has phone numbers 1|2, B has phone numbers 3|4 and C
has phone numbers 2|3, then A, B and C should be grouped together.
from pymongo import MongoClient
from pprint import pprint #Pretty print
import re #for regex
#import unicodedata
client = MongoClient()
cLen = 0
cLenAll = 0
flag = 0
countA = 0
countB = 0
list = []
allHotels = []
conContact = []
conId = []
hotelTotal = []
splitListAll = []
contactChk = []
#We'll be passing the value later as parameter via a function call
#hId = 37443;
regx = re.compile("^Vivanta", re.IGNORECASE)
#Connection
db = client.hotel
collection = db.hotelData
#Finding hotels wrt search input
for post in collection.find({"companyname":regx}):
list.append(post)
#Copying all hotels in a list
for post1 in collection.find():
allHotels.append(post1)
hotelIndex = 11 #Index of hotel selected from search result
conIndex = hotelIndex
x = list[hotelIndex]["companyname"] #Name of selected hotel
y = list[hotelIndex]["phone_search"] #Phone numbers of selected hotel
try:
splitList = y.split("|") #Splitting of phone numbers and storing in a list 'splitList'
except:
splitList = y
print "Contact details of",x,":"
#Printing all contacts...
for contact in splitList:
print contact
conContact.extend(contact)
cLen = cLen+1
print "No. of contacts in",x,"=",cLen
for i in allHotels:
yAll = allHotels[countA]["phone_search"]
try:
splitListAll.append(yAll.split("|"))
countA = countA+1
except:
splitListAll.append(yAll)
countA = countA + 1
# print splitListAll
#count = 0
#This block has errors
#Add code to stop when no new links occur and optimize the outer for loop
#for j in allHotels:
for contactAll in splitListAll:
if contactAll in conContact:
conContact.extend(contactAll)
# contactChk = contactAll
# if (set(conContact) & set(contactChk)):
# conContact = contactChk
# contactChk[:] = [] #drop contactChk list
conId = allHotels[countB]["autoid"]
countB = countB+1
print "Printing the list of connected hotels..."
for final in collection.find({"autoid":conId}):
print final
This is one code I wrote in Python. In this one, I tried performing linear search in a for loop. I am getting some errors as of now but it should work when rectified.
I need an optimized version of this as liner search has poor time complexity.
I am pretty new to this so any other suggestions to improve the code are welcome.
Thanks.

The easiest answer to any Python in-memory search-for question is "use a dict". Dicts give O(ln N) key-access speed, lists give O(N).
Also remember that you can put a Python object into as many dicts (or lists), and as many times into one dict or list, as it takes. They are not copied. It's just a reference.
So the essentials will look like
for hotel in hotels:
phones = hotel["phone_search"].split("|")
for phone in phones:
hotelsbyphone.setdefault(phone,[]).append(hotel)
At the end of this loop, hotelsbyphone["123456"] will be a list of hotel objects which had "123456" as one of their phone_search strings. The key coding feature is the .setdefault(key, []) method which initializes an empty list if the key is not already in the dict, so that you can then append to it.
Once you have built this index, this will be fast
try:
hotels = hotelsbyphone[x]
# and process a list of one or more hotels
except KeyError:
# no hotels exist with that number
Alternatively to try ... except, test if x in hotelsbyphone:

Related

Is there a foolproof way of matching two similar string sequences?

This is the CSV in question. I got this data from Extra History, which is a video series that talks about many historical topics through many mini-series, such as 'Rome: The Punic Wars', or 'Europe: The First Crusades' (in the CSV). Episodes in these mini-series are numbered #1 up to #6 (though Justinian Theodora has two series, the first numbered #1-#6, the second #7-#12).
I would like to do some statistical analysis on these mini-series, which entails sorting these numbered episodes (e.g. episodes #2-#6) into their appropriate series, i.e. end result look something like this; I can then easily automate sorting into the appropriate python list.
My python code matches the #2-#6 episodes to the #1 episode correctly 99% of the time, with only 1 big error in red, and 1 slight error in yellow (because the first episode of that series is #7, not #1). However, I get the nagging feeling that there is an easier and foolproof way since the strings are well organized, with regular patterns in their names. Is that possible? And can I achieve that with my current code, or should I change and approach it from a different angle?
import csv
eh_csv = '/Users/Work/Desktop/Extra History Playlist Video Data.csv'
with open(eh_csv, newline='', encoding='UTF-8') as f:
reader = csv.reader(f)
data = list(reader)
import re
series_first = []
series_rest = []
singles_music_lies = []
all_episodes = [] #list of name of all episodes
#seperates all videos into 3 non-overlapping list: first episodes of a series,
#the other numbered episodes of the series, and singles/music/lies videos
for video in data:
all_episodes.append(video[0])
#need regex b/c normall string search of #1 also matched (Justinian &) Theodora #10
if len(re.findall('\\b1\\b', video[0])) == 1:
series_first.append(video[0])
elif '#' not in video[0]:
singles_music_lies.append(video[0])
else:
series_rest.append(video[0])
#Dice's Coefficient
#got from here; John Rutledge's answer with NinjaMeTimbers modification
#https://stackoverflow.com/questions/653157/a-better-similarity-ranking-algorithm-for-variable-length-strings
#------------------------------------------------------------------------------------------
def get_bigrams(string):
"""
Take a string and return a list of bigrams.
"""
s = string.lower()
return [s[i:i+2] for i in list(range(len(s) - 1))]
def string_similarity(str1, str2):
"""
Perform bigram comparison between two strings
and return a percentage match in decimal form.
"""
pairs1 = get_bigrams(str1)
pairs2 = get_bigrams(str2)
union = len(pairs1) + len(pairs2)
hit_count = 0
for x in pairs1:
for y in pairs2:
if x == y:
hit_count += 1
pairs2.remove(y)
break
return (2.0 * hit_count) / union
#-------------------------------------------------------------------------------------------
#only take couple words of the episode's names for comparison, b/c the first couple words are 99% of the
#times the name of the series; can't make too short or words like 'the, of' etc will get matched (now or
#in future), or too long because will increase chance of superfluous match; does much better than w/o
#limitting to first few words
def first_three_words(name_string):
#eg ''.join vs ' '.join
first_three = ' '.join(name_string.split()[:5]) #-->'The Haitian Revolution' slightly worse
#first_three = ''.join(name_string.split()[:5]) #--> 'TheHaitianRevolution', slightly better
return first_three
#compared given episode with all first videos, and return a list of comparison scores
def compared_with_first(episode, series_name = series_first):
episode_scores = []
for i in series_name:
x = first_three_words(episode)
y = first_three_words(i)
#comparison_score = round(string_similarity(episode, i),4)
comparison_score = round(string_similarity(x,y),4)
episode_scores.append((comparison_score, i))
return episode_scores
matches = []
#go through video number 2,3,4 etc in a series and compare them with the first episode
#of all series, then get a comparison score
for episode in series_rest:
scores_list = compared_with_first(episode)
similarity_score = 0
most_likely_match = []
#go thru list of comparison scores returned from compared_with_first,
#then append the currentepisode/highest score/first episode to
#most_likely_match; repeat for all non-first episodes
for score in scores_list:
if score[0] > similarity_score:
similarity_score = score[0]
most_likely_match.clear() #MIGHT HAVE BEEN THE CRUCIAL KEY
most_likely_match.append((episode,score))
matches.append(most_likely_match)
final_match = []
for i in matches:
final_match.append((i[0][0], i[0][1][1], i[0][1][0]))
#just to get output in desired presentation
path = '/Users/Work/Desktop/'
with open('EH Sorting Episodes.csv', 'w', newline='',encoding='UTF-8') as csvfile:
csvwriter = csv.writer(csvfile)
for currentRow in final_match:
csvwriter.writerow(currentRow)
#print(currentRow)

Python: Matching Strings from an Array with Substrings from Texts in another Array

Currently I am crawling a Webpage for newspaper articles using Pythons BeautifulSoup Library. These articles are stored in the object "details".
Then I have a couple of names of various streets that are stored in the object "lines". Now I want to search the articles for the street names that are contained in "lines".
If one of the street names is part of one of the articles, I want to safe the name of the street in an array.
If there is no match for an article (the selected article does not contain any of the street names), then there should be an empty element in the array.
So for example, let's assume the object "lines" would consist of ("Abbey Road", "St-John's Bridge", "West Lane", "Sunpoint", "East End").
The object "details" consists of 4 articles, of which 2 contain "Abbey Road" and "West Lane" (e.g. as in "Car accident on Abbey Road, three people hurt"). The other 2 articles don't contain any of names from "lines".
Then after matching the result should be an array like this:
[]["Abbey Road"][]["West Lane"]
I was also told to use Vectorization for this, as my original data sample is quite big. However I'm not familiar with using vectorization for String operations. Has anyone worked with this already?
My Code currently looks like this, however this only returns "-1" as elements of my resulting array:
from bs4 import BeautifulSoup
import requests
import io
import re
import string
import numpy as np
my_list = []
for y in range (0, 2):
y *= 27
i = str(y)
my_list.append('http://www.presseportal.de/blaulicht/suche.htx?q=' + 'einbruch' + '&start=' + i)
for link in my_list:
# print (link)
r = requests.get(link)
r.encoding = 'utf-8'
soup = BeautifulSoup(r.content, 'html.parser')
with open('a4.txt', encoding='utf8') as f:
lines = f.readlines()
lines = [w.replace('\n', '') for w in lines]
details = soup.find_all(class_='news-bodycopy')
for class_element in details:
details = class_element.get_text()
sdetails = ''.join(details)
slines = ''.join(lines)
i = str.find(sdetails, slines[1 : 38506])
print(i)
If someone wants to reproduce my experiment, the Website-Url is in the code above and the crawling and storing of articles in the object "details" works properly, so the code can just be copied.
The .txt-file for my original Data for the object "lines" can be accessed in this Dropbox-Folder:
https://www.dropbox.com/s/o0cjk1o2ej8nogq/a4.txt?dl=0
Thanks a lot for any hints how I can make this work, preferably via Vectorization.
You could try something like this:
my_list = []
for y in range (0, 2):
i = str(y)
my_list.append('http://www.presseportal.de/blaulicht/suche.htx?q=einbruch&start=' + i)
for link in my_list:
r = requests.get(link)
soup = BeautifulSoup(r.content.decode('utf-8','ignore'), 'html.parser')
details = soup.find_all(class_='news-bodycopy')
f = open('a4.txt')
lines = [line.rstrip('\r\n') for line in f]
result = []
for i in range(len(details)):
found_in_line = 0
for j in range(len(lines)):
try:
if details[i].get_text().index(lines[j].decode('utf-8','ignore')) is not None:
result.append(lines[j])
found_in_line = found_in_line + 1
except:
if (j == len(lines)-1) and (found_in_line == 0):
result.append(" ")
print result

Parsing data to lists of lists

I am running into problems when trying to parse data to a list of lists.
I am trying to scrape information about departments and their subjects.
However, as there are different numbers of subjects in each department I need to create a list of lists so that I may later link the data together. I have managed to navigate the index error and the problem seems to come from compiling the subject list.
from lxml import html
import requests
page = requests.get('URL')
page_source_code = html.fromstring(page.text)
departments_list = []
subject_list = []
for dep in range(1,3):
departments = page_source_code.xpath('tag'
+str(dep)+']tag/text()')
### print(dep, departments)
if departments == []:
pass
else:
departments_list.append(departments[0])
for sub in range(1,20):
subjects = page_source_code.xpath('tag'
+str(dep)+']tag'
+str(sub)+']tag/text()')
### print(sub, subjects)
if subjects == []:
pass
else:
subject_list.append(subjects[0])
print('Department list ------ ', len(departments_list), departments_list, '\n')
print('Subject list ------ ', len(subject_list), subject_list)
My output looks like this:
Department list ------ 2 ['Department_1', 'Department_2']
Subject list ------ 7 ['Subject_1'(dep_1), 'Subject_2 '(dep_1), 'Subject_3 '(dep_1), 'Subject_4'(dep_1), 'Subject_5'(dep_2), 'Subject_6 '(dep_2), 'Subject_7 '(dep_2)']
This code seems to put all subjects into one list. I would like it as follows:
Subject list ------ 7 [['Subject_1'(dep_1), 'Subject_2 '(dep_1), 'Subject_3 '(dep_1), 'Subject_4'(dep_1)], ['Subject_5'(dep_2), 'Subject_6 '(dep_2), 'Subject_7 '(dep_2)']]
You need two add the two list globally for subject list
and find out the 'dep_1' or 'dep_2' word in subjects[0] string.
#declare the list for subject
sub_list1 = [] sub_list2 = []
#this code is under the second for loop
if subjects.find('dep_1') == -1 :
sub_list2.append(subjects[0])
else:
sub_list1.append(subjects[0])
#Please remove the subjectList.append statement
#from second for loop
#and put it end of both loop like that .
subjectList = [sub_list1,sub_list2]

Including Exception in python

I have a file at /location/all-list-info.txt underneath I have some items in below manner:
aaa:xxx:abc.com:1857:xxx1:rel5t2:y
ifa:yyy:xyz.com:1858:yyy1:rel5t2:y
I process these items with a below python code:
def pITEMName():
global itemList
itemList = str(raw_input('Enter pipe separated list of ITEMS : ')).upper().strip()
items = itemList.split("|")
count = len(items)
print 'Total Distint Item Count : ', count
pipelst = itemList.split('|')
filepath = '/location/all-item-info.txt '
f = open(filepath, 'r')
for lns in f:
split_pipe = lns.split(':', 1)
if split_pipe[0] in pipelst:
index = pipelst.index(split_pipe[0])
del pipelst[index]
for lns in pipelst:
print lns,' is wrong item Name'
f.close()
if podList:
After execution of above python code its gives a prompt as :
Enter pipe separated list of ITEMS:
And then I passes the items :
Enter pipe separated list of ITEMS: aaa|ifa-mc|ggg-mc
now after pressing enter above code process further like below :
Enter pipe separated list of ITEMS : aaa|ifa-mc|ggg-mc
Total Distint Item Count : 3
IFA-MC is wrong Item Name
GGG-MC is wrong Item Name
ITEMs Belonging to other Centers :
Item Count From Other Center = 0
ITEMs Belonging to Current Centers :
Active Items in US1 :
^IFA$
Test Active Items in US1 :
^AAA$
Ignored Item Count From Current center = 0
You Have Entered ItemList belonging to this center as: ^IFA$|^AAA$
Active Pod Count : 2
My question is if I suffix the '-mc' in items while giving the input its given me as wrong item whereas it presents in /location/all-item-info.txt file with not present the item in /location/all-item-info.txt . Please have a look at below output again :
IFA-MC is wrong Item Name
GGG-MC is wrong Item Name
In above example 'ifa' is present in /location/all-items-info.txt path whereas as ggg is not present.
Request you to help me here what can I do on above code so if I suffix the -mc which are present in /location/all-items-info.txt file it should not count as wrong item name. it should count only for those items which are not present in /location/all-items-info.txt file.
Please give you help.
Thanks,
Ritesh.
If you want to avoid checking for -mc as well, then you can modify this part of your script -
pipelst = itemList.split('|')
To -
pipelst = [i.split('-')[0] for i in itemList.split('|')]
It's a bit unclear exactly what you are asking, but basically to ignore any '-mc' from user input, you can explicitly preprocess the user input to strip it out:
pipelst = itemList.split('|')
pipelst = [item.rsplit('-mc',1)[0] for item in pipelst]
If instead you want to allow for the possibility of -mc-suffixed words in the file as well, simply add the stripped version to the list instead of replacing
pipelst = itemList.split('|')
for item in pipelist:
if item.endswith('-mc'):
pipelst.append(item.rsplit('-mc',1)[0])
Another issue may be based on the example lines you gave from /location/all-list-info.txt, it sounds like all the items are lowercase. However, pipelst is explicitly making the user input all uppercase. String equality and in mechanics is case-sensitive, so for instance
>>> print 'ifa-mc' in ['IFA-MC']
False
You probably want:
itemList = str(raw_input('Enter pipe separated list of ITEMS : ')).lower().strip()
and you could use .upper() only when printing or wherever it is needed
Finally, there are a few other things that could be tweaked with the code just to make things a bit faster and cleaner. The main one that comes to mind is it seems like pipelst should be a python set and not a list as checking inclusion and removal would then be much faster for large lists, and the code to remove an item from a set is much cleaner:
>>> desserts = set(['ice cream', 'cookies', 'cake'])
>>> if 'cake' in desserts:
... desserts.remove('cake')
>>> print desserts
set(['cookies', 'ice cream'])

Python dataframe issue

I have the following dataframe. It is from imdb. What i need to do is to extract movies with a score lower than 5 that receive more than 100000 votes. My problem is that I dont understand what the last code lines about the voting really do.
# two lists, one for movie data, the other of vote data
movie_data=[]
vote_data=[]
# this will do some reformating to get the right unicode escape for
hexentityMassage = [(re.compile('&#x([^;]+);'), lambda m: '&#%d;' % int(m.group(1), 16))] # converts XML/HTML entities into unicode string in Python
for i in range(20):
next_url = 'http://www.imdb.com/search/title?sort=num_votes,desc&start=%d&title_type=feature&year=1950,2012'%(i*50+1)
r = requests.get(next_url)
bs = BeautifulSoup(r.text,convertEntities=BeautifulSoup.HTML_ENTITIES,markupMassage=hexentityMassage)
# movie info is found in the table cell called 'title'
for movie in bs.findAll('td', 'title'):
title = movie.find('a').contents[0].replace('&','&') #get '&' as in 'Batman & Robin'
genres = movie.find('span', 'genre').findAll('a')
year = int(movie.find('span', 'year_type').contents[0].strip('()'))
genres = [g.contents[0] for g in genres]
runtime = movie.find('span', 'runtime').contents[0]
rating = float(movie.find('span', 'value').contents[0])
movie_data.append([title, genres, runtime, rating, year])
# rating info is found in a separate cell called 'sort_col'
for voting in bs.findAll('td', 'sort_col'):
vote_data.append(int(voting.contents[0].replace(',','')))
Your problem is this snippet,
for voting in bs.findAll('td', 'sort_col'):
vote_data.append(int(voting.contents[0].replace(',','')))
Here you are looping over all the td tags which have a attribute sort_col. In this case they have class="sort_col".
In the second line,
you are replacing the ',' with '' (empty string) of the first element of the list returned by voting.contents.
casting it to int.
then appending it to vote_data.
If I break this up, It will be like this,
for voting in bs.findAll('td', 'sort_col'):
# voting.contents returns a list like this [u'377,936']
str_vote = voting.contents[0]
# str_vote will be '377,936'
int_vote = int(str_vote.replace(',', ''))
# int_vote will be 377936
vote_data.append(int_vote)
Print the values in the loop to get more understanding. If you tagged your question right you might get a good answer faster.

Categories