Parsing data to lists of lists - python

I am running into problems when trying to parse data to a list of lists.
I am trying to scrape information about departments and their subjects.
However, as there are different numbers of subjects in each department I need to create a list of lists so that I may later link the data together. I have managed to navigate the index error and the problem seems to come from compiling the subject list.
from lxml import html
import requests
page = requests.get('URL')
page_source_code = html.fromstring(page.text)
departments_list = []
subject_list = []
for dep in range(1,3):
departments = page_source_code.xpath('tag'
+str(dep)+']tag/text()')
### print(dep, departments)
if departments == []:
pass
else:
departments_list.append(departments[0])
for sub in range(1,20):
subjects = page_source_code.xpath('tag'
+str(dep)+']tag'
+str(sub)+']tag/text()')
### print(sub, subjects)
if subjects == []:
pass
else:
subject_list.append(subjects[0])
print('Department list ------ ', len(departments_list), departments_list, '\n')
print('Subject list ------ ', len(subject_list), subject_list)
My output looks like this:
Department list ------ 2 ['Department_1', 'Department_2']
Subject list ------ 7 ['Subject_1'(dep_1), 'Subject_2 '(dep_1), 'Subject_3 '(dep_1), 'Subject_4'(dep_1), 'Subject_5'(dep_2), 'Subject_6 '(dep_2), 'Subject_7 '(dep_2)']
This code seems to put all subjects into one list. I would like it as follows:
Subject list ------ 7 [['Subject_1'(dep_1), 'Subject_2 '(dep_1), 'Subject_3 '(dep_1), 'Subject_4'(dep_1)], ['Subject_5'(dep_2), 'Subject_6 '(dep_2), 'Subject_7 '(dep_2)']]

You need two add the two list globally for subject list
and find out the 'dep_1' or 'dep_2' word in subjects[0] string.
#declare the list for subject
sub_list1 = [] sub_list2 = []
#this code is under the second for loop
if subjects.find('dep_1') == -1 :
sub_list2.append(subjects[0])
else:
sub_list1.append(subjects[0])
#Please remove the subjectList.append statement
#from second for loop
#and put it end of both loop like that .
subjectList = [sub_list1,sub_list2]

Related

python splitting array into 2 arrays

I'm trying to separate the information within this array. It has the title, and the link
the first section is the title, and the last is the url.
u'6 Essential Tips on How to Become a Full Stack Developer',
u'https://hackernoon.com/6-essential-tips-on-how-to-become-a-full-
stack-developer-1d10965aaead'
u'6 Essential Tips on How to Become a Full Stack Developer',
u'https://hackernoon.com/6-essential-tips-on-how-to-become-a-full-
stack-developer-1d10965aaead'
u'What is a Full-Stack Developer? - Codeup',
u'https://codeup.com/what-is-a-full-stack-developer/'
u'A Guide to Becoming a Full-Stack Developer in 2017 \u2013
Coderbyte ...',
u'https://medium.com/coderbyte/a-guide-to-
becoming-a-full-stack-developer-in-2017-5c3c08a1600c'
I want to be able to push the titles in a list, as well as the links in a list. Instead of having everything in one list
here is my current code
main.py
from gsearch.googlesearch import search
from orderedset import OrderedSet
import re
import csv
results = search('Full Stack Developer') # returns 10 or less
results
myAraay = list()
for x in results:
owl = re.sub('[\(\)\{\}<>]', '', str(x))
myAraay.append(owl)
newArray = "\n".join(map(str, myAraay))
print(newArray)
Updated Main.py (now getting cannot concatenate 'str' and 'int' objects)
from gsearch.googlesearch import search
from orderedset import OrderedSet
import re
import csv
results = search('Full Stack Developer') # returns 10 or less results
myAraay = list()
for x in results:
owl = re.sub('[\(\)\{\}<>]', '', str(x))
myAraay.append(owl)
newArray = "\n".join(map(str, myAraay))
theDict = {} #keep a dictionary for title:link
for idx, i in enumerate(newArray): #loop through list
if idx % 2 == 0: #the title goes first...
dict[i] = dict[i+1] #add title and link to dictionary
else: #link comes second
continue #skip it, go to next title
print(theDict)
Assuming that the list is formatted such that the link always succeeds the title...
theDict = {} #keep a dictionary for title:link
for idx, i in enumerate(OriginalList): #loop through list
if idx % 2 == 0: #the title goes first...
dict[i] = dict[i+1] #add title and link to dictionary
else: #link comes second
continue #skip it, go to next title
here it looks like
{"How to be Cool" : "http://www.cool.org/cool"}
If you want lists instead of dictionary its kind of the same...
articles = [] #keep a lists
for idx, i in enumerate(OriginalList): #loop through list
if idx % 2 == 0: #the title goes first...
articles.append([i,i+1]) #add title and link to list
else: #link comes second
continue #skip it, go to next title
so here it looks like:
[["How to be Cool", "http://www.cool.org/cool"]]

Grouping of documents having the same phone number

My database consists of collection of a large no. of hotels (approx 121,000).
This is how my collection looks like :
{
"_id" : ObjectId("57bd5108f4733211b61217fa"),
"autoid" : 1,
"parentid" : "P01982.01982.110601173548.N2C5",
"companyname" : "Sheldan Holiday Home",
"latitude" : 34.169552,
"longitude" : 77.579315,
"state" : "JAMMU AND KASHMIR",
"city" : "LEH Ladakh",
"pincode" : 194101,
"phone_search" : "9419179870|253013",
"address" : "Sheldan Holiday Home|Changspa|Leh Ladakh-194101|LEH Ladakh|JAMMU AND KASHMIR",
"email" : "",
"website" : "",
"national_catidlineage_search" : "/10255012/|/10255031/|/10255037/|/10238369/|/10238380/|/10238373/",
"area" : "Leh Ladakh",
"data_city" : "Leh Ladakh"
}
Each document can have 1 or more phone numbers separated by "|" delimiter.
I have to group together documents having same phone number.
By real time, I mean when a user opens up a particular hotel to see its details on the web interface, I should be able to display all the hotels linked to it grouped by common phone numbers.
While grouping, if one hotel links to another and that hotels links to another, then all 3 should be grouped together.
Example : Hotel A has phone numbers 1|2, B has phone numbers 3|4 and C
has phone numbers 2|3, then A, B and C should be grouped together.
from pymongo import MongoClient
from pprint import pprint #Pretty print
import re #for regex
#import unicodedata
client = MongoClient()
cLen = 0
cLenAll = 0
flag = 0
countA = 0
countB = 0
list = []
allHotels = []
conContact = []
conId = []
hotelTotal = []
splitListAll = []
contactChk = []
#We'll be passing the value later as parameter via a function call
#hId = 37443;
regx = re.compile("^Vivanta", re.IGNORECASE)
#Connection
db = client.hotel
collection = db.hotelData
#Finding hotels wrt search input
for post in collection.find({"companyname":regx}):
list.append(post)
#Copying all hotels in a list
for post1 in collection.find():
allHotels.append(post1)
hotelIndex = 11 #Index of hotel selected from search result
conIndex = hotelIndex
x = list[hotelIndex]["companyname"] #Name of selected hotel
y = list[hotelIndex]["phone_search"] #Phone numbers of selected hotel
try:
splitList = y.split("|") #Splitting of phone numbers and storing in a list 'splitList'
except:
splitList = y
print "Contact details of",x,":"
#Printing all contacts...
for contact in splitList:
print contact
conContact.extend(contact)
cLen = cLen+1
print "No. of contacts in",x,"=",cLen
for i in allHotels:
yAll = allHotels[countA]["phone_search"]
try:
splitListAll.append(yAll.split("|"))
countA = countA+1
except:
splitListAll.append(yAll)
countA = countA + 1
# print splitListAll
#count = 0
#This block has errors
#Add code to stop when no new links occur and optimize the outer for loop
#for j in allHotels:
for contactAll in splitListAll:
if contactAll in conContact:
conContact.extend(contactAll)
# contactChk = contactAll
# if (set(conContact) & set(contactChk)):
# conContact = contactChk
# contactChk[:] = [] #drop contactChk list
conId = allHotels[countB]["autoid"]
countB = countB+1
print "Printing the list of connected hotels..."
for final in collection.find({"autoid":conId}):
print final
This is one code I wrote in Python. In this one, I tried performing linear search in a for loop. I am getting some errors as of now but it should work when rectified.
I need an optimized version of this as liner search has poor time complexity.
I am pretty new to this so any other suggestions to improve the code are welcome.
Thanks.
The easiest answer to any Python in-memory search-for question is "use a dict". Dicts give O(ln N) key-access speed, lists give O(N).
Also remember that you can put a Python object into as many dicts (or lists), and as many times into one dict or list, as it takes. They are not copied. It's just a reference.
So the essentials will look like
for hotel in hotels:
phones = hotel["phone_search"].split("|")
for phone in phones:
hotelsbyphone.setdefault(phone,[]).append(hotel)
At the end of this loop, hotelsbyphone["123456"] will be a list of hotel objects which had "123456" as one of their phone_search strings. The key coding feature is the .setdefault(key, []) method which initializes an empty list if the key is not already in the dict, so that you can then append to it.
Once you have built this index, this will be fast
try:
hotels = hotelsbyphone[x]
# and process a list of one or more hotels
except KeyError:
# no hotels exist with that number
Alternatively to try ... except, test if x in hotelsbyphone:

Including Exception in python

I have a file at /location/all-list-info.txt underneath I have some items in below manner:
aaa:xxx:abc.com:1857:xxx1:rel5t2:y
ifa:yyy:xyz.com:1858:yyy1:rel5t2:y
I process these items with a below python code:
def pITEMName():
global itemList
itemList = str(raw_input('Enter pipe separated list of ITEMS : ')).upper().strip()
items = itemList.split("|")
count = len(items)
print 'Total Distint Item Count : ', count
pipelst = itemList.split('|')
filepath = '/location/all-item-info.txt '
f = open(filepath, 'r')
for lns in f:
split_pipe = lns.split(':', 1)
if split_pipe[0] in pipelst:
index = pipelst.index(split_pipe[0])
del pipelst[index]
for lns in pipelst:
print lns,' is wrong item Name'
f.close()
if podList:
After execution of above python code its gives a prompt as :
Enter pipe separated list of ITEMS:
And then I passes the items :
Enter pipe separated list of ITEMS: aaa|ifa-mc|ggg-mc
now after pressing enter above code process further like below :
Enter pipe separated list of ITEMS : aaa|ifa-mc|ggg-mc
Total Distint Item Count : 3
IFA-MC is wrong Item Name
GGG-MC is wrong Item Name
ITEMs Belonging to other Centers :
Item Count From Other Center = 0
ITEMs Belonging to Current Centers :
Active Items in US1 :
^IFA$
Test Active Items in US1 :
^AAA$
Ignored Item Count From Current center = 0
You Have Entered ItemList belonging to this center as: ^IFA$|^AAA$
Active Pod Count : 2
My question is if I suffix the '-mc' in items while giving the input its given me as wrong item whereas it presents in /location/all-item-info.txt file with not present the item in /location/all-item-info.txt . Please have a look at below output again :
IFA-MC is wrong Item Name
GGG-MC is wrong Item Name
In above example 'ifa' is present in /location/all-items-info.txt path whereas as ggg is not present.
Request you to help me here what can I do on above code so if I suffix the -mc which are present in /location/all-items-info.txt file it should not count as wrong item name. it should count only for those items which are not present in /location/all-items-info.txt file.
Please give you help.
Thanks,
Ritesh.
If you want to avoid checking for -mc as well, then you can modify this part of your script -
pipelst = itemList.split('|')
To -
pipelst = [i.split('-')[0] for i in itemList.split('|')]
It's a bit unclear exactly what you are asking, but basically to ignore any '-mc' from user input, you can explicitly preprocess the user input to strip it out:
pipelst = itemList.split('|')
pipelst = [item.rsplit('-mc',1)[0] for item in pipelst]
If instead you want to allow for the possibility of -mc-suffixed words in the file as well, simply add the stripped version to the list instead of replacing
pipelst = itemList.split('|')
for item in pipelist:
if item.endswith('-mc'):
pipelst.append(item.rsplit('-mc',1)[0])
Another issue may be based on the example lines you gave from /location/all-list-info.txt, it sounds like all the items are lowercase. However, pipelst is explicitly making the user input all uppercase. String equality and in mechanics is case-sensitive, so for instance
>>> print 'ifa-mc' in ['IFA-MC']
False
You probably want:
itemList = str(raw_input('Enter pipe separated list of ITEMS : ')).lower().strip()
and you could use .upper() only when printing or wherever it is needed
Finally, there are a few other things that could be tweaked with the code just to make things a bit faster and cleaner. The main one that comes to mind is it seems like pipelst should be a python set and not a list as checking inclusion and removal would then be much faster for large lists, and the code to remove an item from a set is much cleaner:
>>> desserts = set(['ice cream', 'cookies', 'cake'])
>>> if 'cake' in desserts:
... desserts.remove('cake')
>>> print desserts
set(['cookies', 'ice cream'])

Python dataframe issue

I have the following dataframe. It is from imdb. What i need to do is to extract movies with a score lower than 5 that receive more than 100000 votes. My problem is that I dont understand what the last code lines about the voting really do.
# two lists, one for movie data, the other of vote data
movie_data=[]
vote_data=[]
# this will do some reformating to get the right unicode escape for
hexentityMassage = [(re.compile('&#x([^;]+);'), lambda m: '&#%d;' % int(m.group(1), 16))] # converts XML/HTML entities into unicode string in Python
for i in range(20):
next_url = 'http://www.imdb.com/search/title?sort=num_votes,desc&start=%d&title_type=feature&year=1950,2012'%(i*50+1)
r = requests.get(next_url)
bs = BeautifulSoup(r.text,convertEntities=BeautifulSoup.HTML_ENTITIES,markupMassage=hexentityMassage)
# movie info is found in the table cell called 'title'
for movie in bs.findAll('td', 'title'):
title = movie.find('a').contents[0].replace('&','&') #get '&' as in 'Batman & Robin'
genres = movie.find('span', 'genre').findAll('a')
year = int(movie.find('span', 'year_type').contents[0].strip('()'))
genres = [g.contents[0] for g in genres]
runtime = movie.find('span', 'runtime').contents[0]
rating = float(movie.find('span', 'value').contents[0])
movie_data.append([title, genres, runtime, rating, year])
# rating info is found in a separate cell called 'sort_col'
for voting in bs.findAll('td', 'sort_col'):
vote_data.append(int(voting.contents[0].replace(',','')))
Your problem is this snippet,
for voting in bs.findAll('td', 'sort_col'):
vote_data.append(int(voting.contents[0].replace(',','')))
Here you are looping over all the td tags which have a attribute sort_col. In this case they have class="sort_col".
In the second line,
you are replacing the ',' with '' (empty string) of the first element of the list returned by voting.contents.
casting it to int.
then appending it to vote_data.
If I break this up, It will be like this,
for voting in bs.findAll('td', 'sort_col'):
# voting.contents returns a list like this [u'377,936']
str_vote = voting.contents[0]
# str_vote will be '377,936'
int_vote = int(str_vote.replace(',', ''))
# int_vote will be 377936
vote_data.append(int_vote)
Print the values in the loop to get more understanding. If you tagged your question right you might get a good answer faster.

Matching regex to list items in Python

I am attempting to write a python script that shows the URL flow on my installation of nginx. So I currently have my script opening my 'rewrites' file that contains a list of of regex's and locations like so:
rewritei ^/ungrad/info.cfm$ /ungrad/info/ permanent;
So what I currently have python doing is reading the file, trimming the first and last word off (rewritei and premanent;) which just leaves a list like so:
[
['^/ungrad/info.cfm$', '/ungrad/info'],
['^/admiss/testing.cfm$', '/admiss/testing'],
['^/ungrad/testing/$', '/ungrad/info.cfm']
]
This results in the first element being the URL watched, and the second being the URL redirected to. What I would like to do now, is take each of the first elements, and run the regex over the entire list, and check if it matches any of the second elements.
With the example above, [0][0] would match [2][1].
However I am having trouble thinking of a good and efficient way to do this.
import re
a = [
['^/ungrad/info.cfm$', '/ungrad/info'],
['^/admiss/testing.cfm$', '/admiss/testing'],
['^/ungrad/testing/$', '/ungrad/info.cfm']
]
def matchingfun(b):
for list1 in a: # iterating the main list
for reglist in list1: # iterating the inner lists
count = 0
matchedurl = []
for innerlist in reglist[:1]: # iterating the inner list items
c = b.match(innerlist) # matching the regx
if c:
count = count+1
if count > 0:
matchedurl.append(reglist)
return matchedurl
result1 = []
for list1 in a:
for reglist in list1:
b = re.compile(reglist[0])
result = matchingfun(b)
result1.extend(result)
bs = list(set(result1))
print "matched url is", bs
This is bit unefficient i guess but I have done to some extent. Hope this answers your query. the above snippet prints the urls which are matched with the second items in the entire list.

Categories