python splitting array into 2 arrays - python

I'm trying to separate the information within this array. It has the title, and the link
the first section is the title, and the last is the url.
u'6 Essential Tips on How to Become a Full Stack Developer',
u'https://hackernoon.com/6-essential-tips-on-how-to-become-a-full-
stack-developer-1d10965aaead'
u'6 Essential Tips on How to Become a Full Stack Developer',
u'https://hackernoon.com/6-essential-tips-on-how-to-become-a-full-
stack-developer-1d10965aaead'
u'What is a Full-Stack Developer? - Codeup',
u'https://codeup.com/what-is-a-full-stack-developer/'
u'A Guide to Becoming a Full-Stack Developer in 2017 \u2013
Coderbyte ...',
u'https://medium.com/coderbyte/a-guide-to-
becoming-a-full-stack-developer-in-2017-5c3c08a1600c'
I want to be able to push the titles in a list, as well as the links in a list. Instead of having everything in one list
here is my current code
main.py
from gsearch.googlesearch import search
from orderedset import OrderedSet
import re
import csv
results = search('Full Stack Developer') # returns 10 or less
results
myAraay = list()
for x in results:
owl = re.sub('[\(\)\{\}<>]', '', str(x))
myAraay.append(owl)
newArray = "\n".join(map(str, myAraay))
print(newArray)
Updated Main.py (now getting cannot concatenate 'str' and 'int' objects)
from gsearch.googlesearch import search
from orderedset import OrderedSet
import re
import csv
results = search('Full Stack Developer') # returns 10 or less results
myAraay = list()
for x in results:
owl = re.sub('[\(\)\{\}<>]', '', str(x))
myAraay.append(owl)
newArray = "\n".join(map(str, myAraay))
theDict = {} #keep a dictionary for title:link
for idx, i in enumerate(newArray): #loop through list
if idx % 2 == 0: #the title goes first...
dict[i] = dict[i+1] #add title and link to dictionary
else: #link comes second
continue #skip it, go to next title
print(theDict)

Assuming that the list is formatted such that the link always succeeds the title...
theDict = {} #keep a dictionary for title:link
for idx, i in enumerate(OriginalList): #loop through list
if idx % 2 == 0: #the title goes first...
dict[i] = dict[i+1] #add title and link to dictionary
else: #link comes second
continue #skip it, go to next title
here it looks like
{"How to be Cool" : "http://www.cool.org/cool"}
If you want lists instead of dictionary its kind of the same...
articles = [] #keep a lists
for idx, i in enumerate(OriginalList): #loop through list
if idx % 2 == 0: #the title goes first...
articles.append([i,i+1]) #add title and link to list
else: #link comes second
continue #skip it, go to next title
so here it looks like:
[["How to be Cool", "http://www.cool.org/cool"]]

Related

Python: Matching Strings from an Array with Substrings from Texts in another Array

Currently I am crawling a Webpage for newspaper articles using Pythons BeautifulSoup Library. These articles are stored in the object "details".
Then I have a couple of names of various streets that are stored in the object "lines". Now I want to search the articles for the street names that are contained in "lines".
If one of the street names is part of one of the articles, I want to safe the name of the street in an array.
If there is no match for an article (the selected article does not contain any of the street names), then there should be an empty element in the array.
So for example, let's assume the object "lines" would consist of ("Abbey Road", "St-John's Bridge", "West Lane", "Sunpoint", "East End").
The object "details" consists of 4 articles, of which 2 contain "Abbey Road" and "West Lane" (e.g. as in "Car accident on Abbey Road, three people hurt"). The other 2 articles don't contain any of names from "lines".
Then after matching the result should be an array like this:
[]["Abbey Road"][]["West Lane"]
I was also told to use Vectorization for this, as my original data sample is quite big. However I'm not familiar with using vectorization for String operations. Has anyone worked with this already?
My Code currently looks like this, however this only returns "-1" as elements of my resulting array:
from bs4 import BeautifulSoup
import requests
import io
import re
import string
import numpy as np
my_list = []
for y in range (0, 2):
y *= 27
i = str(y)
my_list.append('http://www.presseportal.de/blaulicht/suche.htx?q=' + 'einbruch' + '&start=' + i)
for link in my_list:
# print (link)
r = requests.get(link)
r.encoding = 'utf-8'
soup = BeautifulSoup(r.content, 'html.parser')
with open('a4.txt', encoding='utf8') as f:
lines = f.readlines()
lines = [w.replace('\n', '') for w in lines]
details = soup.find_all(class_='news-bodycopy')
for class_element in details:
details = class_element.get_text()
sdetails = ''.join(details)
slines = ''.join(lines)
i = str.find(sdetails, slines[1 : 38506])
print(i)
If someone wants to reproduce my experiment, the Website-Url is in the code above and the crawling and storing of articles in the object "details" works properly, so the code can just be copied.
The .txt-file for my original Data for the object "lines" can be accessed in this Dropbox-Folder:
https://www.dropbox.com/s/o0cjk1o2ej8nogq/a4.txt?dl=0
Thanks a lot for any hints how I can make this work, preferably via Vectorization.
You could try something like this:
my_list = []
for y in range (0, 2):
i = str(y)
my_list.append('http://www.presseportal.de/blaulicht/suche.htx?q=einbruch&start=' + i)
for link in my_list:
r = requests.get(link)
soup = BeautifulSoup(r.content.decode('utf-8','ignore'), 'html.parser')
details = soup.find_all(class_='news-bodycopy')
f = open('a4.txt')
lines = [line.rstrip('\r\n') for line in f]
result = []
for i in range(len(details)):
found_in_line = 0
for j in range(len(lines)):
try:
if details[i].get_text().index(lines[j].decode('utf-8','ignore')) is not None:
result.append(lines[j])
found_in_line = found_in_line + 1
except:
if (j == len(lines)-1) and (found_in_line == 0):
result.append(" ")
print result

Parsing data to lists of lists

I am running into problems when trying to parse data to a list of lists.
I am trying to scrape information about departments and their subjects.
However, as there are different numbers of subjects in each department I need to create a list of lists so that I may later link the data together. I have managed to navigate the index error and the problem seems to come from compiling the subject list.
from lxml import html
import requests
page = requests.get('URL')
page_source_code = html.fromstring(page.text)
departments_list = []
subject_list = []
for dep in range(1,3):
departments = page_source_code.xpath('tag'
+str(dep)+']tag/text()')
### print(dep, departments)
if departments == []:
pass
else:
departments_list.append(departments[0])
for sub in range(1,20):
subjects = page_source_code.xpath('tag'
+str(dep)+']tag'
+str(sub)+']tag/text()')
### print(sub, subjects)
if subjects == []:
pass
else:
subject_list.append(subjects[0])
print('Department list ------ ', len(departments_list), departments_list, '\n')
print('Subject list ------ ', len(subject_list), subject_list)
My output looks like this:
Department list ------ 2 ['Department_1', 'Department_2']
Subject list ------ 7 ['Subject_1'(dep_1), 'Subject_2 '(dep_1), 'Subject_3 '(dep_1), 'Subject_4'(dep_1), 'Subject_5'(dep_2), 'Subject_6 '(dep_2), 'Subject_7 '(dep_2)']
This code seems to put all subjects into one list. I would like it as follows:
Subject list ------ 7 [['Subject_1'(dep_1), 'Subject_2 '(dep_1), 'Subject_3 '(dep_1), 'Subject_4'(dep_1)], ['Subject_5'(dep_2), 'Subject_6 '(dep_2), 'Subject_7 '(dep_2)']]
You need two add the two list globally for subject list
and find out the 'dep_1' or 'dep_2' word in subjects[0] string.
#declare the list for subject
sub_list1 = [] sub_list2 = []
#this code is under the second for loop
if subjects.find('dep_1') == -1 :
sub_list2.append(subjects[0])
else:
sub_list1.append(subjects[0])
#Please remove the subjectList.append statement
#from second for loop
#and put it end of both loop like that .
subjectList = [sub_list1,sub_list2]

Tuple trouble when trying to count elements in a list?

I am trying to count the number of contractions used by politicians in certain speeches. I have lots of speeches, but here are some of the URLs as a sample:
every_link_test = ['http://www.millercenter.org/president/obama/speeches/speech-4427',
'http://www.millercenter.org/president/obama/speeches/speech-4424',
'http://www.millercenter.org/president/obama/speeches/speech-4453',
'http://www.millercenter.org/president/obama/speeches/speech-4612',
'http://www.millercenter.org/president/obama/speeches/speech-5502']
I have a pretty rough counter right now - it only counts the total number of contractions used in all of those links. For example, the following code returns 79,101,101,182,224 for the five links above. However, I want to link up filename, a variable I create below, so I would have something like (speech_1, 79),(speech_2, 22),(speech_3,0),(speech_4,81),(speech_5,42). That way, I can track the number of contractions used in each individual speech. I'm getting the following error with my code: AttributeError: 'tuple' object has no attribute 'split'
Here's my code:
import urllib2,sys,os
from bs4 import BeautifulSoup,NavigableString
from string import punctuation as p
from multiprocessing import Pool
import re, nltk
import requests
reload(sys)
url = 'http://www.millercenter.org/president/speeches'
url2 = 'http://www.millercenter.org'
conn = urllib2.urlopen(url)
html = conn.read()
miller_center_soup = BeautifulSoup(html)
links = miller_center_soup.find_all('a')
linklist = [tag.get('href') for tag in links if tag.get('href') is not None]
# remove all items in list that don't contain 'speeches'
linkslist = [_ for _ in linklist if re.search('speeches',_)]
del linkslist[0:2]
# concatenate 'http://www.millercenter.org' with each speech's URL ending
every_link_dups = [url2 + end_link for end_link in linkslist]
# remove duplicates
seen = set()
every_link = [] # no duplicates array
for l in every_link_dups:
if l not in seen:
every_link.append(l)
seen.add(l)
def processURL_short_2(l):
open_url = urllib2.urlopen(l).read()
item_soup = BeautifulSoup(open_url)
item_div = item_soup.find('div',{'id':'transcript'},{'class':'displaytext'})
item_str = item_div.text.lower()
splitlink = l.split("/")
president = splitlink[4]
speech_num = splitlink[-1]
filename = "{0}_{1}".format(president, speech_num)
return item_str, filename
every_link_test = every_link[0:5]
print every_link_test
count = 0
for l in every_link_test:
content_1 = processURL_short_2(l)
for word in content_1.split():
word = word.strip(p)
if word in contractions:
count = count + 1
print count, filename
As the error message explains, you cannot use split the way you are using it. split is for strings.
So you will need to change this:
for word in content_1.split():
to this:
for word in content_1[0]:
I chose [0] by running your code, I think that gives you the chunk of the text you are looking to search through.
#TigerhawkT3 has a good suggestion you should follow in their answer too:
https://stackoverflow.com/a/32981533/1832539
Instead of print count, filename, you should save these data to a data structure, like a dictionary. Since processURL_short_2 has been modified to return a tuple, you'll need to unpack it.
data = {} # initialize a dictionary
for l in every_link_test:
content_1, filename = processURL_short_2(l) # unpack the content and filename
for word in content_1.split():
word = word.strip(p)
if word in contractions:
count = count + 1
data[filename] = count # add this to the dictionary as filename:count
This would give you a dictionary like {'obama_4424':79, 'obama_4453':101,...}, allowing you to easily store and access your parsed data.

Including Exception in python

I have a file at /location/all-list-info.txt underneath I have some items in below manner:
aaa:xxx:abc.com:1857:xxx1:rel5t2:y
ifa:yyy:xyz.com:1858:yyy1:rel5t2:y
I process these items with a below python code:
def pITEMName():
global itemList
itemList = str(raw_input('Enter pipe separated list of ITEMS : ')).upper().strip()
items = itemList.split("|")
count = len(items)
print 'Total Distint Item Count : ', count
pipelst = itemList.split('|')
filepath = '/location/all-item-info.txt '
f = open(filepath, 'r')
for lns in f:
split_pipe = lns.split(':', 1)
if split_pipe[0] in pipelst:
index = pipelst.index(split_pipe[0])
del pipelst[index]
for lns in pipelst:
print lns,' is wrong item Name'
f.close()
if podList:
After execution of above python code its gives a prompt as :
Enter pipe separated list of ITEMS:
And then I passes the items :
Enter pipe separated list of ITEMS: aaa|ifa-mc|ggg-mc
now after pressing enter above code process further like below :
Enter pipe separated list of ITEMS : aaa|ifa-mc|ggg-mc
Total Distint Item Count : 3
IFA-MC is wrong Item Name
GGG-MC is wrong Item Name
ITEMs Belonging to other Centers :
Item Count From Other Center = 0
ITEMs Belonging to Current Centers :
Active Items in US1 :
^IFA$
Test Active Items in US1 :
^AAA$
Ignored Item Count From Current center = 0
You Have Entered ItemList belonging to this center as: ^IFA$|^AAA$
Active Pod Count : 2
My question is if I suffix the '-mc' in items while giving the input its given me as wrong item whereas it presents in /location/all-item-info.txt file with not present the item in /location/all-item-info.txt . Please have a look at below output again :
IFA-MC is wrong Item Name
GGG-MC is wrong Item Name
In above example 'ifa' is present in /location/all-items-info.txt path whereas as ggg is not present.
Request you to help me here what can I do on above code so if I suffix the -mc which are present in /location/all-items-info.txt file it should not count as wrong item name. it should count only for those items which are not present in /location/all-items-info.txt file.
Please give you help.
Thanks,
Ritesh.
If you want to avoid checking for -mc as well, then you can modify this part of your script -
pipelst = itemList.split('|')
To -
pipelst = [i.split('-')[0] for i in itemList.split('|')]
It's a bit unclear exactly what you are asking, but basically to ignore any '-mc' from user input, you can explicitly preprocess the user input to strip it out:
pipelst = itemList.split('|')
pipelst = [item.rsplit('-mc',1)[0] for item in pipelst]
If instead you want to allow for the possibility of -mc-suffixed words in the file as well, simply add the stripped version to the list instead of replacing
pipelst = itemList.split('|')
for item in pipelist:
if item.endswith('-mc'):
pipelst.append(item.rsplit('-mc',1)[0])
Another issue may be based on the example lines you gave from /location/all-list-info.txt, it sounds like all the items are lowercase. However, pipelst is explicitly making the user input all uppercase. String equality and in mechanics is case-sensitive, so for instance
>>> print 'ifa-mc' in ['IFA-MC']
False
You probably want:
itemList = str(raw_input('Enter pipe separated list of ITEMS : ')).lower().strip()
and you could use .upper() only when printing or wherever it is needed
Finally, there are a few other things that could be tweaked with the code just to make things a bit faster and cleaner. The main one that comes to mind is it seems like pipelst should be a python set and not a list as checking inclusion and removal would then be much faster for large lists, and the code to remove an item from a set is much cleaner:
>>> desserts = set(['ice cream', 'cookies', 'cake'])
>>> if 'cake' in desserts:
... desserts.remove('cake')
>>> print desserts
set(['cookies', 'ice cream'])

Matching regex to list items in Python

I am attempting to write a python script that shows the URL flow on my installation of nginx. So I currently have my script opening my 'rewrites' file that contains a list of of regex's and locations like so:
rewritei ^/ungrad/info.cfm$ /ungrad/info/ permanent;
So what I currently have python doing is reading the file, trimming the first and last word off (rewritei and premanent;) which just leaves a list like so:
[
['^/ungrad/info.cfm$', '/ungrad/info'],
['^/admiss/testing.cfm$', '/admiss/testing'],
['^/ungrad/testing/$', '/ungrad/info.cfm']
]
This results in the first element being the URL watched, and the second being the URL redirected to. What I would like to do now, is take each of the first elements, and run the regex over the entire list, and check if it matches any of the second elements.
With the example above, [0][0] would match [2][1].
However I am having trouble thinking of a good and efficient way to do this.
import re
a = [
['^/ungrad/info.cfm$', '/ungrad/info'],
['^/admiss/testing.cfm$', '/admiss/testing'],
['^/ungrad/testing/$', '/ungrad/info.cfm']
]
def matchingfun(b):
for list1 in a: # iterating the main list
for reglist in list1: # iterating the inner lists
count = 0
matchedurl = []
for innerlist in reglist[:1]: # iterating the inner list items
c = b.match(innerlist) # matching the regx
if c:
count = count+1
if count > 0:
matchedurl.append(reglist)
return matchedurl
result1 = []
for list1 in a:
for reglist in list1:
b = re.compile(reglist[0])
result = matchingfun(b)
result1.extend(result)
bs = list(set(result1))
print "matched url is", bs
This is bit unefficient i guess but I have done to some extent. Hope this answers your query. the above snippet prints the urls which are matched with the second items in the entire list.

Categories