I've successfully extracted my sitemap, and I would like to turn the urls into a list. I can't quite figure out how to do that, separating the https from the dates modified. Ideally I would also like to turn it into a dictionary, with the associated date stamp. In the end, I plant to iterate over the list and create text files of the web pages, and save the date time stamp at the top of the text file.
I will settle for the next step of turning this into a list. This is my code:
import urllib.request
import inscriptis
from inscriptis import get_text
sitemap = "https://grapaes.com/sitemap.xml"
i=0
url = sitemap
html=urllib.request.urlopen(url).read().decode('utf-8')
text=get_text(html)
dicto = {text}
print(dicto)
for i in dicto:
if i.startswith ("https"):
print (i + '/n')
The output is basically a row with the date stamp, space, and the url.
You can split the text around whitespaces first, then proceed like this:
text = text.split(' ')
dicto = {}
for i in range(0, len(text), 2):
dicto[text[i+1]] = text[i]
gives a dictionary with timestamp as key and URL as value, as follows:
{
'2020-01-12T09:19+00:00': 'https://grapaes.com/',
'2020-01-12T12:13+00:00': 'https://grapaes.com/about-us-our-story/',
...,
'2019-12-05T12:59+00:00': 'https://grapaes.com/211-retilplast/',
'2019-12-01T08:29+00:00': 'https://grapaes.com/fruit-logistica-berlin/'
}
I believe you can do further processing from here onward.
In addition to the answer above: You could also use an XML Parser (standard module) to achieve what you are trying to do:
# Save your xml on disk
with open('sitemap.xml', 'w') as f:
f.write(text)
f.close()
# Import XML-Parser
import xml.etree.ElementTree as ET
# Load xml and obtain the root node
tree = ET.parse('sitemap.xml')
root_node = tree.getroot()
From here you can access your xml's nodes just like every other list-like object:
print(root_node[1][0].text) # output: 'https://grapaes.com/about-us-our-story/'
print(root_node[1][1].text) # output: '2020-01-12T12:13+00:00'
Creating a dict from this is as easy as that:
dicto = dict()
for child in root_node:
dicto.setdefault(child[0], child[1])
Related
I've created a script in python to get the name of neighbors from a webpage. I've used requests library along with re module to parse the content from some script tag out of that site. when I run the script I get the name of neighbors in the right way. However, the problem is i've used this line if not item.startswith("NY:"):continue to get rid of unwanted results from that page. I do not wish to use this hardcoded portion NY: to do this trick.
website link
I've tried with:
import re
import json
import requests
link = 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=New%20York%2C%20NY&start=1'
resp = requests.get(link,headers={"User-Agent":"Mozilla/5.0"})
data = json.loads(re.findall(r'data-hypernova-key[^{]+(.*)--></script>',resp.text)[0])
items = data['searchPageProps']['filterPanelProps']['filterInfoMap']
for item in items:
if not item.startswith("NY:"):continue
print(item)
Result I'm getting (desired result):
NY:New_York:Brooklyn:Mill_Basin
NY:New_York:Bronx:Edenwald
NY:New_York:Staten_Island:Stapleton
If I do not use this line if not item.startswith("NY:"):continue, the results are something like:
rating
NY:New_York:Brooklyn:Mill_Basin
NY:New_York:Bronx:Edenwald
NY:New_York:Staten_Island:Stapleton
NY:New_York:Staten_Island:Lighthouse_Hill
NY:New_York:Queens:Rochdale
NY:New_York:Queens:Pomonok
BusinessParking.validated
food_court
NY:New_York:Queens:Little_Neck
The bottom line is I wish to get everything started with NY:New_York:. What I meant by unwanted results are rating, BusinessParking.validated, food_court and so on.
How can I get the neighbors without using any hardcoded portion of search within the script?
I'm not certain what your complete data set looks like, but based on your sample,
you might use something like:
if ':' not in item:
continue
# or perhaps:
if item.count(':') < 3:
continue
# I'd prefer a list comprehension if I didn't need the other data
items = [x for x in data['searchPageProps']['filterPanelProps']['filterInfoMap'] if ':' in x]
If that doesn't work for what you're trying to achieve then you could just use a variable for the state.
Another solution - using BeautifulSoup - which doesn't involve regex or hardcoding "NY:New_York" is below; it's convoluted, but mainly because Yelp buried it's treasure several layers deep...
So for future reference:
from bs4 import BeautifulSoup as bs
import json
import requests
link = 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=New%20York%2C%20NY&start=1'
resp = requests.get(link,headers={"User-Agent":"Mozilla/5.0"})
target = soup.find_all('script')[14]
content = target.text.replace('<!--','').replace('-->','')
js_data = json.loads(content)
And now the fun of extracting NYC info from the json begins....
for a in js_data:
if a == 'searchPageProps':
level1 = js_data[a]
for b in level1:
if b == 'filterPanelProps':
level2 = level1[b]
for c in level2:
if c == 'filterSets':
level3 = level2[c][1]
for d in level3:
if d == 'moreFilters':
level4 = level3[d]
for e in range(len(level4)):
print(level4[e]['title'])
print(level4[e]['sectionFilters'])
print('---------------')
The output is the name of each borough plus a list of all neighborhoods in that borough. For example:
Manhattan
['NY:New_York:Manhattan:Alphabet_City',
'NY:New_York:Manhattan:Battery_Park',
'NY:New_York:Manhattan:Central_Park', 'NY:New_York:Manhattan:Chelsea',
'...]
etc.
I needed to take an XML file and replace certain values with other values.
This was easy enough parsing through the xml (as text) and replacing the old values with the new.
The issue is the new txt file is in the wrong format.
It's all encased in square brackets and has "/n" characters instead of linebreaks.
I did try the xml.dom.minidom lib but it's not working ...
I could parse the resulting file aswell and remove the "/n" and square brackets but don't want to do that as I am not sure this is the only thing that has been added in this format.
source code :
import json
import shutil
import itertools
import datetime
import time
import calendar
import sys
import string
import random
import uuid
import xml.dom.minidom
inputfile = open('data.txt')
outputfile = open('output.xml','w')
sess_id = "d87c2b8e063e5e5c789d277c34ea"
new_sess_id = ""
number_of_sessions = 4
my_text = str(inputfile.readlines())
my_text2 = ''
#print (my_text)
#Just replicate the session logs x times ...
#print ("UUID random : " + str(uuid.uuid4()).replace("-","")[0:28])
for i in range (0,number_of_sessions):
new_sess_id = str(uuid.uuid4()).replace("-","")[0:28]
my_text2 = my_text + my_text2
my_text2 = my_text2.replace(sess_id,new_sess_id)
#xml = xml.dom.minidom.parseString(my_text2)
outputfile.write(my_text2)
print (my_text)
inputfile.close()
outputfile.close()
The original text is XML format but the output is like
time is it</span></div><div
class=\\"di_transcriptAvatarAnswerEntry\\"><span
class=\\"di_transcriptAvatarTitle\\">[AVATAR] </span> <span
class=\\"di_transcriptAvatarAnswerText\\">My watch says it\'s 6:07
PM.<br/>Was my answer helpful? No Yes</span></div>\\r\\n"
</variable>\n', '</element>\n', '</path>\n', '</transaction>\n',
'</session>\n', '<session>']
You are currently using readlines(). This will read each line of your file and return you a Python list, one line per entry (complete with \n on the end of each entry). You were then using str() to convert the list representation into a string, for example:
text = str(['line1\n', 'line2\n', line3\n'])
text would now be a string looking like a your list, complete with all the [ and quote characters. Rather than using readlines(), you probably need to just use read() which would return the whole file contents as a single text string for you to work with.
Try using the following type approach which also uses the preferred with context manager for dealing with files (it closes them automatically for you).
import uuid
sess_id = "d87c2b8e063e5e5c789d277c34ea"
new_sess_id = ""
number_of_sessions = 4
with open('data.txt') as inputfile, open('output.xml','w') as outputfile:
my_text = inputfile.read()
my_text2 = ''
#Just replicate the session logs x times ...
for i in range (0,number_of_sessions):
new_sess_id = str(uuid.uuid4()).replace("-","")[0:28]
my_text2 = my_text + my_text2
my_text2 = my_text2.replace(sess_id,new_sess_id)
outputfile.write(my_text2)
I want to start off by saying I am very new to Python, but I love the language.
Problem:
I was provided a very large Juniper Configuration file in an XML format. I am using ElementTree library to parse through the file. The file has 5 main roots (parents) and several nested children. We are trying to append the text to several list from several element tags and display the data. I have verified the paths to each. I think the problem is some elements are empty, so when I display the data from each of the list, the corresponding data output does not match with what is within the XML document. Is there a way to tell ElementTree when parsing the document and appending the element to a list, if an element text is empty append "none" as a string? An element placeholder so to speak. My hope is that when I display the data of several lists, it will match with what is within the XML document because the same number of elements were accounted for in the iteration for each list.
Thank you!
Code Example:
import xml.etree.ElementTree as ET
# file to parse, submit to memory
tree = ET.parse('JuniperXmlConf-Name-NewSSLVPNA.xml')
root = tree.getroot()
#Defining the relevant root by tag. Children of a Root Tag
user_realms = root[3][0]
user_roles = root[3][1]
#Defining dictionaries Children of Children (subelements)
name = []
idletimeout = []
maxtimeout = []
reminder = []
limit_concurrent_users = []
guaranteed_minimum = []
maximum = []
max_sessions_per_user = []
user_names = []
# Parsing Data into Lists from User_Roles.
#Notice .text method
for child in user_roles:
try:
name.append(child[0].text)
except IndexError:
continue
...
...
# Counter displays data as name is argument
i = 0
for value in (name):
try:
i += 1
print("Name: {}, Idletimeout {}, Maxtimeout {}, Reminder {}".format(name[i],
idletimeout[i], maxtimeout[i], reminder[i]))
except IndexError:
break
"""
Example of XML:
"""
<users>
<user-realms>
<realm>
<name>Customer Name here</name>
<authentication-policy>
<source-ip>
<customized>any-ip</customized>
<ips>
</ips>
</source-ip>
<browser>
<customized>user-agent here</customized>
<user-agent-patterns>
</user-agent-patterns>
</browser>
........
........
<users>
<user-realms>
<realm>
<name></name>
<authentication-policy>
<source-ip>
<customized>any-ip</customized>
<ips>
</ips>
</source-ip>
<browser>
<customized>user-agent here</customized>
<user-agent-patterns>
</user-agent-patterns>
</browser>
I am trying to count the number of contractions used by politicians in certain speeches. I have lots of speeches, but here are some of the URLs as a sample:
every_link_test = ['http://www.millercenter.org/president/obama/speeches/speech-4427',
'http://www.millercenter.org/president/obama/speeches/speech-4424',
'http://www.millercenter.org/president/obama/speeches/speech-4453',
'http://www.millercenter.org/president/obama/speeches/speech-4612',
'http://www.millercenter.org/president/obama/speeches/speech-5502']
I have a pretty rough counter right now - it only counts the total number of contractions used in all of those links. For example, the following code returns 79,101,101,182,224 for the five links above. However, I want to link up filename, a variable I create below, so I would have something like (speech_1, 79),(speech_2, 22),(speech_3,0),(speech_4,81),(speech_5,42). That way, I can track the number of contractions used in each individual speech. I'm getting the following error with my code: AttributeError: 'tuple' object has no attribute 'split'
Here's my code:
import urllib2,sys,os
from bs4 import BeautifulSoup,NavigableString
from string import punctuation as p
from multiprocessing import Pool
import re, nltk
import requests
reload(sys)
url = 'http://www.millercenter.org/president/speeches'
url2 = 'http://www.millercenter.org'
conn = urllib2.urlopen(url)
html = conn.read()
miller_center_soup = BeautifulSoup(html)
links = miller_center_soup.find_all('a')
linklist = [tag.get('href') for tag in links if tag.get('href') is not None]
# remove all items in list that don't contain 'speeches'
linkslist = [_ for _ in linklist if re.search('speeches',_)]
del linkslist[0:2]
# concatenate 'http://www.millercenter.org' with each speech's URL ending
every_link_dups = [url2 + end_link for end_link in linkslist]
# remove duplicates
seen = set()
every_link = [] # no duplicates array
for l in every_link_dups:
if l not in seen:
every_link.append(l)
seen.add(l)
def processURL_short_2(l):
open_url = urllib2.urlopen(l).read()
item_soup = BeautifulSoup(open_url)
item_div = item_soup.find('div',{'id':'transcript'},{'class':'displaytext'})
item_str = item_div.text.lower()
splitlink = l.split("/")
president = splitlink[4]
speech_num = splitlink[-1]
filename = "{0}_{1}".format(president, speech_num)
return item_str, filename
every_link_test = every_link[0:5]
print every_link_test
count = 0
for l in every_link_test:
content_1 = processURL_short_2(l)
for word in content_1.split():
word = word.strip(p)
if word in contractions:
count = count + 1
print count, filename
As the error message explains, you cannot use split the way you are using it. split is for strings.
So you will need to change this:
for word in content_1.split():
to this:
for word in content_1[0]:
I chose [0] by running your code, I think that gives you the chunk of the text you are looking to search through.
#TigerhawkT3 has a good suggestion you should follow in their answer too:
https://stackoverflow.com/a/32981533/1832539
Instead of print count, filename, you should save these data to a data structure, like a dictionary. Since processURL_short_2 has been modified to return a tuple, you'll need to unpack it.
data = {} # initialize a dictionary
for l in every_link_test:
content_1, filename = processURL_short_2(l) # unpack the content and filename
for word in content_1.split():
word = word.strip(p)
if word in contractions:
count = count + 1
data[filename] = count # add this to the dictionary as filename:count
This would give you a dictionary like {'obama_4424':79, 'obama_4453':101,...}, allowing you to easily store and access your parsed data.
So here is my conundrum!
I have 100+ XML files that I need to parse and find a string by the tag name(or regular expression).
Once I find that string/tag value I need to count the times it occurs(or find the highest value of that string.)
Example:
<content styleCode="Bold">Value 1</content>
<content styleCode="Bold">Value 2</content>
<content styleCode="Bold">Value 3</content>
<content styleCode="Bold">Another Value 1</content>
<content styleCode="Bold">Another Value 2</content>
<content styleCode="Bold">Another Value 3</content>
<content styleCode="Bold">Another Value 4</content>
So basically I would want to parse the XML, find the tag listed above and output to an Excel spreadsheet with the highest value found. The spreadsheet already has headers so just the numerical value is output to the Excel file.
So the output would be in Excel:
Value Another Value
3 4
The each file would output onto another row.
I'm not sure how your XML files were named.
For the easy case, let's say they were named in this pattern:
file1.xml, file2.xml, ... and they are stored in the same folder as your python script.
Then you can use the following code to do the job:
import xml.etree.cElementTree as ElementTree
import re
from xlrd import open_workbook
from xlwt import Workbook
from xlutils.copy import copy
def process():
for i in xrange(1, 100): #loop from file1.xml to file99.xml
resultDict = {}
xml = ElementTree.parse('file%d.xml' %i)
root = xml.getroot()
for child in root:
value = re.search(r'\d+', child.text).group()
key = child.text[:-(1+len(value))]
try:
if value > resultDict[key]:
resultDict[key] = value
except KeyError:
resultDict[key] = value
rb = open_workbook("names.xls")
wb = copy(rb)
s = wb.get_sheet(0)
for index, value in enumerate(resultDict.values()):
s.write(i, index, value)
wb.save('names.xls')
if __name__ == '__main__':
process()
So there are two main parts of the problem. (1) Find the maximum value pair from each file, and (2) Write these in an Excel Workbook. One thing I always advocate is writing reusable code. Here you have to put all your xml files in a folder and simply execute the main method and get the results.
Well now there are several options to write into excel. Simplest one is create a tab or comma separated file (CSV) and import it into excel manually. XMWT is a standard library. OpenPyxl is another library which makes creating excel files much simpler and smaller in terms of lines of code.
Be sure to import required libraries and modules in the beginning of the file.
import re
import os
import openpyxl
While reading an XML file, we use regular expressions to extract the values you want.
regexPatternValue = ">Value\s+(\d+)</content>"
regexPatternAnotherValue = ">Another Value\s+(\d+)</content>"
To modularize it a little more, create a method that parses each line in the given XML file, looks for the regex patterns, extracts all the values and returns maximum of them. In the following method, I'm returning a tuple containing two elements, (Value, Another) which are maximum numbers of each type seen in that file.
def get_values(filepath):
values = []
another = []
for line in open(filepath).readlines():
matchValue = re.search(regexPatternValue, line)
matchAnother = re.search(regexPatternAnotherValue, line)
if matchValue:
values.append(int(matchValue.group(1)))
if matchAnother:
another.append(int(matchAnother.group(1)))
# Now we want to calculate highest number in both the lists.
try:
maxVal = max(values)
except:
maxVal = '' # This case will handle if there are NO values at all
try:
maxAnother = max(another)
except:
maxAnother = ''
return maxVal, maxAnother
Now keep your XML files in one folder, iterate over them, and extract the regex patterns in each. In the following code, I'm appending these extracted values in a list named as writable_lines. Then finally after parsing all the files, create a Workbook and add the extracted values in the format.
def process_folder(folder, output_xls_path):
files = [folder+'/'+f for f in os.listdir(folder) if ".txt" in f]
writable_lines = []
writable_lines.append(("Value","Another Value")) # Header in the excel
for file in files:
values = get_values(file)
writable_lines.append((str(values[0]),str(values[1])))
wb = openpyxl.Workbook()
sheet = wb.active
for i in range(len(writable_lines)):
sheet['A' + str(i+1)].value = writable_lines[i][0]
sheet['B' + str(i+1)].value = writable_lines[i][1]
wb.save(output_xls_path)
In the lower for-loop, we're directing openpyxl to write the values in the cell specified like typical excel format sheet["A3"], sheet["B3"] etc.
Ready to go...
if __name__ == '__main__':
process_folder("xmls", "try.xls")