Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I am very new to Python and I am trying to create a list out of string in python.
Input = "<html><body><ul style="padding-left: 5pt"><i>(See attached file: File1.pdf)</i><i>(See attached file: File2.ppt)</i><i>(See attached file: File3.docx)</i></ul></body></html>"
Desired Output = [File1.pdf, File2.ppt, File3.docx]
What is the most efficient and pythonic way to achieve this? Any help will be very much appreciated.
Thanks
You can use beatifulsoup, which has HTML parsing utils.
>>> from bs4 import BeautifulSoup
>>> html = """<html><body><ul style="padding-left: 5pt"><i>(See attached file: File1.pdf)</i><i>(See attached file: File2.ppt)</i><i>(See attached file: File3.docx)</i></ul></body></html>"""
>>> soup = BeautifulSoup(html, parser='html')
>>> files_list = [i.text.split('file: ')[1].replace(')', '') for i in soup.find_all('i')]
>>> print(files_list)
['File1.pdf', 'File2.ppt', 'File3.docx']
There might be a nice way to do this using a HTML parser like shree.pat18 suggested but here is a quick and dirty way using string.split()
Output = [s.split(")")[0] for s in Input.split("file: ")[1:]]
By first splitting on "file: " we get list of strings, the first one contains the first part of the original string so we don't care about that one. The others start with the filenames that we want and the first character we don't care about is ")". So split on ")" and take the first part.
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
how can i parse this so that I can obtain how many unique urls there are regardless of the number behind it ? using python
You can open the file and get the lines as a string using:
with open("/path/to/file.txt") as file:
lines = list(file)
This will give you a list of all lines in the text file.
Now since you do not want duplicates, I think using set would be a good way. (Set does not contain duplicates)
answer=set()
for x in lines:
answer.add(x[x.find(" ")+1:x.rfind(":")])
This will iterate through all the lines and add the part after the space till and not including the : to the set, which will handle the case for duplicates. Now answer should contain all the unique urls
Tested for Python3.6
You can use regex to parse and extract uids from your file line per line.
import re
uids = set()
with open('...') as f:
for line in f:
m = re.match('$[a-z0-9]+', line)
if m:
uids.add(m.group(0))
print(len(uids))
import re
A, List = ("String_1 URL_1:10\nString_2 URL_2:20\nString_3 URL_1:30".replace(" ", ",")).split("\n"), []
for x in range(len(A)):
Result = re.search(",(.*):", A[x])
if Result.group(1) not in List:
List.append(Result.group(1))
print(len(List))
This should solve your problem.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
i have the following link:
https://webcache.googleusercontent.com/search?q=cache:jAc7OJyyQboJ:https://cooking.nytimes.com/learn-to-cook+&cd=5&hl=en&ct=clnk
I have multiple links in a dataset. Each link is of same pattern. I want to get a specific part of the link, for the above link i would be the bold part of the link above. I want text starting from 2nd http to before first + sign.
I don't know how to do so using regex. I am working in python. Kindly help me out.
If each link has the same pattern you do not need regex. You can use string.find() and string cutting
link = "https://webcache.googleusercontent.com/search?q=cache:jAc7OJyyQboJ:https://cooking.nytimes.com/learn-to-cook+&cd=5&hl=en&ct=clnk"
# This finds the second occurrence of "https://" and returns the position
second_https = link.find("https://", link.find("https://")+1)
# Index of the end of the link
end_of_link = link.find("+")
new_link = link[second_https:end_of_link]
print(new_link)
This will return "https://cooking.nytimes.com/learn-to-cook" and will work if the link follows the same pattern as described (it is the second https:// in the link and ends with + sign)
I'd go with urlparse (Python 2) or urlparse (Python 3) and a little bit of regex:
import re
from urlparse import urlparse
url_example = "https://webcache.googleusercontent.com/search?q=cache:jAc7OJyyQboJ:https://cooking.nytimes.com/learn-to-cook+&cd=5&hl=en&ct=clnk"
parsed = urlparse(url_example)
result = re.findall('https?.*', parsed.query)[0].split('+')[0]
print(result)
Output:
https://cooking.nytimes.com/learn-to-cook
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I'm trying to manipulate a HTML-File and remove a div with a certain id-Tag, using Python3.
Is there a more elegant way to manipulate or remove this container than a mix of for-Loops and regex?
I know, there is the HTMLParser module, but I'm not sure if this will help me (it finds the corresponding tags, but how to remove those and the contents?).
Try lxml and css/xpath queries.
For example, with this html:
<html>
<body>
<p>Some text in a p.</p>
<div class="go-away">Some text in a div.</div>
<div><p>Some text in a p in a div</p></div>
</body>
</html>
You can read that in, remove the div with class "go-away", and output the result with:
import lxml.html
html = lxml.html.fromstring(html_txt)
go_away = html.cssselect('.go-away')[0] # Or with suitable xpath
go_away.getparent().remove(go_away)
lxml.html.tostring(html) # Or lxml.html.tostring(html).decode("utf-8") to get a string
While I can't stress this enough
DON'T PARSE HTML WITH REGEX!!
here's how I'd do it with regex.
from re import sub
new_html = sub('<div class=(\'go-away\'|"go-away")>.*?</div>', '', html)
Even though I think that should be ok, you should never ever use regex to parse anything. More often than anything it creates odd, hard-to-debug issues. It'll create more work for you than you started with. Don't parse with regex.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I have a content like this:
aid: "1168577519", cmt_id = 1168594403;
Now I want to get all number sequence:
1168577519
1168594403
by regex.
I have never meet regex problem, but this time I should use it to do some parse job.
Now I can just get sequence after "aid" and "cmt_id" respectively. I don't know how to merge them into one regex.
My current progress:
pattern = re.compile('(?<=aid: ").*?(?=",)')
print pattern.findall(s)
and
pattern = re.compile('(?<=cmt_id = ).*?(?=;)')
print pattern.findall(s)
There are many different approaches to designing a suitable regular expression which depend on the range of possible inputs you are likely to encounter.
The following would solve your exact question but could fail given different styled input. You need to provide more details, but this would be a start.
re_content = re.search("aid\: \"([0-9]*?)\",\W*cmt_id = ([0-9]*?);", input)
print re_content.groups()
This gives the following output:
('1168577519', '1168594403')
This example assumes that there might be other numbers in your input, and you are trying to extract just the aid and cmt_id values.
The simplest solution is to use re.findall
Example
>>> import re
>>> string = 'aid: "1168577519", cmt_id = 1168594403;'
>>> re.findall(r'\d+', string)
['1168577519', '1168594403']
>>>
\d+ matches one or more digits.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
How to open a webpage and search for a word in python?
This is a little simplified:
>>> import urllib
>>> import re
>>> page = urllib.urlopen("http://google.com").read()
# => via regular expression
>>> re.findall("Shopping", page)
['Shopping']
# => via string.find, returns the position ...
>>> page.find("Shopping")
2716
First, get the page (e.g. via urllib.urlopen). Second use a regular expression to find portions of the text, you are interested in. Or use string.find.
you can use urllib2
import urllib2
webp=urllib2.urlopen("the_page").read()
webp.find("the_word")
hope that helps :D
How to open a webpage?
I think the most convinient way is:
from urllib2 import urlopen
page = urlopen('http://www.example.com').read()
How to search for a word?
I guess you are going to search for some pattern in the page next, so here we go:
import re
pattern = re.compile('^some regex$')
match = pattern.search(page)