removing characters off the end of a string - python

i am displaying prices of graphics cards from newegg using web scraping. On some of the text i scrape there is unwanted text after the price that gets scraped too. what is the most efficient way to only display the text of the price nothing more.
price_container = container.findAll("li", {"class": "price-current"})
price = price_container[0].text
if len(price) > 7:
the prices(bit i want to keep) are never more than 7 characters long so i thought i could remove the unwanted text using this if statement but im not sure how because each price has different length of unwanted text after it.

Use a regular expression:
import re
m = re.search(r'\$([\d.]+)', price)
if m:
print(m.group(0)) # to include the dollar sign
print(m.group(1)) # the amount without the dollar sign

You can either use a regular expression.
Or take a string and extract the numbers from it.
Example:
[float(p) for p in price.split() if p.isdigit()] # Will give you an array of the numbers in the string. You can then join them back together.
Perhaps not exactly what you are looking for, but hopefully will help you :)

if len(price) > 7:
price = price[:-1] #This will reasign the string/list to a string/list with all the characters except for the last one.

Related

Replace characters in specific locations in strings inside lists

Very new to Python/programming, trying to create a "grocery list generator" as a practice project.
I created a bunch of meal variables with their ingredients in a list, then to organise that list in a specific (albeit probably super inefficient) way with vegetables at the top I've added a numerical value at the start of each string. It looks like this -
meal = ["07.ingredient1", "02.ingredient2", "05.ingredient3"]
It organises, prints, and writes how I want it to, but now I want to remove the first three characters (the numbers) from each string in the list before I write it to my text file.
So far my final bit of code looks like this -
Have tried a few different things between the '.sort' and 'with open' like replace, strip, range and some other things but can't get them to work.
My next stop was trying something like this, but can't figure it out -
for item in groceries[1:]
str(groceries(range99)).replace('')
Thanks heaps for your help!
for item in groceries:
shopping_list.write(item[3:] + '\n')
Instead of replacing you can just take a substring.
groceries = [g[3:] for g in groceries]
Depending on your general programming knowledge, this solution is maybe a bit enhanced, but regular expressions would be another alternative.
import re
pattern = re.compile(r"\d+\.\s*(\w+)")
for item in groceries:
ingredient = pattern.findall(item)[0]
\d means any digit (0-9), + means "at least one", \. matches ".", \s is whitespace and * means "0 or more" and \w is any word character (a-z, A-Z, 0-9).
This would also match things like
groceries = ["1. sugar", "0110.salt", "10. tomatoes"]
>>> meal = ["07.ingredient1", "02.ingredient2", "05.ingredient3"]
>>> myarr = [i[3:] for i in meal]
>>> print(myarr)
['ingredient1', 'ingredient2', 'ingredient3']

I have a list and i want to print a specific string from it how can i do that?

So far I have done this but this returns the movie name but i want the year 1995 like this in separate list.
moviename=[]
for i in names:
moviename.append(i.split(' (',1)[0])
One issue with the code you have is that you're getting the first element of the array returned by split, which is the movie title. You want the second argument split()[1].
That being said, this solution won't work very well for a couple of reasons.
You will still have the second parenthesis in the year "1995)"
It won't work if the title itself has parenthesis (e.g. for Shanghai Triad)
If the year is always at the end of each string, you could do something like this.
movie_years = []
for movie in names:
movie_years.append(movie[-5:-1])
You could use a regular expression.
\(\d*\) will match an opening bracket, following by any number of digit characters (0-9), followed by a closing bracket.
Put only the \d+ part inside a capturing group to get only that part.
year_regex = r'\((\d+)\)'
moviename=[]
for i in names:
if re.search(year_regex, i):
moviename.append(re.search(year_regex, i).group(1))
By the way, you can make this all more concise using a list comprehension:
year_regex = r'\((\d+)\)'
moviename = [re.search(year_regex, name_and_year).group(1)
for name_and_year in names
if re.search(year_regex, name_and_year)]

Trouble using regex to locate special characters

I am using beautifulsoup and selenium to collect some data from a page. After narrowing the data down the string that I want, it gives me 'First Blood○○○○○●○○○○'. My goal is to determine the position of the filled in dot (so 5 in this case if we are counting from 0).
I started by trying to remove all of the non-special characters using:
test = re.sub(r'[a-z]+', '', collectStatistics[5], re.I)
Which gave me 'F B○○○○○●○○○○' so I am guessing F B are also special characters. I have no clue how to go about writing a regex that will detect the filled in circle so any advice would be appreciated.
Thanks in advance :)
I think regexes (regices?) are overkill here.
First, cut off everything after the filled dot:
line = line.split('●')[0] # Split on filled dots, then take only the first part
Now, count the empty dots:
result = line.count('○') # Count occurrences
It founds F and B because your regex finds lowercase letters.If you want to find all of them change regex to [a-zA-Z]+
import re
collectStatistics = "First Blood○○○○○●○○○○"
test = re.sub(r'[a-zA-Z]+', '', collectStatistics,re.I)
print (test)
OUTPUT :
○○○○○●○○○○

regex count occurrences

I am looking for a way to count the occurrences found in the string based on my regex. I used findall() and it returns a list but then the len() of the list is only 1? shouldn't the len() of the list be 2?
import re
string1 = r'Total $200.00 Total $900.00'
regex = r'(.*Total.*|.*Invoice.*|.*Amount.*)?(\s+?\$\s?[1-9]{1,10}.*(?:
[.,]\d{3})*(?:[.,]\d{2})?)'
patt = re.findall(regex,string1)
print(patt)
print(len(patt))
Resut:
> [('Total $200.00 Total', ' $900.00')]
> 1
not sure if my regex is causing it to miscalculate. I am looking to get the Total from a file but there are many combinations of this.
Examples:
Total $900.00
Invoice Amt $500.00
Total 800.00
etc.
I am looking to count this because there could be multiple invoice details in one file.
First off, because that's a common misconception:
There is no need to match "all text up to the match" or "all the text after a match". You can drop those .* in your regex. Start with what you actually want to match.
import re
string1 = 'Total $200.00 Total $900.00'
amount_pattern = r'(?:Total|Amt|Invoice Amt|Others)[:\s]*\$([\d\.,]*\d)'
amount_expr = re.compile(amount_pattern, re.IGNORECASE)
amount_expr.findall(string1)
# -> ['200.00', '900.00']
\$([\d\.,]*\d) is a half-way reasonable approximation of prices ("things that start with a $ and then contain a bunch of digits and possibly dots and commas"). The final \d makes sure we are not accidentally matching sentence punctuation. It might be good enough, but you know what data you are working with. Feel free to come up with a more specific sub-expression. Include an optional leading - if you expect to see negative amounts.
Try:
>>> re.findall(r'(\w*\s+\$\d+\.\d+)', string1)
['Total $200.00', 'Total $900.00']
The issue you are having is your regex has two capture groups so re.findall returns a tuple of those two matches. One tuple with two matches inside has a length of 1.

Python extract information after phrase or group of words

I am trying to extract information from PDF.
Simple search worked:
filecontent = ReadDoc.getContent("c:\\temp\\pdf_1.pdf")
match = re.search('Document ID: (\d+)', filecontent)
if match:
docid = match.group(1)
But when I want to search a long phrase, e.g.
I want to extract '$999,999.00', which may appear in the document like "Total Cumulative Payment (USD) $999,999.00" or "Total cumulative payment $55587323.23". Note that there is a difference in the text and I need to use some kind of fuzzy search, find the sentence, somehow extract the $ from there.
Similarly I also need to extract some date, number, amount, money in between phrases/words.
Appreciate your help!
I think this should do what you want:
import re
textlist = ["some other amount as $32,4545.34 and Total Cumulative Payment (USD) $999,999.00 and such","Total cumulative payment $55587323.23"]
matchlist = []
for text in textlist:
match = re.findall("(\$[.\d,]+)", text)
if match:
matchlist.extend(match)
print(matchlist)
results:
['$32,4545.34', '$999,999.00', '$55587323.23']
The regex is look for a $ and grab ., and numbers up to the next space. Depending on what other kind of data you are parsing it may need to be tweaked, I assuming you only want to capture periods, commas and numbers.
update:
it will now find any number of occurrences and put them all in a list
Well something like this can be done with regular expressions:
import re
source = 'total cumulative payment $2000.00; some other amount $1234.56. Total Cumulative Payment (USD) $5,600,000.06'
matches = re.findall( r'total\s+cumulative\s+payment[^$0-9]+\$([0-9,.]+)', source, re.IGNORECASE )
amounts = [ float( x.replace( ',', '' ).rstrip('.') ) for x in matches ]
This will match the two specific examples you've given. But you haven't given much of an idea of how loose the matching criteria should be, or what the rules are. The solution above will miss amounts if the source document has a spelling mistake in the word "cumulative". Or if the amount appears without the dollar sign. It also allows any amount of intervening text between "total cumulative payment" and the dollar amount (so you'll get a false positive from source = "This document contains information about total cumulative payment values, (...3 more pages of introductory material...) and by the way you owe me $20.") Now, these things can be tweaked and improved - but only if you know what is going to be important and what is not, and tighten the specification of the question accordingly.

Categories