Separating a string into numbers and letters in python - python

I started learning python two days ago. Today I built a web scraping script which pulls data from yahoo finance and puts it in a csv file. The problem I have is that some values are string because yahoo finance displays them as such.
For example: Revenue: 806.43M
When I copy them into the csv I cant use them for calculation so I was wondering if it is possible to separate the "806.43" and "M" while still keeping both to see the unit of the number and put them in two different columns.
for the excel writing I use this command:
f.write(revenue + "," + revenue_value + "\n")
where:
print(revenue)
Revenue (ttm)
print(revenue_value)
806.43M
so in the end I should be able to use a command which looks something like this
f.write(revenue + "," + revenue_value + "," + revenue_unit + "\n")
where revenue_value is 806.43 and revenue_unit is M
Hope someone could help with the problem.

I believe the easiest way is to parse the number as string and convert it to a float based on the unit in the end of the string.
The following should do the trick:
def parse_number(number_str) -> float:
mapping = {
"K": 1000,
"M": 1000000,
"B": 1000000000
}
unit = number_str[-1]
number_float = float(number_str[:-1])
return number_float * mapping[unit]
And here's an example:
my_number = "806.43M"
print(parse_number(my_number))
>>> 806430000.0

You can always try regular expressions.
Here's a pretty good online tool to let you practice using Python-specific standards.
import re
sample = "Revenue (ttm): 806.43M"
# Note: the `(?P<name here>)` section is a named group. That way we can identify what we want to capture.
financials_pattern = r'''
(?P<category>.+?):?\s+? # Capture everything up until the colon
(?P<value>[\d\.]+) # Capture only numeric values and decimal points
(?P<unit>[\w]*)? # Capture a trailing unit type (M, MM, etc.)
'''
# Flags:
# re.I -> Ignore character case (upper vs lower)
# re.X -> Allows for 'verbose' pattern construction, as seen above
res = re.search(financials_pattern, sample, flags = re.I | re.X)
Print our dictionary of values:
res.groupdict()
Output:
{'category': 'Revenue (ttm)',
'value': '806.43',
'unit': 'M'}
We can also use .groups() to list results in a tuple.
res.groups()
Output:
('Revenue (ttm)', '806.43', 'M')
In this case, we'll immediately unpack those results into your variable names.
revenue = None # If this is None after trying to set it, don't print anything.
revenue, revenue_value, revenue_unit = res.groups()
We'll use fancy f-strings to print out both your f.write() call along with the results we've captured.
if revenue:
print(f'f.write(revenue + "," + revenue_value + "," + revenue_unit + "\\n")\n')
print(f'f.write("{revenue}" + "," + "{revenue_value}" + "," + "{revenue_unit}" + "\\n")')
Output:
f.write(revenue + "," + revenue_value + "," + revenue_unit + "\n")
f.write("Revenue (ttm)" + "," + "806.43" + "," + "M" + "\n")

Related

How to add a Google formula containing commas and quotes to a CSV file?

I'm trying to output a CSV file from Python and make one of the entries a Google sheet formula:
This is what the formula var would look like:
strLink = "https://xxxxxxx.xxxxxx.com/Interact/Pages/Content/Document.aspx?id=" + strId + "&SearchId=0&utm_source=interact&utm_medium=general_search&utm_term=*"
strLinkCellFormula = "=HYPERLINK(\"" + strLink + "\", \"" + strTitle + "\")"
and then for each row of the CSV I have this:
strCSV = strCSV + strId + ", " + "\"" + strTitle + "\", " + strAuthor + ", " + strDate + ", " + strStatus + ", " + "\"" + strSection + "\", \"" + strLinkCellFormula +"\"\n"
Which doesn't quite work, the hyperlink formula for Google sheets is like so:
=HYPERLINK(url, title)
and I can't seem to get that comma escaped. So in my Sheet I am getting an additional column with the title in it and obviously the formula does not work. Any help would be appreciated.
Try using ; as the formula argument separator. It should work the same.
Instead of reinventing the wheel, you should write your CSV rows using the builtin csv.writer class. This takes care of escaping any commas and quotes in the data, so you don't need to build your own escape logic. This helps you avoid the mess of escaping in your strLinkCellFormula = ... and strCSV = strCSV + ... lines.
For example:
import csv
urls = ["https://google.com", "https://stackoverflow.com/", "https://www.python.org/"]
titles = ["Google", "Stack Overflow", "Python"]
with open("file.csv", "w") as fw:
writer = csv.writer(fw)
writer.writerow(["Company", "Website"])
for u, t in zip(urls, titles):
formula = f'=HYPERLINK("{u}", "Visit {t}")'
row = [t, formula]
writer.writerow(row)
Note that in the line formula = ... above, I used the f-string syntax to format the URL and title into the string. I also used apostrophes to define the string, since I knew that the string was going to contain quotation marks and I didn't want to bother escaping them.
This gives the following CSV:
Company,Website
Google,"=HYPERLINK(""https://google.com"", ""Visit Google"")"
Stack Overflow,"=HYPERLINK(""https://stackoverflow.com/"", ""Visit Stack Overflow"")"
Python,"=HYPERLINK(""https://www.python.org/"", ""Visit Python"")"
where the escaping of commas and quotes is already taken care of.
It is also read by Excel/GSheets correctly, since it conforms to the standard CSV format:
For your specific case, you'd write to your CSV file like so:
with open(filename, "w") as wf:
writer = csv.writer(wf)
writer.writerow(headers) # if necessary
for ...:
strLink = f"https://xxxxxxx.xxxxxx.com/Interact/Pages/Content/Document.aspx?id={strID}&SearchId=0&utm_source=interact&utm_medium=general_search&utm_term=*"
strLinkCellFormula = f'=HYPERLINK("{strLink}", "{strTitle}")'
row = [strId, strTitle, strAuthor, strDate, strStatus, strSection, strLinkCellFormula]
writer.writerow(row)

PyParsing "parseString" class unavailable

I am trying to build a quick script that extracts only certain information from invoice PDFs without using regex.
When I try to define the grammar for, say, electric usage, I get an error "cannot import name 'parseString' from 'pyparsing'"
I have tried reinstalling, modifying casing from camel to snake, etc etc but I am at a loss at this point.
Here is the (I think) relevant documentation:
https://pyparsing-docs.readthedocs.io/en/latest/pyparsing.html
The code:
electric_usage = pp.Word(nums) + ',' + pp.Word(nums) + 'kwh'
dates_1 = pp.Word(nums) + '-' + pp.Word(nums) + '-' + pp.Word(nums)
dates_2 = pp.Word(nums) + '/' + pp.Word(nums) + '/' + pp.Word(nums)
for str in pdf_text:
usage_pulled = electric_usage.parseString(pdf_text)
print(usage_pulled)
here is an example of one of the regex patterns that actually seems to work to pull usage values:
'[0-9]+[0-9]+[0-9]+[,]+[0-9]+[0-9]+[0-9]'
and cost:
'[$]+[0-9]+[0-9]+[,]+[0-9]+[0-9]+[0-9]+[.]+[0-9]+[0-9]+$'

How to concatenate multiple variables?

I have three UV sensors - integers output; one BME280 - float output (temperature and pressure); and one GPS Module - float output.
I need to build a string in this form - #teamname;temperature;pressure;uv_1;uv_2;uv_3;gpscoordinates#
and send them via ser.write at least one time per second- I'm using APC220 Module
Is this the right (and fastest) way to do it?
textstr = str("#" + "teamname" + ";" + str(temperature) + ";" + str(pressure) + ";" + str(uv_1) + ";" + str(uv_2) + ";" + str(uv_3) + "#")
(...)
ser.write(('%s \n'%(textstr)).encode('utf-8'))
You may try something like this:
vars = [teamname, temperature, pressure, uv_1, uv_2, uv_3, gpscoordinates]
joined = ';'.join( map( str, vars ))
ser.write( '#%s# \n', joined )
If using python 3.6+ then you can do this instead
textstr = f"#teamname;{temperature};{pressure};{uv_1};{uv_2};{uv_3}# \n"
(...)
ser.write((textstr).encode('utf-8'))
If teamname and gpscoordinates are also variables then add them the same way
textstr = f"#{teamname};{temperature};{pressure};{uv_1};{uv_2};{uv_3};{gpscoordinates}# \n"
(...)
ser.write((textstr).encode('utf-8'))
For more info about string formatting
https://realpython.com/python-f-strings/
It might improve readability to use python's format:
textstr = "#teamname;{};{};{};{};gpscoordinates#".format(temperature, pressure, uv_1, uv_2, uv_3)
ser.write(('%s \n'%(textstr)).encode('utf-8'))
assuming gpscoordinates is text (it's not in your attempted code). If it's a variable, then replace the text with {} and add it as a param to format.

How to format a long string while following pylint rules?

I have a very simple problem that I have been unable to find the solution to, so I thought I'd try my "luck" here.
I have a string that is created using variables and static text altogether. It is as follows:
filename_gps = 'id' + str(trip_id) + '_gps_did' + did + '_start' + str(trip_start) + '_end' + str(trip_end) + '.json'
However my problem is that pylint is complaining about this string reprensentation as it is too long. And here is the problem. How would I format this string representation over multiple lines without it looking weird and still stay within the "rules" of pylint?
At one point I ended up having it looking like this, however that is incredible "ugly" to look at:
filename_gps = 'id' + str(
trip_id) + '_gps_did' + did + '_start' + str(
trip_start) + '_end' + str(
trip_end) + '.json'
I found that it would follow the "rules" of pylint if I formatted it like this:
filename_gps = 'id' + str(
trip_id) + '_gps_did' + did + '_start' + str(
trip_start) + '_end' + str(
trip_end) + '.json'
Which is much "prettier" to look at, but in case I didn't have the "str()" casts, how would I go about creating such a string?
I doubt that there is a difference between pylint for Python 2.x and 3.x, but if there is I am using Python 3.x.
Don't use so many str() calls. Use string formatting:
filename_gps = 'id{}_gps_did{}_start{}_end{}.json'.format(
trip_id, did, trip_start, trip_end)
If you do have a long expression with a lot of parts, you can create a longer logical line by using (...) parentheses:
filename_gps = (
'id' + str(trip_id) + '_gps_did' + did + '_start' +
str(trip_start) + '_end' + str(trip_end) + '.json')
This would work for breaking up a string you are using as a template in a formatting operation, too:
foo_bar = (
'This is a very long string with some {} formatting placeholders '
'that is broken across multiple logical lines. Note that there are '
'no "+" operators used, because Python auto-joins consecutive string '
'literals.'.format(spam))

Writing multiple values in a text file using python

I want to write mulitiple values in a text file using python.
I wrote the following line in my code:
text_file.write("sA" + str(chart_count) + ".Name = " + str(State_name.groups())[2:-3] + "\n")
Note: State_name.groups() is a regex captured word. So it is captured as a tuple and to remove the ( ) brackets from the tuple I have used string slicing.
Now the output comes as:
sA0.Name = GLASS_OPEN
No problem here
But I want the output to be like this:
sA0.Name = 'GLASS_HATCH_OPENED_PROTECTION_FCT'
I want the variable value to be enclosed inside the single quotes.
Does this work for you?
text_file.write("sA" + str(chart_count) + ".Name = '" + str(State_name.groups())[2:-3] + "'\n")
# ^single quote here and here^

Categories