Concatenate multi-column text files containing numbers & nan - python

I have a folder named 17307 which contains some .ismr files (essentially just CSV files) named like
SEPT307A.17_.ismr,
SEPT307B.17_.ismr,
SEPT307C.17_.ismr,.... upto SEPT307X.17_.ismr.
I want to concatenate all these into a single text file using Python. I tried:
st = 'path/to/folder'
a = input('Enter first part of file') #i.e. SEPT307 in file name
alph = ['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X']
yr = input('enter the year')
last = '_.ismr'
for letter in alph:
st1 = st + "a" + alph + "." + "yr" + last
fp = open(st1, "r")
data=np.append(data, np.fromfile(fp, dtype=list))
i.e. I am trying to put everything into data and later copy data to a separate text file.
However I am getting this error:
TypeError: Can't convert 'list' object to str implicitly
Can anyone kindly suggest some way for doing this?

Looks like the error coming from this line:
st1 = st + "a" + alph + "." + "yr" + last
Where alph is the full list of your alphabet. It should be:
st1 = st + "a" + letter + "." + "yr" + last
The issue is that you are trying to concatenate a list with a str.

Related

How to add a Google formula containing commas and quotes to a CSV file?

I'm trying to output a CSV file from Python and make one of the entries a Google sheet formula:
This is what the formula var would look like:
strLink = "https://xxxxxxx.xxxxxx.com/Interact/Pages/Content/Document.aspx?id=" + strId + "&SearchId=0&utm_source=interact&utm_medium=general_search&utm_term=*"
strLinkCellFormula = "=HYPERLINK(\"" + strLink + "\", \"" + strTitle + "\")"
and then for each row of the CSV I have this:
strCSV = strCSV + strId + ", " + "\"" + strTitle + "\", " + strAuthor + ", " + strDate + ", " + strStatus + ", " + "\"" + strSection + "\", \"" + strLinkCellFormula +"\"\n"
Which doesn't quite work, the hyperlink formula for Google sheets is like so:
=HYPERLINK(url, title)
and I can't seem to get that comma escaped. So in my Sheet I am getting an additional column with the title in it and obviously the formula does not work. Any help would be appreciated.
Try using ; as the formula argument separator. It should work the same.
Instead of reinventing the wheel, you should write your CSV rows using the builtin csv.writer class. This takes care of escaping any commas and quotes in the data, so you don't need to build your own escape logic. This helps you avoid the mess of escaping in your strLinkCellFormula = ... and strCSV = strCSV + ... lines.
For example:
import csv
urls = ["https://google.com", "https://stackoverflow.com/", "https://www.python.org/"]
titles = ["Google", "Stack Overflow", "Python"]
with open("file.csv", "w") as fw:
writer = csv.writer(fw)
writer.writerow(["Company", "Website"])
for u, t in zip(urls, titles):
formula = f'=HYPERLINK("{u}", "Visit {t}")'
row = [t, formula]
writer.writerow(row)
Note that in the line formula = ... above, I used the f-string syntax to format the URL and title into the string. I also used apostrophes to define the string, since I knew that the string was going to contain quotation marks and I didn't want to bother escaping them.
This gives the following CSV:
Company,Website
Google,"=HYPERLINK(""https://google.com"", ""Visit Google"")"
Stack Overflow,"=HYPERLINK(""https://stackoverflow.com/"", ""Visit Stack Overflow"")"
Python,"=HYPERLINK(""https://www.python.org/"", ""Visit Python"")"
where the escaping of commas and quotes is already taken care of.
It is also read by Excel/GSheets correctly, since it conforms to the standard CSV format:
For your specific case, you'd write to your CSV file like so:
with open(filename, "w") as wf:
writer = csv.writer(wf)
writer.writerow(headers) # if necessary
for ...:
strLink = f"https://xxxxxxx.xxxxxx.com/Interact/Pages/Content/Document.aspx?id={strID}&SearchId=0&utm_source=interact&utm_medium=general_search&utm_term=*"
strLinkCellFormula = f'=HYPERLINK("{strLink}", "{strTitle}")'
row = [strId, strTitle, strAuthor, strDate, strStatus, strSection, strLinkCellFormula]
writer.writerow(row)

Separating a string into numbers and letters in python

I started learning python two days ago. Today I built a web scraping script which pulls data from yahoo finance and puts it in a csv file. The problem I have is that some values are string because yahoo finance displays them as such.
For example: Revenue: 806.43M
When I copy them into the csv I cant use them for calculation so I was wondering if it is possible to separate the "806.43" and "M" while still keeping both to see the unit of the number and put them in two different columns.
for the excel writing I use this command:
f.write(revenue + "," + revenue_value + "\n")
where:
print(revenue)
Revenue (ttm)
print(revenue_value)
806.43M
so in the end I should be able to use a command which looks something like this
f.write(revenue + "," + revenue_value + "," + revenue_unit + "\n")
where revenue_value is 806.43 and revenue_unit is M
Hope someone could help with the problem.
I believe the easiest way is to parse the number as string and convert it to a float based on the unit in the end of the string.
The following should do the trick:
def parse_number(number_str) -> float:
mapping = {
"K": 1000,
"M": 1000000,
"B": 1000000000
}
unit = number_str[-1]
number_float = float(number_str[:-1])
return number_float * mapping[unit]
And here's an example:
my_number = "806.43M"
print(parse_number(my_number))
>>> 806430000.0
You can always try regular expressions.
Here's a pretty good online tool to let you practice using Python-specific standards.
import re
sample = "Revenue (ttm): 806.43M"
# Note: the `(?P<name here>)` section is a named group. That way we can identify what we want to capture.
financials_pattern = r'''
(?P<category>.+?):?\s+? # Capture everything up until the colon
(?P<value>[\d\.]+) # Capture only numeric values and decimal points
(?P<unit>[\w]*)? # Capture a trailing unit type (M, MM, etc.)
'''
# Flags:
# re.I -> Ignore character case (upper vs lower)
# re.X -> Allows for 'verbose' pattern construction, as seen above
res = re.search(financials_pattern, sample, flags = re.I | re.X)
Print our dictionary of values:
res.groupdict()
Output:
{'category': 'Revenue (ttm)',
'value': '806.43',
'unit': 'M'}
We can also use .groups() to list results in a tuple.
res.groups()
Output:
('Revenue (ttm)', '806.43', 'M')
In this case, we'll immediately unpack those results into your variable names.
revenue = None # If this is None after trying to set it, don't print anything.
revenue, revenue_value, revenue_unit = res.groups()
We'll use fancy f-strings to print out both your f.write() call along with the results we've captured.
if revenue:
print(f'f.write(revenue + "," + revenue_value + "," + revenue_unit + "\\n")\n')
print(f'f.write("{revenue}" + "," + "{revenue_value}" + "," + "{revenue_unit}" + "\\n")')
Output:
f.write(revenue + "," + revenue_value + "," + revenue_unit + "\n")
f.write("Revenue (ttm)" + "," + "806.43" + "," + "M" + "\n")

Writing multiple values in a text file using python

I want to write mulitiple values in a text file using python.
I wrote the following line in my code:
text_file.write("sA" + str(chart_count) + ".Name = " + str(State_name.groups())[2:-3] + "\n")
Note: State_name.groups() is a regex captured word. So it is captured as a tuple and to remove the ( ) brackets from the tuple I have used string slicing.
Now the output comes as:
sA0.Name = GLASS_OPEN
No problem here
But I want the output to be like this:
sA0.Name = 'GLASS_HATCH_OPENED_PROTECTION_FCT'
I want the variable value to be enclosed inside the single quotes.
Does this work for you?
text_file.write("sA" + str(chart_count) + ".Name = '" + str(State_name.groups())[2:-3] + "'\n")
# ^single quote here and here^

Reading data from Excel sheets and building SQL statements, writing to output file in Python

I have an excel book with a couple of sheets. Each sheet has two columns with PersonID and LegacyID. We are basically trying to update some records in the database based on personid. This is relatively easy to do TSQL and I might even be able to get it done pretty quick in powershell but since I have been trying to learn Python, I thought I would try this in Python. I used xlrd module and I was able to print update statements. below is my code:
import xlrd
book = xlrd.open_workbook('D:\Scripts\UpdateID01.xls')
sheet = book.sheet_by_index(0)
myList = []
for i in range(sheet.nrows):
myList.append(sheet.row_values(i))
outFile = open('D:\Scripts\update.txt', 'wb')
for i in myList:
outFile.write("\nUPDATE PERSON SET LegacyID = " + "'" + str(i[1]) + "'" + " WHERE personid = " + "'" + str(i[0])
+ "'")
Two problems - when I read the output file, I see the LegacyID printed as float. How do I get rid of .0 at the end of each id? Second problem, python doesn't print each update statement in a new line in the output text file. How to I format it?
Edit: Please ignore the format issue. It did print in new lines when I opened the output file in Notepad++. The float issue still remains.
Can you turn the LegacyID into ints ?
i[1] = int(i[1])
outFile.write("\nUPDATE PERSON SET LegacyID = " + "'" + str(i[1]) + "'" + " WHERE personid = " + "'" + str(i[0])
+ "'")
try this..
# use 'a' if you want to append in your text file
outFile = open(r'D:\Scripts\update.txt', 'a')
for i in myList:
outFile.write("\nUPDATE PERSON SET LegacyID = '%s' WHERE personid = '%s'" %( int(i[1]), str(i[0])))
Since you are learning Python (which is very laudable!) you should start reading about string formatting in the Python docs. This is the best place to start whenever you have a question light this.
Hint: You may want to convert the float items to integers using int().

Pyparsing 'no such attribute _ParseResults__tokdict' on multi-line inputs

The following code gives me the error 'no such attribute _ParseResuls__tokdict' when run on an input with more than one line.
With single-line files, there is no error. If I comment out either the second or third line shown here, then I don't get that error either, no matter how long the file is.
for line in input:
final = delimitedList(expr).parseString(line)
notid = delimitedList(notid).parseString(line)
dash_tags = ', '.join(format_tree(notid))
print final.lineId + ": " + dash_tags
Does anyone know what's going on here?
EDIT: As suggested, I'm adding the complete code to allow others to reproduce the error.
from pyparsing import *
#first are the basic elements of the expression
#number at the beginning of the line, unique for each line
#top-level category for a sentiment
#semicolon should eventually become a line break
lineId = Word(nums)
topicString = Word(alphanums+'-'+' '+"'")
semicolon = Literal(';')
#call variable early to allow for recursion
#recursive function allowing for a line id at first, then the topic,
#then any subtopics, and so on. Finally, optional semicolon and repeat.
#set results name lineId.lineId here
expr = Forward()
expr << Optional(lineId.setResultsName("lineId")) + topicString.setResultsName("topicString") + \
Optional(nestedExpr(content=delimitedList(expr))).setResultsName("parenthetical") + \
Optional(Suppress(semicolon).setResultsName("semicolon") + expr.setResultsName("subsequentlines"))
notid = Suppress(lineId) + topicString + \
Optional(nestedExpr(content=delimitedList(expr))) + \
Optional(Suppress(semicolon) + expr)
#naming the parenthetical portion for independent reference later
parenthetical = nestedExpr(content=delimitedList(expr))
#open files for read and write
input = open('parserinput.txt')
output = open('parseroutput.txt', 'w')
#defining functions
#takes nested list output of parser grammer and translates it into
#strings suited for the final output
def format_tree(tree):
prefix = ''
for node in tree:
if isinstance(node, basestring):
prefix = node
yield node
else:
for elt in format_tree(node):
yield prefix + '_' + elt
#function for passing tokens from setResultsName
def id_number(tokens):
#print tokens.dump()
lineId = tokens
lineId["lineId"] = lineId.lineId
def topic_string(tokens):
topicString = tokens
topicString["topicString"] = topicString.topicString
def parenthetical_fun(tokens):
parenthetical = tokens
parenthetical["parenthetical"] = parenthetical.parenthetical
#function for splitting line at semicolon and appending numberId
#not currently in use
def split_and_prepend(tokens):
return '\n' + final.lineId
#setting parse actions
lineId.setParseAction(id_number)
topicString.setParseAction(topic_string)
parenthetical.setParseAction(parenthetical)
#reads each line in the input file
#calls the grammar expressed in 'expr' and uses it to read the line and assign names to the tokens for later use
#calls the 'notid' varient to easily return the other elements in the line aside from the lineId
#applies the format tree function and joins the tokens in a comma-separated string
#prints the lineId + the tokens from that line
for line in input:
final = delimitedList(expr).parseString(line)
notid = delimitedList(notid).parseString(line)
dash_tags = ', '.join(format_tree(notid))
print final.lineId + ": " + dash_tags
The input file is a txt document with the following two lines:
1768 dummy; data
1768 dummy data; price
Reassigning of notid breaks the second iteration when used in delimitedList. Your third line destroys the notid expression defined earlier in the code, so it will only work the first iteration. Use a different name for the notid assignment.

Categories