Reading and writing column data in Python with Pandas - python

This endeavour is a variation on the wonderful Mac Model Shelf. I have managed thus far to write the code myself that can read single Mac serial numbers at the command line and give back the corresponding model type, based on on the last 3 or 4 chars in the serial.
Write now I am trying to write a script to read-in the column data in an Excel file and return back the results for each cell in the neighbouring column.
The output Excel would hopefully looking something like this (with headers)...
Serial Model
C12PT70EG8WP Macbook Pro 2015 15" 2.5 Ghz i7
K12PT7EG0PW iMac 2010 Intel Core Duo 1.6 Ghz
This is all based on excel file that supplies its data to a python shelve. Here is a small example of how it reads... I've called it 'pgList.xlsx' in the main code. In reality it will be hundreds of lines long.
G8WP Macbook Pro 2015 15" 2.5 Ghz i7
0PW iMac 2010 Intel Core Duo 1.6 Ghz
3RT iPad Pro 2017
Main python3 code...
import shelve
import pandas as pd
#getting the shelve/database ready from the library excel file
DBPATH = "/Users/me/PycharmProjects/shelve/macmodelshelfNEW"
databaseOfMacs = shelve.open(DBPATH)
excelDict = pd.read_excel('pgList.xlsx', header=None, index_col=0,squeeze=True).to_dict()
databaseOfMacs.update(excelDict)
#loading up the excel file and serial numbers I want to examine...
df = pd.read_excel('testSerials.xlsx', sheet_name='Sheet1')
listSerials = df['Serial']
listModels = df['Model']
for i in listSerials:
inputSerial = i
inputSerial = inputSerial.upper()
modelCodeIsolatedFromSerial = ""
if len(inputSerial) == 12:
modelCodeIsolatedFromSerial = inputSerial[-4:]
elif len(inputSerial) == 11:
modelCodeIsolatedFromSerial = inputSerial[-3:]
try:
model = databaseOfMacs[modelCodeIsolatedFromSerial]
#printing to console to check code works
print(model)
except:
print("Result not found")
databaseOfMacs.clear()
databaseOfMacs.close()
Could you guys help me out with writing of the results back to the same excel file? So example, if the serial number was in cell A2, the result (the model type) would be written to B2?
I have tried including this line of code before the main 'for' loop in the code but it only ever serves to wipe the Excel file empty after running the script! I just comment it out for the moment.
writer = pd.ExcelWriter('testSerials.xlsx', engine='xlsxwriter')
Could you also help me handle any potential blank cells in the serials column?
A blank will throw back this error.
AttributeError: 'float' object has no attribute 'upper'
Thanks again for looking after me!
WL
UPDATE
The comments I have up to now have really helped. I think the part where am I getting stuck at is getting the output of the 'for' loop, 'model' in this case into the column for 'Models. The variable 'listModels' doesn't seem to behave like other lists in Python 3 i.e I cannot append anything to it.
UPDATE 2
Some more tinkering, trying to get the result of the serial-number lookup of the values in "Serial" column into the "Model" column.
I have tried (without any real success)
try:
model = databaseOfMacs[modelCodeIsolatedFromSerial]
print(model)
listModels.replace(['nan'], [model], inplace=True)
This doesn't give me an error message but still nothing appears in the outputted excel file.
When I run a for loop to print the contents of 'listModels' I just back a list of "NaN"s, suggesting nothing at all has been changed... bummer!
I've also tried
try:
model = databaseOfMacs[modelCodeIsolatedFromSerial]
print(model)
listModels[i] = model
This will spit back a console error about
A value is trying to be set on a copy of a slice from a DataFrame
but at least I can see the modelname relating to a serial number in the console when I iterate through 'listModels', still nothing in the output Excel file though (along with a 'nan' for every serial number that is examined?)
I am sure it's something small that I am missing in the code to fix this problem. Thanks again to anybody who can help me out.
UPDATE 3
I've solved it on my own. Just had to use a while loop instead.
sizeOfSerialsList = len(listSerials)
count = 0
while (count < sizeOfSerialsList):
inputSerial = listSerials.iloc[count]
inputSerial = str(inputSerial).upper()
modelCodeIsolatedFromSerial = ""
model = ""
if len(inputSerial) == 12:
modelCodeIsolatedFromSerial = inputSerial[-4:]
elif len(inputSerial) == 11:
modelCodeIsolatedFromSerial = inputSerial[-3:]
try:
model = databaseOfMacs[modelCodeIsolatedFromSerial]
listModels.iloc[count] = model
except:
listModels.iloc[count] = "Not found"
count = count + 1

From the XlsxWriter docs, you'll need to call df.to_excel(writer) followed by writer.save().
To avoid that AttributeError, one fix (maybe not the most python-3-esque?) is to change inputSerial = inputSerial.upper() to inputSerial = str(inputSerial).upper().

See Update 3 for code that solved the issue

Related

Timestamp repeated values in Python

I'm tring to record the output of an ADC in python with its corresponding timestamp, but when reading the files generated (it generates a new file every minute), sometimes I got repeated timestamp values in the last records of a file and in the begginnings of the new one.
I can't find why this is happening, because I get a new timestamp every time I got in the loop to get the ADC values. Does anybody have a solution or a workaround?
Thanks in advance.
P.D: A simplified version of the code is here:
imports ###
def write_file():
global file_flag
while True:
nombre_archivo=str(int(time.time()))+".txt"
f = open(nombre_archivo, "a")
print("nuevo_archivo:" +nombre_archivo)
while file_flag==0:
adc_value=getadcvalues()
timestamp=time.time()
x=getadcvalues[1]
y=getadcvalues[2]
z=getadcvalues[3]
f.write(str(timestamp)+','+str(x)+','+str(y)+','+str(z)+'\n')
f.close()
os.rename(nombre_archivo, nombre_archivo[:-3]+'finish')
file_flag=0
def cronometro():
global file_flag
#inicio=time.time()
#diferencia=0
while True:
contador=60
inicio=time.time()
diferencia=0
while diferencia<=contador:
diferencia=time.time()-inicio
#print(diferencia)
time.sleep(0.5)
file_flag=1
escritor = threading.Thread(target=escribir_archivo)
temporizador = threading.Thread(target=cronometro)
escritor.start()
temporizador.start()

How do you compile variables and still use them individually?

TL;DR - trying to clean this up but unsure of the best practice for compiling a list of variables and still separating them on individual lines on the .txt file they're being copied to.
This is my first post here.
I've recently created a script to automate an extremely tedious process at work that involves modifying an excel document, copying and pasting outputs from specifics cells depending on the type of configuration we are generating and pasting into 3 separate .txt files to send out via email.
I've got the script functioning, but I hate how my code looks and to be honest, it is quite the pain to try to make additions to.
I'm using openpyxl & pycel for this, as the cells I copy are outputs from a formula that I couldn't seem to get anything except for #N/A when strictly using openpyxl so I integrated pycel for that piece.
I've referenced my code below, & I appreciate any input.
F62 = format(excel.evaluate('Config!F62'))
F63 = format(excel.evaluate('Config!F63'))
F64 = format(excel.evaluate('Config!F64'))
F65 = format(excel.evaluate('Config!F65'))
F66 = format(excel.evaluate('Config!F66'))
F67 = format(excel.evaluate('Config!F67'))
F68 = format(excel.evaluate('Config!F68'))
F69 = format(excel.evaluate('Config!F69'))
F70 = format(excel.evaluate('Config!F70'))
F71 = format(excel.evaluate('Config!F71'))
F72 = format(excel.evaluate('Config!F72'))
F73 = format(excel.evaluate('Config!F73'))
F74 = format(excel.evaluate('Config!F74'))
F75 = format(excel.evaluate('Config!F75'))
F76 = format(excel.evaluate('Config!F76'))
F77 = format(excel.evaluate('Config!F77'))
#so on and so forth to put into:
with open(f'./GRAK-R-{KIT}/3_GRAK-R-{KIT}_FULL.txt', 'r') as basedone:
linetest = f"{F62}\n{F63}\n{F64}\n{F65}\n{F66}\n{F67}\n{F68}\n{F69}\n{F70}\n{F71}\n{F72}\n{F73}\n{F74}\n{F75}\n{F76}\n{F77}\n{F78}\n{F79}\n{F80}\n{F81}\n{F82}\n{F83}\n{F84}\n{F85}\n{F86}\n{F87}\n{F88}\n{F89}\n{F90}\n{F91}\n{F92}\n{F93}\n{F94}\n{F95}\n{F96}\n{F97}\n{F98}\n{F99}\n{F100}\n{F101}\n{F102}\n{F103}\n{F104}\n{F105}\n{F106}\n{F107}\n{F108}\n{F109}\n{F110}\n{F111}\n{F112}\n{F113}\n{F114}\n{F115}\n{F116}\n{F117}\n{F118}\n{F119}\n{F120}\n{F121}\n{F122}\n{F123}\n{F124}\n{F125}\n{F126}\n{F127}\n{F128}\n{F129}\n{F130}\n{F131}\n{F132}\n{F133}\n{F134}\n{F135}\n{F136}\n{F137}\n{F138}\n{F139}\n{F140}\n{F141}\n{F142}\n{F143}\n{F144}\n{F145}\n{F146}\n{F147}\n{F148}\n{F149}\n{F150}\n{F151}\n{F152}\n{F153}\n{F154}\n{F155}\n{F156}\n{F157}\n{F158}\n{F159}\n{F160}\n{F161}\n{F162}\n{F163}\n{F164}\n{F165}\n{F166}\n{F167}\n{F168}\n{F169}\n{F170}\n{F171}\n{F172}\n{F173}\n{F174}\n{F175}\n{F176}\n{F177}\n{F178}\n{F179}\n {F180}\n{F181}\{F182}\n{F183}\n{F184}\n{F185}\n{F186}\n{F187}\n{F188}\n{F189}\n{F190}\n {F191}\n{F192}\n{F193}\n{F194}\n{F195}\n{F196}\n{F197}\n{F198}\n{F199}\n{F200}\n{F201}\n{F202}\n{F203}\n{F204}\n{F205}\n{F206}\n{F207}\n{F208}\n{F209}\n{F210}\n{F211}\n{F212}\n{F213}\n{F214}\n{F215}\n{F216}\n{F217}\n{F218}\n{F219}\n{F220}\n{F221}\n{F222}\n{F223}\n{F224}\n{F225}\n{F226}\n{F227}\n{F228}\n{F229}\n{F230}\n{F231}\n{F232}\n{F233}\n{F234}\n{F235}\n{F236}\n{F237}\n{F238}\n{F239}\n{F240}\n{F241}\n{F242}\n{F243}\n{F244}\n{F245}\n{F246}\n{F247}\n{F248}\n{F249}\n{F250}\n{F251}\n{F252}\n{F253}\n{F254}\n{F255}\n{F256}\n{F257}\n{F258}\n{F259}\n{F260}\n{F261}\n{F262}\n{F263}\n{F264}\n{F265}\n{F266}\n{F267}\n{F268}\n{F269}\n{F270}\n{F271}\n{F272}\n{F273}\n{F274}\n"
oline = basedone.readlines()
oline.insert(9,linetest)
basedone.close()
with open(f'./GRAK-R-{KIT}/3_GRAK-R-{KIT}_FULL.txt', 'w') as basedone:
basedone.writelines(oline)
basedone.close
I don't think you need to name every single variable. You can use f-strings and list comprehensions to keep your code flexible.
min_cell = 62
max_cell = 77
column_name = 'F'
sheet_name = 'Config'
cell_names = [f'{sheet_name}!{column_name}{i}' for i in range(min_cell, max_cell + 1)]
vals = [format(excel.evaluate(cn)) for cn in cell_names]
linetest = '\n'.join(vals)

Add the value of two different specific csv columns

So I currently get a .csv file that looks like this:
HostType,Number
Windows_Desktop,84
Linux_Desktop,12
Windows_Desktop,60
Linux_Desktop,7
I am trying to write a script that performs a function based on the total value. So I have a two global variables:
WINDOWS = 0
LINUX = 0
I am trying to make it so that the function adds the two Window_Desktop numbers together, and Linux_Desktop numbers together. So something like..
def count_function():
global WINDOWS
global LINUX
count_file = open('counts.csv', 'rb')
reader = csv.reader(count_file)
WINDOWS = float(row[2]) + float(otherrow[2])
LINUX = float(row[2]) + float(otherrow[2])
(I know this is very wrong syntax, just a brief example of what I am trying to figure out)
But I don't know how to specify column and row I want to add together. They are always in the same place. Windows is always 2 and 4, Linux is always in 3 and 5. So I don't need to regex them by any means. I am just trying to figure out how to do Row 2 Column 2 + Row 4 Column 2.
Basically, I am ultimately trying to do something like:
if WINDOWS < 80
some_function()
Although I have that part figured out, its getting the numbers to add up that I can't seem to figure out despite how many times I bash my head.
You need to identify the type of thing you are collecting by analyzing the contents of the first column. Since you are collecting Windows and Linux totals, you can use a dictionary to collect these data.
Try this version:
import csv
from collections import defaultdict
data = defaultdict(float) # this just means, the default value of a key
# that doesn't exit is a float
with open('yourfile.csv') as f:
reader = csv.reader(f)
next(f) # This will skip the header
for row in reader:
data[row[0].split('_')[0].strip()] += float(row[1])
if data['Windows'] < 80:
print('Do stuff')
for key, value in data.iteritems():
print('Value for {} is {}'.format(key, value))
I would highly recommend using the Pandas package. It is very useful for working with csv files.
import pandas as pd
df = pd.read_csv("/Users/daddy30000/Dropbox/Stackoverflow/16_06_20_example.csv")
windows = df[df['HostType'] == 'Windows_Desktop'].sum()[1]
linux = df[df['HostType'] == 'Linux_Desktop'].sum()[1]
print windows
>>> 144
print linux
>>> 19
Note that I am assuming all your Windows rows have the same spelling, 'Windows_Desktop'. You use two different spellings in the example.
One way you can do it is like so:
with open("/tmp/foo.txt", 'r') as input_file:
counts = {}
for line in input_file:
split_line = line.split(",")
device = split_line[0]
counts[device] = int(split_line[1]) + (counts.get(device) or 0)
print counts ## prints {'Windows_Desktop': 144, 'Linux_Desktop': 19}
There are many ways, but this one doesn't require imports or downloading anything new to Python
For such as small dataset, I'd read the whole thing into memory and use indices (slightly different from yours) to directly access the proper rows and columns. I also see no need for using global variables (or why you're using float instead of int):
import csv
def count_desktops(filename):
with open(filename, 'rb') as count_file:
data = list(csv.reader(count_file))
windows = float(data[1][1]) + float(data[3][1])
linux = float(data[2][1]) + float(data[4][1])
return windows, linux
windows, linux = count_desktops('counts.csv')
if windows < 80:
some_function()

Efficiently Find Partial String Match --> Values Starting From List of Values in 5 GB file with Python

I have a 5GB file of businesses and I'm trying to extract all the businesses that whose business type codes (SNACODE) start with the SNACODE corresponding to grocery stores. For example, SNACODEs for some businesses could be 42443013, 44511003, 44419041, 44512001, 44522004 and I want all businesses whose codes start with my list of grocery SNACODES codes = [4451,4452,447,772,45299,45291,45212]. In this case, I'd get the rows for 44511003, 44512001, and 44522004
Based on what I googled, the most efficient way to read in the file seemed to be one row at a time (if not the SQL route). I then used a for loop and checked if my SNACODE column started with any of my codes (which probably was a bad idea but the only way I could get to work).
I have no idea how many rows are in the file, but there are 84 columns. My computer was running for so long that I asked a friend who said it should only take 10-20 min to complete this task. My friend edited the code but I think he misunderstood what I was trying to do because his result returns nothing.
I am now trying to find a more efficient method than re-doing my 9.5 hours and having my laptop run for an unknown amount of time. The closest thing I've been able to find is most efficient way to find partial string matches in large file of strings (python), but it doesn't seem like what I was looking for.
Questions:
What's the best way to do this? How long should this take?
Is there any way that I can start where I stopped? (I have no idea how many rows of my 5gb file I read, but I have the last saved line of data--is there a fast/easy way to find the line corresponding to a unique ID in the file without having to read each line?)
This is what I tried -- in 9.5 hours it outputted a 72MB file (200k+ rows) of grocery stores
codes = [4451,4452,447,772,45299,45291,45212] #codes for grocery stores
for df in pd.read_csv('infogroup_bus_2010.csv',sep=',', chunksize=1):
data = np.asarray(df)
data = pd.DataFrame(data, columns = headers)
for code in codes:
if np.char.startswith(str(data["SNACODE"][0]), str(code)):
with open("grocery.csv", "a") as myfile:
data.to_csv(myfile, header = False)
print code
break #break code for loop if match
grocery.to_csv("grocery.csv", sep = '\t')
This is what my friend edited it to. I'm pretty sure the x = df[df.SNACODE.isin(codes)] is only matching perfect matches, and thus returning nothing.
codes = [4451,4452,447,772,45299,45291,45212]
matched = []
for df in pd.read_csv('infogroup_bus_2010.csv',sep=',', chunksize=1024*1024, dtype = str, low_memory=False):
x = df[df.SNACODE.isin(codes)]
if len(x):
matched.append(x)
print "Processed chunk and found {} matches".format(len(x))
output = pd.concat(matched, axis=0)
output.to_csv("grocery.csv", index = False)
Thanks!
To increase speed you could pre-build a single regexp matching the lines you need and the read the raw file lines (no csv parsing) and check them with the regexp...
codes = [4451,4452,447,772,45299,45291,45212]
col_number = 4 # Column number of SNACODE
expr = re.compile("[^,]*," * col_num +
"|".join(map(str, codes)) +
".*")
for L in open('infogroup_bus_2010.csv'):
if expr.match(L):
print L
Note that this is just a simple sketch as no escaping is considered... if the SNACODE column is not the first one and preceding fields may contain a comma you need a more sophisticated regexp like:
...
'([^"][^,]*,|"([^"]|"")*",)' * col_num +
...
that ignores commas inside double-quotes
You can probably make your pandas solution much faster:
codes = [4451, 4452, 447, 772, 45299, 45291, 45212]
codes = [str(code) for code in codes]
sna = pd.read_csv('infogroup_bus_2010.csv', usecols=['SNACODE'],
chunksize=int(1e6), dtype={'SNACODE': str})
with open('grocery.csv', 'w') as fout:
for chunk in sna:
for code in chunk['SNACODE']:
for target_code in codes:
if code.startswith(target_code):
fout.write('{}\n'.format(code))
Read only the needed column with usecols=['SNACODE']. You can adjust the chunk size with chunksize=int(1e6). Depending on your RAM you can likely make it much bigger.

Erasing only a part of an excel with win32com and python

I have 3 parameters.
startLine, starColumn and width (here 2,8,3)
How can I erase the selected area without writing blanks in each cells?
(here there is only 30 line but there could potetialy be 10 000 lines)
Right now I'm succesfully counting the number of lines but I can't manage to find how to select and delete an area
self.startLine = 2
self.startColumn = 8
self.width = 8
self.xl = client.Dispatch("Excel.Application")
self.xl.Visible = 1
self.xl.ScreenUpdating = False
self.worksheet = self.xl.Workbooks.Open("c:\test.xls")
sheet = self.xl.Sheets("data")
#Count the number of line of the record
nb = 0
while sheet.Cells(start_line + nb, self.startColumn).Value is not None:
nb += 1
#must select from StartLine,startColumn to startcolum+width,nb
#and then erase
self.worksheet.Save()
ps : the code works, I may have forgotten some part due do copy/pas error, in reality the handling of the excel file is managed by several classes inheriting from each other
thanks
What I usually do is that I record macro in Excel and than try to re-hack the VB in Python. For deleting content I got something like this, should not be hard to convert it to Python:
Range("H5:J26").Select
Selection.ClearContents
In Python it should be something like:
self.xl.Range("H5:J26").Select()
self.xl.Selection.ClearContents()
Working example:
from win32com.client.gencache import EnsureDispatch
exc = EnsureDispatch("Excel.Application")
exc.Visible = 1
exc.Workbooks.Open(r"f:\Python\Examples\test.xls")
exc.Sheets("data").Select()
exc.Range("H5:J26").Select()
exc.Selection.ClearContents()
this worked for me
xl = EnsureDispatch('Excel.Application')
wb2=xl.Workbooks.Open(file)
ws=wb2.Worksheets("data")
ws.Range("A12:B20").ClearContents()

Categories