Python - excel : writing to multiple cells takes time - python

I'm using win32com.client to write data to an excel file.
This takes too much time (the code below simulates the amount of data I want to update excel with, and it takes ~2 seconds).
Is there a way to update multiple cells (with different values) in one call rather than filling them one by one? or maybe using a different method which is more efficient?
I'm using python 2.7 and office 2010.
Here is the code:
from win32com.client import Dispatch
xlsApp = Dispatch('Excel.Application')
xlsApp.Workbooks.Add()
xlsApp.Visible = True
workSheet = xlsApp.Worksheets(1)
for i in range(300):
for j in range(20):
workSheet.Cells(i+1,j+1).Value = (i+10000)*j

A few suggestions:
ScreenUpdating off, manual calculation
Try the following:
xlsApp.ScreenUpdating = False
xlsApp.Calculation = -4135 # manual
try:
#
worksheet = ...
for i in range(...):
#
finally:
xlsApp.ScreenUpdating = True
xlsApp.Calculation = -4105 # automatic
Assign several cells at once
Using VBA, you can set a range's value to an array. Setting several values at once might be faster:
' VBA code
ActiveSheet.Range("A1:D1").Value = Array(1, 2, 3, 4)
I have never tried this using Python, I suggest you try something like:
worksheet.Range("A1:D1").Value = [1, 2, 3, 4]
A different approach
Consider using openpyxl or xlwt. Openpyxls lets you create .xlsx files without having Excel installed. Xlwt does the same thing for .xls files.

used the range suggestion of the other answer, I wrote this:
def writeLineToExcel(wsh,line):
wsh.Range( "A1:"+chr(len(line)+96).upper()+"1").Value=line
xlApp = Dispatch("Excel.Application")
xlApp.Visible = 1
xlDoc = xlApp.Workbooks.Open("test.xlsx")
wsh = xlDoc.Sheets("Sheet1")
writeLineToExcel(wsh,[1, 2, 3, 4])
you may also write multiple lines at once:
def writeLinesToExcel(wsh,lines): # assume that all lines have the same length
wsh.Range( "A1:"+chr(len(lines)+96).upper()+str(len(lines[0]))).Value=lines
writeLinesToExcel(wsh,[ [1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10,11,12],
[13,14,15,16],
])

Note that you can set ranges via numeric adresses easily by using the following code:
cl1 = Sheet1.Cells(X1,Y1)
cl2 = Sheet1.Cells(X2,Y2)
Range = Sheet1.Range(cl1,cl2)

Related

Appending data to a Google Sheet using Python

I have 3 different tables I'm looking to directly push to 3 separate tabs in a Google Sheet. I set up the GSpread connection and that's working well. I started to adjust my first print statement into what I thought would append the information to Tab A (waveData), but no luck.
I'm looking to append the information to the FIRST blank row in a tab. Basically, so that the data will be ADDED to what is already in there.
I'm trying to use append_rows to do this, but am hitting a "gspread.exceptions.APIError: {'code': 400, 'message': 'Invalid value at 'data.values' (type.googleapis.com/google.protobuf.ListValue).
I'm really new to this, just thought it would be a fun project to evaluate wave sizes in NJ across all major surf spots, but really in over my head (no pun intended).
Any thoughts?
import requests
import pandas as pd
import gspread
gc = gspread.service_account(filename='creds.json')
sh = gc.open_by_key('152qSpr-4nK9V5uHOiYOWTWUx4ojjVNZMdSmFYov-n50')
waveData = sh.get_worksheet(0)
tideData = sh.get_worksheet(1)
lightData = sh.get_worksheet(2)
# AddValue = ["Test", 25, "Test2"]
# lightData.insert_row(AddValue, 3)
id_list = [
'/Belmar-Surf-Report/3683/',
'/Manasquan-Surf-Report/386/',
'/Ocean-Grove-Surf-Report/7945/',
'/Asbury-Park-Surf-Report/857/',
'/Avon-Surf-Report/4050/',
'/Bay-Head-Surf-Report/4951/',
'/Belmar-Surf-Report/3683/',
'/Boardwalk-Surf-Report/9183/',
]
for x in id_list:
waveData.append_rows(pd.read_html(requests.get('http://magicseaweed.com' + x).text)
[2].iloc[:9, [0, 1, 2, 3, 4, 6, 7, 12, 15]].to_json(), value_input_option="USER_ENTERED")
# print(pd.read_html(requests.get('http://magicseaweed.com' + x).text)[0])
# print(pd.read_html(requests.get('http://magicseaweed.com' + x).text)[1])
From your following reply,
there really is no relationship between the 3. When I scrape with IMPORTHTML into Google sheets, those are just Tables at the locations 0,1, and 2. I'm basically just trying to have an output of each table on a separate tab
I understood that you wanted to retrieve the values with pd.read_html(requests.get('http://magicseaweed.com' + x).text)[2].iloc[:9, [0, 1, 2, 3, 4, 6, 7, 12, 15]] from id_list, and wanted to put the values to a sheet in Google Spreadsheet.
In this case, how about the following modification?
At append_rows, it seems that JSON data cannot be directly used. In this case, it is required to use a 2-dimensional array. And, I'm worried about the value of NaN in the datafarame. When these points are reflected in your script, how about the following modification?
Modified script 1:
In this sample, all values are put into a sheet.
gc = gspread.service_account(filename='creds.json')
sh = gc.open_by_key('152qSpr-4nK9V5uHOiYOWTWUx4ojjVNZMdSmFYov-n50')
waveData = sh.get_worksheet(0)
id_list = [
"/Belmar-Surf-Report/3683/",
"/Manasquan-Surf-Report/386/",
"/Ocean-Grove-Surf-Report/7945/",
"/Asbury-Park-Surf-Report/857/",
"/Avon-Surf-Report/4050/",
"/Bay-Head-Surf-Report/4951/",
"/Belmar-Surf-Report/3683/",
"/Boardwalk-Surf-Report/9183/",
]
# I modified the below script.
res = []
for x in id_list:
df = pd.read_html(requests.get("http://magicseaweed.com" + x).text)[2].iloc[:9, [0, 1, 2, 3, 4, 6, 7, 12, 15]].fillna("")
values = [[x], df.columns.values.tolist(), *df.values.tolist()]
res.extend(values)
res.append([])
waveData.append_rows(res, value_input_option="USER_ENTERED")
When this script is run, the retrieved values are put into the 1st sheet as follows. In this sample modification, the path and a blank row are inserted between each data. Please modify this for your actual situation.
Modified script 2:
In this sample, each value is put into each sheet.
gc = gspread.service_account(filename='creds.json')
sh = gc.open_by_key('152qSpr-4nK9V5uHOiYOWTWUx4ojjVNZMdSmFYov-n50')
id_list = [
"/Belmar-Surf-Report/3683/",
"/Manasquan-Surf-Report/386/",
"/Ocean-Grove-Surf-Report/7945/",
"/Asbury-Park-Surf-Report/857/",
"/Avon-Surf-Report/4050/",
"/Bay-Head-Surf-Report/4951/",
"/Belmar-Surf-Report/3683/",
"/Boardwalk-Surf-Report/9183/",
]
obj = {e.title: e for e in sh.worksheets()}
for e in id_list:
if e not in obj:
obj[e] = sh.add_worksheet(title=e, rows="1000", cols="26")
for x in id_list:
df = pd.read_html(requests.get("http://magicseaweed.com" + x).text)[2].iloc[:9, [0, 1, 2, 3, 4, 6, 7, 12, 15]].fillna("")
values = [df.columns.values.tolist(), *df.values.tolist()]
obj[x].append_rows(values, value_input_option="USER_ENTERED")
When this script is run, the sheets are checked and created with the sheet names of the values in id_list, and each value is put to each sheet.
Reference:
append_rows

Reading R dataframes in python in Jupyter

I want to read R objects back to python in Jupyter. For example, in Jupyter this example reads a dataframe generated in python and processed in R. Now I process this dataframe and create a new one that I want to be able to read to python.
Python cell:
# enables the %%R magic, not necessary if you've already done this
%load_ext rpy2.ipython
import pandas as pd
df = pd.DataFrame({
'cups_of_coffee': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
'productivity': [2, 5, 6, 8, 9, 8, 0, 1, 0, -1]
})
R cell:
%%R -i df
# import df from global environment
df$time = 1
df_new = df
df_new
If I move to a new cell the new dataframe df_new cannot read it as is not recognized.
I tried this:
%Rget df_new
But don't know how to assign it to a pandas dataframe or pass it to a python function.
How can switch back to a python cell and be able to read this new dataframe created in the R cell?
So, I randomly tried something myself and it worked. I couldn't find some good documentation.
So, one can just simply do:
df_python = %Rget df_new
This worked for me.

Adding a pandas.dataframe to another one with it's own name

I have data that I want to retrieve from a couple of text files in a folder. For each file in the folder, I create a pandas.DataFrame to store the data. For now it works correctly and all the fils has the same number of rows.
Now what I want to do is to add each of these dataframes to a 'master' dataframe containing all of them. I would like to add each of these dataframes to the master dataframe with their file name.
I already have the file name.
For example, let say I have 2 dataframes with their own file names, I want to add them to the master dataframe with a header for each of these 2 dataframes representing the name of the file.
What I have tried now is the following:
# T0 data
t0_path = "C:/Users/AlexandreOuimet/Box Sync/Analyse Opto/Crunch/GF data crunch/T0/*.txt"
t0_folder = glob.glob(t0_path)
t0_data = pd.DataFrame()
for file in t0_folder:
raw_data = parseGFfile(file)
file_data = pd.DataFrame(raw_data, columns=['wavelength', 'max', 'min'])
file_name = getFileName(file)
t0_data.insert(loc=len(t0_data.columns), column=file_name, value=file_data)
Could someone help me with this please?
Thank you :)
Edit:
I think I was not clear enough, this is what I am expecting as an output:
output
You may be looking for the concat function. Here's an example:
import pandas as pd
A = pd.DataFrame({'Col1': [1, 2, 3], 'Col2': [4, 5, 6]})
B = pd.DataFrame({'Col1': [7, 8, 9], 'Col2': [10, 11, 12]})
a_filename = 'a_filename.txt'
b_filename = 'b_filename.txt'
A['filename'] = a_filename
B['filename'] = b_filename
C = pd.concat((A, B), ignore_index = True)
print(C)
Output:
Col1 Col2 filename
0 1 4 a_filename.txt
1 2 5 a_filename.txt
2 3 6 a_filename.txt
3 7 10 b_filename.txt
4 8 11 b_filename.txt
5 9 12 b_filename.txt
There are a couple changes to make here in order to make this happen in an easy way. I'll list the changes and reasoning below:
Specified which columns your master DataFrame will have
Instead of using some function that it seems like you were trying to define, you can simply create a new column called "file_name" that will be the filepath used to make the DataFrame for every record in that DataFrame. That way, when you combine the DataFrames, each record's origin is clear. I commented that you can make edits to that particular portion if you want to use string methods to clean up the filenames.
At the end, don't use insert. For combining DataFrames with the same columns (a union operation if you're familiar with SQL or with set theory), you can use the append method.
# T0 data
t0_path = "C:/Users/AlexandreOuimet/Box Sync/Analyse Opto/Crunch/GF data crunch/T0/*.txt"
t0_folder = glob.glob(t0_path)
t0_data = pd.DataFrame(columns=['wavelength', 'max', 'min','file_name'])
for file in t0_folder:
raw_data = parseGFfile(file)
file_data = pd.DataFrame(raw_data, columns=['wavelength', 'max', 'min'])
file_data['file_name'] = file #You can make edits here
t0_data = t0_data.append(file_data,ignore_index=True)

Python: How do I syntax data scraping from xlsx file?

Currently I am scraping some data from xlsx file. My code works, but looks like a mess - at least for me.
So I am unsure if my code is ok according to PEP8.
from openpyxl import load_workbook
[...]
for row in sheet.iter_rows():
id = row[0].value
name = row[1].value
second_name = row[2].value
# ignore the following
# middle_name = row[3].value
city = row[4].value
address = row[5].value
field_x = row[7].value
field_y = row[10].value
some_function_to_save_to_database(id, name, second_name, ...)
etc. (Please note that for some of those values I do extra-validation etc).
So it works but it feels a bit "clunky". Obviously I could pass them directly to function, making it some_function_to_save_to_database(row[0].value, row[1].value, ...), but is it any better? Feels like I lose readability a lot in this one.
So my question is as follows: Is it good approach or should I map those fields field names to row order? What is proper way to style this kind of scraping?
Your code does not violate PEP8. However, it's a little cumbersome. And it's not easy to maintain if the data changed. Maybe you can try:
DATA_INDEX_MAP = {
'id' : 0,
'name' : 1,
'second_name' : 2,
'city' : 4,
'address' : 5,
'field_x' : 7,
'field_y' : 10
}
def get_data_from_row(row):
return {key:row[DATA_INDEX_MAP[key]].value for key in DATA_INDEX_MAP}
for row in sheet.iter_rows():
data = get_data_from_row(row)
some_function_to_save_to_database(**data)
Then what you need to do is just to modify DATA_INDEX_MAP.
A lighter alternative to the dict in LiuChang's answer:
from operator import itemgetter
get_data = itemgetter(0, 1, 2, 4, 5, 7, 10)
for row in sheet.iter_rows():
data = [x.value for x in get_data(row)]
some_function_to_save_to_database(*data))

PYTHON - Error while using numpy genfromtxt to import csv data with multiple data types

I'm working on a kaggle competition to predict restaurant revenue based on multiple predictors. I'm a beginner user of Python, I would normally use Rapidminer for data analysis. I am using Python 3.4 on the Spyder 2.3 dev environment.
I am using the below code to import the training csv file.
from sklearn import linear_model
from numpy import genfromtxt, savetxt
def main():
#create the training & test sets, skipping the header row with [1:]
dataset = genfromtxt(open('data/train.csv','rb'), delimiter=",", dtype= None)[1:]
train = [x[1:41] for x in dataset]
test = genfromtxt(open('data/test.csv','rb'), delimiter=",")[1:]
This is the error I get:
dataset = genfromtxt(open('data/train.csv','rb'), delimiter=",", dtype= None)[1:]
IndexError: too many indices for array
Then I checked for various imported data types using print (dataset.dtype)
I noticed that the datatypes had been individually assigned for every value in the csv file. Moreover, the code wouldn't work with [1:] in the end. It gave me the same error of too many indices. And if I removed [1:] and defined the input with the skip_header=1 option, I got the below error:
output = np.array(data, dtype=ddtype)
TypeError: Empty data-type
It seems to me like the entire data set is being read as a single row with over 5000 columns.
The data set consists of 43 columns and 138 rows.
I'm stuck at this point, I would appreciate any help with how I can proceed.
I'm posting the raw csv data below (a sample):
Id,Open Date,City,City Group,Type,P1,P2,P3,P4,P5,P6,P7,P8,P9,P10,P11,P12,P13,P14,P15,P16,P17,P18,P19,P20,P21,P22,P23,P24,P25,P26,P27,P28,P29,P30,P31,P32,P33,P34,P35,P36,P37,revenue
0,7/17/99,Ä°stanbul,Big Cities,IL,4,5,4,4,2,2,5,4,5,5,3,5,5,1,2,2,2,4,5,4,1,3,3,1,1,1,4,2,3,5,3,4,5,5,4,3,4,5653753
1,2/14/08,Ankara,Big Cities,FC,4,5,4,4,1,2,5,5,5,5,1,5,5,0,0,0,0,0,3,2,1,3,2,0,0,0,0,3,3,0,0,0,0,0,0,0,0,6923131
2,3/9/13,DiyarbakÄr,Other,IL,2,4,2,5,2,3,5,5,5,5,2,5,5,0,0,0,0,0,1,1,1,1,1,0,0,0,0,1,3,0,0,0,0,0,0,0,0,2055379
3,2/2/12,Tokat,Other,IL,6,4.5,6,6,4,4,10,8,10,10,8,10,7.5,6,4,9,3,12,20,12,6,1,10,2,2,2.5,2.5,2.5,7.5,25,12,10,6,18,12,12,6,2675511
4,5/9/09,Gaziantep,Other,IL,3,4,3,4,2,2,5,5,5,5,2,5,5,2,1,2,1,4,2,2,1,2,1,2,3,3,5,1,3,5,1,3,2,3,4,3,3,4316715
5,2/12/10,Ankara,Big Cities,FC,6,6,4.5,7.5,8,10,10,8,8,8,10,8,6,0,0,0,0,0,5,6,3,1,5,0,0,0,0,7.5,5,0,0,0,0,0,0,0,0,5017319
6,10/11/10,Ä°stanbul,Big Cities,IL,2,3,4,4,1,5,5,5,5,5,2,5,5,3,4,4,3,4,2,4,1,2,1,5,4,4,5,1,3,4,5,2,2,3,5,4,4,5166635
7,6/21/11,Ä°stanbul,Big Cities,IL,4,5,4,5,2,3,5,4,4,4,4,3,4,0,0,0,0,0,3,5,2,4,2,0,0,0,0,3,2,0,0,0,0,0,0,0,0,4491607
8,8/28/10,Afyonkarahisar,Other,IL,1,1,4,4,1,2,1,5,5,5,1,5,5,1,1,2,1,4,1,1,1,1,1,4,4,4,2,2,3,4,5,5,3,4,5,4,5,4952497
9,11/16/11,Edirne,Other,IL,6,4.5,6,7.5,6,4,10,10,10,10,2,10,7.5,0,0,0,0,0,25,3,3,1,10,0,0,0,0,5,2.5,0,0,0,0,0,0,0,0,5444227
I think the characters (e.g. Ä°) are causing the problem in genfromtxt. I found the following reads in the data you have here,
dtypes = "i8,S12,S12,S12,S12" + ",i8"*38
test = genfromtxt(open('data/test.csv','rb'), delimiter="," , names = True, dtype=dtypes)
You can then access the elements by name,
In [16]: test['P8']
Out[16]: array([ 4, 5, 5, 8, 5, 8, 5, 4, 5, 10])
The values for the city column,
test['City']
returns,
array(['\xc3\x84\xc2\xb0stanbul', 'Ankara', 'Diyarbak\xc3\x84r', 'Tokat',
'Gaziantep', 'Ankara', '\xc3\x84\xc2\xb0stanbul',
'\xc3\x84\xc2\xb0stanbul', 'Afyonkarahis', 'Edirne'],
dtype='|S12')
In principle, you could try to convert these to unicode in your python script with something like,
In [17]: unicode(test['City'][0], 'utf8')
Out[17]: u'\xc4\xb0stanbul
Where \xc4\xb0 is UTF-8 hexadecimal encoding for İ. To avoid this, you could also try to clean up the csv input files.
[Solved].
I just chucked numpy's genfromtext and opted to use read_csv from pandas since it gives the option to import text in 'utf-8' encoding.

Categories