I've got very hard task to do. I need to process Excel file with 6336 rows x 53 columns. My task is to create program which:
Read data from input Excel file.
Sort all rows by specific column data, for eg. Sort by A1:A(last)
Place columns in new output Excel file by given order, for eg.
SaleCity Branch CustomerID InvoiceNum
Old File For eg. Old File Merge old file cols
Col[A1:A(last)] SaleCity='Oklahoma' Col[M1:M(last) Col[K1:K(last) &
Branch='OKL GamesShop' B1:B(last)]
Save new excel File.
Excel Sample:
Excel
(All data in this post is not real so don't try to hack someone or something :D)
I know that I did not provide any code but to be honest I tried solving it by myself and I don't even know which module I should use. I tried using OpenPyXl and Pandas but there's too much data for my capabilities.
Thank you in advance for any help. If I asked the question in the wrong place, please direct me to the right one.
Edit:
To be clear. I'm not asking for full solution here. What am I asking for is guidance and mentority.
I would recommend you to use PySpark. It is more difficult than pandas, but the parallelization provided will help with yours large excel files.
Or you could also use multiprocessing lib from python to paralelize pandas functions.
https://towardsdatascience.com/make-your-own-super-pandas-using-multiproc-1c04f41944a1
Related
I have a particular spreadsheet which has point of sale data exported feom a sql database.
Im trying to migrate to a new point of sale system and si i need to copy this data that i exported into a csv file into another csv file which has a different format, for example different columns thatvi have to rearrange the original data into.
Im trto do this using python but im failing to find a way to automate this task.
Does anyone have any ideas or any videos on a similar project
Pandas seems like the python tool for you.
Open up the first CSV file with Pandas as a DataFrame, apply any modifications you want, and save as a new CSV file. There is A LOT of documentation and support for Pandas, so I'm sure you can find tutorials on how to do any kind of data reshaping that you want.
I asked our data support team to share data for 12 months, they sent me 12 different files with 3 sheets in each file. I need to combine all of that data into another datasheet, I have the following questions?
Would excel be able to cram in all the data into one large file, limitations?
Is R a good solution, can one share any easy code and libraries needed for such operation, seen multiple videos on youtube but all are not working.
I heard that in Python Pandas is helpful but my past experience is bad that Python being very slow.
I have no idea of VB codes
Please could anyone help.
may my answer can help you:
Excel has limit 1.048.576 rows
You just need package to import excel file (readxl...). You can use for loop to import all file, and merge all file to dataframe and export to excel.
In VBA, i think logic is same in R
My problem is rather simple : I have an Excel Sheet that does calculations and creates a graph based on the values of two cells in the sheet. I also have two lists of inputs in text files. I would like to loop through those text files, add the values to the excel sheet, refresh the sheet, and print the resulting graph to a pdf file or an excel file named something like 'input1 - input2.xlsx'.
My programming knowledge is limited, I am decent with Python and have looked into python libraries that work with excel such as openpyxl, however most of those don't seem to work for me for various reasons. Openpyxl deletes the graphs when opening an excel file; XlsxWriter can only write files, not read from them; and xlwings won't work for me.
Should I use python, which I'm familiar with, or would VBA work for this kind of problem? Have any of you ever done something of the sort?
Thanks in advance
As a more transitional approach to what m. wasowski wrote above, I'd suggest you do the following.
Install the pandas package, and see how easy it is to load a file using read_excel. Then, read 10 Minutes to Pandas, and manipulate the data.
You state that the Excel sheet is complex. In general, the more complex it is, this approach will eventually make it simpler. But you don't have to switch everything immediately. You can still do parts in Excel and parts in pandas.
I think you should consider win32Com for excel operation in python instead of Openpyxl,XlsxWriter.
you can read/write excel, create chart and format excel file using win32com without any limitation.
And creating chart you can consider matplotlib, in that after creating chart you can save it in pdf file also.
I have a big problem here with python, openpyxl and Excel files. My objective is to write some calculated data to a preconfigured template in Excel. I load this template and write the data on it. There are two problems:
I'm talking about writing Excel books with more than 2 millions of cells, divided into several sheets.
I do this successfully, but the waiting time is unthinkable.
I don't know other way to solve this problem. Maybe openpyxl is not the solution. I have tried to write in xlsb, but I think openpyxl does not support this format. I have also tried with optimized writer and reader, but the problem comes when I save, due to the big data. However, the output file size is 10 MB, at most. I'm very stuck with this. Do you know if there is another way to do this?
Thanks in advance.
The file size isn't really the issue when it comes to memory use but the number of cells in memory. Your use case really will push openpyxl to the limits at the moment which is currently designed to support either optimised reading or optimised writing but not both at the same time. One thing you might try would be to read in openpyxl with use_iterators=True this will give you a generator that you can call from xlsxwriter which should be able to write a new file for you. xlsxwriter is currently significantly faster than openpyxl when creating files. The solution isn't perfect but it might work for you.
Two-headed question here guys,
First, I've been trying to do some searching for a way to read .xlsx files in python. Does xlrd read .xlsx files now? If not, what's the recommended way to read/write to such a file?
Second, I have two files with similar information. One primary field with scoping subfields (like coordinates(the primary field) -> city -> state -> country). In the older file, the information is given an ID number while the newer file (with records deleted/added) does not have these ID's. In python, I'd 1) open the two files 2) check the primary field of the older file against the primary field of the newer file and merge their information to a new file if they match. Given that its not too big of a file, I don't mind the O(n^2) complexity. My question is this: is there a well-defined way to do this in VBA or excel? Everything I think of using excel's library seems too slow and I'm not excellent with VBA.
I frequently access excel files through python and xlrd, python and the Excel COM object. For this job, xlrd won't work because it does not support the xlsx format. But no matter, both approaches are overkill for what you are looking for. Simple Excel formulas will deliver what you want, specifically VLOOKUP.
VLOOKUP "looks for a value in the lefmost column of a table, and then returns a value in the same row from the column you specify".
Some advice on VLOOKUP, First, if you want to match on multiple cells, create a "key" cell which concatenates the cells you are interested in (in both workbooks). Second, make sure to set the last argument to VLOOKUP as FALSE because you will only want exact matches.
Regarding performance, excel formulas are often very fast.
Read the help file on VLOOKUP and ask further questions here.
Late edit (from Mark Baker's answer): There is now a python solution for xlsx. Openpyxl was created this year by Eric Gazoni to read and write Excel's xlsx format.
I only heard about this project this morning, so I've not had an opportunity to look at it, and have no idea what it's like; but take a look at Eric' Gazoni's openpyxl project. The code can be found on bitbucket. The driving force behind this was the ability to read/write xlsx files from Python.
Try http://www.python-excel.org/
My mistake - I missed the .xlsx detail.
I guess it's a question of what's easier: finding or writing a library that handles .xlsx format natively OR save all the Excel spreadsheets as .xls and get on with it with the libraries that merely handle the older format.
Adding on the answer of Steven Rubalski:
You might want to be able to have your lookup value in any other than the leftmost column. In those cases the Index and Match functions come in handy.
See: http://www.mrexcel.com/articles/excel-vlookup-index-match.php