I would need some help/ideas again.
I have been working on a pandas jupyter notebook for some data wrangling with a file I get from our customer. Unfortunately I cannot disclose it.
Previous version I could read in using pd.read_excel(), however, for the latest one, everything is put under just ONE column, the first one. data is in rows, that is ok, however, each row content is in just the first column.
df=pd.read_excel('./Files/import/ET5X report 03-01-2021.xlsx', header=0, usecols="A:BF")
I even tried to explicitly use "usecols" command, but no change.
Any ideas, what I could check? .csv would be an alternative, but then I have some trouble with the format of some of the cells.
Thanks!
Related
I'm Lucas and I'm a master student in management science in Belgium. I have a very poor background in coding and everything related to it, so my problem is probably very basic but still very confusing for me. This semester, I follow a course named "Data management and business analytics". For the purpose of a little group work, we are required to use python and the package pandasql. We are supposed to import csv files in google colab and to do some queries to only get what we need, then we need to export the "transformed" data frame in excel and do some graphs.
Here is the problem, I managed to import a first csv file, to do a simple query on a first data frame created on the basis of this csv using pandas, even to export it on excel. However, I tried to do the exact same thing on a second csv file, but I keep getting the same error message.
So, more precisely, I imported 3 csv files. The three are well imported and I can visualize their rows and columns without any problem. I named each data frame df, df1 and df2, and the result of the query on each of them result_df, result_df1, result_df2. The problem starts with df1. In this data frame, there are columns named Year, Country, Model, etc. For example, when I write the following basic query: result_df1 = pysql("select Year from df1"), the following error message appears: OperationalError: no such column: Year. I checked several times the names of the different columns, but I can't find the reason why it doesn't recognize the columns whereas they really exist in the data frame df1. The exact same problem occurs with df2 which has the exact same columns as df1.
I hope that someone will be able to help me, It would be a relief 😪.
If you need more details to help me, don't hesitate to ask!
Thanks in advance! (and sorry if my English seems bad)
I've got very hard task to do. I need to process Excel file with 6336 rows x 53 columns. My task is to create program which:
Read data from input Excel file.
Sort all rows by specific column data, for eg. Sort by A1:A(last)
Place columns in new output Excel file by given order, for eg.
SaleCity Branch CustomerID InvoiceNum
Old File For eg. Old File Merge old file cols
Col[A1:A(last)] SaleCity='Oklahoma' Col[M1:M(last) Col[K1:K(last) &
Branch='OKL GamesShop' B1:B(last)]
Save new excel File.
Excel Sample:
Excel
(All data in this post is not real so don't try to hack someone or something :D)
I know that I did not provide any code but to be honest I tried solving it by myself and I don't even know which module I should use. I tried using OpenPyXl and Pandas but there's too much data for my capabilities.
Thank you in advance for any help. If I asked the question in the wrong place, please direct me to the right one.
Edit:
To be clear. I'm not asking for full solution here. What am I asking for is guidance and mentority.
I would recommend you to use PySpark. It is more difficult than pandas, but the parallelization provided will help with yours large excel files.
Or you could also use multiprocessing lib from python to paralelize pandas functions.
https://towardsdatascience.com/make-your-own-super-pandas-using-multiproc-1c04f41944a1
Apologies if this has been answered - I just spent the last hour or so looking for a specific method and was unable to find it.
I am getting data from a reporting program via keyboard emulation with pynput - the program has specific menus for copying data and selecting what is actually copied.
I have managed to get the data copied to the clipboard, and have then called openpyxl to load my selected workbook. What I can not figure out how to do is how to select a specific cell and then paste the data that I've already copied starting at that cell.
The copy parameters from the reporting program copy the data in a way that will paste into Excel properly (cell by cell) so I know that it will not try and paste all the data into one cell. I just can't determine the proper method to select the cell and paste it.
Thanks in advance for any help.
As a side note, I'm INCREDIBLY new to python - I am well versed in VBA but I am trying to branch out so I apologize if I've stated anything incorrectly.
Having the data already copied to the clipboard, you can use python pandas DataFrames to process it. Below is the method which accepts data copied from the clipboard and convert it into DataFrame.
pandas.read_clipboard(sep='\\s+', **kwargs)
Now you can select whichever column you want from the DataFrame and using the to_csv method, you can write to the Excel file.
DataFrame.to_csv()
Check out the documentation :
read_clipboard() - https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_clipboard.html
to_csv() - https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html
Sample Code :
import pandas as pd
data = pd.read_clipboard(sep='\\s+')
print(data)
I love using pd.read_clipboard() to quickly import tabular data from Excel into a DataFrame in pandas.
However, if I have highlighted non-adjacent columns in Excel before I hit CTRL+C to copy them to the clipboard, what gets imported into my DF will include all of the intervening columns that were sandwiched between the non-adjacent columns that I had selected for copying.
(I'm currently using Windows 10 & Excel 2013).
This is annoying and incovenient.
I suspect that the problem may emanate from Excel or the Windows clipboard, since I also get the intervening (tab-separated) columns when I paste into Sublime Text 3.
If I paste into a blank worksheet, I only get the highlighted columns I want, and this has been my work-around (creating an interim Excel sheet that I then copy and import into pandas). It's a decent workaround, but I'm looking for something faster, since I go back and forth between Excel and pandas MANY times per day.
I am aware that the problem goes away if I HIDE the columns in Excel before copying them (whether by the CTRL+0 shortcut or by grouping the intervening columns I want to hide), but neither of these are suitable, since it manipulates my current view/design of my Excel worksheet (and anyway Windows 10 has clobbered the "unhide columns" shortcut (CTRL+SHIFT+0)... although ALT+H+O+U+L still works).
The problem does not go away when the intervening columns that I don't want are visible (thus it doesn't matter if I hit ALT+; to select only visible cells before I hit CTRL+C).
I'm looking for a solution that is simple and fast, preferably an alternative shortcut to use in Excel, or a universal kwarg I can use for pd.read_clipboard().
The documentation for pd.read_clipboard() says to look at the keywords from pd.read_table(), but I can't figure out which might help. Again, I suspect the problem has to do with the clipboard on the Windows end, but I have searched and searched online and can't find anything except third-party commercial Excel plugins which claim to be able to help with copying non-contiguous cells.
We have a dataframe we are working it in a ipython notebook. Granted, if one could save a dataframe in such a way that the whole group could have access to it through their notebooks, would be ideal, and I'd love to know how to do that. However could you help with the following specific problem?
When we do df.to_csv("Csv file name") it appears that it is located in the exact same place as the files we placed in object storage to utilize in the ipython notebook. However, when one goes to Manage Files, it's nowhere to be found.
When one runs pd.DataFrame.to_csv(df), text of the csv file is apparently given. However when one copies that into a text editor (ex- Sublime text), saves it at a csv, and attempts to read it in to a dataframe, the expected dataframe is not yielded.
How does one export a dataframe to csv format, and then access it?
I'm not familiar with bluemix, but it sounds like you're trying to save a pandas dataframe in a way that all of your collaborators can access and it look the same way for everyone.
Maybe saving and reading from CSVs is messing up the formatting of your dataframe. Have you tried using pickling? Since pickling is based around python, it should give consistent results.
Try this:
import pandas as pd
pd.to_pickle(df, "/path/to/pickle/My_pickle")
and on the read side:
df_read = pd.read_pickle("/path/to/pickle/My_pickle")