Why are the indexing methods missing when working on a dataframe? - python

My code is an update of an existing script which outputs an xslx file with a lot of data. The original script is pretty stable and has worked for ages.
What I'm trying to do is that, after the original script has ended and the xslx is created, I want to input the file into Pandas and then run a series of analyses on it, using the methods .loc(), .iloc(), .index().
But after I read the file into a variable, when I hit '.' after the variable's name in PyCharm, I get all the dataframe and NDArray methods... except those three that I need.
No errors, no explanations. They are just not there.
And if I ignore this and go on and type them up manually, the variable I put the results into doesn't show ANY methods when I hit '.' for it, next (instead of showing the methods for, say, a series).
I've tried clearing the xslx file of all formatting (it originally had empty lines hidden). I tried running .info() and .head() to make sure they both run fine (They seem to, yes). I even updated my code from Python 2.7 to Python 3.7 using the 2to3 scripts to see if that might change anything. It didn't.
import pandas as pd
analysis_file = pd.read_excel("F:\\myprogram\\output1.xlsx", "Sheet1")
analysis_file. <--- The problem's here
Really not sure how to proceed, and no one I've asked so far has been able to help me.

Related

Pandas read_pickle breaks because of missing/oudated module

I implemented a small simulation environment and have been saving my evaluation results as Pandas data frames in the form of pickle files.
Later, to analyze the results, I have a Jupyter notebooks, where I use Panda's df = pd.read_pickle(path) to load the data frames again and visualize the data.
I also annotated metadata as attributes of the data frames, using df.attr, which are loaded correctly afterwards.
This used to work fine. Unfortunately, my simulator has evolved and the corresponding Python package changed names, which leads to problems when trying to read old results.
Now, pd.read_pickle() still works fine for newly generated results.
But for old results it breaks with a ModuleNotFoundError, telling me that it doesn't find the simulator_old module, i.e., the previous version of the package with the old name.
I'm not sure why and where the dependency on my package comes from. Maybe I wrote some object from the old package as data frame attribute. I can't figure it out because it always simply breaks.
I want to be able to read old and new results and have pd.read_pickle() simply skip any entries that it cannot read but read everything else.
Is there anything like that that I can do to recover my old results? E.g., to tell pickle to ignore such errors?

Excessive indirect references in NAME formula

I am trying to read in an 'xls' files in python using pandas. My code basically is a one-liner:
import pandas as pd
df = pd.read_excel(str("/test/test_file.xls"))
This code works for the majority of the files, but there are cases when it fails with the error:
Excessive indirect references in NAME formula
What I tried so far:
Tried changing the stack limit(panic and warning) to as far as 10000 in the Pandas package itself, where the exception was occurring. A recursion limit was encountered, so raised it as far as 125000, which led to my Mac/Python reaching its limit so I am guessing not the right solution.
Used a memory-intensive EMR to see if it can read the file - nope.
Looked at the GitHub repo for XLRD here to raise a bug only to find out it's out of support.
Opened the file, saved it as an xlsx, used the same code to read it into a dataframe. Worked like a charm.
Tried using Spark Excel Library to read in a particular section of the data - this worked too but I need to use pandas.
Googled it only to find out the results would show me the XLRD code where the exception is defined. Not one person has reported it.
Tried using Python2 and Python3 with the latest and older versions of Pandas - no use.
I cannot share the file, but has anyone faced this issue before? Can someone help? All suggestions are welcome!
Try the following:
Open the xls file
Copy/paste all cells as values
Rerun your script
Hard to help further without having access to the file to explain exactly what is happening.
But chances are xlrd is trying to resolve the value of a formula and is exceeding the "STACK_PANIC_LEVEL". Without seeing the formula, very difficult to say more.
xlrd has a method of evaluate_name_formula(). When you try to open a .xls file with xlrd, it will raise an error (as you described) if your file has many user-defined formulas. To try to solve your problem, I think you can delete these user-defined formulas and keep the file free of these formulas. Or you can try to edit xlrd code, and prevent it from raising the Error, which seems much more difficult.

How to empty buffer without restarting Python

I am using Python 3.6.3.
A problem remains in my script, which is fully operational.
Main modules are pandas and xslxwriter (with easygui for GUI).
From a single master file (Excel 2010), this script can generate dozens of Excel files (with xlsxwriter), each of them can contain hundreds of columns of data (depending of parameters of the master file).
Indentation, logic and results are OK.
But the last Excel file is not committed to disk, and I have to restart Python to get it.
For example, if one run produces 100 files, only 99 will be written on disk. The last one is calculated, but not visible.
If Python is not restarted, this file is written to disk at the beginning of a next run of the script.
I identified maybe a flush problem, and tried some solutions, but this problem still remains.
Are there some tricks to force the buffer? I am not allowed to modify the environment variables on my professional computer.
Thank you for your help and your time :)
Thank you Mark Tolonen
You were right : the file was not closed properly, and it was because I made a mistake.
My code was a bit difficult to summarize, and I could not post a résumé of my script.
First, a keyword continue (for the main loop) was bad-indented and I replaced it at the right place.
But just before this keyword, I was closing the xlsxwriter file with: workbook.close (only one in the script exists for the main loop).
But this was not mentioned as an error at run-time.
Each xlsxwriter file was committed to disk except the last one, as mentioned in my question above.
I then reviewed the documentation at https://xlsxwriter.readthedocs.io/workbook.html, and I noticed that parenthesis were missing when worbook is closed.
After correction by adding the missing parenthesis: workbook.close(), all is fine now :)
I would like to share this information, because some may have meet the same problem.
Than you also to progmatico for your information on flush properties.
Greetings from Paris, France :)

Scientific Notation is badly written to csv from python

I'm totally puzzled by the situation at hand. I have a large dataset with a broad range of numbers, all between 0 and 2. However, when I write the data to a .csv file with
df_Signals.to_csv('signals_IDG_TOut1.csv', sep=',')
to be able to import the file in another program something strange happens. When I for example call the number with
print(df_Signals["Column"].iloc[44])
python prints: 2.8147020866287068e-05
However, when I open the .csv file it reads 281470208662,87. A quick inspection shows that this happens for all number written in E-notation I could find. I have tried to figure out what is going on, but have no idea what the answer is . So my main question is: Why? And secondly, how can I resolve this? And is this a structural problem when exporting to .csv files?
I use PyCharm 2017.1.4, with the Anaconda 3 interpreter.
Regards
Update: As the comments correctly pointed out, it is Excel that wrongly opens the data. Which still intrigues me why that happens.

How to_csv in Bluemix

We have a dataframe we are working it in a ipython notebook. Granted, if one could save a dataframe in such a way that the whole group could have access to it through their notebooks, would be ideal, and I'd love to know how to do that. However could you help with the following specific problem?
When we do df.to_csv("Csv file name") it appears that it is located in the exact same place as the files we placed in object storage to utilize in the ipython notebook. However, when one goes to Manage Files, it's nowhere to be found.
When one runs pd.DataFrame.to_csv(df), text of the csv file is apparently given. However when one copies that into a text editor (ex- Sublime text), saves it at a csv, and attempts to read it in to a dataframe, the expected dataframe is not yielded.
How does one export a dataframe to csv format, and then access it?
I'm not familiar with bluemix, but it sounds like you're trying to save a pandas dataframe in a way that all of your collaborators can access and it look the same way for everyone.
Maybe saving and reading from CSVs is messing up the formatting of your dataframe. Have you tried using pickling? Since pickling is based around python, it should give consistent results.
Try this:
import pandas as pd
pd.to_pickle(df, "/path/to/pickle/My_pickle")
and on the read side:
df_read = pd.read_pickle("/path/to/pickle/My_pickle")

Categories