How to display a DataFrame without Jupyter Notebook crashing? - python

Everytime I try to display a complete DataFrame in my jupyter notebook, the note book crashes. The file wont start up so I had to make a new jupyter notebook file. When i do display(df) it only shows a couple of the rows, when I need to show 57623 rows. I need to show results for all of these rows and put them into an html file.
I tried setting the max rows and max columns, but the entire dataframe would not print out without the notebook crashing
''' python
pd.set_option('display.max_columns', 24)
"pd.set_option('display.max_rows', 57623)
'''
The expected results were for the entire DataFrame to print out, but instead the notebook would have an hourglass next to it and nothing would load.

I think your machine is not powerful enough to handle so much data. I can display 57623 rows for my data (with 16 columns) without any problem. But my machine has 252GB memory, and it still took about 1 minute to display these data in jupyter notebook.
If I tried to scroll through the data, it's slow, and sometimes stuck for a while.
On the other hand, what do you want to achieve by displaying all data here? There's definitely another way to achieve the same thing as you are doing now.

Related

pandas pd.read_excel puts all into one column

I would need some help/ideas again.
I have been working on a pandas jupyter notebook for some data wrangling with a file I get from our customer. Unfortunately I cannot disclose it.
Previous version I could read in using pd.read_excel(), however, for the latest one, everything is put under just ONE column, the first one. data is in rows, that is ok, however, each row content is in just the first column.
df=pd.read_excel('./Files/import/ET5X report 03-01-2021.xlsx', header=0, usecols="A:BF")
I even tried to explicitly use "usecols" command, but no change.
Any ideas, what I could check? .csv would be an alternative, but then I have some trouble with the format of some of the cells.
Thanks!

Will interrupting the script delete the progress in Jupyter Notebook?

I'm currently running a script in Jupyter Notebook which loops over a Dataframe and manipulates the data of the current row. As my Dataframe has thousands of rows and each loop takes a while to run, I am wondering whether it's safe to interrupt the script without losing all of my progress?
I am keeping track of rows that have already been processed so I could just start where I left off in the case that the manipulations on the Dataframe don't get lost. I don't want to take the risk of trying it out right now so advice would be appreciated.
Unless you are storing progress in external files, interrupting Jupyter will lose you data. I highly do not recommend on counting on the variables inside of Jupyter on being in some state if you are mid-way through a calculation, just save intermediate steps in files to track progress, chunking as you go.

how to use Pycharm as use in Jupyter Notebook to read an object that stored in the ram?

Every time I want to view a df in Pycharm, I need to select from the beginning to the specific line of reading the df, and it takes time to run it every time. I want to view it as variables but no variables are found in the IDE. Thanks.

PDF Grabbing Code is terminated (Cache full trying to find a workaround)

So I just started coding with Python. I have a lot of pdfs which are my target for data grabbing. I have the script finished and it works with out errors if I limit it to a small number of pdfs (~200). If i let the skript run with 4000 pdfs the script is terminated without an error. Friend of mine told me that this is due to the cache.
I save the grabbed data to a list and in the last step create a DataFrame out of the different lists. The DataFrame is then exported to excel.
So i tried to export the DataFrame after 200 pdfs (and then clear all lists and the dataframe) but then pandas overwrites the prior results. Is this the right way to go? Or can anyone think of a different approach to get arround the Termination by large number of pdfs?
Right now i use:
MN=list()
Vds=list()
data={'Materialnummer': MN,'Verwendung des Stoffs':VdS}
df=pd.DataFrame(data)
df.to_excel('test.xls')

Jupyter notebook's response is slow when the codes have multiple lines

I have a question for the jupyter notebook.
When I copied and pasted 663 lines of a python code to the jupyter notebook,
it shows the much lower response than the notebook which has just a few code lines.
Have anyone experienced this issue?
Anyone knows the solution?
Without any information about your code is really difficult to give you an answer.
However try to keep you output under control. Too much output to generate with a single run can overkill the kernel.
Moreover, it makes not much sense to run in a single cell almost 700 lines of code, are you sure you're using the right tool?
Sometimes a piece of code could slow all the session, if you split your execution in smaller pieces, over multiple cells you will find what is really your bottleneck.
Add this to your notebook and then click on the link after you execute the cell. Then you can track progress of what's running and see which statements are causing it to be slow. You could also split the code up into multiple cells to see where slow down is occurring.
from IPython.core.display import display, HTML
#sc = SparkContext.getOrCreate()
from pyspark import SparkContext
sc =SparkContext()
spark_url = sc.uiWebUrl
display(HTML('''
<p>
<br />Spark connection is ready! Use this URL to monitor your Spark application!
</p>
<p>
{spark_url}
</p>'''.format(spark_url=spark_url)))

Categories