Causes of negative dimensions are not allowed - python

What would be possible causes of this error:
negative dimensions are not allowed
I am using crosstab on a pandas data frame with 8 million rows, it works fine up to 2 million rows, and it throws this error when the size of data gets larger.
I tried to achieve what I want to achieve (as described in this question) using unstack, and I get exactly the same error.
Can anyone give any hint on this? I checked this question and this, but they were not helpful enough

Related

Why is pandas compare not working when comparing two dataframes?

I am creating two dataframes, that I set equal to eachother based on an index field. So each frame has the same indices on both sides and I sort them as well. I want to return the differences between these fields, so as to catch any of the rows that have 'updated' since the last run. But I am getting a weird result.
df1.compare(df2)
I fail to see any differences here, and when I manually look at the id's involved I do not see any changes at all. What could be causing this?
If you look at the below code:
It's working.
Can you please share both of your dfs so that we can assist you better.
I solved it and will post this in case someone else gets stuck. Apparently Nulls were being interpreted as 'None' when read into the dataframe. But the other dataframe actually had the String 'None' vs null. You would never know this by pulling these into dataframes, as they would look identical to the eye.
It took me a while to realize this and hopefully this saves someone else some time.

Resizing list from a 2D array

I came across a problem when trying to do the resize as follow:
I am having a 2D array after processing data and now I want to resize the data, with each row ignoring the first 5 items.
What I am doing right now is to:
edit: this apporach works fine as long as you make sure you are working with a list, but not string. it failed to work on my side because I haven't done the convertion from string to list properly.
so it ends up eliminating the first five characters in the entire string.
2dArray=[[array1],[array2]]
new_Resize_2dArray= [array[5:] for array in 2dArray]
However, it does not seems to work as it just recopy all the element over to the new_Resize_2dArray.
I would like to ask for help to see what did I do wrong or if there is any scientific calculation library I could use to acheive this.
First, because python list indexing is zero-based, your code should read new_Resize_2dArray= [array[5:] for array in 2dArray] if you want to not include the first 5 columns. Otherwise, I see no issue with your single line of code.
As for scientific computing libraries, numpy is a highly prevalent 3rd party package with a high-performance multidimensional array type ndarray. Using ndarrays, your code could be shortened to new_Resize_2dArray = 2dArray[:,5:]
Aside: It would help to include a bit more of your code or a minimum example where you are getting the unexpected result (e.g., use a fake/stand-in 2d array to see if it works as expected or still fails.

Keep original precision in dataframes during calculation

I am a bit confused about how python store values and how I should keep the original precision.
My problem is that after some calculation, I got a dataframe which contains Inf and NaN.
Then I replicate the same steps in Excel. The strange thing is that I got specific numbers in Excel for the NaN/Inf cases in python, but all other values are the same.
The calculation involves division, and whenever I encountered a very small number, for example, 2.48065E-16 in Excel, I got 0.0 in Python.
So I tried df = df.round(20) at the very beginning, the NaN disappeared, but I still got Inf.
I also tried pd.set_option('float_format', '{:f}'.format) but it didn't work.
Any idea on how I should let python store and manipulate the values in dataframes without changing the original precision?

Python Idle column limitation

I am trying to use Idle for working with some data and my problem is that when there are too many columns, some of them will be omitted after running and replaced with the dots. Is there a way to increase the limit set by Idle ide ? I have seen sentdex using Idle with up to 11 columns and all of them were presented, hence my question.
Thank you very much for your responses.
What type are you printing? Some have a reduced representation that is produced by default by their str conversion to avoid flooding the terminal. You can get some of them to produce their full representation by applying repr to them and then printing the result.
This doesn't work for dataframes. They have their own adjustable row and column limits. See Pretty-print an entire Pandas Series / DataFrame

subselection of columns in dask (from pandas) by computed boolean indexer

I'm new do dask (imported as dd) and try to convert some pandas (imported as pd) code.
The goal of the following lines, is to slice the data to those columns, which's values fullfill the calculated requirement in dask.
There is a given table in csv. The former code reads
inputdata=pd.read_csv("inputfile.csv");
pseudoa=inputdata.quantile([.035,.965])
pseudob=pseudoa.diff().loc[.965]
inputdata=inputdata.loc[:,inputdata.columns[pseudob.values>0]]
inputdata.describe()
and is working fine.
My simple idea for conversion was so substitute the first line to
inputdata=dd.read_csv("inputfile.csv");
but that resulted in the strange error message IndexError: too many indices for array.
Even by switching to ready computed data in inputdata and pseudob the error remains.
Maybe the question is specifically assigned to the idea of calculated boolean slicing for dask-columns.
I just found a (maybe suboptimal) way (not solution) to do that. Changing line 4 to the following
inputdata=inputdata.loc[:,inputdata.columns[(pseudob.values>0).compute()[0]]]
seems to work.
Yes, Dask.dataframe's .loc accessor only works if it gets concrete indexing values. Otherwise it doesn't know which partitions to ask for the data. Computing your lazy dask result to a concrete Pandas result is one sensible solution to this problem, especially if your indices fit in memory.

Categories