Python Idle column limitation - python

I am trying to use Idle for working with some data and my problem is that when there are too many columns, some of them will be omitted after running and replaced with the dots. Is there a way to increase the limit set by Idle ide ? I have seen sentdex using Idle with up to 11 columns and all of them were presented, hence my question.
Thank you very much for your responses.

What type are you printing? Some have a reduced representation that is produced by default by their str conversion to avoid flooding the terminal. You can get some of them to produce their full representation by applying repr to them and then printing the result.
This doesn't work for dataframes. They have their own adjustable row and column limits. See Pretty-print an entire Pandas Series / DataFrame

Related

Python: encoding issues - comparing two strings with different encoding

I have some problems with encoding using Python. I've searched for an answer for couple of hours now and still no luck.
I am currently working on Jupyter notebook with Python dataframes (pandas).
Long story short - In a dataframe column I have different strings - single letters from the alphabet. I wanted to apply a function on this column, that will convert letters to numbers based on a specific key. But I got an error every time I tried this. When I dug for a reason behind this I realised that:
I have two strings 'T'. But they are not equal.
string1.encode() = b'T'
string2.encode() = b'\xd0\xa2'
How can I standardize/encode/decode/modify all strings to have the same coding/basis so I can compare them and make operations on them? What is the easiest way to achieve that?

Running the same script on the same pandas data produces very slightly different dataframes floating-point values

I am executing a script that I have run before on the same data. The dataframe I get is only so slightly different than the previous one (in the 10th decimal point or so). For example:
in some column (and row) the old dataframe contains the price 5673391.88.
In the same column and same row of the new dataframe the value seems to be exactly the same (5673391.88).
However if I subtract the two columns I get a difference of -9.445123e-10.
This is of course the case for the entire column, not just the particular row. How can that be? Please note that I cannot confirm same environment (pandas or Python version) between the two script runs. Can it be one of these two reasons? Something else?
One possible reason: Pandas 1.2.0 which was released back in 26 Dec 2020, they have highlighted this issue:
Change in default floating precision for read_csv and read_table
the methods read_csv() and read_table() could read floating point numbers slightly incorrectly with respect to the last bit in precision.
Before this version floating_precision="high" has always been available to avoid this issue.
But, within this version the default is now floating_precision=None to make precision more acurate. It won't have any impact on performance.

Keep original precision in dataframes during calculation

I am a bit confused about how python store values and how I should keep the original precision.
My problem is that after some calculation, I got a dataframe which contains Inf and NaN.
Then I replicate the same steps in Excel. The strange thing is that I got specific numbers in Excel for the NaN/Inf cases in python, but all other values are the same.
The calculation involves division, and whenever I encountered a very small number, for example, 2.48065E-16 in Excel, I got 0.0 in Python.
So I tried df = df.round(20) at the very beginning, the NaN disappeared, but I still got Inf.
I also tried pd.set_option('float_format', '{:f}'.format) but it didn't work.
Any idea on how I should let python store and manipulate the values in dataframes without changing the original precision?

Python Pandas: Dataframe.loc[] vs dataframe[] behavior difference?

I'm completely new to Pandas, but I was given some code that uses it, and I came across a line that looked like the following:
df.loc[bool_array]['column_name']=0
Looking at that code, I would expect that if I ran df.loc[bool_array]['column_name'] to retrieve the values for the column after running said code, the resulting values would all be zero. In fact, however, this was not the case, the values remained unchanged. To actually change the values, I had to remove the .loc as so:
df[bool_array]['column_name']=0
whereupon displaying the values did, in fact, show 0 as expected, regardless of if I used .loc or not.
Everything I can find about .loc seems to indicate that it should behave essentially the same as simply doing [] when given a boolean array like this. What difference am I missing that explains this behavioral discrepancy?
EDIT: As pointed out by #QuangHoang in the comments, the following recommended code DOES work:
df.loc[bool_array, 'column_name']=0
which simplifies the question to "why doesn't the original code work"?

Python Panda.read_csv rounds to get import errors?

I have a 10000 x 250 dataset in a csv file. When I use the command
data = pd.read_csv('pool.csv', delimiter=',',header=None)
while I am in the correct path I actually import the values.
First I get the Dataframe. Since I want to work with the numpy package I need to convert this to its values using
data = data.values
And this is when i gets weird. I have at position [9999,0] in the file a -0.3839 as value. However after importing and calculating with it I noticed, that Python (or numpy) does something strange while importing.
Calling the value of data[9999,0] SHOULD give the expected -0.3839, but gives something like -0.383899892....
I already imported the file in other languages like Matlab and there was no issue of rounding those values. I aswell tried to use the .to_csv command from the pandas package instead of .values. However there is the exact same problem.
The last 10 elements of the first column are
-0.2716
0.3711
0.0487
-1.518
0.5068
0.4456
-1.753
-0.4615
-0.5872
-0.3839
Is there any import routine, which does not have those rounding errors?
Passing float_precision='round_trip' should solve this issue:
data = pd.read_csv('pool.csv',delimiter=',',header=None,float_precision='round_trip')
That's a floating point error. This is because of how computers work. (You can look it up if you really want to know how it works.) Don't be bothered by it, it is very small.
If you really want to use exact precision (because you are testing for exact values) you can look at the decimal module of Python, but your program will be a lot slower (probably like 100 times slower).
You can read more here: https://docs.python.org/3/tutorial/floatingpoint.html
You should know that all languages have this problem, only some are better in hiding it. (Also note that in Python3 this "hiding" of the floating point error has been improved.)
Since this problem cannot be solved by an ideal solution, you are given the task to solve it yourself and choose the most appropriate solution for your situtation
I don't know about 'round_trip' and its limitations, but it probably can help you. Other solutions would be to use float_format from the to_csv method. (https://docs.python.org/3/library/string.html#format-specification-mini-language)

Categories