pandas dataframe; error during converting NaN to 0 - python

The last step of my dataframe is to convert all NaN values to 0 (zero). My dataframe contains more than 1000 columns, some are text, some are integers, and some are floats.
To convert NaN to 0, I use the following command:
#replace nan in columns with 0
nan_cols = df5c.columns[df5c.isnull().any(axis=0)]
for col in nan_cols:
df5c[col] = df5c[col].fillna(0).astype(int)
This worked fine, until I added a new column with new data, which gives the following error:
Traceback (most recent call last):
File "pythonscript_v8.py", line 233, in <module>
df5c[col] = df5c[col].fillna(0).astype(int)
File "/usr/lib/python3/dist-packages/pandas/core/generic.py", line 2632, in astype
dtype=dtype, copy=copy, raise_on_error=raise_on_error, **kwargs)
File "/usr/lib/python3/dist-packages/pandas/core/internals.py", line 2864, in astype
return self.apply('astype', dtype=dtype, **kwargs)
File "/usr/lib/python3/dist-packages/pandas/core/internals.py", line 2823, in apply
applied = getattr(b, f)(**kwargs)
File "/usr/lib/python3/dist-packages/pandas/core/internals.py", line 430, in astype
values=values, **kwargs)
File "/usr/lib/python3/dist-packages/pandas/core/internals.py", line 472, in _astype
values = com._astype_nansafe(values.ravel(), dtype, copy=True)
File "/usr/lib/python3/dist-packages/pandas/core/common.py", line 2463, in _astype_nansafe
return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
File "pandas/lib.pyx", line 935, in pandas.lib.astype_intsafe (pandas/lib.c:16612)
File "pandas/src/util.pxd", line 60, in util.set_value_at (pandas/lib.c:66830)
ValueError: invalid literal for int() with base 10: 'NODE_1_length_402490_cov_43.5825_ID_1'
What does this error mean, and how can I solve this?
My dataframe looks like this:
source contigID contig_length SCM/genes plasmid_genes/genes A053_1 parA_1
COLS157_1 NODE_1_length_402490_cov_43.5825_ID_1 402490 0.87 0.95 NaN NaN
COLS157_10 NODE_10_length_218177_cov_45.105_ID_19 218177 0.79 0.97 NaN NaN
COLS157_100 NODE_157_length_248_cov_34.4628_ID_313 248 NaN NaN NaN NaN
COLS157_11 NODE_11_length_176130_cov_51.1495_ID_21 176130 0.75 0.86 NaN NaN
COLS157_12 NODE_12_length_165446_cov_50.2044_ID_23 165446 0.77 0.88 NaN NaN

If I'm understanding the problem correctly this will work.
nan_cols = df5c.columns[df5c.isnull().any(axis=0)]
for col in nan_cols:
for i in range(len(df5c)):
if pd.isnull(df5c.loc[i, col]):
data.loc[i, col] = data.loc[i, col] = 0

Related

I wrote this code and received this error. How should I fix this?

import pandas as pd
import numpy as np
import sklearn as preprocessing
country ={'data source':['data','country name','brazil','switzerland','germany','denmark','spain','france','japan','greece','iran','kuwait','morocco','nigeria','qatar','sweden','india','world'],
'unnamed1':['nan','country code','BRA','CHE','DEU','DNK','ESP','FRA','JPN','GRC','IRN','KWT','MAR','NGA','QAT','SWE','IND','WLD'],
'unnamed2':[2016,'population growth',0.817555711,1.077221168,1.193866758,0.834637611,-0.008048086,0.407491036,-0.115284177,-0.687542545,1.1487886,2.924206194,'nan',1.148214693,1.18167997],
'unnamed3':['nan','total population',207652865,8372098,82667685,'nan',46443959,66896109,126994511,10746740,80277428,4052584,35276786,185989640,2569804,9903122,1324171354,7442135578],
'unnamed4':['area(sq.km)',8358140,39516,348900,42262,500210,547557,394560,128900,16287601,'nan',446300,910770,11610,407310,2973190,129733172.7]}
my_df = pd.DataFrame(country, index=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17], columns=['data source','unnamed1','unnamed2','unnamed3','unnamed4'])
print(my_df)
and this is the error:
Traceback (most recent call last):
File "c:/Users/se7en/Desktop/AI/skl.py", line 11, in <module>
my_df = pd.DataFrame(country, index=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17], columns=['data source','unnamed1','unnamed2','unnamed3','unnamed4'])
File "C:\Program Files\Python37\lib\site-packages\pandas\core\frame.py", line 614, in __init__
mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
File "C:\Program Files\Python37\lib\site-packages\pandas\core\internals\construction.py", line 465, in dict_to_mgr
arrays, data_names, index, columns, dtype=dtype, typ=typ, consolidate=copy
File "C:\Program Files\Python37\lib\site-packages\pandas\core\internals\construction.py", line 136, in arrays_to_mgr
arrays, arr_names, axes, consolidate=consolidate
File "C:\Program Files\Python37\lib\site-packages\pandas\core\internals\managers.py", line 1776, in create_block_manager_from_arrays
raise construction_error(len(arrays), arrays[0].shape, axes, e)
File "C:\Program Files\Python37\lib\site-packages\pandas\core\internals\managers.py", line 1773, in create_block_manager_from_arrays
blocks = _form_blocks(arrays, names, axes, consolidate)
File "C:\Program Files\Python37\lib\site-packages\pandas\core\internals\managers.py", line 1863, in _form_blocks
items_dict["ObjectBlock"], np.object_, consolidate=consolidate
File "C:\Program Files\Python37\lib\site-packages\pandas\core\internals\managers.py", line 1903, in _simple_blockify
values, placement = _stack_arrays(tuples, dtype)
File "C:\Program Files\Python37\lib\site-packages\pandas\core\internals\managers.py", line 1959, in _stack_arrays
stacked[i] = arr
ValueError: could not broadcast input array from shape (15,) into shape (18,)
All the lists/arrays in dictionary must have the same length for the DataFrame constructor to accept the input.
This is not the case with your data:
{k:len(v) for k,v in country.items()}
output:
{'data source': 18,
'unnamed1': 18,
'unnamed2': 15,
'unnamed3': 18,
'unnamed4': 17}
Either trim the elements to the min length, or pad the shortest ones to the max length.
Another option to circumvent this might be to use a dictionary of Series, which will do the padding job automatically:
df = pd.DataFrame({k:pd.Series(v) for k,v in country.items()})
output:
data source unnamed1 unnamed2 unnamed3 unnamed4
0 data nan 2016 nan area(sq.km)
1 country name country code population growth total population 8358140
2 brazil BRA 0.817556 207652865 39516
3 switzerland CHE 1.077221 8372098 348900
4 germany DEU 1.193867 82667685 42262
5 denmark DNK 0.834638 nan 500210
6 spain ESP -0.008048 46443959 547557
7 france FRA 0.407491 66896109 394560
8 japan JPN -0.115284 126994511 128900
9 greece GRC -0.687543 10746740 16287601
10 iran IRN 1.148789 80277428 nan
11 kuwait KWT 2.924206 4052584 446300
12 morocco MAR nan 35276786 910770
13 nigeria NGA 1.148215 185989640 11610
14 qatar QAT 1.18168 2569804 407310
15 sweden SWE NaN 9903122 2973190
16 india IND NaN 1324171354 129733172.7
17 world WLD NaN 7442135578 NaN
NB. you should clarify the output you expect as it seems here that your lists are mixing labels and data

ValueError: Length of values (1) does not match length of index index (12797) - Indexes are the same length

So this is driving me crazy now, cause I really don't see the problem.
I have the following code:
dataframe.to_csv(f"user_data/candle_data.csv")
print (dataframe)
st12 = self.supertrend(dataframe, 3, 12)
st12['ST'].to_csv(f"user_data/st12.csv")
print (st12)
print(dataframe.index.difference(st12.index))
dataframe.loc[:, 'st_12'] = st12['ST'],
Checking the csv files and I can see that the first index is 0 and the last index is 12796. The last row is also on line number 12798. This is true for both files.
The output from three print is as follows
date open high low close volume
0 2020-12-29 21:45:00+00:00 723.33 726.14 723.26 725.05 3540.48612
1 2020-12-29 22:00:00+00:00 725.17 728.77 723.78 726.94 3983.90892
2 2020-12-29 22:15:00+00:00 726.94 727.30 724.72 724.75 3166.57435
3 2020-12-29 22:30:00+00:00 724.94 725.99 723.80 725.91 2848.08122
4 2020-12-29 22:45:00+00:00 725.99 730.30 725.95 729.64 6288.69499
... ... ... ... ... ... ...
12792 2021-05-12 03:45:00+00:00 4292.42 4351.85 4292.35 4332.81 24410.30155
12793 2021-05-12 04:00:00+00:00 4332.12 4347.60 4300.07 4343.05 16545.66776
12794 2021-05-12 04:15:00+00:00 4342.84 4348.00 4305.87 4313.82 10048.32828
12795 2021-05-12 04:30:00+00:00 4313.82 4320.68 4273.35 4287.49 13201.88547
12796 2021-05-12 04:45:00+00:00 4287.49 4306.79 4276.87 4300.80 9663.73327
[12797 rows x 6 columns]
ST STX
0 0.000000 nan
1 0.000000 nan
2 0.000000 nan
3 0.000000 nan
4 0.000000 nan
... ... ...
12792 4217.075684 up
12793 4217.075684 up
12794 4217.260609 up
12795 4217.260609 up
12796 4217.260609 up
[12797 rows x 2 columns]
RangeIndex(start=0, stop=0, step=1)
Full Error Traceback:
Traceback (most recent call last):
File "/freqtrade/freqtrade/main.py", line 37, in main
return_code = args['func'](args)
File "/freqtrade/freqtrade/commands/optimize_commands.py", line 53, in start_backtesting
backtesting.start()
File "/freqtrade/freqtrade/optimize/backtesting.py", line 479, in start
min_date, max_date = self.backtest_one_strategy(strat, data, timerange)
File "/freqtrade/freqtrade/optimize/backtesting.py", line 437, in backtest_one_strategy
preprocessed = self.strategy.ohlcvdata_to_dataframe(data)
File "/freqtrade/freqtrade/strategy/interface.py", line 670, in ohlcvdata_to_dataframe
return {pair: self.advise_indicators(pair_data.copy(), {'pair': pair})
File "/freqtrade/freqtrade/strategy/interface.py", line 670, in <dictcomp>
return {pair: self.advise_indicators(pair_data.copy(), {'pair': pair})
File "/freqtrade/freqtrade/strategy/interface.py", line 687, in advise_indicators
return self.populate_indicators(dataframe, metadata)
File "/freqtrade/user_data/strategies/TrippleSuperTrendStrategy.py", line 94, in populate_indicators
dataframe.loc[:, 'st_12'] = st12['ST'],
File "/home/ftuser/.local/lib/python3.9/site-packages/pandas/core/indexing.py", line 692, in __setitem__
iloc._setitem_with_indexer(indexer, value, self.name)
File "/home/ftuser/.local/lib/python3.9/site-packages/pandas/core/indexing.py", line 1597, in _setitem_with_indexer
self.obj[key] = value
File "/home/ftuser/.local/lib/python3.9/site-packages/pandas/core/frame.py", line 3163, in __setitem__
self._set_item(key, value)
File "/home/ftuser/.local/lib/python3.9/site-packages/pandas/core/frame.py", line 3242, in _set_item
value = self._sanitize_column(key, value)
File "/home/ftuser/.local/lib/python3.9/site-packages/pandas/core/frame.py", line 3899, in _sanitize_column
value = sanitize_index(value, self.index)
File "/home/ftuser/.local/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 751, in sanitize_index
raise ValueError(
ValueError: Length of values (1) does not match length of index (12797)
ERROR: 1
So if both data frames have exactly the same amount of rows and the indexes are exactly the same, why am I getting this error?
There is a typo:
dataframe.loc[:, 'st_12'] = st12['ST']
The comma is a typo.

Not able to read text data into a pandas dataframe in python

I am trying to read text data from Pandas : populate column with if condition not working as expected into a dataframe. My code is:
dftxt = """
0 1 2
1 10/1/2016 'stringvalue' 456
2 NaN 'anothersting' NaN
3 NaN 'and another ' NaN
4 11/1/2016 'more strings' 943
5 NaN 'stringstring' NaN
"""
from io import StringIO
df = pd.read_csv(StringIO(dftxt), sep='\s+')
print (df)
But I am getting following error:
Traceback (most recent call last):
File "mydf.py", line 16, in <module>
df = pd.read_csv(StringIO(dftxt), sep='\s+')
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 646, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 401, in _read
data = parser.read()
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 939, in read
ret = self._engine.read(nrows)
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 1508, in read
data = self._reader.read(nrows)
File "pandas/parser.pyx", line 848, in pandas.parser.TextReader.read (pandas/parser.c:10415)
File "pandas/parser.pyx", line 870, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:10691)
File "pandas/parser.pyx", line 924, in pandas.parser.TextReader._read_rows (pandas/parser.c:11437)
File "pandas/parser.pyx", line 911, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:11308)
File "pandas/parser.pyx", line 2024, in pandas.parser.raise_parser_error (pandas/parser.c:27037)
pandas.io.common.CParserError: Error tokenizing data. C error: Expected 4 fields in line 5, saw 6
I can't understand which 6 fields are being read in error: Expected 4 fields in line 5, saw 6 . Where is the problem and how can it be solved?
Line 5 would be this one -
3 NaN 'and another ' NaN
1 2 3 4 5 6
The problem lies with your separator. It's interpreting each space separated word as a separate column. In this case, you'd need to
change your sep argument to \s{2,}, and
change your engine to 'python' to suppress warnings
df = pd.read_csv(StringIO(dftxt), sep='\s{2,}', engine='python')
Also, I'd get rid of the quotes (they're superfluous) using str.strip -
df.iloc[:, 1] = df.iloc[:, 1].str.strip("'")
df
0 1 2
1 10/1/2016 stringvalue 456.0
2 NaN anothersting NaN
3 NaN and another NaN
4 11/1/2016 more strings 943.0
5 NaN stringstring NaN
Lastly, from one pandas user to another, there's a little convenience function called pd.read_clipboard I think you should take a look at. It reads data from clipboard and accepts just about every argument that read_csv does.

Converting data type of a column in a csv file

I was trying to modify the data type of column in Python in Pycharm using Numpy and Pandas library but I am getting the following error.
dataset.fillna(1e6).astype(int)
D:\Softwares\Python3.6.1\python.exe D:/PythonPractice/DataPreprocessing/DataPreprocessing_1.py
Traceback (most recent call last):
Country Age Salary Purchased
File "D:/PythonPractice/DataPreprocessing/DataPreprocessing_1.py", line 6, in <module>
dataset.fillna(1e6).astype(int)
0 France 44.0 72000.0 No
1 Spain 27.0 48000.0 Yes
File "D:\Softwares\Python3.6.1\lib\site-packages\pandas\util\_decorators.py", line 91, in wrapper
2 Germany 30.0 54000.0 No
3 Spain 38.0 61000.0 No
return func(*args, **kwargs)
4 Germany 40.0 NaN Yes
File "D:\Softwares\Python3.6.1\lib\site-packages\pandas\core\generic.py", line 3299, in astype
**kwargs)
File "D:\Softwares\Python3.6.1\lib\site-packages\pandas\core\internals.py", line 3224, in astype
5 France 35.0 58000.0 Yes
return self.apply('astype', dtype=dtype, **kwargs)
6 Spain NaN 52000.0 No
File "D:\Softwares\Python3.6.1\lib\site-packages\pandas\core\internals.py", line 3091, in apply
7 France 48.0 79000.0 Yes
applied = getattr(b, f)(**kwargs)
8 Germany 50.0 83000.0 No
File "D:\Softwares\Python3.6.1\lib\site-packages\pandas\core\internals.py", line 471, in astype
9 France 37.0 67000.0 Yes
**kwargs)
File "D:\Softwares\Python3.6.1\lib\site-packages\pandas\core\internals.py", line 521, in _astype
values = astype_nansafe(values.ravel(), dtype, copy=True)
File "D:\Softwares\Python3.6.1\lib\site-packages\pandas\core\dtypes\cast.py", line 625, in astype_nansafe
return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
File "pandas\_libs\lib.pyx", line 917, in pandas._libs.lib.astype_intsafe (pandas\_libs\lib.c:16260)
File "pandas\_libs\src\util.pxd", line 93, in util.set_value_at_unsafe (pandas\_libs\lib.c:73093)
ValueError: invalid literal for int() with base 10: 'France'
Your error message - ValueError: invalid literal for int() with base 10: 'France' - suggests you're using the Country column, the contents of which are strings and can't be changed to integers. Try adjusting your range over.
You can't transform 'France' to integer, you should:
dataset['Country'] = dataset['Country'].map({'France': 0, 'Spain': 1, 'Germany': 2})]
then:
dataset['Country'].astype(int)
if there is still an error like this:
ValueError: Cannot convert non-finite values (NA or inf) to integer
This is due to that there is some NaN in the dataset['Country'].
Deal with these NaN by fillna() or drop() and so on, you will resolve it.

Error to find minimum of last column of pandas DataFrame in Python

I'm using read_csv() to read data from external .csv file. It's working fine. But whenever I try to find the minimum of the last column of that dataframe using np.min(...), it's giving lots of errors. But it's interesting that the same procedure is working for the rest of the columns that the dataframe has.
I'm attaching the code here.
import numpy as np
import pandas as pd
import os
data = pd.read_csv("test_data_v4.csv", sep = ",")
print(data)
The output is like below:
LINK_CAPACITY_KBPS THROUGHPUT_KBPS HOP_COUNT PACKET_LOSS JITTER_MS \
0 25 15.0 50 0.25 20
1 20 10.5 70 0.45 3
2 17 12.0 49 0.75 7
3 18 11.0 65 0.30 11
4 14 14.0 55 0.50 33
5 15 8.0 62 0.25 31
RSSI
0 -30
1 -11
2 -26
3 -39
4 -25
5 -65
np.min(data['RSSI'])
Now the error comes:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/koushik_k/anaconda3/lib/python3.5/site-
packages/pandas/core/frame.py", line 1914, in __getitem__
return self._getitem_column(key)
File "/home/koushik_k/anaconda3/lib/python3.5/site-
packages/pandas/core/frame.py", line 1921, in _getitem_column
return self._get_item_cache(key)
File "/home/koushik_k/anaconda3/lib/python3.5/site-
packages/pandas/core/generic.py", line 1090, in _get_item_cache
values = self._data.get(item)
File "/home/koushik_k/anaconda3/lib/python3.5/site-
packages/pandas/core/internals.py", line 3102, in get
loc = self.items.get_loc(item)
File "/home/koushik_k/anaconda3/lib/python3.5/site-
packages/pandas/core/index.py", line 1692, in get_loc
return self._engine.get_loc(_values_from_object(key))
File "pandas/index.pyx", line 137, in pandas.index.IndexEngine.get_loc
(pandas/index.c:3979)
File "pandas/index.pyx", line 157, in pandas.index.IndexEngine.get_loc
(pandas/index.c:3843)
File "pandas/hashtable.pyx", line 668, in
pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12265)
File "pandas/hashtable.pyx", line 676, in
pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12216)
KeyError: 'RSSI'
Following on DSM's comment, try data.columns = data.columns.str.strip()

Categories