Not able to read text data into a pandas dataframe in python

Not able to read text data into a pandas dataframe in python - python

I am trying to read text data from Pandas : populate column with if condition not working as expected into a dataframe. My code is:
dftxt = """
0 1 2
1 10/1/2016 'stringvalue' 456
2 NaN 'anothersting' NaN
3 NaN 'and another ' NaN
4 11/1/2016 'more strings' 943
5 NaN 'stringstring' NaN
"""
from io import StringIO
df = pd.read_csv(StringIO(dftxt), sep='\s+')
print (df)
But I am getting following error:
Traceback (most recent call last):
File "mydf.py", line 16, in <module>
df = pd.read_csv(StringIO(dftxt), sep='\s+')
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 646, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 401, in _read
data = parser.read()
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 939, in read
ret = self._engine.read(nrows)
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 1508, in read
data = self._reader.read(nrows)
File "pandas/parser.pyx", line 848, in pandas.parser.TextReader.read (pandas/parser.c:10415)
File "pandas/parser.pyx", line 870, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:10691)
File "pandas/parser.pyx", line 924, in pandas.parser.TextReader._read_rows (pandas/parser.c:11437)
File "pandas/parser.pyx", line 911, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:11308)
File "pandas/parser.pyx", line 2024, in pandas.parser.raise_parser_error (pandas/parser.c:27037)
pandas.io.common.CParserError: Error tokenizing data. C error: Expected 4 fields in line 5, saw 6
I can't understand which 6 fields are being read in error: Expected 4 fields in line 5, saw 6 . Where is the problem and how can it be solved?

Line 5 would be this one -
3 NaN 'and another ' NaN
1 2 3 4 5 6
The problem lies with your separator. It's interpreting each space separated word as a separate column. In this case, you'd need to
change your sep argument to \s{2,}, and
change your engine to 'python' to suppress warnings
df = pd.read_csv(StringIO(dftxt), sep='\s{2,}', engine='python')
Also, I'd get rid of the quotes (they're superfluous) using str.strip -
df.iloc[:, 1] = df.iloc[:, 1].str.strip("'")
df
0 1 2
1 10/1/2016 stringvalue 456.0
2 NaN anothersting NaN
3 NaN and another NaN
4 11/1/2016 more strings 943.0
5 NaN stringstring NaN
Lastly, from one pandas user to another, there's a little convenience function called pd.read_clipboard I think you should take a look at. It reads data from clipboard and accepts just about every argument that read_csv does.

Related

ValueError: Length of values (1) does not match length of index index (12797) - Indexes are the same length

So this is driving me crazy now, cause I really don't see the problem.
I have the following code:
dataframe.to_csv(f"user_data/candle_data.csv")
print (dataframe)
st12 = self.supertrend(dataframe, 3, 12)
st12['ST'].to_csv(f"user_data/st12.csv")
print (st12)
print(dataframe.index.difference(st12.index))
dataframe.loc[:, 'st_12'] = st12['ST'],
Checking the csv files and I can see that the first index is 0 and the last index is 12796. The last row is also on line number 12798. This is true for both files.
The output from three print is as follows
date open high low close volume
0 2020-12-29 21:45:00+00:00 723.33 726.14 723.26 725.05 3540.48612
1 2020-12-29 22:00:00+00:00 725.17 728.77 723.78 726.94 3983.90892
2 2020-12-29 22:15:00+00:00 726.94 727.30 724.72 724.75 3166.57435
3 2020-12-29 22:30:00+00:00 724.94 725.99 723.80 725.91 2848.08122
4 2020-12-29 22:45:00+00:00 725.99 730.30 725.95 729.64 6288.69499
... ... ... ... ... ... ...
12792 2021-05-12 03:45:00+00:00 4292.42 4351.85 4292.35 4332.81 24410.30155
12793 2021-05-12 04:00:00+00:00 4332.12 4347.60 4300.07 4343.05 16545.66776
12794 2021-05-12 04:15:00+00:00 4342.84 4348.00 4305.87 4313.82 10048.32828
12795 2021-05-12 04:30:00+00:00 4313.82 4320.68 4273.35 4287.49 13201.88547
12796 2021-05-12 04:45:00+00:00 4287.49 4306.79 4276.87 4300.80 9663.73327
[12797 rows x 6 columns]
ST STX
0 0.000000 nan
1 0.000000 nan
2 0.000000 nan
3 0.000000 nan
4 0.000000 nan
... ... ...
12792 4217.075684 up
12793 4217.075684 up
12794 4217.260609 up
12795 4217.260609 up
12796 4217.260609 up
[12797 rows x 2 columns]
RangeIndex(start=0, stop=0, step=1)
Full Error Traceback:
Traceback (most recent call last):
File "/freqtrade/freqtrade/main.py", line 37, in main
return_code = args['func'](args)
File "/freqtrade/freqtrade/commands/optimize_commands.py", line 53, in start_backtesting
backtesting.start()
File "/freqtrade/freqtrade/optimize/backtesting.py", line 479, in start
min_date, max_date = self.backtest_one_strategy(strat, data, timerange)
File "/freqtrade/freqtrade/optimize/backtesting.py", line 437, in backtest_one_strategy
preprocessed = self.strategy.ohlcvdata_to_dataframe(data)
File "/freqtrade/freqtrade/strategy/interface.py", line 670, in ohlcvdata_to_dataframe
return {pair: self.advise_indicators(pair_data.copy(), {'pair': pair})
File "/freqtrade/freqtrade/strategy/interface.py", line 670, in <dictcomp>
return {pair: self.advise_indicators(pair_data.copy(), {'pair': pair})
File "/freqtrade/freqtrade/strategy/interface.py", line 687, in advise_indicators
return self.populate_indicators(dataframe, metadata)
File "/freqtrade/user_data/strategies/TrippleSuperTrendStrategy.py", line 94, in populate_indicators
dataframe.loc[:, 'st_12'] = st12['ST'],
File "/home/ftuser/.local/lib/python3.9/site-packages/pandas/core/indexing.py", line 692, in __setitem__
iloc._setitem_with_indexer(indexer, value, self.name)
File "/home/ftuser/.local/lib/python3.9/site-packages/pandas/core/indexing.py", line 1597, in _setitem_with_indexer
self.obj[key] = value
File "/home/ftuser/.local/lib/python3.9/site-packages/pandas/core/frame.py", line 3163, in __setitem__
self._set_item(key, value)
File "/home/ftuser/.local/lib/python3.9/site-packages/pandas/core/frame.py", line 3242, in _set_item
value = self._sanitize_column(key, value)
File "/home/ftuser/.local/lib/python3.9/site-packages/pandas/core/frame.py", line 3899, in _sanitize_column
value = sanitize_index(value, self.index)
File "/home/ftuser/.local/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 751, in sanitize_index
raise ValueError(
ValueError: Length of values (1) does not match length of index (12797)
ERROR: 1
So if both data frames have exactly the same amount of rows and the indexes are exactly the same, why am I getting this error?

There is a typo:
dataframe.loc[:, 'st_12'] = st12['ST']
The comma is a typo.

Python / Pandas - KeyError merging dataframes

I have two dataframes I'm trying to merge:
target:
version city_id code
id
4 2 4 5736201000175
26 2 3 8290265000183
27 3 3 9529184000156
30 3 3 9263064000150
34 2 3 9312770000144
54 1 3 8407830000140
55 1 3 5590100000139
city:
federation_unit_id name
id
3 8 SAO PAULO
4 8 CAMPINAS
7 8 BARUERI
8 8 BEBEDOURO
9 8 SANTOS
I want to merge them combining target's "city_id" with city's "id", in a way that the final dataframe looks like this:
target:
version city_id code federation_unit_id name
id
4 2 4 5736201000175 8 CAMPINAS
26 2 3 8290265000183 8 SAO PAULO
27 3 3 9529184000156 8 SAO PAULO
30 3 3 9263064000150 8 SAO PAULO
34 2 3 9312770000144 8 SAO PAULO
54 1 3 8407830000140 8 SAO PAULO
55 1 3 5590100000139 8 SAO PAULO
To achieve that, I'm trying to use the following code:
target=target.merge(city, left_on='city_id', right_on='id')
However it keeps getting me the following KeyError:
Traceback (most recent call last):
File "/file.py", line 12, in <module>
target=target.merge(city, left_on='index', right_on='city_id')
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/frame.py", line 4437, in merge
copy=copy, indicator=indicator)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/tools/merge.py", line 38, in merge
copy=copy, indicator=indicator)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/tools/merge.py", line 210, in __init__
self.join_names) = self._get_merge_keys()
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/tools/merge.py", line 434, in _get_merge_keys
right_keys.append(right[rk]._values)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/frame.py", line 1997, in __getitem__
return self._getitem_column(key)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/frame.py", line 2004, in _getitem_column
return self._get_item_cache(key)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/generic.py", line 1350, in _get_item_cache
values = self._data.get(item)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/internals.py", line 3290, in get
loc = self.items.get_loc(item)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/indexes/base.py", line 1947, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/index.pyx", line 137, in pandas.index.IndexEngine.get_loc (pandas/index.c:4154)
File "pandas/index.pyx", line 159, in pandas.index.IndexEngine.get_loc (pandas/index.c:4018)
File "pandas/hashtable.pyx", line 675, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12368)
File "pandas/hashtable.pyx", line 683, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12322)
KeyError: 'id'
I can't find out what am I doing wrong :/
Can someone help on that?

You can use join
target.join(city, on='city_id')
join is inherently index oriented. However, you can specify an alternative column to join on in the dataframe that constitutes the left side. If we call the join method on target then we want to specify 'city_id' as that alternative column. The city dataframe already has the appropriate index.

The id in the city data frame seems to be an index, try set right_index=True:
target.merge(city, left_on='city_id', right_index=True)

Converting data type of a column in a csv file

I was trying to modify the data type of column in Python in Pycharm using Numpy and Pandas library but I am getting the following error.
dataset.fillna(1e6).astype(int)
D:\Softwares\Python3.6.1\python.exe D:/PythonPractice/DataPreprocessing/DataPreprocessing_1.py
Traceback (most recent call last):
Country Age Salary Purchased
File "D:/PythonPractice/DataPreprocessing/DataPreprocessing_1.py", line 6, in <module>
dataset.fillna(1e6).astype(int)
0 France 44.0 72000.0 No
1 Spain 27.0 48000.0 Yes
File "D:\Softwares\Python3.6.1\lib\site-packages\pandas\util\_decorators.py", line 91, in wrapper
2 Germany 30.0 54000.0 No
3 Spain 38.0 61000.0 No
return func(*args, **kwargs)
4 Germany 40.0 NaN Yes
File "D:\Softwares\Python3.6.1\lib\site-packages\pandas\core\generic.py", line 3299, in astype
**kwargs)
File "D:\Softwares\Python3.6.1\lib\site-packages\pandas\core\internals.py", line 3224, in astype
5 France 35.0 58000.0 Yes
return self.apply('astype', dtype=dtype, **kwargs)
6 Spain NaN 52000.0 No
File "D:\Softwares\Python3.6.1\lib\site-packages\pandas\core\internals.py", line 3091, in apply
7 France 48.0 79000.0 Yes
applied = getattr(b, f)(**kwargs)
8 Germany 50.0 83000.0 No
File "D:\Softwares\Python3.6.1\lib\site-packages\pandas\core\internals.py", line 471, in astype
9 France 37.0 67000.0 Yes
**kwargs)
File "D:\Softwares\Python3.6.1\lib\site-packages\pandas\core\internals.py", line 521, in _astype
values = astype_nansafe(values.ravel(), dtype, copy=True)
File "D:\Softwares\Python3.6.1\lib\site-packages\pandas\core\dtypes\cast.py", line 625, in astype_nansafe
return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
File "pandas\_libs\lib.pyx", line 917, in pandas._libs.lib.astype_intsafe (pandas\_libs\lib.c:16260)
File "pandas\_libs\src\util.pxd", line 93, in util.set_value_at_unsafe (pandas\_libs\lib.c:73093)
ValueError: invalid literal for int() with base 10: 'France'

Your error message - ValueError: invalid literal for int() with base 10: 'France' - suggests you're using the Country column, the contents of which are strings and can't be changed to integers. Try adjusting your range over.

You can't transform 'France' to integer, you should:
dataset['Country'] = dataset['Country'].map({'France': 0, 'Spain': 1, 'Germany': 2})]
then:
dataset['Country'].astype(int)
if there is still an error like this:
ValueError: Cannot convert non-finite values (NA or inf) to integer
This is due to that there is some NaN in the dataset['Country'].
Deal with these NaN by fillna() or drop() and so on, you will resolve it.

pandas dataframe; error during converting NaN to 0

The last step of my dataframe is to convert all NaN values to 0 (zero). My dataframe contains more than 1000 columns, some are text, some are integers, and some are floats.
To convert NaN to 0, I use the following command:
#replace nan in columns with 0
nan_cols = df5c.columns[df5c.isnull().any(axis=0)]
for col in nan_cols:
df5c[col] = df5c[col].fillna(0).astype(int)
This worked fine, until I added a new column with new data, which gives the following error:
Traceback (most recent call last):
File "pythonscript_v8.py", line 233, in <module>
df5c[col] = df5c[col].fillna(0).astype(int)
File "/usr/lib/python3/dist-packages/pandas/core/generic.py", line 2632, in astype
dtype=dtype, copy=copy, raise_on_error=raise_on_error, **kwargs)
File "/usr/lib/python3/dist-packages/pandas/core/internals.py", line 2864, in astype
return self.apply('astype', dtype=dtype, **kwargs)
File "/usr/lib/python3/dist-packages/pandas/core/internals.py", line 2823, in apply
applied = getattr(b, f)(**kwargs)
File "/usr/lib/python3/dist-packages/pandas/core/internals.py", line 430, in astype
values=values, **kwargs)
File "/usr/lib/python3/dist-packages/pandas/core/internals.py", line 472, in _astype
values = com._astype_nansafe(values.ravel(), dtype, copy=True)
File "/usr/lib/python3/dist-packages/pandas/core/common.py", line 2463, in _astype_nansafe
return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
File "pandas/lib.pyx", line 935, in pandas.lib.astype_intsafe (pandas/lib.c:16612)
File "pandas/src/util.pxd", line 60, in util.set_value_at (pandas/lib.c:66830)
ValueError: invalid literal for int() with base 10: 'NODE_1_length_402490_cov_43.5825_ID_1'
What does this error mean, and how can I solve this?
My dataframe looks like this:
source contigID contig_length SCM/genes plasmid_genes/genes A053_1 parA_1
COLS157_1 NODE_1_length_402490_cov_43.5825_ID_1 402490 0.87 0.95 NaN NaN
COLS157_10 NODE_10_length_218177_cov_45.105_ID_19 218177 0.79 0.97 NaN NaN
COLS157_100 NODE_157_length_248_cov_34.4628_ID_313 248 NaN NaN NaN NaN
COLS157_11 NODE_11_length_176130_cov_51.1495_ID_21 176130 0.75 0.86 NaN NaN
COLS157_12 NODE_12_length_165446_cov_50.2044_ID_23 165446 0.77 0.88 NaN NaN

If I'm understanding the problem correctly this will work.
nan_cols = df5c.columns[df5c.isnull().any(axis=0)]
for col in nan_cols:
for i in range(len(df5c)):
if pd.isnull(df5c.loc[i, col]):
data.loc[i, col] = data.loc[i, col] = 0

Error to find minimum of last column of pandas DataFrame in Python

I'm using read_csv() to read data from external .csv file. It's working fine. But whenever I try to find the minimum of the last column of that dataframe using np.min(...), it's giving lots of errors. But it's interesting that the same procedure is working for the rest of the columns that the dataframe has.
I'm attaching the code here.
import numpy as np
import pandas as pd
import os
data = pd.read_csv("test_data_v4.csv", sep = ",")
print(data)
The output is like below:
LINK_CAPACITY_KBPS THROUGHPUT_KBPS HOP_COUNT PACKET_LOSS JITTER_MS \
0 25 15.0 50 0.25 20
1 20 10.5 70 0.45 3
2 17 12.0 49 0.75 7
3 18 11.0 65 0.30 11
4 14 14.0 55 0.50 33
5 15 8.0 62 0.25 31
RSSI
0 -30
1 -11
2 -26
3 -39
4 -25
5 -65
np.min(data['RSSI'])
Now the error comes:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/koushik_k/anaconda3/lib/python3.5/site-
packages/pandas/core/frame.py", line 1914, in __getitem__
return self._getitem_column(key)
File "/home/koushik_k/anaconda3/lib/python3.5/site-
packages/pandas/core/frame.py", line 1921, in _getitem_column
return self._get_item_cache(key)
File "/home/koushik_k/anaconda3/lib/python3.5/site-
packages/pandas/core/generic.py", line 1090, in _get_item_cache
values = self._data.get(item)
File "/home/koushik_k/anaconda3/lib/python3.5/site-
packages/pandas/core/internals.py", line 3102, in get
loc = self.items.get_loc(item)
File "/home/koushik_k/anaconda3/lib/python3.5/site-
packages/pandas/core/index.py", line 1692, in get_loc
return self._engine.get_loc(_values_from_object(key))
File "pandas/index.pyx", line 137, in pandas.index.IndexEngine.get_loc
(pandas/index.c:3979)
File "pandas/index.pyx", line 157, in pandas.index.IndexEngine.get_loc
(pandas/index.c:3843)
File "pandas/hashtable.pyx", line 668, in
pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12265)
File "pandas/hashtable.pyx", line 676, in
pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12216)
KeyError: 'RSSI'

Following on DSM's comment, try data.columns = data.columns.str.strip()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Not able to read text data into a pandas dataframe in python - python

Related

ValueError: Length of values (1) does not match length of index index (12797) - Indexes are the same length

Python / Pandas - KeyError merging dataframes

Converting data type of a column in a csv file

pandas dataframe; error during converting NaN to 0

Error to find minimum of last column of pandas DataFrame in Python

Categories

Resources