When using pandas DataFrame, I can do to_string(float_format='%.1f') on a DataFrame. However, when applying the same method to df.describe(), it failed.
The issue is self-explanatory with the following code.
>>> df = pd.DataFrame([[1, 2, 'March'],[5, 6, 'Dec'],[3, 4, 'April'], [0, 1, 'March']], columns=['a','b','m'])
>>> df
a b m
0 1 2 March
1 5 6 Dec
2 3 4 April
3 0 1 March
>>> df.to_string(float_format='%.1f')
u' a b m\n0 1 2 March\n1 5 6 Dec\n2 3 4 April\n3 0 1 March'
>>> df.describe().to_string(float_format='%.1f')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 1343, in to_string
formatter.to_string()
File "/Library/Python/2.7/site-packages/pandas/core/format.py", line 511, in to_string
strcols = self._to_str_columns()
File "/Library/Python/2.7/site-packages/pandas/core/format.py", line 439, in _to_str_columns
fmt_values = self._format_col(i)
File "/Library/Python/2.7/site-packages/pandas/core/format.py", line 693, in _format_col
space=self.col_space
File "/Library/Python/2.7/site-packages/pandas/core/format.py", line 1930, in format_array
return fmt_obj.get_result()
File "/Library/Python/2.7/site-packages/pandas/core/format.py", line 1946, in get_result
fmt_values = self._format_strings()
File "/Library/Python/2.7/site-packages/pandas/core/format.py", line 2022, in _format_strings
fmt_values = [self.formatter(x) for x in self.values]
TypeError: 'str' object is not callable
It's working in your first time because none of your types are float. You could check that with df.dtypes:
In [37]: df.dtypes
Out[37]:
a int64
b int64
m object
dtype: object
From docs:
float_format : one-parameter function, optional
formatter function to apply to columns’ elements if they are floats, default None. The result of this function must be a unicode string.
So you need to pass a function not a string:
df.describe().to_string(float_format=lambda x: '%.1f' % x)
or with .format:
df.describe().to_string(float_format=lambda x: "{:.1f}".format(x))
Related
I want to calculate the anomaly of climate data. The code is shown as follow:
import pandas as pd
import numpy as np
import xarray as xr
date = pd.date_range('2000-01-01','2010-12-31') #4018 days
data = np.random.rand(len(date))
da = xr.DataArray(data=data,
dims='date',
coords=dict(date=date))
monthday = pd.MultiIndex.from_arrays([da['date.month'].values, da['date.day'].values])
da = da.assign_coords(monthday=('date',monthday)).groupby('monthday').mean(dim='date')
print(da)
<xarray.DataArray (monthday: 366)>
array([0.38151556, 0.46306277, 0.46148326, 0.35894069, 0.48318011,
0.44736969, 0.46828286, 0.44927365, 0.59294693, 0.61940206,
0.54264219, 0.51797117, 0.46200014, 0.50356122, 0.49371135,
...
0.44668478, 0.32583885, 0.36537256, 0.64087588, 0.56546472,
0.5021695 , 0.42450777, 0.49071572, 0.39639316, 0.53538823,
0.48345995, 0.46290486, 0.75160507, 0.4945804 , 0.52283262,
0.45320128])
Coordinates:
* monthday (monthday) MultiIndex
- monthday_level_0 (monthday) int64 1 1 1 1 1 1 1 1 ... 12 12 12 12 12 12 12
- monthday_level_1 (monthday) int64 1 2 3 4 5 6 7 8 ... 25 26 27 28 29 30 31
The monthday contains (2,29), i.e., the leap day. So how can I drop the leap day. I have try but it seems to wroks wrong
da.drop_sel(monthday=(2,29))
Traceback (most recent call last):
File "/Users/osamuyuubu/anaconda3/envs/xesmf_env/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3441, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-65-caf7267f29a4>", line 11, in <module>
da.drop_sel(monthday=(2,29))
File "/Users/osamuyuubu/anaconda3/envs/xesmf_env/lib/python3.7/site-packages/xarray/core/dataarray.py", line 2374, in drop_sel
ds = self._to_temp_dataset().drop_sel(labels, errors=errors)
File "/Users/osamuyuubu/anaconda3/envs/xesmf_env/lib/python3.7/site-packages/xarray/core/dataset.py", line 4457, in drop_sel
new_index = index.drop(labels_for_dim, errors=errors)
File "/Users/osamuyuubu/anaconda3/envs/xesmf_env/lib/python3.7/site-packages/pandas/core/indexes/multi.py", line 2201, in drop
loc = self.get_loc(level_codes)
File "/Users/osamuyuubu/anaconda3/envs/xesmf_env/lib/python3.7/site-packages/pandas/core/indexes/multi.py", line 2922, in get_loc
loc = self._get_level_indexer(key, level=0)
File "/Users/osamuyuubu/anaconda3/envs/xesmf_env/lib/python3.7/site-packages/pandas/core/indexes/multi.py", line 3204, in _get_level_indexer
idx = self._get_loc_single_level_index(level_index, key)
File "/Users/osamuyuubu/anaconda3/envs/xesmf_env/lib/python3.7/site-packages/pandas/core/indexes/multi.py", line 2855, in _get_loc_single_level_index
return level_index.get_loc(key)
File "/Users/osamuyuubu/anaconda3/envs/xesmf_env/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3363, in get_loc
raise KeyError(key) from err
KeyError: 29
So, how could I achieve this using xr.drop_sel()?
Thanks in advance!
With drop_sel you need to give the exact value in the index:
da.drop_sel(dayofyear=60)
But for non leap year this would drop the 1st of March.
To drop safely all 29th of Feb, I would probably use something like:
mask = np.logical_and(da.time.dt.is_leap_year, da.time.dt.dayofyear==60)
result = da.where(~mask, drop=True)
I have 2 dataframes (df1 and df2) which look like:
df1
Quarter Body Total requests Requests Processed … Requests on-hold
Q3 2019 A 93 92 … 0
Q3 2019 B 228 210 … 0
Q3 2019 C 180 178 … 0
Q3 2019 D 31 31 … 0
Q3 2019 E 555 483 … 0
df2
Quarter Body Total requests Requests Processed … Requests on-hold
Q2 2019 A 50 50 … 0
Q2 2019 B 191 177 … 0
Q2 2019 C 186 185 … 0
Q2 2019 D 35 35 … 0
Q2 2019 E 344 297 … 0
I am tring to append df2 onto df2 to create df3:
df3
Quarter Body Total requests Requests Processed … Requests on-hold
Q3 2019 A 93 92 … 0
Q3 2019 B 228 210 … 0
Q3 2019 C 180 178 … 0
Q3 2019 D 31 31 … 0
Q3 2019 E 555 483 … 0
Q2 2019 A 50 50 … 0
Q2 2019 B 191 177 … 0
Q2 2019 C 186 185 … 0
Q2 2019 D 35 35 … 0
Q2 2019 E 344 297 … 0
using:
df3= df1.append(df2)
but get the error:
AttributeError: 'NoneType' object has no attribute 'is_extension'
the full error trace is:
File "<ipython-input-405-e3e0e047dbc0>", line 1, in <module>
runfile('C:/2019_Q3/Code.py', wdir='C:/2019_Q3')
File "C:\Anaconda_Python 3.7\2019.03\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 786, in runfile
execfile(filename, namespace)
File "C:\Anaconda_Python 3.7\2019.03\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/2019_Q3/Code.py", line 420, in <module>
main()
File "C:/2019_Q3/Code.py", line 319, in main
df3= df1.append(df2, ignore_index=True)
File "C:\Anaconda_Python 3.7\2019.03\lib\site-packages\pandas\core\frame.py", line 6692, in append
sort=sort)
File "C:\Anaconda_Python 3.7\2019.03\lib\site-packages\pandas\core\reshape\concat.py", line 229, in concat
return op.get_result()
File "C:\Anaconda_Python 3.7\2019.03\lib\site-packages\pandas\core\reshape\concat.py", line 426, in get_result
copy=self.copy)
File "C:\Anaconda_Python 3.7\2019.03\lib\site-packages\pandas\core\internals\managers.py", line 2056, in concatenate_block_managers
elif is_uniform_join_units(join_units):
File "C:\Anaconda_Python 3.7\2019.03\lib\site-packages\pandas\core\internals\concat.py", line 379, in is_uniform_join_units
all(not ju.is_na or ju.block.is_extension for ju in join_units) and
File "C:\Anaconda_Python 3.7\2019.03\lib\site-packages\pandas\core\internals\concat.py", line 379, in <genexpr>
all(not ju.is_na or ju.block.is_extension for ju in join_units) and
AttributeError: 'NoneType' object has no attribute 'is_extension'
using:
df3= pd.concat([df1, df2], ignore_index=True)
gives me a error:
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
the full error trace is:
Traceback (most recent call last):
File "<ipython-input-406-e3e0e047dbc0>", line 1, in <module>
runfile('C:/2019_Q3/Code.py', wdir='C:/2019_Q3')
File "C:\Anaconda_Python 3.7\2019.03\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 786, in runfile
execfile(filename, namespace)
File "C:\Anaconda_Python 3.7\2019.03\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/2019_Q3/Code.py", line 421, in <module>
main()
File "C:/2019_Q3/Code.py", line 321, in main
finalCSV = pd.concat([PreviousCSVdf, df], ignore_index=True)
File "C:\Anaconda_Python 3.7\2019.03\lib\site-packages\pandas\core\reshape\concat.py", line 228, in concat
copy=copy, sort=sort)
File "C:\Anaconda_Python 3.7\2019.03\lib\site-packages\pandas\core\reshape\concat.py", line 381, in __init__
self.new_axes = self._get_new_axes()
File "C:\Anaconda_Python 3.7\2019.03\lib\site-packages\pandas\core\reshape\concat.py", line 448, in _get_new_axes
new_axes[i] = self._get_comb_axis(i)
File "C:\Anaconda_Python 3.7\2019.03\lib\site-packages\pandas\core\reshape\concat.py", line 469, in _get_comb_axis
sort=self.sort)
File "C:\Anaconda_Python 3.7\2019.03\lib\site-packages\pandas\core\indexes\api.py", line 70, in _get_objs_combined_axis
return _get_combined_index(obs_idxes, intersect=intersect, sort=sort)
File "C:\Anaconda_Python 3.7\2019.03\lib\site-packages\pandas\core\indexes\api.py", line 117, in _get_combined_index
index = _union_indexes(indexes, sort=sort)
File "C:\Anaconda_Python 3.7\2019.03\lib\site-packages\pandas\core\indexes\api.py", line 183, in _union_indexes
result = result.union(other)
File "C:\Anaconda_Python 3.7\2019.03\lib\site-packages\pandas\core\indexes\base.py", line 2332, in union
indexer = self.get_indexer(other)
File "C:\Anaconda_Python 3.7\2019.03\lib\site-packages\pandas\core\indexes\base.py", line 2740, in get_indexer
raise InvalidIndexError('Reindexing only valid with uniquely'
Both df1 and df2 have identical numbers of columns and column names. How would I append df1 and df2?
This tends to happen when you have duplicate columns in one or both of datasets.
Also, for general use its easier to go with pd.concat:
pd.concat([df1, df2], ignore_index=True) # ignore_index will reset index for you
And for the InvalidIndexError you can remove duplicate rows:
df1 = df1.loc[~df1.index.duplicated(keep='first')]
df2 = df2.loc[~df2.index.duplicated(keep='first')]
I'll make this short and sweet. I had this same issue.
The issue is not caused by duplicate column names but instead by duplicate column names with different data types.
Swapping to pd.concat will not fix this issue for you if you don't address the data types first.
I am trying to read text data from Pandas : populate column with if condition not working as expected into a dataframe. My code is:
dftxt = """
0 1 2
1 10/1/2016 'stringvalue' 456
2 NaN 'anothersting' NaN
3 NaN 'and another ' NaN
4 11/1/2016 'more strings' 943
5 NaN 'stringstring' NaN
"""
from io import StringIO
df = pd.read_csv(StringIO(dftxt), sep='\s+')
print (df)
But I am getting following error:
Traceback (most recent call last):
File "mydf.py", line 16, in <module>
df = pd.read_csv(StringIO(dftxt), sep='\s+')
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 646, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 401, in _read
data = parser.read()
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 939, in read
ret = self._engine.read(nrows)
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 1508, in read
data = self._reader.read(nrows)
File "pandas/parser.pyx", line 848, in pandas.parser.TextReader.read (pandas/parser.c:10415)
File "pandas/parser.pyx", line 870, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:10691)
File "pandas/parser.pyx", line 924, in pandas.parser.TextReader._read_rows (pandas/parser.c:11437)
File "pandas/parser.pyx", line 911, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:11308)
File "pandas/parser.pyx", line 2024, in pandas.parser.raise_parser_error (pandas/parser.c:27037)
pandas.io.common.CParserError: Error tokenizing data. C error: Expected 4 fields in line 5, saw 6
I can't understand which 6 fields are being read in error: Expected 4 fields in line 5, saw 6 . Where is the problem and how can it be solved?
Line 5 would be this one -
3 NaN 'and another ' NaN
1 2 3 4 5 6
The problem lies with your separator. It's interpreting each space separated word as a separate column. In this case, you'd need to
change your sep argument to \s{2,}, and
change your engine to 'python' to suppress warnings
df = pd.read_csv(StringIO(dftxt), sep='\s{2,}', engine='python')
Also, I'd get rid of the quotes (they're superfluous) using str.strip -
df.iloc[:, 1] = df.iloc[:, 1].str.strip("'")
df
0 1 2
1 10/1/2016 stringvalue 456.0
2 NaN anothersting NaN
3 NaN and another NaN
4 11/1/2016 more strings 943.0
5 NaN stringstring NaN
Lastly, from one pandas user to another, there's a little convenience function called pd.read_clipboard I think you should take a look at. It reads data from clipboard and accepts just about every argument that read_csv does.
I was trying to modify the data type of column in Python in Pycharm using Numpy and Pandas library but I am getting the following error.
dataset.fillna(1e6).astype(int)
D:\Softwares\Python3.6.1\python.exe D:/PythonPractice/DataPreprocessing/DataPreprocessing_1.py
Traceback (most recent call last):
Country Age Salary Purchased
File "D:/PythonPractice/DataPreprocessing/DataPreprocessing_1.py", line 6, in <module>
dataset.fillna(1e6).astype(int)
0 France 44.0 72000.0 No
1 Spain 27.0 48000.0 Yes
File "D:\Softwares\Python3.6.1\lib\site-packages\pandas\util\_decorators.py", line 91, in wrapper
2 Germany 30.0 54000.0 No
3 Spain 38.0 61000.0 No
return func(*args, **kwargs)
4 Germany 40.0 NaN Yes
File "D:\Softwares\Python3.6.1\lib\site-packages\pandas\core\generic.py", line 3299, in astype
**kwargs)
File "D:\Softwares\Python3.6.1\lib\site-packages\pandas\core\internals.py", line 3224, in astype
5 France 35.0 58000.0 Yes
return self.apply('astype', dtype=dtype, **kwargs)
6 Spain NaN 52000.0 No
File "D:\Softwares\Python3.6.1\lib\site-packages\pandas\core\internals.py", line 3091, in apply
7 France 48.0 79000.0 Yes
applied = getattr(b, f)(**kwargs)
8 Germany 50.0 83000.0 No
File "D:\Softwares\Python3.6.1\lib\site-packages\pandas\core\internals.py", line 471, in astype
9 France 37.0 67000.0 Yes
**kwargs)
File "D:\Softwares\Python3.6.1\lib\site-packages\pandas\core\internals.py", line 521, in _astype
values = astype_nansafe(values.ravel(), dtype, copy=True)
File "D:\Softwares\Python3.6.1\lib\site-packages\pandas\core\dtypes\cast.py", line 625, in astype_nansafe
return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
File "pandas\_libs\lib.pyx", line 917, in pandas._libs.lib.astype_intsafe (pandas\_libs\lib.c:16260)
File "pandas\_libs\src\util.pxd", line 93, in util.set_value_at_unsafe (pandas\_libs\lib.c:73093)
ValueError: invalid literal for int() with base 10: 'France'
Your error message - ValueError: invalid literal for int() with base 10: 'France' - suggests you're using the Country column, the contents of which are strings and can't be changed to integers. Try adjusting your range over.
You can't transform 'France' to integer, you should:
dataset['Country'] = dataset['Country'].map({'France': 0, 'Spain': 1, 'Germany': 2})]
then:
dataset['Country'].astype(int)
if there is still an error like this:
ValueError: Cannot convert non-finite values (NA or inf) to integer
This is due to that there is some NaN in the dataset['Country'].
Deal with these NaN by fillna() or drop() and so on, you will resolve it.
I'm using read_csv() to read data from external .csv file. It's working fine. But whenever I try to find the minimum of the last column of that dataframe using np.min(...), it's giving lots of errors. But it's interesting that the same procedure is working for the rest of the columns that the dataframe has.
I'm attaching the code here.
import numpy as np
import pandas as pd
import os
data = pd.read_csv("test_data_v4.csv", sep = ",")
print(data)
The output is like below:
LINK_CAPACITY_KBPS THROUGHPUT_KBPS HOP_COUNT PACKET_LOSS JITTER_MS \
0 25 15.0 50 0.25 20
1 20 10.5 70 0.45 3
2 17 12.0 49 0.75 7
3 18 11.0 65 0.30 11
4 14 14.0 55 0.50 33
5 15 8.0 62 0.25 31
RSSI
0 -30
1 -11
2 -26
3 -39
4 -25
5 -65
np.min(data['RSSI'])
Now the error comes:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/koushik_k/anaconda3/lib/python3.5/site-
packages/pandas/core/frame.py", line 1914, in __getitem__
return self._getitem_column(key)
File "/home/koushik_k/anaconda3/lib/python3.5/site-
packages/pandas/core/frame.py", line 1921, in _getitem_column
return self._get_item_cache(key)
File "/home/koushik_k/anaconda3/lib/python3.5/site-
packages/pandas/core/generic.py", line 1090, in _get_item_cache
values = self._data.get(item)
File "/home/koushik_k/anaconda3/lib/python3.5/site-
packages/pandas/core/internals.py", line 3102, in get
loc = self.items.get_loc(item)
File "/home/koushik_k/anaconda3/lib/python3.5/site-
packages/pandas/core/index.py", line 1692, in get_loc
return self._engine.get_loc(_values_from_object(key))
File "pandas/index.pyx", line 137, in pandas.index.IndexEngine.get_loc
(pandas/index.c:3979)
File "pandas/index.pyx", line 157, in pandas.index.IndexEngine.get_loc
(pandas/index.c:3843)
File "pandas/hashtable.pyx", line 668, in
pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12265)
File "pandas/hashtable.pyx", line 676, in
pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12216)
KeyError: 'RSSI'
Following on DSM's comment, try data.columns = data.columns.str.strip()