I'm trying to convert a dask dataframe to a pandas dataframe with the following code:
import dask.dataframe as dd
uri = "mysql+pymysql://myUser:myPassword#myHost:myPort/myDatabase"
dataframe = dd.read_sql_table("myTable", uri, "id", columns=["id", "name", "type_id"])
df = dataframe.fillna(0)
print(len(df.index))
However I'm facing the following error:
Traceback (most recent call last):
File "tmp.py", line 5, in <module>
print(len(df.index))
File "/home/user/.local/lib/python3.7/site-packages/dask/dataframe/core.py", line 593, in __len__
len, np.sum, token="len", meta=int, split_every=False
File "/home/user/.local/lib/python3.7/site-packages/dask/base.py", line 288, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/home/user/.local/lib/python3.7/site-packages/dask/base.py", line 570, in compute
results = schedule(dsk, keys, **kwargs)
File "/home/user/.local/lib/python3.7/site-packages/dask/threaded.py", line 87, in get
**kwargs
File "/home/user/.local/lib/python3.7/site-packages/dask/local.py", line 517, in get_async
raise_exception(exc, tb)
File "/home/user/.local/lib/python3.7/site-packages/dask/local.py", line 325, in reraise
raise exc
File "/home/user/.local/lib/python3.7/site-packages/dask/local.py", line 223, in execute_task
result = _execute_task(task, data)
File "/home/user/.local/lib/python3.7/site-packages/dask/core.py", line 121, in _execute_task
return func(*(_execute_task(a, cache) for a in args))
File "/home/user/.local/lib/python3.7/site-packages/dask/utils.py", line 35, in apply
return func(*args, **kwargs)
File "/home/user/.local/lib/python3.7/site-packages/dask/dataframe/io/sql.py", line 232, in _read_sql_chunk
return df.astype(meta.dtypes.to_dict(), copy=False)
File "/home/user/.local/lib/python3.7/site-packages/pandas/core/generic.py", line 5683, in astype
col.astype(dtype=dtype[col_name], copy=copy, errors=errors)
File "/home/user/.local/lib/python3.7/site-packages/pandas/core/generic.py", line 5698, in astype
new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors)
File "/home/user/.local/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 582, in astype
return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
File "/home/user/.local/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 442, in apply
applied = getattr(b, f)(**kwargs)
File "/home/user/.local/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 625, in astype
values = astype_nansafe(vals1d, dtype, copy=True)
File "/home/user/.local/lib/python3.7/site-packages/pandas/core/dtypes/cast.py", line 868, in astype_nansafe
raise ValueError("Cannot convert non-finite values (NA or inf) to integer")
ValueError: Cannot convert non-finite values (NA or inf) to integer
The table I'm using has the following structure (retrieved using only pandas):
id name type_id
-------------------------
2 name_2 3.0
3 name_3 3.0
4 name_4 1.0
6 name_6 NaN
7 name_7 2.0
...
I tried the same code without retrieving the 'type_id' column and it works as expected.
What I don't understand is why are NaN values not replaced by '0' since I'm using fillna(0) function before trying to convert the dataframe.
If I look at my database with phpmyadmin the pandas 'NaN' values are 'NULL' values.
How are NaN values not replaced by '0'
By using df = dataframe.fillna(0) you are instructing to fill nans in all columns, which can be problematic. Specifying the columns with nans explicitly might work:
df = dataframe.copy()
df["type_id"] = df["type_id"].astype('float').fillna(0)
Another options is to try dd.to_numeric:
df["type_id"] = dd.to_numeric(df["type_id"], errors="coerce").fillna(0)
Related
With a single-index dataframe, we can use loc to get, set, and change values:
>>> df=pd.DataFrame()
>>> df.loc['A',1]=1
>>> df
1
A 1.0
>>> df.loc['A',1]=2
>>> df.loc['A',1]
2.0
However, with a multiindex dataframe, loc can get and change values:
>>> df=pd.DataFrame([['A','B',1]])
>>> df=df.set_index([0,1])
>>> df.loc[('A','B'),2]
1
>>> df.loc[('A','B'),2]=3
>>> df.loc[('A','B'),2]
3
but setting them seems to fail:
>>> df=pd.DataFrame()
>>> df.loc[('A','B'),2]=3
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Program Files\Python39\lib\site-packages\pandas\core\indexing.py", line 688, in __setitem__
indexer = self._get_setitem_indexer(key)
File "C:\Program Files\Python39\lib\site-packages\pandas\core\indexing.py", line 630, in _get_setitem_indexer
return self._convert_tuple(key, is_setter=True)
File "C:\Program Files\Python39\lib\site-packages\pandas\core\indexing.py", line 754, in _convert_tuple
idx = self._convert_to_indexer(k, axis=i, is_setter=is_setter)
File "C:\Program Files\Python39\lib\site-packages\pandas\core\indexing.py", line 1212, in _convert_to_indexer
return self._get_listlike_indexer(key, axis, raise_missing=True)[1]
File "C:\Program Files\Python39\lib\site-packages\pandas\core\indexing.py", line 1266, in _get_listlike_indexer
self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing)
File "C:\Program Files\Python39\lib\site-packages\pandas\core\indexing.py", line 1308, in _validate_read_indexer
raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Index(['A', 'B'], dtype='object')] are in the [index]"
Why is this, and what is the "right" way to use loc to set a single value in a multiindex dataframe?
This fails because you don't have the correct number of levels in the MultiIndex.
You need to initialize an empty DataFrame with the correct number of levels, for example using pandas.MultiIndex.from_arrays:
idx = pd.MultiIndex.from_arrays([[],[]])
df = pd.DataFrame(index=idx)
df.loc[('A','B'), 2] = 3
Output:
2
A B 3.0
With this example DataFrame: df = pd.DataFrame([['A-3', 'B-4'], ['C-box', 'D1-go']])
Calling extract on individual columns as series works fine:
df.iloc[:, 0].str.extract('-(.+)')
df.iloc[:, 1].str.extract('-(.+)')
and also on the other axis:
df.iloc[0, :].str.extract('-(.+)')
df.iloc[1, :].str.extract('-(.+)')
So, I'd expect using apply would work (by applying extract to each column):
df.apply(lambda s: s.str.extract('-(.+)'), axis=0)
But it throws this error:
Traceback (most recent call last):
File "C:\ProgramData\Miniconda3\envs\py3\lib\site-packages\IPython\core\interactiveshell.py", line 3325, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-588-70b1808d5457>", line 2, in <module>
df.apply(lambda s: s.str.extract('-(.+)'))
File "C:\ProgramData\Miniconda3\envs\py3\lib\site-packages\pandas\core\frame.py", line 6487, in apply
return op.get_result()
File "C:\ProgramData\Miniconda3\envs\py3\lib\site-packages\pandas\core\apply.py", line 151, in get_result
return self.apply_standard()
File "C:\ProgramData\Miniconda3\envs\py3\lib\site-packages\pandas\core\apply.py", line 260, in apply_standard
return self.wrap_results()
File "C:\ProgramData\Miniconda3\envs\py3\lib\site-packages\pandas\core\apply.py", line 308, in wrap_results
return self.wrap_results_for_axis()
File "C:\ProgramData\Miniconda3\envs\py3\lib\site-packages\pandas\core\apply.py", line 340, in wrap_results_for_axis
result = self.obj._constructor(data=results)
File "C:\ProgramData\Miniconda3\envs\py3\lib\site-packages\pandas\core\frame.py", line 392, in __init__
mgr = init_dict(data, index, columns, dtype=dtype)
File "C:\ProgramData\Miniconda3\envs\py3\lib\site-packages\pandas\core\internals\construction.py", line 212, in init_dict
return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
File "C:\ProgramData\Miniconda3\envs\py3\lib\site-packages\pandas\core\internals\construction.py", line 51, in arrays_to_mgr
index = extract_index(arrays)
File "C:\ProgramData\Miniconda3\envs\py3\lib\site-packages\pandas\core\internals\construction.py", line 308, in extract_index
raise ValueError('If using all scalar values, you must pass'
ValueError: If using all scalar values, you must pass an index
Using axis=1 yields an unexpected result, a Series with each row being a Series:
Out[2]:
0 0
0 3
1 4
1 0
0 box
1 go
dtype: object
I'm using apply, because I think this would result in the fastest execution time, but open to other suggestions
You can use split instead.
df.apply(lambda s: s.str.split('-', expand=True)[1])
Out[1]:
0 1
0 3 4
1 box go
The default parameter for expand in str.extract is True and it returns a Dataframe. Since you are applying it to multiple columns, it tries to return multiple dataframes. Set expand to False to handle that,
df.apply(lambda x: x.str.extract('-(.*)', expand = False))
0 1
0 3 4
1 box go
I'm going to the grain. Every one knows that a column, say col = df['field'] is a 'pandas.core.series.Series'. And also counts = df['field'].value_counts() with the method value_counts() is a 'pandas.core.series.Series' data type.
And that you can extract the value from the first row of a 'pandas.core.series.Series' with double brackets: col[0] or counts[0]
Nontheless indexes from col and counts are different. And this insight is what I think is the problem I'm about to present.
I have the next 'pandas.core.series.Series' data type generated by the next code:
We read the data frame as df
df = pd.read_csv('file.csv')
df has 'year' and 'product' columns, which I get its unique values and transform them into strings
vals_year = df['year'].astype('str').unique()
vals_product = df['product'].astype('str').unique()
This is the content in each variable:
>>>vals_year
>>>['16' '18' '17']
>>> vals_product
>>>['card' 'cash']
Then I use the value_counts() method to count and create 'pandas.core.series.Series' data type :
cy = df['year'].value_counts()
cp = df['product'].value_counts()
This is the output:
>>>cy
>>>16 65
17 40
18 12
Name: year, dtype: int64
>>>cp
>>>card 123
cash 106
Name: product, dtype: int64
Here is the first value of cp:
>>>cp[0]
>>>123
But when I try to see the first value from cy this happens:
>>>cy[0]
Traceback (most recent call last):
File "C:.../Test3.py", line 44, in <module>
print(cr[0])
File "C:\...\venv\lib\site-packages\pandas\core\series.py", line 1064, in __getitem__
result = self.index.get_value(self, key)
File "C:\...\venv\lib\site-packages\pandas\core\indexes\base.py", line 4723, in get_value
return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
File "pandas\_libs\index.pyx", line 80, in pandas._libs.index.IndexEngine.get_value
File "pandas\_libs\index.pyx", line 88, in pandas._libs.index.IndexEngine.get_value
File "pandas\_libs\index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 992, in pandas._libs.hashtable.Int64HashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 998, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 0
(I just copy paste the message)
Why this happens? It makes no sense!! Is this a glitch in pandas? I believe the problem resides, as I said before, in the fact that The original values from 'year' column were ints
I have a data frame, df, of size 2x2. When I call df.boxplot() I get a IndexError: list index out of range error message:
Traceback (most recent call last):
File "my_code.py", line 155, in <module>
main()
File "my_code.py", line 135, in main
df.boxplot()
File "/server/software/rhel7/python27_pandas-0.19.2-mkl/lib/python2.7/site-packages/pandas/core/frame.py", line 5749, in boxplot
return_type=return_type, **kwds)
File "/server/software/rhel7/python27_pandas-0.19.2-mkl/lib/python2.7/site-packages/pandas/tools/plotting.py", line 2797, in boxplot
result = plot_group(columns, data.values.T, ax)
File "/server/software/rhel7/python27_pandas-0.19.2-mkl/lib/python2.7/site-packages/pandas/tools/plotting.py", line 2751, in plot_group
bp = ax.boxplot(values, **kwds)
File "/server/software/rhel7/python27_matplotlib-1.5.1-mkl/lib/python2.7/site-packages/matplotlib/__init__.py", line 1812, in inner
return func(ax, *args, **kwargs)
File "/server/software/rhel7/python27_matplotlib-1.5.1-mkl/lib/python2.7/site-packages/matplotlib/axes/_axes.py", line 3212, in boxplot
labels=labels)
File "/server/software/rhel7/python27_matplotlib-1.5.1-mkl/lib/python2.7/site-packages/matplotlib/cbook.py", line 1980, in boxplot_stats
X = _reshape_2D(X)
File "/server/software/rhel7/python27_matplotlib-1.5.1-mkl/lib/python2.7/site-packages/matplotlib/cbook.py", line 2245, in _reshape_2D
if not hasattr(X[0], '__len__'):
IndexError: list index out of range
Interestingly, if I do df.iloc[1,:] = [200, 210], the error disappears. However, running df.iloc[1,0] = 200; df.iloc[1,1] = 210 doesn't fix the error. What could the issue be?
print(df):
C_5 C_10
Date
0 100 150
1 200 210
print(df) looks the same after df.iloc[1,:] = [200, 210] or after df.iloc[1,0] = 200; df.iloc[1,1] = 210 (which is expected).
Looking at print('df.dtypes: \n{0}'.format(df.dtypes)), the issue is that the dtypes was object
df.dtypes:
C_5 float64
C_10 object
whereas it should be:
df.dtypes:
C_5 float64
C_10 float64
otherwise you'll get the very explicit error message IndexError: list index out of range.
I have a text file that looks like this:
# Pearson correlation [n=344 #col=2]
# Name Name Value BiasCorr 2.50% 97.50% N: 2.50% N:97.50%
# --------------- --------------- -------- -------- -------- -------- -------- --------
101_DGCA3.1D[0] 101_LEC.1D[0] +0.85189 +0.85071 +0.81783 +0.87777 +0.82001 +0.87849
I have loaded it into python pandas using the following code:
import pandas as pd
data = pd.read_table('test.txt')
print data
However, I can't seem to access the different columns separately. I have tried using sep=' ' and copying the spaces between the columns in the text file, but I still don't get any column names and trying to print data[0] gives me an error:
Traceback (most recent call last):
File "cut_afni_output.py", line 3, in <module>
print data[0]
File "/home/user/anaconda2/lib/python2.7/site-packages/pandas/core/frame.py", line 1969, in __getitem__
return self._getitem_column(key)
File "/home/user/anaconda2/lib/python2.7/site-packages/pandas/core/frame.py", line 1976, in _getitem_column
return self._get_item_cache(key)
File "/home/user/anaconda2/lib/python2.7/site-packages/pandas/core/generic.py", line 1091, in _get_item_cache
values = self._data.get(item)
File "/home/user/anaconda2/lib/python2.7/site-packages/pandas/core/internals.py", line 3211, in get
loc = self.items.get_loc(item)
File "/home/user/anaconda2/lib/python2.7/site-packages/pandas/core/index.py", line 1759, in get_loc
return self._engine.get_loc(key)
File "pandas/index.pyx", line 137, in pandas.index.IndexEngine.get_loc (pandas/index.c:3979)
File "pandas/index.pyx", line 157, in pandas.index.IndexEngine.get_loc (pandas/index.c:3843)
File "pandas/hashtable.pyx", line 668, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12265)
File "pandas/hashtable.pyx", line 676, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12216)
KeyError: 0
I haven't been able to set the header row manually because it seems like python views the whole thing as one column. How do I make the text file be read in as separate columns that I can call?
Try this:
In [33]: df = pd.read_csv(filename, comment='#', header=None, delim_whitespace=True)
In [34]: df
Out[34]:
0 1 2 3 4 5 6 7
0 101_DGCA3.1D[0] 101_LEC.1D[0] 0.85189 0.85071 0.81783 0.87777 0.82001 0.87849