In the dataframe below:
T2MN T2MX RH2M DFP2M RAIN
6.96 9.32 84.27 5.57 -
6.31 10.46 - 5.63 -
- 10.66 79.38 3.63 -
0.79 4.45 94.24 1.85 -
1.45 3.99 91.71 1.17 -
How do I replace all the - with NaN's. I do not want to specify column names since I do not know before hand which column will have -
If those are strings, then your floats are probably also strings.
Assuming your dataframe is df, I'd try
pd.to_numeric(df.stack(), 'coerce').unstack()
Deeper explanation
Pandas doesn't usually represent missing floats with '-'. Therefore, that '-' must be a string. Thus, the dtype of any column with a '-' in it, must be 'object'. That makes it highly likely that whatever parsed the data, left the floats as string.
setup
from io import StringIO
import pandas as pd
txt = """T2MN T2MX RH2M DFP2M RAIN
6.96 9.32 84.27 5.57 -
6.31 10.46 - 5.63 -
- 10.66 79.38 3.63 -
0.79 4.45 94.24 1.85 -
1.45 3.99 91.71 1.17 - """
df = pd.read_csv(StringIO(txt), delim_whitespace=True)
print(df)
T2MN T2MX RH2M DFP2M RAIN
0 6.96 9.32 84.27 5.57 -
1 6.31 10.46 - 5.63 -
2 - 10.66 79.38 3.63 -
3 0.79 4.45 94.24 1.85 -
4 1.45 3.99 91.71 1.17 -
What are the dtypes?
print(df.dtypes)
T2MN object
T2MX float64
RH2M object
DFP2M float64
RAIN object
dtype: object
What is the type of the first element?
print(type(df.iloc[0, 0]))
<class 'str'>
This means that any column with a '-' is like a column of strings that look like floats. You want to use pd.to_numeric with parameter errors='coerce' to force non-numeric items to np.nan. However, pd.to_numeric does not operate on a pd.DataFrame so we stack and unstack.
pd.to_numeric(df.stack(), 'coerce').unstack()
T2MN T2MX RH2M DFP2M RAIN
0 6.96 9.32 84.27 5.57 NaN
1 6.31 10.46 NaN 5.63 NaN
2 NaN 10.66 79.38 3.63 NaN
3 0.79 4.45 94.24 1.85 NaN
4 1.45 3.99 91.71 1.17 NaN
Just replace() the string:
In [10]: df.replace('-', 'NaN')
Out[10]:
T2MN T2MX RH2M DFP2M RAIN
0 6.96 9.32 84.27 5.57 NaN
1 6.31 10.46 NaN 5.63 NaN
2 NaN 10.66 79.38 3.63 NaN
3 0.79 4.45 94.24 1.85 NaN
4 1.45 3.99 91.71 1.17 NaN
I think you want the actual numpy.nan instead of a string NaN as you can use a lot of methods such as fillna/isnull/notnull on the pandas.Series/pandas.DataFrame
import pandas as pd
import numpy as np
df = pd.DataFrame([['-']*10]*10)
df = df.replace('-',np.nan)
It looks like you were reading this data from CSV/FWF file... If it's true the easiest way to get rid of '-' would be to explain Pandas that it's NaN's representation:
df = pd.read_csv(filename, na_values=['NaN', 'nan', '-'])
Test:
In [79]: df
Out[79]:
T2MN T2MX RH2M DFP2M RAIN
0 6.96 9.32 84.27 5.57 NaN
1 6.31 10.46 NaN 5.63 NaN
2 NaN 10.66 79.38 3.63 NaN
3 0.79 4.45 94.24 1.85 NaN
4 1.45 3.99 91.71 1.17 NaN
In [80]: df.dtypes
Out[80]:
T2MN float64
T2MX float64
RH2M float64
DFP2M float64
RAIN float64
dtype: object
Related
I have Excel files with multiple sheets, each of which looks a little like this (but much longer):
Sample CD4 CD8
Day 1 8311 17.3 6.44
8312 13.6 3.50
8321 19.8 5.88
8322 13.5 4.09
Day 2 8311 16.0 4.92
8312 5.67 2.28
8321 13.0 4.34
8322 10.6 1.95
The first column is actually four cells merged vertically.
When I read this using pandas.read_excel, I get a DataFrame that looks like this:
Sample CD4 CD8
Day 1 8311 17.30 6.44
NaN 8312 13.60 3.50
NaN 8321 19.80 5.88
NaN 8322 13.50 4.09
Day 2 8311 16.00 4.92
NaN 8312 5.67 2.28
NaN 8321 13.00 4.34
NaN 8322 10.60 1.95
How can I either get Pandas to understand merged cells, or quickly and easily remove the NaN and group by the appropriate value? (One approach would be to reset the index, step through to find the values and replace NaNs with values, pass in the list of days, then set the index to the column. But it seems like there should be a simpler approach.)
You could use the Series.fillna method to forword-fill in the NaN values:
df.index = pd.Series(df.index).fillna(method='ffill')
For example,
In [42]: df
Out[42]:
Sample CD4 CD8
Day 1 8311 17.30 6.44
NaN 8312 13.60 3.50
NaN 8321 19.80 5.88
NaN 8322 13.50 4.09
Day 2 8311 16.00 4.92
NaN 8312 5.67 2.28
NaN 8321 13.00 4.34
NaN 8322 10.60 1.95
[8 rows x 3 columns]
In [43]: df.index = pd.Series(df.index).fillna(method='ffill')
In [44]: df
Out[44]:
Sample CD4 CD8
Day 1 8311 17.30 6.44
Day 1 8312 13.60 3.50
Day 1 8321 19.80 5.88
Day 1 8322 13.50 4.09
Day 2 8311 16.00 4.92
Day 2 8312 5.67 2.28
Day 2 8321 13.00 4.34
Day 2 8322 10.60 1.95
[8 rows x 3 columns]
df = df.fillna(method='ffill', axis=0) # resolved updating the missing row entries
To casually come back 8 years later, pandas.read_excel() can solve this internally for you with the index_col parameter.
df = pd.read_excel('path_to_file.xlsx', index_col=[0])
Passing index_col as a list will cause pandas to look for a MultiIndex. In the case where there is a list of length one, pandas creates a regular Index filling in the data.
So I created a function that returns the returns of quantile portfolios as a time series.
If I call Quantile_Returns(2014), the result (DataFrame) looks like this.
Date Q1 Q2 Q3 Q4 Q5
2014-02-28 6.20 4.87 5.41 5.04 4.91
2014-03-31 -0.50 0.05 1.55 1.36 1.49
2014-04-30 -0.17 0.20 0.33 -0.26 1.76
2014-05-30 2.69 1.95 1.95 2.11 2.29
2014-06-30 3.12 3.40 2.81 1.82 2.36
2014-07-31 -2.52 -2.34 -1.92 -2.36 -1.80
2014-08-29 4.60 3.87 4.50 4.65 3.58
2014-09-30 -3.29 -3.25 -3.51 -0.96 -1.76
2014-10-31 2.55 4.63 2.37 3.60 2.10
2014-11-28 0.88 2.08 1.26 4.46 2.83
2014-12-31 0.35 0.20 -0.19 1.01 0.34
2015-01-30 -2.97 -2.63 -3.44 -2.32 -2.61
Now I would want to call this function for a number of years time_period = list(range(1960,2021)) and get a result that is a time series which goes from 1960 to 2021.
I tried like this
time_period = list(range(1960,2021))
for j in time_period:
if j == 1960:
Quantile = pd.DataFrame(Quantile_Returns(j))
else:
Quantile = pd.concat(Quantile, Quantile_Returns(j+1))
But It did not work.
The Error is:
TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"
How can I implement this?
Thank you!
Try replacing the whole loop with
Quantile = pd.concat(Quantile_Returns(j) for j in range(1960, 2021))
pd.concat is expecting a sequence of pandas objects, and in the second pass through your loop you are giving it a DataFrame as the first argument (not a sequence of DataFrames). Also, the second argument should be an axis to concatenate on, not another DataFrame.
Here, I just passed it the sequence of all the DataFrames for different years as the first argument (using a generator expression).
I have a dataframe that I'd like to export to a csv file where each column is stacked on top of one another. I want to use each header as a label with the date in this format, Allu_1_2013.
date Allu_1 Allu_2 Alluv_3 year
2013-01-01 2.00 1.45 3.54 2013
2014-01-01 3.09 2.35 9.01 2014
2015-01-01 4.79 4.89 10.04 2015
The final csv text tile should look like
Allu_1_2013 2.00
Allu_1_2014 3.09
Allu_1_2015 4.79
Allu_2_2013 1.45
Allu_2_2014 2.35
Allu_2_2015 4.89
Allu_3_2013 3.54
Allu_3_2014 9.01
Allu_3_2015 10.04
You can use melt:
new_df = df.melt(id_vars=["date", "year"],
var_name="Date",
value_name="Value").drop(columns=['date'])
new_df['idx'] = new_df['Date'] + '_' + new_df['year'].astype(str)
new_df = new_df.drop(columns=['year', 'Date'])
Value
idx
0
2
Allu_1_2013
1
3.09
Allu_1_2014
2
4.79
Allu_1_2015
3
1.45
Allu_2_2013
4
2.35
Allu_2_2014
5
4.89
Allu_2_2015
6
3.54
Alluv_3_2013
7
9.01
Alluv_3_2014
8
10.04
Alluv_3_2015
import pandas as pd
diamonds = pd.read_csv('diam.csv')
print(diamonds.head())
Unnamed: 0 carat cut color clarity depth table price x y z quality?color
0 0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43 Ideal,E
1 1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31 Premium,E
2 2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31 Good,E
3 3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63 Premium,I
4 4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75 Good,J
I want to print only the object data types
x=diamonds.dtypes=='object'
diamonds.where(diamonds[x]==True)
But I get this error:
unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
where uses the row axis. Use diamonds.loc[:, diamonds.dtypes == object], or the builtin select_dtypes
From your post (badly formatted) I recreated the diamonds DataFrame,
getting result like below:
Unnamed: 0 carat cut color clarity depth table price x y z quality?color
0 0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43 Ideal,E 1
1 1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31 Premium,E 2
2 2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31 Good,E 3
3 3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63 Premium,I 4
4 4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75 Good,J
When you run x = diamonds.dtypes == 'object' and print x, the result is:
Unnamed: 0 False
carat False
cut True
color True
clarity True
depth False
table False
price False
x False
y False
z False
quality?color True
dtype: bool
It is a bool vector, containing answer to the question is this column of object type?
Note that diamonds.columns.where(x).dropna().tolist() prints:
['cut', 'color', 'clarity', 'quality?color']
i.e. names of columns with object type.
So to print all columns of object type you should run:
diamonds[diamonds.columns.where(x).dropna()]
I have Excel files with multiple sheets, each of which looks a little like this (but much longer):
Sample CD4 CD8
Day 1 8311 17.3 6.44
8312 13.6 3.50
8321 19.8 5.88
8322 13.5 4.09
Day 2 8311 16.0 4.92
8312 5.67 2.28
8321 13.0 4.34
8322 10.6 1.95
The first column is actually four cells merged vertically.
When I read this using pandas.read_excel, I get a DataFrame that looks like this:
Sample CD4 CD8
Day 1 8311 17.30 6.44
NaN 8312 13.60 3.50
NaN 8321 19.80 5.88
NaN 8322 13.50 4.09
Day 2 8311 16.00 4.92
NaN 8312 5.67 2.28
NaN 8321 13.00 4.34
NaN 8322 10.60 1.95
How can I either get Pandas to understand merged cells, or quickly and easily remove the NaN and group by the appropriate value? (One approach would be to reset the index, step through to find the values and replace NaNs with values, pass in the list of days, then set the index to the column. But it seems like there should be a simpler approach.)
You could use the Series.fillna method to forword-fill in the NaN values:
df.index = pd.Series(df.index).fillna(method='ffill')
For example,
In [42]: df
Out[42]:
Sample CD4 CD8
Day 1 8311 17.30 6.44
NaN 8312 13.60 3.50
NaN 8321 19.80 5.88
NaN 8322 13.50 4.09
Day 2 8311 16.00 4.92
NaN 8312 5.67 2.28
NaN 8321 13.00 4.34
NaN 8322 10.60 1.95
[8 rows x 3 columns]
In [43]: df.index = pd.Series(df.index).fillna(method='ffill')
In [44]: df
Out[44]:
Sample CD4 CD8
Day 1 8311 17.30 6.44
Day 1 8312 13.60 3.50
Day 1 8321 19.80 5.88
Day 1 8322 13.50 4.09
Day 2 8311 16.00 4.92
Day 2 8312 5.67 2.28
Day 2 8321 13.00 4.34
Day 2 8322 10.60 1.95
[8 rows x 3 columns]
df = df.fillna(method='ffill', axis=0) # resolved updating the missing row entries
To casually come back 8 years later, pandas.read_excel() can solve this internally for you with the index_col parameter.
df = pd.read_excel('path_to_file.xlsx', index_col=[0])
Passing index_col as a list will cause pandas to look for a MultiIndex. In the case where there is a list of length one, pandas creates a regular Index filling in the data.