Error adding date column in Pandas - python

Needed some help solving why my dataframe is returning all NaNs.
print df
0 1 2 3 4
0 1 9 0 7 30
1 2 8 0 4 30
2 3 5 0 3 30
3 4 3 0 3 30
4 5 1 0 3 30
Then I added date index. I only need to increment by one day for 5 days.
date = pd.date_range(datetime.datetime.today(), periods=5)
data = DataFrame(df, index=date)
print data
0 1 2 3 4
2014-04-10 17:16:09.433000 NaN NaN NaN NaN NaN
2014-04-11 17:16:09.433000 NaN NaN NaN NaN NaN
2014-04-12 17:16:09.433000 NaN NaN NaN NaN NaN
2014-04-13 17:16:09.433000 NaN NaN NaN NaN NaN
2014-04-14 17:16:09.433000 NaN NaN NaN NaN NaN
Tried a few different things to no avail. If I switch my original dataframe to
np.random.randn(5,5)
Then it works. Anyone have an idea of what is going on here?
Edit: Going to add that the data type is float64
print df.dtypes
0 float64
1 float64
2 float64
3 float64
4 float64
dtype: object

You should overwrite the index of the original dataframe with the following:
df.index = date
What DataFrame(df, index=date) does is that it creates new dataframe by matching the values of index to the df being used, for example:
DataFrame(df, index=[0,1,2,5,5])
returns the following:
0 1 2 3 4
0 1 9 0 7 30
1 2 8 0 4 30
2 3 5 0 3 30
5 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN
because 5 is not included in the index of the original dataframe.

Related

How to select and print some columns for all the rows that are not NA and are a specifc number in python?

I am having trouble selecting the rowns I want and printing the colums of choice.
So I have 8 columns and what I am looking to do is take all the rows where column 8 is not NA and is equal to 2, and print only columns 2 to 5.
I have tried this:
df.where(df['dhch'].notnull())[['scchdg', 'dhch']]
Here I have just entered 2 columns to check that the conditional statement that dhch is not NA worked and I got as expected:
scchdg dhch
0 3 1
1 -1 2
2 -1 2
3 1 1
4 3 1
... ...
12094 -9 1
12095 1 1
12096 4 1
12097 3 1
12098 4 1
[12099 rows x 2 columns]
And when I check the conditional value I get expected output (i.e., values of 2 and nans in the dhch col:
df.where(df['dhch']==2)[['scchdg', 'dhch']]
Out[50]:
scchdg dhch
0 NaN NaN
1 -1.0 2.0
2 -1.0 2.0
3 NaN NaN
4 NaN NaN
... ...
12094 NaN NaN
12095 NaN NaN
12096 NaN NaN
12097 NaN NaN
12098 NaN NaN
[12099 rows x 2 columns]
But When I combine these, I just get piles of NAs
df.where(df['dhch'].notnull() & df['dhch']==2)[['scchdg', 'dhch']]
Out[51]:
scchdg dhch
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
... ...
12094 NaN NaN
12095 NaN NaN
12096 NaN NaN
12097 NaN NaN
12098 NaN NaN
[12099 rows x 2 columns]
What am I doing wrong please?
in R what I want to do is as follows:
df[!is.na(df$dhch) & df$dhch==2, c('scchdg', 'dhch')]
But how do i do exactly this in Python please?

Pandas assign series to a column based on index

I have 2 dataframes:
DF1:
Count
0 98.0
1 176.0
2 260.5
3 389.0
I have to assign these values to a column in another dataframe for every 3rd row starting from 3rd row.
The Output of DF2 should look like this:
Count
0
1
2 98.0
3
4
5 176.0
6
7
8 260.5
9
10
11 389.0
I am doing
DF2.loc[2::3,'Count'] = DF1['Count']
But, I am not getting the expected results.
Use values
Ohterwise, Pandas tries to align the index values from DF1 and that messes you up.
DF2.loc[2::3, 'Count'] = DF1['Count'].values
DF2
Count
0 NaN
1 NaN
2 98.0
3 NaN
4 NaN
5 176.0
6 NaN
7 NaN
8 260.5
9 NaN
10 NaN
11 389.0
New From DF1
DF1.set_index(DF1.index * 3 + 2).reindex(range(len(DF1) * 3))
Count
0 NaN
1 NaN
2 98.0
3 NaN
4 NaN
5 176.0
6 NaN
7 NaN
8 260.5
9 NaN
10 NaN
11 389.0

pandas ffill/bfill for specific amount of observation

I have the following dataframe:
id indicator
1 NaN
1 NaN
1 1
1 NaN
1 NaN
1 NaN
In reality, I have several more ids. My question now is, how do I do a forward or backward fill for a specific range, e.g. for only the next/last 2 observations. My dataframe should look like this:
id indicator
1 NaN
1 NaN
1 1
1 1
1 1
1 NaN
I know the command
df.groupby("id")["indicator"].fillna(value=None, method="ffill")
However, this fills all the missing values instead of just the next two observations. Anyone knows a solution?
I think DataFrameGroupBy.ffill or DataFrameGroupBy.bfill with limit parameter is nicer:
df.groupby("id")["indicator"].ffill(limit=3)
df.groupby("id")["indicator"].bfill(limit=3)
Sample:
#5 value is in the end of group, so only one value is filled
df['filled'] = df.groupby("id")["indicator"].ffill(limit=2)
print (df)
id indicator filled
0 1 NaN NaN
1 1 NaN NaN
2 1 1.0 1.0
3 1 NaN 1.0
4 1 NaN 1.0
5 1 NaN NaN
6 1 NaN NaN
7 1 NaN NaN
8 1 4.0 4.0
9 1 NaN 4.0
10 1 NaN 4.0
11 1 NaN NaN
12 1 NaN NaN
13 2 NaN NaN
14 2 NaN NaN
15 2 1.0 1.0
16 2 NaN 1.0
17 2 NaN 1.0
18 2 NaN NaN
19 2 5.0 5.0
20 2 NaN 5.0
21 3 3.0 3.0
22 3 NaN 3.0
23 3 NaN 3.0
24 3 NaN NaN
25 3 NaN NaN
almost there,
straight from the doc
If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.
df.groupby("id")["indicator"].fillna(value=None,method="ffill",limit=3)

Boolean indexing to retain falsy values as NaN

Given a dataframe:
Data
1 246804
2 135272
3 898.01
4 3453.33
5 shine
6 add
7 522
8 Nan
9 string
10 29.11
11 20
I would like two new columns Floats and Strings, both having the same length as the original dataframe. Getting the Floats column is easy:
In [176]: pd.to_numeric(df.Data, errors='coerce')
Out[176]:
1 246804.00
2 135272.00
3 898.01
4 3453.33
5 NaN
6 NaN
7 522.00
8 NaN
9 NaN
10 29.11
11 20.00
Name: Data, dtype: float64
As you can see, non Floats are coerced to NaN, which is exactly what I want.
To get strings, this is what I do:
In [177]: df[df.Data.str.isalpha()]
Out[177]:
Data
5 shine
6 add
8 Nan
9 string
But as you can see, it does not retain the non-String values as NaN. I want something like this:
1 NaN
2 NaN
3 NaN
4 NaN
5 shine
6 add
7 NaN
8 Nan (not NaN)
9 string
10 NaN
11 NaN
How can I get it to do so?
To get Strings, you can use Boolean Indexing on the Data column and located where Floats is null.
df['Floats'] = pd.to_numeric(df.Data, errors='coerce')
df['Strings'] = df.Data.loc[df.Floats.isnull()] # Optional: .astype(str)
>>> df
# Output:
# Data Floats Strings
# 1 246804 246804.00 NaN
# 2 135272 135272.00 NaN
# 3 898.01 898.01 NaN
# 4 3453.33 3453.33 NaN
# 5 shine NaN shine
# 6 add NaN add
# 7 522 522.00 NaN
# 8 Nan NaN Nan
# 9 string NaN string
# 10 29.11 29.11 NaN
# 11 20 20.00 NaN
floats = pd.to_numeric(df.Data, 'coerce')
pd.DataFrame(dict(
floats=floats,
strings=df.Data.mask(floats.notnull())
))
floats strings
1 246804.00 NaN
2 135272.00 NaN
3 898.01 NaN
4 3453.33 NaN
5 NaN shine
6 NaN add
7 522.00 NaN
8 NaN Nan
9 NaN string
10 29.11 NaN
11 20.00 NaN
You can even make it more obvious within mask by passing an alternative
floats = pd.to_numeric(df.Data, 'coerce')
pd.DataFrame(dict(
floats=floats,
strings=df.Data.mask(floats.notnull(), '')
))
floats strings
1 246804.00
2 135272.00
3 898.01
4 3453.33
5 NaN shine
6 NaN add
7 522.00
8 NaN Nan
9 NaN string
10 29.11
11 20.00
How about
df.Data.where(pd.to_numeric(df.Data, errors='coerce').isnull())
Out[186]:
Data
1 NaN
2 NaN
3 NaN
4 NaN
5 shine
6 add
7 NaN
8 Nan #not NaN
9 string
10 NaN
11 NaN
Or base on your df.Data.str.isalpha()
df['Data'].where(df['Data'].str.isalpha())

reshape a pandas dataframe index to columns

Consider the below pandas Series object,
index = list('abcdabcdabcd')
df = pd.Series(np.arange(len(index)), index = index)
My desired output is,
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
I have put some effort with pd.pivot_table, pd.unstack and probably the solution lies with correct use of one of them. The closest i have reached is
df.reset_index(level = 1).unstack(level = 1)
but this does not gives me the output i my looking for
// here is something even closer to the desired output, but i am not able to handle the index grouping.
df.to_frame().set_index(df1.values, append = True, drop = False).unstack(level = 0)
a b c d
0 0.0 NaN NaN NaN
1 NaN 1.0 NaN NaN
2 NaN NaN 2.0 NaN
3 NaN NaN NaN 3.0
4 4.0 NaN NaN NaN
5 NaN 5.0 NaN NaN
6 NaN NaN 6.0 NaN
7 NaN NaN NaN 7.0
8 8.0 NaN NaN NaN
9 NaN 9.0 NaN NaN
10 NaN NaN 10.0 NaN
11 NaN NaN NaN 11.0
A bit more general solution using cumcount to get new index values, and pivot to do the reshaping:
# Reset the existing index, and construct the new index values.
df = df.reset_index()
df.index = df.groupby('index').cumcount()
# Pivot and remove the column axis name.
df = df.pivot(columns='index', values=0).rename_axis(None, axis=1)
The resulting output:
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
Here is a way that will work if the index is always cycling in the same order, and you know the "period" (in this case 4):
>>> pd.DataFrame(df.values.reshape(-1,4), columns=list('abcd'))
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
>>>

Categories