Fill in missing rows as NaN in python - python

I have a file that has daily precipitation data form 83 weather stations and 101 years per station. I want to determine number of NaN per year for each station.
As a shortened example lets assume I only have one stations and only care about 1 years of data, 2009.
If I have this:
station_id year month 1 2 3
210018 2009 1 5 6 8
210018 2009 2 NaN NaN 6
210018 2009 12 8 5 6
I want to get to this:
station_id year month 1 2 3
210018 2009 1 5 6 8
210018 2009 2 NaN NaN 6
210018 2009 3 NaN NaN NaN
210018 2009 4 NaN NaN NaN
210018 2009 5 NaN NaN NaN
210018 2009 6 NaN NaN NaN
210018 2009 7 NaN NaN NaN
210018 2009 8 NaN NaN NaN
210018 2009 9 NaN NaN NaN
210018 2009 10 NaN NaN NaN
210018 2009 11 NaN NaN NaN
210018 2009 12 8 5 6
So my station needs 12 rows for all 12 months and a year to go along with each one. Again I have 101 years in the real example.
I am trying to use this code:
df_indexed=df.set_index(['year'])
new_index=np.arange(1910,2011,1)
idx=pd.Index(new_index)
df2=df_indexed.reindex(idx, method=None)
but it returns a long error that ends with
ValueError: cannot reindex from a duplicate axis
I hope that makes sense.

What I'd probably do is create a target MultiIndex and then use that to index in. For example:
>>> target_ix = pd.MultiIndex.from_product([df.station_id.unique(),
np.arange(1910, 2011, 1), np.arange(1,13)],
names=["station_id", "year", "month"])
>>> df = df.set_index(["station_id", "year", "month"])
>>> new_df = df.loc[target_ix]
>>> new_df.tail(24)
1 2 3
station_id year month
210018 2009 1 5 6 8
2 NaN NaN 6
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 8 5 6
2010 1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 NaN NaN NaN
You can .reset_index() at this point if you prefer.

[edit]
THIS IS NOT A PANDAS ANSWER: question was not tagged pandas when I started answering, I will let it here because it can benefit someone.
Suppose you organize your data using a dict where the keys are a tuple of (station_id, year, month) and the values are an array of your data points - you can use collections.defaultdict:
>>> data = defaultdict(lambda: [None, None, None])
>>> data[(210018, 2009, 3)]
[None, None, None]
You are probably reading from a file, I will not do all your homework for you - just give a few hints.
for line in file:
station_id, year, month, d1, d2, d3 = parse_line(line)
data[(station_id, year, month)] = [
None if d == 'NaN' else float(d) for d in (d1, d2, d3)
]
Writing the parse_line function is left as an exercise for the reader.

Related

How to create multiple year columns in a new dataframe, from the original single column datetime dataframe?

Source df has sdate datetime64 and svalue float64 columns as:
sdate svalue
1980-01-01 5
1980-01-02 7
1980-01-05 2
1981-01-01 6
1981-01-02 3
1982-01-01 4
1982-01-02 2
1982-01-06 9
1983-01-06 8
How to create multiple year columns in a new dataset as:
dayofyear 1980 1981 1982 1983
1 5 6 4 nan
2 7 3 2 nan
3 nan nan nan nan
4 nan nan nan nan
5 2 nan nan nan
6 nan nan 9 8
I tried something like
df_new = df.pivot(index=df.sdate.dt.dayofyear, columns=df.sdate.dt.year, values='svalue')
Use DataFrame.assign for new columns and then pivoting:
df_new = df.assign(d = df.sdate.dt.dayofyear, y = df.sdate.dt.year).pivot('d','y','svalue')
print (df_new)
y 1980 1981 1982 1983
d
1 5.0 6.0 4.0 NaN
2 7.0 3.0 2.0 NaN
5 2.0 NaN NaN NaN
6 NaN NaN 9.0 8.0

Map function in pandas

full['Name'].head(10)
here is a Series which show like below:
0 Mr
1 Mrs
2 Miss
3 Mrs
4 Mr
5 Mr
6 Mr
7 Master
8 Mrs
9 Mrs
Name: Name, dtype: object
And after using the map dict function:
full['Name']=full['Name'].map({'Mr':1})
full['Name'].head(100)
it turns out to be:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
18 NaN
19 NaN
20 NaN
21 NaN
22 NaN
23 NaN
24 NaN
25 NaN
26 NaN
27 NaN
28 NaN
29 NaN
And it is strange that I have suceed in doing this on the other Series in DataFrame full, which really make me confused.
Please help.

Replace dataframe values by NaN

I am working with a pandas dataframe of 15 rows and 8 columns, such a:
A B ... G H
0 0.158979 0.187282 ... 0.330566 0.458748
1 0.227254 0.273307 ... 0.489372 0.649698
2 0.308775 0.351285 ... 0.621399 0.833404
3 0.375850 0.444228 ... 0.759206 0.929980
4 0.431860 0.507906 ... 0.850741 1.038544
5 0.507219 0.596291 ... 0.980404 1.145819
6 0.570170 0.676551 ... 1.094201 1.282077
7 0.635122 0.750434 ... 1.155645 1.292930
8 0.704220 0.824748 ... 1.261516 1.395316
9 0.762619 0.887669 ... 1.337860 1.410864
10 0.824553 0.968889 ... 1.407665 1.437886
11 0.893413 1.045289 ... 1.519902 1.514017
12 0.946757 1.109964 ... 1.561611 1.478634
13 1.008294 1.174139 ... 1.596135 1.501220
14 1.053086 1.227203 ... 1.624630 1.503892
where columns from C to F have been omitted.
I would like to know how I can find the closest value to 1 for every column. Once this value is found I would like to replace the rest of the values in the columns by NaN, with the exception of the values corresponding to the previous and next row. Then obtaining a dataframe like that:
A B ... G H
0 NaN NaN ... NaN NaN
1 NaN NaN ... NaN NaN
2 NaN NaN ... NaN NaN
3 NaN NaN ... NaN 0.929980
4 NaN NaN ... 0.850741 1.038544
5 NaN NaN ... 0.980404 1.145819
6 NaN NaN ... 1.094201 NaN
7 NaN NaN ... NaN NaN
8 NaN NaN ... NaN NaN
9 NaN 0.887669 ... NaN NaN
10 NaN 0.968889 ... NaN NaN
11 NaN 1.045289 ... NaN NaN
12 0.946757 NaN ... NaN NaN
13 1.008294 NaN ... NaN NaN
14 1.053086 NaN ... NaN NaN
Does anyone has a sugestion for this?
Thanks in advance
you can use the fact that the closest to 1 is actually the min of the abs of df once remove 1. So check where the min is meet, use shift once with 1 and once with -1 to get the next and previous row. use this mask in where.
df_ = (df-1).abs()
df_ = df_.min() == df_
df_ = df_|df_.shift(1)|df_.shift(-1)
df_ = df.where(df_)
print(df_)
A B G H
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN 0.929980
4 NaN NaN 0.850741 1.038544
5 NaN NaN 0.980404 1.145819
6 NaN NaN 1.094201 NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 NaN 0.887669 NaN NaN
10 NaN 0.968889 NaN NaN
11 NaN 1.045289 NaN NaN
12 0.946757 NaN NaN NaN
13 1.008294 NaN NaN NaN
14 1.053086 NaN NaN NaN

Replacing labels with names using merge

I am trying to figure out how to do merge. I have a labels.csv which contains the names that I have to use to replace the numbers for the same field in my dat.csv
My dat.csv is as follows:
Id,Help in household,Maths,Reading,Science,Social
11011001001,4,20.37,,27.78,
11011001002,3,12.96,,38.18,
11011001003,4,27.78,70,,
11011001004,4,,56.67,,36
11011001005,1,,,14.55,8.33
11011001006,4,,23.33,,30
11011001007,4,40.74,70,,
11011001008,3,,26.67,,22.92
11011001009,2,24.07,,25.45,
11011001010,4,18.52,26.67,,
11011001012,2,37.04,16.67,,
11011001013,4,20.37,,20,
11011001014,2,,,29.63,35.42
11011001015,4,27.78,66.67,,
11011001016,0,18.52,,,
11011001017,4,,,42.59,32
11011001018,2,16.67,,,
11011001019,3,,,21.82,
11011001020,4,,20,,16
11011001021,1,,,18.52,16.67
My labels.csv is as follows:
Column,Name,Level,Rename
Help in household,Every day,4,Every day
Help in household,Never,1,Never
Help in household,Once a month,2,Once a month
Help in household,Once a week,3,Once a week
my programme is as follows:
import pandas as pd
df = pd.read_csv('dat.csv')
labels = pd.read_csv('labels.csv')
df=df.merge(labels,left_on='Help in household',right_on='Name',how='left')
print df
However, the names do not appear as I want them to.
STUID Help in household Maths % Reading % Science % Social % \
0 11011001001 4 20.37 NaN 27.78 NaN
1 11011001002 3 12.96 NaN 38.18 NaN
2 11011001003 4 27.78 70.00 NaN NaN
3 11011001004 4 NaN 56.67 NaN 36.00
4 11011001005 1 NaN NaN 14.55 8.33
5 11011001006 4 NaN 23.33 NaN 30.00
6 11011001007 4 40.74 70.00 NaN NaN
7 11011001008 3 NaN 26.67 NaN 22.92
8 11011001009 2 24.07 NaN 25.45 NaN
9 11011001010 4 18.52 26.67 NaN NaN
10 11011001012 2 37.04 16.67 NaN NaN
11 11011001013 4 20.37 NaN 20.00 NaN
12 11011001014 2 NaN NaN 29.63 35.42
13 11011001015 4 27.78 66.67 NaN NaN
14 11011001016 0 18.52 NaN NaN NaN
15 11011001017 4 NaN NaN 42.59 32.00
16 11011001018 2 16.67 NaN NaN NaN
17 11011001019 3 NaN NaN 21.82 NaN
18 11011001020 4 NaN 20.00 NaN 16.00
19 11011001021 1 NaN NaN 18.52 16.67
Column Name Level Rename
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 NaN NaN NaN NaN
10 NaN NaN NaN NaN
11 NaN NaN NaN NaN
12 NaN NaN NaN NaN
13 NaN NaN NaN NaN
14 NaN NaN NaN NaN
15 NaN NaN NaN NaN
16 NaN NaN NaN NaN
17 NaN NaN NaN NaN
18 NaN NaN NaN NaN
19 NaN NaN NaN NaN
What am I doing wrong?
Okay, is this what you want?
df['Name'] = df['Help in household'].map(labels.set_index('Level')['Name'])
Output:
Id Help in household Maths Reading Science Social \
0 11011001001 4 20.37 NaN 27.78 NaN
1 11011001002 3 12.96 NaN 38.18 NaN
2 11011001003 4 27.78 70.00 NaN NaN
3 11011001004 4 NaN 56.67 NaN 36.00
4 11011001005 1 NaN NaN 14.55 8.33
5 11011001006 4 NaN 23.33 NaN 30.00
6 11011001007 4 40.74 70.00 NaN NaN
7 11011001008 3 NaN 26.67 NaN 22.92
8 11011001009 2 24.07 NaN 25.45 NaN
9 11011001010 4 18.52 26.67 NaN NaN
10 11011001012 2 37.04 16.67 NaN NaN
11 11011001013 4 20.37 NaN 20.00 NaN
12 11011001014 2 NaN NaN 29.63 35.42
13 11011001015 4 27.78 66.67 NaN NaN
14 11011001016 0 18.52 NaN NaN NaN
15 11011001017 4 NaN NaN 42.59 32.00
16 11011001018 2 16.67 NaN NaN NaN
17 11011001019 3 NaN NaN 21.82 NaN
18 11011001020 4 NaN 20.00 NaN 16.00
19 11011001021 1 NaN NaN 18.52 16.67
Name
0 Every day
1 Once a week
2 Every day
3 Every day
4 Never
5 Every day
6 Every day
7 Once a week
8 Once a month
9 Every day
10 Once a month
11 Every day
12 Once a month
13 Every day
14 NaN
15 Every day
16 Once a month
17 Once a week
18 Every day
19 Never

Python interpolate not working on rows

Related to Error in gapfilling by row in Pandas, I would like to interpolate instead of using fillna. Currently, I am doing this:
df.ix[:,'2015':'2100'].interpolate(axis = 1, method = 'linear')
However, this does not seem to replace the NaN's. Any suggestion?
--EDIT
This does not seem to work either:
df.apply(pandas.Series.interpolate, inplace = True)
This looks like a bug (I'm using Pandas 0.16.2 with Python 3.4.3).
Using a subset of your data:
>>>df.ix[:3, '2015':'2020']
2015 2016 2017 2018 2019 2020
0 0.001248 NaN NaN NaN NaN 0.001281
1 0.009669 NaN NaN NaN NaN 0.009963
2 0.020005 NaN NaN NaN NaN 0.020651
The linear interpolation works fine and returns a new dataframe.
>>> df.ix[:3, '2015':'2020'].interpolate(axis=1, method='linear')
2015 2016 2017 2018 2019 2020
0 0.001248 0.001255 0.001261 0.001268 0.001275 0.001281
1 0.009669 0.009728 0.009786 0.009845 0.009904 0.009963
2 0.020005 0.020134 0.020264 0.020393 0.020522 0.020651
3 0.025557 0.025687 0.025818 0.025949 0.026080 0.026211
The original is still untouched.
>>> df.ix[:4, '2015':'2020']
2015 2016 2017 2018 2019 2020
0 0.001248 NaN NaN NaN NaN 0.001281
1 0.009669 NaN NaN NaN NaN 0.009963
2 0.020005 NaN NaN NaN NaN 0.020651
3 0.025557 NaN NaN NaN NaN 0.026211
4 0.060077 NaN NaN NaN NaN 0.060909
So let's to to change it using the inplace=True parameter.
df.ix[:3, '2015':'2020'].interpolate(axis=1, method='linear', inplace=True)
>>> df.ix[:4, '2015':'2020']
2015 2016 2017 2018 2019 2020
0 0.001248 NaN NaN NaN NaN 0.001281
1 0.009669 NaN NaN NaN NaN 0.009963
2 0.020005 NaN NaN NaN NaN 0.020651
3 0.025557 NaN NaN NaN NaN 0.026211
4 0.060077 NaN NaN NaN NaN 0.06090
The changes didn't hold.

Categories