Related
Say I have just 2 columns in pandas.
Column 1 has all numerical values and column 2 has values only at the every 16th position (so column 2 has value at index 0 followed by 15 NaN and value at index 16 followed by 15 NaNs).
How to create a new row, that contains itself and next 15 values of column 1 (as list [value, value2,....value16]) when column 2 is not null.
Can someone let me know a time efficient solution for the below:
Here is the pandas code to reproduce the sample data
df=pd.DataFrame(zip([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32],
['xyz',None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,
'abc',None,None,None,None,None,None,None,None,None,None,None,None,None,None,None],
[[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,
[17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32],None,None,None,None,None,None,None,None,None,None,None,None,None,None,None]), columns= ['A','B','C'])
Use a boolean mask:
m = df['column 2'].notna()
df.loc[m, 'column 3'] = df.groupby(m.cumsum())['column 1'].agg(list).values
print(df)
# Output
column 1 column 2 column 3
0 1 xyz [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...
1 2 NaN NaN
2 3 NaN NaN
3 4 NaN NaN
4 5 NaN NaN
5 6 NaN NaN
6 7 NaN NaN
7 8 NaN NaN
8 9 NaN NaN
9 10 NaN NaN
10 11 NaN NaN
11 12 NaN NaN
12 13 NaN NaN
13 14 NaN NaN
14 15 NaN NaN
15 16 NaN NaN
16 17 abc [17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 2...
17 18 NaN NaN
18 19 NaN NaN
19 20 NaN NaN
20 21 NaN NaN
21 22 NaN NaN
22 23 NaN NaN
23 24 NaN NaN
24 25 NaN NaN
25 26 NaN NaN
26 27 NaN NaN
27 28 NaN NaN
28 29 NaN NaN
29 30 NaN NaN
30 31 NaN NaN
31 32 NaN NaN
I am trying merge 2 dataframes.
df1
Date A B C
01.01.2021 1 8 14
02.01.2021 2 9 15
03.01.2021 3 10 16
04.01.2021 4 11 17
05.01.2021 5 12 18
06.01.2021 6 13 19
07.01.2021 7 14 20
df2
Date B
07.01.2021 14
08.01.2021 27
09.01.2021 28
10.01.2021 29
11.01.2021 30
12.01.2021 31
13.01.2021 32
Both dataframes have one same row (although there could be several overlappings).
So I want to get df3 that looks as follows:
df3
Date A B C
01.01.2021 1 8 14
02.01.2021 2 9 15
03.01.2021 3 10 16
04.01.2021 4 11 17
05.01.2021 5 12 18
06.01.2021 6 13 19
07.01.2021 7 14 20
08.01.2021 Nan 27 Nan
09.01.2021 Nan 28 Nan
10.01.2021 Nan 29 Nan
11.01.2021 Nan 30 Nan
12.01.2021 Nan 31 Nan
13.01.2021 Nan 32 Nan
I've tried
df3=df1.merge(df2, on='Date', how='outer') but it gives extra A,B,C columns. Could you give some idea how to get df3?
Thanks a lot.
merge outer without specifying on (default on is the intersection of columns between the two DataFrames in this case ['Date', 'B']):
df3 = df1.merge(df2, how='outer')
df3:
Date A B C
0 01.01.2021 1.0 8 14.0
1 02.01.2021 2.0 9 15.0
2 03.01.2021 3.0 10 16.0
3 04.01.2021 4.0 11 17.0
4 05.01.2021 5.0 12 18.0
5 06.01.2021 6.0 13 19.0
6 07.01.2021 7.0 14 20.0
7 08.01.2021 NaN 27 NaN
8 09.01.2021 NaN 28 NaN
9 10.01.2021 NaN 29 NaN
10 11.01.2021 NaN 30 NaN
11 12.01.2021 NaN 31 NaN
12 13.01.2021 NaN 32 NaN
Assuming you always want to keep the first full version, you can concat the df2 on the end of df1 and drop duplicates on the Date column.
pd.concat([df1,df2]).drop_duplicates(subset='Date')
Output
Date A B C
0 01.01.2021 1.0 8 14.0
1 02.01.2021 2.0 9 15.0
2 03.01.2021 3.0 10 16.0
3 04.01.2021 4.0 11 17.0
4 05.01.2021 5.0 12 18.0
5 06.01.2021 6.0 13 19.0
6 07.01.2021 7.0 14 20.0
1 08.01.2021 NaN 27 NaN
2 09.01.2021 NaN 28 NaN
3 10.01.2021 NaN 29 NaN
4 11.01.2021 NaN 30 NaN
5 12.01.2021 NaN 31 NaN
6 13.01.2021 NaN 32 NaN
When I try to take an SQL Query generated from a pd.read_sql_query to a dataframe using pd.DataFrame my string values get converted to nan.
I tried using dtypes to set the type of each column
SQL_Query = pd.read_sql_query('''SELECT [CircuitID], [Status],
[LatestJiraTicket], [MrcNew]
FROM CircuitInfoTable
WHERE ([Status] = 'Active')
OR ([Status] = 'Pending')
OR ([Status] = 'Planned')''', conn)
# print(SQL_Query)
cdf = pd.DataFrame(SQL_Query, columns=['CID', 'Status', 'JiraTicket', 'MrcNew'])
SQL Query output:
0 OH1004-01 ... NaN
1 OH1004-02 ... NaN
2 OH1005-01 ... NaN
3 OH1005-02 ... NaN
4 AL1001-01 ... NaN
5 AL1001-02 ... NaN
6 AL1007-01 ... NaN
7 AL1007-02 ... NaN
8 NC1001-01 ... NaN
9 NC1001-02 ... NaN
10 NC1001-03 ... NaN
11 NC1001-04 ... NaN
12 NC1001-05 ... NaN
13 NC1001-06 ... NaN
14 (ommited on purpose) ... 5200.0
15 MO001-02 ... NaN
16 OR020-01 ... 8000.0
17 MA004-01 ... 6500.0
18 MA004-02 ... 6500.0
19 OR004-01 ... 10500.0
20 (ommited on purpose) ... 3975.0
21 OR007-01 ... 2500.0
22 (ommited on purpose) ... 9200.0
23 (ommited on purpose) ... 15000.0
24 (ommited on purpose) ... 5750.0
25 CA1005-02 ... 47400.0
26 CA1005-03 ... 47400.0
27 CA1005-04 ... 47400.0
28 CA1005-05 ... 47400.0
29 CA1006-01 ... 0.0
DataFrame output:
CID Status JiraTicket MrcNew
0 nan Planned nan NaN
1 nan Planned nan NaN
2 nan Planned nan NaN
3 nan Planned nan NaN
4 nan Planned nan NaN
5 nan Planned nan NaN
6 nan Planned nan NaN
7 nan Planned nan NaN
8 nan Planned nan NaN
9 nan Planned nan NaN
10 nan Planned nan NaN
11 nan Planned nan NaN
12 nan Planned nan NaN
13 nan Planned nan NaN
14 nan Active nan 5200.0
15 nan Pending nan NaN
16 nan Pending nan 8000.0
17 nan Pending nan 6500.0
18 nan Pending nan 6500.0
19 nan Pending nan 10500.0
20 nan Active nan 3975.0
21 nan Pending nan 2500.0
22 nan Active nan 9200.0
23 nan Pending nan 15000.0
24 nan Active nan 5750.0
25 nan Pending nan 47400.0
26 nan Pending nan 47400.0
27 nan Pending nan 47400.0
28 nan Pending nan 47400.0
29 nan Pending nan 0.0
Basically, you are using columns argument incorrectly in pandas.DataFrame where that arugment specifies columns to select in resulting output (not to rename). From your query there is no CID or JiraTicket and hence they migrate with all missing values.
Possibly you intended to rename columns. Consider renaming in either SQL with column aliases or in pandas with rename or set_axis:
SELECT [CircuitID] AS [CID],
[Status],
[LatestJiraTicket] AS JiraTicket,
[MrcNew]
FROM CircuitInfoTable
WHERE ([Status] = 'Active')
OR ([Status] = 'Pending')
OR ([Status] = 'Planned')
Pandas
cdf = (pd.read_sql_query(...original query...)
.rename(columns={'CircuitID': 'CID', 'LatestJiraTicket': 'JiraTicket'})
)
cdf = (pd.read_sql_query(...original query...)
.set_axis(['CID', 'Status', 'JiraTicket', 'MrcNew'], axis='columns', inplace=False)
)
I am trying to figure out how to do merge. I have a labels.csv which contains the names that I have to use to replace the numbers for the same field in my dat.csv
My dat.csv is as follows:
Id,Help in household,Maths,Reading,Science,Social
11011001001,4,20.37,,27.78,
11011001002,3,12.96,,38.18,
11011001003,4,27.78,70,,
11011001004,4,,56.67,,36
11011001005,1,,,14.55,8.33
11011001006,4,,23.33,,30
11011001007,4,40.74,70,,
11011001008,3,,26.67,,22.92
11011001009,2,24.07,,25.45,
11011001010,4,18.52,26.67,,
11011001012,2,37.04,16.67,,
11011001013,4,20.37,,20,
11011001014,2,,,29.63,35.42
11011001015,4,27.78,66.67,,
11011001016,0,18.52,,,
11011001017,4,,,42.59,32
11011001018,2,16.67,,,
11011001019,3,,,21.82,
11011001020,4,,20,,16
11011001021,1,,,18.52,16.67
My labels.csv is as follows:
Column,Name,Level,Rename
Help in household,Every day,4,Every day
Help in household,Never,1,Never
Help in household,Once a month,2,Once a month
Help in household,Once a week,3,Once a week
my programme is as follows:
import pandas as pd
df = pd.read_csv('dat.csv')
labels = pd.read_csv('labels.csv')
df=df.merge(labels,left_on='Help in household',right_on='Name',how='left')
print df
However, the names do not appear as I want them to.
STUID Help in household Maths % Reading % Science % Social % \
0 11011001001 4 20.37 NaN 27.78 NaN
1 11011001002 3 12.96 NaN 38.18 NaN
2 11011001003 4 27.78 70.00 NaN NaN
3 11011001004 4 NaN 56.67 NaN 36.00
4 11011001005 1 NaN NaN 14.55 8.33
5 11011001006 4 NaN 23.33 NaN 30.00
6 11011001007 4 40.74 70.00 NaN NaN
7 11011001008 3 NaN 26.67 NaN 22.92
8 11011001009 2 24.07 NaN 25.45 NaN
9 11011001010 4 18.52 26.67 NaN NaN
10 11011001012 2 37.04 16.67 NaN NaN
11 11011001013 4 20.37 NaN 20.00 NaN
12 11011001014 2 NaN NaN 29.63 35.42
13 11011001015 4 27.78 66.67 NaN NaN
14 11011001016 0 18.52 NaN NaN NaN
15 11011001017 4 NaN NaN 42.59 32.00
16 11011001018 2 16.67 NaN NaN NaN
17 11011001019 3 NaN NaN 21.82 NaN
18 11011001020 4 NaN 20.00 NaN 16.00
19 11011001021 1 NaN NaN 18.52 16.67
Column Name Level Rename
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 NaN NaN NaN NaN
10 NaN NaN NaN NaN
11 NaN NaN NaN NaN
12 NaN NaN NaN NaN
13 NaN NaN NaN NaN
14 NaN NaN NaN NaN
15 NaN NaN NaN NaN
16 NaN NaN NaN NaN
17 NaN NaN NaN NaN
18 NaN NaN NaN NaN
19 NaN NaN NaN NaN
What am I doing wrong?
Okay, is this what you want?
df['Name'] = df['Help in household'].map(labels.set_index('Level')['Name'])
Output:
Id Help in household Maths Reading Science Social \
0 11011001001 4 20.37 NaN 27.78 NaN
1 11011001002 3 12.96 NaN 38.18 NaN
2 11011001003 4 27.78 70.00 NaN NaN
3 11011001004 4 NaN 56.67 NaN 36.00
4 11011001005 1 NaN NaN 14.55 8.33
5 11011001006 4 NaN 23.33 NaN 30.00
6 11011001007 4 40.74 70.00 NaN NaN
7 11011001008 3 NaN 26.67 NaN 22.92
8 11011001009 2 24.07 NaN 25.45 NaN
9 11011001010 4 18.52 26.67 NaN NaN
10 11011001012 2 37.04 16.67 NaN NaN
11 11011001013 4 20.37 NaN 20.00 NaN
12 11011001014 2 NaN NaN 29.63 35.42
13 11011001015 4 27.78 66.67 NaN NaN
14 11011001016 0 18.52 NaN NaN NaN
15 11011001017 4 NaN NaN 42.59 32.00
16 11011001018 2 16.67 NaN NaN NaN
17 11011001019 3 NaN NaN 21.82 NaN
18 11011001020 4 NaN 20.00 NaN 16.00
19 11011001021 1 NaN NaN 18.52 16.67
Name
0 Every day
1 Once a week
2 Every day
3 Every day
4 Never
5 Every day
6 Every day
7 Once a week
8 Once a month
9 Every day
10 Once a month
11 Every day
12 Once a month
13 Every day
14 NaN
15 Every day
16 Once a month
17 Once a week
18 Every day
19 Never
I have a file that has daily precipitation data form 83 weather stations and 101 years per station. I want to determine number of NaN per year for each station.
As a shortened example lets assume I only have one stations and only care about 1 years of data, 2009.
If I have this:
station_id year month 1 2 3
210018 2009 1 5 6 8
210018 2009 2 NaN NaN 6
210018 2009 12 8 5 6
I want to get to this:
station_id year month 1 2 3
210018 2009 1 5 6 8
210018 2009 2 NaN NaN 6
210018 2009 3 NaN NaN NaN
210018 2009 4 NaN NaN NaN
210018 2009 5 NaN NaN NaN
210018 2009 6 NaN NaN NaN
210018 2009 7 NaN NaN NaN
210018 2009 8 NaN NaN NaN
210018 2009 9 NaN NaN NaN
210018 2009 10 NaN NaN NaN
210018 2009 11 NaN NaN NaN
210018 2009 12 8 5 6
So my station needs 12 rows for all 12 months and a year to go along with each one. Again I have 101 years in the real example.
I am trying to use this code:
df_indexed=df.set_index(['year'])
new_index=np.arange(1910,2011,1)
idx=pd.Index(new_index)
df2=df_indexed.reindex(idx, method=None)
but it returns a long error that ends with
ValueError: cannot reindex from a duplicate axis
I hope that makes sense.
What I'd probably do is create a target MultiIndex and then use that to index in. For example:
>>> target_ix = pd.MultiIndex.from_product([df.station_id.unique(),
np.arange(1910, 2011, 1), np.arange(1,13)],
names=["station_id", "year", "month"])
>>> df = df.set_index(["station_id", "year", "month"])
>>> new_df = df.loc[target_ix]
>>> new_df.tail(24)
1 2 3
station_id year month
210018 2009 1 5 6 8
2 NaN NaN 6
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 8 5 6
2010 1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 NaN NaN NaN
You can .reset_index() at this point if you prefer.
[edit]
THIS IS NOT A PANDAS ANSWER: question was not tagged pandas when I started answering, I will let it here because it can benefit someone.
Suppose you organize your data using a dict where the keys are a tuple of (station_id, year, month) and the values are an array of your data points - you can use collections.defaultdict:
>>> data = defaultdict(lambda: [None, None, None])
>>> data[(210018, 2009, 3)]
[None, None, None]
You are probably reading from a file, I will not do all your homework for you - just give a few hints.
for line in file:
station_id, year, month, d1, d2, d3 = parse_line(line)
data[(station_id, year, month)] = [
None if d == 'NaN' else float(d) for d in (d1, d2, d3)
]
Writing the parse_line function is left as an exercise for the reader.