Python CSV joining columns - python

I am trying to make a new columnn with conditional statements utilizing Pandas Version 0.17.1. I have two csv's both about 100mb's in size.
What I have:
CSV1:
Index TC_NUM
1241 1105.0017
1242 1105.0018
1243 1105.0019
1244 1105.002
1245 1105.0021
1246 1105.0022
CSV2:
KEYS TC_NUM
UXS-689 3001.0045
FIT-3015 1135.0027
FIT-2994 1140.0156
FIT-2991 1910, 1942.0001, 3004.0004, 3004.0020, 3004.0026, 3004.0063, 3004.0065, 3004.0079, 3004.0084, 3004.0091, 2101.0015, 2101.0016, 2101.0017, 2101.0018, 2101.0050, 2101.0052, 2101.0054, 2101.0055, 2101.0071, 2101.0074, 2101.0075, 2206.0001, 2103.0001, 2103.0002, 2103.0009, 2103.0011, 3000.0004, 3000.0030, 1927.0020
FIT-2990 2034.0002, 3004.0035, 3004.0084, 2034.0001
FIT-2918 3001.0039, 3004.0042
What I want:
Index TC_NUM Matched_Keys
1241 1105.0017 FIT-3015
1242 1105.0018 UXS-668
1243 1105.0019 FIT-087
1244 1105.002 FIT-715
1245 1105.0021 FIT-910
1246 1105.0022 FIT-219
If the TC_NUM in CSV2 matches the TC_NUM from CSV1, it prints the key in a column on CSV1
Code:
dftakecolumns = pd.read_csv('JiraKeysEnv.csv')
dfmergehere = pd.read_csv('output2.csv')
s = dftakecolumns['KEYS']
a = dftakecolumns['TC_NUM']
d = dfmergehere['TC_NUM']
for crows in a:
for toes in d:
if toes == crows:
print toes
dfmergehere['Matched_Keys'] = dftakecolumns.apply(toes, axis=None, join_axis=None, join='outer')

You can try this solution:
Notice - I change value in first (1105.0017) and fourth (1105.0022) row of df2 for test of merge.
print df1
Index TC_NUM
0 1241 1105.0017
1 1242 1105.0018
2 1243 1105.0019
3 1244 1105.0020
4 1245 1105.0021
5 1246 1105.0022
print df2
KEYS TC_NUM
0 UXS-689 1105.0017
1 FIT-3015 1135.0027
2 FIT-2994 1140.0156
3 FIT-2991 1105.0022, 1942.0001, 3004.0004, 3004.0020, 30...
4 FIT-2990 2034.0002, 3004.0035, 3004.0084, 2034.0001
5 FIT-2918 3001.0039, 3004.0042
#convert string column TC_NUM to dataframe df3
df3 = pd.DataFrame([ x.split(',') for x in df2['TC_NUM'].tolist() ])
#convert string df3 to float df3
df3 = df3.astype(float)
print df3
0 1 2 3 4 5 \
0 1105.0017 NaN NaN NaN NaN NaN
1 1135.0027 NaN NaN NaN NaN NaN
2 1140.0156 NaN NaN NaN NaN NaN
3 1105.0022 1942.0001 3004.0004 3004.0020 3004.0026 3004.0063
4 2034.0002 3004.0035 3004.0084 2034.0001 NaN NaN
5 3001.0039 3004.0042 NaN NaN NaN NaN
6 7 8 9 ... 19 20 \
0 NaN NaN NaN NaN ... NaN NaN
1 NaN NaN NaN NaN ... NaN NaN
2 NaN NaN NaN NaN ... NaN NaN
3 3004.0065 3004.0079 3004.0084 3004.0091 ... 2101.0074 2101.0075
4 NaN NaN NaN NaN ... NaN NaN
5 NaN NaN NaN NaN ... NaN NaN
21 22 23 24 25 26 27 \
0 NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN
3 2206.0001 2103.0001 2103.0002 2103.0009 2103.0011 3000.0004 3000.003
4 NaN NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN NaN
28
0 NaN
1 NaN
2 NaN
3 1927.002
4 NaN
5 NaN
[6 rows x 29 columns]
#concat column KEYS to df3
df2 = pd.concat([df2['KEYS'], df3], axis=1)
#stack - rows to one column for merging
df2 = df2.set_index('KEYS').stack().reset_index(level=1,drop=True).reset_index(name='TC_NUM')
print df2
KEYS TC_NUM
0 UXS-689 1105.0017
1 FIT-3015 1135.0027
2 FIT-2994 1140.0156
3 FIT-2991 1105.0022
4 FIT-2991 1942.0001
5 FIT-2991 3004.0004
6 FIT-2991 3004.0020
7 FIT-2991 3004.0026
8 FIT-2991 3004.0063
9 FIT-2991 3004.0065
10 FIT-2991 3004.0079
11 FIT-2991 3004.0084
12 FIT-2991 3004.0091
13 FIT-2991 2101.0015
14 FIT-2991 2101.0016
15 FIT-2991 2101.0017
16 FIT-2991 2101.0018
17 FIT-2991 2101.0050
18 FIT-2991 2101.0052
19 FIT-2991 2101.0054
20 FIT-2991 2101.0055
21 FIT-2991 2101.0071
22 FIT-2991 2101.0074
23 FIT-2991 2101.0075
24 FIT-2991 2206.0001
25 FIT-2991 2103.0001
26 FIT-2991 2103.0002
27 FIT-2991 2103.0009
28 FIT-2991 2103.0011
29 FIT-2991 3000.0004
30 FIT-2991 3000.0030
31 FIT-2991 1927.0020
32 FIT-2990 2034.0002
33 FIT-2990 3004.0035
34 FIT-2990 3004.0084
35 FIT-2990 2034.0001
36 FIT-2918 3001.0039
37 FIT-2918 3004.0042
#merge on column TC_NUM
print pd.merge(df1, df2, on=['TC_NUM'])
Index TC_NUM KEYS
0 1241 1105.0017 UXS-689
1 1246 1105.0022 FIT-2991

Related

More efficient way to do the same merge on multiple columns in a dataframe?

Input:
df1
OFF_P1 OFF_P2 OFF_P3 OFF_P4 OFF_P5 GAME_ID
0 1629675 1627736 1630162 201976 1629020 22101224
1 201599 1630178 202699 1629680 201980 22101228
2 1630191 1630180 1630587 1630240 1628402 22101228
3 1627759 201143 1628464 1628369 203935 22101223
4 1630573 1630271 1630238 1628436 1630346 22101223
df2
PLAYER_ID GAME_ID PTS
0 201980 21900002 28
1 201586 21900001 13
2 1628366 21900001 8
3 200755 21900001 16
4 202324 21900001 6
Desired Output:
OFF_P1 OFF_P2 OFF_P3 OFF_P4 OFF_P5 GAME_ID OFF_P1_PTS OFF_P2_PTS etc...
0 1629675 1627736 1630162 201976 1629020 22101224 28 13 ...
1 201599 1630178 202699 1629680 201980 22101228 12
2 1630191 1630180 1630587 1630240 1628402 22101228 14
3 1627759 201143 1628464 1628369 203935 22101223 8
4 1630573 1630271 1630238 1628436 1630346 22101223 19
I would like to merge the PTS column from df2 to df1 but for each column of OFF_P1, OFF_P2, etc...
Is there a more efficient way to do this other than something like the below?
df1 = df1.merge(df2, left_on=['GAME_ID', 'OFF_P1'], right_on=['GAME_ID', 'PLAYER_ID'])
df1 = df1.merge(df2, left_on=['GAME_ID', 'OFF_P2'], right_on=['GAME_ID', 'PLAYER_ID'])
df1 = df1.merge(df2, left_on=['GAME_ID', 'OFF_P3'], right_on=['GAME_ID', 'PLAYER_ID'])
df1 = df1.merge(df2, left_on=['GAME_ID', 'OFF_P4'], right_on=['GAME_ID', 'PLAYER_ID'])
df1 = df1.merge(df2, left_on=['GAME_ID', 'OFF_P5'], right_on=['GAME_ID', 'PLAYER_ID'])
I would prefer the MultiIndex.map approach:
d = df2.set_index(['GAME_ID', 'PLAYER_ID'])['PTS']
for c in df1.filter(like='OFF_P'):
df1[f'{c}_PTS'] = df1.set_index(['GAME_ID', c]).index.map(d)
print(df1)
OFF_P1 OFF_P2 OFF_P3 OFF_P4 OFF_P5 GAME_ID OFF_P1_PTS OFF_P2_PTS OFF_P3_PTS OFF_P4_PTS OFF_P5_PTS OFF_P1_PTS_PTS OFF_P2_PTS_PTS OFF_P3_PTS_PTS OFF_P4_PTS_PTS OFF_P5_PTS_PTS
0 1629675 1627736 1630162 201976 1629020 22101224 28.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 201599 1630178 202699 1629680 201980 22101228 NaN 13.0 NaN NaN NaN NaN NaN NaN NaN NaN
2 1630191 1630180 1630587 1630240 1628402 22101228 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 1627759 201143 1628464 1628369 203935 22101223 16.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 1630573 1630271 1630238 1628436 1630346 22101223 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

Python/Pandas outer merge not including all relevant columns

I want to merge the following 2 data frames in Pandas but the result isn't containing all the relevant columns:
L1aIn[0:5]
Filename OrbitNumber OrbitMode
OrbitModeCounter Year Month Day L1aIn
0 oco2_L1aInDP_35863a_210329_B10206_210330111927.h5 35863 DP a 2021 3 29 1
1 oco2_L1aInDP_35862a_210329_B10206_210330111935.h5 35862 DP a 2021 3 29 1
2 oco2_L1aInDP_35861b_210329_B10206_210330111934.h5 35861 DP b 2021 3 29 1
3 oco2_L1aInLP_35861a_210329_B10206_210330111934.h5 35861 LP a 2021 3 29 1
4 oco2_L1aInSP_35861a_210329_B10206_210330111934.h5 35861 SP a 2021 3 29 1
L2Std[0:5]
Filename OrbitNumber OrbitMode OrbitModeCounter Year Month Day L2Std
0 oco2_L2StdGL_35861a_210329_B10206r_21042704283... 35861 GL a 2021 3 29 1
1 oco2_L2StdXS_35860a_210329_B10206r_21042700342... 35860 XS a 2021 3 29 1
2 oco2_L2StdND_35852a_210329_B10206r_21042622540... 35852 ND a 2021 3 29 1
3 oco2_L2StdGL_35862a_210329_B10206r_21042622403... 35862 GL a 2021 3 29 1
4 oco2_L2StdTG_35856a_210329_B10206r_21042622422... 35856 TG a 2021 3 29 1
>>> df = L1aIn.copy(deep=True)
>>> df.merge(L2Std, how="outer", on=["OrbitNumber","OrbitMode","OrbitModeCounter"])
0 oco2_L1aInDP_35863a_210329_B10206_210330111927.h5 35863 DP a ... NaN NaN NaN NaN
1 oco2_L1aInDP_35862a_210329_B10206_210330111935.h5 35862 DP a ... NaN NaN NaN NaN
2 oco2_L1aInDP_35861b_210329_B10206_210330111934.h5 35861 DP b ... NaN NaN NaN NaN
3 oco2_L1aInLP_35861a_210329_B10206_210330111934.h5 35861 LP a ... NaN NaN NaN NaN
4 oco2_L1aInSP_35861a_210329_B10206_210330111934.h5 35861 SP a ... NaN NaN NaN NaN
5 NaN 35861 GL a ... 2021.0 3.0 29.0 1.0
6 NaN 35860 XS a ... 2021.0 3.0 29.0 1.0
7 NaN 35852 ND a ... 2021.0 3.0 29.0 1.0
8 NaN 35862 GL a ... 2021.0 3.0 29.0 1.0
9 NaN 35856 TG a ... 2021.0 3.0 29.0 1.0
[10 rows x 13 columns]
>>> df.columns
Index(['Filename', 'OrbitNumber', 'OrbitMode', 'OrbitModeCounter', 'Year',
'Month', 'Day', 'L1aIn'],
dtype='object')
I want the resulting merged table to include both the "L1aIn" and "L2Std" columns but as you can see it doesn't and only picks up the original columns from L1aIn.
I'm also puzzled about why it seems to be returning a dataframe object rather than None.
A toy example works fine for me, but the real-life one does not. What circumstances provoke this kind of behavior for merge?
Seems to me that you just need to a variable to the output of
merged_df = df.merge(L2Std, how="outer", on=["OrbitNumber","OrbitMode","OrbitModeCounter"])
print(merged_df.columns)

Locating columns values in pandas dataframe with conditions

We have a dataframe (df_source):
Unnamed: 0 DATETIME DEVICE_ID COD_1 DAT_1 COD_2 DAT_2 COD_3 DAT_3 COD_4 DAT_4 COD_5 DAT_5 COD_6 DAT_6 COD_7 DAT_7
0 0 200520160941 002222111188 35 200408100500.0 12 200408100400 16 200408100300 11 200408100200 19 200408100100 35 200408100000 43
1 19 200507173541 000049000110 00 190904192701.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 20 200507173547 000049000110 00 190908185501.0 08 190908185501 NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 21 200507173547 000049000110 00 190908205601.0 08 190908205601 NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 22 200507173547 000049000110 00 190909005800.0 08 190909005800 NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
159 775 200529000843 000049768051 40 200529000601.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
160 776 200529000843 000049015792 00 200529000701.0 33 200529000701 NaN NaN NaN NaN NaN NaN NaN NaN NaN
161 779 200529000843 000049180500 00 200529000601.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
162 784 200529000843 000049089310 00 200529000201.0 03 200529000201 61 200529000201 NaN NaN NaN NaN NaN NaN NaN
163 786 200529000843 000049768051 40 200529000401.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
We calculated values_cont, a dict, for a subset:
v_subset = ['COD_1', 'COD_2', 'COD_3', 'COD_4', 'COD_5', 'COD_6', 'COD_7']
values_cont = pd.value_counts(df_source[v_subset].values.ravel())
We obtained as result (values, counter):
00 134
08 37
42 12
40 12
33 3
11 3
03 2
35 2
43 2
44 1
61 1
04 1
12 1
60 1
05 1
19 1
34 1
16 1
Now, the question is:
How to locate values in columns corresponding to counter, for instance:
How to locate:
df['DEVICE_ID'] # corresponding with values ('00') and counter ('134')
df['DEVICE_ID'] # corresponding with values ('08') and counter ('37')
...
df['DEVICE_ID'] # corresponding with values ('16') and counter ('1')
I believe you need DataFrame.melt with aggregate join for ID and GroupBy.size for counts.
This implementation will result in a dataframe with a column (value) for the CODES, all the associated DEVICE_IDs, and the count of ids associated with each code.
This is an alternative to values_cont in the question.
v_subset = ['COD_1', 'COD_2', 'COD_3', 'COD_4', 'COD_5', 'COD_6', 'COD_7']
df = (df_source.melt(id_vars='DEVICE_ID', value_vars=v_subset)
.dropna(subset=['value'])
.groupby('value')
.agg(DEVICE_ID = ('DEVICE_ID', ','.join), count= ('value','size'))
.reset_index())
print (df)
value DEVICE_ID count
0 00 000049000110,000049000110,000049000110,0000490... 7
1 03 000049089310 1
2 08 000049000110,000049000110,000049000110 3
3 11 002222111188 1
4 12 002222111188 1
5 16 002222111188 1
6 19 002222111188 1
7 33 000049015792 1
8 35 002222111188,002222111188 2
9 40 000049768051,000049768051 2
10 43 002222111188 1
11 61 000049089310 1
# print DEVICE_ID for CODES == '03'
print(df.DEVICE_ID[df.value == '03'])
[out]:
1 000049089310
Name: DEVICE_ID, dtype: object
Given the question as related to df_source, to select specific parts of the dataframe, use Pandas: Boolean Indexing
# to return all rows where COD_1 is '00'
df_source[df_source.COD_1 == '00']
# to return only the DEVICE_ID column where COD_1 is '00'
df_source['DEVICE_ID'][df_source.COD_1 == '00']
You can use df.iloc to search out rows that match based on columns. Then from that row you can select the column of interest and output it. There may be a more pythonic way to do this.
df2=df.iloc[df['COD_1']==00]
df3=df2.iloc[df2['DAT_1']==134]
df_out=df3.iloc['DEVICE_ID']
here's more info in .iloc: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html

How to read and write table with extra information as a dataframe and adding new columns from the informations

I have a file-like object generated from StringIO which is a table with lines of information ahead the table (see below starting from #TIMESTAMP).
I want to add extra columns to the exisiting table using the information "Date", "UTCoffset - Time (Substraction)" from #Timestamp and "ZenAngle" from #GLOBAL_SUMMARY.
I used pd.read_csv command to read it but it only worked when I skip the first 8 rows which includes the information I need. Also the Error "TypeError: data argument can't be an iterator" was reported as I tried to import the object below as dataframe.
#TIMESTAMP
UTCOffset,Date,Time
+00:30:32,2011-09-05,08:32:21
#GLOBAL_SUMMARY
Time,IntACGIH,IntCIE,ZenAngle,MuValue,AzimAngle,Flag,TempC,O3,Err_O3,SO2,Err_SO2,F324
08:32:21,7.3576,52.758,59.109,1.929,114.427,000000,24,291,1,,,91.9
#GLOBAL
Wavelength,S-Irradiance,Time
290.0,0.000e+00
290.5,0.000e+00
291.0,4.380e-06
291.5,2.234e-05
292.0,2.102e-05
292.5,2.204e-05
293.0,2.453e-05
293.5,2.256e-05
294.0,3.088e-05
294.5,4.676e-05
295.0,3.384e-05
295.5,3.582e-05
296.0,4.298e-05
296.5,3.774e-05
297.0,4.779e-05
297.5,7.399e-05
298.0,9.214e-05
298.5,1.080e-04
299.0,2.143e-04
299.5,3.180e-04
300.0,3.337e-04
300.5,4.990e-04
301.0,8.688e-04
301.5,1.210e-03
302.0,1.133e-03
I think you can first use read_csv to create 3 DataFrames:
import pandas as pd
import io
temp=u"""#TIMESTAMP
UTCOffset,Date,Time
+00:30:32,2011-09-05,08:32:21
#GLOBAL_SUMMARY
Time,IntACGIH,IntCIE,ZenAngle,MuValue,AzimAngle,Flag,TempC,O3,Err_O3,SO2,Err_SO2,F324
08:32:21,7.3576,52.758,59.109,1.929,114.427,000000,24,291,1,,,91.9
#GLOBAL
Wavelength,S-Irradiance,Time
290.0,0.000e+00
290.5,0.000e+00
291.0,4.380e-06
291.5,2.234e-05
292.0,2.102e-05
292.5,2.204e-05
293.0,2.453e-05
293.5,2.256e-05
294.0,3.088e-05
294.5,4.676e-05
295.0,3.384e-05
295.5,3.582e-05
296.0,4.298e-05
296.5,3.774e-05
297.0,4.779e-05
297.5,7.399e-05
298.0,9.214e-05
298.5,1.080e-04
299.0,2.143e-04
299.5,3.180e-04
300.0,3.337e-04
300.5,4.990e-04
301.0,8.688e-04
301.5,1.210e-03
302.0,1.133e-03
"""
df1 = pd.read_csv(io.StringIO(temp),
skiprows=9)
print (df1)
Wavelength S-Irradiance Time
0 290.0 0.000000 NaN
1 290.5 0.000000 NaN
2 291.0 0.000004 NaN
3 291.5 0.000022 NaN
4 292.0 0.000021 NaN
5 292.5 0.000022 NaN
6 293.0 0.000025 NaN
7 293.5 0.000023 NaN
8 294.0 0.000031 NaN
9 294.5 0.000047 NaN
10 295.0 0.000034 NaN
11 295.5 0.000036 NaN
12 296.0 0.000043 NaN
13 296.5 0.000038 NaN
14 297.0 0.000048 NaN
15 297.5 0.000074 NaN
16 298.0 0.000092 NaN
17 298.5 0.000108 NaN
18 299.0 0.000214 NaN
19 299.5 0.000318 NaN
20 300.0 0.000334 NaN
21 300.5 0.000499 NaN
22 301.0 0.000869 NaN
23 301.5 0.001210 NaN
24 302.0 0.001133 NaN
df2 = pd.read_csv(io.StringIO(temp),
skiprows=1,
nrows=1)
print (df2)
UTCOffset Date Time
0 +00:30:32 2011-09-05 08:32:21
df3 = pd.read_csv(io.StringIO(temp),
skiprows=5,
nrows=1)
print (df3)
Time IntACGIH IntCIE ZenAngle MuValue AzimAngle Flag TempC O3 \
0 08:32:21 7.3576 52.758 59.109 1.929 114.427 0 24 291
Err_O3 SO2 Err_SO2 F324
0 1 NaN NaN 91.9

Reindexing and filling on one level of a hierarchical index in pandas

I have a pandas dataframe with a two level hierarchical index ('item_id' and 'date'). Each row has columns for a variety of metrics for a particular item in a particular month. Here's a sample:
total_annotations unique_tags
date item_id
2007-04-01 2 30 14
2007-05-01 2 32 16
2007-06-01 2 36 19
2008-07-01 2 81 33
2008-11-01 2 82 34
2009-04-01 2 84 35
2010-03-01 2 90 35
2010-04-01 2 100 36
2010-11-01 2 105 40
2011-05-01 2 106 40
2011-07-01 2 108 42
2005-08-01 3 479 200
2005-09-01 3 707 269
2005-10-01 3 980 327
2005-11-01 3 1176 373
2005-12-01 3 1536 438
2006-01-01 3 1854 497
2006-02-01 3 2206 560
2006-03-01 3 2558 632
2007-02-01 3 5650 1019
As you can see, there are not observations for all consecutive months for each item. What I want to do is reindex the dataframe such that each item has rows for each month in a specified range. Now, this is easy to accomplish for any given item. So, for item_id 99, for example:
baseDateRange = pd.date_range('2005-07-01','2013-01-01',freq='MS')
data.xs(99,level='item_id').reindex(baseDateRange,method='ffill')
But with this method, I'd have to iterate through all the item_ids, then merge everything together, which seems woefully over-complicated.
So how can I apply this to the full dataframe, ffill-ing the observations (but also the item_id index) such that each item_id has properly filled rows for all the dates in baseDateRange?
Essentially for each group you want to reindex and ffill. The apply gets passed a data frame that has the item_id and date still in the index, so reset, then set and reindex with filling.
idx is your baseDateRange from above.
In [33]: df.groupby(level='item_id').apply(
lambda x: x.reset_index().set_index('date').reindex(idx,method='ffill')).head(30)
Out[33]:
item_id annotations tags
item_id
2 2005-07-01 NaN NaN NaN
2005-08-01 NaN NaN NaN
2005-09-01 NaN NaN NaN
2005-10-01 NaN NaN NaN
2005-11-01 NaN NaN NaN
2005-12-01 NaN NaN NaN
2006-01-01 NaN NaN NaN
2006-02-01 NaN NaN NaN
2006-03-01 NaN NaN NaN
2006-04-01 NaN NaN NaN
2006-05-01 NaN NaN NaN
2006-06-01 NaN NaN NaN
2006-07-01 NaN NaN NaN
2006-08-01 NaN NaN NaN
2006-09-01 NaN NaN NaN
2006-10-01 NaN NaN NaN
2006-11-01 NaN NaN NaN
2006-12-01 NaN NaN NaN
2007-01-01 NaN NaN NaN
2007-02-01 NaN NaN NaN
2007-03-01 NaN NaN NaN
2007-04-01 2 30 14
2007-05-01 2 32 16
2007-06-01 2 36 19
2007-07-01 2 36 19
2007-08-01 2 36 19
2007-09-01 2 36 19
2007-10-01 2 36 19
2007-11-01 2 36 19
2007-12-01 2 36 19
Constructing on Jeff's answer, I consider this to be somewhat more readable. It is also considerably more efficient since only the droplevel and reindex methods are used.
df = df.set_index(['item_id', 'date'])
def fill_missing_dates(x, idx=all_dates):
x.index = x.index.droplevel('item_id')
return x.reindex(idx, method='ffill')
filled_df = (df.groupby('item_id')
.apply(fill_missing_dates))

Categories