This is quite simple, but I do not get why I can't merge two dataframes. I have the following dfs with different shapes (one is larger and wider than the other):
df1
A id
0 microsoft inc 1
1 apple computer. 2
2 Google Inc. 3
3 IBM 4
4 amazon, Inc. 5
df2
B C D E id
0 (01780-500-01) 237489 - 342 API True. 1
0 (409-6043-01) 234324 API Other 2
0 23423423 API NaN NaN 3
0 (001722-5e240-60) NaN NaN Other 4
1 (0012172-52411-60) 32423423. NaN Other 4
0 29849032-29482390 API Yes False 5
1 329482030-23490-1 API Yes False 5
I would like to merge df1 and df2 by the index column:
df3
A B C D E id
0 microsoft inc (01780-500-01) 237489 - 342 API True. 1
1 apple computer. (409-6043-01) 234324 API Other 2
2 Google Inc. 23423423 API NaN NaN 3
3 IBM (001722-5e240-60) NaN NaN Other 4
4 IBM (0012172-52411-60) 32423423. NaN Other 4
5 amazon, Inc. 29849032-29482390 API Yes False 5
6 amazon, Inc. 329482030-23490-1 API Yes False 5
I know that this could be done by using merge(). Also, I read this excellent tutorial and tried to:
In:
pd.merge(df1, df2, on=df1.id, how='outer')
Out:
IndexError: indices are out-of-bounds
Then I tried:
pd.merge(df2, df1, on='id', how='outer')
And apparently its repeating several times the merged rows, something like this:
A B C D E index
0 microsoft inc (01780-500-01) 237489 - 342 API True. 1
1 apple computer. (409-6043-01) 234324 API Other 2
2 apple computer. (409-6043-01) 234324 API Other 2
3 apple computer. (409-6043-01) 234324 API Other 2
4 apple computer. (409-6043-01) 234324 API Other 2
5 apple computer. (409-6043-01) 234324 API Other 2
6 apple computer. (409-6043-01) 234324 API Other 2
7 apple computer. (409-6043-01) 234324 API Other 2
8 apple computer. (409-6043-01) 234324 API Other 2
...
I think that this is related with the fact that I created a temporal index df2['position'] = df2.index since the indices look weird, and then removed it. So, my question is how to get df3?
UPDATE
I fixed the index of df2 like this:
df2.reset_index(drop=True, inplace=True)
And now looks like this:
B C D E id
0 (01780-500-01) 237489 - 342 API True. 1
1 (409-6043-01) 234324 API Other 2
2 23423423 API NaN NaN 3
3 (001722-5e240-60) NaN NaN Other 4
4 (0012172-52411-60) 32423423. NaN Other 4
5 29849032-29482390 API Yes False 5
6 329482030-23490-1 API Yes False 5
I am still having the same issue. The merged rows are repeating several times.
>>>print(df2.dtypes)
B object
C object
D object
E object
id int64
dtype: object
>>>print(df1.dtypes)
A object
id int64
dtype: object
Update2
>>>print(df2['id'])
0 1
1 2
2 3
3 4
4 4
5 5
6 5
7 6
8 6
9 7
10 8
11 8
12 8
13 8
14 9
15 10
16 11
17 11
18 12
19 12
20 13
21 13
22 14
23 15
24 16
25 16
26 17
27 17
28 18
29 18
...
476 132
477 132
478 132
479 132
480 132
481 132
482 132
483 132
484 133
485 133
486 133
487 133
488 134
489 134
490 134
491 134
492 135
493 135
494 136
495 136
496 137
497 137
498 137
499 137
500 137
501 137
502 137
503 138
504 138
505 138
Name: id, dtype: int64
And
>>>print(df1)
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
10 11
11 8
12 12
13 6
14 7
15 8
16 6
17 11
18 13
19 14
20 15
21 11
22 2
23 16
24 17
25 18
26 9
27 19
28 11
29 20
..
108 57
109 43
110 22
111 2
112 58
113 49
114 22
115 59
116 2
117 6
118 22
119 2
120 37
121 2
122 9
123 60
124 61
125 62
126 63
127 42
128 64
129 4
130 29
131 11
132 2
133 25
134 4
135 65
136 66
137 4
Name: id, dtype: int64
You could try setting the index as id and then using join:
df1 = pd.DataFrame([('microsoft inc',1),
('apple computer.',2),
('Google Inc.',3),
('IBM',4),
('amazon, Inc.',5)],columns = ('A','id'))
df2 = pd.DataFrame([('(01780-500-01)','237489', '- 342','API', 1),
('(409-6043-01)','234324', ' API','Other ',2),
('23423423','API', 'NaN','NaN', 3),
('(001722-5e240-60)','NaN', 'NaN','Other', 4),
('(0012172-52411-60)','32423423',' NaN','Other', 4),
('29849032-29482390','API', ' Yes',' False', 5),
('329482030-23490-1','API', ' Yes',' False', 5)],
columns = ['B','C','D','E','id'])
df1 =df1.set_index('id')
df1.drop_duplicates(inplace=True)
df2 = df2.set_index('id')
df3 = df1.join(df2,how='outer')
Since you've set the index columns (aka join keys) for both dataframes, you wouldn't have to specify the on='id' param.
This is an alternate way to solve the problem.. I don't see anything wrong with pd.merge(df1, df2, on='id', how='outer'). You might want to double check the id column in both dataframes, as mentioned by #JohnE
Related
I'm trying to create or generate some graphs in stacked bar I'm using this data:
index 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
0 No 94 123 96 108 122 106.0 95.0 124 104 118 73 82 106 124 109 70 59
1 Yes 34 4 33 21 5 25.0 34.0 5 21 9 55 46 21 3 19 59 41
2 Dont know 1 2 1 1 2 NaN NaN 1 4 2 2 2 2 2 2 1 7
Basically I want to use the columns names as x and the Yes, No, Don't know as the Y values, here is my code and the result that I have at the moment.
ax = dfu.plot.bar(x='index', stacked=True)
UPDATE:
Here is an example:
data = [{0:1,1:2,2:3},{0:3,1:2,2:1},{0:1,1:1,2:1}]
index = ["yes","no","dont know"]
df = pd.DataFrame(data,index=index)
df.T.plot.bar(stacked=True) # Note .T is used to transpose the DataFrame
I have some data.
I want to remain with rows when an ID has 4 consecutive numbers. For example, if ID 1 has rows 100, 101, 102, 103, 105, the "105" should be excluded.
Data:
ID X
0 1 100
1 1 101
2 1 102
3 1 103
4 1 105
5 2 100
6 2 102
7 2 103
8 2 104
9 3 100
10 3 101
11 3 102
12 3 103
13 3 106
14 3 107
15 3 108
16 3 109
17 3 110
18 3 112
19 4 100
20 4 102
21 4 103
22 4 104
23 4 105
24 4 107
Expected results:
ID X
0 1 100
1 1 101
2 1 102
3 1 103
4 3 100
5 3 101
6 3 102
7 3 103
8 3 106
9 3 107
10 3 108
11 3 109
12 3 110
13 4 102
14 4 103
15 4 104
16 4 105
You can identify the consecutive values, then filter the groups by size with groupby.filter:
# group consecutive X
g = df['X'].diff().gt(1).cumsum() # no need to group here, we'll group later
# filter groups
out = df.groupby(['ID', g]).filter(lambda g: len(g)>=4)#.reset_index(drop=True)
output:
ID X
0 1 100
1 1 101
2 1 102
3 1 103
9 3 100
10 3 101
11 3 102
12 3 103
13 3 106
14 3 107
15 3 108
16 3 109
17 3 110
20 4 102
21 4 103
22 4 104
23 4 105
Another method:
out = df.groupby(df.groupby('ID')['X'].diff().ne(1).cumsum()).filter(lambda x: len(x) >= 4)
print(out)
# Output
ID X
0 1 100
1 1 101
2 1 102
3 1 103
9 3 100
10 3 101
11 3 102
12 3 103
13 3 106
14 3 107
15 3 108
16 3 109
17 3 110
20 4 102
21 4 103
22 4 104
23 4 105
def function1(dd:pd.DataFrame):
return dd.assign(rk=(dd.assign(col1=(dd.X.diff()>1).cumsum()).groupby('col1').transform('size')))
df1.groupby('ID').apply(function1).loc[lambda x:x.rk>3,:'X']
ID X
0 1 100
1 1 101
2 1 102
3 1 103
9 3 100
10 3 101
11 3 102
12 3 103
13 3 106
14 3 107
15 3 108
16 3 109
17 3 110
20 4 102
21 4 103
22 4 104
23 4 105
I am trying to append every column from one row into another row, I want to do this for every row, but some row will not have any values, take a look at my code it will be more clear:
Here is my data
date day_of_week day_of_month day_of_year month_of_year
5/1/2017 0 1 121 5
5/2/2017 1 2 122 5
5/3/2017 2 3 123 5
5/4/2017 3 4 124 5
5/8/2017 0 8 128 5
5/9/2017 1 9 129 5
5/10/2017 2 10 130 5
5/11/2017 3 11 131 5
5/12/2017 4 12 132 5
5/15/2017 0 15 135 5
5/16/2017 1 16 136 5
5/17/2017 2 17 137 5
5/18/2017 3 18 138 5
5/19/2017 4 19 139 5
5/23/2017 1 23 143 5
5/24/2017 2 24 144 5
5/25/2017 3 25 145 5
5/26/2017 4 26 146 5
Here is my current code:
s = df_md['date'].shift(-1)
df_md['next_calendarday'] = s.mask(s.dt.dayofweek.diff().lt(0))
df_md.set_index('date', inplace=True)
df_md.apply(lambda row: GetNextDayMarketData(row, df_md), axis=1)
def GetNextDayMarketData(row, dataframe):
if(row['next_calendarday'] is pd.NaT):
return
key = row['next_calendarday'].strftime("%Y-%m-%d")
nextrow = dataframe.loc[key]
for index, val in nextrow.iteritems():
if(index != "next_calendarday"):
dataframe.loc[row.name, index+'_nextday'] = val
This works but it's so slow it might as well not work. Here is what the result should look like, you can see that the value from the next row has been added to the previous row. The kicker is that it's the next calendar date and not just the next row in the sequence. If a row does not have an entry for next calendar date, it will simply be blank.
Here is the expected result in csv
date day_of_week day_of_month day_of_year month_of_year next_workingday day_of_week_nextday day_of_month_nextday day_of_year_nextday month_of_year_nextday
5/1/2017 0 1 121 5 5/2/2017 1 2 122 5
5/2/2017 1 2 122 5 5/3/2017 2 3 123 5
5/3/2017 2 3 123 5 5/4/2017 3 4 124 5
5/4/2017 3 4 124 5
5/8/2017 0 8 128 5 5/9/2017 1 9 129 5
5/9/2017 1 9 129 5 5/10/2017 2 10 130 5
5/10/2017 2 10 130 5 5/11/2017 3 11 131 5
5/11/2017 3 11 131 5 5/12/2017 4 12 132 5
5/12/2017 4 12 132 5
5/15/2017 0 15 135 5 5/16/2017 1 16 136 5
5/16/2017 1 16 136 5 5/17/2017 2 17 137 5
5/17/2017 2 17 137 5 5/18/2017 3 18 138 5
5/18/2017 3 18 138 5 5/19/2017 4 19 139 5
5/19/2017 4 19 139 5
5/23/2017 1 23 143 5 5/24/2017 2 24 144 5
5/24/2017 2 24 144 5 5/25/2017 3 25 145 5
5/25/2017 3 25 145 5 5/26/2017 4 26 146 5
5/26/2017 4 26 146 5
5/30/2017 1 30 150 5
Use DataFrame.join with remove column next_calendarday_nextday:
df = df.set_index('date')
df = (df.join(df, on='next_calendarday', rsuffix='_nextday')
.drop('next_calendarday_nextday', axis=1))
I have the following dataframe.
hour sensor_id hourly_count
0 1 101 651
1 1 102 19
2 2 101 423
3 2 102 12
4 3 101 356
5 4 101 79
6 4 102 21
7 5 101 129
8 6 101 561
Notice that for sensor_id 102, there are no values for hour = 3. This is due to the fact that the sensors do not generate a separate row of data if the hourly_count is equal to zero. This means that sensor 102 should have hourly_counts = 0 at hour = 3, but this is just the way the original data was collected.
I would ideally wish for a code that fills in this gap. So it should understand that if there are 2 sensors, each sensor should have an hourly record, and if not, insert a row in the dataframe for that sensor for that hour and fill the hourly_count column at that row as 0.
hour sensor_id hourly_count
0 1 101 651
1 1 102 19
2 2 101 423
3 2 102 12
4 3 101 356
5 3 102 0
6 4 101 79
7 4 102 21
8 5 101 129
9 5 102 0
10 6 101 561
11 6 102 0
Any help is really appreciated.
Using DataFrame.reindex, you can explicitly define your index. This is useful if you are missing data from both sensors for a particular hour. You can also extend the hour beyond what you have. In the following example, it extends out to hour 8.
new_ix = pd.MultiIndex.from_product([range(1,9), [101, 102]], names=['hour', 'sensor_id'])
df_new = df.set_index(['hour', 'sensor_id'])
df_new.reindex(new_ix, fill_value=0).reset_index()
Output:
hour sensor_id hourly_count
0 1 101 651
1 1 102 19
2 2 101 423
3 2 102 12
4 3 101 356
5 3 102 0
6 4 101 79
7 4 102 21
8 5 101 129
9 5 102 0
10 6 101 561
11 6 102 0
12 7 101 0
13 7 102 0
14 8 101 0
15 8 102 0
Use pandas.DataFrame.pivot and then unstack with reset_index:
new_df = df.pivot('sensor_id','hour', 'hourly_count').fillna(0).unstack().reset_index()
print(new_df)
Output:
hour sensor_id 0
0 1 101 651.0
1 1 102 19.0
2 2 101 423.0
3 2 102 12.0
4 3 101 356.0
5 3 102 0.0
6 4 101 79.0
7 4 102 21.0
8 5 101 129.0
9 5 102 0.0
10 6 101 561.0
11 6 102 0.0
Assume missing is on sensor_id 2 only. One way is you just create a new df with all combination of all hours of sensor_id 1, and merge left this new df with original df to get hourly_count and fillna
a = df.hour.unique()
Idf1 = pd.MultiIndex.from_product([a, [101, 102]]).to_frame(index=False, name=['hour', 'sensor_id'])
Out[157]:
hour sensor_id
0 1 101
1 1 102
2 2 101
3 2 102
4 3 101
5 3 102
6 4 101
7 4 102
8 5 101
9 5 102
10 6 101
11 6 102
df1.merge(df, on=['hour','sensor_id'], how='left').fillna(0)
Out[161]:
hour sensor_id hourly_count
0 1 101 651.0
1 1 102 19.0
2 2 101 423.0
3 2 102 12.0
4 3 101 356.0
5 3 102 0.0
6 4 101 79.0
7 4 102 21.0
8 5 101 129.0
9 5 102 0.0
10 6 101 561.0
11 6 102 0.0
Other way: using unstack with fill_value
df.set_index(['hour', 'sensor_id']).unstack(fill_value=0).stack().reset_index()
Out[171]:
hour sensor_id hourly_count
0 1 101 651
1 1 102 19
2 2 101 423
3 2 102 12
4 3 101 356
5 3 102 0
6 4 101 79
7 4 102 21
8 5 101 129
9 5 102 0
10 6 101 561
11 6 102 0
I have df
number A B C
123 10 10 1
123 10 11 1
123 18 27 1
456 10 18 2
456 42 34 2
789 13 71 3
789 19 108 3
789 234 560 4
and second df
number A B
123 18 27
456 32 19
789 234 560
I need, if number, A, Bis equal to this column in second df, add that to new df and also add string with C is equal string, that we add earlier.
Desire output
number A B C
123 10 10 1
123 10 11 1
123 18 27 1
789 234 560 4
How can I write this condition?
One way is to give df2 a dummy column:
In [11]: df2["in_df2"] = True
then you can do the merge:
In [12]: df1.merge(df2, how="left")
Out[12]:
number A B C in_df2
0 123 10 10 1 NaN
1 123 10 11 1 NaN
2 123 18 27 1 True
3 456 10 18 2 NaN
4 456 42 34 2 NaN
5 789 13 71 3 NaN
6 789 19 108 3 NaN
7 789 234 560 4 True
Now, we only want those groups which contains a True:
In [13]: df1.merge(df2, how="left").groupby(["number", "C"]).filter(lambda x: x["in_df2"].any())
Out[13]:
number A B C in_df2
0 123 10 10 1 NaN
1 123 10 11 1 NaN
2 123 18 27 1 True
7 789 234 560 4 True