Dataframe Merge in Pandas - python

For some reason, I cannot get this merge to work correctly.
This Dataframe (rspars) has 2,000+ rows...
rsparid f1mult f2mult f3mult
0 1 0.318 0.636 0.810
1 2 0.348 0.703 0.893
2 3 0.384 0.777 0.000
3 4 0.296 0.590 0.911
4 5 0.231 0.458 0.690
5 6 0.275 0.546 0.839
6 7 0.248 0.486 0.731
7 8 0.430 0.873 0.000
8 9 0.221 0.438 0.655
9 11 0.204 0.399 0.593
When trying to join the above to a table based on the rsparid columns to this Dataframe...
line_track line_race rsparid
line_date
2013-03-23 TP 10 1400
2013-02-23 GP 7 634
2013-01-01 GP 7 1508
2012-11-11 AQU 5 96
2012-10-11 BEL 2 161
Using this...
df = pd.merge(datalines, rspars, how='left', on='rsparid')
I get blanks..
line_track line_race rsparid f1mult f2mult f3mult
0 TP 10 1400 NaN NaN NaN
1 TP 10 1400 NaN NaN NaN
2 TP 10 1400 NaN NaN NaN
3 GP 7 634 NaN NaN NaN
4 GP 10 634 NaN NaN NaN
Note, the "datalines" column can have thousands more rows than the rspars, thus the left join. I must be doing something wrong?
I also tried it this way...
df = datalines.merge(rspars, how='left', on='rsparid')
EXAMPLE #2
I dropped the data down to a few rows...
rspars:
rsparid f1mult f2mult f3mult
0 1400 0.216 0.435 0.656
datalines:
rsparid
0 1400
1 634
2 1508
3 96
4 161
5 1011
6 1007
7 518
8 1955
9 678
Merging...
datalines.merge(rspars, how='left', on='rsparid')
Output...
rsparid f1mult f2mult f3mult
0 1400 NaN NaN NaN
1 634 NaN NaN NaN
2 1508 NaN NaN NaN
3 96 NaN NaN NaN
4 161 NaN NaN NaN
5 1011 NaN NaN NaN
6 1007 NaN NaN NaN
7 518 NaN NaN NaN
8 1955 NaN NaN NaN
9 678 NaN NaN NaN

The NaNs mean they have no values in rsparid in common. This can be tricky when merging things that may look the same when they repr
The repr of small DataFrames with strings (of integers) or integers looks the same and no dtype information is printed when frames are small. You can get this information (and more) for small frames by calling the DataFrame.info() method, like so: df.info(). This will give you a nice summary of what's in the DataFrame and what the dtypes of its columns are:
In [205]: datalines_int = DataFrame({'rsparid':[1400,634,1508,96,161,1011,1007,518,1955,678]})
In [206]: datalines_str = DataFrame({'rsparid':map(str,[1400,634,1508,96,161,1011,1007,518,1955,678])})
In [207]: datalines_int
Out[207]:
rsparid
0 1400
1 634
2 1508
3 96
4 161
5 1011
6 1007
7 518
8 1955
9 678
In [208]: datalines_str
Out[208]:
rsparid
0 1400
1 634
2 1508
3 96
4 161
5 1011
6 1007
7 518
8 1955
9 678
In [209]: datalines_int.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 1 columns):
rsparid 10 non-null values
dtypes: int64(1)
In [210]: datalines_str.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 1 columns):
rsparid 10 non-null values
dtypes: object(1)
NOTE: You'll notice a slight difference in the reprs here, most likely because of padding of numeric DataFrames. Point is, no one would really be able to see that using this interactively, unless they were specifically looking for the difference.

Related

python pandas add new column with values grouped count

I want to add a new column with the number of times the points were over 700 and after the year 2014.
import pandas as pd
ipl_data = {'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
grouped = df.groupby('Year')
df.loc[(df['Points'] > 700) & (df['Year'] > 2014), 'High_points'] = df['Points']
#df['Point_per_year_gr_700']=df.groupby(by='Year')['Points'].transform('count')
df['Point_per_year_gr_700']=grouped['Points'].agg(np.size))
the end dataframe should look like this, but I cant get the 'Point_per_year_gr_700' right
Year Points Point_per_year_gr_700 High_points
0 2014 876 NaN
1 2015 789 3 789.0
2 2014 863 NaN
3 2015 673 NaN
4 2014 741 NaN
5 2015 812 3 812.0
6 2016 756 1 756.0
7 2017 788 1 788.0
8 2016 694 NaN
9 2014 701 NaN
10 2015 804 3 804.0
11 2017 690 NaN
Use where to mask the DataFrame to NaN where your condition isn't met. You can use this to create the High_points column and also to exclude rows that shouldn't count when you groupby year and find how many rows satisfy High_points each year.
df['High_points'] = df['Points'].where(df['Year'].gt(2014) & df['Points'].gt(700))
df['ppy_gt700'] = (df.where(df['High_points'].notnull())
.groupby('Year')['Year'].transform('size'))
Year Points High_Points ppy_gt700
0 2014 876 NaN NaN
1 2015 789 789.0 3.0
2 2014 863 NaN NaN
3 2015 673 NaN NaN
4 2014 741 NaN NaN
5 2015 812 812.0 3.0
6 2016 756 756.0 1.0
7 2017 788 788.0 1.0
8 2016 694 NaN NaN
9 2014 701 NaN NaN
10 2015 804 804.0 3.0
11 2017 690 NaN NaN

Taking away all previous values in a column in dataframe

I am using some data where I need to find the time difference between all previous rows i.e. in row 3 I need to know the time between row 2 and row 1 and row 2 and row 0, in row 5 i need to know the time between row 5 and row 4, row 5 and row 3.... row 5 and row 0. I then want to have a big dataframe with all these differences in (as well as the other columns).
I have made a test dataframe for this
data = {random': [1, 3, 9, 3, 4, 7, 8, 10],
'timestamp': [2, 138, 157, 232, 245, 302, 323, 379]}
df = pd.DataFrame(data)
I then tried to do
for i in range(0,len(df-1)):
difference = df.timestamp.diff(periods=i+1)
print(difference)
To iterate through each row and takeaway the previous row the first iteration, the second row the second iteration etc.
I am stuck on how to combine this into one large dataframe after all the iterations AND how to make sure the loop uses the original dataframe at the start of each iteration (not the dataframe from the previous iteration).
This is what is being outputted
0 NaN
1 136.0
2 19.0
3 75.0
4 13.0
5 57.0
6 21.0
7 56.0
Name: timestamp, dtype: float64
0 NaN
1 NaN
2 155.0
3 94.0
4 88.0
5 70.0
6 78.0
7 77.0
Name: timestamp, dtype: float64
0 NaN
1 NaN
2 NaN
3 230.0
4 107.0
5 145.0
6 91.0
7 134.0
Name: timestamp, dtype: float64
0 NaN
1 NaN
2 NaN
3 NaN
4 243.0
5 164.0
6 166.0
7 147.0
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 300.0
6 185.0
7 222.0
Name: timestamp, dtype: float64
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 321.0
7 241.0
Name: timestamp, dtype: float64
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 377.0
Name: timestamp, dtype: float64
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
Name: timestamp, dtype: float64
If anyone knows how to solve this that would be great :)
Here is one way of solving the problem with Series.expanding:
df['diff'] = [list(s.iat[-1] - s[-2::-1]) for s in df['timestamp'].expanding(1)]
random timestamp diff
0 1 2 []
1 3 138 [136]
2 9 157 [19, 155] #--> 157-138, 157-2
3 3 232 [75, 94, 230] #--> 232-157, 232-138, 232-2
4 4 245 [13, 88, 107, 243]
5 7 302 [57, 70, 145, 164, 300]
6 8 323 [21, 78, 91, 166, 185, 321]
7 10 379 [56, 77, 134, 147, 222, 241, 377]
I may be misunderstanding what you mean but if you're asking how to collect these differences together:
differences = [df.timestamp.diff(periods=i+1) for i in range(0,len(df-1))]
differences = pd.concat(differences)
I also may be misunderstanding, but this is the best representation I could think of from what you described:
>>> df2 = df.copy()
>>> for i in df2.timestamp:
df2[i]=df2['timestamp']-i
>>> df2
random timestamp 2 138 157 232 245 302 323 379
0 1 2 0 -136 -155 -230 -243 -300 -321 -377
1 3 138 136 0 -19 -94 -107 -164 -185 -241
2 9 157 155 19 0 -75 -88 -145 -166 -222
3 3 232 230 94 75 0 -13 -70 -91 -147
4 4 245 243 107 88 13 0 -57 -78 -134
5 7 302 300 164 145 70 57 0 -21 -77
6 8 323 321 185 166 91 78 21 0 -56
7 10 379 377 241 222 147 134 77 56 0

Subtracting fix date from whole panda data frame - python

I have data
customer_id purchase_amount date_of_purchase
0 760 25.0 06-11-2009
1 860 50.0 09-28-2012
2 1200 100.0 10-25-2005
3 1420 50.0 09-07-2009
4 1940 70.0 01-25-2013
5 1960 40.0 10-29-2013
6 2620 30.0 09-03-2006
7 3050 50.0 12-04-2007
8 3120 150.0 08-11-2006
9 3260 45.0 10-20-2010
10 3510 35.0 04-05-2013
11 3970 30.0 07-06-2007
12 4000 20.0 11-25-2005
13 4180 20.0 09-22-2010
14 4390 30.0 04-15-2011
15 4750 60.0 02-12-2013
16 4840 30.0 10-14-2005
17 4910 15.0 12-13-2006
18 4950 50.0 05-19-2010
19 4970 30.0 01-12-2006
20 5250 50.0 12-20-2005
Now I want to subtract 01-01-2016 from each row of date_of_purchase
I tried the following so I should have a new column days_since with a number of days.
NOW = pd.to_datetime('01/01/2016').strftime('%m-%d-%Y')
gb = customer_purchases_df.groupby('customer_id')
df2 = gb.agg({'date_of_purchase': lambda x: (NOW - x.max()).days})
any suggestion. how I can achieve this
Thanks in advance
pd.to_datetime(df['date_of_purchase']).rsub(pd.to_datetime('2016-01-01')).dt.days
0 2395
1 1190
2 3720
3 2307
4 1071
5 794
6 3407
7 2950
8 3430
9 1899
10 1001
11 3101
12 3689
13 1927
14 1722
15 1053
16 3731
17 3306
18 2053
19 3641
20 3664
Name: date_of_purchase, dtype: int64
I'm assuming the 'date_of_purchase' column already has the datetime dtype.
>>> df
customer_id purchase_amount date_of_purchase
0 760 25.0 2009-06-11
1 860 50.0 2012-09-28
2 1200 100.0 2005-10-25
>>> df['days_since'] = df['date_of_purchase'].sub(pd.to_datetime('01/01/2016')).dt.days.abs()
>>> df
customer_id purchase_amount date_of_purchase days_since
0 760 25.0 2009-06-11 2395
1 860 50.0 2012-09-28 1190
2 1200 100.0 2005-10-25 3720

Frequency count unique values Pandas

I have a Pandas Series as follow :
2014-05-24 23:59:49 1.3
2014-05-24 23:59:50 2.17
2014-05-24 23:59:50 1.28
2014-05-24 23:59:51 1.30
2014-05-24 23:59:51 2.17
2014-05-24 23:59:53 2.17
2014-05-24 23:59:58 2.17
Name: api_id, Length: 483677
I'm trying to count for each id the frequency per day.
For now I'm doing this :
count = {}
for x in apis.unique():
count[x] = apis[apis == x].resample('D','count')
count_df = pd.DataFrame(count)
That gives me what I want which is :
... 2.13 2.17 2.4 2.6 2.7 3.5(user) 3.9 4.2 5.1 5.6
timestamp ...
2014-05-22 ... 391 49962 3727 161 2 444 113 90 1398 90
2014-05-23 ... 450 49918 3861 187 1 450 170 90 629 90
2014-05-24 ... 396 46359 3603 172 3 513 171 89 622 90
But is there a way to do so without the for loop ?
You can use the value_counts function for this (docs), applying this after a groupby (which is similar to the resample('D') you did, but resample is expecting an aggregated output so we have to use the more general groupby in this case). With a small example:
In [16]: s = pd.Series([1,1,2,2,1,2,5,6,2,5,4,1], index=pd.date_range('2012-01-01', periods=12, freq='8H'))
In [17]: counts = s.groupby(pd.Grouper(freq='D')).value_counts()
In [18]: counts
Out[18]:
2012-01-01 1 2
2 1
2012-01-02 2 2
1 1
2012-01-03 2 1
6 1
5 1
2012-01-04 1 1
5 1
4 1
dtype: int64
To get this in the desired format, you can just unstack this (move the second level row indices to the columns):
In [19]: counts.unstack()
Out[19]:
1 2 4 5 6
2012-01-01 2 1 NaN NaN NaN
2012-01-02 1 2 NaN NaN NaN
2012-01-03 NaN 1 NaN 1 1
2012-01-04 1 NaN 1 1 NaN
Note: for the use of groupby(pd.Grouper(freq='D')) you need pandas 0.14. If you have al older version, you can use groupby(pd.TimeGrouper(freq='D')) to obtain exactly the same. This is also similar to doing groupby(s.index.date) (with the difference you have then datetime.date objects in the index).

Reindexing and filling on one level of a hierarchical index in pandas

I have a pandas dataframe with a two level hierarchical index ('item_id' and 'date'). Each row has columns for a variety of metrics for a particular item in a particular month. Here's a sample:
total_annotations unique_tags
date item_id
2007-04-01 2 30 14
2007-05-01 2 32 16
2007-06-01 2 36 19
2008-07-01 2 81 33
2008-11-01 2 82 34
2009-04-01 2 84 35
2010-03-01 2 90 35
2010-04-01 2 100 36
2010-11-01 2 105 40
2011-05-01 2 106 40
2011-07-01 2 108 42
2005-08-01 3 479 200
2005-09-01 3 707 269
2005-10-01 3 980 327
2005-11-01 3 1176 373
2005-12-01 3 1536 438
2006-01-01 3 1854 497
2006-02-01 3 2206 560
2006-03-01 3 2558 632
2007-02-01 3 5650 1019
As you can see, there are not observations for all consecutive months for each item. What I want to do is reindex the dataframe such that each item has rows for each month in a specified range. Now, this is easy to accomplish for any given item. So, for item_id 99, for example:
baseDateRange = pd.date_range('2005-07-01','2013-01-01',freq='MS')
data.xs(99,level='item_id').reindex(baseDateRange,method='ffill')
But with this method, I'd have to iterate through all the item_ids, then merge everything together, which seems woefully over-complicated.
So how can I apply this to the full dataframe, ffill-ing the observations (but also the item_id index) such that each item_id has properly filled rows for all the dates in baseDateRange?
Essentially for each group you want to reindex and ffill. The apply gets passed a data frame that has the item_id and date still in the index, so reset, then set and reindex with filling.
idx is your baseDateRange from above.
In [33]: df.groupby(level='item_id').apply(
lambda x: x.reset_index().set_index('date').reindex(idx,method='ffill')).head(30)
Out[33]:
item_id annotations tags
item_id
2 2005-07-01 NaN NaN NaN
2005-08-01 NaN NaN NaN
2005-09-01 NaN NaN NaN
2005-10-01 NaN NaN NaN
2005-11-01 NaN NaN NaN
2005-12-01 NaN NaN NaN
2006-01-01 NaN NaN NaN
2006-02-01 NaN NaN NaN
2006-03-01 NaN NaN NaN
2006-04-01 NaN NaN NaN
2006-05-01 NaN NaN NaN
2006-06-01 NaN NaN NaN
2006-07-01 NaN NaN NaN
2006-08-01 NaN NaN NaN
2006-09-01 NaN NaN NaN
2006-10-01 NaN NaN NaN
2006-11-01 NaN NaN NaN
2006-12-01 NaN NaN NaN
2007-01-01 NaN NaN NaN
2007-02-01 NaN NaN NaN
2007-03-01 NaN NaN NaN
2007-04-01 2 30 14
2007-05-01 2 32 16
2007-06-01 2 36 19
2007-07-01 2 36 19
2007-08-01 2 36 19
2007-09-01 2 36 19
2007-10-01 2 36 19
2007-11-01 2 36 19
2007-12-01 2 36 19
Constructing on Jeff's answer, I consider this to be somewhat more readable. It is also considerably more efficient since only the droplevel and reindex methods are used.
df = df.set_index(['item_id', 'date'])
def fill_missing_dates(x, idx=all_dates):
x.index = x.index.droplevel('item_id')
return x.reindex(idx, method='ffill')
filled_df = (df.groupby('item_id')
.apply(fill_missing_dates))

Categories