My first data frame has various columns one of which contains ID column and my second data frame has various columns one of which contains a No so I have found the link between the two. However how can I link these together using the number to assign the postcode information from data frame 2 to the correct practice in data frame 1.
Any help would be greatly appreciated!!!
Date frame 1
ID place Items Cost
0 5 10 2001.00
1 12 2 20.98
2 2 4 100.80
3 7 7 199.60
Data frame 2
ID No Dr Postcode
0 1 Dr.K BT94 7HX
1 5 Dr.H BT7 4MC
2 3 Dr.Love BT9 1HE
3 7 Dr.Kerr BT72 4TX
I want to create a new column 'Postcode' in Data frame 1 and assign the postcode to the correct Practice
ID Place Items Cost Postcode
0 5 10 BT7 4MC
1 2 3 BT9 1HE
2 22 8 BT62 4TU
3 7 7 BT72 4TX
How can I do this??
IIUC, I think what you are looking for is 'left_on' and 'right_on' parameters in merge:
df1.merge(df2, left_on='Practice', right_on='Prac No')
Output:
ID_x Practice Items Cost ID_y Prac No Dr Postcode
0 0 5 10 2001.0 1 5 Dr.H BT7 4MC
1 3 7 7 199.6 3 7 Dr.Kerr BT72 4TX
Or another way is to use set_index and map:
df1['Postcode'] = df1['Practice'].map(df2.set_index('Prac No')['Postcode'])
df1
Output:
ID Practice Items Cost Postcode
0 0 5 10 2001.00 BT7 4MC
1 1 12 2 20.98 NaN
2 2 2 4 100.80 NaN
3 3 7 7 199.60 BT72 4TX
Related
Suppose I have pandas DataFrame like this:
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4], 'value':[1,2,3,1,2,3,4,1,1]})
which looks like:
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
I want to get a new DataFrame with top 2 records for each id, like this:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
I can do it with numbering records within group after groupby:
dfN = df.groupby('id').apply(lambda x:x['value'].reset_index()).reset_index()
which looks like:
id level_1 index value
0 1 0 0 1
1 1 1 1 2
2 1 2 2 3
3 2 0 3 1
4 2 1 4 2
5 2 2 5 3
6 2 3 6 4
7 3 0 7 1
8 4 0 8 1
then for the desired output:
dfN[dfN['level_1'] <= 1][['id', 'value']]
Output:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
But is there more effective/elegant approach to do this? And also is there more elegant approach to number records within each group (like SQL window function row_number()).
Did you try
df.groupby('id').head(2)
Output generated:
id value
id
1 0 1 1
1 1 2
2 3 2 1
4 2 2
3 7 3 1
4 8 4 1
(Keep in mind that you might need to order/sort before, depending on your data)
EDIT: As mentioned by the questioner, use
df.groupby('id').head(2).reset_index(drop=True)
to remove the MultiIndex and flatten the results:
id value
0 1 1
1 1 2
2 2 1
3 2 2
4 3 1
5 4 1
Since 0.14.1, you can now do nlargest and nsmallest on a groupby object:
In [23]: df.groupby('id')['value'].nlargest(2)
Out[23]:
id
1 2 3
1 2
2 6 4
5 3
3 7 1
4 8 1
dtype: int64
There's a slight weirdness that you get the original index in there as well, but this might be really useful depending on what your original index was.
If you're not interested in it, you can do .reset_index(level=1, drop=True) to get rid of it altogether.
(Note: From 0.17.1 you'll be able to do this on a DataFrameGroupBy too but for now it only works with Series and SeriesGroupBy.)
Sometimes sorting the whole data ahead is very time consuming.
We can groupby first and doing topk for each group:
g = df.groupby(['id']).apply(lambda x: x.nlargest(topk,['value'])).reset_index(drop=True)
df.groupby('id').apply(lambda x : x.sort_values(by = 'value', ascending = False).head(2).reset_index(drop = True))
Here sort values ascending false gives similar to nlargest and True gives similar to nsmallest.
The value inside the head is the same as the value we give inside nlargest to get the number of values to display for each group.
reset_index is optional and not necessary.
This works for duplicated values
If you have duplicated values in top-n values, and want only unique values, you can do like this:
import pandas as pd
ifile = "https://raw.githubusercontent.com/bhishanpdl/Shared/master/data/twitter_employee.tsv"
df = pd.read_csv(ifile,delimiter='\t')
print(df.query("department == 'Audit'")[['id','first_name','last_name','department','salary']])
id first_name last_name department salary
24 12 Shandler Bing Audit 110000
25 14 Jason Tom Audit 100000
26 16 Celine Anston Audit 100000
27 15 Michale Jackson Audit 70000
If we do not remove duplicates, for the audit department we get top 3 salaries as 110k,100k and 100k.
If we want to have not-duplicated salaries per each department, we can do this:
(df.groupby('department')['salary']
.apply(lambda ser: ser.drop_duplicates().nlargest(3))
.droplevel(level=1)
.sort_index()
.reset_index()
)
This gives
department salary
0 Audit 110000
1 Audit 100000
2 Audit 70000
3 Management 250000
4 Management 200000
5 Management 150000
6 Sales 220000
7 Sales 200000
8 Sales 150000
To get the first N rows of each group, another way is via groupby().nth[:N]. The outcome of this call is the same as groupby().head(N). For example, for the top-2 rows for each id, call:
N = 2
df1 = df.groupby('id', as_index=False).nth[:N]
To get the largest N values of each group, I suggest two approaches.
First sort by "id" and "value" (make sure to sort "id" in ascending order and "value" in descending order by using the ascending parameter appropriately) and then call groupby().nth[].
N = 2
df1 = df.sort_values(by=['id', 'value'], ascending=[True, False])
df1 = df1.groupby('id', as_index=False).nth[:N]
Another approach is to rank the values of each group and filter using these ranks.
# for the entire rows
N = 2
msk = df.groupby('id')['value'].rank(method='first', ascending=False) <= N
df1 = df[msk]
# for specific column rows
df1 = df.loc[msk, 'value']
Both of these are much faster than groupby().apply() and groupby().nlargest() calls as suggested in the other answers on here(1, 2, 3). On a sample with 100k rows and 8000 groups, a %timeit test showed that it was 24-150 times faster than those solutions.
Also, instead of slicing, you can also pass a list/tuple/range to a .nth() call:
df.groupby('id', as_index=False).nth([0,1])
# doesn't even have to be consecutive
# the following returns 1st and 3rd row of each id
df.groupby('id', as_index=False).nth([0,2])
I would like to create on my Dataframe (Global_Dataset) a new column (Col_val) based on the other Dataframe (List_Data).
I need a faster code because I have a dataset of 2 million samples and List_data contains 50000 samples.
Col_Val must contain the value of column Value according to Col_Key
List_Data:
id Key Value
1 5 0
2 7 1
3 9 2
Global_Dataset:
id Col_Key Col_Val
1 9 2
2 5 0
3 9 2
4 7 1
5 7 1
6 5 0
7 9 2
8 7 1
9 9 2
10 5 0
I have tried this code but it needs a long time to be executed. Is there any other faster way for achieving my goal?
Col_Val = []
for i in range (len(List_Data)):
for j in range (len(Global_Data)):
if List_Data.get_value(i, "Key") == Global_Data.get_value(j, 'Col_Key') :
Col_Val.append(List_Data.get_value(i, 'Value'))
Global_Data['Col_Val'] = Col_Val
PS: I have tried loc and iloc instead of get_value but it works very slow
Try this:
data_dict = {key : value for key, value in zip(List_Data['Key'], List_Data['Value'])}
Global_Data['Col_Val'] = pd.Series([data_dict[key] for key in Global_Data['Col_Key']])
I don't know how long it will takes on your machine with the amount of data you need to handle, but it should be faster of what you are using now.
You could also generate the dictionary with data_dict = {row['Key'] : row['Value'] for _, row in list_data.iterrows()} but on my machine is slower than what I proposed above.
It works under the assumption that all the keys in Global_Data['Col_Keys'] are present in List_Data['Key'], otherwise you will get a KeyError.
There is no reason to loop through anything, either manually or with iterrows. If I understand your problem, this should be a simple merge operation.
df
Key Value
id
1 5 0
2 7 1
3 9 2
global_df
Col_Key
id
1 9
2 5
3 9
4 7
5 7
6 5
7 9
8 7
9 9
10 5
global_df.reset_index()\
.merge(df, left_on='Col_Key', right_on='Key')\
.drop('Key', axis=1)\
.set_index('id')\
.sort_index()
Col_Key Value
id
1 9 2
2 5 0
3 9 2
4 7 1
5 7 1
6 5 0
7 9 2
8 7 1
9 9 2
10 5 0
Note that the essence of this is the global_df.merge(...), but the extra operations are to keep the original indexing and remove unwanted extra columns. I encourage you to try each step individually to see the results.
I have been struggling to merge data frames. I need to have the rows arranged by the time, with both sets of data's columns merged into a new data frame. I'm sorry if this is clearly documented somewhere, but it has been hard for me to find an appropriate method. I tried append and merge but I am struggling to find an appropriate solution.
dataframe1:
# Date Time, GMT-07:00 Crossflow (Cold) (Volts) \
0 1 8:51:00 AM 1.13431
1 2 8:51:01 AM 1.12821
2 3 8:51:02 AM 1.12943
3 4 8:51:03 AM 1.12759
4 5 8:51:04 AM 1.13065
5 6 8:51:05 AM 1.12821
6 7 8:51:06 AM 1.12943
7 8 8:51:07 AM 1.13065
8 9 8:51:08 AM 1.13126
9 10 8:51:09 AM 1.13126
10 11 8:51:10 AM 1.12821
dataframe2:
# Date Time, GMT-07:00 \
0 1 9:06:39 AM
1 2 9:06:40 AM
2 3 9:06:41 AM
3 4 9:06:42 AM
4 5 9:06:43 AM
5 6 9:06:44 AM
6 7 9:06:45 AM
7 8 9:06:46 AM
8 9 9:06:47 AM
9 10 9:06:48 AM
10 11 9:06:49 AM
K-Type, °C (LGR S/N: 10118625, SEN S/N: 10118625)
0 43.96
1 47.25
2 48.90
3 50.21
4 43.63
5 43.63
6 42.97
7 42.97
8 42.30
9 41.64
10 40.98
It appears that you want to append the dataframes to each other. Make sure that your date column has the same name in both dataframes otherwise pandas will treat them as two totally separate columns.
df = dataframe1.append(dataframe2, ignore_index=True)
I've got a "Perf" dataframe with people performance data over time.
The index is a timstamp and the columns are the persons name.
There are 100 persons (columns) and each person belongs to one of 10 groups, however the group assignment is dynamic, everyday each person could be assigned to a different group.
So there is a second "Group" DataFrame of the same shape than "Perf" that contains group number (0-9) for each timestamp and person.
The question is how can I elegantly do a mean subtraction everyday for each person with regards to its group assignment?
One method that is really slow is:
for g in range(10):
Perf[Group==g] -= Perf[Group==g].mean(1)
But this is really slow, I'm sure there is a way to do it in one shot with Pandas.
here is an concrete example:
scores represents the score for each person (0-4) for 10 days (0-9):
In [8]: perf = DataFrame(np.random.randn(10,5))
In [9]: perf
Out[9]:
0 1 2 3 4
0 0.945575 -0.805883 1.338865 0.420829 -1.074329
1 -1.086116 0.430230 1.296153 0.527612 1.269646
2 0.705276 -1.409828 2.859838 -0.769508 1.520295
3 0.331860 -0.217884 0.962576 -0.495888 -1.083996
4 0.402625 0.018885 -0.260516 -0.547802 -0.995959
5 2.168944 -0.361657 0.184537 0.391014 0.972161
6 1.959699 0.590739 -0.781736 1.059761 1.080997
7 2.090273 -2.446399 0.553785 0.806368 -0.786343
8 0.441160 -2.320302 -1.981387 2.190607 0.345626
9 -0.276013 -1.319214 1.339096 0.269680 -0.509884
Then I've got some grouping dataframe that for each day shows the group association of each one of the 5 persons, the grouping changes everyday.
In [20]: grouping
Out[20]:
0 1 2 3 4
0 3 1 2 1 2
1 3 1 2 2 1
2 2 2 3 1 1
3 1 2 2 3 1
4 3 2 1 2 1
5 2 1 1 2 3
6 1 2 1 2 3
7 2 2 1 1 3
8 2 1 2 1 3
9 1 3 2 1 2
I want to modify Perf such that for each day I subtract for each person the mean score of its group.
for example for day 0 it will be 0.0 -0.613356 1.206597 0.613356 -1.206597
I want to do it in 1 line without loops. Groupby seems to be the function to use, but I couldn't use efficiently its output to perform the mean subtract operation on the original matrix.
I am trying to add several columns of data to an existing dataframe. The dataframe itself was built from a number of other dataframes, which I successfully joined on indices, which were identical. For that, I used code like this:
data = p_data.join(r_data)
I actually joined these on a multi-index, so the dataframe looks something like the following, where Name1 and Name 2 are indices:
Name1 Name2 present r behavior
a 1 1 0 0
2 1 .5 2
4 3 .125 1
b 2 1 0 0
4 5 .25 4
8 1 0 1
So the Name1 index does not repeat data, but the Name2 index does (I'm using this to keep track of dyads, so that Name1 & Name2 together are only represented once). What I now want to add are 4 columns of data that correspond to Name2 data (information on the second member of the dyad). Unlike the "present" "r" and "behavior" data, these data are per individual, not per dyad. So I don't need to consider Name1 data when merging.
The problem is that while Name2 data are repeated to exhaust the dyad combos, the "Name2" column in the data I would now like to add only has one piece of data per Name2 individual:
Name2 Data1 Data2 Data3
1 80 6 1
2 61 8 3
4 45 7 2
8 30 3 6
What I would like the output to look like:
Name1 Name2 present r behavior Data1 Data2 Data3
a 1 1 0 0 80 6 1
2 1 .5 2 61 8 3
4 3 .125 1 45 7 2
b 2 1 0 0 61 8 3
4 5 .25 4 45 7 2
8 1 0 1 30 3 6
Despite reading the documentation, I am not clear on whether I can use join() or merge() for the desired outcome. If I try a join to the existing dataframe like the simple one I've used previously, I end up with the new columns but they are full of NaN values. I've also tried various combinations using Name1 and Name2 as either columns or as indices, with either join or merge (not as random as it sounds, but I'm clearly not interpreting the documentation correctly!). Your help would be very much appreciated, as I am presently very much lost.
I'm not sure if this is the best way, but you could use reset_index to temporarily make your original DataFrame indexed by Name2 only. Then you could perform the join as usual. Then use set_index to again make Name1 part of the MultiIndex:
import pandas as pd
df = pd.DataFrame({'Name1':['a','a','a','b','b','b'],
'Name2':[1,2,4,2,4,8],
'present':[1,1,3,1,5,1]})
df.set_index(['Name1','Name2'], inplace=True)
df2 = pd.DataFrame({'Data1':[80,61,45,30],
'Data2':[6,8,7,3]},
index=pd.Series([1,2,4,8], name='Name2'))
result = df.reset_index(level=0).join(df2).set_index('Name1', append=True)
print(result)
# present Data1 Data2
# Name2 Name1
# 1 a 1 80 6
# 2 a 1 61 8
# b 1 61 8
# 4 a 3 45 7
# b 5 45 7
# 8 b 1 30 3
To make the result look even more like your desired DataFrame, you could reorder and sort the index:
print(result.reorder_levels([1,0],axis=0).sort(axis=0))
# present Data1 Data2
# Name1 Name2
# a 1 1 80 6
# 2 1 61 8
# 4 3 45 7
# b 2 1 61 8
# 4 5 45 7
# 8 1 30 3