Merging two dataframes while considering overlaps and missing indexes [duplicate] - python

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 1 year ago.
I have multiple dataframes that have an ID and a value and I am trying to merge them such that each ID has all the values in it's row.
ID
Value
1
10
3
21
4
12
5
43
7
11
And then I have another dataframe:
ID
Value2
1
12
2
14
4
55
6
23
7
90
I want to merge these two in a way where it considers the ID's that are already in the first dataframe and if an ID that is the second dataframe is not in the first one, it adds it to the ID row with value2 leaving value empty. This is what my result would look like:
ID
Value
Value2
1
10
12
3
21
-
4
12
55
5
43
-
7
11
90
2
-
14
6
-
23
Hope this makes sense. I don't really care for the order of the ID numbers, they can be sorted or not. My goal is to be able to create dictionaries for each ID with "Value", "Value2", "Value3,... as keys and the corresponding actual value numbers as the keys values. Please let me know if any clarification needed.

You can use pandas' merge method (see here for the help page):
import pandas as pd
df1.merge(df2, how='outer', on='ID')
Specifying 'outer' will use union keys from both dataframes.

Related

drop rows based on a condition based on another [duplicate]

This question already has answers here:
Get the row(s) which have the max value in groups using groupby
(15 answers)
Closed 6 months ago.
I have the following data frame
user_id
value
1
5
1
7
1
11
1
15
1
35
2
8
2
9
2
14
I want to drop all rows that are not the maximum value of every user_id
resulting on a 2 row data frame:
user_id
value
1
35
2
14
How can I do that?
You can use pandas.DataFrame.max after the grouping.
Assuming that your original dataframe is named df, try the code below :
out = df.groupby('user_id', as_index=False).max('value')
>>> print(out)
Edit :
If you want to group more than one column, use this :
out = df.groupby(['user_id', 'sex'], as_index=False, sort=False)['value'].max()
>>> print(out)

Populate Dataframe column from information in other Dataframe [duplicate]

This question already has answers here:
Pandas: how to merge two dataframes on a column by keeping the information of the first one?
(4 answers)
Closed 1 year ago.
I have two dataframes, one (A) contains the notes associated with certain accounts. The other (B) is a list of accounts that i wish to add a column containing the note for that account. In this example there will be times when the account number in dataframe B is not in dataframe A and i would like to fill this either NaN or 0.
Input:
Dataframe A:
Account Note
11 a
12 b
13 c
14 d
15 e
16 f
Dataframe B:
Account
11
25
42
14
15
19
26
Desired Output:
Dataframe C:
Account Note
11 a
25
42
14 d
15 e
19
26
Note that in my real world example the size of Dataframe B will be much bigger than A
Try merge with how='left' and on='Account':
>>> df_b.merge(df_a, how='left', on='Account')

How can I filter a table based in two values at same time? [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
I have two dataframes and I want to filter one dataframe if two other values are not present in the other dataframe. Both dataframes share the name of the columns.
For example, dataframe A has:
col1,col2
1 5
-10 15
6 7
and dataframe B has:
col1,col2
6 7
-10 15
-1 5
So in this example, I would like to pick the value pair in A and see if it is present in B.
First row of A has value pair 1,5 and since 1,5 is not present in B that row would be excluded from A.
Second and third row of A has value -10,15 and 6,7, and since both are present in B I would like to keep them.
So the desired output of the filtered table A would be:
col1,col2
-10 15
6 7
How can I achieve this?
EDIT: One of the first things I tried was a merge, but the resulting dataframe was actually bigger than the original. Since merge and merging 101 topic was suggested, I will add the real dataframes here.
Dataframe A have latitude, longitude and id columns (id is not the index). It has 363 rows:
id lat lon
0 0 -33.252192 -70.765291
1 1 -33.224300 -70.780249
2 2 -33.251651 -70.797289
3 3 -33.298574 -70.770133
4 4 -33.214315 -70.787822
... ... ... ...
358 499 -33.227614 -70.770126
359 501 -33.299217 -70.770685
360 502 -33.191476 -70.801492
361 503 -33.239037 -70.780278
362 504 -33.263893 -70.762674
Dataframe B has 73096 rows and it also has and id, latitude and longitude. I'm putting here only lat and lon.
lat lon
1 -33.260415 -70.713767
2 -33.461718 -70.853525
3 -33.258741 -70.638032
4 -33.544858 -70.578624
8 -33.535512 -70.574188
... ... ...
97724 -33.451817 -70.847999
97725 -33.452225 -70.846520
97726 -33.450841 -70.841494
97729 -33.461407 -70.856090
97730 -33.457633 -70.822085
So I want to see if the lat,lon pair in A is present in B and if not then exclude it from A.
When I do A.merge(B) I get a dataframe that is 1108 rows long.
You can try pandas.merge. Something like df1.merge(df2, how='inner', left_on=['col1','col2'], right_on=['col1','col2']).
(To help you remember, the naming of these arguments comes from an inner join in database terminology)
A simple merge will do
df_out = dfA.merge(dfB)
Output
col1 col2
0 -10 15
1 6 7
df.merge does an inner join by default.

groupby with multiple columns with addition and frequency counts in pandas [duplicate]

This question already has answers here:
Multiple aggregations of the same column using pandas GroupBy.agg()
(4 answers)
Closed 4 years ago.
I have a table that is looks like follows:
name type val
A online 12
B online 24
A offline 45
B online 32
A offline 43
B offline 44
I want to dataframe in such a manner that it can be groupby with multiple cols name & type, which also have additional columns that return the count of the record with val being added of the same type records. It should be like follows:
name type count val
A online 1 12
offline 2 88
B online 2 56
offline 1 44
I have tried pd.groupby(['name', 'type'])['val'].sum() that gives the addition but unable to add the count of records.
Add parameter sort=False to groupby for avoid default sorting and aggregate by agg with tuples with new columns names and aggregate functions, last reset_index for MultiIndex to columns:
df1 = (df.groupby(['name', 'type'], sort=False)['val']
.agg([('count', 'count'),('val', 'sum')])
.reset_index())
print (df1)
name type count val
0 A online 1 12
1 B online 2 56
2 A offline 2 88
3 B offline 1 44
You can try pivoting i.e
df.pivot_table(index=['name','type'],aggfunc=['count','sum'],values='val')
count sum
val val
name type
A offline 2 88
online 1 12
B offline 1 44
online 2 56

Pandas: Get top 10 values AFTER grouping

I have a pandas data frame with a column 'id' and a column 'value'. It is already sorted by first id (ascending) and then value (descending). What I need is the top 10 values per id.
I assumed that something like the following would work, but it doesn't:
df.groupby("id", as_index=False).aggregate(lambda (index,rows) : rows.iloc[:10])
What I get is just a list of ids, the value column (and other columns that I omitted for the question) aren't there anymore.
Any ideas how it might be done, without iterating through each of the single rows and appending the first ten to another data structure?
Is this what you're looking for?
df.groupby('id').head(10)
I would like to answer this by giving and example dataframe as:
df = pd.DataFrame(np.array([['a','a','b','c','a','c','b'],[4,6,1,8,9,4,1],[12,11,7,1,5,5,7],[123,54,146,96,10,114,200]]).T,columns=['item','date','hour','value'])
df['value'] = pd.to_numeric(df['value'])
This gives you a dataframe
item date hour value
a 4 12 123
a 6 11 54
b 1 7 146
c 8 1 96
a 9 5 10
c 4 5 114
b 1 7 200
Now this is grouped below and displays first 2 values of grouped items.
df.groupby(['item'])['value'].head(2)

Categories