Merge DataFrames with ordering criteria - python

In a previous question, I was asking how to match values from this DataFrame source:
car_id lat lon
0 100 10.0 15.0
1 100 12.0 10.0
2 100 13.0 09.0
3 110 23.0 08.0
4 110 13.0 09.0
5 110 12.0 10.0
6 110 12.0 02.0
7 120 11.0 11.0
8 120 12.0 10.0
9 120 13.0 09.0
10 120 14.0 08.0
11 130 12.0 10.0
And keep only those whose coords are in this second DataFrame coords:
lat lon
0 12.0 10.0
1 13.0 09.0
But this time I'd like to match each car_id who gets:
all the values from coords
with the same order
So that the resulting DataFrame result would be:
car_id
1 100
2 120
# 110 has all the values from coords, but not in the same order
# 130 doesn't have all the values from coords
Is there a way to achieve this result in a vectorized way, avoiding going through a lot of loops and conditionals?

plan
we will groupby 'car_id' and evaluate each subset
after an inner merge we should see two things
the resultant merged dataframe should have the same values as coords
the resultant merged dataframe should cover everything
def duper(df):
m = df.merge(coords)
c = pd.concat([m, coords])
# we put the merged rows first and those are
# the ones we'll keep after `drop_duplicates(keep='first')`
# `keep='first'` is the default, so I don't pass it
c1 = (c.drop_duplicates().values == coords.values).all()
# if `keep=False` then I drop all duplicates. If I got
# everything in `coords` this should be empty
c2 = c.drop_duplicates(keep=False).empty
return c1 & c2
source.set_index('car_id').groupby(level=0).filter(duper).index.unique().values
array([100, 120])
slight alternative
def duper(df):
m = df.drop('car_id', 1).merge(coords)
c = pd.concat([m, coords])
c1 = (c.drop_duplicates().values == coords.values).all()
c2 = c.drop_duplicates(keep=False).empty
return c1 & c2
source.groupby('car_id').filter(duper).car_id.unique()

This isn't pretty, but what if you did something like this:
df2 = DataFrame(df, copy=True)
df2[['lat2', 'lon2']] = df[['lat', 'lon']].shift(-1)
df2.set_index(['lat', 'lon', 'lat2', 'lon2'], inplace=True)
print(df2.loc[(12, 10, 13, 9)].reset_index(drop=True))
car_id
0 100
1 120
And this would be the general case:
raw_data = {'car_id': [100, 100, 100, 110, 110, 110, 110, 120, 120, 120, 120, 130],
'lat': [10, 12, 13, 23, 13, 12, 12, 11, 12, 13, 14, 12],
'lon': [15, 10, 9, 8, 9, 10, 2, 11, 10, 9, 8, 10],
}
df = pd.DataFrame(raw_data, columns = ['car_id', 'lat', 'lon'])
raw_data = {
'lat': [10, 12, 13],
'lon': [15, 10, 9],
}
coords = pd.DataFrame(raw_data, columns = ['lat', 'lon'])
def submatch(df, match):
df2 = DataFrame(df['car_id'])
for x in range(match.shape[0]):
df2[['lat{}'.format(x), 'lon{}'.format(x)]] = df[['lat', 'lon']].shift(-x)
n = match.shape[0]
cols = [item for sublist in
[['lat{}'.format(x), 'lon{}'.format(x)] for x in range(n)]
for item in sublist]
df2.set_index(cols, inplace=True)
return df2.loc[tuple(match.stack().values)].reset_index(drop=True)
print(submatch(df, coords))
car_id
0 100

Related

Compare two dataframe and conditionally capture random data in Python

The main logic of my question is on comparing the two dataframes a little, but it will be different from the existing questions here. Q1, Q2, Q3
Let's create dummy two dataframes.
data1 = {'user': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 4, 4,4],
'checkinid': [10, 20, 30, 40, 50, 35, 45, 55, 20, 120, 100, 35, 55, 180, 200,400],
'count': [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]}
data2 = {'checkinid': [10, 20, 30, 35, 40, 45, 50,55, 60, 70,100,120,180,200,300,400]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
data2 consists of whole checkinid values. I am trying to create a training file.
For example, user 1 visited 5 places where ids are (10,20,30,40,50)
I want to add randomly the places that user 1 does not visit and set the 'count column' as 0.
My expectation dataframe like this
user checkinid count
1 10 1
1 20 1
1 30 1
1 40 1
1 50 1
1 300 0 (add randomly)
1 180 0 (add randomly)
1 55 0 (add randomly)
2 35 1
2 45 1
2 55 1
2 20 1
2 120 1
2 10 0 (add randomly)
2 400 0 (add randomly)
2 180 0 (add randomly)
... ...
Now those who read the question can ask how many random data they will add.
For each user, just add 3 non-visited places is enough for this example.
This might not be the best solution but it works
you have to get each users and then pick the checkinids which are not assigned to them
#get all users
users = df1.user.unique();
for user in users:
checkins = df1.loc[df1['user'] == user]
df = checkins.merge(df2, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='right_only'].sample(n=3)
df['user']=[user,user,user]
df['count']=[0,0,0]
df.pop("_merge")
df1 = df1.append(df, ignore_index=True)
#sort data frome based on user
df1 = df1.sort_values(by=['user']);
#re-arrange cols
df1 = df1[['user', 'checkinid', 'count']]
#print df
print df1

Merge two dataframes based on interval overlap

I have two dataframes A and B:
For example:
import pandas as pd
import numpy as np
In [37]:
A = pd.DataFrame({'Start': [10, 11, 20, 62, 198], 'End': [11, 11, 35, 70, 200]})
A[["Start","End"]]
Out[37]:
Start End
0 10 11
1 11 11
2 20 35
3 62 70
4 198 200
In [38]:
B = pd.DataFrame({'Start': [8, 5, 8, 60], 'End': [10, 90, 13, 75], 'Info': ['some_info0','some_info1','some_info2','some_info3']})
B[["Start","End","Info"]]
Out[38]:
Start End Info
0 8 10 some_info0
1 5 90 some_info1
2 8 13 some_info2
3 60 75 some_info3
I would like to add column info to dataframe A based on if the interval (Start-End) of A overlaps with the interval of B. In case, the A interval overlaps with more than one B interval, the info corresponding to the shorter interval should be added.
I have been looking arround how to manage this issue and I have found kind of similar questions but most of their answers are using iterrows() which in my case, as I am dealing with huge dataframes is not viable.
I would like something like:
A.merge(B,on="overlapping_interval", how="left")
And then drop duplicates keeping the info coming from the shorter interval.
The output should look like this:
In [39]:
C = pd.DataFrame({'Start': [10, 11, 20, 62, 198], 'End': [11, 11, 35, 70, 200], 'Info': ['some_info0','some_info2','some_info1','some_info3',np.nan]})
C[["Start","End","Info"]]
Out[39]:
Start End Info
0 10 11 some_info0
1 11 11 some_info2
2 20 35 some_info1
3 62 70 some_info3
4 198 200 NaN
I have found this question really interesting as it suggests the posibility of solving this issue using pandas Interval object. But after lots attempts I have not managed to solve it.
Any ideas?
I would suggest to do a function then apply on the rows:
First I compute the delta (End - Start) in B for sorting purpose
B['delta'] = B.End - B.Start
Then a function to get information:
def get_info(x):
#Fully included
c0 = (x.Start >= B.Start) & (x.End <= B.End)
#start lower, end include
c1 = (x.Start <= B.Start) & (x.End >= B.Start)
#start include, end higher
c2 = (x.Start <= B.End) & (x.End >= B.End)
#filter with conditions and sort by delta
_B = B[c0|c1|c2].sort_values('delta',ascending=True)
return None if len(_B) == 0 else _B.iloc[0].Info #None if no info corresponding
Then you can apply this function to A:
A['info'] = A.apply(lambda x : get_info(x), axis='columns')
print(A)
Start End info
0 10 11 some_info0
1 11 11 some_info2
2 20 35 some_info1
3 62 70 some_info3
4 198 200 None
Note:
Instead of using pd.Interval, make your own conditions. cx are your intervals definitions, change them to get the exact expected behaviour

How to write a nested query with in python pandas?

Hi all I am new to pandas. I need some help regarding how to write pandas query for my required output.
I want to retrieve output data like
when 0 < minimum_age < 10 i need to get sum(population) for that 0 to 10 only
when 10 < minimum_age < 20 i need to get sum(population) for that 10 to 20 only
and then it continues
My Input Data Looks Like:
population,minimum_age,maximum_age,gender,zipcode,geo_id
50,30,34,f,61747,8600000US61747
5,85,NaN,m,64120,8600000US64120
1389,10,34,m,95117,8600000US95117
231,5,60,f,74074,8600000US74074
306,22,24,f,58042,8600000US58042
My Code:
import pandas as pd
import numpy as np
df1 = pd.read_csv("C:\Users\Rahul\Desktop\Desktop_Folders\Code\Population\population_by_zip_2010.csv")
df2=df1.set_index("geo_id")
df2['sum_population'] = np.where(df2['minimum_age'] < 10,sum(df2['population']),0)
print df2
You can try pandas cut along with groupby,
df.groupby(pd.cut(df['minimum_age'], bins=np.arange(0,100, 10), right=False)).population.sum().reset_index(name = 'sum of population')
minimum_age sum of population
0 [0, 10) 231.0
1 [10, 20) 1389.0
2 [20, 30) 306.0
3 [30, 40) 50.0
4 [40, 50) NaN
5 [50, 60) NaN
6 [60, 70) NaN
7 [70, 80) NaN
8 [80, 90) 5.0
Explanation: Pandas cut helps create bins of minimum_age by putting them in groups of 0-10, 10-20 and so on. This is how it looks
pd.cut(df['minimum_age'], bins=bins, right=False)
0 [30, 40)
1 [80, 90)
2 [10, 20)
3 [0, 10)
4 [20, 30)
Now we use groupby on the output of pd.cut to find sum of population.

Write to file from dictionary instead of pandas

I would like to print dictionaries to file in a different way.
Right now, I am using Pandas to convert dictionaries to Dataframes, combine several Dataframes and then print them to file (see below code).
However, the Pandas operations seem to take a very long time and I would like to do this more efficiently.
Is it possible to do the below approach more efficiently while retaining the structure of the output files? (e.g. by printing from dictionary directly?)
import pandas as pd
labels = ["A", "B", "C"]
periods = [0, 1, 2]
header = ['key', 'scenario', 'metric', 'labels']
metrics_names = ["metric_balances", "metric_record"]
key = "key_x"
scenario = "base"
# The metrics are structured as dicts where the keys are `periods` and the values
# are arrays (where each array entry correspond to one of the `labels`)
metric_balances = {0: [1000, 100, 50], 1: [900, 150, 100], 2: [800, 350, 100]}
metric_record = {0: [20, 10, 5], 1: [90, 15, 10], 2: [80, 35, 10]}
# Combine all metrics into one output structure for key "x"
output_x = pd.concat([pd.DataFrame(metric_balances, columns=periods, index=labels),
pd.DataFrame(metric_record, columns=periods, index=labels)],
keys=pd.MultiIndex.from_product([[key], [scenario], metrics_names]),
names=header)
key = "key_y"
scenario = "base_2"
metric_balances = {0: [2000, 200, 50], 1: [1900, 350, 100], 2: [1200, 750, 100]}
metric_record = {0: [40, 5, 3], 1: [130, 45, 10], 2: [82, 25, 18]}
# Combine all metrics into one output structure for key "y"
output_y = pd.concat([pd.DataFrame(metric_balances, columns=periods, index=labels),
pd.DataFrame(metric_record, columns=periods, index=labels)],
keys=pd.MultiIndex.from_product([[key], [scenario], metrics_names]),
names=header)
# Concatenate all output dataframes
output = pd.concat([output_x, output_y], names=header)
# Print results to a csv file
output.to_csv("test.csv", index=False)
Below are the respective outputs:
OUTPUT X
0 1 2
key scenario metric labels
key_x base metric_balances A 1000 900 800
B 100 150 350
C 50 100 100
metric_record A 20 90 80
B 10 15 35
C 5 10 10
-----------------------------------
OUTPUT Y
0 1 2
key scenario metric labels
key_y base_2 metric_balances A 2000 1900 1200
B 200 350 750
C 50 100 100
metric_record A 40 130 82
B 5 45 25
C 3 10 18
------------------------------
OUTPUT COMBINED
0 1 2
key scenario metric labels
key_x base metric_balances A 1000 900 800
B 100 150 350
C 50 100 100
metric_record A 20 90 80
B 10 15 35
C 5 10 10
key_y base_2 metric_balances A 2000 1900 1200
B 200 350 750
C 50 100 100
metric_record A 40 130 82
B 5 45 25
C 3 10 18
I was looking into row wise printing of the dictionaries - but I had difficulties in merging the labels with the relevant arrays.

Looping though a dataframe element by element

If I have a data frame df (indexed by integer)
BBG.KABN.S BBG.TKA.S BBG.CON.S BBG.ISAT.S
index
0 -0.004881 0.008011 0.007047 -0.000307
1 -0.004881 0.008011 0.007047 -0.000307
2 -0.005821 -0.016792 -0.016111 0.001028
3 0.000588 0.019169 -0.000307 -0.001832
4 0.007468 -0.011277 -0.003273 0.004355
and I want to iterate though each element individually (by row and column) I know I need to use .iloc(row,column) but do I need to create 2 for loops (one for row and one for column) and how I would do that?
I guess it would be something like:
for col in rollReturnRandomDf.keys():
for row in rollReturnRandomDf.iterrows():
item = df.iloc(col,row)
But I am unsure of the exact syntax.
Maybe try using df.values.ravel().
import pandas as pd
import numpy as np
# data
# =================
df = pd.DataFrame(np.arange(25).reshape(5,5), columns='A B C D E'.split())
Out[72]:
A B C D E
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
4 20 21 22 23 24
# np.ravel
# =================
df.values.ravel()
Out[74]:
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
21, 22, 23, 24])
for item in df.values.ravel():
# do something with item

Categories