compare dfs with nearest Lon,Lat (Python, Pandas) - python

I have a large df1 with columns(Lon,Lat,V1,V2,V3) and a large df2(V4,V5,Lat,Lon,V6). dfs coordinates are not exact match. df2 have different row numbers. I want to:
1)Find the nearest df2(Lon,Lat) to df1(Lon,Lat) based on (abs(df1.Lon-df2.Lon<=0.11))&(abs(df1.Lat-df2.Lat<=0.11))
2)Create new df3 with columns (df1.Lon,df1.Lat, df1.V1,df2.V6).
df1:
Lon,Lat,V1,V2,V3
-94.9324,34.9099,5.0,66.9,46.6
-103.524,34.457,6.0,186.7,3.8
-92.5145,38.7823,4.0,188.7,273.5
-92.5143,37.3182,2.0,78.8,218.4
-92.5142,36.6965,5.0,98.5,27.7
-89.2187,36.4448,7.3,79.8,35.8
df2:
V4,V5,Lat,Lon,V6
20190329,10,35.0,-94.9,105.9
20180329,11,34.5,-103.5,305.9
20170329,15,38.7,-92.5,206.0
20160329,14,36.5,-89.22,402.1
20150329,13,36.7,-92.6,316.1
20140329,05,37.4,-92.5,290.0
20130329,05,33.8,-89.2,250.0
df3:
Lon,Lat,V1,V6
-94.9324,34.9099,5.0,105.9
-103.524,34.457,6.0,305.9
-92.5145,38.7823,4.0,206.0
-92.5143,37.3182,2.0,290.0
-92.5142,36.6965,5.0,316.1
-89.2187,36.4448,7.3,402.1
Different codes not working:
df3 = df1.loc[~((abs(df2.Lat - df1.Lat) <= 0.11) & (abs(df2.Lon - df1.Lon) <= 0.11))]
df3 = df1.where((abs(df1[df1.Lon] - df2[df2.Lon]) <=0.11) & (abs(df1[df1.Lat] -df2[df2.Lat]) <=0.11))
df3 = pd.merge(df1, df2, on=[(abs(df1.Lon-df2.Lon)<=0.11), (abs(df1.Lat-df2.Lat)<=0.11)], how='inner')

It is possible, but with cross join, so if large DataFrames, need much memory:
df = pd.merge(df1.assign(A=1), df2.assign(A=1), on='A', how='outer', suffixes=('','_'))
cols = ['Lon','Lat','V1','V6']
df3 = df[(((df.Lat_ - df.Lat) <= 0.11).abs() & ((df.Lon_ - df.Lon).abs() <= 0.11))]
df3 = df3.drop_duplicates(subset=df1.columns)[cols]
print (df3)
Lon Lat V1 V6
0 -94.9324 34.9099 5.0 105.9
8 -103.5240 34.4570 6.0 305.9
16 -92.5145 38.7823 4.0 206.0
25 -92.5143 37.3182 2.0 316.1
32 -92.5142 36.6965 5.0 316.1
38 -89.2187 36.4448 7.3 402.1

Related

Grouper() and agg() functions produce multiple copies when squashed

I have a sample dataframe as given below.
import pandas as pd
import numpy as np
NaN = np.nan
data = {'ID':['A', 'A', 'A', 'B','B','B'],
'Date':['2021-09-20 04:34:57', '2021-09-20 04:37:25', '2021-09-20 04:38:26', '2021-09-01
00:12:29','2021-09-01 11:20:58','2021-09-02 09:20:58'],
'Name':['xx','xx',NaN,'yy',NaN,NaN],
'Height':[174,174,NaN,160,NaN,NaN],
'Weight':[74,NaN,NaN,58,NaN,NaN],
'Gender':[NaN,'Male',NaN,NaN,'Female',NaN],
'Interests':[NaN,NaN,'Hiking,Sports',NaN,NaN,'Singing']}
df1 = pd.DataFrame(data)
df1
I want to combine the data present on the same date into a single row. The 'Date' column is in timestamp format. I have written a code for it. Here is my TRY code:
TRY:
df1['Date'] = pd.to_datetime(df1['Date'])
df_out = (df1.groupby(['ID', pd.Grouper(key='Date', freq='D')])
.agg(lambda x: ''.join(x.dropna().astype(str)))
.reset_index()
).replace('', np.nan)
This gives an output where if there are multiple entries of same value, the final result has multiple entries in the same row as shown below.
Obtained Output
However, I do not want the values to be repeated if there are multiple entries. The final output should look like the image shown below.
Required Output
The first column should not have 'xx' and 174.0 instead of 'xxxx' and '174.0 174.0'.
Any help is greatly appreciated. Thank you.
In your case replace agg join to first
df_out = (df1.groupby(['ID', pd.Grouper(key='Date', freq='D')])
.first()
.reset_index()
).replace('', np.nan)
df_out
Out[113]:
ID Date Name Height Weight Gender Interests
0 A 2021-09-20 xx 174.0 74.0 Male Hiking,Sports
1 B 2021-09-01 yy 160.0 58.0 Female None
2 B 2021-09-02 None NaN NaN None Singing
Since you're only trying to keep the first available value for each column for each date, you can do:
>>> df1.groupby(["ID", pd.Grouper(key='Date', freq='D')]).agg("first").reset_index()
ID Date Name Height Weight Gender Interests
0 A 2021-09-20 xx 174.0 74.0 Male Hiking,Sports
1 B 2021-09-01 yy 160.0 58.0 Female None
2 B 2021-09-02 None NaN NaN None Singing

Python>Pandas>Summing columns in different data frames which have same column names, same index values but not same same length of index

i have two data frames which look like below. i want to sum df2 and df1 and override df1 to reflect this sum. though column name matches in both data frames, and even index have similar values, but DF2 is smaller in size and not have all rows(or index values). how can i best do this operation? "Buckets" is the index on both the data frame.
No need to merge, let's use pandas intrinsic data aligment with indexes:
df1.set_index("Buckets")\
.add(df2.set_index("Buckets"), fill_value=0)\
.reset_index()
Output:
Buckets EUR
0 20Y 200.0
1 25Y 200.0
2 30Y 200.0
3 35Y 200.0
Note: You can leave out the set_index if Buckets is already in the index.
Do, df1.add(df2, fill_value=0)
Try this (the join type left or outer, you can decide as per your data)
df1 = pd.merge(df1, df2, on=['Buckets'], how='left').set_index(['Buckets']).sum(axis=1).reset_index()
# .set_index(['Buckets']) this is optional for you, as it is already index(as mentioned by you)
# output, You may have to rename column 0 to EUR after that
Buckets 0
0 20Y 200.0
1 25Y 200.0
2 30Y 200.0
3 35Y 200.0
OR try this
df1 = pd.merge(df1, df2, on=['Buckets'], how='left')
# you wll have 2 columns for EUR(as both df1 and df2 has it) suffixed as _x and _y
df1['EUR_y'] = df1['EUR_y'].fillna(0) # as NaN will create issue
df1['EUR'] = df1['EUR_x'] +df1['EUR_y']
# o/p
>>> df1
Buckets EUR_x EUR_y EUR
0 20Y 100 100.0 200.0
1 25Y 200 0.0 200.0
2 30Y 200 0.0 200.0
3 35Y 400 -200.0 200.0

Python (Pandas) How to merge 2 dataframes with different dates in incremental order?

I am trying to merge 2 dataframes by date index in order. Is this possible?
A sample code of what I need to manipulate
Link for sg_df:https://query1.finance.yahoo.com/v7/finance/download/%5ESTI?P=^STI?period1=1442102400&period2=1599955200&interval=1mo&events=history
Link for facemask_compliance_df: https://today.yougov.com/topics/international/articles-reports/2020/05/18/international-covid-19-tracker-update-18-may (YouGov COVID-19 behaviour changes tracker: Wearing a face mask when in public places)
# Singapore Index
# Read file
# Format Date
# index date column for easy referencing
sg_df = pd.read_csv("^STI.csv")
conv = lambda x: datetime.strptime(x, "%d/%m/%Y")
sg_df["Date"] = sg_df["Date"].apply(conv)
sg_df.sort_values("Date", inplace = True)
sg_df.set_index("Date", inplace = True)
# Will wear face mask in public
# Read file
# Format Date, Removing time
# index date column for easy referencing
facemask_compliance_df = pd.read_csv("yougov-chart.csv")
convert1 = lambda x: datetime.strptime(x, "%d/%m/%Y %H:%M")
facemask_compliance_df["DateTime"] = facemask_compliance_df["DateTime"].apply(convert1).dt.date
facemask_compliance_df.sort_values("DateTime", inplace = True)
facemask_compliance_df.set_index("DateTime", inplace = True)
sg_df = sg_df.merge(facemask_compliance_df["Singapore"], left_index = True, right_index = True, how = "outer").sort_index()
and I wish to output a table kind of like this.
Kindly let me know if you need any more info, I will kindly provide them to you shortly if I am able to.
Edit:
This is the issue
data from yougov-chart
I think it is reading the dates even when it is not from Singapore
Use:
merge to merge to tables.
1.1. on to choose on which column to merge:
Column or index level names to join on. These must be found in both DataFrames. If on is None and not merging on indexes then this defaults to the intersection of the columns in both DataFrames.
1.2. outer option:
outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.
sort_values to sort by date
import pandas as pd
df1 = pd.read_csv("^STI.csv")
df1['Date'] = pd.to_datetime(df1.Date)
df2 = pd.read_csv("yougov-chart.csv")
df2['Date'] = pd.to_datetime(df2.DateTime)
result = df2.merge(df1, on='Date', how='outer')
result = result.sort_values('Date')
print(result)
Output:
Date US_GDP_Thousands Mask Compliance
6 2016-02-01 NaN 37.0
7 2017-07-01 NaN 73.0
8 2019-10-01 NaN 85.0
0 2020-02-21 50.0 27.0
1 2020-03-18 55.0 NaN
2 2020-03-19 60.0 NaN
3 2020-03-25 65.0 NaN
4 2020-04-03 70.0 NaN
5 2020-05-14 75.0 NaN
First use parameters parse_dates and index_col in read_csv for DatetimeIndex in both and in second remove times by Series.dt.floor:
sg_df = pd.read_csv("^STI.csv",
parse_dates=['Date'],
index_col=['Date'])
facemask_compliance_df = pd.read_csv("yougov-chart.csv",
parse_dates=['DateTime'],
index_col=['DateTime'])
facemask_compliance_df["DateTime"] = facemask_compliance_df["DateTime"].dt.floor('d')
Then use DataFrame.merge by index by outer join and then sort index by DataFrame.sort_index:
df = sg_df.merge(facemask_compliance_df,
left_index=True,
right_index=True,
how='outer').sort_index()
print (df)
Mask Compliance US_GDP_Thousands
Date
2016-02-01 37.0 NaN
2017-07-01 73.0 NaN
2019-10-01 85.0 NaN
2020-02-21 27.0 50.0
2020-03-18 NaN 55.0
2020-03-19 NaN 60.0
2020-03-25 NaN 65.0
2020-04-03 NaN 70.0
2020-05-14 NaN 75.0
If i remember right In numpy you can do v.stack or h.stack. depends on how you want to join them together.
in pandas there was something like concatenate https://pandas.pydata.org/docs/user_guide/merging.html which i used for merging dataframes

Pandas: concatenate dataframes

I have 2 dataframe
category count_sec_target
3D-шутеры 0.09375
Cериалы 201.90625
GPS и ГЛОНАСС 0.015625
Hi-Tech 187.1484375
Абитуриентам 0.8125
Авиакомпании 8.40625
and
category count_sec_random
3D-шутеры 0.369565217
Hi-Tech 70.42391304
АСУ ТП, промэлектроника 0.934782609
Абитуриентам 1.413043478
Авиакомпании 14.93478261
Авто 480.3369565
I need to concatenate that And get
category count_sec_target count_sec_random
3D-шутеры 0.09375 0.369565217
Cериалы 201.90625 0
GPS и ГЛОНАСС 0.015625 0
Hi-Tech 187.1484375 70.42391304
Абитуриентам 0.8125 1.413043478
Авиакомпании 8.40625 14.93478261
АСУ ТП, промэлектроника 0 0.934782609
Авто 0 480.3369565
And next I want to divide values in col (count_sec_target / count_sec_random) * 100%
But when I try to concatenate df
frames = [df1, df1]
df = pd.concat(frames)
I get
category count_sec_random count_sec_target
0 3D-шутеры 0.369565 NaN
1 Hi-Tech 70.423913 NaN
2 АСУ ТП, промэлектроника 0.934783 NaN
3 Абитуриентам 1.413043 NaN
4 Авиакомпании 14.934783 NaN
Also I try df = df1.append(df2)
BUt I get wrong result.
How can I fix that?
df3 = pd.concat([d.set_index('category') for d in frames], axis=1).fillna(0)
df3['ratio'] = df3.count_sec_random / df3.count_sec_target
df3
Setup Reference
import pandas as pd
from StringIO import StringIO
t1 = """category;count_sec_target
3D-шутеры;0.09375
Cериалы;201.90625
GPS и ГЛОНАСС;0.015625
Hi-Tech;187.1484375
Абитуриентам;0.8125
Авиакомпании;8.40625"""
t2 = """category;count_sec_random
3D-шутеры;0.369565217
Hi-Tech;70.42391304
АСУ ТП, промэлектроника;0.934782609
Абитуриентам;1.413043478
Авиакомпании;14.93478261
Авто;480.3369565"""
df1 = pd.read_csv(StringIO(t1), sep=';')
df2 = pd.read_csv(StringIO(t2), sep=';')
frames = [df1, df2]
Merge should be appropriate here:
df_1.merge(df_2, on='category', how='outer').fillna(0)
To get the division output, simply do:
df['division'] = df['count_sec_target'].div(df['count_sec_random']) * 100
where: df is the merged DF

python pandas:get rolling value of one Dataframe by rolling index of another Dataframe

I have two dataframes: one has multi levels of columns, and another has only single level column (which is the first level of the first dataframe, or say the second dataframe is calculated by grouping the first dataframe).
These two dataframes look like the following:
first dataframe-df1
second dataframe-df2
The relationship between df1 and df2 is:
df2 = df1.groupby(axis=1, level='sector').mean()
Then, I get the index of rolling_max of df1 by:
result1=pd.rolling_apply(df1,window=5,func=lambda x: pd.Series(x).idxmax(),min_periods=4)
Let me explain result1 a little bit. For example, during the five days (window length) 2016/2/23 - 2016/2/29, the max price of the stock sh600870 happened in 2016/2/24, the index of 2016/2/24 in the five-day range is 1. So, in result1, the value of stock sh600870 in 2016/2/29 is 1.
Now, I want to get the sector price for each stock by the index in result1.
Let's take the same stock as example, the stock sh600870 is in sector ’家用电器视听器材白色家电‘. So in 2016/2/29, I wanna get the sector price in 2016/2/24, which is 8.770.
How can I do that?
idxmax (or np.argmax) returns an index which is relative to the rolling
window. To make the index relative to df1, add the index of the left edge of
the rolling window:
index = pd.rolling_apply(df1, window=5, min_periods=4, func=np.argmax)
shift = pd.rolling_min(np.arange(len(df1)), window=5, min_periods=4)
index = index.add(shift, axis=0)
Once you have ordinal indices relative to df1, you can use them to index
into df1 or df2 using .iloc.
For example,
import numpy as np
import pandas as pd
np.random.seed(2016)
N = 15
columns = pd.MultiIndex.from_product([['foo','bar'], ['A','B']])
columns.names = ['sector', 'stock']
dates = pd.date_range('2016-02-01', periods=N, freq='D')
df1 = pd.DataFrame(np.random.randint(10, size=(N, 4)), columns=columns, index=dates)
df2 = df1.groupby(axis=1, level='sector').mean()
window_size, min_periods = 5, 4
index = pd.rolling_apply(df1, window=window_size, min_periods=min_periods, func=np.argmax)
shift = pd.rolling_min(np.arange(len(df1)), window=window_size, min_periods=min_periods)
# alternative, you could use
# shift = np.pad(np.arange(len(df1)-window_size+1), (window_size-1, 0), mode='constant')
# but this is harder to read/understand, and therefore it maybe more prone to bugs.
index = index.add(shift, axis=0)
result = pd.DataFrame(index=df1.index, columns=df1.columns)
for col in index:
sector, stock = col
mask = pd.notnull(index[col])
idx = index.loc[mask, col].astype(int)
result.loc[mask, col] = df2[sector].iloc[idx].values
print(result)
yields
sector foo bar
stock A B A B
2016-02-01 NaN NaN NaN NaN
2016-02-02 NaN NaN NaN NaN
2016-02-03 NaN NaN NaN NaN
2016-02-04 5.5 5 5 7.5
2016-02-05 5.5 5 5 8.5
2016-02-06 5.5 6.5 5 8.5
2016-02-07 5.5 6.5 5 8.5
2016-02-08 6.5 6.5 5 8.5
2016-02-09 6.5 6.5 6.5 8.5
2016-02-10 6.5 6.5 6.5 6
2016-02-11 6 6.5 4.5 6
2016-02-12 6 6.5 4.5 4
2016-02-13 2 6.5 4.5 5
2016-02-14 4 6.5 4.5 5
2016-02-15 4 6.5 4 3.5
Note in pandas 0.18 the rolling_apply syntax was changed. DataFrames and Series now have a rolling method, so that now you would use:
index = df1.rolling(window=window_size, min_periods=min_periods).apply(np.argmax)
shift = (pd.Series(np.arange(len(df1)))
.rolling(window=window_size, min_periods=min_periods).min())
index = index.add(shift.values, axis=0)

Categories