Merge 2 data frames in pandas - python

I have 2 data frames: GPS coordinates
Time X Y Z
2013-06-01 00:00:00 13512.466575 -12220.845913 19279.970720
2013-06-01 00:00:00 -13529.778408 -14013.560399 -18060.112972
2013-06-01 00:00:00 25108.907276 8764.536182 1594.215305
2013-06-01 00:00:00 -8436.586675 -22468.562354 -11354.726511
2013-06-01 00:05:00 13559.288748 -11476.738832 19702.063737
2013-06-01 00:05:00 -13500.120049 -14702.564328 -17548.488127
2013-06-01 00:05:00 25128.357948 8883.802142 664.732379
2013-06-01 00:05:00 -8346.854582 -22878.993160 -10544.640975
and Glonass coordinates
Time X Y Z
2013-06-01 00:00:00 0.248752905273E+05 -0.557450976562E+04 -0.726176757812E+03
2013-06-01 00:15:00 0.148314306641E+05 0.510153710938E+04 0.201156157227E+05
2013-06-01 00:15:00 0.242346674805E+05 -0.562089208984E+04 0.561714257812E+04
2013-06-01 00:15:00 0.195601284180E+05 -0.122148081055E+05 -0.108823476562E+05
2013-06-01 00:15:00 0.336192968750E+04 -0.122589394531E+05 -0.220986958008E+05
and I need to merge them according to column Time - to get the coordinates of satellites from only the same time (I need all GPS coordinates and all Glonass coordinates from particular time), the result from above example should look like this:
Time X_gps Y_gps Z_gps X_glonass Y_glonass Z_glonass
0 2013-06-01 00:00:00 13512.466575 -12220.845913 19279.970720 0.248752905273E+05 -0.557450976562E+04 -0.726176757812E+03
1 2013-06-01 00:00:00 -13529.778408 -14013.560399 -18060.112972
2 2013-06-01 00:00:00 25108.907276 8764.536182 1594.215305
3 2013-06-01 00:00:00 -8436.586675 -22468.562354 -11354.726511
What I ended up doing is coord = pd.merge(d_gps, d_glonass, on = 'Time', how = 'inner', suffixes = ('_gps','_glonass')) but it copies glonass coordinates to fulfill empty spaces in data frame. What should I change to get the result I want?
I'm new to pandas so I really need your help.

After merging (I took the liberty of renaming the columns first), you can then iterate over the columns, test for duplicated and set these to NaN, you can't set to be blank as the column dtype is a float and setting to a blank string will raise invalid literal error:
In [272]:
df1 = df1.rename(columns={'X':'X_glonass', 'Y':'Y_glonass', 'Z':'Z_glonass'})
df = df.rename(columns={'X':'X_gps', 'Y':'Y_gps', 'Z':'Z_gps'})
merged = df.merge(df1, on='Time')
In [278]:
for col in merged.columns[1:]:
merged.loc[merged[col].duplicated(),col] = np.NaN
merged
Out[278]:
Time X_gps Y_gps Z_gps X_glonass \
0 2013-06-01 13512.466575 -12220.845913 19279.970720 24875.290527
1 2013-06-01 -13529.778408 -14013.560399 -18060.112972 NaN
2 2013-06-01 25108.907276 8764.536182 1594.215305 NaN
3 2013-06-01 -8436.586675 -22468.562354 -11354.726511 NaN
Y_glonass Z_glonass
0 -5574.509766 -726.176758
1 NaN NaN
2 NaN NaN
3 NaN NaN

Related

python masking each day in dataframe

I have to make a daily sum on a dataframe but only if at least 70% of the daily data is not NaN. If it is then this day must not be taken into account. Is there a way to create such a mask? My dataframe is more than 17 years of hourly data.
my data is something like this:
clear skies all skies Lab
2015-02-26 13:00:00 597.5259 376.1830 307.62
2015-02-26 14:00:00 461.2014 244.0453 199.94
2015-02-26 15:00:00 283.9003 166.5772 107.84
2015-02-26 16:00:00 93.5099 50.7761 23.27
2015-02-26 17:00:00 1.1559 0.2784 0.91
... ... ...
2015-12-05 07:00:00 95.0285 29.1006 45.23
2015-12-05 08:00:00 241.8822 120.1049 113.41
2015-12-05 09:00:00 363.8040 196.0568 244.78
2015-12-05 10:00:00 438.2264 274.3733 461.28
2015-12-05 11:00:00 456.3396 330.6650 447.15
if I groupby and aggregate than there is no way to know if in any day there was some lack of data and some days will have lower sums and therefore lowering my monthly means
As said in the comments, use groupby to group the data by date and then write an appropriate selection. This is an example that would sum all days (assuming regular data points, 24 per day) with less than 50% of nan entries:
import pandas as pd
import numpy as np
# create a date range
date_rng = pd.date_range(start='1/1/2018', end='1/1/2021', freq='H')
# create random data
df = pd.DataFrame({"data":np.random.randint(0,100,size=(len(date_rng)))}, index = date_rng)
# set some values to nan
df["data"][df["data"] > 50] = np.nan
# looks like this
df.head(20)
# sum everything where less than 50% are nan
df.groupby(df.index.date).sum()[df.isna().groupby(df.index.date).sum() < 12]
Example output:
data
2018-01-01 NaN
2018-01-02 NaN
2018-01-03 487.0
2018-01-04 NaN
2018-01-05 421.0
... ...
2020-12-28 NaN
2020-12-29 NaN
2020-12-30 NaN
2020-12-31 392.0
2021-01-01 0.0
An alternative solution - you may find it useful & flexible:
# pip install convtools
from convtools import conversion as c
total_number = c.ReduceFuncs.Count()
total_not_none = c.ReduceFuncs.Count(where=c.item("amount").is_not(None))
total_sum = c.ReduceFuncs.Sum(c.item("amount"))
input_data = [] # e.g. iterable of dicts
converter = (
c.group_by(
c.item("key1"),
c.item("key2"),
)
.aggregate(
{
"key1": c.item("key1"),
"key2": c.item("key2"),
"sum_if_70": c.if_(
total_not_none / total_number < 0.7,
None,
total_sum,
),
}
)
.gen_converter(
debug=False
) # install black and set to True to see the generated ad-hoc code
)
result = converter(input_data)

Identifying events from time series

I have a time series for visibility data that contains half-hourly measurements of visibility. A fog event is defined when the visibility falls below 1 km and the fog event ends when the visibility exceeds 1 Km. Please find the code attached below. I intend to find out the number of such fog events and the duration of each such fog event.
from IPython.display import display
import pandas as pd
import matplotlib.pyplot as plt
from google.colab import files
uploaded = files.upload()
import io
df = pd.read_csv(io.BytesIO(uploaded['visibility.csv']))
df.set_index('Unnamed: 0',inplace=True)
df.index = pd.to_datetime(df.index)
df=df.interpolate(method='linear', limit_direction='forward')
display(df)
Unnamed: 0 visibility_km
2016-01-01 00:00:00 0.595456
2016-01-01 00:30:00 0.595456
2016-01-01 01:00:00 0.595456
2016-01-01 01:30:00 0.595456
2016-01-01 02:00:00 0.595456
... ...
2020-12-31 21:30:00 0.925370
2020-12-31 22:00:00 0.901230
2020-12-31 22:30:00 0.804670
2020-12-31 23:00:00 0.804670
2020-12-31 23:30:00 0.692016
# FOG Events
fog_events=df[df<1.0].count()
print('no. of fog events',fog_events)
no. of fog events 10318
But it simply gives the number of times the visibility drops below 1 km and not the number of fog events.
You can create sample time series data like this:
import pandas as pd
tdf = pd.DataFrame({'Time':pd.date_range(start='1/1/2016', periods=11, freq='30s'),
'Visibility_km': [0.56, 0.75, 0.99, 1.01, 1.1, 1.3, 0.5, 0.6, 0.7, 1.2, 1.3]})
Data in this format makes it easier to copy and paste your problem. To get the total number of fog events and their durations, start by creating a column for the events and one to mark when the event starts and ends
# Create column to mark duration of events
tdf['fog_event'] = (tdf['Visibility_km'] < 1.).astype(int)
# Create column to mark event start and end
tdf['event_diff'] = tdf['fog_event'] != tdf['fog_event'].shift(1)
print(tdf)
Time Visibility_km fog_event event_diff
0 2016-01-01 00:00:00 0.56 1 True
1 2016-01-01 00:00:30 0.75 1 False
2 2016-01-01 00:01:00 0.99 1 False
3 2016-01-01 00:01:30 1.01 0 True
4 2016-01-01 00:02:00 1.10 0 False
5 2016-01-01 00:02:30 1.30 0 False
6 2016-01-01 00:03:00 0.50 1 True
7 2016-01-01 00:03:30 0.60 1 False
8 2016-01-01 00:04:00 0.70 1 False
9 2016-01-01 00:04:30 1.20 0 True
10 2016-01-01 00:05:00 1.30 0 False
Now you can get the events in two ways:
The first way doesn't use Pandas and was the original way I grouped the events by.
from itertools import groupby
groups = [list(g) for _, g in groupby(tdf.fog_event.values)]
fog_durations = np.array([sum(g) for g in groups])
duration_each_event = fog_durations[fog_durations != 0]
total_fog_events = sum(fog_durations != 0)
print(duration_each_event)
array([3, 3])
print(total_fog_events)
2
To do it using Pandas, you can group by the cumulative sum of the event difference
fdf = tdf.groupby([tdf['event_diff'].cumsum(), 'fog_event']).size()
fdf = fdf.reset_index(name = 'duration').rename(columns = {'event_diff': 'index'})
duration_each_event = fdf.loc[fdf['fog_event'] == 1, 'duration'].values
total_fog_events = fdf.loc[fdf['fog_event'] == 1, 'fog_event'].sum()
print(duration_each_event)
[3, 3]
print(total_fog_events)
2
Assuming the time interval between measurements doesn't change (i.e. always measured 30 seconds apart), you can multiply duration_each_event by 30 (for seconds) or 0.5 (for minutes) to get the duration in time units.

Copy row to another dataframe

I have 2 dataframes with index type: Datatimeindex and I would like to copy one row to another. The dataframes are:
variable: df
DateTime
2013-01-01 01:00:00 0.0
2013-01-01 02:00:00 0.0
2013-01-01 03:00:00 0.0
....
Freq: H, Length: 8759, dtype: float64
variable: consumption_year
PotĂȘncia Ativa ... Costs
Datetime ...
2019-01-01 00:00:00 11.500000 ... 1.08874
2019-01-01 01:00:00 6.500000 ... 0.52016
2019-01-01 02:00:00 5.250000 ... 0.38183
2019-01-01 03:00:00 5.250000 ... 0.38183
[8760 rows x 5 columns]
here is my code:
mc.run_model(tmy_data)
df=round(mc.ac.fillna(0)/1000,3)
consumption_year['PVProduction'] = df.iloc[:,[1]] #1
consumption_year['PVProduction'] = df[:,1] #2
I am trying to copy the second column of df, to a new column in consumption_year dataframe but none of those previous experiences worked. Looking to the index, I see 3 major differences:
year (2013 and 2019)
starting hour: 01:00 and 00:00
length: 8760 and 8759
Do I need to solve those 3 differences first (making an datetime from df equal to consumption_year), before I can copy one row to another? If so, could you provide me a solution to fix those differences.
Those are the errors:
1: consumption_year['PVProduction'] = df.iloc[:,[1]]
raise IndexingError("Too many indexers")
pandas.core.indexing.IndexingError: Too many indexers
2: consumption_year['PVProduction'] = df[:,1]
raise ValueError("Can only tuple-index with a MultiIndex")
ValueError: Can only tuple-index with a MultiIndex
You can merge two data frames together.
pd.merge(df, consumption_year, left_index=True, right_index=True, how='outer')

Why is the difference of datetime = zero for two rows in a dataframe?

This issue that I am facing is very simple yet weird and has troubled me to no end.
I have a dataframe as follows :
df['datetime'] = df['datetime'].dt.tz_convert('US/Pacific')
#converting datetime from datetime64[ns, UTC] to datetime64[ns,US/Pacific]
df.head()
vehicle_id trip_id datetime
6760612 1000500 4f874888ce404720a203e36f1cf5b716 2017-01-01 10:00:00-08:00
6760613 1000500 4f874888ce404720a203e36f1cf5b716 2017-01-01 10:00:01-08:00
6760614 1000500 4f874888ce404720a203e36f1cf5b716 2017-01-01 10:00:02-08:00
6760615 1000500 4f874888ce404720a203e36f1cf5b716 2017-01-01 10:00:03-08:00
6760616 1000500 4f874888ce404720a203e36f1cf5b716 2017-01-01 10:00:04-08:00
df.info ()
vehicle_id int64
trip_id object
datetime datetime64[ns, US/Pacific]
I am trying to find out the datatime difference as follows ( in two different ways) :
df['datetime_diff'] = df['datetime'].diff()
df['time_diff'] = (df['datetime'] - df['datetime'].shift(1)).astype('timedelta64[s]')
For a particular trip_id, I have the results as follows :
df[trip_frame['trip_id'] == '4f874888ce404720a203e36f1cf5b716'][['datetime','datetime_diff','time_diff']].head()
datetime datetime_diff time_diff
6760612 2017-01-01 10:00:00-08:00 NaT NaN
6760613 2017-01-01 10:00:01-08:00 00:00:01 1.0
6760614 2017-01-01 10:00:02-08:00 00:00:01 1.0
6760615 2017-01-01 10:00:03-08:00 00:00:01 1.0
6760616 2017-01-01 10:00:04-08:00 00:00:01 1.0
But for some other trip_ids like the below, you can observe that I am having the datetime difference as zero (for both the columns) when it is actually not.There is a time difference in seconds.
df[trip_frame['trip_id'] == '01b8a24510cd4e4684d67b96369286e0'][['datetime','datetime_diff','time_diff']].head(4)
datetime datetime_diff time_diff
3236107 2017-01-28 03:00:00-08:00 0 days 0.0
3236108 2017-01-28 03:00:01-08:00 0 days 0.0
3236109 2017-01-28 03:00:02-08:00 0 days 0.0
3236110 2017-01-28 03:00:03-08:00 0 days 0.0
df[df['trip_id'] == '01c2a70c25e5428bb33811ca5eb19270'][['datetime','datetime_diff','time_diff']].head(4)
datetime datetime_diff time_diff
8915474 2017-01-21 10:00:00-08:00 0 days 0.0
8915475 2017-01-21 10:00:01-08:00 0 days 0.0
8915476 2017-01-21 10:00:02-08:00 0 days 0.0
8915477 2017-01-21 10:00:03-08:00 0 days 0.0
Any leads as to what the actual issue is ? I will be very grateful.
If I just execute your code without the type conversion, everything looks fine:
df.timestamp - df.timestamp.shift(1)
On the example lines
rows=['2017-01-21 10:00:00-08:00',
'2017-01-21 10:00:01-08:00',
'2017-01-21 10:00:02-08:00',
'2017-01-21 10:00:03-08:00',
'2017-01-21 10:00:03-08:00'] # the above lines are from your example. I just invented this last line to have one equal entry
df= pd.DataFrame(rows, columns=['timestamp'])
df['timestamp']= df['timestamp'].astype('datetime64')
df.timestamp - df.timestamp.shift(1)
The last line returns
Out[40]:
0 NaT
1 00:00:01
2 00:00:01
3 00:00:01
4 00:00:00
Name: timestamp, dtype: timedelta64[ns]
That looks unsuspicious so far. Note, that you already have a timedelta64 series.
If I now add your conversion, I get:
(df.timestamp - df.timestamp.shift(1)).astype('timedelta64[s]')
Out[42]:
0 NaN
1 1.0
2 1.0
3 1.0
4 0.0
Name: timestamp, dtype: float64
You see, that the result is a series of floats. This is probably because there is a NaN in the series. One other thing is the additon [s]. This doesn't seem to work. If you use [ns] it seems to work. If you want to get rid of the nano seconds somehow, I guess you need to do it separately.

Dataframe.rolling().mean not calculating moving average

I've been trying to calculate a moving avarage using pandas, but when I use the Dataframe.rolling().mean(), it copies the value instead.
stock_info['stock'].head()
Fecha Open High Low Close Volume
0 04-05-2007 00:00:00 234,4593 255,5703 234,3532 246,8906 6044574
1 07-05-2007 00:00:00 246,8906 254,7023 247,855 252,1563 2953869
2 08-05-2007 00:00:00 252,1562 250,7482 244,9617 250,1695 2007217
3 09-05-2007 00:00:00 250,1695 249,7838 245,9261 248,3757 2329078
4 10-05-2007 00:00:00 248,8194 248,9158 244,9617 245,6368 2138002
stock_info['stock']['MA'] = stock_info['stock']['Close'].rolling(window=2).mean()
Fecha Open High Low Close Volume MA
0 04-05-2007 00:00:00 234,4593 255,5703 234,3532 246,8906 6044574 246,8906
1 07-05-2007 00:00:00 246,8906 254,7023 247,855 252,1563 2953869 252,1563
2 08-05-2007 00:00:00 252,1562 250,7482 244,9617 250,1695 2007217 250,1695
3 09-05-2007 00:00:00 250,1695 249,7838 245,9261 248,3757 2329078 248,3757
4 10-05-2007 00:00:00 248,8194 248,9158 244,9617 245,6368 2138002 245,6368
My first thought is that the values in stock_info['stock']['Close'] are stored as strings, rather than as a numeric type. Attempting
df['MA'] = df['Close'].rolling(window=2).mean()
on
df = pd.DataFrame({'Close': ['246,8906', '252,1563', '250,1695']})
gives
df
Out[38]:
Close MA
0 246,8906 246,8906
1 252,1563 252,1563
2 250,1695 250,1695
as happened for you.
Converting this to a numeric value first, say with
df['MA'] = df['Close'].str.replace(',', '.').astype(float).rolling(window=2).mean()
gives
df
Out[40]:
Close MA
0 246,8906 NaN
1 252,1563 249.52345
2 250,1695 251.16290
as desired.
According to Pandas Latest version docs http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html you have tu use the on parameter in the rolling function.
df1 = pd.DataFrame({'val': range(10,30)})
df1['avg'] = df1.val.mean()
df1['rolling'] = df1.rolling(window=2, on='avg').mean()
instead of using df1['avg'].rolling()
you can use pd.rolling_mean to calcualte
example:
df1 = pd.DataFrame([ np.random.randint(-10,10) for _ in xrange(100) ],columns =['val'])
val
0 4
1 -3
2 -7
3 3
4 -10
df1['MA'] = pd.rolling_mean(df1.val,2)
val MA
0 4 NaN
1 -3 0.5
2 -7 -5.0
3 3 -2.0
4 -10 -3.5

Categories