Merge do DataFrames with missing rows

Merge do DataFrames with missing rows - python

I want to merge two DataFrames:
df1:
dt_object Lng
1 2020-01-01 00:00:00 1.57423
2 2020-01-01 01:00:00 1.57444
3 2020-01-01 02:00:00 1.57465
4 2020-01-01 03:00:00 1.57486
df2:
dt_object Price
0 2020-01-03 10:00:00 256.086667
1 2020-01-03 11:00:00 256.526667
2 2020-01-03 12:00:00 257.386667
3 2020-01-03 13:00:00 256.703333
4 2020-01-03 14:00:00 255.320000
dt_object in both cases has type datetime64
df1 never has missing rows. So it has 24 hours per day.
But df2 HAS missing rows.
When I combine them, there is a mismatch.
df = pd.merge(df1, df2, on = 'dt_object')
Merged df:
dt_object Lng Price
0 2020-04-01 10:00:00 1.59270 183.996667
1 2020-04-01 11:00:00 1.59294 184.466667
2 2020-04-01 12:00:00 1.59319 184.810000
3 2020-04-01 13:00:00 1.59343 184.386667
4 2020-04-01 14:00:00 1.59367 184.533333
Problems:
Lng 1.59270 is in wrong place. It flew in 2020-04-01 10:00:00 from 04.01.2020 10:00:00 (month and date are messed up). But Price 183.996667 is in correct place. So ALL Lng flew from wrong date with messed up Date/Month.
Prices in df2 start from January 2020-01-03 10:00:00, but merged dataframe starts from April 2020-04-01
When I saw this problem, I added for both dataframes:
df1['dt_object'] = pd.to_datetime(df1['dt_object'], format='%Y-%m-%d %H:%M:%S')
df2['dt_object'] = pd.to_datetime(df2['dt_object'], format='%Y-%m-%d %H:%M:%S')
, but it didn't help. Nothing changed. Inside dt_object is a strange bug with month/date, but I cannot detect it.
Help me to fix it , please!

You have to specify that you want to execute a left join. The Pandas Documentation explains what the different options for what the how parameter will do.
>>> df1 = pd.DataFrame({'dt_object': pd.date_range('2020-01-01', '2020-01-04'), 'Lng': [0, 1, 2, 3]})
>>> df1
dt_object Lng
0 2020-01-01 0
1 2020-01-02 1
2 2020-01-03 2
3 2020-01-04 3
>>> df2 = pd.DataFrame({'dt_object': [pd.Timestamp('2020-01-01'), pd.Timestamp('2020-01-02'), pd.Timestamp('2020-01-04')], 'Price': [1000, 2000, 3000]})
>>> df2
dt_object Price
0 2020-01-01 1000
1 2020-01-02 2000
2 2020-01-04 3000
>>> df1.merge(df2, how='left')
dt_object Lng Price
0 2020-01-01 0 1000.0
1 2020-01-02 1 2000.0
2 2020-01-03 2 NaN
3 2020-01-04 3 3000.0

Related

Python Pandas: resample based on just one of the columns

I have the following data and I'm resampling my data to find out how many bikes arrive at each of the stations every 15 minutes. However, my code is aggregating my stations too, and I only want to aggregate the variable "dtm_end_trip"
Sample data:
id_trip
dtm_start_trip
dtm_end_trip
start_station
end_station
1
2018-10-01 10:15:00
2018-10-01 10:17:00
A
B
2
2018-10-01 10:17:00
2018-10-01 10:18:00
B
A
...
...
...
...
...
999999
2021-12-31 23:58:00
2022-01-01 00:22:00
C
A
1000000
2021-12-31 23:59:00
2022-01-01 00:29:00
A
D
Trial code:
df2 = df(['end_station', 'dtm_end_trip']).size().to_frame(name = 'count').reset_index()
df2 = df2.sort_values(by='count', ascending=False)
df2= df2.set_index('dtm_end_trip')
df2 = df2.resample('15T').count()
Output I get:
dtm_end_trip
end_station
count
2018-10-01 00:15:00
2
2
2018-10-01 00:30:00
0
0
2018-10-01 00:45:00
1
1
2018-10-01 01:00:00
2
2
2018-10-01 01:15:00
1
1
Desired output:
dtm_end_trip
end_station
count
2018-10-01 00:15:00
A
2
2018-10-01 00:15:00
B
0
2018-10-01 00:15:00
C
1
2018-10-01 00:15:00
D
2
2018-10-01 00:30:00
A
3
2018-10-01 00:30:00
B
2
The count column of the table above was, in this case, constructed with random numbers with the sole purpose of exemplifying the architecture of the desired output.

You can use pd.Grouper like this:
out = df.groupby([
pd.Grouper(freq='15min', key='dtm_end_trip'),
'end_station',
]).size()
>>> out
dtm_end_trip end_station
2018-10-01 10:15:00 A 1
B 1
2022-01-01 00:15:00 A 1
D 1
dtype: int64
The result is a Series, but you can easily convert it to a DataFrame with the same headings as per your desired output:
>>> out.to_frame('count').reset_index()
dtm_end_trip end_station count
0 2018-10-01 10:15:00 A 1
1 2018-10-01 10:15:00 B 1
2 2022-01-01 00:15:00 A 1
3 2022-01-01 00:15:00 D 1
Note: this is the result from the four rows in your sample input data.

Filter rows in DataFrame where certain conditions are met?

I have a DataFrame with relevant stock information that looks like this.
Screenshot of my dataframe
I need it so that if the 'close' from one row is different from the 'open' in the next row a new dataframe will be created storing the ones that fulfill this criteria. I would like that all of the values from the row to be saved in the new dataframe. To clarify, I would like the two rows where this happens to be stored in the new dataframe.
DataFrame as text as requested:
timestamp open high low close volume
0 2020-01-01 00:00:00 129.16 130.98 128.68 130.24 4.714333e+04
1 2020-01-01 08:00:00 130.24 132.40 129.87 132.08 5.183323e+04
2 2020-01-01 16:00:00 132.08 133.05 129.74 130.77 4.579396e+04
3 2020-01-02 00:00:00 130.72 130.78 128.69 129.26 6.606601e+04
4 2020-01-02 08:00:00 129.23 130.28 128.90 129.59 4.849893e+04
5 2020-01-02 16:00:00 129.58 129.78 126.38 127.19 9.919212e+04
6 2020-01-03 00:00:00 127.19 130.15 125.88 128.86 1.276414e+05

This can be accomplished using Series.shift
>>> df['close'] != df['open'].shift(-1)
0 2020-01-01 False
1 2020-01-01 False
2 2020-01-01 True
3 2020-01-02 True
4 2020-01-02 True
5 2020-01-02 False
6 2020-01-03 True
This compares the close value in one row to the open value of the next row ("shifted" one row ahead).
You can then select the rows for which the condition is True.
>>> df[df['close'] != df['open'].shift(-1)]
timestamp open high low close volume
2 2020-01-01 16:00:00 132.08 133.05 129.74 130.77 45793.96
3 2020-01-02 00:00:00 130.72 130.78 128.69 129.26 66066.01
4 2020-01-02 08:00:00 129.23 130.28 128.90 129.59 48498.93
6 2020-01-03 00:00:00 127.19 130.15 125.88 128.86 127641.40
This only returns the second of the two rows; to get the first, we can then shift back one, and unite the two conditions.
>>> row_condition = df['close'] != df['open'].shift(-1)
>>> row_before = row_condition.shift(1)
>>> df[row_condition | row_before]
timestamp open high low close volume
0 2020-01-01 00:00:00 129.16 130.98 128.68 130.24 47143.33
1 2020-01-01 08:00:00 130.24 132.40 129.87 132.08 51833.23
2 2020-01-01 16:00:00 132.08 133.05 129.74 130.77 45793.96
3 2020-01-02 00:00:00 130.72 130.78 128.69 129.26 66066.01
4 2020-01-02 08:00:00 129.23 130.28 128.90 129.59 48498.93
5 2020-01-02 16:00:00 129.58 129.78 126.38 127.19 99192.12
6 2020-01-03 00:00:00 127.19 130.15 125.88 128.86 127641.40
Providing a textual sample of the DataFrame is useful because this can be copied directly into a Python session; I would have had to manually type the content of your screenshot otherwise.

python dataframe change index type and remove duplicates

i have a dataframe that looks like this
2020-01-01 10
2020-02-01 5
2020-05-01 2
2020-08-01 7
2020-01-01 00:00:00 0
2020-02-01 00:00:00 0
2020-03-01 00:00:00 0
2020-04-01 00:00:00 0
I want to remove the time of the index and combine where the dates may be the same the end result will look like
2020-01-01 10
2020-02-01 5
2020-03-01 0
2020-04-01 0
2020-05-01 2
2020-06-01 0
2020-07-01 0
2020-08-01 7
etc, etc

change the index data type and filter with .duplicated:
df.index = pd.to_datetime(df.index)
df = df[~df.index.duplicated(keep='first')]
df
Out[1]:
1
0
2020-01-01 10
2020-02-01 5
2020-05-01 2
2020-08-01 7
2020-03-01 0
2020-04-01 0
If you want to sum them together rather than get rid of the duplicate, then use:
df.index = pd.to_datetime(df.index)
df = df.sum(level=0)
df
Out[2]:
1
0
2020-01-01 10
2020-02-01 5
2020-05-01 2
2020-08-01 7
2020-03-01 0
2020-04-01 0

if the index content is in string format u can simply slice
df.reset_index(inplace=True)#consider column name to be date
df["date"]=df["date"].str[:11]#till time index
df.set_index( "date",inplace=True)
if it is in date format:
df.reset_index(inplace=True)
df['date'] = pd.to_datetime(df['date']).dt.date
df.set_index( "date",inplace=True)

Given this data (reflecting your own) with the string dates and int data in columns (not as index):
dates = ['2020-01-01', '2020-02-01', '2020-05-01', '2020-08-01',
'2020-01-01 00:00:00', '2020-02-01 00:00:00', '2020-03-01 00:00:00',
'2020-04-01 00:00:00']
data = [10,5,2,7,0,0,0,0]
df = pd.DataFrame({'dates':dates, 'data':data})
You can do the following:
df['dates'] = pd.to_datetime(df['dates']).dt.date #convert to datetime and get the date
df = df.groupby('dates').sum().sort_index() # groupby and sort index
Giving:
data
dates
2020-01-01 10
2020-02-01 5
2020-03-01 0
2020-04-01 0
2020-05-01 2
2020-08-01 7
You can replace .sum() with your favorite aggregation method. Also, if you want to impute the missing dates (as in your expected output), you can do:
months = pd.date_range(min(df.index), max(df.index), freq='MS').date
df = df.reindex(months).fillna(0)
Giving:
data
dates
2020-01-01 10.0
2020-02-01 5.0
2020-03-01 0.0
2020-04-01 0.0
2020-05-01 2.0
2020-06-01 0.0
2020-07-01 0.0
2020-08-01 7.0

Delete all (hourly) day entries per row based on a daily table in python

I have a dataframe with a datetime64[ns] object which has the format, so there I have data per hourly base:
Datum Values
2020-01-01 00:00:00 1
2020-01-01 01:00:00 10
....
2020-02-28 00:00:00 5
2020-03-01 00:00:00 4
and another table with closing days, also in a datetime64[ns] column with the format, so there I only have a dayformat:
Dates
2020-02-28
2020-02-29
....
How can I delete all days in the first dataframe df, which occure in the second dataframe Dates? So that df is:
2020-01-01 00:00:00 1
2020-01-01 01:00:00 10
....
2020-03-01 00:00:00 4

Use Series.dt.floor for set times to 0, so possible filter by Series.isin with inverted mask in boolean indexing:
df['Datum'] = pd.to_datetime(df['Datum'])
df1['Dates'] = pd.to_datetime(df1['Dates'])
df = df[~df['Datum'].dt.floor('d').isin(df1['Dates'])]
print (df)
Datum Values
0 2020-01-01 00:00:00 1
1 2020-01-01 01:00:00 10
3 2020-03-01 00:00:00 4
EDIT: For flag column convert mask to integers by Series.view or Series.astype:
df['flag'] = df['Datum'].dt.floor('d').isin(df1['Dates']).view('i1')
#alternative
#df['flag'] = df['Datum'].dt.floor('d').isin(df1['Dates']).astype('int')
print (df)
Datum Values flag
0 2020-01-01 00:00:00 1 0
1 2020-01-01 01:00:00 10 0
2 2020-02-28 00:00:00 5 1
3 2020-03-01 00:00:00 4 0

Putting you aded comment into consideration
string of the Dates in df1
c="|".join(df1.Dates.values)
c
Coerce Datum to datetime
df['Datum']=pd.to_datetime(df['Datum'])
df.dtypes
Extract Datum as Dates ,dtype string
df.set_index(df['Datum'],inplace=True)
df['Dates']=df.index.date.astype(str)
Boolean select date ins in both
m=df.Dates.str.contains(c)
m
Mark inclusive dates as 0 and exclusive as 1
df['drop']=np.where(m,0,1)
df
Drop unwanted rows
df.reset_index(drop=True).drop(columns=['Dates'])
Outcome

Pandas: time column addition and repeating all rows for a month

I'd like to change my dataframe adding time intervals for every hour during a month
Original df
money food
0 1 2
1 4 5
2 5 7
Output:
money food time
0 1 2 2020-01-01 00:00:00
1 1 2 2020-01-01 00:01:00
2 1 2 2020-01-01 00:02:00
...
2230 5 7 2020-01-31 00:22:00
2231 5 7 2020-01-31 00:23:00
where 2231 = out_rows_number-1 = month_days_number*hours_per_day*orig_rows_number - 1
What is the proper way to perform it?

Use cross join by DataFrame.merge and new DataFrame with all hours per month created by date_range:
df1 = pd.DataFrame({'a':1,
'time':pd.date_range('2020-01-01', '2020-01-31 23:00:00', freq='h')})
df = df.assign(a=1).merge(df1, on='a', how='outer').drop('a', axis=1)
print (df)
money food time
0 1 2 2020-01-01 00:00:00
1 1 2 2020-01-01 01:00:00
2 1 2 2020-01-01 02:00:00
3 1 2 2020-01-01 03:00:00
4 1 2 2020-01-01 04:00:00
... ... ...
2227 5 7 2020-01-31 19:00:00
2228 5 7 2020-01-31 20:00:00
2229 5 7 2020-01-31 21:00:00
2230 5 7 2020-01-31 22:00:00
2231 5 7 2020-01-31 23:00:00
[2232 rows x 3 columns]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merge do DataFrames with missing rows - python

Related

Python Pandas: resample based on just one of the columns

Filter rows in DataFrame where certain conditions are met?

python dataframe change index type and remove duplicates

Delete all (hourly) day entries per row based on a daily table in python

Pandas: time column addition and repeating all rows for a month

Categories

Resources