So I am reading in a csv file of a 30 minute timeseries going from "2015-01-01 00:00" upto and including "2020-12-31 23:30". There are five sets of these timeseries, each being at a certain location, and there are 105215 rows going down for each 30 minutes. My job is to go through and find the timedelta between each row, for each column. It should be 30 minutes for each one, except sometimes it isn't, and I have to find that.
So far I'm reading in the data fine via
ca_time = np.array(ca.iloc[0:, 1], dtype= "datetime64")
ny_time = np.array(ny.iloc[0:, 1], dtype = "datetime64")
tx_time = np.array(tx.iloc[0:, 1], dtype = "datetime64")
#I'm then passing these to a pandas dataframe for more convenient manipulation
frame_ca = pd.DataFrame(data = ca_time, dtype = "datetime64[s]")
frame_ny = pd.DataFrame(data = ny_time, dtype = "datetime64[s]")
frame_tx = pd.DataFrame(data = tx_time, dtype = "datetime64[s]")
#Then concatenating them into an array with 100k+ rows, and the five columns represent each location
full_array = pd.concat([frame_ca, frame_ny, frame_tx], axis = 1)
I now want to find the timedelta between each cell for each respective location.
Currently I'm trying this as a simply test
first_row = full_array2.loc[1:1, :1]
second_row = full_array2.loc[2:2, :1]
delta = first_row - second_row
I'm getting back
0 0 0
1 NaT NaT NaT
2 NaT NaT NaT
These seems simple enough but don't know how I'm getting Not a Time here.
For reference, below are both those rows I'm trying to subtract
ca ny tx fl az
1 2015-01-01 01:00:00 2015-01-01 01:00:00 2015-01-01 01:00:00 2015-01-01 01:00:00 2015-01-01 01:00:00, 0 0 0 0 0
2 2015-01-01 01:30:00 2015-01-01 01:30:00 2015-01-01 01:30:00 2015-01-01 01:30:00 2015-01-01 01:30:00
Any help appreciated!
I am looking for a elegant solution to selecting common timestamps from multiple dataframes. I know that something like this could work supposing the dataframe of common timestamps to be df:
df = df1[df1['Timestamp'].isin(df2['Timestamp'])]
However, if I have several other dataframes, this solution becomes quite unelegant. Therefore, I have been wondering if there is an easier approach to achieve my goal when working with multiple dataframes.
So, let's say for example that I have:
date1 = pd.date_range(start='1/1/2018', end='1/02/2018', freq='H')
date2 = pd.date_range(start='1/1/2018', end='1/02/2018', freq='15min')
date3 = pd.date_range(start='1/1/2018', end='1/02/2018', freq='45min')
date4 = pd.date_range(start='1/1/2018', end='1/02/2018', freq='30min')
data1 = np.random.randn(len(date1))
data2 = np.random.randn(len(date2))
data3 = np.random.randn(len(date3))
data4 = np.random.randn(len(date4))
df1 = pd.DataFrame(data = {'date1' : date1, 'data1' : data1})
df2 = pd.DataFrame(data = {'date2' : date2, 'data2' : data2})
df3 = pd.DataFrame(data = {'date3' : date3, 'data3' : data3})
df4 = pd.DataFrame(data = {'date4' : date4, 'data4' : data4})
I would like as an output a dataframe containing the common timestamps of the four dataframes as well as the respective data column out of each of them, for example (just to illustrate what I mean, it doesn't reflect on the result):
commom Timestamp data1 data2 data3 data4
0 2018-01-01 00:00:00 -1.129439 1.2312 1.11 -0.83
1 2018-01-01 01:00:00 0.853421 0.423 0.241 0.123
2 2018-01-01 02:00:00 -1.606047 1.001 -0.005 -0.12
3 2018-01-01 03:00:00 -0.668267 0.98 1.11 -0.23
[...]
You can use reduce from functools to perform the complete inner merge. We'll need to rename the columns just so the merge is a bit easier.
from functools import reduce
lst = [df1.rename(columns={'date1': 'Timestamp'}), df2.rename(columns={'date2': 'Timestamp'}),
df3.rename(columns={'date3': 'Timestamp'}), df4.rename(columns={'date4': 'Timestamp'})]
reduce(lambda l,r: l.merge(r, on='Timestamp'), lst)
Timestamp data1 data2 data3 data4
0 2018-01-01 00:00:00 -0.971201 -0.978107 1.163339 0.048824
1 2018-01-01 03:00:00 -1.063810 0.125318 -0.818835 -0.777500
2 2018-01-01 06:00:00 0.862549 -0.671529 1.902272 0.011490
3 2018-01-01 09:00:00 1.030826 -1.306481 0.438610 -1.817053
4 2018-01-01 12:00:00 -1.191646 -1.700694 1.007190 -1.932421
5 2018-01-01 15:00:00 -1.803248 0.415256 0.690243 1.387650
6 2018-01-01 18:00:00 -0.304502 0.514616 0.974318 -0.062800
7 2018-01-01 21:00:00 -0.668874 -1.262635 -0.504298 -0.043383
8 2018-01-02 00:00:00 -0.943615 1.010958 1.343095 0.119853
Alternatively concat with an 'inner' join and setting the Timestamp to the index
pd.concat([x.set_index('Timestamp') for x in lst], axis=1, join='inner')
If it would be acceptable to name every timestamp column in the same way (date for example), something like this could work:
def common_stamps(*args): # *args lets you feed it any number of dataframes
df = pd.concat([df_i.set_index('date') for df_i in args], axis=1)\
.dropna()\ # this removes all rows with `uncommon stamps`
.reset_index()
return df
df = common_stamps(df1, df2, df3, df4)
print(df)
Output:
date data1 data2 data3 data4
0 2018-01-01 00:00:00 -0.667090 0.487676 -1.001807 -0.200328
1 2018-01-01 03:00:00 -1.639815 2.320734 -0.396013 -1.838732
2 2018-01-01 06:00:00 0.469890 0.626428 0.040004 -2.063454
3 2018-01-01 09:00:00 -0.916928 -0.260329 -0.598313 0.383281
4 2018-01-01 12:00:00 0.132670 1.771344 -0.441651 0.664980
5 2018-01-01 15:00:00 -0.761542 0.255955 1.378836 -1.235562
6 2018-01-01 18:00:00 -0.120083 0.243652 -1.261733 1.045454
7 2018-01-01 21:00:00 0.339921 -0.901171 1.492577 -0.797161
8 2018-01-02 00:00:00 -1.397864 -0.173818 -0.581590 -0.402472
My problem is that I have a big dataframe with over 40000 Rows and now I want to select the rows from 2013-01-01 00:00:00 until 2013-31-12 00:00:00
print(df.loc[df['localhour'] == '2013-01-01 00:00:00'])
Thats my code now but I can not choose an intervall for printing out ... any ideas ?
One way is to set your index as datetime and then use pd.DataFrame.loc with string indexers:
df = pd.DataFrame({'Date': ['2013-01-01', '2014-03-01', '2011-10-01', '2013-05-01'],
'Var': [1, 2, 3, 4]})
df['Date'] = pd.to_datetime(df['Date'])
res = df.set_index('Date').loc['2010-01-01':'2013-01-01']
print(res)
Var
Date
2013-01-01 1
2011-10-01 3
Make a datetime object and then apply the condition:
print(df)
date
0 2013-01-01
1 2014-03-01
2 2011-10-01
3 2013-05-01
df['date']=pd.to_datetime(df['date'])
df['date'].loc[(df['date']<='2013-12-31 00:00:00') & (df['date']>='2013-01-01 00:00:00')]
Output:
0 2013-01-01
3 2013-05-01
I have a large data set like this
user category
time
2014-01-01 00:00:00 21155349 2
2014-01-01 00:00:00 56347479 6
2014-01-01 00:00:00 68429517 13
2014-01-01 00:00:00 39055685 4
2014-01-01 00:00:00 521325 13
I want to make it as
user category
time
00:00:00 21155349 2
00:00:00 56347479 6
00:00:00 68429517 13
00:00:00 39055685 4
00:00:00 521325 13
How you do this using pandas
If you want to mutate a series (column) in pandas, the pattern is to apply a function to it (that updates on element in the series at a time), and to then assign that series back into into the dataframe
import pandas
import StringIO
# load data
data = '''date,user,category
2014-01-01 00:00:00, 21155349, 2
2014-01-01 00:00:00, 56347479, 6
2014-01-01 00:00:00, 68429517, 13
2014-01-01 00:00:00, 39055685, 4
2014-01-01 00:00:00, 521325, 13'''
df = pandas.read_csv(StringIO.StringIO(data))
df['date'] = pandas.to_datetime(df['date'])
# make the required change
without_date = df['date'].apply( lambda d : d.time() )
df['date'] = without_date
# display results
print df
If the problem is because the date is the index, you've got a few more hoops to jump through:
df = pandas.read_csv(StringIO.StringIO(data), index_col='date')
ser = pandas.to_datetime(df.index).to_series()
df.set_index(ser.apply(lambda d : d.time() ))
As suggested by #DSM, If you have pandas later than 0.15.2, you can use use the .dt accessor on the series to do fast updates.
df = pandas.read_csv(StringIO.StringIO(data), index_col='date')
ser = pandas.to_datetime(df.index).to_series()
df.set_index(ser.dt.time)
I'm trying to merge a (Pandas 14.1) dataframe and a series. The series should form a new column, with some NAs (since the index values of the series are a subset of the index values of the dataframe).
This works for a toy example, but not with my data (detailed below).
Example:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.random.randn(6, 4), columns=['A', 'B', 'C', 'D'], index=pd.date_range('1/1/2011', periods=6, freq='D'))
df1
A B C D
2011-01-01 -0.487926 0.439190 0.194810 0.333896
2011-01-02 1.708024 0.237587 -0.958100 1.418285
2011-01-03 -1.228805 1.266068 -1.755050 -1.476395
2011-01-04 -0.554705 1.342504 0.245934 0.955521
2011-01-05 -0.351260 -0.798270 0.820535 -0.597322
2011-01-06 0.132924 0.501027 -1.139487 1.107873
s1 = pd.Series(np.random.randn(3), name='foo', index=pd.date_range('1/1/2011', periods=3, freq='2D'))
s1
2011-01-01 -1.660578
2011-01-03 -0.209688
2011-01-05 0.546146
Freq: 2D, Name: foo, dtype: float64
pd.concat([df1, s1],axis=1)
A B C D foo
2011-01-01 -0.487926 0.439190 0.194810 0.333896 -1.660578
2011-01-02 1.708024 0.237587 -0.958100 1.418285 NaN
2011-01-03 -1.228805 1.266068 -1.755050 -1.476395 -0.209688
2011-01-04 -0.554705 1.342504 0.245934 0.955521 NaN
2011-01-05 -0.351260 -0.798270 0.820535 -0.597322 0.546146
2011-01-06 0.132924 0.501027 -1.139487 1.107873 NaN
The situation with the data (see below) seems basically identical - concatting a series with a DatetimeIndex whose values are a subset of the dataframe's. But it gives the ValueError in the title (blah1 = (5, 286) blah2 = (5, 276) ). Why doesn't it work?:
In[187]: df.head()
Out[188]:
high low loc_h loc_l
time
2014-01-01 17:00:00 1.376235 1.375945 1.376235 1.375945
2014-01-01 17:01:00 1.376005 1.375775 NaN NaN
2014-01-01 17:02:00 1.375795 1.375445 NaN 1.375445
2014-01-01 17:03:00 1.375625 1.375515 NaN NaN
2014-01-01 17:04:00 1.375585 1.375585 NaN NaN
In [186]: df.index
Out[186]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-01-01 17:00:00, ..., 2014-01-01 21:30:00]
Length: 271, Freq: None, Timezone: None
In [189]: hl.head()
Out[189]:
2014-01-01 17:00:00 1.376090
2014-01-01 17:02:00 1.375445
2014-01-01 17:05:00 1.376195
2014-01-01 17:10:00 1.375385
2014-01-01 17:12:00 1.376115
dtype: float64
In [187]:hl.index
Out[187]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-01-01 17:00:00, ..., 2014-01-01 21:30:00]
Length: 89, Freq: None, Timezone: None
In: pd.concat([df, hl], axis=1)
Out: [stack trace] ValueError: Shape of passed values is (5, 286), indices imply (5, 276)
I had a similar problem (join worked, but concat failed).
Check for duplicate index values in df1 and s1, (e.g. df1.index.is_unique)
Removing duplicate index values (e.g., df.drop_duplicates(inplace=True)) or one of the methods here https://stackoverflow.com/a/34297689/7163376 should resolve it.
My problem were different indices, the following code solved my problem.
df1.reset_index(drop=True, inplace=True)
df2.reset_index(drop=True, inplace=True)
df = pd.concat([df1, df2], axis=1)
To drop duplicate indices, use df = df.loc[df.index.drop_duplicates()]. C.f. pandas.pydata.org/pandas-docs/stable/generated/… – BallpointBen Apr 18 at 15:25
This is wrong but I can't reply directly to BallpointBen's comment due to low reputation. The reason its wrong is that df.index.drop_duplicates() returns a list of unique indices, but when you index back into the dataframe using those the unique indices it still returns all records. I think this is likely because indexing using one of the duplicated indices will return all instances of the index.
Instead, use df.index.duplicated(), which returns a boolean list (add the ~ to get the not-duplicated records):
df = df.loc[~df.index.duplicated()]
Aus_lacy's post gave me the idea of trying related methods, of which join does work:
In [196]:
hl.name = 'hl'
Out[196]:
'hl'
In [199]:
df.join(hl).head(4)
Out[199]:
high low loc_h loc_l hl
2014-01-01 17:00:00 1.376235 1.375945 1.376235 1.375945 1.376090
2014-01-01 17:01:00 1.376005 1.375775 NaN NaN NaN
2014-01-01 17:02:00 1.375795 1.375445 NaN 1.375445 1.375445
2014-01-01 17:03:00 1.375625 1.375515 NaN NaN NaN
Some insight into why concat works on the example but not this data would be nice though!
Your indexes probably contains duplicated values.
import pandas as pd
T1_INDEX = [
0,
1, # <= !!! if I write e.g.: "0" here then it fails
0.2,
]
T1_COLUMNS = [
'A', 'B', 'C', 'D'
]
T1 = [
[1.0, 1.1, 1.2, 1.3],
[2.0, 2.1, 2.2, 2.3],
[3.0, 3.1, 3.2, 3.3],
]
T2_INDEX = [
1.2,
2.11,
]
T2_COLUMNS = [
'D', 'E', 'F',
]
T2 = [
[54.0, 5324.1, 3234.2],
[55.0, 14.5324, 2324.2],
# [3.0, 3.1, 3.2],
]
df1 = pd.DataFrame(T1, columns=T1_COLUMNS, index=T1_INDEX)
df2 = pd.DataFrame(T2, columns=T2_COLUMNS, index=T2_INDEX)
print(pd.concat([pd.DataFrame({})] + [df2, df1], axis=1))
Try sorting index after concatenating them
result=pd.concat([df1,df2]).sort_index()
Maybe it is simple, try this
if you have a DataFrame. then make sure that both matrices or vectros that you're trying to combine have the same rows_name/index
I had the same issue. I changed the name indices of the rows to make them match each other
here is an example for a matrix (principal component) and a vector(target) have the same row indicies (I circled them in the blue in the leftside of the pic)
Before, "when it was not working", I had the matrix with normal row indicies (0,1,2,3) while I had the vector with row indices (ID0, ID1, ID2, ID3)
then I changed the vector's row indices to (0,1,2,3) and it worked for me.
enter image description here