Merge multiple dataframes with non-unique indices - python

I have a bunch of pandas time series. Here is an example for illustration (real data has ~ 1 million entries in each series):
>>> for s in series:
print s.head()
print
2014-01-01 01:00:00 -0.546404
2014-01-01 01:00:00 -0.791217
2014-01-01 01:00:01 0.117944
2014-01-01 01:00:01 -1.033161
2014-01-01 01:00:02 0.013415
2014-01-01 01:00:02 0.368853
2014-01-01 01:00:02 0.380515
2014-01-01 01:00:02 0.976505
2014-01-01 01:00:02 0.881654
dtype: float64
2014-01-01 01:00:00 -0.111314
2014-01-01 01:00:01 0.792093
2014-01-01 01:00:01 -1.367650
2014-01-01 01:00:02 -0.469194
2014-01-01 01:00:02 0.569606
2014-01-01 01:00:02 -1.777805
dtype: float64
2014-01-01 01:00:00 -0.108123
2014-01-01 01:00:00 -1.518526
2014-01-01 01:00:00 -1.395465
2014-01-01 01:00:01 0.045677
2014-01-01 01:00:01 1.614789
2014-01-01 01:00:01 1.141460
2014-01-01 01:00:02 1.365290
dtype: float64
The times in each series are not unique. For example, the last series has 3 values at 2014-01-01 01:00:00. The second series has only one value at that time. Also, not all the times need to be present in all the series.
My goal is to create a merged DataFrame with times that are a union of all the times in the individual time series. Each timestamp should be repeated as many times as needed. So, if a timestamp occurs (2, 0, 3, 4) times in the series above, the timestamp should be repeated 4 times (the maximum of the frequencies) in the resulting DataFrame. The values of each column should be "filled forward".
As an example, the result of merging the above should be:
c0 c1 c2
2014-01-01 01:00:00 -0.546404 -0.111314 -0.108123
2014-01-01 01:00:00 -0.791217 -0.111314 -1.518526
2014-01-01 01:00:00 -0.791217 -0.111314 -1.395465
2014-01-01 01:00:01 0.117944 0.792093 0.045677
2014-01-01 01:00:01 -1.033161 -1.367650 1.614789
2014-01-01 01:00:01 -1.033161 -1.367650 1.141460
2014-01-01 01:00:02 0.013415 -0.469194 1.365290
2014-01-01 01:00:02 0.368853 0.569606 1.365290
2014-01-01 01:00:02 0.380515 -1.777805 1.365290
2014-01-01 01:00:02 0.976505 -1.777805 1.365290
2014-01-01 01:00:02 0.881654 -1.777805 1.365290
To give an idea of size and "uniqueness" in my real data:
>>> [len(s.index.unique()) for s in series]
[48617, 48635, 48720, 48620]
>>> len(times)
51043
>>> [len(s) for s in series]
[1143409, 1143758, 1233646, 1242864]
Here is what I have tried:
I can create a union of all the unique times:
uniques = [s.index.unique() for s in series]
times = uniques[0].union_many(uniques[1:])
I can now index each series using times:
series[0].loc[times]
But that seems to repeat the values for each item in times, which is not what I want.
I can't reindex() the series using times because the index for each series is not unique.
I can do it by a slow Python loop or do it in Cython, but is there a "pandas-only" way to do what I want to do?
I created my example series using the following code:
def make_series(n=3, rep=(0,5)):
times = pandas.date_range('2014/01/01 01:00:00', periods=n, freq='S')
reps = [random.randint(*rep) for _ in xrange(n)]
dates = []
values = numpy.random.randn(numpy.sum(reps))
for date, rep in zip(times, reps):
dates.extend([date]*rep)
return pandas.Series(data=values, index=dates)
series = [make_series() for _ in xrange(3)]

This is very nearly a concat:
In [11]: s0 = pd.Series([1, 2, 3], name='s0')
In [12]: s1 = pd.Series([1, 4, 5], name='s1')
In [13]: pd.concat([s0, s1], axis=1)
Out[13]:
s0 s1
0 1 1
1 2 4
2 3 5
However, concat cannot deal with duplicate indices (it's ambigious how they should merge, and in your case you don't want to merge them in the "ordinary" way - as combinations)...
I think you are going to use a groupby:
In [21]: s0 = pd.Series([1, 2, 3], [0, 0, 1], name='s0')
In [22]: s1 = pd.Series([1, 4, 5], [0, 1, 1], name='s1')
Note: I've appended a faster method which works for int-like dtypes (like datetime64).
We want to add a MultiIndex level of the cumcounts for each item, that way we trick the Index into becoming unique:
In [23]: s0.groupby(level=0).cumcount()
Out[23]:
0 0
0 1
1 0
dtype: int64
Note: I can't seem to append a column to the index without being a DataFrame..
In [24]: df0 = pd.DataFrame(s0).set_index(s0.groupby(level=0).cumcount(), append=True)
In [25]: df1 = pd.DataFrame(s1).set_index(s1.groupby(level=0).cumcount(), append=True)
In [26]: df0
Out[26]:
s0
0 0 1
1 2
1 0 3
Now we can go ahead and concat these:
In [27]: res = pd.concat([df0, df1], axis=1)
In [28]: res
Out[28]:
s0 s1
0 0 1 1
1 2 NaN
1 0 3 4
1 NaN 5
If you want to drop the cumcount level:
In [29]: res.index = res.index.droplevel(1)
In [30]: res
Out[30]:
s0 s1
0 1 1
0 2 NaN
1 3 4
1 NaN 5
Now you can ffill to get the desired result... (if you were concerned about forward filling of different datetimes you could groupby the index and ffill).
If the upperbound on repetitions in each group was reasonable (I'm picking 1000, but much higher is still "reasonable"!, you could use a Float64Index as follows (and certainly it seems more elegant):
s0.index = s0.index + (s0.groupby(level=0)._cumcount_array() / 1000.)
s1.index = s1.index + (s1.groupby(level=0)._cumcount_array() / 1000.)
res = pd.concat([s0, s1], axis=1)
res.index = res.index.values.astype('int64')
Note: I'm cheekily using a private method here which returns the cumcount as a numpy array...
Note2: This is pandas 0.14, in 0.13 you have to pass a numpy array to _cumcount_array e.g. np.arange(len(s0))), pre-0.13 you're out of luck - there's no cumcount.

How about this - convert to dataframes with labeled columns first, then concat().
s1 = pd.Series(index=['4/4/14','4/4/14','4/5/14'],
data=[12.2,0.0,12.2])
s2 = pd.Series(index=['4/5/14','4/8/14'],
data=[14.2,3.0])
d1 = pd.DataFrame(a,columns=['a'])
d2 = pd.DataFrame(b,columns=['b'])
final_df = pd.merge(d1, d2, left_index=True, right_index=True, how='outer')
This gives me
a b
4/4/14 12.2 NaN
4/4/14 0.0 NaN
4/5/14 12.2 14.2
4/8/14 NaN 3.0

Related

Finding the Timedelta through pandas dataframe, I keep return NaT

So I am reading in a csv file of a 30 minute timeseries going from "2015-01-01 00:00" upto and including "2020-12-31 23:30". There are five sets of these timeseries, each being at a certain location, and there are 105215 rows going down for each 30 minutes. My job is to go through and find the timedelta between each row, for each column. It should be 30 minutes for each one, except sometimes it isn't, and I have to find that.
So far I'm reading in the data fine via
ca_time = np.array(ca.iloc[0:, 1], dtype= "datetime64")
ny_time = np.array(ny.iloc[0:, 1], dtype = "datetime64")
tx_time = np.array(tx.iloc[0:, 1], dtype = "datetime64")
#I'm then passing these to a pandas dataframe for more convenient manipulation
frame_ca = pd.DataFrame(data = ca_time, dtype = "datetime64[s]")
frame_ny = pd.DataFrame(data = ny_time, dtype = "datetime64[s]")
frame_tx = pd.DataFrame(data = tx_time, dtype = "datetime64[s]")
#Then concatenating them into an array with 100k+ rows, and the five columns represent each location
full_array = pd.concat([frame_ca, frame_ny, frame_tx], axis = 1)
I now want to find the timedelta between each cell for each respective location.
Currently I'm trying this as a simply test
first_row = full_array2.loc[1:1, :1]
second_row = full_array2.loc[2:2, :1]
delta = first_row - second_row
I'm getting back
0 0 0
1 NaT NaT NaT
2 NaT NaT NaT
These seems simple enough but don't know how I'm getting Not a Time here.
For reference, below are both those rows I'm trying to subtract
ca ny tx fl az
1 2015-01-01 01:00:00 2015-01-01 01:00:00 2015-01-01 01:00:00 2015-01-01 01:00:00 2015-01-01 01:00:00, 0 0 0 0 0
2 2015-01-01 01:30:00 2015-01-01 01:30:00 2015-01-01 01:30:00 2015-01-01 01:30:00 2015-01-01 01:30:00
Any help appreciated!

Create a dataframe based in common timestamps of multiple dataframes

I am looking for a elegant solution to selecting common timestamps from multiple dataframes. I know that something like this could work supposing the dataframe of common timestamps to be df:
df = df1[df1['Timestamp'].isin(df2['Timestamp'])]
However, if I have several other dataframes, this solution becomes quite unelegant. Therefore, I have been wondering if there is an easier approach to achieve my goal when working with multiple dataframes.
So, let's say for example that I have:
date1 = pd.date_range(start='1/1/2018', end='1/02/2018', freq='H')
date2 = pd.date_range(start='1/1/2018', end='1/02/2018', freq='15min')
date3 = pd.date_range(start='1/1/2018', end='1/02/2018', freq='45min')
date4 = pd.date_range(start='1/1/2018', end='1/02/2018', freq='30min')
data1 = np.random.randn(len(date1))
data2 = np.random.randn(len(date2))
data3 = np.random.randn(len(date3))
data4 = np.random.randn(len(date4))
df1 = pd.DataFrame(data = {'date1' : date1, 'data1' : data1})
df2 = pd.DataFrame(data = {'date2' : date2, 'data2' : data2})
df3 = pd.DataFrame(data = {'date3' : date3, 'data3' : data3})
df4 = pd.DataFrame(data = {'date4' : date4, 'data4' : data4})
I would like as an output a dataframe containing the common timestamps of the four dataframes as well as the respective data column out of each of them, for example (just to illustrate what I mean, it doesn't reflect on the result):
commom Timestamp data1 data2 data3 data4
0 2018-01-01 00:00:00 -1.129439 1.2312 1.11 -0.83
1 2018-01-01 01:00:00 0.853421 0.423 0.241 0.123
2 2018-01-01 02:00:00 -1.606047 1.001 -0.005 -0.12
3 2018-01-01 03:00:00 -0.668267 0.98 1.11 -0.23
[...]
You can use reduce from functools to perform the complete inner merge. We'll need to rename the columns just so the merge is a bit easier.
from functools import reduce
lst = [df1.rename(columns={'date1': 'Timestamp'}), df2.rename(columns={'date2': 'Timestamp'}),
df3.rename(columns={'date3': 'Timestamp'}), df4.rename(columns={'date4': 'Timestamp'})]
reduce(lambda l,r: l.merge(r, on='Timestamp'), lst)
Timestamp data1 data2 data3 data4
0 2018-01-01 00:00:00 -0.971201 -0.978107 1.163339 0.048824
1 2018-01-01 03:00:00 -1.063810 0.125318 -0.818835 -0.777500
2 2018-01-01 06:00:00 0.862549 -0.671529 1.902272 0.011490
3 2018-01-01 09:00:00 1.030826 -1.306481 0.438610 -1.817053
4 2018-01-01 12:00:00 -1.191646 -1.700694 1.007190 -1.932421
5 2018-01-01 15:00:00 -1.803248 0.415256 0.690243 1.387650
6 2018-01-01 18:00:00 -0.304502 0.514616 0.974318 -0.062800
7 2018-01-01 21:00:00 -0.668874 -1.262635 -0.504298 -0.043383
8 2018-01-02 00:00:00 -0.943615 1.010958 1.343095 0.119853
Alternatively concat with an 'inner' join and setting the Timestamp to the index
pd.concat([x.set_index('Timestamp') for x in lst], axis=1, join='inner')
If it would be acceptable to name every timestamp column in the same way (date for example), something like this could work:
def common_stamps(*args): # *args lets you feed it any number of dataframes
df = pd.concat([df_i.set_index('date') for df_i in args], axis=1)\
.dropna()\ # this removes all rows with `uncommon stamps`
.reset_index()
return df
df = common_stamps(df1, df2, df3, df4)
print(df)
Output:
date data1 data2 data3 data4
0 2018-01-01 00:00:00 -0.667090 0.487676 -1.001807 -0.200328
1 2018-01-01 03:00:00 -1.639815 2.320734 -0.396013 -1.838732
2 2018-01-01 06:00:00 0.469890 0.626428 0.040004 -2.063454
3 2018-01-01 09:00:00 -0.916928 -0.260329 -0.598313 0.383281
4 2018-01-01 12:00:00 0.132670 1.771344 -0.441651 0.664980
5 2018-01-01 15:00:00 -0.761542 0.255955 1.378836 -1.235562
6 2018-01-01 18:00:00 -0.120083 0.243652 -1.261733 1.045454
7 2018-01-01 21:00:00 0.339921 -0.901171 1.492577 -0.797161
8 2018-01-02 00:00:00 -1.397864 -0.173818 -0.581590 -0.402472

How can I print certain rows from a CSV in pandas

My problem is that I have a big dataframe with over 40000 Rows and now I want to select the rows from 2013-01-01 00:00:00 until 2013-31-12 00:00:00
print(df.loc[df['localhour'] == '2013-01-01 00:00:00'])
Thats my code now but I can not choose an intervall for printing out ... any ideas ?
One way is to set your index as datetime and then use pd.DataFrame.loc with string indexers:
df = pd.DataFrame({'Date': ['2013-01-01', '2014-03-01', '2011-10-01', '2013-05-01'],
'Var': [1, 2, 3, 4]})
df['Date'] = pd.to_datetime(df['Date'])
res = df.set_index('Date').loc['2010-01-01':'2013-01-01']
print(res)
Var
Date
2013-01-01 1
2011-10-01 3
Make a datetime object and then apply the condition:
print(df)
date
0 2013-01-01
1 2014-03-01
2 2011-10-01
3 2013-05-01
df['date']=pd.to_datetime(df['date'])
df['date'].loc[(df['date']<='2013-12-31 00:00:00') & (df['date']>='2013-01-01 00:00:00')]
Output:
0 2013-01-01
3 2013-05-01

Python Pandas remove date from timestamp

I have a large data set like this
user category
time
2014-01-01 00:00:00 21155349 2
2014-01-01 00:00:00 56347479 6
2014-01-01 00:00:00 68429517 13
2014-01-01 00:00:00 39055685 4
2014-01-01 00:00:00 521325 13
I want to make it as
user category
time
00:00:00 21155349 2
00:00:00 56347479 6
00:00:00 68429517 13
00:00:00 39055685 4
00:00:00 521325 13
How you do this using pandas
If you want to mutate a series (column) in pandas, the pattern is to apply a function to it (that updates on element in the series at a time), and to then assign that series back into into the dataframe
import pandas
import StringIO
# load data
data = '''date,user,category
2014-01-01 00:00:00, 21155349, 2
2014-01-01 00:00:00, 56347479, 6
2014-01-01 00:00:00, 68429517, 13
2014-01-01 00:00:00, 39055685, 4
2014-01-01 00:00:00, 521325, 13'''
df = pandas.read_csv(StringIO.StringIO(data))
df['date'] = pandas.to_datetime(df['date'])
# make the required change
without_date = df['date'].apply( lambda d : d.time() )
df['date'] = without_date
# display results
print df
If the problem is because the date is the index, you've got a few more hoops to jump through:
df = pandas.read_csv(StringIO.StringIO(data), index_col='date')
ser = pandas.to_datetime(df.index).to_series()
df.set_index(ser.apply(lambda d : d.time() ))
As suggested by #DSM, If you have pandas later than 0.15.2, you can use use the .dt accessor on the series to do fast updates.
df = pandas.read_csv(StringIO.StringIO(data), index_col='date')
ser = pandas.to_datetime(df.index).to_series()
df.set_index(ser.dt.time)

Pandas concat: ValueError: Shape of passed values is blah, indices imply blah2

I'm trying to merge a (Pandas 14.1) dataframe and a series. The series should form a new column, with some NAs (since the index values of the series are a subset of the index values of the dataframe).
This works for a toy example, but not with my data (detailed below).
Example:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.random.randn(6, 4), columns=['A', 'B', 'C', 'D'], index=pd.date_range('1/1/2011', periods=6, freq='D'))
df1
A B C D
2011-01-01 -0.487926 0.439190 0.194810 0.333896
2011-01-02 1.708024 0.237587 -0.958100 1.418285
2011-01-03 -1.228805 1.266068 -1.755050 -1.476395
2011-01-04 -0.554705 1.342504 0.245934 0.955521
2011-01-05 -0.351260 -0.798270 0.820535 -0.597322
2011-01-06 0.132924 0.501027 -1.139487 1.107873
s1 = pd.Series(np.random.randn(3), name='foo', index=pd.date_range('1/1/2011', periods=3, freq='2D'))
s1
2011-01-01 -1.660578
2011-01-03 -0.209688
2011-01-05 0.546146
Freq: 2D, Name: foo, dtype: float64
pd.concat([df1, s1],axis=1)
A B C D foo
2011-01-01 -0.487926 0.439190 0.194810 0.333896 -1.660578
2011-01-02 1.708024 0.237587 -0.958100 1.418285 NaN
2011-01-03 -1.228805 1.266068 -1.755050 -1.476395 -0.209688
2011-01-04 -0.554705 1.342504 0.245934 0.955521 NaN
2011-01-05 -0.351260 -0.798270 0.820535 -0.597322 0.546146
2011-01-06 0.132924 0.501027 -1.139487 1.107873 NaN
The situation with the data (see below) seems basically identical - concatting a series with a DatetimeIndex whose values are a subset of the dataframe's. But it gives the ValueError in the title (blah1 = (5, 286) blah2 = (5, 276) ). Why doesn't it work?:
In[187]: df.head()
Out[188]:
high low loc_h loc_l
time
2014-01-01 17:00:00 1.376235 1.375945 1.376235 1.375945
2014-01-01 17:01:00 1.376005 1.375775 NaN NaN
2014-01-01 17:02:00 1.375795 1.375445 NaN 1.375445
2014-01-01 17:03:00 1.375625 1.375515 NaN NaN
2014-01-01 17:04:00 1.375585 1.375585 NaN NaN
In [186]: df.index
Out[186]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-01-01 17:00:00, ..., 2014-01-01 21:30:00]
Length: 271, Freq: None, Timezone: None
In [189]: hl.head()
Out[189]:
2014-01-01 17:00:00 1.376090
2014-01-01 17:02:00 1.375445
2014-01-01 17:05:00 1.376195
2014-01-01 17:10:00 1.375385
2014-01-01 17:12:00 1.376115
dtype: float64
In [187]:hl.index
Out[187]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-01-01 17:00:00, ..., 2014-01-01 21:30:00]
Length: 89, Freq: None, Timezone: None
In: pd.concat([df, hl], axis=1)
Out: [stack trace] ValueError: Shape of passed values is (5, 286), indices imply (5, 276)
I had a similar problem (join worked, but concat failed).
Check for duplicate index values in df1 and s1, (e.g. df1.index.is_unique)
Removing duplicate index values (e.g., df.drop_duplicates(inplace=True)) or one of the methods here https://stackoverflow.com/a/34297689/7163376 should resolve it.
My problem were different indices, the following code solved my problem.
df1.reset_index(drop=True, inplace=True)
df2.reset_index(drop=True, inplace=True)
df = pd.concat([df1, df2], axis=1)
To drop duplicate indices, use df = df.loc[df.index.drop_duplicates()]. C.f. pandas.pydata.org/pandas-docs/stable/generated/… – BallpointBen Apr 18 at 15:25
This is wrong but I can't reply directly to BallpointBen's comment due to low reputation. The reason its wrong is that df.index.drop_duplicates() returns a list of unique indices, but when you index back into the dataframe using those the unique indices it still returns all records. I think this is likely because indexing using one of the duplicated indices will return all instances of the index.
Instead, use df.index.duplicated(), which returns a boolean list (add the ~ to get the not-duplicated records):
df = df.loc[~df.index.duplicated()]
Aus_lacy's post gave me the idea of trying related methods, of which join does work:
In [196]:
hl.name = 'hl'
Out[196]:
'hl'
In [199]:
df.join(hl).head(4)
Out[199]:
high low loc_h loc_l hl
2014-01-01 17:00:00 1.376235 1.375945 1.376235 1.375945 1.376090
2014-01-01 17:01:00 1.376005 1.375775 NaN NaN NaN
2014-01-01 17:02:00 1.375795 1.375445 NaN 1.375445 1.375445
2014-01-01 17:03:00 1.375625 1.375515 NaN NaN NaN
Some insight into why concat works on the example but not this data would be nice though!
Your indexes probably contains duplicated values.
import pandas as pd
T1_INDEX = [
0,
1, # <= !!! if I write e.g.: "0" here then it fails
0.2,
]
T1_COLUMNS = [
'A', 'B', 'C', 'D'
]
T1 = [
[1.0, 1.1, 1.2, 1.3],
[2.0, 2.1, 2.2, 2.3],
[3.0, 3.1, 3.2, 3.3],
]
T2_INDEX = [
1.2,
2.11,
]
T2_COLUMNS = [
'D', 'E', 'F',
]
T2 = [
[54.0, 5324.1, 3234.2],
[55.0, 14.5324, 2324.2],
# [3.0, 3.1, 3.2],
]
df1 = pd.DataFrame(T1, columns=T1_COLUMNS, index=T1_INDEX)
df2 = pd.DataFrame(T2, columns=T2_COLUMNS, index=T2_INDEX)
print(pd.concat([pd.DataFrame({})] + [df2, df1], axis=1))
Try sorting index after concatenating them
result=pd.concat([df1,df2]).sort_index()
Maybe it is simple, try this
if you have a DataFrame. then make sure that both matrices or vectros that you're trying to combine have the same rows_name/index
I had the same issue. I changed the name indices of the rows to make them match each other
here is an example for a matrix (principal component) and a vector(target) have the same row indicies (I circled them in the blue in the leftside of the pic)
Before, "when it was not working", I had the matrix with normal row indicies (0,1,2,3) while I had the vector with row indices (ID0, ID1, ID2, ID3)
then I changed the vector's row indices to (0,1,2,3) and it worked for me.
enter image description here

Categories