I have an excel sheet with values representing start and end_time of a time series data, as shown below. Times are in seconds.
+------------+---------+-------+
Start_Time End_Time Value
0 2 A
2 3 B
3 9 A
9 11 C
I want to extrapolate the values between start and end_time and display the values for each second.
+---------+------+
Time Value
0 A
1 A
2 A
3 B
4 A
5 A
6 A
7 A
8 A
9 A
10 C
11 c
Any help to implement it in Python will be appreciated. Thanks.
Setup
You should find how to read your excel sheet with pandas easily, and options will depend on the file itself, so I won't cover this part.
Below is the reproduction of your sample dataframe, used for the example.
import pandas as pd
df = pd.DataFrame({'Start_Time': [0, 2, 3, 9],
'End_Time': [2, 3, 9, 11],
'Value': ['A', 'B', 'A', 'C']})
>>> df
Out[]:
End_Time Start_Time Value
0 2 0 A
1 3 2 B
2 9 3 A
3 11 9 C
Solution
(pd.Series(range(df.End_Time.max() + 1), name='Value') # Create a series on whole range
.map(df.set_index('End_Time').Value) # Set values from "df"
.bfill() # Backward fill NaNs values
.rename_axis('Time')) # Purely cosmetic axis rename
Out[]:
Time
0 A
1 A
2 A
3 B
4 A
5 A
6 A
7 A
8 A
9 A
10 C
11 C
Name: Value, dtype: object
Walkthrough
Create the whole "Time" range
s = pd.Series(range(df.End_Time.max() + 1))
>>> s
Out[]:
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
dtype: int32
Use "End_Time" as index for df
>>> df.set_index('End_Time')
Out[]:
Start_Time Value
End_Time
2 0 A
3 2 B
9 3 A
11 9 C
Map df values to corresponding "End_Time" values from s
s = s.map(df.set_index('End_Time').Value)
>>> s
Out[]:
0 NaN
1 NaN
2 A
3 B
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 A
10 NaN
11 C
dtype: object
Backward-fill the NaN values
s = s.bfill()
>>> s
Out[]:
0 A
1 A
2 A
3 B
4 A
5 A
6 A
7 A
8 A
9 A
10 C
11 C
dtype: object
Then rename_axis('Time') only renames the series axis to match your desired output.
Note that this works here because you use excluding Start_Time.
If you were using including Start_Time (where Value really starts at Start_Time, which is more common) you should change End_Time to Start_Time and bfill() to ffill() (forward-fill).
Related
Having two data frames:
df1 = pd.DataFrame({'a':[1,2,3],'b':[4,5,6]})
a b
0 1 4
1 2 5
2 3 6
df2 = pd.DataFrame({'c':[7],'d':[8]})
c d
0 7 8
The goal is to add all df2 column values to df1, repeated and create the following result. It is assumed that both data frames do not share any column names.
a b c d
0 1 4 7 8
1 2 5 7 8
2 3 6 7 8
If there are strings columns names is possible use DataFrame.assign with unpack Series created by selecing first row of df2:
df = df1.assign(**df2.iloc[0])
print (df)
a b c d
0 1 4 7 8
1 2 5 7 8
2 3 6 7 8
Another idea is repeat values by df1.index with DataFrame.reindex and use DataFrame.join (here first index value of df2 is same like first index value of df1.index):
df = df1.join(df2.reindex(df1.index, method='ffill'))
print (df)
a b c d
0 1 4 7 8
1 2 5 7 8
2 3 6 7 8
If no missing values in original df is possible use forward filling missing values in last step, but also are types changed to floats, thanks #Dishin H Goyan:
df = df1.join(df2).ffill()
print (df)
a b c d
0 1 4 7.0 8.0
1 2 5 7.0 8.0
2 3 6 7.0 8.0
I have a data frame where there are several groups of numeric series where the values are cumulative. Consider the following:
df = pd.DataFrame({'Cat': ['A', 'A','A','A', 'B','B','B','B'], 'Indicator': [1,2,3,4,1,2,3,4], 'Cumulative1': [1,3,6,7,2,4,6,9], 'Cumulative2': [1,3,4,6,1,5,7,12]})
In [74]:df
Out[74]:
Cat Cumulative1 Cumulative2 Indicator
0 A 1 1 1
1 A 3 3 2
2 A 6 4 3
3 A 7 6 4
4 B 2 1 1
5 B 4 5 2
6 B 6 7 3
7 B 9 12 4
I need to create discrete series for Cumulative1 and Cumulative2, with starting point being the earliest entry in 'Indicator'.
my Approach is to use diff()
In[82]: df['Discrete1'] = df.groupby('Cat')['Cumulative1'].diff()
Out[82]: df
Cat Cumulative1 Cumulative2 Indicator Discrete1
0 A 1 1 1 NaN
1 A 3 3 2 2.0
2 A 6 4 3 3.0
3 A 7 6 4 1.0
4 B 2 1 1 NaN
5 B 4 5 2 2.0
6 B 6 7 3 2.0
7 B 9 12 4 3.0
I have 3 questions:
How do I avoid the NaN in an elegant/Pythonic way? The correct values are to be found in the original Cumulative series.
Secondly, how do I elegantly apply this computation to all series, say -
cols = ['Cumulative1', 'Cumulative2']
Thirdly, I have a lot of data that needs this computation -- is this the most efficient way?
You do not want to avoid NaNs, you want to fill them with the start values from the "cumulative" column:
df['Discrete1'] = df['Discrete1'].combine_first(df['Cumulative1'])
To apply the operation to all (or select) columns, broadcast it to all columns of interest:
sources = 'Cumulative1', 'Cumulative2'
targets = ["Discrete" + x[len('Cumulative'):] for x in sources]
df[targets] = df.groupby('Cat')[sources].diff()
You still have to condition the NaNs in a loop:
for s,t in zip(sources, targets):
df[t] = df[t].combine_first(df[s])
I'm attempting to populate a column in a data frame based on whether the index value of that record falls within a range defined by two columns in another data frame.
df1 looks like:
a
0 4
1 45
2 7
3 5
4 48
5 44
6 22
7 89
8 45
9 44
10 23
and df2 is:
START STOP CLASS
0 2 3 1
1 5 7 2
2 8 8 3
what I want would look like:
a CLASS
0 4 nan
1 45 nan
2 7 1
3 5 1
4 48 nan
5 44 2
6 22 2
7 89 2
8 45 3
9 44 nan
10 23 nan
The START column in df2 is the minimum value of the range and the STOP column is the max.
You can use IntervalIndex (requires v0.20.0).
First construct the index:
df2.index = pd.IntervalIndex.from_arrays(df2['START'], df2['STOP'], closed='both')
df2
Out:
START STOP CLASS
[2, 3] 2 3 1
[5, 7] 5 7 2
[8, 8] 8 8 3
Now if you index into the second DataFrame it will lookup the value in the intervals. For example,
df2.loc[6]
Out:
START 5
STOP 7
CLASS 2
Name: [5, 7], dtype: int64
returns the second class. I don't know if it can be used with merge or with merge_asof but as an alternative you can use map:
df1['CLASS'] = df1.index.to_series().map(df2['CLASS'])
Note that I first converted the index to a Series to be able to use the Series.map method. This results in
df1
Out:
a CLASS
0 4 NaN
1 45 NaN
2 7 1.0
3 5 1.0
4 48 NaN
5 44 2.0
6 22 2.0
7 89 2.0
8 45 3.0
9 44 NaN
10 23 NaN
Alternative solution:
classdict = df2.set_index("CLASS").to_dict("index")
rangedict = {}
for key,value in classdict.items():
# get all items in range and assign value (the key)
for item in list(range(value["START"],value["STOP"]+1)):
rangedict[item] = key
extract rangedict:
{2: 1, 3: 1, 5: 2, 6: 2, 7: 2, 8: 3}
now map and possibly format(?):
df1['CLASS'] = df1.index.to_series().map(rangedict)
df1.applymap("{0:.0f}".format)
outputs:
a CLASS
0 4 nan
1 45 nan
2 7 1
3 5 1
4 48 nan
5 44 2
6 22 2
7 89 2
8 45 3
9 44 nan
10 23 nan
import pandas as pd
import numpy as np
# Here is your existing dataframe
df_existing = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
# Create a new empty dataframe with specific column names and data types
df_new = pd.DataFrame(index=None)
columns = ['field01','field02','field03','field04']
dtypes = [str,int,int,int]
for c,d in zip(columns, dtypes):
df_new[c] = pd.Series(dtype=d)
# Set the index on the new dataframe to same as existing
df_new['new_index'] = df_existing.index
df_new.set_index('new_index', inplace=True)
# Fill the new dataframe with specific fields from the existing dataframe
df_new[['field02','field03']] = df_existing[['B','C']]
print df_new
I need to remove all rows from a pandas.DataFrame, which satisfy an unusual condition.
In case there is an exactly the same row, except for it has Nan value in column "C", I want to remove this row.
Given a table:
A B C D
1 2 NaN 3
1 2 50 3
10 20 NaN 30
5 6 7 8
I need to remove the first row, since it has Nan in column C, but there is absolutely same row (second) with real value in column C.
However, 3rd row must stay, because there're no rows with same A, B and D values as it has.
How do you perform this using pandas? Thank you!
You can achieve in using drop_duplicates.
Initial DataFrame:
df=pd.DataFrame(columns=['a','b','c','d'], data=[[1,2,None,3],[1,2,50,3],[10,20,None,30],[5,6,7,8]])
df
a b c d
0 1 2 NaN 3
1 1 2 50 3
2 10 20 NaN 30
3 5 6 7 8
Then you can sort DataFrame by column C. This will drop NaNs to the bottom of column:
df = df.sort_values(['c'])
df
a b c d
3 5 6 7 8
1 1 2 50 3
0 1 2 NaN 3
2 10 20 NaN 30
And then remove duplicates selecting taken into account columns ignoring C and keeping first catched row:
df1 = df.drop_duplicates(['a','b','d'], keep='first')
a b c d
3 5 6 7 8
1 1 2 50 3
2 10 20 NaN 30
But it will be valid only if NaNs are in column C.
You can try fillna along with drop_duplicates
df.bfill().ffill().drop_duplicates(subset=['A', 'B', 'D'], keep = 'last')
This will handle the scenario such as A, B and D values are same but C has non-NaN values in both the rows.
You get
A B C D
1 1 2 50 3
2 10 20 Nan 30
3 5 6 7 8
This feels right to me
notdups = ~df.duplicated(df.columns.difference(['C']), keep=False)
notnans = df.C.notnull()
df[notdups | notnans]
A B C D
1 1 2 50.0 3
2 10 20 NaN 30
3 5 6 7.0 8
I have this pandas dataframe:
SourceDomain 1 2 3
0 www.theguardian.com profile.theguardian.com 1 Directed
1 www.theguardian.com membership.theguardian.com 2 Directed
2 www.theguardian.com subscribe.theguardian.com 3 Directed
3 www.theguardian.com www.google.co.uk 4 Directed
4 www.theguardian.com jobs.theguardian.com 5 Directed
I would like to add a new column which is a pandas series created like this:
Weights = Weights.value_counts()
However, when I try to add the new column using edgesFile[4] = Weights it fills it with NA instead of the values:
SourceDomain 1 2 3 4
0 www.theguardian.com profile.theguardian.com 1 Directed NaN
1 www.theguardian.com membership.theguardian.com 2 Directed NaN
2 www.theguardian.com subscribe.theguardian.com 3 Directed NaN
3 www.theguardian.com www.google.co.uk 4 Directed NaN
4 www.theguardian.com jobs.theguardian.com 5 Directed NaN
How can I add the new column keeping the values?
Thanks?
Dani
You are getting NaNs because the index of Weights does not match up with the index of edgesFile. If you want Pandas to ignore Weights.index and just paste the values in order then pass the underlying NumPy array instead:
edgesFile[4] = Weights.values
Here is an example which demonstrates the difference:
In [14]: df = pd.DataFrame(np.arange(4)*10, index=list('ABCD'))
In [15]: df
Out[15]:
0
A 0
B 10
C 20
D 30
In [16]: s = pd.Series(np.arange(4), index=list('CDEF'))
In [17]: s
Out[17]:
C 0
D 1
E 2
F 3
dtype: int64
Here we see Pandas aligning the index:
In [18]: df[4] = s
In [19]: df
Out[19]:
0 4
A 0 NaN
B 10 NaN
C 20 0
D 30 1
Here, Pandas simply pastes the values in s into the column:
In [20]: df[4] = s.values
In [21]: df
Out[21]:
0 4
A 0 0
B 10 1
C 20 2
D 30 3
This is small example of your question:
You can add new column with a column name in existing DataFrame
>>> df = DataFrame([[1,2,3],[4,5,6]], columns = ['A', 'B', 'C'])
>>> df
A B C
0 1 2 3
1 4 5 6
>>> s = Series([7,8])
>>> s
0 7
1 8
2 9
>>> df['D']=s
>>> df
A B C D
0 1 2 3 7
1 4 5 6 8
Or, You can make DataFrame from Series and concat then
>>> df = DataFrame([[1,2,3],[4,5,6]])
>>> df
0 1 2
0 1 2 3
1 4 5 6
>>> s = DataFrame(Series([7,8]), columns=['4']) # if you don't provide column name, default name will be 0
>>> s
0
0 7
1 8
>>> df = pd.concat([df,s], axis=1)
>>> df
0 1 2 0
0 1 2 3 7
1 4 5 6 8
Hope this will help