How do I write a python code for calculating values for a new column --"UEC_saving" by OFFSET # of rows in the "UEC"column. The table format is pandas dataframe.
Positive numbers in the "Offset_rows" column means shift rows down and vice versa.
For example, "UEC_saving" for index 0 is 8.6-7.2 and for index 2 is 0.2-7.0
The output for "UEC_saving" should look like this:
Product_Class UEC Offset_rows UEC_saving
0 PC1 8.6 1 1.4
1 PC1 7.2 0 0.0
2 PC1 0.2 -1 -7.0
3 PC2 18.8 2 2.2
4 PC2 10.0 1 1.4
5 PC2 8.6 0 0.0
6 PC2 0.3 -1 -8.3
you can do:
for i,row in df.iterrows():
df.at[i,'UEC_saving'] = row['UEC'] - df.loc[i+row['Offset_rows']]['UEC']
df
UEC Offset_rows UEC_saving
0 8.6 1 1.4
1 7.2 0 0.0
2 0.2 -1 -7.0
3 18.8 2 10.2
4 10.0 1 1.4
5 8.6 0 0.0
6 0.3 -1 -8.3
this lines up with all your desired answers except for index 3, please let me know if there was a typo, or please explain further in your question exactly what occurs there
Related
I have a DataFrame similar to this:
MACD
0 -2.3
1 -0.3
2 0.8
3 0.1
4 0.6
5 -0.7
6 1.1
7 2.4
How can I add an extra column showing the number of rows since MACD was on the opposite side of the origin (positive/negative)?
Desired Outcome:
MACD RowsSince
0 -2.3 NaN
1 -0.3 NaN
2 0.8 1
3 0.1 2
4 0.6 3
5 -0.7 1
6 1.1 1
7 2.4 2
We can try with use np.sign with diff create the subgroup , then with groupby + cumcount
s = np.sign(df['MACD']).diff().ne(0).cumsum()
df['new'] = (df.groupby(s).cumcount()+1).mask(s.eq(1))
df
Out[80]:
MACD new
0 -2.3 NaN
1 -0.3 NaN
2 0.8 1.0
3 0.1 2.0
4 0.6 3.0
5 -0.7 1.0
6 1.1 1.0
7 2.4 2.0
I have data collected from a lineage of instruments with some overlap. I want to merge them to a single pandas data structure in a way where the newest available data for each column take precedence if not NaN, otherwise the older data are retained.
The following code produces the intended output, but involves a lot of code for such a simple task. Additionally, the final step involves identifying duplicated index values, and I am nervous about whether I can rely on the "last" part because df.combine_first(other) reorders the data. Is there a more compact, efficient and/or predictable way to do this?
# set up the data
df0 = pd.DataFrame({"x": [0.,1.,2.,3.,4,],"y":[0.,1.,2.,3.,np.nan],"t" :[0,1,2,3,4]}) # oldest/lowest priority
df1 = pd.DataFrame({"x" : [np.nan,4.1,5.1,6.1],"y":[3.1,4.1,5.1,6.1],"t": [3,4,5,6]})
df2 = pd.DataFrame({"x" : [8.2,10.2],"t":[8,10]})
df0.set_index("t",inplace=True)
df1.set_index("t",inplace=True)
df2.set_index("t",inplace=True)
# this concatenates, leaving redundant indices in df0, df1, df2
dfmerge = pd.concat((df0,df1,df2),sort=True)
print("dfmerge, with duplicate rows and interlaced NaN data")
print(dfmerge)
# Now apply, in priority order, each of the original dataframes to fill the original
dfmerge2 = dfmerge.copy()
for ddf in (df2,df1,df0):
dfmerge2 = dfmerge2.combine_first(ddf)
print("\ndfmerge2, fillable NaNs filled but duplicate indices now reordered")
print(dfmerge2) # row order has changed unpredictably
# finally, drop duplicate indices
dfmerge3 = dfmerge2.copy()
dfmerge3 = dfmerge3.loc[~dfmerge3.index.duplicated(keep='last')]
print ("dfmerge3, final")
print (dfmerge3)
The output of which is this:
dfmerge, with duplicate rows and interlaced NaN data
x y
t
0 0.0 0.0
1 1.0 1.0
2 2.0 2.0
3 3.0 3.0
4 4.0 NaN
3 NaN 3.1
4 4.1 4.1
5 5.1 5.1
6 6.1 6.1
8 8.2 NaN
10 10.2 NaN
dfmerge2, fillable NaNs filled but duplicate indices now reordered
x y
t
0 0.0 0.0
1 1.0 1.0
2 2.0 2.0
3 3.0 3.0
3 3.0 3.1
4 4.0 4.1
4 4.1 4.1
5 5.1 5.1
6 6.1 6.1
8 8.2 NaN
10 10.2 NaN
dfmerge3, final
x y
t
0 0.0 0.0
1 1.0 1.0
2 2.0 2.0
3 3.0 3.1
4 4.1 4.1
5 5.1 5.1
6 6.1 6.1
8 8.2 NaN
10 10.2 NaN
In your case
s=pd.concat([df0,df1,df2],sort=False)
s[:]=np.sort(s,axis=0)
s=s.dropna(thresh=1)
s
x y
t
0 0.0 0.0
1 1.0 1.0
2 2.0 2.0
3 3.0 3.0
4 4.0 3.1
3 4.1 4.1
4 5.1 5.1
5 6.1 6.1
6 8.2 NaN
8 10.2 NaN
I have a dataframe with some columns representing counts for every timestep, I would like to automatically drop these, for example like the df.dropna() functionality, but something like df.dropcounts().
Here is an example dataframe
array = [[0.0,1.6,2.7,12.0],[1.0,3.5,4.5,13.0],[2.0,6.5,8.6,14.0]]
pd.DataFrame(array)
0 1 2 3
0 0.0 1.6 2.7 12.0
1 1.0 3.5 4.5 13.0
2 2.0 6.5 8.6 14.0
I would like to drop the first and last columns
I believe need:
val = 1
df = df.loc[:, df.diff().fillna(val).ne(val).any()]
print (df)
1 2
0 1.6 2.7
1 3.5 4.5
2 6.5 8.6
Explanation:
First compare by DataFrame.diff:
print (df.diff())
0 1 2 3
0 NaN NaN NaN NaN
1 1.0 1.9 1.8 1.0
2 1.0 3.0 4.1 1.0
Replace NaNs:
print (df.diff().fillna(val))
0 1 2 3
0 1.0 1.0 1.0 1.0
1 1.0 1.9 1.8 1.0
2 1.0 3.0 4.1 1.0
Compare if not equal by ne:
print (df.diff().fillna(val).ne(val))
0 1 2 3
0 False False False False
1 False True True False
2 False True True False
And chck at least one True per column by DataFrame.any:
print (df.diff().fillna(val).ne(val).any())
0 False
1 True
2 True
3 False
dtype: bool
Using all
d.loc[:,~d.diff().fillna(1).eq(1).all().values]
Out[295]:
1 2
0 1.6 2.7
1 3.5 4.5
2 6.5 8.6
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
I am attempting to combine two sets of data, but I can't figure out which method is most suitable (join, merge, concat, etc.) for this application, and the documentation doesn't have any examples that do what I need to do.
I have two sets of data, structured like so:
>>> A
Time Voltage
1.0 5.1
2.0 5.5
3.0 5.3
4.0 5.4
5.0 5.0
>>> B
Time Current
-1.0 0.5
0.0 0.6
1.0 0.3
2.0 0.4
3.0 0.7
I would like to combine the data columns and merge the 'Time' column together so that I get the following:
>>> AB
Time Voltage Current
-1.0 0.5
0.0 0.6
1.0 5.1 0.3
2.0 5.5 0.4
3.0 5.3 0.7
4.0 5.4
5.0 5.0
I've tried AB = merge_ordered(A, B, on='Time', how='outer'), and while it successfully combined the data, it output something akin to:
>>> AB
Time Voltage Current
-1.0 0.5
0.0 0.6
1.0 5.1
1.0 0.3
2.0 5.5
2.0 0.4
3.0 5.3
3.0 0.7
4.0 5.4
5.0 5.0
You'll note that it did not combine rows with shared 'Time' values.
I have also tried merging a la AB = A.merge(B, on='Time', how='outer'), but that outputs something combined, but not sorted, like so:
>>> AB
Time Voltage Current
-1.0 0.5
0.0 0.6
1.0 5.1
2.0 5.5
3.0 5.3 0.7
4.0 5.4
5.0 5.0
1.0 0.3
2.0 0.4
...it essentially skips some of the data in 'Current' and appends it to the bottom, but it does so inconsistently. And again, it does not merge the rows together.
I have also tried AB = pandas.concat(A, B, axis=1), but the result does not get merged. I simply get, well, the concatenation of the two DataFrames, like so:
>>> AB
Time Voltage Time Current
1.0 5.1 -1.0 0.5
2.0 5.5 0.0 0.6
3.0 5.3 1.0 0.3
4.0 5.4 2.0 0.4
5.0 5.0 3.0 0.7
I've been scouring the documentation and here to try to figure out the exact differences between merge and join, but from what I gather they're pretty similar. Still, I haven't found anything that specifically answers the question of "how to merge rows that share an identical key/index". Can anyone enlighten me on how to do this? I only have a few days-worth of experience with Pandas!
merge
merge combines on columns. By default it takes all commonly named columns. Otherwise, you can specify which columns to combine on. In this example, I chose, Time.
A.merge(B, 'outer', 'Time')
Time Voltage Current
0 1.0 5.1 0.3
1 2.0 5.5 0.4
2 3.0 5.3 0.7
3 4.0 5.4 NaN
4 5.0 5.0 NaN
5 -1.0 NaN 0.5
6 0.0 NaN 0.6
join
join combines on index values unless you specify the left hand side's column instead. That is why I set the index for the right hand side and Specify a column for the left hand side Time.
A.join(B.set_index('Time'), 'Time', 'outer')
Time Voltage Current
0 1.0 5.1 0.3
1 2.0 5.5 0.4
2 3.0 5.3 0.7
3 4.0 5.4 NaN
4 5.0 5.0 NaN
4 -1.0 NaN 0.5
4 0.0 NaN 0.6
pd.concat
concat combines on index values... so I create a list comprehension in which I iterate over each dataframe I want to combine [A, B]. In the comprehension, each dataframe assumes the name d, hence the for d in [A, B]. axis=1 says to combine them side by side thus using the index as the joining feature.
pd.concat([d.set_index('Time') for d in [A, B]], axis=1).reset_index()
Time Voltage Current
0 -1.0 NaN 0.5
1 0.0 NaN 0.6
2 1.0 5.1 0.3
3 2.0 5.5 0.4
4 3.0 5.3 0.7
5 4.0 5.4 NaN
6 5.0 5.0 NaN
combine_first
A.set_index('Time').combine_first(B.set_index('Time')).reset_index()
Time Current Voltage
0 -1.0 0.5 NaN
1 0.0 0.6 NaN
2 1.0 0.3 5.1
3 2.0 0.4 5.5
4 3.0 0.7 5.3
5 4.0 NaN 5.4
6 5.0 NaN 5.0
It should work properly if the Time column is of the same dtype in both DFs:
In [192]: A.merge(B, how='outer').sort_values('Time')
Out[192]:
Time Voltage Current
5 -1.0 NaN 0.5
6 0.0 NaN 0.6
0 1.0 5.1 0.3
1 2.0 5.5 0.4
2 3.0 5.3 0.7
3 4.0 5.4 NaN
4 5.0 5.0 NaN
In [193]: A.dtypes
Out[193]:
Time float64
Voltage float64
dtype: object
In [194]: B.dtypes
Out[194]:
Time float64
Current float64
dtype: object
Reproducing your problem:
In [198]: A.merge(B.assign(Time=B.Time.astype(str)), how='outer').sort_values('Time')
Out[198]:
Time Voltage Current
5 -1.0 NaN 0.5
6 0.0 NaN 0.6
0 1.0 5.1 NaN
7 1.0 NaN 0.3
1 2.0 5.5 NaN
8 2.0 NaN 0.4
2 3.0 5.3 NaN
9 3.0 NaN 0.7
3 4.0 5.4 NaN
4 5.0 5.0 NaN
In [199]: B.assign(Time=B.Time.astype(str)).dtypes
Out[199]:
Time object # <------ NOTE
Current float64
dtype: object
Visually it's hard to distinguish:
In [200]: B.assign(Time=B.Time.astype(str))
Out[200]:
Time Current
0 -1.0 0.5
1 0.0 0.6
2 1.0 0.3
3 2.0 0.4
4 3.0 0.7
In [201]: B
Out[201]:
Time Current
0 -1.0 0.5
1 0.0 0.6
2 1.0 0.3
3 2.0 0.4
4 3.0 0.7
Solution found
As per the suggestions below, I had to round the numbers in the 'Time' column prior to merging them, despite the fact that they were both of the same dtype (float64). The suggestion was to round like so:
A = A.assign(A.Time = A.Time.round(4))
But in my actual situation, the column was labeled 'Time, (sec)' (there was punctuation that screwed with the assignment. So instead I used the following line to round it:
A['Time, (sec)'] = A['Time, (sec)'].round(4)
And it worked like a charm. Are there any issues with doing it like that?
Given the following data frame:
index value
1 0.8
2 0.9
3 1.0
4 0.9
5 nan
6 nan
7 nan
8 0.4
9 0.9
10 nan
11 0.8
12 2.0
13 1.4
14 1.9
15 nan
16 nan
17 nan
18 8.4
19 9.9
20 10.0
…
in which the data 'value' is separated into a number of clusters by value NAN. is there any way I can calculate some values such as accumulate summation, or mean of the clustered data, for example, I want calculate the accumulated sum and generate the following data frame:
index value cumsum
1 0.8 0.8
2 0.9 1.7
3 1.0 2.7
4 0.9 3.6
5 nan 0
6 nan 0
7 nan 0
8 0.4 0.4
9 0.9 1.3
10 nan 0
11 0.8 0.8
12 2.0 2.8
13 1.4 4.2
14 1.9 6.1
15 nan 0
16 nan 0
17 nan 0
18 8.4 8.4
19 9.9 18.3
20 10.0 28.3
…
Any suggestions?
Also as a simple extension of the problem, if two clusters of data are close enough, such as there are only 1 NAN separate them we consider the as one cluster of data, such that we can have the following data frame:
index value cumsum
1 0.8 0.8
2 0.9 1.7
3 1.0 2.7
4 0.9 3.6
5 nan 0
6 nan 0
7 nan 0
8 0.4 0.4
9 0.9 1.3
10 nan 1.3
11 0.8 2.1
12 2.0 4.1
13 1.4 5.5
14 1.9 7.4
15 nan 0
16 nan 0
17 nan 0
18 8.4 8.4
19 9.9 18.3
20 10.0 28.3
Thank you for the help!
You can do the first part using the compare-cumsum-groupby pattern. Your "simple extension" isn't quite so simple, but we can still pull it off, by finding out the parts of value that we want to treat as zero:
n = df["value"].isnull()
clusters = (n != n.shift()).cumsum()
df["cumsum"] = df["value"].groupby(clusters).cumsum().fillna(0)
to_zero = n & (df["value"].groupby(clusters).transform('size') == 1)
tmp_value = df["value"].where(~to_zero, 0)
n2 = tmp_value.isnull()
new_clusters = (n2 != n2.shift()).cumsum()
df["cumsum_skip1"] = tmp_value.groupby(new_clusters).cumsum().fillna(0)
produces
>>> df
index value cumsum cumsum_skip1
0 1 0.8 0.8 0.8
1 2 0.9 1.7 1.7
2 3 1.0 2.7 2.7
3 4 0.9 3.6 3.6
4 5 NaN 0.0 0.0
5 6 NaN 0.0 0.0
6 7 NaN 0.0 0.0
7 8 0.4 0.4 0.4
8 9 0.9 1.3 1.3
9 10 NaN 0.0 1.3
10 11 0.8 0.8 2.1
11 12 2.0 2.8 4.1
12 13 1.4 4.2 5.5
13 14 1.9 6.1 7.4
14 15 NaN 0.0 0.0
15 16 NaN 0.0 0.0
16 17 NaN 0.0 0.0
17 18 8.4 8.4 8.4
18 19 9.9 18.3 18.3
19 20 10.0 28.3 28.3