I've got a dataframe that looks like:
0 1 2 3 4 5 6 7 8 9 10 11
12 13 13 13.4 13.4 12.4 12.4 16 0 0 0 0
14 12.2 12.2 13.4 13.4 12.6 12.6 19 5 5 6.7 6.7
.
.
.
Each 'layer'/row has pairs that are duplicates that I want to reduce.
The one problem is that there are repeating 0s as well so I cannot just simply remove duplicates per row or it will leave an uneven number of rows.
My desired output would be a lambda function that I could apply to all rows of this dataframe to get this:
0 1 2 3 4 5 6
12 13 13.4 12.4 16 0 0
14 12.2 13.4 12.6 19 5 6.7
.
.
.
Is there a simple function I could write to do this?
Method 1 using transpose
As mentioned by Yuca in the comments:
df = df.T.drop_duplicates().T
df.columns = range(len(df.columns))
print(df)
0 1 2 3 4 5 6
0 12.0 13.0 13.4 12.4 16.0 0.0 0.0
1 14.0 12.2 13.4 12.6 19.0 5.0 6.7
Method 2 using list comprehension with even numbers
We can make a list of even numbers and then select those columns based on their index:
idxcols = [x-1 for x in range(len(df.columns)) if x % 2]
df = df.iloc[:, idxcols]
df.columns = range(len(df.columns))
print(df)
0 1 2 3 4 5
0 12 13.0 13.4 12.4 0 0.0
1 14 12.2 13.4 12.6 5 6.7
In your case
from itertools import zip_longest
l=[sorted(set(x), key=x.index) for x in df.values.tolist()]
newdf=pd.DataFrame(l).ffill(1)
newdf
Out[177]:
0 1 2 3 4 5 6
0 12.0 13.0 13.4 12.4 16.0 0.0 0.0
1 14.0 12.2 13.4 12.6 19.0 5.0 6.7
You can use functools.reduce to sequentially concatenate columns to your output DataFrame if the next column is not equal to the last column added:
from functools import reduce
output_df = reduce(
lambda d, c: d if (d.iloc[:,-1] == df[c]).all() else pd.concat([d, df[c]], axis=1),
df.columns[1:],
df[df.columns[0]].to_frame()
)
print(output_frame)
# 0 1 3 5 7 8 10
#0 12 13.0 13.4 12.4 16 0 0.0
#1 14 12.2 13.4 12.6 19 5 6.7
This method also maintains the column names of the columns which were picked, if that's important.
Assuming this is your input df:
print(df)
# 0 1 2 3 4 5 6 7 8 9 10 11
#0 12 13.0 13.0 13.4 13.4 12.4 12.4 16 0 0 0.0 0.0
#1 14 12.2 12.2 13.4 13.4 12.6 12.6 19 5 5 6.7 6.7
I have a Dataframe that I'd like to perform the same operation (ie, correlations, graphing) on grouped data. The grouping is based on location (refered to as STA in the Dataframe).
Sample of the Dataframe is below:
Index STA Date Var1 Var2 Var3
0 RE25 1973-04-09 1.0 10.5 6.3
1 RE30 1973-04-09 1.0 10.0 7.6
2 RE25 1973-04-09 5.0 10.6 NaN
3 RE30 1973-04-09 5.0 10.0 NaN
4 RE25 1973-04-09 10.0 10.6 NaN
5 RE30 1973-04-09 10.0 10.2 NaN
6 RE25 1973-04-09 15.0 10.7 NaN
7 RE30 1973-04-09 15.0 10.1 NaN
8 RE25 1973-04-09 20.0 10.7 NaN
9 RE30 1973-04-09 20.0 10.1 NaN
10 RE30 1973-04-09 23.0 10.0 7.6
To generate the list of unique sampling STA (which will be different for each DataFrame), I used
Stations = np.sort(Resdat.STA.unique()).tolist()
which works in creating the unique list of STA that I'm after. However, when I try to call this list I get the following error:
TypeError: 'list' object is not callable.
With my limited knowledge, I'm only making progress with the following code:
RE01 = Resdat.groupby('STA').get_group('RE01')
RE01 = RE01.dropna(axis = 1, how = 'all')
repeated over and over for each unique STA.
I'm sure there is a better way but I'm struggling to find other posted answers that I can use.
You can using for loop
names=[]
l=[]
for name, data in df.groupby('STA'):
names.append(name)
l.append(data.dropna(axis=1, how='all'))
I am sure there must be a very simple solution to this problem, but I am failing to find it (and browsing through previously asked questions, I didn't find the answer I wanted or didn't understand it).
I have a dataframe similar to this (just much bigger, with many more rows and columns):
x val1 val2 val3
0 0.0 10.0 NaN NaN
1 0.5 10.5 NaN NaN
2 1.0 11.0 NaN NaN
3 1.5 11.5 NaN 11.60
4 2.0 12.0 NaN 12.08
5 2.5 12.5 12.2 12.56
6 3.0 13.0 19.8 13.04
7 3.5 13.5 13.3 13.52
8 4.0 14.0 19.8 14.00
9 4.5 14.5 14.4 14.48
10 5.0 15.0 19.8 14.96
11 5.5 15.5 15.5 15.44
12 6.0 16.0 19.8 15.92
13 6.5 16.5 16.6 16.40
14 7.0 17.0 19.8 18.00
15 7.5 17.5 17.7 NaN
16 8.0 18.0 19.8 NaN
17 8.5 18.5 18.8 NaN
18 9.0 19.0 19.8 NaN
19 9.5 19.5 19.9 NaN
20 10.0 20.0 19.8 NaN
In the next step, I need to compute the derivative dVal/dx for each of the value columns (in reality I have more than 3 columns, so I need to have a robust solution in a loop, I can't select the rows manually each time). But because of the NaN values in some of the columns, I am facing the problem that x and val are not of the same dimension. I feel the way to overcome this would be to only select only those x intervals, for which the val is notnull. But I am not able to do that. I am probably making some very stupid mistakes (I am not a programmer and I am very untalented, so please be patient with me:) ).
Here is the code so far (now that I think of it, I may have introduced some mistakes just by leaving some old pieces of code because I've been messing with it for a while, trying different things):
import pandas as pd
import numpy as np
df = pd.read_csv('H:/DocumentsRedir/pokus/dataframe.csv', delimiter=',')
vals = list(df.columns.values)[1:]
for i in vals:
V = np.asarray(pd.notnull(df[i]))
mask = pd.notnull(df[i])
X = np.asarray(df.loc[mask]['x'])
derivative=np.diff(V)/np.diff(X)
But I am getting this error:
ValueError: operands could not be broadcast together with shapes (20,) (15,)
So, apparently, it did not select only the notnull values...
Is there an obvious mistake that I am making or a different approach that I should adopt? Thanks!
(And another less important question: is np.diff the right function to use here or had I better calculated it manually by finite differences? I'm not finding numpy documentation very helpful.)
To calculate dVal/dX:
dVal = df.iloc[:, 1:].diff() # `x` is in column 0.
dX = df['x'].diff()
>>> dVal.apply(lambda series: series / dX)
val1 val2 val3
0 NaN NaN NaN
1 1 NaN NaN
2 1 NaN NaN
3 1 NaN NaN
4 1 NaN 0.96
5 1 NaN 0.96
6 1 15.2 0.96
7 1 -13.0 0.96
8 1 13.0 0.96
9 1 -10.8 0.96
10 1 10.8 0.96
11 1 -8.6 0.96
12 1 8.6 0.96
13 1 -6.4 0.96
14 1 6.4 3.20
15 1 -4.2 NaN
16 1 4.2 NaN
17 1 -2.0 NaN
18 1 2.0 NaN
19 1 0.2 NaN
20 1 -0.2 NaN
We difference all columns (except the first one), and then apply a lambda function to each column which divides it by the difference in column X.
I have several datasets, which I am trying to merge into one. Below, I created fictive simpler smaller datasets to test the method and it worked perfectly fine.
examplelog = pd.DataFrame({'Depth':[10,20,30,40,50,60,70,80],
'TVD':[10,19.9,28.8,37.7,46.6,55.5,64.4,73.3],
'T1':[11,11.3,11.5,12.,12.3,12.6,13.,13.8],
'T2':[11.3,11.5,11.8,12.2,12.4,12.7,13.1,14.1]})
log1 = pd.DataFrame({'Depth':[30,40,50,60],'T3':[12.1,12.6,13.7,14.]})
log2 = pd.DataFrame({'Depth':[20,30,40,50,60],'T4':[12.0,12.2,12.4,13.2,14.1]})
logs=[log1,log2]
result=examplelog.copy()
for i in logs:
result=result.merge(i,how='left', on='Depth')
print result
The result is, as expected:
Depth T1 T2 TVD T3 T4
0 10 11.0 11.3 10.0 NaN NaN
1 20 11.3 11.5 19.9 NaN 12.0
2 30 11.5 11.8 28.8 12.1 12.2
3 40 12.0 12.2 37.7 12.3 12.4
4 50 12.3 12.4 46.6 13.5 13.2
5 60 12.6 12.7 55.5 14.2 14.1
6 70 13.0 13.1 64.4 NaN NaN
7 80 13.8 14.1 73.3 NaN NaN
Happy with the result, I applied this method to my actual data, but for T3 and T4 in the resulting dataframes, I received just empty columns (all values were NaN). I suspect that the problem is with floating numbers, because my datasets were created on different machines by different software and although the "Depth" has the precision of two decimal numbers in all of the files, I am afraid that it may not be 20.05 in both of them, but one might be 20.049999999999999 while in the other it might be 20.05000000000001. Then, the merge function will not work, as shown in the following example:
examplelog = pd.DataFrame({'Depth':[10,20,30,40,50,60,70,80],
'TVD':[10,19.9,28.8,37.7,46.6,55.5,64.4,73.3],
'T1':[11,11.3,11.5,12.,12.3,12.6,13.,13.8],
'T2':[11.3,11.5,11.8,12.2,12.4,12.7,13.1,14.1]})
log1 = pd.DataFrame({'Depth':[30.05,40.05,50.05,60.05],'T3':[12.1,12.6,13.7,14.]})
log2 = pd.DataFrame({'Depth':[20.01,30.01,40.01,50.01,60.01],'T4':[12.0,12.2,12.4,13.2,14.1]})
logs=[log1,log2]
result=examplelog.copy()
for i in logs:
result=result.merge(i,how='left', on='Depth')
print result
Depth T1 T2 TVD T3 T4
0 10 11.0 11.3 10.0 NaN NaN
1 20 11.3 11.5 19.9 NaN NaN
2 30 11.5 11.8 28.8 NaN NaN
3 40 12.0 12.2 37.7 NaN NaN
4 50 12.3 12.4 46.6 NaN NaN
5 60 12.6 12.7 55.5 NaN NaN
6 70 13.0 13.1 64.4 NaN NaN
7 80 13.8 14.1 73.3 NaN NaN
Do you know how to fix this?
Thanks!
Round the Depth values to the appropriate precision:
for df in [examplelog, log1, log2]:
df['Depth'] = df['Depth'].round(1)
import numpy as np
import pandas as pd
examplelog = pd.DataFrame({'Depth':[10,20,30,40,50,60,70,80],
'TVD':[10,19.9,28.8,37.7,46.6,55.5,64.4,73.3],
'T1':[11,11.3,11.5,12.,12.3,12.6,13.,13.8],
'T2':[11.3,11.5,11.8,12.2,12.4,12.7,13.1,14.1]})
log1 = pd.DataFrame({'Depth':[30.05,40.05,50.05,60.05],'T3':[12.1,12.6,13.7,14.]})
log2 = pd.DataFrame({'Depth':[20.01,30.01,40.01,50.01,60.01],
'T4':[12.0,12.2,12.4,13.2,14.1]})
for df in [examplelog, log1, log2]:
df['Depth'] = df['Depth'].round(1)
logs=[log1,log2]
result=examplelog.copy()
for i in logs:
result=result.merge(i,how='left', on='Depth')
print(result)
yields
Depth T1 T2 TVD T3 T4
0 10 11.0 11.3 10.0 NaN NaN
1 20 11.3 11.5 19.9 NaN 12.0
2 30 11.5 11.8 28.8 12.1 12.2
3 40 12.0 12.2 37.7 12.6 12.4
4 50 12.3 12.4 46.6 13.7 13.2
5 60 12.6 12.7 55.5 14.0 14.1
6 70 13.0 13.1 64.4 NaN NaN
7 80 13.8 14.1 73.3 NaN NaN
Per the comments, rounding does not appear to work for the OP on the actual
data. To debug the problem, find some rows which should merge:
subframes = []
for frame in [examplelog, log2]:
mask = (frame['Depth'] < 20.051) & (frame['Depth'] >= 20.0)
subframes.append(frame.loc[mask])
Then post
for frame in subframes:
print(frame.to_dict('list'))
print(frame.info()) # shows the dtypes of the columns
This might give us the info we need to reproduce the problem.