I have data collected from a lineage of instruments with some overlap. I want to merge them to a single pandas data structure in a way where the newest available data for each column take precedence if not NaN, otherwise the older data are retained.
The following code produces the intended output, but involves a lot of code for such a simple task. Additionally, the final step involves identifying duplicated index values, and I am nervous about whether I can rely on the "last" part because df.combine_first(other) reorders the data. Is there a more compact, efficient and/or predictable way to do this?
# set up the data
df0 = pd.DataFrame({"x": [0.,1.,2.,3.,4,],"y":[0.,1.,2.,3.,np.nan],"t" :[0,1,2,3,4]}) # oldest/lowest priority
df1 = pd.DataFrame({"x" : [np.nan,4.1,5.1,6.1],"y":[3.1,4.1,5.1,6.1],"t": [3,4,5,6]})
df2 = pd.DataFrame({"x" : [8.2,10.2],"t":[8,10]})
df0.set_index("t",inplace=True)
df1.set_index("t",inplace=True)
df2.set_index("t",inplace=True)
# this concatenates, leaving redundant indices in df0, df1, df2
dfmerge = pd.concat((df0,df1,df2),sort=True)
print("dfmerge, with duplicate rows and interlaced NaN data")
print(dfmerge)
# Now apply, in priority order, each of the original dataframes to fill the original
dfmerge2 = dfmerge.copy()
for ddf in (df2,df1,df0):
dfmerge2 = dfmerge2.combine_first(ddf)
print("\ndfmerge2, fillable NaNs filled but duplicate indices now reordered")
print(dfmerge2) # row order has changed unpredictably
# finally, drop duplicate indices
dfmerge3 = dfmerge2.copy()
dfmerge3 = dfmerge3.loc[~dfmerge3.index.duplicated(keep='last')]
print ("dfmerge3, final")
print (dfmerge3)
The output of which is this:
dfmerge, with duplicate rows and interlaced NaN data
x y
t
0 0.0 0.0
1 1.0 1.0
2 2.0 2.0
3 3.0 3.0
4 4.0 NaN
3 NaN 3.1
4 4.1 4.1
5 5.1 5.1
6 6.1 6.1
8 8.2 NaN
10 10.2 NaN
dfmerge2, fillable NaNs filled but duplicate indices now reordered
x y
t
0 0.0 0.0
1 1.0 1.0
2 2.0 2.0
3 3.0 3.0
3 3.0 3.1
4 4.0 4.1
4 4.1 4.1
5 5.1 5.1
6 6.1 6.1
8 8.2 NaN
10 10.2 NaN
dfmerge3, final
x y
t
0 0.0 0.0
1 1.0 1.0
2 2.0 2.0
3 3.0 3.1
4 4.1 4.1
5 5.1 5.1
6 6.1 6.1
8 8.2 NaN
10 10.2 NaN
In your case
s=pd.concat([df0,df1,df2],sort=False)
s[:]=np.sort(s,axis=0)
s=s.dropna(thresh=1)
s
x y
t
0 0.0 0.0
1 1.0 1.0
2 2.0 2.0
3 3.0 3.0
4 4.0 3.1
3 4.1 4.1
4 5.1 5.1
5 6.1 6.1
6 8.2 NaN
8 10.2 NaN
Related
I have a DataFrame that looks like this:
df = pd.DataFrame({'a':[1,2,np.nan,1,np.nan,np.nan,4,2,3,np.nan],
'b':[4,2,3,np.nan,np.nan,1,5,np.nan,5,8]
})
a b
0 1.0 4.0
1 2.0 2.0
2 NaN 3.0
3 1.0 NaN
4 NaN NaN
5 NaN 1.0
6 4.0 5.0
7 2.0 NaN
8 3.0 5.0
9 NaN 8.0
I want to dynamically replace the nan values. I have tried doing (df.ffill()+df.bfill())/2 but that does not yield the desired output, as it casts the fill value to the whole column at once, rather then dynamically. I have tried with interpolate, but it doesn't work well for non linear data.
I have seen this answer but did not fully understand it and not sure if it would work.
Update on the computation of the values
I want every nan value to be the mean of the previous and next non nan value. In case there are more than 1 nan value in sequence, I want to replace one at a time and then compute the mean e.g., in case there is 1, np.nan, np.nan, 4, I first want the mean of 1 and 4 (2.5) for the first nan value - obtaining 1,2.5,np.nan,4 - and then the second nan will be the mean of 2.5 and 4, getting to 1,2.5,3.25,4
The desired output is
a b
0 1.00 4.0
1 2.00 2.0
2 1.50 3.0
3 1.00 2.0
4 2.50 1.5
5 3.25 1.0
6 4.00 5.0
7 2.00 5.0
8 3.00 5.0
9 1.50 8.0
Inspired by the #ye olde noobe answer (thanks to him!):
I've optimized it to make it ≃ 100x faster (times comparison below):
def custom_fillna(s:pd.Series):
for i in range(len(s)):
if pd.isna(s[i]):
last_valid_number = (s[s[:i].last_valid_index()] if s[:i].last_valid_index() is not None else 0)
next_valid_numer = (s[s[i:].first_valid_index()] if s[i:].first_valid_index() is not None else 0)
s[i] = (last_valid_number+next_valid_numer)/2
custom_fillna(df['a'])
df
Times comparison:
Maybe not the most optimized, but it works (note: from your example, I assume that if there is no valid value before or after a NaN, like the last row on column a, 0 is used as a replacement):
import pandas as pd
def fill_dynamically(s: pd.Series):
for i in range(len(s)):
s[i] = (
(0 if s[i:].first_valid_index() is None else s[i:][s[i:].first_valid_index()]) +
(0 if s[:i+1].last_valid_index() is None else s[:i+1][s[:i+1].last_valid_index()])
) / 2
Use like this for the full dataframe:
df = pd.DataFrame({'a':[1,2,np.nan,1,np.nan,np.nan,4,2,3,np.nan],
'b':[4,2,3,np.nan,np.nan,1,5,np.nan,5,8]
})
df.apply(fill_dynamically)
df after applying:
a b
0 1.00 4.0
1 2.00 2.0
2 1.50 3.0
3 1.00 2.0
4 2.50 1.5
5 3.25 1.0
6 4.00 5.0
7 2.00 5.0
8 3.00 5.0
9 1.50 8.0
In case you would have other columns and don't want to apply that on the whole dataframe, you can of course use it on a single column, like that:
df = pd.DataFrame({'a':[1,2,np.nan,1,np.nan,np.nan,4,2,3,np.nan],
'b':[4,2,3,np.nan,np.nan,1,5,np.nan,5,8]
})
fill_dynamically(df['a'])
In this case, df looks like that:
a b
0 1.00 4.0
1 2.00 2.0
2 1.50 3.0
3 1.00 NaN
4 2.50 NaN
5 3.25 1.0
6 4.00 5.0
7 2.00 NaN
8 3.00 5.0
9 1.50 8.0
When I am trying to use fillna to replace NaNs in the columns with means, the NaNs changed from float64 to object, showing:
bound method Series.mean of 0 NaN\n1
Here is the code:
mean = df['texture_mean'].mean
df['texture_mean'] = df['texture_mean'].fillna(mean)`
You cannot use mean = df['texture_mean'].mean. This is where the problem lies. The following code will work -
df=pd.DataFrame({'texture_mean':[2,4,None,6,1,None],'A':[1,2,3,4,5,None]}) # Example
df
A texture_mean
0 1.0 2.0
1 2.0 4.0
2 3.0 NaN
3 4.0 6.0
4 5.0 1.0
5 NaN NaN
df['texture_mean']=df['texture_mean'].fillna(df['texture_mean'].mean())
df
A texture_mean
0 1.0 2.00
1 2.0 4.00
2 3.0 3.25
3 4.0 6.00
4 5.0 1.00
5 NaN 3.25
In case you want to replace all the NaNs with the respective means of that column in all columns, then just do this -
df=df.fillna(df.mean())
df
A texture_mean
0 1.0 2.00
1 2.0 4.00
2 3.0 3.25
3 4.0 6.00
4 5.0 1.00
5 3.0 3.25
Let me know if this is what you want.
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
I am attempting to combine two sets of data, but I can't figure out which method is most suitable (join, merge, concat, etc.) for this application, and the documentation doesn't have any examples that do what I need to do.
I have two sets of data, structured like so:
>>> A
Time Voltage
1.0 5.1
2.0 5.5
3.0 5.3
4.0 5.4
5.0 5.0
>>> B
Time Current
-1.0 0.5
0.0 0.6
1.0 0.3
2.0 0.4
3.0 0.7
I would like to combine the data columns and merge the 'Time' column together so that I get the following:
>>> AB
Time Voltage Current
-1.0 0.5
0.0 0.6
1.0 5.1 0.3
2.0 5.5 0.4
3.0 5.3 0.7
4.0 5.4
5.0 5.0
I've tried AB = merge_ordered(A, B, on='Time', how='outer'), and while it successfully combined the data, it output something akin to:
>>> AB
Time Voltage Current
-1.0 0.5
0.0 0.6
1.0 5.1
1.0 0.3
2.0 5.5
2.0 0.4
3.0 5.3
3.0 0.7
4.0 5.4
5.0 5.0
You'll note that it did not combine rows with shared 'Time' values.
I have also tried merging a la AB = A.merge(B, on='Time', how='outer'), but that outputs something combined, but not sorted, like so:
>>> AB
Time Voltage Current
-1.0 0.5
0.0 0.6
1.0 5.1
2.0 5.5
3.0 5.3 0.7
4.0 5.4
5.0 5.0
1.0 0.3
2.0 0.4
...it essentially skips some of the data in 'Current' and appends it to the bottom, but it does so inconsistently. And again, it does not merge the rows together.
I have also tried AB = pandas.concat(A, B, axis=1), but the result does not get merged. I simply get, well, the concatenation of the two DataFrames, like so:
>>> AB
Time Voltage Time Current
1.0 5.1 -1.0 0.5
2.0 5.5 0.0 0.6
3.0 5.3 1.0 0.3
4.0 5.4 2.0 0.4
5.0 5.0 3.0 0.7
I've been scouring the documentation and here to try to figure out the exact differences between merge and join, but from what I gather they're pretty similar. Still, I haven't found anything that specifically answers the question of "how to merge rows that share an identical key/index". Can anyone enlighten me on how to do this? I only have a few days-worth of experience with Pandas!
merge
merge combines on columns. By default it takes all commonly named columns. Otherwise, you can specify which columns to combine on. In this example, I chose, Time.
A.merge(B, 'outer', 'Time')
Time Voltage Current
0 1.0 5.1 0.3
1 2.0 5.5 0.4
2 3.0 5.3 0.7
3 4.0 5.4 NaN
4 5.0 5.0 NaN
5 -1.0 NaN 0.5
6 0.0 NaN 0.6
join
join combines on index values unless you specify the left hand side's column instead. That is why I set the index for the right hand side and Specify a column for the left hand side Time.
A.join(B.set_index('Time'), 'Time', 'outer')
Time Voltage Current
0 1.0 5.1 0.3
1 2.0 5.5 0.4
2 3.0 5.3 0.7
3 4.0 5.4 NaN
4 5.0 5.0 NaN
4 -1.0 NaN 0.5
4 0.0 NaN 0.6
pd.concat
concat combines on index values... so I create a list comprehension in which I iterate over each dataframe I want to combine [A, B]. In the comprehension, each dataframe assumes the name d, hence the for d in [A, B]. axis=1 says to combine them side by side thus using the index as the joining feature.
pd.concat([d.set_index('Time') for d in [A, B]], axis=1).reset_index()
Time Voltage Current
0 -1.0 NaN 0.5
1 0.0 NaN 0.6
2 1.0 5.1 0.3
3 2.0 5.5 0.4
4 3.0 5.3 0.7
5 4.0 5.4 NaN
6 5.0 5.0 NaN
combine_first
A.set_index('Time').combine_first(B.set_index('Time')).reset_index()
Time Current Voltage
0 -1.0 0.5 NaN
1 0.0 0.6 NaN
2 1.0 0.3 5.1
3 2.0 0.4 5.5
4 3.0 0.7 5.3
5 4.0 NaN 5.4
6 5.0 NaN 5.0
It should work properly if the Time column is of the same dtype in both DFs:
In [192]: A.merge(B, how='outer').sort_values('Time')
Out[192]:
Time Voltage Current
5 -1.0 NaN 0.5
6 0.0 NaN 0.6
0 1.0 5.1 0.3
1 2.0 5.5 0.4
2 3.0 5.3 0.7
3 4.0 5.4 NaN
4 5.0 5.0 NaN
In [193]: A.dtypes
Out[193]:
Time float64
Voltage float64
dtype: object
In [194]: B.dtypes
Out[194]:
Time float64
Current float64
dtype: object
Reproducing your problem:
In [198]: A.merge(B.assign(Time=B.Time.astype(str)), how='outer').sort_values('Time')
Out[198]:
Time Voltage Current
5 -1.0 NaN 0.5
6 0.0 NaN 0.6
0 1.0 5.1 NaN
7 1.0 NaN 0.3
1 2.0 5.5 NaN
8 2.0 NaN 0.4
2 3.0 5.3 NaN
9 3.0 NaN 0.7
3 4.0 5.4 NaN
4 5.0 5.0 NaN
In [199]: B.assign(Time=B.Time.astype(str)).dtypes
Out[199]:
Time object # <------ NOTE
Current float64
dtype: object
Visually it's hard to distinguish:
In [200]: B.assign(Time=B.Time.astype(str))
Out[200]:
Time Current
0 -1.0 0.5
1 0.0 0.6
2 1.0 0.3
3 2.0 0.4
4 3.0 0.7
In [201]: B
Out[201]:
Time Current
0 -1.0 0.5
1 0.0 0.6
2 1.0 0.3
3 2.0 0.4
4 3.0 0.7
Solution found
As per the suggestions below, I had to round the numbers in the 'Time' column prior to merging them, despite the fact that they were both of the same dtype (float64). The suggestion was to round like so:
A = A.assign(A.Time = A.Time.round(4))
But in my actual situation, the column was labeled 'Time, (sec)' (there was punctuation that screwed with the assignment. So instead I used the following line to round it:
A['Time, (sec)'] = A['Time, (sec)'].round(4)
And it worked like a charm. Are there any issues with doing it like that?
I'm trying to assign values of some columns based on another column mapping them by one single key. The problem is that I don't think the mapping is being used correctly, because it is assigning NaN to the columns.
I should be mapping them by 'SampleID'.
Here is the DF I want to assign values to
>>> df.ix[new_df['SampleID'].isin(pooled['SampleID']), cols]
Volume_Received Quantity massug
88280 2.0 15.0 1.0
88282 3.0 55.0 5.0
88284 2.5 46.2 3.0
88286 2.0 98.0 5.0
229365 2.0 8.4 3.0
229366 3.0 15.9 3.0
229367 1.5 7.7 2.0
233666 1.5 50.8 3.0
233667 4.0 60.2 5.0
This is the new value I have for them
>>> numerical
Volume_Received Quantity massug
SampleID
sample8 10.0 75.0 5.0
sample70 15.0 275.0 25.0
sample72 12.5 231.0 15.0
sample89 6.0 294.0 15.0
sample90 4.0 16.8 6.0
sample96 6.0 31.8 6.0
sample97 3.0 15.4 4.0
sample99 3.0 101.6 6.0
sample100 8.0 120.4 10.0
I'm using this command to assign the values:
df.ix[df['SampleID'].isin(pooled['SampleID']), cols] = numerical[cols]
Where pooled is basically pooled = df[df['type'] == 'Pooled'] and cols is a list with the three columns shown above. After I run the code above I receive NaN in all the values. I think I'm telling pandas to get values where it does not exist because of the mapping and it's returning something null which is being converted to NaN (assumption).
index does not match,
you can use
df.ix[df['SampleID'].isin(pooled['SampleID']), cols] = numerical[cols].values
only if the size are exactly the same!
I am using pandas.read_csv to read a whitespace delimited file. The file has a variable number of whitespace characters in front of every line (the numbers are right-aligned). When I read this file, it creates a column of NaN. Why does this happen, and what is the best way to prevent it?
Example:
Text file:
9.0 3.3 4.0
32.3 44.3 5.1
7.2 1.1 0.9
Command:
import pandas as pd
pd.read_csv("test.txt",delim_whitespace=True,header=None)
Output:
0 1 2 3
0 NaN 9.0 3.3 4.0
1 NaN 32.3 44.3 5.1
2 NaN 7.2 1.1 0.9
FWIW I tend to use \s+ instead, and it doesn't suffer the same problem:
>>> pd.read_csv("wspace.csv", header=None, delim_whitespace=True)
0 1 2 3
0 NaN 9.0 3.3 4.0
1 NaN 32.3 44.3 5.1
2 NaN 7.2 1.1 0.9
>>> pd.read_csv("wspace.csv", header=None, sep=r"\s+")
0 1 2
0 9.0 3.3 4.0
1 32.3 44.3 5.1
2 7.2 1.1 0.9