My question is exactly same with the question
But, my language is Python, not R. So I ask this question again.
I have two time series with different time stamps and a different number of data points.
For example,
first data is
second data is
.
I concatenate two tables into one table.
I want to do two things. First, time index should be in order.
It is easily done by pd.concat([df1, df2], axis=1). The result is
The second thing is to replace 'NA' by the most recent data point.
For example, at time 0.1, the value of column 'B' is 2.1 which is the value at time 0.09. In a same manner, the value of columns 'A' at time 0.30 should be 3.0. But still, there is no value at time 0.09 for columns 'A'.
How can I do this second job?
Thank you!
you can use fillna with method ffill (forward fill)
>>> df.fillna(method='ffill')
A B
0.09 NaN 2.1
0.10 2.0 2.1
0.22 3.0 3.3
0.30 3.0 5.1
0.33 5.0 5.1
0.50 4.0 4.0
0.59 4.0 10.0
0.60 10.0 10.0
if you want to reassign this to the same dataframe, set parameter inplace=True
Related
I have a pandas dataframe containing about 2 Million rows which looks like the following example
ID V1 V2 V3 V4 V5
12 0.2 0.3 0.5 0.03 0.9
12 0.5 0.4 0.6 0.7 1.8
01 3.8 2.9 1.1 1.6 1.5
17 0.9 1.2 1.8 2.6 9.0
02 0.2 0.3 0.5 0.03 0.9
12 0.5 0.4 0.6 0.7 1.8
07 3.8 2.9 1.1 1.6 1.5
19 0.9 1.2 1.8 2.6 9.0
19 0.5 0.4 0.6 0.7 1.8
06 3.8 2.9 1.1 1.6 1.5
17 0.9 1.2 1.8 2.6 9.0
18 0.9 1.2 1.8 2.6 9.0
I want to create three subsets of this data such that the column ID is mutually exclusive. And each of the subset includes all rows corresponding to the ID column in the main dataframe.
As of now, I am randomly shuffling the ID column and selecting unique ID's as a list. Using this list I'm selecting all rows that from the dataframe who's ID belong to fraction of the list.
import numpy as np
import random
distinct = list(set(df.ID.values))
random.shuffle(distinct)
X1, X2 = distinct[:1000000], distinct[1000000:2000000]
df_X1 = df.loc[df['ID'].isin(list(X1))]
df_X2 = df.loc[df['ID'].isin(list(X2))]
This is working as expected for smaller data, however for larger data the run doesn't even complete for many hours. Is there a more efficient way to do this? appreciate responses.
I think the slow down is coming in the nested isin list inside the loc slice. I tried a different approach using numpy and a boolean index that seems to double the speed.
First to set up the dataframe. I wasn't sure how many unique items you had so I selected 50. I was also unsure how many columns so arbitrarily selected 10,000 columns and rows.
df = pd.DataFrame(np.random.randn(10000, 10000))
ID = np.random.randint(0,50,10000)
df['ID'] = ID
Then I try to use mostly numpy arrays and avoid the nested list using a boolean index.
# Create a numpy array from the ID columns
a_ID = np.array(df['ID'])
# use the numpy unique method to get a unique array
# a = np.unique(np.array(df['ID']))
a = np.unique(a_ID)
# shuffle the unique array
np.random.seed(100)
np.random.shuffle(a)
# cut the shuffled array in half
X1 = a[0:25]
# create a boolean mask
mask = np.isin(a_ID, X1)
# set the index to the mask
df.index = mask
df.loc[True]
When I ran your code on my sample df, times were 817 ms, the code above runs at 445 ms.
Not sure if this helps. Good question, thanks.
We have a data frame with a sorted float index and two columns that should be the same. Their values are not always present, and in the worst case scenario, they do not have overlaps in the index values. The goal is to be able to check how far they are from each other.
I was thinking about interpolating the missing values and then calculating the distance. This would result in a large collection of index values for which this distance can be calculated.
Another approach would be to compare the actual values, and come up with an index error for which this comparison would make sense.
The question is which approach would make more sense and how to calculate the distance. The result should tell us how close they are to each other, with f.e. 0 meaning that they are the same.
Example
We have a data frame with two columns a1 and a2 and a sorted, float index.
df = pd.DataFrame({'a1':[6.1, np.nan, 6.8, 7.5, 7.9],
'a2':[6.2, 6.6, 6.8, np.nan, 7.7]},
index=[0.10, 0.11, 0.13, 0.16, 0.17])
a1 a2
0.10 6.1 6.2
0.11 NaN 6.6
0.13 6.8 6.8
0.16 7.5 NaN
0.17 7.9 7.7
If your objective is to get the absolute distance of the interpolated vectors you can proceed as follows:
r = pd.interpolate()
absolute_sum = (r["a1"] - r["a2"]).abs().sum()
With the given example the result is 0.7000000000000011.
Though if you are interested on how similar the two columns are you could take a look into the correlation coefficient.
r = pd.interpolate()
correlation = r["a1"].corr("a2")
With the given example the result is 0.9929580338258082.
Since you mention distance
from scipy.spatial import distance
df=df.interpolate(axis=0)
pd.DataFrame(distance.cdist(df.values, df.values, 'euclidean'),columns=df.index,index=df.index)
Out[468]:
0.10 0.11 0.13 0.16 0.17
0.10 0.000000 0.531507 0.921954 1.750000 2.343075
0.11 0.531507 0.000000 0.403113 1.234909 1.820027
0.13 0.921954 0.403113 0.000000 0.832166 1.421267
0.16 1.750000 1.234909 0.832166 0.000000 0.602080
0.17 2.343075 1.820027 1.421267 0.602080 0.000000
I have:
A1 A2 Random data Random data2 Average Stddev
0 0.1 2.0 300 3000 1.05 1.343503
1 0.5 4.5 4500 450 2.50 2.828427
2 3.0 1.2 800 80 2.10 1.272792
3 9.0 9.0 900 90 9.00 0.000000
And would like to add a column 'ColumnX' that needs to have the values calculated as :
ColumnX = min(df['Random data']-df['Average'],df[Random data2]-
df[Stddev])/3.0*df['A2'])
I get the error:
ValueError: The truth value of a Series is ambiguous.
Your error has to do with pandas preferring bitwise operators and using the built in min function isn't going to work row wise.
A potential solution would be to make two new calculated columns then using the pandas dataframe .min method.
df['calc_col_1'] = df['Random data']-df['Average']
df['calc_col_2'] = (df['Random data2']-df['Stddev'])/(3.0*df['A2'])
df['min_col'] = df[['calc_col_1','calc_col_2']].min(axis=1)
The method min(axis=1) will find the min between the two columns by row then assigned to the new column. This way is efficient because you're using numpy vectorization, and it is easier to read.
Is is possible to change Column Names using data in a list?
df = pd.DataFrame([[1, 1.0, 2.3,0.2,0.53], [2, 3.35, 2.0,0.2,0.65], [2,3.4,
2.0,0.25,0.55], [3,3.4,2.0,0.25,0.55], [1,3.4,2.0,0.25,0.55],
[3,3.4,2.0,0.25,0.55]],
columns=["ID", "A", "B","C","D"])\
.set_index('ID')
I have my new labels as below:
New_Labels=['NaU', 'MgU', 'AlU', 'SiU']
Is possible to change the names using data in the above list? My original data set has 100 columns and I did not want to do it manually for each column.
I was trying the following using df.rename but keep getting errors. Thanks!
You can use this :
df.columns = New_Labels
Using rename is a formally more correct approach. You just have to provide a dictionary that maps your current columns names to the new ones (thing that will guarantee expected results even in case of misplaced columns)
new_names = {'A':'NaU', 'B':'MgU', 'C':'Alu', 'D':'SiU'}
df.rename(index=str, columns=new_names)
Notice you can provide entries for the sole names you want to substitute, the rest will remain the same.
df = pd.DataFrame([[1, 1.0, 2.3,0.2,0.53], [2, 3.35, 2.0,0.2,0.65], [2,3.4,
2.0,0.25,0.55], [3,3.4,2.0,0.25,0.55], [1,3.4,2.0,0.25,0.55],
[3,3.4,2.0,0.25,0.55]],
columns=["ID", "A", "B","C","D"])\
.set_index('ID')
New_Labels=['NaU', 'MgU', 'AlU', 'SiU']
df.columns = New_Labels
this will make df look like this:
NaU MgU AlU SiU
ID
1 1.00 2.3 0.20 0.53
2 3.35 2.0 0.20 0.65
2 3.40 2.0 0.25 0.55
3 3.40 2.0 0.25 0.55
1 3.40 2.0 0.25 0.55
3 3.40 2.0 0.25 0.55
df.columns = New_Labels
Take care of the sequence of new column names.
The accepted rename answer is fine, but it's mainly for mapping old→new names. If we just want to wipe out the column names with a new list, there's no need to create an intermediate mapping dictionary. Just use set_axis directly.
set_axis
To set a list as the columns, use set_axis along axis=1 (the default axis=0 sets the index values):
df.set_axis(New_Labels, axis=1)
# NaU MgU AlU SiU
# ID
# 1 1.00 2.3 0.20 0.53
# 2 3.35 2.0 0.20 0.65
# 2 3.40 2.0 0.25 0.55
# 3 3.40 2.0 0.25 0.55
# 1 3.40 2.0 0.25 0.55
# 3 3.40 2.0 0.25 0.55
Note that set_axis is similar to modifying df.columns directly, but set_axis allows method chaining, e.g.:
df.some_method().set_axis(New_Labels, axis=1).other_method()
Theoretically, set_axis should also provide better error checking than directly modifying an attribute, though I can't find a concrete example at the moment.
I am merging one column from DataFrame (df1) with another DataFrame (df2 where both have the same index. The result of this operation gives me a lot more rows that I started with (duplicates). Is there a way to avoid duplicates? Please see the example codes below to replicate my issue.
df1 = pd.DataFrame([[1, 1.0, 2.3,0.2,0.53], [2, 3.35, 2.0,0.2,0.65], [2,3.4,
2.0,0.25,0.55]],
columns=["Sample_ID", "NaX", "NaU","OC","EC"])\
.set_index('Sample_ID')
df2 = pd.DataFrame([[1,0.2, 1.5, 82], [2, 3.35,2.4,92],[2, 3.4, 2.0,0.25]],
columns=["Sample_ID", "OC","Flow", "Diameter"])\
.set_index('Sample_ID')
df1 = pd.merge(df1,df2['Flow'].to_frame(), left_index=True,right_index=True)
My result (below) has two entries for sample "2" starting with 3.35 and then two entries for "2" starting with 3.40.
What I was expecting was just two entries for "2", one starting with 3.35 and the other line for "2" starting with 3.40. So the total number of rows should be only three, while I have a total of 5 rows of data now.
Can you please see what the reason for this is? Thanks for your help!
NaX NaU OC EC Flow
Sample_ID
1 1.00 2.3 0.20 0.53 1.5
2 3.35 2.0 0.20 0.65 2.4
2 3.35 2.0 0.20 0.65 2.0
2 3.40 2.0 0.25 0.55 2.4
2 3.40 2.0 0.25 0.55 2.0
What you want to do is concatenate as follows:
pd.concat([df1, df2['Flow'].to_frame()], axis=1)
...which returns your desired output. The axis=1 argument let's you "glue on" extra columns.
As to why your join is returning twice as many entries for Sample_ID = 2, you can read through the docs on joins. The relevant portion is:
In SQL / standard relational algebra, if a key combination appears more than once in both tables, the resulting table will have the Cartesian product of the associated data.