I have a script where I do munging with dataframes and extract data like the following:
times = pd.Series(df.loc[df['sy_x'].str.contains('AA'), ('t_diff')].quantile([.1, .25, .5, .75, .9]))
I want to add the resulting data from quantile() to a data frame with separate columns for each of those quantiles, lets say the columns are:
ID pt_1 pt_2 pt_5 pt_7 pt_9
AA
BB
CC
How might I add the quantiles to each row of ID?
new_df = None
for index, value in times.items():
for col in df[['pt_1', 'pt_2','pt_5','pt_7','pt_9',]]:
..but that feels wrong and not idiomatic. Should I be using loc or iloc? I have a couple more Series that I'll need to add to other columns not shown, but I think I can figure that out once I know
EDIT:
Some of the output of times looks like:
0.1 -0.5
0.25 -0.3
0.5 0.0
0.75 2.0
0.90 4.0
Thanks in advance for any insight
IIUC, you want a groupby():
# toy data
np.random.seed(1)
df = pd.DataFrame({'sy_x':np.random.choice(['AA','BB','CC'], 100),
't_diff': np.random.randint(0,100,100)})
df.groupby('sy_x').t_diff.quantile((0.1,.25,.5,.75,.9)).unstack(1)
Output:
0.10 0.25 0.50 0.75 0.90
sy_x
AA 16.5 22.25 57.0 77.00 94.5
BB 9.1 21.00 58.5 80.25 91.3
CC 9.7 23.25 40.5 65.75 84.1
Try something like:
pd.DataFrame(times.values.T, index=times.keys())
I am trying to create a row in my existing pandas dataframe and the value of a new row should be a computation
I have a dataframe that looks like the below:
Rating LE_St % Total
1.00 7.58 74.55
2.00 0.56 5.55
3.00 0.21 2.04
5.00 0.05 0.44
6.00 1.77 17.42
All 10.17 100.00
I want to add a row called "Metric" which is the sum of "LE_St" variable for "Rating" >= 4 and <6 divided by "LE_St" for "All" i.e Metric = (0.05+1.77)/10.17
My output dataframe should look like below:
Rating LE_St % Total
1.00 7.58 74.55
2.00 0.56 5.55
3.00 0.21 2.04
5.00 0.05 0.44
6.00 1.77 17.42
All 10.17 100.00
Metric 0.44
I believe your approach to the dataframe is wrong.
Usually rows hold values correlating with columns in a matter that makes sense and not hold random information. the power of pandas and python is for holding and manipulating data. You can easily compute a value from a column or even all columns and store them in a "summary" like dataframe or in separate values. That might help you with this as well.
for computation on a column (i.e. Series object) you can use the .sum() method (or any other of the computational tools) and slice your dataframe by values in the "rating" column.
for random computation of small statistics you will be rather off with excel :)
an example of a solution might look like this:
all = 10.17 # i dont know where this value comes from
df = df[df['rating'].between(4, 6, inclusive=True)]
metric = sliced_df['LE_ST'].sum()/all
print metric # or store it somewhere however you like
We have a data frame with a sorted float index and two columns that should be the same. Their values are not always present, and in the worst case scenario, they do not have overlaps in the index values. The goal is to be able to check how far they are from each other.
I was thinking about interpolating the missing values and then calculating the distance. This would result in a large collection of index values for which this distance can be calculated.
Another approach would be to compare the actual values, and come up with an index error for which this comparison would make sense.
The question is which approach would make more sense and how to calculate the distance. The result should tell us how close they are to each other, with f.e. 0 meaning that they are the same.
Example
We have a data frame with two columns a1 and a2 and a sorted, float index.
df = pd.DataFrame({'a1':[6.1, np.nan, 6.8, 7.5, 7.9],
'a2':[6.2, 6.6, 6.8, np.nan, 7.7]},
index=[0.10, 0.11, 0.13, 0.16, 0.17])
a1 a2
0.10 6.1 6.2
0.11 NaN 6.6
0.13 6.8 6.8
0.16 7.5 NaN
0.17 7.9 7.7
If your objective is to get the absolute distance of the interpolated vectors you can proceed as follows:
r = pd.interpolate()
absolute_sum = (r["a1"] - r["a2"]).abs().sum()
With the given example the result is 0.7000000000000011.
Though if you are interested on how similar the two columns are you could take a look into the correlation coefficient.
r = pd.interpolate()
correlation = r["a1"].corr("a2")
With the given example the result is 0.9929580338258082.
Since you mention distance
from scipy.spatial import distance
df=df.interpolate(axis=0)
pd.DataFrame(distance.cdist(df.values, df.values, 'euclidean'),columns=df.index,index=df.index)
Out[468]:
0.10 0.11 0.13 0.16 0.17
0.10 0.000000 0.531507 0.921954 1.750000 2.343075
0.11 0.531507 0.000000 0.403113 1.234909 1.820027
0.13 0.921954 0.403113 0.000000 0.832166 1.421267
0.16 1.750000 1.234909 0.832166 0.000000 0.602080
0.17 2.343075 1.820027 1.421267 0.602080 0.000000
My question is exactly same with the question
But, my language is Python, not R. So I ask this question again.
I have two time series with different time stamps and a different number of data points.
For example,
first data is
second data is
.
I concatenate two tables into one table.
I want to do two things. First, time index should be in order.
It is easily done by pd.concat([df1, df2], axis=1). The result is
The second thing is to replace 'NA' by the most recent data point.
For example, at time 0.1, the value of column 'B' is 2.1 which is the value at time 0.09. In a same manner, the value of columns 'A' at time 0.30 should be 3.0. But still, there is no value at time 0.09 for columns 'A'.
How can I do this second job?
Thank you!
you can use fillna with method ffill (forward fill)
>>> df.fillna(method='ffill')
A B
0.09 NaN 2.1
0.10 2.0 2.1
0.22 3.0 3.3
0.30 3.0 5.1
0.33 5.0 5.1
0.50 4.0 4.0
0.59 4.0 10.0
0.60 10.0 10.0
if you want to reassign this to the same dataframe, set parameter inplace=True
Is is possible to change Column Names using data in a list?
df = pd.DataFrame([[1, 1.0, 2.3,0.2,0.53], [2, 3.35, 2.0,0.2,0.65], [2,3.4,
2.0,0.25,0.55], [3,3.4,2.0,0.25,0.55], [1,3.4,2.0,0.25,0.55],
[3,3.4,2.0,0.25,0.55]],
columns=["ID", "A", "B","C","D"])\
.set_index('ID')
I have my new labels as below:
New_Labels=['NaU', 'MgU', 'AlU', 'SiU']
Is possible to change the names using data in the above list? My original data set has 100 columns and I did not want to do it manually for each column.
I was trying the following using df.rename but keep getting errors. Thanks!
You can use this :
df.columns = New_Labels
Using rename is a formally more correct approach. You just have to provide a dictionary that maps your current columns names to the new ones (thing that will guarantee expected results even in case of misplaced columns)
new_names = {'A':'NaU', 'B':'MgU', 'C':'Alu', 'D':'SiU'}
df.rename(index=str, columns=new_names)
Notice you can provide entries for the sole names you want to substitute, the rest will remain the same.
df = pd.DataFrame([[1, 1.0, 2.3,0.2,0.53], [2, 3.35, 2.0,0.2,0.65], [2,3.4,
2.0,0.25,0.55], [3,3.4,2.0,0.25,0.55], [1,3.4,2.0,0.25,0.55],
[3,3.4,2.0,0.25,0.55]],
columns=["ID", "A", "B","C","D"])\
.set_index('ID')
New_Labels=['NaU', 'MgU', 'AlU', 'SiU']
df.columns = New_Labels
this will make df look like this:
NaU MgU AlU SiU
ID
1 1.00 2.3 0.20 0.53
2 3.35 2.0 0.20 0.65
2 3.40 2.0 0.25 0.55
3 3.40 2.0 0.25 0.55
1 3.40 2.0 0.25 0.55
3 3.40 2.0 0.25 0.55
df.columns = New_Labels
Take care of the sequence of new column names.
The accepted rename answer is fine, but it's mainly for mapping old→new names. If we just want to wipe out the column names with a new list, there's no need to create an intermediate mapping dictionary. Just use set_axis directly.
set_axis
To set a list as the columns, use set_axis along axis=1 (the default axis=0 sets the index values):
df.set_axis(New_Labels, axis=1)
# NaU MgU AlU SiU
# ID
# 1 1.00 2.3 0.20 0.53
# 2 3.35 2.0 0.20 0.65
# 2 3.40 2.0 0.25 0.55
# 3 3.40 2.0 0.25 0.55
# 1 3.40 2.0 0.25 0.55
# 3 3.40 2.0 0.25 0.55
Note that set_axis is similar to modifying df.columns directly, but set_axis allows method chaining, e.g.:
df.some_method().set_axis(New_Labels, axis=1).other_method()
Theoretically, set_axis should also provide better error checking than directly modifying an attribute, though I can't find a concrete example at the moment.