Merge Pandas Dataframe using "to_frame" without duplicates

Merge Pandas Dataframe using "to_frame" without duplicates - python

I am merging one column from DataFrame (df1) with another DataFrame (df2 where both have the same index. The result of this operation gives me a lot more rows that I started with (duplicates). Is there a way to avoid duplicates? Please see the example codes below to replicate my issue.
df1 = pd.DataFrame([[1, 1.0, 2.3,0.2,0.53], [2, 3.35, 2.0,0.2,0.65], [2,3.4,
2.0,0.25,0.55]],
columns=["Sample_ID", "NaX", "NaU","OC","EC"])\
.set_index('Sample_ID')
df2 = pd.DataFrame([[1,0.2, 1.5, 82], [2, 3.35,2.4,92],[2, 3.4, 2.0,0.25]],
columns=["Sample_ID", "OC","Flow", "Diameter"])\
.set_index('Sample_ID')
df1 = pd.merge(df1,df2['Flow'].to_frame(), left_index=True,right_index=True)
My result (below) has two entries for sample "2" starting with 3.35 and then two entries for "2" starting with 3.40.
What I was expecting was just two entries for "2", one starting with 3.35 and the other line for "2" starting with 3.40. So the total number of rows should be only three, while I have a total of 5 rows of data now.
Can you please see what the reason for this is? Thanks for your help!
NaX NaU OC EC Flow
Sample_ID
1 1.00 2.3 0.20 0.53 1.5
2 3.35 2.0 0.20 0.65 2.4
2 3.35 2.0 0.20 0.65 2.0
2 3.40 2.0 0.25 0.55 2.4
2 3.40 2.0 0.25 0.55 2.0

What you want to do is concatenate as follows:
pd.concat([df1, df2['Flow'].to_frame()], axis=1)
...which returns your desired output. The axis=1 argument let's you "glue on" extra columns.
As to why your join is returning twice as many entries for Sample_ID = 2, you can read through the docs on joins. The relevant portion is:
In SQL / standard relational algebra, if a key combination appears more than once in both tables, the resulting table will have the Cartesian product of the associated data.

Related

Pandas applying data subset to new data frame

I have a script where I do munging with dataframes and extract data like the following:
times = pd.Series(df.loc[df['sy_x'].str.contains('AA'), ('t_diff')].quantile([.1, .25, .5, .75, .9]))
I want to add the resulting data from quantile() to a data frame with separate columns for each of those quantiles, lets say the columns are:
ID pt_1 pt_2 pt_5 pt_7 pt_9
AA
BB
CC
How might I add the quantiles to each row of ID?
new_df = None
for index, value in times.items():
for col in df[['pt_1', 'pt_2','pt_5','pt_7','pt_9',]]:
..but that feels wrong and not idiomatic. Should I be using loc or iloc? I have a couple more Series that I'll need to add to other columns not shown, but I think I can figure that out once I know
EDIT:
Some of the output of times looks like:
0.1 -0.5
0.25 -0.3
0.5 0.0
0.75 2.0
0.90 4.0
Thanks in advance for any insight

IIUC, you want a groupby():
# toy data
np.random.seed(1)
df = pd.DataFrame({'sy_x':np.random.choice(['AA','BB','CC'], 100),
't_diff': np.random.randint(0,100,100)})
df.groupby('sy_x').t_diff.quantile((0.1,.25,.5,.75,.9)).unstack(1)
Output:
0.10 0.25 0.50 0.75 0.90
sy_x
AA 16.5 22.25 57.0 77.00 94.5
BB 9.1 21.00 58.5 80.25 91.3
CC 9.7 23.25 40.5 65.75 84.1

Try something like:
pd.DataFrame(times.values.T, index=times.keys())

Create a row in pandas dataframe

I am trying to create a row in my existing pandas dataframe and the value of a new row should be a computation
I have a dataframe that looks like the below:
Rating LE_St % Total
1.00 7.58 74.55
2.00 0.56 5.55
3.00 0.21 2.04
5.00 0.05 0.44
6.00 1.77 17.42
All 10.17 100.00
I want to add a row called "Metric" which is the sum of "LE_St" variable for "Rating" >= 4 and <6 divided by "LE_St" for "All" i.e Metric = (0.05+1.77)/10.17
My output dataframe should look like below:
Rating LE_St % Total
1.00 7.58 74.55
2.00 0.56 5.55
3.00 0.21 2.04
5.00 0.05 0.44
6.00 1.77 17.42
All 10.17 100.00
Metric 0.44

I believe your approach to the dataframe is wrong.
Usually rows hold values correlating with columns in a matter that makes sense and not hold random information. the power of pandas and python is for holding and manipulating data. You can easily compute a value from a column or even all columns and store them in a "summary" like dataframe or in separate values. That might help you with this as well.
for computation on a column (i.e. Series object) you can use the .sum() method (or any other of the computational tools) and slice your dataframe by values in the "rating" column.
for random computation of small statistics you will be rather off with excel :)
an example of a solution might look like this:
all = 10.17 # i dont know where this value comes from
df = df[df['rating'].between(4, 6, inclusive=True)]
metric = sliced_df['LE_ST'].sum()/all
print metric # or store it somewhere however you like

Value distance between two columns in a data frame with sorted, float index

We have a data frame with a sorted float index and two columns that should be the same. Their values are not always present, and in the worst case scenario, they do not have overlaps in the index values. The goal is to be able to check how far they are from each other.
I was thinking about interpolating the missing values and then calculating the distance. This would result in a large collection of index values for which this distance can be calculated.
Another approach would be to compare the actual values, and come up with an index error for which this comparison would make sense.
The question is which approach would make more sense and how to calculate the distance. The result should tell us how close they are to each other, with f.e. 0 meaning that they are the same.
Example
We have a data frame with two columns a1 and a2 and a sorted, float index.
df = pd.DataFrame({'a1':[6.1, np.nan, 6.8, 7.5, 7.9],
'a2':[6.2, 6.6, 6.8, np.nan, 7.7]},
index=[0.10, 0.11, 0.13, 0.16, 0.17])
a1 a2
0.10 6.1 6.2
0.11 NaN 6.6
0.13 6.8 6.8
0.16 7.5 NaN
0.17 7.9 7.7

If your objective is to get the absolute distance of the interpolated vectors you can proceed as follows:
r = pd.interpolate()
absolute_sum = (r["a1"] - r["a2"]).abs().sum()
With the given example the result is 0.7000000000000011.
Though if you are interested on how similar the two columns are you could take a look into the correlation coefficient.
r = pd.interpolate()
correlation = r["a1"].corr("a2")
With the given example the result is 0.9929580338258082.

Since you mention distance
from scipy.spatial import distance
df=df.interpolate(axis=0)
pd.DataFrame(distance.cdist(df.values, df.values, 'euclidean'),columns=df.index,index=df.index)
Out[468]:
0.10 0.11 0.13 0.16 0.17
0.10 0.000000 0.531507 0.921954 1.750000 2.343075
0.11 0.531507 0.000000 0.403113 1.234909 1.820027
0.13 0.921954 0.403113 0.000000 0.832166 1.421267
0.16 1.750000 1.234909 0.832166 0.000000 0.602080
0.17 2.343075 1.820027 1.421267 0.602080 0.000000

python - concatenate two time series with different time stamps

My question is exactly same with the question
But, my language is Python, not R. So I ask this question again.
I have two time series with different time stamps and a different number of data points.
For example,
first data is
second data is
.
I concatenate two tables into one table.
I want to do two things. First, time index should be in order.
It is easily done by pd.concat([df1, df2], axis=1). The result is
The second thing is to replace 'NA' by the most recent data point.
For example, at time 0.1, the value of column 'B' is 2.1 which is the value at time 0.09. In a same manner, the value of columns 'A' at time 0.30 should be 3.0. But still, there is no value at time 0.09 for columns 'A'.
How can I do this second job?
Thank you!

you can use fillna with method ffill (forward fill)
>>> df.fillna(method='ffill')
A B
0.09 NaN 2.1
0.10 2.0 2.1
0.22 3.0 3.3
0.30 3.0 5.1
0.33 5.0 5.1
0.50 4.0 4.0
0.59 4.0 10.0
0.60 10.0 10.0
if you want to reassign this to the same dataframe, set parameter inplace=True

Change column names in Pandas Dataframe from a list

Is is possible to change Column Names using data in a list?
df = pd.DataFrame([[1, 1.0, 2.3,0.2,0.53], [2, 3.35, 2.0,0.2,0.65], [2,3.4,
2.0,0.25,0.55], [3,3.4,2.0,0.25,0.55], [1,3.4,2.0,0.25,0.55],
[3,3.4,2.0,0.25,0.55]],
columns=["ID", "A", "B","C","D"])\
.set_index('ID')
I have my new labels as below:
New_Labels=['NaU', 'MgU', 'AlU', 'SiU']
Is possible to change the names using data in the above list? My original data set has 100 columns and I did not want to do it manually for each column.
I was trying the following using df.rename but keep getting errors. Thanks!

You can use this :
df.columns = New_Labels

Using rename is a formally more correct approach. You just have to provide a dictionary that maps your current columns names to the new ones (thing that will guarantee expected results even in case of misplaced columns)
new_names = {'A':'NaU', 'B':'MgU', 'C':'Alu', 'D':'SiU'}
df.rename(index=str, columns=new_names)
Notice you can provide entries for the sole names you want to substitute, the rest will remain the same.

df = pd.DataFrame([[1, 1.0, 2.3,0.2,0.53], [2, 3.35, 2.0,0.2,0.65], [2,3.4,
2.0,0.25,0.55], [3,3.4,2.0,0.25,0.55], [1,3.4,2.0,0.25,0.55],
[3,3.4,2.0,0.25,0.55]],
columns=["ID", "A", "B","C","D"])\
.set_index('ID')
New_Labels=['NaU', 'MgU', 'AlU', 'SiU']
df.columns = New_Labels
this will make df look like this:
NaU MgU AlU SiU
ID
1 1.00 2.3 0.20 0.53
2 3.35 2.0 0.20 0.65
2 3.40 2.0 0.25 0.55
3 3.40 2.0 0.25 0.55
1 3.40 2.0 0.25 0.55
3 3.40 2.0 0.25 0.55

df.columns = New_Labels
Take care of the sequence of new column names.

The accepted rename answer is fine, but it's mainly for mapping old→new names. If we just want to wipe out the column names with a new list, there's no need to create an intermediate mapping dictionary. Just use set_axis directly.
set_axis
To set a list as the columns, use set_axis along axis=1 (the default axis=0 sets the index values):
df.set_axis(New_Labels, axis=1)
# NaU MgU AlU SiU
# ID
# 1 1.00 2.3 0.20 0.53
# 2 3.35 2.0 0.20 0.65
# 2 3.40 2.0 0.25 0.55
# 3 3.40 2.0 0.25 0.55
# 1 3.40 2.0 0.25 0.55
# 3 3.40 2.0 0.25 0.55
Note that set_axis is similar to modifying df.columns directly, but set_axis allows method chaining, e.g.:
df.some_method().set_axis(New_Labels, axis=1).other_method()
Theoretically, set_axis should also provide better error checking than directly modifying an attribute, though I can't find a concrete example at the moment.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merge Pandas Dataframe using "to_frame" without duplicates - python

Related

Pandas applying data subset to new data frame

Create a row in pandas dataframe

Value distance between two columns in a data frame with sorted, float index

python - concatenate two time series with different time stamps

Change column names in Pandas Dataframe from a list

Categories

Resources