Pandas applying data subset to new data frame - python

I have a script where I do munging with dataframes and extract data like the following:
times = pd.Series(df.loc[df['sy_x'].str.contains('AA'), ('t_diff')].quantile([.1, .25, .5, .75, .9]))
I want to add the resulting data from quantile() to a data frame with separate columns for each of those quantiles, lets say the columns are:
ID pt_1 pt_2 pt_5 pt_7 pt_9
AA
BB
CC
How might I add the quantiles to each row of ID?
new_df = None
for index, value in times.items():
for col in df[['pt_1', 'pt_2','pt_5','pt_7','pt_9',]]:
..but that feels wrong and not idiomatic. Should I be using loc or iloc? I have a couple more Series that I'll need to add to other columns not shown, but I think I can figure that out once I know
EDIT:
Some of the output of times looks like:
0.1 -0.5
0.25 -0.3
0.5 0.0
0.75 2.0
0.90 4.0
Thanks in advance for any insight

IIUC, you want a groupby():
# toy data
np.random.seed(1)
df = pd.DataFrame({'sy_x':np.random.choice(['AA','BB','CC'], 100),
't_diff': np.random.randint(0,100,100)})
df.groupby('sy_x').t_diff.quantile((0.1,.25,.5,.75,.9)).unstack(1)
Output:
0.10 0.25 0.50 0.75 0.90
sy_x
AA 16.5 22.25 57.0 77.00 94.5
BB 9.1 21.00 58.5 80.25 91.3
CC 9.7 23.25 40.5 65.75 84.1

Try something like:
pd.DataFrame(times.values.T, index=times.keys())

Related

How To Iterate Over A Timespan and Calculate some Values in a Dataframe using Python?

I have a dataset like below
data = {'ReportingDate':['2013/5/31','2013/5/31','2013/5/31','2013/5/31','2013/5/31','2013/5/31',
'2013/6/28','2013/6/28',
'2013/6/28','2013/6/28','2013/6/28'],
'MarketCap':[' ',0.35,0.7,0.875,0.7,0.35,' ',1,1.5,0.75,1.25],
'AUM':[3.5,3.5,3.5,3.5,3.5,3.5,5,5,5,5,5],
'weight':[' ',0.1,0.2,0.25,0.2,0.1,' ',0.2,0.3,0.15,0.25]}
# Create DataFrame
df = pd.DataFrame(data)
df.set_index('Reporting Date',inplace=True)
df
Just a sample of a 8000 rows dataset.
ReportingDate starts from 2013/5/31 to 2015/10/30.
It includes data of all the months during the above period. But Only the last day of each month.
The first line of each month has two missing data. I know that
the sum of weight for each month is equal to 1
weight*AUM is equal to MarketCap
I can use the below line to get the answer I want, for only one month
a= (1-df["2013-5"].iloc[1:]['weight'].sum())
b= a* AUM
df.iloc[1,0]=b
df.iloc[1,2]=a
How can I use a loop to get the data for the whole period? Thanks
One way using pandas.DataFrame.groupby:
# If whitespaces are indeed whitespaces, not nan
df = df.replace("\s+", np.nan, regex=True)
# If not already datatime series
df.index = pd.to_datetime(df.index)
s = df["weight"].fillna(1) - df.groupby(df.index.date)["weight"].transform(sum)
df["weight"] = df["weight"].fillna(s)
df["MarketCap"] = df["MarketCap"].fillna(s * df["AUM"])
Note: This assumes that dates are always only the last day so that it is equivalent to grouping by year-month. If not so, try:
s = df["weight"].fillna(1) - df.groupby(df.index.strftime("%Y%m"))["weight"].transform(sum)
Output:
MarketCap AUM weight
ReportingDate
2013-05-31 0.350 3.5 0.10
2013-05-31 0.525 3.5 0.15
2013-05-31 0.700 3.5 0.20
2013-05-31 0.875 3.5 0.25
2013-05-31 0.700 3.5 0.20
2013-05-31 0.350 3.5 0.10
2013-06-28 0.500 5.0 0.10
2013-06-28 1.000 5.0 0.20
2013-06-28 1.500 5.0 0.30
2013-06-28 0.750 5.0 0.15
2013-06-28 1.250 5.0 0.25

Change column names in Pandas Dataframe from a list

Is is possible to change Column Names using data in a list?
df = pd.DataFrame([[1, 1.0, 2.3,0.2,0.53], [2, 3.35, 2.0,0.2,0.65], [2,3.4,
2.0,0.25,0.55], [3,3.4,2.0,0.25,0.55], [1,3.4,2.0,0.25,0.55],
[3,3.4,2.0,0.25,0.55]],
columns=["ID", "A", "B","C","D"])\
.set_index('ID')
I have my new labels as below:
New_Labels=['NaU', 'MgU', 'AlU', 'SiU']
Is possible to change the names using data in the above list? My original data set has 100 columns and I did not want to do it manually for each column.
I was trying the following using df.rename but keep getting errors. Thanks!
You can use this :
df.columns = New_Labels
Using rename is a formally more correct approach. You just have to provide a dictionary that maps your current columns names to the new ones (thing that will guarantee expected results even in case of misplaced columns)
new_names = {'A':'NaU', 'B':'MgU', 'C':'Alu', 'D':'SiU'}
df.rename(index=str, columns=new_names)
Notice you can provide entries for the sole names you want to substitute, the rest will remain the same.
df = pd.DataFrame([[1, 1.0, 2.3,0.2,0.53], [2, 3.35, 2.0,0.2,0.65], [2,3.4,
2.0,0.25,0.55], [3,3.4,2.0,0.25,0.55], [1,3.4,2.0,0.25,0.55],
[3,3.4,2.0,0.25,0.55]],
columns=["ID", "A", "B","C","D"])\
.set_index('ID')
New_Labels=['NaU', 'MgU', 'AlU', 'SiU']
df.columns = New_Labels
this will make df look like this:
NaU MgU AlU SiU
ID
1 1.00 2.3 0.20 0.53
2 3.35 2.0 0.20 0.65
2 3.40 2.0 0.25 0.55
3 3.40 2.0 0.25 0.55
1 3.40 2.0 0.25 0.55
3 3.40 2.0 0.25 0.55
df.columns = New_Labels
Take care of the sequence of new column names.
The accepted rename answer is fine, but it's mainly for mapping old→new names. If we just want to wipe out the column names with a new list, there's no need to create an intermediate mapping dictionary. Just use set_axis directly.
set_axis
To set a list as the columns, use set_axis along axis=1 (the default axis=0 sets the index values):
df.set_axis(New_Labels, axis=1)
# NaU MgU AlU SiU
# ID
# 1 1.00 2.3 0.20 0.53
# 2 3.35 2.0 0.20 0.65
# 2 3.40 2.0 0.25 0.55
# 3 3.40 2.0 0.25 0.55
# 1 3.40 2.0 0.25 0.55
# 3 3.40 2.0 0.25 0.55
Note that set_axis is similar to modifying df.columns directly, but set_axis allows method chaining, e.g.:
df.some_method().set_axis(New_Labels, axis=1).other_method()
Theoretically, set_axis should also provide better error checking than directly modifying an attribute, though I can't find a concrete example at the moment.

Merge Pandas Dataframe using "to_frame" without duplicates

I am merging one column from DataFrame (df1) with another DataFrame (df2 where both have the same index. The result of this operation gives me a lot more rows that I started with (duplicates). Is there a way to avoid duplicates? Please see the example codes below to replicate my issue.
df1 = pd.DataFrame([[1, 1.0, 2.3,0.2,0.53], [2, 3.35, 2.0,0.2,0.65], [2,3.4,
2.0,0.25,0.55]],
columns=["Sample_ID", "NaX", "NaU","OC","EC"])\
.set_index('Sample_ID')
df2 = pd.DataFrame([[1,0.2, 1.5, 82], [2, 3.35,2.4,92],[2, 3.4, 2.0,0.25]],
columns=["Sample_ID", "OC","Flow", "Diameter"])\
.set_index('Sample_ID')
df1 = pd.merge(df1,df2['Flow'].to_frame(), left_index=True,right_index=True)
My result (below) has two entries for sample "2" starting with 3.35 and then two entries for "2" starting with 3.40.
What I was expecting was just two entries for "2", one starting with 3.35 and the other line for "2" starting with 3.40. So the total number of rows should be only three, while I have a total of 5 rows of data now.
Can you please see what the reason for this is? Thanks for your help!
NaX NaU OC EC Flow
Sample_ID
1 1.00 2.3 0.20 0.53 1.5
2 3.35 2.0 0.20 0.65 2.4
2 3.35 2.0 0.20 0.65 2.0
2 3.40 2.0 0.25 0.55 2.4
2 3.40 2.0 0.25 0.55 2.0
What you want to do is concatenate as follows:
pd.concat([df1, df2['Flow'].to_frame()], axis=1)
...which returns your desired output. The axis=1 argument let's you "glue on" extra columns.
As to why your join is returning twice as many entries for Sample_ID = 2, you can read through the docs on joins. The relevant portion is:
In SQL / standard relational algebra, if a key combination appears more than once in both tables, the resulting table will have the Cartesian product of the associated data.

Splitting pandas dataframe into multiple columns by row value

I am new to python and pandas and am attempting to make aggregate plots of orientation data across response time for my research. The approach I am attempting requires grouping and splitting the data by trial, however I have no index variable to group by in the raw data file. Just as some context, I am working with about 300 .csv files with 10-15k rows in each. Here is a snippet of the format of the raw .csv files:
25.10 3.1 7.8 173.6 0.695646 -0.046507 0.716452 -0.024699 -0.014172 -0.712739 -0.086428 0.695940 88.4 1.8 -174.3
25.25 3.1 7.6 173.6 0.696440 -0.045587 0.715711 -0.025514 -0.013402 -0.712050 -0.085468 0.696778 88.5 1.7 -174.3
25.40 2.9 7.6 173.6 0.697160 -0.045407 0.715048 -0.024725 -0.014230 -0.711399 -0.085251 0.697454 88.6 1.7 -174.3
25.55 3.2 7.8 173.6 0.695360 -0.046466 0.716729 -0.024797 -0.014058 -0.713018 -0.086403 0.695660 88.4 1.8 -174.2
Response: S
END TRIAL 1
BEGIN TRIAL 2
0.05 126.4 126.4 0.0 -0.322306 -0.712941 -0.535465 -0.317978 -0.322306 -0.712941 -0.535465 -0.317978 105.7 -34.0 74.7
0.20 129.1 129.1 0.0 -0.311974 -0.711464 -0.555195 -0.297070 -0.311974 -0.711464 -0.555195 -0.297070 105.1 -37.2 76.3
As you can see there is no variable to group by trial or headers, only three rows (with a different structure) that separate the trials.
I have managed to get column headers and extract the relevant variables with pandas:
df = pd.read_csv('data.csv', delim_whitespace=True, names=['time', 'S1', 'S2', 'Enc', 'q1a', 'q1b', 'q1c', 'q1d', 'q2a', 'q2b', 'q2c', 'q2d', 'yaw', 'pitch', 'roll'], header=0, usecols=['time', 'S1', 'S2'])
usecols=['time', 'S1', 'S2']
Which outputs this data structure:
time S1 S2
0 0.25 277.5 277.5
1 0.25 277.5 277.5
2 0.40 277.5 277.5
3 0.55 277.5 277.5
4 0.70 277.5 277.5
5 0.85 277.5 277.5
.........................
784 117.70 161.2 96.9
785 END TRIAL 1.0
786 BEGIN TRIAL 2.0
787 0.10 159.9 159.9
Closer, but now I am stuck on how to parse the individual trials into separate columns or data structures because there is no grouping list that signifies trial number and the number of rows varies for each trial. After reading through the Pandas documentation for the last few days and scouring similar questions raised here, I am having trouble finding the best solution.
Eventually, I'd like to have a single data structure to make some exploratory visualizations. Am I heading in the right direction, or is there a smarter approach to this?

How do I apply a lambda function on pandas slices, and return the same format as the input data frame?

I want to apply a function to row slices of dataframe in pandas for each row and returning a dataframe with for each row the value and number of slices that was calculated.
So, for example
df = pandas.DataFrame(numpy.round(numpy.random.normal(size=(2, 10)),2))
f = lambda x: (x - x.mean())
What I want is to apply lambda function f from column 0 to 5 and from column 5 to 10.
I did this:
a = pandas.DataFrame(f(df.T.iloc[0:5,:])
but this is only for the first slice.. how can include the second slice in the code, so that my resulting output frame looks exactly as the input frame -- just that every data point is changed to its value minus the mean of the corresponding slice.
I hope it makes sense.. What would be the right way to go with this?
thank you.
You can simply reassign the result to original df, like this:
import pandas as pd
import numpy as np
# I'd rather use a function than lambda here, preference I guess
def f(x):
return x - x.mean()
df = pd.DataFrame(np.round(np.random.normal(size=(2,10)), 2))
df.T
0 1
0 0.92 -0.35
1 0.32 -1.37
2 0.86 -0.64
3 -0.65 -2.22
4 -1.03 0.63
5 0.68 -1.60
6 -0.80 -1.10
7 -0.69 0.05
8 -0.46 -0.74
9 0.02 1.54
# makde a copy of df here
df1 = df
# just reassign the slices back to the copy
# edited, omit DataFrame part.
df1.T[:5], df1.T[5:] = f(df.T.iloc[0:5,:]), f(df.T.iloc[5:,:])
df1.T
0 1
0 0.836 0.44
1 0.236 -0.58
2 0.776 0.15
3 -0.734 -1.43
4 -1.114 1.42
5 0.930 -1.23
6 -0.550 -0.73
7 -0.440 0.42
8 -0.210 -0.37
9 0.270 1.91

Categories