I have a data frame like this:
df:
name score
Coby0 8
Sony1 3
Coby1 4
Sony2 6
Coby2 7
Sony3 8
Coby3 3
Sony4 2
Coby4 9
Sony5 5
Coby5 7
Sony6 2
Coby6 10
I want to filter this data frame from the start till it finds the first row that starts with 'Sony'
name score
Coby0 8
Sony1 3
I want to filter this data from the start till it finds the last row that starts with 'Sony'
df:
name score
Coby0 8
Sony1 3
Coby1 4
Sony2 6
Coby2 7
Sony3 8
Coby3 3
Sony4 2
Coby4 9
Sony5 5
Coby5 7
Sony6 2
Thanks in advance!
Here is the simplest was that I know to find the first and last column.
import pandas as pd
foo = [['Coby0',8],['Sony0',3],['Coby1',4],['Sony1',6],['Coby2',7],['Sony2',8]]
df = pd.DataFrame(foo, columns=['name','score'])
print(df.head())
first = df[df.name.str.startswith('Sony')].iloc[0]
print(first)
last = df[df.name.str.startswith('Sony')].iloc[-1]
print(last)
You can use
m = df['name'].str.startswith('Sony')
first_true_idx = m.idxmax()
df1 = df.iloc[:first_true_idx+1]
print(df1)
name score
0 Coby0 8
1 Sony1 3
last_true_idx = m[::-1].idxmax()
df2 = df.iloc[:last_true_idx+1]
print(df2)
name score
0 Coby0 8
1 Sony1 3
2 Coby1 4
3 Sony2 6
4 Coby2 7
5 Sony3 8
6 Coby3 3
7 Sony4 2
8 Coby4 9
9 Sony5 5
10 Coby5 7
11 Sony6 2
I've a pandas dataframe of two variables( Begin and End) for three replicates(R1, R2, R3) each of Control(C) and Treatment(T)
Begin End Expt
2 5 C_R1
2 5 C_R2
2 5 C_R3
2 5 T_R1
2 5 T_R2
2 5 T_R3
4 7 C_R2
4 7 C_R3
4 7 T_R1
4 7 T_R2
4 7 T_R3
I want to pick up those rows only for which all three replicates of both control and treatment
totally six were observed, i.e (Begin,End:2,5) and not (Begin,End:4,7) as it has only five observations
missing the C_R1.
I've gone through some posts here and tried the following, which works for a small set of sample but I've to test with real data which has around 50K rows
my_df[my_df.groupby(["Begin", "End"])['Expt'].transform('nunique') == 6]
Please let me know if this is OK or if any better technique exists.
Thanks
df[df.groupby(['Begin', 'End'])['Expt']
.transform(lambda x: (np.unique(x.str.split('_').str[0], return_counts = True)[1] == 3).all())]
Begin End Expt
0 2 5 C_R1
1 2 5 C_R2
2 2 5 C_R3
3 2 5 T_R1
4 2 5 T_R2
5 2 5 T_R3
df1
df2 = df1[df1.groupby(['Begin','End'])['Expt'].transform('nunique') == 6]
df2
index
Begin
End
Expt
0
2
5
C_R1
1
2
5
C_R2
2
2
5
C_R3
3
2
5
T_R1
4
2
5
T_R2
5
2
5
T_R3
I'm having problems with pd.rolling() method that returns several outputs even though the function returns a single value.
My objective is to:
Calculate the absolute percentage difference between two DataFrames with 3 columns in each df.
Sum all values
I can do this using pd.iterrows(). But working with larger datasets makes this method ineffective.
This is the test data im working with:
#import libraries
import pandas as pd
import numpy as np
#create two dataframes
values = {'column1': [7,2,3,1,3,2,5,3,2,4,6,8,1,3,7,3,7,2,6,3,8],
'column2': [1,5,2,4,1,5,5,3,1,5,3,5,8,1,6,4,2,3,9,1,4],
"column3" : [3,6,3,9,7,1,2,3,7,5,4,1,4,2,9,6,5,1,4,1,3]
}
df1 = pd.DataFrame(values)
df2 = pd.DataFrame([[2,3,4],[3,4,1],[3,6,1]])
print(df1)
print(df2)
column1 column2 column3
0 7 1 3
1 2 5 6
2 3 2 3
3 1 4 9
4 3 1 7
5 2 5 1
6 5 5 2
7 3 3 3
8 2 1 7
9 4 5 5
10 6 3 4
11 8 5 1
12 1 8 4
13 3 1 2
14 7 6 9
15 3 4 6
16 7 2 5
17 2 3 1
18 6 9 4
19 3 1 1
20 8 4 3
0 1 2
0 2 3 4
1 3 4 1
2 3 6 1
This method produces the output I want by using pd.iterrows()
RunningSum = []
for index, rows in df1.iterrows():
if index > 3:
Div = abs((((df2 / df1.iloc[index-3+1:index+1].reset_index(drop="True").values)-1)*100))
Average = Div.sum(axis=0)
SumOfAverages = np.sum(Average)
RunningSum.append(SumOfAverages)
#printing my desired output values
print(RunningSum)
[991.2698412698413,
636.2698412698412,
456.19047619047626,
616.6666666666667,
935.7142857142858,
627.3809523809524,
592.8571428571429,
350.8333333333333,
449.1666666666667,
1290.0,
658.531746031746,
646.031746031746,
597.4603174603175,
478.80952380952385,
383.0952380952381,
980.5555555555555,
612.5]
Finally, below is my attemt to use pd.rolling() so that I dont need to loop through each row.
def SumOfAverageFunction(vals):
Div = abs((((df2.values / vals.reset_index(drop="True").values)-1)*100))
Average = Div.sum()
SumOfAverages = np.sum(Average)
return SumOfAverages
RunningSums = df1.rolling(window=3,axis=0).apply(SumOfAverageFunction)
Here is my problem because printing RunningSums from above outputs several values and is not close to the results I'm getting using iterrows method. How do I solve this?
print(RunningSums)
column1 column2 column3
0 NaN NaN NaN
1 NaN NaN NaN
2 702.380952 780.000000 283.333333
3 533.333333 640.000000 533.333333
4 1200.000000 475.000000 403.174603
5 833.333333 1280.000000 625.396825
6 563.333333 760.000000 1385.714286
7 346.666667 386.666667 1016.666667
8 473.333333 573.333333 447.619048
9 533.333333 1213.333333 327.619048
10 375.000000 746.666667 415.714286
11 408.333333 453.333333 515.000000
12 604.166667 338.333333 1250.000000
13 1366.666667 577.500000 775.000000
14 847.619048 1400.000000 683.333333
15 314.285714 733.333333 455.555556
16 533.333333 441.666667 474.444444
17 347.619048 616.666667 546.666667
18 735.714286 466.666667 1290.000000
19 350.000000 488.888889 875.000000
20 525.000000 1361.111111 1266.666667
It's just the way rolling behaves, it's going to window around all of the columns and I don't know that there is a way around it. One solution is to apply rolling to a single column, and use the indexes from those windows to slice the dataframe inside your function. Still expensive, but probably not as bad as what you're doing.
Also the output of your first method looks wrong. You're actually starting your calculations a few rows too late.
import numpy as np
def SumOfAverageFunction(vals):
return (abs(np.divide(df2.values, df1.loc[vals.index].values)-1)*100).sum()
vals = df1.column1.rolling(3)
vals.apply(SumOfAverageFunction, raw=False)
I would like to create on my Dataframe (Global_Dataset) a new column (Col_val) based on the other Dataframe (List_Data).
I need a faster code because I have a dataset of 2 million samples and List_data contains 50000 samples.
Col_Val must contain the value of column Value according to Col_Key
List_Data:
id Key Value
1 5 0
2 7 1
3 9 2
Global_Dataset:
id Col_Key Col_Val
1 9 2
2 5 0
3 9 2
4 7 1
5 7 1
6 5 0
7 9 2
8 7 1
9 9 2
10 5 0
I have tried this code but it needs a long time to be executed. Is there any other faster way for achieving my goal?
Col_Val = []
for i in range (len(List_Data)):
for j in range (len(Global_Data)):
if List_Data.get_value(i, "Key") == Global_Data.get_value(j, 'Col_Key') :
Col_Val.append(List_Data.get_value(i, 'Value'))
Global_Data['Col_Val'] = Col_Val
PS: I have tried loc and iloc instead of get_value but it works very slow
Try this:
data_dict = {key : value for key, value in zip(List_Data['Key'], List_Data['Value'])}
Global_Data['Col_Val'] = pd.Series([data_dict[key] for key in Global_Data['Col_Key']])
I don't know how long it will takes on your machine with the amount of data you need to handle, but it should be faster of what you are using now.
You could also generate the dictionary with data_dict = {row['Key'] : row['Value'] for _, row in list_data.iterrows()} but on my machine is slower than what I proposed above.
It works under the assumption that all the keys in Global_Data['Col_Keys'] are present in List_Data['Key'], otherwise you will get a KeyError.
There is no reason to loop through anything, either manually or with iterrows. If I understand your problem, this should be a simple merge operation.
df
Key Value
id
1 5 0
2 7 1
3 9 2
global_df
Col_Key
id
1 9
2 5
3 9
4 7
5 7
6 5
7 9
8 7
9 9
10 5
global_df.reset_index()\
.merge(df, left_on='Col_Key', right_on='Key')\
.drop('Key', axis=1)\
.set_index('id')\
.sort_index()
Col_Key Value
id
1 9 2
2 5 0
3 9 2
4 7 1
5 7 1
6 5 0
7 9 2
8 7 1
9 9 2
10 5 0
Note that the essence of this is the global_df.merge(...), but the extra operations are to keep the original indexing and remove unwanted extra columns. I encourage you to try each step individually to see the results.
My first data frame has various columns one of which contains ID column and my second data frame has various columns one of which contains a No so I have found the link between the two. However how can I link these together using the number to assign the postcode information from data frame 2 to the correct practice in data frame 1.
Any help would be greatly appreciated!!!
Date frame 1
ID place Items Cost
0 5 10 2001.00
1 12 2 20.98
2 2 4 100.80
3 7 7 199.60
Data frame 2
ID No Dr Postcode
0 1 Dr.K BT94 7HX
1 5 Dr.H BT7 4MC
2 3 Dr.Love BT9 1HE
3 7 Dr.Kerr BT72 4TX
I want to create a new column 'Postcode' in Data frame 1 and assign the postcode to the correct Practice
ID Place Items Cost Postcode
0 5 10 BT7 4MC
1 2 3 BT9 1HE
2 22 8 BT62 4TU
3 7 7 BT72 4TX
How can I do this??
IIUC, I think what you are looking for is 'left_on' and 'right_on' parameters in merge:
df1.merge(df2, left_on='Practice', right_on='Prac No')
Output:
ID_x Practice Items Cost ID_y Prac No Dr Postcode
0 0 5 10 2001.0 1 5 Dr.H BT7 4MC
1 3 7 7 199.6 3 7 Dr.Kerr BT72 4TX
Or another way is to use set_index and map:
df1['Postcode'] = df1['Practice'].map(df2.set_index('Prac No')['Postcode'])
df1
Output:
ID Practice Items Cost Postcode
0 0 5 10 2001.00 BT7 4MC
1 1 12 2 20.98 NaN
2 2 2 4 100.80 NaN
3 3 7 7 199.60 BT72 4TX