I want to sort a subset of a dataframe (say, between indexes i and j) according to some value. I tried
df2=df.iloc[i:j].sort_values(by=...)
df.iloc[i:j]=df2
No problem with the first line but nothing happens when I run the second one (not even an error). How should I do ? (I tried also the update function but it didn't do either).
I believe need assign to filtered DataFrame with converting to numpy array by values for avoid align indices:
df = pd.DataFrame({'A': [1,2,3,4,3,2,1,4,1,2]})
print (df)
A
0 1
1 2
2 3
3 4
4 3
5 2
6 1
7 4
8 1
9 2
i = 2
j = 7
df.iloc[i:j] = df.iloc[i:j].sort_values(by='A').values
print (df)
A
0 1
1 2
2 1
3 2
4 3
5 3
6 4
7 4
8 1
9 2
Related
I need to take the lowest value over n rows and add it to these n rows in a new colomn of the dataframe. For example:
n=3
Column 1 Column 2
5 3
3 3
4 3
7 2
8 2
2 2
5 4
4 4
9 4
8 2
2 2
3 2
5 2
Please take note that if the number of rows is not dividable by n, the last values are incorporated in the last group. So in this example n=4 for the end of the dataframe.
Thanking you in advance!
I do not know any straight forward way to do this, but here is a working example (not elegant, but working...).
If you do not worry about the number of rows being dividable by n, you could use .groupby():
import pandas as pd
d = {'col1': [1, 2,1,5,3,2,5,6,4,1,2] }
df = pd.DataFrame(data=d)
n=3
df['new_col']=df.groupby(df.index // n).transform('min')
which yields:
col1 new_col
0 1 1
1 2 1
2 1 1
3 5 2
4 3 2
5 2 2
6 5 4
7 6 4
8 4 4
9 1 1
10 2 1
However, we can see that the last 2 rows are grouped together, instead of them being grouped with the 3 previous values in this case.
A way around would be to look at the .count() of elements in each group generated by grouby, and check the last one:
import pandas as pd
d = {'col1': [1, 2,1,5,3,2,5,6,4,1,2] }
df = pd.DataFrame(data=d)
n=3
# Temporary dataframe
A = df.groupby(df.index // n).transform('min')
# The min value of each group in a second dataframe
min_df = df.groupby(df.index // n).min()
# The size of the last group
last_batch = df.groupby(df.index // n).count()[-1:]
# if the last size is not equal to n
if last_batch.values[0][0] !=n:
last_group = last_batch+n
A[-last_group.values[0][0]:]=min_df[-2:].min()
# Assign the temporary modified dataframe to df
df['new_col'] = A
which yields the expected result:
col1 new_col
0 1 1
1 2 1
2 1 1
3 5 2
4 3 2
5 2 2
6 5 1
7 6 1
8 4 1
9 1 1
10 2 1
I've a pandas dataframe of two variables( Begin and End) for three replicates(R1, R2, R3) each of Control(C) and Treatment(T)
Begin End Expt
2 5 C_R1
2 5 C_R2
2 5 C_R3
2 5 T_R1
2 5 T_R2
2 5 T_R3
4 7 C_R2
4 7 C_R3
4 7 T_R1
4 7 T_R2
4 7 T_R3
I want to pick up those rows only for which all three replicates of both control and treatment
totally six were observed, i.e (Begin,End:2,5) and not (Begin,End:4,7) as it has only five observations
missing the C_R1.
I've gone through some posts here and tried the following, which works for a small set of sample but I've to test with real data which has around 50K rows
my_df[my_df.groupby(["Begin", "End"])['Expt'].transform('nunique') == 6]
Please let me know if this is OK or if any better technique exists.
Thanks
df[df.groupby(['Begin', 'End'])['Expt']
.transform(lambda x: (np.unique(x.str.split('_').str[0], return_counts = True)[1] == 3).all())]
Begin End Expt
0 2 5 C_R1
1 2 5 C_R2
2 2 5 C_R3
3 2 5 T_R1
4 2 5 T_R2
5 2 5 T_R3
df1
df2 = df1[df1.groupby(['Begin','End'])['Expt'].transform('nunique') == 6]
df2
index
Begin
End
Expt
0
2
5
C_R1
1
2
5
C_R2
2
2
5
C_R3
3
2
5
T_R1
4
2
5
T_R2
5
2
5
T_R3
I have dataframe with many columns. There is a datetime column, and there are duplicated entries for the datetime with data for those duplicates coming from different sources. I would like to drop the duplicates based on column "dt", but I want to keep the result based on what is in column "pref". I have provided simplified data below, but the reason for this is that I also have a value column, and the "Pref" column is the data source. I prefer certain data sources, but I only need one entry per date (column "dt"). I would like this code to work so that I don't have to provide a complete list of preferences either.
Artificial Data Code
import pandas as pd
import numpy as np
df=pd.DataFrame({'dt':[1,1,1,2,2,3,3,4,4,5],
"Pref":[1,2,3,2,3,1,3,1,2,3],
"Value":np.random.normal(size=10),
"String_col":['A']*10})
df
Out[1]:
dt Pref Value String_col
0 1 1 -0.479593 A
1 1 2 0.553963 A
2 1 3 0.194266 A
3 2 2 0.598814 A
4 2 3 -0.909138 A
5 3 1 -0.297539 A
6 3 3 -1.100855 A
7 4 1 0.747354 A
8 4 2 1.002964 A
9 5 3 0.301373 A
Desired Output 1 (CASE 1):
In this case I my preference list matters all the way down. I prefer data source 2 the most, followed by 1, but will take 3 if that is all I have.
preference_list=[2,1,3]
Out[2]:
dt Pref Value String_col
1 1 2 0.553963 A
3 2 2 0.598814 A
5 3 1 -0.297539 A
8 4 2 1.002964 A
9 5 3 0.301373 A
Desired Output 2 (CASE 2)
In this case I just want to look for data source 1. If it is not present I don't actually care what the other data source is.
preference_list2=[1]
Out[3]:
dt Pref Value String_col
0 1 1 -0.479593 A
3 2 2 0.598814 A
5 3 1 -0.297539 A
7 4 1 0.747354 A
9 5 3 0.301373 A
I can imagine doing this in a really slow and complicated loop, but I feel like there should be a command to accomplish this. Another important thing: I need to keep some other text columns in the data frame so .agg may cause issue for those metadata. I have experimented with sorting and using the keep argument in drop_duplicates, but with no success.
You are actually looking for sorting by category, which can be done by pd.Categorical:
df["Pref"] = pd.Categorical(df["Pref"], categories=preference_list, ordered=True)
print (df.sort_values(["dt","Pref"]).drop_duplicates("dt"))
dt Pref Value String_col
1 1 2 -1.004362 A
3 2 2 -1.316961 A
5 3 1 0.513618 A
8 4 2 -1.859514 A
9 5 3 1.199374 A
here is a very efficient and simple solution, I hope it helps !
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df=pd.DataFrame({'dt':[1,1,1,2,2,3,3,4,4,5],
"Pref":[1,2,3,2,3,1,3,1,2,3],
"Value":np.random.normal(size=10),
"String_col":['A']*10})
preference_list = [2,3]
df_clean = df[df['Pref'].isin(preference_list)]
print(df)
print(df_clean)
Output:
dt Pref Value String_col
0 1 1 1.404505 A
1 1 2 0.840923 A
2 1 3 -1.509667 A
3 2 2 -1.431240 A
4 2 3 -0.576142 A
5 3 1 -1.208514 A
6 3 3 -0.456773 A
7 4 1 0.574463 A
8 4 2 -1.682750 A
9 5 3 0.719394 A
dt Pref Value String_col
1 1 2 0.840923 A
2 1 3 -1.509667 A
3 2 2 -1.431240 A
4 2 3 -0.576142 A
6 3 3 -0.456773 A
8 4 2 -1.682750 A
9 5 3 0.719394 A
I would like to create on my Dataframe (Global_Dataset) a new column (Col_val) based on the other Dataframe (List_Data).
I need a faster code because I have a dataset of 2 million samples and List_data contains 50000 samples.
Col_Val must contain the value of column Value according to Col_Key
List_Data:
id Key Value
1 5 0
2 7 1
3 9 2
Global_Dataset:
id Col_Key Col_Val
1 9 2
2 5 0
3 9 2
4 7 1
5 7 1
6 5 0
7 9 2
8 7 1
9 9 2
10 5 0
I have tried this code but it needs a long time to be executed. Is there any other faster way for achieving my goal?
Col_Val = []
for i in range (len(List_Data)):
for j in range (len(Global_Data)):
if List_Data.get_value(i, "Key") == Global_Data.get_value(j, 'Col_Key') :
Col_Val.append(List_Data.get_value(i, 'Value'))
Global_Data['Col_Val'] = Col_Val
PS: I have tried loc and iloc instead of get_value but it works very slow
Try this:
data_dict = {key : value for key, value in zip(List_Data['Key'], List_Data['Value'])}
Global_Data['Col_Val'] = pd.Series([data_dict[key] for key in Global_Data['Col_Key']])
I don't know how long it will takes on your machine with the amount of data you need to handle, but it should be faster of what you are using now.
You could also generate the dictionary with data_dict = {row['Key'] : row['Value'] for _, row in list_data.iterrows()} but on my machine is slower than what I proposed above.
It works under the assumption that all the keys in Global_Data['Col_Keys'] are present in List_Data['Key'], otherwise you will get a KeyError.
There is no reason to loop through anything, either manually or with iterrows. If I understand your problem, this should be a simple merge operation.
df
Key Value
id
1 5 0
2 7 1
3 9 2
global_df
Col_Key
id
1 9
2 5
3 9
4 7
5 7
6 5
7 9
8 7
9 9
10 5
global_df.reset_index()\
.merge(df, left_on='Col_Key', right_on='Key')\
.drop('Key', axis=1)\
.set_index('id')\
.sort_index()
Col_Key Value
id
1 9 2
2 5 0
3 9 2
4 7 1
5 7 1
6 5 0
7 9 2
8 7 1
9 9 2
10 5 0
Note that the essence of this is the global_df.merge(...), but the extra operations are to keep the original indexing and remove unwanted extra columns. I encourage you to try each step individually to see the results.
I realize this question is similar to join or merge with overwrite in pandas, but the accepted answer does not work for me since I want to use the on='keys' from df.join().
I have a DataFrame df which looks like this:
keys values
0 0 0.088344
1 0 0.088344
2 0 0.088344
3 0 0.088344
4 0 0.088344
5 1 0.560857
6 1 0.560857
7 1 0.560857
8 2 0.978736
9 2 0.978736
10 2 0.978736
11 2 0.978736
12 2 0.978736
13 2 0.978736
14 2 0.978736
Then I have a Series s (which is a result from some df.groupy.apply()) with the same keys:
keys
0 0.183328
1 0.239322
2 0.574962
Name: new_values, dtype: float64
Basically I want to replace the 'values' in the df with the values in the Series, by keys so every keys block gets the same new value. Currently, I do it as follows:
df = df.join(s, on='keys')
df['values'] = df['new_values']
df = df.drop('new_values', axis=1)
The obtained (and desired) result is then:
keys values
0 0 0.183328
1 0 0.183328
2 0 0.183328
3 0 0.183328
4 0 0.183328
5 1 0.239322
6 1 0.239322
7 1 0.239322
8 2 0.574962
9 2 0.574962
10 2 0.574962
11 2 0.574962
12 2 0.574962
13 2 0.574962
14 2 0.574962
That is, I add it as a new column and by using on='keys' it gets the corrects shape. Then I assign values to be new_values and remove the new_values column. This of course works perfectly, the only problem being that I find it extremely ugly.
Is there a better way to do this?
You could try something like:
df = df[df.columns[df.columns!='values']].join(s, on='keys')
Make sure s is named 'values' instead of 'new_values'.
To my knowledge, pandas doesn't have the ability to join with "force overwrite" or "overwrite with warning".