I have a df that looks like this:
df
time score
83623 4
83624 3
83625 3
83629 2
83633 1
I want to explode df.time so that the single digit increments by 1, and then the df.score value is duplicated for each added row. See example below:
time score
83623 4
83624 3
83625 3
83626 3
83627 3
83628 3
83629 2
83630 2
83631 2
83632 2
83633 1
From your sample, I assume df.time is integer. You may try this way
df_final = df.set_index('time').reindex(range(df.time.min(), df.time.max()+1),
method='pad').reset_index()
Out[89]:
time score
0 83623 4
1 83624 3
2 83625 3
3 83626 3
4 83627 3
5 83628 3
6 83629 2
7 83630 2
8 83631 2
9 83632 2
10 83633 1
Related
I need to take the lowest value over n rows and add it to these n rows in a new colomn of the dataframe. For example:
n=3
Column 1 Column 2
5 3
3 3
4 3
7 2
8 2
2 2
5 4
4 4
9 4
8 2
2 2
3 2
5 2
Please take note that if the number of rows is not dividable by n, the last values are incorporated in the last group. So in this example n=4 for the end of the dataframe.
Thanking you in advance!
I do not know any straight forward way to do this, but here is a working example (not elegant, but working...).
If you do not worry about the number of rows being dividable by n, you could use .groupby():
import pandas as pd
d = {'col1': [1, 2,1,5,3,2,5,6,4,1,2] }
df = pd.DataFrame(data=d)
n=3
df['new_col']=df.groupby(df.index // n).transform('min')
which yields:
col1 new_col
0 1 1
1 2 1
2 1 1
3 5 2
4 3 2
5 2 2
6 5 4
7 6 4
8 4 4
9 1 1
10 2 1
However, we can see that the last 2 rows are grouped together, instead of them being grouped with the 3 previous values in this case.
A way around would be to look at the .count() of elements in each group generated by grouby, and check the last one:
import pandas as pd
d = {'col1': [1, 2,1,5,3,2,5,6,4,1,2] }
df = pd.DataFrame(data=d)
n=3
# Temporary dataframe
A = df.groupby(df.index // n).transform('min')
# The min value of each group in a second dataframe
min_df = df.groupby(df.index // n).min()
# The size of the last group
last_batch = df.groupby(df.index // n).count()[-1:]
# if the last size is not equal to n
if last_batch.values[0][0] !=n:
last_group = last_batch+n
A[-last_group.values[0][0]:]=min_df[-2:].min()
# Assign the temporary modified dataframe to df
df['new_col'] = A
which yields the expected result:
col1 new_col
0 1 1
1 2 1
2 1 1
3 5 2
4 3 2
5 2 2
6 5 1
7 6 1
8 4 1
9 1 1
10 2 1
I've a pandas dataframe of two variables( Begin and End) for three replicates(R1, R2, R3) each of Control(C) and Treatment(T)
Begin End Expt
2 5 C_R1
2 5 C_R2
2 5 C_R3
2 5 T_R1
2 5 T_R2
2 5 T_R3
4 7 C_R2
4 7 C_R3
4 7 T_R1
4 7 T_R2
4 7 T_R3
I want to pick up those rows only for which all three replicates of both control and treatment
totally six were observed, i.e (Begin,End:2,5) and not (Begin,End:4,7) as it has only five observations
missing the C_R1.
I've gone through some posts here and tried the following, which works for a small set of sample but I've to test with real data which has around 50K rows
my_df[my_df.groupby(["Begin", "End"])['Expt'].transform('nunique') == 6]
Please let me know if this is OK or if any better technique exists.
Thanks
df[df.groupby(['Begin', 'End'])['Expt']
.transform(lambda x: (np.unique(x.str.split('_').str[0], return_counts = True)[1] == 3).all())]
Begin End Expt
0 2 5 C_R1
1 2 5 C_R2
2 2 5 C_R3
3 2 5 T_R1
4 2 5 T_R2
5 2 5 T_R3
df1
df2 = df1[df1.groupby(['Begin','End'])['Expt'].transform('nunique') == 6]
df2
index
Begin
End
Expt
0
2
5
C_R1
1
2
5
C_R2
2
2
5
C_R3
3
2
5
T_R1
4
2
5
T_R2
5
2
5
T_R3
I have dataframe with many columns. There is a datetime column, and there are duplicated entries for the datetime with data for those duplicates coming from different sources. I would like to drop the duplicates based on column "dt", but I want to keep the result based on what is in column "pref". I have provided simplified data below, but the reason for this is that I also have a value column, and the "Pref" column is the data source. I prefer certain data sources, but I only need one entry per date (column "dt"). I would like this code to work so that I don't have to provide a complete list of preferences either.
Artificial Data Code
import pandas as pd
import numpy as np
df=pd.DataFrame({'dt':[1,1,1,2,2,3,3,4,4,5],
"Pref":[1,2,3,2,3,1,3,1,2,3],
"Value":np.random.normal(size=10),
"String_col":['A']*10})
df
Out[1]:
dt Pref Value String_col
0 1 1 -0.479593 A
1 1 2 0.553963 A
2 1 3 0.194266 A
3 2 2 0.598814 A
4 2 3 -0.909138 A
5 3 1 -0.297539 A
6 3 3 -1.100855 A
7 4 1 0.747354 A
8 4 2 1.002964 A
9 5 3 0.301373 A
Desired Output 1 (CASE 1):
In this case I my preference list matters all the way down. I prefer data source 2 the most, followed by 1, but will take 3 if that is all I have.
preference_list=[2,1,3]
Out[2]:
dt Pref Value String_col
1 1 2 0.553963 A
3 2 2 0.598814 A
5 3 1 -0.297539 A
8 4 2 1.002964 A
9 5 3 0.301373 A
Desired Output 2 (CASE 2)
In this case I just want to look for data source 1. If it is not present I don't actually care what the other data source is.
preference_list2=[1]
Out[3]:
dt Pref Value String_col
0 1 1 -0.479593 A
3 2 2 0.598814 A
5 3 1 -0.297539 A
7 4 1 0.747354 A
9 5 3 0.301373 A
I can imagine doing this in a really slow and complicated loop, but I feel like there should be a command to accomplish this. Another important thing: I need to keep some other text columns in the data frame so .agg may cause issue for those metadata. I have experimented with sorting and using the keep argument in drop_duplicates, but with no success.
You are actually looking for sorting by category, which can be done by pd.Categorical:
df["Pref"] = pd.Categorical(df["Pref"], categories=preference_list, ordered=True)
print (df.sort_values(["dt","Pref"]).drop_duplicates("dt"))
dt Pref Value String_col
1 1 2 -1.004362 A
3 2 2 -1.316961 A
5 3 1 0.513618 A
8 4 2 -1.859514 A
9 5 3 1.199374 A
here is a very efficient and simple solution, I hope it helps !
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df=pd.DataFrame({'dt':[1,1,1,2,2,3,3,4,4,5],
"Pref":[1,2,3,2,3,1,3,1,2,3],
"Value":np.random.normal(size=10),
"String_col":['A']*10})
preference_list = [2,3]
df_clean = df[df['Pref'].isin(preference_list)]
print(df)
print(df_clean)
Output:
dt Pref Value String_col
0 1 1 1.404505 A
1 1 2 0.840923 A
2 1 3 -1.509667 A
3 2 2 -1.431240 A
4 2 3 -0.576142 A
5 3 1 -1.208514 A
6 3 3 -0.456773 A
7 4 1 0.574463 A
8 4 2 -1.682750 A
9 5 3 0.719394 A
dt Pref Value String_col
1 1 2 0.840923 A
2 1 3 -1.509667 A
3 2 2 -1.431240 A
4 2 3 -0.576142 A
6 3 3 -0.456773 A
8 4 2 -1.682750 A
9 5 3 0.719394 A
I have a dataframe as follow:
d = {'item': [1, 2,3,4,5,6], 'time': [1297468800, 1297468809, 12974688010, 1297468890, 1297468820,1297468805]}
df = pd.DataFrame(data=d)
the output of df is as follow:
item time
0 1 1297468800
1 2 1297468809
2 3 1297468801
3 4 1297468890
4 5 1297468820
5 6 1297468805
the time here is based on the unixsystem time. My goal is to replace the time column in the dataframe.
such as the
mintime = 1297468800
maxtime = 1297468890
And I want to split the time into 10 (can be changed by using parameter like 20 intervals) interval, and recode the time column in df. Such as
item time
0 1 1
1 2 1
2 3 1
3 4 9
4 5 3
5 6 1
what is the most efficient way to do this since I have billion of records? Thanks
You can use pd.cut with np.linspace to specify the bins. This encodes your column categorically, from which you can then extract the codes in order:
bins = np.linspace(df.time.min() - 1, df.time.max(), 10)
df['time'] = pd.cut(df.time, bins=bins, right=True).cat.codes + 1
df
item time
0 1 1
1 2 1
2 3 1
3 4 9
4 5 3
5 6 1
Alternatively, depending on how you treat the interval edges, you could also do
bins = np.linspace(df.time.min(), df.time.max() + 1, 10)
pd.cut(df.time, bins=bins, right=False).cat.codes + 1
0 1
1 1
2 1
3 9
4 2
5 1
dtype: int8
I want to sort a subset of a dataframe (say, between indexes i and j) according to some value. I tried
df2=df.iloc[i:j].sort_values(by=...)
df.iloc[i:j]=df2
No problem with the first line but nothing happens when I run the second one (not even an error). How should I do ? (I tried also the update function but it didn't do either).
I believe need assign to filtered DataFrame with converting to numpy array by values for avoid align indices:
df = pd.DataFrame({'A': [1,2,3,4,3,2,1,4,1,2]})
print (df)
A
0 1
1 2
2 3
3 4
4 3
5 2
6 1
7 4
8 1
9 2
i = 2
j = 7
df.iloc[i:j] = df.iloc[i:j].sort_values(by='A').values
print (df)
A
0 1
1 2
2 1
3 2
4 3
5 3
6 4
7 4
8 1
9 2