How to explode aggregated pandas column

How to explode aggregated pandas column - python

I have a df that looks like this:
df
time score
83623 4
83624 3
83625 3
83629 2
83633 1
I want to explode df.time so that the single digit increments by 1, and then the df.score value is duplicated for each added row. See example below:
time score
83623 4
83624 3
83625 3
83626 3
83627 3
83628 3
83629 2
83630 2
83631 2
83632 2
83633 1

From your sample, I assume df.time is integer. You may try this way
df_final = df.set_index('time').reindex(range(df.time.min(), df.time.max()+1),
method='pad').reset_index()
Out[89]:
time score
0 83623 4
1 83624 3
2 83625 3
3 83626 3
4 83627 3
5 83628 3
6 83629 2
7 83630 2
8 83631 2
9 83632 2
10 83633 1

Related

Allocate lowest value over n rows to n rows in DataFrame

I need to take the lowest value over n rows and add it to these n rows in a new colomn of the dataframe. For example:
n=3
Column 1 Column 2
5 3
3 3
4 3
7 2
8 2
2 2
5 4
4 4
9 4
8 2
2 2
3 2
5 2
Please take note that if the number of rows is not dividable by n, the last values are incorporated in the last group. So in this example n=4 for the end of the dataframe.
Thanking you in advance!

I do not know any straight forward way to do this, but here is a working example (not elegant, but working...).
If you do not worry about the number of rows being dividable by n, you could use .groupby():
import pandas as pd
d = {'col1': [1, 2,1,5,3,2,5,6,4,1,2] }
df = pd.DataFrame(data=d)
n=3
df['new_col']=df.groupby(df.index // n).transform('min')
which yields:
col1 new_col
0 1 1
1 2 1
2 1 1
3 5 2
4 3 2
5 2 2
6 5 4
7 6 4
8 4 4
9 1 1
10 2 1
However, we can see that the last 2 rows are grouped together, instead of them being grouped with the 3 previous values in this case.
A way around would be to look at the .count() of elements in each group generated by grouby, and check the last one:
import pandas as pd
d = {'col1': [1, 2,1,5,3,2,5,6,4,1,2] }
df = pd.DataFrame(data=d)
n=3
# Temporary dataframe
A = df.groupby(df.index // n).transform('min')
# The min value of each group in a second dataframe
min_df = df.groupby(df.index // n).min()
# The size of the last group
last_batch = df.groupby(df.index // n).count()[-1:]
# if the last size is not equal to n
if last_batch.values[0][0] !=n:
last_group = last_batch+n
A[-last_group.values[0][0]:]=min_df[-2:].min()
# Assign the temporary modified dataframe to df
df['new_col'] = A
which yields the expected result:
col1 new_col
0 1 1
1 2 1
2 1 1
3 5 2
4 3 2
5 2 2
6 5 1
7 6 1
8 4 1
9 1 1
10 2 1

Selecting pandas dataframe rows with same pair of two column values and different on third for certain number of counts

I've a pandas dataframe of two variables( Begin and End) for three replicates(R1, R2, R3) each of Control(C) and Treatment(T)
Begin End Expt
2 5 C_R1
2 5 C_R2
2 5 C_R3
2 5 T_R1
2 5 T_R2
2 5 T_R3
4 7 C_R2
4 7 C_R3
4 7 T_R1
4 7 T_R2
4 7 T_R3
I want to pick up those rows only for which all three replicates of both control and treatment
totally six were observed, i.e (Begin,End:2,5) and not (Begin,End:4,7) as it has only five observations
missing the C_R1.
I've gone through some posts here and tried the following, which works for a small set of sample but I've to test with real data which has around 50K rows
my_df[my_df.groupby(["Begin", "End"])['Expt'].transform('nunique') == 6]
Please let me know if this is OK or if any better technique exists.
Thanks

df[df.groupby(['Begin', 'End'])['Expt']
.transform(lambda x: (np.unique(x.str.split('_').str[0], return_counts = True)[1] == 3).all())]
Begin End Expt
0 2 5 C_R1
1 2 5 C_R2
2 2 5 C_R3
3 2 5 T_R1
4 2 5 T_R2
5 2 5 T_R3

df1
df2 = df1[df1.groupby(['Begin','End'])['Expt'].transform('nunique') == 6]
df2
index
Begin
End
Expt
0
2
5
C_R1
1
2
5
C_R2
2
2
5
C_R3
3
2
5
T_R1
4
2
5
T_R2
5
2
5
T_R3

Use drop duplicates in Pandas DF but choose keep column based on a preference list

I have dataframe with many columns. There is a datetime column, and there are duplicated entries for the datetime with data for those duplicates coming from different sources. I would like to drop the duplicates based on column "dt", but I want to keep the result based on what is in column "pref". I have provided simplified data below, but the reason for this is that I also have a value column, and the "Pref" column is the data source. I prefer certain data sources, but I only need one entry per date (column "dt"). I would like this code to work so that I don't have to provide a complete list of preferences either.
Artificial Data Code
import pandas as pd
import numpy as np
df=pd.DataFrame({'dt':[1,1,1,2,2,3,3,4,4,5],
"Pref":[1,2,3,2,3,1,3,1,2,3],
"Value":np.random.normal(size=10),
"String_col":['A']*10})
df
Out[1]:
dt Pref Value String_col
0 1 1 -0.479593 A
1 1 2 0.553963 A
2 1 3 0.194266 A
3 2 2 0.598814 A
4 2 3 -0.909138 A
5 3 1 -0.297539 A
6 3 3 -1.100855 A
7 4 1 0.747354 A
8 4 2 1.002964 A
9 5 3 0.301373 A
Desired Output 1 (CASE 1):
In this case I my preference list matters all the way down. I prefer data source 2 the most, followed by 1, but will take 3 if that is all I have.
preference_list=[2,1,3]
Out[2]:
dt Pref Value String_col
1 1 2 0.553963 A
3 2 2 0.598814 A
5 3 1 -0.297539 A
8 4 2 1.002964 A
9 5 3 0.301373 A
Desired Output 2 (CASE 2)
In this case I just want to look for data source 1. If it is not present I don't actually care what the other data source is.
preference_list2=[1]
Out[3]:
dt Pref Value String_col
0 1 1 -0.479593 A
3 2 2 0.598814 A
5 3 1 -0.297539 A
7 4 1 0.747354 A
9 5 3 0.301373 A
I can imagine doing this in a really slow and complicated loop, but I feel like there should be a command to accomplish this. Another important thing: I need to keep some other text columns in the data frame so .agg may cause issue for those metadata. I have experimented with sorting and using the keep argument in drop_duplicates, but with no success.

You are actually looking for sorting by category, which can be done by pd.Categorical:
df["Pref"] = pd.Categorical(df["Pref"], categories=preference_list, ordered=True)
print (df.sort_values(["dt","Pref"]).drop_duplicates("dt"))
dt Pref Value String_col
1 1 2 -1.004362 A
3 2 2 -1.316961 A
5 3 1 0.513618 A
8 4 2 -1.859514 A
9 5 3 1.199374 A

here is a very efficient and simple solution, I hope it helps !
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df=pd.DataFrame({'dt':[1,1,1,2,2,3,3,4,4,5],
"Pref":[1,2,3,2,3,1,3,1,2,3],
"Value":np.random.normal(size=10),
"String_col":['A']*10})
preference_list = [2,3]
df_clean = df[df['Pref'].isin(preference_list)]
print(df)
print(df_clean)
Output:
dt Pref Value String_col
0 1 1 1.404505 A
1 1 2 0.840923 A
2 1 3 -1.509667 A
3 2 2 -1.431240 A
4 2 3 -0.576142 A
5 3 1 -1.208514 A
6 3 3 -0.456773 A
7 4 1 0.574463 A
8 4 2 -1.682750 A
9 5 3 0.719394 A
dt Pref Value String_col
1 1 2 0.840923 A
2 1 3 -1.509667 A
3 2 2 -1.431240 A
4 2 3 -0.576142 A
6 3 3 -0.456773 A
8 4 2 -1.682750 A
9 5 3 0.719394 A

Encode pandas column as categorical values

I have a dataframe as follow:
d = {'item': [1, 2,3,4,5,6], 'time': [1297468800, 1297468809, 12974688010, 1297468890, 1297468820,1297468805]}
df = pd.DataFrame(data=d)
the output of df is as follow:
item time
0 1 1297468800
1 2 1297468809
2 3 1297468801
3 4 1297468890
4 5 1297468820
5 6 1297468805
the time here is based on the unixsystem time. My goal is to replace the time column in the dataframe.
such as the
mintime = 1297468800
maxtime = 1297468890
And I want to split the time into 10 (can be changed by using parameter like 20 intervals) interval, and recode the time column in df. Such as
item time
0 1 1
1 2 1
2 3 1
3 4 9
4 5 3
5 6 1
what is the most efficient way to do this since I have billion of records? Thanks

You can use pd.cut with np.linspace to specify the bins. This encodes your column categorically, from which you can then extract the codes in order:
bins = np.linspace(df.time.min() - 1, df.time.max(), 10)
df['time'] = pd.cut(df.time, bins=bins, right=True).cat.codes + 1
df
item time
0 1 1
1 2 1
2 3 1
3 4 9
4 5 3
5 6 1
Alternatively, depending on how you treat the interval edges, you could also do
bins = np.linspace(df.time.min(), df.time.max() + 1, 10)
pd.cut(df.time, bins=bins, right=False).cat.codes + 1
0 1
1 1
2 1
3 9
4 2
5 1
dtype: int8

Set value to slice of a Pandas dataframe

I want to sort a subset of a dataframe (say, between indexes i and j) according to some value. I tried
df2=df.iloc[i:j].sort_values(by=...)
df.iloc[i:j]=df2
No problem with the first line but nothing happens when I run the second one (not even an error). How should I do ? (I tried also the update function but it didn't do either).

I believe need assign to filtered DataFrame with converting to numpy array by values for avoid align indices:
df = pd.DataFrame({'A': [1,2,3,4,3,2,1,4,1,2]})
print (df)
A
0 1
1 2
2 3
3 4
4 3
5 2
6 1
7 4
8 1
9 2
i = 2
j = 7
df.iloc[i:j] = df.iloc[i:j].sort_values(by='A').values
print (df)
A
0 1
1 2
2 1
3 2
4 3
5 3
6 4
7 4
8 1
9 2

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to explode aggregated pandas column - python

Related

Allocate lowest value over n rows to n rows in DataFrame

Selecting pandas dataframe rows with same pair of two column values and different on third for certain number of counts

Use drop duplicates in Pandas DF but choose keep column based on a preference list

Encode pandas column as categorical values

Set value to slice of a Pandas dataframe

Categories

Resources