Python : Explode rows from panda dataframe - python

im new to Python and i'm working on panda dataframe.
So I have a dataframe like:
Client_id Nb_Products
1 2
2 3
3 1
And I need to explode each row Nb_Products times for each client_id.
So i need to output the following table:
Client_id Product_Nb
1 1
1 2
2 1
2 2
2 3
3 1
At first i think i should create a range of numbers for Nb_Products like :
Client_id Nb_Products_rng
1 [1,2]
2 [1,2,3]
3 [1]
And then explode it.
But i couldn't succeed creating this.
I'll be greatful to any answer or part of answer.
Thank you

Methodology
I use an index firstly to speed things up and get the unique client ids
df = df.set_index('Client_id')
client_ids = df.index.get_level_values('Client_id').unique()
Then I just reconstruct the DataFrame by iterating over all products per client
res = pd.DataFrame(
[
[client, prod]
for client in client_ids
for prod in range(1, df.loc[client, 'Nb_Products'].max()+1)
],
columns = ['Client_id', 'Nb_Products']
)
Example / Test
The test data I used
import pandas as pd
df = pd.DataFrame(
[[1, 2], [2, 3], [3, 3]],
columns=['Client_id', 'Nb_Products']
)
Initial DataFrame
Client_id Nb_Products
0 1 2
1 2 3
2 3 3
Result
Client_id Nb_Products
0 1 1
1 1 2
2 2 1
3 2 2
4 2 3
5 3 1
6 3 2
7 3 3

You can do it simply by repeating the values in Client_id Nb_products time to 'explode' your dataset. Repeating Client_id value in a row by the the value against it in the Nb_products column will produce the Client_id variable of the new dataframe. I do this using list comprehension.
To get the second column - Product_Nb you simply need a sequence starting from 1.
from io import StringIO
import pandas as pd
TESTDATA=StringIO("""Client_id Nb_Products
1 2
2 3
3 1""")
df = pd.read_csv(TESTDATA, sep=" ")
col1 = []
_ = [col1.extend([a]*b) for a,b in zip(df.iloc[:,0].values.tolist(), df.iloc[:,1].values.tolist())]
col2 = []
_ = [col2.extend(list(range(1,i+1))) for i in df.iloc[:,1].values.tolist()]
df2 = pd.DataFrame(list(zip(col1,col2)),columns = ['Client_id', 'Product_Nb'])

Related

Count number of occurences in Dataframe per column

I have a sample dataframe whereby all numbers are userID:
from
to
1
3
1
2
2
3
How do I count the number of occurrences for each columns, sum it up based on the same values and displays in the following format in a new dataframe?
UserID
Occurences
1
2
2
2
3
2
Thank you.
IIUC, you can stack then value_counts
out = (df.stack().value_counts()
.to_frame('Occurences')
.rename_axis('UserID')
.reset_index())
print(out)
UserID Occurences
0 1 2
1 2 2
2 3 2
Use DataFrame.melt with GroupBy.size:
df = df.melt(value_name='UserID').groupby('UserID').size().reset_index(name='Occurences')
print (df)
UserID Occurences
0 1 2
1 2 2
2 3 2
The pd.Series.value counts method may be used to count the instances of each userID in the columns "from" and "to," and pd.concat can be used to combine the results. At the end create a dataframe from the resulting series using the pd.DataFrame.reset index method:
import pandas as pd
data_frame = pd.DataFrame({'from': [1, 1, 2], 'to': [3, 2, 3]})
occur = pd.concat([df['from'].value_counts(), df['to'].value_counts()])
result_df = occur.reset_index()
result_df.columns = ['UserID', 'occur']
result_df = result_df.groupby(['UserID'])['occur'].sum().reset_index()
UserID Occur
0 1 2
1 2 2
2 3 2

How can I insert a row in between every other row in a dataframe?

I have a data frame that looks like:
a
b
1
1
2
2
1
2
3
1
2
and a row that looks like: [1,2]
How can I insert this row in between rows 1 & 2, 2 & 3, and so on?
In other words, how do I insert a row every other row in a dataframe?
If you just want to add [1,2] in the table that contains only 1,2 then you can repeat those values:
df=df.reindex(df.index.repeat(2)).reset_index(drop=True)
Otherwise if there is different value you can try:
df.index=[x for x in range (len(df)*2) if x%2!=0]
for x in range (2,(len(df)*2)+2):
if x%2==0:
df.loc[x]=[2,3]
df=df.sort_index()
output of df:
a b
1 1 2
2 2 3
3 1 2
4 2 3
5 1 2
6 2 3
This reminds me of a mathematical problem about a hotel with infinite number of rooms.
But here is the solution, we multiply the index by 2, and concatenate a new dataframe with odd indexes. Then sort by index.
import pandas as pd
from io import StringIO
rows = [[3,4]]
df = pd.read_csv(StringIO(
"""a b
1 2
1 2
1 2"""), sep="\s+")
nrows = df.shape[0] - 1
df.index = df.index*2
new_df = pd.DataFrame(rows * nrows, columns=["a", "b"])
new_df.index = new_df.index*2 + 1
>>> pd.concat([df, new_df]).sort_index()
a b
0 1 2
1 3 4
2 1 2
3 3 4
4 1 2

Groupby and sum of multiple columns with the same value

I am working on Pandas data frame and have following dataframe:
data =pd.DataFrame()
data['HomeTeam'] = ['A','B','C','D','E']
data['AwayTeam'] = ['E','D','A','B','C']
data['HomePoint'] = [1,3,0,1,3]
data['AwayPoint'] = [1,0,3,1,0]
data ['Match'] = data['HomeTeam'].astype(str)+' Vs '+data['AwayTeam'].astype(str)
# I want to duplicate the match
Nsims = 2
data_Dub =pd.DataFrame((pd.np.tile(data,(Nsims,1))))
data_Dub.columns = data.columns
# Then I will assign the stage of the match
data_Dub['SimStage'] = data_Dub.groupby('Match').cumcount()
What i wanted to do is to sum homepoint and awaypoint obtained by each team and save it to new data frame.
my new dataframe will look like as follow:
It means that Homepoint and awaypoint will be added for same team as I have 5 teams in dataframe.
Can anyone advise how to do it.
I used following code and it does not work.
Point = data_Dub.groupby(['SimStage','HomeTeam','AwayTeam])['HomePoint','AwayPoint'].sum()
Thanks.
You can aggregate sum separately for HomeTeam and AwayTeam, then use add, last sort_index, reset_index for columns from MultiIndex, change column name and if necessary order of columns:
a = data_Dub.groupby(['AwayTeam', 'SimStage'])['AwayPoint'].sum()
b = data_Dub.groupby(['HomeTeam', 'SimStage'])['HomePoint'].sum()
s = a.add(b).rename('Point')
df = s.sort_index(level=[1, 0]).reset_index().rename(columns={'AwayTeam':'Team'})
df = df[['Team','Point','SimStage']]
print (df)
Team Point SimStage
0 A 4 0
1 B 4 0
2 C 0 0
3 D 1 0
4 E 4 0
5 A 4 1
6 B 4 1
7 C 0 1
8 D 1 1
9 E 4 1

Improve pandas filter speed by storing indices?

I have the following df:
df = pd.DataFrame({'ID1':[1,2,3,4,5,6],'ID2':[2,6,6,2,1,2],'AREA':[1,1,1,1,1,1]})
...
ID1 ID2 AREA
0 1 2 1
1 2 6 1
2 3 6 1
3 4 2 1
4 5 1 1
5 6 2 1
I accumulate the AREA column as so:
for id_ in df.ID1:
id1_filter = df.ID1 == id_
id2_filter = (df.ID1 == id_) | (df.ID2 == id_)
df.loc[id1_filter, 'AREA'] = df.loc[id2_filter].AREA.sum()
print(df)
...
ID1 ID2 AREA
0 1 2 2
1 2 6 5
2 3 6 1
3 4 2 1
4 5 1 1
5 6 2 7
For each id_ in ID1, the AREA is summed where ID1 == id_ or ID2 == id_,
and it is always run when df is sorted on ID1.
The real dataframe I'm working on though is 150,000 records, each row belonging to a unique ID1.
Running the above on this dataframe takes 2.5 hours. Since this operation will take place repeatedly
for the foreseeable future, I decided to store the indices of the True values in id1_filter and id2_filter
in a DB with the following schema.
Table ID1:
ID_,INDEX_
1 , 0
2 , 1
etc, ect
Table ID2:
ID_,INDEX_
1 , 0
1 , 4
2 , 0
2 , 1
2 , 3
2 , 5
etc, etc
The next time I run the accumulation on the AREA column (now filled with different AREA values)
I read in the sql tables and the convert them to dicts. I then use these dicts
to grab the records I need during the summing loop.
id1_dict = pd.read_sql('select * from ID1',db_engine).groupby('ID_').INDEX_.unique().to_dict()
id2_dict = pd.read_sql('select * from ID2',db_engine).groupby('ID_').INDEX_.unique().to_dict()
# print indices for id1_filter and id2_fillter for id 1
print(id1_dict[1])
print(id2_dict[1])
...
[0]
[0, 4]
for id_ in df.ID1:
df.loc[id1_dict[id_], 'AREA'] = df.loc[id2_dict[id_]].AREA.sum()
When run this way it only takes 6 minutes!
My question: Is there a better/standard way to handle this scenario, i.e storing dataframe selections for
later use? Side note, I have set an index on the SQL table's ID columns and tried getting the
indices by querying the table for each id, and it works well, but still takes a little longer than the above (9 mins).
One way to do it is like this:
df = df.set_index('ID1')
for row in df.join(df.groupby('ID2')['AREA'].apply(lambda x: x.index.tolist()),rsuffix='_').dropna().itertuples():
df.loc[row[0],'AREA'] += df.loc[row[3],'AREA'].sum()
df = df.reset_index()
and you get the result expected
ID1 ID2 AREA
0 1 2 2
1 2 6 5
2 3 6 1
3 4 2 1
4 5 1 1
5 6 2 7
Now on a bigger df like:
df = pd.DataFrame( {'ID1':range(1,1501),'ID2': np.random.randint(1,1501,(1500,)),'AREA':[1]*1500},
columns = ['ID1','ID2','AREA'])
The method presented here turns in about 0.76 s on my computer while your first is running in 6.5 s.
Ultimately, you could create a df_list such as:
df_list = (df.set_index('ID1')
.join(df.set_index('ID1').groupby('ID2')['AREA']
.apply(lambda x: x.index.tolist()),rsuffix='_ID2')
.dropna().drop(['AREA','ID2'],1))
to keep somewhere the information that linked ID1 and ID2: here you can see the id is equal to 2 in the column ID2, where the value of ID1 = 1, 4 and 6
AREA_ID2
ID1
1 [5]
2 [1, 4, 6]
6 [2, 3]
and then you can run to not re-create the df_list, with a small difference in the code:
df = df.set_index('ID1')
for row in df_list.itertuples():
df.loc[row[0],'AREA'] += df.loc[row[1],'AREA'].sum()
df = df.reset_index()
Hope it's faster

Transform pandas timeseries into timeseries with non-date index

I'm trying to generate a timeseries from a dataframe, but the solutions I've found here don't really address my specific problem. I have a dataframe which is a series of id's which iterate from 1 to n, then repeat, like this:
key ID Var_1
0 1 1
0 2 1
0 3 2
1 1 3
1 2 2
1 3 1
I want to reshape it into a timeseries in which the index
ID Var_1_0 Var_2_0
1 1 3
2 1 2
3 2 1
I have tried the stack() method but it doesn't generate the result I want. Generating an index from ID seems to be the right ID is not a proper date so I'm not sure how to proceed. Pointers much appreciated.
Try this:
import pandas as pd
df = pd.DataFrame([[0,1,1], [0,2,1], [0,3,2], [1,1,3], [1,2,2], [1,3,1]], columns=('key', 'ID', 'Var_1'))
Use the pivot function:
df2 = df.pivot('ID', 'key', 'Var_1')
You can rename the columns by:
df2.columns = ('Var_1_0', 'Var_2_0')
Result:
Out:
Var_1_0 Var_2_0
ID
1 1 3
2 1 2
3 2 1

Categories