Pandas Multilevel header, creating multiple rows for entries within a line - python

I have a multilevel column pandas dataframe with orders from an online retailer. Each line has info from a single order. This results in multiple items within the same row. I need to create a single row for each item sold in the orders. Because of this there are hundreds of columns.
orderID orderinfo orderline orderline orderline orderline
nan orderaspect1 0 0 1 1
nan nan itemaspect1 itemaspect2 itemaspect1 itemaspect2
0 1 2 3 4 5 6
1 10 20 30 40 50 60
2 100 200 300 400 500 600
each row with the same number under orderline needs to have its own row which includes the info from order iD and all the order aspects. It needs to look something like this.
orderID orderinfo orderline orderline item #
nan nan itemaspect1 itemaspect2
0 1 2 3 4 0
1 10 20 30 40 0
2 100 200 300 400 0
3 1 2 5 6 1
4 10 20 50 60 1
5 100 200 500 600 1
This way there is a row for each ITEM instead of a row for each ORDER.
I've tried using melt and stack to unpivot followed by pivoting but I've run into issue with that and the multiindex columns. Nothing is formatting it correctly.
[edit]
It should look like this using code.
i = pd.DatetimeIndex(['2011-03-31', '2011-04-01', '2011-04-04', '2011-04-05',
'2011-04-06', '2011-04-07', '2011-04-08', '2011-04-11',
'2011-04-12', '2011-04-13'])
cols = pd.MultiIndex.from_product([['orderID', "orderlines"],['order data', 0, 1]])
df = pd.DataFrame(np.random.randint(10, size=(len(i), 6)),index=i, columns=cols)

Related

Pandas create column based on values in other rows and columns

I am currently trying to add a new column in a pandas dataset whose data is based on the contents of other rows in the dataset.
In the example, for each row x, I want to find the entry from id_real from row y, so that the content of id_par in row x matches the content from id in row y. See the following example.
id_real id id_par
100 1 2
200 2 3
300 3 4
id_real id id_par new_col
100 1 2 200
200 2 3 300
300 3 4 NaN
I have tried a lot of things and the last thing I tried was the following:
df["new_col"] = df[df["id"] == df["id_par"]]["node_id"]
Unfortunately, the new column then only contains NaN entries. Can you help me?
Use Series.map with DataFrame.drop_duplicates for match first id rows:
df["new_col"] = df['id_par'].map(df.drop_duplicates('id').set_index('id')['id_real'])
print (df)
id_real id id_par new_col
0 100 1 2 200.0
1 200 2 3 300.0
2 300 3 4 NaN
Use map to match the "id_par" using "id" as index and "id_real" as values:
df['new_col'] = df['id_par'].map(df.set_index('id')['id_real'])
output:
id_real id id_par new_col
0 100 1 2 200.0
1 200 2 3 300.0
2 300 3 4 NaN

Pandas Dataframe fill column with sequence_id based on multiple columns ids and timestamp

*Im editing the df given it contained a typo in ne1_id
having a really hard time trying to solve the following, ill really much appreciate any assistance or light with the following
I have a DataFrame df that looks like this:
timestamp
user_id
ne1_id.
ne2_id.
attempt_no
0
18:11:42.838363
1
100
1
1
18:11:42.838364
100
123456
2
18:11:42.838365
100
123456
3
18:11:42.83836
100
123456
4
18:11:45.838365
1
100
2
5
18:11:45.838366
100
321234
6
18:11:45.838369
100
321234
7
18:11:46.838363
3
12
3
8
18:11:46.838364
12
9832
9
18:11:47.838363
2
12
4
10
18:11:47.838369
100
What I want to do is to fill the attempt_no of the empty cells (empties are empties not NaN) for the next rows based on timestamp (or index) with the proper attempt_no by associating user_id, ne1_id, ne2_id associations,
I im not seeing the logic of it neither the way of do it.
the result should be something like this
timestamp
user_id
ne1_id.
ne2_id.
attempt_no
0
18:11:42.838363
1
100
1
1
18:11:42.838364
100
123456
1
2
18:11:42.838365
100
123456
3
18:11:42.838369
100
123456
4
18:11:45.838365
1
100
2
5
18:11:45.838366
100
321234
2
6
18:11:45.838369
100
321234
7
18:11:46.838363
3
12
3
8
18:11:46.838364
12
9832
3
9
18:11:47.838363
2
12
4
10
18:11:47.838369
100
4
something that says the following:
"find all the rows where there is a user_id and find the next row with the same ne1_id with an empty user_id and attemp_no and fill atppemp_no with the attemp_no of the previous row"
i tried with groupby -that i believe is the way of do it-, but kind of stuck there
i appreciate any suggestion.
def f(x):
last = None
for i in range(len(x)):
if np.isnan(x[i]):
x[i] = last
else:
last = x[i]
return x
df = pd.DataFrame({'x': [1, None, None, 2, None, None, None, 3, None]})
df[['x']].apply(f)
By applying the function on axis=0 you are able to jointly process the entire column.

How to drop rows by threshold of index column's occur frequence in Pandas

I have a dataframe like this:
userid itemid timestamp
1 1 50
1 2 50
1 3 50
1 4 60
2 1 40
2 2 50
I want to drop all rows whose userid occur more than 2 times and get a new dataframe as follows. Does someone can help me? Thanks.
userid itemid timestamp
2 1 40
2 2 50
You can use pd.Series.value_counts and calculate an array of userid filtered by your condition. Then use this to filter your original dataframe.
c = df['userid'].value_counts()
idx = c[c > 2].index
res = df[~df['userid'].isin(idx)]
print(res)
userid itemid timestamp
4 2 1 40
5 2 2 50

comparing columns within groupby objects in pandas

My dataframe looks like this:
id month spent limit
1 1 2.6 10
1 2 4 10
1 3 6 10
2 1 3 100
2 2 89 100
2 3 101 100
3 1 239 500
3 2 432 500
3 3 100 500
I want to groupby id and then get the ids for which spent column is less than or equal to limit column for every row in the grouped by object.
For my above example, I should get ids 1 and 3 as my result because id 2 spends 101 in 3rd month and hence exceeds the limit of 100.
How can I do this in pandas efficiently?
Thanks in advance!
You can create a mask by finding the ids where spent is greater than limit. The mask out the ids in the mask
mask = df.loc[df['spent'] > df['limit'], 'id'].values.tolist()
df.id[df['id'] != mask].unique()
gives you
array([1, 3])
This should give you something like what you want
df.groupby('id').apply(lambda g: (g.spent < g.limit).all()).to_frame('not_exceeded').query('not_exceeded == True')
Reverse logic! Check for unique ids where spent is greater than limit. Then filter out those.
df[~df.id.isin(df.set_index('id').query('limit < spent').index.unique())]
id month spent limit
0 1 1 2.6 10
1 1 2 4.0 10
2 1 3 6.0 10
6 3 1 239.0 500
7 3 2 432.0 500
8 3 3 100.0 500

Pandas: Merge or join dataframes based on column data?

I am trying to add several columns of data to an existing dataframe. The dataframe itself was built from a number of other dataframes, which I successfully joined on indices, which were identical. For that, I used code like this:
data = p_data.join(r_data)
I actually joined these on a multi-index, so the dataframe looks something like the following, where Name1 and Name 2 are indices:
Name1 Name2 present r behavior
a 1 1 0 0
2 1 .5 2
4 3 .125 1
b 2 1 0 0
4 5 .25 4
8 1 0 1
So the Name1 index does not repeat data, but the Name2 index does (I'm using this to keep track of dyads, so that Name1 & Name2 together are only represented once). What I now want to add are 4 columns of data that correspond to Name2 data (information on the second member of the dyad). Unlike the "present" "r" and "behavior" data, these data are per individual, not per dyad. So I don't need to consider Name1 data when merging.
The problem is that while Name2 data are repeated to exhaust the dyad combos, the "Name2" column in the data I would now like to add only has one piece of data per Name2 individual:
Name2 Data1 Data2 Data3
1 80 6 1
2 61 8 3
4 45 7 2
8 30 3 6
What I would like the output to look like:
Name1 Name2 present r behavior Data1 Data2 Data3
a 1 1 0 0 80 6 1
2 1 .5 2 61 8 3
4 3 .125 1 45 7 2
b 2 1 0 0 61 8 3
4 5 .25 4 45 7 2
8 1 0 1 30 3 6
Despite reading the documentation, I am not clear on whether I can use join() or merge() for the desired outcome. If I try a join to the existing dataframe like the simple one I've used previously, I end up with the new columns but they are full of NaN values. I've also tried various combinations using Name1 and Name2 as either columns or as indices, with either join or merge (not as random as it sounds, but I'm clearly not interpreting the documentation correctly!). Your help would be very much appreciated, as I am presently very much lost.
I'm not sure if this is the best way, but you could use reset_index to temporarily make your original DataFrame indexed by Name2 only. Then you could perform the join as usual. Then use set_index to again make Name1 part of the MultiIndex:
import pandas as pd
df = pd.DataFrame({'Name1':['a','a','a','b','b','b'],
'Name2':[1,2,4,2,4,8],
'present':[1,1,3,1,5,1]})
df.set_index(['Name1','Name2'], inplace=True)
df2 = pd.DataFrame({'Data1':[80,61,45,30],
'Data2':[6,8,7,3]},
index=pd.Series([1,2,4,8], name='Name2'))
result = df.reset_index(level=0).join(df2).set_index('Name1', append=True)
print(result)
# present Data1 Data2
# Name2 Name1
# 1 a 1 80 6
# 2 a 1 61 8
# b 1 61 8
# 4 a 3 45 7
# b 5 45 7
# 8 b 1 30 3
To make the result look even more like your desired DataFrame, you could reorder and sort the index:
print(result.reorder_levels([1,0],axis=0).sort(axis=0))
# present Data1 Data2
# Name1 Name2
# a 1 1 80 6
# 2 1 61 8
# 4 3 45 7
# b 2 1 61 8
# 4 5 45 7
# 8 1 30 3

Categories