How to group by hierarchical data in pandas/sql? - python

I have a problem on hierarchy.i have data like this.
id performance_rating parent_id level
111 8 null 0
122 3 null 0
123 9 null 0
254 5 111 1
265 8 111 1
298 7 122 1
220 6 123 1
305 5 298 2
395 8 220 2
... ... ... ...
654 4 562 5
the id is person unique identity.
performance_rating is his rating out of 10
parent_id is the id of the person who is working above the corresponding id.
I need to find out the average rating of an individual tree(111,122,123).
what I tried is separate data frame according to levels. Then merge it and groupby. But it is quite long.

there will be a few different ways to do this - here's an ugly solution.
We use a while and for loop over a function to "back-level" each column of the dataframe:
This requires that we first set 'id' as index and sort by 'level', descending. It also requires no duplicate IDs. Here goes:
df = df.set_index('id')
df = df.sort_values(by='level', ascending=False)
for i in df.index:
while df.loc[i, 'level'] > 1:
old_pid = df.loc[i, 'parent_id']
df.loc[i, 'parent_id'] = df.loc[old_pid, 'parent_id']
old_level = df.loc[i,'level']
df.loc[i, 'level'] = old_level - 1
This way, no matter how many levels there are, we are left with everything at level 1 of hierarchy and can then do:
grouped = df.groupby('parent_id').mean()
(or whatever variation of that you need)
I hope that helps!

Related

Assign values to a column based on Min and Max of a particular group

I have 3 fields Cust_ID, Acc_No and Product like the table below
I need to add a column 'Type' based on 'Product' value for each Cust_ID. If all the 'Product' values for a customer lie between range 'a' to 'm' or 'n' to 'z' then it should be labeled as 'Single' else 'Multiple' like in the below table
I am trying to group by 'Cust_ID' and compare min and max value of 'Product' with range '<=m' and '>=n' but not able to implement it successfully. Any help will be appreciated, thanks in advance.
You can use .groupby.transform + Series.between:
df["Type"] = df.groupby("Cust_ID")["Product"].transform(
lambda x: np.where(
x.between("a", "m").all() | x.between("n", "z").all(),
"Single",
"Multiple",
)
)
print(df)
Prints:
Cust_ID Acc_No Product Type
0 1 111 a Single
1 1 112 b Single
2 1 113 c Single
3 2 221 a Multiple
4 2 222 x Multiple
5 2 223 y Multiple
6 3 331 z Single
7 3 332 x Single

Best way to iterate through dict of pandas dataframes with identical structures, to generate one dataframe with the sum of each (row, col) element?

I have a pandas dict, d1, where each value is a two-column (ID and Weight), 100-row dataframe.
I want to iterate through the dict, and for each dataframe, I want to sum all the 'Weight' values in row n, where n is the value between 1 and 100 representing the row. I then want to write the output to another dict, d2, where the key is 1-100, and the value is the sum of the values.
Example d1 value dataframe:
ID Weight
1 0.021
2 0.445
3 1.018
..
..
..
99 77.31
100 234.04
Essentially, imagine I have 10000 of these dataframes, and I want to sum all the Weight values for ID 1 across the 10000, then all the Weight values for ID 2 across the 10000, and so on up to ID 100.
I have a solution, which is basically a nested loop. It works, and it will do. However, I'm really keen to expand my basic pandas / numpy knowledge, and I wondered if there is a more pythonic way to do this?
My existing code :
for i in range (1,101):
tot = 0
for key, value in d1.items():
tot = tot + value.at[i,'Weight']
d2[i] = tot
Hugely appreciate any help and advice!
You can use pandas add function:
#create a zero filled dataframe
df = pd.DataFrame(0, index=np.arange(len(df1)), columns=df1.columns)
#iterate through dict and add values to df
for value in d1.values():
df = df.add(value)
You can set your ID as index via df_i = df_i.set_index('ID') and then add them all up, so that only weights are added and then df=df.reset_index() at the end.
Example:
df1 = pd.DataFrame([(1,2),(3,4),(5,6)], columns=['ID','Weight'])
ID Weight
0 1 2
1 3 4
2 5 6
df2 = pd.DataFrame([(10,20),(30,40),(50,60)], columns=['ID','Weight'])
ID Weight
0 10 20
1 30 40
2 50 60
df3 = pd.DataFrame([(100,200),(300,400),(500,600)], columns=['ID','Weight'])
ID Weight
0 100 200
1 300 400
2 500 600
d1 = {'df1':df1,'df2':df2,'df3':df3}
df = pd.DataFrame(0, index=np.arange(len(df1)), columns=df1.columns)
print(df)
for value in d1.values():
df = df.add(value)
df:
ID Weight
0 111 222
1 333 444
2 555 666

How do I create a DataFrame with multi-level columns?

An existing question, Creating a Pandas Dataframe with Multi Column Index, deals with a very "regular" DataFrame where all columns and rows are products and all data is present.
My situation is, alas, different. I have this kind of data:
[{"street": "Euclid", "house":42, "area":123, (1,"bedrooms"):1, (1,"bathrooms"):4},
{"street": "Euclid", "house":19, "area":234, (2,"bedrooms"):3, (2,"bathrooms"):3},
{"street": "Riemann", "house":42, "area":345, (1,"bedrooms"):5,
(1,"bathrooms"):2, (2,"bedrooms"):12, (2, "bathrooms"):17},
{"street": "Riemann", "house":19, "area":456, (1,"bedrooms"):7, (1,"bathrooms"):1}]
and I want this sort of DataFrame with both rows and columns having multi-level indexes:
area 1 2
street house bedrooms bathrooms bedrooms bathrooms
Euclid 42 123 1 4
Euclid 19 234 3 3
Riemann 42 345 5 2 12 17
Riemann 19 456 7 1
So, the row index should be
MultiIndex([("Euclid",42),("Euclid",19),("Riemann",42),("Riemann",19)],
names=["street","house"])
and the columns index should be
MultiIndex([("area",None),(1,"bedrooms"),(1,"bathrooms"),(2,"bedrooms"),(2,"bathrooms")],
names=["floor","entity"])
and I see no way to generate these indexes from the list of dictionaries I have.
i feel there should be something better than this; hopefully someone on SO puts out sth much better:
Create a function to process each entry in the dictionary:
def process(entry):
#read in data and get the keys to be the column names
m = pd.DataFrame.from_dict(entry,orient='index').T
#set index
m = m.set_index(['street','house'])
#create multi-index columns
col1 = [ent[0] if isinstance(ent,tuple) else ent for ent in m.columns ]
col2 = [ent[-1] if isinstance(ent,tuple) else None for ent in m.columns ]
#assign multi-index column to m
m.columns=[col1,col2]
return m
Apply function above to data(i wrapped the dictionary into the data variable):
res = [process(entry) for entry in data]
concatenate to get final output
pd.concat(res)
area 1 2
NaN bedrooms bathrooms bedrooms bathrooms
street house
Euclid 42 123 1 4 NaN NaN
19 234 NaN NaN 3 3
Riemann 42 345 5 2 12 17
19 456 7 1 NaN NaN

Join pandas dataframes based on column values

I'm quite new to pandas dataframes, and I'm experiencing some troubles joining two tables.
The first df has just 3 columns:
DF1:
item_id position document_id
336 1 10
337 2 10
338 3 10
1001 1 11
1002 2 11
1003 3 11
38 10 146
And the second has exactly same two columns (and plenty of others):
DF2:
item_id document_id col1 col2 col3 ...
337 10 ... ... ...
1002 11 ... ... ...
1003 11 ... ... ...
What I need is to perform an operation which, in SQL, would look as follows:
DF1 join DF2 on
DF1.document_id = DF2.document_id
and
DF1.item_id = DF2.item_id
And, as a result, I want to see DF2, complemented with column 'position':
item_id document_id position col1 col2 col3 ...
What is a good way to do this using pandas?
I think you need merge with default inner join, but is necessary no duplicated combinations of values in both columns:
print (df2)
item_id document_id col1 col2 col3
0 337 10 s 4 7
1 1002 11 d 5 8
2 1003 11 f 7 0
df = pd.merge(df1, df2, on=['document_id','item_id'])
print (df)
item_id position document_id col1 col2 col3
0 337 2 10 s 4 7
1 1002 2 11 d 5 8
2 1003 3 11 f 7 0
But if necessary position column in position 3:
df = pd.merge(df2, df1, on=['document_id','item_id'])
cols = df.columns.tolist()
df = df[cols[:2] + cols[-1:] + cols[2:-1]]
print (df)
item_id document_id position col1 col2 col3
0 337 10 2 s 4 7
1 1002 11 2 d 5 8
2 1003 11 3 f 7 0
If you're merging on all common columns as in the OP, you don't even need to pass on=, simply calling merge() will do the job.
merged_df = df1.merge(df2)
The reason is that under the hood, if on= is not passed, pd.Index.intersection is called on the columns to determine the common columns and merge on all of them.
A special thing about merging on common columns is that it doesn't matter which dataframe is on the right or the left, the rows filtered are the same because they are selected by looking up matching rows on the common columns. The only difference is where the columns are positioned; the columns in the right dataframe that are not in the left dataframe will be added to the right of the columns on the left dataframe. So unless the order of the columns matter (which can be very easily fixed using column selection or reindex()), it doesn't really matter which dataframe is on the right and which is on the left. In other words,
df12 = df1.merge(df2, on=['document_id','item_id']).sort_index(axis=1)
df21 = df2.merge(df1, on=['document_id','item_id']).sort_index(axis=1)
# df12 and df21 are the same.
df12.equals(df21) # True
This is not true if the columns to be merged on don't have the same name and you have to pass left_on= and right_on= (see #1 in this answer).

Appending two dataframes with same columns, different order

I have two pandas dataframes.
noclickDF = DataFrame([[0, 123, 321], [0, 1543, 432]],
columns=['click', 'id', 'location'])
clickDF = DataFrame([[1, 123, 421], [1, 1543, 436]],
columns=['click', 'location','id'])
I simply want to join such that the final DF will look like:
click | id | location
0 123 321
0 1543 432
1 421 123
1 436 1543
As you can see the column names of both original DF's are the same, but not in the same order. Also there is no join in a column.
You could also use pd.concat:
In [36]: pd.concat([noclickDF, clickDF], ignore_index=True)
Out[36]:
click id location
0 0 123 321
1 0 1543 432
2 1 421 123
3 1 436 1543
Under the hood, DataFrame.append calls pd.concat.
DataFrame.append has code for handling various types of input, such as Series, tuples, lists and dicts. If you pass it a DataFrame, it passes straight through to pd.concat, so using pd.concat is a bit more direct.
For future users (sometime >pandas 0.23.0):
You may also need to add sort=True to sort the non-concatenation axis when it is not already aligned (i.e. to retain the OP's desired concatenation behavior). I used the code contributed above and got a warning, see Python Pandas User Warning. The code below works and does not throw a warning.
In [36]: pd.concat([noclickDF, clickDF], ignore_index=True, sort=True)
Out[36]:
click id location
0 0 123 321
1 0 1543 432
2 1 421 123
3 1 436 1543
You can use append for that
df = noclickDF.append(clickDF)
print df
click id location
0 0 123 321
1 0 1543 432
0 1 421 123
1 1 436 1543
and if you need you can reset the index by
df.reset_index(drop=True)
print df
click id location
0 0 123 321
1 0 1543 432
2 1 421 123
3 1 436 1543

Categories