I have a grouped dataframe
id num week
101 23 7 3
8 1
9 2
102 34 8 4
9 1
10 2
...
And I need to create new columns and have a dataFrame like this
id num 7 8 9 10
101 23 3 1 2 0
102 34 0 4 1 2
...
As you may see, the values of the week column turned into several columns.
I may also have the input dataFrame not grouped, or with reset_index, like this:
id num week
101 23 7 3
101 23 8 1
101 23 9 2
102 34 8 4
102 34 9 1
102 34 10 2
...
but I don't know with which would be easier to start.
Notice that id and num are both keys
Use unstack() and fillna(0) to not have NaNs.
Let's load the data:
id num week val
101 23 7 3
101 23 8 1
101 23 9 2
102 34 8 4
102 34 9 1
102 34 10 2
s = pd.read_clipboard(index_col=[0,1,2], squeeze=True)
Notice I have set the index to be id, num and week. If you haven't yet, use set_index.
Now we can unstack: move from the index (rows) to the columns. By default it does it to the last level in line, which is week here, but you could specify it using level=-1 or level='week'
s.unstack().fillna(0)
Note that as pointed out by #piRsquared you can do s.unstack(fill_value=0) to do it in one go.
Related
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 11 months ago.
I have two data frames in Pandas with following structure:
pr_df:
id prMerged idRepository avgTime
0 1 1 2 63.93
1 2 0 3 41.11
2 3 0 3 36.03
3 4 1 4 77.28
...
98 99 1 20 54.78
99 100 0 20 42.12
repo_df
id stars forks
0 1 1245 45
1 2 3689 78
2 3 458 15
3 4 954 75
...
19 20 1947 102
I would like to combine pr_df with repo_df by comparing idRepository (from pr_df) and id (from repo_df) with each other and add two columns to pr_df: stars and forks. As a result, I would like to achieve:
pr_df:
id prMerged idRepository avgTime stars forks
0 1 1 2 63.93 3689 78
1 2 0 3 41.11 458 15
2 3 0 3 36.03 458 15
3 4 1 4 77.28 954 75
...
98 99 1 20 54.78 1947 102
99 100 0 20 42.12 1947 102
How can I do it using Pandas? How can I compare idRepository with id and add new columns to pr_df based on that?
You can use the merge function, and you have to supply the columns that you want to merge on.
pr_df.merge(repo_df, left_on='idRepository', right_on='id')
Have got a dataframe df like below:
Store Aisle Table
11 59 2
11 61 3
Need to expand each combination of row 3 times generating new column 'bit' with range value as below:
Store Aisle Table Bit
11 59 2 1
11 59 2 2
11 59 2 3
11 61 3 1
11 61 3 2
11 61 3 3
Have tried the below code but didn't worked out.
df.loc[df.index.repeat(range(3))]
Help me out! Thanks in Advance.
You should provide a number, not a range to repeat. Also, you need a bit of processing:
(df.loc[df.index.repeat(3)]
.assign(Bit=lambda d: d.groupby(level=0).cumcount().add(1))
.reset_index(drop=True)
)
output:
Store Aisle Table Bit
0 11 59 2 1
1 11 59 2 2
2 11 59 2 3
3 11 61 3 1
4 11 61 3 2
5 11 61 3 3
Alternatively, using MultiIndex.from_product:
idx = pd.MultiIndex.from_product([df.index, range(1,3+1)], names=(None, 'Bit'))
(df.reindex(idx.get_level_values(0))
.assign(Bit=idx.get_level_values(1))
)
df = df.iloc[np.repeat(np.arange(len(df)), 3)]
df['Bit'] = list(range(1, len(df)//3+1))*3
I am having a dataframe df like shown:
1-1 1-2 1-3 2-1 2-2 3-1 3-2 4-1 5-1
10 3 9 1 3 9 33 10 11
21 31 3 22 21 13 11 7 13
33 22 61 31 35 34 8 10 16
6 9 32 5 4 8 9 6 8
where the explanation of the columns as the following:
the first digit is a group number and the second is part of it or subgroup in our example we have groups 1,2,3,4,5 and group 1 consists of 1-1,1-2,1-3.
I would like to create a new dataframe that have only the groups 1,2,3,4,5 without subgroups and choose for each row the max number in the subgroup and be flexible for any new modifications or increasing the groups or subgroups.
The new dataframe I need is like the shown:
1 2 3 4 5
10 3 33 10 11
31 22 13 7 13
61 35 34 10 16
32 5 9 6 8
You can aggregate by columns with axis=1 and lambda function for split and select first values with max and DataFrame.groupby:
This working correct if numbers of groups contains 2 or more digits.
df1 = df.groupby(lambda x: x.split('-')[0], axis=1).max()
Alternative is pass splitted columns names:
df1 = df.groupby(df.columns.str.split('-').str[0], axis=1).max()
print (df1)
1 2 3 4 5
0 10 3 33 10 11
1 31 22 13 7 13
2 61 35 34 10 16
3 32 5 9 6 8
You can use .str[] or .str.get here.
df.groupby(df.columns.str[0], axis=1).max())
1 2 3 4 5
0 10 3 33 10 11
1 31 22 13 7 13
2 61 35 34 10 16
3 32 5 9 6 8
I have the following Pandas Dataframe.
name day h1 h2 h3 h4 h5
pepe 1 10 4 0 4 7
pepe 2 54 65 4 42 6
pepe 3 1 3 28 6 12
pepe 4 5 6 1 8 5
juan 1 78 9 2 65 4
juan 2 2 42 14 54 95
I want to obtain:
name day h1 h2 h3 h4 h5 sum
pepe 1 10 4 0 4 7
pepe 2 54 65 4 42 6 18
pepe 3 1 3 28 6 12 165
pepe 4 5 6 1 8 5 38
juan 1 78 9 2 65 4
juan 2 2 42 14 54 95 154
I've been searching the web, but without success.
The number 38 of the sum column is in the pepe row, day 4, and is the sum of h1 to h4 of the pepe row of the day 4-1 = 3.
Similarly, it proceeds for day 3 and day 2. On day 1 you must keep an empty result in your corresponding sum cell.
The same must be done for Juan and so for the different values of name.
How can I do it?. Maybe it's better to try to make a loop using iterrows first or something like that.
I would sum the rows based on the values... This is my favorite resource for complex loc calls, lots of options here --
https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/
df.reset_index(inplace=True)
df.loc[df['name'] == 'pepe','sum'] = df.sum(axis=1)
or
df.reset_index(inplace=True)
df.groupby('name')['h1','h2','h3','h4'].sum(axis=1)
to use loop, would need df.itertuples()
df['sum'] = 0 #Must initialize column first
for i in df.itertuples():
temp_sum = i.h1 + i.h2 + i.h3 + i.h4
#May need to check if final row of 'name', or groupby name first.
df.at[i,'sum'] = temp_sum
I have a data sheet with about 1700 columns and 100 rows of data w/ a unique identifier. It is survey data and every employee of an organization answer the same 9 questions but its compiled into one row of data for every organization. Is there a way in python/pandas to vertically integrate this data as opposed to the elongated format on the x-axis it already is at? I am cutting and pasting currently.
You can reshape the underlying numpy array and reindex with proper companies:
# sample data, assuming index is the company
df = pd.DataFrame(np.arange(36).reshape(2,-1))
# new index
idx = df.index.repeat(df.shape[1]//9)
# new data:
new_df = pd.DataFrame(df.values.reshape(-1,9), index=idx)
Output:
0 1 2 3 4 5 6 7 8
0 0 1 2 3 4 5 6 7 8
0 9 10 11 12 13 14 15 16 17
1 18 19 20 21 22 23 24 25 26
1 27 28 29 30 31 32 33 34 35