transform pandas dataframe and combine rows - python

I have a pandas dataframe which looks like that:
|---------------------|------------------|------------------|
| student-id | subject-id | grade |
|---------------------|------------------|------------------|
| 1 | 1234 | 4 |
|---------------------|------------------|------------------|
| 1 | 2234 | 3 |
|---------------------|------------------|------------------|
| 1 | 3234 | 3 |
|---------------------|------------------|------------------|
| 2 | 1234 | 2 |
|---------------------|------------------|------------------|
| 2 | 2234 | 1 |
|---------------------|------------------|------------------|
| 2 | 3234 | 4 |
|---------------------|------------------|------------------|
now I want to transform it, that I get only one row for every student-id with every grade from this student in this row like that:
|---------------------|------------------|------------------|------------------|
| student-id | grade 1 | grade 2 | grade 3 |
|---------------------|------------------|------------------|------------------|
| 1 | 4 | 3 | 3 |
|---------------------|------------------|------------------|------------------|
| 2 | 2 | 1 | 4 |
|---------------------|------------------|------------------|------------------|
thx for help!

you may drop subject-id by del df['column_name'] and then df.groupBy['student-id'] will give grades with respect to student-id.

Related

pandas group by category and assign a bin with pd.cut

I have a dataframe like the following:
+-------+-------+
| Group | Price |
+-------+-------+
| A | 2 |
| B | 3 |
| A | 1 |
| C | 4 |
| B | 2 |
+-------+-------+
I would like to create a column, that would give me the in which range (if I divided each group into 4 intervals) my price value is within each group.
+-------+-------+--------------------------+
| Group | Price | Range |
+-------+-------+--------------------------+
| A | 2 | [1-2] |
| B | 3 | [2-3] |
| A | 1 | [0-1] |
| C | 4 | [0-4] |
| B | 2 | [0-2] |
+-------+-------+--------------------------+
Anyone has any idea by using pandas pd.cut and groupby operations?
Thanks
You can pass pd.cut to groupby():
df['Range'] = df.groupby('Group')['Price'].transform(pd.cut, bins=4)

Fetch values corresponding to id of each row python

Is is possible to fetch column containing values corresponding to an id column?
Example:-
df1
| ID | Value | Salary |
|:---------:--------:|:------:|
| 1 | amr | 34 |
| 1 | ith | 67 |
| 2 | oaa | 45 |
| 1 | eea | 78 |
| 3 | anik | 56 |
| 4 | mmkk | 99 |
| 5 | sh_s | 98 |
| 5 | ahhi | 77 |
df2
| ID | Dept |
|:---------:--------:|
| 1 | hrs |
| 1 | cse |
| 2 | me |
| 1 | ece |
| 3 | eee |
Expected Output
| ID | Dept | Value |
|:---------:--------:|----------:|
| 1 | hrs | amr |
| 1 | cse | ith |
| 2 | me | oaa |
| 1 | ece | eea |
| 3 | eee | anik |
I want to fetch each values in the 'Value' column corresponding to values in df2's ID column. And create column containing 'Values' in df2. The number of rows in the two dfs are not the same. I have tried
this
Not worked
IIUC , you can try df.merge after assigning a helper column by doing groupby+cumcount on ID:
out = (df1.assign(k=df1.groupby("ID").cumcount())
.merge(df2.assign(k=df2.groupby("ID").cumcount()),on=['ID','k'])
.drop("k",1))
print(out)
ID Value Dept
0 1 Amr hrs
1 1 ith cse
2 2 oaa me
3 1 eea ece
4 3 anik eee
is this what you want to do?
df1.merge(df2, how='inner',on ='ID')
Since you have duplicated IDs in both dfs, but these are ordered, try:
df1 = df1.drop(columns="ID")
df3 = df2.merge(df1, left_index=True, right_index=True)

How to get the column values of a Dataframe into another dataframe as a new column after matching the values in columns that both dataframes have?

I'm trying to create a new column in a DataFrame and storing it with values stored in a different dataframe by first comparing the values of columns that both dataframes have. For example:
df1 >>>
| name | team | week | dates | interceptions | pass_yds | rating |
| ---- | ---- | -----| ---------- | ------------- | --------- | -------- |
| maho | KC | 1 | 2020-09-10 | 0 | 300 | 105 |
| went | PHI | 1 | 2020-09-13 | 2 | 225 | 74 |
| lock | DEN | 1 | 2020-09-14 | 0 | 150 | 89 |
| dris | DEN | 2 | 2020-09-20 | 1 | 220 | 95 |
| went | PHI | 2 | 2020-09-20 | 2 | 250 | 64 |
| maho | KC | 2 | 2020-09-21 | 1 | 245 | 101 |
df2 >>>
| name | team | week | catches | rec_yds | rec_tds |
| ---- | ---- | -----| ------- | ------- | ------- |
| ertz | PHI | 1 | 5 | 58 | 1 |
| fant | DEN | 2 | 6 | 79 | 0 |
| kelc | KC | 2 | 8 | 105 | 1 |
| fant | DEN | 1 | 3 | 29 | 0 |
| kelc | KC | 1 | 6 | 71 | 1 |
| ertz | PHI | 2 | 7 | 91 | 2 |
| goed | PHI | 2 | 2 | 15 | 0 |
I want to create a dates column in df2 with the values of the dates stored in the dates column in df1 after matching the teams and the weeks columns. After the matching, df2 in this example should look something like this:
df2 >>>
| name | team | week | catches | rec_yds | rec_tds | dates |
| ---- | ---- | -----| ------- | ------- | ------- | ---------- |
| ertz | PHI | 1 | 5 | 58 | 1 | 2020-09-13 |
| fant | DEN | 2 | 6 | 79 | 0 | 2020-09-20 |
| kelc | KC | 2 | 8 | 105 | 1 | 2020-09-20 |
| fant | DEN | 1 | 3 | 29 | 0 | 2020-09-14 |
| kelc | KC | 1 | 6 | 71 | 1 | 2020-09-10 |
| ertz | PHI | 2 | 7 | 91 | 2 | 2020-09-20 |
| goed | PHI | 2 | 2 | 15 | 0 | 2020-09-20 |
I'm looking for an optimal solution. I've already tried nested for loops and comparing the week and team columns from both dataframes together but that hasn't worked. At this point I'm all out of ideas. Please help!
Disclaimer: The actual DataFrames I'm working with are a lot larger. They have a lot more rows, columns, and values (i.e. a lot more teams in the team columns, a lot more dates in the dates columns, and a lot more weeks in the week columns)

Transform a Pandas dataframe in a pandas with multicolumns

I have the following pandas dataframe, where the column id is the dataframe index
+----+-----------+------------+-----------+------------+
| | price_A | amount_A | price_B | amount_b |
|----+-----------+------------+-----------+------------|
| 0 | 0.652826 | 0.941421 | 0.823048 | 0.728427 |
| 1 | 0.400078 | 0.600585 | 0.194912 | 0.269842 |
| 2 | 0.223524 | 0.146675 | 0.375459 | 0.177165 |
| 3 | 0.330626 | 0.214981 | 0.389855 | 0.541666 |
| 4 | 0.578132 | 0.30478 | 0.789573 | 0.268851 |
| 5 | 0.0943601 | 0.514878 | 0.419333 | 0.0170096 |
| 6 | 0.279122 | 0.401132 | 0.722363 | 0.337094 |
| 7 | 0.444977 | 0.333254 | 0.643878 | 0.371528 |
| 8 | 0.724673 | 0.0632807 | 0.345225 | 0.935403 |
| 9 | 0.905482 | 0.8465 | 0.585653 | 0.364495 |
+----+-----------+------------+-----------+------------+
And I want to convert this dataframe in to a multi column data frame, that looks like this
+----+-----------+------------+-----------+------------+
| | A | B |
+----+-----------+------------+-----------+------------+
| id | price | amount | price | amount |
|----+-----------+------------+-----------+------------|
| 0 | 0.652826 | 0.941421 | 0.823048 | 0.728427 |
| 1 | 0.400078 | 0.600585 | 0.194912 | 0.269842 |
| 2 | 0.223524 | 0.146675 | 0.375459 | 0.177165 |
| 3 | 0.330626 | 0.214981 | 0.389855 | 0.541666 |
| 4 | 0.578132 | 0.30478 | 0.789573 | 0.268851 |
| 5 | 0.0943601 | 0.514878 | 0.419333 | 0.0170096 |
| 6 | 0.279122 | 0.401132 | 0.722363 | 0.337094 |
| 7 | 0.444977 | 0.333254 | 0.643878 | 0.371528 |
| 8 | 0.724673 | 0.0632807 | 0.345225 | 0.935403 |
| 9 | 0.905482 | 0.8465 | 0.585653 | 0.364495 |
+----+-----------+------------+-----------+------------+
I've tried transforming my old pandas dataframe in to a dict this way:
dict = {"A": df[["price_a","amount_a"]], "B":df[["price_b", "amount_b"]]}
df = pd.DataFrame(dict, index=df.index)
But I had no success, how can I do that?
Try renaming columns manually:
df.columns=pd.MultiIndex.from_tuples([x.split('_')[::-1] for x in df.columns])
df.index.name='id'
Output:
A B b
price amount price amount
id
0 0.652826 0.941421 0.823048 0.728427
1 0.400078 0.600585 0.194912 0.269842
2 0.223524 0.146675 0.375459 0.177165
3 0.330626 0.214981 0.389855 0.541666
4 0.578132 0.304780 0.789573 0.268851
5 0.094360 0.514878 0.419333 0.017010
6 0.279122 0.401132 0.722363 0.337094
7 0.444977 0.333254 0.643878 0.371528
8 0.724673 0.063281 0.345225 0.935403
9 0.905482 0.846500 0.585653 0.364495
You can split the column names on the underscore and convert to a tuple. Once you map each split column name to a tuple, pandas will convert the Index to a MultiIndex for you. From there we just need to call swaplevel to get the letter level to come first and reassign to the dataframe.
note: in my input dataframe I replaced the column name "amount_b" with "amount_B" because it lined up with your expected output so I assumed it was a typo
df.columns = df.columns.str.split("_", expand=True).swaplevel()
print(df)
A B
price amount price amount
0 0.652826 0.941421 0.823048 0.728427
1 0.400078 0.600585 0.194912 0.269842
2 0.223524 0.146675 0.375459 0.177165
3 0.330626 0.214981 0.389855 0.541666
4 0.578132 0.304780 0.789573 0.268851
5 0.094360 0.514878 0.419333 0.017010
6 0.279122 0.401132 0.722363 0.337094
7 0.444977 0.333254 0.643878 0.371528
8 0.724673 0.063281 0.345225 0.935403
9 0.905482 0.846500 0.585653 0.364495

How to create query that's update columns value using sql server or pandas function even

Now I have a table something like the below table:
esn_missing_in_DF_umts
|---------------------|------------------|---------------------|------------------|------------------|------------------|
| cell_name | n_cell_name | source_vendor | target_vendor | source_rnc | target_rnc |
|---------------------|------------------|---------------------|------------------|------------------|------------------|
| 1 | 8 | x | y | | |
|---------------------|------------------|---------------------|------------------|------------------|------------------|
| 2 | 5 | x | x | | |
|---------------------|------------------|---------------------|------------------|------------------|------------------|
| 3 | 6 | x | x | | |
|---------------------|------------------|---------------------|------------------|------------------|------------------|
| 4 | 9 | x | y | | |
|---------------------|------------------|---------------------|------------------|------------------|------------------|
| 5 | 10 | x | y | | |
|---------------------|------------------|---------------------|------------------|------------------|------------------|
| 6 | 11 | x | y | | |
|---------------------|------------------|---------------------|------------------|------------------|------------------|
| 7 | 12 | x | y | | |
|---------------------|------------------|---------------------|------------------|------------------|------------------|
Now I have two columns are empty in sqlServer or dataframe the source_rnc and the target_rnc:
Here's the other two tables I want to update the two columns from
esn_umts_intra_sho
|---------------------|------------------|------------------|
| ucell | urelation | ucell_rnc |
|---------------------|------------------|------------------|
| 13 | 5 | abc567 |
|---------------------|------------------|------------------|
| 8 | 6 | abc568 |
|---------------------|------------------|------------------|
| 14 | 8 | abc569 |
|---------------------|------------------|------------------|
| 7 | 9 | abc570 |
|---------------------|------------------|------------------|
| 16 | 10 | abc571 |
|---------------------|------------------|------------------|
| 5 | 11 | abc572 |
|---------------------|------------------|------------------|
| 17 | 12 | abc573 |
|---------------------|------------------|------------------|
| 10 | 9 | abc574 |
|---------------------|------------------|------------------|
| 9 | 17 | abc575 |
|---------------------|------------------|------------------|
| 12 | 11 | abc576 |
|---------------------|------------------|------------------|
| 11 | 12 | abc577 |
|---------------------|------------------|------------------|
df_umts_carrier
|---------------------|------------------|
| cell_name_umts | rnc |
|---------------------|------------------|
| 1 | xyz123 |
|---------------------|------------------|
| 2 | xyz124 |
|---------------------|------------------|
| 3 | xyz125 |
|---------------------|------------------|
| 4 | xyz126 |
|---------------------|------------------|
| 5 | xyz127 |
|---------------------|------------------|
| 6 | xyz128 |
|---------------------|------------------|
| 7 | xyz129 |
|---------------------|------------------|
So Not I want to update the source_rnc and target_rnc through those two tables esn_umts_intra_sho and df_umts_carrier
So I imagine that the query could be like this
UPDATE [toolDB].[dbo].[esn_missing_in_DF_umts]
SET [toolDB].[dbo].[esn_missing_in_DF_umts].[target_rnc] = CASE WHEN [toolDB].[dbo].[esn_missing_in_DF_umts].[target_vendor] = 'HUA' THEN [toolDB].[dbo].[df_umts_carrier].[rnc]
FROM [toolDB].[dbo].[esn_missing_in_DF_umts]
INNER JOIN [toolDB].[dbo].[df_umts_carrier]
ON [n_cell_name] = [cell_name_umts]
ELSE
UPDATE [toolDB].[dbo].[esn_missing_in_DF_umts]
SET [toolDB].[dbo].[esn_missing_in_DF_umts].[target_rnc] = [toolDB].[dbo].[esn_umts_intra_sho].[ucell_rnc]
From [toolDB].[dbo].[esn_missing_in_DF_umts] INNER JOIN [toolDB].[dbo].[esn_umts_intra_sho]
ON [n_cell_name] = [ucell]
I want the final output to be somthing like this:
|---------------------|------------------|---------------------|------------------|------------------|------------------|
| cell_name | n_cell_name | source_vendor | target_vendor | source_rnc | target_rnc |
|---------------------|------------------|---------------------|------------------|------------------|------------------|
| 1 | 8 | x | y | xyz123 | abc568 |
|---------------------|------------------|---------------------|------------------|------------------|------------------|
| 2 | 5 | x | x | xyz124 | xyz127 |
|---------------------|------------------|---------------------|------------------|------------------|------------------|
| 3 | 6 | x | x | xyz125 | xyz128 |
|---------------------|------------------|---------------------|------------------|------------------|------------------|
| 4 | 9 | x | y | xyz126 | abc575 |
|---------------------|------------------|---------------------|------------------|------------------|------------------|
| 5 | 10 | x | y | xyz127 | abc574 |
|---------------------|------------------|---------------------|------------------|------------------|------------------|
| 6 | 11 | x | y | xyz128 | abc576 |
|---------------------|------------------|---------------------|------------------|------------------|------------------|
| 7 | 12 | x | y | xyz129 | abc577 |
|---------------------|------------------|---------------------|------------------|------------------|------------------|
I tried even with pandas but doesn't work...
I wish someone help me.
The best thing is to make the query as if you were writing a SELECT statement with the Case clause in it. Once it works as expected, you can amend it for your update.
So in this example, if the main tables Column = bla, then get the data from the first joined table, else the other table.
Quick amendment Make sure its all rows you are happy to update, else remember to put in a where statement. That's why its best to work out your logic in a SELECT and move on from there.
I think you want something like this:
UPDATE [toolDB].[dbo].[esn_missing_in_DF_umts]
SET [toolDB].[dbo].[esn_missing_in_DF_umts].[target_rnc] = (CASE WHEN UMT.target_vendor = 'HUA' THEN carrier.rnc ELSE SHO.ucell_rnc END )
FROM [toolDB].[dbo].[esn_missing_in_DF_umts] UMT
LEFT JOIN [toolDB].[dbo].[df_umts_carrier] carrier ON UMT.n_cell_name = carrier.cell_name_umts
LEFT JOIN [toolDB].[dbo].[esn_umts_intra_sho] SHO ON UMT.n_cell_name = SHO.ucell

Categories