How to create duplicate rows based on columns? - python

Consider this data frame
order number | Item | column 0 | column 1 | Column 2
12 | [abcd][efgh] | [abcd | [efgh] |
34 | [mnop] | | [mnop] | |
56 | [xyzz][zzyx][mnoq] | [xyzz] | [zzyx] | [mnoq]
How do I turn it into?
order number | Item | column 0 |
12 | [abcd][efgh] | [abcd |
12 | [abcd][efgh] | [efgh] |
34 | [mnop] | | [mnop] |
56 | [xyzz][zzyx][mnoq] | [xyzz] |
56 | [xyzz][zzyx][mnoq] | [zzyx] |
56 | [xyzz][zzyx][mnoq] | [mnoq] |
This is my first time posting on stackoverflow so apologies for any mistakes. I've tried searching the blogs but have not any luck with this kind of problem. Any help is really appreciated

Related

How to get the column values of a Dataframe into another dataframe as a new column after matching the values in columns that both dataframes have?

I'm trying to create a new column in a DataFrame and storing it with values stored in a different dataframe by first comparing the values of columns that both dataframes have. For example:
df1 >>>
| name | team | week | dates | interceptions | pass_yds | rating |
| ---- | ---- | -----| ---------- | ------------- | --------- | -------- |
| maho | KC | 1 | 2020-09-10 | 0 | 300 | 105 |
| went | PHI | 1 | 2020-09-13 | 2 | 225 | 74 |
| lock | DEN | 1 | 2020-09-14 | 0 | 150 | 89 |
| dris | DEN | 2 | 2020-09-20 | 1 | 220 | 95 |
| went | PHI | 2 | 2020-09-20 | 2 | 250 | 64 |
| maho | KC | 2 | 2020-09-21 | 1 | 245 | 101 |
df2 >>>
| name | team | week | catches | rec_yds | rec_tds |
| ---- | ---- | -----| ------- | ------- | ------- |
| ertz | PHI | 1 | 5 | 58 | 1 |
| fant | DEN | 2 | 6 | 79 | 0 |
| kelc | KC | 2 | 8 | 105 | 1 |
| fant | DEN | 1 | 3 | 29 | 0 |
| kelc | KC | 1 | 6 | 71 | 1 |
| ertz | PHI | 2 | 7 | 91 | 2 |
| goed | PHI | 2 | 2 | 15 | 0 |
I want to create a dates column in df2 with the values of the dates stored in the dates column in df1 after matching the teams and the weeks columns. After the matching, df2 in this example should look something like this:
df2 >>>
| name | team | week | catches | rec_yds | rec_tds | dates |
| ---- | ---- | -----| ------- | ------- | ------- | ---------- |
| ertz | PHI | 1 | 5 | 58 | 1 | 2020-09-13 |
| fant | DEN | 2 | 6 | 79 | 0 | 2020-09-20 |
| kelc | KC | 2 | 8 | 105 | 1 | 2020-09-20 |
| fant | DEN | 1 | 3 | 29 | 0 | 2020-09-14 |
| kelc | KC | 1 | 6 | 71 | 1 | 2020-09-10 |
| ertz | PHI | 2 | 7 | 91 | 2 | 2020-09-20 |
| goed | PHI | 2 | 2 | 15 | 0 | 2020-09-20 |
I'm looking for an optimal solution. I've already tried nested for loops and comparing the week and team columns from both dataframes together but that hasn't worked. At this point I'm all out of ideas. Please help!
Disclaimer: The actual DataFrames I'm working with are a lot larger. They have a lot more rows, columns, and values (i.e. a lot more teams in the team columns, a lot more dates in the dates columns, and a lot more weeks in the week columns)

Pivot multiple columns from row to column

I have a PySpark dataframe which looks like this:
| id | name | policy | payment_name | count |
|------|--------|------------|--------------|-------|
| 2 | two | 0 | Hybrid | 58 |
| 2 | two | 1 | Hybrid | 2 |
| 5 | five | 1 | Excl | 13 |
| 5 | five | 0 | Excl | 70 |
| 5 | five | 0 | Agen | 811 |
| 5 | five | 1 | Agen | 279 |
| 5 | five | 1 | Hybrid | 600 |
| 5 | five | 0 | Hybrid | 2819 |
I would like to make the combination of policy and payment_name become a column with the respective count (reducing down to one row per id).
Output would look like this:
| id | name | no_policy_hybrid | no_policy_excl | no_policy_agen | policy_hybrid | policy_excl | policy_agen |
|----|------|------------------|----------------|----------------|---------------|-------------|-------------|
| 2 | two | 58 | 0 | 0 | 2 | 0 | 0 |
| 5 | five | 2819 | 70 | 811 | 600 | 13 | 279 |
In cases where there is no combination we can default it to 0 i.e. id 2 has no combination including payment_name Excl so it is set 0 on the example output.
To pivot the table, you would first need a grouping column to combine the policy and the payment_name.
df = df.withColumn("groupingCol", udf("{}_{}".format)("policy", "payment_name"))
When you have that, you can group by the id and name` columns and pivot the grouping column.
df.groupBy("id", "name").pivot("groupingCol").agg(F.max("count"))
That should return the correct table columns.
+---+----+------+------+--------+------+------+--------+
| id|name|0_Agen|0_Excl|0_Hybrid|1_Agen|1_Excl|1_Hybrid|
+---+----+------+------+--------+------+------+--------+
| 5|five| 811| 70| 2819| 279| 13| 600|
| 2| two| null| null| 58| null| null| 2|
+---+----+------+------+--------+------+------+--------+
To get the same column names as in your example, you can start with changing the content of the policy column to policy and no_policy like this:
df = df.withColumn("policy", when(col("policy") == 1, "policy").otherwise("no_policy"))
This is how you would replace the missing values with 0:
df.na.fill(0)

Joining two dataframes based on the columns of one of them and the row of another

Sorry if the title doesn't make sense, but wasn't sure how eles to explain it. Here's an example of what i'm talking about
df_1
| ID | F\_Name | L\_Name |
|----|---------|---------|
| 0 | | |
| 1 | | |
| 2 | | |
| 3 | | |
df_2
| ID | Name\_Type | Name |
|----|------------|--------|
| 0 | First | Bob |
| 0 | Last | Smith |
| 1 | First | Maria |
| 1 | Last | Garcia |
| 2 | First | Bob |
| 2 | Last | Stoops |
| 3 | First | Joe |
df_3 (result)
| ID | F\_Name | L\_Name |
|----|---------|---------|
| 0 | Bob | Smith |
| 1 | Maria | Garcia |
| 2 | Bob | Stoops |
| 3 | Joe | |
Any and all advice are welcomed! Thank you
I guess that what you want to do is to reshape your second DataFrame to have the same structure of the first one, right?
You can use pivot method to achieve it:
df_3 = df_2.pivot(columns="Name_Type", values="Name")
Then, you can rename the index and the columns:
df_3 = df_3.rename(columns={"First": "F_Name", "Second": "L_Name"})
df_3.columns.name = None
df_3.index.name = "ID"

Creating multiindex from 2 rows with duplicated columns

I have an excel file that I read with pandas and convert to a dataframe. Here is a sample of the dataframe:
| | salads_count | salads_count | salads_count | carrot_counts | carrot_counts | carrot_counts |
|---------------|--------------|--------------|--------------|---------------|---------------|---------------|
| | 01.2016 | 02.2016 | 03.2016 | 01.2016 | 02.2016 | 03.2016 |
| farm_location | | | | | | |
| sweden | 42 | 41 | 43 | 52 | 51 | 53 |
It's a very weird formatting, but that's what is in the excel file. At first the 2 first rows are not even in a multiindex form.
I managed to get it into a multiindex with the code below, but some columns are duplicated (salads_count appears several times for example):
arrays = [df.columns.tolist(), df.iloc[0].tolist()]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples)
df.columns = index
I would like to convert the columns to a multiindex, something like that:
| | salads_count | | | carrot_counts | | |
|---------------|--------------|---------|---------|---------------|---------|---------|
| | 01.2016 | 02.2016 | 03.2016 | 01.2016 | 02.2016 | 03.2016 |
| farm_location | | | | | | |
| sweden | 42 | 41 | 43 | 52 | 51 | 53 |
Or even better, like that:
| | 01.2016 | | 02.2016 | | | |
|---------------|--------------|--------------|--------------|-------------|---|---|
| | carrot_count | salads_count | carrot_count | salad_count | | |
| farm_location | | | | | | |
| sweden | 52 | 42 | 51 | 41 | | |
How can I do this?
The best is convert columns to MultiIndex in read_excel by parameter header=[0,1]:
df = pd.read_excel(file, header=[0,1])
Then use swaplevel with sort_index:
df = df.swaplevel(0,1, axis=1).sort_index(axis=1, level=0)

Parsing out indeces and values from pandas multi index dataframe

I have a dataframe in a similar format to this:
+--------+--------+----------+------+------+------+------+
| | | | | day1 | day2 | day3 |
+--------+--------+----------+------+------+------+------+
| id_one | id_two | id_three | date | | | |
| 18273 | 50 | 1 | 3 | 9 | 11 | 3 |
| | | | 4 | 26 | 27 | 68 |
| | | | 5 | 92 | 25 | 4 |
| | | | 6 | 60 | 72 | 83 |
| | 60 | 2 | 5 | 69 | 93 | 84 |
| | | | 6 | 69 | 30 | 12 |
| | | | 7 | 65 | 65 | 59 |
| | | | 8 | 57 | 88 | 59 |
| | 70 | 3 | 5 | 22 | 95 | 7 |
| | | | 6 | 40 | 24 | 20 |
| | | | 7 | 73 | 81 | 57 |
| | | | 8 | 43 | 8 | 66 |
+--------+--------+----------+------+------+------+------+
I am trying to create tuple that contains id_one, id_two and the values that each grouping contains.
To test this, I am simply trying to print the ids and values like this:
for id_two, data in df.head(100).groupby(level='id_two'):
print id_two, data.values.ravel()
Which gives me the id_two and the data exactly as it should.
I am running into problems when I try and incorporate id_one. I tried this, but was met with an error ValueError: need more than 2 values to unpack
for id_one, id_two, data in df.head(100).groupby(level='id_two'):
print id_one, id_two, data.values.ravel()
How can I print id_one, id_two and the data?
You can pass a list of columns into the level parameter:
df.head.groupby(level=['id_one', 'id_two'])

Categories