Combining MultiIndex columns with similar root names in Pandas/Python - python

I have a MultiIndex dataframe with the top level columns named:
Col1_1 | Col1_2 | Col 2_1 | Col 2_2 | ... |
I'm looking to combine Col1_1 with Col1_2 as Col1. I could also do this before creating the MultiIndex, but the original data is more drawn out as:
Col1_1.aspect1 | Col1_1.aspect 2 | Col1_2.aspect1 | Col1_2.aspect2 | ... |
where 'aspect1' and 'aspect2' become subcolumns in the MultiIndex.
Please let me know if I can clarify anything, and many thanks in advance.
The expected result combines the two as just Sample1; any number of ways is fine, including stacking/concatenating the data, outputting a summary stat e.g. mean(), etc.

You can use groupby and apply an aggregation function against it like mean.
You must group against axis 1 (columns) and with level 1 (lower multiindex columns). It will apply the grouping across all samples. Then simply do a mean if it's what you want to achieve:
df.groupby(level=1, axis=1).mean()

Related

How to list values from column A where column B is NaN?

I have a dataframe (let's call it df) that looks a bit like this:
Offer | Cancelled | Restriction
------|-----------|------------
1 | N | A
2 | Y | B
3 | N | NaN
4 | Y | NaN
I have the following bit of code, which creates a list of all offers that have been cancelled:
cancelled = list('000'+df.loc[(df['Cancelled']=='Y'),'Offer'].astype(str))
What I now want to do is to adapt this to create a list of all offers where the 'Restriction' column is not NaN. So my desired result would look like this:
['0001','0002']
Does anyone know how to do this please?
You were almost there. Just add the extra condition that the Restriction column may not be NaN.
list('000'+df.loc[(df['Restriction'].notna()) & (df['Cancelled'] == 'Y'), 'Offer'].astype(str))
If you just want to filter on not NaN in Restriction column, the answer is commented by #Henry Ecker

Create a new column based off values in two others

I'm trying to merge two columns. Merging is not working out and I've been reading and asking questions for two days trying to find a solution so I'm going to go a different route.
Since I have to change the column name after I merge anyway why not just create a new column and fill it based on the other two.
So I have column A, B and C now.
C is a blank column.
Column A has values for most rows but not all, In the case that column A doesn't have a value I want to use Column B's value. I want to put one of the two Values in column C.
Please keep in mind that when column A doesn't have a value a "-" was put in its place (hence why I'm having a horrendous time trying to merge these columns).
I have converted the "-" to NaN but then the .fillna function doesn't work and I'm not sure why.
I'm thinking I have to write a for loop and an if statement to accomplish this although I feel like there is a function that would accomplish compiling a new column based on the other two columns values.
| A | B |
| 34 | 35 |
| 37 | - |
| - | 32 |
| 94 | 92 |
| - | 91 |
| 47 | - |
Desired Result
|C |
|34|
|37|
|32|
|94|
|91|
|47|
Does this answer your question:
df['A']=df.apply(lambda x: x['B'] if x['A']=='-' else x['A'],axis=1)
df['A']=df.apply(lambda x: x['B'] if x['A']==np.NaN else x['A'],axis=1)

Converting a dataframe of 2 columns to a series of 2 columns in Python

I am trying to work on some time series data and am quite new to pandas dataframe. I have a dataframe with two columns as below:
+---+-----------------------+-------+--+
| | 0 | 1 | |
+---+-----------------------+-------+--+
| 1 | 2018-08-02 23:00:00 | 456.8 | |
| 2 | 2018-08-02 23:01:00 | 457.9 | |
+---+-----------------------+-------+--+
I am trying to convert it into a series with two columns as it is in the dataframe. How can it be done? as pd.series is converting the dataframe to a series of one column.
There is no such thing as a pandas Series with two columns. My guess is that you want to generate a Series with column 0 as the index and column 1 as the values. You can get that by setting the index and extracting the column of interest (assuming your DataFrame is in df):
df.set_index(0)[1]
As stated in comments using "pd.Series(df.col1, df.col2) produces a Series with NaNs". The reason is that the Series will be reindexed with the object passed as the index argument. Current dev docs clarify:
If data is dict-like and index is None, then the keys in the data are used as the index. If the index is not None, the resulting Series is reindexed with the index values.
To circumvent reindexing this can be done:
pd.Series(df[0].values, index=df[1])
Since df[0].values is a pd.array, rather than a dict-like pd.Series, nothing will be reindexed and df[1] will be set as index as-is.

Spark DataFrame aggregate and groupby multiple columns while retaining order

I have the following data
id | value1 | value2
-----------------------
1 A red
1 B red
1 C blue
2 A blue
2 B blue
2 C green
The result I need is:
id | values
---------------------------------
1 [[A,red],[B,red][C,blue]]
2 [[A,blue],[B,blue][C,green]]
My approach so far is to groupby and aggregate value1 and value2 in seperate arrays and then merge them together as described in Combine PySpark DataFrame ArrayType fields into single ArrayType field
df.groupBy(["id"]).agg(*[F.collect_list("value1"), F.collect_list("value2")])
However since order is not guaranteed in collect_list() (see here), how can I make sure value1 and value2 are both matched to the correct values?
This could potentially lead to two lists with different order and subsequent merging would match wrong values?
As commented by #Raphael, you can combine value1 and value2 columns into a single struct type column firstly, and then collect_list:
import pyspark.sql.functions as F
(df.withColumn('values', F.struct(df.value1, df.value2))
.groupBy('id')
.agg(F.collect_list('values').alias('values'))).show()
+---+--------------------+
| id| values|
+---+--------------------+
| 1|[[A,red], [B,red]...|
| 2|[[A,blue], [B,blu...|
+---+--------------------+

Pivot or transpose in SQL or Pandas

I have a table of the form:
item_code | attribute | time_offset | mean | median | description | ...
The attribute column has one of 40 possible values and the time_offset column can be an integer from 0 to 20.
I want to transform this table to a wide one of the form:
item_code | <attribute1>_<time_offset1>_mean | <attribute1>_<time_offset1>_median | <attribute1>_<time_offset1>_description | <attribute1>_<time_offset1>_... | <attribute2>...
I can do this either in SQL or in Pandas but I'm having difficulty with the fact that some of the columns are not numeric, so it is hard to come up with an aggregation function for them.
I can guarantee that each combination of item_code, attribute and time_offset will have only one row, so I do not need an aggregation function. Is there something like a transpose operation that will allow me to do what I am looking for?

Categories