Python : Replace None value None with value of other column [duplicate] - python

I'm trying to fill a column of a dataframe from another dataframe based on conditions. Let's say my first dataframe is df1 and the second is named df2.
# df1 is described as bellow :
+------+------+
| Col1 | Col2 |
+------+------+
| A | 1 |
| B | 2 |
| C | 3 |
| A | 1 |
+------+------+
And
# df2 is described as bellow :
+------+------+
| Col1 | Col2 |
+------+------+
| A | NaN |
| B | NaN |
| D | NaN |
+------+------+
Each distinct value of Col1 has her an id number (In Col2), so what I want is to fill the NaN values in df2.Col2 where df2.Col1==df1.Col1 .
So that my second dataframe will look like :
# df2 :
+------+------+
| Col1 | Col2 |
+------+------+
| A | 1 |
| B | 2 |
| D | NaN |
+------+------+
I'm using Python 2.7

Use drop_duplicates with set_index and combine_first:
df = df2.set_index('Col1').combine_first(df1.drop_duplicates().set_index('Col1')).reset_index()
If need check dupes only in id column:
df = df2.set_index('Col1').combine_first(df1.drop_duplicates().set_index('Col1')).reset_index()

Here is a solution with the filter df1.Col1 == df2.Col1
df2['Col2'] = df1[df1.Col1 == df2.Col1]['Col2']
It is even better to use loc (but less clear from my point of view)
df2['Col2'] = df1.loc[df1.Col1 == df2.Col2, 'Col2']

Related

How to keep only certain df columns that appear as rows in another df

Assuming that we have two dataframes
df_1
+--------+--------+-------+-------+
| id | col1 | col2 | col3 |
+--------+--------+-------+-------+
| A | 10 | 5 | 4 |
| B | 5 | 3 | 2 |
+--------+--------+-------+-------+
and df_2
+----------+--------+---------+
| col_name | col_t | col_d |
+----------+--------+---------+
| col1 | 3.3 | 2.2 |
| col3 | 1 | 2 |
+----------+--------+---------+
What I want to achieve is to join the two tables, such that only the columns that appear under df_2's col_name are kept in df_1 i.e. the desired table would be
+--------+--------+-------+
| id | col1 | col3 |
+--------+--------+-------+
| A | 10 | 4 |
| B | 5 | 2 |
+--------+--------+-------+
however, I need to perform this action only through joins and/or df transpose or pivot if possible.
I know that the above could be easily inferred by just selecting the df_1 columns as they appear in df_2's col_name but this is not what I am looking for here
One way to do this is to dedup and obtain the values in df_2.col_name using collect_list and passing this list of column names in your df_1 dataframe:
col_list = list(set(df_2.select(collect_list("col_name")).collect()[0][0]))
list_with_id = ['id'] + col_list
df_1[list_with_id].show()
Output:
+---+----+----+
| id|col1|col3|
+---+----+----+
| A| 10| 4|
| B| 5| 2|
+---+----+----+
Is this what you're looking for? (Assuming you want something dynamic and not manually selecting columns). I'm not using joins or pivots here though.

Select value from other dataframe where index is equal

I have two dataframes with the same index. I would like to add a column to one of those dataframes based on an equation for which I need the value from a row of another dataframe where the index is the same.
Using
df2['B'].loc[df2['Date'] == df1['Date']]
I get the 'Can only compare identically-labeled Series objects' -error
df1
+-------------+
| Index A |
+-------------+
| 3-2-20 3 |
| 4-2-20 1 |
| 5-2-20 3 |
+-------------+
df2
+----------------+
| Index A |
+----------------+
| 1-2-20 2 |
| 2-2-20 4 |
| 3-2-20 3 |
| 4-2-20 1 |
| 5-2-20 3 |
+----------------+
df1['B'] = 1 + df2['A'].loc[df2['Date'] == df1['Date']] , the index is a date but in my real df I have also a col called Date with the same values
df1 desired
+----------------+
| Index A B |
+----------------+
| 3-2-20 3 4 |
| 4-2-20 1 2 |
| 5-2-20 3 4 |
+----------------+
This should work. If not, just play with the column names, because they are similar in both tables. A_y is the df2['A'] column (autorenamed because of the similarity)
df1['B']=df1.merge(df2, left_index=True, right_index=True)['A_y']+1
I guess for now I will have to settle with doing it by a cut clone of df2 to the indexes of df1
dfc = df2
t = list(df1['Date'])
dfc = dfc.loc[dfc['Date'].isin(t)]
df1['B'] = 1 + dfc['A']

How do I reference a sub-column from a pivoted table inside a method such as sort_values?

This table has 2 level of columns after pivotting col2. I want to sort the table with df['col3']['A'], but in .sort_values() you can only use a string or a list of strings to reference column(s).
I know for this specific problem, I can just sort the table before pivotting. But this problem applies to all other methods e.g. df.style.set_properties( subset=...) on any dataframes with multi-level/hierarchical columns.
EDIT:
Table here:
+------+---------------------+---------------------+
| | col3 | col4 |
+------+----------+----------+----------+----------+
| col2 | A | B | A | B |
+------+----------+----------+----------+----------+
| col1 | | | | |
+------+----------+----------+----------+----------+
| 1 | 0.242926 | 0.181175 | 0.189465 | 0.338340 |
| 2 | 0.240864 | 0.494611 | 0.211883 | 0.614739 |
| 3 | 0.052051 | 0.757591 | 0.361446 | 0.389341 |
+------+----------+----------+----------+----------+
Found the answer here: Sort Pandas Pivot Table by the margin ('All') values column
Basically just put the column and sub column(s) in a tuple. i.e. for my case, it is just .sort_values(('col3','A')).

What is the smartest way to get the rest of a pandas.DataFrame?

Here is a pandas.DataFrame df.
| Foo | Bar |
|-----|-----|
| 0 | A |
| 1 | B |
| 2 | C |
| 3 | D |
| 4 | E |
I selected some rows and defined a new dataframe, by df1 = df.iloc[[1,3],:].
| Foo | Bar |
|-----|-----|
| 1 | B |
| 3 | D |
What is the best way to get the rest of df, like the following.
| Foo | Bar |
|-----|-----|
| 0 | A |
| 2 | C |
| 4 | E |
Fast set-based diffing.
df2 = df.loc[df.index.difference(df1.index)]
df2
Foo Bar
0 0 A
2 2 C
4 4 E
Works as long as your index values are unique.
If I'm understanding correctly, you want to take a dataframe, select some rows from it and store those in a variable df2, and then select rows in df that are not in df2.
If that's the case, you can do df[~df.isin(df2)].dropna().
df[ x ] subsets the dataframe df based on the condition x
~df.isin(df2) is the negation of df.isin(df2), which evaluates to True for rows of df belonging to df2.
.dropna() drops rows with a NaN value. In this case the rows we don't want were coerced to NaN in the filtering expression above, so we get rid of those.
I assume that Foo can be treated as a unique index.
First select Foo values from df1:
idx = df1['Foo'].values
Then filter your original dataframe:
df2 = df[~df['Foo'].isin(idx)]

PySpark: Create a column from ordered concatenation of columns

I am having an issue creating a new column from the ordered concatenation of two existing columns on a pyspark dataframe i.e.:
+------+------+--------+
| Col1 | Col2 | NewCol |
+------+------+--------+
| ORD | DFW | DFWORD |
| CUN | MCI | CUNMCI |
| LAX | JFK | JFKLAX |
+------+------+--------+
In other words, I want to grab Col1 and Col2, ordered them alphabetically and concatenate them.
Any suggestions?
Combine concat_ws, array and sort_array
from pyspark.sql.functions import concat_ws, array, sort_array
df = spark.createDataFrame(
[("ORD", "DFW"), ("CUN", "MCI"), ("LAX", "JFK")],
("Col1", "Col2"))
df.withColumn("NewCol", concat_ws("", sort_array(array("Col1", "Col2")))).show()
# +----+----+------+
# |Col1|Col2|NewCol|
# +----+----+------+
# | ORD| DFW|DFWORD|
# | CUN| MCI|CUNMCI|
# | LAX| JFK|JFKLAX|
# +----+----+------+

Categories