Select value from other dataframe where index is equal - python

I have two dataframes with the same index. I would like to add a column to one of those dataframes based on an equation for which I need the value from a row of another dataframe where the index is the same.
Using
df2['B'].loc[df2['Date'] == df1['Date']]
I get the 'Can only compare identically-labeled Series objects' -error
df1
+-------------+
| Index A |
+-------------+
| 3-2-20 3 |
| 4-2-20 1 |
| 5-2-20 3 |
+-------------+
df2
+----------------+
| Index A |
+----------------+
| 1-2-20 2 |
| 2-2-20 4 |
| 3-2-20 3 |
| 4-2-20 1 |
| 5-2-20 3 |
+----------------+
df1['B'] = 1 + df2['A'].loc[df2['Date'] == df1['Date']] , the index is a date but in my real df I have also a col called Date with the same values
df1 desired
+----------------+
| Index A B |
+----------------+
| 3-2-20 3 4 |
| 4-2-20 1 2 |
| 5-2-20 3 4 |
+----------------+

This should work. If not, just play with the column names, because they are similar in both tables. A_y is the df2['A'] column (autorenamed because of the similarity)
df1['B']=df1.merge(df2, left_index=True, right_index=True)['A_y']+1

I guess for now I will have to settle with doing it by a cut clone of df2 to the indexes of df1
dfc = df2
t = list(df1['Date'])
dfc = dfc.loc[dfc['Date'].isin(t)]
df1['B'] = 1 + dfc['A']

Related

Replace column name by Index

I have the below data in a Dataframe.
+----+------+----+------+
| Id | Name | Id | Name |
+----+------+----+------+
| 1 | A | 1 | C |
| 2 | B | 2 | B |
+----+------+----+------+
Though the column names are repeating, ideally, its a comparison of 1st 2 columns (old data) with the last 2 columns (new data).
I was trying to rename the 2nd last column by appending _New to it with Index using the below code. Unfortunately, the 1st column is also getting appended with _New.
df.rename(columns={df.columns[2]: df.columns[2] + '_New'}, inplace=True)
Here's the result I am getting using the above code.
+--------+------+--------+------+
| Id_New | Name | Id_New | Name |
+--------+------+--------+------+
| 1 | A | 1 | C |
| 2 | B | 2 | B |
+--------+------+--------+------+
My understanding is that it should add _New to only the 2nd last column. Below is the expected result.
+----+------+--------+------+
| Id | Name | Id_New | Name |
+----+------+--------+------+
| 1 | A | 1 | C |
| 2 | B | 2 | B |
+----+------+--------+------+
Is there any way to accomplish this?
You can use a simple loop with a dictionary to keep track of the increments. I generalized the logic here to handle an arbitrary number of duplicates:
cols = {}
new_cols = []
for c in df.columns:
if c in cols:
new_cols.append(f'{c}_New{cols[c]}')
cols[c] += 1
else:
new_cols.append(c)
cols[c] = 1
df.columns = new_cols
output:
Id Name Id_New1 Name_New1
0 1 A 1 C
1 2 B 2 B
If you really want Id_New then Id_New2 etc. change:
new_cols.append(f'{c}_New{cols[c]}')
to
i = cols[c] if cols[c] != 1 else ''
new_cols.append(f'{c}_New{i}')

Python : Replace None value None with value of other column [duplicate]

I'm trying to fill a column of a dataframe from another dataframe based on conditions. Let's say my first dataframe is df1 and the second is named df2.
# df1 is described as bellow :
+------+------+
| Col1 | Col2 |
+------+------+
| A | 1 |
| B | 2 |
| C | 3 |
| A | 1 |
+------+------+
And
# df2 is described as bellow :
+------+------+
| Col1 | Col2 |
+------+------+
| A | NaN |
| B | NaN |
| D | NaN |
+------+------+
Each distinct value of Col1 has her an id number (In Col2), so what I want is to fill the NaN values in df2.Col2 where df2.Col1==df1.Col1 .
So that my second dataframe will look like :
# df2 :
+------+------+
| Col1 | Col2 |
+------+------+
| A | 1 |
| B | 2 |
| D | NaN |
+------+------+
I'm using Python 2.7
Use drop_duplicates with set_index and combine_first:
df = df2.set_index('Col1').combine_first(df1.drop_duplicates().set_index('Col1')).reset_index()
If need check dupes only in id column:
df = df2.set_index('Col1').combine_first(df1.drop_duplicates().set_index('Col1')).reset_index()
Here is a solution with the filter df1.Col1 == df2.Col1
df2['Col2'] = df1[df1.Col1 == df2.Col1]['Col2']
It is even better to use loc (but less clear from my point of view)
df2['Col2'] = df1.loc[df1.Col1 == df2.Col2, 'Col2']

Pandas Merge multiple dataframes on index and column

I am trying to Merge multiple dataframes to one main dataframe using the datetime index and id from main dataframe and datetime and id columns from other dataframes
Main dataframe
DateTime | id | data
(Df.Index)
---------|----|------
2017-9-8 | 1 | a
2017-9-9 | 2 | b
df1
id | data1 | data2 | DateTime
---|-------|-------|---------
1 | a | c | 2017-9-8
2 | b | d | 2017-9-9
5 | a | e | 2017-9-20
df2
id | data3 | data4 | DateTime
---|-------|-------|---------
1 | d | c | 2017-9-8
2 | e | a | 2017-9-9
4 | f | h | 2017-9-20
The main dataframe and the other dataframes are in different dictionaries. I want to read from each dictionary and merge when the joining condition (datetime, id) is met
for sleep in dictOfSleep#MainDataFrame:
for sensorDevice in dictOfSensor#OtherDataFrames:
try:
dictOfSleep[sleep]=pd.merge(dictOfSleep[sleep],dictOfSensor[sensorDevice], how='outer',on=['DateTime','id'])
except:
print('Join could not be done')
Desired Output:
DateTime | id | data | data1 | data2 | data3 | data4
(Df.Index)
---------|----|------|-------|-------|-------|-------|
2017-9-8 | 1 | a | a | c | d | c |
2017-9-9 | 2 | b | b | d | e | a |
I'm not sure how your dictionaries are set up so you will most likely need to modify this but I'd try something like:
for sensorDevice in dictOfSensor:
df = dictOfSensor[sensorDevice]
# set df index to match the main_df index
df = df.set_index(['DateTime'])
# try join (not merge) when combining on index
main_df = main_df.join(df, how='outer')
Alternatively, if the id column is very important you can try to first reset your main_df index and then merging.
main_df = main_df.reset_index()
for sensorDevice in dictOfSensor:
df = dictOfSensor[sensorDevice]
# try to merge on both columns
main_df = main_df.merge(df, how='outer', on=['DateTime', 'id])

pyspark - how can I remove all duplicate rows (ignoring certain columns) and not leaving any dupe pairs behind?

I have the following table:
df = spark.createDataFrame([(2,'john',1),
(2,'john',1),
(3,'pete',8),
(3,'pete',8),
(5,'steve',9)],
['id','name','value'])
df.show()
+----+-------+-------+--------------+
| id | name | value | date |
+----+-------+-------+--------------+
| 2 | john | 1 | 131434234342 |
| 2 | john | 1 | 10-22-2018 |
| 3 | pete | 8 | 10-22-2018 |
| 3 | pete | 8 | 3258958304 |
| 5 | steve | 9 | 124324234 |
+----+-------+-------+--------------+
I want to remove all duplicate pairs (When the duplicates occur in id, name, or value but NOT date) so that I end up with:
+----+-------+-------+-----------+
| id | name | value | date |
+----+-------+-------+-----------+
| 5 | steve | 9 | 124324234 |
+----+-------+-------+-----------+
How can I do this in PySpark?
You could groupBy id, name and value and filter on the count column : :
df = df.groupBy('id','name','value').count().where('count = 1')
df.show()
+---+-----+-----+-----+
| id| name|value|count|
+---+-----+-----+-----+
| 5|steve| 9| 1|
+---+-----+-----+-----+
You could eventually drop the count column if needed
Do groupBy for the columns you want and count and do a filter where count is equal to 1 and then you can drop the count column like below
import pyspark.sql.functions as f
df = df.groupBy("id", "name", "value").agg(f.count("*").alias('cnt')).where('cnt = 1').drop('cnt')
You can add the date column in the GroupBy condition if you want
Hope this helps you

What is the smartest way to get the rest of a pandas.DataFrame?

Here is a pandas.DataFrame df.
| Foo | Bar |
|-----|-----|
| 0 | A |
| 1 | B |
| 2 | C |
| 3 | D |
| 4 | E |
I selected some rows and defined a new dataframe, by df1 = df.iloc[[1,3],:].
| Foo | Bar |
|-----|-----|
| 1 | B |
| 3 | D |
What is the best way to get the rest of df, like the following.
| Foo | Bar |
|-----|-----|
| 0 | A |
| 2 | C |
| 4 | E |
Fast set-based diffing.
df2 = df.loc[df.index.difference(df1.index)]
df2
Foo Bar
0 0 A
2 2 C
4 4 E
Works as long as your index values are unique.
If I'm understanding correctly, you want to take a dataframe, select some rows from it and store those in a variable df2, and then select rows in df that are not in df2.
If that's the case, you can do df[~df.isin(df2)].dropna().
df[ x ] subsets the dataframe df based on the condition x
~df.isin(df2) is the negation of df.isin(df2), which evaluates to True for rows of df belonging to df2.
.dropna() drops rows with a NaN value. In this case the rows we don't want were coerced to NaN in the filtering expression above, so we get rid of those.
I assume that Foo can be treated as a unique index.
First select Foo values from df1:
idx = df1['Foo'].values
Then filter your original dataframe:
df2 = df[~df['Foo'].isin(idx)]

Categories