Pandas Merge/Join - python

I have a dataframe called Bob with Columns = [A,B] and A has only unique values like a serial ID. Shape is (100,2)
I have another dataframe called Anna with Columns [C,D,E,F] where C has the same values as A in bob but there are duplicates. Column D is a category (phone/laptop/ipad) that is defined by the serial ID found in C. Shape of Anna is (500,4).
Example of row in anna:
A B C D
K103 phone 12 17
K103 phone 14 23
G221 laptop 25 6
I want to create a new dataframe that has columns A,B,D by searching for value of A in anna[C]. The final dataframe should be shape (100,3)
I'm finding this difficult with pd.merge (i tried left/inner/right joins) because it keeps creating 2 rows in the new dataframe with same values i.e. K103 will show up 2x in the new dataframe.

Tell me if this works, I'm thinking of this while typing it, so I couldn't actually check.
df = Bob.merge(Anna[['C','D'].drop_duplicates(keep='last'),how='left',left_on='A',right_on='C']
Let me know if it doesn't work, I'll create a sample dataset and edit it with the correct code.

Related

Pyspark DataFrame - propagate column subset onto another DataFrame's column

Using Python 3.9, I have Spark DataFrame (just an example):
DataFrame A:
id
Name
Place
5
Michael
null
8
Dwight
null
17
Darryl
null
After some computing I get a subset of DataFrame A, containing only people working in specific Place like so:
DataFrame B:
id
Name
Place
5
Michael
Office
8
Dwight
Office
Please note: All entries in DataFrame B column Place always have same values (like for example "Office" here or "Warehouse" in some other computation).
My question is: how to "propagate" column Place from DataFrame B into DataFrame A? Resulting propagation should look like:
DataFrame C:
id
Name
Place
5
Michael
Office
8
Dwight
Office
17
Darryl
null
Tried using sth like with no luck:
row_list = B.select('id').collect()
B_ids = [row.id for row in row_list]
A.withColumn('Place', when(A['id'].isin(B_ids), 'Office'))

Dropping rows, where a dynamic number of integer columns only contain 0's

I have the following problem - I have the following example data-frame:
Name
Planet
Number Column #1
Number Column #2
John
Earth
2
0
Peter
Terra
0
0
Anna
Mars
5
4
Robert
Knowhere
0
1
Here, I want to only remove the rows, in which there are numbers, that are all 0's. In this case this is the second row. So my data-frame has to become like this:
Name
Planet
Number Column #1
Number Column #2
John
Earth
2
0
Anna
Mars
5
4
Robert
Knowhere
0
1
For this, I have a solution and it is the following:
new_df = old_df.loc[(a['Number Column #1'] > 0) + (a['Number Column #2'] > 0)]
This works, however I have another problem. My dataframe, based on the request, will dynamically have a different number of number columns. For example:
Name
Planet
Number Column #1
Number Column #2
Number Column #3
John
Earth
2
0
1
Peter
Terra
0
0
0
Anna
Mars
5
4
2
Robert
Knowhere
0
1
1
This is the problematic part, as I am not sure how I can adjust my code to work for dynamic columns. I've tried multiple things from StackOverflow and the Pandas documentation - however most examples only work for dataframes, in which all columns are integers. Pandas would them consider them booleans, and you can add a simple solution like this:
new_df = (df != 0).any(axis=1)
In my case however, the text columns, which are always the same, are the problematic ones. Does anyone have an idea for a solution here? Thanks a lot in advance!
P.S. I have the names of the number columns available beforehand in the code as a list, for example:
my_num_columns = ["Number Column #1", "Number Column #2", "Number Column #3"]
# my pandas logic...
IIUC:
You can try via select_dtypes() and select int and float columns after that check your condition and filter out your dataframe:
df=df.loc[~df.select_dtypes(['int','float']).eq(0).all(axis=1)]
#OR
df=df.loc[df.select_dtypes(['int','float']).ne(0).any(axis=1)]
Note: If needed you can also include 'bool' columns and typecast it to float and then check your condition:
df=df.loc[df.select_dtypes(['int','float','bool']).astype(float).ne(0).any(axis=1)]

Trying to multiply specific columns, by a portion of multiple rows in Pandas DataFrame (Python)

I am trying to multiply a few specific columns by a portion of multiple rows and creating a new column from every result. I could not really find an answer to my question in previous stackoverflow questions or on google, so maybe one of you can help.
I would like to point out that I am quite the beginner in Python, so apologies ahead for any obvious questions or strange code.
This is how my DataFrame currently looks like:
So, for the column Rank of Hospital by Doctor_1, I want to multiply all its numbers by the values of the first row of column Rank of Doctor by Hospital_1 until column Rank of Doctor by Hospital_10. Which would result in:
1*1
2*1
3*1
4*4
...
and so on.
I want to do this for every Doctor_ column. So for Doctor_2 its values should be multiplied by the second row of all those ten columns (Rank of Doctor by Hospital_. Doctor_3, multiplied by the third row etc.
So far, I have transposed the Rank of Doctor by Hospital_ columns in a new DataFrame:
and tried to multiply this by a DataFrame of the Rank of Hospital by Doctor_ columns. Here the first column of the first df should be multiplied by the first column of the second df. (and second column * second column, etc.):
But my current formula
preferences_of_doctors_and_hospitals_doctors_ranking.mul(preferences_of_doctors_and_hospitals_hospitals_ranking_transposed)
is obviously not working:
Does anybody know what I am doing wrong and how I could fix this? Maybe I could write a for loop so that a new column is created for every multiplication of columns? So Multiplication_column_1 of DF3 = Column 1 of DF1 * Column 1 of DF2 and Multiplication_column_2 of DF3 = Column 2 of DF1 * Column 2 of DF2.
Thank you in advance!
Jeff
You can multiple 2d arrays created by filtering column with filter and values first:
arr = df.filter(like='Rank of Hospital by').values * df.filter(like='Rank of Doctor by').values
Or:
arr = (preferences_of_doctors_and_hospitals_doctors_ranking.values *
preferences_of_doctors_and_hospitals_hospitals_ranking_transposed.values)
Notice - necessary is same ordering of columns, same length of columns names and index in both filtered DataFrames.
Get 2d array, so create DataFrame by constructor and join to original:
df = df.join(pd.DataFrame(arr, index=df.index).add_prefix('Multiplied '))
df = pd.DataFrame({"A":[1,2,3,4,5], "B":[6,7,8,9,10]})
df["mul"] = df["A"] * df["B"]
print(df)
Output:
A B mul
0 1 6 6
1 2 7 14
2 3 8 24
3 4 9 36
4 5 10 50
If I understood the question correctly I think you way over complicated it.
You can just create another column telling pandas to give it the value of first column multiplied by second column.
More similar to your specific case with more than 2 columns:
df = pd.DataFrame({"A":[1,2,3,4,5], "B":[6,7,8,9,10], "C":[11,12,13,14,15]})
df["mul"] = df["A"] * df["B"] * df["C"]

PANDAS: Convert 2 row df to single row multilevel column df

I have been searching for an answer to my question for a while, and have not been able to find anything that produces my desired result.
The problem is this: I have a dataframe with two rows that I want to merge into a single row dataframe that has multi-level columns. Using my example below (which I drafted in excel to better visualize my desired output), I want the new DF to have a multicolumn index with the first level being based on the original columns A-C, then add a new column sub level based on the values from the original 'Name' column. It is quite possible i'm incorrectly using existing functions. If you could provide me with your simplest way of altering the dataframe, I would greatly appreciate it!
Code to construct current df:
import pandas as pd
df = pd.DataFrame([['Alex',1,2,3],['Bob',4,5,6]],columns='Name A B
C'.split())
Image of current df with desired output:
Using set_index + unstack
df.set_index('Name').unstack().to_frame().T
Out[198]:
A B C
Name Alex Bob Alex Bob Alex Bob
0 1 4 2 5 3 6

Python Pandas select group where a specific column contains zeroes

I'm working on a small project using Python Pandas and I'm stuck at the following problem:
I have a table where column A contains multiple and possibly non unique values and a second column B with values which might be zero. Now I want to group all rows in the DataFrame by their value in column A and then only "keep" or "select" the groups which contain one or more zeros in the B column.
For example from a DataFrame that looks like this:
Column A Column B
-------- --------
b 12
c 56
f 0
b 456
b 334
f 10
I am only interested in all rows (the group) where column A = f :
Column A Column B
-------- --------
f 0
f 10
I know how I could achieve this using loops and iterating over groups but I'm looking for a simple and reasonably fast code as the DataFrames I work with can get very huge.
My current approach is something like this:
df.groupby("A").filter(lambda x: 0 in x["B"].values)
Obviously I'm new to Python Pandas and am hoping for your help !
Thank you in advance !
One way is to get all values of column A where column B is zero, and then group on this filtered set.
groups = df[df['Column B'] == 0]['Column A'].unique()
>>> df[df['Column A'].isin(groups)]
Column A Column B
2 f 0
5 f 10

Categories