Pyspark DataFrame - propagate column subset onto another DataFrame's column - python

Using Python 3.9, I have Spark DataFrame (just an example):
DataFrame A:
id
Name
Place
5
Michael
null
8
Dwight
null
17
Darryl
null
After some computing I get a subset of DataFrame A, containing only people working in specific Place like so:
DataFrame B:
id
Name
Place
5
Michael
Office
8
Dwight
Office
Please note: All entries in DataFrame B column Place always have same values (like for example "Office" here or "Warehouse" in some other computation).
My question is: how to "propagate" column Place from DataFrame B into DataFrame A? Resulting propagation should look like:
DataFrame C:
id
Name
Place
5
Michael
Office
8
Dwight
Office
17
Darryl
null
Tried using sth like with no luck:
row_list = B.select('id').collect()
B_ids = [row.id for row in row_list]
A.withColumn('Place', when(A['id'].isin(B_ids), 'Office'))

Related

Fill in NA with other dataframe and then add the rows that are not in the first dataframe

I am currently working with python to merge two dataframes that look like below:
# Primary
df1 = [['A','2021-03','NA',9,'NA'], ['B','2021-09','NA','NA',27], ['C','2021-12','NA',12,28]]
df1_fin = pd.DataFrame(df1, columns=['ID','Date','Value_1','Value_2','Value_3'])
# Secondatry
df2 = [['A','2021-03',80,20,30], ['B','2021-09',90,'NA',20], ['B','2021-12','NA','NA',27], ['D','2020-06',4,12,28]]
df2_fin = pd.DataFrame(df2, columns=['ID','Date','Value_1','Value_2','Value_3'])
I want to perform outer join but keep the value of first dataframe if it already exist.
The key columns will be ID and Date.
If the ID and Date matches, the NA value will be replaced by second dataframe and existing values will not be replaced.
If the ID and Date does not matches, new row will be created
The result dataframe will look like below:
ID
Date
Value_1
Value_2
Value_3
A
2021-03
80
9
30
B
2021-09
90
NA
27
B
2021-12
NA
NA
27
C
2021-12
NA
12
28
D
2020-06
4
12
28
Should I fill in NA first and then combine the rest rows? or is there a function that I can define the parameters to perform both actions?
Yes, there's a function for it in pandas combine_first:
Combine two DataFrame objects by filling null values in one DataFrame
with non-null values from other DataFrame. The row and column indexes
of the resulting DataFrame will be the union of the two.
df1_fin.set_index(['ID', 'Date']).combine_first(df2_fin.set_index(['ID', 'Date'])).reset_index()
(Please note that in your example, you provide two dataframes without any NaN values but with the string 'NaN' instead, which has no special meaning. Replace 'NaN' with None in the example to get the intended meaning.)

List of Python Dataframe column values meeting criteria in another dataframe?

I have two dataframes, df1 and df2, which have a common column heading, Name. The Name values are unique within df1 and df2. df1's Name values are a subset of those in df2; df2 has more rows -- about 17,300 -- than df1 -- about 6,900 -- but each Name value in df1 is in df2. I would like to create a list of Name values in df1 that meet certain criteria in other columns of the corresponding rows in df2.
Example:
df1:
Name
Age
Hair
0
Jim
25
black
1
Mary
58
brown
3
Sue
15
purple
df2:
Name
Country
phoneOS
0
Shari
GB
Android
1
Jim
US
Android
2
Alain
TZ
iOS
3
Sue
PE
iOS
4
Mary
US
Android
I would like a list of only those Name values in df1 that have df2 Country and phoneOS values of US and Android. The example result would be [Jim, Mary].
I have successfully selected rows within one dataframe that meet multiple criteria in order to copy those rows to a new dataframe. In that case pandas/Python does the iteration over rows internally. I guess I could write a "manual" iteration over the rows of df1 and access df2 on each iteration. I was hoping for a more efficient solution whereby the iteration was handled internally as in the single-dataframe case. But my searches for such a solution have been fruitless.
try:
df_1.loc[df_1.Name.isin(df_2.loc[df_2.Country.eq('US') & \
df_2.phoneOS.eq('Android'), 'Name']), 'Name']
Result:
0 Jim
1 Mary
Name: Name, dtype: object
if you want the result as a list just add .to_list() at the end
data = df1.merge(df2, on='Name')
data.loc[((data.phoneOS == 'Android') & (data.Country == "US")), 'Name'].values.tolist()

copy rows of pandas dataframe to a new dataframe

i have a dataframe almost like this one :
id name numOfppl
1 A 30
2 B 31
3 C 10
4 D 0
.
.
.
31 comp 52
These numbers are coming from a python code.
Once we have 5 rows where numOfppl >=30, the code will stop and return all the rest of the rows to a new dataframe.
my code so far:
df[df['numOfppl'] >= 30].iloc[:5]
if more rows are added, how can i copy them to a new Dataframe ?
Once you have created a dataframe for the condition you mentioned, you need all the other rows to be in a new dataframe, right?
Please check with the below
df_1 = df[df['numOfppl'] >= 30].iloc[:5]
df_2 = df[~df.isin(df_1)].dropna()
Here df_1 will have 5 rows as you mentioned with the condition and rest all the rows will be copied to df_2.
Also, for newly added rows (later) you can directly copy them into dataframe df_2

Remove the rows from dataframe till the actual column names are found

I am reading tabular data from the email in the pandas dataframe.
There is no guarantee that column names will contain in the first row.Sometimes data is in the following format.
The column names that will always be there are [ID,Name and Year].Sometimes there can be additional columns such as "Age"
dummy1 dummy2 dummy3 dummy4
test_column1 test_column2 test_column3 test_column4
ID Name Year Age
1 John Sophomore 20
2 Lisa Junior 21
3 Ed Senior 22
Sometimes the column names come in the first row as expected.
ID Name Year
1 John Sophomore
2 Lisa Junior
3 Ed Senior
Once I read the HTML table from the email,how can I remove the initial rows that don't contain the column names?["ID","Name","Year"]
So in the first case I would need to remove first 2 rows in the dataframe(including column row) and in the second case,i wouldn't have to remove anything.
Also,the column names can be in any sequence,and they can be variable.But these 3 columns will always be there ["ID","Name","Year"]
if i do the following,it only works if the dataframe contains only 3 columns ["ID","Name","Year"]
col_index = df.index[(df == ["ID","Name","Year"]).all(1)].item() # get columns index
df.columns = df.iloc[col_index].to_numpy() # set valid columns
df = df.iloc[col_index + 1 :]
I should be able to fetch the corresponding column index as long as the row contains any of these 3 columns ["ID","Name","Year"]
How can I achieve this?
I tried
col_index = df.index[(["ID","Name","Year"] in df).any(1)].item()
But i am getting error
You could stack the dataframe and use isin to find the header row.
IIUC, a small function could work. (personally I'd change this to pass in your file I/O read method and return a dataframe starting at that header row.
#make sure your read method has pd.read..(headers=None)
def find_columns(dataframe,cols) -> list:
stack_df = dataframe.stack()
header_row = stack_df[stack_df.isin(cols)].index.get_level_values(0)[0]
return header_row
header_row = find_columns(df,["Age", "Year", "ID", "Name"])
new_df = pd.read_csv(file,skiprows=header_row)
ID Name Year Age
0 1 John Sophomore 20
1 2 Lisa Junior 21
2 3 Ed Senior 22

Pandas Merge/Join

I have a dataframe called Bob with Columns = [A,B] and A has only unique values like a serial ID. Shape is (100,2)
I have another dataframe called Anna with Columns [C,D,E,F] where C has the same values as A in bob but there are duplicates. Column D is a category (phone/laptop/ipad) that is defined by the serial ID found in C. Shape of Anna is (500,4).
Example of row in anna:
A B C D
K103 phone 12 17
K103 phone 14 23
G221 laptop 25 6
I want to create a new dataframe that has columns A,B,D by searching for value of A in anna[C]. The final dataframe should be shape (100,3)
I'm finding this difficult with pd.merge (i tried left/inner/right joins) because it keeps creating 2 rows in the new dataframe with same values i.e. K103 will show up 2x in the new dataframe.
Tell me if this works, I'm thinking of this while typing it, so I couldn't actually check.
df = Bob.merge(Anna[['C','D'].drop_duplicates(keep='last'),how='left',left_on='A',right_on='C']
Let me know if it doesn't work, I'll create a sample dataset and edit it with the correct code.

Categories