why is Pyspark joined column turning into null values?

why is Pyspark joined column turning into null values? - python

I'm trying to join two dataframes but the values of the second keep turning into nulls:
joint = sdf.join(k, "date", how='left').select(sdf.date, sdf.Res, sdf.Ind, k.gen.cast(IntegerType())).orderBy('date')
output: | 1/1/2001 | 4103 | 9223 | null |

Related

Subseting python dataframe using position values from lists

I have a dataframe with raw data and I would like to select different range of rows for each column, using two different lists: one containing the first row position to select and the other the last.
INPUT
| Index | Column A | Column B |
|:--------:|:--------:|:--------:|
| 1 | 2 | 8 |
| 2 | 4 | 9 |
| 3 | 1 | 7 |
first_position=[1,2]
last_position=[2,3]
EXPECTED OUTPUT
| Index | Column A | Column B |
|:--------:|:--------:|:--------:|
| 1 | 2 | 9 |
| 2 | 4 | 7 |
Which function can I use?
Thanks!
I tried df.filter but I think it does not accept list as input.

Basically, as far as I can see, you have two meaningful columns in your DataFrame.
Thus, I would suggest using "Index" column as the index indeed:
df.set_index(df.columns[0], inplace=True)
That way you might use .loc:
df_out = pd.concat(
[
df.loc[first_position, "Column A"].reset_index(drop=True),
df.loc[last_position, "Column B"].reset_index(drop=True)
],
axis=1
)
However, having indexes stored in separate lists you would need to watch them yourselves, which may be not too convenient.
Instead, I would re-organize it with slicing:
df_out = pd.concat(
[
df[["Column A"]][:-1].reset_index(drop=True),
df[["Column B"]][1:].reset_index(drop=True)
],
axis=1
)
In either cases, index is being destroyed. If that matters, then the scenario without .reset_index(drop=True) would be required.

Pandas drop duplicates is not working as expected

I am building multiple dataframes from a SQL query that contains a lot of left joins which is producing a bunch of duplicate values. I am familiar with pd.drop_duplicates() as I use it regularly in my other scripts, however, I can't get this particular one to work.
I am trying to drop_duplicates on a subset of 2 columns. Here is my code:
df = pd.read_sql("query")
index = []
for i in range(len(df)):
index.append(i)
df['index'] = index
df.set_index([df['index']])
df2 = df.groupby(['SSN', 'client_name', 'Evaluation_Date']).substance_use_name.agg(' | '.join).reset_index()
df2.shape which equals (182,4)
df3 = pd.concat([df, df2], axis=1, join='outer').drop_duplicates(keep=False)
df3.drop_duplicates(subset=['client_name', 'Evaluation_Date'], keep='first', inplace=True)
df3 returns 791 rows of data... (which is the exact amount of rows that my original query returns). After the drop_duplicates method I expected to have only 190 rows of data, however, it only drops the duplicates to 301 rows. When I do df3.to_excel(r'file_path.xlsx') and remove duplicates manually by the same subset in Excel, it works just fine and gives me the 190 rows that I expect. I'm not sure why?
I noticed in other similar questions regarding this topic that pandas cannot drop duplicates if a date field is a dtype 'object' and that it must be changed to a datetime, however, my date field is already a datetime.
Data frame looks like this:
ID | substnace1 | substance2 | substance3 | substance4
01 | drug | null | null | null
01 | null | drug | null | null
01 | null | null | drug | null
01 | null | null | null | drug
02 | drug | null | null | null
so on and so forth. I want to merge the rows into one row so it looks like this:
ID | substnace1 | substance2 | substance3 | substance4
01 | drug | drug | drug | drug
02 | drug | drug | drug | drug
so on and so forth.. Does that make better sense?
Would anyone be able to help me with this?
Thanks!

Merge 2 data frame based on values on dataframe 1 and index and column from dataframe 2

I have 2 DataFrame as follows
DataFrame 1
DataFrame 2
I wanted to merge these 2 DataFrames, based on the values of each row in DataFrame 2, matched with the combination of index and column in DataFrame 1.
So I want to append another column in DataFrame 2, name it "weight", and store the merged value there.
For example,
----------------------------------------------------
| | col1 | col2 | relationship | weight |
| 0 | Andy | Claude | 0 | 1 |
| 1 | Andy | Frida | 20 | 1 |
and so on. How to do this?

Use DataFrame.join with DataFrame.stack for MultiIndex Series:
df2 = df2.join(df1.stack().rename('weight'), on=['col1','col2'])

How to change the size and distribution of a PySpark Dataframe according to the values of its rows & columns?

I have a large PySpark DataFrame that I would like to manipulate as in the example below. I think it is easier to visualise it than to describe it. Hence, for illustrative purposes, let us take a simple DataFrame df:
df.show()
+----------+-----------+-----------+
| series | timestamp | value |
+----------+-----------+-----------+
| ID1 | t1 | value1_1 |
| ID1 | t2 | value2_1 |
| ID1 | t3 | value3_1 |
| ID2 | t1 | value1_2 |
| ID2 | t2 | value2_2 |
| ID2 | t3 | value3_2 |
| ID3 | t1 | value1_3 |
| ID3 | t2 | value2_3 |
| ID3 | t3 | value3_3 |
+----------+-----------+-----------+
In the above DataFrame, each of the three unique values contained in column series (i.e. ID1, ID2 and ID3) have corresponding values (under column values) occurring simulaneously at the same time (i.e. same entries in column timestamp).
From this DataFrame, I would like to have a transformation which ends up with the following DataFrame, named, say, results. As it can be seen, the size of the DataFrame has changed and even the columns have been renamed according to entries of the original DataFrame.
result.show()
+-----------+-----------+-----------+-----------+
| timestamp | ID1 | ID2 | ID3 |
+-----------+-----------+-----------+-----------+
| t1 | value1_1 | value1_2 | value1_3 |
| t2 | value2_1 | value2_2 | value2_3 |
| t3 | value3_1 | value3_2 | value3_3 |
+-----------+-----------+-----------+-----------+
The order of the columns in result is arbitrary and should not affect the final answer. This illustrative example only contains three unique values in series (i.e. ID1, ID2 and ID3). Ideally, I would like to write a piece of code which automatically detects unique values in series and therefore generates a new corresponding column. Does anyone know where can I start from? I have tried to group by timestamp and then to collect a set of distinct series and value by using the aggregate function collect_set but with no luck:(
Many thanks in advance!
Marioanzas

Just a simple pivot:
import pyspark.sql.functions as F
result = df.groupBy('timestamp').pivot('series').agg(F.first('value'))
Make sure that each row in df is distinct; otherwise duplicate entries may be silently deduplicated.

Extendind on mck's answer, I have found out a way of improving the pivot performance. pivot is a very expensive operation, hence, for Spark 2.0 on-wards, it is recommended to provide column data (if known) as an argument to the function as shown in the code below. This will improve the performance of the code for DataFrames much larger than the illustrative one posed in this question. Given that the values of series are known beforehand, we can use:
import pyspark.sql.functions as F
series_list = ('ID1', 'ID2', 'ID3')
result = df.groupBy('timestamp').pivot('series', series_list).agg(F.first('value'))
result.show()
+---------+--------+--------+--------+
|timestamp| ID1| ID2| ID3|
+---------+--------+--------+--------+
| t1|value1_1|value1_2|value1_3|
| t2|value2_1|value2_2|value2_3|
| t3|value3_1|value3_2|value3_3|
+---------+--------+--------+--------+

Combing two dataframes in Pandas Python [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
I would like to combine two dataframes
I would like to combine both the dataframes such a way that the accts are the same.
For eg, acct 10 should values in CME and NISSAN while the rest are zeros.

I think you can use df.combine_first():
It will update null elements with value in the same location in other.
df2.combine_first(df1)
Also, you can try:
pd.concat([df1.set_index('acct'),df2.set_index('acct')],axis=1).reset_index()

It looks like what you're trying to do is merge these two DataFrames.
You can use df.merge to merge the two. Since you want to match on the acct column, set the on keyword arg to "acct" and set how to "inner" to keep only those rows that appear in both DataFrames.
For example:
merged = df1.merge(df2, how="inner", on="acct")
Output:
+------+--------------------+------------------+-------------------+-----------+--------------------+-------------------+--------------------+
| acct | GOODM | KIS | NISSAN | CME | HKEX | OSE | SGX |
+------+--------------------+------------------+-------------------+-----------+--------------------+-------------------+--------------------+
| 10 | | | 1397464.227495019 | 1728005.0 | 0.0 | | |
| 30 | 30569.300965712766 | 4299649.75104102 | | 6237.0 | | | |
+------+--------------------+------------------+-------------------+-----------+--------------------+-------------------+--------------------+
If you want to fill empty values with zeroes, you can use df.fillna(0).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

why is Pyspark joined column turning into null values? - python

I'm trying to join two dataframes but the values of the second keep turning into nulls: joint = sdf.join(k, "date", how='left').select(sdf.date, sdf.Res, sdf.Ind, k.gen.cast(IntegerType())).orderBy('date') output: | 1/1/2001 | 4103 | 9223 | null |

Related

Subseting python dataframe using position values from lists

Pandas drop duplicates is not working as expected

Merge 2 data frame based on values on dataframe 1 and index and column from dataframe 2

How to change the size and distribution of a PySpark Dataframe according to the values of its rows & columns?

Combing two dataframes in Pandas Python [duplicate]

Categories

Resources