Inverse a column in pyspark - python

a b
'1' 1
'2' 2
'3' 3
'4' 4
I would like to insert a new column which is the inverse of the b column while keeping the other columns constant.
Example:
a b c
'1' 1 4
'2' 2 3
'3' 3 2
'4' 4 1
We use temp['b'][::-1] to achieve this result in pandas. Is this transformation possible in pyspark as well?

Let's say your dataframe is ordered by column a.
You could try performing a self-join on a generated column that reverses the order. Such a column, rn, could be generated using row_number eg
Using pyspark api
from pyspark.sql import functions as F
from pyspark.sql import Window
output_df = (
df.withColumn(
"rn",
F.row_number().over(Window.orderBy("a"))
)
.alias("df1")
.join(
df.withColumn(
"rn",
F.row_number().over(Window.orderBy(F.col("a").desc()))
).alias("df2"),
["rn"],
"inner"
)
.selectExpr("df1.a","df1.b","df2.b as c")
)
Using spark sql
select
df1.a,
df1.b,
df2.b as c
from (
select
*,
row_number() over (order by a) rn
from
df
) df1
INNER JOIN (
select
b,
row_number() over (order by a desc) rn
from
df
) df2 on df1.rn=df2.rn;
a
b
c
1
1
4
2
2
3
3
3
2
4
4
1
View on DB Fiddle

Related

How to Count No. Of Occurences after Union All

I'm new to SQLite and figuring out to implement the following codes using Python.
The sample database table is as below:
from
to
1
2
1
3
3
4
4
1
How do I implement the Union All and to count the number of occurrences for each integer?
My code is as such below:
combine_query = "SELECT * FROM (SELECT col1 FROM table UNION ALL SELECT col2 FROM table)"
c.execute(combine_query)
df= pd.DataFrame(c.fetchall(), columns=['Integers', [Occurences])
Integers
No of Occurences
1
3
2
1
3
2
4
2
Thank you.
We can try the following union approach:
SELECT val, COUNT(*) AS num
FROM
(
SELECT col1 AS val FROM yourTable
UNION ALL
SELECT col2 FROM yourTable
) t
GROUP BY val
ORDER BY val;

Python Pandas Compare 2 dataFrame [duplicate]

This question already has an answer here:
Find unique column values out of two different Dataframes
(1 answer)
Closed 1 year ago.
i'm working on python with Pandas and i have 2 dataFrame
1 'A'
2 'B'
1 'A'
2 'B'
3 'C'
4 'D'
and i want to return the difference:
1 'C'
2 'D'
You can concatenate two dataframes and drop duplicates:
pd.concat([df1, df2]).drop_duplicates(keep=False)
If your dataframe contains more columns you can add a certain column name as a subset:
pd.concat([df1, df2]).drop_duplicates(subset='col_name', keep=False)
What i retrieve with pd.concat([df1, df2]).drop_duplicates(keep=False)
(N = name of column)
df1:
N
0 A
1 B
2 C
df2:
N
0 A
1 B
2 C
df3
N
0 A
1 B
2 C
0 A
1 B
2 C
Value in df is phone Number without '+' in it. i can't show them.
i import them with :
df1 = pd.DataFrame(ListResponse, columns=['33000000000'])
df2 = pd.read_csv('number.csv')
ListResponse return List with number and number.csv is ListResponse that i save in csv the last time i run the script
edit:
(what i want in this case is "Empty DataFrame")
just test with new value :
df3:
N
0 A
1 B
2 C
3 D
0 B
1 C
2 D
Edit2: i think drop_duplicate is not working because my func implement new value as index = 0 and not index = length+1 like you can see just above. but when same values in both df, it not return me empty df...

Pandas groubpy and then join multiple columns

Given the following dataframe:
A B C
1 2 3
1 9 8
df = df.groupby(['A'])['B'].apply(','.join).reset_index()
this produces
A B
1 2,9
However I also want to join the 'C' column values together with a comma the same way as b.
Expected:
A B C
1 2,9 3,8
I tried:
df = df.groupby(['A'])['B','C'].apply(','.join).reset_index()
Use GroupBy.agg:
df = df.groupby(['A'])['B','C'].agg(','.join).reset_index()

How to join pandas dataframes based on wildcards?

I have two dataframes df & df2 and I would like to merge them with * as wildcard
import pandas as pd
data = [[".",".",1],["AB.","B.",3],["B.",".",2]]
data2 = [["A","B","1"],["ABC","BC",4],["B","A",2]]
columns = ["Type1","Type2","Value"]
df = pd.DataFrame(data,columns=columns)
df2 = pd.DataFrame(data2,columns=columns)
print(df)
print(df2)
Type1 Type2 Value
0 * * 1
1 AB* B* 3
2 B* * 2
Type1 Type2 Value
0 A B 1
1 ABC BC 4
2 B A 2
Typically here the second line of df2 should match with line 1 and line 2.
Whereas line 0 in df2 should only match the first line of df1.
Somehow I would like to get something like
df2.merge(df,how='left',on=["Type1","Type2"])
But the result here is not matching anything.
This is the result that I would like to get.
data3 = [["A","B","1","1"],["ABC","BC",4,1],["ABC","BC",4,3],["B","A",2,1],["B","A",2,2]]
columns3 = ["Type1","Type2","Value_x","Value_y"]
results = pd.DataFrame(data3,columns=columns3)
print(results)
Type1 Type2 Value_x Value_y
0 A B 1 1
1 ABC BC 4 1
2 ABC BC 4 3
3 B A 2 1
4 B A 2 2
Please note that the df2 table actually has more than 1 million lines so I can't afford to do a loop for efficiency reasons.
Finally I decided to use the code below. This transfer the dataframe into a SQLite database, then performs the join and finally brings it back to another dataframe. This is not optimal but it works.
import sqlite3
conn = sqlite3.connect(':memory:')
df.to_sql('df', conn, index=False)
df2.to_sql('df2', conn, index=False)
query = """
SELECT [df2].[Type1],
[df2].[Type2],
[df2].[value],
[df].[value]
FROM ([df]
LEFT OUTER JOIN [df2]
ON [df].[type1] LIKE [df2].[type1]
AND [df].[type2] LIKE [df2].[type2])
"""
df3 = pd.read_sql_query(query, conn)
conn.close()

how to concat 2 pandas dataframe based on time ranges

I have two dataframe and I would like to concat them based on time ranges
for example
dataframe A
user timestamp product
A 2015/3/13 1
B 2015/3/15 2
dataframe B
user time behavior
A 2015/3/1 2
A 2015/3/8 3
A 2015/3/13 1
B 2015/3/1 2
I would like to concat 2 dataframe as below ( frame B left join to frame A)
column "timestamp1" is 7 days before column "timestamp"
for example when timestamp is 3/13 , then 3/6-13 is in the range
otherwise dont concat
user timestamp product time1 behavior
A 2015/3/13 1 2015/3/8 3
A 2015/3/13 1 2015/3/13 1
B 2015/3/15 2 NaN NaN
the sql code would look like
select * from
B left join A
on user
where B.time >= A.timestamp - 7 & B.time <= A.timestamp
##WHERE B.time BETWEEN DATE_SUB(B.time, INTERVAL 7 DAY) AND A.timestamp ;
how can we make this on python?
can only think of the following and dont know how to work with the time..
new = pd.merge(A, B, on='user', how='left')
thanks and sorry..
The few steps required to solve this-
from datetime import timedelta
First,convert your timestamps to pandas datetime. (df1 refers to Dataframe A and df2 refers to Dataframe B)
df1[['time']]=df1[['timestamp']].apply(pd.to_datetime)
df2[['time']]=df2[['time']].apply(pd.to_datetime)
Merge as follows: (Based on your final dataset i think your left join is more of a right join)
df3 = pd.merge(df1,df2,how='left')
Get your final df:
df4 = df3[(df3.time>=df3.timestamp-timedelta(days=7)) & (df3.time<=df3.timestamp)]
The row containing nan is missing and this is because of the way conditional joins are done in pandas.
Condtional joins are not a feature of pandas yet. A way to get past that is by doing filtering post a join.
Here's one solution that relies on two merges - first, to narrow down dataframe B (df2), and then to produce the desired output:
We can read in the example dataframes with read_clipboard():
import pandas as pd
# copy dataframe A data, then:
df1 = pd.read_clipboard(parse_dates=['timestamp'])
# copy dataframe B data, then:
df2 = pd.read_clipboard(parse_dates=['time'])
Merge and filter:
# left merge df2, df1
tmp = df2.merge(df1, on='user', how='left')
# then drop rows which disqualify based on timestamp
mask = tmp.time < (tmp.timestamp - pd.Timedelta(days=7))
tmp.loc[mask, ['time', 'behavior']] = None
tmp = tmp.dropna(subset=['time']).drop(['timestamp','product'], axis=1)
# now perform final merge
merged = df1.merge(tmp, on='user', how='left')
Output:
user timestamp product time behavior
0 A 2015-03-13 1 2015-03-08 3.0
1 A 2015-03-13 1 2015-03-13 1.0
2 B 2015-03-15 2 NaT NaN

Categories