a b
'1' 1
'2' 2
'3' 3
'4' 4
I would like to insert a new column which is the inverse of the b column while keeping the other columns constant.
Example:
a b c
'1' 1 4
'2' 2 3
'3' 3 2
'4' 4 1
We use temp['b'][::-1] to achieve this result in pandas. Is this transformation possible in pyspark as well?
Let's say your dataframe is ordered by column a.
You could try performing a self-join on a generated column that reverses the order. Such a column, rn, could be generated using row_number eg
Using pyspark api
from pyspark.sql import functions as F
from pyspark.sql import Window
output_df = (
df.withColumn(
"rn",
F.row_number().over(Window.orderBy("a"))
)
.alias("df1")
.join(
df.withColumn(
"rn",
F.row_number().over(Window.orderBy(F.col("a").desc()))
).alias("df2"),
["rn"],
"inner"
)
.selectExpr("df1.a","df1.b","df2.b as c")
)
Using spark sql
select
df1.a,
df1.b,
df2.b as c
from (
select
*,
row_number() over (order by a) rn
from
df
) df1
INNER JOIN (
select
b,
row_number() over (order by a desc) rn
from
df
) df2 on df1.rn=df2.rn;
a
b
c
1
1
4
2
2
3
3
3
2
4
4
1
View on DB Fiddle
Related
I'm new to SQLite and figuring out to implement the following codes using Python.
The sample database table is as below:
from
to
1
2
1
3
3
4
4
1
How do I implement the Union All and to count the number of occurrences for each integer?
My code is as such below:
combine_query = "SELECT * FROM (SELECT col1 FROM table UNION ALL SELECT col2 FROM table)"
c.execute(combine_query)
df= pd.DataFrame(c.fetchall(), columns=['Integers', [Occurences])
Integers
No of Occurences
1
3
2
1
3
2
4
2
Thank you.
We can try the following union approach:
SELECT val, COUNT(*) AS num
FROM
(
SELECT col1 AS val FROM yourTable
UNION ALL
SELECT col2 FROM yourTable
) t
GROUP BY val
ORDER BY val;
This question already has an answer here:
Find unique column values out of two different Dataframes
(1 answer)
Closed 1 year ago.
i'm working on python with Pandas and i have 2 dataFrame
1 'A'
2 'B'
1 'A'
2 'B'
3 'C'
4 'D'
and i want to return the difference:
1 'C'
2 'D'
You can concatenate two dataframes and drop duplicates:
pd.concat([df1, df2]).drop_duplicates(keep=False)
If your dataframe contains more columns you can add a certain column name as a subset:
pd.concat([df1, df2]).drop_duplicates(subset='col_name', keep=False)
What i retrieve with pd.concat([df1, df2]).drop_duplicates(keep=False)
(N = name of column)
df1:
N
0 A
1 B
2 C
df2:
N
0 A
1 B
2 C
df3
N
0 A
1 B
2 C
0 A
1 B
2 C
Value in df is phone Number without '+' in it. i can't show them.
i import them with :
df1 = pd.DataFrame(ListResponse, columns=['33000000000'])
df2 = pd.read_csv('number.csv')
ListResponse return List with number and number.csv is ListResponse that i save in csv the last time i run the script
edit:
(what i want in this case is "Empty DataFrame")
just test with new value :
df3:
N
0 A
1 B
2 C
3 D
0 B
1 C
2 D
Edit2: i think drop_duplicate is not working because my func implement new value as index = 0 and not index = length+1 like you can see just above. but when same values in both df, it not return me empty df...
Given the following dataframe:
A B C
1 2 3
1 9 8
df = df.groupby(['A'])['B'].apply(','.join).reset_index()
this produces
A B
1 2,9
However I also want to join the 'C' column values together with a comma the same way as b.
Expected:
A B C
1 2,9 3,8
I tried:
df = df.groupby(['A'])['B','C'].apply(','.join).reset_index()
Use GroupBy.agg:
df = df.groupby(['A'])['B','C'].agg(','.join).reset_index()
I have two dataframes df & df2 and I would like to merge them with * as wildcard
import pandas as pd
data = [[".",".",1],["AB.","B.",3],["B.",".",2]]
data2 = [["A","B","1"],["ABC","BC",4],["B","A",2]]
columns = ["Type1","Type2","Value"]
df = pd.DataFrame(data,columns=columns)
df2 = pd.DataFrame(data2,columns=columns)
print(df)
print(df2)
Type1 Type2 Value
0 * * 1
1 AB* B* 3
2 B* * 2
Type1 Type2 Value
0 A B 1
1 ABC BC 4
2 B A 2
Typically here the second line of df2 should match with line 1 and line 2.
Whereas line 0 in df2 should only match the first line of df1.
Somehow I would like to get something like
df2.merge(df,how='left',on=["Type1","Type2"])
But the result here is not matching anything.
This is the result that I would like to get.
data3 = [["A","B","1","1"],["ABC","BC",4,1],["ABC","BC",4,3],["B","A",2,1],["B","A",2,2]]
columns3 = ["Type1","Type2","Value_x","Value_y"]
results = pd.DataFrame(data3,columns=columns3)
print(results)
Type1 Type2 Value_x Value_y
0 A B 1 1
1 ABC BC 4 1
2 ABC BC 4 3
3 B A 2 1
4 B A 2 2
Please note that the df2 table actually has more than 1 million lines so I can't afford to do a loop for efficiency reasons.
Finally I decided to use the code below. This transfer the dataframe into a SQLite database, then performs the join and finally brings it back to another dataframe. This is not optimal but it works.
import sqlite3
conn = sqlite3.connect(':memory:')
df.to_sql('df', conn, index=False)
df2.to_sql('df2', conn, index=False)
query = """
SELECT [df2].[Type1],
[df2].[Type2],
[df2].[value],
[df].[value]
FROM ([df]
LEFT OUTER JOIN [df2]
ON [df].[type1] LIKE [df2].[type1]
AND [df].[type2] LIKE [df2].[type2])
"""
df3 = pd.read_sql_query(query, conn)
conn.close()
I have two dataframe and I would like to concat them based on time ranges
for example
dataframe A
user timestamp product
A 2015/3/13 1
B 2015/3/15 2
dataframe B
user time behavior
A 2015/3/1 2
A 2015/3/8 3
A 2015/3/13 1
B 2015/3/1 2
I would like to concat 2 dataframe as below ( frame B left join to frame A)
column "timestamp1" is 7 days before column "timestamp"
for example when timestamp is 3/13 , then 3/6-13 is in the range
otherwise dont concat
user timestamp product time1 behavior
A 2015/3/13 1 2015/3/8 3
A 2015/3/13 1 2015/3/13 1
B 2015/3/15 2 NaN NaN
the sql code would look like
select * from
B left join A
on user
where B.time >= A.timestamp - 7 & B.time <= A.timestamp
##WHERE B.time BETWEEN DATE_SUB(B.time, INTERVAL 7 DAY) AND A.timestamp ;
how can we make this on python?
can only think of the following and dont know how to work with the time..
new = pd.merge(A, B, on='user', how='left')
thanks and sorry..
The few steps required to solve this-
from datetime import timedelta
First,convert your timestamps to pandas datetime. (df1 refers to Dataframe A and df2 refers to Dataframe B)
df1[['time']]=df1[['timestamp']].apply(pd.to_datetime)
df2[['time']]=df2[['time']].apply(pd.to_datetime)
Merge as follows: (Based on your final dataset i think your left join is more of a right join)
df3 = pd.merge(df1,df2,how='left')
Get your final df:
df4 = df3[(df3.time>=df3.timestamp-timedelta(days=7)) & (df3.time<=df3.timestamp)]
The row containing nan is missing and this is because of the way conditional joins are done in pandas.
Condtional joins are not a feature of pandas yet. A way to get past that is by doing filtering post a join.
Here's one solution that relies on two merges - first, to narrow down dataframe B (df2), and then to produce the desired output:
We can read in the example dataframes with read_clipboard():
import pandas as pd
# copy dataframe A data, then:
df1 = pd.read_clipboard(parse_dates=['timestamp'])
# copy dataframe B data, then:
df2 = pd.read_clipboard(parse_dates=['time'])
Merge and filter:
# left merge df2, df1
tmp = df2.merge(df1, on='user', how='left')
# then drop rows which disqualify based on timestamp
mask = tmp.time < (tmp.timestamp - pd.Timedelta(days=7))
tmp.loc[mask, ['time', 'behavior']] = None
tmp = tmp.dropna(subset=['time']).drop(['timestamp','product'], axis=1)
# now perform final merge
merged = df1.merge(tmp, on='user', how='left')
Output:
user timestamp product time behavior
0 A 2015-03-13 1 2015-03-08 3.0
1 A 2015-03-13 1 2015-03-13 1.0
2 B 2015-03-15 2 NaT NaN