I have two pyspark dataframes that has some mutual key IDs with different other values. What I want to achieve is to inject one dataframe to another.
First Dataframe:
ID1
ID2
DATE
VAL1
VAL2
19
22
05-03-2012
311
622
20
30
05-03-2012
40
60
20
30
06-03-2012
70
120
20
30
07-03-2012
480
3
20
30
08-03-2012
49
98
Second Dataframe:
ID1
ID2
DATE
VAL1
VAL2
19
22
07-03-2012
311
622
20
30
06-03-2012
22
2
Final DF:
ID1
ID2
DATE
VAL1
VAL2
19
22
05-03-2012
311
622
19
22
07-03-2012
311
622
20
30
05-03-2012
40
60
20
30
06-03-2012
70
120
20
30
07-03-2012
480
3
20
30
08-03-2012
49
98
As you can see all the values absent in one of the dataframe is present in final dataframe and rows with the same ID1, ID2, DATE are taking from the first dataframe. These are simplified examples of dataframes, these are much more complicated, with different columns (I'll select the important ones) and hundred of thousands of rows.
I was experimenting with outer join, but after many tries I've lost any hope, so I'd be grateful for any help.
This should work -
Essentially, First do a left_anti join to extract only those rows that are absent from First Dataframe but present in the second dataframe then union them (i.e. append) in First Dataframe
Seq<String> colList = convertListToSeq(Stream.of("id1", "id2", "date").collect(Collectors.toList()));
// Only present in Right
Dataset<Row> missingInLeft = rightDF.join(leftDF, colList, "left_anti");
leftDF.union(missingInLeft).show(); // Left + Missing in left
Update:
Pyspark Code:
rightDF.union(rightDF.join(leftDF, ["id1", "id2", "date"], how='left_anti')).show()
Related
I'm working on a large data with more than 60K rows.
I have continuous measurement of current in a column. A code is measured for a second where the equipment measures it for 14/15/16/17 times, depending on the equipment speed and then the measurement moves to the next code and again measures for 14/15/16/17 times and so forth.
Every time measurement moves from one code to another, there is a jump of more than 0.15 on the current measurement
The data with top 48 rows is as follows,
Index
Curr(mA)
0
1.362476
1
1.341721
2
1.362477
3
1.362477
4
1.355560
5
1.348642
6
1.327886
7
1.341721
8
1.334804
9
1.334804
10
1.348641
11
1.362474
12
1.348644
13
1.355558
14
1.334805
15
1.362477
16
1.556172
17
1.542336
18
1.549252
19
1.528503
20
1.549254
21
1.528501
22
1.556173
23
1.556172
24
1.542334
25
1.556172
26
1.542336
27
1.542334
28
1.556170
29
1.535415
30
1.542334
31
1.729109
32
1.749863
33
1.749861
34
1.749861
35
1.736024
36
1.770619
37
1.742946
38
1.763699
39
1.749861
40
1.749861
41
1.763703
42
1.756781
43
1.742946
44
1.736026
45
1.756781
46
1.964308
47
1.957395
I want to write a script where similar data of 14/15/16/17 times is averaged in a separate column for each code measurement .. I have been thinking of doing this with pandas..
I want the data to look like
Index
Curr(mA)
0
1.34907
1
1.54556
2
1.74986
Need some help to get this done. Please help
First get the indexes of every row where there's a jump. Use Pandas' DataFrame.diff() to get the difference between the value in each row and the previous row, then check to see if it's greater than 0.15 with >. Use that to filter the dataframe index, and save the resulting indices (in the case of your sample data, three) in a variable.
indices = df.index[df['Curr(mA)'].diff() > 0.15]
The next steps depend on if there are more columns in the source dataframe that you want in the output, or if it's really just curr(mA) and index. In the latter case, you can use np.split() to cut the dataframe into a list of dataframes based on the indexes you just pulled. Then you can go ahead and average them in a list comphrension.
[df['Curr(mA)'].mean() for df in np.split(df, indices)]
> [1.3490729374999997, 1.5455638666666667, 1.7498627333333332, 1.9608515]
To get it to match your desired output above (same thing but as one-column dataframe rather than list) convert the list to pd.Series and reset_index().
pd.Series(
[df['Curr(mA)'].mean() for df in np.split(df, indices)]
).reset_index(drop=True)
index 0
0 0 1.349073
1 1 1.545564
2 2 1.749863
3 3 1.960851
I have pandas Dataframe (70 000 rows) like this:
No
Time
Length
1
2.12079
60
2
2.581
12
3
2.7172
60
4
3.43883
60
5
3.6883
54
6
3.7233
54
...
...
...
70 000
172.2777
24
In column Time i have values in float type representing time in seconds. I need to drop rows that has duplicates in length but just in range of 1 second. So my output
should look like this:
No
Time
Length
1
2.12079
60
2
2.581
12
4
3.43883
60
5
3.6883
54
...
...
...
Can somebody help me?
Since your requirement is
I want to delete duplicates from 1.0000s to 2.000s and then 2.000 to
3.000
Create a temporary column that is the floor of your time value (same as //1), then drop duplicates with respect to that and the length.
df = (df.assign(Time2=df.Time//1)
.drop_duplicates(['Time2', 'Length'])
.drop(columns='Time2'))
# No Time Length
#0 1 2.12079 60
#1 2 2.58100 12
#3 4 3.43883 60
#4 5 3.68830 54
#6 70 000 172.27770 24
Maybe this is what you seek?
df["int_time"] = df["Time"].astype(int)
print(
df.groupby(["Length", "int_time"], as_index=False)
.first()
.sort_values(by="No")
.drop(columns="int_time")
)
Prints:
Length No Time
3 60 1 2.12079
0 12 2 2.58100
4 60 4 3.43883
2 54 5 3.68830
1 24 70000 172.27770
I'm very confused by your example output, but based on your description, and assuming the dataframe is already sorted by time: you could make a new column containing the difference between the time in row x and the time in row x-1:
df['Diff'] = df['Time'].diff()
Then drop the rows where Diff > 1
df['Diff'] = df['Diff'].fillna(999) # make sure first row doesn't get deleted
df = df[df['Diff'] > 1].drop(['Diff'],axis=1) # filter and drop the temporary "Diff" column
I want to get new dataframe, in which I need to see sum of certain columns for rows which have same value of 'Index' columns (campaign_id and group_name in my example)
This is sample (example) of my dataframe:
campaign_id group_name clicks conversions cost label city_id
101 blue 40 15 100 foo 15
102 red 20 5 50 bar 12
102 red 7 3 25 bar 12
102 brown 5 0 18 bar 12
this is what I want to get:
campaign_id group_name clicks conversions cost label city_id
101 blue 40 15 100 foo 15
102 red 27 8 75 bar 12
102 brown 5 0 18 bar 12
I tried:
df = df.groupby(['campaign_id','group_name'])['clicks','conversions','cost'].sum().reset_index()
but this gives my only mentioned (summarized) columns (and Index), like this:
campaign_id group_name clicks conversions cost
101 blue 40 15 100
102 red 27 8 75
102 brown 5 0 18
I can try to add leftover columns after this operation, but I'm not sure if this will be optimal and adequate way to solve the problem
Is there simple way to summarize certain columns and leave other columns untouched (I don't care if they would differ, because in my data all leftover columns have same data for rows with same corresponding values in 'Index' columns (which are campaign_id and group_name)
When I finished my post I saw the answer right away: since all columns except those which I want to summarize - have matching values - I just need to take all those columns as part of multi-index, for this operation. Like this:
df = df.groupby(['campaign_id','group_name','lavel','city_id'])['clicks','conversions','cost'].sum().reset_index()
In this case I got exacty what I wanted.
I have two dataframes of the same length (39014 rows), one has datetime as its index, and the other one just a regular index. I need to copy one column into the other one but when the copy is being made it returns Nans. I did:
df_datetime["newcol"]=df_regular["col"]
If you check the column newcol in df_datetime it's a column full of Nan, even though the column col of df_regular has numbers. Why is this happening? How can I fix it? Thanks!
Also tried
pd.merge(df_datetime, df_regular[["col"]], left_index=True, right_index=True, how='left')
And the same happens
This is because the indices are not aligned.
When you assign a new column like that, df_datetime.loc[x, 'newcol'] will have the same value as df_regular.loc[x, 'newcol']
You need to replace the index of df_datetime with a datetime index.
What if you do this:
listOfColumn = list(df_regular["col"])
df_datetime["newcol"] = listOfColumn
As df_datetime and df_regular I prepared the following DataFrames:
N1
Dat
2019-09-01 120
2019-09-02 130
2019-09-03 140
2019-09-04 150
2019-09-05 160
and
col N2
0 23 19
1 26 32
2 48 61
3 51 53
4 62 60
Both with 5 rows.
If you want to "add" col column from df_regular to df_datetime
ignoring index values in both DataFrames, run:
df_datetime['newcol'] = df_regular.col.values
The result is:
N1 newcol
Dat
2019-09-01 120 23
2019-09-02 130 26
2019-09-03 140 48
2019-09-04 150 51
2019-09-05 160 62
I have 3 days of time series data with multiple columns in it. I have one single DataFrame which includes all 3 days data. I want 3 different DataFrames based on Column name "Dates" i.e df["Dates"]
For Example:
Available Dataframe is: df
Expected Output: Based on Three different Dates
First DataFrame: df_23
Second DataFrame: df_24
Third DataFrame: df_25
I want to use these all three DataFrames separately for analysis.
I tried below code but I am not able to use those three dataframes (Rather I don't know how to use.) Can anybody help me to work my code better. Thank you.
Above code is just printing the DataFrame in three DataFrames that too not as expected as per code!
Unsure if your saving your variable into a csv or keep it in memory for further use,
you could pass each unique value into a dict and access by it's value :
print(df)
Cal Dates
0 85 23
1 75 23
2 74 23
3 97 23
4 54 24
5 10 24
6 77 24
7 95 24
8 58 25
9 53 25
10 44 25
11 94 25
d = {}
for frame, data in df.groupby('Dates'):
d[f'df{frame}'] = data
print(d['df23'])
Cal Dates
0 85 23
1 75 23
2 74 23
3 97 23
edit updated request :
for k,v in d.items():
i = (v['Cal'].loc[v['Cal'] > 70].count())
print(f"{v['Dates'].unique()[0]} --> {i} times")
23 --> 4 times
24 --> 2 times
25 --> 1 times