Pandas groupby on multiple values - python

Start with a sorted table:
Index | A | B | C |
0 | A1| 0 | Group 1 |
1 | A1| 0 | Group 1 |
2 | A1| 1 | Group 2 |
3 | A1| 1 | Group 2 |
4 | A1| 2 | Group 3 |
5 | A1| 2 | Group 3 |
6 | A2| 7 | Group 4 |
7 | A2| 7 | Group 4 |
Returns records 0,1,2,3,6,7
First I want to create groups based on Columns A and B.
Then I want only the first two subgroups of a Column A group returned.
I want all the records returned for the subgroup.
Thank you so much.

Use pd.factorize within a groupby and filter for less than 2
df[df.groupby('A').B.transform(lambda x: x.factorize()[0]).lt(2)]
# same as
# df[df.groupby('A').B.transform(lambda x: x.factorize()[0]) < 2]
A B C
0 A1 0 Group 1
1 A1 0 Group 1
2 A1 1 Group 2
3 A1 1 Group 2
6 A2 7 Group 4
7 A2 7 Group 4

Related

How to merge and sum multi index dataframes

I would like to add multiple multi index data frames together, with differing columns, but if there are similar columns/cells, then the values will be added together.
Constraints:
There are up to n dataframes to be merged and summed
Within the code, you cannot call the multi index columns and index (should be general)
Example:
df1
Category | A | B | C |
Region | AC| AD| AK|
0-5 years | 2 | 3 | 4 |
5-10 years | 1 | 2 | 5 |
10-12 years| 2 | 0 | 2 |
df 2
Category | A | B | D |
Region | AC| AD| AM|
0-5 years | 1 | 4 | 1 |
5-10 years | 2 | 5 | 1 |
10-12 years| 3 | 6 | 0 |
Desired outcome:
Category | A | B | C | D |
Region | AC| AD| AK| AM|
0-5 years | 3 | 7 | 4 | 1 |
5-10 years | 3 | 7 | 5 | 1 |
10-12 years| 5 | 6 | 2 | 0 |
Any help would be greatly appreciated, thanks in advance!

Pandas sort_values, duplicate values in sort column

How does pandas treat equal values in the column it is sorting by.
dataFrame1
a | b | c |
--|---|---|
1 | 2 | 2 |
2 | 1 | 6 |
2 | 1 | 5 |
3 | 4 | 2 |
If I run dataFrame1.sort_values(by=['a'], ascending=True)
How does it treat the duplicate values in a ?

Find count of unique value of each column and save in CSV

I have data like this:
+---+---+---+
| A | B | C |
+---+---+---+
| 1 | 2 | 7 |
| 2 | 2 | 7 |
| 3 | 2 | 1 |
| 3 | 2 | 1 |
| 3 | 2 | 1 |
+---+---+---+
Need to count unique value of each column and report it like below:
+---+---+---+
| A | 3 | 3 |
| A | 2 | 1 |
| A | 1 | 1 |
| B | 2 | 5 |
| C | 1 | 3 |
| C | 7 | 2 |
+---+---+---+
I have no issue when number of column are limit and manually name them, when input file is big it become hard,need to have simple way to have output
here is the code I have
import pandas as pd
df=pd.read_csv('1.csv')
A=df['A']
B=df['B']
C=df['C']
df1=A.value_counts()
df2=B.value_counts()
df3=C.value_counts()
all = {'A': df1,'B': df2,'C': df3}
result = pd.concat(all)
result.to_csv('out.csv')
Use DataFrame.stack with SeriesGroupBy.value_counts and then convert Series to DataFrame by Series.rename_axis and Series.reset_index and :
df=pd.read_csv('1.csv')
result = (df.stack()
.groupby(level=1)
.value_counts()
.rename_axis(['X','Y'])
.reset_index(name='Z'))
print (result)
X Y Z
0 A 3 3
1 A 1 1
2 A 2 1
3 B 2 5
4 C 1 3
5 C 7 2
X Y Z
2 A 3 3
0 A 1 1
1 A 2 1
3 B 2 5
4 C 1 3
5 C 7 2
result.to_csv('out.csv', index=False)
You can loop over column and insert them in dictionary.
you can initiate dictionary by all={}. To be scalable you can read column by colm=df.columns. This would give you all column in your df.
Try this code:
import pandas as pd
df=pd.read_csv('1.csv')
all={}
colm=df.columns
for i in colm:
all.update({i:df[i].value_counts()})
result = pd.concat(all)
result.to_csv('out.csv')
to find unique values of a data-frame.
df.A.unique()
to know the count of the unique values.
len(df.A.unique())
unique create an array to find the count use len() function

Add values in two Spark DataFrames, row by row

I have two Spark DataFrames, with values that I would like to add, and then multiply, and keep the lowest pair of values only. I have written a function that will do this:
math_func(aValOne, aValTwo, bValOne, bValTwo):
tmpOne = aValOne + bValOne
tmpTwo = aValTwo + bValTwo
final = tmpOne*tmpTwo
return final
I would like to iterate through two Spark DataFrames, "A" and "B", row by row, and keep the lowest values results. So if I have two DataFrames:
DataFrameA:
ID | ValOne | ValTwo
0 | 2 | 4
1 | 3 | 6
DataFrameB:
ID | ValOne | ValTwo
0 | 4 | 5
1 | 7 | 9
I would like to first take row 0 from DataFrameA:, compare it to rows 0 and 1 of DataFrameB, and then keep the lowest value results. I have tried this:
results = DataFrameA.select('ID')(lambda i: DataFrameA.select('ID')(math_func(DataFrameA.ValOne, DataFrameA.ValTwo, DataFrameB.ValOne, DataFrameB.ValOne))
but I get errors about iterating through a DataFrame column. I know that in Pandas I would essentially make a nested "for loop", and then just write the results to another DataFrame and append the results. The results I would expect are:
Initial Results:
DataFrameA_ID | Value | DataFrameB_ID
0 | 54 | 0
0 | 117 | 1
1 | 77 | 0
1 | 150 | 1
Final Results:
DataFrameA_ID | Value | DataFrameB_ID
0 | 54 | 0
1 | 77 | 0
I am quite new at Spark, but I know enough to know I'm not approaching this the right way.
Any thoughts on how to go about this?
You will need multiple steps to achieve this.
Suppose you have data
DFA:
ID | ValOne | ValTwo
0 | 2 | 4
1 | 3 | 6
DFB:
ID | ValOne | ValTwo
0 | 4 | 5
1 | 7 | 9
Step 1.
Do a cartesian join on your 2 dataframes. That will give you:
Cartesian:
DFA.ID | DFA.ValOne | DFA.ValTwo | DFB.ID | DFB.ValOne | DFB.ValTwo
0 | 2 | 4 | 0 | 4 | 5
1 | 3 | 6 | 0 | 4 | 5
0 | 2 | 4 | 1 | 7 | 9
1 | 3 | 6 | 1 | 7 | 9
Step 2.
Multiply columns:
Multiplied:
DFA.ID | DFA.Mul | DFB.ID | DFB.Mul
0 | 8 | 0 | 20
1 | 18 | 0 | 20
0 | 8 | 1 | 63
1 | 18 | 1 | 63
Step 3.
Group by DFA.ID and select min from DFA.Mul and DFB.Mul

Interleaving Pandas Dataframes by Timestamp

I've got 2 Pandas DataFrame, each of them containing 2 columns. One of the columns is a timestamp column [t], the other one contains sensor readings [s].
I now want to create a single DataFrame, containing 4 columns, that is interleaved on the timestamp column.
Example:
First Dataframe:
+----+----+
| t1 | s1 |
+----+----+
| 0 | 1 |
| 2 | 3 |
| 3 | 3 |
| 5 | 2 |
+----+----+
Second DataFrame:
+----+----+
| t2 | s2 |
+----+----+
| 1 | 5 |
| 2 | 3 |
| 4 | 3 |
+----+----+
Target:
+----+----+----+----+
| t1 | t2 | s1 | s2 |
+----+----+----+----+
| 0 | 0 | 1 | 0 |
| 0 | 1 | 1 | 5 |
| 2 | 1 | 3 | 5 |
| 2 | 2 | 3 | 3 |
| 3 | 2 | 3 | 3 |
| 3 | 4 | 3 | 3 |
| 5 | 4 | 2 | 3 |
+----+----+----+----+
I hat a look at pandas.merge, but that left me with a lot of NaNs and an unsorted table.
a.merge(b, how='outer')
Out[55]:
t1 s1 t2 s2
0 0 1 NaN NaN
1 2 3 2 3
2 3 3 NaN NaN
3 5 2 NaN NaN
4 1 NaN 1 5
5 4 NaN 4 3
Merging will put NaNs in common columns that you merge on, if those values are not present in both indexes. It will not create new data that is not present in the dataframes that are being merged.
For example, index 0 in your target dataframe shows t2 with a value of 0. This is not present in the second dataframe, so you cannot expect it to appear in the merged dataframe either. Same applies for other rows as well.
What you can do instead is reindex the dataframes to a common index. In your case, since the maximum index is 5 in the target dataframe, lets use this list to reindex both input dataframes:
In [382]: ind
Out[382]: [0, 1, 2, 3, 4, 5]
Now, we will reindex according both inputs to this index:
In [372]: x = a.set_index('t1').reindex(ind).fillna(0).reset_index()
In [373]: x
Out[373]:
t1 s1
0 0 1
1 1 0
2 2 3
3 3 3
4 4 0
5 5 2
In [374]: y = b.set_index('t2').reindex(ind).fillna(0).reset_index()
In [375]: y
Out[375]:
t2 s2
0 0 0
1 1 5
2 2 3
3 3 0
4 4 5
5 5 0
And, now we merge it to get something close to the target dataframe:
In [376]: x.merge(y, left_on=['t1'], right_on=['t2'], how='outer')
Out[376]:
t1 s1 t2 s2
0 0 1 0 0
1 1 0 1 5
2 2 3 2 3
3 3 3 3 0
4 4 0 4 5
5 5 2 5 0

Categories