Selecting random columns for each group of pyspark RDD/dataframe

Selecting random columns for each group of pyspark RDD/dataframe - python

My dataframe as got 10,0000 columns, I have to apply some logic on each group (key is region and dept). Each group will use max 30 columns from 10k columns, the 30 columns list is from the second data set column "colList". Each group will have 2-3 millions rows. My approach is to do group by key and call function like below. But it fails - 1. shuffle and 2.data group is more than 2G(can be solved by re-partition but its costly), 3. very slow
def testfunc(iter):
<<got some complex business logic which cant be done in spark API>>
resRDD = df.rdd.groupBy(region, dept).map(lambda x: testfunc(x))
Input:
region dept week val0 val1 val2 val3 ... val10000
US CS 1 1 2 1 1 ... 2
US CS 2 1.5 2 3 1 ... 2
US CS 3 1 2 2 2.1 2
US ELE 1 1.1 2 2 2.1 2
US ELE 2 2.1 2 2 2.1 2
US ELE 3 1 2 1 2 .... 2
UE CS 1 2 2 1 2 .... 2
Columns to pick for each group: (data set 2)
region dept colList
US CS val0,val10,val100,val2000
US ELE val2,val5,val800,val900
UE CS val21,val54,val806,val9000
My second solution is create a new data set from input data with only 30 columns and rename the columns to col1 to col30. Then use a mapping list for each columns and group. Then i can apply groupbyKey (assuming), which will be Skinner than original input of 10K columns.
region dept week col0 col1 col2 col3 ... col30
US CS 1 1 2 1 1 ... 2
US CS 2 1.5 2 3 1 ... 2
US CS 3 1 2 2 2.1 2
US ELE 1 1.1 2 2 2.1 2
US ELE 2 2.1 2 2 2.1 2
US ELE 3 1 2 1 2 .... 2
UE CS 1 2 2 1 2 .... 2
Can any one help to convert Input with 10K to 30 columns? Or any other alternative should be fine to avoid group by.

you can use the create_map function to convert all the 10k columns to map per row. Now use a UDF which takes the map, region and dept and dillutes the map to the 30 columns and make sure to always have the same names for all 30 columns.
Lastly you can wrap your complex function to receive the map instead of the original 10K columns. Hopefully this will get it small enough to work properly.
If not, you can get a distinct of region and dept and assuming there are few enough you can loop through one and groupby the other.

Related

python sum values in columns taken from another dataframe

I have a dataframe ("MUNg") like this:
MUN_id Col1
1-2 a
3 b
4-5-6 c
...
And another dataframe ("ppc") like this:
id population
0 1 20
1 2 25
2 3 4
3 4 45
4 5 100
5 6 50
...
I need to create a column in "MUNg" that contains the total population obtained by summing the population corresponding to the ids from "pcc", that are present in MUN_id
Expected result:
MUN_id Col1 total_population
1-2 a 45
3 b 4
4-5-6 c 195
...
I don't write how I tried to achieve this, because I am new to python and I don't know how to do it.
MUNg['total_population']=?
Many thanks!

You can split and explode your string into new rows, map the population data and GroupBy.agg to get the sum:
MUNg['total_population'] = (MUNg['MUN_id']
.str.split('-')
.explode()
.astype(int) # required if "id" in "ppc" is an integer, comment if string
.map(ppc.set_index('id')['population'])
.groupby(level=0).sum()
)
output:
MUN_id Col1 total_population
0 1-2 a 45
1 3 b 4
2 4-5-6 c 195

Python Pandas - How to get group by counts by values from multiple columns with multiple values

My data includes a few variables holding data from multi-answer questions. These are stored as string (comma separated) and aren't ordered by value.
I need to run different counts across 2 or more of these variables at the same time, i.e. get the frequencies of each combination of their unique values.
I also have a second dataframe with the available codes for each variable
df_meta['a']['Categories'] = ['1', '2', '3','4']
df_meta['b']['Categories'] = ['1', '2']
If this is my data
df = pd.DataFrame(np.array([["1,3","1"],["3","1,2"],["1,3,2","1"],["3,1","2,1"]]),
columns=['a', 'b'])
index a b
1 1,3 1
2 3 1,2
3 1,3,2 1
4 3,1 2,1
Ideally, this is what the output would look like
a b count
1 1 3
1 2 1
2 1 1
2 2 0
3 1 4
3 2 2
4 1 0
4 2 0
Although if I it's not possible to get the zero-counts, this would be just fine
a b count
1 1 3
1 2 1
2 1 1
3 1 4
3 2 2
So far, I got the counts for each of these variables individually, by using split and value_counts
df["a"].str.split(',',expand=True).stack().value_counts()
3 4
1 3
2 1
df["b"].str.split(',',expand=True).stack().value_counts()
1 4
2 2
But I can't figure how to group by them, because of the differences in the indexes.
df2 = pd.DataFrame()
df2["a"] = df["a"].str.split(',',expand=True).stack()
df2["b"] = df["b"].str.split(',',expand=True).stack()
df2.groupby(['a','b']).size()
a b
1 1 3
3 1 1
2 1
Is there a way to adjust the groupby to only count the instances of the first index or another way to count the unique combinations more efficiency?
I can alternatively iterate through all codes using the df_meta dataframe, but some of the actual variables have 300-400 codes and it's very slow, when I try to cross 2-3 of them and, if it's possible to use groupby or another function, it should work much faster.

First we make your dataframe to start with.
df = pd.DataFrame(np.array([["1,3","1"],["3","1,2"],["1,3,2","1"],
["3,1","2,1"]]),columns=['a', 'b'])
Then split columns to separate dataframes.
da = df["a"].str.split(',',expand=True)
db = df["b"].str.split(',',expand=True)
Loop through all rows and both dataframes. Make temporary dataframes of all compinations and add them to a list.
ab = list()
for r in range(len(da)):
for i in da.iloc[r,:]:
for j in db.iloc[r,:]:
if i != None and j != None:
daf = pd.DataFrame({'a':[i], 'b':[j]})
ab.append(daf)
Concatenate list of temporary dataframes into one new dataframe.
dfn = pd.concat(ab)
Groupby with 'a' and 'b' columns and size() gives you the answer.
print(dfn.groupby(['a', 'b']).size().reset_index(name='count'))
a b count
0 1 1 3
1 1 2 1
2 2 1 1
3 3 1 4
4 3 2 2

Trying to group by, then sort a dataframe based on multiple values [duplicate]

Suppose I have pandas DataFrame like this:
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4], 'value':[1,2,3,1,2,3,4,1,1]})
which looks like:
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
I want to get a new DataFrame with top 2 records for each id, like this:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
I can do it with numbering records within group after groupby:
dfN = df.groupby('id').apply(lambda x:x['value'].reset_index()).reset_index()
which looks like:
id level_1 index value
0 1 0 0 1
1 1 1 1 2
2 1 2 2 3
3 2 0 3 1
4 2 1 4 2
5 2 2 5 3
6 2 3 6 4
7 3 0 7 1
8 4 0 8 1
then for the desired output:
dfN[dfN['level_1'] <= 1][['id', 'value']]
Output:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
But is there more effective/elegant approach to do this? And also is there more elegant approach to number records within each group (like SQL window function row_number()).

Did you try
df.groupby('id').head(2)
Output generated:
id value
id
1 0 1 1
1 1 2
2 3 2 1
4 2 2
3 7 3 1
4 8 4 1
(Keep in mind that you might need to order/sort before, depending on your data)
EDIT: As mentioned by the questioner, use
df.groupby('id').head(2).reset_index(drop=True)
to remove the MultiIndex and flatten the results:
id value
0 1 1
1 1 2
2 2 1
3 2 2
4 3 1
5 4 1

Since 0.14.1, you can now do nlargest and nsmallest on a groupby object:
In [23]: df.groupby('id')['value'].nlargest(2)
Out[23]:
id
1 2 3
1 2
2 6 4
5 3
3 7 1
4 8 1
dtype: int64
There's a slight weirdness that you get the original index in there as well, but this might be really useful depending on what your original index was.
If you're not interested in it, you can do .reset_index(level=1, drop=True) to get rid of it altogether.
(Note: From 0.17.1 you'll be able to do this on a DataFrameGroupBy too but for now it only works with Series and SeriesGroupBy.)

Sometimes sorting the whole data ahead is very time consuming.
We can groupby first and doing topk for each group:
g = df.groupby(['id']).apply(lambda x: x.nlargest(topk,['value'])).reset_index(drop=True)

df.groupby('id').apply(lambda x : x.sort_values(by = 'value', ascending = False).head(2).reset_index(drop = True))
Here sort values ascending false gives similar to nlargest and True gives similar to nsmallest.
The value inside the head is the same as the value we give inside nlargest to get the number of values to display for each group.
reset_index is optional and not necessary.

This works for duplicated values
If you have duplicated values in top-n values, and want only unique values, you can do like this:
import pandas as pd
ifile = "https://raw.githubusercontent.com/bhishanpdl/Shared/master/data/twitter_employee.tsv"
df = pd.read_csv(ifile,delimiter='\t')
print(df.query("department == 'Audit'")[['id','first_name','last_name','department','salary']])
id first_name last_name department salary
24 12 Shandler Bing Audit 110000
25 14 Jason Tom Audit 100000
26 16 Celine Anston Audit 100000
27 15 Michale Jackson Audit 70000
If we do not remove duplicates, for the audit department we get top 3 salaries as 110k,100k and 100k.
If we want to have not-duplicated salaries per each department, we can do this:
(df.groupby('department')['salary']
.apply(lambda ser: ser.drop_duplicates().nlargest(3))
.droplevel(level=1)
.sort_index()
.reset_index()
)
This gives
department salary
0 Audit 110000
1 Audit 100000
2 Audit 70000
3 Management 250000
4 Management 200000
5 Management 150000
6 Sales 220000
7 Sales 200000
8 Sales 150000

To get the first N rows of each group, another way is via groupby().nth[:N]. The outcome of this call is the same as groupby().head(N). For example, for the top-2 rows for each id, call:
N = 2
df1 = df.groupby('id', as_index=False).nth[:N]
To get the largest N values of each group, I suggest two approaches.
First sort by "id" and "value" (make sure to sort "id" in ascending order and "value" in descending order by using the ascending parameter appropriately) and then call groupby().nth[].
N = 2
df1 = df.sort_values(by=['id', 'value'], ascending=[True, False])
df1 = df1.groupby('id', as_index=False).nth[:N]
Another approach is to rank the values of each group and filter using these ranks.
# for the entire rows
N = 2
msk = df.groupby('id')['value'].rank(method='first', ascending=False) <= N
df1 = df[msk]
# for specific column rows
df1 = df.loc[msk, 'value']
Both of these are much faster than groupby().apply() and groupby().nlargest() calls as suggested in the other answers on here(1, 2, 3). On a sample with 100k rows and 8000 groups, a %timeit test showed that it was 24-150 times faster than those solutions.
Also, instead of slicing, you can also pass a list/tuple/range to a .nth() call:
df.groupby('id', as_index=False).nth([0,1])
# doesn't even have to be consecutive
# the following returns 1st and 3rd row of each id
df.groupby('id', as_index=False).nth([0,2])

Multiple Condition Apply Function that iterates over itself

So I have a Dataframe that is the same thing 348 times, but with a different date as a static column. What I would like to do is add a column that checks against that date and then counts the number of rows that are within 20 miles using a lat/lon column and geopy.
My frame is like this:
What I am looking to do is something like an apply function that takes all of the identifying dates that are equal to the column and then run this:
geopy.distance.vincenty(x, y).miles
X would be the location's lat/lon and y would be the iterative lat/lon. I'd want the count of locations in which the above is < 20. I'd then like to store this count as a column in the initial Dataframe.
I'm ok with Pandas, but this is just outside my comfort zone. Thanks.

I started with this DataFrame (because I did not want to type that much by hand and you did not provide any code for the data):
df
Index Number la ID
0 0 1 [43.3948, -23.9483] 1/1/90
1 1 2 [22.8483, -34.3948] 1/1/90
2 2 3 [44.9584, -14.4938] 1/1/90
3 3 4 [22.39458, -55.34924] 1/1/90
4 4 5 [33.9383, -23.4938] 1/1/90
5 5 6 [22.849, -34.397] 1/1/90
Now I introduced an artificial column which is only there to help us get the cartesian product of the distances
df['join'] = 1
df_c = pd.merge(df, df[['la', 'join','Index']], on='join')
The next step is to apply the vincenty function via .apply and store the result in an extra column
df_c['distance'] = df_c.apply(lambda x: distance.vincenty(x.la_x, x.la_y).miles, 1)
Now we have the cartesian product of the original matrix, which means we have the comparison of each city with itself, too. But we will take that into account in the next step by performing -1. We groupby the Index_x and sum all the distances smaller the 20 miles.
df['num_close_cities'] = df_c.groupby('Index_x').apply(lambda x: sum((x.distance < 20))) -1
df.drop('join', 1)
Index Number la ID num_close_cities
0 0 1 [43.3948, -23.9483] 1/1/90 0
1 1 2 [22.8483, -34.3948] 1/1/90 1
2 2 3 [44.9584, -14.4938] 1/1/90 0
3 3 4 [22.39458, -55.34924] 1/1/90 0
4 4 5 [33.9383, -23.4938] 1/1/90 0
5 5 6 [22.849, -34.397] 1/1/90 1

Pandas - retain rows that match two cells in other dataframe

I have a dataframe df like this:
trial id run rt acc
0 1 1 1 0.941836 1
1 2 1 1 0.913791 1
2 3 1 1 0.128986 1
3 4 1 1 0.155720 0
4 1 1 2 0.414175 0
5 2 1 2 0.699326 1
6 3 1 2 0.781877 1
7 4 1 2 0.554666 1
There are 2 runs per id, and 70+ trials per run. Each row contains one trial. So the hierarchy is id - run - trial.
I want to retain only runs where mean acc is above 0.5, so I used temp = df.groupby(['id', 'run']).agg(np.average) and keep = temp[temp['acc']] > 0.5 .
Now I want to remove all trials from runs that are not in keep.
I tried to use df[df['id'].isin(keep['id'])&df['run'].isin(keep['run'])], but this doesn't seem to work correctly. df.query doesn't seem to work either as the indices and columns differ between the dataframes.
Is there another way of doing this?

I want to retain only runs where mean acc is above 0.5
Using groupby + transform, you can use a single Boolean series for indexing:
df = df[df.groupby(['id', 'run'])['acc'].transform('mean') > 0.5]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Selecting random columns for each group of pyspark RDD/dataframe - python

Related

python sum values in columns taken from another dataframe

Python Pandas - How to get group by counts by values from multiple columns with multiple values

Trying to group by, then sort a dataframe based on multiple values [duplicate]

Multiple Condition Apply Function that iterates over itself

Pandas - retain rows that match two cells in other dataframe

Categories

Resources