divide a dataframe on special threshold

divide a dataframe on special threshold - python

I got a DataFrame as an example:
name age
Ashe 12
Ashe 13
Ashe 23
John 33
John 45
Karin 55
David 84
Zaki 34
Mano 45
my threshold is I need to divide this on distinct names like I need 3 distinct names so I need the output to be :
name age
Ashe 12
Ashe 13
Ashe 23
John 33
John 45
Karin 55
and the second DF :
name age
David 84
Zaki 34
Zaki 23
Zaki 35
Mano 45
what can I do?

from itertools import islice
def chunk(lst, size):
lst = iter(lst)
return iter(lambda: tuple(islice(lst, size)), ())
name_groups = list(chunk(df.name.unique(),3))
data = {}
for i, group in enumerate(name_groups):
data[f'df{i}'] = df[df.name.isin(group)]
The chunk function splits an array to chunks of size n (in our case - 3)
You can read more here : https://stackoverflow.com/a/22045226/13104290
name_groups contains a list of tuples with up to 3 elements each one:
[('Ashe', 'John', 'Karin'), ('David', 'Zaki', 'Mano')]
Since we sent df.name.unique(), there are no duplications.
Now we need to dynamically create each new dataframe, we'll do this by creating a dictionary and adding each new partition one at a time.
The dictionary now contains two dataframes, df0 and df1.
data['df0'] :
name age
0 Ashe 12
1 Ashe 13
2 Ashe 23
3 John 33
4 John 45
5 Karin 55
data['df1']:
name age
6 David 84
7 Zaki 34
8 Mano 45

Related

Doing vlookup-like things on Python with multiple lookup values

Many of us know that the syntax for a Vlookup function on Excel is as follows:
=vlookup([lookup value], [lookup table/range], [column selected], [approximate/exact match (optional)])
I want to do something on Python with a lookup table (in dataframe form) that looks something like this:
Name Date of Birth ID#
Jack 1/1/2003 0
Ryan 1/8/2003 1
Bob 12/2/2002 2
Jack 3/9/2003 3
...and so on. Note how the two Jacks are assigned different ID numbers because they are born on different dates.
Say I have something like a gradebook (again, in dataframe form) that looks like this:
Name Date of Birth Test 1 Test 2
Jack 1/1/2003 89 91
Ryan 1/8/2003 92 88
Jack 3/9/2003 93 79
Bob 12/2/2002 80 84
...
How do I make it so that the result looks like this?
ID# Name Date of Birth Test 1 Test 2
0 Jack 1/1/2003 89 91
3 Ryan 1/8/2003 92 88
1 Jack 3/9/2003 93 79
2 Bob 12/2/2002 80 84
...
It seems to me that the "lookup value" would involve multiple columns of data ('Name' and 'Date of Birth'). I kind of know how to do this in Excel, but how do I do it in Python?

Turns out that I can just do
pd.merge([lookup value], [lookup table], on = ['Name', 'Date of Birth']
which produces
Name Date of Birth Test 1 Test 2 ID#
Jack 1/1/2003 89 91 0
Ryan 1/8/2003 92 88 3
Jack 3/9/2003 93 79 1
Bob 12/2/2002 80 84 2
...
Then everything needed is to move the last column to the front.

Random number to values columns

I have a large dataset with the columns 'group' and 'postcode'. An example of the df is given below:
Age
65+
16-25
16-25
26-39
40-64
65+
26-39
40-64
16-25
65+
I am trying to affect to each row value a random value with the code below
df['AGE'] = df['AGE'].replace({'65+': randint(65,100), '16-25': randint(16,25),
'26-39': randint(26,39), '40-64': randint(40,64)})
But what I'm getting are four random values to each of these values: {'65+', '16-25', '26-39', '40-64'} like so:
Age
73
23
23
34
42
73
34
42
23
73
Can someone please help me figure out what am I doing wrong by correcting my code?

You're generating the random numbers once and just replacing your column values.
If you want a different random number for each row, you need to call randint for each row. Try:
>>> df['AGE'].apply(lambda x: randint(int(x[:2]), 100 if x[-1]=="+" else int(x[-2:])))
0 82
1 23
2 18
3 27
4 45
5 83
6 38
7 64
8 17
9 93
Name: AGE, dtype: int64

i want to separate data frame based on marks and download it then

This is my data frame:
Name Age Stream Percentage
0 A 21 Math 88
1 B 19 Commerce 92
2 C 20 Arts 95
3 D 18 Biology 70
0 E 21 Math 88
1 F 19 Commerce 92
2 G 20 Arts 95
3 H 18 Biology 70
I want to download different excel file for each subject in one loop so basically, I should get 4 excel files for each subject
i tried this but didn't work:
n=0
for subjects in df.stream:
df.to_excel("sub"+ str(n)+".xlsx")
n+=1

I think groupby is helpful here. and you can use enumerate to keep track of the index.
for i, (group, group_df) in enumerate(df.groupby('stream')):
group_df.to_excel('sub{}.xlsx'.format(i))
# Alternatively, to name the file based on the stream...
group_df.to_excel('sub{}.xlsx'.format(group))
group is going to be the name of the stream.
group_df is going to be a sub-dataframe containing all the data in that group.

Select/Group rows from a data frame with the nearest values for a specific column(s)

I have the two columns in a data frame (you can see a sample down below)
Usually in columns A & B I get 10 to 12 rows with similar values.
So for example: from index 1 to 10 and then from index 11 to 21.
I would like to group these values and get the mean and standard deviation of each group.
I found this following line code where I can get the index of the nearest value. but I don't know how to do this repetitively:
Index = df['A'].sub(df['A'][0]).abs().idxmin()
Anyone has any ideas on how to approach this problem?
A B
1 3652.194531 -1859.805238
2 3739.026566 -1881.965576
3 3742.095325 -1878.707674
4 3747.016899 -1878.728626
5 3746.214554 -1881.270329
6 3750.325368 -1882.915532
7 3748.086576 -1882.406672
8 3751.786422 -1886.489485
9 3755.448968 -1885.695822
10 3753.714126 -1883.504098
11 -337.969554 24.070990
12 -343.019575 23.438956
13 -344.788697 22.250254
14 -346.433460 21.912217
15 -343.228579 22.178519
16 -345.722368 23.037441
17 -345.923108 23.317620
18 -345.526633 21.416528
19 -347.555162 21.315934
20 -347.229210 21.565183
21 -344.575181 22.963298
22 23.611677 -8.499528
23 26.320500 -8.744512
24 24.374874 -10.717384
25 25.885272 -8.982414
26 24.448127 -9.002646
27 23.808744 -9.568390
28 24.717935 -8.491659
29 25.811393 -8.773649
30 25.084683 -8.245354
31 25.345618 -7.508419
32 23.286342 -10.695104
33 -3184.426285 -2533.374402
34 -3209.584366 -2553.310934
35 -3210.898611 -2555.938332
36 -3214.234899 -2558.244347
37 -3216.453616 -2561.863807
38 -3219.326197 -2558.739058
39 -3214.893325 -2560.505207
40 -3194.421934 -2550.186647
41 -3219.728445 -2562.472566
42 -3217.630380 -2562.132186
43 234.800448 -75.157523
44 236.661235 -72.617806
45 238.300501 -71.963103
46 239.127539 -72.797922
47 232.305335 -70.634125
48 238.452197 -73.914015
49 239.091210 -71.035163
50 239.855953 -73.961841
51 238.936811 -73.887023
52 238.621490 -73.171441
53 240.771812 -73.847028
54 -16.798565 4.421919
55 -15.952454 3.911043
56 -14.337879 4.236691
57 -17.465204 3.610884
58 -17.270147 4.407737
59 -15.347879 3.256489
60 -18.197750 3.906086

A simpler approach consist in grouping the values where the percentage change is not greater than a given threshold (let's say 0.5):
df['Group'] = (df.A.pct_change().abs()>0.5).cumsum()
df.groupby('Group').agg(['mean', 'std'])
Output:
A B
mean std mean std
Group
0 3738.590934 30.769420 -1880.148905 7.582856
1 -344.724684 2.666137 22.496995 0.921008
2 24.790470 0.994361 -9.020824 0.977809
3 -3210.159806 11.646589 -2555.676749 8.810481
4 237.902230 2.439297 -72.998817 1.366350
5 -16.481411 1.341379 3.964407 0.430576
Note: I have only used the "A" column, since the "B" column appears to follow the same pattern of consecutive nearest values. You can check if the identified groups are the same between columns with:
grps = (df[['A','B']].pct_change().abs()>1).cumsum()
grps.A.eq(grps.B).all()

I would say that if you know the length of each group/index set you want then you can first subset the column and row with :
df['A'].iloc[0:11].mean()
Then figure out a way to find standard deviation.

Python numpy where function behavior

Have a question regarding using numpy's where condition. I am able to use where condition with == operator but not able to use where condition with "is one string substring of another string ?"
CODE:
import pandas as pd
import datetime as dt
import numpy as np
data = {'name': ['Smith, Jason', 'Bush, Molly', 'Smith, Tina',
'Clinton, Jake', 'Hamilton, Amy'],
'age': [42, 52, 36, 24, 73],
'preTestScore': [4, 24, 31, 2, 3],
'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(data, columns = ['name', 'age', 'preTestScore',
'postTestScore'])
print "BEFORE---- "
print df
print "AFTER----- "
df["Smith Family"]=np.where("Smith" in df['name'],'Y','N' )
print df
OUTPUT:
BEFORE-----
name age preTestScore postTestScore
0 Smith, Jason 42 4 25
1 Bush, Molly 52 24 94
2 Smith, Tina 36 31 57
3 Clinton, Jake 24 2 62
4 Hamilton, Amy 73 3 70
AFTER-----
name age preTestScore postTestScore Smith Family
0 Smith, Jason 42 4 25 N
1 Bush, Molly 52 24 94 N
2 Smith, Tina 36 31 57 N
3 Clinton, Jake 24 2 62 N
4 Hamilton, Amy 73 3 70 N
Why numpy.where condition does not work in the above case.
Had expected Smith Family to have values
Y
N
Y
N
N
But did not get that output. Output as seen above is all N,N,N,N,N
Instead of using condition "Smith" in df['name'] (also tried str(df['name']).find("Smith") >-1 ) but that did not work either.
Any idea what is wrong or what could I have done differently?

I think you need str.contains for boolean mask:
print (df['name'].str.contains("Smith"))
0 True
1 False
2 True
3 False
4 False
Name: name, dtype: bool
df["Smith Family"]=np.where(df['name'].str.contains("Smith"),'Y','N' )
print (df)
name age preTestScore postTestScore Smith Family
0 Smith, Jason 42 4 25 Y
1 Bush, Molly 52 24 94 N
2 Smith, Tina 36 31 57 Y
3 Clinton, Jake 24 2 62 N
4 Hamilton, Amy 73 3 70 N
Or str.startswith:
df["Smith Family"]=np.where(df['name'].str.startswith("Smith"),'Y','N' )
print (df)
name age preTestScore postTestScore Smith Family
0 Smith, Jason 42 4 25 Y
1 Bush, Molly 52 24 94 N
2 Smith, Tina 36 31 57 Y
3 Clinton, Jake 24 2 62 N
4 Hamilton, Amy 73 3 70 N
If want use in working with scalars need apply:
This solution is faster, but doesnt work if NaN in column name.
df["Smith Family"]=np.where(df['name'].apply(lambda x: "Smith" in x),'Y','N' )
print (df)
name age preTestScore postTestScore Smith Family
0 Smith, Jason 42 4 25 Y
1 Bush, Molly 52 24 94 N
2 Smith, Tina 36 31 57 Y
3 Clinton, Jake 24 2 62 N
4 Hamilton, Amy 73 3 70 N

The behavior of np.where("Smith" in df['name'],'Y','N' ) depends on what df['name'] produces - I assume some sort of numpy array. The rest is numpy
In [733]: x=np.array(['one','two','three'])
In [734]: 'th' in x
Out[734]: False
In [744]: 'two' in np.array(['one','two','three'])
Out[744]: True
in is a whole string test, both for a list and an array of strings. It's not a substring test.
np.char has a bunch of functions that apply string functions to elements of an array. These are roughly the equivalent of np.array([x.fn() for x in arr]).
In [754]: x=np.array(['one','two','three'])
In [755]: np.char.startswith(x,'t')
Out[755]: array([False, True, True], dtype=bool)
In [756]: np.where(np.char.startswith(x,'t'),'Y','N')
Out[756]:
array(['N', 'Y', 'Y'],
dtype='<U1')
Or with find:
In [760]: np.char.find(x,'wo')
Out[760]: array([-1, 1, -1])
The pandas .str method appears to do something similar; applying string methods to elements of a data series.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

divide a dataframe on special threshold - python

Related

Doing vlookup-like things on Python with multiple lookup values

Random number to values columns

i want to separate data frame based on marks and download it then

Select/Group rows from a data frame with the nearest values for a specific column(s)

Python numpy where function behavior

Categories

Resources