How to write a nested query with in python pandas? - python

Hi all I am new to pandas. I need some help regarding how to write pandas query for my required output.
I want to retrieve output data like
when 0 < minimum_age < 10 i need to get sum(population) for that 0 to 10 only
when 10 < minimum_age < 20 i need to get sum(population) for that 10 to 20 only
and then it continues
My Input Data Looks Like:
population,minimum_age,maximum_age,gender,zipcode,geo_id
50,30,34,f,61747,8600000US61747
5,85,NaN,m,64120,8600000US64120
1389,10,34,m,95117,8600000US95117
231,5,60,f,74074,8600000US74074
306,22,24,f,58042,8600000US58042
My Code:
import pandas as pd
import numpy as np
df1 = pd.read_csv("C:\Users\Rahul\Desktop\Desktop_Folders\Code\Population\population_by_zip_2010.csv")
df2=df1.set_index("geo_id")
df2['sum_population'] = np.where(df2['minimum_age'] < 10,sum(df2['population']),0)
print df2

You can try pandas cut along with groupby,
df.groupby(pd.cut(df['minimum_age'], bins=np.arange(0,100, 10), right=False)).population.sum().reset_index(name = 'sum of population')
minimum_age sum of population
0 [0, 10) 231.0
1 [10, 20) 1389.0
2 [20, 30) 306.0
3 [30, 40) 50.0
4 [40, 50) NaN
5 [50, 60) NaN
6 [60, 70) NaN
7 [70, 80) NaN
8 [80, 90) 5.0
Explanation: Pandas cut helps create bins of minimum_age by putting them in groups of 0-10, 10-20 and so on. This is how it looks
pd.cut(df['minimum_age'], bins=bins, right=False)
0 [30, 40)
1 [80, 90)
2 [10, 20)
3 [0, 10)
4 [20, 30)
Now we use groupby on the output of pd.cut to find sum of population.

Related

The minimum number of rows that sum to a given number in Python

I have a DataFrame with cash, inflows and outflows.
I need to create a feature survival that is the maximum number of periods the cash is enough to pay the projection of outflows (excluding the inflows from the computation).
Let's take an example from the table below.
(Again, the inflows do not count in this exercise).
In t=1, from the starting cash = 100, I can add the outflows: -20, -50, -10, -10 e still having a positive cash (100-20-50-10-10 = 10 > 0) while with the outflow in t=5 the cash would be negative. So, as long as I can "survive" 4 periods in t=1 the survival = 4.
In t=2 the survival = 3 and so on.
As it is a big DataFrame, how can I do it efficiently with Pandas?
t
cash
outflow
inflow
survival
1
100
-20
10
4
2
90
-50
10
3
3
50
-10
80
2
4
120
-10
70
...
5
40
-50
60
...
I would do like this:
df['survival'] = [ (cash + df.iloc[i:].outflow.cumsum()>0).sum() for i,cash in enumerate(df.cash)]
Output:
t cash outflow survival
0 1 100 -20 4
1 2 90 -50 3
2 3 50 -10 2
3 4 120 -10 2
4 5 40 -50 0
Explanation: I make a loop on cash values keeping also track of the row number using enumerate. I use the row number to select only the portion of the dataframe from the current value of cash going down. On this portion of the dataframe I make a cumulative sum and then I add it to the cash. This yields a series which is negative when the cash is smaller than the sum of the previous outflows. I then set it >0 so I have True values when it's positive and False when it's negative. I finally sum the whole series, each True counts as a 1, so the output is the survival number you are looking for. Hope it makes sense.
With your sample data :
df = pd.DataFrame({
't': [1, 2, 3, 4, 5],
'cash': [100, 90, 50, 120, 40],
'outflow': [-20, -50, -10, -10, -50]
})
I choose to use the pandas apply() function on this function with x being the evaluated row and df the complete DataFrame :
def survival(x, df):
cash = x['cash']
i = 0
while cash > 0:
try:
cash = cash + df.loc[x.name + i]['outflow']
i += 1
except KeyError:
print('End of dataframe')
i += 1
cash = -1 # To make sure we leave the loop
return i - 1
Then apply it to every row :
df['survival'] = df.apply(survival, args=(df,), axis=1)
# Output
t cash outflow survival
0 1 100 -20 4
1 2 90 -50 3
2 3 50 -10 2
3 4 120 -10 2
4 5 40 -50 0
Creating the test dataframe
import pandas as pd
import numpy as np
N = 50
cash = 50 # the initial cash
# I will not type your dataframe
df = pd.DataFrame({'inflow': np.random.randint(1,10, N),
'outflow': np.random.randint(1, 20, N)})
Then the solution could be achieved with
# computes the cash for each period
ccash = (cash + (df['inflow'] - df['outflow']).cumsum())
survival = (ccash[::-1] >= 0).cumsum()[::-1]

Finding minimum variance based on combinations of binning in python

I am looking to use a loop to iterate through all combinations of binning a variable before doing a group by. Example data:
import pandas as pd
df = pd.DataFrame({'id': [1,2,3,4,5,6,7,8,9,10],
'age': [23,54,47,38,37,21,27,72,25,36],
'score':[28,38,47,27,37,26,28,48,27,47]})
df.head()
id age score
0 1 23 28
1 2 54 38
2 3 47 47
3 4 38 27
4 5 37 37
And then manually creating bins like so:
bins = [20,50,70,80]
labels = ['-'.join(map(str,(x,y))) for x, y in zip(bins[:-1], bins[1:])]
df["age_bin"] = pd.cut(df["age"], bins = bins,labels = labels)
Finally calculating the average variance for that bin combination:
df.groupby("age_bin").agg({'score':'var'}).mean()
How can I loop through all combinations of bins, with a minimum bin size of 10 and but with no restrictions on the number of bins and assuming they do not have to be the same size?
e.g.
bins mean
0 [20, 50, 70, 80] 82.553571
1 [20, 70, 80] 74.611111
2 [20, 30, 60, 80] 35.058333

In pandas, how should one add age-range columns?

Let's say I've got a simple DataFrame that details when people have been playing music through their lives, like this:
import pandas as pd
df = pd.DataFrame(
[[15, 8, 7],
[20, 10, 10],
[35, 15, 20],
[50, 12, 38]],
columns=['current age', 'age started playing music', 'years playing music'])
How should one add additional columns that break down the number of years playing music they've had in each decade of their lives? For example, if the columns added were 0-10, 10-20, 20-30 etc., then the first person would have had 2 years of playing music in their first decade, 5 in their second, 0 in their third etc.
You can also try this using pd.cut and value_counts:
df.join(df.apply(lambda x: pd.cut(np.arange(x['age started playing music'],
x['current age']),
bins=[0, 9, 19, 29, 39, 49],
labels=['0-10', '10-20',
'20-30', '30-40',
'40+'])
.value_counts(),
axis=1))
Output:
current age age started playing music years playing music 0-10 10-20 20-30 30-40 40+
0 15 8 7 2 5 0 0 0
1 20 10 10 0 10 0 0 0
2 35 15 20 0 5 10 5 0
3 50 12 38 0 8 10 10 10
I suggest to create a function that will return a list with the number of years played per decade and then apply it to your dataframe
import numpy as np
# Create list with numbers of years played in a decade
def get_years_playing_music_decade(current_age, age_start):
if age_start > current_age: # should not be possible
return None
# convert age to list of booleans
# was he playing on its i-th Year of living
# Example : age_start = 3 is a list [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1 ...]
age_start_lst = [0] * age_start + (100-age_start) * [1]
# was he living on its i-th Year of living
current_age_lst = [1] * current_age + (100-current_age) * [0]
# combination of living and playing
playing_music_lst = [1 if x==y else 0 for x, y in zip(age_start_lst, current_age_lst)]
# group by 10y
playing_music_lst_10y = [sum(playing_music_lst[(10*i):((10*i)+10)]) for i in range(0, 10)]
return playing_music_lst_10y
get_years_playing_music_decade(current_age=33, age_start=12)
# [0, 8, 10, 3, 0, 0, 0, 0, 0, 0]
# create columns 0-10 .. 90-100
colnames=list()
for i in range(10):
colnames += [str(10*i) + '-' + str(10*(i+1))]
# apply defined function to the dataframe
df[colnames] = pd.DataFrame(df.apply(lambda x: get_years_playing_music_decade(
int(x['current age']), int(x['age started playing music'])), axis=1).values.tolist())

Merge two dataframes based on interval overlap

I have two dataframes A and B:
For example:
import pandas as pd
import numpy as np
In [37]:
A = pd.DataFrame({'Start': [10, 11, 20, 62, 198], 'End': [11, 11, 35, 70, 200]})
A[["Start","End"]]
Out[37]:
Start End
0 10 11
1 11 11
2 20 35
3 62 70
4 198 200
In [38]:
B = pd.DataFrame({'Start': [8, 5, 8, 60], 'End': [10, 90, 13, 75], 'Info': ['some_info0','some_info1','some_info2','some_info3']})
B[["Start","End","Info"]]
Out[38]:
Start End Info
0 8 10 some_info0
1 5 90 some_info1
2 8 13 some_info2
3 60 75 some_info3
I would like to add column info to dataframe A based on if the interval (Start-End) of A overlaps with the interval of B. In case, the A interval overlaps with more than one B interval, the info corresponding to the shorter interval should be added.
I have been looking arround how to manage this issue and I have found kind of similar questions but most of their answers are using iterrows() which in my case, as I am dealing with huge dataframes is not viable.
I would like something like:
A.merge(B,on="overlapping_interval", how="left")
And then drop duplicates keeping the info coming from the shorter interval.
The output should look like this:
In [39]:
C = pd.DataFrame({'Start': [10, 11, 20, 62, 198], 'End': [11, 11, 35, 70, 200], 'Info': ['some_info0','some_info2','some_info1','some_info3',np.nan]})
C[["Start","End","Info"]]
Out[39]:
Start End Info
0 10 11 some_info0
1 11 11 some_info2
2 20 35 some_info1
3 62 70 some_info3
4 198 200 NaN
I have found this question really interesting as it suggests the posibility of solving this issue using pandas Interval object. But after lots attempts I have not managed to solve it.
Any ideas?
I would suggest to do a function then apply on the rows:
First I compute the delta (End - Start) in B for sorting purpose
B['delta'] = B.End - B.Start
Then a function to get information:
def get_info(x):
#Fully included
c0 = (x.Start >= B.Start) & (x.End <= B.End)
#start lower, end include
c1 = (x.Start <= B.Start) & (x.End >= B.Start)
#start include, end higher
c2 = (x.Start <= B.End) & (x.End >= B.End)
#filter with conditions and sort by delta
_B = B[c0|c1|c2].sort_values('delta',ascending=True)
return None if len(_B) == 0 else _B.iloc[0].Info #None if no info corresponding
Then you can apply this function to A:
A['info'] = A.apply(lambda x : get_info(x), axis='columns')
print(A)
Start End info
0 10 11 some_info0
1 11 11 some_info2
2 20 35 some_info1
3 62 70 some_info3
4 198 200 None
Note:
Instead of using pd.Interval, make your own conditions. cx are your intervals definitions, change them to get the exact expected behaviour

Indexing on DataFrame with MultiIndex

I have a large pandas DataFrame that I need to fill.
Here is my code:
trains = np.arange(1, 101)
#The above are example values, it's actually 900 integers between 1 and 20000
tresholds = np.arange(10, 70, 10)
tuples = []
for i in trains:
for j in tresholds:
tuples.append((i, j))
index = pd.MultiIndex.from_tuples(tuples, names=['trains', 'tresholds'])
df = pd.DataFrame(np.zeros((len(index), len(trains))), index=index, columns=trains, dtype=float)
metrics = dict()
for i in trains:
m = binary_metric_train(True, i)
#Above function returns a binary array of length 35
#Example: [1, 0, 0, 1, ...]
metrics[i] = m
for i in trains:
for j in tresholds:
trA = binary_metric_train(True, i, tresh=j)
for k in trains:
if k != i:
trB = metrics[k]
corr = abs(pearsonr(trA, trB)[0])
df[k][i][j] = corr
else:
df[k][i][j] = np.nan
My problem is, when this piece of code is finally done computing, my DataFrame df still contains nothing but zeros. Even the NaN are not inserted. I think that my indexing is correct. Also, I have tested my binary_metric_train function separately, it does return an array of length 35.
Can anyone spot what I am missing here?
EDIT: For clarity, this DataFrame looks like this:
1 2 3 4 5 ...
trains tresholds
1 10
20
30
40
50
60
2 10
20
30
40
50
60
...
As #EdChum noted, you should take a lookt at pandas indexing. Here's some test data for the purpose of illustration, which should clear things up.
import numpy as np
import pandas as pd
trains = [ 1, 1, 1, 2, 2, 2]
thresholds = [10, 20, 30, 10, 20, 30]
data = [ 1, 0, 1, 0, 1, 0]
df = pd.DataFrame({
'trains' : trains,
'thresholds' : thresholds,
'C1' : data,
'C2' : data
}).set_index(['trains', 'thresholds'])
print df
df.ix[(2, 30), 0] = 3 # using column index
# or...
df.ix[(2, 30), 'C1'] = 3 # using column name
df.loc[(2, 30), 'C1'] = 3 # using column name
# but not...
df.loc[(2, 30), 1] = 3 # creates a new column
print df
Which outputs the DataFrame before and after modification:
C1 C2
trains thresholds
1 10 1 1
20 0 0
30 1 1
2 10 0 0
20 1 1
30 0 0
C1 C2 1
trains thresholds
1 10 1 1 NaN
20 0 0 NaN
30 1 1 NaN
2 10 0 0 NaN
20 1 1 NaN
30 3 0 3

Categories