In pandas, how should one add age-range columns? - python

Let's say I've got a simple DataFrame that details when people have been playing music through their lives, like this:
import pandas as pd
df = pd.DataFrame(
[[15, 8, 7],
[20, 10, 10],
[35, 15, 20],
[50, 12, 38]],
columns=['current age', 'age started playing music', 'years playing music'])
How should one add additional columns that break down the number of years playing music they've had in each decade of their lives? For example, if the columns added were 0-10, 10-20, 20-30 etc., then the first person would have had 2 years of playing music in their first decade, 5 in their second, 0 in their third etc.

You can also try this using pd.cut and value_counts:
df.join(df.apply(lambda x: pd.cut(np.arange(x['age started playing music'],
x['current age']),
bins=[0, 9, 19, 29, 39, 49],
labels=['0-10', '10-20',
'20-30', '30-40',
'40+'])
.value_counts(),
axis=1))
Output:
current age age started playing music years playing music 0-10 10-20 20-30 30-40 40+
0 15 8 7 2 5 0 0 0
1 20 10 10 0 10 0 0 0
2 35 15 20 0 5 10 5 0
3 50 12 38 0 8 10 10 10

I suggest to create a function that will return a list with the number of years played per decade and then apply it to your dataframe
import numpy as np
# Create list with numbers of years played in a decade
def get_years_playing_music_decade(current_age, age_start):
if age_start > current_age: # should not be possible
return None
# convert age to list of booleans
# was he playing on its i-th Year of living
# Example : age_start = 3 is a list [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1 ...]
age_start_lst = [0] * age_start + (100-age_start) * [1]
# was he living on its i-th Year of living
current_age_lst = [1] * current_age + (100-current_age) * [0]
# combination of living and playing
playing_music_lst = [1 if x==y else 0 for x, y in zip(age_start_lst, current_age_lst)]
# group by 10y
playing_music_lst_10y = [sum(playing_music_lst[(10*i):((10*i)+10)]) for i in range(0, 10)]
return playing_music_lst_10y
get_years_playing_music_decade(current_age=33, age_start=12)
# [0, 8, 10, 3, 0, 0, 0, 0, 0, 0]
# create columns 0-10 .. 90-100
colnames=list()
for i in range(10):
colnames += [str(10*i) + '-' + str(10*(i+1))]
# apply defined function to the dataframe
df[colnames] = pd.DataFrame(df.apply(lambda x: get_years_playing_music_decade(
int(x['current age']), int(x['age started playing music'])), axis=1).values.tolist())

Related

Trying to iterate through rows of pandas dataframe and edit row if it satisfies a condition

I have attached an image of my dataframe, and the code of the methods I tried. My goal is to switch the first half of the values in a row with the second half of the values in that row if the row satisfies a condition.
The first method checks if the condition is true (values need to be switched), and then assigns the new values directly to the original dataframe.
The second methods checks if the condition is true (values need to be switched) and adds the values to two separate dataframes. If the condition is not true I add the original values to df1 and df2. At the end of the block, I was planning on combining the dataframes together again.
However, both of these methods take super long to run, and it seems there has to be something more efficient. I had trouble finding the most efficient way online, which one would it be?
datframe
METHOD 1:
finalGameID = list(final.loc[:,'GameID'])
for i,v in enumerate(finalGameID):
if final['HomeAway'][i] == 0:
print(v)
values = final.loc[i].values
value1 = list(values[4:124])
value2 = list(values[124:])
final.iloc[i,4:124] = value2
final.iloc[i,124:] = value1
METHOD 2:
df1 = final[final.columns[4:124]]
df2 = final[final.columns[124:]]
df3 = final[final.columns[0:4]]
df1 = df1[0:0]
df2 = df2[0:0]
finalGameID = list(final.loc[:,'GameID'])
for i,v in enumerate(finalGameID):
values = final.loc[i].values
value1 = list(values[4:124])
value2 = list(values[124:])
if final['HomeAway'][i] == 0:
print(v)
df1.loc[len(df1.index)] = value2
df2.loc[len(df2.index)] = value1
else:
df1.loc[len(df1.index)] = value1
df2.loc[len(df2.index)] = value2
Your approach is slow because you loop over the rows and use intermediate copies.
You should be able to use boolean indexing for direct swapping:
mask = final['HomeAway'].eq(0)
final.loc[mask, 4:124], final.loc[mask, 124:] = final.loc[mask, 124:], final.loc[mask, 4:124]
The Data on which you are working is unknown and I have tried to replicate your problem with duplicate data. Change the variables and the indexing values while using it in your project
CODE
import pandas as pd
import numpy as np
data = pd.DataFrame({"HomeAway": [1, 1, 0, 0, 1],
"Value1": [14, 16, 29, 22, 21],
"Value2": [8, 14, 24, 14, 19],
"Value3": [6, 2, 5, 8, 2],
"Value4": [3, 3, 2, 2, 0]})
print("BEFORE")
print(data)
left = np.asanyarray(data[data["HomeAway"] == 0].iloc[:, 1:3])
right = np.asanyarray(data[data["HomeAway"] == 0].iloc[:, 3:5])
data.iloc[data["HomeAway"] == 0, 1:3] = right
data.iloc[data["HomeAway"] == 0, 3:5] = left
print("AFTER")
print(data)
OUTPUT
BEFORE
HomeAway Value1 Value2 Value3 Value4
0 1 14 8 6 3
1 1 16 14 2 3
2 0 29 24 5 2
3 0 22 14 8 2
4 1 21 19 2 0
AFTER
HomeAway Value1 Value2 Value3 Value4
0 1 14 8 6 3
1 1 16 14 2 3
2 0 5 2 29 24
3 0 8 2 22 14
4 1 21 19 2 0

Compare two dataframe and conditionally capture random data in Python

The main logic of my question is on comparing the two dataframes a little, but it will be different from the existing questions here. Q1, Q2, Q3
Let's create dummy two dataframes.
data1 = {'user': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 4, 4,4],
'checkinid': [10, 20, 30, 40, 50, 35, 45, 55, 20, 120, 100, 35, 55, 180, 200,400],
'count': [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]}
data2 = {'checkinid': [10, 20, 30, 35, 40, 45, 50,55, 60, 70,100,120,180,200,300,400]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
data2 consists of whole checkinid values. I am trying to create a training file.
For example, user 1 visited 5 places where ids are (10,20,30,40,50)
I want to add randomly the places that user 1 does not visit and set the 'count column' as 0.
My expectation dataframe like this
user checkinid count
1 10 1
1 20 1
1 30 1
1 40 1
1 50 1
1 300 0 (add randomly)
1 180 0 (add randomly)
1 55 0 (add randomly)
2 35 1
2 45 1
2 55 1
2 20 1
2 120 1
2 10 0 (add randomly)
2 400 0 (add randomly)
2 180 0 (add randomly)
... ...
Now those who read the question can ask how many random data they will add.
For each user, just add 3 non-visited places is enough for this example.
This might not be the best solution but it works
you have to get each users and then pick the checkinids which are not assigned to them
#get all users
users = df1.user.unique();
for user in users:
checkins = df1.loc[df1['user'] == user]
df = checkins.merge(df2, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='right_only'].sample(n=3)
df['user']=[user,user,user]
df['count']=[0,0,0]
df.pop("_merge")
df1 = df1.append(df, ignore_index=True)
#sort data frome based on user
df1 = df1.sort_values(by=['user']);
#re-arrange cols
df1 = df1[['user', 'checkinid', 'count']]
#print df
print df1

Pandas DataFrame group-by indexes matching list - indexes respectively smaller than list[i+1] and greater than list[i]

I have a DataFrame Times_df with times in a single column and a second DataFrame End_df with specific end times for each group indexed by group name.
Times_df = pd.DataFrame({'time':np.unique(np.cumsum(np.random.randint(5, size=(100,))), axis=0)})
End_df = pd.DataFrame({'end time':np.unique(random.sample(range(Times_df.index.values[0], Times_df.index.values[-1]), 10))})
End_df.index.name = 'group'
I want to add a group index for all times in Times_df smaller or equal than each consequitive end time in End_df but greater than the previous one
I can only do it for now with a loop, which takes forever ;(
lis = []
i = 1
for row in Times_df['time'].values:
while i <= row:
lis.append((End_df['end time']==row).index)
i +1
Then I add the list lis as a new column to Times_df
Times_df['group']=lis
A nother sollution that sadly still uses a loop is this:
test_df = pd.DataFrame()
for group, index in End_df.iterrows():
test = count.loc[count.index<=index['end time]][:]
test['group']=group
test_df = pd.concat([test_df,test], axis=0, ignore_index=True)
I think what you are looking for is pd.cut to bin your values into the groups.
bins = [0, 3, 10, 20, 53, 59, 63, 65, 68, 74, np.inf]
groups = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Times_df["group"] = pd.cut(Times_df["time"], bins, labels=groups)
print(Times_df)
time group
0 2 0
1 3 0
2 7 1
3 11 2
4 15 2
5 16 2
6 18 2
7 22 3
8 25 3
9 28 3

How to conditionally fill new column using for loop in python? [duplicate]

This question already has answers here:
Pandas conditional creation of a series/dataframe column
(13 answers)
Closed 3 years ago.
I want to add a new column and fill values based on condition.
df:
indicator, value, a, b
1, 20, 5, 3
0, 30, 6, 8
0, 70, 2, 2
1, 10, 3, 7
I want to add a new column (value_new) based on Indicator. If indicator == 1, value_new = a*b otherwise value_new = value.
df:
indicator, value, a, b, value_new
1, 20, 5, 3, 15
0, 30, 6, 8, 30
0, 70, 2, 2, 70
1, 10, 3, 7, 21
I have tried following:
value_new = []
for in in range(1, len(df)):
if indicator[i] == 1:
value_new.append(df['a'][i]*df['b'][i])
else:
value_new.append(df['value'][i])
df['value_new'] = value_new
Error: 'Length of values does not match length of index'
And I have also tried:
for in in range(1, len(df)):
if indicator[i] == 1:
df['value_new'][i] = df['a'][i]*df['b'][i]
else:
df['value_new'][i] = df['value'][i]
KeyError: 'value_new'
You can use np.where:
df['value_new'] = np.where(df['indicator'], df['a']*df['b'], df['value'])
print(df)
Prints:
indicator value a b value_new
0 1 20 5 3 15
1 0 30 6 8 30
2 0 70 2 2 70
3 1 10 3 7 21

python pandas isin method?

I have a dictionary 'wordfreq' like this:
{'techsmart': 30, 'paradies': 57, 'jobvark': 5000, 'midgley': 100, 'weisman': 2, 'tucuman': 1, 'amdahl': 2, 'frogfeet': 1, 'd8848': 1, 'jiaoyuwang': 1, 'walter': 19}
and I want to put the keys in a list if the value is more than 5 and also if the key is not in another dataframe 'df', and then adding them to a list called 'stopword':here is a df dataframe:
word freq
1 paradies 1
5 tucuman 1
and here is the code I am using:
stopword = []
for k,v in wordfreq.items():
if v >= 5:
if k not in list_c:
stopword.append((k))
Anybody knows how can I do the same thing with isin() method or more efficiently at least?
I'd load your dict into a df:
In [177]:
wordfreq = {'techsmart': 30, 'paradies': 57, 'jobvark': 5000, 'midgley': 100, 'weisman': 2, 'tucuman': 1, 'amdahl': 2, 'frogfeet': 1, 'd8848': 1, 'jiaoyuwang': 1, 'walter': 19}
df = pd.DataFrame({'word':list(wordfreq.keys()), 'freq':list(wordfreq.values())})
df
Out[177]:
freq word
0 1 frogfeet
1 1 tucuman
2 57 paradies
3 1 d8848
4 5000 jobvark
5 100 midgley
6 1 jiaoyuwang
7 30 techsmart
8 2 weisman
9 19 walter
10 2 amdahl
And then filter using isin against the other df (df_1 in my case) like this:
In [181]:
df[(df['freq'] > 5) & (~df['word'].isin(df1['word']))]
Out[181]:
freq word
4 5000 jobvark
5 100 midgley
7 30 techsmart
9 19 walter
So the boolean condition looks for freq values greater than 5 and also where the word is not in the other df using isin and invert the boolean mask ~.
You can then now get a list easily:
In [182]:
list(df[(df['freq'] > 5) & (~df['word'].isin(df1['word']))]['word'])
Out[182]:
['jobvark', 'midgley', 'techsmart', 'walter']

Categories