Splitting a list of values in a Dataframe column - python

I was working with the Gun violence dataset from Kaggle which had the age column like this:
In [5]: df['participant_age_group'].head()
Out [5]:
0 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A...
1 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A...
2 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A...
3 0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A...
4 0::Adult 18+||1::Adult 18+||2::Teen 12-17||3::...
Name: participant_age_group, dtype: object
Where 0::,1:: correspond to index. So I want to split them and forma a whole new dataframe with no. of people belonging to that age group and having total no. of people belonging to that age group, say df_age. For ex:
Age Group No_of_people
18 300
25 210
30 100
So that I can then .groupby(age) and No_of_people.value_counts() by No._of_people and visualize the age group which is responsible for max. gun violences.
Unfortunately I'm only able to split but then not coming up to what I want.

I started from this input:
df = pd.DataFrame({'participant_age_group':['0::Adult 18+||1::Adult 18+||2::Adult 18+||',
'0::Adult 18+||1::Adult 18+||2::Adult 18+||',
'0::Adult 25+||1::Adult 25+||2::Adult 30+||',
'0::Adult 18+||1::Adult 18+||2::Teen 12-17||']})
then to create the df_age:
df_age = (df['participant_age_group'].str.replace('+','')
.str.split('\|{2}',expand=True).stack()
.str.split(' ',expand=True).dropna()
.groupby(1,as_index=False).count()
.rename(columns={0:'No_of_people',1:'Age_group'}))
Some explanation of the code.
str.split('\|{2}',expand=True).stack() splits each row where the symbol || is in the string and stack will expand as a column instead in the row. You get something like this, where the first level of index is your rows number in your original df.
0 0 0::Adult 18
1 1::Adult 18
2 2::Adult 18
3
1 0 0::Adult 18
1 1::Adult 18
...
(I don't print all the data). Then str.split(' ',expand=True).dropna() will split each string were a space is (before the age) and also drop empty rows to get:
0 1
0 0 0::Adult 18
1 1::Adult 18
2 2::Adult 18
1 0 0::Adult 18
1 1::Adult 18
...
Here you can see you have created two columns, 0 and 1, and in the column 1 you have the ages, so you just have to groupby this column and count the occurence of each age with groupby(1,as_index=False).count()
With my input, df_age is like:
Age_group No_of_people
0 12-17 1
1 18 8
2 25 2
3 30 1

Related

Is there a way to avoid while loops using pandas in order to speed up my code?

I'm writing a code to merge several dataframe together using pandas .
Here is my first table :
Index Values Intensity
1 11 98
2 12 855
3 13 500
4 24 140
and here is the second one:
Index Values Intensity
1 21 1000
2 11 2000
3 24 0.55
4 25 500
With these two df, I concanate and drop_duplicates the Values columns which give me the following df :
Index Values Intensity_df1 Intensity_df2
1 11 0 0
2 12 0 0
3 13 0 0
4 24 0 0
5 21 0 0
6 25 0 0
I would like to recover the intensity of each values in each Dataframes, for this purpose, I'm iterating through each line of each df which is very inefficient. Here is the following code I use:
m = 0
while m < len(num_df):
n = 0
while n < len(df3):
temp_intens_abs = df[m]['Intensity'][df3['Values'][n] == df[m]['Values']]
if temp_intens_abs.empty:
merged.at[n,"Intensity_df%s" %df[m]] = 0
else:
merged.at[n,"Intensity_df%s" %df[m]] = pandas.to_numeric(temp_intens_abs, errors='coerce')
n = n + 1
m = m + 1
The resulting df3 looks like this at the end:
Index Values Intensity_df1 Intensity_df2
1 11 98 2000
2 12 855 0
3 13 500 0
4 24 140 0.55
5 21 0 1000
6 25 0 500
My question is : Is there a way to directly recover "present" values in a df by comparing directly two columns using pandas? I've tried several solutions using numpy but without success.. Thanks in advance for your help.
You can try joining these dataframes: df3 = df1.merge(df2, on="Values")

Aggregating customer spend without any customer ID

I have 2 columns as below. The first column is spend, and the second column is months from offer. Unfortunately there is no ID to identify each customer. In the case below, there are three customers. e.g. The first 5 rows represent customer 1, the next 3 rows are customer 2, and then final 7 rows are customer 3. You can tell by looking at the months_from_offer, which go from -x to x months for each customer (x is not necessarily the same for each customer, as shown here where x=2,1,3 respectively for customers 1,2,3).
What I am looking to do is calculate the difference in post offer spend vs pre-offer spend for each customer. I don't care about the individual customers themselves, but I would like an overview - e.g. 10 customers had a post/pre difference in between $0-$100.
As an example with the data below, to calculate the post/pre offer difference for customer 1, it is -$10 - $32 + $23 + $54 = $35
for customer 2: -$21 + $87 = $66
for customer 3: -$12 - $83 - $65 + $80 + $67 + $11 = -$2
spend months_from_offer
$10 -2
$32 -1
$43 0
$23 1
$54 2
$21 -1
$23 0
$87 1
$12 -3
$83 -2
$65 -1
$21 0
$80 1
$67 2
$11 3
You can identify the customers using the following and then groupby customer:
df['customer'] = df['months_from_offer'].cumsum().shift().eq(0).cumsum().add(1)
#Another way to calculate customer per #teylyn method
#df['customer'] = np.sign(df['months_from_offer']).diff().lt(0).cumsum().add(1)
df['amount'] = df['spend'].str[1:].astype(int) * np.sign(df['months_from_offer']
df.groupby('customer')['amount'].sum().reset_index()
Output:
customer amount
0 1 35
1 2 66
2 3 -2
How it is done:
spend months_from_offer customer amount
0 $10 -2 1 -10
1 $32 -1 1 -32
2 $43 0 1 0
3 $23 1 1 23
4 $54 2 1 54
5 $21 -1 2 -21
6 $23 0 2 0
7 $87 1 2 87
8 $12 -3 3 -12
9 $83 -2 3 -83
10 $65 -1 3 -65
11 $21 0 3 0
12 $80 1 3 80
13 $67 2 3 67
14 $11 3 3 11
Calculate 'customer' column using cumsum, shift and eq and add to start at customer 1.
Calculate 'amount' using string manipulation and multiply by np.sign
'month from offer'
sum 'amount' with groupby 'customer'
In Excel, you can insert a helper column that looks at the sign and determines if the sign is different to the row above and then increments a counter number.
Hard code a customer ID of 1 into the first row of data, then calculate the rest.
=IF(AND(SIGN(A3)=-1,SIGN(A3)<>SIGN(A2)),B2+1,B2)
Copy the results and paste as values, then you can use them to aggregate your data
Use pandas.Series.diff with cumsum to create pseudo user id:
s = df["months_from_offer"].diff().lt(0).cumsum()
Output:
0 0
1 0
2 0
3 0
4 0
5 1
6 1
7 1
8 2
9 2
10 2
11 2
12 2
13 2
14 2
Name: months_from_offer, dtype: int64
Then use pandas.Series.clip to make the series either -1, 0, or 1, then do multiplication:
spend = (df["spend"] * df["months_from_offer"].clip(-1, 1))
Then use groupby.sum with the psuedo id s:
spend.groupby(s).sum()
Final output:
months_from_offer
0 35
1 66
2 -2
dtype: int64
Create id
s = df['months_from_offer'].iloc[::-1].cumsum().eq(0).iloc[::-1].cumsum()
0 1
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 3
9 3
10 3
11 3
12 3
13 3
14 3
Name: months_from_offer, dtype: int32
Then assign it
df['id']=s
I assume you wanted to read an excel file using pandas.
import pandas as pd
df = pd.read_excel('file.xlsx', sheetname='yoursheet')
pre = 0
post = 0
for i in df.index:
if df['months_from_offer'][i] < 0:
pre += int(df['spend'][i])
if df['months_from_offer'][i] > 0:
post += int(df['spend'][i])
dif = post - pre
If you would like to read the data for each customer
import pandas as pd
df = pd.read_excel('file.xlsx', sheetname='yoursheet')
customers = list[]
last = None
pre = 0
post = 0
for i in df.index:
if last is not None and abs(last + df['months_from_offer'][i]) > 1:
customers.append(post - pre)
pre = 0
post = 0
if df['months_from_offer'][i] < 0:
pre += int(df['spend'][i])
if df['months_from_offer'][i] > 0:
post += int(df['spend'][i])
last = df['months_from_offer'][i]
Or you can use a dict to name a customer. The way I separated the customers is when 2 months are more than (int) 1 from apart, there must be another person's record starting.

Pandas: Replace/ Change Duplicate values within a Time Range

I have a pandas data-frame where I am trying to replace/ change the duplicate values to 0 (don't want to delete the values) within a certain range of days.
So, in example given below, I want to replace duplicate values in all columns with 0 within a range of let's say 3 (the number can be changed) days. Desired result is also given below
A B C
01-01-2011 2 10 0
01-02-2011 2 12 2
01-03-2011 2 10 0
01-04-2011 3 11 3
01-05-2011 5 15 0
01-06-2011 5 23 1
01-07-2011 4 21 4
01-08-2011 2 21 5
01-09-2011 1 11 0
So, the output should look like
A B C
01-01-2011 2 10 0
01-02-2011 0 12 2
01-03-2011 0 0 0
01-04-2011 3 11 3
01-05-2011 5 15 0
01-06-2011 0 23 1
01-07-2011 4 21 4
01-08-2011 2 0 5
01-09-2011 1 11 0
Any help will be appreciated.
You can use df.shift() for this to look at a value from a row up or down (or several rows, specified by the number x in .shift(x)).
You can use that in combination with .loc to select all rows that have a identical value to the 2 rows above and then replace it with a 0.
Something like this should work :
(edited the code to make it flexible for endless number of columns and flexible for the number of days)
numberOfDays = 3 # number of days to compare
for col in df.columns:
for x in range(1, numberOfDays):
df.loc[df[col] == df[col].shift(x), col] = 0
print df
This gives me the output:
A B C
date
01-01-2011 2 10 0
01-02-2011 0 12 2
01-03-2011 0 0 0
01-04-2011 3 11 3
01-05-2011 5 15 0
01-06-2011 0 23 1
01-07-2011 4 21 4
01-08-2011 2 0 5
01-09-2011 1 11 0
I don't find anything better than looping over all columns, because every column leads to a different grouping.
First define a function which does what you want at grouped level, i.e. setting all but the first entry to zero:
def set_zeros(g):
g.values[1:] = 0
return g
for c in df.columns:
df[c] = df.groupby([c, pd.Grouper(freq='3D')], as_index=False)[c].transform(set_zeros)
This custom function can be applied to each group, which is defined by a time range (freq='3D') and equal values of a column within this period. As the columns generally have their equal values in different rows, this has to be done for each column in a loop.
Change freq to 5D, 10D or 20D for your other considerations.
For a detailed description of how to define the time period see http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases

How to replace dataframe columns on given condition?

I have a Dataframe , in which i have two columns, first_columns and second_columns. first_columns is id and second_columns is room number.
As you can see from pic that, particular id person serve on different room number. Now i want to replace all the second_columns columns with 1 and zero on given condition
1) if particular first_columns column id person does't not serve in 9, 10 and 11, then replace all the room number with 1, if he work then replace all room number with 0.
In above picture, first_columns id 3737 doest not work in room 9,10, and 11. then all the row of 3737 room number will be replace by 1.
I think need groupby with transform for compare by sets, last invert mask by ~ and convert to integers:
df['new'] = ((~df.groupby('first_column')['second_column']
.transform(lambda x: set(x) >=set([9,10,11])))
.astype(int))
print (df)
first_column second_column new
0 3767 2 1
1 3767 4 1
2 3767 6 1
3 6282 2 0
4 6282 9 0
5 6282 10 0
6 6282 11 0
7 10622 0 1
8 13096 7 1
9 13096 10 1
10 13896 11 1

Pandas Assign Value Multiple Conditions

Given the following DataFrame:
import pandas as pd
import numpy as np
d=pd.DataFrame({0:[10,20,30,40],1:[20,45,10,35],2:[34,24,54,22],
'0 to 1':[1,1,1,0],'0 to 2':[1,0,1,1],'1 to 2':[1,1,1,1],
})
d=d[[0,1,2,'0 to 1','0 to 2','1 to 2']]
d
0 1 2 0 to 1 0 to 2 1 to 2
0 10 20 34 1 1 1
1 20 45 24 1 0 1
2 30 10 54 1 1 1
3 40 35 22 0 1 1
I'd like to produce 3 new columns; one for each of the 3 columns on the left with the following criteria:
Include the original value.
If the original value is greater than the value being compared, and if there is a 1 in the comparison column (the columns with 'to'), list the other column(s) separated by commas.
For example:
Column 1, row 0 has a value of 20, which is greater than its corresponding value in column 0 (10). The comparison column between columns 0 and 1 is '0 to 1'. In this column, there is a value of 1 for row 0. There is another column comparing column 1 to column 2, but the value for column 2, row 0 is 34, so since it is greater than 20, ignore the 1 in '1 to 2'.
So the final value would be '20 (0)'.
Here is the desired resulting data frame:
0 1 2 0 to 1 0 to 2 1 to 2 0 Final 1 Final 2 Final
0 10 20 34 1 1 1 10 20 (0) 34 (0,1)
1 20 45 24 1 0 1 20 45 (0,2) 24
2 30 10 54 1 1 1 30 (1) 10 54 (0,1)
3 40 35 22 0 1 1 40 (2) 35 (2) 22
Thanks in advance!
Note: Because my real data will have varying numbers of columns on the left (i.e. 0,1,2,3,4) and resulting comparisons, I need an approach that finds all conditions that apply. So, for a particular value, find all cases where the comparison column value is 1 and the value is higher than that being compared.
Update
To clarify:
'0 to 1' compares column 0 to column 1. If there is a significant difference between them, the value is 1, else 0. So for '0 Final', if 0 is larger than 1 and '0 to 1' is 1, there would be a (1) after the value to signify that 0 is significantly larger than 1 for that row.
Here's what I have so far:
d['0 Final']=d[0].astype(str)
d['1 Final']=d[1].astype(str)
d['2 Final']=d[2].astype(str)
d.loc[((d[0]>d[1])&(d['0 to 1']==1))|((d['0 to 2']==1)&(d[0]>d[2])),'0 Final']=d['0 Final']+' '
d.loc[((d[1]>d[0])&(d['0 to 1']==1))|((d['1 to 2']==1)&(d[1]>d[2])),'1 Final']=d['1 Final']+' '
d.loc[((d[2]>d[0])&(d['0 to 2']==1))|((d['1 to 2']==1)&(d[2]>d[1])),'2 Final']=d['2 Final']+' '
d.loc[(d['0 to 1']==1)&(d[0]>d[1]),'0 Final']=d['0 Final']+'1'
d.loc[(d['0 to 2']==1)&(d[0]>d[2]),'0 Final']=d['0 Final']+'2'
d.loc[(d['0 to 1']==1)&(d[1]>d[0]),'1 Final']=d['1 Final']+'0'
d.loc[(d['1 to 2']==1)&(d[1]>d[2]),'1 Final']=d['1 Final']+'2'
d.loc[(d['0 to 2']==1)&(d[2]>d[0]),'2 Final']=d['2 Final']+'0'
d.loc[(d['1 to 2']==1)&(d[2]>d[1]),'2 Final']=d['2 Final']+'1'
d.loc[d['0 Final'].str.contains(' '),'0 Final']=d[0].astype(str)+' ('+d['0 Final'].str.split(' ').str[1]+')'
d.loc[d['1 Final'].str.contains(' '),'1 Final']=d[1].astype(str)+' ('+d['1 Final'].str.split(' ').str[1]+')'
d.loc[d['2 Final'].str.contains(' '),'2 Final']=d[2].astype(str)+' ('+d['2 Final'].str.split(' ').str[1]+')'
0 1 2 0 to 1 0 to 2 1 to 2 0 Final 1 Final 2 Final
0 10 20 34 1 1 1 10 20 (0) 34 (01)
1 20 45 24 1 0 1 20 45 (02) 24
2 30 10 54 1 1 1 30 (1) 10 54 (01)
3 40 35 22 0 1 1 40 (2) 35 (2) 22
It has 2 shortcomings:
I cannot predict how many columns I will have to compare, so the first 3 .loc lines will need to somehow account for this, assuming it can and it's the best approach.
I still need to figure out how to get a comma and space between each number in parentheses if there is more than 1.

Categories