I have a dataframe of the following format
df = pd.DataFrame(
{"company":["McDonalds","Arbys","Wendys"],
"City":["Dallas","Austin","Chicago"],
"Datetime":[{"11/23/2016":"1","09/06/2011":"2"},
{"02/23/2012":"1","04/06/2013":"2"},
{"10/23/2017":"1","05/06/2019":"2"}]})
df
>>> Company City Datetime
>>> McDonalds Dallas {'11/23/2016': '1', '09/06/2011':'2'}
>>> Arbys Austin {'02/23/2012': '1', '04/06/2013':'2'}
>>> Wendys Chicago {'10/23/2017': '1', '05/06/2019':'2'}
The dictionary inside of the column "Datetime" is a string , so I must read it into a python dictionary by using ast.literal_eval
I would like to unstack the dataframe based on the values in datetime so that the output looks as follows:
df_out
>>> Company City Date Value
>>> McDonalds Dallas 11/23/2016 1
>>> McDonalds Dallas 09/06/2011 2
>>> Arbys Austin 02/23/2012 1
>>> Arbys Austin 04/06/2013 2
>>> Wendys Chicago 10/23/2017 1
>>> Wendys Chicago 05/06/2019 2
I am a bit lost on this one, I know I will need to iter over the rows and read each dictionary, so I had the idea of using df.iterrows() and creating namedTuples of each rows values that won't change, and then looping over the dictionary itself attaching different datetime values, but I am not sure this is the most efficient way. Any tips would be appreciated.
My try:
(df.drop('Datetime', axis=1)
.merge(df.Datetime.agg(lambda x: pd.Series(x))
.stack().reset_index(-1),
left_index=True,
right_index=True
)
.rename(columns={'level_1':'Date', 0:'Value'})
)
Output:
company City Date Value
0 McDonalds Dallas 11/23/2016 1
0 McDonalds Dallas 09/06/2011 2
1 Arbys Austin 02/23/2012 1
1 Arbys Austin 04/06/2013 2
2 Wendys Chicago 10/23/2017 1
2 Wendys Chicago 05/06/2019 2
I would flatten dictionaries in Datetime and construct a new df from it. Finally, join back.
from itertools import chain
df1 = pd.DataFrame(chain.from_iterable(df.Datetime.map(dict.items)),
index=df.index.repeat(df.Datetime.str.len()),
columns=['Date', 'Val'])
Out[551]:
Date Val
0 11/23/2016 1
0 09/06/2011 2
1 02/23/2012 1
1 04/06/2013 2
2 10/23/2017 1
2 05/06/2019 2
df_final = df.drop('Datetime', 1).join(df1)
Out[554]:
company City Date Val
0 McDonalds Dallas 11/23/2016 1
0 McDonalds Dallas 09/06/2011 2
1 Arbys Austin 02/23/2012 1
1 Arbys Austin 04/06/2013 2
2 Wendys Chicago 10/23/2017 1
2 Wendys Chicago 05/06/2019 2
Here is a clean solution:
Solution
df = df.set_index(['company', 'City'])
df_stack = (df['Datetime'].apply(pd.Series)
.stack().reset_index()
.rename(columns= {'level_2': 'Datetime', 0: 'val'}))
Output
print(df_stack.to_string())
company City Datetime val
0 McDonalds Dallas 11/23/2016 1
1 McDonalds Dallas 09/06/2011 2
2 Arbys Austin 02/23/2012 1
3 Arbys Austin 04/06/2013 2
4 Wendys Chicago 10/23/2017 1
5 Wendys Chicago 05/06/2019 2
Related
I have a df that looks like this.
id rent place
0 Yes colorado
0 yes Mexico
0 yes Brazil
1 yes colorado
1 yes Mexico
1 yes Brazil
2 yes colorado
2 yes Mexico
2 yes Brazil
3 yes colorado
3 yes Mexico
3 yes Brazil
I need the "id" column to continue to increase by 1 and the values in the "place" column to repeat every 3rd row. I have no idea how to do this.
You could build your DataFrame row by row, and append the relevant row(s) as you desire.
id = [0,1,2,3]
rent = [123, 'yes', 'yes']
place = ['colorado', 'Mexico', 'Brazil']
df = pd.DataFrame({'rent': [], 'place': []}, index=[]) #empty df
for i in range(len(id)):
for j in range(len(rent)):
df = df.append(pd.DataFrame({'rent': rent[j], 'place': place[j]}, index=[id[i]]))
df.reset_index(inplace=True)
df.rename(columns={'index': 'id'}, inplace=True)
Output df is:
id rent place
0 0 123 colorado
1 0 yes Mexico
2 0 yes Brazil
3 1 123 colorado
4 1 yes Mexico
5 1 yes Brazil
6 2 123 colorado
7 2 yes Mexico
8 2 yes Brazil
9 3 123 colorado
10 3 yes Mexico
11 3 yes Brazil
You can generate a new one like so:
N = 200
from itertools import cycle
places = cycle(["colorado", "mexico", "brazil"])
data = {"id": [j//3 for j in range(N)], "rent": True, "place": [next(places) for j in range(N)]}
df = pd.DataFrame(data)
Note that I've replaced rent with a boolean to be less error prone
than text. Output:
id rent place
0 0 True colorado
1 0 True mexico
2 0 True brazil
3 1 True colorado
4 1 True mexico
.. .. ... ...
195 65 True colorado
196 65 True mexico
197 65 True brazil
198 66 True colorado
199 66 True mexico
Alternatively, you can concatenate dfs and then sort them:
df = pd.DataFrame()
for place in ["brazil", "colorado", "mexico"]:
sub_df = pd.DataFrame({"id": range(N), "rent": True, "place": place})
df = pd.concat([df, sub_df], axis=0)
df = df.sort_values(["id"])
I have data on births that looks like this:
Date Country Sex
1.1.20 USA M
1.1.20 USA M
1.1.20 Italy F
1.1.20 England M
2.1.20 Italy F
2.1.20 Italy M
3.1.20 USA F
3.1.20 USA F
My purpose is to get a new dataframe in which each row is a date at a country, and then number of total births, number of male births and number of female births. It's supposed to look like this:
Date Country Births Males Females
1.1.20 USA 2 2 0
1.1.20 Italy 1 0 1
1.1.20 England 1 1 0
2.1.20 Italy 2 1 1
3.1.20 USA 2 0 2
I tried using this code:
df.groupby(by=['Date', 'Country', 'Sex']).size()
but it only gave me a new column of total births, with different rows for each sex in every date+country combination.
any help will be appreciated.
Thanks,
Eran
You can group the dataframe on columns Date and Country then aggregate column Sex using value_counts followed by unstack to reshape, finally assign the Births columns by summing frequency along axis=1:
out = df.groupby(['Date', 'Country'], sort=False)['Sex']\
.value_counts().unstack(fill_value=0)
out.assign(Births=out.sum(1)).reset_index()\
.rename(columns={'M': 'Male', 'F': 'Female'})
Or you can use a very similar approach with .crosstab instead of groupby + value_counts:
out = pd.crosstab([df['Date'], df['Country']], df['Sex'], colnames=[None])
out.assign(Births=out.sum(1)).reset_index()\
.rename(columns={'M': 'Male', 'F': 'Female'})
Date Country Female Male Births
0 1.1.20 USA 0 2 2
1 1.1.20 Italy 1 0 1
2 1.1.20 England 0 1 1
3 2.1.20 Italy 1 1 2
4 3.1.20 USA 2 0 2
There are 2 dfs
datatypes are the same
df1 =
ID city name value
1 LA John 111
2 NY Sam 222
3 SF Foo 333
4 Berlin Bar 444
df2 =
ID city name value
1 NY Sam 223
2 LA John 111
3 SF Foo 335
4 London Foo1 999
5 Berlin Bar 444
I need to compare them and produce a new df, only with values, which are in df2, but not in df1
By some reason results after applying different methods are wrong
So far I've tried
pd.concat([df1, df2], join='inner', ignore_index=True)
but it returns all values together
pd.merge(df1, df2, how='inner')
it returns df1
then this one
df1[~(df1.iloc[:, 0].isin(list(df2.iloc[:, 0])))
it returns df1
The desired output is
ID city name value
1 NY Sam 223
2 SF Foo 335
3 London Foo1 999
Use DataFrame.merge by all columns without first and indicator parameter:
c = df1.columns[1:].tolist()
Or:
c = ['city', 'name', 'value']
df = (df2.merge(df1,on=c, indicator = True, how='left', suffixes=('','_'))
.query("_merge == 'left_only'")[df1.columns])
print (df)
ID city name value
0 1 NY Sam 223
2 3 SF Foo 335
3 4 London Foo1 999
Try this:
print("------------------------------")
print(df1)
df2 = DataFrameFromString(s, columns)
print("------------------------------")
print(df2)
common = df1.merge(df2,on=["city","name"]).rename(columns = {"value_y":"value", "ID_y":"ID"}).drop("value_x", 1).drop("ID_x", 1)
print("------------------------------")
print(common)
OUTPUT:
------------------------------
ID city name value
0 ID city name value
1 1 LA John 111
2 2 NY Sam 222
3 3 SF Foo 333
4 4 Berlin Bar 444
------------------------------
ID city name value
0 1 NY Sam 223
1 2 LA John 111
2 3 SF Foo 335
3 4 London Foo1 999
4 5 Berlin Bar 444
------------------------------
city name ID value
0 LA John 2 111
1 NY Sam 1 223
2 SF Foo 3 335
3 Berlin Bar 5 444
If I have a the following dataframe:
df = pd.DataFrame({'name':['john','mary','peter','jeff','bill','lisa','jose'], 'gender':['M','F','M','M','M','F','M'],'state':['california','dc','california','dc','california','texas','texas'],'num_children':[2,0,0,3,2,1,4],'num_pets':[5,1,0,5,2,2,3]})
name gender state num_children num_pets
0 john M california 2 5
1 mary F dc 0 1
2 peter M california 0 0
3 jeff M dc 3 5
4 bill M california 2 2
5 lisa F texas 1 2
6 jose M texas 4 3
I want to create a new row and column pct. to get the percentage of zero values in columns num_children and num_pets
Expected output:
name gender state num_children num_pets pct.
0 pct. 28.6% 14.3%
1 john M california 2 5 0%
2 mary F dc 0 1 50%
3 peter M california 0 0 100%
4 jeff M dc 3 5 0%
5 bill M california 2 2 0%
6 lisa F texas 1 2 0%
7 jose M texas 4 3 0%
I have calculated percentage of zero in each row for targets columns:
df['pct'] = df[['num_children', 'num_pets']].astype(bool).sum(axis=1)/2
df['pct.'] = 1-df['pct']
del df['pct']
df['pct.'] = pd.Series(["{0:.0f}%".format(val * 100) for val in df['pct.']], index = df.index)
name gender state num_children num_pets pct.
0 john M california 2 5 0%
1 mary F dc 0 1 50%
2 peter M california 0 0 100%
3 jeff M dc 3 5 0%
4 bill M california 2 2 0%
5 lisa F texas 1 2 0%
6 jose M texas 4 3 0%
But i don't know how to insert results below to row of pct. as expected output, please help me to get expected result in more pythonic way. Thanks.
df[['num_children', 'num_pets']].astype(bool).sum(axis=0)/len(df.num_children)
Out[153]:
num_children 0.714286
num_pets 0.857143
dtype: float64
UPDATE: same thing but for calculation of sums, great thanks to #jezrael:
df['sums'] = df[['num_children', 'num_pets']].sum(axis=1)
df1 = (df[['num_children', 'num_pets']].sum()
.to_frame()
.T
.assign(name='sums'))
df = pd.concat([df1.reindex(columns=df.columns, fill_value=''), df],
ignore_index=True, sort=False)
print (df)
name gender state num_children num_pets sums
0 sums 12 18
1 john M california 2 5 7
2 mary F dc 0 1 1
3 peter M california 0 0 0
4 jeff M dc 3 5 8
5 bill M california 2 2 4
6 lisa F texas 1 2 3
7 jose M texas 4 3 7
You can use mean with boolean mask by compare 0 values by DataFrame.eq, because sum/len=mean by definition, multiple by 100 and add percentage with apply:
s = df[['num_children', 'num_pets']].eq(0).mean(axis=1)
df['pct'] = s.mul(100).apply("{0:.0f}%".format)
For first row create new DataFrame with same columns like original and concat together:
df1 = (df[['num_children', 'num_pets']].eq(0)
.mean()
.mul(100)
.apply("{0:.1f}%".format)
.to_frame()
.T
.assign(name='pct.'))
df = pd.concat([df1.reindex(columns=df.columns, fill_value=''), df],
ignore_index=True, sort=False)
print (df)
name gender state num_children num_pets pct
0 pct. 28.6% 14.3%
1 john M california 2 5 0%
2 mary F dc 0 1 50%
3 peter M california 0 0 100%
4 jeff M dc 3 5 0%
5 bill M california 2 2 0%
6 lisa F texas 1 2 0%
7 jose M texas 4 3 0%
I have this dataframe, and I want the count of all non zero values for interaction per month, date and email
DATE LOC EMAIL INTERATION
1/11 INDIA qw#mail.com 0
1/11 INDIA ap#mail.com 11
1/11 LONDON az#mail.com 2
2/11 INDIA qw#mail.com 5
2/11 INDIA rw#mail.com 5
2/11 LONDON az#mail.com 0
3/11 LONDON az#mail.com 1
So my resulting dataframe should look like this:
DATE LOC INTERATION
1/11 INDIA 1
1/11 LONDON 1
2/11 INDIA 2
2/11 LONDON 0
3/11 LONDON 1
Thanks in advance
Use groupby with agg and numpy.count_nonzero:
df1 = df.groupby(['DATE','LOC'], as_index=False)['INTERATION'].agg(np.count_nonzero)
print (df1)
DATE LOC INTERATION
0 1/11 INDIA 1
1 1/11 LONDON 1
2 2/11 INDIA 2
3 2/11 LONDON 0
4 3/11 LONDON 1
Another solution is create boolean mask by compre by not equal by ne, cast to integers and aggregate sum:
df1 = (df.assign(INTERATION = df['INTERATION'].ne(0).astype(int))
.groupby(['DATE','LOC'], as_index=False)['INTERATION']
.sum())
If need group by column EMAIL too:
df2 = df.groupby(['DATE','LOC','EMAIL'], as_index=False)['INTERATION'].agg(np.count_nonzero)
print (df2)
DATE LOC EMAIL INTERATION
0 1/11 INDIA ap#mail.com 1
1 1/11 INDIA qw#mail.com 0
2 1/11 LONDON az#mail.com 1
3 2/11 INDIA qw#mail.com 1
4 2/11 INDIA rw#mail.com 1
5 2/11 LONDON az#mail.com 0
6 3/11 LONDON az#mail.com 1
One not necessarily efficient solution is to convert to bool and then sum. This use the fact 0 / 1 are equivalent to False / True respectively in calculations:
res = df.groupby(['DATE', 'LOC'])['INTERATION']\
.apply(lambda x: x.astype(bool).sum()).reset_index()
print(res)
DATE LOC INTERATION
0 1/11 INDIA 1
1 1/11 LONDON 1
2 2/11 INDIA 2
3 2/11 LONDON 0
4 3/11 LONDON 1