Python2.7 - Pandas dataframe groupby two criterias - python

Lets say I have a panadas DataFrame:
import pandas as pd
df = pd.DataFrame(columns=['name','time'])
df = df.append({'name':'Waren', 'time': '20:15'}, ignore_index=True)
df = df.append({'name':'Waren', 'time': '20:12'}, ignore_index=True)
df = df.append({'name':'Waren', 'time': '20:11'}, ignore_index=True)
df = df.append({'name':'Waren', 'time': '01:29'}, ignore_index=True)
df = df.append({'name':'Waren', 'time': '02:15'}, ignore_index=True)
df = df.append({'name':'Waren', 'time': '02:16'}, ignore_index=True)
df = df.append({'name':'Kim', 'time': '20:11'}, ignore_index=True)
df = df.append({'name':'Kim', 'time': '01:29'}, ignore_index=True)
df = df.append({'name':'Kim', 'time': '02:15'}, ignore_index=True)
df = df.append({'name':'Kim', 'time': '01:49'}, ignore_index=True)
df = df.append({'name':'Kim', 'time': '01:49'}, ignore_index=True)
df = df.append({'name':'Kim', 'time': '02:15'}, ignore_index=True)
df = df.append({'name':'Mary', 'time': '22:15'}, ignore_index=True)
df = df.drop(df.index[2])
df = df.drop(df.index[7])
I would like to group this frame by name and secondly group by continuous indexes (Group by continuous indexes in Pandas DataFrame).
The desired output would be a grouping like this:
So the rows are grouped by name and for row this continuous increasing indexes only the first and last element is taken.
I tried it like so:
df.groupby(['name']).groupby(df.index.to_series().diff().ne(1).cumsum()).group
which only raises the error:
AttributeError: Cannot access callable attribute 'groupby' of 'DataFrameGroupBy' objects, try using the 'apply' method
Any help is welcome!

You are doing it wrong. When you do df.groupby(['name']) it returns attribute groupby which is not callable. You need to apply both of it together.
df.groupby(['name', df.index.to_series().diff().ne(1).cumsum()]).groups
Out:
{('Kim', 2): [6, 7],
('Kim', 3): [9, 10, 11],
('Mary', 3): [12],
('Waren', 1): [0, 1],
('Waren', 2): [3, 4, 5]}

Related

How do I rearrange nested Pandas DataFrame columns?

In the DataFrame below, I want to rearrange the nested columns - i.e. to have 'region_sea' appearing before 'region_inland'
df = pd.DataFrame( {'state': ['WA', 'CA', 'NY', 'NY', 'CA', 'CA', 'WA' ]
, 'region': ['region_sea', 'region_inland', 'region_sea', 'region_inland', 'region_sea', 'region_sea', 'region_inland',]
, 'count': [1, 3, 4, 6, 7, 8, 4]
, 'income': [100, 200, 300, 400, 600, 400, 300]
}
)
df = df.pivot_table(index='state', columns='region', values=['count', 'income'], aggfunc={'count': 'sum', 'income': 'mean'})
df
I tried the code below but it's not working...any idea how to do this? Thanks
df[['count']]['region_sea', 'region_inland']
You can use sort_index to sort it. However, as it is nested columns, it will replace income and count too.
df.sort_index(axis='columns', level=0, ascending=False, inplace=True)
If you don't want replace income/count, than it will not give common header for both.
df.sort_index(axis='columns', level='region', ascending=False, inplace=True)

How to loop through a column and compare each value to a list

I have a column in a dataset. I need to compare each value from that column to a list. After comparison, if it satisfies a condition, the value of another column should change.
for example,
List- james, michael, clara
According to the code, if a name in col A is in the list, col B should be 1, else 0.
How to solve this in python
Change B column where value A is in List
Using the loc operator you can easily select the rows where the item in the A column is in your List, and change the B column of these rows.
df.loc[(df["A"].isin(List)), "B"] = 1
Use np.fillna to fill empty cells with zeros.
df.fillna(0, inplace=True)
Full Code
names = ['james', 'randy', 'mona', 'lisa', 'paul', 'clara']
List = ["james", "michael", "clara"]
df = pd.DataFrame(data=names, columns=['A'])
df["B"] = np.nan
df.loc[(df["A"].isin(List)), "B"] = 1
df.fillna(0, inplace=True)
This would be a good time to use np.where()
import pandas as pd
import numpy as np
name_list = ['James', 'Sally', 'Sarah', 'John']
df = pd.DataFrame({
'Names' : ['James', 'Roberts', 'Stephen', 'Hannah', 'John', 'Sally']
})
df['ColumnB'] = np.where(df['Names'].isin(name_list), 1, 0)
df

How to add a row to mention the names of the dataframes after we concatenate them?

I have 3 dataframes with the same format.
Then I combine them horizontally and get
I would like to add a row to denote the name of each dataframe, i.e.,
I get above form by copying the data to MS Excel and manually adding the row. Is there anyway to directly do so for displaying in Python?
import pandas as pd
data = {'Name': ['Tom', 'Joseph'], 'Age': [20, 21]}
df1 = pd.DataFrame(data)
data = {'Name': ['John', 'Kim'], 'Age': [15, 17]}
df2 = pd.DataFrame(data)
data = {'Name': ['Paul', 'Dood'], 'Age': [10, 5]}
df3 = pd.DataFrame(data)
pd.concat([df1, df2, df3], axis = 1)
Use key parameter in concat:
df = pd.concat([df1, df2, df3], axis = 1, keys=('df1','df2','df3'))
print (df)
df1 df2 df3
Name Age Name Age Name Age
0 Tom 20 John 15 Paul 10
1 Joseph 21 Kim 17 Dood 5
The row is actually a first-level column. You can have it by adding this level to each dataframe before concatenating:
for df_name, df in zip(("df1", "df2", "df3"), (df1, df2, df3)):
df.columns = pd.MultiIndex.from_tuples(((df_name, col) for col in df))
pd.concat([df1, df2, df3], axis = 1)
Very nich case, but you can use Multindex objects in order to be able to build want you want.
Consider that what you need is a "two level headers" to display the information as you want. Multindex at a columns level can accomplish that.
To understand more the code, read about Multindex objects in pandas. You basically create the labels (called levels) and then use indexes to point to those labels (called codes) to build the object.
Here how to do it:
data = {'Name': ['Tom', 'Joseph'], 'Age': [20, 21]}
df1 = pd.DataFrame(data)
data = {'Name': ['John', 'Kim'], 'Age': [15, 17]}
df2 = pd.DataFrame(data)
data = {'Name': ['Paul', 'Dood'], 'Age': [10, 5]}
df3 = pd.DataFrame(data)
df1.columns = pd.MultiIndex(levels=[['df1', 'df2', 'df3'], ['Name', 'Age']], codes=[[0, 0], [0, 1]])
df2.columns = pd.MultiIndex(levels=[['df1', 'df2', 'df3'], ['Name', 'Age']], codes=[[1, 1], [0, 1]])
df3.columns = pd.MultiIndex(levels=[['df1', 'df2', 'df3'], ['Name', 'Age']], codes=[[2, 2], [0, 1]])
And after the concatenation, you will have:
pd.concat([df1, df2, df3], axis = 1)

How to be able to concatenate the values of a column with the name of the other columns in a DataFrame

How to be able to concatenate the values of a column called "ITEM" with the name of the other columns, thus creating new columns.
If I have a dataframe like this:
import pandas as pd
df = pd.DataFrame({
'ITEM': ['Item1', 'Item2', 'Item3'],
'Variable1': [1,1,2],
'Variable2': [2,1,3],
'Variable3':[3,2,4]
})
df
I need to transform this dataframe:
enter image description here
on that dataframe:
enter image description here
import pandas as pd
df = pd.DataFrame({
'ITEM': ['Item1', 'Item2', 'Item3'],
'Variable1': [1,1,2],
'Variable2': [2,1,3],
'Variable3':[3,2,4]
})
df = pd.melt(df, id_vars='ITEM',value_vars=['Variable1','Variable2','Variable3'])
df['title'] = df['variable']+'_'+df['ITEM']
df = df[['title','value']].T
df.columns = df.iloc[0]
df = df[1:]

How do I use pandas to add rows to a data frame based on a date column and number of days column

I would like to know how to use a start date from a data frame column and have it add rows to the dataframe from the number of days in another column. A new date per day.
Essentially, I am trying to turn this data frame:
df = pd.DataFrame({
'Name':['Peter', 'Peter', 'Peter', 'Peter'],
'Planned_Start':['1/1/2019', '1/2/2019', '1/15/2019', '1/2/2019'],
'Duration':[2, 3, 5, 6],
'Hrs':[0.6, 1, 1.2, 0.3]})
...into this data frame:
df_2 = pd.DataFrame({
'Name':['Peter', 'Peter', 'Peter', 'Peter', 'Peter', 'Peter', 'Peter', 'Peter'],
'Date':['1/1/2019', '1/2/2019', '1/2/2019', '1/3/2019', '1/4/2019','1/10/2019', '1/15/2019', '1/16/2019'],
'Hrs':[0.6, 0.6, 1, 1, 1, 1.2, 0.3, 0.3]})
I'm new to programming in general and have tried the following:
df_2 = pd.DataFrame({
'date': pd.date_range(
start = df.Planned_Start,
end = pd.to_timedelta(df.Duration, unit='D'),
freq = 'D'
)
})
... and ...
df["date"] = df.Planned_Start + timedelta(int(df.Duration))
with no luck.
I am not entirely sure what you are trying to achieve as your df_2 looks a bit wrong from what I can see.
If you want the take the duration column as days and add this many dates to a Date column, then the below code achieves that:
You can also drop any columns you don't need with pd.Series.drop() method:
df = pd.DataFrame({
'Name':['Peter', 'Peter', 'Peter', 'Peter'],
'Planned_Start':['1/1/2019', '1/2/2019', '1/15/2019', '1/2/2019'],
'Duration':[2, 3, 5, 6],
'Hrs':[0.6, 1, 1.2, 0.3]})
df_new = pd.DataFrame()
for i, row in df.iterrows():
for duration in range(row.Duration):
date = pd.Series([pd.datetime.strptime(row.Planned_Start, '%m/%d/%Y') + timedelta(days=duration)], index=['date'])
newrow = row.append(date)
df_new = df_new.append(newrow, ignore_index=True)

Categories