I have a Pandas DataFrame, say df, which is 1099 lines by 33 rows. I need the original file to be processed by another software, but it is not in the proper format. This is why I'm trying to get the good format whith pandas.
The problem is very simple: df is constituted by columns of identifiers (7 in the real case, only 3 in the following example), and then by corresponding results by months. To be clear, it's like
A B C date1result date2result date2result
a1 b1 c1 12 15 17
a2 b2 c3 5 8 3
But to be processed, I would need it to have one line per result, adding a column for the date. In the given example, it would be
A B C result date
a1 b1 c1 12 date1
a1 b1 c1 15 date2
a1 b1 c1 17 date3
a2 b2 c3 5 date1
a2 b2 c3 8 date2
a2 b2 c3 3 date3
So to be more precise, I have edited manually all column names with date (after the read_excel, the looked like '01/01/2015 0:00:00' or something like that, and I was unable to access them... As a secondary question, does anyone knows how to access columns being imported from a date field in an .xlsx?), so that date column names are now 2015_01, 2015_02... 2015_12, 2016_01, ..., 2016_12, the 5 first being 'Account','Customer Name','Postcode','segment' and 'Rep'. So I tried the following code:
core = df.loc[:,('Account','Customer Name','Postcode','segment','Rep')]
df_final=pd.Series([])
for year in [2015,2016]:
for month in range(1, 13):
label = "%i_%02i" % (year,month)
date = []
for i in range(core.shape[0]):
date.append("01/%02i/%i"%(month,year))
df_date=pd.Series(date) #I don't know to create this 1xn df
df_final = df_final.append(pd.concat([core, df[label], df_date], axis=1))
That works roughly, but it is very unclean: I get a (26376, 30) shaped df_final, fist column being the dates, then the results, but of course with '2015_01' as column name, then all the '2015_02' through '2016_12' filled by NaN, and at last my Account', 'Customer Name', 'Postcode', 'segment' and 'Rep' columns. Does anyone know how I could do such a "slicing+stacking" in a clean way?
Thank you very much.
Edit: it is roughly the reverse of this question: Stacking and shaping slices of DataFrame (pandas) without looping
Ithink you need melt:
df = pd.melt(df, id_vars=['A','B','C'], value_name='result', var_name='date')
print (df)
A B C date result
0 a1 b1 c1 date1result 12
1 a2 b2 c3 date1result 5
2 a1 b1 c1 date2result 15
3 a2 b2 c3 date2result 8
4 a1 b1 c1 date3result 17
5 a2 b2 c3 date3result 3
And then convert to_datetime:
print (df)
A B C 2015_01 2016_10 2016_12
0 a1 b1 c1 12 15 17
1 a2 b2 c3 5 8 3
df = pd.melt(df, id_vars=['A','B','C'], value_name='result', var_name='date')
df.date = pd.to_datetime(df.date, format='%Y_%m')
print (df)
A B C date result
0 a1 b1 c1 2015-01-01 12
1 a2 b2 c3 2015-01-01 5
2 a1 b1 c1 2016-10-01 15
3 a2 b2 c3 2016-10-01 8
4 a1 b1 c1 2016-12-01 17
5 a2 b2 c3 2016-12-01 3
Related
How can I drop data from one level of a multi-level indexed data frame, based on aggregated information I get from a column within a groupby on that level?
For example, with data frame dfmi:
midx = pd.MultiIndex.from_product([['A0','A1','A2'], ['B0','B1','B2']], names=["index_1", "index_2"])
columns = ['foo', 'bar']
dfmi = pd.DataFrame(np.arange(18).reshape((len(midx), len(columns))),
index=midx, columns=columns)
dfmi
foo bar
index_1 index_2
A0 B0 0 1
B1 2 3
B2 4 5
A1 B0 6 7
B1 8 9
B2 10 11
A2 B0 12 13
B1 14 15
B2 16 17
Let's say I only want to keep levels of index_1 if the mean for foo exceeds a certain threshold.
Like this:
thresh = 5
for grp, data in dfmi.groupby("index_1"):
print(data.foo.mean() > thresh)
False <-- drop this level
True
True
Desired output:
foo bar
index_1 index_2
A1 B0 6 7
B1 8 9
B2 10 11
A2 B0 12 13
B1 14 15
B2 16 17
In this toy example I can get what I want with dfmi.loc[pd.IndexSlice["A1":"A2", :]]. But I can't figure out how to use IndexSlice or loc variants to do aggregations inside a grouped MultiIndex and then slice the full data frame based on the results.
My best solution so far is to just keep track of the level values that qualify as keepers (with grp), and then use the accumulated keepers collection with IndexSlice:
keepers = list()
for grp, data in dfmi.groupby("index_1"):
if data.foo.mean() > thresh:
keepers.append(grp)
dfmi.loc[pd.IndexSlice[keepers, :]]
I'm looking for a more efficient and/or more elegant way to accomplish that with native Pandas functionality.
You can use loc once you have created your mask, like so:
mask = dfmi.groupby(level=0)['foo'].mean()>thresh
dfmi.loc[mask.index[mask]]
Yields:
index_1 index_2
A1 B0 6 7
B1 8 9
B2 10 11
A2 B0 12 13
B1 14 15
B2 16 17
I have a dataframe that looks like this:
v1 v2
0 a A1
1 b A2,A3
2 c B4
3 d A5, B6, B7
I want to modify this dataframe such that any row which has more than one value in the v2 column gets replicated for each value in v2. For example for the above dataframe, the result is as follows:
v1 v2
0 a A1
1 b A2
2 b A3
3 c B4
4 d A5
5 d B6
6 d B7
I was able to do this with the following code:
new_df = pd.DataFrame()
for index, row in df.iterrows():
if len(row["v2"].split(','))>1:
row_base = row
for r in row["v2"].split(','):
row_base["v2"] = r
new_df = new_df.append(row_base, ignore_index=True)
else:
new_df = new_df.append(row)
however it is extremely inefficient on a large dataframe and I am would like to learn how to do it more efficiently.
Pandas solution for 0.25+ version with Series.str.split and DataFrame.explode:
df = df.assign(v2 = df.v2.str.split(',')).explode('v2').reset_index(drop=True)
print (df)
v1 v2
0 a A1
1 b A2
2 b A3
3 c B4
4 d A5
5 d B6
6 d B7
For oldier versions and also perfromace should be better with numpy:
from itertools import chain
s = df.v2.str.split(',')
lens = s.str.len()
df = pd.DataFrame({
'v1' : df['v1'].values.repeat(lens),
'v2' : list(chain.from_iterable(s.values.tolist()))
})
print (df)
v1 v2
0 a A1
1 b A2
2 b A3
3 c B4
4 d A5
5 d B6
6 d B7
What could be a way to compare multiple groupby outputs.
I have multiple groupby outputs from different dataframes, like below
>>> tmp1
account place balance type
0 A A1 10 B1
1 A A1 20 B1
2 A A1 30 B1
3 A A1 10 B4
4 A A1 20 B4
5 A A1 10 B5
6 A A1 10 B6
7 B A2 10 B7
8 B A2 20 B1
9 B A2 100 B1
I do
>>>tmp1.groupby(['account','place','type']['balance'].last().sum(level=0).astype(int)
account
A 70
B 110
Name: balance, dtype: int64
Similarly
>>> tmp2
account place balance type
0 A A1 100 B1
1 A A1 200 B1
2 A A1 100 B1
3 A A1 100 B4
4 A A1 200 B4
5 A A1 100 B5
6 A A1 100 B6
7 B A2 100 B7
8 B A2 200 B1
9 B A2 200 B1
>>>tmp2.groupby(['account','place','type']['balance'].last().sum(level=0).astype(int)
account
A 500
B 300
Name: balance, dtype: int64
#similarly tmp3 grouped..and so on
Is there a way to find the df with maximum sum balance. eg. in this case tmp2 has greater sum (70+110 < 500+300).
My try:
One of the ways I tried was taking the sum and maintaining a list, like below
mylist=[]
mylist.append(tmp1.groupby(['account','place','type']['balance'].last().sum(.astype(int))
mylist.append(tmp2.groupby(['account','place','type']['balance'].last().sum(.astype(int))
>>> mylist
[180,800]
Now I can take max from list, but I loose account information (800 is max but I need info on account A having 500, B having 300)
I tried
>>>tmp2.groupby(['account','place','type'])['balance'].last().sum(level=0).to_dict()
{'A': 500, 'B': 300}
So for every df I have a dict, I just need to find maximum of such lists (I think I have come very close to solving it)
I intend to find which dataframe had maximum sum (along with account)
If I understand you correctly, in case you have more than 2 dfs.
tmp1 = pd.DataFrame([{'acount':'A', 'balance':100, 'type':'A1'},
{'acount':'A', 'balance':200, 'type':'A2'},
{'acount':'B', 'balance':200, 'type':'B1'},
{'acount':'B', 'balance':300, 'type':'B2'}])
tmp2 = pd.DataFrame([{'acount':'A', 'balance':100, 'type':'A1'},
{'acount':'A', 'balance':200, 'type':'A2'},
{'acount':'B', 'balance':400, 'type':'B1'},
{'acount':'B', 'balance':300, 'type':'B2'}])
tmplist = [tmp1,tmp2]
tmprlist = [tmp.groupby(['acount','type']).last().sum(level=0).astype(int) for tmp in tmplist]
tmpslist = [tmp.groupby(['acount','type'])['balance'].last().sum() for tmp in tmplist]
tmprlist[np.argmax(tmpslist)]
Result:
acount balance
A 300
B 700
I have the following dataframe:
Input:-
ID month Name
A1 2017.01 A
A1 2017.02 B
A1 2017.04 C
A2 2017.02 A
A2 2017.03 D
A2 2017.05 C
Output:-
ID month Name
A1 2017.01 A
A1 2017.02 B
A1 2017.03 B
A1 2017.04 C
A2 2017.02 A
A2 2017.03 D
A2 2017.04 D
A2 2017.05 C
I require to get the missing months in the sequence and the value of the month preceding it and which is present in the input list. Consider example of ID "A1". "A1" has months 1,2,4 and has missing month 3. So i need to add the row with value "A1" as ID, month as "2017.03" and Name as "B". Please note the "Name" column should get its value from the row immediately above it that is present in the input.
How do I achieve this in pandas, or by any other method in python.
Any help is appreciated!
Thanks
Let's try this with #EFT's suggestion:
df['Date'] = pd.to_datetime(df.month.astype(str),format='%Y.%m')
df_out = df.set_index('Date').groupby('ID').resample('MS').asfreq().ffill().reset_index(level=0, drop=True)
df_out = df_out.reset_index()
df_out['month'] = df_out.Date.dt.strftime('%Y.%m')
df_out = df_out.drop('Date',axis=1)
print(df_out)
Output:
ID month Name
0 A1 2017.01 A
1 A1 2017.02 B
2 A1 2017.03 B
3 A1 2017.04 C
4 A2 2017.02 A
5 A2 2017.03 D
6 A2 2017.04 D
7 A2 2017.05 C
There was a question in the comments about how df knows which column to ffill, and I just decided to go through it and post it here, maybe someone finds it useful (or I use it as a reference for myself):
mytest = pd.DataFrame({'ID': ['A1', 'A1', 'A1', 'A2', 'A2', 'A2'], 'month': ['2017.01', '2017.02', '2017.04', '2017.02', '2017.03', '2017.05'], 'Name':['A','B','C','A','D','C']})
mytest.month = pd.to_datetime(mytest.month)
mytest=mytest.set_index('month').groupby(['ID'])
mytest = mytest.resample('MS').asfreq()['Name']
mytest = pd.DataFrame(pd.DataFrame(mytest).to_records())
mytest.Name = mytest.Name.ffill()
mytest
Obviously outputs a very similar thing, I just have not formatted months back to the original format.
ID month Name
0 A1 2017-01-01 A
1 A1 2017-02-01 B
2 A1 2017-03-01 B
3 A1 2017-04-01 C
4 A2 2017-02-01 A
5 A2 2017-03-01 D
6 A2 2017-04-01 D
7 A2 2017-05-01 C
I have the following pandas DataFrame:
df
A B
1 b0
2 a0
3 c0
5 c1
6 a1
7 b1
8 b2
The first row which starts with a is
df[df.B.str.startswith("a")]
A B
2 a0
I would like to extract the first row in column B that starts with a and every row after. My desired result is below
A B
2 a0
3 c0
5 c1
6 a1
7 b1
8 b2
How can this be done?
One option is to create a mask and use it for selection:
mask = df.B.str.startswith("a")
mask[~mask] = np.nan
df[mask.fillna(method='ffill').fillna(0).astype(int) == 1]
Another option is to build an index range:
first = df[df.B.str.startswith("a")].index[0]
df.ix[first:]
The latter approach assumes that an "a" is always present.
using idxmax to find first True
df.loc[df.B.str[0].eq('a').idxmax():]
A B
1 2 a0
2 3 c0
3 5 c1
4 6 a1
5 7 b1
6 8 b2
If I understand your question correctly, here is how you do it :
df = pd.DataFrame(data={'A':[1,2,3,5,6,7,8],
'B' : ['b0','a0','c0','c1','a1','b1','b2']})
# index of the item beginning with a
index = df[df.B.str.startswith("a")].values.tolist()[0][0]
desired_df = pd.concat([df.A[index-1:],df.B[index-1:]], axis = 1)
print desired_df
and you get: