Insert Missing Months rows in the dataframe in python - python

I have the following dataframe:
Input:-
ID month Name
A1 2017.01 A
A1 2017.02 B
A1 2017.04 C
A2 2017.02 A
A2 2017.03 D
A2 2017.05 C
Output:-
ID month Name
A1 2017.01 A
A1 2017.02 B
A1 2017.03 B
A1 2017.04 C
A2 2017.02 A
A2 2017.03 D
A2 2017.04 D
A2 2017.05 C
I require to get the missing months in the sequence and the value of the month preceding it and which is present in the input list. Consider example of ID "A1". "A1" has months 1,2,4 and has missing month 3. So i need to add the row with value "A1" as ID, month as "2017.03" and Name as "B". Please note the "Name" column should get its value from the row immediately above it that is present in the input.
How do I achieve this in pandas, or by any other method in python.
Any help is appreciated!
Thanks

Let's try this with #EFT's suggestion:
df['Date'] = pd.to_datetime(df.month.astype(str),format='%Y.%m')
df_out = df.set_index('Date').groupby('ID').resample('MS').asfreq().ffill().reset_index(level=0, drop=True)
df_out = df_out.reset_index()
df_out['month'] = df_out.Date.dt.strftime('%Y.%m')
df_out = df_out.drop('Date',axis=1)
print(df_out)
Output:
ID month Name
0 A1 2017.01 A
1 A1 2017.02 B
2 A1 2017.03 B
3 A1 2017.04 C
4 A2 2017.02 A
5 A2 2017.03 D
6 A2 2017.04 D
7 A2 2017.05 C

There was a question in the comments about how df knows which column to ffill, and I just decided to go through it and post it here, maybe someone finds it useful (or I use it as a reference for myself):
mytest = pd.DataFrame({'ID': ['A1', 'A1', 'A1', 'A2', 'A2', 'A2'], 'month': ['2017.01', '2017.02', '2017.04', '2017.02', '2017.03', '2017.05'], 'Name':['A','B','C','A','D','C']})
mytest.month = pd.to_datetime(mytest.month)
mytest=mytest.set_index('month').groupby(['ID'])
mytest = mytest.resample('MS').asfreq()['Name']
mytest = pd.DataFrame(pd.DataFrame(mytest).to_records())
mytest.Name = mytest.Name.ffill()
mytest
Obviously outputs a very similar thing, I just have not formatted months back to the original format.
ID month Name
0 A1 2017-01-01 A
1 A1 2017-02-01 B
2 A1 2017-03-01 B
3 A1 2017-04-01 C
4 A2 2017-02-01 A
5 A2 2017-03-01 D
6 A2 2017-04-01 D
7 A2 2017-05-01 C

Related

Extract TLDs , SLDs from a dataframe column into new columns

I am trying to extract the top level domain (TLD), second level (SLD) etc from a column in a dataframe and added to new columns. Currently I have a solution where I convert this to a list and then use tolist, but the since this does sequential append, it does not work correctly. For example if the url has 3 levels then the mapping gets messed up
df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4],"C":["xyz[.]com","abc123[.]pro","xyzabc[.]gouv[.]fr"]})
df['C'] = df.C.apply(lambda x: x.split('[.]'))
df.head()
A B C
0 1 2 [xyz, com]
1 2 3 [abc123, pro]
2 3 4 [xyzabc, gouv, fr]
d = [pd.DataFrame(df[col].tolist()).add_prefix(col) for col in df.columns]
df = pd.concat(d, axis=1)
df.head()
A0 B0 C0 C1 C2
0 1 2 xyz com None
1 2 3 abc123 pro None
2 3 4 xyzabc gouv fr
I want C2 to always contain the TLD (com,pro,fr) and C1 to always contain SLD
I am sure there is a better way to do this correctly and would appreciate any pointers.
You can shift the Cx columns:
df.loc[:, "C0":] = df.loc[:, "C0":].apply(
lambda x: x.shift(periods=x.isna().sum()), axis=1
)
print(df)
Prints:
A0 B0 C0 C1 C2
0 1 2 NaN xyz com
1 2 3 NaN abc123 pro
2 3 4 xyzabc gouv fr
You can also use a regex with negative lookup and split builtin pandas with expand
df[['C0', 'C2']] = df.C.str.split('\[\.\](?!.*\[\.\])', expand=True)
df[['C0', 'C1']] = df.C0.str.split('\[\.\]', expand=True)
that gives
A B C C0 C2 C1
0 1 2 xyz[.]com xyz com None
1 2 3 abc123[.]pro abc123 pro None
2 3 4 xyzabc[.]gouv[.]fr xyzabc fr gouv

How can I efficiently replicate a pandas row, changing only one column?

I have a dataframe that looks like this:
v1 v2
0 a A1
1 b A2,A3
2 c B4
3 d A5, B6, B7
I want to modify this dataframe such that any row which has more than one value in the v2 column gets replicated for each value in v2. For example for the above dataframe, the result is as follows:
v1 v2
0 a A1
1 b A2
2 b A3
3 c B4
4 d A5
5 d B6
6 d B7
I was able to do this with the following code:
new_df = pd.DataFrame()
for index, row in df.iterrows():
if len(row["v2"].split(','))>1:
row_base = row
for r in row["v2"].split(','):
row_base["v2"] = r
new_df = new_df.append(row_base, ignore_index=True)
else:
new_df = new_df.append(row)
however it is extremely inefficient on a large dataframe and I am would like to learn how to do it more efficiently.
Pandas solution for 0.25+ version with Series.str.split and DataFrame.explode:
df = df.assign(v2 = df.v2.str.split(',')).explode('v2').reset_index(drop=True)
print (df)
v1 v2
0 a A1
1 b A2
2 b A3
3 c B4
4 d A5
5 d B6
6 d B7
For oldier versions and also perfromace should be better with numpy:
from itertools import chain
s = df.v2.str.split(',')
lens = s.str.len()
df = pd.DataFrame({
'v1' : df['v1'].values.repeat(lens),
'v2' : list(chain.from_iterable(s.values.tolist()))
})
print (df)
v1 v2
0 a A1
1 b A2
2 b A3
3 c B4
4 d A5
5 d B6
6 d B7

pandas.merge with coinciding column names

Consider the following data frames:
import pandas as pd
df1 = pd.DataFrame({'id': list('fghij'), 'A': ['A' + str(i) for i in range(5)]})
A id
0 A0 f
1 A1 g
2 A2 h
3 A3 i
4 A4 j
df2 = pd.DataFrame({'id': list('fg'), 'B': ['B' + str(i) for i in range(2)]})
B id
0 B0 f
1 B1 g
df3 = pd.DataFrame({'id': list('ij'), 'B': ['B' + str(i) for i in range(3, 5)]})
B id
0 B3 i
1 B4 j
I want to merge them to get
A id B
0 A0 f B0
1 A1 g B1
2 A2 h NaN
3 A3 i B3
4 A4 j B4
Inspired by this answer I tried
final = reduce(lambda l, r: pd.merge(l, r, how='outer', on='id'), [df1, df2, df3])
but unfortunately it yields
A id B_x B_y
0 A0 f B0 NaN
1 A1 g B1 NaN
2 A2 h NaN NaN
3 A3 i NaN B3
4 A4 j NaN B4
Additionally, I checked out this question but I can't adapt the solution to my problem. Also, I didn't find any options in the docs for pandas.merge to make this happen.
In my real world problem the list of data frames might be much longer and the size of the data frames might be much larger.
Is there any "pythonic" way to do this directly and without "postprocessing"? It would be perfect to have a solution that raises an exception if column B of df2 and df3 would overlap (so if there might be multiple candidates for some value in column B of the final data frame).
Consider pd.concat + groupby?
pd.concat([df1, df2, df3], axis=0).groupby('id').first().reset_index()
id A B
0 f A0 B0
1 g A1 B1
2 h A2 NaN
3 i A3 B3
4 j A4 B4

Combining rows of a dataframe with string columns

I think this is a simple one, but I am not able to figure this out today and needed some help.
I have a pandas dataframe:
df = pd.DataFrame({
'id': [0, 0, 1, 1, 2],
'q.name':['A'] * 3 + ['B'] * 2,
'q.value':['A1','A2','A3','B1','B2'],
'w.name':['Q', 'W', 'E', 'R', 'Q'],
'w.value':['B1','B2','C3','C1','D2']
})
that looks like this
id q.name q.value w.name w.value
0 0 A A1 Q B1
1 0 A A2 W B2
2 1 A A3 E C3
3 1 B B1 R C1
4 2 B B2 Q D2
I am looking to convert it to
id q.name q.value w.name w.value
0 0 A A A1 A2 Q W B1 B2
1 1 A B A3 B1 E R C3 C1
2 2 B B2 Q D2
I tried pd.DataFrame(df.apply(lambda s: s.str.cat(sep=" "))) but that did not give me the result I wanted. I have done this before but I'm struggling to recall or find any post on SO to help me.
Update:
I should have mentioned this before: Is there a way of doing this without specifying which column? The DataFrame changes based on context.
I have also updated the dataframe and shown an id field, as I just realised that this was possible. I think now a groupby on the id field should solve this.
UPDATE:
In [117]: df.groupby('id', as_index=False).agg(' '.join)
Out[117]:
id q.name q.value w.name w.value
0 0 A A A1 A2 Q W B1 B2
1 1 A B A3 B1 E R C3 C1
2 2 B B2 Q D2
Old answer:
In [106]: df.groupby('category', as_index=False).agg(' '.join)
Out[106]:
category name
0 A A1 A2 A3
1 B B1 B2

Clean way of slicing + stacking pandas dataframe

I have a Pandas DataFrame, say df, which is 1099 lines by 33 rows. I need the original file to be processed by another software, but it is not in the proper format. This is why I'm trying to get the good format whith pandas.
The problem is very simple: df is constituted by columns of identifiers (7 in the real case, only 3 in the following example), and then by corresponding results by months. To be clear, it's like
A B C date1result date2result date2result
a1 b1 c1 12 15 17
a2 b2 c3 5 8 3
But to be processed, I would need it to have one line per result, adding a column for the date. In the given example, it would be
A B C result date
a1 b1 c1 12 date1
a1 b1 c1 15 date2
a1 b1 c1 17 date3
a2 b2 c3 5 date1
a2 b2 c3 8 date2
a2 b2 c3 3 date3
So to be more precise, I have edited manually all column names with date (after the read_excel, the looked like '01/01/2015 0:00:00' or something like that, and I was unable to access them... As a secondary question, does anyone knows how to access columns being imported from a date field in an .xlsx?), so that date column names are now 2015_01, 2015_02... 2015_12, 2016_01, ..., 2016_12, the 5 first being 'Account','Customer Name','Postcode','segment' and 'Rep'. So I tried the following code:
core = df.loc[:,('Account','Customer Name','Postcode','segment','Rep')]
df_final=pd.Series([])
for year in [2015,2016]:
for month in range(1, 13):
label = "%i_%02i" % (year,month)
date = []
for i in range(core.shape[0]):
date.append("01/%02i/%i"%(month,year))
df_date=pd.Series(date) #I don't know to create this 1xn df
df_final = df_final.append(pd.concat([core, df[label], df_date], axis=1))
That works roughly, but it is very unclean: I get a (26376, 30) shaped df_final, fist column being the dates, then the results, but of course with '2015_01' as column name, then all the '2015_02' through '2016_12' filled by NaN, and at last my Account', 'Customer Name', 'Postcode', 'segment' and 'Rep' columns. Does anyone know how I could do such a "slicing+stacking" in a clean way?
Thank you very much.
Edit: it is roughly the reverse of this question: Stacking and shaping slices of DataFrame (pandas) without looping
Ithink you need melt:
df = pd.melt(df, id_vars=['A','B','C'], value_name='result', var_name='date')
print (df)
A B C date result
0 a1 b1 c1 date1result 12
1 a2 b2 c3 date1result 5
2 a1 b1 c1 date2result 15
3 a2 b2 c3 date2result 8
4 a1 b1 c1 date3result 17
5 a2 b2 c3 date3result 3
And then convert to_datetime:
print (df)
A B C 2015_01 2016_10 2016_12
0 a1 b1 c1 12 15 17
1 a2 b2 c3 5 8 3
df = pd.melt(df, id_vars=['A','B','C'], value_name='result', var_name='date')
df.date = pd.to_datetime(df.date, format='%Y_%m')
print (df)
A B C date result
0 a1 b1 c1 2015-01-01 12
1 a2 b2 c3 2015-01-01 5
2 a1 b1 c1 2016-10-01 15
3 a2 b2 c3 2016-10-01 8
4 a1 b1 c1 2016-12-01 17
5 a2 b2 c3 2016-12-01 3

Categories