I think this is a simple one, but I am not able to figure this out today and needed some help.
I have a pandas dataframe:
df = pd.DataFrame({
'id': [0, 0, 1, 1, 2],
'q.name':['A'] * 3 + ['B'] * 2,
'q.value':['A1','A2','A3','B1','B2'],
'w.name':['Q', 'W', 'E', 'R', 'Q'],
'w.value':['B1','B2','C3','C1','D2']
})
that looks like this
id q.name q.value w.name w.value
0 0 A A1 Q B1
1 0 A A2 W B2
2 1 A A3 E C3
3 1 B B1 R C1
4 2 B B2 Q D2
I am looking to convert it to
id q.name q.value w.name w.value
0 0 A A A1 A2 Q W B1 B2
1 1 A B A3 B1 E R C3 C1
2 2 B B2 Q D2
I tried pd.DataFrame(df.apply(lambda s: s.str.cat(sep=" "))) but that did not give me the result I wanted. I have done this before but I'm struggling to recall or find any post on SO to help me.
Update:
I should have mentioned this before: Is there a way of doing this without specifying which column? The DataFrame changes based on context.
I have also updated the dataframe and shown an id field, as I just realised that this was possible. I think now a groupby on the id field should solve this.
UPDATE:
In [117]: df.groupby('id', as_index=False).agg(' '.join)
Out[117]:
id q.name q.value w.name w.value
0 0 A A A1 A2 Q W B1 B2
1 1 A B A3 B1 E R C3 C1
2 2 B B2 Q D2
Old answer:
In [106]: df.groupby('category', as_index=False).agg(' '.join)
Out[106]:
category name
0 A A1 A2 A3
1 B B1 B2
Related
I have a one pandas DataFrame like this,
A B C
0 A0 B0 X
1 A1 B1 Y
2 A2 B2 X
And I want to merge the above with the following DataFrames,
df_x
A D
0 A0 X0
1 A1 X1
2 A2 X2
3 A3 X3
df_y
A D
0 A0 Y0
1 A1 Y1
2 A2 Y2
3 A3 Y3
When merging I want a select the second DataFrame based on the column C. In here, if the value in C is X then I need to use df_x to merge with that row, and similarly if the value in C is Y use df_y. So, the final output would be like,
A B C D
0 A0 B0 X X0
1 A1 B1 Y Y1
2 A2 B2 X X2
We may use some methods like, i) Iterating over each row and processing, or ii) Merging by adding C column for each df_x and df_y and then merging, etc. Obviously iterating method would not be much efficient. And the other method will consume additional space for a column with redundant data. Is there a better method to achieve this?
Try this:
import io
df=pd.read_csv(io.StringIO('''A B C
0 A0 B0 X
1 A1 B1 Y
2 A2 B2 X'''), sep='\s+', engine='python')
df_x=pd.read_csv(io.StringIO(''' A D
0 A0 X0
1 A1 X1
2 A2 X2
3 A3 X3'''), sep='\s+', engine='python')
df_y=pd.read_csv(io.StringIO(''' A D
0 A0 Y0
1 A1 Y1
2 A2 Y2
3 A3 Y3'''), sep='\s+', engine='python')
# print(df)
# print(df_x)
# print(df_y)
dfx = df[df.C == 'X']
# print(dfx)
dfy = df[df.C == 'Y']
# print(dfy)
df1 = dfx.merge(df_x, left_on='A', right_on='A')
df2 = dfy.merge(df_y, left_on='A', right_on='A')
print(df1)
print(df2)
df_final = pd.concat([df1, df2]).sort_values('A')
Output
A B C D
0 A0 B0 X X0
0 A1 B1 Y Y1
1 A2 B2 X X2
There is no direct way of doing it, however merge will do the job.
df_new = df.merge(df_x, 'left', ['A', 'B','C', 'D'], suffixes=('*x', '*y')).groupby(lambda x: x.split('*')[0], axis=1).last()
df_new = df.merge(df_y, 'left', ['A', 'B','C', 'D'], suffixes=('*x', '*y')).groupby(lambda x: x.split('*')[0], axis=1).last()
Try something like this. This is may not be the direct answer. But, you could easily do the job by understanding the above code.
I have a pandas dataframe that looks like:
c1 c2 c3 c4 result
a b c d 1
b c d a 1
a e d b 1
g a f c 1
but I want to randomly select 50% of the rows to swap the order of and also flip the result column from 1 to 0 (as shown below):
c1 c2 c3 c4 result
a b c d 1
d a b c 0 (we swapped c3 and c4 with c1 and c2)
a e d b 1
f c g a 0 (we swapped c3 and c4 with c1 and c2)
What's the idiomatic way to accomplish this?
You had the general idea. Shuffle the DataFrame and split it in half. Then modify one half and join back.
import numpy as np
np.random.seed(410112)
dfs = np.array_split(df.sample(frac=1), 2) # Shuffle then split in 1/2
# On one half set result to 0 and swap the columns
dfs[1]['result'] = 0
dfs[1] = dfs[1].rename(columns={'c1': 'c2', 'c2': 'c1', 'c3': 'c4', 'c4': 'c3'})
# Join Back
df = pd.concat(dfs).sort_index()
c1 c2 c3 c4 result
0 a b c d 1
1 c b a d 0
2 e a b d 0
3 g a f c 1
I have a dataframe that looks like this:
v1 v2
0 a A1
1 b A2,A3
2 c B4
3 d A5, B6, B7
I want to modify this dataframe such that any row which has more than one value in the v2 column gets replicated for each value in v2. For example for the above dataframe, the result is as follows:
v1 v2
0 a A1
1 b A2
2 b A3
3 c B4
4 d A5
5 d B6
6 d B7
I was able to do this with the following code:
new_df = pd.DataFrame()
for index, row in df.iterrows():
if len(row["v2"].split(','))>1:
row_base = row
for r in row["v2"].split(','):
row_base["v2"] = r
new_df = new_df.append(row_base, ignore_index=True)
else:
new_df = new_df.append(row)
however it is extremely inefficient on a large dataframe and I am would like to learn how to do it more efficiently.
Pandas solution for 0.25+ version with Series.str.split and DataFrame.explode:
df = df.assign(v2 = df.v2.str.split(',')).explode('v2').reset_index(drop=True)
print (df)
v1 v2
0 a A1
1 b A2
2 b A3
3 c B4
4 d A5
5 d B6
6 d B7
For oldier versions and also perfromace should be better with numpy:
from itertools import chain
s = df.v2.str.split(',')
lens = s.str.len()
df = pd.DataFrame({
'v1' : df['v1'].values.repeat(lens),
'v2' : list(chain.from_iterable(s.values.tolist()))
})
print (df)
v1 v2
0 a A1
1 b A2
2 b A3
3 c B4
4 d A5
5 d B6
6 d B7
Consider the following data frames:
import pandas as pd
df1 = pd.DataFrame({'id': list('fghij'), 'A': ['A' + str(i) for i in range(5)]})
A id
0 A0 f
1 A1 g
2 A2 h
3 A3 i
4 A4 j
df2 = pd.DataFrame({'id': list('fg'), 'B': ['B' + str(i) for i in range(2)]})
B id
0 B0 f
1 B1 g
df3 = pd.DataFrame({'id': list('ij'), 'B': ['B' + str(i) for i in range(3, 5)]})
B id
0 B3 i
1 B4 j
I want to merge them to get
A id B
0 A0 f B0
1 A1 g B1
2 A2 h NaN
3 A3 i B3
4 A4 j B4
Inspired by this answer I tried
final = reduce(lambda l, r: pd.merge(l, r, how='outer', on='id'), [df1, df2, df3])
but unfortunately it yields
A id B_x B_y
0 A0 f B0 NaN
1 A1 g B1 NaN
2 A2 h NaN NaN
3 A3 i NaN B3
4 A4 j NaN B4
Additionally, I checked out this question but I can't adapt the solution to my problem. Also, I didn't find any options in the docs for pandas.merge to make this happen.
In my real world problem the list of data frames might be much longer and the size of the data frames might be much larger.
Is there any "pythonic" way to do this directly and without "postprocessing"? It would be perfect to have a solution that raises an exception if column B of df2 and df3 would overlap (so if there might be multiple candidates for some value in column B of the final data frame).
Consider pd.concat + groupby?
pd.concat([df1, df2, df3], axis=0).groupby('id').first().reset_index()
id A B
0 f A0 B0
1 g A1 B1
2 h A2 NaN
3 i A3 B3
4 j A4 B4
I have a Pandas DataFrame, say df, which is 1099 lines by 33 rows. I need the original file to be processed by another software, but it is not in the proper format. This is why I'm trying to get the good format whith pandas.
The problem is very simple: df is constituted by columns of identifiers (7 in the real case, only 3 in the following example), and then by corresponding results by months. To be clear, it's like
A B C date1result date2result date2result
a1 b1 c1 12 15 17
a2 b2 c3 5 8 3
But to be processed, I would need it to have one line per result, adding a column for the date. In the given example, it would be
A B C result date
a1 b1 c1 12 date1
a1 b1 c1 15 date2
a1 b1 c1 17 date3
a2 b2 c3 5 date1
a2 b2 c3 8 date2
a2 b2 c3 3 date3
So to be more precise, I have edited manually all column names with date (after the read_excel, the looked like '01/01/2015 0:00:00' or something like that, and I was unable to access them... As a secondary question, does anyone knows how to access columns being imported from a date field in an .xlsx?), so that date column names are now 2015_01, 2015_02... 2015_12, 2016_01, ..., 2016_12, the 5 first being 'Account','Customer Name','Postcode','segment' and 'Rep'. So I tried the following code:
core = df.loc[:,('Account','Customer Name','Postcode','segment','Rep')]
df_final=pd.Series([])
for year in [2015,2016]:
for month in range(1, 13):
label = "%i_%02i" % (year,month)
date = []
for i in range(core.shape[0]):
date.append("01/%02i/%i"%(month,year))
df_date=pd.Series(date) #I don't know to create this 1xn df
df_final = df_final.append(pd.concat([core, df[label], df_date], axis=1))
That works roughly, but it is very unclean: I get a (26376, 30) shaped df_final, fist column being the dates, then the results, but of course with '2015_01' as column name, then all the '2015_02' through '2016_12' filled by NaN, and at last my Account', 'Customer Name', 'Postcode', 'segment' and 'Rep' columns. Does anyone know how I could do such a "slicing+stacking" in a clean way?
Thank you very much.
Edit: it is roughly the reverse of this question: Stacking and shaping slices of DataFrame (pandas) without looping
Ithink you need melt:
df = pd.melt(df, id_vars=['A','B','C'], value_name='result', var_name='date')
print (df)
A B C date result
0 a1 b1 c1 date1result 12
1 a2 b2 c3 date1result 5
2 a1 b1 c1 date2result 15
3 a2 b2 c3 date2result 8
4 a1 b1 c1 date3result 17
5 a2 b2 c3 date3result 3
And then convert to_datetime:
print (df)
A B C 2015_01 2016_10 2016_12
0 a1 b1 c1 12 15 17
1 a2 b2 c3 5 8 3
df = pd.melt(df, id_vars=['A','B','C'], value_name='result', var_name='date')
df.date = pd.to_datetime(df.date, format='%Y_%m')
print (df)
A B C date result
0 a1 b1 c1 2015-01-01 12
1 a2 b2 c3 2015-01-01 5
2 a1 b1 c1 2016-10-01 15
3 a2 b2 c3 2016-10-01 8
4 a1 b1 c1 2016-12-01 17
5 a2 b2 c3 2016-12-01 3