I have 2 dataframes:
DF1:
Count
0 98.0
1 176.0
2 260.5
3 389.0
I have to assign these values to a column in another dataframe for every 3rd row starting from 3rd row.
The Output of DF2 should look like this:
Count
0
1
2 98.0
3
4
5 176.0
6
7
8 260.5
9
10
11 389.0
I am doing
DF2.loc[2::3,'Count'] = DF1['Count']
But, I am not getting the expected results.
Use values
Ohterwise, Pandas tries to align the index values from DF1 and that messes you up.
DF2.loc[2::3, 'Count'] = DF1['Count'].values
DF2
Count
0 NaN
1 NaN
2 98.0
3 NaN
4 NaN
5 176.0
6 NaN
7 NaN
8 260.5
9 NaN
10 NaN
11 389.0
New From DF1
DF1.set_index(DF1.index * 3 + 2).reindex(range(len(DF1) * 3))
Count
0 NaN
1 NaN
2 98.0
3 NaN
4 NaN
5 176.0
6 NaN
7 NaN
8 260.5
9 NaN
10 NaN
11 389.0
Related
I am looking for a method to create an array of numbers to label groups, based on the value of the 'number' column. If it's possible?
With this abbreviated example DF:
number = [nan,nan,1,nan,nan,nan,2,nan,nan,3,nan,nan,nan,nan,nan,4,nan,nan]
df = pd.DataFrame(columns=['number'])
df = pd.DataFrame.assign(df, number=number)
Ideally I would like to make a new column, 'group', based on the int in column 'number' - so there would be effectively be array's of 1, ,2, 3, etc. FWIW, the DF is 1000's lines long, with sporadically placed int's.
The result would be a new column, something like this:
number group
0 NaN 0
1 NaN 0
2 1.0 1
3 NaN 1
4 NaN 1
5 NaN 1
6 2.0 2
7 NaN 2
8 NaN 2
9 3.0 3
10 NaN 3
11 NaN 3
12 NaN 3
13 NaN 3
14 NaN 3
15 4.0 4
16 NaN 4
17 NaN 4
All advice much appreciated!
You can use notna combined with cumsum:
df['group'] = df['number'].notna().cumsum()
NB. if you had zeros: df['group'] = df['number'].ne(0).cumsum().
output:
number group
0 NaN 0
1 NaN 0
2 1.0 1
3 NaN 1
4 NaN 1
5 NaN 1
6 2.0 2
7 NaN 2
8 NaN 2
9 3.0 3
10 NaN 3
11 NaN 3
12 NaN 3
13 NaN 3
14 NaN 3
15 4.0 4
16 NaN 4
17 NaN 4
You can use forward fill:
df['number'].ffill().fillna(0)
Output:
0 0.0
1 0.0
2 1.0
3 1.0
4 1.0
5 1.0
6 2.0
7 2.0
8 2.0
9 3.0
10 3.0
11 3.0
12 3.0
13 3.0
14 3.0
15 4.0
16 4.0
17 4.0
Name: number, dtype: float64
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 1 year ago.
I have three pandas dataframes: df1, df2, df3 which looks like as follows:
df1=
X
Y
M1_x
M2_x
M3_x
12
33
3
4
2
12
54
0
3
4
23
12
0
8
3
df2=
X
Y
M1_y
M2_y
M3_y
12
33
9
4
1
12
54
0
3
5
12
11
0
2
1
df3=
X
Y
M1_z
M2_z
M3_z
12
33
1
40
10
11
10
10
30
0
12
11
0
40
5
I like to concatenate the two df and get the merged dataframe as below:
result =
X
Y
M1_x
M2_x
M3_x
M1_y
M2_y
M3_y
M1_z
M2_z
M3_z
12
33
3
4
2
9
4
1
1
40
10
12
54
0
3
4
0
3
5
nan
nan
nan
23
12
0
8
3
nan
nan
nan
nan
nan
nan
12
11
nan
nan
nan
0
2
1
0
40
5
11
10
nan
nan
nan
nan
nan
nan
10
30
0
I already tried p.cocatenate function, the problem is choosing axis=1, doesn't merge the[X, Y] columns, then I have double [X,Y] columns in the result file, choosing axis=0, doesn't merge the rows, and I get the union of rows instead of merging common [X,Y] with eachother.
How can I make this happen?
EDIT:
I know how to use merge function when merging two dataframes. My problem here is I have more than two (actually 4) databases to merge. Is there any function to combine more than two dfs in one line?
Use pandas.DataFrame.merge specifying how to merge (outer merge):
>>> df1.merge(df2, 'outer').merge(df3, 'outer')
X Y M1_x M2_x M3_x M1_y M2_y M3_y M1_z M2_z M3_z
0 12 33 3.0 4.0 2.0 9.0 4.0 1.0 1.0 40.0 10.0
1 12 54 0.0 3.0 4.0 0.0 3.0 5.0 NaN NaN NaN
2 23 12 0.0 8.0 3.0 NaN NaN NaN NaN NaN NaN
3 12 11 NaN NaN NaN 0.0 2.0 1.0 0.0 40.0 5.0
4 11 10 NaN NaN NaN NaN NaN NaN 10.0 30.0 0.0
If you have more dataframes you can do something like this:
out = df1.copy()
dfs = [df2,df3,df4,df5,df6]
for df in dfs:
out = out.merge(df, 'outer')
This diagram clarifies the different types of merge (pandas uses inner merge as default):
Check if this can help.
new_df = pd.merge(df1, df2, how='outer')
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
I have a dataset similar to this
Serial A B
1 12
1 31
1
1 12
1 31 203
1 10
1 2
2 32 100
2 32 242
2 3
3 2
3 23 100
3
3 23
I group the dataframe based on Serial and find the maximum value of each A column by df['A_MAX'] = df.groupby('Serial')['A'].transform('max').values and retain the first value by df['A_MAX'] = df['A_MAX'].mask(df['Serial'].duplicated(), '')
Serial A B A_MAX B_corresponding
1 12 31 203
1 31
1
1 12
1 31 203
1 10
1 2
2 32 100 32 100
2 32 242
2 3
3 2 23 100
3 23 100
3
3 23
Now for the B_corresponding column, I would like to get the corresponding B values of the A_MAX. I thought of locating the A_MAX values in A but there are similar max A values per group. Additional condition, for example in Serial 2 I would also prefer to get the smallest B values between the 32
Idea is use DataFrame.sort_values for maximal values per groups, then remove missing values by DataFrame.dropna and get first rows by Serial by DataFrame.drop_duplicates. Create Series by DataFrame.set_index and last use Series.map:
df['A_MAX'] = df.groupby('Serial')['A'].transform('max')
df['A_MAX'] = df['A_MAX'].mask(df['Serial'].duplicated())
s = (df.sort_values(['Serial','A'], ascending=[True, False])
.dropna(subset=['B'])
.drop_duplicates('Serial')
.set_index('Serial')['B'])
df['B_corresponding'] = df['Serial'].map(s).mask(df['Serial'].duplicated())
print (df)
Serial A B A_MAX B_corresponding
0 1 12.0 NaN 31.0 203.0
1 1 31.0 NaN NaN NaN
2 1 NaN NaN NaN NaN
3 1 12.0 NaN NaN NaN
4 1 31.0 203.0 NaN NaN
5 1 10.0 NaN NaN NaN
6 1 2.0 NaN NaN NaN
7 2 32.0 100.0 32.0 100.0
8 2 32.0 242.0 NaN NaN
9 2 3.0 NaN NaN NaN
10 3 2.0 NaN 23.0 100.0
11 3 23.0 100.0 NaN NaN
12 3 NaN NaN NaN NaN
13 3 23.0 NaN NaN NaN
Converting missing values to empty strings is possible, but get mixed values - numeric and strings, so next processing should be problematic:
df['A_MAX'] = df.groupby('Serial')['A'].transform('max')
df['A_MAX'] = df['A_MAX'].mask(df['Serial'].duplicated(), '')
s = (df.sort_values(['Serial','A'], ascending=[True, False])
.dropna(subset=['B'])
.drop_duplicates('Serial')
.set_index('Serial')['B'])
df['B_corresponding'] = df['Serial'].map(s).mask(df['Serial'].duplicated(), '')
print (df)
Serial A B A_MAX B_corresponding
0 1 12.0 NaN 31 203
1 1 31.0 NaN
2 1 NaN NaN
3 1 12.0 NaN
4 1 31.0 203.0
5 1 10.0 NaN
6 1 2.0 NaN
7 2 32.0 100.0 32 100
8 2 32.0 242.0
9 2 3.0 NaN
10 3 2.0 NaN 23 100
11 3 23.0 100.0
12 3 NaN NaN
13 3 23.0 NaN
You could also use dictionaries to achieve the same if you are not so inclined to only use pandas.
a_to_b_mapping = df.groupby('A')['B'].min().to_dict()
series_to_a_mapping = df.groupby('Series')['A'].max().to_dict()
agg_df = {}
for series, a in series_to_a_mapping.items():
agg_df.append((series, a, a_to_b_mapping.get(a, None)))
agg_df = pd.DataFrame(agg_df, columns=['Series', 'A_max', 'B_corresponding'])
agg_df.head()
Series A_max B_corresponding
0 1 31.0 203.0
1 2 32.0 100.0
2 3 23.0 100.0
If you want, you could join this to original dataframe and mask duplicates.
dft = df.join(final_df.set_index('Serial'), on='Serial', how='left')
dft['A_max'] = dft['A_max'].mask(dft['A_max'].duplicated(), '')
dft['B_corresponding'] = dft['B_corresponding'].mask(dft['B_corresponding'].duplicated(), '')
dft
Consider the below pandas Series object,
index = list('abcdabcdabcd')
df = pd.Series(np.arange(len(index)), index = index)
My desired output is,
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
I have put some effort with pd.pivot_table, pd.unstack and probably the solution lies with correct use of one of them. The closest i have reached is
df.reset_index(level = 1).unstack(level = 1)
but this does not gives me the output i my looking for
// here is something even closer to the desired output, but i am not able to handle the index grouping.
df.to_frame().set_index(df1.values, append = True, drop = False).unstack(level = 0)
a b c d
0 0.0 NaN NaN NaN
1 NaN 1.0 NaN NaN
2 NaN NaN 2.0 NaN
3 NaN NaN NaN 3.0
4 4.0 NaN NaN NaN
5 NaN 5.0 NaN NaN
6 NaN NaN 6.0 NaN
7 NaN NaN NaN 7.0
8 8.0 NaN NaN NaN
9 NaN 9.0 NaN NaN
10 NaN NaN 10.0 NaN
11 NaN NaN NaN 11.0
A bit more general solution using cumcount to get new index values, and pivot to do the reshaping:
# Reset the existing index, and construct the new index values.
df = df.reset_index()
df.index = df.groupby('index').cumcount()
# Pivot and remove the column axis name.
df = df.pivot(columns='index', values=0).rename_axis(None, axis=1)
The resulting output:
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
Here is a way that will work if the index is always cycling in the same order, and you know the "period" (in this case 4):
>>> pd.DataFrame(df.values.reshape(-1,4), columns=list('abcd'))
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
>>>
Needed some help solving why my dataframe is returning all NaNs.
print df
0 1 2 3 4
0 1 9 0 7 30
1 2 8 0 4 30
2 3 5 0 3 30
3 4 3 0 3 30
4 5 1 0 3 30
Then I added date index. I only need to increment by one day for 5 days.
date = pd.date_range(datetime.datetime.today(), periods=5)
data = DataFrame(df, index=date)
print data
0 1 2 3 4
2014-04-10 17:16:09.433000 NaN NaN NaN NaN NaN
2014-04-11 17:16:09.433000 NaN NaN NaN NaN NaN
2014-04-12 17:16:09.433000 NaN NaN NaN NaN NaN
2014-04-13 17:16:09.433000 NaN NaN NaN NaN NaN
2014-04-14 17:16:09.433000 NaN NaN NaN NaN NaN
Tried a few different things to no avail. If I switch my original dataframe to
np.random.randn(5,5)
Then it works. Anyone have an idea of what is going on here?
Edit: Going to add that the data type is float64
print df.dtypes
0 float64
1 float64
2 float64
3 float64
4 float64
dtype: object
You should overwrite the index of the original dataframe with the following:
df.index = date
What DataFrame(df, index=date) does is that it creates new dataframe by matching the values of index to the df being used, for example:
DataFrame(df, index=[0,1,2,5,5])
returns the following:
0 1 2 3 4
0 1 9 0 7 30
1 2 8 0 4 30
2 3 5 0 3 30
5 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN
because 5 is not included in the index of the original dataframe.