I want to select specific columns from multiple dataframes and combine them into one dataframe, how can I accomplish this?
df1
count grade
0 3 0
1 5 100
2 4 50.5
3 10 80.10
df2
books saving
0 4 10
1 5 9000
2 8 70
3 10 500
How can I select the saving column from df2 and combine with grade column from df1 to form a separate pandas dataframe that looks like the below.
grade saving
0 0 10
1 100 9000
2 50.5 70
3 80.10 500
I tried
df = pd.DataFrame([df1['grade'],df2['saving']])
print(df)
but the outcome is not what I wanted.
df = pd.concat([df1['grade'], df2['saving']], axis=1)
Similar question has been answered here.
Pandas documentation for this function: pandas.concat
Related
I've got a dataframe with lots of rows and columns on score results. I want to aggregate them and look at how many records there are with each score. Ideally the output would look like:
df=
Score count
10 576
9 306
8 644
7 829
etc...
I've been using the below code to try to get the main dataframe into that format.
df = df[['score']]
df = df.groupby(['score'])['score'].count()
df = df.reset_index()
This code works for the most part, the second line gets me the aggregated figures for each score, but the scores themselves are in the index as opposed to being their own column which is why I try to reset the index.
However I keep getting the error: ValueError: cannot insert score, already exists.
Anyway to get around this so I can have the two columns; score and count.
Sol 1
df = df.groupby(['score']).count()
df = df.reset_index()
df
score count
0 7 1
1 8 1
2 9 1
3 10 1
Sol 2
df = df[['score']]
df = df.groupby(['score']).size()
df = df.reset_index()
df
score 0
0 7 1
1 8 1
2 9 1
3 10 1
What I wanna do:
Column 'angle' has tracked about 20 angles per second (can vary). But my 'Time' timestamp has only an accuracy of 1s (therefore always about ~20 rows are having the same timestamp)(total rows of over 1 million in the dataframe).
My result shall be a new dataframe with a changing timestamp for each row. The angle for the timestamp shall be the median of the ~20 timestamps in that intervall.
My Idea:
I iterate through the rows and check if the timestamp has changed.
If so, I select all timestamps until it changes, calculate the median, and append it to a new dataframe.
Nevertheless I have many many big data files and I am wondering if there is a faster way to achieve my goal.
Right now my code is the following (see below).
It is not fast and I think there must be a better way to do that with pandas/numpy (or something else?).
a = 0
for i in range(1,len(df1.index)):
if df1.iloc[[a],[1]].iloc[0][0]==df1.iloc[[i],[1]].iloc[0][0]:
continue
else:
if a == 0:
df_result = df1[a:i-1].median()
else:
df_result = df_result.append(df1[a:i-1].median(), ignore_index = True)
a = i
You can use groupby here. Below, I made a simple dummy dataframe.
import pandas as pd
df1 = pd.DataFrame({'time': [1,1,1,1,1,1,2,2,2,2,2,2],
'angle' : [8,9,7,1,4,5,11,4,3,8,7,6]})
df1
time angle
0 1 8
1 1 9
2 1 7
3 1 1
4 1 4
5 1 5
6 2 11
7 2 4
8 2 3
9 2 8
10 2 7
11 2 6
Then, we group by the timestamp and take the median of the angle column within that group, and convert the result to a pandas dataframe.
df2 = pd.DataFrame(df1.groupby('time')['angle'].median())
df2 = df2.reset_index()
df2
time angle
0 1 6.0
1 2 6.5
You can use the .agg after grouping function to select operation according to the column
df1.groupby('Time', as_index=False).agg({"angle":"median"})
I have a huge data set in a pandas data frame. It looks something like this
df = pd.DataFrame([[1,2,3,4],[31,14,13,11],[115,613,1313,1]], columns=['c1','c1','c2','c2'])
Here first two columns have same name. So they should be concatenated into a single column so the the values are one below another. so the dataframe should look something like this:
df1 = pd.DataFrame([[1,3],[31,13],[115,1313],[2,4],[14,11],[613,1]], columns=['c1','c2'])
Note: My orignal dataframe has many column so I cannot used simple concat function to stack the columns. Also I tried using stack function, apart from concat function. What can I do?
use groupby + cumcount to create a pd.MultiIndex. Reassign column with new pd.MultiIndex and stack
df = pd.DataFrame(
[[1,2,3,4],[31,14,13,11],[115,613,1313,1]],
columns=['c1','c1','c2','c2'])
df1 = df.copy()
df1.columns = [df.columns, df.columns.to_series().groupby(level=0).cumcount()]
print(df1.stack().reset_index(drop=True))
c1 c2
0 1 3
1 2 4
2 31 13
3 14 11
4 115 1313
5 613 1
Or with a bit of creativity, in one line
df.T.set_index(
df.T.groupby([df.columns]).cumcount(),
append=True
).unstack().T.reset_index(drop=True)
c1 c2
0 1 3
1 2 4
2 31 13
3 14 11
4 115 1313
5 613 1
You could melt the dataframe, then count entries within each column to use as index for the new dataframe and then unstack it back like this:
import pandas as pd
df = pd.DataFrame(
[[1,2,3,4],[31,14,13,11],[115,613,1313,1]],
columns=['c1','c1','c2','c2'])
df1 = (pd.melt(df,var_name='column')
.assign(n = lambda x: x.groupby('column').cumcount())
.set_index(['n','column'])
.unstack())
df1.columns=df1.columns.get_level_values(1)
print(df1)
Which produces
column c1 c2
n
0 1 3
1 31 13
2 115 1313
3 2 4
4 14 11
5 613 1
I have a pandas dataframe defined as:
A B SUM_C
1 1 10
1 2 20
I would like to do a cumulative sum of SUM_C and add it as a new column to the same dataframe. In other words, my end goal is to have a dataframe that looks like below:
A B SUM_C CUMSUM_C
1 1 10 10
1 2 20 30
Using cumsum in pandas on group() shows the possibility of generating a new dataframe where column name SUM_C is replaced with cumulative sum. However, my ask is to add the cumulative sum as a new column to the existing dataframe.
Thank you
Just apply cumsum on the pandas.Series df['SUM_C'] and assign it to a new column:
df['CUMSUM_C'] = df['SUM_C'].cumsum()
Result:
df
Out[34]:
A B SUM_C CUMSUM_C
0 1 1 10 10
1 1 2 20 30
I have a pandas dataframe with 6 columns and several rows, each row being data from a specific participant in an experiment. Each column is a particular scale that the participant responded to and contains their scores. I want to create a new dataframe that has only data from those participants whose score for one particular measure matches a criteria.
The criteria is that it has to match one of the items from a list that I have generated separately.
To paraphrase, I have the data in a dataframe and I want to isolate participants who scored a certain score in one of the 6 measures that matches a list of scores that are of interest. I want to have all the 6 columns in the new dataframe with just the rows of participants of interest. Hope this is clear.
I tried using the groupby function but it doesn't offer enough specificity in specifying the criteria, or at least I don't know the syntax if such methods exist. I'm fairly new to pandas.
You could use isin() and any() to isolate the participants getting a particular score in the tests.
Here's a small example DataFrame showing the scores of five participants in three tests:
>>> df = pd.DataFrame(np.random.randint(1,6,(5,3)), columns=['Test1','Test2','Test3'])
>>> df
Test1 Test2 Test3
0 3 3 5
1 5 5 2
2 5 3 4
3 1 3 3
4 2 1 1
If you want a DataFrame with the participants scoring a 1 or 2 in any of the three tests, you could do the following:
>>> score = [1, 2]
>>> df[df.isin(score).any(axis=1)]
Test1 Test2 Test3
1 5 5 2
3 1 3 3
4 2 1 1
Here df.isin(score) creates a boolean DataFrame showing whether each value of df was in the list scores or not. any(axis=1) checks each row for at least one True value, creating a boolean Series. This Series is then used to index the DataFrame df.
If I understood your question correctly you want to query a dataframe for inclusion of the entries in a list.
Like, you have a "results" df like
df = pd.DataFrame({'score1' : np.random.randint(0,10,5)
, 'score2' : np.random.randint(0,10,5)})
score1 score2
0 7 2
1 9 9
2 9 3
3 9 3
4 0 4
and a set of positive outcomes
positive_outcomes = [1,5,7,3]
then you can query the df like
df_final = df[df.score1.isin(positive_outcomes) | df.score2.isin(positive_outcomes)]
to get
score1 score2
0 7 2
2 9 3
3 9 3