How would you use the results of a variable in a pandas data set to call different columns in iteratively that have the same name as the results. If one column had results that were letters and other columns were titled with those letters, how could you use the results from the letters column as column names?
Take this example data set
letter a b c d
a 1 3 4 2
d 4 3 2 1
c 2 1 4 3
d 3 4 2 1
desired results
letter a b c d correct answer
a 1 3 4 2 1
d 4 3 2 1 1
c 2 1 4 3 4
d 3 4 2 1 1
How do you create the correct answer variable seen in the desired results
Assuming that you have set letter as your index, I suppose you could do the following:
df['correct_answer'] = df.apply(lambda x: x.loc[x.name], axis=1)
Yields:
a b c d correct_answer
letter
a 1 3 4 2 1
d 4 3 2 1 1
c 2 1 4 3 4
d 3 4 2 1 1
Here is something that might help: pandas DataFrame diagonal
This is basically searching for an identity diagonal i.e. when rowname == colname. Should be enough to get you started.
Related
I am trying to do some basic dimensional reduction. I have a CSV file that looks something like this:
A B C A B B A C
1 1 2 2 1 3 1 1
1 2 3 0 0 1 1 2
0 2 1 3 0 1 2 2
I want to import as a pandas DF but without renaming the headers to A.1 A.2 etc. Instead I want to sum the duplicates and keep the columns names. Ideally my new DF should look like this:
A B C
4 5 3
2 3 5
5 3 3
Is it possible to do this easily or would you recommend a different way? I can also use bash, R, or anything that can do the trick with a file that is 1 million lines and 1000 columns.
Thank you!
Try split the column names by . and groupby the first part:
df.groupby(df.columns.str.split('.').str[0], axis=1).sum()
Output:
A B C
0 4 5 3
1 2 3 5
2 5 3 3
Just load the dataframe normally and group by the first letter of the column name, and sum the values:
df.groupby(lambda colname: colname[0], axis=1).sum()
which gives
A B C
0 4 5 3
1 2 3 5
2 5 3 3
I would like to achieve the result below in Python using Pandas.
I tried groupby and sum on the id and Group columns using the below:
df.groupby(['id','Group'])['Total'].sum()
I got the first two columns, but I'm not sure how to get the third column (Overall_Total).
How can I do it?
Initial data (before grouping)
id
Group
Time
1
a
2
1
a
2
1
a
1
1
b
1
1
b
1
1
c
1
2
e
2
2
a
4
2
e
1
2
a
5
3
c
1
3
e
4
3
a
3
3
e
4
3
a
2
3
h
4
Assuming df is your initial dataframe, please try this:
df_group = df.groupby(['id','group']).sum(['time']).rename(columns={'time':'Total'})
df_group['All_total'] = df_group.groupby(['id'])['Total'].transform('sum')
I want to use head / tail function, but for each group i will take the different number of row according an input dictionary.
The function should have 2 input. First input is pandas dataframe
df = pd.DataFrame({"group":["A","A","A","B","B","B","B"],"value":[0,1,2,3,4,5,6,7]})
print(df)
group value
0 A 0
1 A 1
2 A 2
3 B 3
4 B 4
5 B 5
6 B 6
Second input is dict :
slice_per_group = {"A":1,"B":3}
Expected output :
df.groupby('group').head(slice_per_group) #Obviously this doesn't work
group value
0 A 0
3 B 3
4 B 4
5 B 5
Use head on each group separately:
df.groupby('group', group_keys=False).apply(lambda g: g.head(slice_per_group.get(g.name)))
group value
0 A 0
3 B 3
4 B 4
5 B 5
I'm currently trying to do analysis of rolling correlations of a dataset with four compared values but only need the output of rows containing 'a'
I got my data frame by using the command newdf = df.rolling(3).corr()
Sample input (random numbers)
a b c d
1 a
1 b
1 c
1 d
2 a
2 b
2 c
2 d
3 a
3 b 5 6 3
3 c 4 3 1
3 d 3 4 2
4 a 1 3 5 6
4 b 6 2 4 1
4 c 8 6 6 7
4 d 2 5 4 6
5 a 2 5 4 1
5 b 1 4 6 3
5 c 2 6 3 7
5 d 3 6 3 7
and need the output
a b c d
1 a 1 3 5 6
2 a 2 5 4 1
I've tried filtering it by doing adf = newdf.filter(['a'], axis=0) however that gets rid of everything and when doing it for the other axis it filters by column. Unfortunately the column containing the rows with values: a, b, c, d is unnamed so I cant filter that column individually. This wouldn't be an issue however if its possible to flip the rows and columns with the values being listed by index to get the desired output.
Try using loc. Put the column of abcdabcd ... as index and just use loc
df.loc['a']
The actual source of problem in your case is that your DataFrame
has a MultiIndex.
So when you attempt to execute newdf.filter(['a'], axis=0) you want
to leave rows with the index containing only "a" string.
But since your DataFrame has a MultiIndex, each row with "a" at
level 1 contains also some number at level 0.
To get your intended result, run:
newdf.filter(like='a', axis=0)
maybe followed by .dropna().
An alterantive solution is:
newdf.xs('a', level=1, drop_level=False)
I have a dataframe where I want to select all the rows that
df = A B C D
'a' 1 1 1
'b' 1 2 1
'c' 1 1 1
'a' 1 2 2
'a' 2 2 2
'b' 1 2 2
And I want to get the rows where the value in one column is the maximum for that group. So for the example above if I wanted to group be 'A' and 'B' and get the rows that have the greatest value in 'C'
df = A B C D
'a' 1 2 2
'b' 1 2 2
'c' 1 1 1
'a' 2 2 2
I know that I want to use a groupby, but I'm not sure what to do after that.
The easiest way is to use the transform function. This basically let's you apply a function against a group that retains the same index as the original dataframe. In this case, you can see you get the following from the transform
In [13]: df.groupby(['A', 'B'])['C'].transform(max)
Out[13]:
0 2
1 2
2 1
3 2
4 2
5 2
Name: C, dtype: int64
This has the exact same index as the original dataframe, so you can use it to create a filter.
df[df['C'] == df.groupby(['A', 'B'])['C'].transform(max)]
Out[11]:
A B C D
1 b 1 2 1
2 c 1 1 1
3 a 1 2 2
4 a 2 2 2
5 b 1 2 2
For much more information on this, see the pandas groupby documentation, which is excellent.