I have a dataframe like this one:
Name Team Shots Goals Actions Games Minutes
1 Player 1 ABC 5 3 20 2 15
2 Player 2 ATL 6 2 15 1 30
3 Player 3 RMA 3 3 16 1 20
4 Player 4 BAR 9 0 22 3 28
5 Player 5 ATL 8 1 19 2 32
Actually, in my df I have around 120 columns, but this is an example to see how would be the solution. I need the same df but with the values of most of the columns divided by one. In this case I would like to have the values os 'Shots', 'Goals' and 'Actions' divided by 'Minutes' but I don't want to apply this condition to 'Games' (and some 3 or 4 other specific columns in my real case).
Do you know any code to apply what I need telling the exceptions columns I don't need to apply the division?
try:
exclude=['Games','Minutes']
#create a list of excluded columns
cols=df.columns[df.dtypes!='O']
#Filterout columns that are of type int\float
cols=cols[~cols.isin(exclude)]
#Filter out columns other than that are present in exclude list
Finally:
out=df[cols].div(df['Minutes'],axis=0)
Update:
If you want complete and final df with the excludes columns and the values in this one then you can use join() method:
finaldf=out.join(df[exclude])
#If you want to join only excluded column
OR
cols=df.columns[df.dtypes=='O'].tolist()+exclude
finaldf=out.join(df[cols])
#If you want all the columns excluded+string ones
You can use df.div() to divide multiple columns by one column in place:
df[['Shots','Goals','Actions']].div(df.Minutes, axis=0)
Related
I have a data frame like this
df:
Index C-1 C-2 C-3 C-4 ........
Ind-1 3 9 5 4
Ind-2 5 2 8 3
Ind-3 0 1 1 0
.
.
The data frame has more than a hundred columns and rows with whole numbers(0-60) as values.
The first two rows(indexes) have values in the range 2-12/
I want to sort the columns based on values in the first and second rows(indexes) in ascending order. I need not care about sorting in the remaining rows.
Can anyone help me with this
pandas.DataFrame.sort_values
In the first argument you pass rows that you need sorting on, and than axis to sort through columns.
Mind that by placement has a priority. If you want to sort first by second row and than by the first, you should pass ['Ind-2','Ind-1'], this will have a different result.
df.sort_values(by=['Ind-1','Ind-2'],axis=1)
Output
C-1 C-4 C-3 C-2
Index
Ind-1 3 4 5 9
Ind-2 5 3 8 2
Ind-3 0 0 1 1
I am working with a pandas dataframe with one column. I would like to keep the rows the same if they do not have a . in them but if they do contain a period I would only like to keep anything to the right of the period.
df
col1
0 learn
1 media
2 email.kg
3 tracking1
4 link.mta2
5 schemas
6 email.lg
7 secure2
8 tags
9 links.seminars
Desired outcome:
df1
col1
0 learn
1 media
2 kg
3 tracking1
4 mta2
5 schemas
6 lg
7 secure2
8 tags
9 seminars
Try split and ffill:
df['col1'] = df['col1'].str.split('\.', expand=True).ffill(1)[-1]
Output:
col1
0 learn
1 media
2 kg
3 tracking1
4 mta2
5 schemas
6 lg
7 secure2
8 tags
9 seminars
You can use the apply method to call a function on each element of the column (which is called a Serie in pandas). In this function, you can use the split method, which will scan the string and break it at every occurence of .. The result of split is an array with one element for each string section delimited by .. Assuming you have at most one ., getting the last element of the array will work.
df["col1"].apply(lambda x: x.split(".")[-1])
I have a dataframe like the following:
This dataframe is a result of unstacking another dataframe.
cost 10 20 30
-------------------------------------------
cycles
--------------------------------------------
1 2 4 6
2 1 2 3
3 3 6 9
4 1 0 5
I want something like this:
cycles 10 20 30
-----------------------------------------------
1 2 4 6
2 1 2 3
3 3 6 9
4 1 0 5
I am a little confused about pivoting in the unstacked dataframes. I went through a couple of other similar posts but I really couldn't understand. I want to perform regression on every column of this dataframe. I figured it would be difficult to access the cycles column in the first dataframe, so I would really appreciate if someone can shed any light on this. Thanks in advance!
Do you want this?
df.reset_index()
Edit: To drop the axis name cost, you need to use:
df.rename_axis(None, axis=1).reset_index()
This will still return an index, that is just the way pandas works, but it will not have a label floating over it. If you want cycle as the index without cost, you can just use the first part:
df.rename_axis(None, axis=1)
I have the following pandas DataFrame.
import pandas as pd
df = pd.read_csv('filename.csv')
print(df)
A B C D
0 2 0 11 0.053095
1 2 0 11 0.059815
2 0 35 11 0.055268
3 0 35 11 0.054573
4 0 1 11 0.054081
5 0 2 11 0.054426
6 0 1 11 0.054426
7 0 1 11 0.054426
8 42 7 3 0.048208
9 42 7 3 0.050765
10 42 7 3 0.05325
....
The problem is, the data is naturally "clustered" into groups, but this data is not given. From the above, rows 0-1 are one group, rows 2-3 are a group, rows 4-7 are a group, and 8-10 are a group.
I need to impute this information. One could use machine learning; however, is it possible to do this only using pandas?
Can users groupby the values of the columns to create these groups? The problem is the values are not exact. For the third group, column B has group 1, 2, 1, 1.
A pure pandas solution would involve binning, assuming that your values are close to each other and your bin size is large enough for cluster variation but smaller than distance between cluster values. That answer depends on your data.
The binning approach uses the cut function in pandas. You provide a series (or array) and the number of bins you want to the function. The function evenly subdivides the range of your series into the given number of bins and determines where each value in the input falls. The output for the below set of columns will be which bin the value fell in and will be what you can group by, following your original train of thought.
The way this would come out in practice for bins of size ~5 is
for col in df.columns:
binned_name = col + '_binned'
num_bins = np.ceil(df[col].max()/5)
df[binned_name] = pd.cut(df[col],num_bins,labels=False)
I have a pandas dataframe with 6 columns and several rows, each row being data from a specific participant in an experiment. Each column is a particular scale that the participant responded to and contains their scores. I want to create a new dataframe that has only data from those participants whose score for one particular measure matches a criteria.
The criteria is that it has to match one of the items from a list that I have generated separately.
To paraphrase, I have the data in a dataframe and I want to isolate participants who scored a certain score in one of the 6 measures that matches a list of scores that are of interest. I want to have all the 6 columns in the new dataframe with just the rows of participants of interest. Hope this is clear.
I tried using the groupby function but it doesn't offer enough specificity in specifying the criteria, or at least I don't know the syntax if such methods exist. I'm fairly new to pandas.
You could use isin() and any() to isolate the participants getting a particular score in the tests.
Here's a small example DataFrame showing the scores of five participants in three tests:
>>> df = pd.DataFrame(np.random.randint(1,6,(5,3)), columns=['Test1','Test2','Test3'])
>>> df
Test1 Test2 Test3
0 3 3 5
1 5 5 2
2 5 3 4
3 1 3 3
4 2 1 1
If you want a DataFrame with the participants scoring a 1 or 2 in any of the three tests, you could do the following:
>>> score = [1, 2]
>>> df[df.isin(score).any(axis=1)]
Test1 Test2 Test3
1 5 5 2
3 1 3 3
4 2 1 1
Here df.isin(score) creates a boolean DataFrame showing whether each value of df was in the list scores or not. any(axis=1) checks each row for at least one True value, creating a boolean Series. This Series is then used to index the DataFrame df.
If I understood your question correctly you want to query a dataframe for inclusion of the entries in a list.
Like, you have a "results" df like
df = pd.DataFrame({'score1' : np.random.randint(0,10,5)
, 'score2' : np.random.randint(0,10,5)})
score1 score2
0 7 2
1 9 9
2 9 3
3 9 3
4 0 4
and a set of positive outcomes
positive_outcomes = [1,5,7,3]
then you can query the df like
df_final = df[df.score1.isin(positive_outcomes) | df.score2.isin(positive_outcomes)]
to get
score1 score2
0 7 2
2 9 3
3 9 3