I have a dictionary from which I want decided which columns value I want to choose sort of like an if condition using a dictionary.
import pandas as pd
dictname = {'A': 'Select1', 'B':'Select2','C':'Select3'}
DataFrame = pd.DataFrame([['A',1,2,3,4],['B',1,2,3,4],['B',1,3,4,5],['C',1,5,6,7]], columns=['Name','Score','Select1','Select2','Select3'])
So I want to create a new coilumn called ChosenValue which selects values based on the row value in the column 'Name' e.e. ChosenValue should equal to column 'Select1'' s value if the row value in 'Name' = 'A' and then ChosenValue should equal to 'Select2''s value if the row value in 'Name' = 'B' and so forth. I really want something to link it to the dictionary 'dictname'
Use Index.get_indexer to get a list of indices. After that, you can just index into the underlying numpy array.
idx = df.columns.get_indexer(df.Name.map(dictname))
df['ChosenValue'] = df.values[np.arange(len(df)), idx]
df
Name Score Select1 Select2 Select3 ChosenValue
0 A 1 2 3 4 2
1 B 1 2 3 4 3
2 B 1 3 4 5 4
3 C 1 5 6 7 7
If you know that every Name is in the dictionary, you could use lookup:
In [104]: df["ChosenValue"] = df.lookup(df.index, df.Name.map(dictname))
In [105]: df
Out[105]:
Name Score Select1 Select2 Select3 ChosenValue
0 A 1 2 3 4 2
1 B 1 2 3 4 3
2 B 1 3 4 5 4
3 C 1 5 6 7 7
Related
I want to use head / tail function, but for each group i will take the different number of row according an input dictionary.
The function should have 2 input. First input is pandas dataframe
df = pd.DataFrame({"group":["A","A","A","B","B","B","B"],"value":[0,1,2,3,4,5,6,7]})
print(df)
group value
0 A 0
1 A 1
2 A 2
3 B 3
4 B 4
5 B 5
6 B 6
Second input is dict :
slice_per_group = {"A":1,"B":3}
Expected output :
df.groupby('group').head(slice_per_group) #Obviously this doesn't work
group value
0 A 0
3 B 3
4 B 4
5 B 5
Use head on each group separately:
df.groupby('group', group_keys=False).apply(lambda g: g.head(slice_per_group.get(g.name)))
group value
0 A 0
3 B 3
4 B 4
5 B 5
I have a dataframe like this:
df1
a b c
0 1 2 [bg10, ng45, fg56]
1 4 5 [cv10, fg56]
2 7 8 [bg10, ng45, fg56]
3 7 8 [fg56, fg56]
4 7 8 [bg10]
I would like to count the total occurences take place of each type in column 'c'. I would then like to return the value of column 'b' for the values in column 'c' that have a count total of '1'.
The expected output is soemthing like this:
c b total_count
0 bg10 2 2
0 ng45 2 2
0 fg56 2 5
1 cv10 5 1
1 fg56 5 5
I have tried the 'Collections' library, and a 'for' loop (I understand its not best practise in Pandas) but i think i'm missing some fundamental udnerstanding of lists within cells, and how to perform analysis like these.
Thank you for taking my question into consideration.
I would use apply the following way:
first I create the df:
df1=pd.DataFrame({"b":[2,5,8,8], "c":[['bg10', 'ng45', 'fg56'],['cv10', 'fg56'],['bg10', 'ng45', 'fg56'],['fg56', 'fg56']]})
next use apply to count the number of (non unique) items in a list and save it in a different column:
df1["count_c"]=df1.c.apply(lambda x: len(x))
you will get the following:
b c count_c
0 2 [bg10, ng45, fg56] 3
1 5 [cv10, fg56] 2
2 8 [bg10, ng45, fg56] 3
3 8 [fg56, fg56] 2
to get the lines when c larger than threshold:`
df1[df1["count_c"]>2]["b"]
note: if you want to count only unique values in each list in column c you should use:
df1["count_c"]=df1.c.apply(lambda x: len(set(x)))
EDIT
in order to count the total number of each item I would try this:
first let's "unpack all the lists into columns
new_df1=(df1.c.apply(lambda x: pd.Series(x))).stack().reset_index(level=1,drop=True).to_frame("c").join(df1[["b"]],how="left")
then get the total counts of each item in the list and add it to a new col:
counts_dict=new_df1.c.value_counts().to_dict()
new_df1["total_count_c"]=new_df1.c.map(counts_dict)
new_df1.head()
c b total_count_c
0 bg10 2 2
0 ng45 2 2
0 fg56 2 5
1 cv10 5 1
1 fg56 5 5
I want to add a new column to save the sort order, which is sorted by one of the columns, in Dataframe. For example, I would like to sort by column 'B'(ascending) and add a new column 'C' to save the sort order. that means i want to get a column'C', which is [4,3,1,2,4,2]
df=pd.DataFrame({"A":[1,2,3,4,5,6],"B":[5,2,0,1,5,1]})
Try with rank, and method='dense' so that rank always increases by 1 between groups:
import pandas as pd
df=pd.DataFrame({"A":[1,2,3,4,5,6],"B":[5,2,0,1,5,1]})
df['C']=df['B'].rank(method='dense')
df
Output:
A B C
0 1 5 4
1 2 2 3
2 3 0 1
3 4 1 2
4 5 5 4
5 6 1 2
I have a dataframe where I want to select all the rows that
df = A B C D
'a' 1 1 1
'b' 1 2 1
'c' 1 1 1
'a' 1 2 2
'a' 2 2 2
'b' 1 2 2
And I want to get the rows where the value in one column is the maximum for that group. So for the example above if I wanted to group be 'A' and 'B' and get the rows that have the greatest value in 'C'
df = A B C D
'a' 1 2 2
'b' 1 2 2
'c' 1 1 1
'a' 2 2 2
I know that I want to use a groupby, but I'm not sure what to do after that.
The easiest way is to use the transform function. This basically let's you apply a function against a group that retains the same index as the original dataframe. In this case, you can see you get the following from the transform
In [13]: df.groupby(['A', 'B'])['C'].transform(max)
Out[13]:
0 2
1 2
2 1
3 2
4 2
5 2
Name: C, dtype: int64
This has the exact same index as the original dataframe, so you can use it to create a filter.
df[df['C'] == df.groupby(['A', 'B'])['C'].transform(max)]
Out[11]:
A B C D
1 b 1 2 1
2 c 1 1 1
3 a 1 2 2
4 a 2 2 2
5 b 1 2 2
For much more information on this, see the pandas groupby documentation, which is excellent.
Given a dataframe of the format
A B C D
.......
........
I would like to select the rows whose value in column B is greater than 0.6*the last value in column.For eg,
Input:
A B C
1 0 5
2 3 4
3 6 6
4 8 1
5 9 3
Output:
A B C
3 6 6
4 8 1
5 9 3
I am currently doing the following,
x = df.loc[df.tail(1).index,'B']
Which return a series object corresponding to the index and value of coulmn B of the last row of the dataframe and then,
new_df = df.[df.B > x]
But I am getting the error,
ValueError: Series lengths must match to compare
How should I perform the query?
you need to 1st take the last value of column B using tail and multiply with 0.6.
df[df['B'] > df['B'].tail(1).values[0] * 0.6]