Input
DBN Grade 3 4 5
0 01M015 3 30 44 15
1 01M015 4 30 44 15
2 01M015 5 30 44 15
Desired Output
DBN Grade 3 4 5 Enrollment
0 01M015 3 30 44 15 30
1 01M015 4 30 44 15 44
2 01M015 5 30 44 15 15
How would you create the Enrollment column?
Note that the column we seek for each record depends on the value at df['Grade'].
I've tried variations of df[df['Grade']] so that I could find the column df['3'], but I haven't been successful.
Is there a way to do this simply?
import pandas as pd
import numpy as np
data={'DBN':['01M015','01M015','01M015'],
'Grade':['3','4','5'],
'3':['30','30','30'],
'4':['44','44','44'],
'5':['15','15','15']}
df = pd.DataFrame(data)
# This line below doesn't work: raises ValueError: Length of values does not match length of index
df['Enrollment'] = [df[c] if (df.loc[i,'Grade'] == c) else None for i in df.index for c
in df.columns]
Set your index, and then use lookup:
df.set_index('Grade').lookup(df['Grade'], df['Grade'])
array(['30', '44', '15'], dtype=object)
You might run into some issues if your data is numeric (in your sample data it is all strings), requiring a cast to make the lookup succeed.
import pandas as pd
import numpy as np
data={'DBN':['01M015','01M015','01M015'],
'Grade':['3','4','5'],
'3':['30','30','30'],
'4':['44','44','44'],
'5':['15','15','15']}
df = pd.DataFrame(data)
enrollmentList = []
for index, row in df.iterrows():
enrollmentList.append(row[row["Grade"]])
df['Enrollment'] = enrollmentList
Related
I want to stack two columns on top of each other
So I have Left and Right values in one column each, and want to combine them into a single one. How do I do this in Python?
I'm working with Pandas Dataframes.
Basically from this
Left Right
0 20 25
1 15 18
2 10 35
3 0 5
To this:
New Name
0 20
1 15
2 10
3 0
4 25
5 18
6 35
7 5
It doesn't matter how they are combined as I will plot it anyway, and the new column name also doesn't matter because I can rename it.
You can create a list of the cols, and call squeeze to anonymise the data so it doesn't try to align on columns, and then call concat on this list, passing ignore_index=True creates a new index, otherwise you'll get the names as index values repeated:
cols = [df[col].squeeze() for col in df]
pd.concat(cols, ignore_index=True)
Many options, stack, melt, concat, ...
Here's one:
>>> df.melt(value_name='New Name').drop('variable', 1)
New Name
0 20
1 15
2 10
3 0
4 25
5 18
6 35
7 5
You can also use np.ravel:
import numpy as np
out = pd.DataFrame(np.ravel(df.values.T), columns=['New name'])
print(out)
# Output
New name
0 20
1 15
2 10
3 0
4 25
5 18
6 35
7 5
Update
If you have only 2 cols:
out = pd.concat([df['Left'], df['Right']], ignore_index=True).to_frame('New name')
print(out)
# Output
New name
0 20
1 15
2 10
3 0
4 25
5 18
6 35
7 5
Solution with unstack
df2 = df.unstack()
# recreate index
df2.index = np.arange(len(df2))
A solution with masking.
# Your data
import numpy as np
import pandas as pd
df = pd.DataFrame({"Left":[20,15,10,0], "Right":[25,18,35,5]})
# Masking columns to ravel
df2 = pd.DataFrame({"New Name":np.ravel(df[["Left","Right"]])})
df2
New Name
0 20
1 25
2 15
3 18
4 10
5 35
6 0
7 5
I ended up using this solution, seems to work fine
df1 = dfTest[['Left']].copy()
df2 = dfTest[['Right']].copy()
df2.columns=['Left']
df3 = pd.concat([df1, df2],ignore_index=True)
This question already has answers here:
Is it possible to insert a row at an arbitrary position in a dataframe using pandas?
(4 answers)
Closed 2 years ago.
import pandas as pd
data = {'term':[2, 7,10,11,13],'pay':[22,30,50,60,70]}
df = pd.DataFrame(data)
pay term
0 22 2
1 30 7
2 50 10
3 60 11
4 70 13
df.loc[2] = [49,9]
print(df)
pay term
0 22 2
1 30 7
2 49 9
3 60 11
4 70 13
Expected output :
pay term
0 22 2
1 30 7
2 49 9
3 50 10
4 60 11
5 70 13
If we run above code, it is replacing the values at 2 index. I want to add new row with desired value as above to my existing dataframe without replacing the existing values. Please suggest.
You could not be able to insert a new row directly by assigning values to df.loc[2] as it will overwrite the existing values. But you can slice the dataframe in two parts and then concat the two parts along with third row to insert.
Try this:
new_df = pd.DataFrame({"pay": 49, "term": 9}, index=[2])
df = pd.concat([df.loc[:1], new_df, df.loc[2:]]).reset_index(drop=True)
print(df)
Output:
term pay
0 2 22
1 7 30
2 9 49
3 10 50
4 11 60
5 13 70
A possible way is to prepare an empty slot in the index, add the row and sort according to the index:
df.index = list(range(2)) + list(range(3, len(df) +1))
df.loc[2] = [49,9]
It gives:
term pay
0 2 22
1 7 30
3 10 50
4 11 60
5 13 70
2 49 9
Time to sort it:
df = df.sort_index()
term pay
0 2 22
1 7 30
2 49 9
3 10 50
4 11 60
5 13 70
That is because loc and iloc methods bring the already existing row from the dataframe, what you would normally do is to insert by appending a value in the last row.
To address this situation first you need to split the dataframe, append the value you want, concatenate with the second split and finally reset the index (in case you want to keep using integers)
#location you want to update
i = 2
#data to insert
data_to_insert = pd.DataFrame({'term':49, 'pay':9}, index = [i])
#split, append data to insert, append the rest of the original
df = df.loc[:i].append(data_to_insert).append(df.loc[i:]).reset_index(drop=True)
Keep in mind that the slice operator will work because the index is integers.
I have certain pandas dataframe which has a structure like this
A B C
1 2 2
2 2 2
...
I want to create a new column called ID and fill it with an alphanumeric series which looks somewhat like this
ID A B C
GT001 1 2 2
GT002 2 2 2
GT003 2 2 2
...
I know how to fill it with either alphabets or numerals but I couldn't figure out if there is a "Pandas native" method which would allow me to fill an alphanumeric series.What would be the best way to do this?
Welcome to Stack Overflow!
If you want a custom ID, then you have to create a list with the desired index:
list = []
for i in range(1, df.shape[0] + 1): # gets the length of the DataFrame.
list.append(f'GT{i:03d}') # Using f-string for format and 03d for leading zeros.
df['ID'] = list
And if you want to set that as an index do df.set_index('ID', inplace=True)
import pandas as pd
import numpy as np
df = pd.DataFrame({'player': np.linspace(0,20,20)})
n = 21
data = ['GT' + '0'*(3-len(str(i))) + str(i) for i in range(1, n)]
df['ID'] = data
Output:
player ID
0 0.000000 GT001
1 1.052632 GT002
2 2.105263 GT003
3 3.157895 GT004
4 4.210526 GT005
5 5.263158 GT006
6 6.315789 GT007
7 7.368421 GT008
8 8.421053 GT009
9 9.473684 GT010
10 10.526316 GT011
11 11.578947 GT012
12 12.631579 GT013
13 13.684211 GT014
14 14.736842 GT015
15 15.789474 GT016
16 16.842105 GT017
17 17.894737 GT018
18 18.947368 GT019
19 20.000000 GT020
I have the following multiindex dataframe:
from io import StringIO
import pandas as pd
datastring = StringIO("""File,no,runtime,value1,value2
A,0, 0,12,34
A,0, 1,13,34
A,0, 2,23,34
A,1, 6,23,38
A,1, 7,22,38
B,0,17,15,35
B,0,18,17,35
C,0,34,23,32
C,0,35,21,32
""")
df = pd.read_csv(datastring, sep=',')
df.set_index(['File','no',df.index], inplace=True)
>> df
runtime value1 value2
File no
A 0 0 0 12 34
1 1 13 34
2 2 23 34
1 3 6 23 38
4 7 22 38
B 0 5 17 15 35
6 18 17 35
C 0 7 34 23 32
8 35 21 32
What I would like to get is just the first values of every entry with a new file and a different number
A 0 34
A 1 38
B 0 35
C 0 32
The most similar questions I could find where these
Resample pandas dataframe only knowing result measurement count
MultiIndex-based indexing in pandas
Select rows in pandas MultiIndex DataFrame
but I was unable to construct a solution from them. The best I got was the ix operation, but as the values technically are still there (just not on display), the result is
idx = pd.IndexSlice
df.loc[idx[:,0],:]
could, for example, filter for the 0 value but would still return the entire rest of the dataframe.
Is a multiindex even the right tool for the task at hand? How to solve this?
Use GroupBy.first by first and second level of MultiIndex:
s = df.groupby(level=[0,1])['value2'].first()
print (s)
File no
A 0 34
1 38
B 0 35
C 0 32
Name: value2, dtype: int64
If need one column DataFrame use one element list:
df1 = df.groupby(level=[0,1])[['value2']].first()
print (df1)
value2
File no
A 0 34
1 38
B 0 35
C 0 32
Another idea is remove 3rd level by DataFrame.reset_index and filter by Index.get_level_values with boolean indexing:
df2 = df.reset_index(level=2, drop=True)
s = df2.loc[~df2.index.duplicated(), 'value2']
print (s)
File no
A 0 34
1 38
B 0 35
C 0 32
Name: value2, dtype: int64
For the sake of completeness, I would like to add another method (which I would not have found without the answere by jezrael).
s = df.groupby(level=[0,1])['value2'].nth(0)
This can be generalized to finding any, not merely the first entry
t = df.groupby(level=[0,1])['value1'].nth(1)
Note that the selection was changed from value2 to value1 as for the former, the results of nth(0) and nth(1) would have been identical.
Pandas documentation link: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.nth.html
srch_destination hotel_booked count
28 1 4
28 5 1
28 8 2
28 11 9
28 14 17
19 11 3
19 2 5
19 5 8
19 6 10
Let's say I have a dataframe formatted above. These are searches, so let's say that 4 people who searched for destination 28 booked hotel 1. I essentially want to get a dataframe that contains a row for each search destination, along with the corresponding top 3 bookings. So for this dataframe, we would have two rows that look like:
srch_destination top_hotels
28 14 11 1
19 6 5 2
Currently, my code is below where 'c_id' is the initial dataframe and 'a' is the desired output. I am coming from R and am wondering if there is a more efficient way to do this sorting and subsequent aggregation.
import numpy as np
import pandas as pd
a = pd.DataFrame()
for ind in np.unique(c_id.srch_destination):
nlarg = c_id[c_id.srch_destination == ind].sort_values('count', ascending = False).head(3)['hotel_booked']
a = a.append({'srch_destination': ind, 'top_hotels': " ".join(map(str, nlarg))}, ignore_index=True)
a.to_csv('out.csv')
Use nlargest to get the top 3 based on the count column.
>>> (df.groupby('srch_destination')
.apply(lambda group: group.nlargest(3, 'count').hotel_booked.tolist()))
srch_destination
19 [6, 5, 2]
28 [14, 11, 1]
dtype: object