This question already has answers here:
How can I fill in a missing values in range with Pandas?
(2 answers)
Closed last year.
I want to add missing rows based on the column "id" in a dataframe. The id should be continuous integers, starting from 1 to 60000. A small example is as follows: id ranges from 1 to 5. So I need to add 1,3,4 with value "0"s for the table below.
id
value1
value2
2
13
33
5
45
24
The final dataframe would become
id
value1
value2
1
0
0
2
13
33
3
0
0
4
0
0
5
45
24
You can set column 'id' as index, then use reindex method to conform df to new index with index from 1 to 5. The reindex method places NaN values in locations that had no values in the previous index, so you use fillna method to fill these with 0s, then reset the index and finally cast df to int dtype:
df = df.set_index('id').reindex(range(1,6)).fillna(0).reset_index().astype(int)
Output:
id value1 value2
0 1 0 0
1 2 13 33
2 3 0 0
3 4 0 0
4 5 45 24
You may want to look at the DataFrame.append method: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html
It adds rows to a DataFrame
You could use something like the following:
for i in [1, 3, 4]:
df = df.append({'id':i, 'value1': 0, 'value2': 0}, ignore_index=True)
If you want them to be in order by id afterwards, you could sort it:
df.sort_values(by=['id'], inplace=True)
Related
Context: I'd like to "bump" the index level of a multi-index dataframe up. In other words, I'd like to put the index level of a dataframe at the same level as the columns of a multi-indexed dataframe
Let's say we have this dataframe:
tt = pd.DataFrame({'A':[1,2,3],'B':[4,5,6],'C':[7,8,9]})
tt.index.name = 'Index Column'
And we perform this change to add a multi-index level (like a label of a table)
tt = pd.concat([tt],keys=['Multi-Index Table Label'], axis=1)
Which results in this:
Multi-Index Table Label
A B C
Index Column
0 1 4 7
1 2 5 8
2 3 6 9
Desired Output: How can I make it so that the dataframe looks like this instead (notice the removal of the empty level on the dataframe/table):
Multi-Index Table Label
Index Column A B C
0 1 4 7
1 2 5 8
2 3 6 9
Attempts: I was testing something out and you can essentially remove the index level by doing this:
tt.index.name = None
Which would result in :
Multi-Index Table Label
A B C
0 1 4 7
1 2 5 8
2 3 6 9
Essentially removing that extra level/empty line, but the thing is that I do want to keep the Index Column as it will give information about the type of data present on the index (which in this example are just 0,1,2 but can be years, dates, etc).
How could I do that?
Thank you all in advance :)
How about this:
tt = pd.DataFrame({'A':[1,2,3],'B':[4,5,6],'C':[7,8,9]})
tt.insert(loc=0, column='Index Column', value=tt.index)
tt = pd.concat([tt],keys=['Multi-Index Table Label'], axis=1)
tt = tt.style.hide_index()
This question already has answers here:
Get the row(s) which have the max value in groups using groupby
(15 answers)
Closed 6 months ago.
I have the following data frame
user_id
value
1
5
1
7
1
11
1
15
1
35
2
8
2
9
2
14
I want to drop all rows that are not the maximum value of every user_id
resulting on a 2 row data frame:
user_id
value
1
35
2
14
How can I do that?
You can use pandas.DataFrame.max after the grouping.
Assuming that your original dataframe is named df, try the code below :
out = df.groupby('user_id', as_index=False).max('value')
>>> print(out)
Edit :
If you want to group more than one column, use this :
out = df.groupby(['user_id', 'sex'], as_index=False, sort=False)['value'].max()
>>> print(out)
I have a dataframe, and I would like to select a subset of the dataframe using both index and column values. I can do both of these separately, but cannot figure out the syntax to do them simultaneously. Example:
import pandas as pd
# sample dataframe:
cid=[1,2,3,4,5,6,17,18,91,104]
c1=[1,2,3,1,2,3,3,4,1,3]
c2=[0,0,0,0,1,1,1,1,0,1]
df=pd.DataFrame(list(zip(c1,c2)),columns=['col1','col2'],index=cid)
df
Returns:
col1 col2
1 1 0
2 2 0
3 3 0
4 1 0
5 2 1
6 3 1
17 3 1
18 4 1
91 1 0
104 3 1
Using .loc, I can collect by index:
rel_index=[5,6,17]
relc1=[2,3]
relc2=[1]
df.loc[rel_index]
Returns:
col1 col2
5 1 5
6 2 6
17 3 7
Or I can select by column values:
df.loc[df['col1'].isin(relc1) & df['col2'].isin(relc2)]
Returning:
col1 col2
5 2 1
6 3 1
17 3 1
104 3 1
However, I cannot do both. When I try the following:
df.loc[rel_index,df['col1'].isin(relc1) & df['col2'].isin(relc2)]
Returns:
IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match
I have tried a few other variations (such as "&" instead of the ","), but these return the same or other errors.
Once I collect this slice, I am hoping to reassign values on the main dataframe. I imagine this will be trivial once the above is done, but I note it here in case it is not. My goal is to assign something like df2 in the following:
c3=[1,2,3]
c4=[5,6,7]
df2=pd.DataFrame(list(zip(c3,c4)),columns=['col1','col2'],index=rel_index)
to the slice referenced by index and multiple column conditions (overwriting what was in the original dataframe).
The reason for the IndexingError, is that you're calling df.loc with arrays of 2 different sizes.
df.loc[rel_index] has a length of 3 whereas df['col1'].isin(relc1) has a length of 10.
You need the index results to also have a length of 10. If you look at the output of df['col1'].isin(relc1), it is an array of booleans.
You can achieve a similar array with the proper length by replacing df.loc[rel_index] with df.index.isin([5,6,17])
so you end up with:
df.loc[df.index.isin([5,6,17]) & df['col1'].isin(relc1) & df['col2'].isin(relc2)]
which returns:
col1 col2
5 2 1
6 3 1
17 3 1
That said, I'm not sure why your index would ever look like this. Typically when slicing by index you would use df.iloc and your index would match the 0,1,2...etc. format.
Alternatively, you could first search by value - then assign the resulting dataframe to a new variable df2
df2 = df.loc[df['col1'].isin(relc1) & df['col2'].isin(relc2)]
then df2.loc[rel_index] would work without issue.
As for your overall goal, you can simply do the following:
c3=[1,2,3]
c4=[5,6,7]
df2=pd.DataFrame(list(zip(c3,c4)),columns=['col1','col2'],index=rel_index)
df.loc[df.index.isin([5,6,17]) & df['col1'].isin(relc1) & df['col2'].isin(relc2)] = df2
#Rexovas explains it quite well, this is an alternative, where you can compute the filters on the index before assigning - it is a bit long, involves MultiIndex, but once you get your head around MultiIndex, should be intuitive:
(df
# move columns into the index
.set_index(['col1', 'col2'], append = True)
# filter based on the index
.loc(axis = 0)[rel_index, relc1, relc2]
# return cols 1 and 2
.reset_index(level = [-2, -1])
# assign values
.assign(col1 = c3, col2 = c4)
)
col1 col2
5 1 5
6 2 6
17 3 7
I have a dataframe that looks like this:
import pandas as pd
foo = pd.DataFrame({'var_name': ['r1','r2','r3','var', 'r1','r2','r3','var'],
'group': ['a','a','a','a','b','b','b','b'],
'value': [1,2,3,4,6,7,8,9]})
I want a new column in this dataframe, which will contain 1 if the value is larger than the median of value column of the rows where var_name is in ['r1','r2','r3'] by group, otherwise 0
The output dataframe should look like:
foo = pd.DataFrame({'var_name': ['r1','r2','r3','var', 'r1','r2','r3','var'],
'group': ['a','a','a','a','b','b','b','b'],
'value': [1,2,3,4,6,7,8,9],
'test': [0,0,1,1,0,0,1,1]})
Explanation of output dataframe:
The median of r1,r2,r3 for group a is 2, so rows r3 & var get a 1 in the test column
is there a pythonic way of doing that ?
First idea is filter only rows matched by r values by boolean indexing, aggregate median and last Series.map by groups with Series.lt and last convert to 0,1 values by Series.view:
s = foo[foo['var_name'].isin(['r1','r2','r3'])].groupby('group')['value'].median()
foo['test'] = foo['group'].map(s).lt(foo['value']).view('i1')
Or another idea with Series.where for replace no nmatched values to NaNs and then is created new Series for compare by GroupBy.transform and median:
foo['test'] = (foo['value'].where(foo['var_name'].isin(['r1','r2','r3']))
.groupby(foo['group'])
.transform('median')
.lt(foo['value'])
.view('i1'))
print (foo)
var_name group value test
0 r1 a 1 0
1 r2 a 2 0
2 r3 a 3 1
3 var a 4 1
4 r1 b 6 0
5 r2 b 7 0
6 r3 b 8 1
7 var b 9 1
I am trying to do a division of column 0 by columns 1 and 2. From the below, I would like to return a dataframe of 10 rows, 3 columns. The first column should all be 1's. Instead I get a 10x10 dataframe. What am I doing wrong?
data = np.random.randn(10,3)
df = pd.DataFrame(data)
df[0] / df
First you should create a 10 by 3 DataFrame with all columns equal to the first column and then divide it with your DataFrame.
df[[0, 0, 0]] / df.values
or
df[[0, 0, 0]].values / df
If you want to keep the column names.
(I use .values to avoid reindexing which will fail due to duplicate column values.)
You need to match the dimension of the Series with the rows of the DataFrame. There are a few ways to do this but I like to use transposes.
data = np.random.randn(10,3)
df = pd.DataFrame(data)
(df[0] / df.T).T
0 1 2
0 1 -0.568096 -0.248052
1 1 -0.792876 -3.539075
2 1 -25.452247 1.434969
3 1 -0.685193 -0.540092
4 1 0.451879 -0.217639
5 1 -2.691260 -3.208036
6 1 0.351231 -1.467990
7 1 0.249589 -0.714330
8 1 0.033477 -0.004391
9 1 -0.958395 -1.530424