I have the following Dataframe:
Now i want to insert an empty row after every time the column "Zweck" equals 7.
So for example the third row should be an empty row.
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [1, 2, 3, 4, 5], 'f': [1, 7, 3, 4, 7]})
ren_dict = {i: df.columns[i] for i in range(len(df.columns))}
ind = df[df['f'] == 7].index
df = pd.DataFrame(np.insert(df.values, ind, values=[33], axis=0))
df.rename(columns=ren_dict, inplace=True)
ind_empt = df['a'] == 33
df[ind_empt] = ''
print(df)
Output
a b f
0 1 1 1
1
2 2 2 7
3 3 3 3
4 4 4 4
5
6 5 5 7
Here the dataframe is overwritten, as the append operation will be resource intensive. As a result, the required strings with values 33 appear. This is necessary because np.insert does not allow string values to be substituted. Columns are renamed to their original state with: df.rename. Finally, we find lines with df['a'] == 33 to set to empty values.
Related
Currently i have a dataframe looking like this:
import pandas as pd
import numpy as np
col1 = [2, 3, 3, 2]
col2 = [[1,3,2], 4, 2, [1,2]]
df = pd.DataFrame({"A": col1, "B": col2})
A B
0 2 1,3,2
1 3 4
2 3 2
3 2 1,2
if i run df.loc[df['B']== 4] it runs fine. However if i try to run df.loc[df['B']== [1,3,2]] it gives me an error: 'Lengths must match to compare'. Is there a way i can compare a list within my pandas series? The desired outcome is ofcourse is the first row of index 0 as output.
To compare with list you have to use map/apply:
df[df['B'].map(lambda x: x == [1, 3, 2])]
A B
0 2 [1, 3, 2]
I have a Pandas dataframe df with 102 columns. Each column is named differently, say A, B, C etc. to give the original dataframe following structure
Column A. Column B. Column C. ....
Row 1.
Row 2.
---
Row n
I would like to change the columns names from A, B, C etc. to F1, F2, F3, ...., F102. I tried using df.columns but wasn't successful in renaming them this way. Any simple way to automatically rename all column names to F1 to F102 automatically, insteading of renaming each column name individually?
df.columns=["F"+str(i) for i in range(1, 103)]
Note:
Instead of a “magic” number 103 you may use the calculated number of columns (+ 1), e.g.
len(df.columns) + 1, or
df.shape[1] + 1.
(Thanks to ALollz for this tip in his comment.)
One way to do this is to convert it to a pair of lists, and convert the column names list to the index of a loop:
import pandas as pd
d = {'Column A': [1, 2, 3, 4, 5, 4, 3, 2, 1], 'Column B': [1, 2, 3, 4, 5, 4, 3, 2, 1], 'Column c': [1, 2, 3, 4, 5, 4, 3, 2, 1]}
dataFrame = pd.DataFrame(data=d)
cols = list(dataFrame.columns.values) #convert original dataframe into a list containing the values for column name
index = 1 #start at 1
for column in cols:
cols[index-1] = "F"+str(index) #rename the column name based on index
index += 1 #add one to index
vals = dataFrame.values.tolist() #get the values for the rows
newDataFrame = pd.DataFrame(vals, columns=cols) #create a new dataframe containing the new column names and values from rows
print(newDataFrame)
Output:
F1 F2 F3
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
5 4 4 4
6 3 3 3
7 2 2 2
8 1 1 1
Wanted a new column based on certain conditions of existing columns, below is what I am doing right now, but it takes too much time for huge data. Is there any efficient or faster way to do it.
DF["A"][0] = 0
for x in range(1,rows):
if(DF["B"][x]>DF["B"][x-1]):
DF["A"][x] = DF["A"][x-1] + DF["C"][x]
elif(DF["B"][x]<DF["B"][x-1]):
DF["A"][x] = DF["A"][x-1] - DF["C"][x]
else:
DF["A"][x] = DF["A"][x-1]
If I got you right this is what you want:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 2, 3, 4, 5],
'B': [12, 15, 9, 8, 15],
'C': [3, 9, 12, 6, 8]})
df['A'] = np.where(df.index==0,
0,
np.where(df['B']>df['B'].shift(),
df['A']-df['A'].shift(),
np.where(df['B']<df['B'].shift(),
df['A'].shift()-df['C'],
df['A'].shift())))
df
# A B C
#0 0.0 12 3
#1 1.0 15 9
#2 -10.0 9 12
#3 -3.0 8 6
#4 1.0 15 8
a new column based on certain conditions of existing columns,
I'm using the DataFrame provided by #zipa:
df = pd.DataFrame({'A': [1, 2, 3, 4, 5],
'B': [12, 15, 9, 8, 15],
'C': [3, 9, 12, 6, 8]})
First approach
Here's a function that implements efficiently as you specified. It works by leveraging Pandas' indexing features, specifically row masks
def update(df):
cond_larger = df['B'] > df['B'].shift().fillna(0)
cond_smaller = df['B'] < df['B'].shift().fillna(0)
cond_else = ~(cond_larger | cond_smaller)
for cond, sign in [(cond_larger, +1), # A[x-1] + C[x]
(cond_smaller, -1), # A[x-1] - C[x]
(cond_else, 0)]: # A[x-1] + 0
if any(cond):
df.loc[cond, 'A_updated'] = (df['A'].shift().fillna(0) +
sign * df[cond]['C'])
df['A'] = df['A_updated']
df.drop(columns=['A_updated'], inplace=True)
return df
update(df)
=>
A B C
0 3.0 12 3
1 10.0 15 9
2 -10.0 9 12
3 -3.0 8 6
4 12.0 15 8
Optimized
It turns out you can use DataFrame.mask to achieve the same as above. Note you could combine the conditions into the call of mask, however I find it easier to read like this:
# specify conditions
cond_larger = df['B'] > df['B'].shift().fillna(0)
cond_smaller = df['B'] < df['B'].shift().fillna(0)
cond_else = ~(cond_larger | cond_smaller)
# apply
A_shifted = (df['A'].shift().fillna(0)).copy()
df.mask(cond_larger, A_shifted + df['C'], axis=0, inplace=True)
df.mask(cond_smaller, A_shifted - df['C'], axis=0, inplace=True)
df.mask(cond_else, A_shifted, axis=0, inplace=True)
=>
(same results as above)
Notes:
I'm assuming default value 0 for A/B[x-1]. If the first row should be treated differently remove or replace .fillna(0). Results will be different.
Conditions are checked in sequence. Depending on whether updates should use the original values in A or those updated in the previous condition you may not need the helper column A_updated
See previous versions of this answer for a history of how I got here
I have read doc of Advanced indexing with hierarchical index where using .loc for MultiIndex is explained. Also this thread: Using .loc with a MultiIndex in pandas?
Still I don't see how select rows where (first index == some value) or (second index == some value)
Example:
import pandas as pd
index = pd.MultiIndex.from_arrays([['a', 'a', 'a', 'b', 'b', 'b'],
['a', 'b', 'c', 'a', 'b', 'c']],
names=['i0', 'i1'])
df = pd.DataFrame({'x': [1,2,3,4,5,6], 'y': [6,5,4,3,2,1]}, index=index)
Is this DataFrame:
x y
i0 i1
a a 1 6
b 2 5
c 3 4
b a 4 3
b 5 2
c 6 1
How can I get rows where i0 == 'b' or i1 == 'b'?
x y
i0 i1
a b 2 5
b a 4 3
b 5 2
c 6 1
I think the easier answer is to use the DataFrame.query function which allows you to query the multi-index by name as follows:
import pandas as pd
import numpy as np
index = pd.MultiIndex.from_arrays([list("aaabbb"),
list("abcabc")],
names=['i0', 'i1'])
df = pd.DataFrame({'x': [1, 2, 3, 4, 5, 6], 'y': [6, 5, 4, 3, 2, 1]}, index=index)
df.query('i0 == "b" | i1 == "b"')
returns:
x y
i0 i1
a b 2 5
b a 4 3
b 5 2
c 6 1
Use get_level_values()
>>> mask = (df.index.get_level_values(0)=='b') | (df.index.get_level_values(1)=='b')
>>> df[mask] # same as df.loc[mask]
x y
i0 i1
a b 2 5
b a 4 3
b 5 2
c 6 1
This might be possible with some logical condition on the index columns i0 and i1 unsing .loc. However to me using .iloc seems easier:
You can get the iloc index via pd.MultiIndex.get_locs.
import pandas as pd
import numpy as np
index = pd.MultiIndex.from_arrays([list("aaabbb"),
list("abcabc")],
names=['i0', 'i1'])
df = pd.DataFrame({'x': [1, 2, 3, 4, 5, 6], 'y': [6, 5, 4, 3, 2, 1]}, index=index)
idx0 = index.get_locs(['b', slice(None)]) # i0 == 'b' => [3, 4, 5]
idx1 = index.get_locs([slice(None), 'b']) # i1 == 'b' => [1, 4]
idx = np.union1d(idx0, idx1)
print(df.iloc[idx])
will yield
x y
i0 i1
a b 2 5
b a 4 3
b 5 2
c 6 1
Note:
slice(None) means the same as [:] in index-slicing.
It is quite some time since this question was raised. After reading the answers available, however, I do see the benefit of adding my response which is going to answer original query exactly and how do it efficiently with minimum coding.
To select multiple indices as in your question, you can do :
df.loc[('b','b')]
Please note most critical point here is to use parenthesis () for indices. This will give an output :
x 5
y 2
Name: (b, b), dtype: int64
You can further add column name ('x' in my case) as if needed by doing as below:
df.loc[('b','b'),'x']
This will give output:
5
Entire process is in the attached image.
I have dataframe (df) of 12 rows x 5 columns. I sample 1 row from each label and create a new dataframe (df1) of 3 rows x 5 columns. I need that the next time I sample more rows from df I will not choose the same ones that are already in df1. So how can I delete the already sampled rows from df?
import pandas as pd
import numpy as np
# 12x5
df = pd.DataFrame(np.random.rand(12, 5))
label=np.array([1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3])
df['label'] = label
#3x5
df1 = pd.concat(g.sample(1) for idx, g in df.groupby('label'))
#My attempt. It should be a 9x5 dataframe
df2 = pd.concat(f.drop(idx) for idx, f in df1.groupby('label'))
df
df1
df2
Starting with this DataFrame:
df = pd.DataFrame(np.random.rand(12, 5))
label=np.array([1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3])
df['label'] = label
Your first sample is this:
df1 = pd.concat(g.sample(1) for idx, g in df.groupby('label'))
For the second sample, you can drop df1's indices from df:
pd.concat(g.sample(1) for idx, g in df.drop(df1.index).groupby('label'))
Out:
0 1 2 3 4 label
2 0.188005 0.765640 0.549734 0.712261 0.334071 1
4 0.599812 0.713593 0.366226 0.374616 0.952237 2
8 0.631922 0.585104 0.184801 0.147213 0.804537 3
This is not an inplace operation. It doesn't modify the original DataFrame. It just drops the rows, returns a copy, and samples from that copy. If you want it to be permanent, you can do:
df2 = df.drop(df1.index)
And sample from df2 afterwards.