I have a big dataframe with more than 100 columns. I am sharing a miniature version of my real dataframe below
ID rev_Q1 rev_Q5 rev_Q4 rev_Q3 rev_Q2 tx_Q3 tx_Q5 tx_Q2 tx_Q1 tx_Q4
1 1 1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1 1 1
I would like to do the below
a) sort the column names based on Quarters (ex:Q1,Q2,Q3,Q4,Q5..Q100..Q1000) for each column pattern
b) By column pattern, I mean the keyword that is before underscore which is rev and tx.
So, I tried the below but it doesn't work and it also shifts the ID column to the back
df = df.reindex(sorted(df.columns), axis=1)
I expect my output to be like as below. In real time, there are more than 100 columns with more than 30 patterns like rev, tx etc. I want my ID column to be in the first position as shown below.
ID rev_Q1 rev_Q2 rev_Q3 rev_Q4 rev_Q5 tx_Q1 tx_Q2 tx_Q3 tx_Q4 tx_Q5
1 1 1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1 1 1
For the provided example, df.sort_index(axis=1) should work fine.
If you have Q values higher that 9, use natural sorting with natsort:
from natsort import natsort_key
out = df.sort_index(axis=1, key=natsort_key)
Or using manual sorting with np.lexsort:
idx = df.columns.str.split('_Q', expand=True, n=1)
order = np.lexsort([idx.get_level_values(1).astype(float), idx.get_level_values(0)])
out = df.iloc[:, order]
Something like:
new_order = list(df.columns)
new_order = ['ID'] + sorted(new_order.remove("ID"))
df = df[new_order]
we manually put "ID" in front and then sort what is remaining
The idea is to create a dataframe from the column names. Create two columns: one for Variable and another one for Quarter number. Finally sort this dataframe by values then extract index.
idx = (df.columns.str.extract(r'(?P<V>[^_]+)_Q(?P<Q>\d+)')
.fillna(0).astype({'Q': int})
.sort_values(by=['V', 'Q']).index)
df = df.iloc[:, idx]
Output:
>>> df
ID rev_Q1 rev_Q2 rev_Q3 rev_Q4 rev_Q5 tx_Q1 tx_Q2 tx_Q3 tx_Q4 tx_Q5
0 1 1 1 1 1 1 1 1 1 1 1
1 2 1 1 1 1 1 1 1 1 1 1
>>> (df.columns.str.extract(r'(?P<V>[^_]+)_Q(?P<Q>\d+)')
.fillna(0).astype({'Q': int})
.sort_values(by=['V', 'Q']))
V Q
0 0 0
1 rev 1
5 rev 2
4 rev 3
3 rev 4
2 rev 5
9 tx 1
8 tx 2
6 tx 3
10 tx 4
7 tx 5
I've a question about a specific problem in pandas:
I have in a df a column with the following values:
5
4
3
2
1
0
0
0
0
1
2
3
4
5
I want to select all the rows from the first 5 to the last 0:
5
4
3
2
1
0
0
0
0
I tried with drop duplicates, but i loose the last three zeroes.
I'm thinking aboutusing a for cycle and stop when the i-th value of the column is greater than the i-1 value, but i don't know how to make such a cycle for a dataframe in pandas.
Can someone help me?
Thank you in advance, I hope I've explained the problem clearly.
You could use DataFrame.shift to compare with the next row, and keep only those that are less or equal than the previous. Here I use np.r_ to include the first value too:
import numpy as np
df[np.r_[True, df.col.le(df.col.shift()).to_numpy()[1:]]]
col
0 5
1 4
2 3
3 2
4 1
5 0
6 0
7 0
8 0
Let us try cummax:
s = df['col']
df.loc[s.eq(5).cummax() & s[::-1].eq(0).cummax()]
col
0 5
1 4
2 3
3 2
4 1
5 0
6 0
7 0
8 0
I am beginner, and I really need help on the following:
I need to do similar to the following but on a two dimensional dataframe Identifying consecutive occurrences of a value
I need to use this answer but for two dimensional dataframe. I need to count at least 2 consecuetive ones along the columns dimension. Here is a sample dataframe:
my_df=
0 1 2
0 1 0 1
1 0 1 0
2 1 1 1
3 0 0 1
4 0 1 0
5 1 1 0
6 1 1 1
7 1 0 1
The output I am looking for is:
0 1 2
0 3 5 4
Instead of the column 'consecutive', I need a new output called "out_1_df" for line
df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size') * df.Count
So that later I can do
threshold = 2;
out_2_df= (out_1_df > threshold).astype(int)
I tried the following:
out_1_df= my_df.groupby(( my_df != my_df.shift(axis=0)).cumsum(axis=0))
out_2_df =`(out_1_df > threshold).astype(int)`
How can I modify this?
Try:
import pandas as pd
df=pd.DataFrame({0:[1,0,1,0,0,1,1,1], 1:[0,1,1,0,1,1,1,0], 2: [1,0,1,1,0,0,1,1]})
out_2_df=((df.diff(axis=0).eq(0)|df.diff(periods=-1,axis=0).eq(0))&df.eq(1)).sum(axis=0)
>>> out_2_df
[3 5 4]
I want to get a row count of the frequency of each value, even if that value doesn't exist in the dataframe.
d = {'light' : pd.Series(['b','b','c','a','a','a','a'], index=[1,2,3,4,5,6,9]),'injury' : pd.Series([1,5,5,5,2,2,4], index=[1,2,3,4,5,6,9])}
testdf = pd.DataFrame(d)
injury light
1 1 b
2 5 b
3 5 c
4 5 a
5 2 a
6 2 a
9 4 a
I want to get a count of the number of occurrences of each unique value of 'injury' for each unique value in 'light'.
Normally I would just use groupby(), or (in this case, since I want it to be in a specific format), pivot_table:
testdf.reset_index().pivot_table(index='light',columns='injury',fill_value=0,aggfunc='count')
index
injury 1 2 4 5
light
a 0 2 1 1
b 1 0 0 1
c 0 0 0 1
But in this case I actually want to compare the records in the dataframe to an external list of values-- in this case, ['a','b','c','d']. So if 'd' doesn't exist in this dataframe, then I want it to return a count of zero:
index
injury 1 2 4 5
light
a 0 2 1 1
b 1 0 0 1
c 0 0 0 1
d 0 0 0 0
The closest I've come is filtering the dataframe based on each value, and then getting the size of that dataframe:
for v in sorted(['a','b','c','d']):
idx2 = (df['light'].isin([v]))
df2 = df[idx2]
print(df2.shape[0])
4
2
1
0
But that only returns counts from the 'light' column-- instead of a cross-tabulation of both columns.
Is there a way to make a pivot table, or a groupby() object, that groups things based on values in a list, rather than in a column in a dataframe? Or is there a better way to do this?
Try this:
df = pd.crosstab(df.light, df.injury,margins=True)
df
injury 1 2 4 5 All
light
a 0 2 1 1 4
b 1 0 0 1 2
c 0 0 0 1 1
All 1 2 1 3 7
df["All"]
light
a 4
b 2
c 1
All 7
Given a DataFrame with a hierarchical index containing three levels (experiment, trial, slot) and a second DataFrame with a hierarchical index containing two levels (experiment, trial), how do I drop all the rows in the first DataFrame whose (experiment, trial) are not contained in the second dataframe?
Example data:
from io import StringIO
import pandas as pd
df1_data = StringIO(u',experiment,trial,slot,token\n0,btn144a10_p_RDT,0,0,4.0\n1,btn144a10_p_RDT,0,1,14.0\n2,btn144a10_p_RDT,1,0,12.0\n3,btn144a10_p_RDT,1,1,14.0\n4,btn145a07_p_RDT,0,0,6.0\n5,btn145a07_p_RDT,0,1,19.0\n6,btn145a07_p_RDT,1,0,17.0\n7,btn145a07_p_RDT,1,1,13.0\n8,chn004b06_p_RDT,0,0,6.0\n9,chn004b06_p_RDT,0,1,8.0\n10,chn004b06_p_RDT,1,0,2.0\n11,chn004b06_p_RDT,1,1,5.0\n12,chn008a06_p_RDT,0,0,12.0\n13,chn008a06_p_RDT,0,1,14.0\n14,chn008a06_p_RDT,1,0,6.0\n15,chn008a06_p_RDT,1,1,4.0\n16,chn008b06_p_RDT,0,0,3.0\n17,chn008b06_p_RDT,0,1,13.0\n18,chn008b06_p_RDT,1,0,12.0\n19,chn008b06_p_RDT,1,1,19.0\n20,chn008c04_p_RDT,0,0,17.0\n21,chn008c04_p_RDT,0,1,2.0\n22,chn008c04_p_RDT,1,0,1.0\n23,chn008c04_p_RDT,1,1,6.0\n')
df1 = pd.DataFrame.from_csv(df1_data).set_index(['experiment', 'trial', 'slot'])
df2_data = StringIO(u',experiment,trial,target\n0,btn145a07_p_RDT,1,13\n1,chn004b06_p_RDT,1,9\n2,chn008a06_p_RDT,0,15\n3,chn008a06_p_RDT,1,15\n4,chn008b06_p_RDT,1,1\n5,chn008c04_p_RDT,1,12\n')
df2 = pd.DataFrame.from_csv(df2_data).set_index(['experiment', 'trial'])
The first dataframe looks like:
token
experiment trial slot
btn144a10_p_RDT 0 0 4
1 14
1 0 12
1 14
btn145a07_p_RDT 0 0 6
1 19
1 0 17
1 13
chn004b06_p_RDT 0 0 6
1 8
1 0 2
1 5
chn008a06_p_RDT 0 0 12
1 14
1 0 6
1 4
chn008b06_p_RDT 0 0 3
1 13
1 0 12
1 19
chn008c04_p_RDT 0 0 17
1 2
1 0 1
1 6
The second dataframe looks like:
target
experiment trial
btn145a07_p_RDT 1 13
chn004b06_p_RDT 1 9
chn008a06_p_RDT 0 15
1 15
chn008b06_p_RDT 1 1
chn008c04_p_RDT 1 12
The result I want:
token
experiment trial slot
btn145a07_p_RDT 1 0 17
1 13
chn004b06_p_RDT 1 0 2
1 5
chn008a06_p_RDT 0 0 12
1 14
1 0 6
1 4
chn008b06_p_RDT 1 0 12
1 19
chn008c04_p_RDT 1 0 1
1 6
One way to do it would by using merge
merged = pd.merge(
df2.reset_index(),
df1.reset_index(),
left_on=['experiment', 'trial'],
right_on=['experiment', 'trial'],
how='left')
You just need to reindex merged to whatever you like (I could not tell exactly from the question).
What should work is
df1.loc[df2.index]
but multi indexing still has some problems. What does work is
df1.reset_index(2).loc[df2.index].set_index('slot', append=True)
which is a bit of a hack around this problem. Note that
df1.loc[df2.index[:1]]
gives garbage while
df.loc[df2.index[0]]
gives what you would expect. So passing multiple values from a m-level index to an n-level index where n > m > 2 doesn't work, though it should.