I've search for quite a time, but I haven't found any similar question. If there is, please let me know!
I am currently trying to divide one dataframe into n dataframes where the n is equal to the number of columns of the original dataframe. All the new resulting dataframes must always keep the first column of the original dataframe. An extra would be gather all togheter in a list, for example, for further access.
In order to visualize my intention, here goes an brief example:
>> original df
GeneID A B C D E
1 0.3 0.2 0.6 0.4 0.8
2 0.5 0.3 0.1 0.2 0.6
3 0.4 0.1 0.5 0.1 0.3
4 0.9 0.7 0.1 0.6 0.7
5 0.1 0.4 0.7 0.2 0.5
My desired output would be something like this:
>> df1
GeneID A
1 0.3
2 0.5
3 0.4
4 0.9
5 0.1
>> df2
GeneID B
1 0.2
2 0.3
3 0.1
4 0.7
5 0.4
....
And so on, until all the columns from the original dataframe be covered.
What would be the better solution ?
You can use df.columns to get all column names and then create sub-dataframes:
outdflist =[]
# for each column beyond first:
for col in oridf.columns[1:]:
# create a subdf with desired columns:
subdf = oridf[['GeneID',col]]
# append subdf to list of df:
outdflist.append(subdf)
# to view all dataframes created:
for df in outdflist:
print(df)
Output:
GeneID A
0 1 0.3
1 2 0.5
2 3 0.4
3 4 0.9
4 5 0.1
GeneID B
0 1 0.2
1 2 0.3
2 3 0.1
3 4 0.7
4 5 0.4
GeneID C
0 1 0.6
1 2 0.1
2 3 0.5
3 4 0.1
4 5 0.7
GeneID D
0 1 0.4
1 2 0.2
2 3 0.1
3 4 0.6
4 5 0.2
GeneID E
0 1 0.8
1 2 0.6
2 3 0.3
3 4 0.7
4 5 0.5
Above for loop can also be written more simply as list comprehension:
outdflist = [ oridf[['GeneID', col]]
for col in oridf.columns[1:] ]
You can do with groupby
d={'df'+ str(x): y for x , y in df.groupby(level=0,axis=1)}
d
Out[989]:
{'dfA': A
0 0.3
1 0.5
2 0.4
3 0.9
4 0.1, 'dfB': B
0 0.2
1 0.3
2 0.1
3 0.7
4 0.4, 'dfC': C
0 0.6
1 0.1
2 0.5
3 0.1
4 0.7, 'dfD': D
0 0.4
1 0.2
2 0.1
3 0.6
4 0.2, 'dfE': E
0 0.8
1 0.6
2 0.3
3 0.7
4 0.5, 'dfGeneID': GeneID
0 1
1 2
2 3
3 4
4 5}
You can create a list of column names, and manually loop through and create a new DataFrame each loop.
>>> import pandas as pd
>>> d = {'col1':[1,2,3], 'col2':[3,4,5], 'col3':[6,7,8]}
>>> df = pd.DataFrame(data=d)
>>> df
col1 col2 col3
0 1 3 6
1 2 4 7
2 3 5 8
>>> newstuff=[]
>>> columns = list(df)
>>> for column in columns:
... newstuff.append(pd.DataFrame(data=df[column]))
Unless your dataframe is unreasonably massive, above code should serve its job.
Related
I have DataFrame:
A
B
C
D
1
0.1
0.2
0.3
2
0.4
0.7
0.2
3
0
4.25
100
3
-2.5
4.20
70
3
-2.5
3.5
80
4
0
3.6
81
4
-5
3.5
77
4
-5
3.4
75
4
-5
3.1
74
5
0
3.2
75
5
0.1
3.3
73
Now , i want to skip first two rows. after that i want to substarct last row value and first row value for number '3' with coumn 'C' value then divide with same substactction but with coumn 'B' Value. basically, |3.5-4.25|/|-2.5-0| = | 0.3 |
i tried and skiped first two rows with
cols = ['A']
df[cols] = df[df[cols] > 2][cols]
df = df.dropna()
Expected output:
NEW_COL
Result
1
0.3
2
0.1
3
1
could you please help me?
IIUC, you can compute the last-first per group, then divide C/B and drop NAs:
out = (df.groupby('A')
.agg(lambda g: abs(g.iloc[-1]-g.iloc[0]))
.eval('C/B')
.dropna()
.reset_index(drop=True)
)
output:
0 0.3
1 0.1
2 1.0
dtype: float64
for example= df is the data with features. I want to split the train + test from the data whose indices have been given. How shall I get train/test df.
df=
0 2 0.3 0.5 0.5
1 4 0.5 0.7 0.4
2 2 0.5 0.1 0.4
3 4 0.4 0.1 0.3
4 2 0.3 0.1 0.5
where train.txt is
train=pd.read_csv(data_train.txt)
where in this dataframe indices are given. How should I get the training data from those indices?
Contents in data_train.txt(there are 10000 of data in which train indices are given in this txt file)
0
2
4
I want these indices for training data with feature:- like
final train should look like this (see the index):
0 2 0.3 0.5 0.5
2 2 0.5 0.1 0.4
4 2 0.3 0.1 0.5
If you have a df as given by:
0 1 2 3 4
0 0 2 0.3 0.5 0.5
1 1 4 0.5 0.7 0.4
2 2 2 0.5 0.1 0.4
3 3 4 0.4 0.1 0.3
4 4 2 0.3 0.1 0.5
and another train_indices as given by:
0
0 0
1 2
2 4
then all you need to do to get the corresponding rows of df depends on how the data is organised:
#if you're trying to match the index of the df itself
train_df = df.iloc[train_indices]
#if you're trying to match column 0, which might be important
#if it's not aligned to the index
train_df = df.loc[df[0].isin(train_indices)]
Both of these (in this case) return:
0 1 2 3 4
0 0 2 0.3 0.5 0.5
2 2 2 0.5 0.1 0.4
4 4 2 0.3 0.1 0.5
I have a dataframe in python that looks like this:
ID Value
001 0.5
001 0.2
001 0.5
001 0.0
002 0.4
002 0.6
002 0.6
I would like the data to be reshaped into something like this:
ID Val1 Val2 Val3 Val4
001 0.5 0.2 0.5 0.0
002 0.4 0.6 0.6 NaN
Can anyone help with this? My first thought was de-melting the data with "pivot" but without a value denoting the "Val" position, it doesnt work as intended.
thanks!
Grouppby your ID then reset the index to keep the columns consistent and unstack
df.groupby('ID')['Value'].apply(lambda df: df.reset_index(drop=True)).unstack()
0 1 2 3
ID
1 0.5 0.2 0.5 0.0
2 0.4 0.6 0.6 NaN
OR to not use ID as the index:
df.sort_values('ID').groupby('ID')['Value'].apply(lambda df: df.reset_index(drop=True)).unstack().reset_index()
ID 0 1 2 3
0 1 0.5 0.2 0.5 0.0
1 2 0.4 0.6 0.6 NaN
You can assign an indexer series, then pivot:
res = df.assign(ValNum=df.groupby('ID').cumcount()+1)\
.pivot(index='ID', columns='ValNum', values='Value')\
.reset_index()
print(res)
ValNum ID 1 2 3 4
0 1 0.5 0.2 0.5 0.0
1 2 0.4 0.6 0.6 NaN
This might work:
>>> df = pd.DataFrame({"id": ["001"]*4 + ["002"]*3, "value": [0.5, 0.2, 0.5, 0.0, 0.4, 0.6, 0.6]})
>>> df
id value
0 001 0.5
1 001 0.2
2 001 0.5
3 001 0.0
4 002 0.4
5 002 0.6
6 002 0.6
>>> pd.concat([pd.Series(list(g["value"]), name=x) for x, g in df.groupby("id")], axis=1).T
0 1 2 3
001 0.5 0.2 0.5 0.0
002 0.4 0.6 0.6 NaN
Now what you have to do is to rename the columns/rows.
The data shown below is an simplified example. The actual data frame is 3750 rows 2 columns data frame. I need to reshape the data frame into another structure.
A A2
0.1 1
0.4 2
0.6 3
B B2
0.8 1
0.7 2
0.9 3
C C2
0.3 1
0.6 2
0.8 3
How can I reshape above data frame into horizontal as following:
A A2 B B2 C C2
0.1 1 0.8 1 0.3 1
0.4 2 0.7 2 0.6 2
0.6 3 0.9 3 0.8 3
You can reshape your data and create a new dataframe:
cols = 6
rows = 4
df = pd.DataFrame(df.values.T.reshape(cols,rows).T)
df.rename(columns=df.iloc[0]).drop(0)
A B C A2 B2 C2
1 0.1 0.8 0.3 1 1 1
2 0.4 0.7 0.6 2 2 2
3 0.6 0.9 0.8 3 3 3
try this, If you don't want to hard code your values.
df['header']=pd.to_numeric(df[0],errors='coerce')
l= df['header'].values
m_l = l.reshape((np.isnan(l).sum(),-1))[:,1:]
h=df[df['header'].isnull()][0].values
print pd.DataFrame(dict(zip(h,m_l)))
Output:
A B C
0 0.1 0.8 0.3
1 0.4 0.7 0.6
2 0.6 0.9 0.8
I have a pandas dataframe which looks like one long row.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
________________________________________________________________________________________________
2010 | 0.1 0.5 0.5 0.7 0.5 0.5 0.5 0.5 0.9 0.5 0.5 0.8 0.3 0.3 0.6
I would like to reshape it as:
0 1 2 3 4
____________________________________
|0| 0.1 0.5 0.5 0.7 0.5
2010 |1| 0.5 0.5 0.5 0.9 0.5
|2| 0.5 0.8 0.3 0.3 0.6
I can certainly do it using a loop, but I'm guessing (un)stack and/or pivot might be able to do the trick, but I couldn't figure it out how...
Symmetry/filling up blanks - if the data is not integer divisible by the number of rows after unstack - is not important for now.
EDIT:
I coded up the loop solution meanwhile:
df=my_data_frame
dk=pd.DataFrame()
break_after=3
for i in range(len(df)/break_after):
dl=pd.DataFrame(df[i*break_after:(i+1)*break_after]).T
dl.columns=range(break_after)
dk=pd.concat([dk,dl])
If there is only one index (2010), this will work fine.
df1 = pd.DataFrame(np.reshape(df.values,(3,5)))
df1['Index'] = '2010'
df1.set_index('Index',append=True,inplace=True)
df1 = df1.reorder_levels(['Index', None])
Output:
0 1 2 3 4
Index
2010 0 0.1 0.5 0.5 0.7 0.5
1 0.5 0.5 0.5 0.9 0.5
2 0.5 0.8 0.3 0.3 0.6