I am creating a dataframe to store informations on samples. Some of my columns label have the format index:subindex. Is there a better way of doing that? I was looking at pd.MultiIndex but my subindices are specific to the index.
import pandas as pd
df = pd.DataFrame(
np.random.random(size=(1234, 6)),
columns=['ID',
'Charge:pH2', 'Charge:pH4', 'Charge:pH6',
'Extinction:Wavelength200nm', 'Extinction:Wavelength500nm'])
I would like to be able to call df.loc[:, 'ID'] or df.loc[:, 'Charge'] or df.loc[:, ('Charge', 'pH6')]
You could use MultiIndex.from_tuple:
import numpy as np
import pandas as pd
df = pd.DataFrame(
np.random.random(size=(1234, 6)),
columns=['ID','Charge:pH2', 'Charge:pH4', 'Charge:pH6','Extinction:Wavelength200nm', 'Extinction:Wavelength500nm'])
df.columns = pd.MultiIndex.from_tuples(map(tuple, df.columns.str.split(':')))
print(df.head(10))
Output
ID Charge ... Extinction
NaN pH2 ... Wavelength200nm Wavelength500nm
0 0.301592 0.137384 ... 0.074137 0.339948
1 0.737711 0.557524 ... 0.813727 0.586845
2 0.615398 0.529687 ... 0.148700 0.466916
3 0.411509 0.725513 ... 0.380019 0.876992
4 0.031172 0.623944 ... 0.311610 0.488207
5 0.022140 0.450630 ... 0.422927 0.479094
6 0.119681 0.221624 ... 0.710848 0.719201
7 0.252039 0.632321 ... 0.453235 0.952687
8 0.379501 0.356493 ... 0.141977 0.028836
9 0.249950 0.316020 ... 0.307337 0.881437
[10 rows x 6 columns]
All the required indexing schemes work:
print(df.loc[:, 'ID'].shape)
print(df.loc[:, 'Charge'].shape)
print(df.loc[:, ('Charge', 'pH6')].shape)
Output
(1234, 1)
(1234, 3)
(1234,)
I think the best is create index or Multiindex with not columns possible split (with no splitter) and then create MultiIndex by split with expand=True:
np.random.seed(2019)
df = pd.DataFrame(
np.random.random(size=(3, 6)),
columns=['ID',
'Charge:pH2', 'Charge:pH4', 'Charge:pH6',
'Extinction:Wavelength200nm', 'Extinction:Wavelength500nm'])
df = df.set_index('ID')
df.columns = df.columns.str.split(':', expand=True)
print (df)
Charge Extinction
pH2 pH4 pH6 Wavelength200nm Wavelength500nm
ID
0.903482 0.393081 0.623970 0.637877 0.880499 0.299172
0.702198 0.903206 0.881382 0.405750 0.452447 0.267070
0.162865 0.889215 0.148476 0.984723 0.032361 0.515351
Solution with not set ID in index is possible, but get NaN for second level for not splitted columns names:
df.columns = df.columns.str.split(':', expand=True)
print (df)
ID Charge Extinction
NaN pH2 pH4 pH6 Wavelength200nm Wavelength500nm
0 0.903482 0.393081 0.623970 0.637877 0.880499 0.299172
1 0.702198 0.903206 0.881382 0.405750 0.452447 0.267070
2 0.162865 0.889215 0.148476 0.984723 0.032361 0.515351
Last select by columns names, also is possible use DataFrame.xs if want select by second level:
print (df['Charge'])
pH2 pH4 pH6
ID
0.903482 0.393081 0.623970 0.637877
0.702198 0.903206 0.881382 0.405750
0.162865 0.889215 0.148476 0.984723
print (df.xs('Charge', axis=1, level=0))
pH2 pH4 pH6
ID
0.903482 0.393081 0.623970 0.637877
0.702198 0.903206 0.881382 0.405750
0.162865 0.889215 0.148476 0.984723
print (df.xs('pH4', axis=1, level=1))
Charge
ID
0.903482 0.623970
0.702198 0.881382
0.162865 0.148476
Related
I have a sample dataframe as given below.
import pandas as pd
import numpy as np
NaN = np.nan
data = {'ID':['A', 'A', 'A', 'B','B','B'],
'Date':['2021-09-20 04:34:57', '2021-09-20 04:37:25', '2021-09-20 04:38:26', '2021-09-01
00:12:29','2021-09-01 11:20:58','2021-09-02 09:20:58'],
'Name':['xx','xx',NaN,'yy',NaN,NaN],
'Height':[174,174,NaN,160,NaN,NaN],
'Weight':[74,NaN,NaN,58,NaN,NaN],
'Gender':[NaN,'Male',NaN,NaN,'Female',NaN],
'Interests':[NaN,NaN,'Hiking,Sports',NaN,NaN,'Singing']}
df1 = pd.DataFrame(data)
df1
I want to combine the data present on the same date into a single row. The 'Date' column is in timestamp format. I have written a code for it. Here is my TRY code:
TRY:
df1['Date'] = pd.to_datetime(df1['Date'])
df_out = (df1.groupby(['ID', pd.Grouper(key='Date', freq='D')])
.agg(lambda x: ''.join(x.dropna().astype(str)))
.reset_index()
).replace('', np.nan)
This gives an output where if there are multiple entries of same value, the final result has multiple entries in the same row as shown below.
Obtained Output
However, I do not want the values to be repeated if there are multiple entries. The final output should look like the image shown below.
Required Output
The first column should not have 'xx' and 174.0 instead of 'xxxx' and '174.0 174.0'.
Any help is greatly appreciated. Thank you.
In your case replace agg join to first
df_out = (df1.groupby(['ID', pd.Grouper(key='Date', freq='D')])
.first()
.reset_index()
).replace('', np.nan)
df_out
Out[113]:
ID Date Name Height Weight Gender Interests
0 A 2021-09-20 xx 174.0 74.0 Male Hiking,Sports
1 B 2021-09-01 yy 160.0 58.0 Female None
2 B 2021-09-02 None NaN NaN None Singing
Since you're only trying to keep the first available value for each column for each date, you can do:
>>> df1.groupby(["ID", pd.Grouper(key='Date', freq='D')]).agg("first").reset_index()
ID Date Name Height Weight Gender Interests
0 A 2021-09-20 xx 174.0 74.0 Male Hiking,Sports
1 B 2021-09-01 yy 160.0 58.0 Female None
2 B 2021-09-02 None NaN NaN None Singing
Using Python 3.8, pandas 1.1.2
I have two dateframes with multiindex
df1 (multi level column):
user price
count sum
name date hour
A 9/17 1 33 34
A 9/17 2 66 55
A 9/17 3 77 2
A 9/17 4 88 1
df2:
seller_count
name date hour
A 9/17 1 100
A 9/17 15 66
I am trying to do full outer join on two of them.
Desired output:
user price
count sum seller_count
name date hour
A 9/17 1 33 34 100
A 9/17 2 66 55 null
A 9/17 3 77 2 null
A 9/17 4 88 1 null
A 9/17 15 Null Null 66
I am trying to find out a way to do this without resetting indexes. Any help? Thanks!
solution from Pandas Dataframe Multiindex Merge does not seem to work, I am only able to get seller_count if it has same name, date,hour as df1.
df1.columns outputs:
MultiIndex([( 'user', 'count'),
( 'price', 'sum')])
df2.columns outputs:
Index(["seller_count"])
Setup:
print (df1.index)
MultiIndex([('A', '9/17', 1),
('A', '9/17', 2),
('A', '9/17', 3),
('A', '9/17', 4)],
names=['name', 'date', 'hour'])
print (df1.columns)
MultiIndex([( 'user', 'count'),
('price', 'sum')],
)
print (df2.index)
MultiIndex([('A', '9/17', 1),
('A', '9/17', 15)],
names=['name', 'date', 'hour'])
print (df2.columns)
Index(['seller_count'], dtype='object')
First is necessary create MultiIndex in df2, then use merge with outer join:
df2.columns = pd.MultiIndex.from_product([[''], df2.columns])
print (df2.columns)
MultiIndex([('', 'seller_count')],
)
df = df1.merge(df2, left_index=True, right_index=True, how="outer")
print (df)
user price
count sum seller_count
name date hour
A 9/17 1 33.0 34.0 100.0
2 66.0 55.0 NaN
3 77.0 2.0 NaN
4 88.0 1.0 NaN
15 NaN NaN 66.0
df = df1.join(df2, how="outer")
print (df)
user price
count sum seller_count
name date hour
A 9/17 1 33.0 34.0 100.0
2 66.0 55.0 NaN
3 77.0 2.0 NaN
4 88.0 1.0 NaN
15 NaN NaN 66.0
print (df.columns)
MultiIndex([( 'user', 'count'),
('price', 'sum'),
( '', 'seller_count')],
)
print (df.index)
MultiIndex([('A', '9/17', 1),
('A', '9/17', 2),
('A', '9/17', 3),
('A', '9/17', 4),
('A', '9/17', 15)],
names=['name', 'date', 'hour'])
I assume that column names in the index in df1 are of "single level".
You can achieve it the following way:
The source file contains:
name,date,hour,user,price
, , ,count,sum
A,9/17,1,33,34
A,9/17,2,66,55
A,9/17,3,77,2
A,9/17,4,88,1
Note spaces as first 3 column names at the second level.
Read the file executing:
df1 = pd.read_csv('Input_1.csv', header=[0,1])
df1 = df1.set_index([('name', ' '), ('date', ' '), ('hour', ' ')])\
.rename_axis(index=['name', 'date', 'hour'])
This way "2-level" column names, after setting as the index, get single
level names.
Another detail to note is that:
index column names in both DataFrames are of single level,
df1 has a MultiIndex on columns,
df2 has an ordinary (single level) index on columns,
the result should have MultiIndex on columns.
To perform the join, you have to start from adding a MultiIndex level
to the column index in df2 (with a space as the top level):
df2.columns = pd.MultiIndex.from_product([[' '], df2.columns])
Then perform ordinary outer join:
result = df1.join(df2, how='outer')
The result is:
user price
count sum seller_count
name date hour
A 9/17 1 33.0 34.0 100.0
2 66.0 55.0 NaN
3 77.0 2.0 NaN
4 88.0 1.0 NaN
15 NaN NaN 66.0
Below is some dummy data that reflects the data I am working with.
import pandas as pd
import numpy as np
from numpy import random
random.seed(30)
# Dummy data that represents a percent change
datelist = pd.date_range(start='1983-01-01', end='1994-01-01', freq='Y')
df1 = pd.DataFrame({"P Change_1": np.random.uniform(low=-0.55528, high=0.0396181, size=(11,)),
"P Change_2": np.random.uniform(low=-0.55528, high=0.0396181, size=(11,))})
#This dataframe contains the rows we want to operate on
df2 = pd.DataFrame({
'Loc1': [None, None, None, None, None, None, None, None, None, None, 2.5415],
'Loc2': [None, None, None, None, None, None, None, None, None, None, 3.2126],})
#Set the datetime index
df1 = df1.set_index(datelist)
df2 = df2.set_index(datelist)
df1:
P Change_1 P Change_2
1984-12-31 -0.172080 -0.231574
1985-12-31 -0.328773 -0.247018
1986-12-31 -0.160834 -0.099079
1987-12-31 -0.457924 0.000266
1988-12-31 0.017374 -0.501916
1989-12-31 -0.349052 -0.438816
1990-12-31 0.034711 0.036164
1991-12-31 -0.415445 -0.415372
1992-12-31 -0.206852 -0.413107
1993-12-31 -0.313341 -0.181030
1994-12-31 -0.474234 -0.118058
df2:
Loc1 Loc2
1984-12-31 NaN NaN
1985-12-31 NaN NaN
1986-12-31 NaN NaN
1987-12-31 NaN NaN
1988-12-31 NaN NaN
1989-12-31 NaN NaN
1990-12-31 NaN NaN
1991-12-31 NaN NaN
1992-12-31 NaN NaN
1993-12-31 NaN NaN
1994-12-31 2.5415 3.2126
DataFrame details:
First off, Loc1 will correspond with P Change_1 and Loc2 corresponds to P Change_2, etc. Looking at Loc1 first, I want to either fill up the DataFrame containing Loc1 and Loc2 with the relevant values or compute a new dataframe that has columns Calc1 and Calc2.
The calculation:
I want to start with the 1994 value of Loc1 and calculate a new value for 1993 by taking Loc1 1993 = Loc1 1994 + (Loc1 1994 * P Change_1 1993). With the values filled in it would be 2.5415 +(-0.313341 * 2.5415) which equals about 1.74514.
This 1.74514 value will replace the NaN value in 1993, and then I want to use that calculated value to get a value for 1992. This means we now compute Loc1 1992 = Loc1 1993 + (Loc1 1993 * P Change_1 1992). I want to carry out this operation row-wise until it gets the earliest value in the timeseries.
What is the best way to go about implementing this row-wise equation? I hope this makes some sense and any help is greatly appreciated!
df = pd.merge(df1, df2, how='inner', right_index=True, left_index=True) # merging dataframes on date index
df['count'] = range(len(df)) # creating a column, count for easy operation
# divides dataframe in two part, one part above the not NaN row and one below
da1 = df[df['count']<=df.dropna().iloc[0]['count']]
da2 = df[df['count']>=df.dropna().iloc[0]['count']]
da1.sort_values(by=['count'],ascending=False, inplace=True)
g=[da1,da2]
num_col=len(df1.columns)
for w in range(len(g)):
list_of_col=[]
count = 0
list_of_col=[list() for i in range(len(g[w]))]
for item, rows in g[w].iterrows():
n=[]
if count==0:
for p in range(1,num_col+1):
n.append(rows[f'Loc{p}'])
else:
for p in range(1,num_col+1):
n.append(list_of_col[count-1][p-1]+ list_of_col[count-1][p-1]* rows[f'P Change_{p}'])
list_of_col[count].extend(n)
count+=1
tmp=[list() for i in range(num_col)]
for d_ in range(num_col):
for x_ in range(len(list_of_col)):
tmp[d_].append(list_of_col[x_][d_])
z1=[]
z1.extend(tmp)
for i in range(num_col):
g[w][f'Loc{i+1}']=z1[i]
da1.sort_values(by=['count'] ,inplace=True)
final_df = pd.concat([da1, da2[1:]])
calc_df = pd.DataFrame()
for i in range(num_col):
calc_df[f'Calc{i+1}']=final_df[f'Loc{i+1}']
print(calc_df)
I have tried to include all the obscure thing I have done in the comment. I have edited my code to let initial dataframes remain unaffected.
[Edited] : I have edited the code to include any number of columns in the given dataframe.
[Edited:]If the name of columns are arbitrary in df1 and df2, please run this block of code before running the upper code. I have renamed the columns name using list comprehension!
df1.columns = [f'P Change_{i+1}' for i in range(len(df1.columns))]
df2.columns = [f'Loc{i+1}' for i in range(len(df2.columns))]
[EDITED] Perhaps there are better/more elegant ways to do this, but this worked fine for me:
def fill_values(df1, df2, cols1=None, cols2=None):
if cols1 is None: cols1 = df1.columns
if cols2 is None: cols2 = df2.columns
for i in reversed(range(df2.shape[0]-1)):
for col1, col2 in zip(cols1, cols2):
if np.isnan(df2[col2].iloc[i]):
val = df2[col2].iloc[i+1] + df2[col2].iloc[i+1] * df1[col1].iloc[i]
df2[col2].iloc[i] = val
return df1, df2
df1, df2 = fill_values(df1, df2)
print(df2)
Loc1 Loc2
1983-12-31 0.140160 0.136329
1984-12-31 0.169291 0.177413
1985-12-31 0.252212 0.235614
1986-12-31 0.300550 0.261526
1987-12-31 0.554444 0.261457
1988-12-31 0.544976 0.524925
1989-12-31 0.837202 0.935388
1990-12-31 0.809117 0.902741
1991-12-31 1.384158 1.544128
1992-12-31 1.745144 2.631024
1993-12-31 2.541500 3.212600
This assumes that the rows in df1 and df2 corresponds perfectly (I'm not querying the index, but only the location). Hope it helps!
Just to be clear, what you need is Loc1[year]=Loc1[next_year] + PChange[year]*Loc1[next_year], right?
The below loop will do what you are looking for, but it just assumes that the number of rows in both df's is always equal, etc. (instead of matching the value in the index). From your description, I think this works for your data.
for i in range(df2.shape[0]-2,-1,-1):
df2.Loc1[i]=df2.Loc1[i+1] + (df1.PChange_1[i]*df2.Loc1[i+1])
Hope this helps :)
I am new to Python and Pandas. I have a panda dataframe with monthly columns ranging from 2000 (2000-01) to 2016 (2016-06).
I want to find the average of every three months and assign it to a new quarterly column (2000q1). I know I can do the following:
df['2000q1'] = df[['2000-01', '2000-02', '2000-03']].mean(axis=1)
df['2000q2'] = df[['2000-04', '2000-05', '2000-06']].mean(axis=1)
.
.
.
df['2016-02'] = df[['2016-04', '2016-05', '2016-06']].mean(axis=1)
But, this is very tedious. I appreciate it if someone helps me find a better way.
You can use groupby on columns:
df.groupby(np.arange(len(df.columns))//3, axis=1).mean()
Or, those can be converted to datetime. You can use resample:
df.columns = pd.to_datetime(df.columns)
df.resample('Q', axis=1).mean()
Here's a demo:
cols = pd.date_range('2000-01', '2000-06', freq='MS')
cols = cols.strftime('%Y-%m')
cols
Out:
array(['2000-01', '2000-02', '2000-03', '2000-04', '2000-05', '2000-06'],
dtype='<U7')
df = pd.DataFrame(np.random.randn(10, 6), columns=cols)
df
Out:
2000-01 2000-02 2000-03 2000-04 2000-05 2000-06
0 -1.263798 0.251526 0.851196 0.159452 1.412013 1.079086
1 -0.909071 0.685913 1.394790 -0.883605 0.034114 -1.073113
2 0.516109 0.452751 -0.397291 -0.050478 -0.364368 -0.002477
3 1.459609 -1.696641 0.457822 1.057702 -0.066313 -0.910785
4 -0.482623 1.388621 0.971078 -0.038535 0.033167 0.025781
5 -0.016654 1.404805 0.100335 -0.082941 -0.418608 0.588749
6 0.684735 -2.007105 0.552615 1.969356 -0.614634 0.021459
7 0.382475 0.965739 -1.826609 -0.086537 -0.073538 -0.534753
8 1.548773 -0.157250 0.494819 -1.631516 0.627794 -0.398741
9 0.199049 0.145919 0.711701 0.305382 -0.118315 -2.397075
First alternative:
df.groupby(np.arange(len(df.columns))//3, axis=1).mean()
Out:
0 1
0 -0.053692 0.883517
1 0.390544 -0.640868
2 0.190523 -0.139108
3 0.073597 0.026868
4 0.625692 0.006805
5 0.496162 0.029067
6 -0.256585 0.458727
7 -0.159465 -0.231609
8 0.628781 -0.467487
9 0.352223 -0.736669
Second alternative:
df.columns = pd.to_datetime(df.columns)
df.resample('Q', axis=1).mean()
Out:
2000-03-31 2000-06-30
0 -0.053692 0.883517
1 0.390544 -0.640868
2 0.190523 -0.139108
3 0.073597 0.026868
4 0.625692 0.006805
5 0.496162 0.029067
6 -0.256585 0.458727
7 -0.159465 -0.231609
8 0.628781 -0.467487
9 0.352223 -0.736669
You can assign this to a DataFrame:
res = df.resample('Q', axis=1).mean()
Change column names as you like:
res = res.rename(columns=lambda col: '{}q{}'.format(col.year, col.quarter))
res
Out:
2000q1 2000q2
0 -0.053692 0.883517
1 0.390544 -0.640868
2 0.190523 -0.139108
3 0.073597 0.026868
4 0.625692 0.006805
5 0.496162 0.029067
6 -0.256585 0.458727
7 -0.159465 -0.231609
8 0.628781 -0.467487
9 0.352223 -0.736669
And attach this to your current DataFrame by:
pd.concat([df, res], axis=1)
I have 2 dataframe
category count_sec_target
3D-шутеры 0.09375
Cериалы 201.90625
GPS и ГЛОНАСС 0.015625
Hi-Tech 187.1484375
Абитуриентам 0.8125
Авиакомпании 8.40625
and
category count_sec_random
3D-шутеры 0.369565217
Hi-Tech 70.42391304
АСУ ТП, промэлектроника 0.934782609
Абитуриентам 1.413043478
Авиакомпании 14.93478261
Авто 480.3369565
I need to concatenate that And get
category count_sec_target count_sec_random
3D-шутеры 0.09375 0.369565217
Cериалы 201.90625 0
GPS и ГЛОНАСС 0.015625 0
Hi-Tech 187.1484375 70.42391304
Абитуриентам 0.8125 1.413043478
Авиакомпании 8.40625 14.93478261
АСУ ТП, промэлектроника 0 0.934782609
Авто 0 480.3369565
And next I want to divide values in col (count_sec_target / count_sec_random) * 100%
But when I try to concatenate df
frames = [df1, df1]
df = pd.concat(frames)
I get
category count_sec_random count_sec_target
0 3D-шутеры 0.369565 NaN
1 Hi-Tech 70.423913 NaN
2 АСУ ТП, промэлектроника 0.934783 NaN
3 Абитуриентам 1.413043 NaN
4 Авиакомпании 14.934783 NaN
Also I try df = df1.append(df2)
BUt I get wrong result.
How can I fix that?
df3 = pd.concat([d.set_index('category') for d in frames], axis=1).fillna(0)
df3['ratio'] = df3.count_sec_random / df3.count_sec_target
df3
Setup Reference
import pandas as pd
from StringIO import StringIO
t1 = """category;count_sec_target
3D-шутеры;0.09375
Cериалы;201.90625
GPS и ГЛОНАСС;0.015625
Hi-Tech;187.1484375
Абитуриентам;0.8125
Авиакомпании;8.40625"""
t2 = """category;count_sec_random
3D-шутеры;0.369565217
Hi-Tech;70.42391304
АСУ ТП, промэлектроника;0.934782609
Абитуриентам;1.413043478
Авиакомпании;14.93478261
Авто;480.3369565"""
df1 = pd.read_csv(StringIO(t1), sep=';')
df2 = pd.read_csv(StringIO(t2), sep=';')
frames = [df1, df2]
Merge should be appropriate here:
df_1.merge(df_2, on='category', how='outer').fillna(0)
To get the division output, simply do:
df['division'] = df['count_sec_target'].div(df['count_sec_random']) * 100
where: df is the merged DF