I read my data by this:
dataset = pd.read_csv(r' ...\x.csv')
Then specify choose some of them like this:
dataset = dataset.loc[len(dataset)-data_length: , :]
Do shifting:
dataset_shifted = dataset.shift(1)
dataset_shifted = dataset_shifted.dropna()
And like to add a new row equal to 1 to the top of my dataset. But using the following command doesn't work because my data indexes are from 3714 to 3722 and it adds an index 0 to end of the dataframe not to the top of it!
dataset_shifted = dataset_shifted .loc[0 , :] = 1
If no missing values in DataFrame you can simplify your solution by remove dropna and using DataFrame.fillna:
dataset = pd.DataFrame({
'B':[4,5,4],
'C':[7,8,9],
'D':[1,3,5],
}, index=[3714, 3715, 3716])
print (dataset)
B C D
3714 4 7 1
3715 5 8 3
3716 4 9 5
dataset_shifted = dataset.shift(1).fillna(1)
print (dataset_shifted)
B C D
3714 1.0 1.0 1.0
3715 4.0 7.0 1.0
3716 5.0 8.0 3.0
If possible missing values only set first row by position by DataFrame.iloc:
dataset_shifted = dataset.shift(1)
dataset_shifted.iloc[0 , :] = 1
Your solution should be changed:
dataset_shifted = dataset.shift(1)
dataset_shifted = dataset_shifted.dropna()
dataset_shifted.loc[0 , :] = 1
dataset_shifted = dataset_shifted.sort_index()
print (dataset_shifted)
B C D
0 1.0 1.0 1.0
3715 4.0 7.0 1.0
3716 5.0 8.0 3.0
Related
I have a dataframe like this:
import pandas as pd
df = pd.DataFrame({
'stuff_1_var_1': range(5),
'stuff_1_var_2': range(2, 7),
'stuff_2_var_1': range(3, 8),
'stuff_2_var_2': range(5, 10)
})
stuff_1_var_1 stuff_1_var_2 stuff_2_var_1 stuff_2_var_2
0 0 2 3 5
1 1 3 4 6
I would like to groupby based on the column headers and then add the mean and median of each group as new columns. So my expected output looks like this:
stuff_1_var_mean stuff_1_var_median stuff_2_var_mean stuff_2_var_median
0 1 1 4 4
1 2 2 5 5
Brief explanation:
we have two groups stuff_1_var_ and stuff_2_var_ for which would calculate the mean and median per row. So, e.g. for stuff_1_var_ it would be:
# values from stuff_1_var_1 and stuff_1_var_2
(0 + 2) / 2 = 1 and
( 1 + 3) / 2 = 2
The values are then added as a new column stuff_1_var_mean; analogue for meadian and stuff_2_var_.
I got until:
df = df.T
pattern = df.index.str.extract('(^stuff_\d_var_)', expand=False)
dfgb = df.groupby(pattern).agg(['mean', 'median']).T
stuff_1_var_ stuff_2_var_
0 mean 1 4
median 1 4
1 mean 2 5
median 2 5
How can I do the final step(s)?
Your solution should be changed:
df = df.T
pattern = df.index.str.extract('(^stuff_\d_var_)', expand=False)
dfgb = df.groupby(pattern).agg(['mean', 'median']).T.unstack()
dfgb.columns = dfgb.columns.map(lambda x: f'{x[0]}{x[1]}')
print (dfgb)
stuff_1_var_mean stuff_1_var_median stuff_2_var_mean stuff_2_var_median
0 1 1 4 4
1 2 2 5 5
2 3 3 6 6
3 4 4 7 7
4 5 5 8 8
Unfortunately for axis=1 is not implemented agg, so possible solution is create mean and median separately and then concat:
dfgb = df.groupby(pattern, axis=1).agg(['mean','median'])
NotImplementedError: axis other than 0 is not supported
pattern = df.columns.str.extract('(^stuff_\d_var_)', expand=False)
g = df.groupby(pattern, axis=1)
dfgb = pd.concat([g.mean().add_suffix('mean'),
g.median().add_suffix('median')], axis=1)
dfgb = dfgb.iloc[:, [0,2,1,3]]
print (dfgb)
stuff_1_var_mean stuff_1_var_median stuff_2_var_mean stuff_2_var_median
0 1 1 4 4
1 2 2 5 5
2 3 3 6 6
3 4 4 7 7
4 5 5 8 8
Here's a way you can do:
col = 'stuff_1_var_'
use_col = [x for x in df.columns if 'stuff_1' in x]
df[f'{col}mean'] = df[use_col].mean(1)
df[f'{col}median'] = df[use_col].median(1)
col2 = 'stuff_2_var_'
use_col = [x for x in df.columns if 'stuff_2' in x]
df[f'{col2}mean'] = df[use_col].mean(1)
df[f'{col2}median'] = df[use_col].median(1)
print(df.iloc[:,-4:]) # showing last four new columns
stuff_1_var_mean stuff_1_var_median stuff_2_var_mean stuff_2_var_median
0 1.0 1.0 4.0 4.0
1 2.0 2.0 5.0 5.0
2 3.0 3.0 6.0 6.0
3 4.0 4.0 7.0 7.0
4 5.0 5.0 8.0 8.0
Ofcourse, you can put it in a function to avoid repeating the same code.
I have two big dataframes (100K rows), One has 'values', one has 'types'. I want to assign a 'type' from df2 to a column in df1 based on depth. The depths are given as depth 'From' and depth 'To' columns. The 'types' are also defined by depth 'From' and 'To'. BUT they are NOT the same intervals. df1 depths may span multiple df2 types.
I want to assign the df2 'types' to df1 and if there are multiple types, try and capture that information too. Example below.
import pandas as pd
import numpy as np
df1=pd.DataFrame(np.array([[1,3,0.001],[3,5,0.005],[5,7,0.002],[7,10,0.001]]),columns=['From', 'To', 'val'])
df2=pd.DataFrame(np.array([[0.0,4,'A'],[4,5,'B'],[5,6,'C'],[6,8,'D'],[8,10,'E']]),columns=['From', 'To', 'Type'])
df1
Out[1]:
From To val
0 1.0 3.0 0.001
1 3.0 5.0 0.005
2 5.0 7.0 0.002
3 7.0 10.0 0.001
df2
Out[2]:
From To Type
0 0 4 A
1 4 5 B
2 5 6 C
3 6 8 D
4 8 10 E
Possible acceptable output:
Out[4]:
From To val Type
0 1 3 0.001 A
1 3 5 0.005 1 unit A,2 units B
2 5 7 0.002 1 unit C,1 unit D
3 7 10 0.001 1 unit D, 3 units E
Percentages of Types would also be a good ouput in Type .
One solution may be to create a new dataframe with high resolution 'depths' and forward fill the types, and do a sort of VLOOKUP on the To and the From.
I also thought about the possibility of making a column in each df that is a 'set' based on the to and from cols.
Possible join or merge but need to get the data compatible first.
Don't know where to start. Hoping there is neat way to tackle this, I have basically the exact same situation as this guy but I don't speak 'R' and would like to possibly report multiple type info.
From df2 create an auxiliary Series, marking each "starting point"
of a unit (a range of length 1):
units = df2.set_index('Type').apply(lambda row: pd.Series(
range(row.From, row.To)), axis=1).stack()\
.reset_index(level=1, drop=True)
The result is:
Type
A 0.0
A 1.0
A 2.0
A 3.0
B 4.0
C 5.0
D 6.0
D 7.0
E 8.0
E 9.0
dtype: float64
Then define a function generating Type for the current row:
def getType(row):
gr = units[units.ge(row.From) & units.lt(row.To)].groupby(level=0)
if gr.ngroups == 1:
return gr.ngroup().index[0]
txt = []
for key, grp in gr:
siz = grp.size
un = 'unit' if siz == 1 else 'units'
txt.append(f'{siz} {un} {key}')
return ','.join(txt)
And to generate Type column, apply it to each row:
df1['Type'] = df1.apply(getType, axis=1)
The result is:
From To val Type
0 1.0 3.0 0.001 A
1 3.0 5.0 0.005 1 unit A,1 unit B
2 5.0 7.0 0.002 1 unit C,1 unit D
3 7.0 10.0 0.001 1 unit D,2 units E
This result is a bit different from your expected result, but I think
you created it in a bit inconsequent way.
I think that my solution is correct (at least more consequent), because:
Row 1.0 - 3.0 is entirely within the limits of 0 4 A, so the
result is just A (like in your post).
Row 3.0 - 5.0 can be "divided" into:
3.0 - 4.0 is within the limits of 0 4 A (1 unit),
4.0 - 5.0 is within the limits of 4 5 B (also 1 unit,
but you want 2 units here).
Row 5.0 - 7.0 can be again "divided" into:
5.0 - 6.0 is within the limits of 5 6 C (1 unit),
6.0 - 7.0 is within the limits of 6 8 D (1 unit, just like you did).
Row 7.0 - 10.0 can be "divided" into:
7.0 - 8.0 is within the limits of 6 8 D (1 unit, just like you did),
8.0 - 10.0 is within the limits of 8 10 E (2 units, not 3 as you wrote).
I was wondering if there is an efficient way to add rows to a Dataframe that e.g. include the average or a predifined value in case there are not enough rows for a specific value in another column. I guess the description of the Problem is not the best that is why you find an example below:
Say we have the Dataframe
df1
Client NumberOfProducts ID
A 1 2
A 5 1
B 1 2
B 6 1
C 9 1
And we want to have 2 Rows for each client A, B, C, D, no matter if these 2 rows are already existing or not. So for Client A and B we can just copy the rows, for C we want to add a row which says Client = C, NumberOfProducts = average of existing rows = 9 and ID is not of interest (so we could set it to ID = smallest existing one - 1 = 0 any other value, even NaN, would also be possible). For Client D there does not exist a single row so we want to add 2 rows where NumberOfProducts is equal to the constant 2.5. The output should then look like this:
df1
Client NumberOfProducts ID
A 1 2
A 5 1
B 1 2
B 6 1
C 9 1
C 9 0
D 2.5 NaN
D 2.5 NaN
What I have done so far is to loop through the dataframe and add rows where necessary. Since this is pretty inefficient any better solution would be highly appreciated.
Use:
clients = ['A','B','C','D']
N = 2
#test only values from list and also filter only 2 rows for each client if necessary
df = df[df['Client'].isin(clients)].groupby('Client').head(N)
#create helper counter and reshape by unstack
df1 = df.set_index(['Client',df.groupby('Client').cumcount()]).unstack()
#set first if only 1 row per client - replace second NumberOfProducts by first
df1[('NumberOfProducts',1)] = df1[('NumberOfProducts',1)].fillna(df1[('NumberOfProducts',0)])
# ... replace second ID by first subtracted by 1
df1[('ID',1)] = df1[('ID',1)].fillna(df1[('ID',0)] - 1)
#add missing clients by reindex
df1 = df1.reindex(clients)
#replace NumberOfProducts by constant 2.5
df1['NumberOfProducts'] = df1['NumberOfProducts'].fillna(2.5)
print (df1)
NumberOfProducts ID
0 1 0 1
Client
A 1.0 5.0 2.0 1.0
B 1.0 6.0 2.0 1.0
C 9.0 9.0 1.0 0.0
D 2.5 2.5 NaN NaN
#last reshape to original
df2 = df1.stack().reset_index(level=1, drop=True).reset_index()
print (df2)
Client NumberOfProducts ID
0 A 1.0 2.0
1 A 5.0 1.0
2 B 1.0 2.0
3 B 6.0 1.0
4 C 9.0 1.0
5 C 9.0 0.0
6 D 2.5 NaN
7 D 2.5 NaN
I have gone through the posts that are similar to filling out the multiple columns for pandas in one go, however it appears that my problem here is a little different, in the sense that I need to be able to populate a missing column value with a specific column value and be able to do that for multiple columns in one go.
Eg: I can use the commands as below individually to fill the NA's
result1_copy['BASE_B'] = np.where(pd.isnull(result1_copy['BASE_B']), result1_copy['BASE_S'], result1_copy['BASE_B'])
result1_copy['QWE_B'] = np.where(pd.isnull(result1_copy['QWE_B']), result1_copy['QWE_S'], result1_copy['QWE_B'])
However, if I were to try populating it one go, it does not work:
result1_copy['BASE_B','QWE_B'] = result1_copy['BASE_B', 'QWE_B'].fillna(result1_copy['BASE_S','QWE_S'])
Do we know why ?
Please note I have only used 2 columns here for ease of purpose, however I have 10s of columns to impute. And they are either object, float or datetime.
Is datatypes the issue here ?
You need add [] for filtered DataFrame and for align columns add rename:
d = {'BASE_S':'BASE_B', 'QWE_S':'QWE_B'}
result1_copy[['BASE_B','QWE_B']] = result1_copy[['BASE_B', 'QWE_B']]
.fillna(result1_copy[['BASE_S','QWE_S']]
.rename(columns=d))
More dynamic solution:
L = ['BASE_','QWE_']
orig = ['{}B'.format(x) for x in L]
new = ['{}S'.format(x) for x in L]
d = dict(zip(new, orig))
result1_copy[orig] = (result1_copy[orig].fillna(result1_copy[new]
.rename(columns=d)))
Another solution if match columns with B and S:
for x in ['BASE_','QWE_']:
result1_copy[x + 'B'] = result1_copy[x + 'B'].fillna(result1_copy[x + 'S'])
Sample:
result1_copy = pd.DataFrame({'A':list('abcdef'),
'BASE_B':[np.nan,5,4,5,5,np.nan],
'QWE_B':[np.nan,8,9,4,2,np.nan],
'BASE_S':[1,3,5,7,1,0],
'QWE_S':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (result1_copy)
A BASE_B BASE_S F QWE_B QWE_S
0 a NaN 1 a NaN 5
1 b 5.0 3 a 8.0 3
2 c 4.0 5 a 9.0 6
3 d 5.0 7 b 4.0 9
4 e 5.0 1 b 2.0 2
5 f NaN 0 b NaN 4
d = {'BASE_S':'BASE_B', 'QWE_S':'QWE_B'}
result1_copy[['BASE_B','QWE_B']] = (result1_copy[['BASE_B', 'QWE_B']]
.fillna(result1_copy[['BASE_S','QWE_S']]
.rename(columns=d)))
print (result1_copy)
A BASE_B BASE_S F QWE_B QWE_S
0 a 1.0 1 a 5.0 5
1 b 5.0 3 a 8.0 3
2 c 4.0 5 a 9.0 6
3 d 5.0 7 b 4.0 9
4 e 5.0 1 b 2.0 2
5 f 0.0 0 b 4.0 4
I am trying to clean up a Excel file for some further research. Problem that I have, I want to merge the first and second row. The code which I have now:
xl = pd.ExcelFile("nanonose.xls")
df = xl.parse("Sheet1")
df = df.drop('Unnamed: 2', axis=1)
## Tried this line but no luck
##print(df.head().combine_first(df.iloc[[0]]))
The output of this is:
Nanonose Unnamed: 1 A B C D E \
0 Sample type Concentration NaN NaN NaN NaN NaN
1 Water 9200 95.5 21.0 6.0 11.942308 64.134615
2 Water 9200 94.5 17.0 5.0 5.484615 63.205769
3 Water 9200 92.0 16.0 3.0 11.057692 62.586538
4 Water 4600 53.0 7.5 2.5 3.538462 35.163462
F G H
0 NaN NaN NaN
1 21.498560 5.567840 1.174135
2 19.658560 4.968000 1.883444
3 19.813120 5.192480 0.564835
4 6.876207 1.641724 0.144654
So, my goal is to merge the first and second row to get: Sample type | Concentration | A | B | C | D | E | F | G | H
Could someone help me merge these two rows?
I think you need numpy.concatenate, similar principe like cᴏʟᴅsᴘᴇᴇᴅ answer:
df.columns = np.concatenate([df.iloc[0, :2], df.columns[2:]])
df = df.iloc[1:].reset_index(drop=True)
print (df)
Sample type Concentration A B C D E F \
0 Water 9200 95.5 21.0 6.0 11.942308 64.134615 21.498560
1 Water 9200 94.5 17.0 5.0 5.484615 63.205769 19.658560
2 Water 9200 92.0 16.0 3.0 11.057692 62.586538 19.813120
3 Water 4600 53.0 7.5 2.5 3.538462 35.163462 6.876207
G H
0 5.567840 1.174135
1 4.968000 1.883444
2 5.192480 0.564835
3 1.641724 0.144654
Just reassign df.columns.
df.columns = np.append(df.iloc[0, :2], df.columns[2:])
Or,
df.columns = df.iloc[0, :2].tolist() + (df.columns[2:]).tolist()
Next, skip the first row.
df = df.iloc[1:].reset_index(drop=True)
df
Sample type Concentration A B C D E F \
0 Water 9200 95.5 21.0 6.0 11.942308 64.134615 21.498560
1 Water 9200 94.5 17.0 5.0 5.484615 63.205769 19.658560
2 Water 9200 92.0 16.0 3.0 11.057692 62.586538 19.813120
3 Water 4600 53.0 7.5 2.5 3.538462 35.163462 6.876207
G H
0 5.567840 1.174135
1 4.968000 1.883444
2 5.192480 0.564835
3 1.641724 0.144654
reset_index is optional if you want a 0-index for your final output.
Fetch the all columns present in Second row header then First row header. combine them to make a "all columns name header" list. now create a df with excel by taking header as header[0,1]. now replace its headers with all column name headers you created previously.
import pandas as pd
#reading Second header row columns
df1 = pd.read_excel('nanonose.xls', header=[1] , index = False)
cols1 = df1.columns.tolist()
SecondRowColumns = []
for c in cols1:
if ("Unnamed" or "NaN" not in c):
SecondRowColumns.append(c)
#reading First header row columns
df2 = pd.read_excel('nanonose.xls', header=[0] , index = False)
cols2 = df2.columns.tolist()
FirstRowColumns = []
for c in cols2:
if ("Unnamed" or "Nanonose" not in c):
FirstRowColumns.append(c)
AllColumn = []
AllColumn = SecondRowColumns+ FirstRowColumns
df = pd.read_excel('nanonose.xls', header=[0,1] , index=False)
df.columns = AllColumn
print(df)