I have a dataframe which looks something like this:
Df
lev1 lev2 lev3 lev4 lev5 description
RD21 Nan Nan Nan Nan Oil
Nan RD32 Nan Nan Nan Oil/Canola
Nan Nan RD33 Nan Nan Oil/Canola/Wheat
Nan Nan RD34 Nan Nan Oil/Canola/Flour
Nan Nan Nan RD55 Nan Oil/Canola/Flour/Thick
ED54 Nan Nan Nan Nan Rice
Nan ED66 Nan Nan Nan Rice/White
Nan Nan ED88 Nan Nan Rice/White/Jasmine
Nan Nan ED89 Nan Nan Rice/White/Basmati
Nan ED68 Nan Nan Nan Rice/Brown
I want to remove all the NaN values and just keep the non Nan values, something like this:
DF2
code description
RD21 Oil
RD32 Oil/Canola
RD33 Oil/Canola/Wheat
RD34 Oil/Canola/Flour
RD55 Oil/Canola/Flour/Thick
.
.
.
How do I do this? I tried using notna() method, but it returns a boolean value of the dataframe. Any help would be appreciated.
We can mask by notna()
import pandas as pd
import numpy as np
df1 = pd.DataFrame(
{
'l1': [np.nan, 5],
'l2': [6, np.nan],
'd': ['a', 'b']
}
)
notna = df1[['l1', 'l2']].notna().values
notna_values = df1[['l1', 'l2']].values[notna]
print(notna_values)
df2 = pd.DataFrame(df1['d'])
df2['code'] = notna_values
print(df2)
out:
d code
0 a 6.0
1 b 5.0
You can use stack and groupby like this to find the fist non null value,
df['code'] = df[['lev1', 'lev2', 'lev3', 'lev4', 'lev5']].stack().groupby(level=0).first().reindex(df.index)
Now, you can select the code column and description column
df[['code', 'description']]
code description
0 RD21 Oil
1 RD32 Oil/Canola
2 RD33 Oil/Canola/Wheat
3 RD34 Oil/Canola/Flour
4 RD55 Oil/Canola/Flour/Thick
5 ED54 Rice
6 ED66 Rice/White
7 ED88 Rice/White/Jasmine
8 ED89 Rice/White/Basmati
9 ED68 Rice/Brown
You can apply a function over every row in df[cols] (the subview over problematic columns), dropping every NaN and taking the only one remaining.
>>> cols = "lev1 lev2 lev3 lev4 lev5".split()
>>> df["code"] = df[cols].apply(lambda row: row.dropna().iloc[0])
You can also drop the original columns if you don't need them anymore with drop.
>>> df.drop(columns=cols, inplace=True)
With your dataframe, I'd first make sure that Nan is actual np.NaN and not a string saying 'Nan'. Then I'd want to make sure that they're imputed as empty strings. Thus,
df.replace('Nan', np.nan, inplace=True)
df.fillna('', inplace=True)
Afterwards,
df['code'] = df['lev1'] + df['lev2'] + df['lev3'] + df['lev4'] + df['lev5']
And then df.drop(columns=[s for s in df.columns if s.startswith('lev')], inplace=True) to dispose of the old columns.
Note that this works only with the assumption given in OP's comment that there is one unique code in the five columns and the others are all NaN.
Select columns like lev and replace NaN with "back fill". Keep the first level lev1 and concat to your other column(s).
>>> pd.concat([df.filter(like='lev').bfill(axis='columns')['lev1'].rename('code'),
df['description']], axis="columns")
code description
0 RD21 Oil
1 RD32 Oil/Canola
2 RD33 Oil/Canola/Wheat
3 RD34 Oil/Canola/Flour
4 RD55 Oil/Canola/Flour/Thick
5 ED54 Rice
6 ED66 Rice/White
7 ED88 Rice/White/Jasmine
8 ED89 Rice/White/Basmati
9 ED68 Rice/Brown
or using melt:
>>> df.melt('description', value_name='code') \
.dropna().drop(columns='variable') \
.reset_index(drop=True) \
[['code', 'description']]
code description
0 RD21 Oil
1 ED54 Rice
2 RD32 Oil/Canola
3 ED66 Rice/White
4 ED68 Rice/Brown
5 RD33 Oil/Canola/Wheat
6 RD34 Oil/Canola/Flour
7 ED88 Rice/White/Jasmine
8 ED89 Rice/White/Basmati
9 RD55 Oil/Canola/Flour/Thick
Related
I have a text file with this structure:
- ASDF : |a=1|b=1|c=1|d=1
- QWER : |b=2|e=2|f=2
- ZXCV : |a=3|c=3|e=3|f=3|g=3
- TREW : |a=4|b=4|g=4
and I'd like to create a dataframe like this:
index
A
B
C
D
E
F
G
ASDF
1
1
1
1
NaN
NaN
NaN
QWER
NaN
2
NaN
NaN
NaN
2
NaN
ZXCV
3
NaN
3
NaN
3
3
3
TREW
4
4
NaN
NaN
NaN
NaN
4
which solution could I implement? consider that I don't know in advance how many rows I have, nor the number or name of the tags.
thank you all in advance!
Use read_csv first, then convert first column to index and split by | by Series.str.split, reshape by DataFrame.stack, split by =, convert index to columns and pivoting by DataFrame.pivot, last create uppercase columns names:
df = pd.read_csv('file', sep=':', header=None)
df1 = (df.set_index(0)[1]
.str.split('|', expand=True)
.stack()
.str.split('=', expand=True)
.dropna()
.rename_axis(['a','b'])
.reset_index()
.pivot('a',0,1)
.rename(columns=str.upper, index = lambda x: x.strip(' -'))
.rename_axis(index=None, columns=None))
print (df1)
A B C D E F G
ASDF 1 1 1 1 NaN NaN NaN
QWER NaN 2 NaN NaN 2 2 NaN
TREW 4 4 NaN NaN NaN NaN 4
ZXCV 3 NaN 3 NaN 3 3 3
You can use read_csv to load the data, then replace to fix the index name (removing the - ). str.extractall to get the key=value pairs, and reshape with unstack:
# using io.StringIO for the example
# in real-life use your file as input
import io
data = '''- ASDF : |a=1|b=1|c=1|d=1
- QWER : |b=2|e=2|f=2
- ZXCV : |a=3|c=3|e=3|f=3|g=3
- TREW : |a=4|b=4|g=4'''
# in real-life use:
# (pd.read_csv('filename.txt', sep=' : ', header=None,
df = (pd.read_csv(io.StringIO(data), sep=' : ', header=None,
names=['index', 'val'], engine='python')
# cleanup index column
.assign(index=lambda d: d['index'].replace('^-\s*', '', regex=True))
# extract the key=val pairs and reshape
# then merge to original
.pipe(lambda d: d.join(d.pop('val')
.str.extractall('(\w+)=(\d+)')
.droplevel('match')
.set_index(0, append=True)
[1].unstack(0)
)
)
)
Alternative with a python dictionary comprehension (if your format is strictly what you showed):
with open('filename.txt') as f:
df = (pd.DataFrame
.from_dict({k[2:]: dict(x.split('=') for x in v[1:].split('|'))
for l in f if l.strip()
for k,v in [l.strip().split(' : ', 1)]},
orient='index')
.reset_index()
)
output:
index a b c d e f g
0 ASDF 1 1 1 1 NaN NaN NaN
1 QWER NaN 2 NaN NaN 2 2 NaN
2 ZXCV 3 NaN 3 NaN 3 3 3
3 TREW 4 4 NaN NaN NaN NaN 4
I would harness regular expression for that following way
import re
import pandas as pd
text = '''ASDF : |a=1|b=1|c=1|d=1
QWER : |b=2|e=2|f=2
ZXCV : |a=3|c=3|e=3|f=3|g=3
TREW : |a=4|b=4|g=4'''
def extract_data(line):
name = line.split()[0] # name is everything before first whitespace
data = dict(re.findall(r'([a-z]+)=(\d+)',line))
return {"name":name,**data}
df = pd.DataFrame(map(extract_data,text.splitlines()))
df.set_index("name",inplace=True)
print(df)
output
a b c d e f g
name
ASDF 1 1 1 1 NaN NaN NaN
QWER NaN 2 NaN NaN 2 2 NaN
ZXCV 3 NaN 3 NaN 3 3 3
TREW 4 4 NaN NaN NaN NaN 4
Explanation: I use regular expression with capturing group to find pairs of one-or-more lowercase ASCII characters [a-z]+ and one-or-more digits \d+ sheared by requal sign =, then conver that into dict and do build dict from that and name. Disclaimer: this solution assumes every name appears no more than once in each line.
I am trying to do the Regex in Python dataframe using this script
import pandas as pd
df1 = {'data':['1gsmxx,2gsm','abc10gsm','10gsm','18gsm hhh4gsm','Abc:10gsm','5gsmaaab3gsmABC55gsm','abc - 15gsm','3gsm,,ff40gsm','9gsm','VV - fg 8gsm','kk 5gsm 00g','001….abc..5gsm']}
df1 = pd.DataFrame(df1)
df1
df1['Result']=df1['Data'].str.findall('(\d{1,3}\s?gsm)')
OR
df2=df1['data'].str.extractall('(\d{1,3}\s?gsm)').unstack()
However, it turnout into multiple results in one column.
Is it possible I could have a result like the attached below?
Use pandas.Series.str.extractall with unstack.
If you want your original series, use pandas.concat.
df2 = df1['data'].str.extractall('(\d{1,3}\s?gsm)').unstack()
df = pd.concat([df1, df2.droplevel(0, 1)], 1)
print(df)
Output:
data 0 1 2
0 1gsmxx,2gsm 1gsm 2gsm NaN
1 abc10gsm 10gsm NaN NaN
2 10gsm 10gsm NaN NaN
3 18gsm hhh4gsm 18gsm 4gsm NaN
4 Abc:10gsm 10gsm NaN NaN
5 5gsmaaab3gsmABC55gsm 5gsm 3gsm 55gsm
6 abc - 15gsm 15gsm NaN NaN
7 3gsm,,ff40gsm 3gsm 40gsm NaN
8 9gsm 9gsm NaN NaN
9 VV - fg 8gsm 8gsm NaN NaN
10 kk 5gsm 00g 5gsm NaN NaN
11 001….abc..5gsm 5gsm NaN NaN
I have a data frame in python pandas as follows:
( the first two columns, mygroup1 & mygroup2 are groupby columns)
df =
**mygroup1 mygroup2 tname #dt #num #vek**
a p alpha may 6 a
b q alpha june 8 b
c r beta may 9 c
d s beta june 11 d
I want to pivot the table (the values in tname column) which should be the following with names of columns joined with tname values taken from the other columns (#dt,#num and #vec)
**mygroup1 mygroup2 alpha#dt alpha#num alpha#vec beta#dt beta#num beta#vec**
a p may 6 a nan nan nan
b q june 8 b nan nan nan
c r nan nan nan may 9 c
d s nan nan nan june 11 d
I am trying to do a pivot using pandas pivot table but not able to get in the below format which I really want. I will appreciate any help.
You can do:
new_df = df.set_index(['mygroup1','mygroup2','tname']).unstack('tname')
new_df.columns = [f'{y}{x}' for x,y in new_df.columns]
new_df = new_df.sort_index(axis=1).reset_index()
Output:
mygroup1 mygroup2 alpha#dt alpha#num alpha#vek beta#dt beta#num beta#vek
0 a p may 6.0 a NaN NaN NaN
1 b q june 8.0 b NaN NaN NaN
2 c r NaN NaN NaN may 9.0 c
3 d s NaN NaN NaN june 11.0 d
I have different excel files that I am processing with Pandas. I need to remove a certain number of rows from the top of each file. These extra rows could be empty or they could contain text. Pandas is combining some of the rows so I am not sure how many need to be removed. For example:
Here is an example excel file (represented as csv):
,,
,,
some text,,
,,
,,
,,
name, date, task
Jason,1-Jan,swim
Aem,2-Jan,workout
Here is my current python script:
import pandas as pd
xl = pd.ExcelFile('extra_rows.xlsx')
dfs = xl.parse(xl.sheet_names[0])
print ("dfs: ", dfs)
Here is the results when I print the dataframe:
dfs: Unnamed: 0 Unnamed: 1 Unnamed: 2
0 some other text NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 name date task
5 Jason 2016-01-01 00:00:00 swim
6 Aem 2016-01-02 00:00:00 workout
From the file, I would remove the first 6 rows. However, from the dataframe I would only remove 4. Is there a way to read in the Excel file with the data in its raw state so the number of rows remains consistent?
I used python3 and pandas-0.18.1. The Excel load function is pandas.read_csv. You can try set the parameter header=None to achieve. Here are sample codes:
(1) With default parameters, result will ignore leading blank lines:
In [12]: pd.read_excel('test.xlsx')
Out[12]:
Unnamed: 0 Unnamed: 1 Unnamed: 2
0 text1 NaN NaN
1 NaN NaN NaN
2 n1 t2 c3
3 NaN NaN NaN
4 NaN NaN NaN
5 jim sum tim
(2) With header=None, result will keep leading blank lines.
In [13]: pd.read_excel('test.xlsx', header=None)
Out[13]:
0 1 2
0 NaN NaN NaN
1 NaN NaN NaN
2 text1 NaN NaN
3 NaN NaN NaN
4 n1 t2 c3
5 NaN NaN NaN
6 NaN NaN NaN
7 jim sum tim
Here is what you are looking for:
import pandas as pd
xl = pd.ExcelFile('extra_rows.xlsx')
dfs = xl.parse(skiprows=6)
print ("dfs: ", dfs)
Check the docs on ExcelFile for more details.
If you read your file in with pd.read_excel and pass header=None, the blank rows should be included:
In [286]: df = pd.read_excel("test.xlsx", header=None)
In [287]: df
Out[287]:
0 1 2
0 NaN NaN NaN
1 NaN NaN NaN
2 something NaN NaN
3 NaN NaN NaN
4 name date other
5 1 2 3
I would like to use the interpolate function, but only between known data values in a pandas DataFrame column. The issue is that the first and last values in the column are often NaN and sometimes it can be many rows before a value is not NaN:
col 1 col 2
0 NaN NaN
1 NaN NaN
...
1000 1 NaN
1001 NaN 1 <-----
1002 3 NaN <----- only want to fill in these 'in between value' rows
1003 4 3
...
3999 NaN NaN
4000 NaN NaN
I am tying together a dataset which is updated 'on event' but separately for each column, and is indexed via Timestamp. This means that there are often rows where no data is recorded for some columns, hence a lot of NaNs!
I select by min and max value of column by function idxmin and idxmax and use function fillna with method forward filling.
print df
# col 1 col 2
#0 NaN NaN
#1 NaN NaN
#1000 1 NaN
#1001 NaN 1
#1002 3 NaN
#1003 4 3
#3999 NaN NaN
#4000 NaN NaN
df.loc[df['col 1'].idxmin(): df['col 1'].idxmax()] = df.loc[df['col 1'].idxmin(): df['col 1'].idxmax()].fillna(method='ffill')
df.loc[df['col 2'].idxmin(): df['col 2'].idxmax()] = df.loc[df['col 2'].idxmin(): df['col 2'].idxmax()].fillna(method='ffill')
print df
# col 1 col 2
#0 NaN NaN
#1 NaN NaN
#1000 1 NaN
#1001 1 1
#1002 3 1
#1003 4 3
#3999 NaN NaN
#4000 NaN NaN
Added different solution, thanks HStro.
df['col 1'].loc[df['col 1'].first_valid_index() : df['col 1'].last_valid_index()] = df['col 1'].loc[df['col 1'].first_valid_index(): df['col 1'].last_valid_index()].astype(float).interpolate()