I am trying to do the Regex in Python dataframe using this script
import pandas as pd
df1 = {'data':['1gsmxx,2gsm','abc10gsm','10gsm','18gsm hhh4gsm','Abc:10gsm','5gsmaaab3gsmABC55gsm','abc - 15gsm','3gsm,,ff40gsm','9gsm','VV - fg 8gsm','kk 5gsm 00g','001….abc..5gsm']}
df1 = pd.DataFrame(df1)
df1
df1['Result']=df1['Data'].str.findall('(\d{1,3}\s?gsm)')
OR
df2=df1['data'].str.extractall('(\d{1,3}\s?gsm)').unstack()
However, it turnout into multiple results in one column.
Is it possible I could have a result like the attached below?
Use pandas.Series.str.extractall with unstack.
If you want your original series, use pandas.concat.
df2 = df1['data'].str.extractall('(\d{1,3}\s?gsm)').unstack()
df = pd.concat([df1, df2.droplevel(0, 1)], 1)
print(df)
Output:
data 0 1 2
0 1gsmxx,2gsm 1gsm 2gsm NaN
1 abc10gsm 10gsm NaN NaN
2 10gsm 10gsm NaN NaN
3 18gsm hhh4gsm 18gsm 4gsm NaN
4 Abc:10gsm 10gsm NaN NaN
5 5gsmaaab3gsmABC55gsm 5gsm 3gsm 55gsm
6 abc - 15gsm 15gsm NaN NaN
7 3gsm,,ff40gsm 3gsm 40gsm NaN
8 9gsm 9gsm NaN NaN
9 VV - fg 8gsm 8gsm NaN NaN
10 kk 5gsm 00g 5gsm NaN NaN
11 001….abc..5gsm 5gsm NaN NaN
Related
I have a dataframe which looks something like this:
Df
lev1 lev2 lev3 lev4 lev5 description
RD21 Nan Nan Nan Nan Oil
Nan RD32 Nan Nan Nan Oil/Canola
Nan Nan RD33 Nan Nan Oil/Canola/Wheat
Nan Nan RD34 Nan Nan Oil/Canola/Flour
Nan Nan Nan RD55 Nan Oil/Canola/Flour/Thick
ED54 Nan Nan Nan Nan Rice
Nan ED66 Nan Nan Nan Rice/White
Nan Nan ED88 Nan Nan Rice/White/Jasmine
Nan Nan ED89 Nan Nan Rice/White/Basmati
Nan ED68 Nan Nan Nan Rice/Brown
I want to remove all the NaN values and just keep the non Nan values, something like this:
DF2
code description
RD21 Oil
RD32 Oil/Canola
RD33 Oil/Canola/Wheat
RD34 Oil/Canola/Flour
RD55 Oil/Canola/Flour/Thick
.
.
.
How do I do this? I tried using notna() method, but it returns a boolean value of the dataframe. Any help would be appreciated.
We can mask by notna()
import pandas as pd
import numpy as np
df1 = pd.DataFrame(
{
'l1': [np.nan, 5],
'l2': [6, np.nan],
'd': ['a', 'b']
}
)
notna = df1[['l1', 'l2']].notna().values
notna_values = df1[['l1', 'l2']].values[notna]
print(notna_values)
df2 = pd.DataFrame(df1['d'])
df2['code'] = notna_values
print(df2)
out:
d code
0 a 6.0
1 b 5.0
You can use stack and groupby like this to find the fist non null value,
df['code'] = df[['lev1', 'lev2', 'lev3', 'lev4', 'lev5']].stack().groupby(level=0).first().reindex(df.index)
Now, you can select the code column and description column
df[['code', 'description']]
code description
0 RD21 Oil
1 RD32 Oil/Canola
2 RD33 Oil/Canola/Wheat
3 RD34 Oil/Canola/Flour
4 RD55 Oil/Canola/Flour/Thick
5 ED54 Rice
6 ED66 Rice/White
7 ED88 Rice/White/Jasmine
8 ED89 Rice/White/Basmati
9 ED68 Rice/Brown
You can apply a function over every row in df[cols] (the subview over problematic columns), dropping every NaN and taking the only one remaining.
>>> cols = "lev1 lev2 lev3 lev4 lev5".split()
>>> df["code"] = df[cols].apply(lambda row: row.dropna().iloc[0])
You can also drop the original columns if you don't need them anymore with drop.
>>> df.drop(columns=cols, inplace=True)
With your dataframe, I'd first make sure that Nan is actual np.NaN and not a string saying 'Nan'. Then I'd want to make sure that they're imputed as empty strings. Thus,
df.replace('Nan', np.nan, inplace=True)
df.fillna('', inplace=True)
Afterwards,
df['code'] = df['lev1'] + df['lev2'] + df['lev3'] + df['lev4'] + df['lev5']
And then df.drop(columns=[s for s in df.columns if s.startswith('lev')], inplace=True) to dispose of the old columns.
Note that this works only with the assumption given in OP's comment that there is one unique code in the five columns and the others are all NaN.
Select columns like lev and replace NaN with "back fill". Keep the first level lev1 and concat to your other column(s).
>>> pd.concat([df.filter(like='lev').bfill(axis='columns')['lev1'].rename('code'),
df['description']], axis="columns")
code description
0 RD21 Oil
1 RD32 Oil/Canola
2 RD33 Oil/Canola/Wheat
3 RD34 Oil/Canola/Flour
4 RD55 Oil/Canola/Flour/Thick
5 ED54 Rice
6 ED66 Rice/White
7 ED88 Rice/White/Jasmine
8 ED89 Rice/White/Basmati
9 ED68 Rice/Brown
or using melt:
>>> df.melt('description', value_name='code') \
.dropna().drop(columns='variable') \
.reset_index(drop=True) \
[['code', 'description']]
code description
0 RD21 Oil
1 ED54 Rice
2 RD32 Oil/Canola
3 ED66 Rice/White
4 ED68 Rice/Brown
5 RD33 Oil/Canola/Wheat
6 RD34 Oil/Canola/Flour
7 ED88 Rice/White/Jasmine
8 ED89 Rice/White/Basmati
9 RD55 Oil/Canola/Flour/Thick
I have a data frame in python pandas as follows:
( the first two columns, mygroup1 & mygroup2 are groupby columns)
df =
**mygroup1 mygroup2 tname #dt #num #vek**
a p alpha may 6 a
b q alpha june 8 b
c r beta may 9 c
d s beta june 11 d
I want to pivot the table (the values in tname column) which should be the following with names of columns joined with tname values taken from the other columns (#dt,#num and #vec)
**mygroup1 mygroup2 alpha#dt alpha#num alpha#vec beta#dt beta#num beta#vec**
a p may 6 a nan nan nan
b q june 8 b nan nan nan
c r nan nan nan may 9 c
d s nan nan nan june 11 d
I am trying to do a pivot using pandas pivot table but not able to get in the below format which I really want. I will appreciate any help.
You can do:
new_df = df.set_index(['mygroup1','mygroup2','tname']).unstack('tname')
new_df.columns = [f'{y}{x}' for x,y in new_df.columns]
new_df = new_df.sort_index(axis=1).reset_index()
Output:
mygroup1 mygroup2 alpha#dt alpha#num alpha#vek beta#dt beta#num beta#vek
0 a p may 6.0 a NaN NaN NaN
1 b q june 8.0 b NaN NaN NaN
2 c r NaN NaN NaN may 9.0 c
3 d s NaN NaN NaN june 11.0 d
This question already has answers here:
How to replace NaNs by preceding or next values in pandas DataFrame?
(10 answers)
Closed 3 years ago.
I import the data from an excel file. But the format of merged cells in excel file does not match in python. Therefore, I have to modify the data in python.
for example: the data I import in python looks like
0 aa
1 NaN
2 NaN
3 NaN
4 b
5 NaN
6 NaN
7 NaN
8 NaN
9 ccc
10 NaN
11 NaN
12 NaN
13 dd
14 NaN
15 NaN
16 NaN
the result I want is:
0 aa
1 aa
2 aa
3 aa
4 b
5 b
6 b
7 b
8 b
9 ccc
10 ccc
11 ccc
12 ccc
13 dd
14 dd
15 dd
16 dd
I tried to use for loop to fix the problem. But it took lots of time and I have a huge dataset. I do not know if there is a faster way to do it.
Looks like .fillna() is your friend – quoting the documentation::
We can also propagate non-null values forward or backward.
>>> df
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
3 NaN 3.0 NaN 4
>>> df.fillna(method='ffill')
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 3.0 4.0 NaN 5
3 3.0 3.0 NaN 4
This is exactly the use of the .fillna() function in pandas
You can get your desired result with the help of apply AND fillna methods :-
import pandas as pd
import numpy as np
df = pd.DataFrame(data = {'A':['a', np.nan, np.nan, 'b', np.nan]})
l = []
def change(value):
if value == "bhale":
value = l[-1]
return value
else:
l.append(value)
return value
# First converting NaN values into any string value like `bhale` here
df['A'] = df['A'].fillna('bhale')
df["A"] = df['A'].apply(change) # Using apply method.
df
I hope it may help you.
I have a dataframe like this.
Project 4 Project1 Project2 Project3
0 NaN laptio AB NaN
1 NaN windows ten NaN
0 one NaN NaN
1 two NaN NaN
I want to delete NaN values from Project 4 column
My desired output should be,
df,
Project 4 Project1 Project2 Project3
0 one laptio AB NaN
1 two windows ten NaN
0 NaN NaN NaN
1 NaN NaN
If your data frame's index is just standard 0 to n ordered integers, you can pop the Project4 column to a series, drop the NaN values, reset the index, and then merge it back with the data frame.
import pandas a pd
df = pd.DataFrame([[pd.np.nan, 1,2,3],
[pd.np.nan, 4,5,6],
['one',7,8,9],
['two',10,11,12]], columns=['p4','p1','p2','p3'])
s = df.pop('p4')
pd.concat([df, ps.dropna().reset_index(drop=True)], axis=1)
# returns:
p1 p2 p3 p4
0 1 2 3 one
1 4 5 6 two
2 7 8 9 NaN
3 10 11 12 NaN
I have different excel files that I am processing with Pandas. I need to remove a certain number of rows from the top of each file. These extra rows could be empty or they could contain text. Pandas is combining some of the rows so I am not sure how many need to be removed. For example:
Here is an example excel file (represented as csv):
,,
,,
some text,,
,,
,,
,,
name, date, task
Jason,1-Jan,swim
Aem,2-Jan,workout
Here is my current python script:
import pandas as pd
xl = pd.ExcelFile('extra_rows.xlsx')
dfs = xl.parse(xl.sheet_names[0])
print ("dfs: ", dfs)
Here is the results when I print the dataframe:
dfs: Unnamed: 0 Unnamed: 1 Unnamed: 2
0 some other text NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 name date task
5 Jason 2016-01-01 00:00:00 swim
6 Aem 2016-01-02 00:00:00 workout
From the file, I would remove the first 6 rows. However, from the dataframe I would only remove 4. Is there a way to read in the Excel file with the data in its raw state so the number of rows remains consistent?
I used python3 and pandas-0.18.1. The Excel load function is pandas.read_csv. You can try set the parameter header=None to achieve. Here are sample codes:
(1) With default parameters, result will ignore leading blank lines:
In [12]: pd.read_excel('test.xlsx')
Out[12]:
Unnamed: 0 Unnamed: 1 Unnamed: 2
0 text1 NaN NaN
1 NaN NaN NaN
2 n1 t2 c3
3 NaN NaN NaN
4 NaN NaN NaN
5 jim sum tim
(2) With header=None, result will keep leading blank lines.
In [13]: pd.read_excel('test.xlsx', header=None)
Out[13]:
0 1 2
0 NaN NaN NaN
1 NaN NaN NaN
2 text1 NaN NaN
3 NaN NaN NaN
4 n1 t2 c3
5 NaN NaN NaN
6 NaN NaN NaN
7 jim sum tim
Here is what you are looking for:
import pandas as pd
xl = pd.ExcelFile('extra_rows.xlsx')
dfs = xl.parse(skiprows=6)
print ("dfs: ", dfs)
Check the docs on ExcelFile for more details.
If you read your file in with pd.read_excel and pass header=None, the blank rows should be included:
In [286]: df = pd.read_excel("test.xlsx", header=None)
In [287]: df
Out[287]:
0 1 2
0 NaN NaN NaN
1 NaN NaN NaN
2 something NaN NaN
3 NaN NaN NaN
4 name date other
5 1 2 3