Replace column names with quotations with no quotations - python

I am trying to replace my column names that have quotations and simply remove the quotations but when I try this:
for x in df.columns:
x = x.replace('"', '')
print(x)
Nothing happens and the quotations are still there.

I would do something like this
cols = [column_name.replace('"','') for column_name in df.columns]
df.columns = cols
CODE
import pandas as pd
df=pd.DataFrame({"a":[1,2],'"b"':[3,4]})
print('BEFORE')
print(df)
cols = [column_name.replace('"','') for column_name in df.columns]
df.columns = cols
print('AFTER')
print(df)
OUTPUT
BEFORE
a "b"
0 1 3
1 2 4
AFTER
a b
0 1 3
1 2 4

you can remove it by writing the following code:
col=[]
for x in df.columns:
x = x.replace('"', '')
col.append(x)
df.columns=col
To know more about column renaming: Check this Renaming columns in pandas

One canonical solution to this problem is using pandas str.replace on the header directly (this is "vectorized"):
df = pd.DataFrame({"a": [1, 2], '"b"': [3, 4]})
df.columns = df.columns.str.replace('"', '')
df
a b
0 1 3
1 2 4

Related

In python pandas, How to apply loop to create rows for multiple columns?

import pandas as pd
import numpy as np
column_names = [str(x) for x in range(1,4)]
df= pd.DataFrame ( columns = column_names )
new_row = []
for i in range(3):
new_row.append(i)
df = df.append(new_row , ignore_index = True)
print(df)
output:
1 2 3 0
0 NaN NaN NaN 0.0
1 NaN NaN NaN 1.0
2 NaN NaN NaN 2.0
Is there a way to apply the loop to column 1, column 2, and column 3?
I think it's possible with a simple code, isn't it?
I've been thinking a lot, but I don't know how.
I also tried the .loc() method, but I couldn't apply the loop to the row of columns.
This is a supplementary explanation.
'column_names = [str(x) for x in range(1,4)]' creates columns 0 to 3.
A loop is applied to each column.
The "for" loop inserts 0 through 2 into column 1.
Therefore, 0, 1, 2 are input to the row of column 1.
The result I want is below.
You can add the following code after all your codes above:
for col in df:
df[col] = new_row
Result:
If you run after all your codes:
column_names = [str(x) for x in range(1,4)]
df= pd.DataFrame ( columns = column_names )
new_row = []
for i in range(3):
new_row.append(i)
df = df.append(new_row , ignore_index = True)
Then run the code here:
for col in df:
df[col] = new_row
You should get:
print(df)
1 2 3 0
0 0 0 0 0
1 1 1 1 1
2 2 2 2 2
I know it's weird but you can use .loc to do that:
df.loc[len(df.index)+1] = new_row
>>> df
1 2 3
1 0 1 2
you can use the name of the column for example:
for col in column_names:
df[col] = new_row
Assign the new row to the next index position in the dataframe using .loc.
import pandas as pd
import numpy as np
column_names = [str(x) for x in range(1,4)]
df= pd.DataFrame(columns=column_names)
new_row = []
for i in range(3):
new_row.append(i)
df.loc[len(df)] = new_row
If you have multiple rows to add in a loop,
len(df)
in the .loc statement will ensure they're always being added to the end.
not 100% sure what you are trying to do - can you rephrase?
import pandas as pd
column_names = [str(x) for x in range(1,4)]
df= pd.DataFrame ( columns = column_names )
new_row = []
for i in range(len(df.columns)):
new_row.append(i)
df = df.append(new_row , ignore_index = True)
for i in df:
df[i] = new_row
print(df)

How can I remove string after last underscore in python dataframe?

I want to remove the all string after last underscore from the dataframe. If I my data in dataframe looks like.
AA_XX,
AAA_BB_XX,
AA_BB_XYX,
AA_A_B_YXX
I would like to get this result
AA,
AAA_BB,
AA_BB,
AA_A_B
You can do this simply using Series.str.split and Series.str.join:
In [2381]: df
Out[2381]:
col1
0 AA_XX
1 AAA_BB_XX
2 AA_BB_XYX
3 AA_A_B_YXX
In [2386]: df['col1'] = df['col1'].str.split('_').str[:-1].str.join('_')
In [2387]: df
Out[2387]:
col1
0 AA
1 AAA_BB
2 AA_BB
3 AA_A_B
pd.DataFrame({'col': ['AA_XX', 'AAA_BB_XX', 'AA_BB_XYX', 'AA_A_B_YXX']})['col'].apply(lambda r: '_'.join(r.split('_')[:-1]))
Explaination:
df = pd.DataFrame({'col': ['AA_XX', 'AAA_BB_XX', 'AA_BB_XYX', 'AA_A_B_YXX']})
Creates
col
0 AA_XX
1 AAA_BB_XX
2 AA_BB_XYX
3 AA_A_B_YXX
Use apply in order to loop through the column you want to edit.
I broke the string at _ and then joined all parts leaving the last part at _
df['col'] = df['col'].apply(lambda r: '_'.join(r.split('_')[:-1]))
print(df)
Results:
col
0 AA
1 AAA_BB
2 AA_BB
3 AA_A_B
If your dataset contains values like AA (values without underscore).
Change the lambda like this
df = pd.DataFrame({'col': ['AA_XX', 'AAA_BB_XX', 'AA_BB_XYX', 'AA_A_B_YXX', 'AA']})
df['col'] = df['col'].apply(lambda r: '_'.join(r.split('_')[:-1]) if len(r.split('_')) > 1 else r)
print(df)
Here is another way of going about it.
import pandas as pd
data = {'s': ['AA_XX', 'AAA_BB_XX', 'AA_BB_XYX', 'AA_A_B_YXX']}
df = pd.DataFrame(data)
def cond1(s):
temp_s = s.split('_')
temp_len = len(temp_s)
if len(temp_s) == 1:
return temp_s
else:
return temp_s[:len(temp_s)-1]
df['result'] = df['s'].apply(cond1)

Python DataFrame : Split data in rows based on custom value?

I have a dataframe with column a. I need to get data after second _.
a
0 abc_def12_0520_123
1 def_ghij123_0120_456
raw_data = {'a': ['abc_def12_0520_123', 'def_ghij123_0120_456']}
df = pd.DataFrame(raw_data, columns = ['a'])
Output:
a b
0 abc_def12_0520_123 0520_123
1 def_ghij123_0120_456 0120_456
What I have tried:
df['b'] = df.number.str.replace('\D+', '')
I tried removing alphabets first, But its getting complex. Any suggestions
Here is how:
df['b'] = ['_'.join(s.split('_')[2:]) for s in df['a']]
print(df)
Output:
a b
0 abc_def12_0520_123 0520_123
1 def_ghij123_0120_456 0120_456
Explanation:
lst = ['_'.join(s.split('_')[2:]) for s in df['a']]
is the equivalent of:
lst = []
for s in df['a']:
a = s.split('_')[2:] # List all strings in list of substrings splitted '_' besides the first 2
lst.append('_'.join(a))
Try:
df['b'] = df['a'].str.split('_',2).str[-1]
a b
0 abc_def12_0520_123 0520_123
1 def_ghij123_0120_456 0120_456

How to remove double quotes while assigning columns to dataframe

I have below list
ColumnName = 'Emp_id','Emp_Name','EmpAGe'
While i am trying to read above columns and assign inside dataframe i am getting extra double quotes
df = pd.dataframe(data,columns=[ColumnName])
columns=[ColumnName]
i am getting columns = ["'Emp_id','Emp_Name','EmpAGe'"]
how can i handle these extra double quotes and remove them while assigning header to data
This code
ColumnName = 'Emp_id','Emp_Name','EmpAGe'
Is a tuple and not a list.
In case you want three columns, each with values on the tuple above you gonna need
df = pd.dataframe(data,columns=list(ColumnName))
The problem is how you define the columns for pandas DataFrame.
The example below will build a correct data frame :
import pandas as pd
ColumnName1 = 'Emp_id','Emp_Name','EmpAGe'
df1 = [['A1','A1','A2'],['1','2','1'],['a0','a1','a3']]
df = pd.DataFrame(data=df1,columns=ColumnName1 )
df
Result :
Emp_id Emp_Name EmpAGe
0 A1 A1 A2
1 1 2 1
2 a0 a1 a3
A print screen of the code I wrote with the result, with no double quotations
Just for the shake of the understanding, where you can use col.replace to get the desired ..
Let take an example..
>>> df
col1" col2"
0 1 1
1 2 2
Result:
>>> df.columns = [col.replace('"', '') for col in df.columns]
# df.columns = df.columns.str.replace('"', '') <-- can use this as well
>>> df
col1 col2
0 1 1
1 2 2
OR
>>> df = pd.DataFrame({ '"col1"':[1, 2], '"col2"':[1,2]})
>>> df
"col1" "col2"
0 1 1
1 2 2
>>> df.columns = [col.replace('"', '') for col in df.columns]
>>> df
col1 col2
0 1 1
1 2 2
Your input is not quite right. ColumnName is already list-like and it should be passed on directly rather than wrapped in another list. In the latter case it would be interpreted as one single column.
df = pd.DataFrame(data, columns=ColumnName)

Pandas read multiindexed csv with blanks

I'm struggling with properly loading a csv that has a multi lines header with blanks. The CSV looks like this:
,,C,,,D,,
A,B,X,Y,Z,X,Y,Z
1,2,3,4,5,6,7,8
What I would like to get is:
When I try to load with pd.read_csv(file, header=[0,1], sep=','), I end up with the following:
Is there a way to get the desired result?
Note: alternatively, I would accept this as a result:
Versions used:
Python: 2.7.8
Pandas 0.16.0
Here is an automated way to fix the column index. First,
pull the column level values into a DataFrame:
columns = pd.DataFrame(df.columns.tolist())
then rename the Unnamed: columns to NaN:
columns.loc[columns[0].str.startswith('Unnamed:'), 0] = np.nan
and then forward-fill the NaNs:
columns[0] = columns[0].fillna(method='ffill')
so that columns now looks like
In [314]: columns
Out[314]:
0 1
0 NaN A
1 NaN B
2 C X
3 C Y
4 C Z
5 D X
6 D Y
7 D Z
Now we can find the remaining NaNs and fill them with empty strings:
mask = pd.isnull(columns[0])
columns[0] = columns[0].fillna('')
To make the first two columns, A and B, indexable as df['A'] and df['B'] -- as though they were single-leveled -- you could swap the values in the first and second columns:
columns.loc[mask, [0,1]] = columns.loc[mask, [1,0]].values
Now you can build a new MultiIndex and assign it to df.columns:
df.columns = pd.MultiIndex.from_tuples(columns.to_records(index=False).tolist())
Putting it all together, if data is
,,C,,,D,,
A,B,X,Y,Z,X,Y,Z
1,2,3,4,5,6,7,8
3,4,5,6,7,8,9,0
then
import numpy as np
import pandas as pd
df = pd.read_csv('data', header=[0,1], sep=',')
columns = pd.DataFrame(df.columns.tolist())
columns.loc[columns[0].str.startswith('Unnamed:'), 0] = np.nan
columns[0] = columns[0].fillna(method='ffill')
mask = pd.isnull(columns[0])
columns[0] = columns[0].fillna('')
columns.loc[mask, [0,1]] = columns.loc[mask, [1,0]].values
df.columns = pd.MultiIndex.from_tuples(columns.to_records(index=False).tolist())
print(df)
yields
A B C D
X Y Z X Y Z
0 1 2 3 4 5 6 7 8
1 3 4 5 6 7 8 9 0
There is no magical way of making pandas aware of how you want your index to look, the closest way you can do this is by specifying a lot yourself, like this:
names = ['A', 'B',
('C','X'), ('C', 'Y'), ('C', 'Z'),
('D','X'), ('D','Y'), ('D', 'Z')]
pd.read_csv(file, mangle_dupe_cols=True,
header=1, names=names, index_col=[0, 1])
Gives:
C D
X Y Z X Y Z
A B
1 2 3 4 5 6 7 8
To do this in a dynamic fashion, you could read the first two lines of the CSV as they are and loop through the columns you get to generate the names variable dynamically before loading the full dataset.
pd.read_csv(file, nrows=1, header=[0,1], index_col=[0, 1])
Then access the columns and loop to create your header.
Again, not a very clean solution, but should work.
you can read using :
df = pd.read_csv('file.csv', header=[0, 1], skipinitialspace=True, tupleize_cols=True)
and then
df.columns = pd.MultiIndex.from_tuples(df.columns)
Load the dataframe, with multiindex:
df = pd.read_csv(filelist,header=[0,1], sep=',')
Write a function to replace the index:
def replace_index(df):
arr = df.columns.values
l = [list(x) for x in arr]
for i in range(len(l)):
if l[i][0][:7] == 'Unnamed':
if l[i-1][0][:7] != 'Unnamed':
l[i][0] = l[i-1][0]
for i in range(len(l)):
if l[i][0][:7] == 'Unnamed':
l[i][0] = l[i][1]
l[i][1] = ''
index = pd.MultiIndex.from_tuples(l)
df.columns = index
return df
Return the new dataframe properly indexed:
replace_index(df)
I used a technique to flatten from the multi-index columns and make one column. It works well for me.
your_df.columns = ['_'.join(col).strip() for col in your_df.columns.values]
Import your csv file providing the header row indexes:
df = pd.read_csv('file.csv', header=[0, 1, 2])
Then, you can iterate over each column header, clean it up, assign it to a tuple, the re-assign the dataframe columns using pd.MultiIndex.from_tuples(list_of_tuples)
df.columns = pd.MultiIndex.from_tuples(
[tuple(['' if y.find('Unnamed')==0 else y for y in x]) for x in df.columns]
)
this is the quick one liner I was looking for when trying to figure this out.

Categories