How to remove newline characters from string in pandas python - python

For some reason, none of the solutions previously posted about this seem to answer my question.
I am reading in an excel page with 150+ sheets. I am looping through them and preparing the data to be concatenated together. (doing things like deleting unneeded/blank columns, and transforming some data) However, for some reason, I cannot get rid of any of the newline characters, no matter what I try. Here are some variations that I've tried so you can see what DIDN'T work.
import pandas as pd
import os
os.chdir(r'C:\Users\agray\Downloads')
sheets_dict = pd.read_excel('2022_Advanced_Control.xlsx', sheet_name=None)
df_list = list(sheets_dict.values())
df_list_clean = []
The top part stays the same, this loop portion is what changes.
for df in df_list:
df.columns = [c.replace(' ', '_') for c in df.columns]
df.drop(df.columns.difference(['Prescription_Drug_Name','Drug_Tier', 'Drug_Notes']), 1, inplace=True)
df.drop(df.tail(3).index, inplace=True)
df.loc[:, 'Prescription_Drug_Name'] = df.loc[:, 'Prescription_Drug_Name'].replace("\n", "", inplace=True)
df_list_clean.append(df)
This gives me a column that has nothing but blank values.
Here's another way I tried
for df in df_list:
df.columns = [c.replace(' ', '_') for c in df.columns]
df.drop(df.columns.difference(['Prescription_Drug_Name','Drug_Tier', 'Drug_Notes']), 1, inplace=True)
df.drop(df.tail(3).index, inplace=True)
df['Prescription_Drug_Name'] = df['Prescription_Drug_Name'].replace(r'\n','', regex=True, inplace=True)
df_list_clean.append(df)
This version is only applying to a copy, so none of the changes it says it's making are actually being made to my df. Any ideas how to get rid of all these "/n" characters in my column? Thanks!

Use str.strip():
df['Prescription_Drug_Name'] = df['Prescription_Drug_Name'].str.replace(r'\n', '')

I always advise against inplace=True. Make an explicit copy where you mean to.
This version is only applying to a copy, so none of the changes... being made to my df. Why don't you clone your data like this:
for df in df_list:
clean = df.copy()
clean.columns = [c.replace(' ', '_') for c in df.columns]
clean = clean.drop(df.columns.difference(['Prescription_Drug_Name','Drug_Tier', 'Drug_Notes']), 1)
# drop last three rows
clean = clean.iloc[:-3]
# modify column, remove `inplace` here
clean['Prescription_Drug_Name'] = clean['Prescription_Drug_Name'].replace(r'\n','', regex=True)
df_list_clean.append(clean)
That being said, all of the above can be chained, so you can do something like this:
for df in df_list
clean = (df.rename(columns=lambda x: x.replace(' ', '_'))
.reindex(['Prescription_Drug_Name','Drug_Tier', 'Drug_Notes'], axis=1).dropna(axis=0, how='all') # select only the columns
.iloc[:-3]
.assign(Prescription_Drug_Name=lambda x: x.replace(r'\n', '', regex=True)
)
df_list_clean.append(clean)

Related

Pandas: replace string with special characters

I have a dataframe (see after) in which I have two columns containing either a list of patients or an empty list (like that [''] ). I want to remove the empty list
What i have:
Homozygous_list
heterozygous_list
[Patient1,Patient2]
['']
['']
[Patient1]
What i want:
Homozygous_list
heterozygous_list
[Patient1,Patient2]
[Patient1]
I try several thing like :
variants["Homozygous_list"].replace("['']","", regex=True, inplace=True)
or
variants["Homozygous_list"].replace("\[''\]","", regex=True, inplace=True)
or
variants["Homozygous_list"] = variants["Homozygous_list"].replace("['']","", regex=True)
etc but nothing seems to work.
If you really have lists of strings, use applymap:
df = df.applymap(lambda x: '' if x==[''] else x) # or pd.NA in place of ''
output:
Homozygous_list heterozygous_list
0 [Patient1, Patient2]
1 [Patient1]
used input:
df = pd.DataFrame({'Homozygous_list ': [['Patient1','Patient2'], ['']],
'heterozygous_list': [[''], ['Patient1']]})

Pandas Read Excel _x00

I have an excel file that I read via :
df = pd.read_excel(path)
The problem is the encoding of the file when using pandas (while opening in Excel, everything is fine)
df
id group
_x0034_5109336 _x0020_N12
_x0035_4610785 _x0020_N32
_x0036_1987159 _x0020_N33
_x0034_6506844 _x0020_N41_x0020__x002F__x0020_N42
_x0033_8342845 _x0020_N23
I wanted to remove manually the xharacters:
df[col] = df[col].astype(str).str.replace('_x0020x', ' ')
But it might not be the best option..
BElow is the expected output
df
id group
45109336 N12
54610785 N32
61987159 N33
46506844 N41 / N42
38342845 N23
You'd have to use regex, something like _x.*[0-9]_
df[col] = df[col].astype(str).str.replace('_x.*[0-9]_', '', regex=True)

Is there any method to replace specific data from column without breaking its structure or spliting

Hi there i am trying to figure out how to replace a specific data of csv file. i have a file which is base or location data of id's.
https://store8.gofile.io/download/5b031959-e0b0-4dbf-aec6-264e0b87fd09/service%20block.xlsx (sheet 2 had data ).
The file which i want to replace data using id is below
https://store8.gofile.io/download/6e13a19a-bac8-4d16-8692-e4435eed2a08/Serp.csv
Highlighted part need to be deleted after filling location.
import pandas as pd
df1= pd.read_excel("serp.xlsx", header=None)
df2= pd.read_excel("flocnam.xlsx", header=None)
df1 = df1[0].str.split(";", expand=True)
df1[4] = df1[4].apply(lambda x: v[-1] if (v := x.split()) else "")
df2[1] = df2[1].apply(lambda x: x.split("-")[0])
m = dict(zip(df2[1], df2[0]))
df1[4]= df1[4].replace(m)
print(df1)
df1.to_csv ("test.csv")
It worked but not how i wanted.
https://store8.gofile.io/download/c0ae7e05-c0e2-4f43-9d13-da12ddf73a8d/test.csv
trying to replace it like this.(desired output)
Thank you for being Supportive community❤️
If I understand correctly, you simply need to specify the separator ;
>>> df.to_csv(‘test.csv’, sep=‘;’, index_label=False)

Pandas Attribute error: Nonetype has no attribute rename

learning this from a tutorial, the code isn't working on my machine. error in line with df.rename
def compile_data():
colist = pd.read_csv("nse500symbolistnov2020.csv")
tickers = colist['Symbol']
maindf = pd.DataFrame()
for count,ticker in enumerate(tickers):
df = pd.read_csv('stock_dfs/{}.csv'.format(ticker))
df = df.set_index('Date',inplace=True)
df = df.rename(columns={'Adj Close': ticker},inplace=True)
df.drop(['Open','High','Low','CLose','Volume'],1,inplace=True)
if maindf.empty:
maindf = df
else:
maindf = maindf.join(df, how='outer')
if count % 10 == 0:
print(count)
print(maindf.head())
maindf.to_csv('NSE60joined.csv')
The problem is in the line
df = df.set_index('Date',inplace=True)
Either remove inplace=True, or remove the assignment df =, leaving just
df.set_index('Date',inplace=True)
The same goes for the next line. Either use inplace=True, or assign the new dataframe to df, not both.
When you specify inplace=True and want to see it's contents, it would return None as they merely mutate the DF instead of creating a new copy of it. Basically, you're assigning None to the result and hence it complains of the AttributeError as it isn't a df.DataFrame object anymore to access it's .head() method.
You can do it now in two ways:
No assigning with inplace parameter
df.rename(columns={'Adj Close': ticker},inplace=True)
assign without inplace parameter
df= df.rename(columns={'Y':l})

Converting DataFrame containing UTF-8 and Nulls to String Without Losing Data

Here's my code for reading in this dataframe:
html = 'https://www.agroindustria.gob.ar/sitio/areas/ss_mercados_agropecuarios/logistica/_archivos/000023_Posici%C3%B3n%20de%20Camiones%20y%20Vagones/000010_Entrada%20de%20camiones%20y%20vagones%20a%20puertos%20semanal%20y%20mensual.php'
url = urlopen(html)
df = pd.read_html(html, encoding = 'utf-8')
remove = []
for x in range(len(df)):
if len(df[x]) < 10:
remove.append(x)
for x in remove[::-1]:
df.pop(x)
df = df[0]
The dataframe contained uses both ',' and '.' as thousands indicators, and i want neither. So 5.103 should be 5103.
Using this code:
df = df.apply(lambda x: x.str.replace('.', ''))
df = df.apply(lambda x: x.str.replace(',', ''))
All of the data will get changed, but the values in the last four columns will all turn to NaN. I'm assuming this has something to do with trying to use str.replace on a float?
Trying any sort of df[column] = df[column].astype(str) also gives back errors, as does something convoluted like the following:
for x in df.columns.tolist():
for k, v in df[x].iteritems():
if pd.isnull(v) == False and type(v) = float:
df.loc(k, df[x]) == str(v)
What is the right way to approach this problem?
You can try this regex approach. I haven't tested it, but it should work.
df = df.apply(lambda x: re.sub(r'(\d+)[.,](\d+)',r'\1\2',str(x)))

Categories