These are the values in my DataFrame
What I am trying to do, is to change the datatype of columns from string to float, but I can't because some of the values in df are written with the space.
What I've already tried:
df['Value'] = df['Value'].str.strip()
df['Value'] = df['Value'].str.replace(' ','')
Nothing helps... Anyone got any ideas?
This code should work:
df['Value'] = df['Value'].astype(str).str.replace(' ','').astype(float)
If it does not, try to troubleshoot with the following:
def check_cell(x):
try:
x = float(str(x).replace(' ', ''))
return x
except:
print(x)
df['Value'] = df['Value'].apply(lambda x: check_cell(x))
You could try this variant:
df['Value'] = df['Value'].str.replace(' ', '', regex=True)
As the OP says, they want the result to be a Series of float, not str. A safe way to do this is, after removing any potential spaces, to use pd.to_numeric():
df['Value'] = pd.to_numeric(df['Value'].str.replace(' ', '', regex=False))
Related
I have a dataframe (see after) in which I have two columns containing either a list of patients or an empty list (like that [''] ). I want to remove the empty list
What i have:
Homozygous_list
heterozygous_list
[Patient1,Patient2]
['']
['']
[Patient1]
What i want:
Homozygous_list
heterozygous_list
[Patient1,Patient2]
[Patient1]
I try several thing like :
variants["Homozygous_list"].replace("['']","", regex=True, inplace=True)
or
variants["Homozygous_list"].replace("\[''\]","", regex=True, inplace=True)
or
variants["Homozygous_list"] = variants["Homozygous_list"].replace("['']","", regex=True)
etc but nothing seems to work.
If you really have lists of strings, use applymap:
df = df.applymap(lambda x: '' if x==[''] else x) # or pd.NA in place of ''
output:
Homozygous_list heterozygous_list
0 [Patient1, Patient2]
1 [Patient1]
used input:
df = pd.DataFrame({'Homozygous_list ': [['Patient1','Patient2'], ['']],
'heterozygous_list': [[''], ['Patient1']]})
For some reason, none of the solutions previously posted about this seem to answer my question.
I am reading in an excel page with 150+ sheets. I am looping through them and preparing the data to be concatenated together. (doing things like deleting unneeded/blank columns, and transforming some data) However, for some reason, I cannot get rid of any of the newline characters, no matter what I try. Here are some variations that I've tried so you can see what DIDN'T work.
import pandas as pd
import os
os.chdir(r'C:\Users\agray\Downloads')
sheets_dict = pd.read_excel('2022_Advanced_Control.xlsx', sheet_name=None)
df_list = list(sheets_dict.values())
df_list_clean = []
The top part stays the same, this loop portion is what changes.
for df in df_list:
df.columns = [c.replace(' ', '_') for c in df.columns]
df.drop(df.columns.difference(['Prescription_Drug_Name','Drug_Tier', 'Drug_Notes']), 1, inplace=True)
df.drop(df.tail(3).index, inplace=True)
df.loc[:, 'Prescription_Drug_Name'] = df.loc[:, 'Prescription_Drug_Name'].replace("\n", "", inplace=True)
df_list_clean.append(df)
This gives me a column that has nothing but blank values.
Here's another way I tried
for df in df_list:
df.columns = [c.replace(' ', '_') for c in df.columns]
df.drop(df.columns.difference(['Prescription_Drug_Name','Drug_Tier', 'Drug_Notes']), 1, inplace=True)
df.drop(df.tail(3).index, inplace=True)
df['Prescription_Drug_Name'] = df['Prescription_Drug_Name'].replace(r'\n','', regex=True, inplace=True)
df_list_clean.append(df)
This version is only applying to a copy, so none of the changes it says it's making are actually being made to my df. Any ideas how to get rid of all these "/n" characters in my column? Thanks!
Use str.strip():
df['Prescription_Drug_Name'] = df['Prescription_Drug_Name'].str.replace(r'\n', '')
I always advise against inplace=True. Make an explicit copy where you mean to.
This version is only applying to a copy, so none of the changes... being made to my df. Why don't you clone your data like this:
for df in df_list:
clean = df.copy()
clean.columns = [c.replace(' ', '_') for c in df.columns]
clean = clean.drop(df.columns.difference(['Prescription_Drug_Name','Drug_Tier', 'Drug_Notes']), 1)
# drop last three rows
clean = clean.iloc[:-3]
# modify column, remove `inplace` here
clean['Prescription_Drug_Name'] = clean['Prescription_Drug_Name'].replace(r'\n','', regex=True)
df_list_clean.append(clean)
That being said, all of the above can be chained, so you can do something like this:
for df in df_list
clean = (df.rename(columns=lambda x: x.replace(' ', '_'))
.reindex(['Prescription_Drug_Name','Drug_Tier', 'Drug_Notes'], axis=1).dropna(axis=0, how='all') # select only the columns
.iloc[:-3]
.assign(Prescription_Drug_Name=lambda x: x.replace(r'\n', '', regex=True)
)
df_list_clean.append(clean)
Here's my code for reading in this dataframe:
html = 'https://www.agroindustria.gob.ar/sitio/areas/ss_mercados_agropecuarios/logistica/_archivos/000023_Posici%C3%B3n%20de%20Camiones%20y%20Vagones/000010_Entrada%20de%20camiones%20y%20vagones%20a%20puertos%20semanal%20y%20mensual.php'
url = urlopen(html)
df = pd.read_html(html, encoding = 'utf-8')
remove = []
for x in range(len(df)):
if len(df[x]) < 10:
remove.append(x)
for x in remove[::-1]:
df.pop(x)
df = df[0]
The dataframe contained uses both ',' and '.' as thousands indicators, and i want neither. So 5.103 should be 5103.
Using this code:
df = df.apply(lambda x: x.str.replace('.', ''))
df = df.apply(lambda x: x.str.replace(',', ''))
All of the data will get changed, but the values in the last four columns will all turn to NaN. I'm assuming this has something to do with trying to use str.replace on a float?
Trying any sort of df[column] = df[column].astype(str) also gives back errors, as does something convoluted like the following:
for x in df.columns.tolist():
for k, v in df[x].iteritems():
if pd.isnull(v) == False and type(v) = float:
df.loc(k, df[x]) == str(v)
What is the right way to approach this problem?
You can try this regex approach. I haven't tested it, but it should work.
df = df.apply(lambda x: re.sub(r'(\d+)[.,](\d+)',r'\1\2',str(x)))
Hello I have this file :
date;category_name;item_number;item_description;bottlevolume_ml;state_bottle_retail;bottles_sold;volume_sold_gallons
11/04/2015;APRICOT$ BRANDIES;54436;$Mr. Boston Apricot Brandy;750;6.75;12;2.38
03/02/2016;BLENDED WHISKIES;27605;Tin Cup;750;$20.63;2;0.40
02/11/2016;STRAIGHT BOURBON WHISKIES;19067;Jim Beam;1000;$18.89;24;6.34
02/03/2016;AMERICAN COCKTAILS;59154;1800 Ultimate Margarita;1750;$14.25;6;2.77
08/18/2015;VODKA 80 PROOF;35918;Five O'clock Vodka;1750;$10.80;12;5.55
I would like to remove the $ using panda.
I tried this :
import pandas as pd
import numpy as np
df = pd.read_csv('data2.csv', delimiter=';')
df.date = [x.strip('$') for x in df.date]
df.category_name = [x.strip('$') for x in df.category_name]
df.item_number = [x.strip('$') for x in df.ite_number]
But I would like using pandas to remove from all my columns the $
Any ideas ?
Thank you !
for c in df.select_dtypes('object').columns:
df[c] = df[c].str.replace('$', '')
Explanation:
If a column has a '$', it will be a object-type column. It's useful to select only these, because then you can use .str.replace (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.replace.html) to find all '$"-signs in that column and replace it with an empty string.
Nothe that this solution also removes'$' in the middle of the string (in contrast to the .strip method you've used so far).
This should work.
df = df.apply(lambda x: x.str.strip('$') if x.dtype == "object" else x)
i have a problem with pandas.to_csv
pandas dataframe work correctly and pd.to_excel work well too.
when i try to use .to_csv some rows splitted in two (i see it in wordpad and excel)
for example:
line 1: provincia;comune;Ragione sociale;indirizzo;civico;destinazione;sup_coperta
line2: AR;CHIUSI DELLA VERNA;ex sacci;LOC. CORSALONE STRADA REGIONALE
line3: 71;;SITO DISMESSO;
my code toscana.to_csv("toscana.csv", index = False, encoding = "utf-8", sep=";")
EDIT: i add some line with the problem
(thx to all for the comments!)
`
how i can remove line breaks in values? I found \r in a cell splitted in 2 csv lines: Out[17]: u'IMPIANTI SPORTIVI: CIRCOLO CULTURALE RICREATIVO \rPESTELLO'
i solve with
def replace(x):
if type(x) == str or type(x) == unicode:
x = x.replace('\r', '')
else:
x = x[0].replace('\r', '')
return x
toscana["indirizzo"] = toscana["indirizzo"].map(lambda x: x.replace('"', ''))
toscana["indirizzo"] = toscana["indirizzo"].map(lambda x: replace(x))
toscana["Ragione sociale"] = toscana["Ragione sociale"].map(lambda x: x.replace('"', ''))
toscana["Ragione sociale"] = toscana["Ragione sociale"].map(lambda x: replace(x))
there is smarter methods to do it?
You can use the Pandas Replace method to achieve this rather than creating a new function.
Pandas Replace Method
It includes regex so you can include expressions in the replace such as | for Or
In the example we will use regex=True and replace the \\ with a \ using regex
adding inplace = True will change the value without adding / removing any data from the position in the table.
r"\\t|\\n|\\r" is the same as "\\t" or "\\n" or "\\r" and we replace with the single \
df.replace(to_replace=[r"\\t|\\n|\\r", "\t|\n|\r"], value=["",""], regex=True, inplace=True)