Converting DataFrame containing UTF-8 and Nulls to String Without Losing Data

Converting DataFrame containing UTF-8 and Nulls to String Without Losing Data - python

Here's my code for reading in this dataframe:
html = 'https://www.agroindustria.gob.ar/sitio/areas/ss_mercados_agropecuarios/logistica/_archivos/000023_Posici%C3%B3n%20de%20Camiones%20y%20Vagones/000010_Entrada%20de%20camiones%20y%20vagones%20a%20puertos%20semanal%20y%20mensual.php'
url = urlopen(html)
df = pd.read_html(html, encoding = 'utf-8')
remove = []
for x in range(len(df)):
if len(df[x]) < 10:
remove.append(x)
for x in remove[::-1]:
df.pop(x)
df = df[0]
The dataframe contained uses both ',' and '.' as thousands indicators, and i want neither. So 5.103 should be 5103.
Using this code:
df = df.apply(lambda x: x.str.replace('.', ''))
df = df.apply(lambda x: x.str.replace(',', ''))
All of the data will get changed, but the values in the last four columns will all turn to NaN. I'm assuming this has something to do with trying to use str.replace on a float?
Trying any sort of df[column] = df[column].astype(str) also gives back errors, as does something convoluted like the following:
for x in df.columns.tolist():
for k, v in df[x].iteritems():
if pd.isnull(v) == False and type(v) = float:
df.loc(k, df[x]) == str(v)
What is the right way to approach this problem?

You can try this regex approach. I haven't tested it, but it should work.
df = df.apply(lambda x: re.sub(r'(\d+)[.,](\d+)',r'\1\2',str(x)))

Related

Removing a comma at end a each row in python

I have the below dataframe
After doing the below manipulations to the dataframe, I am getting the output in the Rule column with comma at the end which is expected .but I want to remove it .How to do it
df['Rule'] = df.State.apply(lambda x: str("'"+str(x)+"',"))
df['Rule'] = df.groupby(['Description'])['Rule'].transform(lambda x: ' '.join(x))
df1 = df.drop_duplicates('Description',keep = 'first')
df1['Rule'] = df1['Rule'].apply(lambda x: str("("+str(x)+")")
I have tried it using ilo[-1].replace(",",""). But it is not working .

Try this:
df['Rule'] = df.State.apply(lambda x: str("'"+str(x)+"'"))
df['Rule'] = df.groupby(['Description'])['Rule'].transform(lambda x: ', '.join(x))
df1 = df.drop_duplicates('Description', keep = 'first')
df1['Rule'] = df1['Rule'].apply(lambda x: str("("+str(x)+")"))

How to transform pandas column values to dict with column name

I need help on how to properly transform my df from this:
df = pd.DataFrame({'ID': ['ID no1', "ID no2", "ID no3"],
'ValueM2': ["11998","11076", "12025"],
'ValueSqFt': [129145.39718,119221.07178, 129.43600276]})
to this: --> also i need it to be outputted as double quote (") instead of single quote (')
dfnew = pd.DataFrame({'ID': ["ID no1", "ID no2", "ID no3"],
'DataMetric': [{"ValueM2": "11998"}, {"ValueM2": "11076"}, {"ValueM2": "12025"}],
'DataImperial': [{"ValueSqFt": "129145.39718"}, {"ValueSqFt": "119221.07178"}, {"ValueSqFt": "129.43600276"}]})

If there are only 2 columns to be manipulated, it is best to adopt a manual approach as follows:
df['ValueM2'] = [{'ValueM2': x} for x in df['ValueM2'].values]
df['ValueSqFt'] = [{"ValueSqFt": x} for x in df['ValueSqFt'].values]
df = df.rename(columns={'ValueM2': 'DataMetric', 'ValueSqFt': 'DataImperial'})
If you want to have the output with double quotes, you can use json.dumps:
import json
df['DataMetric'] = df['DataMetric'].apply(lambda x: json.dumps(x))
df['DataImperial'] = df['DataImperial'].apply(lambda x: json.dumps(x))
or
df['DataMetric'] = df['DataMetric'].astype(str).apply(lambda x: x.replace("'", '"'))
df['DataImperial'] = df['DataImperial'].astype(str).apply(lambda x: x.replace("'", '"'))
but this will convert the date type to string!

Can't remove space from value in pandas DataFrame

These are the values in my DataFrame
What I am trying to do, is to change the datatype of columns from string to float, but I can't because some of the values in df are written with the space.
What I've already tried:
df['Value'] = df['Value'].str.strip()
df['Value'] = df['Value'].str.replace(' ','')
Nothing helps... Anyone got any ideas?

This code should work:
df['Value'] = df['Value'].astype(str).str.replace(' ','').astype(float)
If it does not, try to troubleshoot with the following:
def check_cell(x):
try:
x = float(str(x).replace(' ', ''))
return x
except:
print(x)
df['Value'] = df['Value'].apply(lambda x: check_cell(x))

You could try this variant:
df['Value'] = df['Value'].str.replace(' ', '', regex=True)

As the OP says, they want the result to be a Series of float, not str. A safe way to do this is, after removing any potential spaces, to use pd.to_numeric():
df['Value'] = pd.to_numeric(df['Value'].str.replace(' ', '', regex=False))

How to split out tagged column from df and fill new column?

I have a column in a Python df like:
TAGS
{user_type:active}
{session_type:session1}
{user_type:inactive}
How can I efficiently make this column its own column for each of the tags specified?
Desired:
TAGS |user_type|session_type
{user_type:active} |active |null
{session_type:session1}|null |session1
{user_type:inactive} |inactive |null
My attempt only is able to do this in a boolean sense (not what I want) and only if I specify the columns from the tags (which I don't know ahead of time):
mask = df['tags'].apply(lambda x: 'user_type' in x)
df['user_type'] = mask

there are better ways but this is from what you got
df['user_type'] = df['tags'].apply(lambda x: x.split(':')[1] if 'user_type' in x else np.nan)
df['session_type'] = df['tags'].apply(lambda x: x.split(':')[1] if 'session_type' in x else np.nan)

You could use pandas.json_normalize() to convert TAGS column to dict object and check if user_type is a key of that dict.
df2 = pd.json_normalize(df['TAGS'])
df2['user_type'] = df2['TAGS'].apply(lambda x: x['user_type'] if 'user_type' in x else 'null')

This is what ended up working for me, wanted to post a short working example from the json library the helped.
def js(row):
if row:
return json.loads(row)
else:
return {'':''}
#This example includes if there was/wasn't a dataframe with other fields including tags
import json
import pandas as pd
df2 = df.copy()
#Make some dummy tags
df2['tags'] = ['{"user_type":"active","nonuser_type":"inactive"}']*len(df2['tags'])
df2['tags'] = df2['tags'].apply(js)
df_temp = pd.DataFrame(df2['tags'].values.tolist())
df3 = (pd.concat([df2.drop('tags', axis=1), df_temp], axis=1))
#Ynjxsjmh your approach reminds me of something like that I had used in the past, but in this case, I had gotten the following error:
AttributeError: 'str' object has no attribute 'values'
#Bing Wang I am a big fan of list comprehension but in this case I don't know the names of the columns before hand.

pandas to_csv split some rows in 2 lines

i have a problem with pandas.to_csv
pandas dataframe work correctly and pd.to_excel work well too.
when i try to use .to_csv some rows splitted in two (i see it in wordpad and excel)
for example:
line 1: provincia;comune;Ragione sociale;indirizzo;civico;destinazione;sup_coperta
line2: AR;CHIUSI DELLA VERNA;ex sacci;LOC. CORSALONE STRADA REGIONALE
line3: 71;;SITO DISMESSO;
my code toscana.to_csv("toscana.csv", index = False, encoding = "utf-8", sep=";")
EDIT: i add some line with the problem
(thx to all for the comments!)
`
how i can remove line breaks in values? I found \r in a cell splitted in 2 csv lines: Out[17]: u'IMPIANTI SPORTIVI: CIRCOLO CULTURALE RICREATIVO \rPESTELLO'
i solve with
def replace(x):
if type(x) == str or type(x) == unicode:
x = x.replace('\r', '')
else:
x = x[0].replace('\r', '')
return x
toscana["indirizzo"] = toscana["indirizzo"].map(lambda x: x.replace('"', ''))
toscana["indirizzo"] = toscana["indirizzo"].map(lambda x: replace(x))
toscana["Ragione sociale"] = toscana["Ragione sociale"].map(lambda x: x.replace('"', ''))
toscana["Ragione sociale"] = toscana["Ragione sociale"].map(lambda x: replace(x))
there is smarter methods to do it?

You can use the Pandas Replace method to achieve this rather than creating a new function.
Pandas Replace Method
It includes regex so you can include expressions in the replace such as | for Or
In the example we will use regex=True and replace the \\ with a \ using regex
adding inplace = True will change the value without adding / removing any data from the position in the table.
r"\\t|\\n|\\r" is the same as "\\t" or "\\n" or "\\r" and we replace with the single \
df.replace(to_replace=[r"\\t|\\n|\\r", "\t|\n|\r"], value=["",""], regex=True, inplace=True)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Converting DataFrame containing UTF-8 and Nulls to String Without Losing Data - python

You can try this regex approach. I haven't tested it, but it should work. df = df.apply(lambda x: re.sub(r'(\d+)[.,](\d+)',r'\1\2',str(x)))

Related

Removing a comma at end a each row in python

How to transform pandas column values to dict with column name

Can't remove space from value in pandas DataFrame

How to split out tagged column from df and fill new column?

pandas to_csv split some rows in 2 lines

Categories

Resources