How to delete \xa0 symbol from all dataframe in pandas? - python

I have a data frame with the 2011 census (Chile) information. In some columns names and some variable values, I have \xa0 symbol and is trouble to call some parts of data frames. My code is the following:
import numpy as numpy
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn
#READ AND SAVE EACH DATA SET IN A DATA FRAME
data_2011 = pd.read_csv("2011.csv")
#SHOW FIRST 5 ROWS
print(data_2011.head())
Doing this got the following output:
This far, everything is right, but when I want to see the column names using:
print("ATRIBUTOS DATOS CENSO 2011",list(data_2011))
I got the following ouput:
How can I fix this? Thanks in advance

Try the following code:
import re
inputarray = re.sub(r'\s', data_2011).split(",')

Related

Filtering pandas dataframe in python

I have a csv file with rows and columns separated by commas. This file contains headers (str) and values. Now, I want to filter all the data with a condition. For example, there is a header called "pmra" and I want to keep all the information for pmra values between -2.6 and -2.0. How can I do that? I tried with np.where but it did not work. Thanks for your help.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
filename="NGC188_C.csv"
data = pd.read_csv(filename)
ra = data["ra"]
dec = data["dec"]
parallax = data["parallax"]
pm_ra = data["pmra"]
pm_dec = data["pmdec"]
g_band = data["phot_g_mean_mag"]
bp_rp = data["bp_rp"]
You can use something like:
data[(data["pmra"] >= -2.6) & (data["pmra"] <= -2)]
There is also another approach: You can use between function:
data["pmra"].between(-2.6, -2)

Using tabula-py why I get a list and not a Dataframe?

Output
I want to work with PDF files, specially with tables. I code this
import pandas as pd
import numpy as np
import tabula
from tabula import read_pdf
tab= tabula.read_pdf('..\PDFs\Ala.pdf',encoding='latin-1', pages ='all')
tab
But I get a list of values, like this:
[ Nombres Edad Ciudad
0 Noelia 20 Lima
1 Michelie 45 Lima
2 Ximena 18 Lima
3 Miguel 43 Lima]
I cannot analyze it die it's not a data frame. This is just an example the real PDF file contains tables between texts and several pages
So, please could someone help me with this issue?
tabula should return a list of Pandas dataframes, one for each table found in the PDF. You could display (and work with them) as follows:
import pandas as pd
import numpy as np
import tabula
from tabula import read_pdf
dfs = tabula.read_pdf('..\PDFs\Ala.pdf', encoding='latin-1', pages='all')
print(f"Found {len(dfs)} tables")
# display each of the dataframes
for df in dfs:
print(df.size)
print(df)
tabula returns a list of Pandas DataFrame. But we can convert this list to Pandas DataFrame using the below statement.
import tabula
import pandas
tab = pandas.DataFrame(tabula.read_pdf('..\PDFs\Ala.pdf', pages ='all')[0])

how to drop a categorical value from a data frame column in python?

I am working with a data frame title price_df. and I would like to drop the rows that contain '4wd' from the column drive-wheels. I have tried price_df2 = price_df.drop(index='4wd', axis=0) and a few other variations after reading the docs pages in pandas, but I continue to get error codes. Could anyone direct me to the correct way to drop the rows that contain values 4wd from the column and data frame? Below is the code I have ran before trying to drop the values:
# Cleaned up Dataset location
fileName = "https://library.startlearninglabs.uw.edu/DATASCI410/Datasets/Automobile%20price%20data%20_Raw_.csv"
# Import libraries
from scipy.stats import norm
import numpy as np
import pandas as pd
import math
import numpy.random as nr
price_df = pd.read_csv(fileName)
round(price_df.head(),2) #getting an overview of that data
price_df.loc[:,'drive-wheels'].value_counts()
price_df2 = price_df.drop(index='4wd', axis=0)
You can use pd.DataFrame.query and back ticks for this column name with a hyphen:
price_df.query('`drive-wheels` != "4wd"')
Try this
price_df = pd.read_csv(fileName)
mask = price_df["drive-wheels"] =="4wd"
price_df = price_df[~mask]
Get a subset of your data with this one-liner:
price_df2 = price_df[price_df.drive-wheels != '4wd']

KeyError with Pandas CSV

Getting KeyError: Revenue
My CSV file
Product,Revenue
Onetap Master,538.07
Aimware Masterpack,306.06
Personal Config,159.94
Aimware Lua,29.95
Config Swap,22.76
The code
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv(open('sales.csv'),index_col=1, sep=',')
print(df.columns.tolist())
pd.value_counts(df['Revenue']).plot.bar()
plt.show()
When I use Product instead of Revenue it works just fine
Simply drop the
index_col=1
From your pd.read_csv() step, and it works. You can also skip the
open()
and
sep=','
parts of pd.read_csv()
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('sales.csv')
print(df.columns.tolist()) # returns ['Product', 'Revenue']
pd.value_counts(df['Revenue']).plot.bar()
plt.show()
When you use index_col=1 you make the Revenue column the index and the dataframe looks like:
Product
Revenue
538.07 Onetap Master
306.06 Aimware Masterpack
159.94 Personal Config
29.95 Aimware Lua
22.76 Config Swap
So it is a single column dataframe, which can be make evident by examining df.columns: is is just [Products])
TL/DR: in you want to use Revenue as a column, do not put it into the index

Python Pandas: print the csv data in oder with columns

 Hi I am new with python, I am using pandas to read the csv file data, and print it. The code is shown as following:
import numpy as np
import pandas as pd
import codecs
from pandas import Series, DataFrame
dframe = pd.read_csv("/home/vagrant/geonlp_japan_station.csv",sep=',',
encoding="Shift-JIS")
print (dframe.head(2))
but the data is printed like as following(I just give example to show it)
However, I want the data to be order with columns like as following:
I don't know how to make the printed data be clear, thanks in advance!
You can check unicode-formatting and set:
pd.set_option('display.unicode.east_asian_width', True)
I test it with UTF-8 version csv:
dframe = pd.read_csv("test/geonlp_japan_station/geonlp_japan_station_20130912_u.csv")
and it seems align of output is better.
pd.set_option('display.unicode.east_asian_width', True)
print dframe
pd.set_option('display.unicode.east_asian_width', False)
print dframe

Categories