Using Dask with Python causes issues when running Pandas code - python

I am trying to work with Dask because my dataframe has become large and that pandas by itself can't simply process it. I read my dataset in as follows and get the following result that looks odd, not sure why its not outputting the dataframe:
import pandas as pd
import numpy as np
import seaborn as sns
import scipy.stats as stats
import matplotlib.pyplot as plt
import dask.bag as db
import json
%matplotlib inline
Leads = db.read_text('Leads 6.4.18.txt')
Leads
This returns (instead of my pandas dataframe):
dask.bag<bag-fro..., npartitions=1>
Then when I try to rename a few columns:
Leads_updated = Leads.rename(columns={'Business Type':'Business_Type','Lender
Type':'Lender_Type'})
Leads_updated
I get:
AttributeError: 'Bag' object has no attribute 'rename'
Can someone please explain what I am not doing correctly. The ojective is to just use Dask for all these steps since it is too big for regular Python/Pandas. My understanding is the syntax used under Dask should be the same as Pandas.

Related

Plotting dataframes using matplotlib in Python IDE

I am trying to plot a dataframe which has been taken from get_data_yahoo attribute in pandas_datareader.data on python IDE using matplotlib.pyplot and I am getting an KeyError for the X-Co-ordinate in prices.plot no matter what I try. Please help!
I have tried this out :-
import matplotlib.pyplot as plt
from pandas import Series,DataFrame
import pandas_datareader.data as pdweb
import datetime
prices=pdweb.get_data_yahoo(['CVX','XOM','BP'],start=datetime.datetime(2020,2,24),
end=datetime.datetime(2020,3,20))['Adj Close']
prices.plot(x="Date",y=["CVX","XOM","BP"])
plt.imshow()
plt.show()
And I have tried this as well:-
prices=DataFrame(prices.to_dict())
prices.plot(x="Timestamp",y=["CVX","XOM","BP"])
plt.imshow()
plt.show()
Please Help...!!
P.S: I am also getting some kind of warning, please explain about it if you could :)
The issue is that the Date column isn't an actual column when you import the data. It's an index. So just use:
prices = prices.reset_index()
Before plotting. This will convert the index into a column, and generate a new, integer-labelled index.
Also, in regards to the warnings, Pandas is full of them and they are super annoying! You can turn them off with the standard python library warnings.
import warnings
warnings.filterwarnings('ignore')

CSV data structure printing in messy format

I've exported an Excel into a CSV where all the columns and entires look correct and normal. However, when I put it into a data frame and print the head, the structure becomes very messy and unreadable due to columns being unstructured.
As you can see in the image, the values are not neatly under user_id.
https://imgur.com/a/gbWaTwi
I'm using the following code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
then
df1 = pd.read_csv('../doc.csv', low_memory=False)
df1.head
Do this --- print the invocation of head. Just saying .head isn't enough.
print(df1.head())

Del Command Not Executing Properly Pandas

I have a CSV file that I am uploading into Jupyter and I am trying to delete multiple columns at once. I thought the "DEL" command would be the best but I can't get it to work.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline
tmbd_movies = pd.read_csv('tmdb-movies.csv')
tmbd_movies.head()
del(tmbd_movies['imdb_id','homepage','tagline','keywords','overview'])
The goal was to remove the following columns:
imdb_id','homepage','tagline','keywords','overview
You want this:
tmbd_movies.drop(['imdb_id','homepage','tagline','keywords','overview'], 'columns', inplace=True)

Scatter_Matrix Will Not Display Using Pandas and

Working through following the Machine Learning Tutorial:
http://machinelearningmastery.com/machine-learning-in-python-step-by-step/
Specifically, Section 4.2. Unfortunately, my code is throwing an error
NameError: name 'scatter_matrix' is not defined
Here is my code:
import pandas
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = pandas.read_csv(url, names=names)
scatter_matrix(dataset)
plt.show()
There's at least one Stack Overflow question on scatter_matrix, but I haven't able to figure out what's missing.
Pandas scatter_matrix - plot categorical variables
You will have to import it like this:
from pandas.plotting import scatter_matrix
Cause you've imported the Pandas. You could use it like below:
pd.scatter_matrix(dataset)
However, pandas.scatter_matrix() is deprecated. use pandas.plotting.scatter_matrix() instead

How to change the data types of a column in Pandas when read_csv()

I could not able to change the data types of specific column, I tried using these below codes, but neither works for me.
getting ValueError: could not convert string to float: '?'
There were some issues in version supporting of Pandas as I got to know when I searched around.
data.Global_intensity = data.Global_intensity.astype(pd.np.float)
data.Global_intensity.apply(float)
data['Global_intensity'] = data['Global_intensity'].astype('float')
Here are the versions of modules I am working with.
python: 3.5.2.final.0
pandas: 0.19.0
numpy: 1.11.1
part of the code
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 15, 6
data = pd.read_csv('Desktop/household_power_consumptions.csv')
print (data.head())
print ('\n Data Types:')
data['Global_intensity'] = data['Global_intensity'].astype('float')
print (data.dtypes)
What I need is to convert the data type of object into float of Global_intensity column to work with matplotlib.pylab library.
Thank you.

Categories