CSV data structure printing in messy format - python

I've exported an Excel into a CSV where all the columns and entires look correct and normal. However, when I put it into a data frame and print the head, the structure becomes very messy and unreadable due to columns being unstructured.
As you can see in the image, the values are not neatly under user_id.
https://imgur.com/a/gbWaTwi
I'm using the following code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
then
df1 = pd.read_csv('../doc.csv', low_memory=False)
df1.head

Do this --- print the invocation of head. Just saying .head isn't enough.
print(df1.head())

Related

Pyplot directly on yfinance object is fast. Pyplot on equivalent csv is slow

Pyplot directly on data from yfinance
Here's a little script which loads data into pyplot from yfinance:
import yfinance as yf
import matplotlib.pyplot as plt
data = yf.Ticker('MSFT').history(period="max").reset_index()[["Date", "Open"]]
plt.plot(data["Date"], data["Open"])
plt.show()
The UI loads quickly and is quite responsive. It looks like this:
Pyplot on equivalent data from csv
Here's a similar script which writes the data to CSV, loads it from CSV, then plots the graph using the data from CSV:
import yfinance as yf
import pandas as pd
import matplotlib.pyplot as plt
data = yf.Ticker('MSFT').history(period="max").reset_index()[["Date", "Open"]]
data.to_csv('out.csv')
from_csv = pd.read_csv('out.csv', index_col=0)
plt.plot(from_csv["Date"], from_csv["Open"])
plt.show()
This time:
The UI loads much more slowly
Zooming is slow
Panning is slow
Resizing is slow
The horizontal axes labels don't display clearly
Question
I'd like to avoid hitting the yfinance API each time the script is run so as to not burden their systems unnecessarily. (I'd include a bit more logic than in the script above which takes care of not accessing the API if a CSV is available. I kept the example simple for the sake of demonstration.)
Is there a way to get the CSV version to result in a pyplot UI that is as responsive as the direct-from-API version?
When loading from CSV, the Date column needs to be converted to an actual date value.
from_csv["Date"] = pd.to_datetime(from_csv["Date"])
Here's a fast version of the above script:
import yfinance as yf
import pandas as pd
import matplotlib.pyplot as plt
data = yf.Ticker('MSFT').history(period="max").reset_index()[["Date", "Open"]]
data.to_csv('out.csv')
from_csv = pd.read_csv('out.csv', index_col=0)
from_csv["Date"] = pd.to_datetime(from_csv["Date"])
plt.plot(from_csv["Date"], from_csv["Open"])
plt.show()

Pandas read_csv gives decimal column numbers

I've been pulling my hair out trying to make a bipartite graph from a csv file and so far all I have is a panda matrix that looks like this
My code so far is just
`
import networkx as nx
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# import pyexcel as pe
# import pyexcel.ext.xlsx
from networkx.algorithms import bipartite
mat = pd.read_csv("networkdata3.csv")
# mat = pd.read_excel("networkdata1.xlsx",sheet_name="sheet_name_1")
print(mat.info)
sand = nx.from_pandas_adjacency(mat)
`
and I have no clue what I'm doing wrong. Initially I was trying to read it in as the original xlsx file but then I just converted it to a csv and it started reading. I assume I can't make the graph because the column numbers are decimals and the error that spits out claims that the column numbers don't match up. So how else should I be doing this to actually start making some progress?

Python decimal issue

I need to analyze data with Python. I created a csv file in excel before. But when I show the table with python, only one column seems to have all numbers in decimals. Excel also looks fine, but when I show it in python it looks decimal. Please help!
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("covıd.csv" , sep=';', encoding='latin-1')
df
The Total Case part looks decimal. not improving. It wasn't like this before.

Del Command Not Executing Properly Pandas

I have a CSV file that I am uploading into Jupyter and I am trying to delete multiple columns at once. I thought the "DEL" command would be the best but I can't get it to work.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline
tmbd_movies = pd.read_csv('tmdb-movies.csv')
tmbd_movies.head()
del(tmbd_movies['imdb_id','homepage','tagline','keywords','overview'])
The goal was to remove the following columns:
imdb_id','homepage','tagline','keywords','overview
You want this:
tmbd_movies.drop(['imdb_id','homepage','tagline','keywords','overview'], 'columns', inplace=True)

Using Dask with Python causes issues when running Pandas code

I am trying to work with Dask because my dataframe has become large and that pandas by itself can't simply process it. I read my dataset in as follows and get the following result that looks odd, not sure why its not outputting the dataframe:
import pandas as pd
import numpy as np
import seaborn as sns
import scipy.stats as stats
import matplotlib.pyplot as plt
import dask.bag as db
import json
%matplotlib inline
Leads = db.read_text('Leads 6.4.18.txt')
Leads
This returns (instead of my pandas dataframe):
dask.bag<bag-fro..., npartitions=1>
Then when I try to rename a few columns:
Leads_updated = Leads.rename(columns={'Business Type':'Business_Type','Lender
Type':'Lender_Type'})
Leads_updated
I get:
AttributeError: 'Bag' object has no attribute 'rename'
Can someone please explain what I am not doing correctly. The ojective is to just use Dask for all these steps since it is too big for regular Python/Pandas. My understanding is the syntax used under Dask should be the same as Pandas.

Categories