I used a pyspark kernel to download some data from AWS-S3 and convert to Pandas:
my first cell looks like this:
liq_banco = pd.DataFrame()
meu_diretorio = 's3://bucket/'
meu_arquivo = 'file_name'
from_s3 = spark.read.format('csv').option('header', 'true').option('sep', ';').load(f'{meu_diretorio}/{meu_arquivo}.csv')
data_df = from_s3.toPandas()
my second cell looks like this:
%matplotlib inline
print(liq_banco)
but, when i tried print os plot graphics in another cell... jupyterlab show this message:
'NameError: name 'data_df ' is not defined'
data_df is a DataFrame with data.
Who can I fixed it?
sorry my english...
I tried use magic commands, but i didn't success.
Related
I am trying to run a script onto over 900 files using the Spyder platform, that aims to delete the first 3 rows of data and certain columns. I tried looking into other similar questions but was unable to achieve the intended results.
My code for one text file is as follows:
import pandas as pd
mydataset = pd.read_csv('vectors_0001.txt')
df = pd.DataFrame(mydataset)
df.drop(df.iloc[:,:2], inplace = True, axis = 1)
df.drop([0,1,3], axis = 0, inplace = True)
df = df.dropna(axis = 0, subset=['Column3','Column4'])
Then I want to modify the code above so it can be applied to the consecutive text files, all the text file names are: vectors_0001, vectors_0002, ..., vectors_0900. I tried to do something similar but I keep getting errors. Take the one below as an example:
(Note: that 'u [m/s]', 'v [m/s]' are the columns I want to keep for further data analysis and the other columns I want to get rid of.)
import glob
import os.path
import sys
import pandas as pd
dir_of_interest = sys.argv[1] if len(sys.argv) > 1 else '.'
files = glob.glob(os.path.join(dir_of_interest, "*.txt"))
for file in files:
with open('file.txt', 'w') as f:
f.writelines(3:)
df = pd.read_csv("*.txt")
df_new = df[['u [m/s]', 'v [m/s]']
df_new.to_csv('*.txt', header=True, index=None)
with open('file.txt','r+') as f:
print(f.read())
However I tried to run it and I got the error:
f.writelines(3:)
^
SyntaxError: invalid syntax
I really want to get this figured out and move onto my data analysis. Please and thank you in advance.
I'm not totally sure of what you are trying to achieve here but you're using the writelines functions incorrectly. It accepts a list as an argument
https://www.w3schools.com/python/ref_file_writelines.asp
You're giving it "3:" which is not valid. Maybe you want to give it a fraction of an existing list ?
I'm trying to move two dataframes from notebook1 to notebook2
I've tried using nbimporter:
import nbimporter
import notebook1 as nb1
nb1.df()
Which returns:
AttributeError: module 'notebook1' has no attribute 'df' (it does)
I also tried using ipynb but that didn't work either
I would just write it to a excel file and read it but the index gets messed up when reading it in the other notebook.
You could use a magic (literally what it's called, not me being cute lol) command called store. It works like this:
In notebook A:
df = pd.DataFrame(...)
%store df # Store the variable df in the IPython database
Then in another notebook B:
%store -r # This will load variables from the IPython database
df
An advantage of this approach is that you won't run into problems with datatypes changing or indexes getting messed up. This will work with variable types other than pandas dataframes too.
The official documentation displays some more features here
You could do something like this to save it as a csv:
df.to_csv('example.csv')
And then while accessing it in another notebook simply use:
df = pd.read_csv('example.csv', index_col=0)
I propose to use pickle to save then load your dataframe
From the first notebook
df.to_pickle("./df.pkl")
then from the second notebook
df = pd.read_pickle("./df.pkl")
I'm an entry level Python user who just started self-teaching use Python to do data analytics. These days I'm practicing with a Global Suicide Rate data in Jupyter Notebook on Kaggle. I met some problems with formatting my result.I'm wondering how to make my result in several lists, into a well-formatted table?
The dataset I's using is a Global Suicide Date data. For the following section of the code, I want to retrieve all country information in the min_year (which is 1985), and max_year (which is 2016).
So what I expected as my output is something like this: (just an example)
Following are my code.
country_1985 = data [(data['year']==min_year)].country.unique()
country_2016 = data [(data['year']==max_year)].country.unique()
print ([country_1985],[country_2016])
The result shows like this:
However, I don't want those in a list. I'd like it to be shown in a table format something like this:
I tried to use pandas.DataFrame, also doesn't make any sense... Could anyone help me to solve my problem?
Updated:
Thanks for #Code Pope code!!! Thank you for your explanation and patience!
import pandas as pd
import numpy as np
country_1985 = data [(data['year']==min_year)].country.unique()
country_2016 = data [(data['year']==max_year)].country.unique()
country_1985 = pd.DataFrame(country_1985.categories)
country_2016 = pd.DataFrame(country_2016.categories)
# Following are the code from #Code Pope
from IPython.display import display_html
def display_side_by_side(dataframe1, dataframe2):
modified_HTML=dataframe1.to_html() + dataframe2.to_html()
display_html(modified_HTML.replace('table','table
style="display:inline"'),raw=True)
display_side_by_side(country_1985,country_2016 )
Then it looks like this:
Updated Output
As you are saying that you are using Jupyter Notebook, you can change the html of your dataframes before displaying it. Use the following function:
from IPython.display import display_html
def display_side_by_side(dataframe1, dataframe2):
modified_HTML=dataframe1.to_html() + dataframe2.to_html()
display_html(modified_HTML.replace('table','table style="display:inline"'),raw=True)
# then call the function with your two dataframes
display_side_by_side(country_1985,country_2016 )
I am using Spyder as my python IDE.
I tried run this python code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
path = os.getcwd() + '\ex1data1.txt'
data = pd.read_csv(path, header = None, names = ['Population', 'Profit'])
data.head()
In theory it is supposed to show me a table of data since I am using
data.head()
at the end
but when I run it, it only shows :
I thought it was supposed to display a new window with the data table.
What is wrong?
You are calling data.head() in .py script. The script is supposed to only read the data, not print it. Try print(data.head())
You want to print your data.head() so do that...
print(data.head())
I have figure out how to print the data from an Excel spreadsheet using a For Loop, but know I would like to export each column as a different variable so I can manipulate them, for example plot a graph using plot.ly
What I have used so far is;
import xlrd
book = xlrd.open_workbook('filelocation/file.xlsx')
sheet = book.sheet_by_index(0)
for j in range(1,4):
for i in range(2,8785):
print "%d" %sheet.cell_value(i,j)
Which just prints all the numbers from the spreadsheet into my terminal which is not that useful.
But I would like something like this;
import xlrd
book = xlrd.open_workbook('filelocation/file.xlsx')
sheet = book.sheet_by_index(0)
for j= 1:
for i in range(2,8785):
Time= "%s" %sheet.cell_value(i,j)
for j= 2:
for i in range(2,8785):
SYS= "%s" %sheet.cell_value(i,j)
which would declare different variables for each column. But as I understand from the error message I seem to be using the For Loops wrong, I am not that familiar with For Loops in Python, I have only really used them in Matlab.
* EDIT * fixed indentation in the question, was fine in the original code, not the source of error.
I like pandas for all this sort of thing.
you can create a DataFrame object which will hold all the data you're looking for:
import pandas as pd
df = pd.read_excel('myfile.xlsx', sheetname='Sheet1')
now you can access each column by its name out of that dataframe, so if you had a column called 'mynumbers' (idk) you would get it py doing:
print df['mynumbers']
or you could iterate over all columns using:
for col in df.columns:
print df[col]
then you can do whatever you like, including some built-in plotting, visualisation and stats if you have a look around the docs.