Writing pandas/numpy statements in python functions - python

I am working on multiple data sets with similar data attributes (column names) in jupyter notebook. But it is really tiresome to run all the commands again and again with multiple data sets to achieve the same target. Can anyone let me know if I can automate the process and run this for various data sets. Let's say I'm running following commands for one data set in jupyter notebook:
data = pd.read_csv(r"/d/newfolder/test.csv",low_memory=False) <br>
data.head()
list(data.columns)
data_new=data.sort_values(by='column_name')
Now I'd want to run all the commands saving in one function, for different data sets in the notebook.
Can anyone help me out pls on what are the possible ways? Thanks in advance

IIUC, your issue is that something like print(df) doesn't show as pretty as if you just have df as the last line in a Jupyter cell.
You can have the pretty output whenever you want (as long as your jupyter is updated) by using display!
Modifying your code:
def process_data(file):
data = pd.read_csv(file, low_memory=False)
display(data.head())
display(data.columns)
data_new = data.sort_values(by='column_name')
display(data_new.head())
process_data(r"/d/newfolder/test.csv")
This will output data.head(), data.columns, and data_new.head() from a single cell~

Related

Getting an error code in Python that df is not defined

I am new to data, so after a few lessons on importing data in python, I tried the following codes in my jupter notebook but keep getting an error saying df not defined. I need help.
The code I wrote is as follows;
import pandas as pd
url = "https://api.worldbank.org/v2/en/indicator/SH.TBS.INCD?downloadformat=csv"
df = pd.read_csv(https://api.worldbank.org/v2/en/indicator/SH.TBS.INCD?downloadformat=csv)
After running the third code, I got a series of reports on jupter notebook but one that stood out was "df not defined"
The problem here is that your data is a ZIP file containing multiple CSV files. You need to download the data, unpack the ZIP file, and then read one CSV file at a time.
If you can give more details on the problem(etc: screenshots), debugging will become more easier
One possibility for the error is that the response content accessed by the url(https://api.worldbank.org/v2/en/indicator/SH.TBS.INCD?downloadformat=csv) is a zip file, which may prevent pandas from processing it further.

How to Pythonically add new cells Jupyter Notebook

I have a lot of different files that I'm trying to load to pandas in a pythonic way but also to add to different cells to make this look easy. Now I have 36 different variables but to make things easy, I'll show you an example with three different dataframes.
But let's say I'm uploading CSV files with this into dataframes but in different cells, automatically generated.
file_list = ['df1.csv', 'df2.csv', 'df3.csv']
name_list = ['df1', 'df2', 'df3']
I could easy create three different cells and type:
df1 = pd.read_csv('df1.csv')
But there are dozens of different CSVs and I want to do similar things like delete columns and there have to be easier ways.
I've done something such as:
var_list = []
for file, name in zip(file_list, name_list):
var_name = name
var_file = pd.read_csv(file)
var_list.append((file, name, var_file))
print(var_list)
But this all occurs in the same cell.
Now I looked at the ipython docs, as this is the package I believe has to do with this, but I couldn't find anything. I appreciate your help.
From what I understand, you need to load the content of several .csv files into several pandas dataframes, plus, you want to execute a repeatable process for each of them. You're not sure they will be loaded correctly, but you still want to be able to get the max out of them, and to this end you want to run each process in its own Jupyter cell.
As pointed out by ddejohn, I don't know if that's the best option, but anyway, I think it's a cool question. Next code generates several cells, each of them having a common structure with different variables (in my example, I simply sort the loaded dataframe by age, as an example). It is based on How to programmatically create several new cells in a Jupyter notebook page, which should get the credit, if it is indeed what you were looking for:
from IPython.core.getipython import get_ipython
import pandas as pd
def create_new_cell(contents):
shell = get_ipython()
payload = dict(
source='set_next_input',
text=contents,
replace=False,
)
shell.payload_manager.write_payload(payload, single=False)
def get_df(file_name, df_name):
content = "{df} = pd.read_csv('{file}', names=['Name', 'Age', 'Height'])\n"\
"{df}.sort_values(by='Age', inplace=True)\n"\
"{df}"\
.format(df=df_name, file=file_name)
create_new_cell(content)
file_list = ['filename_1.csv', 'filename_2.csv']
name_list = ['df1', 'df2']
for file, name in zip(file_list, name_list):
get_df(file, name)

Moving dataframes between notebooks

I'm trying to move two dataframes from notebook1 to notebook2
I've tried using nbimporter:
import nbimporter
import notebook1 as nb1
nb1.df()
Which returns:
AttributeError: module 'notebook1' has no attribute 'df' (it does)
I also tried using ipynb but that didn't work either
I would just write it to a excel file and read it but the index gets messed up when reading it in the other notebook.
You could use a magic (literally what it's called, not me being cute lol) command called store. It works like this:
In notebook A:
df = pd.DataFrame(...)
%store df # Store the variable df in the IPython database
Then in another notebook B:
%store -r # This will load variables from the IPython database
df
An advantage of this approach is that you won't run into problems with datatypes changing or indexes getting messed up. This will work with variable types other than pandas dataframes too.
The official documentation displays some more features here
You could do something like this to save it as a csv:
df.to_csv('example.csv')
And then while accessing it in another notebook simply use:
df = pd.read_csv('example.csv', index_col=0)
I propose to use pickle to save then load your dataframe
From the first notebook
df.to_pickle("./df.pkl")
then from the second notebook
df = pd.read_pickle("./df.pkl")

Writing results of several cells in jupyter to one file (no overwriting)

I have a jupyter notebook where I run the same simulation using many different combinations of parameters (essentially, to simulate different versions of environment and their effect on the results). Let's say that the result of each run is an image and a 2d array of all relevant metrics for my system. I want to be able to keep the images in notebook, but save the arrays all in one place, so that I can work with them later on if needed.
Ideally I would save them into an external file with the following format:
'Experiment environment version i' (or some other description)
2d array
and every time I would run a new simulation (a new cell) the results would be added into this file until I close it.
Any ideas how to end up with such external summary file?
If you have excel available to you then you could use pandas to write the results to a spreadsheet (or you could use pandas to write to a csv). See the documentation here, but essentially you would do the following when appending and/or using a new sheet:
import pandas as pd
for i in results:
with pd.ExcelWriter('results.xlsx', mode='a') as writer:
df.to_excel(writer, sheet_name='Result'+i)
You will need to have your array in dataframe 'df', there are lots of tutorials on how to put an array into pandas.
After a bit of try and error, here is a general answer how to write to txt (without pandas, otherwise see #jaybeesea's answer)
with open("filename.txt", "a+") as f:
f.write("Comment 1 \n")
f.write("%s \n" %np.array2string(array, separator=' , '))
Every time you run it, it adds to the file "f".

print data_frame.head gives output not as a nice table

I have a data set taken from kaggle, and I want to get the result shown here
So, I took that code, changed it a bit and what I ran is this:
# get titanic & test csv files as a DataFrame
titanic_df = pd.read_csv("./input/train.csv")
test_df = pd.read_csv("./input/test.csv")
# preview the data
print titanic_df.head()
This works, as it outputs the right data, but not as neatly as in the tutorial... Can I make it right?
Here is my output (Python 2, Spyder):
Try using Jupyter notebook if you have not used it before. In ipython console, it will wrap the text and show it in multiple lines. In kaggle, what you are seeing is itself a jupyter notebook.

Categories