I have a lot of different files that I'm trying to load to pandas in a pythonic way but also to add to different cells to make this look easy. Now I have 36 different variables but to make things easy, I'll show you an example with three different dataframes.
But let's say I'm uploading CSV files with this into dataframes but in different cells, automatically generated.
file_list = ['df1.csv', 'df2.csv', 'df3.csv']
name_list = ['df1', 'df2', 'df3']
I could easy create three different cells and type:
df1 = pd.read_csv('df1.csv')
But there are dozens of different CSVs and I want to do similar things like delete columns and there have to be easier ways.
I've done something such as:
var_list = []
for file, name in zip(file_list, name_list):
var_name = name
var_file = pd.read_csv(file)
var_list.append((file, name, var_file))
print(var_list)
But this all occurs in the same cell.
Now I looked at the ipython docs, as this is the package I believe has to do with this, but I couldn't find anything. I appreciate your help.
From what I understand, you need to load the content of several .csv files into several pandas dataframes, plus, you want to execute a repeatable process for each of them. You're not sure they will be loaded correctly, but you still want to be able to get the max out of them, and to this end you want to run each process in its own Jupyter cell.
As pointed out by ddejohn, I don't know if that's the best option, but anyway, I think it's a cool question. Next code generates several cells, each of them having a common structure with different variables (in my example, I simply sort the loaded dataframe by age, as an example). It is based on How to programmatically create several new cells in a Jupyter notebook page, which should get the credit, if it is indeed what you were looking for:
from IPython.core.getipython import get_ipython
import pandas as pd
def create_new_cell(contents):
shell = get_ipython()
payload = dict(
source='set_next_input',
text=contents,
replace=False,
)
shell.payload_manager.write_payload(payload, single=False)
def get_df(file_name, df_name):
content = "{df} = pd.read_csv('{file}', names=['Name', 'Age', 'Height'])\n"\
"{df}.sort_values(by='Age', inplace=True)\n"\
"{df}"\
.format(df=df_name, file=file_name)
create_new_cell(content)
file_list = ['filename_1.csv', 'filename_2.csv']
name_list = ['df1', 'df2']
for file, name in zip(file_list, name_list):
get_df(file, name)
Related
How can i display multiple pandas function created on python in the same csv file
So I have multiple data tables saved as pandas dataframes, and I want to output all of them into the same CSV for ease of access. However, I am not really sure the best way to go about this, as I want to maintain each dataframes inherent structure (ie columns and index), so I can combine them all into 1 single dataframe.
You have 2 choices:
Either you combine them first (pd.concat()) with all the advantages and limitations of that appraoch, then you cann call .to_csv and it will print 1 file. If they are structurally the same, this is great because you will be able to read the file again.
Or, you call .to_csv() multiple times, and save the output in a "buffer", which you can then write (see here). Probably the only way if your DataFrames are very different from a structural perspective, but a mess to read them later.
Is .json output an option for what you want to do?
Thanks alot for the comment Kingotto, I used to first option added the this code and it was able to help me arrange my functions horizontally and exported the file to csv like this:
frames = pd.concat([file_1, file_2, file_3], axis = 1)
save the dataframe
frames.to_csv('Combined.csv', index = False)
I am learning how to use python.
For the project I am working on, I have hundreds of datasheets containing a City, Species, and Time (speciesname.csv).
I also have a single datasheet that has all cities in the world with their latitude and longitude point (cities.csv).
My goal is to have 2 more columns for latitude and longitude (from cities.csv) in every (speciesname.csv) datasheet, corresponding to the location of each species.
I am guessing my workflow will look something like this:
Go into speciesname.csv file and find the location on each line
Go into cities.csv and search for the location from speciesname.csv
Copy the corresponding latitude and longitude into new columns in speciesname.csv.
I have been unsuccessful in my search for a blog post or someone else with a similar question. I don't know where to start so anyone with a starting point would be very helpful.
Thank you.
You can achieve it in many ways.
The simplest way I can think of to approach this problem is:
collect all cities.csv data inside a dictionary {"cityname":(lat,lon), ...}
read line by line your speciesname.csv and for each line search by key (key == speciesname_cityname) in the dictionary.
when you find a correspondence add all data from the line and the lat & lon separated by comma to a buffer string that has to end with a "\n" char
when the foreach line is ended your buffer string will contains all the data and can be used as input to the write to file function
Here is a little program that should work if you put it in the same folder as your separate CSVs. I'm assuming you just have 2 sheets, one that is cities and another with the species. Your description saying the cities info is in hundreds of datasheets is confusing since then you say it's all in one csv.
This program turns the two separate CSV files into pandas dataframe format which can then be joined on the common city column. Then it creates a new CSV from the joined data frame.
In order for this program to work, you need to need to install pandas which is a library specifically for dealing with things in tabular (spreadsheet) format. I don't know what system you are on so you'll have to find your own instructions from here:
https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html
This is the version if your csv do not have a header, which is when the first row is just some data.
# necessary for the functions like pd.read_csv
import pandas as pd
species_column_names = ['city','species','time']
speciesname = pd.read_csv('speciesname.csv', names=species_column_names, header=None)
cities_column_names = ['city','lat','long']
cities = pd.read_csv('cities.csv', names=cities_column_names, header=None)
# this joining function relies on both tables having a 'city' column
combined = speciesname.join(cities.set_index('city'), on='city')
combined_csv = combined.to_csv()
If you already have headers for both files, use these two lines instead to ignore the first row since I don't know how they are spelled/capitalized/whatever and we are joining based on all lower case custom column names:
import pandas as pd
species_column_names = ['city','species','time']
speciesname = pd.read_csv('speciesname.csv', names=species_column_names, skiprows = 0, header=None)
cities_column_names = ['city','lat','long']
cities = pd.read_csv('cities.csv', names=cities_column_names, skiprows = 0, header=None)
# this joining function relies on both tables having a 'city' column
combined = speciesname.join(cities.set_index('city'), on='city')
combined_csv = combined.to_csv()
I am working with pandas, and I've just modified a table
Now, I would like to see my table in excel, but it's just a quick look, and I will have to modify the table again later on, so I don't want to save my table anywhere.
In other words, the solution
my_df = pd.DataFrame()
item_path = "my/path"
my_df.to_csv("my/path")
os.startfile(os.normpath(item_path))
Is not what I want. I would like to obtain the same behavior without saving the Dataframe as CSV first.
#Something like:
my_df = pd.DataFrame()
start_excel(table_to_load = my_df) #Opens excel with a COPY of my_df
Note
To quickly explore a DataFrame, df.head() is the way, but I want to open my DataFrame from a Tkinter application. I need to use an external program to open this temporary table
you can have a quick look using
<dataframe_name>.head()
it will display top 5 rows by default
or
you can simply write how many rows you want
<dataframe_name>.head(<rows_you_want>)
I have 10 files which I need to work on.
I need to import those files using pd.read_csv to turn them all into dataframes along with usecols as I only need the same two specific columns from each file.
I then need to search the two columns for a specific entry in the rows like ‘abcd’ and for python to return a new df with includes all the rows it appeared in for each file.
Is there a way I could do this using a for loop. For far I’ve only got a list of all the paths to the 10 files.
So far what I do for one file without the for loop is:
df = pd.read_csv(r'filepath', header=2, usecols=['Column1', 'Column2'])
search_df = df.loc[df['Column1'] == 'abcd']
So what I'm trying to do is the following:
I have 300+ CSVs in a certain folder. What I want to do is open each CSV and take only the first row of each.
What I wanted to do was the following:
import os
list_of_csvs = os.listdir() # puts all the names of the csv files into a list.
The above generates a list for me like ['file1.csv','file2.csv','file3.csv'].
This is great and all, but where I get stuck is the next step. I'll demonstrate this using pseudo-code:
import pandas as pd
for index,file in enumerate(list_of_csvs):
df{index} = pd.read_csv(file)
Basically, I want my for loop to iterate over my list_of_csvs object, and read the first item to df1, 2nd to df2, etc. But upon trying to do this I just realized - I have no idea how to change the variable being assigned when doing the assigning via an iteration!!!
That's what prompts my question. I managed to find another way to get my original job done no problemo, but this issue of doing variable assignment over an interation is something I haven't been able to find clear answers on!
If i understand your requirement correctly, we can do this quite simply, lets use Pathlib instead of os which was added in python 3.4+
from pathlib import Path
csvs = Path.cwd().glob('*.csv') # creates a generator expression.
#change Path(your_path) with Path.cwd() if script is in dif location
dfs = {} # lets hold the csv's in this dictionary
for file in csvs:
dfs[file.stem] = pd.read_csv(file,nrows=3) # change nrows [number of rows] to your spec.
#or with a dict comprhension
dfs = {file.stem : pd.read_csv(file) for file in Path('location\of\your\files').glob('*.csv')}
this will return a dictionary of dataframes with the key being the csv file name .stem adds this without the extension name.
much like
{
'csv_1' : dataframe,
'csv_2' : dataframe
}
if you want to concat these then do
df = pd.concat(dfs)
the index will be the csv file name.