make custom spreadsheets with python - python

I have a pandas data frame with two columns:
year experience and salary
I want to save a csv file with these two columns and also have some stats at the head of the file as in the image:
Is there any option to handle these with pandas or any other library of do I have to make a script to write it line adding the commas between fields?

Pandas does not support what you want to do here. The problem is that your format is no valid csv. The RFC for CSV states that Each record is located on a separate line, implying that a line corresponds to a record, with an optional header line. Your format adds the average and max values, which do not correspond to records.
As I see it, you have three paths to go from here: i. You create two separate data frames and map them to csv files (super precise would be 3), one with your records, one with the additional values. ii. Write your data frame to csv first, then open that file and insert the your additional values at the top. iii. If your goal is an import into excel, however, #gefero 's suggestion is the right hint: try using the xslxwriter package do directly write to cells in a spreadsheet.

You can read the file as two separate parts (stats and csv)
Reading stats:
number_of_stats_rows = 3
stats = pandas.read_csv(file_path, nrows=number_of_stats_rows, header=None).fillna('')
Reading remaining file:
other_data = pandas.read_csv(file_path, skiprows=number_of_stats_rows).fillna('')

Take a look to xslxwriter. Perhaps it´s what you are looking for.

Related

displaying multiple pandas function created on python in the same csv file

How can i display multiple pandas function created on python in the same csv file
So I have multiple data tables saved as pandas dataframes, and I want to output all of them into the same CSV for ease of access. However, I am not really sure the best way to go about this, as I want to maintain each dataframes inherent structure (ie columns and index), so I can combine them all into 1 single dataframe.
You have 2 choices:
Either you combine them first (pd.concat()) with all the advantages and limitations of that appraoch, then you cann call .to_csv and it will print 1 file. If they are structurally the same, this is great because you will be able to read the file again.
Or, you call .to_csv() multiple times, and save the output in a "buffer", which you can then write (see here). Probably the only way if your DataFrames are very different from a structural perspective, but a mess to read them later.
Is .json output an option for what you want to do?
Thanks alot for the comment Kingotto, I used to first option added the this code and it was able to help me arrange my functions horizontally and exported the file to csv like this:
frames = pd.concat([file_1, file_2, file_3], axis = 1)
save the dataframe
frames.to_csv('Combined.csv', index = False)

How do I pull specific data from one file and add it to another file in a specific spot?

I am learning how to use python.
For the project I am working on, I have hundreds of datasheets containing a City, Species, and Time (speciesname.csv).
I also have a single datasheet that has all cities in the world with their latitude and longitude point (cities.csv).
My goal is to have 2 more columns for latitude and longitude (from cities.csv) in every (speciesname.csv) datasheet, corresponding to the location of each species.
I am guessing my workflow will look something like this:
Go into speciesname.csv file and find the location on each line
Go into cities.csv and search for the location from speciesname.csv
Copy the corresponding latitude and longitude into new columns in speciesname.csv.
I have been unsuccessful in my search for a blog post or someone else with a similar question. I don't know where to start so anyone with a starting point would be very helpful.
Thank you.
You can achieve it in many ways.
The simplest way I can think of to approach this problem is:
collect all cities.csv data inside a dictionary {"cityname":(lat,lon), ...}
read line by line your speciesname.csv and for each line search by key (key == speciesname_cityname) in the dictionary.
when you find a correspondence add all data from the line and the lat & lon separated by comma to a buffer string that has to end with a "\n" char
when the foreach line is ended your buffer string will contains all the data and can be used as input to the write to file function
Here is a little program that should work if you put it in the same folder as your separate CSVs. I'm assuming you just have 2 sheets, one that is cities and another with the species. Your description saying the cities info is in hundreds of datasheets is confusing since then you say it's all in one csv.
This program turns the two separate CSV files into pandas dataframe format which can then be joined on the common city column. Then it creates a new CSV from the joined data frame.
In order for this program to work, you need to need to install pandas which is a library specifically for dealing with things in tabular (spreadsheet) format. I don't know what system you are on so you'll have to find your own instructions from here:
https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html
This is the version if your csv do not have a header, which is when the first row is just some data.
# necessary for the functions like pd.read_csv
import pandas as pd
species_column_names = ['city','species','time']
speciesname = pd.read_csv('speciesname.csv', names=species_column_names, header=None)
cities_column_names = ['city','lat','long']
cities = pd.read_csv('cities.csv', names=cities_column_names, header=None)
# this joining function relies on both tables having a 'city' column
combined = speciesname.join(cities.set_index('city'), on='city')
combined_csv = combined.to_csv()
If you already have headers for both files, use these two lines instead to ignore the first row since I don't know how they are spelled/capitalized/whatever and we are joining based on all lower case custom column names:
import pandas as pd
species_column_names = ['city','species','time']
speciesname = pd.read_csv('speciesname.csv', names=species_column_names, skiprows = 0, header=None)
cities_column_names = ['city','lat','long']
cities = pd.read_csv('cities.csv', names=cities_column_names, skiprows = 0, header=None)
# this joining function relies on both tables having a 'city' column
combined = speciesname.join(cities.set_index('city'), on='city')
combined_csv = combined.to_csv()

Convert CSV file to CSV with the same amount of columns, via the command line

I downloaded several CSV files from a finance site. These files are inputs to a python script that I wrote. The rows in the CSV files don't all have the same number of values (i.e) columns. In fact on blank lines there are no values at all.
This is what the first few line of the downloaded file look like :
Performance Report
Date Produced,14-Feb-2020
When I attempt to add the row to a panda dataFrame, the script incurs an error of "mismatched columns".
I got around this by opening up the the files in MAC OSX Numbers and manually exporting each file to CSV. However I don't want to do this each time I download a CSV file from the finance site. I have googled for ways to automate this but have not been successful.
This is what the first few lines of the "Numbers" exported csv file looks like:
,,,,,,,
Performance Report,,,,,,
Date Produced,14-Feb-2020,,,,,
,,,,,,,
I have tried to played with dialect value of the csv.read module but have not been successful.
I also appended the columns manually in the python script but also have not been successful.
Essentially mid way down the CSV file is the table that I place into the dataFrame. Below is an example.
Asset Name,Opening Balance $,Purchases $,Sales $,Change in Value $,Closing Balance $,Income $,% Return for Period
Asset A,0.00,35.25,66.00,26.51,42.74,5.25,-6.93
...
...
Sub Total,48.86,26,12.29,-16.7,75.82,29.06,
That table prior to exporting via "Numbers" looks like so:
Asset Name,Opening Balance $,Purchases $,Sales $,Change in Value $,Closing Balance $,Income $,% Return for Period
Asset A,0.00,35.25,66.00,26.51,42.74,5.25,-6.93
...
...
Sub Total,48.86,26,12.29,-16.7,75.82,29.06
Above the subtotal row does not have a value in the last column, and does so doe snot represent it as ,"", which would make it so that all rows have an equal number of columns.
Does anyone have any ideas on how I can automate the Numbers export process? Any help would be appreciated. I presume they're varying formats of CSV.
In pandas read_csv you can skip rows. If the number of header rows are consistent then:
pd.read_csv(myfile.csv, skiprows=2)
If the first few lines are not consistent or the problem is actually deeper within the file, then you might experiment with try: and except:. Without more information on what the data file looks like, I can't come up with a more specific example using try: and except:.
There is many ways to do this in your script rather than adding commas by means of seperate programs.
One way is to preprocess the file in memory in your script before using pandas.
Now when you are using pandas you should use the built-in power of pandas.
you have not shared what the actual data rows looks like, and without that noone can actually help you.
I would look into using the following 2 kwargs of 'read_csv' to get the job done.
skiprows as a callable,
i.e. make your own function and use it as a filter to filter unwanted rows away
error_bad_lines set to False to just ignore errors and deal with it after it's in the dataframe

Creating a dataframe from a csv file in pandas: column issue

I have a messy text file that I need to sort into columns in a dataframe so I
can do the data analysis I need to do. Here is the messy looking file:
Messy text
I can read it in as a csv file, that looks a bit nicer using:
import pandas as pd
data = pd.read_csv('phx_30kV_indepth_0_0_outfile.txt')
print(data)
And this prints out the data aligned, but the issue is that the output is [640 rows x 1 column]. And I need to separate it into multiple columns and manipulate it as a dataframe.
I have tried a number of solutions using StringIO that have worked here before, but nothing seems to be doing the trick.
However, when I do this, there is the issue that the
delim_whitespace=True
Link to docs ^
df = pd.read_csv('phx_30kV_indepth_0_0_outfile.txt', delim_whitespace=True)
Your input file is actually not in CSV format.
As you provided only .png picture, it is even not clear, whether this file
is divided into rows or not.
If not, you have to start from "cutting" the content into individual lines and
read the content from the output file - result of this cutting.
I think, this is the first step, before you can use either read_csv or read_table (of course, with delim_whitespace=True).

Pyspark - write a dataframe into 2 different csv files

I want to save a single DataFrame into 2 different csv files (splitting the DataFrame) - one would include just the header and another would include the rest of the rows.
I want to save the 2 files under the same directory so Spark handling all the logic would be the best option if possible instead of splitting the csv file using pandas.
what would be the most efficient way to do this?
Thanks for your help!
Let's assume you've got Dataset called "df".
You can:
Option one: write twice:
df.write.(...).option("header", "false").csv(....)
df.take(1).option("header", "true").csv() // as far as I remember, someone had problems with saving DataFrame without rows -> you must write at least one row and then manually cut this row using normal Java or Python file API
Or you can write once with header = true and then manually cut the header and place it in new file using normal Java API
Data, without header:
df.to_csv("filename.csv", header=False)
Header, without data:
df_new = pd.DataFrame(data=None, columns=df_old.columns) # data=None makes sure no rows are copied to the new dataframe
df_new.to_csv("filename.csv")

Categories