first post here.
I am very new to programmation, sorry if it is confused.
I made a database by collecting multiple different data online. All these data are in one xlsx file (each data a column), that I converted in csv afterwards because my teacher only showed us how to use csv file in Python.
I installed pandas and make it read my csv file, but it seems it doesnt understand that I have multiple columns, it reads one column. Thus, I can't get the info on each data (and so i can't transform the data).
I tried df.info() and df.info(verbose=True, show_counts=True) but it makes the same thing
len(df.columns) = 1 which proves it doesnt see that each data has its own column
len(df) = 1923, which is right
I was expecting that : https://imgur.com/a/UROKtxN (different project, not the same database)
database used: https://imgur.com/a/Wl1tsYb
And I have that instead : https://imgur.com/a/iV38YNe
database used: https://imgur.com/a/VefFrL4
idk, it looks pretty similar, why doesn't work :((
Thanks.
Related
My goal is to import a table of astrophysical data that I have saved to my computer (obtained from matching 2 other tables in TOPCAT, if you know it), and extract certain relevant columns. I hope to then do further manipulations on these columns. I am a complete beginner in python, so I apologise for basic errors. I've done my best to try and solve my problem on my own but I'm a bit lost.
This script I have written so far:
import pandas as pd
input_file = "location\\filename"
dataset = pd.read_csv(input_file,skiprows=12,usecols=[1])
The file that I'm trying to import is listed as having file type "File", in my drive. I've looked at this file in Notepad and it has a lot of descriptive bumf in the first few rows, so to try and get rid of this I've used "skiprows" as you can see. The data in the file is separated column-wise by lines--at least that's how it appears in Notepad.
The problem is when I try to extract the first column using "usecol" it instead returns what appears to be the first row in the command window, as well as a load of vertical bars between each value. I assume it is somehow not interpreting the table correctly? Not understanding what's a column and what's a row.
What I've tried: Modifying the file and saving it in a different filetype. This gives the following error:
FileNotFoundError: \[Errno 2\] No such file or directory: 'location\\filename'
Despite the fact that the new file is saved in exactly the same location.
I've tried using "pd.read_table" instead of csv, but this doesn't seem to change anything (nor does it give me an error).
When I've tried to extract multiple columns (ie "usecol=[1,2]") I get the following error:
ValueError: Usecols do not match columns, columns expected but not found: \[1, 2\]
My hope is that someone with experience can give some insight into what's likely going on to cause these problems.
Maybie you can try dataset.iloc[:,0] . With iloc you can extract the column or line you want by index(not only). [:,0] for all the lines of 1st column.
The file is incorrectly named.
I expect that you are reading a csv file or an xlsx or txt file. So the (windows) path would look similar to this:
import pandas as pd
input_file = "C:\\python\\tests\\test_csv.csv"
dataset = pd.read_csv(input_file,skiprows=12,usecols=[1])
The error message tell you this:
No such file or directory: 'location\\filename'
Edit: I found out a solution to my question. More or less look at the user manual for openPyxl instead of online tutorials, the tutorials ran errors when I tried them (I tried more than one) and their thought process was significantly different from the thought process in the user manual. And also I ended up not using pandas as much as I thought I would need to.
I am trying to append certain values in an Excel file with multiple sheets based on user inputs and then rewrite it to the Excel file (without deleting the rest of the sheets). So far I have tried this which seems to combine the data but I didn't quite see how it applied to what I am doing since I want to append a part of a sheet instead of rewrite the whole excel file. I have also tried a few other things with ExcelWriter but I don't quite understand it since it usually wipes all the data in the file (I may be using it wrong).
episode_dataframe = pd.read_excel (r'All_excerpts (Siena Copy)_test.xlsx', sheet_name=episode)
#episode is a specified string inputted by user, this line makes a data frame for the specified sheet
episode_dataframe.loc[(int(pass_num) - 1), 'Resources'] = resources
#resources is also a user inputted string, it's what I am trying to append the spreadsheet cell value to, this appends to corresponding data frame
path_R = open("All_excerpts (Siena Copy)_test.xlsx", "rb")
with pd.ExcelWriter(path_R) as writer:
writer.book = openpyxl.load_workbook(path_R)
#I copied this from [here][3], i think it should make the writer for the to_excel? I don't fully know
episode_dataframe.to_excel(writer, sheet_name=episode, engine=openpyxl, if_sheet_exsits ='replace')
#this should write the sheet data frame onto the file, but I don't want it to delete the other sheets
Additionally, I have been running into a bunch of other smaller errors, a big one was Workbook' object has no attribute 'add worksheet' even though I'm not trying to add a worksheet, also I could not get their solution to work.
I am a bit of a novice at python, so my code might be a bit of a mess.
I am racking my brain here and have read a lot of tutorials, sites, sample code, etc. Something is not clicking for me.
Here is my desired end state.
Select data from MSSQL - Sorted, not a problem
Open an Excel template (xlsx file) - Sorted, not a problem
Export data to this Excel template and saving it with a different name - PROBLEM.
What I have achieved so far: (this works)
I can extract data from DB.
I can write that data to Excel using pandas, my line of code for doing that is: pd.read_sql(script,cnxn).to_excel(filename,sheet_name="Sheet1",startrow=19,encoding="utf-8")
filename variable is a new file that I create every time the for loop runs.
What my challenge is:
The data needs to be export to a predefined template (template has formatting that must be present in every file)
I can open the file and I can write to the file, but I do not know how to save that file with a different name through every iteration of the for loop
In my for loop I use this code:
#this does not work
pd.read_sql(script,cnxn)
writer = pd.ExcelWriter(SourcePath) #opens the source document
df.to_excel(writer)
writer.save() #how to I saveas() a different file name?????
Your help would be highly appreciated.
Your method is work. The problem is you don't need to write the data into excel file right after you read the data from the database. My suggestion is first read the data into different data frame.
df1 = pd.read_sql(script)
df2 = pd.read_sql(script)
df3 = pd.read_sql(script)
You can then write all the dataframe together to a excel file. You can refer to this link.
I hope this solution can help you. Have a nice weekend
I downloaded several CSV files from a finance site. These files are inputs to a python script that I wrote. The rows in the CSV files don't all have the same number of values (i.e) columns. In fact on blank lines there are no values at all.
This is what the first few line of the downloaded file look like :
Performance Report
Date Produced,14-Feb-2020
When I attempt to add the row to a panda dataFrame, the script incurs an error of "mismatched columns".
I got around this by opening up the the files in MAC OSX Numbers and manually exporting each file to CSV. However I don't want to do this each time I download a CSV file from the finance site. I have googled for ways to automate this but have not been successful.
This is what the first few lines of the "Numbers" exported csv file looks like:
,,,,,,,
Performance Report,,,,,,
Date Produced,14-Feb-2020,,,,,
,,,,,,,
I have tried to played with dialect value of the csv.read module but have not been successful.
I also appended the columns manually in the python script but also have not been successful.
Essentially mid way down the CSV file is the table that I place into the dataFrame. Below is an example.
Asset Name,Opening Balance $,Purchases $,Sales $,Change in Value $,Closing Balance $,Income $,% Return for Period
Asset A,0.00,35.25,66.00,26.51,42.74,5.25,-6.93
...
...
Sub Total,48.86,26,12.29,-16.7,75.82,29.06,
That table prior to exporting via "Numbers" looks like so:
Asset Name,Opening Balance $,Purchases $,Sales $,Change in Value $,Closing Balance $,Income $,% Return for Period
Asset A,0.00,35.25,66.00,26.51,42.74,5.25,-6.93
...
...
Sub Total,48.86,26,12.29,-16.7,75.82,29.06
Above the subtotal row does not have a value in the last column, and does so doe snot represent it as ,"", which would make it so that all rows have an equal number of columns.
Does anyone have any ideas on how I can automate the Numbers export process? Any help would be appreciated. I presume they're varying formats of CSV.
In pandas read_csv you can skip rows. If the number of header rows are consistent then:
pd.read_csv(myfile.csv, skiprows=2)
If the first few lines are not consistent or the problem is actually deeper within the file, then you might experiment with try: and except:. Without more information on what the data file looks like, I can't come up with a more specific example using try: and except:.
There is many ways to do this in your script rather than adding commas by means of seperate programs.
One way is to preprocess the file in memory in your script before using pandas.
Now when you are using pandas you should use the built-in power of pandas.
you have not shared what the actual data rows looks like, and without that noone can actually help you.
I would look into using the following 2 kwargs of 'read_csv' to get the job done.
skiprows as a callable,
i.e. make your own function and use it as a filter to filter unwanted rows away
error_bad_lines set to False to just ignore errors and deal with it after it's in the dataframe
I want to save a single DataFrame into 2 different csv files (splitting the DataFrame) - one would include just the header and another would include the rest of the rows.
I want to save the 2 files under the same directory so Spark handling all the logic would be the best option if possible instead of splitting the csv file using pandas.
what would be the most efficient way to do this?
Thanks for your help!
Let's assume you've got Dataset called "df".
You can:
Option one: write twice:
df.write.(...).option("header", "false").csv(....)
df.take(1).option("header", "true").csv() // as far as I remember, someone had problems with saving DataFrame without rows -> you must write at least one row and then manually cut this row using normal Java or Python file API
Or you can write once with header = true and then manually cut the header and place it in new file using normal Java API
Data, without header:
df.to_csv("filename.csv", header=False)
Header, without data:
df_new = pd.DataFrame(data=None, columns=df_old.columns) # data=None makes sure no rows are copied to the new dataframe
df_new.to_csv("filename.csv")