I'm using a simple code to import an Excel file. However, the command is combining the first two rows into one. I would like to keep it separated (as it is in the Excel file).
db=pd.read_excel('fileaddress', sheetname='Sheet1')
Related
I'm using Python Pandas reading CSV file, calculating values and then creating a new CSV file of the just calculated values.
In my CSV files I have several columns. I have used sep = ; but then some of the cell values are missing (all of them are first in "General mode" in Excel but when creating a new CSV file they suddenly are missing and became in "Custom mode"). I have used also a sep = , and then I don't miss any values but the final CSV is not very easy to read, because all of the values are in the first and same column.
Any ideas? Thankful of any help!
There is a picture of what I got when using semicolon as a separator.
I have 10 files which I need to work on.
I need to import those files using pd.read_csv to turn them all into dataframes along with usecols as I only need the same two specific columns from each file.
I then need to search the two columns for a specific entry in the rows like ‘abcd’ and for python to return a new df with includes all the rows it appeared in for each file.
Is there a way I could do this using a for loop. For far I’ve only got a list of all the paths to the 10 files.
So far what I do for one file without the for loop is:
df = pd.read_csv(r'filepath', header=2, usecols=['Column1', 'Column2'])
search_df = df.loc[df['Column1'] == 'abcd']
I am trying to read a csv file with some garbage at the top, but also garbage at the bottom of the interesting data. I need to read multiple files and the length of the interesting data varies. Is there a way to let the pd.read_csv command know that the dataframe ends at the first linebreak?
Example data (screenshot from excel):
I read the file with:
dataframe = pd.read_csv(file, skiprows=45)
Which nicely gives me a dataframe with 10 columns with the headers on line 46 (see image). However, it continues further than the #GARBAGE DATA row.
Important note: Neither the length of the data nor the length of the footer is of equal length in the different files I want to read.
Two ways you could implement this
1) use skipfooter parameter of read csv , it tells the function the Number of lines at bottom of file to skip
pd.read_csv("in.csv",skiprows=45,skipfooter=2)
2) Read the file as it is and later use dropna function, this should drop the Garbage values.
df.dropna(inplace=True)
After using this command:
dataframe = pd.read_csv(file, skiprows=45)
You can use this command:
dataframe= dataframe.dropna(how='any')
This would delete a row if any empty value has been found in that row. Hence it would delete rest of all the rows.
I want to save a single DataFrame into 2 different csv files (splitting the DataFrame) - one would include just the header and another would include the rest of the rows.
I want to save the 2 files under the same directory so Spark handling all the logic would be the best option if possible instead of splitting the csv file using pandas.
what would be the most efficient way to do this?
Thanks for your help!
Let's assume you've got Dataset called "df".
You can:
Option one: write twice:
df.write.(...).option("header", "false").csv(....)
df.take(1).option("header", "true").csv() // as far as I remember, someone had problems with saving DataFrame without rows -> you must write at least one row and then manually cut this row using normal Java or Python file API
Or you can write once with header = true and then manually cut the header and place it in new file using normal Java API
Data, without header:
df.to_csv("filename.csv", header=False)
Header, without data:
df_new = pd.DataFrame(data=None, columns=df_old.columns) # data=None makes sure no rows are copied to the new dataframe
df_new.to_csv("filename.csv")
I have a pandas data frame with two columns:
year experience and salary
I want to save a csv file with these two columns and also have some stats at the head of the file as in the image:
Is there any option to handle these with pandas or any other library of do I have to make a script to write it line adding the commas between fields?
Pandas does not support what you want to do here. The problem is that your format is no valid csv. The RFC for CSV states that Each record is located on a separate line, implying that a line corresponds to a record, with an optional header line. Your format adds the average and max values, which do not correspond to records.
As I see it, you have three paths to go from here: i. You create two separate data frames and map them to csv files (super precise would be 3), one with your records, one with the additional values. ii. Write your data frame to csv first, then open that file and insert the your additional values at the top. iii. If your goal is an import into excel, however, #gefero 's suggestion is the right hint: try using the xslxwriter package do directly write to cells in a spreadsheet.
You can read the file as two separate parts (stats and csv)
Reading stats:
number_of_stats_rows = 3
stats = pandas.read_csv(file_path, nrows=number_of_stats_rows, header=None).fillna('')
Reading remaining file:
other_data = pandas.read_csv(file_path, skiprows=number_of_stats_rows).fillna('')
Take a look to xslxwriter. Perhaps it´s what you are looking for.