I have a csv file with comments marked by '#'. I want to select only the table part from this and get it into a pandas dataframe. I can just check the '#' marks and the table header and delete them but it will not be dynamic enough. If the csv file is slightly changed it won't work.
Please help me figure out a way to extract only the table part from this csv file.
There is a comment argument if you read in your file, but each line has to start with the appropriate character or your Metadata will not be treated as comment.
import pandas as pd
df = pd.read_csv('path/to/file.csv', sep=';', comment='#')
.csv file can't have comment. Then you must delete comment-line manualy. Try start checking from end file, and stop if # in LINE and ';' not in LINE
Related
I have a .csv file that has (45211rows, 1columns).
but i need to create new .scv file with (45211rows, 17columns)
These are the column names
age;"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"
I add a screenshot of the .csv file that I already have.
In pandas, the read_csv method has an option for setting the separator, which is , by default. To override, you can:
pandas.read_csv(<PATH_TO_CSV_FILE>, sep=';', header=0)
This will return a new dataframe with the correct format. The header=0 might not be needed, but it will force the returned dataframe to read the first line of the CSV file as column headers
Open the CSV in Excel
Select all the data
Choose the Data tab atop the ribbon.
Select Text to Columns.
Ensure Delimited is selected and click Next.
Clear each box in the Delimiters section and instead choose Semi Colon.
Click Finish.
I'm struggeling with one task that can save plenty of time. I'm new to Python so please don't kill me :)
I've got huge txt file with millions of records. I used to split them in MS Access, delimiter "|", filtered data so I can have about 400K records and then copied to Excel.
So basically file looks like:
What I would like to have:
I'm using Spyder so it would be great to see data in variable explorer so I can easily check and (after additional filters) export it to excel.
I use LibreOffice so I'm not 100% sure about Excel but if you change the .txt to .csv and try to open the file with Excel, it should allow to change the delimiter from a comma to '|' and then import it directly. That work with LibreOffice Calc anyway.
u have to split the file in lines then split the lines by the char l and map the data to a list o dicts.
with open ('filename') as file:
data = [{'id': line[0], 'fname':line[1]} for line in f.readlines()]
you have to fill in tve rest of the fields
Doing this with pandas will be much easier
Note: I am assuming that each entry is on a new line.
import pandas as pd
data = pd.read_csv("data.txt", delimiter='|')
# Do something here or let it be if you want to just convert text file to excel file
data.to_excel("data.xlsx")
I am trying to code a function where I grab data from my database, which already works correctly.
This is my code for the headers prior to adding the actual records:
with open('csv_template.csv', 'a') as template_file:
#declares the variable template_writer ready for appending
template_writer = csv.writer(template_file, delimiter=',')
#appends the column names of the excel table prior to adding the actual physical data
template_writer.writerow(['Arrangement_ID','Quantity','Cost'])
#closes the file after appending
template_file.close()
This is my code for the records which is contained in a while loop and is the main reason that the two scripts are kept separate.
with open('csv_template.csv', 'a') as template_file:
#declares the variable template_writer ready for appending
template_writer = csv.writer(template_file, delimiter=',')
#appends the data of the current fetched values of the sql statement within the while loop to the csv file
template_writer.writerow([transactionWordData[0],transactionWordData[1],transactionWordData[2]])
#closes the file after appending
template_file.close()
Now once I have got this data ready for excel, I run the file in excel and I would like it to be in a format where I can print immediately, however, when I do print the column width of the excel cells is too small and leads to it being cut off during printing.
I have tried altering the default column width within excel and hoping that it would keep that format permanently but that doesn't seem to be the case and every time that I re-open the csv file in excel it seems to reset completely back to the default column width.
Here is my code for opening the csv file in excel using python and the comment is the actual code I want to use when I can actually format the spreadsheet ready for printing.
#finds the os path of the csv file depending where it is in the file directories
file_path = os.path.abspath("csv_template.csv")
#opens the csv file in excel ready to print
os.startfile(file_path)
#os.startfile(file_path, 'print')
If anyone has any solutions to this or ideas please let me know.
Unfortunately I don't think this is possible for CSV file formats, since they are just plaintext comma separated values and don't support formatting.
I have tried altering the default column width within excel but every time that I re-open the csv file in excel it seems to reset back to the default column width.
If you save the file to an excel format once you have edited it that should solve this problem.
Alternatively, instead of using the csv library you could use xlsxwriter instead which does allow you to set the width of the columns in your code.
See https://xlsxwriter.readthedocs.io and https://xlsxwriter.readthedocs.io/worksheet.html#worksheet-set-column.
Hope this helps!
The csv format is nothing else than a text file, where the lines follow a given pattern, that is, a fixed number of fields (your data) delimited by comma. In contrast an .xlsx file is a binary file that contains specifications about the format. Therefore you may want write to an Excel file instead using the rich pandas library.
You can add space like as it is string so it will automatically adjust the width do it like this:
template_writer.writerow(['Arrangement_ID ','Quantity ','Cost '])
I am trying to add a header to my existing csv file and there are already content in it. I am just wondering if there is any piece of code that could insert a header row at the top (such as ['name','age','salary','country'] without affecting the contents.
Also this code is connected to API so I will run it multiple times. So just wondering if it is possible to detect whether a header exists to avoid multiple header lines.
THank you and hope you all a good day!
Your question has 2 parts:
1) To add a header to your csv (when it does not exists)
In order to insert the header row, you can read the csv with below command:
df=pd.read_csv(filename, header=None, names=['name','age','salary','country'])
To get create the csv with header row without affecting the contents you can use below command
df.to_csv(new_file_with_header.csv, header=True)
2) The second parti is little tricky. To infer whether your file is having header or not you will have to write a little code. I can provide you the algorithm.
read csv explicitly with header
df=pd.read_csv(filename.csv, header=None, names=['name','age','salary','country'])
Check for 1st row 1st column in your csv, if it contains value as 'name' then write the csv from 2nd row till end else write as is
temp_var=df['name'].iloc[0]
if (temp_var=='name'):
df.iloc[1:].to_csv(new_file.csv)
else:
df.to_csv(new_file.csv)
Hope this helps!!
Thanks,
Rohan Hodarkar
My CSV file has 3 columns: Name,Age and Sex and sample data is:
AlexÇ39ÇM
#Ç#SheebaÇ35ÇF
#Ç#RiyaÇ10ÇF
The column delimiter is 'Ç' and record delimiter is '#Ç#'. Note the first record don't have the record delimiter(#Ç#), but all other records have record delimiter(#Ç#). Could you please tell me how to read this file and store it in a dataframe?
Both csv and pandas module support reading csv-files directly. However, since you need to modify the file contents line by line before further processing, I suggest reading the file line by line, modify each line as desired and store the processed data in a list for further handling.
The necessary steps include:
open file
read file line by line
remove newline char (which is part of the line when using readlines()
replace record delimiter (since a record is equivalent to a line)
split lines at column delimiter
Since .split() returns a list of string elements we get an overall list of lists, where each 'sub-list' contains/represents the data of a line/record. Data formatted like this can be read by pandas.DataFrame.from_records() which comes in quite handy at this point:
import pandas as pd
with open('myData.csv') as file:
# `.strip()` removes newline character from each line
# `.replace('#;#', '')` removes '#;#' from each line
# `.split(';')` splits at given string and returns a list with the string elements
lines = [line.strip().replace('#;#', '').split(';') for line in file.readlines()]
df = pd.DataFrame.from_records(lines, columns=['Name', 'Age', 'Sex'])
print(df)
Remarks:
I changed Ç to ; which worked better for me due to encoding issues. However, the basic idea of the algorithm is still the same.
Reading data manually like this can become quite resource-intensive which might be a problem when handling larger files. There might be more elegant ways, which I am not aware of. When getting problems with performance, try to read the file in chunks or have a look for more effective implementations.