Huge txt file with one column (text to columns in python) - python

I'm struggeling with one task that can save plenty of time. I'm new to Python so please don't kill me :)
I've got huge txt file with millions of records. I used to split them in MS Access, delimiter "|", filtered data so I can have about 400K records and then copied to Excel.
So basically file looks like:
What I would like to have:
I'm using Spyder so it would be great to see data in variable explorer so I can easily check and (after additional filters) export it to excel.

I use LibreOffice so I'm not 100% sure about Excel but if you change the .txt to .csv and try to open the file with Excel, it should allow to change the delimiter from a comma to '|' and then import it directly. That work with LibreOffice Calc anyway.

u have to split the file in lines then split the lines by the char l and map the data to a list o dicts.
with open ('filename') as file:
data = [{'id': line[0], 'fname':line[1]} for line in f.readlines()]
you have to fill in tve rest of the fields

Doing this with pandas will be much easier
Note: I am assuming that each entry is on a new line.
import pandas as pd
data = pd.read_csv("data.txt", delimiter='|')
# Do something here or let it be if you want to just convert text file to excel file
data.to_excel("data.xlsx")

Related

Pandas read_excel get only last row

I have an excel that is generated daily and can have up to 50k+ rows. Is there a way to read only the last row (which is the sum of the columns)?
right now I am just reading the entire sheet and keeping only the last row but it is taking up a huge amount of runtime.
my code:
df=pd.read_excel(filepath,header=1,usecols="O:AC")
df=df.tail(1)
Pandas is quite slow, especially with large in memory data. You can think about a lazy loading method, for example check dask.
Else you can read the file using "open" and read the last line :
with open(filepath, "r") as file:
last_line = file.readlines()[-1]
I dont think there is a way to decrease runtime when you read excel file.
When you read a excel or one sheet of excel,you would load excel all data into dask,even you use pd.read_excel skiprows,Its just keep the row the skiprows choose after you load all data into dask.So it cant decrease runtime.
If you really want decrease runtime of read file,you should save the file into another format,.csv or .txt and so on.
AND you generally you can't read Microsoft Excel files as a text files using methods like readlines or read. You should convert files to another format before (good solution is .csv which can be readed by csv module) or use a special python modules like pyexcel and openpyxl to read .xlsx files directly.

Is there a way I can extract mutliple pieces of data from a multiple text file in python and save it as a row in a new .csv file?

Is there a way I can extract multiple pieces of data from a text file in python and save it as a row in a new .csv file? I need to do this for multiple input files and save the output as a single .csv file for all of the input files.
I have never used Python before so I am quite clueless. I have used matlab before and I know how I would do it in matlab if it was numbers (but unfortunately it is text which is why I am trying python). So to be clear I need a new line in the .csv output file for each "ID" in the input files.
An example of the data is show below (2 separate files)
EXAMPLE DATA - FILE 1:
id,ARI201803290
version,2
info,visteam,COL
info,hometeam,ARI
info,site,PHO01
info,date,2018/03/29
id,ARI201803300
data,er,corbp001,2
version,2
info,visteam,COL
info,hometeam,ARI
info,site,PHO01
info,date,2018/03/30
data,er,delaj001,0
EXAMPLE DATA - FILE 2:
id,NYN201803290
version,2
info,visteam,SLN
info,hometeam,NYN
info,site,NYC20
info,usedh,false
info,date,2018/03/29
data,er,famij001,0
id,NYN201803310
version,2
info,visteam,SLN
info,hometeam,NYN
info,site,NYC20
info,date,2018/03/31
data,er,gselr001,0
I'm hoping to get the data in a .csv format with all the details from one "id" on 1 line. There are multiple "id's" per text file and there are multiple files. I want to repeat this process for multiple text files so the outputs are in the same .csv output file. I want the output to look as follows in the .csv file, with each piece of info as a new cell:
ARI201803290 COL ARI PHO01 2018/03/29 2
ARI201803300 COL ARI PHO01 2018/03/30 0
NYN201803290 SLN NYN NYC20 2018/03/29 0
NYN201803310 SLN NYN NYC20 2018/03/31 0
If I was doing it in matlab I'd use a for loop and if statement and say
j=1
k=1
for i=1:size(myMatrix, 1)
if file1(i;1)==id
output(k,1)=(i;2)
k=k+1
else if
file1(i;1)==info && file1(i;1)==info
output(j,2)=(i;3)
j=j+1
etc.....
However I obviously can't do this in matlab because I have comma separated text files, not a matrix. Does anyone have any suggestions how I can translate my idea to python code? Or any other suggestion. I am super new to python so willing to try anything that might work.
Thank you very much in advance!
python is very flexible and can do these jobs very easily,
there is a lot of csv tools/modules in python to handle pretty much all type of csv and excel files, however i prefer to handle a csv the same as a text file because csv is simply a text file with comma separated text, so simple is better than complicated
below is the code with comments to explain most of it, you can tweak it to match your needs exactly
import os
input_folder = 'myfolder/' # path of folder containing the text files on your disk
# create a list with file names with their full paths using list comprehension
data_files = [os.path.join(input_folder, file) for file in os.listdir(input_folder)]
# open our csv file for writing
csv = open('myoutput.csv', 'w') # better to open files with context manager like below but i am trying to show you different methods
def write_to_csv(line):
print(line)
csv.write(line)
# loop thru your text files
for file in data_files:
with open(file, 'r') as f: # use context manager to open files (best practice)
buff = []
for line in f:
line = line.strip() # remove spaces and new lines
line = line.split(',') # split line to list of values
if buff and line[0] == 'id': # hit another 'id'
write_to_csv(','.join(buff) + '\n')
buff = []
buff.append(line[-1]) # add the last word in line
write_to_csv(','.join(buff) + '\n')
csv.close() # must close any open file handles opened manually "no context manager i.e. no with"
output:
ARI201803290,2,COL,ARI,PHO01,2018/03/29,2
ARI201803300,2,COL,ARI,PHO01,2018/03/30,0
NYN201803290,2,SLN,NYN,NYC20,false,2018/03/29,0
NYN201803310,2,SLN,NYN,NYC20,2018/03/31,0

How do I increase the default column width of a csv file so that when I open the file all of the text fits correctly?

I am trying to code a function where I grab data from my database, which already works correctly.
This is my code for the headers prior to adding the actual records:
with open('csv_template.csv', 'a') as template_file:
#declares the variable template_writer ready for appending
template_writer = csv.writer(template_file, delimiter=',')
#appends the column names of the excel table prior to adding the actual physical data
template_writer.writerow(['Arrangement_ID','Quantity','Cost'])
#closes the file after appending
template_file.close()
This is my code for the records which is contained in a while loop and is the main reason that the two scripts are kept separate.
with open('csv_template.csv', 'a') as template_file:
#declares the variable template_writer ready for appending
template_writer = csv.writer(template_file, delimiter=',')
#appends the data of the current fetched values of the sql statement within the while loop to the csv file
template_writer.writerow([transactionWordData[0],transactionWordData[1],transactionWordData[2]])
#closes the file after appending
template_file.close()
Now once I have got this data ready for excel, I run the file in excel and I would like it to be in a format where I can print immediately, however, when I do print the column width of the excel cells is too small and leads to it being cut off during printing.
I have tried altering the default column width within excel and hoping that it would keep that format permanently but that doesn't seem to be the case and every time that I re-open the csv file in excel it seems to reset completely back to the default column width.
Here is my code for opening the csv file in excel using python and the comment is the actual code I want to use when I can actually format the spreadsheet ready for printing.
#finds the os path of the csv file depending where it is in the file directories
file_path = os.path.abspath("csv_template.csv")
#opens the csv file in excel ready to print
os.startfile(file_path)
#os.startfile(file_path, 'print')
If anyone has any solutions to this or ideas please let me know.
Unfortunately I don't think this is possible for CSV file formats, since they are just plaintext comma separated values and don't support formatting.
I have tried altering the default column width within excel but every time that I re-open the csv file in excel it seems to reset back to the default column width.
If you save the file to an excel format once you have edited it that should solve this problem.
Alternatively, instead of using the csv library you could use xlsxwriter instead which does allow you to set the width of the columns in your code.
See https://xlsxwriter.readthedocs.io and https://xlsxwriter.readthedocs.io/worksheet.html#worksheet-set-column.
Hope this helps!
The csv format is nothing else than a text file, where the lines follow a given pattern, that is, a fixed number of fields (your data) delimited by comma. In contrast an .xlsx file is a binary file that contains specifications about the format. Therefore you may want write to an Excel file instead using the rich pandas library.
You can add space like as it is string so it will automatically adjust the width do it like this:
template_writer.writerow(['Arrangement_ID ','Quantity ','Cost '])

Python Writing To CSV All in One Cell

I'm trying to shave the top 7 lines off a csv file.
There is probably a more concise way to do this, but right now I am reading one file and writing each line other than the first 7 to another file. When I write to the file though, all the contents for the line show up in the first cell instead of spread out in organized columns.
Here is my code:
with open('file1.csv', 'r') as file_org:
with open("file2.csv","w") as file_stripped:
writer = csv.writer(file_stripped)
for i, line in enumerate(file_org, -7):
if i>=0:
writer.writerow([line])
Thank you!
reading in csv do need you to specify the seperator, which is usually ";", you can find the constructor usage in manual, and you should open the file and see the content, not by some other tool like excel
if you are not meant to change the content, you could just treat them as normal files / line, or just by
line.split(";")
or
";".join(splited_line)
manully

Csv blank rows problem with Excel

I have a csv file which contains rows from a sqlite3 database. I wrote the rows to the csv file using python.
When I open the csv file with Ms Excel, a blank row appears below every row, but the file on notepad is fine(without any blanks).
Does anyone know why this is happenning and how I can fix it?
Edit: I used the strip() function for all the attributes before writing a row.
Thanks.
You're using open('file.csv', 'w')--try open('file.csv', 'wb').
The Python csv module requires output files be opened in binary mode.
the first that comes into my mind (just an idea) is that you might have used "\r\n" as row delimiter (which is shown as one linebrak in notepad) but excel expects to get only "\n" or only "\r" and so it interprets this as two line-breaks.

Categories