Read_csv in Pandas where file structure is inconsistent - python

Having trouble with reading a csv file into a pandas dataframe where the line endings are not standard.
Here is my code:
df_feb = pd.read_csv(data_location, sep = ",",nrows = 500, header = None, skipinitialspace = True,encoding = 'utf-8')
Here is the output (personal info scratched out):
Output
This is what the input data looks like:
The above output splits what should be a single line into 4 lines. A new line should start for every phone number (phone number = scratched out bit).
I am aiming to have each line look like this:
Goal output
Thank you in advance for your help!

If the file have format have any rule (not unique format for each record), then I suggest you write your own convertion tool.
Here I suggest what the tool should do
Read file as plain text.
Put 4 lines into 1 records/class object ( as I see in picture, 4 records seem to have 4 lines)
Parse the line (split by comma, tab, whatever you have) to get attribute
Write attribute in another file, split by tab (or comma) => your csv
Now, you can load your csv to Pandas.

Related

How can I read certain characters from a .txt file and write them to a .csv file in Python?

So I'm currently trying to use Python to create a neat and tidy .csv file from a .txt file. The first stage is to get some 8-digit numbers into one column called 'Number'. I've created the header and just need to put each number from each line into the column. What I want to know is, how do I tell Python to read the first eight characters of each line in the .txt file (which correspond to the number I'm looking for) and then write them to the .csv file? This is probably very simple but I'm only new to Python!
So far, I have something which looks like this:
with open(r'C:/Users/test1.txt') as rf:
with open(r'C:/Users/test2.csv','w',newline='') as wf:
outputDictWriter = csv.DictWriter(wf,['Number'])
outputDictWriter.writeheader()
writeLine = rf.read(8)
for line in rf:
wf.write(writeLine)
You can use pandas:
import pandas as pd
df = pd.read_csv(r'C:/Users/test2.txt')
df.to_csv(r'C:/Users/test2.csv')
Here is how to read the first 8 characters of each line in a file and store them in a list:
with open('file.txt','r') as f:
lines = [line[:8] for line in f.readlines()]
You can use regex to select digits with the charecter. Search for it
pattern = re.searh(w*\d{8})
Just go one step back and read again what you need:
read the first eight characters of each line in the .txt file (which correspond to the number I'm looking for) and then write them to the .csv file
Now forget Python and explain what is to be done in pseudo code:
open txt file for reading
open csv file for writing (beware end of line is expected to be \r\n for a CSV file)
write the header to the csv file
loop reading the txt file 1 line at a time until end of file
extract 8 first characters from the line
write them to the csv file, ended with a \r\n
close both files
Ok, time to convert above pseudo code to Python language:
with open('C:/Users/test1.txt') as rf, open('C:/Users/test2.csv', 'w', newline='\r\n') as wf:
print('Number', file=wf)
for line in rf:
print(line.rstrip()[:8], file=wf)

Huge txt file with one column (text to columns in python)

I'm struggeling with one task that can save plenty of time. I'm new to Python so please don't kill me :)
I've got huge txt file with millions of records. I used to split them in MS Access, delimiter "|", filtered data so I can have about 400K records and then copied to Excel.
So basically file looks like:
What I would like to have:
I'm using Spyder so it would be great to see data in variable explorer so I can easily check and (after additional filters) export it to excel.
I use LibreOffice so I'm not 100% sure about Excel but if you change the .txt to .csv and try to open the file with Excel, it should allow to change the delimiter from a comma to '|' and then import it directly. That work with LibreOffice Calc anyway.
u have to split the file in lines then split the lines by the char l and map the data to a list o dicts.
with open ('filename') as file:
data = [{'id': line[0], 'fname':line[1]} for line in f.readlines()]
you have to fill in tve rest of the fields
Doing this with pandas will be much easier
Note: I am assuming that each entry is on a new line.
import pandas as pd
data = pd.read_csv("data.txt", delimiter='|')
# Do something here or let it be if you want to just convert text file to excel file
data.to_excel("data.xlsx")

How to read csv file which has column delimiter as well record delimiter

My CSV file has 3 columns: Name,Age and Sex and sample data is:
AlexÇ39ÇM
#Ç#SheebaÇ35ÇF
#Ç#RiyaÇ10ÇF
The column delimiter is 'Ç' and record delimiter is '#Ç#'. Note the first record don't have the record delimiter(#Ç#), but all other records have record delimiter(#Ç#). Could you please tell me how to read this file and store it in a dataframe?
Both csv and pandas module support reading csv-files directly. However, since you need to modify the file contents line by line before further processing, I suggest reading the file line by line, modify each line as desired and store the processed data in a list for further handling.
The necessary steps include:
open file
read file line by line
remove newline char (which is part of the line when using readlines()
replace record delimiter (since a record is equivalent to a line)
split lines at column delimiter
Since .split() returns a list of string elements we get an overall list of lists, where each 'sub-list' contains/represents the data of a line/record. Data formatted like this can be read by pandas.DataFrame.from_records() which comes in quite handy at this point:
import pandas as pd
with open('myData.csv') as file:
# `.strip()` removes newline character from each line
# `.replace('#;#', '')` removes '#;#' from each line
# `.split(';')` splits at given string and returns a list with the string elements
lines = [line.strip().replace('#;#', '').split(';') for line in file.readlines()]
df = pd.DataFrame.from_records(lines, columns=['Name', 'Age', 'Sex'])
print(df)
Remarks:
I changed Ç to ; which worked better for me due to encoding issues. However, the basic idea of the algorithm is still the same.
Reading data manually like this can become quite resource-intensive which might be a problem when handling larger files. There might be more elegant ways, which I am not aware of. When getting problems with performance, try to read the file in chunks or have a look for more effective implementations.

Python error in processing lines from a file

wrote a python script in windows 8.1 using Sublime Text editor and I just tried to run it from terminal in OSX Yosemite but I get an error.
My error occurs when parsing the first line of a .CSV file. This is the slice of the code
lines is an array where each element is the line in the file it is read from as a string
we split the string by the desired delimiter
we skip the first line because that is the header information (else condition)
For the last index in the for loop i = numlines -1 = the number of lines in the file - 2
We only add one to the value of i because the last line is blank in the file
for i in range(numlines):
if i == numlines-1:
dataF = lines[i+1].split(',')
else:
dataF = lines[i+1].split(',')
dataF1 = list(dataF[3])
del(dataF1[len(dataF1)-1])
del(dataF1[len(dataF1)-1])
del(dataF1[0])
f[i] = ''.join(dataF1)
return f
All the lines in the csv file looks like this (with the exception of the header line):
"08/06/2015","19:00:00","1","410"
So it saves the single line into an array where each element corresponds to one of the 4 values separated by commas in a line of the CSV file. Then we take the 3 element in the array, "410" ,and create a list that should look like
['"','4','1','0','"','\n']
(and it does when run from windows)
but it instead looks like
['"','4','1','0','"','\r','\n']
and so when I concatenate this string based off the above code I get 410 instead of 410.
My question is: Where did the '\r' term come from? It is non-existent in the original files when ran by a windows machine. At first I thought it was the text format so I saved the CSV file to a UTF-8, that didn’t work. I tried changing the tab size from 4 to 8 spaces, that didn’t work. Running out of ideas now. Any help would be greatly appreciated.
Thanks
The "\r" is the line separator. The "\r\n" is also a line separator. Different platforms have different line separators.
A simple fix: if you read a line from a file yourself, then line.rstrip() will remove the whitespace from the line end.
A proper fix: use Python's standard CSV reader. It will skip the blank lines and comments, will properly handle quoted strings, etc.
Also, when working with long lists, it helps to stop thinking about them as index-addressed 'arrays' and use the 'stream' or 'sequential reading' metaphor.
So the typical way of handling a CSV file is something like:
import csv
with open('myfile.csv') as f:
reader = csv.reader(f)
# We assume that the file has 3 columns; adjust to taste
for (first_field, second_field, third_field) in reader:
# do something with field values of the current lines here

How to copy specific data out of a file using python?

I have some large data files and I want to copy out certain pieces of data on each line, basically an ID code. The ID code has a | on one side and a space on the other. I was wondering would it be possible to pull out just the ID. Also I have two data files, one has 4 ID codes per line and the other has 23 per line.
At the moment I'm thinking something like copying each line from the data file, then subtract the strings from each other to get the desired ID code, but surely there must be an easier way! Help?
Here is an example of a line from the data file that I'm working with
cluster8032: WoodR1|Wood_4286 Q8R1|EIK58010 F113|AEV64487.1 NFM421|PSEBR_a4327
and from this line I would want to output on separate lines
Wood_4286
EIK58010
AEV644870.1
PSEBR_a4327
Use the regex module for such a task. The following code shows you how to extract the ID's from a string (works for any number of ID's as long as they are structured the same way).
import re
s = """cluster8032: WoodR1|Wood_4286 Q8R1|EIK58010 F113|AEV64487.1 NFM421|PSEBR_a4327"""
results = re.findall('\|([^ ]*)',s) #list of ids that have been extracted from string
print('\n'.join(results)) #pretty output
Output:
Wood_4286
EIK58010
AEV64487.1
PSEBR_a4327
To write the output to a file:
with open('out.txt', mode = 'w') as filehandle:
filehandle.write('\n'.join(results))
For more information, see the regex module documentation.
If all your lines have the given format, a simple split is enough:
#split by '|' and the result by space
ids = [x.split()[0] for x in line.split("|")[1:]]

Categories