I am pulling in a text file that has a lot of different data: Serial Num, Type, and a log of csv data:
A123>
A123>read sn
sn = 12143
A123>read cms-sn
cms-sn = 12143-00000000-0000
A123>read fw-rev
fw-rev = 1.3, 1.3
A123>read log
log =
1855228,1,0,41,-57,26183,25,22,21,22,0,0,0,89,2048,500,0,0
1855240,1,0,33,0,26319,25,22,22,23,0,0,0,89,2048,500,0,0
2612010,1,0,41,-82,26122,20,21,21,21,0,0,0,87,2048,500,0,0
2612142,1,0,49,301,27607,21,22,21,21,0,0,0,81,2048,500,0,0
Here is the code I have so far:
import pandas as pd
lines = [] # Declare an empty list named "lines"
with open ('03-22-2019.txt', 'rt') as in_file: # Open file
for line in in_file: # For each line of text in in_file, where the data is named "line",
lines.append(line.rstrip('\n')) # add that line to our list of lines, stripping newlines.
while('' in lines):
lines.remove("")
lines = [x for x in lines if 'A123' not in x] #delete all lines with 'A123'
for element in lines: # For each element in our list,
print(element) # print it.
split_line = lines[0].split() # create list with serial number line
Serial_Num = split_line[-1]
print(Serial_Num)
split_line = lines[1].split() # go to line with CMS SN
CMS_SN = split_line[-1]
print(CMS_SN)
split_line = lines[2].split()
Firm_Rev_1 = split_line[-1]
Firm_Rev_2 = split_line[-2]
print(Firm_Rev_1)
print(Firm_Rev_2)
# Problem section starts here!
start_data = lines.index("log =") + 1 #<<<<<<<<<<
data = [x for x in lines[start_data:].split(",")] #<<<<<<<<<<
#dfObj = pd.DataFrame(lines[start_data:-1].split(",")) #<<<<<<<<<<
The problem come up when I am trying to import the log section of the data into a dataframe and split out the CSV values into their own column.
How do I programmatically find the start of the log data, and read the data from there to the end into a Pandas dataframe?
It looks like you're pretty close.
# this will get you a list of lists for each line.
data = [line.split(',') for line in lines[start_data:]]
# This should construct your data frame
dfObj = pd.DataFrame(data=data, columns=[list, of, column, names])
Related
So I've got this code I've been working on for a few days. I need to iterate through a set of csv's, and using general logic, find the indexes which don't have the same number of columns as index 2 and strip them out of the new csv. I've gotten the code to this point, but I'm stuck as to how to use slicing to strip the broken index.
Say each index in file A is supposed to have 10 columns, and for some reason index 2,000 logs with only 7 columns. How is the best way to approach this problem to get the code to strip index 2,000 out of the new csv?
#Comments to the right
for f in TD_files: #FOR ALL TREND FILES:
with open(f,newline='',encoding='latin1') as g: #open file as read
r = csv.reader((line.replace('\0','') for line in g)) #declare read variable for list while stripping nulls
data = [line for line in r] #set list to all data in file
for j in range(0,len(data)): #set up data variable
if data[j][2] != data[j-1][2] and j != 0: #compare index j2 and j2-1
print('Index Not Equal') #print debug
data[0] = TDmachineID #add machine ID line
data[1] = trendHeader #add trend header line
with open(f,'w',newline='') as g: #open file as write
w = csv.writer(g) #declare write variable
w.writerows(data)
The Index To Strip
EDIT
Since you loop through the whole data anyway, I would replace that \0 at the same list comprehension when checking for the length. It looks cleaner to me and works the same.
with open(f, newline='', encoding='latin1') as g:
raw_data = csv.reader(g)
data = [[elem.replace('\0', '') for elem in line] for line in raw_data if len(line)==10]
data[0] = TDmachineID
data[1] = trendHeader
old answer:
You could add a condition to your list comprehension if the list has the length 10.
with open(f,newline='',encoding='latin1') as g:
r = csv.reader((line.replace('\0','') for line in g))
data = [line for line in r if len(line)==10] #add condition to check if the line is added to your data
data[0] = TDmachineID
data[1] = trendHeader
Hi guys how are you? I hope you just fine!
How to parse a text file extracting specific values using index positions, append the values to a list, then convert it to pandas dataframe. So far I was to able write the below code:
TEXT SAMPLE:
header:0RCPF049100000084220210407
body:1927907801100032G 00sucess
1067697546140032G 00sucess
1053756666000032G 00sucess
1321723368900032G 00sucess
1037673956810032G 00sucess
For example, the first line is the header, and from it, I just need the date which is in the following index position:
date_from_header = linhas[0][18:26]
The rest of the values is in body
import csv
import pandas as pd
headers = ["data_mov", "chave_detalhe", "cpf_cliente", "cd_clube",
"cd_operacao","filler","cd_retorno","tc_recusa"]
# This is the actual code
with open('RCPF0491.20210407.1609.txt', "r")as f:
linhas = [linha.rstrip() for linha in f.readlines()]
for i in range(0,len(linhas)):
data_mov = linhas[0][18:26]
chave_detalhe=linhas[1][0:1]
cpf_cliente=linhas[1][1:12]
cd_clube=linhas[1][12:16]
cd_operacao=linhas[1][16:17]
filler=linhas[1][17:40]
cd_retorno=linhas[1][40:42]
tx_recusa=linhas[1][42:100]
data = [data_mov,chave_detalhe,cpf_cliente,cd_clube,cd_operacao","filler,cd_retorno,tc_recusa]
The intended result looks like this:
data_mov chave_detalhe cpf_cliente cd_clube cd_operacao filler cd_retorno tx_recusa
'20210407' '1' 92790780110 '0032' 'G' 'blank space' '00' 'sucesso'
'20210407' '1' 92790780110 '0032' 'G' 'blank space' '00' 'sucesso'
'20210407' '1' 92790780110 '0032' 'G' 'blank space' '00' 'sucesso'
Using stackoverflow.com/a/10851479/1581658
def parse_file(filename):
indices = [0,1,12,16,17,18,20] # list the indices to split on
parsed_data = [] # returned array by line
with open(filename) as f:
header = next(f) #skip the header
data_mov = header[18:26] # and get data_mov from header
for line in f: #loop through lines
#split each line by the indices
parts = [data_mov] + [line.rstrip()[i:j] for i,j in zip(indices, indices[1:]+[None])]
parsed_data.append(parts)
return parsed_data
print(parse_file("filename.txt"))
I thanks the help of SamBob, following the final solution in case anyone needs:
import itertools
import pandas as pd
pd.options.display.width = 0
def parse_file(filename):
indices=[0,1,12,16,17,18,42] # list of indexes
parsed_data = [] # return a list
with open(filename) as f:
header = next(f)
data_mov = header[18:26]
for line in itertools.islice(f,1,100):
# dividr de acordo com os índices.
parts = [data_mov] + [line.rstrip()[i:j] for i,j in zip(indices, indices[1:]+[None])]
parsed_data.append(parts)
# convert to dataframe
cols = ['data_mov', 'chave_detalhe', 'cpf_cliente','cd_clube','cd_operacao','filler','cd_retorno','tx_recusa']
df = pd.DataFrame(parsed_data, columns=cols)
return df
df = (parse_file("filename.txt"))
I am trying to read a .txt file and save the data in each column as a list. each column in the file contains a variable which I will later on use to plot a graph. I have tried looking up the best method to do this and most answers recommend opening the file, reading it, and then either splitting or saving the columns as a list. The data in the .txt is as follows -
0 1.644231726
0.00025 1.651333945
0.0005 1.669593478
0.00075 1.695214575
0.001 1.725409504
the delimiter is a space '' or a tab '\t' . I have used the following code to try and append the columns to my variables -
import csv
with open('./rvt.txt') as file:
readfile = csv.reader(file, delimiter='\t')
time = []
rim = []
for line in readfile:
t = line[0]
r = line[1]
time.append(t)
rim.append(r)
print(time, rim)
However, when I try to print the lists, time and rim, using print(time, rim), I get the following error message -
r = line[1]
IndexError: list index out of range
I am, however, able to print only the 'time' if I comment out the r=line[1] and rim.append(r) parts. How do I approach this problem? Thank you in advance!
I would suggest the following:
import pandas as pd
df=pd.read_csv('./rvt.txt', sep='\t'), header=[a list with your column names])
Then you can use list(your_column) to work with your columns as lists
The problem is with the delimiter. The dataset contain multiple space ' '.
When you use '\t' and
print line you can see it's not separating the line with the delimiter.
eg:
['0 1.644231726']
['0.00025 1.651333945']
['0.0005 1.669593478']
['0.00075 1.695214575']
['0.001 1.725409504']
To get the desired result you can use (space) as delimiter and filter the empty values:
readfile = csv.reader(file, delimiter=" ")
time, rim = [], []
for line in readfile:
line = list(filter(lambda x: len(x), line))
t = line[0]
r = line[1]
Here is the code to do this:
import csv
with open('./rvt.txt') as file:
readfile = csv.reader(file, delimiter=” ”)
time = []
rim = []
for line in readfile:
t = line[0]
r = line[1]
time.append(t)
rim.append(r)
print(time, rim)
I need help sorting a list from a text file. I'm reading a .txt and then adding some data, then sorting it by population change %, then lastly, writing that to a new text file.
The only thing that's giving me trouble now is the sort function. I think the for statement syntax is what's giving me issues -- I'm unsure where in the code I would add the sort statement and how I would apply it to the output of the for loop statement.
The population change data I am trying to sort by is the [1] item in the list.
#Read file into script
NCFile = open("C:\filelocation\NC2010.txt")
#Save a write file
PopulationChange =
open("C:\filelocation\Sorted_Population_Change_Output.txt", "w")
#Read everything into lines, except for first(header) row
lines = NCFile.readlines()[1:]
#Pull relevant data and create population change variable
for aLine in lines:
dataRow = aLine.split(",")
countyName = dataRow[1]
population2000 = float(dataRow[6])
population2010 = float(dataRow[8])
popChange = ((population2010-population2000)/population2000)*100
outputRow = countyName + ", %.2f" %popChange + "%\n"
PopulationChange.write(outputRow)
NCFile.close()
PopulationChange.close()
You can fix your issue with a couple of minor changes. Split the line as you read it in and loop over the sorted lines:
lines = [aLine.split(',') for aLine in NCFile][1:]
#Pull relevant data and create population change variable
for dataRow in sorted(lines, key=lambda row: row[1]):
population2000 = float(dataRow[6])
population2010 = float(dataRow[8])
...
However, if this is a csv you might want to look into the csv module. In particular DictReader will read in the data as a list of dictionaries based on the header row. I'm making up the field names below but you should get the idea. You'll notice I sort the data based on 'countryName' as it is read in:
from csv import DictReader, DictWriter
with open("C:\filelocation\NC2010.txt") as NCFile:
reader = DictReader(NCFile)
data = sorted(reader, key=lambda row: row['countyName'])
for row in data:
population2000 = float(row['population2000'])
population2010 = float(row['population2010'])
popChange = ((population2010-population2000)/population2000)*100
row['popChange'] = "{0:.2f}".format(popChange)
with open("C:\filelocation\Sorted_Population_Change_Output.txt", "w") as PopulationChange:
writer = csv.DictWriter(PopulationChange, fieldnames=['countryName', 'popChange'])
writer.writeheader()
writer.writerows(data)
This will give you a 2 column csv of ['countryName', 'popChange']. You would need to correct this with the correct fieldnames.
You need to read all of the lines in the file before you can sort it. I've created a list called change to hold the tuple pair of the population change and the country name. This list is sorted and then saved.
with open("NC2010.txt") as NCFile:
lines = NCFile.readlines()[1:]
change = []
for line in lines:
row = line.split(",")
country_name = row[1]
population_2000 = float(row[6])
population_2010 = float(row[8])
pop_change = ((population_2010 / population_2000) - 1) * 100
change.append((pop_change, country_name))
change.sort()
output_rows = []
[output_rows.append("{0}, {1:.2f}\n".format(pair[1], pair[0]))
for pair in change]
with open("Sorted_Population_Change_Output.txt", "w") as PopulationChange:
PopulationChange.writelines(output_rows)
I used a list comprehension to generate the output rows which swaps the pair back in the desired order, i.e. country name first.
The original file format is like this
ID DC_trip
AC A9999
SY DC,Foggy_bottom,22201,H_St.
SY DC,Smithsonian,12345,14th_St.
//
ID ...
AC ...
SY ...
SY ...
SY ...
I want to convert it to .csv file format and transform it into
DC_trip,A9999,DC,Foggy_bottom,22201,H_St.
DC_trip,A9999,DC,Smithsonian,12345,14th_St.
.
.
.
I tried to use if statement and elif.....
if lines.find('ID'):
lines[5:]
elif lines.find('SY'):
lines[5:]
If I use this way, each time I can only get one value.
Could someone give me some recommendation?
Thank you
Assuming the data in the original file is tab separated, you can use the csv module, and do this:
data = []
# Extract the second row from the input file
# and store it in data
with open('input') as in_file:
csv_reader = csv.reader(in_file, delimiter='\t')
for row in csv_reader:
data.append(row[1])
# The first two values in data is the suffix
# for the rest of your output file
suffix = ','.join(data[:2])
# Append the suffix to the rest of the values
# and write it out to the output file.
with open('output') as op_file:
for item in data[2:]:
op_file.write('{},{}\n'.format(suffix, item))
If the data in the original file is delimited by space, you would replace the first part with:
data = []
with open('file1') as in_file:
for line in in_file:
data.append(line.strip().split())
data = [a[1] for a in data if a[1]]