I'm pretty new to python and coding in general, so sorry in advance for any dumb questions. My program needs to split an existing log file into several *.csv files (run1,.csv, run2.csv, ...) based on the keyword 'MYLOG'. If the keyword appears it should start copying the two desired columns into the new file till the keyword appears again. When finished there need to be as many csv files as there are keywords.
53.2436 EXP MYLOG: START RUN specs/run03_block_order.csv
53.2589 EXP TextStim: autoDraw = None
53.2589 EXP TextStim: autoDraw = None
55.2257 DATA Keypress: t
57.2412 DATA Keypress: t
59.2406 DATA Keypress: t
61.2400 DATA Keypress: t
63.2393 DATA Keypress: t
...
89.2314 EXP MYLOG: START BLOCK scene [specs/run03_block01.csv]
89.2336 EXP Imported specs/run03_block01.csv as conditions
89.2339 EXP Created sequence: sequential, trialTypes=9
...
[EDIT]: The output per file (run*.csv) should look like this:
onset type
53.2436 EXP
53.2589 EXP
53.2589 EXP
55.2257 DATA
57.2412 DATA
59.2406 DATA
61.2400 DATA
...
The program creates as much run*.csv as needed, but i can't store the desired columns in my new files. When finished, all I get are empty csv files. If I shift the counter variable to == 1 it creates just one big file with the desired columns.
Thanks again!
import csv
QUERY = 'MYLOG'
with open('localizer.log', 'rt') as log_input:
i = 0
for line in log_input:
if QUERY in line:
i = i + 1
with open('run' + str(i) + '.csv', 'w') as output:
reader = csv.reader(log_input, delimiter = ' ')
writer = csv.writer(output)
content_column_A = [0]
content_column_B = [1]
for row in reader:
content_A = list(row[j] for j in content_column_A)
content_B = list(row[k] for k in content_column_B)
writer.writerow(content_A)
writer.writerow(content_B)
Looking at the code there's a few things that are possibly wrong:
the csv reader should take a file handler, not a single line.
the reader delimiter should not be a single space character as it looks like the actual delimiter in your logs is a variable number of multiple space characters.
the looping logic seems to be a bit off, confusing files/lines/rows a bit.
You may be looking at something like the code below (pending clarification in the question):
import csv
NEW_LOG_DELIMITER = 'MYLOG'
def write_buffer(_index, buffer):
"""
This function takes an index and a buffer.
The buffer is just an iterable of iterables (ex a list of lists)
Each buffer item is a row of values.
"""
filename = 'run{}.csv'.format(_index)
with open(filename, 'w') as output:
writer = csv.writer(output)
writer.writerow(['onset', 'type']) # adding the heading
writer.writerows(buffer)
current_buffer = []
_index = 1
with open('localizer.log', 'rt') as log_input:
for line in log_input:
# will deal ok with multi-space as long as
# you don't care about the last column
fields = line.split()[:2]
if not NEW_LOG_DELIMITER in line or not current_buffer:
# If it's the first line (the current_buffer is empty)
# or the line does NOT contain "MYLOG" then
# collect it until it's time to write it to file.
current_buffer.append(fields)
else:
write_buffer(_index, current_buffer)
_index += 1
current_buffer = [fields] # EDIT: fixed bug, new buffer should not be empty
if current_buffer:
# We are now out of the loop,
# if there's an unwritten buffer then write it to file.
write_buffer(_index, current_buffer)
You can use pandas to simplify this problem.
Import pandas and read in log file.
import pandas as pd
df = pd.read_fwf('localizer2.log', header=None)
df.columns = ['onset', 'type', 'event']
df.set_index('onset', inplace=True)
Set Flag where third column == 'MYLOG'
df['flag'] = 0
df.loc[df.event.str[:5] == 'MYLOG', 'flag'] = 1
df.flag = df['flag'].cumsum()
Save each run as a separate run*.csv file
for i in range(1, df.flag.max()+1):
df.loc[df.flag == i, 'event'].to_csv('run{0}.csv'.format(i))
EDIT:
Looks like your format is different than I originally assumed. Changed to use pd.read_fwf. my localizer.log file was a copy and paste of your original data, hope this works for you. I assumed by the original post that it did not have headers. If it does have headers then remove header=None and df.columns = ['onset', 'type', 'event'].
Related
I'm writing something that will take two CSV's: #1 is a list of email's with # received for each, #2 is a catalog of every email addr on record, with a # of received emails per reporting period with date annotated at top of column.
import csv
from datetime import datetime
datestring = datetime.strftime(datetime.now(), '%m-%d')
storedEmails = []
newEmails = []
sortedList = []
holderList = []
with open('working.csv', 'r') as newLines, open('archive.csv', 'r') as oldLines: #readers to make lists
f1 = csv.reader(newLines, delimiter=',')
f2 = csv.reader(oldLines, delimiter=',')
print ('Processing new data...')
for row in f2:
storedEmails.append(list(row)) #add archived data to a list
storedEmails[0].append(datestring) #append header row with new date column
for col in f1:
if col[1] == 'email' and col[2] == 'To Address': #new list containing new email data
newEmails.append(list(col))
counter = len(newEmails)
n = len(storedEmails[0]) #using header row len to fill zeros if no email received
print(storedEmails[0])
print (n)
print ('Updating email lists and tallies, this could take a minute...')
with open ('archive.csv', 'w', newline='') as toWrite: #writer to overwrite old csv
writer = csv.writer(toWrite, delimiter=',')
for i in newEmails:
del i[:3] #strip useless identifiers from data
if int(i[1]) > 30: #only keep emails with sufficient traffic
sortedList.append(i) #add these emails to new sorted list
for i in storedEmails:
for entry in sortedList: #compare stored emails with the new emails, on match append row with new # of emails
if i[0] == entry[0]:
i.append(entry[1])
counter -=1
else:
holderList.append(entry) #if no match, it is a new email that meets criteria to land itself on the list
break #break inner loop after iteration of outer email, to move to next email and avoid multiple entries
storedEmails = storedEmails + holderList #combine lists for archived csv rewrite
for i in storedEmails:
if len(i) < n:
i.append('0') #if email on list but didnt have any activity this period, append with 0 to keep records intact
writer.writerow(i)
print('SortedList', sortedList)
print (len(sortedList))
print('storedEmails', storedEmails)
print(len(storedEmails))
print('holderList',holderList)
print(len(holderList))
print ('There are', counter, 'new emails being added to the list.')
print ('All done!')
CSV's will look similar to this.
working.csv:
1,asdf#email.com,'to address',31
2,fsda#email.com,'to address',19
3,zxcv#email.com,'to address',117
4,qwer#gmail.com,'to address',92
5,uiop#fmail.com,'to address',11
archive.csv:
date,01-sep
asdf#email.com,154
fsda#email.com,128
qwer#gmail.com,77
ffff#xmail.com,63
What I want after processing is:
date,01-sep,27-sep
asdf#email.com,154,31
fsda#email.com,128,19
qwer#gmail.com,77,92
ffff#xmail.com,63,0
zxcv#email.com,0,117
I'm not sure where I've gone wrong at - but it keeps producing duplicate entries. Some of the functionality is there but I've been at it for too long and I'm getting tunnel vision trying to figure out what I have done wrong with my loops.
I know my zero-filler section in the end is wrong as well, as it will append onto the end of a newly created record instead of populating zero's up to its first appearance.
I'm sure there are far more efficient ways to do this, I'm new to programming so its probably overly complicated and messy - initially I tried to compare CSV to CSV and realized that wasnt possible since you cant read and write at the same time, so I attempted to convert to using lists, which I also know wont work forever due to memory limitations when the list gets big.
-EDIT-
Using Trenton's Panda's solution:
I ran a script on working.csv so it instead produces the following:
asdf#email.com,1000
bsdf#gmail.com,500
xyz#fmail.com,9999
I have modified your solution to reflect this change:
import pandas as pd
from datetime import datetime
import csv
# get the date string
datestring = datetime.strftime(datetime.now(), '%d-%b')
# filter original list to grab only emails of interest
with open ('working.csv', 'r') as fr, open ('writer.csv', 'w', newline='') as fw:
reader = csv.reader(fr, delimiter=',')
writer = csv.writer(fw, delimiter=',')
for row in reader:
if row[1] == 'Email' and row[2] == 'To Address':
writer.writerow([row[3], row[4]])
# read archive
arch = pd.read_csv('archive.csv')
# rename columns
arch.rename(columns={'email': 'date'}, inplace=True)
# read working, but only the two columns that are needed
working = pd.read_csv('writer.csv', header=None, usecols=[0, 1]) # I assume usecols isnt necessery anymore, but I'm not sure
# rename columns
working.rename(columns={0: 'email', 1: datestring}, inplace=True)
# only emails greater than 30 or already in arch
working = working[(working[datestring] > 30) | (working.email.isin(arch.email))]
# merge
arch_updated = pd.merge(arch, working, on='email', how='outer').fillna(0)
# save to csv
arch_updated.to_csv('archive.csv', index=False)
I apparently still have no idea how this works because I'm now getting :
Traceback (most recent call last):
File "---/agsdga.py", line 29, in <module>
working = working[(working[datestring] > 30) | (working.email.isin(arch.email))]
File "---\Python\Python38-32\lib\site-packages\pandas\core\generic.py", line 5130, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'email'
Process finished with exit code 1
-UPDATE-
It is working now as:
import pandas as pd
from datetime import datetime
import csv
# get the date string
datestring = datetime.strftime(datetime.now(), '%d-%b')
with open ('working.csv', 'r') as fr, open ('writer.csv', 'w', newline='') as fw:
reader = csv.reader(fr, delimiter=',')
writer = csv.writer(fw, delimiter=',')
for row in reader:
if row[1] == 'Email' and row[2] == 'To Address':
writer.writerow([row[3], row[4]])
# read archive
arch = pd.read_csv('archive.csv')
# rename columns
arch.rename(columns={'date': 'email'}, inplace=True)
# read working, but only the two columns that are needed
working = pd.read_csv('writer.csv', header=None, usecols=[0, 1])
# rename columns
working.rename(columns={0: 'email', 1: datestring}, inplace=True)
# only emails greater than 30 or already in arch
working = working[(working[datestring] > 30) | (working.email.isin(arch.email))]
# merge
arch_updated = pd.merge(arch, working, on='email', how='outer').fillna(0)
# save to csv
arch_updated.to_csv('archive.csv', index=False)
The errors above were caused because I changed
arch.rename(columns={'date': 'email'}, inplace=True)
to
arch.rename(columns={'email': 'date'}, inplace=True)
I ran into further complications because I stripped the header row from the test archive because I didnt think the header mattered, even with header=None I still got issues. I'm still not clear why the header is so important when we are assigning our own values to the columns for purposes of the dataframe, but its working now. Thanks for all the help!
I'd load the data with pandas.read_csv
.rename some columns
Renaming the columns in working, is dependent upon the column index, since working.csv has no column headers.
When the working dataframe is created, look at the dataframe to verify the correct columns have been loaded, and the correct column index is being used for renaming.
The date column of arch should really be email, because headers identify what's below them, not the other column headers.
Once the column name has been changed in archive.csv, then rename won't be required any longer.
pandas.merge on the email column.
Since both dataframes have a column renamed with email, the merged result will only have one email column.
If the merge occurs on two different column names, then the result will have two columns containing email addresses.
pandas: Merge, join, concatenate and compare
As long as the columns in the files are consistent, this should work without modification
import pandas as pd
from datetime import datetime
# get the date string
datestring = datetime.strftime(datetime.now(), '%d-%b')
# read archive
arch = pd.read_csv('archive.csv')
# rename columns
arch.rename(columns={'date': 'email'}, inplace=True)
# read working, but only the two columns that are needed
working = pd.read_csv('working.csv', header=None, usecols=[1, 3])
# rename columns
working.rename(columns={1: 'email', 3: datestring}, inplace=True)
# only emails greater than 30 or already in arch
working = working[(working[datestring] > 30) | (working.email.isin(arch.email))]
# merge
arch_updated = pd.merge(arch, working, on='email', how='outer').fillna(0)
# save to csv
arch_updated.to_csv('archive.csv', index=False)
# display(arch_updated)
email 01-sep 27-Aug
asdf#email.com 154.0 31.0
fsda#email.com 128.0 19.0
qwer#gmail.com 77.0 92.0
ffff#xmail.com 63.0 0.0
zxcv#email.com 0.0 117.0
So, the problem is you have two sets of data. Both have the data stored with a "key" entry (the emails) and additional piece of data that you want condensed down to one storage. Identifying that there is a similar "key" for both of these sets of data simplifies this greatly.
Imagine each key as being the name of a bucket. Each bucket needs two pieces of info, one piece from one csv and the other piece from the other csv.
Now, I must take a small detour to explain a dictionary in python. Here is a definition stolen from here
A dictionary is a collection which is unordered, changeable and indexed.
A collection is a container like a list that holds data. Unordered and indexed means that the dictionary is not accessible like a list where the data is accessible by the index. In this case, the dictionary is accessed using keys, which can be anything like a string or a number (technically the key must be hashable, but thats too indepth). And finally changeable means that the dictionary can actually have its the stored data changed (once again, oversimplified).
Example:
dictionary = dict()
key = "Something like a string or a number!"
dictionary[key] = "any kind of value can be stored here! Even lists and other dictionaries!"
print(dictionary[key]) # Would print the above string
Here is the structure that I suggest you use instead of most of your lists:
dictionary[email] = [item1, item2]
This way, you can avoid using multiple lists and massively simplifying your code. If you are still iffy on the usage of dictionaries, there are a lot of articles and videos on the usage of them. Good luck!
I have a file which I read in as a string. In sublime the file looks like this:
Filename
Dataset
Level
Duration
Accuracy
Speed Ratio
Completed
file_001.mp3
datasetname_here
value
00:09:29
0.00%
7.36x
2019-07-18
file_002.mp3
datasetname_here
value
00:22:01
...etc.
in Bash:
['Filename\n', 'Dataset\n', 'Level\n', 'Duration\n', 'Accuracy\n', 'Speed Ratio\n', 'Completed\n', 'file_001.mp3\n', 'datasetname_here\n', 'value\n', '00:09:29\n', '0.00%\n', '7.36x\n', '2019-07-18\n', 'file_002.mp3\n', 'datasetname_here\n', 'L1\n', '00:20:01\n', ...etc.
I want to split this into a 7 column csv. As you can see, the values repeat every 8th line. I know I can use a for loop and modulus to read each line. I have done this successfully before.
How can I use pandas to read things into columns?
I don't know how to approach the Pandas library. I have looked at other examples and all seem to start with csv.
import sys
parser = argparse.ArgumentParser()
parser.add_argument('file' , help = "this is the file you want to open")
args = parser.parse_args()
print("file name:" , args.file)
with open(args.file , 'r') as word:
print(word.readlines()) ###here is where i was making sure it read in properly
###here is where I will start to manipulate the data
This is the Bash output:
['Filename\n', 'Dataset\n', 'Level\n', 'Duration\n', 'Accuracy\n', 'Speed Ratio\n', 'Completed\n', 'file_001.mp3\n', 'datasetname_here\n', 'value\n', '00:09:29\n', '0.00%\n', '7.36x\n', '2019-07-18\n', 'file_002.mp3\n', 'datasetname_here\n', 'L1\n', '00:20:01\n', ...]
First remove '\n':
raw_data = ['Filename\n', 'Dataset\n', 'Level\n', 'Duration\n', 'Accuracy\n', 'Speed Ratio\n', 'Completed\n', 'file_001.mp3\n', 'datasetname_here\n', 'value\n', '00:09:29\n', '0.00%\n', '7.36x\n', '2019-07-18\n', 'file_002.mp3\n', 'datasetname_here\n', 'L1\n', '00:20:01\n', '0.01%\n', '7.39x\n', '2019-07-20\n']
raw_data = [string.replace('\n', '') for string in raw_data]
Then pack your data in 7-length arrays inside a big array:
data = [raw_data[x:x+7] for x in range(0, len(raw_data),7)]
Finally read your data as a DataFrame, the first row contains the name of the columns:
df = pd.DataFrame(data[1:], columns=data[0])
print(df.to_string())
Filename Dataset Level Duration Accuracy Speed Ratio Completed
0 file_001.mp3 datasetname_here value 00:09:29 0.00% 7.36x 2019-07-18
1 file_002.mp3 datasetname_here L1 00:20:01 0.01% 7.39x 2019-07-20
Try This
import numpy as np
import pandas as pd
with open ("data.txt") as f:
list_str = f.readlines()
list_str = map(lambda s: s.strip(), list_str) #Remove \n
n=7
list_str = [list_str[k:k+n] for k in range(0, len(list_str), n)]
df = pd.DataFrame(list_str[1:])
df.columns = list_str[0]
df.to_csv("Data_generated.csv",index=False)
Pandas is not a library to read into columns. It supports many formats to read and write (One of them is comma separated values) and mainly used as python based data analysis tool.
Best place to learn is see their documentation and practice.
Output of above code
I think you don't have to use pandas or any other library. My approach:
data = []
row = []
with open(args.file , 'r') as file:
for line in file:
row.append(line)
if len(row) == 7:
data.append(row)
row = []
How does it work?
The for loop reads the file line by line.
Add the line to row
When row's length is 7, it's completed and you can add the row to data
Create a new list for row
Repeat
I have an issue which was already discussed in several topics, nevertheless i would like to go a bit deeper and maybe find a better solution.
So the idea is to go through "huge" (50 to 60GB) .csv files with python, find the lines which satisfy some conditions, extract them and finally store them in a second variable for further analysis.
Initially the problem was for r scripts, which i manage with sparklyr connection, or eventually some gawk code in bash (see awk, or gawk), to extract the data I need, then analyse it with R/python.
I would like to resolve this issue exclusively with python, the idea would be to avoid mixing languages like bash/python, or bash/R (unix). So far i use the open as x, and go through file line by line, and it kinda works, but it's awfully slow. For example, going through the file is pretty fast (~500.000 lines per second, even for a 58M lines is ok), but when I try to store the data, the speed drops to ~10 lines per second. For an extraction with ~300.000 lines, it's unacceptable.
I tried several solutions and I'm guessing that it's not optimal (poor python code ? :( ) and better solutions eventually exist.
Solution 1: go through file, split the line in a list, check the conditions, if ok put the line in numpy matrix and vstack for each iteration which is satisfying the condition (very slow)
import csv
import numpy
import pandas
from tqdm import tqdm
date_first = '2008-11-01'
date_last = '2008-11-10'
a = numpy.array(['colnames']*35) #data is 35 columns
index = list()
with open("data.csv", "r") as f:
for line in tqdm(f, unit = " lines per"):
line = line.split(sep = ";") # csv with ";" ...
date_file = line[1][0:10] # date stored in the 2nd column
if date_file >= date_first and date_file <= date_last : #data extraction concern a time period (one month for example)
line=numpy.array(line) #go to numpy
a=numpy.vstack((a, line)) #stack it
Solution 2 : the same but store the line in a pandas data.frame with a row index if conditions ok (very slow)
import csv
import numpy
import pandas
from tqdm import tqdm
date_first = '2008-11-01'
date_last = '2008-11-10'
row = 0 #row index
a = pandas.DataFrame(numpy.zeros((0,35)))#data is 35 columns
with open("data.csv", "r") as f:
for line in tqdm(f, unit = " lines per"):
line = line.split(sep = ";")
date_file = line[1][0:10]
if date_file>=date_first and date_file<=date_last :
a.loc[row] = line #store the line in the pd.data.frame at the position row
row = row + 1 #go to next row
Solution 3 : the same, but instead of storing the line somewhere, which is the main issue for me, keep an index for satisfying rows, and then open the csv with the rows i need (even slower, actually going through file to find the indexes is fast enough, the opening index's row is awfully slow)
import csv
import numpy
import pandas
from tqdm import tqdm
date_first = '2008-11-01'
date_last = '2008-11-10'
row = 0
index = list()
with open("data.csv", "r") as f:
f = csv.reader(f, delimiter = ";")
for line in tqdm(f, unit = " lines per"):
line = line.split(sep = ";")
date_file = line[1][0:10]
row = row + 1
if date_file>=date_first and date_file<=date_last :
index.append(row)
with open("data.csv") as f:
reader=csv.reader(f)
interestingrows=[row for idx, row in enumerate(reader) if idx in index]
The idea would be to keep only the data which satisfy the condition, here an extraction for a specific month. I do not understand where the problem is coming from, saving the data somewhere (vstack, or writing with in a pd.DF) is definitively an issue. I'm pretty sure i do something wrong but i'm not sure where/what.
The data is a csv with 35 columns and over 57M rows.
Thanks for the reading
O.
Appends to dataframes and numpy arrays are very expensive because each append must copy the entire data to a new memory location. Instead, you can try reading the file in chunks, processing the data, and appending back out. Here I've picked a chunk size of 100,000 but you can obviously change this.
I don't know the column names of your CSV so I guessed at 'date_file'. This should get you close:
import pandas as pd
date_first = '2008-11-01'
date_last = '2008-11-10'
df = pd.read_csv("data.csv", chunksize=100000)
for chunk in df:
chunk = chunk[(chunk['date_file'].str[:10] >= date_first)
& (chunk['date_file'].str[:10] <= date_last)]
chunk.to_csv('output.csv', mode='a')
# Program to combine data from 2 csv file
The cdc_list gets updated after second call of read_csv
overall_list = []
def read_csv(filename):
file_read = open(filename,"r").read()
file_split = file_read.split("\n")
string_list = file_split[1:len(file_split)]
#final_list = []
for item in string_list:
int_fields = []
string_fields = item.split(",")
string_fields = [int(x) for x in string_fields]
int_fields.append(string_fields)
#final_list.append()
overall_list.append(int_fields)
return(overall_list)
cdc_list = read_csv("US_births_1994-2003_CDC_NCHS.csv")
print(len(cdc_list)) #3652
total_list = read_csv("US_births_2000-2014_SSA.csv")
print(len(total_list)) #9131
print(len(cdc_list)) #9131
I don't think the code you pasted explains the issue you've had, at least it's not anywhere I can determine. Seems like there's a lot of code you did not include in what you pasted above, that might be responsible.
However, if all you want to do is merge two csvs (assuming they both have the same columns), you can use Pandas' read_csv and Pandas DataFrame methods append and to_csv, to achieve this with 3 lines of code (not including imports):
import pandas as pd
# Read CSV file into a Pandas DataFrame object
df = pd.read_csv("first.csv")
# Read and append the 2nd CSV file to the same DataFrame object
df = df.append( pd.read_csv("second.csv") )
# Write merged DataFrame object (with both CSV's data) to file
df.to_csv("merged.csv")
I have a data file that has 14 lines of header. In the header, there is the metadata for the latitude-longitude coordinates and time. I am currently using
pandas.read_csv(filename, delimiter",", header=14)
to read in the file but this just gets the data and I can't seem to get the metadata. Would anyone know how to read in the information in the header? The header looks like:
CSD,20160315SSIO
NUMBER_HEADERS = 11
EXPOCODE = 33RR20160208
SECT_ID = I08
STNBBR = 1
CASTNO = 1
DATE = 20160219
TIME = 0558
LATITUDE = -66.6027
LONGITUDE = 78.3815
DEPTH = 462
INSTRUMENT_ID = 0401
CTDPRS,CTDPRS_FLAG,CTDTMP,CTDTMP_FLAG
DBAR,,ITS-90,,PSS-78
You have to parse your metadata header by yourself, yet you can do it in an elegant manner in one pass and even by using it on the fly so that you can extract data out it / control the correctness of the file etc.
First, open the file yourself:
f = open(filename)
Then, do the work to parse each metadata line to extract data out it. For the sake of the explanation, I'm just skipping these rows:
for i in range(13): # skip the first 13 lines that are useless for the columns definition
f.readline() # use the resulting string for metadata extraction
Now you have the file pointer ready on the unique header line you want to use to load the DataFrame. The cool thing is that read_csv accepts file objects! Thus you start loading your DataFrame right away now:
pandas.read_csv(f, sep=",")
Note that I don't use the header argument as I consider by your description you have only that one last line of header that is useful for your dataframe. You can build and adjust hearder parsing values / rows to skip from that example.
Although the following method does not use Pandas, I was able to extract the header information.
with open(fname) as csvfile:
forheader_IO2016 = csv.reader(csvfile, delimiter=',')
header_IO2016 = []
for row in forheader_IO2016:
header_IO2016.append(row[0])
date = header_IO2016[7].split(" ")[2]
time = header_IO2016[8].split(" ")[2]
lat = float(header_IO2016[9].split(" ")[2])
lon = float(header_IO2016[10].split(" ")[4])