sort one giant string into 7 columns

sort one giant string into 7 columns - python

I have a file which I read in as a string. In sublime the file looks like this:
Filename
Dataset
Level
Duration
Accuracy
Speed Ratio
Completed
file_001.mp3
datasetname_here
value
00:09:29
0.00%
7.36x
2019-07-18
file_002.mp3
datasetname_here
value
00:22:01
...etc.
in Bash:
['Filename\n', 'Dataset\n', 'Level\n', 'Duration\n', 'Accuracy\n', 'Speed Ratio\n', 'Completed\n', 'file_001.mp3\n', 'datasetname_here\n', 'value\n', '00:09:29\n', '0.00%\n', '7.36x\n', '2019-07-18\n', 'file_002.mp3\n', 'datasetname_here\n', 'L1\n', '00:20:01\n', ...etc.
I want to split this into a 7 column csv. As you can see, the values repeat every 8th line. I know I can use a for loop and modulus to read each line. I have done this successfully before.
How can I use pandas to read things into columns?
I don't know how to approach the Pandas library. I have looked at other examples and all seem to start with csv.
import sys
parser = argparse.ArgumentParser()
parser.add_argument('file' , help = "this is the file you want to open")
args = parser.parse_args()
print("file name:" , args.file)
with open(args.file , 'r') as word:
print(word.readlines()) ###here is where i was making sure it read in properly
###here is where I will start to manipulate the data
This is the Bash output:
['Filename\n', 'Dataset\n', 'Level\n', 'Duration\n', 'Accuracy\n', 'Speed Ratio\n', 'Completed\n', 'file_001.mp3\n', 'datasetname_here\n', 'value\n', '00:09:29\n', '0.00%\n', '7.36x\n', '2019-07-18\n', 'file_002.mp3\n', 'datasetname_here\n', 'L1\n', '00:20:01\n', ...]

First remove '\n':
raw_data = ['Filename\n', 'Dataset\n', 'Level\n', 'Duration\n', 'Accuracy\n', 'Speed Ratio\n', 'Completed\n', 'file_001.mp3\n', 'datasetname_here\n', 'value\n', '00:09:29\n', '0.00%\n', '7.36x\n', '2019-07-18\n', 'file_002.mp3\n', 'datasetname_here\n', 'L1\n', '00:20:01\n', '0.01%\n', '7.39x\n', '2019-07-20\n']
raw_data = [string.replace('\n', '') for string in raw_data]
Then pack your data in 7-length arrays inside a big array:
data = [raw_data[x:x+7] for x in range(0, len(raw_data),7)]
Finally read your data as a DataFrame, the first row contains the name of the columns:
df = pd.DataFrame(data[1:], columns=data[0])
print(df.to_string())
Filename Dataset Level Duration Accuracy Speed Ratio Completed
0 file_001.mp3 datasetname_here value 00:09:29 0.00% 7.36x 2019-07-18
1 file_002.mp3 datasetname_here L1 00:20:01 0.01% 7.39x 2019-07-20

Try This
import numpy as np
import pandas as pd
with open ("data.txt") as f:
list_str = f.readlines()
list_str = map(lambda s: s.strip(), list_str) #Remove \n
n=7
list_str = [list_str[k:k+n] for k in range(0, len(list_str), n)]
df = pd.DataFrame(list_str[1:])
df.columns = list_str[0]
df.to_csv("Data_generated.csv",index=False)
Pandas is not a library to read into columns. It supports many formats to read and write (One of them is comma separated values) and mainly used as python based data analysis tool.
Best place to learn is see their documentation and practice.
Output of above code

I think you don't have to use pandas or any other library. My approach:
data = []
row = []
with open(args.file , 'r') as file:
for line in file:
row.append(line)
if len(row) == 7:
data.append(row)
row = []
How does it work?
The for loop reads the file line by line.
Add the line to row
When row's length is 7, it's completed and you can add the row to data
Create a new list for row
Repeat

Related

Calculate averages over subgroups of data in extremely large (100GB+) CSV file

I have a large semicolon-delimited text file that weighs in at a little over 100GB. It comprises ~18,000,000 rows of data and 772 columns.
The columns are: 'sc16' (int), 'cpid' (int), 'type' (str), 'pubyr' (int) and then 767 columns labeled 'dim_0', 'dim_1', 'dim_2' ... 'dim_767', that are all ints.
The file is already arranged/sorted by sc16 and pubyr so that each combination of sc16+pubyr are grouped together in ascending order.
What I'm trying to do is get the average of each 'dim_' column for each unique combination of sc16 & pubyr, then output the row to a new dataframe and save the final result to a new text file.
The problem is that in my script below, the processing gradually gets slower and slower until it's just creeping along by row 5,000,000. I'm working on a machine with 96GB of RAM, and I'm not used to working with a file so large I can't simply load it into memory. This is my first attempt trying to work with something like itertools, so no doubt I'm being really inefficient. Any help you can provide would be much appreciated!
import itertools
import pandas as pd
# Step 1: create an empty dataframe to store the mean values
mean_df = pd.DataFrame(columns=['sc16', 'pubyr'] + [f"dim_{i}" for i in range(768)])
# Step 2: open the file and iterate through the rows
with open('C:\Python_scratch\scibert_embeddings_sorted.txt') as f:
counter = 0
total_lines = sum(1 for line in f)
f.seek(0)
for key, group in itertools.groupby(f, key=lambda x: (x.split(';')[0], x.split(';')[3])): # group by the first (sc16) and fourth (pubyr) column
sc16, pubyr = key
rows = [row.strip().split(';') for row in group]
columns = rows[0]
rows = rows[1:]
# Step 3: convert the group of rows to a dataframe
group_df = pd.DataFrame(rows, columns=columns)
# Step 4: calculate the mean for the group
mean_row = {'sc16': sc16, 'pubyr': pubyr}
for col in group_df.columns:
if col.startswith('dim_'):
mean_row[col] = group_df[col].astype(float).mean()
# Step 5: append the mean row to the mean dataframe
mean_df = pd.concat([mean_df, pd.DataFrame([mean_row])], ignore_index=True)
counter += len(rows)
print(f"{counter} of {total_lines}")
# Step 6: save the mean dataframe to a new file
mean_df.to_csv('C:\Python_scratch\scibert_embeddings_mean.txt', sep=';', index=False)

You might not want to use Pandas at all, since your data is already neatly pre-sorted and all.
Try something like this; it's using numpy to make dim-wise averaging fast, but is plain Python otherwise. It processes a 43,000 line example file I generated in about 9 7.6 seconds on my machine and I don't see a reason why this should slow down over time. (If you know your file won't have a header line or empty lines, you could get rid of those checks.)
Your original code also spent extra time parsing the read lines over and over again; this uses a generator that does that only once.
import itertools
import operator
import numpy as np
def read_embeddings_file(filename):
# Read the (pre-sorted) embeddings file,
# yielding tuples of ((sc16, pubyr) and a list of dimensions).
with open(filename) as in_file:
for line in in_file:
if not line or line.startswith("sc16"): # Header or empty line
continue
line = line.split(";")
sc16, cpid, type, pubyr, *dims = line
# list(map(... is faster than the equivalent listcomp
yield (sc16, pubyr), list(map(int, dims))
def main():
output_name = "scibert_embeddings_mean.txt"
input_name = "scibert_embeddings_sorted.txt"
with open(output_name, "w") as out_f:
print("sc16", "pubyr", *[f"dim_{i}" for i in range(768)], sep=";", file=out_f)
counter = 0
for group, group_contents in itertools.groupby(
read_embeddings_file(input_name),
key=operator.itemgetter(0), # Group by (sc16, pubyr)
):
dims = [d[1] for d in group_contents]
# Calculate the mean of each dimension
mean_dims = np.mean(np.array(dims).astype(float), axis=0)
# Write group to output
print(*group, *mean_dims, sep=";", file=out_f)
# Print progress
counter += len(dims)
print(f"Processed: {counter}; group: {group}, entries in group: {len(dims)}")
if __name__ == "__main__":
main()

How to find a median from a list of values

I exported a CSV file to Python and organized it into lists.
I need to print the 'Median' carat for the 'Premium' category (yellow marked).
Here is my code:
diamonds_file = open('diamonds.csv', 'r')
lines = diamonds_file.readlines()
table=[]
for i in range(len(lines)):
lines[i]=lines[i].replace('\n', '')
splitted_line=lines[i].split(',')
print(splitted_line)
Please see the attached output of this code above:

I hope you can use external librares.
import statistics
diamonds_file = open('diamonds.csv', 'r')
lines = diamonds_file.readlines()
table=[]
values=[]
for i in range(len(lines)):
lines[i]=lines[i].replace('\n', '')
splitted_line=lines[i].split(',')
if splitted_line[1] == '"Premium"':
values.append(float(splitted_line[0]))
print(statistics.median(values))
Whitout external lib.
diamonds_file = open('diamonds.csv', 'r')
lines = diamonds_file.readlines()
table=[]
values=[]
n = 0
for i in range(len(lines)):
lines[i]=lines[i].replace('\n', '')
splitted_line=lines[i].split(',')
if splitted_line[1] == '"Premium"':
values.append(float(splitted_line[0]))
n += 1
print(sum(values)/n)

Read the csv into pandas...
import pandas as pd
df = pd.read_csv('diamonds.csv')
If the csv has no headers then select columns by index number (this is what I do below) or rename columns...and continue.
df_Premium = df[df[1] == 'Premium']
stats = df_Premium.describe()
display(stats)
The median will be in the stats printed out.

Please use Pandas library, it is a Data Analysis Library.
import pandas as pd
df = pd.read_csv("diamonds.csv")
And you can see the uniform table stored into a dataframe df.
Now you want median from a specific metric
df.groupby('cut').median()
Which shows all numerical metric's median.
Now, indicate specific column that you need:
df.groupby('cut').median()['cart']

def premiummedian(splitted_line):
premium_carat=[]
n=0
for line in splitted_line:
if line[1]=="Premium":
premium_carat.append(float(line[0]))
n+=1
# not sure if you have to sort for median. If yes then
#remove comment from the line below this
premium_carat.sort()
if n%2==0:# if length is even then return the middle element
return premium_carat[n//2]
else:#if length is odd then return the avg of 2 elements at the middle
return (premium_carat[n//2]+premium_carat[(n//2)+1])/2

How to print max and min value from a long file?

So Im having problem to print the max and min value from a file, the file has over 3000 lines and look like this:
3968 #number of lines
2000-01-03, 3.738314
2000-01-04, 3.423135
2000-01-05, 3.473229
...
...
2015-10-07, 110.779999
2015-10-08, 109.50
2015-10-09, 112.120003
So this is my current code, I have no idea how to make it work, because now it only prints 3968 value because obviously it is the largest but I want the largest and smallest value from the second column (all the stock prices).
def apple():
stock_file = open('apple_USD.txt', 'r')
data = stock_file.readlines()
data = data[0:]
stock_file.close()
print(max(data))

Your current code outputs the "correct" output by chance, since it is using string comparison.
Consider this:
with open('test.txt') as f:
lines = [line.split(', ') for line in f.readlines()[1:]]
# lines is a list of lists, each sub-list represents a line in a format [date, value]
max_value_date, max_value = max(lines, key=lambda line: float(line[-1].strip()))
print(max_value_date, max_value)
# '2015-10-09' '112.120003'

Your current code is reading each line as a string and then finding max and min lines for your list. You can use pandas to read the file as CSV and load it as data frame and then do your min, max operations on data frame
Hope following answers your question
stocks = []
data=data[1:]
for d in data:
stocks.append(float(d.split(',')[1]))
print(max(stocks))
print( min(stocks))

I recommend Pandas module to work with tabular data and use read_csv function. Is very well documented, optimized and very popular for this purposes. You can install it with pip using pip install pandas.
I created a dumb file with your format and stored in a file called test.csv:
3968 #number of lines
2000-01-03, 3.738314
2000-01-04, 3.423135
2000-01-05, 3.473229
2015-10-07, 110.779999
2015-10-08, 109.50
2015-10-09, 112.120003
Then, to parse the file you can do as follows. Names parameter defines the name of the columns. Skiprows allows you to skip the first line.
#import module
import pandas as pd
#load file
df = pd.read_csv('test.csv', names=['date', 'value'], skiprows=[0])
#get max and min values
max_value = df['value'].max()
min_value = df['value'].min()

You want to extract the second column into a float using float(datum.split(', ')[1].strip()), and ignore the first line.
def apple():
stock_file = open('apple_USD.txt', 'r')
data = stock_file.readlines()
data = data[1:] #ignore first line
stock_file.close()
data = [datum.split(', ') for datum in data]
max_value_date, max_value = max(data, key=lambda data: float(data[-1].strip()))
print(max_value_date, max_value)

or you can use do it in a simpler way: make a list of prices and then get the maximum and minimum. like this:
#as the first line in your txt is not data
datanew=data[1:]
prices=[]
line_after=[]
for line in datanew:
line_after=line.split(',')
price=line_after[1]
prices.append(float(price))
maxprice=max(prices)
minprice=min(prices)

reading csv file line by line and save lines which are satisfying certain conditions

I have an issue which was already discussed in several topics, nevertheless i would like to go a bit deeper and maybe find a better solution.
So the idea is to go through "huge" (50 to 60GB) .csv files with python, find the lines which satisfy some conditions, extract them and finally store them in a second variable for further analysis.
Initially the problem was for r scripts, which i manage with sparklyr connection, or eventually some gawk code in bash (see awk, or gawk), to extract the data I need, then analyse it with R/python.
I would like to resolve this issue exclusively with python, the idea would be to avoid mixing languages like bash/python, or bash/R (unix). So far i use the open as x, and go through file line by line, and it kinda works, but it's awfully slow. For example, going through the file is pretty fast (~500.000 lines per second, even for a 58M lines is ok), but when I try to store the data, the speed drops to ~10 lines per second. For an extraction with ~300.000 lines, it's unacceptable.
I tried several solutions and I'm guessing that it's not optimal (poor python code ? :( ) and better solutions eventually exist.
Solution 1: go through file, split the line in a list, check the conditions, if ok put the line in numpy matrix and vstack for each iteration which is satisfying the condition (very slow)
import csv
import numpy
import pandas
from tqdm import tqdm
date_first = '2008-11-01'
date_last = '2008-11-10'
a = numpy.array(['colnames']*35) #data is 35 columns
index = list()
with open("data.csv", "r") as f:
for line in tqdm(f, unit = " lines per"):
line = line.split(sep = ";") # csv with ";" ...
date_file = line[1][0:10] # date stored in the 2nd column
if date_file >= date_first and date_file <= date_last : #data extraction concern a time period (one month for example)
line=numpy.array(line) #go to numpy
a=numpy.vstack((a, line)) #stack it
Solution 2 : the same but store the line in a pandas data.frame with a row index if conditions ok (very slow)
import csv
import numpy
import pandas
from tqdm import tqdm
date_first = '2008-11-01'
date_last = '2008-11-10'
row = 0 #row index
a = pandas.DataFrame(numpy.zeros((0,35)))#data is 35 columns
with open("data.csv", "r") as f:
for line in tqdm(f, unit = " lines per"):
line = line.split(sep = ";")
date_file = line[1][0:10]
if date_file>=date_first and date_file<=date_last :
a.loc[row] = line #store the line in the pd.data.frame at the position row
row = row + 1 #go to next row
Solution 3 : the same, but instead of storing the line somewhere, which is the main issue for me, keep an index for satisfying rows, and then open the csv with the rows i need (even slower, actually going through file to find the indexes is fast enough, the opening index's row is awfully slow)
import csv
import numpy
import pandas
from tqdm import tqdm
date_first = '2008-11-01'
date_last = '2008-11-10'
row = 0
index = list()
with open("data.csv", "r") as f:
f = csv.reader(f, delimiter = ";")
for line in tqdm(f, unit = " lines per"):
line = line.split(sep = ";")
date_file = line[1][0:10]
row = row + 1
if date_file>=date_first and date_file<=date_last :
index.append(row)
with open("data.csv") as f:
reader=csv.reader(f)
interestingrows=[row for idx, row in enumerate(reader) if idx in index]
The idea would be to keep only the data which satisfy the condition, here an extraction for a specific month. I do not understand where the problem is coming from, saving the data somewhere (vstack, or writing with in a pd.DF) is definitively an issue. I'm pretty sure i do something wrong but i'm not sure where/what.
The data is a csv with 35 columns and over 57M rows.
Thanks for the reading
O.

Appends to dataframes and numpy arrays are very expensive because each append must copy the entire data to a new memory location. Instead, you can try reading the file in chunks, processing the data, and appending back out. Here I've picked a chunk size of 100,000 but you can obviously change this.
I don't know the column names of your CSV so I guessed at 'date_file'. This should get you close:
import pandas as pd
date_first = '2008-11-01'
date_last = '2008-11-10'
df = pd.read_csv("data.csv", chunksize=100000)
for chunk in df:
chunk = chunk[(chunk['date_file'].str[:10] >= date_first)
& (chunk['date_file'].str[:10] <= date_last)]
chunk.to_csv('output.csv', mode='a')

How to split a log file into several csv files with python

I'm pretty new to python and coding in general, so sorry in advance for any dumb questions. My program needs to split an existing log file into several *.csv files (run1,.csv, run2.csv, ...) based on the keyword 'MYLOG'. If the keyword appears it should start copying the two desired columns into the new file till the keyword appears again. When finished there need to be as many csv files as there are keywords.
53.2436 EXP MYLOG: START RUN specs/run03_block_order.csv
53.2589 EXP TextStim: autoDraw = None
53.2589 EXP TextStim: autoDraw = None
55.2257 DATA Keypress: t
57.2412 DATA Keypress: t
59.2406 DATA Keypress: t
61.2400 DATA Keypress: t
63.2393 DATA Keypress: t
...
89.2314 EXP MYLOG: START BLOCK scene [specs/run03_block01.csv]
89.2336 EXP Imported specs/run03_block01.csv as conditions
89.2339 EXP Created sequence: sequential, trialTypes=9
...
[EDIT]: The output per file (run*.csv) should look like this:
onset type
53.2436 EXP
53.2589 EXP
53.2589 EXP
55.2257 DATA
57.2412 DATA
59.2406 DATA
61.2400 DATA
...
The program creates as much run*.csv as needed, but i can't store the desired columns in my new files. When finished, all I get are empty csv files. If I shift the counter variable to == 1 it creates just one big file with the desired columns.
Thanks again!
import csv
QUERY = 'MYLOG'
with open('localizer.log', 'rt') as log_input:
i = 0
for line in log_input:
if QUERY in line:
i = i + 1
with open('run' + str(i) + '.csv', 'w') as output:
reader = csv.reader(log_input, delimiter = ' ')
writer = csv.writer(output)
content_column_A = [0]
content_column_B = [1]
for row in reader:
content_A = list(row[j] for j in content_column_A)
content_B = list(row[k] for k in content_column_B)
writer.writerow(content_A)
writer.writerow(content_B)

Looking at the code there's a few things that are possibly wrong:
the csv reader should take a file handler, not a single line.
the reader delimiter should not be a single space character as it looks like the actual delimiter in your logs is a variable number of multiple space characters.
the looping logic seems to be a bit off, confusing files/lines/rows a bit.
You may be looking at something like the code below (pending clarification in the question):
import csv
NEW_LOG_DELIMITER = 'MYLOG'
def write_buffer(_index, buffer):
"""
This function takes an index and a buffer.
The buffer is just an iterable of iterables (ex a list of lists)
Each buffer item is a row of values.
"""
filename = 'run{}.csv'.format(_index)
with open(filename, 'w') as output:
writer = csv.writer(output)
writer.writerow(['onset', 'type']) # adding the heading
writer.writerows(buffer)
current_buffer = []
_index = 1
with open('localizer.log', 'rt') as log_input:
for line in log_input:
# will deal ok with multi-space as long as
# you don't care about the last column
fields = line.split()[:2]
if not NEW_LOG_DELIMITER in line or not current_buffer:
# If it's the first line (the current_buffer is empty)
# or the line does NOT contain "MYLOG" then
# collect it until it's time to write it to file.
current_buffer.append(fields)
else:
write_buffer(_index, current_buffer)
_index += 1
current_buffer = [fields] # EDIT: fixed bug, new buffer should not be empty
if current_buffer:
# We are now out of the loop,
# if there's an unwritten buffer then write it to file.
write_buffer(_index, current_buffer)

You can use pandas to simplify this problem.
Import pandas and read in log file.
import pandas as pd
df = pd.read_fwf('localizer2.log', header=None)
df.columns = ['onset', 'type', 'event']
df.set_index('onset', inplace=True)
Set Flag where third column == 'MYLOG'
df['flag'] = 0
df.loc[df.event.str[:5] == 'MYLOG', 'flag'] = 1
df.flag = df['flag'].cumsum()
Save each run as a separate run*.csv file
for i in range(1, df.flag.max()+1):
df.loc[df.flag == i, 'event'].to_csv('run{0}.csv'.format(i))
EDIT:
Looks like your format is different than I originally assumed. Changed to use pd.read_fwf. my localizer.log file was a copy and paste of your original data, hope this works for you. I assumed by the original post that it did not have headers. If it does have headers then remove header=None and df.columns = ['onset', 'type', 'event'].

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

sort one giant string into 7 columns - python

Related

Calculate averages over subgroups of data in extremely large (100GB+) CSV file

How to find a median from a list of values

How to print max and min value from a long file?

reading csv file line by line and save lines which are satisfying certain conditions

How to split a log file into several csv files with python

Categories

Resources