Reading in tab-delimited txt file in chunks? - python

I have a txt file, and here is a snippet of the first few lines:
C A10231 A1 171|171 HER
C B23098 A1 171|171 HEF
C A03295 A2 171|171 HAF
I want to create a running list of every time the third column reads something other than "A1", and also keep track of how many times "A1" appears. Is there a way to import this file into a pandas df without causing a memory error?
If not, how can I process the txt file using the following rules:
Keep a running count of every time the third column reads "A1"
If the third column is not "A1", append the value to a list.
Find the amount of rows in the txt file
I essentially want to create three outputs. One output is the count of A1, the other is a list of everything that isn't A1 non_A1 = ['A2','B3','B4,'V6'...], and the last is the total number of rows.

All you need to do is process each line as you read it; no need to store anything more than your accumulated results and the current line in memory at any given time, and certainly no need to build a full dataframe from the contents of the file.
row_count = 0
a1_count = 0
non_a1 = []
with open("file.tsv") as f:
for line in f:
row = line.strip().split('\t')
row_count += 1
if row[2] == 'A1':
a1_count += 1
else:
non_a1.append(row[2])

As you tag your question with Pandas, you can use:
count_A1 = 0
non_A1 = set()
num_rows = 0
for chunk in pd.read_csv('/home/damien/data.txt', sep='\t', usecols=[2], header=None, chunksize=1):
count_A1 += chunk[2].eq('A1').sum()
non_A1 |= set(chunk.loc[chunk[2].ne('A1'), 2].unique().tolist())
num_rows += chunk.shape[0]
Output:
>>> count_A1
2
>>> list(non_A1):
['A2']
>>> num_rows
3

Using pandas for this trivial task is overkill
a1_count = 0
line_count = 0
others = []
with open('foo.tsv') as tsv:
for line in tsv:
if (ax := line.split()[2]) == 'A1':
a1_count += 1
else:
others.append(ax)
line_count += 1

In a similar vein to #Corralien. However, using the categorical datatype that results in memory savings for large amounts of data that are in a limited number of categories:
import pandas as pd
# Create some test data
fname = "reading_tsv_in_chunks.tsv"
with open("reading_tsv_in_chunks.tsv", "w") as fid:
for i in range(1000):
fid.write("C\tA10231\tA1\t171|171\tHER\nC\tB23098\tA1\t171|171\tHEF\nC\tA03295\tA2\t171|171\tHAF\nC\tA02225\tA3\t171|171\tHAX\nC\tA012325\tA4\t171|171\tHAY\n")
# Read as categorical
df = pd.read_csv(fname, sep="\t", header=None, names=["category",], usecols=[2,], dtype="category")
print(f"Contents of df:\n{df.describe()}\n")
print(f"Memory usage of with categorical dtype:\n{df.memory_usage()}\n\n")
# Read as non-categorical
df2 = pd.read_csv(fname, sep="\t", header=None, names=["category",], usecols=[2,])
print(f"Contents of df2:\n{df2.describe()}\n")
print(f"Memory usage of WITHOUT categorical dtype:\n{df2.memory_usage()}\n\n")
# Process as necessary e.g.
a1_count = sum([ len(values) for category, values in df.groupby("category")["category"] if category=="A1"])
non_a1_count = sum([ len(values) for category, values in df.groupby("category")["category"] if category!="A1"])
print(f"A1 count: {a1_count}\n")
print(f"Non-A1 count: {non_a1_count}")

Related

Calculate averages over subgroups of data in extremely large (100GB+) CSV file

I have a large semicolon-delimited text file that weighs in at a little over 100GB. It comprises ~18,000,000 rows of data and 772 columns.
The columns are: 'sc16' (int), 'cpid' (int), 'type' (str), 'pubyr' (int) and then 767 columns labeled 'dim_0', 'dim_1', 'dim_2' ... 'dim_767', that are all ints.
The file is already arranged/sorted by sc16 and pubyr so that each combination of sc16+pubyr are grouped together in ascending order.
What I'm trying to do is get the average of each 'dim_' column for each unique combination of sc16 & pubyr, then output the row to a new dataframe and save the final result to a new text file.
The problem is that in my script below, the processing gradually gets slower and slower until it's just creeping along by row 5,000,000. I'm working on a machine with 96GB of RAM, and I'm not used to working with a file so large I can't simply load it into memory. This is my first attempt trying to work with something like itertools, so no doubt I'm being really inefficient. Any help you can provide would be much appreciated!
import itertools
import pandas as pd
# Step 1: create an empty dataframe to store the mean values
mean_df = pd.DataFrame(columns=['sc16', 'pubyr'] + [f"dim_{i}" for i in range(768)])
# Step 2: open the file and iterate through the rows
with open('C:\Python_scratch\scibert_embeddings_sorted.txt') as f:
counter = 0
total_lines = sum(1 for line in f)
f.seek(0)
for key, group in itertools.groupby(f, key=lambda x: (x.split(';')[0], x.split(';')[3])): # group by the first (sc16) and fourth (pubyr) column
sc16, pubyr = key
rows = [row.strip().split(';') for row in group]
columns = rows[0]
rows = rows[1:]
# Step 3: convert the group of rows to a dataframe
group_df = pd.DataFrame(rows, columns=columns)
# Step 4: calculate the mean for the group
mean_row = {'sc16': sc16, 'pubyr': pubyr}
for col in group_df.columns:
if col.startswith('dim_'):
mean_row[col] = group_df[col].astype(float).mean()
# Step 5: append the mean row to the mean dataframe
mean_df = pd.concat([mean_df, pd.DataFrame([mean_row])], ignore_index=True)
counter += len(rows)
print(f"{counter} of {total_lines}")
# Step 6: save the mean dataframe to a new file
mean_df.to_csv('C:\Python_scratch\scibert_embeddings_mean.txt', sep=';', index=False)
You might not want to use Pandas at all, since your data is already neatly pre-sorted and all.
Try something like this; it's using numpy to make dim-wise averaging fast, but is plain Python otherwise. It processes a 43,000 line example file I generated in about 9 7.6 seconds on my machine and I don't see a reason why this should slow down over time. (If you know your file won't have a header line or empty lines, you could get rid of those checks.)
Your original code also spent extra time parsing the read lines over and over again; this uses a generator that does that only once.
import itertools
import operator
import numpy as np
def read_embeddings_file(filename):
# Read the (pre-sorted) embeddings file,
# yielding tuples of ((sc16, pubyr) and a list of dimensions).
with open(filename) as in_file:
for line in in_file:
if not line or line.startswith("sc16"): # Header or empty line
continue
line = line.split(";")
sc16, cpid, type, pubyr, *dims = line
# list(map(... is faster than the equivalent listcomp
yield (sc16, pubyr), list(map(int, dims))
def main():
output_name = "scibert_embeddings_mean.txt"
input_name = "scibert_embeddings_sorted.txt"
with open(output_name, "w") as out_f:
print("sc16", "pubyr", *[f"dim_{i}" for i in range(768)], sep=";", file=out_f)
counter = 0
for group, group_contents in itertools.groupby(
read_embeddings_file(input_name),
key=operator.itemgetter(0), # Group by (sc16, pubyr)
):
dims = [d[1] for d in group_contents]
# Calculate the mean of each dimension
mean_dims = np.mean(np.array(dims).astype(float), axis=0)
# Write group to output
print(*group, *mean_dims, sep=";", file=out_f)
# Print progress
counter += len(dims)
print(f"Processed: {counter}; group: {group}, entries in group: {len(dims)}")
if __name__ == "__main__":
main()

How to parse the log data which is in form of nested [key=value] format using python pandas

I have huge Sensor log data which is in form of [key=value] pair I need to parse the data column wise
i found this code for my problem
import pandas as pd
lines = []
with open('/path/to/test.txt', 'r') as infile:
for line in infile:
if "," not in line:
continue
else:
lines.append(line.strip().split(","))
row_names = []
column_data = {}
max_length = max(*[len(line) for line in lines])
for line in lines:
while(len(line) < max_length):
line.append(f'{len(line)-1}=NaN')
for line in lines:
row_names.append(" ".join(line[:2]))
for info in line[2:]:
(k,v) = info.split("=")
if k in column_data:
column_data[k].append(v)
else:
column_data[k] = [v]
df = pd.DataFrame(column_data)
df.index = row_names
print(df)
df.to_csv('/path/to/test.csv')
the above code is suitable when the data is in form of "Priority=0, X=776517049" but my data is something like this [Priority=0][X=776517049] and there is no separator in between two columns how can i do it in python and i am sharing the link of sample data here raw data and bilow that expected parsed data which i done manually https://docs.google.com/spreadsheets/d/1EVTVL8RAkrSHhZO48xV1uEGqOzChQVf4xt7mHkTcqzs/edit?usp=sharing kindly check this link
I've downloaded as csv.
Since your file has multiple tables on one sheet, I've limited to 100 rows, you can remove that parameter.
raw = pd.read_csv(
"logdata - Sheet1.csv", # filename
skiprows=1, # skip the first row
nrows=100, # use 100 rows, remove in your example
usecols=[0], # only use the first column
header=None # your dataset has no column names
)
Then you can use a regex to extract the values:
df = raw[0].str.extract(r"\[Priority=(\d*)\] \[GPS element=\[X=(\d*)\] \[Y=(\d*)\] \[Speed=(\d*)\]")
and set column names:
df.columns = ["Priority", "X", "Y", "Speed"]
result:
Priority X Y Speed
0 0 776517049 128887449 4
1 0 776516816 128887733 0
2 0 776516816 128887733 0
3 0 776516833 128887166 0
4 0 776517200 128886133 0
5 0 776516883 128885933 8
.....................................
99 0 776494483 128908783 0

Python 3:what is the best way to iterate for each value of a column?

I am new to Python and would like some advice on what is the simplest way for me to iterate on a given column of data.
My input file looks like this:
Col1,Col2,Col3<br/>
593457863416,959345934754,9456968345233455<br/>
487593748734,485834896965,4958558475345<br/>
694568245543,34857495345,494589589209<br/>
...
What I would like to do is add 100 to all items in column 2. So the output would like this:
Col1,Col2,Col3<br/>
593457863416,959345934854,9456968345233455<br/>
487593748734,485834897065‬,4958558475345<br/>
694568245543,34857495445,494589589209<br/>
...
Here is my code so far:
import csv
with open("C:/Users/r00t/Desktop/test/sample.txt") as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
line_count = 0
output_list = []
for row in csv_reader:
if line_count == 0:
print(f'{", ".join(row)}')
line_count += 1
else:
temp_list = []
output_row = int(row[1])
output_row = output_row + 100
temp_list =[row[0], row[1], row[2]]
output_list = [[row[0], output_row, row[2]]]
print(output_list)
line_count += 1
The code seems not optimal. Is there a way to not specify index for row? What happens when my file has more than 3 columns?
Thank you!
-r
I suggest using csv.DictReader(). Each row will now be in a dictionary, with keys being the column name, and the value being the row value.
You can use Series based value addition. Or you can use location or you can use it in place updation using pandas.
Simplest Way(in pandas)
df["column2"] = df["column2"] + 100
ILocation(in pandas)
df.iloc[:, 1] = df.iloc[:, 1] + 100
Without Pandas
file_read = csv.reader(open('/tmp/test.csv'))
file_data_in_list = list(file_read)
# Since now you have three columns,
# you can just simply go through 1st index and add 1 there
for index in range(len(file_data_in_list):
if index > 0:
file_data_in_list[index][1] += 100 # Adds hundred to each line of 2nd column.
# Now you can use file_data_in_list, it won't require you extra variables and the replacment is in place.
IT is better to use Column based Data Structure for this operations.
Here I have used pandas
import pandas as pd
df = pd.read_csv('C:/Users/r00t/Desktop/test/sample.txt')
# df1 = df+100
edit-1
df['col2'] = df['col2'] + 100
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.add.html
Here is a suggestion on how to do it. Use pandas which is very handy for dealing with data.
import pandas as pd
df = pd.read_csv("sample.txt")
print(df)
# I am basicly getting all the rows of column index 1 (wich is the Col2)
df.iloc[:, 1] = df.iloc[:, 1] + 100
print(df)
# I could also use the column name
df["Col3"] = df["Col3"] + 1

Comparing 2 lists and return mismatches

im struggling with 2 csv files which I have imported
the csv files look like this:
csv1
planet,diameter,discovered,color
sceptri,33.41685587,28-11-1611 05:15, black
...
csv2
planet,diameter,discovered,color
sceptri,33.41685587,28-11-1611 05:15, blue
...
in both csv files, there are the same planets but in a different order and sometimes with different values (a mismatch)
the data for each planet (diameter, discovered and color) has been entered independently. I wanted to Cross-check the two sheets and find all the fields that are mismatched. Then I want to generate a new file that contains one line per error with a description of the error.
for example:
sceptri: mismatch (black/blue)
here is my code so far
with open('planets1.csv') as csvfile:
a = csv.reader(csvfile, delimiter=',')
data_a= list(a)
for row in a:
print(row)
with open('planets2.csv') as csvfile:
b = csv.reader(csvfile, delimiter=',')
data_b= list(b)
for row in b:
print(row)
print(data_a)
print(data_b)
c= [data_a]
d= [data_b]```
thank you in advance for your help!
Assuming the name of planets are correct in both files, here is my proposal
# Working with list of list, which could be get csv file reading:
csv1 = [["sceptri",33.41685587,"28-11-1611 05:15", "black"],
["foo",35.41685587,"29-11-1611 05:15", "black"],
["bar",38.7,"29-11-1611 05:15", "black"],]
csv2 = [["foo",35.41685587,"29-11-1611 05:15", "black"],
["bar",38.17,"29-11-1611 05:15", "black"],
["sceptri",33.41685587,"28-11-1611 05:15", "blue"]]
# A list to contain the errors:
new_file = []
# A dict to check if a planet has already been processed:
a_dict ={}
# Let's read all planet data:
for planet in csv1+csv2:
# Check if planet is already as a key in a_dict:
if planet[0] in a_dict:
# Yes, sir, need to check discrepancies.
if a_dict[planet[0]] != planet[1:]:
# we have some differences in some values.
# Put both set of values in python sets to differences:
error = set(planet[1:]) ^ set(a_dict[planet[0]])
# Append [planet_name, diff.param1, diff_param2] to new_file:
new_file.append([planet[0]]+list(error))
else:
# the planet name becomes a dict key, other param are key value:
a_dict[planet[0]] = planet[1:]
print(new_file)
# [['bar', 38.17, 38.7], ['sceptri', 'black', 'blue']]
The list new_file may be saved as new file, see Writing a list to file
I'd suggest using Pandas for a task like this.
Firstly, you'll need to read the csv contents into dataframe objects. This can be done as follows:
import pandas as pd
# make a dataframe from each csv file
df1 = pd.read_csv('planets1.csv')
df2 = pd.read_csv('planets2.csv')
You may want to declare names for each column if your CSV file doesn't have them.
colnames = ['col1', 'col2', ..., 'coln']
df1 = pd.read_csv('planets1.csv', names=colnames, index_col=0)
df2 = pd.read_csv('planets2.csv', names=colnames, index_col=0)
# use index_col=0 if csv already has an index column
For the sake of reproducible code, I will define dataframe objects without a csv below:
import pandas as pd
# example column names
colnames = ['A','B','C']
# example dataframes
df1 = pd.DataFrame([[0,3,6], [4,5,6], [3,2,5]], columns=colnames)
df2 = pd.DataFrame([[1,3,1], [4,3,6], [3,6,5]], columns=colnames)
Note that df1 looks like this:
A B C
---------------
0 0 3 6
1 4 5 6
2 3 2 5
And df2 looks like this:
A B C
---------------
0 1 3 1
1 4 3 6
2 3 6 5
The following code compares dataframes, concatenate the comparison to a new dataframe, and then saves the result to a CSV:
# define the condition you want to check for (i.e., mismatches)
mask = (df1 != df2)
# df1[mask], df2[mask] will replace matched values with NaN (Not a Number), and leave mismatches
# dropna(how='all') will remove rows filled entirely with NaNs
errors_1 = df1[mask].dropna(how='all')
errors_2 = df2[mask].dropna(how='all')
# add labels to column names
errors_1.columns += '_1' # for planets 1
errors_2.columns += '_2' # for planets 2
# you can now combine horizontally into one big dataframe
errors = pd.concat([errors_1,errors_2],axis=1)
# if you want, reorder the columns of `errors` so compared columns are next to each other
errors = errors.reindex(sorted(errors.columns), axis=1)
# if you don't like the clutter of NaN values, you can replace them with fillna()
errors = errors.fillna('_')
# save to a csv
errors.to_csv('mismatches.csv')
The final result looks something like this:
A_1 A_2 B_1 B_2 C_1 C_2
-----------------------------
0 0 1 _ _ 6 1
1 _ _ 5 3 _ _
2 _ _ 2 6 _ _
Hope this helps.
This kind of problem can be solved by sorting the rows from the csv files, and then comparing the corresponding rows to see if there are differences.
This approach uses a functional style to perform the comparisons and will compare any number of csv files.
It assumes that the csvs contain the same number of records, and that the columns are in the same order.
import contextlib
import csv
def compare_files(readers):
colnames = [next(reader) for reader in readers][0]
sorted_readers = [sorted(r) for r in readers]
for gen in [compare_rows(colnames, rows) for rows in zip(*sorted_readers)]:
yield from gen
def compare_rows(colnames, rows):
col_iter = zip(*rows)
# Be sure we're comparing the same planets.
planets = set(next(col_iter))
assert len(planets) == 1, planets
planet = planets.pop()
for (colname, *vals) in zip(colnames, col_iter):
if len(set(*vals)) > 1:
yield f"{planet} mismatch {colname} ({'/'.join(*vals)})"
def main(outfile, *infiles):
with contextlib.ExitStack() as stack:
csvs = [stack.enter_context(open(fname)) for fname in infiles]
readers = [csv.reader(f) for f in csvs]
with open(outfile, 'w') as out:
for result in compare_files(readers):
out.write(result + '\n')
if __name__ == "__main__":
main('mismatches.txt', 'planets1.csv', 'planets2.csv')

Python 3 Count number of rows in a CSV

Im having trouble getting the row count in a python 3 environment after migrating from 2.7. After several attempts the number of rows returned gives one. How do I get around a DeprecationWarning: 'U' mode is deprecated in python 3 ?
input_file = open("test.csv","rU")
reader_file = csv.reader(input_file)
value = len(list(reader_file))
In the case of using python 3 Ive tried the following approach but Im still stuck with a 1.
input_file = open("test.csv","rb")
reader_file = csv.reader(input_file)
value = len(list(reader_file))
If you are using pandas you can easily do that, without much coding stuff.
import pandas as pd
df = pd.read_csv('filename.csv')
## Fastest would be using length of index
print("Number of rows ", len(df.index))
## If you want the column and row count then
row_count, column_count = df.shape
print("Number of rows ", row_count)
print("Number of columns ", column_count)
input_file = open("test.csv","rb") #rb is a read-in-binary format and
#you can't count the number of row from binary format file
with open("text.csv",'r') as f:
file = f.readlines()
print(len(file))
# Data in my text file
# a
# b
# c
# d
# e
#The output of above code is
#5 means number of rows is 5

Categories