Python 3 Count number of rows in a CSV - python

Im having trouble getting the row count in a python 3 environment after migrating from 2.7. After several attempts the number of rows returned gives one. How do I get around a DeprecationWarning: 'U' mode is deprecated in python 3 ?
input_file = open("test.csv","rU")
reader_file = csv.reader(input_file)
value = len(list(reader_file))
In the case of using python 3 Ive tried the following approach but Im still stuck with a 1.
input_file = open("test.csv","rb")
reader_file = csv.reader(input_file)
value = len(list(reader_file))

If you are using pandas you can easily do that, without much coding stuff.
import pandas as pd
df = pd.read_csv('filename.csv')
## Fastest would be using length of index
print("Number of rows ", len(df.index))
## If you want the column and row count then
row_count, column_count = df.shape
print("Number of rows ", row_count)
print("Number of columns ", column_count)

input_file = open("test.csv","rb") #rb is a read-in-binary format and
#you can't count the number of row from binary format file
with open("text.csv",'r') as f:
file = f.readlines()
print(len(file))
# Data in my text file
# a
# b
# c
# d
# e
#The output of above code is
#5 means number of rows is 5

Related

Reading in tab-delimited txt file in chunks?

I have a txt file, and here is a snippet of the first few lines:
C A10231 A1 171|171 HER
C B23098 A1 171|171 HEF
C A03295 A2 171|171 HAF
I want to create a running list of every time the third column reads something other than "A1", and also keep track of how many times "A1" appears. Is there a way to import this file into a pandas df without causing a memory error?
If not, how can I process the txt file using the following rules:
Keep a running count of every time the third column reads "A1"
If the third column is not "A1", append the value to a list.
Find the amount of rows in the txt file
I essentially want to create three outputs. One output is the count of A1, the other is a list of everything that isn't A1 non_A1 = ['A2','B3','B4,'V6'...], and the last is the total number of rows.
All you need to do is process each line as you read it; no need to store anything more than your accumulated results and the current line in memory at any given time, and certainly no need to build a full dataframe from the contents of the file.
row_count = 0
a1_count = 0
non_a1 = []
with open("file.tsv") as f:
for line in f:
row = line.strip().split('\t')
row_count += 1
if row[2] == 'A1':
a1_count += 1
else:
non_a1.append(row[2])
As you tag your question with Pandas, you can use:
count_A1 = 0
non_A1 = set()
num_rows = 0
for chunk in pd.read_csv('/home/damien/data.txt', sep='\t', usecols=[2], header=None, chunksize=1):
count_A1 += chunk[2].eq('A1').sum()
non_A1 |= set(chunk.loc[chunk[2].ne('A1'), 2].unique().tolist())
num_rows += chunk.shape[0]
Output:
>>> count_A1
2
>>> list(non_A1):
['A2']
>>> num_rows
3
Using pandas for this trivial task is overkill
a1_count = 0
line_count = 0
others = []
with open('foo.tsv') as tsv:
for line in tsv:
if (ax := line.split()[2]) == 'A1':
a1_count += 1
else:
others.append(ax)
line_count += 1
In a similar vein to #Corralien. However, using the categorical datatype that results in memory savings for large amounts of data that are in a limited number of categories:
import pandas as pd
# Create some test data
fname = "reading_tsv_in_chunks.tsv"
with open("reading_tsv_in_chunks.tsv", "w") as fid:
for i in range(1000):
fid.write("C\tA10231\tA1\t171|171\tHER\nC\tB23098\tA1\t171|171\tHEF\nC\tA03295\tA2\t171|171\tHAF\nC\tA02225\tA3\t171|171\tHAX\nC\tA012325\tA4\t171|171\tHAY\n")
# Read as categorical
df = pd.read_csv(fname, sep="\t", header=None, names=["category",], usecols=[2,], dtype="category")
print(f"Contents of df:\n{df.describe()}\n")
print(f"Memory usage of with categorical dtype:\n{df.memory_usage()}\n\n")
# Read as non-categorical
df2 = pd.read_csv(fname, sep="\t", header=None, names=["category",], usecols=[2,])
print(f"Contents of df2:\n{df2.describe()}\n")
print(f"Memory usage of WITHOUT categorical dtype:\n{df2.memory_usage()}\n\n")
# Process as necessary e.g.
a1_count = sum([ len(values) for category, values in df.groupby("category")["category"] if category=="A1"])
non_a1_count = sum([ len(values) for category, values in df.groupby("category")["category"] if category!="A1"])
print(f"A1 count: {a1_count}\n")
print(f"Non-A1 count: {non_a1_count}")

Calculate averages over subgroups of data in extremely large (100GB+) CSV file

I have a large semicolon-delimited text file that weighs in at a little over 100GB. It comprises ~18,000,000 rows of data and 772 columns.
The columns are: 'sc16' (int), 'cpid' (int), 'type' (str), 'pubyr' (int) and then 767 columns labeled 'dim_0', 'dim_1', 'dim_2' ... 'dim_767', that are all ints.
The file is already arranged/sorted by sc16 and pubyr so that each combination of sc16+pubyr are grouped together in ascending order.
What I'm trying to do is get the average of each 'dim_' column for each unique combination of sc16 & pubyr, then output the row to a new dataframe and save the final result to a new text file.
The problem is that in my script below, the processing gradually gets slower and slower until it's just creeping along by row 5,000,000. I'm working on a machine with 96GB of RAM, and I'm not used to working with a file so large I can't simply load it into memory. This is my first attempt trying to work with something like itertools, so no doubt I'm being really inefficient. Any help you can provide would be much appreciated!
import itertools
import pandas as pd
# Step 1: create an empty dataframe to store the mean values
mean_df = pd.DataFrame(columns=['sc16', 'pubyr'] + [f"dim_{i}" for i in range(768)])
# Step 2: open the file and iterate through the rows
with open('C:\Python_scratch\scibert_embeddings_sorted.txt') as f:
counter = 0
total_lines = sum(1 for line in f)
f.seek(0)
for key, group in itertools.groupby(f, key=lambda x: (x.split(';')[0], x.split(';')[3])): # group by the first (sc16) and fourth (pubyr) column
sc16, pubyr = key
rows = [row.strip().split(';') for row in group]
columns = rows[0]
rows = rows[1:]
# Step 3: convert the group of rows to a dataframe
group_df = pd.DataFrame(rows, columns=columns)
# Step 4: calculate the mean for the group
mean_row = {'sc16': sc16, 'pubyr': pubyr}
for col in group_df.columns:
if col.startswith('dim_'):
mean_row[col] = group_df[col].astype(float).mean()
# Step 5: append the mean row to the mean dataframe
mean_df = pd.concat([mean_df, pd.DataFrame([mean_row])], ignore_index=True)
counter += len(rows)
print(f"{counter} of {total_lines}")
# Step 6: save the mean dataframe to a new file
mean_df.to_csv('C:\Python_scratch\scibert_embeddings_mean.txt', sep=';', index=False)
You might not want to use Pandas at all, since your data is already neatly pre-sorted and all.
Try something like this; it's using numpy to make dim-wise averaging fast, but is plain Python otherwise. It processes a 43,000 line example file I generated in about 9 7.6 seconds on my machine and I don't see a reason why this should slow down over time. (If you know your file won't have a header line or empty lines, you could get rid of those checks.)
Your original code also spent extra time parsing the read lines over and over again; this uses a generator that does that only once.
import itertools
import operator
import numpy as np
def read_embeddings_file(filename):
# Read the (pre-sorted) embeddings file,
# yielding tuples of ((sc16, pubyr) and a list of dimensions).
with open(filename) as in_file:
for line in in_file:
if not line or line.startswith("sc16"): # Header or empty line
continue
line = line.split(";")
sc16, cpid, type, pubyr, *dims = line
# list(map(... is faster than the equivalent listcomp
yield (sc16, pubyr), list(map(int, dims))
def main():
output_name = "scibert_embeddings_mean.txt"
input_name = "scibert_embeddings_sorted.txt"
with open(output_name, "w") as out_f:
print("sc16", "pubyr", *[f"dim_{i}" for i in range(768)], sep=";", file=out_f)
counter = 0
for group, group_contents in itertools.groupby(
read_embeddings_file(input_name),
key=operator.itemgetter(0), # Group by (sc16, pubyr)
):
dims = [d[1] for d in group_contents]
# Calculate the mean of each dimension
mean_dims = np.mean(np.array(dims).astype(float), axis=0)
# Write group to output
print(*group, *mean_dims, sep=";", file=out_f)
# Print progress
counter += len(dims)
print(f"Processed: {counter}; group: {group}, entries in group: {len(dims)}")
if __name__ == "__main__":
main()

How to parse the log data which is in form of nested [key=value] format using python pandas

I have huge Sensor log data which is in form of [key=value] pair I need to parse the data column wise
i found this code for my problem
import pandas as pd
lines = []
with open('/path/to/test.txt', 'r') as infile:
for line in infile:
if "," not in line:
continue
else:
lines.append(line.strip().split(","))
row_names = []
column_data = {}
max_length = max(*[len(line) for line in lines])
for line in lines:
while(len(line) < max_length):
line.append(f'{len(line)-1}=NaN')
for line in lines:
row_names.append(" ".join(line[:2]))
for info in line[2:]:
(k,v) = info.split("=")
if k in column_data:
column_data[k].append(v)
else:
column_data[k] = [v]
df = pd.DataFrame(column_data)
df.index = row_names
print(df)
df.to_csv('/path/to/test.csv')
the above code is suitable when the data is in form of "Priority=0, X=776517049" but my data is something like this [Priority=0][X=776517049] and there is no separator in between two columns how can i do it in python and i am sharing the link of sample data here raw data and bilow that expected parsed data which i done manually https://docs.google.com/spreadsheets/d/1EVTVL8RAkrSHhZO48xV1uEGqOzChQVf4xt7mHkTcqzs/edit?usp=sharing kindly check this link
I've downloaded as csv.
Since your file has multiple tables on one sheet, I've limited to 100 rows, you can remove that parameter.
raw = pd.read_csv(
"logdata - Sheet1.csv", # filename
skiprows=1, # skip the first row
nrows=100, # use 100 rows, remove in your example
usecols=[0], # only use the first column
header=None # your dataset has no column names
)
Then you can use a regex to extract the values:
df = raw[0].str.extract(r"\[Priority=(\d*)\] \[GPS element=\[X=(\d*)\] \[Y=(\d*)\] \[Speed=(\d*)\]")
and set column names:
df.columns = ["Priority", "X", "Y", "Speed"]
result:
Priority X Y Speed
0 0 776517049 128887449 4
1 0 776516816 128887733 0
2 0 776516816 128887733 0
3 0 776516833 128887166 0
4 0 776517200 128886133 0
5 0 776516883 128885933 8
.....................................
99 0 776494483 128908783 0

sort one giant string into 7 columns

I have a file which I read in as a string. In sublime the file looks like this:
Filename
Dataset
Level
Duration
Accuracy
Speed Ratio
Completed
file_001.mp3
datasetname_here
value
00:09:29
0.00%
7.36x
2019-07-18
file_002.mp3
datasetname_here
value
00:22:01
...etc.
in Bash:
['Filename\n', 'Dataset\n', 'Level\n', 'Duration\n', 'Accuracy\n', 'Speed Ratio\n', 'Completed\n', 'file_001.mp3\n', 'datasetname_here\n', 'value\n', '00:09:29\n', '0.00%\n', '7.36x\n', '2019-07-18\n', 'file_002.mp3\n', 'datasetname_here\n', 'L1\n', '00:20:01\n', ...etc.
I want to split this into a 7 column csv. As you can see, the values repeat every 8th line. I know I can use a for loop and modulus to read each line. I have done this successfully before.
How can I use pandas to read things into columns?
I don't know how to approach the Pandas library. I have looked at other examples and all seem to start with csv.
import sys
parser = argparse.ArgumentParser()
parser.add_argument('file' , help = "this is the file you want to open")
args = parser.parse_args()
print("file name:" , args.file)
with open(args.file , 'r') as word:
print(word.readlines()) ###here is where i was making sure it read in properly
###here is where I will start to manipulate the data
This is the Bash output:
['Filename\n', 'Dataset\n', 'Level\n', 'Duration\n', 'Accuracy\n', 'Speed Ratio\n', 'Completed\n', 'file_001.mp3\n', 'datasetname_here\n', 'value\n', '00:09:29\n', '0.00%\n', '7.36x\n', '2019-07-18\n', 'file_002.mp3\n', 'datasetname_here\n', 'L1\n', '00:20:01\n', ...]
First remove '\n':
raw_data = ['Filename\n', 'Dataset\n', 'Level\n', 'Duration\n', 'Accuracy\n', 'Speed Ratio\n', 'Completed\n', 'file_001.mp3\n', 'datasetname_here\n', 'value\n', '00:09:29\n', '0.00%\n', '7.36x\n', '2019-07-18\n', 'file_002.mp3\n', 'datasetname_here\n', 'L1\n', '00:20:01\n', '0.01%\n', '7.39x\n', '2019-07-20\n']
raw_data = [string.replace('\n', '') for string in raw_data]
Then pack your data in 7-length arrays inside a big array:
data = [raw_data[x:x+7] for x in range(0, len(raw_data),7)]
Finally read your data as a DataFrame, the first row contains the name of the columns:
df = pd.DataFrame(data[1:], columns=data[0])
print(df.to_string())
Filename Dataset Level Duration Accuracy Speed Ratio Completed
0 file_001.mp3 datasetname_here value 00:09:29 0.00% 7.36x 2019-07-18
1 file_002.mp3 datasetname_here L1 00:20:01 0.01% 7.39x 2019-07-20
Try This
import numpy as np
import pandas as pd
with open ("data.txt") as f:
list_str = f.readlines()
list_str = map(lambda s: s.strip(), list_str) #Remove \n
n=7
list_str = [list_str[k:k+n] for k in range(0, len(list_str), n)]
df = pd.DataFrame(list_str[1:])
df.columns = list_str[0]
df.to_csv("Data_generated.csv",index=False)
Pandas is not a library to read into columns. It supports many formats to read and write (One of them is comma separated values) and mainly used as python based data analysis tool.
Best place to learn is see their documentation and practice.
Output of above code
I think you don't have to use pandas or any other library. My approach:
data = []
row = []
with open(args.file , 'r') as file:
for line in file:
row.append(line)
if len(row) == 7:
data.append(row)
row = []
How does it work?
The for loop reads the file line by line.
Add the line to row
When row's length is 7, it's completed and you can add the row to data
Create a new list for row
Repeat

How can I combine two csv files into one, by adding one column to the end of the first one?

I simply need to add the column of the second CSV file to the first CSV file.
Example CSV file #1
Time Press RH Dewpt Alt
Value Value Value Value Value
For N number of rows.
Example CSV file #2
SmoothedTemperature
Value
I simply want to make it
Time Press RH Dewpt Alt SmoothedTemperature
Value Value Value Value Value Value
Also one has headers the other does not.
Here is sample code of what I have so far, however the output is the final row of file 1 repeated with the full data set of File #2 next to it.
##specifying what they want to open
File = open(askopenfilename(), 'r')
##reading in the other file
Averaged = open('Moving_Average_Adjustment.csv','r')
##opening the new file via raw input to write to
filename = raw_input("Enter desired filename, EX: YYYYMMDD_SoundingNumber_Time.csv; must end in csv")
New_File = open(filename,'wb')
R = csv.reader(File, delimiter = ',')
## i feel the issue is here in my loop, i don't know how to print the first columns
## then also print the last column from the other CSV file on the end to make it mesh well
Write_New_File = csv.writer(New_File)
data = ["Time,Press,Dewpt,RH,Alt,AveragedTemp"]
Write_New_File.writerow(data)
for i, line in enumerate(R):
if i <=(header_count + MovingAvg/2):
continue
Time,Press,Temp,Dewpt,RH,Ucmp,Vcmp,spd,Dir,Wcmp,Lon,Lat,Ele,Azi,Alt,Qp,Qt,Qrh,Qu,Qv,QdZ=line
for i, line1 in enumerate(Averaged):
if i == 1:
continue
SmoothedTemperature = line1
Calculated_Data = [Time,Press,Dewpt,RH,Alt,SmoothedTemperature]
Write_New_File.writerow(Calculated_Data)
If you want to go down this path, pandas makes csv manipulation very easy. Say your first two sample tables are in files named test1.csv and test2.csv:
>>> import pandas as pd
>>> test1 = pd.read_csv("test1.csv")
>>> test2 = pd.read_csv("test2.csv")
>>> test3 = pd.concat([test1, test2], axis=1)
>>> test3
Time Press RH Dewpt Alt SmoothedTemperature
0 1 2 3 4 5 6
[1 rows x 6 columns]
This new table can be saved to a .csv file with the DataFrame method to_csv.
If, as you mention, one of the files has no headers, you can specify this when reading the file:
>>> test2 = pd.read_csv('test2.csv', header=None)
and then change the header row manually in pandas.

Categories