Python: Summarize values from CSV

Python: Summarize values from CSV - python

I have a CSV-file which looks like this:
a,date,b
,2020-10-26 09:06:07,
,2020-10-26 16:15:20,
,2020-10-27 08:04:54,
,2020-10-28 22:09:16,
My question is:
Can I summarize my CSV so that it looks like this? (in a new CSV):
date, count
2020-10-26,2
2020-10-27,1
2020-10-28,1
So that every row which has data from the same day is summarized.

This can be accomplished quite simply using the following logic, with either core Python, or pandas - whichever suits you best.
Read the source CSV file.
Count the occurrences of each date.
Write the counts to a new CSV file.
Using only core Python
counts = {}
# Open source CSV and extract only dates.
with open('dates.csv')as f:
dates = [i.strip().split(',')[1].split(' ')[0] for i in f][1:]
# Count date occurrences.
for i in dates:
counts[i] = counts.get(i, 0) + 1
# Write the output to a new CSV file.
with open('dates_out.csv', 'w') as f:
f.write('date,count\n')
for k, v in counts.items():
f.write(f'{k},{v}\n')
Using pandas
import pandas as pd
# Read the source CSV into a DataFrame.
df = pd.read_csv('dates.csv')
# Convert the `date` column to a `datetime` object and return the `date` part only.
df['date'] = pd.to_datetime(df['date']).dt.date
# Count occurrences and store the results to a new CSV file.
(df['date']
.value_counts()
.sort_index()
.reset_index()
.rename(columns={'index': 'date', 'date': 'count'})
.to_csv('dates_out.csv', index=False))
Output
$ cat dates_out.csv
date,count
2020-10-26,2
2020-10-27,1
2020-10-28,1
Source input file
For completeness, here are the contents of my testing source file, dates.csv.
col1,date,col3
a,2020-10-26 09:06:07,b
a,2020-10-26 16:15:20,b
a,2020-10-27 08:04:54,b
a,2020-10-28 22:09:16,b

Something like the below ('zz.txt' is your data)
from collections import defaultdict
data = defaultdict(int)
with open('zz.txt') as f:
lines = [line.strip() for line in f.readlines()][1:]
for line in lines:
data[line[1:line.find(' ')]] += 1
print(data)
output
defaultdict(<class 'int'>, {'2020-10-26': 2, '2020-10-27': 1, '2020-10-28': 1})

Related

Join large set of CSV files where the header is the timestamp for the file

I have a large set of CSV files. Approx. 15 000 files. And would like to figure out how to join them together as one file for data processing.
Each file is in a simple pattern with timestamp that corresponds to a period of time that represent the data in the each CSV file.
Ex.
file1.csv
2021-07-23 08:00:00
Unit.Device.No03.ErrorCode;11122233
Unit.Device.No04.ErrorCode;0
Unit.Device.No05.ErrorCode;0
Unit.Device.No11.ErrorCode;0
file2.csv
2021-07-23 08:15:00
Unit.Device.No03.ErrorCode;0
Unit.Device.No04.ErrorCode;44556666
Unit.Device.No05.ErrorCode;0
Unit.Device.No11.ErrorCode;0
Each file starts with the timestamp. I would like to join all the files in a directory, and transpose the "Unit.Device" to columns. And then use the original header as a timestamp column. For each file add a new row with the corresponding "ErrorCode" to each column.
Like this:
Timestamp;Unit.Device.No03.ErrorCode;Unit.Device.No04.ErrorCode;Unit.Device.No05.ErrorCode..
2021-07-23 08:00:00;11122233;0;0;0;0....
2021-07-23 08:15:00;0;44556666;0;0;0....
Any simple tools for this, or Python routines?

Thanks for the reply on my first question here!
I will also contribute with a solution for this problem.
I did some read up on Pandas after I found something similar to what I wanted to do. I found that the transform method was very easy to use, and put together this snippet of Python code instead.
import pandas as pd
import os
folder = 'in'
df_out = pd.DataFrame()
for filename in os.scandir(folder):
if filename.is_file():
print('Working on file' + filename.path)
df = pd.read_csv(filename.path, encoding='utf-16', sep=';',header =[0])
# Transpose data with timestamp header to columns
df_tranposed = df.T
df_out = df_out.append(df_tranposed)
df_out.to_csv('output.csv')

Try the following Pandas approach:
import pandas as pd
import glob
import os
dfs = []
for csv_filename in glob.glob('./file*.csv'):
print('Working on file', csv_filename)
# Read the CSV file, assume no header and two columns
df = pd.read_csv(csv_filename, sep=';', names=[0, 1], header=None)
# Transpose from the 2nd row (skip the timestamp)
df_transposed = df[1:].T
# Allocate the column names from the first row and 'Timestamp'
df_transposed.columns = df_transposed.iloc[0] + 'Timestamp'
# Copy the timestamp into the transposed dataframe as a datetime value
df_transposed['Timestamp'] = pd.to_datetime(df.iloc[0, 0])
# Remove the first row (containing the names)
df_transposed = df_transposed[1:]
dfs.append(df_transposed)
# Concatenate all dataframes together and sort by Timestamp
df_output = pd.concat(dfs).sort_values(by='Timestamp')
# Sort the header columns and output to a CSV file
df_output.reindex(sorted(df_output.columns), axis=1).to_csv('output.csv', index=None)
Alternatively, it could be done using standard Python:
from datetime import datetime
import csv
import glob
data = []
fieldnames = set()
for fn in glob.glob('file*.csv'):
with open(fn) as f_input:
csv_input = csv.reader(f_input, delimiter=';')
timestamp = next(csv_input)[0]
row = {'Timestamp' : timestamp}
for device, error_code in csv_input:
row[device] = error_code
fieldnames.add(device)
data.append(row)
with open('output.csv', 'w', newline='') as f_output:
csv_output = csv.DictWriter(f_output, fieldnames=['Timestamp', *sorted(fieldnames)], delimiter=';')
csv_output.writeheader()
csv_output.writerows(sorted(data, key=lambda x: datetime.strptime(x['Timestamp'], '%Y-%m-%d %H:%M:%S')))
This gives output.csv as:
Timestamp;Unit.Device.No03.ErrorCode;Unit.Device.No04.ErrorCode;Unit.Device.No05.ErrorCode;Unit.Device.No11.ErrorCode
2021-07-23 08:00:00;11122233;0;0;0
2021-07-23 08:15:00;0;44556666;0;0
How does this work
First iterate over all .csv files in a given folder.
For each file open it using a csv.reader()
Read the header row as a special case, storing the value as a Timestamp entry in a dictionary row.
For each row, store additional key value entries in the row dictionary.
Keep a note of each device name using a set.
Append the complete row into a data list.
It is now possible to create an output.csv file. The full list of columns names can be assigned as fieldnames and a csv.DictWriter() used.
Write the header.
Use writerows() to write all the data rows sorted by timestamp. To do this convert each row's Timestamp entry into a datetime value for sorting.
This approach will also work if the CSV files happen to have different types of devices e.g. Unit.Device.No42.ErrorCode.

Reading csv file formatted as dictioary into pandas

I have a csv file containing sensor data where one row is of the following format
1616580317.0733, {'Roll': 0.563820598084682, 'Pitch': 0.29817540218781163, 'Yaw': 60.18415650363684, 'gyroX': 0.006687641609460116, 'gyroY': -0.012394784949719908, 'gyroZ': -0.0027120113372802734, 'accX': -0.12778355181217196, 'accY': 0.24647256731987, 'accZ': 9.763526916503906}
Where the first column is a timestamp and the remainder is a dictionary like object containing various measured quantities.
I want to read this into a pandas array wit the columns
["Timestamp","Roll","Pitch","Yaw","gyroX","gyroY","gyroZ","accX","accY","accZ"]. What would be an efficient way of doing this? The file is 600MB so it's not a trivial number of lines which need to be parsed.

I'm not sure where you are getting the seconds column from.
The code below parses each row into a timestamp and dict. Then adds the timestamp to the dictionary that will eventually become a row in the dataframe.
import json
import pandas as pd
def read_file(filename):
chunk_size = 20000
entries = []
counter = 0
df = pd.DataFrame()
with open(filename, "r") as fh:
for line in fh:
timestamp, data_dict = line.split(",", 1)
data_dict = json.loads(data_dict.replace("'", '"'))
data_dict["timestamp"] = float(timestamp)
entries.append(data_dict)
counter += 1
if counter == chunk_size:
df = df.append(entries, ignore_index=True)
entries = []
counter = 0
if counter != 0:
df = df.append(entries, ignore_index=True)
return df
read_file("sample.txt")

I think you should convert your csv file to json format and then look at this site on how to transform the dictionary into a pandas dataframe : https://www.delftstack.com/fr/howto/python-pandas/how-to-convert-python-dictionary-to-pandas-dataframe/#:~:text=2%20banana%2012-,M%C3%A9thode%20pandas.,le%20nom%20de%20la%20colonne.

How to convert dataframe containing date into list of list with correct date format and save in csv file

How to write dates of dataframe in a file.
import csv
import pandas as pd
writeFile = open("dates.csv","w+")
writer = csv.writer(writeFile)
dates = pd.DataFrame(pd.date_range(start = '01-09-2019', end = '30-09-2019'))
Convert2List = dates.values.tolist()
for row in Convert2List:
writer.writerow(row)
writeFile.close()
My actual values are:
1.54699E+18
1.54708E+18
1.54716E+18
1.54725E+18
1.54734E+18
And the expected values should be:
01-09-2019
02-09-2019
03-09-2019

If you have a pandas dataframe you can just use the method pandas.DataFrame.to_csv and set the parameters (link to documentation).

Pandas has a write to file function build-in. Try:
import pandas as pd
dates = pd.DataFrame(pd.date_range(start = '01-09-2019', end = '30-09-2019'))
#print (dates) # check here if the dates is written correctly.
dates.to_csv('dates.csv') # writes the dataframe directly to a file.
The date.csv file gives me:
,0
0,2019-01-09
1,2019-01-10
2,2019-01-11
3,2019-01-12
...snippet...
262,2019-09-28
263,2019-09-29
264,2019-09-30
Changing date order to get date range September for default settings:
dates = pd.DataFrame(pd.date_range(start = '2019-09-01', end = '2019-09-30'))
Gives:
0_29 entries for 30 days of September.
Furthermore, changing the date order for custom settings:
dates[0] = pd.to_datetime(dates[0]).apply(lambda x:x.strftime('%d-%m-%Y'))
Gives you:
01-09-2019
02-09-2019
03-09-2019
...etc.

Python Pandas performing operation on each row of CSV file

I have a 1million line CSV file. I want to do call a lookup function on each row's 1'st column, and append its result as a new column in the same CSV (if possible).
What I want is this is something like this:
for each row in dataframe
string=row[1]
result=lookupFunction(string)
row.append[string]
I Know i could do it using python's CSV library by opening my CSV, read each row, do my operation, write results to a new CSV.
This is my code using Python's CSV library
with open(rawfile, 'r') as f:
with open(newFile, 'a') as csvfile:
csvwritter = csv.writer(csvfile, delimiter=' ')
for line in f:
#do operation
However I really want to do it with Pandas because it would be something new to me.
This is what my data looks like
77,#oshkosh # tannersville pa,,PA,US
82,#osithesakcom ca,,CA,US
88,#osp open records or,,OR,US
89,#ospbco tel ord in,,IN,US
98,#ospwmnwithn return in,,IN,US
99,#ospwmnwithn tel ord in,,IN,US
100,#osram sylvania inc ma,,MA,US
106,#osteria giotto montclair nj,,NJ,US
Any help and guidance will be appreciated it. THanks

here is a simple example of adding 2 columns to a new column from you csv file
import pandas as pd
df = pd.read_csv("yourpath/yourfile.csv")
df['newcol'] = df['col1'] + df['col2']

create df and csv
import pandas as pd
df = pd.DataFrame(dict(A=[1, 2], B=[3, 4]))
df.to_csv('test_add_column.csv')
read csv into dfromcsv
dfromcsv = pd.read_csv('test_add_column.csv', index_col=0)
create new column
dfromcsv['C'] = df['A'] * df['B']
dfromcsv
write csv
dfromcsv.to_csv('test_add_column.csv')
read it again
dfromcsv2 = pd.read_csv('test_add_column.csv', index_col=0)
dfromcsv2

How to Perform Mathematical Operation on One Value of a CSV file?

I am dealing with a csv file that contains three columns and three rows containing numeric data. The csv data file simply looks like the following:
Colum1,Colum2,Colum3
1,2,3
1,2,3
1,2,3
My question is how to write a python code that take a single value of one of the column and perform a specific operation. For example, let say I want to take the first value in 'Colum1' and subtract it from the sum of all the values in the column.
Here is my attempt:
import csv
f = open('columns.csv')
rows = csv.DictReader(f)
value_of_single_row = 0.0
for i in rows:
value_of_single_Row += float(i) # trying to isolate a single value here!
print value_of_single_row - sum(float(r['Colum1']) for r in rows)
f.close()

Based on the code you provided, I suggest you take a look at the doc to see the preferred approach on how to read through a csv file. Take a look here:
How to use CsvReader
with that being said, you can modify the beginning of your code slightly to this:
import csv
with open('data.csv', 'rb') as f:
rows = csv.DictReader(f)
for row in rows:
# perform operation per row
From there you now have access to each row.
This should give you what you need to do proper row-by-row operations.
What I suggest you do is play around with printing out your rows to see what your data looks like. You will see that each row being outputted is a dictionary.
So if you were going through each row, you can just simply do something like this:
for row in rows:
row['Colum1'] # or row.get('Colum1')
# to do some math to add everything in Column1
s += float(row['Column1'])
So all of that will look like this:
import csv
s = 0
with open('data.csv', 'rb') as f:
rows = csv.DictReader(f)
for row in rows:
s += float(row['Colum1'])

You can do pretty much all of this with pandas
from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import pandas as pd
import sys
import os
Location = r'path/test.csv'
df = pd.read_csv(Location, names=['Colum1','Colum2','Colum3'])
df = df[1:] #Remove the headers since they're unnecessary
print df
df.xs(1)['Colum1']=int(df.loc[1,'Colum1'])+5
print df
You can write back to your csv using df.to_csv('File path', index=False,header=True) Having headers=True will add the headers back in.
To do this more along the lines of what you have you can do it like this
import csv
Location = r'C:/Users/tnabrelsfo/Documents/Programs/Stack/test.csv'
data = []
with open(Location, 'r') as f:
for line in f:
data.append(line.replace('\n','').replace(' ','').split(','))
data = data[1:]
print data
data[1][1] = 5
print data
it will read in each row, cut out the column names, and then you can modify the values by index

So here is my simple solution using pandas library. Suppose we have sample.csv file
import pandas as pd
df = pd.read_csv('sample.csv') # df is now a DataFrame
df['Colum1'] = df['Colum1'] - df['Colum1'].sum() # here we replace the column by subtracting sum of value in the column
print df
df.to_csv('sample.csv', index=False) # save dataframe back to csv file
You can also use map function to do operation to one column, for example,
import pandas as pd
df = pd.read_csv('sample.csv')
col_sum = df['Colum1'].sum() # sum of the first column
df['Colum1'] = df['Colum1'].map(lambda x: x - col_sum)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Summarize values from CSV - python

Related

Join large set of CSV files where the header is the timestamp for the file

Reading csv file formatted as dictioary into pandas

How to convert dataframe containing date into list of list with correct date format and save in csv file

Python Pandas performing operation on each row of CSV file

How to Perform Mathematical Operation on One Value of a CSV file?

Categories

Resources