Reading csv file formatted as dictioary into pandas - python

I have a csv file containing sensor data where one row is of the following format
1616580317.0733, {'Roll': 0.563820598084682, 'Pitch': 0.29817540218781163, 'Yaw': 60.18415650363684, 'gyroX': 0.006687641609460116, 'gyroY': -0.012394784949719908, 'gyroZ': -0.0027120113372802734, 'accX': -0.12778355181217196, 'accY': 0.24647256731987, 'accZ': 9.763526916503906}
Where the first column is a timestamp and the remainder is a dictionary like object containing various measured quantities.
I want to read this into a pandas array wit the columns
["Timestamp","Roll","Pitch","Yaw","gyroX","gyroY","gyroZ","accX","accY","accZ"]. What would be an efficient way of doing this? The file is 600MB so it's not a trivial number of lines which need to be parsed.

I'm not sure where you are getting the seconds column from.
The code below parses each row into a timestamp and dict. Then adds the timestamp to the dictionary that will eventually become a row in the dataframe.
import json
import pandas as pd
def read_file(filename):
chunk_size = 20000
entries = []
counter = 0
df = pd.DataFrame()
with open(filename, "r") as fh:
for line in fh:
timestamp, data_dict = line.split(",", 1)
data_dict = json.loads(data_dict.replace("'", '"'))
data_dict["timestamp"] = float(timestamp)
entries.append(data_dict)
counter += 1
if counter == chunk_size:
df = df.append(entries, ignore_index=True)
entries = []
counter = 0
if counter != 0:
df = df.append(entries, ignore_index=True)
return df
read_file("sample.txt")

I think you should convert your csv file to json format and then look at this site on how to transform the dictionary into a pandas dataframe : https://www.delftstack.com/fr/howto/python-pandas/how-to-convert-python-dictionary-to-pandas-dataframe/#:~:text=2%20banana%2012-,M%C3%A9thode%20pandas.,le%20nom%20de%20la%20colonne.

Related

Reading in tab-delimited txt file in chunks?

I have a txt file, and here is a snippet of the first few lines:
C A10231 A1 171|171 HER
C B23098 A1 171|171 HEF
C A03295 A2 171|171 HAF
I want to create a running list of every time the third column reads something other than "A1", and also keep track of how many times "A1" appears. Is there a way to import this file into a pandas df without causing a memory error?
If not, how can I process the txt file using the following rules:
Keep a running count of every time the third column reads "A1"
If the third column is not "A1", append the value to a list.
Find the amount of rows in the txt file
I essentially want to create three outputs. One output is the count of A1, the other is a list of everything that isn't A1 non_A1 = ['A2','B3','B4,'V6'...], and the last is the total number of rows.
All you need to do is process each line as you read it; no need to store anything more than your accumulated results and the current line in memory at any given time, and certainly no need to build a full dataframe from the contents of the file.
row_count = 0
a1_count = 0
non_a1 = []
with open("file.tsv") as f:
for line in f:
row = line.strip().split('\t')
row_count += 1
if row[2] == 'A1':
a1_count += 1
else:
non_a1.append(row[2])
As you tag your question with Pandas, you can use:
count_A1 = 0
non_A1 = set()
num_rows = 0
for chunk in pd.read_csv('/home/damien/data.txt', sep='\t', usecols=[2], header=None, chunksize=1):
count_A1 += chunk[2].eq('A1').sum()
non_A1 |= set(chunk.loc[chunk[2].ne('A1'), 2].unique().tolist())
num_rows += chunk.shape[0]
Output:
>>> count_A1
2
>>> list(non_A1):
['A2']
>>> num_rows
3
Using pandas for this trivial task is overkill
a1_count = 0
line_count = 0
others = []
with open('foo.tsv') as tsv:
for line in tsv:
if (ax := line.split()[2]) == 'A1':
a1_count += 1
else:
others.append(ax)
line_count += 1
In a similar vein to #Corralien. However, using the categorical datatype that results in memory savings for large amounts of data that are in a limited number of categories:
import pandas as pd
# Create some test data
fname = "reading_tsv_in_chunks.tsv"
with open("reading_tsv_in_chunks.tsv", "w") as fid:
for i in range(1000):
fid.write("C\tA10231\tA1\t171|171\tHER\nC\tB23098\tA1\t171|171\tHEF\nC\tA03295\tA2\t171|171\tHAF\nC\tA02225\tA3\t171|171\tHAX\nC\tA012325\tA4\t171|171\tHAY\n")
# Read as categorical
df = pd.read_csv(fname, sep="\t", header=None, names=["category",], usecols=[2,], dtype="category")
print(f"Contents of df:\n{df.describe()}\n")
print(f"Memory usage of with categorical dtype:\n{df.memory_usage()}\n\n")
# Read as non-categorical
df2 = pd.read_csv(fname, sep="\t", header=None, names=["category",], usecols=[2,])
print(f"Contents of df2:\n{df2.describe()}\n")
print(f"Memory usage of WITHOUT categorical dtype:\n{df2.memory_usage()}\n\n")
# Process as necessary e.g.
a1_count = sum([ len(values) for category, values in df.groupby("category")["category"] if category=="A1"])
non_a1_count = sum([ len(values) for category, values in df.groupby("category")["category"] if category!="A1"])
print(f"A1 count: {a1_count}\n")
print(f"Non-A1 count: {non_a1_count}")

Python: Summarize values from CSV

I have a CSV-file which looks like this:
a,date,b
,2020-10-26 09:06:07,
,2020-10-26 16:15:20,
,2020-10-27 08:04:54,
,2020-10-28 22:09:16,
My question is:
Can I summarize my CSV so that it looks like this? (in a new CSV):
date, count
2020-10-26,2
2020-10-27,1
2020-10-28,1
So that every row which has data from the same day is summarized.
This can be accomplished quite simply using the following logic, with either core Python, or pandas - whichever suits you best.
Read the source CSV file.
Count the occurrences of each date.
Write the counts to a new CSV file.
Using only core Python
counts = {}
# Open source CSV and extract only dates.
with open('dates.csv')as f:
dates = [i.strip().split(',')[1].split(' ')[0] for i in f][1:]
# Count date occurrences.
for i in dates:
counts[i] = counts.get(i, 0) + 1
# Write the output to a new CSV file.
with open('dates_out.csv', 'w') as f:
f.write('date,count\n')
for k, v in counts.items():
f.write(f'{k},{v}\n')
Using pandas
import pandas as pd
# Read the source CSV into a DataFrame.
df = pd.read_csv('dates.csv')
# Convert the `date` column to a `datetime` object and return the `date` part only.
df['date'] = pd.to_datetime(df['date']).dt.date
# Count occurrences and store the results to a new CSV file.
(df['date']
.value_counts()
.sort_index()
.reset_index()
.rename(columns={'index': 'date', 'date': 'count'})
.to_csv('dates_out.csv', index=False))
Output
$ cat dates_out.csv
date,count
2020-10-26,2
2020-10-27,1
2020-10-28,1
Source input file
For completeness, here are the contents of my testing source file, dates.csv.
col1,date,col3
a,2020-10-26 09:06:07,b
a,2020-10-26 16:15:20,b
a,2020-10-27 08:04:54,b
a,2020-10-28 22:09:16,b
Something like the below ('zz.txt' is your data)
from collections import defaultdict
data = defaultdict(int)
with open('zz.txt') as f:
lines = [line.strip() for line in f.readlines()][1:]
for line in lines:
data[line[1:line.find(' ')]] += 1
print(data)
output
defaultdict(<class 'int'>, {'2020-10-26': 2, '2020-10-27': 1, '2020-10-28': 1})

Speed up my data reading in python?

My current code looks like this:
import pandas as pd
import csv
import matplotlib.pyplot as plt
def data_reader(filename, rowname):
with open(filename, newline='') as fp:
yield from (row[1:] for row in csv.reader(fp, skipinitialspace=True)
if row[0] == rowname)
File = 'data.csv'
ASA = pd.DataFrame.from_records(data_reader(File, 'ASA'))
GDS = pd.DataFrame.from_records(data_reader(File, 'GDS'))
SCD = pd.DataFrame.from_records(data_reader(File, 'SCD'))
ASF = pd.DataFrame.from_records(data_reader(File, 'ASF'))
ADC = pd.DataFrame.from_records(data_reader(File, 'ADC'))
DFS = pd.DataFrame.from_records(data_reader(File, 'DFS'))
DCS = pd.DataFrame.from_records(data_reader(File, 'DCS'))
DFDS = pd.DataFrame.from_records(data_reader(File, 'DFDS'))
It is reading data that looks like this:
legend, useless data, useless data, DCS, useless data, sped, air, xds, sas, dac
legend, useless data, useless data, GDS, useless data, sped, air
Legend, useless data, useless data, ASA, useless data, sped, air, gnd
ASA, 231, 123, 12
GDS, 12, 1
DCS, 13, 12, 123, 12, 4
ASA, 123, 132, 12
and so on for couple of millions....
I am trying to write an IF statement that looks something like this:
pd.DataFrame.from_records(data_reader(
if rowname = 'ASA'
ASA.append(row)
elif rowname = 'GDS'
GDS.append(row)
and so on. Would this be faster? currently it is taking about 1 minute to run my code and plot one graph. I am sure it will be much longer when I have about 10-15 plots to do. I have tried different methods of writing the if/elseif statement but I am not having any luck doing so.
Reading from disk is the bottleneck here, so we should try to avoid reading the file more than once. If you have enough memory to parse the entire CSV into a dict of lists, then you could use
import csv
import collections
import pandas as pd
def data_reader(filename):
dfs = collections.defaultdict(list)
columns = dict()
with open(filename, newline='') as fp:
for row in csv.reader(fp, skipinitialspace=True):
key = row[0].upper()
if key == 'LEGEND':
name = row[3]
columns[name] = row
else:
dfs[key].append(row[1:])
for key in dfs:
num_cols = max(len(row) for row in dfs[key])
dfs[key] = pd.DataFrame(dfs[key], columns=columns[key][-num_cols:])
return dfs
filename = 'data.csv'
dfs = data_reader(filename)
for key in dfs:
print(dfs[key])
The loop
for row in csv.reader(fp, skipinitialspace=True):
key = row[0].upper()
...
dfs[key].append(row[1:])
loads the CSV into a dict, dfs. The dict keys are strings like 'ASA',
'GDS' and 'DCS'. The dict values are lists of lists.
The other loop
for key in dfs:
...
dfs[key] = pd.DataFrame(dfs[key], columns=columns[key][:-num_cols:])
converts the lists of lists to DataFrames.
The if-statement:
if key == 'LEGEND':
name = row[3]
columns[name] = row
else:
dfs[key].append(row[1:])
records the row in the columns dict if the row begins with LEGEND (with or without capitalization), or otherwise records the row in the dfs dict.
Later in the for-loop:
for key in dfs:
num_cols = max(len(row) for row in dfs[key])
dfs[key] = pd.DataFrame(dfs[key], columns=columns[key][-num_cols:])
The keys are strings such as 'ASA'. For each key, the number of columns is
obtained by finding the maximum length of the rows in dfs[key].
columns[key] returns the corresponding legend row for key.
columns[key][-num_cols:] returns the last num_cols values from that row.
The result returned by data_reader is a dict of DataFrames:
In [211]: dfs['ASA']
Out[211]:
sped air gnd
0 231 123 12
1 123 132 12
In [212]: dfs['GDS']
Out[212]:
sped air
0 12 1
In [213]: dfs['DCS']
Out[213]:
sped air xds sas dac
0 13 12 123 12 4
You should be able to do something like this:
df = pd.read_csv('data.csv')
ASA = df.ix[df[0] == "ASA"]
# etc ...

How to Perform Mathematical Operation on One Value of a CSV file?

I am dealing with a csv file that contains three columns and three rows containing numeric data. The csv data file simply looks like the following:
Colum1,Colum2,Colum3
1,2,3
1,2,3
1,2,3
My question is how to write a python code that take a single value of one of the column and perform a specific operation. For example, let say I want to take the first value in 'Colum1' and subtract it from the sum of all the values in the column.
Here is my attempt:
import csv
f = open('columns.csv')
rows = csv.DictReader(f)
value_of_single_row = 0.0
for i in rows:
value_of_single_Row += float(i) # trying to isolate a single value here!
print value_of_single_row - sum(float(r['Colum1']) for r in rows)
f.close()
Based on the code you provided, I suggest you take a look at the doc to see the preferred approach on how to read through a csv file. Take a look here:
How to use CsvReader
with that being said, you can modify the beginning of your code slightly to this:
import csv
with open('data.csv', 'rb') as f:
rows = csv.DictReader(f)
for row in rows:
# perform operation per row
From there you now have access to each row.
This should give you what you need to do proper row-by-row operations.
What I suggest you do is play around with printing out your rows to see what your data looks like. You will see that each row being outputted is a dictionary.
So if you were going through each row, you can just simply do something like this:
for row in rows:
row['Colum1'] # or row.get('Colum1')
# to do some math to add everything in Column1
s += float(row['Column1'])
So all of that will look like this:
import csv
s = 0
with open('data.csv', 'rb') as f:
rows = csv.DictReader(f)
for row in rows:
s += float(row['Colum1'])
You can do pretty much all of this with pandas
from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import pandas as pd
import sys
import os
Location = r'path/test.csv'
df = pd.read_csv(Location, names=['Colum1','Colum2','Colum3'])
df = df[1:] #Remove the headers since they're unnecessary
print df
df.xs(1)['Colum1']=int(df.loc[1,'Colum1'])+5
print df
You can write back to your csv using df.to_csv('File path', index=False,header=True) Having headers=True will add the headers back in.
To do this more along the lines of what you have you can do it like this
import csv
Location = r'C:/Users/tnabrelsfo/Documents/Programs/Stack/test.csv'
data = []
with open(Location, 'r') as f:
for line in f:
data.append(line.replace('\n','').replace(' ','').split(','))
data = data[1:]
print data
data[1][1] = 5
print data
it will read in each row, cut out the column names, and then you can modify the values by index
So here is my simple solution using pandas library. Suppose we have sample.csv file
import pandas as pd
df = pd.read_csv('sample.csv') # df is now a DataFrame
df['Colum1'] = df['Colum1'] - df['Colum1'].sum() # here we replace the column by subtracting sum of value in the column
print df
df.to_csv('sample.csv', index=False) # save dataframe back to csv file
You can also use map function to do operation to one column, for example,
import pandas as pd
df = pd.read_csv('sample.csv')
col_sum = df['Colum1'].sum() # sum of the first column
df['Colum1'] = df['Colum1'].map(lambda x: x - col_sum)

How to filter out data into unique pandas dataframes from a combined csv of multiple datatypes?

Sample csv
time,type,-1,
time,type,0,w
time,type,1,a,12,b,13,c,15,name,apple
time,type,5,r,2,s,43,t,45,u,67,style,blue,font,13
time,type,11,a,12,c,15
time,type,5,r,2,s,43,t,45,u,67,style,green,font,15
time,type,1,a,12,b,13,c,15,name,apple
time,type,11,a,12,c,15
time,type,5,r,2,s,43,t,45,u,67,style,green,font,15
time,type,1,a,12,b,13,c,15,name,apple
time,type,5,r,2,s,43,t,45,u,67,style,yellow,font,9
time,type,19,b,12
type,19,b,42
I would like to filter each of the following "type,1", "type,5", "type,11", "type,19" into a separate pandas frame for further analysis. What's the best way to do it ? [Also, I will be ignoring "type,0" and "type,-1"]
Sample Code
import pandas as pd
type1_header = ['type','a','b','c','name']
type5_header = ['type','r','s','t','u','style','font']
type11_header = ['type','a','c']
type19_header = ['type','b']
type1_data = pd.read_csv(file_path_to_csv, usecols=[2,4,6,8,10] , names=type1_header)
type5_data = pd.read_csv(file_path_to_csv, usecols=[2,4,6,8,10,12,14] , names=type5_header)
import pandas as pd
headers = {1:['a','b','c','name'],
5:['r','s','t','u','style','font'],
}
usecols = {1:[4,6,8,10],
5:[4,6,8,10,12,14],
}
frames = {}
for h in headers:
frames[h] = pd.DataFrame(columns=headers[h])
count = 0
for line in open('irreg.csv'):
row = line.split(',')
count += 1
ID = int(row[2])
row_subset = []
if ID in frames:
for col in usecols[ID]: row_subset.append(row[col])
frames[ID].loc[len(frames[ID])] = row_subset
else:
print('WARNING: line %d: type %s not found'%(count, row[2]))
Although, that done, how often do you do this and how often does the data change? For a one-off it's probably easiest to split up the incoming csv file, e.g. by
grep type,19 irreg.csv > 19.csv
at the commandline, and then import each csv according to its headers and usecols.

Categories