Averaging out every four elements in a CSV file - python

I have a CSV file with, say $n=100$ elements. So the file looks like a $n$-dimensional vector. The question is: how can I average every 4 elements and save the results in a new csv file?
For example I generate a list of random numbers:
import random
my_random_list = []
for i in range(0,9):
n = random.randint(1,100)
my_random_list.append(n)
df = pd.DataFrame(my_random_list)
df.to_csv('my_csv.csv', index=False, header=None)
This is similar to my code. Now, I want create a new csv (because I have the data in csv form already) where I average out and save the first 4 elements, then the next 4, etc. So I will end up with a csv file with only 25 elements.

Use DataFrame.groupby with integer division of index for groups of 4 values and aggregate mean:
np.random.seed(2021)
df = pd.DataFrame({'a':np.random.randint(1,10, size=10)})
print (df)
a
0 5
1 6
2 1
3 7
4 6
5 9
6 7
7 7
8 7
9 7
df1 = df.groupby(df.index // 4).mean()
print (df1)
a
0 4.75
1 7.25
2 7.00
Detail:
print (df.index // 4)
Int64Index([0, 0, 0, 0, 1, 1, 1, 1, 2, 2], dtype='int64')
All together:
df = pd.read_csv(file, header=None)
df1 = df.groupby(df.index // 4).mean()
df1.to_csv('my_csv.csv', index=False, header=None)

import pandas as pd
import random
import csv
# FIRST PART -- GENERATES THE ORIGINAL CSV FILE
my_random_list = []
for i in range(0,100):
n = random.randint(1,100)
my_random_list.append(n)
df = pd.DataFrame(my_random_list)
df.to_csv('my_csv.csv', index=False, header=None)
# SECOND PART -- POPULATES A LIST WITH THE CONTENTS OF THE
# ORIGINAL CSV FILE
file_CSV = open('my_csv.csv')
data_CSV = csv.reader(file_CSV)
list_CSV = list(data_CSV)
# THIRD PART -- GENERATES A NEW LIST CONTAINING
# THE AVERAGE OF EVERY FOURTH ELEMENT
# AND ITS THREE PREDECESSORS
new_list = []
for i in range(0,len(list_CSV)):
if(i%4==0):
s = int(list_CSV[i+0][0])
s = s + int(list_CSV[i+1][0])
s = s + int(list_CSV[i+2][0])
s = s + int(list_CSV[i+3][0])
s = s/4
new_list.append(s)
i = i + 1
# FOURTH PART -- GENERATES A NEW CSV
df = pd.DataFrame(new_list)
df.to_csv('new_csv.csv', index=False, header=None)

Related

Find unique elements of column with chunksize pandas

Given a sample(!) data frame:
test =
time clock
1 1
1 1
2 2
2 2
3 3
3 3
I was trying to do some operations with pandas chunksize:
for df in pd.read_csv("...path...",chunksize = 10):
time_spam = df.time.unique()
detector_list = df.clock.unique()
But it gives me operation to the length of the chunsize. If 10, then give me 10 rows only.
P.S. It is sample data
Please try:
for df in pd.read_csv("...path...",chunksize = 10, iterator=True):
time_spam = df.time.unique()
detector_list = df.clock.unique()
You need to use the iterator flag as described here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-chunking
Here's how you can create lists of unique elements parsing chunks:
# Initialize lists
time_spam = []
detector_list = []
# Cycle over each chunk
for df in pd.read_csv("...path...", chunksize = 10):
# Add elements if not already in the list
time_spam += [t for t in df['time'].unique() if t not in time_spam]
detector_list += [c for c in df['clock'].unique() if c not in detector_list ]
File test.csv:
col1,col2
1,1
1,2
1,3
1,4
2,1
2,2
2,3
2,4
Code:
col1, col2 = [], []
for df in pd.read_csv('test.csv', chunksize = 3):
col1.append(df.col1)
col2.append(df.col2)
Results:
print(pd.concat(col1).unique())
[1 2]
print(pd.concat(col2).unique())
[1 2 3 4]

Use previous row value for calculating log

I have a Dataframe as presented in the Spreadsheet, It has a column A.
https://docs.google.com/spreadsheets/d/1h3ED1FbkxQxyci0ETQio8V4cqaAOC7bIJ5NvVx41jA/edit?usp=sharing
I have been trying to create a new column like A_output which uses the previous row value and current row value for finding the Natual Log.
df.apply(custom_function, axix=1) #on a function
But I am not sure, How to access the previous value of the row?
The only thing I have tried is converting the values into the list and perform my operation and appending it back to the dataframe something like this.
output = []
previous_value = 100
for value in df['A'].values:
output.append(np.log(value/previous_value))
previous_value = value
df['A_output'] = output
This is going to be extremely expensive operation, What's the best way to approach this problem?
Another way with rolling():
import pandas as pd
import numpy as np
data = np.random.normal(loc=5., size=(6, 1))
df = pd.DataFrame(columns=['A'], data=data)
df['output'] = df['A'].rolling(2).apply(lambda x: np.log(x[1]/x[0]))
init_val = 3.
df['output'][0] = np.log(df['A'][0] / init_val) # <-- manually assign value for the first item
print(df)
# A output
# 0 7.257160 0.883376
# 1 4.579390 -0.460423
# 2 4.630148 0.011023
# 3 5.153198 0.107029
# 4 6.004917 0.152961
# 5 6.633857 0.099608
If you want to apply the same operation on multiple columns:
import pandas as pd
import numpy as np
data = np.random.normal(loc=5., size=(6, 2))
df = pd.DataFrame(columns=['A', 'B'], data=data)
df[['output_A', 'output_B']] = df.rolling(2).apply(lambda x: np.log(x[1]/x[0]))
init_val = 3.
df['output_A'][0] = np.log(df['A'][0] / init_val)
df['output_B'][0] = np.log(df['B'][0] / init_val)
print(df)
# A B output_A output_B
# 0 7.289657 4.986245 0.887844 0.508071
# 1 5.690721 5.010605 -0.247620 0.004874
# 2 5.773812 5.129814 0.014495 0.023513
# 3 4.417981 6.395500 -0.267650 0.220525
# 4 4.923170 5.363723 0.108270 -0.175936
# 5 5.279008 5.327365 0.069786 -0.006802
We can use Series.shift and after use .loc to assign the first value with the base value
Let's assume we have the following dataframe:
df = pd.DataFrame({'A':np.random.randint(1, 10, 5)})
print(df)
A
0 8
1 3
2 3
3 1
4 5
df['A_output'] = np.log(df['A'] / df['A'].shift())
df.loc[0, 'A_output'] = np.log(df.loc[0, 'A'] / 100)
print(df)
A A_output
0 8 -2.525729
1 3 -0.980829
2 3 0.000000
3 1 -1.098612
4 5 1.609438

Python: Pivot Table/group by specific conditions

I'm trying to change structure of my data from text file(.txt) which data look like this:
:1:A
:2:B
:3:C
:1:D
:2:E
:3:F
:4:G
:1:H
:3:I
:4:J
And I would like to transform them into this format (like pivot-table in excel which column name is character between ":" and each group always start with :1:)
Group :1: :2: :3: :4:
1 A B C
2 D E F G
3 H I J
Does anyone have any idea? Thanks in advance.
First create DataFrame by read_csv with header=None, because no header in file:
import pandas as pd
temp=u""":1:A
:2:B
:3:C
:1:D
:2:E
:3:F
:4:G
:1:H
:3:I
:4:J"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), header=None)
print (df)
0
0 :1:A
1 :2:B
2 :3:C
3 :1:D
4 :2:E
5 :3:F
6 :4:G
7 :1:H
8 :3:I
9 :4:J
Extract original column by DataFrame.pop, then remove traling : by Series.str.strip and Series.str.split values to 2 new columns. Then create groups by compare with Series.eq for == by string 0 with Series.cumsum, create MultiIndex by DataFrame.set_index and last reshape by Series.unstack:
df[['a','b']] = df.pop(0).str.strip(':').str.split(':', expand=True)
df1 = df.set_index([df['a'].eq('1').cumsum(), 'a'])['b'].unstack(fill_value='')
print (df1)
a 1 2 3 4
a
1 A B C
2 D E F G
3 H I J
Use:
# Reading text file (assuming stored in CSV format, you can also use pd.read_fwf)
df = pd.read_csv('SO.csv', header=None)
# Splitting data into two columns
ndf = df.iloc[:, 0].str.split(':', expand=True).iloc[:, 1:]
# Grouping and creating a dataframe. Later dropping NaNs
res = ndf.groupby(1)[2].apply(pd.DataFrame).apply(lambda x: pd.Series(x.dropna().values))
# Post processing (optional)
res.columns = [':' + ndf[1].unique()[i] + ':' for i in range(ndf[1].nunique())]
res.index.name = 'Group'
res.index = range(1, res.shape[0] + 1)
res
Group :1: :2: :3: :4:
1 A B C
2 D E F G
3 H I J
Another way to do this:
#read the file
with open("t.txt") as f:
content = f.readlines()
#Create a dictionary and read each line from file to keep the column names (ex, :1:) as keys and rows(ex, A) as values in dictionary.
my_dict={}
for v in content:
key = v.rstrip(':')[0:3] # take the value ':1:'
value = v.rstrip(':')[3] # take value 'A'
my_dict.setdefault(key,[]).append(value)
#convert dictionary to dataframe and transpose it
df = pd.DataFrame.from_dict(my_dict,orient='index').transpose()
df
The output will be looking like this:
:1: :2: :3: :4:
0 A B C G
1 D E F J
2 H None I None

Array not appending

I am trying to import data from a file and then add it to an array. I know that this is not the best way to add elements to a numpy array. Nevertheless, why is the data not appending? The last element of the csv is 1.1 and thats what i get when i do print(dd)
with open('C:\\Users\jez40\.PyCharmCE2018.2\8_Data.csv', 'r') as data_file:
data = csv.reader(data_file, delimiter=',')
for i in data:
t = []
d = []
dd = []
t.append([float(i[0])])
d.append([float(i[1])])
dd.append([float(i[2])])
t = np.array(t)
d = np.array(d)
dd = np.array(dd)
print (dd)
The root of your problem lies in the fact that every iteration of your loop you are re-assigning t, d and dd to empty lists []. If your end-all goal is to acquire numpy arrays for these variables, I would recommend using pd.read_csv() to convert your csv file to a dataframe. Take this sample csv:
t,d,dd
1,2,3
4,5,6
7,8,9
Using pd.read_csv():
df = pd.read_csv(r'C:\\Users\jez40\.PyCharmCE2018.2\8_Data.csv')
Gives:
t d dd
0 1 2 3
1 4 5 6
2 7 8 9
Then you can query your columns to return them as pd.Series():
t = df['t']
d = df['d']
dd = df['dd']
Or you can convert them to np.array():
t = np.array(df['t'])
d = np.array(df['d'])
dd = np.array(df['dd'])

Add calculated column to pandas dataframe on the fly while iterating over the lines of a csv file?

I have a large space separated input file input.csv, which I can't hold in memory:
## Header
# More header here
A B
1 2
3 4
If I use the iterator=True argument for pandas.read_csv, then it returns a TextFileReader / TextParser object. This allows filtering the file on the fly and only selecting rows for which column A is greater than 2.
But how do I add a third column to the dataframe on the fly without having to loop over all of the data once more?
Specifically I want column C to be equal to column A multiplied by the value in a dictionary d, which has the value of column B as its key; i.e. C = A*d[B].
Currently I have this code:
import pandas
d = {2: 2, 4: 3}
TextParser = pandas.read_csv('input.csv', sep=' ', iterator=True, comment='#')
df = pandas.concat([chunk[chunk['A'] > 2] for chunk in TextParser])
print(df)
Which prints this output:
A B
1 3 4
How do I get it to print this output (C = A*d[B]):
A B C
1 3 4 9
You can use a generator to work on the chunks one at a time:
Code:
def on_the_fly(the_csv):
d = {2: 2, 4: 3}
chunked_csv = pd.read_csv(
the_csv, sep='\s+', iterator=True, comment='#')
for chunk in chunked_csv:
rows_idx = chunk['A'] > 2
chunk.loc[rows_idx, 'C'] = chunk[rows_idx].apply(
lambda x: x.A * d[x.B], axis=1)
yield chunk[rows_idx]
Test Code:
from io import StringIO
data = StringIO(u"""#
A B
1 2
3 4
4 4
""")
import pandas as pd
df = pd.concat([c for c in on_the_fly(data)])
print(df)
Results:
A B C
1 3 4 9.0
2 4 4 12.0

Categories