def get_df():
df = pd.DataFrame()
os.chdir("C:/Users/s/Desktop/P")
for file in os.listdir():
if file.endswith('.csv'):
av_a = np.average(a, axis=0)
np.savetxt('merged_average.csv', av_a, delimiter=',')
I've tried to save it but it always overwrites with the next file and deletes the previous results
At the moment, your code is a bit hard to read, as you are declaring variables which are not used (df) and using variables which are not declared (a). In the future, try to give a minimal reproducible example of your problematic code.
I'll still try to give you an interpreted answer:
If you want to store multiple columns from different files next to each other, the job becomes simpler by first acquiring all columns, and then afterwardds save them to the file in a single action.
Here is an interpretation of your code:
def get_df():
# create an empty list to collect all results
average_results = []
os.chdir("C:/Users/s/Desktop/P")
for file in os.listdir():
if file.endswith('.csv'):
a = something(file) # unknown to me
average_results.append(np.average(a, axis=0))
# convert the results to a 2d numpy matrix,
# optionally transpose it to get the desired data orientation
data = np.array(average_results).transpose()
# save the full dataset
np.savetxt('merged_average.csv', data , delimiter=',')
I have a txt file with following format:
{"results":[{"statement_id":0,"series":[{"name":"datalogger","columns":["time","ActivePower0","CosPhi0","CurrentRms0","DcAnalog5","FiringAngle0","IrTemperature0","Lage 1_Angle1","Lage 1_Angle2","PotentioMeter0","Rotation0","SNR","TNR","Temperature0","Temperature3","Temperature_MAX31855_0","Temperature_MAX31855_3","Vibra0_X","Vibra0_Y","Vibra0_Z","VoltageAccu0","VoltageRms0"],"values":[["2017-10-06T08:50:25.347Z",null,null,null,null,null,null,null,null,null,null,"41762721","Testcustomer",null,null,null,null,-196,196,-196,null,null],["2017-10-06T08:50:25.348Z",null,null,null,null,null,null,346.2964,76.11179,null,null,"41762721","Testcustomer",null,null,null,null,null,null,null,null,null],["2017-10-06T08:50:25.349Z",null,null,2596,null,null,null,null,null,null,null,"41762721","Testkunde",null,null,null,null,null,null,null,null,80700],["2017-10-06T08:50:25.35Z",null,null,null,null,null,null,null,null,null,1956,"41762721","Testkunde",null,null,null,null,null,null,null,null,null],["2017-10-06T09:20:05.742Z",null,null,null,null,null,67.98999,null,null,null,null,"41762721","Testkunde",null,null,null,null,null,null,null,null,null]]}]}]}
...
So in the text file everything is saved in one line. CSV file is not available.
I would like to have it as a data frame in pandas. when I use read.csv:
df = pd.read_csv('time-series-data.txt', sep = ",")
the output of print(df) is someting like [0 rows x 3455.. columns]
So currently everything is read in as one line. However, I would like to have 22 columns (time, activepower0, CosPhi0,..). I ask for tips, thank you very much.
Is a pandas dataframe even suitable for this? the text files are up to 2 GB in size.
Here's an example which can read the file you posted.
Here's the test file, named test.json:
{"results":[{"statement_id":0,"series":[{"name":"datalogger","columns":["time","ActivePower0","CosPhi0","CurrentRms0","DcAnalog5","FiringAngle0","IrTemperature0","Lage 1_Angle1","Lage 1_Angle2","PotentioMeter0","Rotation0","SNR","TNR","Temperature0","Temperature3","Temperature_MAX31855_0","Temperature_MAX31855_3","Vibra0_X","Vibra0_Y","Vibra0_Z","VoltageAccu0","VoltageRms0"],
"values":[
["2017-10-06T08:50:25.347Z",null,null,null,null,null,null,null,null,null,null,"41762721","Test-customer",null,null,null,null,-196,196,-196,null,null],
["2017-10-06T08:50:25.348Z",null,null,null,null,null,null,346.2964,76.11179,null,null,"41762721","Test-customer",null,null,null,null,null,null,null,null,null]]}]}]}
Here's the python code used to read it in:
import json
import pandas as pd
# Read test file.
# This reads the entire file into memory at once. If this is not
# possible for you, you may want to look into something like ijson:
# https://pypi.org/project/ijson/
with open("test.json", "rb") as f
data = json.load(f)
# Get the first element of results list, and first element of series list
# You may need a loop here, if your real data has more than one of these.
subset = data['results'][0]['series'][0]
values = subset['values']
columns = subset['columns']
df = pd.DataFrame(values, columns=columns)
print(df)
From the 2 year data, find the top 10 readings/rows of AWND. Store the result in a file .csv file and name it top10AWND.csv. The new file will have all columns from filteredData.csv, but only the top 10 AWND.
Small portion of the filteredData.csv:
I am using Python 3.8 and Pandas.
I need to find the top 10 readings of AWND from my filteredData.csv file. Then, I need to store the results in a new file. The new file needs to have the columns, STATION, NAME, DATA, Month, AWND, and SNOW of the top 10 readings.
I am not sure how to go about doing this. This is what I have so far and it does not work. It gives me errors. One error I run into is a TyperError: list indices must be integers or slices, not list for the filtered_weather = line in the code.
import numpy as np
import pandas as pd
import re
for filename in ['filteredData.csv']:
file = pd.read_csv(filename)
all_loc =dict(file['AWND'].value_counts()).keys()
most_loc = list(all_loc)[:10]
filtered_weather = ['filteredData.csv'][['STATION','NAME','DATE','Month','AWND','SNOW']] #Select the column names that you want
filtered_weather.to_csv('top10AWND.csv',index=False)
You can do something like this:
#This not neccessary unless you want to read several files
for filename in ['filteredData.csv']:
file = pd.read_csv(filename)
file = file.sort_values('AWND', ascending = False).head(10)
# If it's only one file you can do
#
#file = pd.read_csv(filename)
#file = file.sort_values('AWND', ascending = False).head(10)
#Considering you want to keep all the columns you can just write the dataframe to the file
file.to_csv('top10AWND.csv',index=False)
Hi everyone I am currently working on data like the following:
Example of original data file
There are a total of 51 files, each with more than 800 oscillating columns, e.g. (Time, ID, x1, x2, ID, x1, x2,...), the columns are all unlabelled. Within the file, each row has different numbers of columns, something looks like this:Shape of one data file
I need to merge all 51 files into one file, and then stack the columns vertically like this:
Example of output file
So for each timestamp, each student will have a specific row with their location x,y.
Can anyone please help me with this, thanks
I used the following code to merge CSV files with different columns, but the output file is twice the size of the originals (e.g. 100MB VS 50MB). My approach was to combine the files using the maximum number of columns and expand to each row. However, this approach created a lot of missing values in the data, and thus, increasing the size of output files.
import os
import glob
import pandas as pd
def concatenate(indir="C:\Test Files",outfile="F:\Research Assitant\PROJECT_Position Data\Test File\Concatenate.csv"):
os.chdir(indir)
fileList=glob.glob("*.csv")
dfList=[]
for filename in fileList:
### Loop over each line
with open(filename, 'r') as f:
### Skip first four lines
for _ in range(4):
next(f)
### Get the numbers of columns in each line
col_count = [ len(l.split(",")) for l in f.readlines() ]
### Read the current csv file
df = pd.read_csv(filename, header=None, delimiter=",", names=range(max(col_count)),
skiprows=4, keep_default_na=False, na_values=[""])
### Append to the list
dfList.append(df)
concatDf=pd.concat(dfList,axis=0)
concatDf.to_csv(outfile,index=None)
Is there any way to reduce the size of the output files? Or a more efficient way to deal with heterogeneous CSV files in python?
And how do I stack the columns vertically after merged all the CSV files?
with open(os.path.join(working_folder, file_name)) as f:
student_data = []
for line in f:
row = line.strip().split(",")
number_of_results = round(len(row[1:]) / 4) # if we do not count time column, data repeats every 4 times
time_column = row[0]
results = row[1:]
for i in range(number_of_results):
data = [time_column] + results[i*4: (i+1)*4]
student_data.append(data)
df = pd.DataFrame(student_data, columns=["Time", "ID", "Name", "x1", "x2"])
df
I am dealing with a csv file that contains three columns and three rows containing numeric data. The csv data file simply looks like the following:
Colum1,Colum2,Colum3
1,2,3
1,2,3
1,2,3
My question is how to write a python code that take a single value of one of the column and perform a specific operation. For example, let say I want to take the first value in 'Colum1' and subtract it from the sum of all the values in the column.
Here is my attempt:
import csv
f = open('columns.csv')
rows = csv.DictReader(f)
value_of_single_row = 0.0
for i in rows:
value_of_single_Row += float(i) # trying to isolate a single value here!
print value_of_single_row - sum(float(r['Colum1']) for r in rows)
f.close()
Based on the code you provided, I suggest you take a look at the doc to see the preferred approach on how to read through a csv file. Take a look here:
How to use CsvReader
with that being said, you can modify the beginning of your code slightly to this:
import csv
with open('data.csv', 'rb') as f:
rows = csv.DictReader(f)
for row in rows:
# perform operation per row
From there you now have access to each row.
This should give you what you need to do proper row-by-row operations.
What I suggest you do is play around with printing out your rows to see what your data looks like. You will see that each row being outputted is a dictionary.
So if you were going through each row, you can just simply do something like this:
for row in rows:
row['Colum1'] # or row.get('Colum1')
# to do some math to add everything in Column1
s += float(row['Column1'])
So all of that will look like this:
import csv
s = 0
with open('data.csv', 'rb') as f:
rows = csv.DictReader(f)
for row in rows:
s += float(row['Colum1'])
You can do pretty much all of this with pandas
from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import pandas as pd
import sys
import os
Location = r'path/test.csv'
df = pd.read_csv(Location, names=['Colum1','Colum2','Colum3'])
df = df[1:] #Remove the headers since they're unnecessary
print df
df.xs(1)['Colum1']=int(df.loc[1,'Colum1'])+5
print df
You can write back to your csv using df.to_csv('File path', index=False,header=True) Having headers=True will add the headers back in.
To do this more along the lines of what you have you can do it like this
import csv
Location = r'C:/Users/tnabrelsfo/Documents/Programs/Stack/test.csv'
data = []
with open(Location, 'r') as f:
for line in f:
data.append(line.replace('\n','').replace(' ','').split(','))
data = data[1:]
print data
data[1][1] = 5
print data
it will read in each row, cut out the column names, and then you can modify the values by index
So here is my simple solution using pandas library. Suppose we have sample.csv file
import pandas as pd
df = pd.read_csv('sample.csv') # df is now a DataFrame
df['Colum1'] = df['Colum1'] - df['Colum1'].sum() # here we replace the column by subtracting sum of value in the column
print df
df.to_csv('sample.csv', index=False) # save dataframe back to csv file
You can also use map function to do operation to one column, for example,
import pandas as pd
df = pd.read_csv('sample.csv')
col_sum = df['Colum1'].sum() # sum of the first column
df['Colum1'] = df['Colum1'].map(lambda x: x - col_sum)