I have data in 10 individual csv files. Each csv file just has one row of data entires (500000 data points, no headers etc.). Three questions:
How can I transform the data to be one column with 500000 rows?
Is it better to import them into one numpy array: 500000 x 10 to analyze them. If so, how can one do this?
Or is it better to import them into one DataFrame 500000 x 10, to analyze it.
Assume you have a list of file names files. Then:
df = pd.concat([pd.read_csv(f, header=None) for f in files], ignore_index=True)
df is a 10 x 500000 dataframe. Make it a 500000 x 10 with df.T
Answers to 2 and 3 depends on your task.
First, read all 10 csv:
import os, csv, numpy
import pandas as pd
my_csvs = os.listdir('path to folder with 10 csvs') #selects all files in folder
list_of_columns = []
os.chdir('path to folder with 10 csvs')
for file in my_csvs:
column = []
with open(file, 'r') as f:
reader = csv.reader(f)
for row in reader:
column.append(row)
list_of_columns.append(column)
This is how you get a list of lists-columns. Next transform them to pandas df or numpy or whatever you feel comfortable to work with.
Related
I have a .csv file with over 50k rows. I would like to divide it into smaller chunks and save as separate .csv files. Not sure if pandas are best approach here (if not I'm open for any suggestions).
My goal: Read file, identify number of existing rows in dataframe, divide dataframe into chunks (3000 rows each file including the header row, save as separate .csv files)
My code so far:
import os
import pandas as pd
i = 0
while os.path.exists("output/path/chunk%s.csv" % i):
i += 1
size = 3000
df = pd.read_csv('/input/path/input.csv')
list_of_dfs = [df.loc[i:i+size-1,:] for i in range(0, len(df),size)]
for x in list_of_dfs:
x.to_csv('/output/path/chunk%s.csv' % i, index=False)
the above code didn't throw any error, but created only one file ('chunk0.csv') with 1439 rows instead of 3000.
Could someone help me with this? thanks in advance!
Use DataFrame.groupby with pass integer division of index values by size, loop and write to files with f for f-strings:
size = 3000
df = pd.read_csv('/input/path/input.csv')
for i, g in df.groupby(df.index // size):
g.to_csv(f'/output/path/chunk{i}.csv', index=False)
You may be interested in pd.read_csv chunksize parameter
You can use it this way :
size = 3000
filename = '/input/path/input.csv'
for i, chunk in enumerate(pd.read_csv(filename, chunksize=size)):
chunk.to_csv(f"output/path/chunk{i}.csv", index=False)
Hi everyone I am currently working on data like the following:
Example of original data file
There are a total of 51 files, each with more than 800 oscillating columns, e.g. (Time, ID, x1, x2, ID, x1, x2,...), the columns are all unlabelled. Within the file, each row has different numbers of columns, something looks like this:Shape of one data file
I need to merge all 51 files into one file, and then stack the columns vertically like this:
Example of output file
So for each timestamp, each student will have a specific row with their location x,y.
Can anyone please help me with this, thanks
I used the following code to merge CSV files with different columns, but the output file is twice the size of the originals (e.g. 100MB VS 50MB). My approach was to combine the files using the maximum number of columns and expand to each row. However, this approach created a lot of missing values in the data, and thus, increasing the size of output files.
import os
import glob
import pandas as pd
def concatenate(indir="C:\Test Files",outfile="F:\Research Assitant\PROJECT_Position Data\Test File\Concatenate.csv"):
os.chdir(indir)
fileList=glob.glob("*.csv")
dfList=[]
for filename in fileList:
### Loop over each line
with open(filename, 'r') as f:
### Skip first four lines
for _ in range(4):
next(f)
### Get the numbers of columns in each line
col_count = [ len(l.split(",")) for l in f.readlines() ]
### Read the current csv file
df = pd.read_csv(filename, header=None, delimiter=",", names=range(max(col_count)),
skiprows=4, keep_default_na=False, na_values=[""])
### Append to the list
dfList.append(df)
concatDf=pd.concat(dfList,axis=0)
concatDf.to_csv(outfile,index=None)
Is there any way to reduce the size of the output files? Or a more efficient way to deal with heterogeneous CSV files in python?
And how do I stack the columns vertically after merged all the CSV files?
with open(os.path.join(working_folder, file_name)) as f:
student_data = []
for line in f:
row = line.strip().split(",")
number_of_results = round(len(row[1:]) / 4) # if we do not count time column, data repeats every 4 times
time_column = row[0]
results = row[1:]
for i in range(number_of_results):
data = [time_column] + results[i*4: (i+1)*4]
student_data.append(data)
df = pd.DataFrame(student_data, columns=["Time", "ID", "Name", "x1", "x2"])
df
I have a folder with more or less 10 json files that size between 500 and 1000 Mb.
Each file contains about 1.000.000 of lines like the loffowling:
{
"dateTime": '2019-01-10 01:01:000.0000'
"cat": 2
"description": 'This description'
"mail": 'mail#mail.com'
"decision":[{"first":"01", "second":"02", "third":"03"},{"first":"04", "second":"05", "third":"06"}]
"Field001": 'data001'
"Field002": 'data002'
"Field003": 'data003'
...
"Field999": 'data999'
}
My target is to analyze it with pandas so I would like to save the data coming from all the files into a Dataframe.
If I loop all the files Python crash because I don't have free resources to manage the data.
As for my purpose I only need a Dataframe with two columns cat and dateTime from all the files, which I suppose is lighter that a whole Dataframe with all the columns I have tryed to read only these two columns with the following snippet:
Note: at the moment I am working with only one file, when I get a fast reader code I will loop to all other files (A.json, B.json, ...)
import pandas as pd
import json
import os.path
from glob import glob
cols = ['cat', 'dateTime']
df = pd.DataFrame(columns=cols)
file_name='this_is_my_path/File_A.json'
with open(file_name, encoding='latin-1') as f:
for line in f:
data=json.loads(line)
lst_dict=({'cat':data['cat'], 'dateTime':data['dateTime']})
df = df.append(lst_dict, ignore_index=True)
The code works, but it is very very slow so it takes more than one hour for one, file while reading all the file and storing into a Dataframe usually takes me 8-10 minutes.
Is there a way to read only two specific columns and append to a Dataframe in a faster way?
I have tryed to read all the JSON file and store into a Dataframe, then drop all the columns but 'cat' and 'dateTime' but it seems to be too heavy for my MacBook.
I had the same problem. I found out that appending a dict to a DataFrame is very very slow. Extract the values as a list instead. In my case it took 14 s instead of 2 h.
cols = ['cat', 'dateTime']
data = []
file_name = 'this_is_my_path/File_A.json'
with open(file_name, encoding='latin-1') as f:
for line in f:
doc = json.loads(line)
lst = [doc['cat'], doc['dateTime']]
data.append(lst)
df = pd.DataFrame(data=data, columns=cols)
Will this help?
Step 1.
Read your json file from pandas
"pandas.read_json() "
Step 2.
Then filter out your 2 columns from the dataframe.
Let me know if you still face any issue.
Thanks
I have some csv files and I want to copy a specific column from all of them and save it in a new csv file column wise.But the following code add them in a single column.
Also in total I have to go through almost 20M data so I don't want to store them in a single dataframe and save them in last.
Here is my code:
import os
import glob
import pandas as pd
k= glob.glob("*.csv")
colu="Close"
file="merged.csv"
temp_dirr="./temp/"
if not os.path.exists(temp_dirr):
os.makedirs(temp_dirr)
filename=temp_dirr+file
df=pd.read_csv(k[0])[colu].dropna()
df.to_csv(filename,header=False,index=False)
for i in k[1:]:
df=pd.read_csv(i)[colu].dropna()
df.to_csv(filename,mode="a",header=False,index=False)
and here is the output merged.csv file
23.6
1065
23.45
1150
172.7
11098
11443.3
But i want the output file to be like this
23.6 172.7
1065 11098
23.45 11443.3
1150
Here the folder has 2 csv files and the two columns are for for the "close" column of those 2 files. So how to add them columnwise?
you can do it this way:
def get_merged_csv(flist, **kwargs):
return pd.concat([pd.read_csv(f, **kwargs) for f in flist], axis=1)
fmask = '*.csv'
# column numbers are starting from 0, so 9th column has index 8
df = get_merged_csv(glob.glob(fmask), usecols=[8])
df.to_csv(filename,mode="a",header=False,index=False)
I'm not sure how to do this using Pythond, but in R, it is very easy.
Merge all columns in File1 and Column12 in File2.
import pandas as pd
file1 = pd.read_table('C:\\Users\Users\\your_path_here\\Book1.csv', delimiter=',', header=None)
file2 = pd.read_table('C:\\Users\\Users\\your_path_here\\Book2.csv', delimiter=',', header=None)
file2_short = file2.ix[:,12:13]
#print (file2_short)
frames=[file1, file2_short]
new = pd.concat(frames)
new.to_csv('C:\\Users\\your_path_here\\newfile.csv')
I am dealing with a csv file that contains three columns and three rows containing numeric data. The csv data file simply looks like the following:
Colum1,Colum2,Colum3
1,2,3
1,2,3
1,2,3
My question is how to write a python code that take a single value of one of the column and perform a specific operation. For example, let say I want to take the first value in 'Colum1' and subtract it from the sum of all the values in the column.
Here is my attempt:
import csv
f = open('columns.csv')
rows = csv.DictReader(f)
value_of_single_row = 0.0
for i in rows:
value_of_single_Row += float(i) # trying to isolate a single value here!
print value_of_single_row - sum(float(r['Colum1']) for r in rows)
f.close()
Based on the code you provided, I suggest you take a look at the doc to see the preferred approach on how to read through a csv file. Take a look here:
How to use CsvReader
with that being said, you can modify the beginning of your code slightly to this:
import csv
with open('data.csv', 'rb') as f:
rows = csv.DictReader(f)
for row in rows:
# perform operation per row
From there you now have access to each row.
This should give you what you need to do proper row-by-row operations.
What I suggest you do is play around with printing out your rows to see what your data looks like. You will see that each row being outputted is a dictionary.
So if you were going through each row, you can just simply do something like this:
for row in rows:
row['Colum1'] # or row.get('Colum1')
# to do some math to add everything in Column1
s += float(row['Column1'])
So all of that will look like this:
import csv
s = 0
with open('data.csv', 'rb') as f:
rows = csv.DictReader(f)
for row in rows:
s += float(row['Colum1'])
You can do pretty much all of this with pandas
from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import pandas as pd
import sys
import os
Location = r'path/test.csv'
df = pd.read_csv(Location, names=['Colum1','Colum2','Colum3'])
df = df[1:] #Remove the headers since they're unnecessary
print df
df.xs(1)['Colum1']=int(df.loc[1,'Colum1'])+5
print df
You can write back to your csv using df.to_csv('File path', index=False,header=True) Having headers=True will add the headers back in.
To do this more along the lines of what you have you can do it like this
import csv
Location = r'C:/Users/tnabrelsfo/Documents/Programs/Stack/test.csv'
data = []
with open(Location, 'r') as f:
for line in f:
data.append(line.replace('\n','').replace(' ','').split(','))
data = data[1:]
print data
data[1][1] = 5
print data
it will read in each row, cut out the column names, and then you can modify the values by index
So here is my simple solution using pandas library. Suppose we have sample.csv file
import pandas as pd
df = pd.read_csv('sample.csv') # df is now a DataFrame
df['Colum1'] = df['Colum1'] - df['Colum1'].sum() # here we replace the column by subtracting sum of value in the column
print df
df.to_csv('sample.csv', index=False) # save dataframe back to csv file
You can also use map function to do operation to one column, for example,
import pandas as pd
df = pd.read_csv('sample.csv')
col_sum = df['Colum1'].sum() # sum of the first column
df['Colum1'] = df['Colum1'].map(lambda x: x - col_sum)