I am a beginner of Python. I have about 1000 CSV files (1.csv, 2.csv....1000.csv). Each CSV file has about 3,000,000,000 rows and 14 variables. I would like to clean data in each CSV file first using the same process for each CSV file:
sum variable A and variable B,
count C by sorting date, if the number of records in C for one day is greater than 50, then drop it.
Next, save the cleaned data into a new CSV file. At last, append all 1000 new CSV files into one CSV file.
I have some code as follows, but it imports all CSV files first, then process to clean data, which is very inefficient. I would like to clean the data in each CSV file first, then append new CSV files. Can anyone help me on this? Any help will be appreciated.
This what I understand from your question. I read all the files and I add a new column for the summation. Then I order the date and drop any value of C greater than 50. After that, I save the update. Before you do this you have to copy your original files or you can save them with a different files name.
import glob
import os
import pandas as pd
path = "./data/"
all_files = glob.glob(os.path.join(path, "*.csv")) #make list of paths
for file in all_files:
# Getting the file name without extension
file_name = os.path.splitext(os.path.basename(file))[0]
df = pd.read_csv(file_name)
df['new_column'] = df['A']+ df['B']
df.sort_values(by='C')
df.drop(df.loc[df['C']>50].index, inplace=True)
df.to_csv(file_name)
Related
I'm a newbie when it comes to Python with a bit more experience in MATLAB. I'm currently trying to write a script that basically loops through a folder to pick up all the .csv files, extract column 14 from csv file 1 and adding it to column 1 of the new table, extract column 14 from csv file 2 and adding it to column 2 of the new table, to build up a table of column 14 from all csvfiles in the folder. I'd ideally like to have the headers of the new table to show the respective filename that said column 14 has been extracted from.
I've considered that Python is base0 so I've double checked that it reads the desired column, but as my code stands, i can only get it to print all the files' 14th columns in the one array and I'm not sure how to split it up to put it into a table. Perhaps via dataframe, although I'm not entirely sure how they work.
Any help would be greatly appreciated!
Code attached below:
import os
import sys
import csv
pathName = "D:/GLaDOS-CAMPUS/data/TestData-AB/"
numFiles = []
fileNames = os.listdir(pathName)
for fileNames in fileNames:
if fileNames.endswith(".csv"):
numFiles.append(fileNames)
print(numFiles)
for i in numFiles:
file = open(os.path.join(pathName, i), "rU")
reader = csv.reader(file, delimiter=',')
for column in reader:
print(column[13])
Finding files.
I'm not sure if your way of finding files is right or not. Since I do not have a folder with csv files. But I can say it is way better to use glob for getting list of files:
from glob import glob
files = glob("/Path/To/Files/*.csv")
This will return all csv files.
Reading CSV files
Now we need to find a way to read all files and get 13th column. I don't know if it is an overkill but I prefer to use pandas and numpy to get 13th column.
To read a column of a csv file using pandas one can use:
pd.read_csv(file, usecols=[COL])
Now we can loop over files and get 13th columns:
columns = [pd.read_csv(file, usecols=[2]).values[:, 0] for file in files]
Notice we converted all values to numpy arrays.
Merging all columns
In columns we have our each column as an element of a list. So it is technical rows. Not columns.
Now we should get the transpose of the array so it will become columns:
pd.DataFrame(np.transpose(columns))
The code
The whole code would look like:
from glob import glob
import pandas as pd
import numpy as np
files = glob("/Path/To/Files/*.csv")
columns = [pd.read_csv(file, usecols=[2]).values[:, 0] for file in files]
print(pd.DataFrame(np.transpose(columns)))
I have several large .text files that I want to consolidate into one .csv file. However, each of the files is to large to import into Excel on its own, let alone all together.
I want to create a use pandas to analyze the data, but don't know how to get the files all in one place.
How would I go about reading the data directly into Python, or into Excel for a .csv file?
The data in question is the 2019-2020 Contributions by individuals file on the FEC's website.
You can convert each of the files to csv and the concatenate them to fom one final csv file
import pandas as pd
csv_path = 'pathtonewcsvfolder' # use your path
all_files=os.listdir("path/to/textfiles")
x=0
for filename in all_files:
df = pd.read_fwf(filename)
df.to_csv(os.path.join(csv_path,'log'+str(x)+'.csv'))
x+=1
all_csv_files = glob.iglob(os.path.join(csv_path, "*.csv"))
converted_df=pd.concat((pd.read_csv(f) for f in all_csv_files), ignore_index=True)
converted_df.to_csv('converted.csv')
Suppose I have 10 excel files in a directory, and I want to iterate over them and remove rows of each excel which meet certain conditions like (if cell contains values like null), and save the updated file and move that updated file into a new directory. I have to only remove the rows not the columns
How can I achieve this with python
Thanks in advance
I would propose to have a look at Pandas DataFrame. There you can easily import and export from Excel-files.
In your code you would iterate with a for loop over your files and remove the desired rows from your read-in DataFrames and export them to the Excel-files again.
I have written a semi-Pseudo code for you. Hope this helps. Store this code in the folder of your xlsx-files.
import glob
import os
import pandas as pd
import shutil
#create a new folder if not exists:
if not os.path.exists("New"):
os.makedirs("New")
# store all files in a list
filenames = glob.glob("*.xlsx")
#iterate through your files
for file in filenames:
#create dataframes from your files
df = pd.read_excel (file )
#insert some conditions:
#...
#...
#...
#...
#e.g. get specific value
#val=df.iloc[0,1]
#Drop the matching rows from your df e.g.
df.drop(df.index[0])
#write to excel files
df.to_excel(file,index=None)
# move updated files to that folder
shutil.move(file, "New/" + file)
print (df)
I am pretty new to Python in general, but am trying to make a script that takes data from certain files in a folder and puts it into an Excel spreadsheet.
The code I have will find the file type that I want in my specified folder, and then make a list with the full file paths.
import os
file_paths = []
for folder, subs, files in os.walk('C://Users/Dir'):
for filename in files:
if filename.endswith(".log") or filename.endswith(".txt"):
file_paths.append(os.path.abspath(os.path.join(folder,filename)))
It will also take a specific file path, pull data from the correct column, and put it into excel in the correct cells.
import pandas as pd
import numpy
for i in range(len(file_paths)):
fields = ['RDCR']
data = pd.read_table(file_paths[i], sep= "\s+", names = fields, usecols=[3],
Where I am having trouble is making the read_table iterate through my list of files and put the data into an excel sheet where every time it reads a new file it moves over one column in the spreadsheet.
Ideally, the for loop would see how long the file_paths list is, and use that as the range. It would then use the file_paths[i] to input the file names into the read_table one by one.
What happens is that it finds the length of file_paths, and instead of iterating through the files in it one by one, it just inputs the data from the last file on the list.
Any help would be much appreciated! Thank you!
Try to concatenate all of them at once and write to excel one time.
from glob import glob
import pandas as pd
files = glob('C://Users/Dir/*.log') + glob('C://Users/Dir/*.txt')
def read_file(f):
fields = ['RDCR']
return pd.read_table(
f, sep="\s+",
names=fields, usecols=[3])
df = pd.concat([read_file(f) for f in files], axis=1).to_excel('out.xlsx')
I am a Python beginner and trying to solve this task:
I have multiple (125) .csv files (48 rows and 5 columns each), and trying to make a new file that will contain first row and last row (written in a single row) from every .csv file a have.
To get you started here is how you can generate the list of files and open them using Pandas. This will generate a list of csv files from a directory, iterate the list and open each as a CSV Pandas DataFrame. Then it creates a new list of the first and last rows of each csv file. I am not sure how you want to create one row out of two though so hopefully this is a starting point for you.
import os
import pandas as pd
#get all files in current directory, or specify the directory in lisdir
csv_files = [file for file in os.listdir(".")]
#create dictionary and load all the files as dataframes.
dataframes = {}
for x in range(len(csv_files)):
dataframes[x] = pd.read_csv(csv_files[x])
#get first and last row from each dataframe(loaded csv).
result_df = pd.DataFrame()
for item in dataframes:
result_df = result_df.append(dataframes[item].iloc[0])
result_df = result_df.append(dataframes[item].iloc[-1])
#write to csv file.
result_df.to_csv("resulting.csv")