New file from data from multiple files - python

I am a Python beginner and trying to solve this task:
I have multiple (125) .csv files (48 rows and 5 columns each), and trying to make a new file that will contain first row and last row (written in a single row) from every .csv file a have.

To get you started here is how you can generate the list of files and open them using Pandas. This will generate a list of csv files from a directory, iterate the list and open each as a CSV Pandas DataFrame. Then it creates a new list of the first and last rows of each csv file. I am not sure how you want to create one row out of two though so hopefully this is a starting point for you.
import os
import pandas as pd
#get all files in current directory, or specify the directory in lisdir
csv_files = [file for file in os.listdir(".")]
#create dictionary and load all the files as dataframes.
dataframes = {}
for x in range(len(csv_files)):
dataframes[x] = pd.read_csv(csv_files[x])
#get first and last row from each dataframe(loaded csv).
result_df = pd.DataFrame()
for item in dataframes:
result_df = result_df.append(dataframes[item].iloc[0])
result_df = result_df.append(dataframes[item].iloc[-1])
#write to csv file.
result_df.to_csv("resulting.csv")

Related

Python script to read csv files every 15 minutes and discard the csv file that has been read and update dataframe every time a new csv file is read

I have a local directly which receives multiple csv files every 15minutes automatically from a measured values provider. I want to read these files into a pandas dataframe and create a new column for each csv file or add new rows if column already exists and then load the dataframe to a database. I am using a for loop to read the files and merging on timestamp. How can I read the csv files and discard the csv files that have been read already so that when the python script runs again after 15 minutes it does not read the old csv files but only the new ones and update the dataframe instead of creating a new dataframe with current values? Also the timestamps in my csv files are overlapping. How can I merge the csv files with same timestamp and only choose the last value received for a particular timestamp? Right now if I do a merge on timestamp I am getting duplicate rows?
The more time that passes the more it takes for the for loop to iterate over all csv files and the dataframe needs longer to load because the csv files are increasing and if i delete the csv files manually, older values will be deleted from the dataframe.
extension = 'csv'
all_filenames = [i for i in glob.glob('*Wind*.{}'.format(extension))]
al=pd.read_csv('202216_14345.123_0000.Wind15min_202203161432.csv',sep=';')
al['#timestamp'] = pd.to_datetime(al['#timestamp'])
frames = []
for f in all_filenames:
df = pd.read_csv(f,sep =';')
df['#timestamp'] = pd.to_datetime(df['#timestamp'])
#print(df.head(5))
frames.append(df)
#result = pd.concat(frames,axis = 1,join = 'inner')
result= pd.merge(al, df,how = 'outer')
al = result

How to read in excel files from a folder and join them into a single df?

First-time poster here! I have perused these forums for a while and I am taken aback by how supportive this community is.
My problem involves several excel files with the same name, column headers, data types, that I am trying to read in with pandas. After reading them in, I want to compare the column 'Agreement Date' across all the data-frames and create a yes/no column if they match. I then want to export the data frame.
I am still learning Python and Pandas so I am struggling with this task. This is my code so far:
import pandas as pd
import glob
xlpath = "/Users/myname/Documents/Python/"
# read .xlsx file into a list
allfiles = glob.glob(xlpath + "*.xls")
# for loop to read in all files
for excelfiles in allfiles:
raw_excel = pd.read_excel(allfiles)
# place all the pulled dataframe into a list
list = [raw_excel]
From here though I am quite lost. I do not know how to join all of my files together on my id column and then compare the 'Agreement Date' column? Any help would be greatly appreciated!
THANKS!!
In your loop you need to hand the looped value and not the whole list to read_excel
You have to append the list values within the loop, otherwise only the last item will be in the list
Do not overwrite python builtins such as list or you can encounter some difficult to debug behaviors
Here's what I would change:
import pandas as pd
import glob
xlpath = "/Users/myname/Documents/Python/"
# get file name list of .xlsx files in the directory
allfiles = glob.glob(xlpath + "*.xls")
# for loop to read in all files & place all the pulled dataframe into a list
dataframes_list = []
for file in allfiles:
dataframes_list.append(pd.read_excel(file))
You can then append the DataFrames like this:
merged_df = dataframes_list[0]
for df in dataframes_list[1:]:
merged_df.append(df, ignore_index=True)
Use ignore_index if the Indexes are overlapping and causing problems. If they already are distinct and you want to keep them, set this to False.

Extracting a column from a collection of csv files and constructing a new table with said data

I'm a newbie when it comes to Python with a bit more experience in MATLAB. I'm currently trying to write a script that basically loops through a folder to pick up all the .csv files, extract column 14 from csv file 1 and adding it to column 1 of the new table, extract column 14 from csv file 2 and adding it to column 2 of the new table, to build up a table of column 14 from all csvfiles in the folder. I'd ideally like to have the headers of the new table to show the respective filename that said column 14 has been extracted from.
I've considered that Python is base0 so I've double checked that it reads the desired column, but as my code stands, i can only get it to print all the files' 14th columns in the one array and I'm not sure how to split it up to put it into a table. Perhaps via dataframe, although I'm not entirely sure how they work.
Any help would be greatly appreciated!
Code attached below:
import os
import sys
import csv
pathName = "D:/GLaDOS-CAMPUS/data/TestData-AB/"
numFiles = []
fileNames = os.listdir(pathName)
for fileNames in fileNames:
if fileNames.endswith(".csv"):
numFiles.append(fileNames)
print(numFiles)
for i in numFiles:
file = open(os.path.join(pathName, i), "rU")
reader = csv.reader(file, delimiter=',')
for column in reader:
print(column[13])
Finding files.
I'm not sure if your way of finding files is right or not. Since I do not have a folder with csv files. But I can say it is way better to use glob for getting list of files:
from glob import glob
files = glob("/Path/To/Files/*.csv")
This will return all csv files.
Reading CSV files
Now we need to find a way to read all files and get 13th column. I don't know if it is an overkill but I prefer to use pandas and numpy to get 13th column.
To read a column of a csv file using pandas one can use:
pd.read_csv(file, usecols=[COL])
Now we can loop over files and get 13th columns:
columns = [pd.read_csv(file, usecols=[2]).values[:, 0] for file in files]
Notice we converted all values to numpy arrays.
Merging all columns
In columns we have our each column as an element of a list. So it is technical rows. Not columns.
Now we should get the transpose of the array so it will become columns:
pd.DataFrame(np.transpose(columns))
The code
The whole code would look like:
from glob import glob
import pandas as pd
import numpy as np
files = glob("/Path/To/Files/*.csv")
columns = [pd.read_csv(file, usecols=[2]).values[:, 0] for file in files]
print(pd.DataFrame(np.transpose(columns)))

Removing entire row with specific values in cells of excel file

Suppose I have 10 excel files in a directory, and I want to iterate over them and remove rows of each excel which meet certain conditions like (if cell contains values like null), and save the updated file and move that updated file into a new directory. I have to only remove the rows not the columns
How can I achieve this with python
Thanks in advance
I would propose to have a look at Pandas DataFrame. There you can easily import and export from Excel-files.
In your code you would iterate with a for loop over your files and remove the desired rows from your read-in DataFrames and export them to the Excel-files again.
I have written a semi-Pseudo code for you. Hope this helps. Store this code in the folder of your xlsx-files.
import glob
import os
import pandas as pd
import shutil
#create a new folder if not exists:
if not os.path.exists("New"):
os.makedirs("New")
# store all files in a list
filenames = glob.glob("*.xlsx")
#iterate through your files
for file in filenames:
#create dataframes from your files
df = pd.read_excel (file )
#insert some conditions:
#...
#...
#...
#...
#e.g. get specific value
#val=df.iloc[0,1]
#Drop the matching rows from your df e.g.
df.drop(df.index[0])
#write to excel files
df.to_excel(file,index=None)
# move updated files to that folder
shutil.move(file, "New/" + file)
print (df)

Repeating same processes for multiple csv files

I am a beginner of Python. I have about 1000 CSV files (1.csv, 2.csv....1000.csv). Each CSV file has about 3,000,000,000 rows and 14 variables. I would like to clean data in each CSV file first using the same process for each CSV file:
sum variable A and variable B,
count C by sorting date, if the number of records in C for one day is greater than 50, then drop it.
Next, save the cleaned data into a new CSV file. At last, append all 1000 new CSV files into one CSV file.
I have some code as follows, but it imports all CSV files first, then process to clean data, which is very inefficient. I would like to clean the data in each CSV file first, then append new CSV files. Can anyone help me on this? Any help will be appreciated.
This what I understand from your question. I read all the files and I add a new column for the summation. Then I order the date and drop any value of C greater than 50. After that, I save the update. Before you do this you have to copy your original files or you can save them with a different files name.
import glob
import os
import pandas as pd
path = "./data/"
all_files = glob.glob(os.path.join(path, "*.csv")) #make list of paths
for file in all_files:
# Getting the file name without extension
file_name = os.path.splitext(os.path.basename(file))[0]
df = pd.read_csv(file_name)
df['new_column'] = df['A']+ df['B']
df.sort_values(by='C')
df.drop(df.loc[df['C']>50].index, inplace=True)
df.to_csv(file_name)

Categories