Combining several csv files and calculating averages of a column

Combining several csv files and calculating averages of a column - python

I have a folder with several csv files. Each file has several columns but the ones I'm interested in are id, sequence (3 different sequences) and time. See an example of a csv file below. I have over 40 csv files-each file belongs to a different participant.
id
sequence
time
300
1
2
2
3
1
3
etc
I need to calculate the average times for 3 different sequences for each participant. The code I currently have combines the csv files in one dataframe, selects the columns I'm interested in (id, sequence and time) and calculates the average for each person for 3 conditions, pivots the table to wide format (it's the format I need) and exports this as a csv file. I am not sure if this is the best way to do it but it works. However, for some conditions 'time' is 0. I want to exclude these sequences from the averages. How do I do this? Many thanks in advance.
filenames = sorted(glob.glob('times*.csv'))
df = pd.concat((pd.read_csv(filename) for filename in filenames))
df_new = df[["id","sequence","time"]]
df_new_ave = df_new.groupby(['id','sequence'] ['time'].mean().reset_index(name='Avg_Score')
df_new_ave_wide = df_new_ave.pivot(index='id', columns='sequence', values='Avg_Score')
df_new_ave_wide.to_csv('final_ave.csv', encoding='utf-8', index=True)

Related

How to create python function that performs multiple checks on a dataframe?

I have multiple inventory tables like so:
line no
-1 qty
-2 qty
1
-
3
2
42.1 FT
-
3
5
-
4
-
10 FT
5
2
1
6
6.7
-
or
line no
qty
1
2
2
4.5 KG
3
5
4
5
13
6
AR
I want to create logic check for the quantity column using python. (The table may have more than one qty column and I need to be able to check all of them. In both examples, I have the tables formatted as dataframes.)
Acceptable criteria:
integer with or without "EA" (meaning each)
"AR" (as required)
integer or float with unit of measure
if multiple QTY columns, then "-" is also accepted (first table)
I want to return a list per page, containing the line no. corresponding to rows where quantity value is missing (line 4, second table) or does not meet acceptance criteria (line 6, table 1). If the line passes the checks, then return True.
I have tried:
qty_col = [col for col in df.columns if 'qty' in col]
df['corr_qty'] = np.where(qty_col.isnull(), False, df['line_no'])
but this creates the quantity columns as a list and yields the following
AttributeError: 'list' object has no attribute 'isnull'

Intro and Suggestions:
Welcome to StackOverflow. Some general tips when asking questions on S.O. include as much information as possible. In addition, always identify the libraries you want to use and the accepted approach since there can be multiple solutions to the same problem, looks like you've done that.
Also, it is best to always share all, if not, most of your attempted solutions so others can understand the thought process and fully understand the best approach to provide a potential solution.
The Solution:
It wasn't clear if the solution you are looking for required that you read the PDF to create the dataframe or if converting the PDF to a CSV and processing the data using the CSV was sufficient. I took the latter approach.
import tabula as tb
import pandas as pd
#PDF file path
input_file_path = "/home/hackernumber7/Projects/python/resources/Pandas_Sample_Data.pdf"
#CSV file path
output_file_path = "/home/hackernumber7/Projects/python/resources/Pandas_Sample_Data.csv"
#Read the PDF
#id = tb.read_pdf(input_file_path, pages='all')
#Convert the PDF to CSV
cv = tb.convert_into(input_file_path, output_file_path, "csv", pages="all")
#Read initial data
id = pd.read_csv(output_file_path, delimiter=",")
#Print the initial data
print(id)
#Create the dataframe
df = pd.DataFrame(id, columns = ['qty'])
#Print the data as a DataFrame object; boolean values when conditions met
print(df.notna())

Merge sequential filename repetitions in list

I've been trying to create a script that loops through and merges CSVs in a folder and calculates the averages of specific columns and exports the results to a single file. So far I've managed to create the logic for it now but I'm struggling with the identification for each column in the resulting CSV, these columns should be named after the 3 files that were averaged.
I've listed the files in the current directory using glob, all the files are named with the pattern:
AA_XXXX-b.
Where AA is a sample number and b is the repetition (1-2 for duplicates, 1-3 for triplicates, etc.) and XXXX is a short sample description. I thought of using the list generated when listing the files and somehow merge all repetitions of a sample into a single item with a format like:
AA_XXXX_1-N,
Where N is the number of repetitions, and store the merged names in a list in order to use it to name the columns with the averages in the final file, but couldn't think of or find anything similar. I apologize if this question was already asked.
Edit:
Here's an example of what I'm trying to do:
This is what the data in the individual csvs looks like:
Filename: 01_NoCons-1
abs
Wavelength (nm)
0
901.5391
0.523718
902.8409
0.516127
905.4431
0.521074
908.0434
0.516442
909.3429
0.510993
Filename: 01_NoCons-2
abs
Wavelength (nm)
0
901.5391
0.523718
902.8409
0.516127
905.4431
0.521074
908.0434
0.516442
909.3429
0.510993
Filename: 01_NoCons- 3
abs
Wavelength (nm)
0
901.5391
0.523718
902.8409
0.516127
905.4431
0.521074
908.0434
0.516442
909.3429
0.510993
And after concatenating and calculating the average of the 3 abs columns, the result is transfered to a new table already containing the Wavelenght column like so:
Filename: Final table
01_NoCons_1-3
Wavelength (nm)
0
901.5391
0.523718
902.8409
0.516127
905.4431
0.521074
908.0434
0.516442
909.3429
0.510993
This process is repeated for every set of sample repetitions, and I'd like the resulting column name to identify from which set it was generated, such as 01_NoCons_1-3, which indicates that the column is a result of the average of repetitions 1 to 3 of the sample 01_NoCons

Best way to compare two huge dataframes in python

I have a use case where I need to compare a file from s3 bucket with the output of a sql query.
Below is how I am reading the s3 file.
if s3_data is None:
s3_data = pd.read_feather(BytesIO(obj.get()['Body'].read()))
else:
s3_data = s3_data.append(pd.read_feather(BytesIO(obj.get()['Body'].read())), ignore_index=True)
Below is how I am reading from the database.
db_data=pandas.read_sql
Now I need to compare s3_data with db_data. Both of these dataframes are huge with as much as 2 million rows of data each.
The format of the data is somewhat like below.
name | Age | Gender
--------------------
Peter| 30 | Male
Tom | 24 | Male
Riya | 28 | Female
Now I need to validate whether exact same rows with same column data exist in both s3 file and db.
I tried using dataframe.compare() but the kind of results it gives is not what I am looking for. The position of the row in the dataframe is not relevant for me. So if 'Tom' appears in row 1 or 3 in one of the dataframes while on a different position in other, it should still pass the equality validation as long as the record itself with same column values is present in both. dataframe.compare() is not helping me in this case.
The alternate approach which I took is to use csv_diff. I merged all the columns into one in the following way and saved it as a csv in my local creating two csv files- one for s3 data and one for db data.
data
------------
Peter+30+Male
Tom+24+Male
Riya+28+Female
Then, using below code, I am comparing the files.
s3_file = load_csv(open("s3.csv"),key="data")
db_file = load_csv(open("db.csv"),key="data")
diff = compare(s3_file,db_file)
This works but is not very performant as I have to first write huge csv files with size as big as 500mb to local and then read and compare them this way.
Is there a better way to handle the comparison part without the need to write files to local and also at the same time ensuring that I am able to compare entire records(with each column value compared for a given row) for equality irrespective of the position of the row in the dataframe?

Using Pandas to insert a column on an existing large CSV file without consuming so much RAM

I am reading in a large CSV file using Python Pandas that is a little over 2GB in size. Ultimately, what I am attempting to do is add a "Date" column to the first index of the file, transpose the file from 364 rows and approximately 360,000 columns to only three columns ("Date", "Location", and "Data) with many, many rows instead. This will then be written out to a newly transposed CSV file.
For a little more context, each of the 364 rows represents each day of the year. For each day (and each row), there are thousands and thousands of site locations (these are the columns), each containing a measurement taken at the location.
The file looks like this right now:
Index Location #1 Location #2 Location #359000...
0 Measurement Measurement Measurement
1 Measurement Measurement Measurement
2 Measurement Measurement Measurement
3 Measurement Measurement Measurement
364... Measurement Measurement Measurement
I have attempted to add the new column by creating a date column using the Pandas "date_range" function and then inserting that column into a new dataframe.
import pandas as pd
#read in csv file
df = pd.read_csv('Path to file')
#define 'Date' column
date_col = pd.date_range(start='1/1/2001', periods=364, freq='D')
#add 'Date' column at the 0th index to be the first column
df1 = df.insert(0, 'Date', date_col)
#rearrange rows to columns and index by date
long_df = df.set_index('Date').unstack().to_frame('Data').swaplevel().sort_index
#write out to new csv file, specifying two other columns via index label
long_df.to_csv('Transposed_csv_file', index=True, index_label=['Location', 'Data'])
The output I am looking for is a transposed CSV file that looks like this:
Date Location Data
1/1/2001 Location No. 1 Measurement 1
1/1/2001 Location No. 2 Measurement 2
1/1/2001 Location No. 3 Measurement 3
Once January 1st is complete, it will move on to January 2nd, like so:
1/2/2001 Location No. 1 Measurement 1
1/2/2001 Location No. 2 Measurement 2
1/2/2001 Location No. 3 Measurement 3
This pattern will repeat all the way to the end at 12/31/2001.
Three columns -- many rows. Basically, I am transposing from an X position to a Y position formatted CSV file.
What's happening right now is that when I attempted to run these lines of code, I noticed via the task manager that my memory was slowly being consumed, reaching beyond 96%. I have 32GB of RAM. There is no way a 2GB CSV file being read in by Pandas and outputting another large transposed file should consume that much memory. I'm not sure what I am doing wrong or if there is a better method I can use to achieve the results I want. Thank you for your help.

Pandas apply Series- Order of the columns

To aggregate and to find values per second, I am doing following in Python using pandas, however, in the output logged to a file doesn't show columns in the way they appear here. Somehow these column names are sorted and hence TotalDMLsSec shows up before UpdateTotal and UpdatesSec.
'DeletesTotal': x['Delete'].sum(),
'DeletesSec': x['Delete'].sum()/VSeconds,
'SelectsTotal': x['Select'].sum(),
'SelectsSec': x['Select'].sum()/VSeconds,
'UpdateTotal': x['Update'].sum(),
'UpdatesSec': x['Update'].sum()/VSeconds,
'InsertsTotal': x['Insert'].sum(),
'InsertsSec': x['Insert'].sum()/VSeconds,
'TotalDMLsSec':(x['Delete'].sum()+x['Update'].sum()+x['Insert'].sum())/VSeconds
})
)
df.to_csv(/home/summary.log,sep='\t', encoding='utf-8-sig')
Apart from above questions, have couple of other questions-
Despite logging as csv format, all values/columns are appearing in one column in excel, is there anyway to properly load data CSV
Can rows be sorted based on one column(let say InsertsSec) by default when writing to csv file?
Any help here would be really appreciated.

Assume that your DataFrame is something like this:
Deletes Selects Updates Inserts
Name
Xxx 20 10 40 50
Yyy 12 32 24 11
Zzz 70 20 30 20
Then both total and total per sec can be computed as:
total = df.sum().rename('Total')
VSeconds = 5 # I assumed some value
tps = (total / VSeconds).rename('Total per sec')
Then you can add both above rows to the DataFrame:
df = df.append(totals).append(tps)
The downside is that all numbers are converted to float.
But in Pandat there is no other way, as each column must have
values of one type.
Then you can e.g. write it to a CSV file (with totals included).

This is how I eneded up doing
df.to_excel(vExcelFile,'All')
vSortedDF=df.sort_values(['Deletes%'],ascending=False)
vSortedDF.loc[vSortedDF['Deletes%']> 5, ['DeletesTotal','DeletesSec','Deletes%']].to_excel(vExcelFile,'Top Delete objects')
vExcelFile.save()
For CSV, instead of using separate \t used , and it worked just fine.
df.to_csv(/home/summary.log,sep='\t', encoding='utf-8-sig')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Combining several csv files and calculating averages of a column - python

Related

How to create python function that performs multiple checks on a dataframe?

Merge sequential filename repetitions in list

Best way to compare two huge dataframes in python

Using Pandas to insert a column on an existing large CSV file without consuming so much RAM

Pandas apply Series- Order of the columns

Categories

Resources