Efficient way of concatenating and grouping pandas dataframes - python

I have around 150 CSV files on the following format:
Product Name
Cost
Manufacturer
Country
P_0
5
Pfizer
Finland
P_1
10
BioNTech
Sweden
P_2
12
Pfizer
Denmark
P_3
11
J&J
Finland
Each CSV represents daily data. So the file for the previous date would look like:
Product Name
Cost
Manufacturer
Country
P_0
7
Pfizer
Finland
P_1
15
BioNTech
Sweden
P_2
17
Pfizer
Denmark
P_3
10
J&J
Finland
I would like to create a time series dataset where I can track the price of a product given a manufacturer in a given country over time.
So for example I want to be able to show the price development of product P_1 made by BioNTech in Sweden as:
Date
Price
17/10/2022
15
18/10/2022
10
My attempt:
Each CSV has the date as a part of its name (e.g., 'data_17-10_2022'). So I have created a list that contains the path to all of the CSV files and then I iterate through this list, convert each CSV to a pandas dataframe, add each of them to a list and then concatenate this after which I perform some groupby operation.
def create_ts(data):
df_list = []
for file in data:
match = re.search(r'\d{2}-\d{2}-\d{4}', file) # get date from file name
date = datetime.strptime(match.group(), '%d-%m-%Y').date()
df = pd.read_csv(file, sep = ";")
df["date"] = date # create a new column in each df that contains the date
df_list.append(df)
return df_list
df_concat = pd.concat(create_ts(my_files))
df_group = df_concat.groupby(["Manufacturer", "Country", "Product Name"])
This returns what I am after. However, it is very slow (when I tried it for a random country, manufacturer and product name it took nearly 10 minutes to run).
The problem (I think) is that each CSV is approximately 40MB (180000 rows and 20 columns, of which I have drop around 10 irrelevant columns).
Is there anything I can do to speed this up? I tried installing modin but I got an error saying I need VS C++ v.14 and my work computer does not allow me to install programs without going through a very long process with the IT department.

Fundamentally your reading approach is fine: as far as I know reading then concatenating the dataframes is the best approach. There are some marginal improvements you can get if you use the usecols and dtype parameters in read_csv but this is ever dependant on what your data looks like:
Method
Time
Relative
Original
0.1512130000628531
1.5909069397118787
Only load columns you need
0.09676750004291534
1.0180876465175188
Use dtype parameter
0.09504829999059439
1.0
I think to get significant performance improvement you probably want to look at caching at some point in the process as dankal444 mentions.
What you cache depends on how the data is changing, but assuming the files do not change once you have received them I would probably cache the loaded dataframe with a set of included files something like:
import pickle
dst = './fastreading.pkl'
contained_files = set()
with open(dst, 'wb') as f:
pickle.dump((contained_files, df), f)
with open(dst, 'rb') as f:
contained_files2, df2 = pickle.load(f)
You could then check if the file is in the list of contained files in your loading process. I am using pickle here, but there are other faster ways of loading/saving dataframes, there is some benchmark data here.
If you are worried that the files will chance, you could include a timestamp or a checksum in your contained files list.
The other thing I would recommend is running a profiler. This should give you a good idea where the time is spent.
read_csv test code:
import pandas as pd
import numpy as np
import timeit
iterations = 10
item_count = 5000
path = './fasterreading.csv'
data = {c: [i/2 for i in range(item_count)] for c in [chr(c) for c in range(ord('a'), ord('z') + 1)]}
dtypes = {c: np.float64 for c in data.keys()}
df = pd.DataFrame(data)
df.to_csv(path)
# attempt to negate file system caching effect
timeit.timeit(lambda: pd.read_csv(path), number=5)
t0 = timeit.timeit(lambda: pd.read_csv(path), number=iterations)
t1 = timeit.timeit(lambda: pd.read_csv(path, usecols=['a', 'b', 'c']), number=iterations)
t2 = timeit.timeit(lambda: pd.read_csv(path, usecols=['a', 'b', 'c'], dtype=dtypes), number=iterations)
tmin = min(t0, t1, t2)
print(f'| Method | Time | Relative |')
print(f'|------------------ |----------------------|')
print(f'| Original | {t0} | {t0 / tmin} |')
print(f'| Only load columns you need | {t1} | {t1 / tmin} |')
print(f'| Use dtype parameter | {t2} | {t2 / tmin} |')

Related

Python read excel distinguish number type

I have a column in my excel spreadsheet that contains different types of numbers (i.e., Decimal, Currency, Percentage)
I need to read them into my DF in python and know which ones are which.
excel table looks like:
Group Q2_2022 Q3_2022 Q4_2022 Goal Comments
Team A 25 24 25 24 meets
Team B 18% 18% 19% 18% Q4 over
Team C $200 $225 $218 $220 Q4 under
df = pd.read_excel(file_one, Sheet One)
I need df['Goal] to include the symbol if it exists.
So I need to be able to tell which rows are tracking goals which way. I do not have any control over the source data. Is there anyway to do this when I read the data into the python dataframe?
Edited
Based on solution by #Timeless below. Headed in the right direction but getting errors.
You can approach this by using number_format cell's attribute from openpyxl.
from openpyxl import load_workbook
from collections import defaultdict
import pandas as pd
​
wb = load_workbook("/tmp/file.xlsx")
ws = wb["Sheet1"]
​
data = defaultdict(list)
for row in ws.iter_rows(2):
for cell, col in zip(row, ws[1]):
fmt, v1, colname = cell.number_format, cell.value, col.value
v2 = f"${v1}" if "$" in fmt else f"{v1*100:g}%" if fmt == "0%" else v1
data[colname].append(v2)
​
df = pd.DataFrame(data)
Output :
​print(df)
Group Q2_2022 Q3_2022 Q4_2022 Goal
0 1 25 24 25 24
1 2 18% 18% 19% 18%
2 3 $200 $225 $218 $220

Matching cells in CSV to return calculation

I am trying to create a program that will take the most recent 30 CSV files of data within a folder and calculate totals of certain columns. There are 4 columns of data, with the first column being the identifier and the rest being the data related to the identifier. Here's an example:
file1
Asset X Y Z
12345 250 100 150
23456 225 150 200
34567 300 175 225
file2
Asset X Y Z
12345 270 130 100
23456 235 190 270
34567 390 115 265
I want to be able to match the asset# in both CSVs to return each columns value and then perform calculations on each column. Once I have completed those calculations I intend on graphing various data as well. So far the only thing I have been able to complete is extracting ALL the data from the CSV file using the following code:
csvfile = glob.glob('C:\\Users\\tdjones\\Desktop\\Python Work Files\\FDR*.csv')
listData = []
for files in csvfile:
df = pd.read_csv(files, index_col=0)
listData.append(df)
concatenated_data = pd.concat(listData, sort=False)
group = concatenated_data.groupby('ASSET')['Slip Expense ($)', 'Net Win ($)'].sum()
group.to_csv("C:\\Users\\tdjones\\Desktop\\Python Work Files\\Test\\NewFDRConcat.csv", header=('Slip Expense', 'Net WIn'))
I am very new to Python so any and all direction is welcome. Thank you!
I'd probably also set the asset number as the index while you're reading the data, since this can help with sifting through data. So
rd = pd.read_csv(files, index_col=0)
Then you can do as Alex Yu suggested and just pick all the data from a specific asset number out when you're done using
asset_data = rd.loc[asset_number, column_name]
You'll generally need to format the data in the DataFrame before you append it to the list if you only want specific inputs. Exactly how to do that naturally depends specifically on what you want i.e. what kind of calculations you perform.
If you want a function that just returns all the data for one specific asset, you could do something along the lines of
def get_asset(asset_number):
csvfile = glob.glob('C:\\Users\\tdjones\\Desktop\\Python Work Files\\*.csv')
asset_data = []
for file in csvfile:
data = [line for line in open(file, 'r').read().splitlines()
if line.split(',')[0] == str(asset_num)]
for line in data:
asset_data.append(line.split(','))
return pd.DataFrame(asset_data, columns=['Asset', 'X', 'Y', 'Z'], dtype=float)
Although how well the above performs is going to depend on how large the dataset is your going through. Something like the above method needs to search through every line and perform several high level functions on each line, so it could potentially be problematic if you have millions of lines of data in each file.
Also, the above assumes that all data elements are strings of numbers (so can be cast to integers or floats). If thats not the case, leave the dtype argument out of the DataFrame definition, but keep in mind that everything returned is stored as a string then.
I suppose that you need to add for your code pandas.concat of your listData
So it will became:
csvfile = glob.glob('C:\\Users\\tdjones\\Desktop\\Python Work Files\\*.csv')
listData = []
for files in csvfile:
rd = pd.read_csv(files)
listData.append(rd)
concatenated_data = pd.concat(listData)
After that you can use aggregate functions with this concatenated_data DataFrame such as: concatenated_data['A'].max(), concatenated_data['A'].count(), 'groupby`s etc.

Pandas - DateTime within X amount of minutes from row

I am not entirely positive the best way to ask or phrase this question so I will highlight my problem, dataset, my thoughts on the method and end goal and hopefully it will be clear by the end.
My problem:
My company dispatches workers and will load up dispatches to a single employee even if they are on their current dispatch. This is due to limitation in the software we use. If an employee receives two dispatches within 30 minutes, we call this a double dispatch.
We are analyzing our dispatching efficiency and I am running into a bit of a head scratcher. I need to run through our 100k row database and add an additional column that will read as a dummy variable 1 for double 0 for normal. BUT as we have multiple people we dispatch and B our records do not start ordered by dispatch, I need to determine how often a dispatch occurs to the same person within 30 minutes.
Dataset:
The dataset is incredibly massive due to poor organization in our data warehouse but for terms of what items I need these are the columns I will need for my calc.
Tech Name | Dispatch Time (PST)
John Smith | 1/1/2017 12:34
Jane Smith | 1/1/2017 12:46
John Smith | 1/1/2017 18:32
John Smith | 1/1/2017 18:50
My Thoughts:
How I would do it is clunky and it could work one way but not backwards. I would more or less write my code as:
import pandas as pd
df = pd.read_excel('data.xlsx')
df.sort('Dispatch Time (PST)', inplace = True)
tech_name = None
dispatch_time = pd.to_datetime('1/1/1900 00:00:00')
for index, row in df.iterrows():
if tech_name is None:
tech_name = row['Tech Name']
else:
if dispatch_time.pd.time_delta('0 Days 00:30:00') > row['Tech Dispatch Time (PST)'] AND row['Tech Name'] = tech_name:
row['Double Dispatch'] = 1
dispatch_time = row['Tech Dispatch Time (PST)']
else:
dispatch_time = row['Tech Dispatch Time (PST)']
tech_name = row['Tech Name']
This has many problems from being slow, only tracking dates going backwards and not forwards so I will be missing many dispatches.
End Goal:
My goal is to have a dataset I can then plug back into Tableau for my report by adding on one column that reads as that dummy variable so I can filter and calculate on that.
I appreciate your time and help and let me know if any more details are necessary.
Thank you!
------------------ EDIT -------------
Added a edit to make the question clear as I failed to do so earlier.
Question: Is Pandas the best tool to use to iterate over my dataframe to see each for each datetime dispatch, is there a record that matches the Tech's Name AND is less then 30 minutes away from this record.
If so, how could I improve my algorithm or theory, if not what would the best tool be.
Desired Output - An additional column that records if a dispatch happened within a 30 minute window as a dummy variable 1 for True 0 for False. I need to see when double dispatches are occuring and how many records are true double dispatches, and not just a count that says there were 100 instances of double dispatch, but that involved over 200 records. I need to be able to sort and see each record.
Hello I think I found a solution. It slow, only compares one index before or after, but in terms of cases that have 3 dispatches within thirty minutes, this represents less then .5 % for us.
import pandas as pd
import numpy as np
import datetime as dt
dispatch = 'Tech Dispatched Date-Time (PST)'
tech = 'CombinedTech'
df = pd.read_excel('combined_data.xlsx')
df.sort_values(dispatch, inplace=True)
df.reset_index(inplace = True)
df['Double Dispatch'] = np.NaN
writer = pd.ExcelWriter('final_output.xlsx', engine='xlsxwriter')
dispatch_count = 0
time = dt.timedelta(minutes = 30)
for index, row in df.iterrows():
try:
tech_one = df[tech].loc[(index - 1)]
dispatch_one = df[dispatch].loc[(index - 1)]
except KeyError:
tech_one = None
dispatch_one = pd.to_datetime('1/1/1990 00:00:00')
try:
tech_two = df[tech].loc[(index + 1)]
dispatch_two = df[dispatch].loc[(index + 1)]
except KeyError:
tech_two = None
dispatch_two = pd.to_datetime('1/1/2020 00:00:00')
first_time = dispatch_one + time
second_time = pd.to_datetime(row[dispatch]) + time
dispatch_pd = pd.to_datetime(row[dispatch])
if tech_one == row[tech] or tech_two == row[tech]:
if first_time > row[dispatch] or second_time > dispatch_two:
df.set_value(index, 'Double Dispatch', 1)
dispatch_count += 1
else:
df.set_value(index, 'Double Dispatch', 0)
dispatch_count += 1
print(dispatch_count) # This was to monitor total # of records being pushed through
df.to_excel(writer,sheet_name='Sheet1')
writer.save()
writer.close()

Split a very large panda dataframe in smaller ones efficiently knowing the large one is ordered

I'm reading in a very large (15M lines) csv file into a panda dataframe. I then want to split it in smaller ones (ultimately creating smaller csv files, or a panda panel...).
I have working code but it's VERY slow. I believe it's not taking advantage of the fact that my dataframe is 'ordered'.
The df looks like:
ticker date open high low
0 AAPL 1999-11-18 45.50 50.0000 40.0000
1 AAPL 1999-11-19 42.94 43.0000 39.8100
2 AAPL 1999-11-22 41.31 44.0000 40.0600
...
1000 MSFT 1999-11-18 45.50 50.0000 40.0000
1001 MSFT 1999-11-19 42.94 43.0000 39.8100
1002 MSFT 1999-11-22 41.31 44.0000 40.0600
...
7663 IBM 1999-11-18 45.50 50.0000 40.0000
7664 IBM 1999-11-19 42.94 43.0000 39.8100
7665 IBM 1999-11-22 41.31 44.0000 40.0600
I want to take all rows where symbol=='AAPL', and make a dataframe with it. Then all rows where symbol=='MSFT', and so on. The number of rows for each symbol is NOT the same, and the code has to adapt. I might load in a new 'large' csv where everything is different.
This is what I came up with:
#Read database
alldata = pd.read_csv('./alldata.csv')
#get a list of all unique ticker present in the database
alltickers = alldata.iloc[:,0].unique();
#write data of each ticker in its own csv file
for ticker in alltickers:
print('Creating csv for '+ticker)
#get data for current ticker
tickerdata = alldata.loc[alldata['ticker'] == ticker]
#remove column with ticker symbol (will be the file name) and reindex as
#we're grabbing from somwhere in a large dataframe
tickerdata = tickerdata.iloc[:,1:13].reset_index(drop=True)
#write csv
tickerdata.to_csv('./split/'+ticker+'.csv')
this takes forever to run. I thought it was the file I/O, but I commented the write csv part in the for loop, and I see that this line is the problem:
tickerdata = alldata.loc[alldata['ticker'] == ticker]
I wonder if pandas is looking in the WHOLE dataframe every single time. I do know that the dataframe is in order of ticker. Is there a way to leverage that?
Thank you very much!
Dave
Easiest way to do this is to create a dictionary of the dataframes using a dictionary comprehension and pandas groupby
dodf = {ticker: sub_df for ticker, sub_df in alldata.groupby('ticker')}
dodf['IBM']
ticker date open high low
7663 IBM 1999-11-18 45.50 50.0 40.00
7664 IBM 1999-11-19 42.94 43.0 39.81
7665 IBM 1999-11-22 41.31 44.0 40.06
It makes sense that creating a boolean index of length 15 million, and doing it repeatedly, is going to take a little while. Honestly, for splitting the file into subfiles, I think Pandas is the wrong tool for the job. I'd just use a simple loop to iterate over the lines in the input file, writing them to the appropriate output file as they come. This doesn't even have to load the whole file at once, so it will be fairly fast.
import itertools as it
tickers = set()
with open('./alldata.csv') as f:
headers = next(f)
for ticker, lines in it.groupby(f, lambda s: s.split(',', 1)[0]):
with open('./split/{}.csv'.format(ticker), 'a') as w:
if ticker not in tickers:
w.writelines([headers])
tickers.add(ticker)
w.writelines(lines)
Then you can load each individual split file using pd.read_csv() and turn that into its own DataFrame.
If you know that the file is ordered by ticker, then you can skip everything involving the set tickers (which tracks which tickers have already been encountered). But that's a fairly cheap check.
Probably, the best approach is to use groupby. Suppose:
>>> df
ticker v1 v2
0 A 6 0.655625
1 A 2 0.573070
2 A 7 0.549985
3 B 32 0.155053
4 B 10 0.438095
5 B 26 0.310344
6 C 23 0.558831
7 C 15 0.930617
8 C 32 0.276483
Then group:
>>> grouped = df.groupby('ticker', as_index=False)
Finally, iterate over your groups:
>>> for g, df_g in grouped:
... print('creating csv for ', g)
... print(df_g.to_csv())
...
creating csv for A
,ticker,v1,v2
0,A,6,0.6556248347252436
1,A,2,0.5730698850517599
2,A,7,0.5499849530664374
creating csv for B
,ticker,v1,v2
3,B,32,0.15505313728451087
4,B,10,0.43809490694469133
5,B,26,0.31034386153099336
creating csv for C
,ticker,v1,v2
6,C,23,0.5588311692150466
7,C,15,0.930617426953476
8,C,32,0.2764826801584902
Of course, here I am printing a csv, but you can do whatever you want.
Using groupby is great, but it does not take advantage of the fact that the data is presorted and so will likely have more overhead compared to a solution that does. For a large dataset, this could be a noticeable slowdown.
Here is a method which is optimized for the sorted case:
import pandas as pd
import numpy as np
alldata = pd.read_csv("tickers.csv")
tickers = np.array(alldata.ticker)
# use numpy to compute change points, should
# be super fast and yield performance boost over groupby:
change_points = np.where(
tickers[1:] != tickers[:-1])[0].tolist()
# add last point in as well to get last ticker block
change_points += [tickers.size - 1]
prev_idx = 0
for idx in change_points:
ticker = alldata.ticker[idx]
print('Creating csv for ' + ticker)
# get data for current ticker
tickerdata = alldata.iloc[prev_idx: idx + 1]
tickerdata = tickerdata.iloc[:, 1:13].reset_index(drop=True)
tickerdata.to_csv('./split/' + ticker + '.csv')
prev_idx = idx + 1

pandas groupby for multiple data frames/files at once

I have multiple huge tsv files that I'm trying to process using pandas. I want to group by 'col3' and 'col5'. I've tried this:
import pandas as pd
df = pd.read_csv('filename.txt', sep = "\t")
g2 = df.drop_duplicates(['col3', 'col5'])
g3 = g2.groupby(['col3', 'col5']).size().sum(level=0)
print g3
It works fine so far and prints an output like this:
yes 2
no 2
I'd like to be able to aggregate the output from multiple files, i.e., to be able to group by these two columns in all the files at once and print one common output with total number of occurrences of 'yes' or 'no' or whatever that attribute could be. In other words, I'd now like to use groupby on multiple files at once. And if a file doesn't have one of these columns, it should be skipped and should go to the next file.
This is a nice use case for blaze.
Here's an example using a couple of reduced files from the nyctaxi dataset. I've purposely split a single large file into two files of 1,000,000 lines each:
In [16]: from blaze import Data, compute, by
In [17]: ls
trip10.csv trip11.csv
In [18]: d = Data('*.csv')
In [19]: expr = by(d[['passenger_count', 'medallion']], avg_time=d.trip_time_in_secs.mean())
In [20]: %time result = compute(expr)
CPU times: user 3.22 s, sys: 393 ms, total: 3.61 s
Wall time: 3.6 s
In [21]: !du -h *
194M trip10.csv
192M trip11.csv
In [22]: len(d)
Out[22]: 2000000
In [23]: result.head()
Out[23]:
passenger_count medallion avg_time
0 0 08538606A68B9A44756733917323CE4B 0
1 0 0BB9A21E40969D85C11E68A12FAD8DDA 15
2 0 9280082BB6EC79247F47EB181181D1A4 0
3 0 9F4C63E44A6C97DE0EF88E537954FC33 0
4 0 B9182BF4BE3E50250D3EAB3FD790D1C9 14
Note: This will perform the computation with pandas, using pandas' own chunked CSV reader. If your files are in the GB range you're better off converting to a format such as bcolz or PyTables, as these are binary formats and designed for data analysis on huge files. CSVs are justs blobs of text with conventions.
One way to do it is to concatenate the dfs. It can eat up a lot of memory. How huge are the files?
filelist = ['file1.txt', 'file2.txt']
df = pd.concat([pd.read_csv(x, sep="\t") for x in filelist], axis=0)

Categories