Scalable way to union rows among thousands of tables - python

I have about 50k csv-like files separated by space, each one has about tens of millions of rows. The first column is always string without any space, and the second one is alway positive integer, and there is no missing data. In this problem, I am ONLY interested in the first column, so please ignore the second column. Here are toy examples of two such csv files.
example1.csv
f1 x
f2 x
f5 x
f7 x
...
example2.csv
f1 x
f2 x
f3 x
f4 x
f6 x
...
As you can see, the feature set in the two files have overlap but are not the same. What I want to do is to combine data from all 50k csv files and transform it into the following form.
file_name f1 f2 f3 f4 f5 f6 f7 ....
example1.csv 1 1 0 0 1 0 1 ...
example2.csv 1 1 1 1 0 1 0 ...
...
So it's basically to construct a matrix of file_name x feature_id, if the feature_id exists in the file, then it's 1, otherwise 0. The parsing here is relatively simple, the focus is on scalability, and the number of rows may go up to billions in future projects. I have access to machines with up to one or two terabytes of memory and 100 cores. So I guess memory constraint is less of a concern, but my naive implementation as shown below works well on toy examples, but gets too slow for real ones, and appears to hang when it reaches about 310000 lines in the first file, which I am not sure why. (Do you know why? My intuition says it may have something to do with defaultdict, not sure how it's implemented and it might be expensive to use.) I would like the solution to be reasonably fast. The solution is preferably to be in Python, but other languages are fine, too.
import os
import gzip
from collections import defaultdict
import pandas as pd
# collect paths to all csv-like files
with open('file_list.txt') as inf:
inputs = [_.split() for _ in inf.readlines]
inputs_map = dict(zip(inputs, range(len(inputs))))
res = defaultdict(lambda :[0] * len(inputs))
for k, infile in enumerate(inputs):
print(k, infile)
source_file = os.path.abspath(infile)
source_file_id = inputs_map[source_file]
# starting parsing the csv-like file
with gzip.open(source_file, 'rt') as inf:
for kl, line in enumerate(inf):
feature_id = line.split()[0]
res[feature_id][source_file_id] = 1
if (kl + 1) % 10000 == 0:
print('File {0}'.format(k), 'Line {0}'.format(kl + 1), source_file)
df = pd.DataFrame(res)
df.index = inputs
print('starting writing to disk...')
df.T.to_csv('output.csv')

xto1 = lambda x: 1
def read_ex(fn):
s = pd.read_csv(
fn, sep=' ', header=None,
index_col=0, usecols=[0, 1],
converters={1: xto1},
names=[None, fn],
squeeze=True)
return s
fs = ['example1.csv', 'example2.csv']
pd.concat([read_ex(f) for f in fs], keys=fs).unstack(fill_value=0)

Related

Efficient way of concatenating and grouping pandas dataframes

I have around 150 CSV files on the following format:
Product Name
Cost
Manufacturer
Country
P_0
5
Pfizer
Finland
P_1
10
BioNTech
Sweden
P_2
12
Pfizer
Denmark
P_3
11
J&J
Finland
Each CSV represents daily data. So the file for the previous date would look like:
Product Name
Cost
Manufacturer
Country
P_0
7
Pfizer
Finland
P_1
15
BioNTech
Sweden
P_2
17
Pfizer
Denmark
P_3
10
J&J
Finland
I would like to create a time series dataset where I can track the price of a product given a manufacturer in a given country over time.
So for example I want to be able to show the price development of product P_1 made by BioNTech in Sweden as:
Date
Price
17/10/2022
15
18/10/2022
10
My attempt:
Each CSV has the date as a part of its name (e.g., 'data_17-10_2022'). So I have created a list that contains the path to all of the CSV files and then I iterate through this list, convert each CSV to a pandas dataframe, add each of them to a list and then concatenate this after which I perform some groupby operation.
def create_ts(data):
df_list = []
for file in data:
match = re.search(r'\d{2}-\d{2}-\d{4}', file) # get date from file name
date = datetime.strptime(match.group(), '%d-%m-%Y').date()
df = pd.read_csv(file, sep = ";")
df["date"] = date # create a new column in each df that contains the date
df_list.append(df)
return df_list
df_concat = pd.concat(create_ts(my_files))
df_group = df_concat.groupby(["Manufacturer", "Country", "Product Name"])
This returns what I am after. However, it is very slow (when I tried it for a random country, manufacturer and product name it took nearly 10 minutes to run).
The problem (I think) is that each CSV is approximately 40MB (180000 rows and 20 columns, of which I have drop around 10 irrelevant columns).
Is there anything I can do to speed this up? I tried installing modin but I got an error saying I need VS C++ v.14 and my work computer does not allow me to install programs without going through a very long process with the IT department.
Fundamentally your reading approach is fine: as far as I know reading then concatenating the dataframes is the best approach. There are some marginal improvements you can get if you use the usecols and dtype parameters in read_csv but this is ever dependant on what your data looks like:
Method
Time
Relative
Original
0.1512130000628531
1.5909069397118787
Only load columns you need
0.09676750004291534
1.0180876465175188
Use dtype parameter
0.09504829999059439
1.0
I think to get significant performance improvement you probably want to look at caching at some point in the process as dankal444 mentions.
What you cache depends on how the data is changing, but assuming the files do not change once you have received them I would probably cache the loaded dataframe with a set of included files something like:
import pickle
dst = './fastreading.pkl'
contained_files = set()
with open(dst, 'wb') as f:
pickle.dump((contained_files, df), f)
with open(dst, 'rb') as f:
contained_files2, df2 = pickle.load(f)
You could then check if the file is in the list of contained files in your loading process. I am using pickle here, but there are other faster ways of loading/saving dataframes, there is some benchmark data here.
If you are worried that the files will chance, you could include a timestamp or a checksum in your contained files list.
The other thing I would recommend is running a profiler. This should give you a good idea where the time is spent.
read_csv test code:
import pandas as pd
import numpy as np
import timeit
iterations = 10
item_count = 5000
path = './fasterreading.csv'
data = {c: [i/2 for i in range(item_count)] for c in [chr(c) for c in range(ord('a'), ord('z') + 1)]}
dtypes = {c: np.float64 for c in data.keys()}
df = pd.DataFrame(data)
df.to_csv(path)
# attempt to negate file system caching effect
timeit.timeit(lambda: pd.read_csv(path), number=5)
t0 = timeit.timeit(lambda: pd.read_csv(path), number=iterations)
t1 = timeit.timeit(lambda: pd.read_csv(path, usecols=['a', 'b', 'c']), number=iterations)
t2 = timeit.timeit(lambda: pd.read_csv(path, usecols=['a', 'b', 'c'], dtype=dtypes), number=iterations)
tmin = min(t0, t1, t2)
print(f'| Method | Time | Relative |')
print(f'|------------------ |----------------------|')
print(f'| Original | {t0} | {t0 / tmin} |')
print(f'| Only load columns you need | {t1} | {t1 / tmin} |')
print(f'| Use dtype parameter | {t2} | {t2 / tmin} |')

Save each Excel-spreadsheet-row with header in separate .txt-file (saved as a parameter-sample to be read by simulation programs)

I'm a building energy simulation modeller with an Excel-question to enable automated large-scale simulations using parameter samples (samples generated using Monte Carlo). Now I have the following question in saving my samples:
I want to save each row of an Excel-spreadsheet in a separate .txt-file in a 'special' way to be read by simulation programs.
Let's say, I have the following excel-file with 4 parameters (a,b,c,d) and 20 values underneath:
a b c d
2 3 5 7
6 7 9 1
3 2 6 2
5 8 7 6
6 2 3 4
Each row of this spreadsheet represents a simulation-parameter-sample.
I want to store each row in a separate .txt-file as follows (so 5 '.txt'-files for this spreadsheet):
'1.txt' should contain:
a=2;
b=3;
c=5;
d=7;
'2.txt' should contain:
a=6;
b=7;
c=9;
d=1;
and so on for files '3.txt', '4.txt' and '5.txt'.
So basically matching the header with its corresponding value underneath for each row in a separate .txt-file ('header equals value;').
Is there an Excel add-in that does this or is it better to use some VBA-code? Anybody some idea?
(I'm quit experienced in simulation modelling but not in programming, therefore this rather easy parameter-sample-saving question in Excel. (Solutions in Python are also welcome if that's easier for you people))
my idea would be to use Python along with Pandas as it's one of the most flexible solutions, as your use case might expand in the future.
I'm gonna try making this as simple as possible. Though I'm assuming, that you have Python, that you know how to install packages via pip or conda and are ready to run a python script on whatever system you are using.
First your script needs to import pandas and read the file into a DataFrame:
import pandas as pd
df = pd.read_xlsx('path/to/your/file.xlsx')
(Note that you might need to install the xlrd package, in addition to pandas)
Now you have a powerful data structure, that you can manipulate in plenty of ways. I guess the most intuitive one, would be to loop over all items. Use string formatting, which is best explained over here and put the strings together the way you need them:
outputs = {}
for row in df.index:
s = ""
for col in df.columns:
s += "{}={};\n".format(col, df[col][row])
print(s)
now you just need to write to a file using python's io method open. I'll just name the files by the index of the row, but this solution will overwrite older text files, created by earlier runs of this script. You might wonna add something unique like the date and time or the name of the file you read to it or increment the file name further with multiple runs of the script, for example like this.
All together we get:
import pandas as pd
df = pd.read_excel('path/to/your/file.xlsx')
file_count = 0
for row in df.index:
s = ""
for col in df.columns:
s += "{}={};\n".format(col, df[col][row])
file = open('test_{:03}.txt'.format(file_count), "w")
file.write(s)
file.close()
file_count += 1
Note that it's probably not the most elegant way and that there are one liners out there, but since you are not a programmer I thought you might prefer a more intuitive way, that you can tweak yourself easily.
I got this to work in Excel. You can expand the length of the variables x,y and z to match your situation and use LastRow, LastColumn methods to find the dimensions of your data set. I named the original worksheet "Data", as shown below.
Sub TestExportText()
Dim Hdr(1 To 4) As String
Dim x As Long
Dim y As Long
Dim z As Long
For x = 1 To 4
Hdr(x) = Cells(1, x)
Next x
x = 1
For y = 1 To 5
ThisWorkbook.Sheets.Add After:=Sheets(Sheets.Count)
ActiveSheet.Name = y
For z = 1 To 4
With ActiveSheet
.Cells(z, 1) = Hdr(z) & "=" & Sheets("Data").Cells(x + 1, z) & ";"
End With
Next z
x = x + 1
ActiveSheet.Move
ActiveWorkbook.ActiveSheet.SaveAs Filename:="File" & y & ".txt", FileFormat:=xlTextWindows
ActiveWorkbook.Close SaveChanges:=False
Next y
End Sub
If you can save your Excel spreadsheet as a CSV file then this python script will do what you want.
with open('data.csv') as file:
data_list = [l.rstrip('\n').split(',') for l in file]
counter = 1
for x in range (1, len (data_list)) :
output_file_name = str (counter) + '.txt'
with open (output_file_name, 'w' ) as file :
for x in range (len (data_list [counter])) :
print (x)
output_string = data_list [0] [x] + '=' + data_list [counter] [x] + ';\n'
file.write (output_string)
counter += 1

Matching cells in CSV to return calculation

I am trying to create a program that will take the most recent 30 CSV files of data within a folder and calculate totals of certain columns. There are 4 columns of data, with the first column being the identifier and the rest being the data related to the identifier. Here's an example:
file1
Asset X Y Z
12345 250 100 150
23456 225 150 200
34567 300 175 225
file2
Asset X Y Z
12345 270 130 100
23456 235 190 270
34567 390 115 265
I want to be able to match the asset# in both CSVs to return each columns value and then perform calculations on each column. Once I have completed those calculations I intend on graphing various data as well. So far the only thing I have been able to complete is extracting ALL the data from the CSV file using the following code:
csvfile = glob.glob('C:\\Users\\tdjones\\Desktop\\Python Work Files\\FDR*.csv')
listData = []
for files in csvfile:
df = pd.read_csv(files, index_col=0)
listData.append(df)
concatenated_data = pd.concat(listData, sort=False)
group = concatenated_data.groupby('ASSET')['Slip Expense ($)', 'Net Win ($)'].sum()
group.to_csv("C:\\Users\\tdjones\\Desktop\\Python Work Files\\Test\\NewFDRConcat.csv", header=('Slip Expense', 'Net WIn'))
I am very new to Python so any and all direction is welcome. Thank you!
I'd probably also set the asset number as the index while you're reading the data, since this can help with sifting through data. So
rd = pd.read_csv(files, index_col=0)
Then you can do as Alex Yu suggested and just pick all the data from a specific asset number out when you're done using
asset_data = rd.loc[asset_number, column_name]
You'll generally need to format the data in the DataFrame before you append it to the list if you only want specific inputs. Exactly how to do that naturally depends specifically on what you want i.e. what kind of calculations you perform.
If you want a function that just returns all the data for one specific asset, you could do something along the lines of
def get_asset(asset_number):
csvfile = glob.glob('C:\\Users\\tdjones\\Desktop\\Python Work Files\\*.csv')
asset_data = []
for file in csvfile:
data = [line for line in open(file, 'r').read().splitlines()
if line.split(',')[0] == str(asset_num)]
for line in data:
asset_data.append(line.split(','))
return pd.DataFrame(asset_data, columns=['Asset', 'X', 'Y', 'Z'], dtype=float)
Although how well the above performs is going to depend on how large the dataset is your going through. Something like the above method needs to search through every line and perform several high level functions on each line, so it could potentially be problematic if you have millions of lines of data in each file.
Also, the above assumes that all data elements are strings of numbers (so can be cast to integers or floats). If thats not the case, leave the dtype argument out of the DataFrame definition, but keep in mind that everything returned is stored as a string then.
I suppose that you need to add for your code pandas.concat of your listData
So it will became:
csvfile = glob.glob('C:\\Users\\tdjones\\Desktop\\Python Work Files\\*.csv')
listData = []
for files in csvfile:
rd = pd.read_csv(files)
listData.append(rd)
concatenated_data = pd.concat(listData)
After that you can use aggregate functions with this concatenated_data DataFrame such as: concatenated_data['A'].max(), concatenated_data['A'].count(), 'groupby`s etc.

pandas groupby for multiple data frames/files at once

I have multiple huge tsv files that I'm trying to process using pandas. I want to group by 'col3' and 'col5'. I've tried this:
import pandas as pd
df = pd.read_csv('filename.txt', sep = "\t")
g2 = df.drop_duplicates(['col3', 'col5'])
g3 = g2.groupby(['col3', 'col5']).size().sum(level=0)
print g3
It works fine so far and prints an output like this:
yes 2
no 2
I'd like to be able to aggregate the output from multiple files, i.e., to be able to group by these two columns in all the files at once and print one common output with total number of occurrences of 'yes' or 'no' or whatever that attribute could be. In other words, I'd now like to use groupby on multiple files at once. And if a file doesn't have one of these columns, it should be skipped and should go to the next file.
This is a nice use case for blaze.
Here's an example using a couple of reduced files from the nyctaxi dataset. I've purposely split a single large file into two files of 1,000,000 lines each:
In [16]: from blaze import Data, compute, by
In [17]: ls
trip10.csv trip11.csv
In [18]: d = Data('*.csv')
In [19]: expr = by(d[['passenger_count', 'medallion']], avg_time=d.trip_time_in_secs.mean())
In [20]: %time result = compute(expr)
CPU times: user 3.22 s, sys: 393 ms, total: 3.61 s
Wall time: 3.6 s
In [21]: !du -h *
194M trip10.csv
192M trip11.csv
In [22]: len(d)
Out[22]: 2000000
In [23]: result.head()
Out[23]:
passenger_count medallion avg_time
0 0 08538606A68B9A44756733917323CE4B 0
1 0 0BB9A21E40969D85C11E68A12FAD8DDA 15
2 0 9280082BB6EC79247F47EB181181D1A4 0
3 0 9F4C63E44A6C97DE0EF88E537954FC33 0
4 0 B9182BF4BE3E50250D3EAB3FD790D1C9 14
Note: This will perform the computation with pandas, using pandas' own chunked CSV reader. If your files are in the GB range you're better off converting to a format such as bcolz or PyTables, as these are binary formats and designed for data analysis on huge files. CSVs are justs blobs of text with conventions.
One way to do it is to concatenate the dfs. It can eat up a lot of memory. How huge are the files?
filelist = ['file1.txt', 'file2.txt']
df = pd.concat([pd.read_csv(x, sep="\t") for x in filelist], axis=0)

Python data wrangling issues

I'm currently stumped by some basic issues with a small data set. Here are the first three lines to illustrate the format of the data:
"Sport","Entry","Contest_Date_EST","Place","Points","Winnings_Non_Ticket","Winnings_Ticket","Contest_Entries","Entry_Fee","Prize_Pool","Places_Paid"
"NBA","NBA 3K Crossover #3 [3,000 Guaranteed] (Early Only) (1/15)","2015-03-01 13:00:00",35,283.25,"13.33","0.00",171,"20.00","3,000.00",35
"NBA","NBA 1,500 Layup #4 [1,500 Guaranteed] (Early Only) (1/25)","2015-03-01 13:00:00",148,283.25,"3.00","0.00",862,"2.00","1,500.00",200
The issues I am having after using read_csv to create a DataFrame:
The presence of commas in certain categorical values (such as Prize_Pool) results in python considering these entries as strings. I need to convert these to floats in order to make certain calculations. I've used python's replace() function to get rid of the commas, but that's as far as I've gotten.
The category Contest_Date_EST contains timestamps, but some are repeated. I'd like to subset the entire dataset into one that has only unique timestamps. It would be nice to have a choice in which repeated entry or entries are removed, but at the moment I'd just like to be able to filter the data with unique timestamps.
Use thousands=',' argument for numbers that contain a comma
In [1]: from pandas import read_csv
In [2]: d = read_csv('data.csv', thousands=',')
You can check Prize_Pool is numerical
In [3]: type(d.ix[0, 'Prize_Pool'])
Out[3]: numpy.float64
To drop rows - take first observed, you can also take last
In [7]: d.drop_duplicates('Contest_Date_EST', take_last=False)
Out[7]:
Sport Entry \
0 NBA NBA 3K Crossover #3 [3,000 Guaranteed] (Early ...
Contest_Date_EST Place Points Winnings_Non_Ticket Winnings_Ticket \
0 2015-03-01 13:00:00 35 283.25 13.33 0
Contest_Entries Entry_Fee Prize_Pool Places_Paid
0 171 20 3000 35
Edit: Just realized you're using pandas - should have looked at that.
I'll leave this here for now in case it's applicable but if it gets
downvoted I'll take it down by virtue of peer pressure :)
I'll try and update it to use pandas later tonight
Seems like itertools.groupby() is the tool for this job;
Something like this?
import csv
import itertools
class CsvImport():
def Run(self, filename):
# Get the formatted rows from CSV file
rows = self.readCsv(filename)
for key in rows.keys():
print "\nKey: " + key
i = 1
for value in rows[key]:
print "\nValue {index} : {value}".format(index = i, value = value)
i += 1
def readCsv(self, fileName):
with open(fileName, 'rU') as csvfile:
reader = csv.DictReader(csvfile)
# Keys may or may not be pulled in with extra space by DictReader()
# The next line simply creates a small dict of stripped keys to original padded keys
keys = { key.strip(): key for (key) in reader.fieldnames }
# Format each row into the final string
groupedRows = {}
for k, g in itertools.groupby(reader, lambda x : x["Contest_Date_EST"]):
groupedRows[k] = [self.normalizeRow(v.values()) for v in g]
return groupedRows;
def normalizeRow(self, row):
row[1] = float(row[1].replace(',','')) # "Prize_Pool"
# and so on
return row
if __name__ == "__main__":
CsvImport().Run("./Test1.csv")
Output:
More info:
https://docs.python.org/2/library/itertools.html
Hope this helps :)

Categories