Merge Many CSV Files Leads to Kernel Death - python

I need to preprocess a lot of csv tables to apply them to an autoencoder.
By using pandas, I read all these tables as data frames. Then I need to merge them based on a shared key(id). merged = pd.merge(df, df1, on='id', how = 'left').
However, after a couple of merges the size of the resulting table became very big and killed the kernel. This is the last size I got for merging result before the kernel died merged.shape = (29180782, 71). And I need to merge many more tables.
All the tables have an outlook like this but with more rows and columns (the values define in each column shows a category):
df: df1:
id a b c d id e f g h
0 2000 1 1 1 3 2000 1 1 1 1
1 2001 2 1 1 3 2001 2 0 0 3
2 2002 1 3 1 2 2002 1 3 1 2
3 2003 2 2 1 1 2003 1 0 1 1
I have tried feather but it doesn't help. I also did try to downcast the column types df['a'] = pd.to_numeric(df['a'], downcast='unsigned') but I saw no difference in table size. The last solution came up to my mind was using chunk. I tried the below code with different chunk sizes, but the kernel died again:
for chunk in pd.read_csv('df1', chunksize = 100000, low_memory=False):
df = pd.merge(df,chunk , on='id', how = 'left')
So I decided to write on a file instead of using a variable to prevent the kernel from dying. At first, I saved the last merged table in a csv file in order to read from it by chunks for the next merging process.
lastmerged.to_csv(r'/Desktop/lastmerged.csv', index=False)
And then:
from csv import writer
for chunk in pd.read_csv('lastmerged.csv', chunksize = 100000, low_memory=False):
newmerge = pd.merge(df1,chunk , on='id', how = 'right')
with open('newmerge.csv', 'a+', newline='') as write_obj:
csv_writer = writer(write_obj)
for i in range (len(newmerge)):
csv_writer.writerow(newmerge.loc[i,:])
I did try this piece of code on some small tables and I got the desired result. But for my real tables, it took lots of time for running and it made me to stop the kernel :| Besides, the code doesn't seem efficient!
In a nut shell, my question is how to merge tables when they got larger and larger and cause kernel's death and memory problem.
ps. I have already tried google colab, Jupyter, and terminal. They all work the same.

you can collect them in a list and use
total_df = pd.concat([df1,df2,df3,df4...,dfn],axis = 1)
you can also use
for name in filename:
df = pd.concat([df,pd.read_csv(name,index_col= False)])
So this way, you can pass the memory problem a

You can convert your pandas dataframes to dask dataframes. And then merge your data frames by dd.merge().
import dask.dataframe as dd
d_df = dd.from_pandas(df, chunksize=10000)
For data that fits into RAM, Pandas can often be faster and easier to use than Dask DataFrame, but when you have problem with RAM size you can use Dask to deal with hard disk and RAM.

Related

Analyzing workflow status, add sum, write workflow process in data

I have a csv file that contains a record of a workflow. It contains for each timestamp the status. So I do have time and day when something was done, however I sorted it in ascending order and this is sufficient for the next step, therefore this is not included in this sample data. My sample data looks like this (csv files are attached, Example1.csv and Example2.csv, the preview in google looks wrong, as the decimal "," separator is not properly recognized):
So as I said these files are already sorted in ascending order and the status could be imagined as something like a workflow. So work started, proceed, finished, clean up. Like this:
Now I want to detect suspicious entries. For example if someone finished work without actually started it, or other unusual "patterns". What I would like to have is an overview of all the different workflows.
1.
I would like to have the counts / number of occurences per unique workflow. I achieved to implement this. My code is as follows:
import pandas as pd
from collections import OrderedDict
df=pd.read_csv(r'C:\Users\PC\Desktop\Example2.csv', sep=";", decimal=",", encoding="utf-8-sig")
df['Status']=df['Status'].astype(str)
df['Status'].fillna('No', inplace=True)
df=df.groupby(['Worker'])['Status'].apply('|'.join).reset_index()
df=df.groupby(['Status']).count()
df = df.rename(columns={'Worker': 'Count'})
#df['Sum']=df.groupby(['Amount']).sum()
df.to_csv(r'C:\Users\PC\Desktop\outtest.csv', sep=';', encoding="utf-8-sig")
Which works. I get the following output:
or in case of using numbers:
Which is exactly what I want. Here I can see for example that two workers started work and then directly cleaned up.
2.
Now I would like to have the sum of the amounts too. The amount per worker is always the same, so this number does not vary for a worker, so for example as shown in the sample data, worker 1 always has 2500,24. What I would like to have is this output:
I tried to implement it with adding a simple line:
df['Sum']=df.groupby(['Amount']).sum()
But this throws an error. Reason is that the Amount in this step is simply not available. I could not figure out how to get this working.
How can I add the sum?
3.
I would like to "write the type of workflow which was counted for this worker" back to my original data file. So in my original data it should look like this (for simplicity reasons lets take the version where the status is represented with numbers):
How can I implement this?
(I thought about this and it actually does not need to be combined with the results from my previous code. I just basically need to expand/transpose the status for each worker and write this to a new variable/column. However here the problem is that I do not know in advance how many status/steps a worker has. So somehow I need to implement that "if the next entry belongs to the same worker than attach the value from status with a "|" to an existing variable" and this is my new column. But maybe I am wrong here and there is another implementation.)
To calculate sum of amounts we can first groupby the Worker column to get the workflow and the amount for each worker (I'm taking first for the amount since it's the same for all rows for the same worker). Then we groupby again on the workflow (which is in Status column after the first groupby), and calculate counts and sums:
df = pd.read_csv('Example2.csv', sep=';', decimal=',')
df['Status'] = df['Status'].astype(str)
z = df.groupby('Worker').agg({
'Status': '|'.join,
'Amount': 'first',
}).groupby('Status')['Amount'].agg(['count', 'sum']).reset_index()
# save and output
z.to_csv('outtest.csv', sep=';')
z
Output:
Status count sum
0 Started work 1 2900.00
1 Started work|Clean up 2 3600.18
2 Started work|Continued work|Finished|Clean up 2 6700.74
3 Started work|Continued work|Finished|Clean up|... 1 4200.98
To add workflow as a column, we can use transform:
df = pd.read_csv('Example1.csv', sep=';', decimal=',')
df['Status'] = df['Status'].astype(str)
# add workflow column
df['workflow'] = df.groupby('Worker')['Status'].transform('|'.join)
# save and output
df.to_csv('Example1_with_workflow.csv', sep=';', decimal=',')
df
Output (using the numeric Example1.csv here to make it more readable, but will work with either of them, of course):
Worker Status Amount workflow
0 1 1 2500.24 1|2|3|4
1 1 2 2500.24 1|2|3|4
2 1 3 2500.24 1|2|3|4
3 1 4 2500.24 1|2|3|4
4 2 1 2400.00 1|4
5 2 4 2400.00 1|4
6 3 1 4200.98 1|2|3|4|5
7 3 2 4200.98 1|2|3|4|5
8 3 3 4200.98 1|2|3|4|5
9 3 4 4200.98 1|2|3|4|5
10 3 5 4200.98 1|2|3|4|5
11 4 1 1200.18 1|4
12 4 4 1200.18 1|4
13 5 1 4200.50 1|2|3|4
14 5 2 4200.50 1|2|3|4
15 5 3 4200.50 1|2|3|4
16 5 4 4200.50 1|2|3|4
17 6 1 2900.00 1
P.S. If I read correctly, in (1) there was no question as everything worked as expected, right?

Memory Error: happening on Linux but not Mac OS

I have a big pandas dataframe (7 GiB) that I read from a csv. I need to merge this dataframe with another one, much much smaller. Let's say its size is negligible.
I'm aware that a merge operation in pandas will keep the 2 dataframes to merge + the merged dataframe. Since I have only 16 GiB of RAM, when I run the merge on Linux, it fails with a memory error (my system consumes around 3-4 GiB).
I also tried to run the merge on a Mac, with 16 GiB as well. The system consumes about 3 GiB of RAM by default. The merge completed on the Mac, with the memory going no higher than 10 GiB.
How is this possible? The version of pandas is the same, the dataframe is the same. What is happening here?
Edit:
Here is the code I use to read/merge my files:
# Read the data for the stations, stored in a separate file
stations = pd.read_csv("stations_with_id.csv", index_col=0)
stations.set_index("id_station")
list_data = list()
data = pd.DataFrame()
# Merge all pollutants data in one dataframe
# Probably not the most optimized approach ever...
for pollutant in POLLUTANTS:
path_merged_data_per_pollutant = os.path.join("raw_data", f"{pollutant}_merged")
print(f"Pollutant: {pollutant}")
for f in os.listdir(path_merged_data_per_pollutant):
if ".csv" not in f:
print(f"passing {f}")
continue
print(f"loading {f}")
df = pd.read_csv(
os.path.join(path_merged_data_per_pollutant, f),
sep=";",
na_values="mq",
dtype={"concentration": "float64"},
)
# Drop useless colums and translate useful ones to english
# Do that here to limit memory usage
df = df.rename(index=str, columns=col_to_rename)
df = df[list(col_to_rename.values())]
# Date formatted as YYYY-MM
df["date"] = df["date"].str[:7]
df.set_index("id_station")
df = pd.merge(df, stations, left_on="id_station", right_on="id_station")
# Filter entries to France only (only the metropolitan area) based on GPS coordinates
df = df[(df.longitude > -5) & (df.longitude < 12)]
list_data.append(df)
print("\n")
data = pd.concat(list_data)
The only column that is not a string is concentration, and I specify the type when I read the csv.
The stations dataframe is < 1 MiB.
MacOS compresses memory since Mavericks. If your dataframe is not literally random, it won't take up the full 7GiB in RAM.
There are ways to get compressed memory on Linux as well, but this isn't necessarily enabled. It depends on your distro and configuration.

Force Pandas to keep multiple columns with the same name

I'm building a program that collects data and adds it to an ongoing excel sheet weekly (read_excel() and concat() with the new data). The issue I'm having is that I need the columns to have the same name for presentation (it doesn't look great with x.1, x.2, ...).
I only need this on the final output. Is there any way to accomplish this? Would it be too time consuming to modify pandas?
you can create a list of custom headers that will be read into excel
newColNames = ['x','x','x'.....]
df.to_excel(path,header=newColNames)
You can add spaces to the end of the column name. It will appear the same in a Excel, but pandas can distinguish the difference.
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], columns=['x','x ','x '])
df
x x x
0 1 2 3
1 4 5 6
2 7 8 9

Create different dataframes from 1 excel file using selected columns

I have a large data frame with dates and then stocks at the top with columns of price data.
Header 1 Header 2 Header 3 Header 4
======== ======== ======== ========
Date Stock 1 Stock 2 Stock 3
1/2/2001 2.77 6.00 11.00
1/3/2001 2.89 6.08 11.10
1/4/2001 2.86 6.33 11.97
1/5/2001 2.80 6.58 12.40
What I want to do is make multiple dataframes from this one file with the date and the stock price of each stock. So essentially in this example you would have 4 dataframes (the file has more than 1000 of them so this is just a sample). So the dataframes would be:
DF1 = Data and Stock 1
DF2 = Data and Stock 2
DF3 = Data and Stock 3
DF4 = Data and Stock 4
I am then going to take each dataframe and add more columns to each of them once they are created.
I was reading through previous questions and came up with usecols, but I can seem to get the syntax written properly. Can someone help me out? Also if there is a better way to do this please advise. Since I have more than 1000, speed is important in running through the file
This is what I have so far but I am not sure I am heading down the most efficient path. It gives the following error (among others it seems):
>>>> ValueError: The elements of 'usecols' must either be all strings or all integers`
df2 = pd.read_csv('file.csv')
# read in Exel file to get column headers from excel
for i in df2:
a = 0
# always want to have 1st (date column) as 1st column in DF
d = pd.read_csv('file.csv',usecols=[a,i])
# Read in file with proper columns, will always be first column
#and add column 1, next loop cols 0,2, next loop 0,3, etc.
dataf[i] = pd.DataFrame(d) #actually create DataFrame
It also seems to be inefficient to have to read in the file each time. Maybe there is a way to read in file once and then create the dataframes. Any help would be appreciated.
Consider building a list of integer pairings ([0,1], [0,2], [0,3], etc.) to slice master dataframe by columns. Then iteratively append dataframes to a list which is a preferred setup of one container (with similarly structured elements) to avoid 1000's of dfs flooding your global enviroment.
dateparse = lambda x: pd.datetime.strptime(x, '%m/%d/%Y')
masterdf = pd.read_csv("DataFile.csv", parse_dates=[0], date_parser=dateparse)
colpairs = [[0, i] for i in range(1, len(masterdf.columns))]
dfList = []
for cols in colpairs:
dfList.append(masterdf[cols])
print(len(dfList))
print(dfList[0].head())
print(dfList[1].head())
Alternatively, consider a dictionary of dataframes with stock names as keys for a container, where colpairs carry string literal pairings as opposed to integers:
colpairs = [['Date', masterdf.columns[i]] for i in range(1, len(masterdf.columns))]
dfDict = {}
for cols in colpairs:
dfDict[cols[1]] = masterdf[cols]
print(len(dfDict))
print(dfDict['Stock 1'].head())
print(dfDict['Stock 2'].head())

Reading values in column x from specific worksheets using pandas

I am new to python and have looked at a number of similar problems on SO, but cannot find anything quite like the problem that I have and am therefore putting it forward:
I have an .xlsx dataset with data spread across eight worksheets and I want to do the following:
sum the values in the 14th column in each worksheet (the format, layout and type of data (scores) is the same in column 14 across all worksheets)
create a new worksheet with all summed values from column 14 in each worksheet
sort the totaled scores from highest to lowest
plot the summed values in a bar chart to compare
I cannot even begin this process because I am struggling at the first point. I am using pandas and am having trouble reading the data from one specific worksheet - I only seem to be able to read the data from the first worksheet only (I print the outcome to see what my system is reading in).
My first attempt produces an `Empty DataFrame':
import pandas as pd
y7data = pd.read_excel('Documents\\y7_20161128.xlsx', sheetname='7X', header=0,index_col=0,parse_cols="Achievement Points",convert_float=True)
print y7data
I also tried this but it only exported the entire first worksheet's data as opposed to the whole document (I am trying to do this so that I can understand how to export all data). I chose to do this thinking that maybe if I exported the data to a .csv, then it might give me a clearer view of what went wrong, but I am nonethewiser:
import pandas as pd
import numpy as np
y7data = pd.read_excel('Documents\\y7_20161128.xlsx')
y7data.to_csv("results.csv")
I have tried a number of different things to try and specify which column within each worksheet, but cannot get this to work; it only seems to produce the results for the first worksheet.
How can I, firstly, read the data from column 14 in every worksheet, and then carry out the rest of the steps?
Any guidance would be much appreciated.
UPDATE (for those using Enthought Canopy and struggling with openpyxl):
I am using Enthought Canopy IDE and was constantly receiving an error message around openpyxl not being installed no matter what I tried. For those of you having the same problem, save yourself lots of time and read this post. In short, register for an Enthought Canopy account (it's free), then run this code via the Canopy Command Prompt:
enpkg openpyxl 1.8.5
I think you can use this sample file:
First read all columns in each sheet to list of columns called y7data:
y7data = [pd.read_excel('y7_20161128.xlsx', sheetname=i, parse_cols=[13]) for i in range(3)]
print (y7data)
[ a
0 1
1 5
2 9, a
0 4
1 2
2 8, a
0 5
1 8
2 5]
Then concat all columns together, I add keys which are used for axis x in graph, sum all columns, remove second level of MultiIndex (a, a, a in sample data) by reset_index and last sort_values:
print (pd.concat(y7data, axis=1, keys=['a','b','c']))
a b c
a a a
0 1 4 5
1 5 2 8
2 9 8 5
summed = pd.concat(y7data, axis=1, keys=['a','b','c'])
.sum()
.reset_index(drop=True, level=1)
.sort_values(ascending=False)
print (summed)
c 18
a 15
b 14
dtype: int64
Create new DataFrame df, set column names and write to_excel:
df = summed.reset_index()#.
df.columns = ['a','summed']
print (df)
a summed
0 c 18
1 a 15
2 b 14
If need add new sheet use this solution:
from openpyxl import load_workbook
book = load_workbook('y7_20161128.xlsx')
writer = pd.ExcelWriter('y7_20161128.xlsx', engine='openpyxl')
writer.book = book
writer.sheets = dict((ws.title, ws) for ws in book.worksheets)
df.to_excel(writer, "Main", index=False)
writer.save()
Last Series.plot.bar:
import matplotlib.pyplot as plt
summed.plot.bar()
plt.show()
From what I understand, your immediate problem is managing to load the 14th column from each of your worksheets.
You could be using ExcelFile.parse instead of read_excel and loop over your sheets.
xls_file = pd.ExcelFile('Documents\\y7_20161128.xlsx')
worksheets = ['Sheet1', 'Sheet2', 'Sheet3']
series = [xls_file.parse(sheet, parse_cols=[13]) for sheet in worksheets]
df = pd.DataFrame(series)
And from that, sum() your columns and keep going.
Using ExcelFile and then ExcelFile.parse() has the advantage to load your Excel file only once, and iterate over each worksheet. Using read_excel makes your Excel file to be loaded in each iteration, which is useless.
Documentation for pandas.ExcelFile.parse.

Categories