I have a csv file where I validate the each cell by some rule on the column.
df.drop(df[~validator].index, inplace=True)
The validator here is can be different functions checking if a cell is integer-like, or if a string inside a cell is smaller than 10 characters etc. So a cell alone hase all the information to be validated without requiring any other cells from the same row or same column.
And I have this:
bad_dfs = []
for validator, error in people_csv_validators:
bad_dfs.append(df.loc[~validator])
df.drop(df[~validator].index, inplace=True)
bad_df = pd.concat(bad_dfs)
Prior the dataframes were smaller than 1m rows with 20 columns or less, column count didn't change but the rows increased by a lot and I want to be able to process this with a fixed amount of memory. So I figured I'd chunk it since the validation doesn't depend on anything.
Now, I know I can just put chunk argument in to the read_csv I have, then write to a csv file chunk by chunk with mode="a", but I head about dask and couple other libraries that does something similar underneath with their dataframe class, and I figured there might be some other methods to do this.
Is there any standard way of doing this, like
df = pd.read_csv(path, chunk_in_the_background_and_write_to_this_file=output_path, chunk_count=10^6)
some_row_based_operations(df)
# It automatically reads the first 10^6 rows and processes them,
# then writes them to `output_path` and then reads the next 10^6 rows and so on
Again, this is rather a simple thing but I want to know if there is a canonical way.
The rough code to do this with dask is as follows:
import dask.dataframe as dd
# let's use ddf for dask df
ddf = dd.read_csv(path) # can also provide list of files
def some_row_based_operations(df):
# a function that accepts and returns pandas df
# implementing required logic
return df
# the line below is fine only if the function is row-based
# (no dependencies across different rows)
modified_ddf = ddf.map_partitions(some_row_based_operations)
# single_file kwarg is only if you want one file at the end
modified_ddf.to_csv(output_path, single_file=True)
One caution: with the approach above there should be no inplace changes to the df inside some_row_based_operations, but hopefully making a change like the one below is feasible:
# change this: df.drop(df[~validator].index, inplace=True)
# also note, that this logic should be part of `some_row_based_operations`
df = df.drop(df[~validator].index)
Related
Download the Data Here
Hi, I have a data something like below, and would like to multi label the data.
something to like this: target
But the problem here is data lost when multilabel it, something like below:
issue
using the coding of:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer(sparse_output=True)
df_enc = df.drop('movieId', 1).join(df.movieId.str.join('|').str.get_dummies())
Someone can help me, feel free to download the dataset, thank you.
So that column when read in with pandas will be stored as a string. So first we'd need to convert that to an actual list.
From there use .explode() to expand out that list into a series (where the index will match the index it came from, and the column values will be the values in that list).
Then crosstab that from a series into each row and column being the value.
Then join that back up with the dataframe on the index values.
Keep in mind, when you do one-hot-encoding with high cardinality, you're table will blow up into a huge wide table. I just did this on the first 20 rows, and ended up with 233 columns. with the 225,000 + rows, it'll take a while (maybe a minute or so) to process and you end up with close to 1300 columns. This may be too complex for machine learning to do anything useful with it (although maybe would work with deep learning). You could still try it though and see what you get. What I would suggest to test out is find a way to simplify it a bit to make it less complex. Perhaps find a way to combine movie ids in a set number of genres or something like that? But then test to see if simplifying it improves your model/performance.
import pandas as pd
from ast import literal_eval
df = pd.read_csv('ratings_action.csv')
df.movieId = df.movieId.apply(literal_eval)
s = df['movieId'].explode()
df = df[['userId']].join(pd.crosstab(s.index, s))
Can somebody help me in solving the below problem
I have a CSV, which is relatively large with over 1 million rows X 4000 columns. Case ID is one of the first column header in csv. Now I need to extract the complete rows belonging to the few case Ids, which are documented in list as faulty IDs.
Note: I dont know the indices of the required case IDs
Example > the CSV is - production_data.csv and the faulty IDs, faulty_Id= [ 50055, 72525, 82998, 1555558]
Now, we need to extract the complete rows for faulty_Id= [ 50055, 72525, 82998, 1555558]
Best Regards
If your faculty_id is present as header in csv file, you can use pandas dataframe to read_csv and set index as faculty_id and extract rows based on the faculty_id. For more info attach sample data of csv
CSV, which is relatively large with over 1 million rows X 4000 columns
As CSV are just text files and it is probably to big to be feasible to load it as whole I suggest using fileinput built-in module, if ID is 1st column, then create extractfaults.py as follows:
import fileinput
faulty = ["50055", "72525", "82998", "1555558"]
for line in fileinput.input():
if fileinput.lineno() == 0:
print(line, end='')
elif line.split(",", 1)[0] in faulty:
print(line, end='')
and use it following way
python extractfaults.py data.csv > faultdata.csv
Explanation: keep lines which are either 1st line (header) or have one of provided ID (I used optional 2nd .split argument to limit number of splits to 1). Note usage of end='' as fileinput keeps original newlines. My solution assumes that IDs are not quoted and ID is first column, if any of these does not hold true, feel free to adjust my code to your purposes.
The best way for you is to use a database like Postgres or MySQL. You can copy your data to the database first and then easily operate rows and columns. The file in your case is not the best solution since you need to upload all the data from the file to the memory to be able to process it. And file opening takes a lot of time in addition.
Let data be a giant pandas dataframe. It has many functions. The functions do not modify in place but return a new dataframe. How then am I supposed to perform multiple operations, to maximize performance?
For example, say I want to do
data = data.method1().method2().method()
where method1 could be set_index, and so on.
Is this the way you are supposed to do it? My wory is that pandas creates a copy every time I call a method, so that there are 3 copies being made of my data in the above, when in reality, all I want is to modify the original one.
So is it faster to say
data = data.method1(inplace=True)
data = data.method2(inplace=True)
data = data.method3(inplace=True)
This is just way too verbose for me?
Yes you can do that, i.e applying methods one after the other. Since you overwrite data, you are not creating 3 copies, but only keeping one copy
data = data.method1().method2().method()
However regarding your second example, the general rule is to either write over the existing dataframe or do it "inplace", but not both at the same time
data = data.method1()
or
data.method1(inplace=True)
I have a pandas dataframe that holds the file path to .wav data. Can I use pandas DataFrame.plot() function to plot the data referenced?
Example:
typical usage:
df.plot()
what I'm trying to do:
df.plot(df.path_to_data)???
I suspect some combination of apply and lambda will do the trick, but I'm not very familiar with these tools.
No, that isn't possible. plot is first order function that operates on pd.DataFrame objects. Here, df would be the same thing. What you'd need to do is
Load your dataframe using pd.read_* (usually, pd.read_csv(file)) and assign to df
Now call df.plot
So, in summary, you need -
df = pd.read_csv(filename)
... # some processing here (if needed)
df.plot()
As for the question of whether this can be done "without loading data in memory"... you can't plot data that isn't in memory. If you want to, you can limit tha number of rows you read, or you can load it efficiently, by loading it in chunks. You can also write code to aggregate/summarise data, or sample it.
I think you need first create DataFrame obviously by read_csv and then DataFrame.plot:
pd.read_csv('path_to_data').plot()
But if need plot DataFrames created from paths from column in DataFrame:
df.path_to_data.apply(lambda x: pd.read_csv(x).plot())
Or use custom function:
def f(x):
pd.read_csv(x).plot()
df.path_to_data.apply(f)
Or use loop:
for x in df.path_to_data:
pd.read_csv(x).plot()
Say we have a (huge) csv file with a header "data".
I load this file with pandas (imported as pd):
pd_data = pd.read_csv(csv_file)
Now, say the file has multiple headers and many rows, so I would like to increase reading speed and reduce memory usage.
Is there a difference in performance between using:
(1) pd.read_csv(csv_file, usecols=['data'])
versus
(2) pd.read_csv(csv_file, usecols=0)
(let us assume that 'data' is always column 0).
The reason for using (1) is the piece of mind that no matter where 'data' is actually stored in the file we will load the right column.
Using (2) does require us to trust that column 0 is indeed 'data'.
The reason why I think (2) is faster is because (1) may have some overhead in looking for the column (unless it says "if match on first column, stop searching", kind of thing, of course).
Questions:
Which approach is faster among (1) and (2)?
Is there an even faster approach?
Thank you.