merge two big dataframes - python

I have two big data frame : one containes 3M rows and the other contains 2M rows
1st dataframe :
sacc_id$ id$ creation_date
0 0011200001LheyyAAB 5001200000gxTeGAAU 2017-05-30 13:25:07
2nd dataframe :
sacc_id$ opp_line_id$ oppline_creation_date
0 001A000000hAUn8IAG a0WA000000BYKoWMAX 2013-10-26
I need to merge them :
case = pd.merge(limdata, df_case, left_on='sacc_id$',right_on='sacc_id$')
But I get A memory problem:
pandas/_libs/join.pyx in pandas._libs.join.inner_join()
MemoryError:
Is there another way to do it efficiently. I read in some discussion here that Dask can help but i do not understand how to use it in this context.
Any help please?
thank you

I suggest using Dask when you are dealing with large dataframes. Dask supports the Pandas dataframe and Numpy array data structures and is able to either be run on your local computer or be scaled up to run on a cluster.
You can easily convert your Pandas dataframe to Dask which is made up of smaller split up Pandas dataframes and therefore allows a subset of Pandas query syntax.
here is an example of how you can do it:
import dask.dataframe as dd
limdata= dd.read_csv(path_to_file_1)
df_case= dd.read_csv(path_to_file_2)
case = dd.merge(limdata, df_case, left_on='sacc_id$',right_on='sacc_id$')
There are tips on best practices on how to partition your dataframes to get a better performance. I reckon reading up on it. Also it is a good practice not to have special characters like $ in your column name.

Related

Accessing a value from Dask using .loc

For the life of me, I cant figure how to combine these two dataframes. I am using the newest most updated versions of all softwares, including Python, Pandas and Dask.
#pandasframe has 10k rows and 3 columns -
['monkey','banana','furry']
#daskframe has 1.5m rows, 1column, 135 partitions -
row.index: 'monkey_banana_furry'
row.mycolumn = 'happy flappy tuna'
my dask dataframe has a string as its index for accessing,
but when i do daskframe.loc[index_str] it returns a dask dataframe, but i thought it was supposed to return one single specific row. and i dont know how to access the row/value that i need from that dataframe. what i want is to input the index, and output one specific value.
what am i doing wrong?
Even pandas.DataFrame.loc don't return a scalar if you don't specify a label for the columns.
Anyways, to get a scalar in your case, first, you need to add dask.dataframe.DataFrame.compute so you can get a pandas dataframe (since dask.dataframe.DataFrame.loc returns a dask dataframe). And only then, you can use the pandas .loc.
Assuming (dfd) is your dask dataframe, try this :
dfd.loc[index_str].compute().loc[index_str, "happy flappy tuna"]
Or this :
dfd.loc[index_str, "happy flappy tuna"].compute().iloc[0]

Values of the columns are null and swapped in pyspark dataframe

I am using pyspark==2.3.1. I have performed data preprocessing on the data using pandas now I want to convert my preprocessing function into pyspark from pandas. But while reading the data CSV file using pyspark lot of values become null of the column that has actually some values. If I try to perform any operation on this dataframe then it is swapping the values of the columns with other columns. I also tried different versions of pyspark. Please let me know what I am doing wrong. Thanks
Result from pyspark:
The values of the column "property_type" have null but the actual dataframe has some value instead of null.
CSV File:
But pyspark is working fine with small datasets. i.e.
In our we faced the similar issue. Things you need to check
Check wether your data as " [double quotes] pypark would consider this as delimiter and data looks like malformed
Check wether your csv data as multiline
We handled this situation by mentioning the following configuration
spark.read.options(header=True, inferSchema=True, escape='"').option("multiline",'true').csv(schema_file_location)
Are you limited to use CSV fileformat?
Try parquet. Just save your DataFrame in pandas with .to_parquet() instead of .to_csv(). Spark works with this format really well.

How to merge Python Dask Dataframes into one on columns?

got a little problem. I have two dask dataframes with following format:
#DF1.csv
DATE|EVENTNAME|VALUE
#DF2.csv
DATE|EVENTNAME0|EVENTNAME1|...|EVENTNAMEX
I want to merge the value from DF1.csv into DF2.csv, at time t (Date) and column (EventName). I use Dask at the moment, because i'm working with huge datesets ~50gb. I noticed that you can't use direct assignment of values in Dask. So i tried, dd.Series.where:
df[nodeid].where(time,value) => Result in an error (for row in df.iterrows():
#df2.loc[row[0],row[1][0]] =row[1][1])
i also tried a merge, but the resulting Dask dataframe had no partitions, which result in a MemoryError, because all datasets will be loaded into memory, if i use the .to_csv('data-*.csv') method. It should be easy to merge the dataframes, but i have no clue at the moment. Is there a Dask pro, that could help me out?
Edit://
This works well in pandas but not with dask:
for row in df.iterrows():
df2.loc[row[0],row[1][0]] =row[1][1]
Tried something like that:
for row in df.iterrows():
df2[row[1][0]] = df2[row[1][0]].where(row[0], row[1][1])
#Result in Error => raise ValueError('Array conditional must be same shape as '
Any ideas?
For everyone who is interested, you can use:
#DF1
df.pivot(index='date', columns='event', values='value') #to create DF2 Memory efficient
see also: https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html
before, it took a huge time, was horrible memory hungry and brought up not the results that i was looking for. Just use Pandas pivot, if you try to alter your dataframe scheme.
Edit:// And there is no reason to use Dask anymore, speed up the whole process even further ;)

Best way to compare Pandas dataframe with csv file

I have a number of tests where the Pandas dataframe output needs to be compared with a static baseline file. My preferred option for the baseline file format is the csv format for its readability and easy maintenance within Git. But if I were to load the csv file into a dataframe, and use
A.equals(B)
where A is the output dataframe and B is the dataframe loaded from the CSV file, inevitably there will be errors as the csv file does not record datatypes and what-nots. So my rather contrived solution is to write the dataframe A into a CSV file and load it back out the same way as B then ask whether they are equal.
Does anyone have a better solution that they have been using for some time without any issues?
If you are worried about the datatypes of the csv file, you can load it as a dataframe with specific datatypes as following:
import pandas as pd
B = pd.DataFrame('path_to_csv.csv', dtypes={"col1": "int", "col2": "float64", "col3": "object"} )
This will ensure that each column of the csv is read as a particular data type
After that you can just compare the dataframes easily by using
A.equals(B)
EDIT:
If you need to compare a lot of pairs, another way to do it would be to compare the hash values of the dataframes instead of comparing each row and column of individual data frames
hashA = hash(A.values.tobytes())
hashB = hash(B.values.tobytes())
Now compare these two hash values which are just integers to check if the original data frames were same or not.
Be Careful though: I am not sure if the data types of the original data frame would matter or not. Be sure to check that.
I came across a solution that does work for my case by making use of Pandas testing utilities.
from pandas.util.testing import assert_frame_equal
Then call it from within a try except block where check_dtype is set to False.
try:
assert_frame_equal(A, B, check_dtype=False)
print("The dataframes are the same.")
except:
print("Please verify data integrity.")
(A != B).any(1) returns a Series with Boolean values which tells you which rows are equal and which ones aren't ...
Boolean values are internally represented by 1's and 0's, so you can do a sum() to check how many rows were not equal.
sum((A != B).any(1))
If you get an output of 0, that would mean all rows were equal.

Working on 50 million rows in pandas (python)

I am working on a dataframe of 50 million rows in pandas. I need to run through a column and extract specific parts of the text. The column has string values defined in 4 or 5 patterns. I need to extract the text and replace the original string. I am using the apply function and regex for this. This takes me close to a day to execute. I feel this is inefficient. Or is this normal? Is there an approach i am missing to make it faster?
here are the docs:
http://pandas.pydata.org/pandas-docs/stable/indexing.html
http://pandas.pydata.org/pandas-docs/stable/text.html#extracting-substrings
Replacing text is easy. No a day isn't normal. Get rid of all the lists you had in an earlier version of this post. You don't need them. Add on columns to the dataframe if you need more space for data. Learn the data types to make the data smaller.
import pandas as pd
df = pd.DataFrame() #import your data at this step
df['column'].str.extract(regex_thingy_here)
I'd write more but you took the code down.

Categories