Combining two dataframes with pandas/numpy - python

I'm trying to write script that gets the data from two different databases and gives the result which is a CSV file with combined data.
I managed to get data using psycopg2 and pandas read_sql_query, I turned results into two different dataframes and all of that works great. I wrote all of that with only a little information about those databases so I used databases I had and some simple queries.
All of that is on my github:
https://github.com/tomasz-urban/sql-db-get
With more detailed info about what needs to be done I'm stuck...
In the first database there are users limitations: lim_val_1 and lim_val_2 together with user_id (couple thousand rows). Second one holds usage with val_1 and val_2 gathered every some period of time (couple hundred thousand rows).
I need to get those rows where users reach their limits (doesn't matter if it is lim_val_1 or lim_val_2 or both, I need all of that).
To visualize it better there are some simple tables in the link:
Databases info with output
My last approach:
result_query = df2.loc[(df2['val_1'] == df1['lim_val_1']) & (df2['val_2'] == df1['lim_val_2'])]
output_data = pd.DataFrame(result_query)
and I'm getting an error:
"ValueError: Can only compare identically-labeled Series objects"
I cannot label those columns the same so I think this solution will not work for me. I also tried merging with no result.
Anyone could help me with this ?

Since df1 and df2 have different numbers of rows, you cannot compare the values directly, try joining the data frames and then select where the conditions are met.
If user_id is the index of both df:
df_3 = df1.join(df2)
df_3[
(df_3['val_1'] == df_3['lim_val_1']) |
(df_3['val_2'] == df_3['lim_val_2']
]
You might want to replace == in for >= in case where limits can be surpassed.

Related

Converting for loop to numpy calculation for pandas dataframes

So I have a python script that compares two dataframes and works to find any rows that are not in both dataframes. It currently iterates through a for loop which is slow.
I want to improve the speed of the process, and know that iteration is the problem. However, I haven't been having much luck using various numpy methods such as merge and where.
Couple of caveats:
The column names from my file sources aren't the same, so I set their names into variables and use the variable names to compare.
I want to only use the column names from one of the dataframes.
df_new represents new information to be checked against what is currently on file (df_current)
My current code:
set_current = set(df_current[current_col_name])
df_out = pd.DataFrame(columns=df_new.columns)
for i in range(len(df_new.index)):
# if the row entry is new, we add it to our dataset
if not df_new[new_col_name][i] in set_current:
df_out.loc[len(df_out)] = df_new.iloc[i]
# if the row entry is a match, then we aren't going to do anything with it
else:
continue
# create a xlsx file with the new items
df_out.to_excel("data/new_products_to_examine.xlsx", index=False)
Here are some simple examples of dataframes I would be working with:
df_current
|partno|description|category|cost|price|upc|brand|color|size|year|
|:-----|:----------|:-------|:---|:----|:--|:----|:----|:---|:---|
|123|Logo T-Shirt||25|49.99||apple|red|large|2021||
|456|Knitted Shirt||35|69.99||apple|green|medium|2021||
df_new
|mfgr_num|desc|category|cost|msrp|upc|style|brand|color|size|year|
|:-------|:---|:-------|:---|:---|:--|:----|:----|:----|:---|:---|
|456|Knitted Shirt||35|69.99|||apple|green|medium|2021|
|789|Logo Vest||20|39.99|||apple|yellow|small|2022|
There are usually many more columns in the current sheet, but I wanted the table displayed to be somewhat readable. The key is that I would only want the columns in the "new" dataframe to be output.
I would want to match partno with mfgr_num since the spreadsheets will always have them, whereas some items don't have upc/gtin/ean.
It's still a unclear what you want without providing examples of each dataframe. But if you want to test unique IDs in differently named columns in two different dataframes, try an approach like this.
Find the IDs that exist in the second dataframe
test_ids = df2['cola_id'].unique().tolist()
the filter the first dataframe for those IDs.
df1[df1['keep_id'].isin(test_ids)]
Here is the answer that works - was supplied to me by someone much smarter.
df_out = df_new[~df_new[new_col_name].isin(df_current[current_col_name])]

How to compute multiple counts with different conditions on a pyspark DataFrame, fast?

Let's say I have this pyspark Dataframe:
data = spark.createDataFrame(schema=['Country'], data=[('AT',), ('BE',), ('France',), ('Latvia',)])
And let's say I want to collect various statistics about this data. For example, I might want to know how many rows use a 2-character country code and how many use longer country names:
count_short = data.where(F.length(F.col('Country')) == 2).count()
count_long = data.where(F.length(F.col('Country')) > 2).count()
This works, but when I want to collect many different counts based on different conditions, it becomes very slow even for tiny datasets. In Azure Synapse Studio, where I am working, every count takes 1-2 seconds to compute.
I need to do 100+ counts, and it takes multiple minutes to compute for a dataset of 10 rows. And before somebody asks, the conditions for those counts are more complex than in my example. I cannot group by length or do other tricks like that.
I am looking for a general way to do multiple counts on arbitrary conditions, fast.
I am guessing that the reason for the slow performance is that for every count call, my pyspark notebook starts some Spark processes that have significant overhead. So I assume that if there was some way to collect these counts in a single query, my performance problems would be solved.
One possible solution I thought of is to build a temporary column that indicates which of my conditions have been matched, and then call countDistinct on it. But then I would have individual counts for all combinations of condition matches. I also noticed that depending on the situation, the performance is a bit better when I do data = data.localCheckpoint() before computing my statistics, but the general problem still persists.
Is there a better way?
Function "count" can be replaced by "sum" with condition (Scala):
data.select(
sum(
when(length(col("Country")) === 2, 1).otherwise(0)
).alias("two_characters"),
sum(
when(length(col("Country")) > 2, 1).otherwise(0)
).alias("more_than_two_characters")
)
While one way is to combine multiple queries in to one, the other way is to cache the dataframe that is being queried again and again.
By caching the dataframe, we avoid the re-evaluation each time the count() is invoked.
data.cache()
Few things to keep in mind. If you are applying multiple actions on your dataframe and there are lot of transformations and you are reading that data from some external source then you should definitely cache that dataframe before you apply any single action on that dataframe.
The answer provided by #pasha701 works but you will have to keep on adding the columns based on different country code length value you want to analyse.
You can use the below code to get the count of different country codes all in one single dataframe.
//import statements
from pyspark.sql.functions import *
//sample Dataframe
data = spark.createDataFrame(schema=['Country'], data=[('AT',), ('ACE',), ('BE',), ('France',), ('Latvia',)])
//adding additional column that gives the length of the country codes
data1 = data.withColumn("CountryLength",length(col('Country')))
//creating columns list schema for the final output
outputcolumns = ["CountryLength","RecordsCount"]
//selecting the countrylength column and converting that to rdd and performing map reduce operation to count the occurrences of the same length
countrieslength = data1.select("CountryLength").rdd.map(lambda word: (word, 1)).reduceByKey(lambda a,b:a +b).toDF(outputcolumns).select("CountryLength.CountryLength","RecordsCount")
//now you can do display or show on the dataframe to see the output
display(countrieslength)
please see the output snapshot that you might get as below :
If you want to apply multiple filter condition on this dataframe, then you can cache this dataframe and get the count of different combination of records based on the country code length.

How can I get the difference between two dataframes based on one column?

I have a dataframe (allPop) and a geodataframe (allTracts). I'm merging them on a the column GEOID, which they both share:
newTracts = allTracts.merge(allPop, on='GEOID')
My problem is that I'm losing data on this merge, which conceptually shouldn't be happening. Each of the records in allPop should match with one of the records from allTracts, but newTracts has a couple hundred fewer records than allPop. I'd like to be able to look at the records not being included in the merge to try to diagnose the problem. Is there a way to do this? Or else, can I find the difference between allPop and allTracts based on their columns 'GEOID'? I've seen how to do this when both dataframes have all of the same column names/types, but can I do this based only on one column? I'm not sure what the output for this would look like, but lists of the GEOIDs that aren't being merged from both dataframes would be good. Or else the dataframes themselves without the records that were merged. Thanks!
You can use the isin method in Pandas.
badPop = allPop[~allPop['GEOID'].isin(allTracts['GEOID'])
You can also use the indicator option of the merge method along with how='outer' to find the offending rows.

Is there a standard way of fixing missing values in pandas index column?

The problem
I'm working with a data set, which has been given to me as a csv file with lines of the form id,data. I would like to work with this data in a pandas dataframe, with the id as the index.
Unfortunately, somewhere along the data pipeline, my csv files has a number of rows where the id's are missing. Fortunately, the rows of my data are not fully independent, so I can recreate the missing values: each row is linked to its predecessor, and I have access to an oracle which, when given an id, can give me all its data. This includes the id of its predecessor.
My question is therefore whether there's a simple way of filling in these missing values in my dataframe
My Solution
I don't have much experience working with pandas, but after playing around for a bit I came up with the following approach. I start by reading the csv file into a dataframe without setting the index, so I end up with a RangeIndex. I then
Find the location of the rows with missing ids
Add 1 to the index to get the children of each row
Ask the oracle for the parents of each child
Merge the parents and children on the child id
Subtract one from the index again, and set the parent ids
In code:
children = df.loc[df[df['id'].isna()].index + 1, 'id']
parents = pd.Series({x.id: x.parent_id for x in ask_oracle(children)},
name='parent_id')
combined = pd.merge(children, parents, left_on='id', right_index=True)
combined.set_index(combined.index - 1, inplace=True)
df.loc[combined.index, 'id'] = combined['parent_id']
This works, but I'm 95% sure it's going to look like scary black magic in a few months time.
In particular, I'm unhappy with
The way I get the location of the nan rows. Three lots of df[ in one line is just too many
The manual fiddling about with the indices I have to do to get the rows to match up.
Does anyone have any suggestions for a better way of doing things?
The format of the input data is fixed, as are the properties of the oracle, but if there's a smarter way of organising my dataframe, I'm more than happy to hear it.

MemoryError when I merge two Pandas data frames

I searched almost all over the internet and somehow none of the approaches seem to work in my case.
I have two large csv files (each with a million+ rows and about 300-400MB in size). They are loading fine into data frames using the read_csv function without having to use the chunksize parameter.
I even performed certain minor operations on this data like new column generation, filtering, etc.
However, when I try to merge these two frames, I get a MemoryError. I have even tried to use SQLite to accomplish the merge, but in vain. The operation takes forever.
Mine is a Windows 7 PC with 8GB RAM. The Python version is 2.7
Thank you.
Edit: I tried chunking methods too. When I do this, I don't get MemoryError, but the RAM usage explodes and my system crashes.
When you are merging data using pandas.merge it will use df1 memory, df2 memory and merge_df memory. I believe that it is why you get a memory error. You should export df2 to a csv file and use chunksize option and merge data.
It might be a better way but you can try this.
*for large data set you can use chunksize option in pandas.read_csv
df1 = pd.read_csv("yourdata.csv")
df2 = pd.read_csv("yourdata2.csv")
df2_key = df2.Colname2
# creating a empty bucket to save result
df_result = pd.DataFrame(columns=(df1.columns.append(df2.columns)).unique())
df_result.to_csv("df3.csv",index_label=False)
# save data which only appear in df1 # sorry I was doing left join here. no need to run below two line.
# df_result = df1[df1.Colname1.isin(df2.Colname2)!=True]
# df_result.to_csv("df3.csv",index_label=False, mode="a")
# deleting df2 to save memory
del(df2)
def preprocess(x):
df2=pd.merge(df1,x, left_on = "Colname1", right_on = "Colname2")
df2.to_csv("df3.csv",mode="a",header=False,index=False)
reader = pd.read_csv("yourdata2.csv", chunksize=1000) # chunksize depends with you colsize
[preprocess(r) for r in reader]
this will save merged data as df3.
The reason you might be getting MemoryError: Unable to allocate.. could be due to duplicates or blanks in your dataframe. Check the column you are joining on (when using merge) and see if you have duplicates or blanks. If so get rid of them using this command:
df.drop_duplicates(subset ='column_name', keep = False, inplace = True)
Then re-run your python/pandas code. This worked for me.
In general chunk version suggested by #T_cat works great.
However, memory exploding might be caused by joining on columns that have Nan values.
So you may want to exclude those rows from the join.
see: https://github.com/pandas-dev/pandas/issues/24698#issuecomment-614347153
Maybe the left data frame has NaN in the merging columns which causes the final merged dataframe to bloat.
Fill the merging column in the left data frame with zeros if ok.
df['left_column'] = df['left_column'].fillna(0)
Then do the merge. See what you get.

Categories