Combing two dataframes in Pandas Python [duplicate] - python

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
I would like to combine two dataframes
I would like to combine both the dataframes such a way that the accts are the same.
For eg, acct 10 should values in CME and NISSAN while the rest are zeros.

I think you can use df.combine_first():
It will update null elements with value in the same location in other.
df2.combine_first(df1)
Also, you can try:
pd.concat([df1.set_index('acct'),df2.set_index('acct')],axis=1).reset_index()

It looks like what you're trying to do is merge these two DataFrames.
You can use df.merge to merge the two. Since you want to match on the acct column, set the on keyword arg to "acct" and set how to "inner" to keep only those rows that appear in both DataFrames.
For example:
merged = df1.merge(df2, how="inner", on="acct")
Output:
+------+--------------------+------------------+-------------------+-----------+--------------------+-------------------+--------------------+
| acct | GOODM | KIS | NISSAN | CME | HKEX | OSE | SGX |
+------+--------------------+------------------+-------------------+-----------+--------------------+-------------------+--------------------+
| 10 | | | 1397464.227495019 | 1728005.0 | 0.0 | | |
| 30 | 30569.300965712766 | 4299649.75104102 | | 6237.0 | | | |
+------+--------------------+------------------+-------------------+-----------+--------------------+-------------------+--------------------+
If you want to fill empty values with zeroes, you can use df.fillna(0).

Related

How to vertically stack completely different DataFrames one below other?

I have 3 different DataFrames as below:
Dataframe 1 (df1):
Dataframe 2 (df2):
Dataframe 3 (df3):
I wish to vertically stack these dataframes as a whole on top of each other.
The result should look like this:
I tried using pd.concat with axis=0, but i am unable to achieve the desired result.
Instead, this is what i am getting:
How can I achieve the desired dataframe?
First of all: You don't want to append/concatenate df3 to the other two because it's an entirely different table. You would concatenate Dataframes containing the same kind of data, i.e. the same columns. If df3 is missing a column df2 has, this column would be empty for all values originating from df3. This is what happened when you concatenated df3 to the other two. The result shows you what I try to say with my first sentence: They share no columns and are two completely separate tables.
Another thing that might help you to get along with pandas is: Pandas is preferably used on stacked data (I don't know if it's officially called stacked but I'll call it like this for now) while your df1 and df2 are pivot tables:
Pivot table:
| Country/Sex | male | female |
|-------------|------|--------|
| India | 10 | 20 |
| China | 30 | 40 |
| USA | 50 | 60 |
Stacked table (preferably use this with pandas):
| Country | sex | number |
|---------|--------|--------|
| India | male | 10 |
| India | female | 20 |
| China | male | 30 |
| China | female | 40 |
| USA | male | 50 |
| USA | female | 60 |
You can switch between these two using pd.pivot() and pd.unstack() as decribed here. It may also help with the problem that Training: 1 and Validation: 1 are interpreted as columns when they're probably just the names of the tables. In a stacked table you could just add another column specifying each row as either Training or Validation.
Your df3 is already stacked, df1 and df2 are pivot tables.

Efficient way to write Pandas groupby codes by eliminating repetition

I have a DataFrame as below.
df = pd.DataFrame({
'Country':['A','A','A','A','A','A','B','B','B'],
'City':['C 1','C 1','C 1','B 2','B 2','B 2','C 1','C 1','C 1'],
'Date':['7/1/2020','7/2/2020','7/3/2020','7/1/2020','7/2/2020','7/3/2020','7/1/2020','7/2/2020','7/3/2020'],
'Value':[46,90,23,84,89,98,31,84,41]
})
I need to create 2 averages
Firstly, both Country and City as the criteria
Secondly, Average for only the Country
In order to achieve this, we can easily write below codes
df.groupby(['Country','City']).agg('mean')
.
+---------+------+-------+
| Country | City | Value |
+---------+------+-------+
| A | B 2 | 90.33 |
| +------+-------+
| | C 1 | 53 |
+---------+------+-------+
| B | C 1 | 52 |
+---------+------+-------+
df.groupby(['Country']).agg('mean')
.
+---------+-------+
| Country | |
+---------+-------+
| A | 71.67 |
+---------+-------+
| B | 52 |
+---------+-------+
The only change in the above 2 codes are the groupby criteria City. apart from that everything is same. so there's a clear repetition/duplication of codes. (specially when it comes to complex scenarios).
Now my question is, Is there any way that, we could write one code to incorporate both the scenarios at once. DRY - Don't Repeat Yourself.
what I've in my mind is something like below.
Choice = 'City' `<<--Here I type either City or None or something based on the requirement. Eg: If None, the Below code will ignore that criteria.`
df.groupby(['Country',Choice]).agg('mean')
Is this possible? or what is the best way to write the above codes efficiently without repetition?
I am not sure what you want to accomplish but.. why not just using a if?
columns=['Country']
if Choice:
columns.append(Choice)
df.groupby(columns).agg('mean')

Creating new column from API lookup using groupby

I have a dataframe of weather date that looks like this:
+----+------------+----------+-----------+
| ID | Station_ID | Latitude | Longitude |
+----+------------+----------+-----------+
| 0 | 6010400 | 52.93 | -82.43 |
| 1 | 6010400 | 52.93 | -82.43 |
| 2 | 6010400 | 52.93 | -82.43 |
| 3 | 616I001 | 45.07 | -77.88 |
| 4 | 616I001 | 45.07 | -77.88 |
| 5 | 616I001 | 45.07 | -77.88 |
+----+------------+----------+-----------+
I want to create a new column called postal_code using an API lookup based on the latitude and longitude values. I cannot perform a lookup for each row in the dataframe as that would be inefficient, since there are over 500,000 rows and only 186 unique Station_IDs. It's also unfeasible due to rate limiting on the API I need to use.
I believe I need to perform a groupby transform but can't quite figure out how to get it to work correctly.
Any help with this would be greatly appreciated.
I believe, you can use groupby only for aggregations, which is not what you want.
First combine both 'Latitude' and 'Longitude'. It gives a new column with tuples.
df['coordinates'] = list(zip(df['Latitude'],df['Longitude']))
Then you can use this 'coordinates' column to create all unique values of (Latitude,Longitude) using set datatype, so it doesn't contain duplicates.
set(list(df['coordinates']))
Then fetch the postal_codes of these coordinates using API calls as you said and store them as a dict.
Then you can use this dict to populate postal codes for each row.
postal_code_dict = {'key':'value'} #sample dictionary
df['postal_code'] = df['coordinates'].apply(lambda x: postal_code_dict[x])
Hope this helps.

Spark parquet assign index within group [duplicate]

This question already has answers here:
How do I get a SQL row_number equivalent for a Spark RDD?
(4 answers)
Closed 4 years ago.
I would like to know the most efficient way to generate column index
to unique identify a record within each group of label:
+-------+-------+-------+
| label | value | index |
+-------+-------+-------+
| a | v1 | 0 |
+-------+-------+-------+
| a | v2 | 1 |
+-------+-------+-------+
| a | v3 | 2 |
+-------+-------+-------+
| a | v4 | 3 |
+-------+-------+-------+
| b | v5 | 0 |
+-------+-------+-------+
| b | v6 | 1 |
+-------+-------+-------+
My actual data is very large and each group of label has the same number of records. Column index will be used for Pivot.
I could do the usual sort + for-loop incremental + check if cur<>pre then reset index, etc but a faster and more efficient way is always welcome.
EDIT: got my answer from the suggested question:
from pyspark.sql import Row, functions as F
from pyspark.sql.window import Window
df = df.withColumn("index",
F.row_number().over(
Window.partitionBy("label").orderBy("value"))
)
Thank you for all your helps!
You can use Window functions to create a rank based column while partitioning on the label column. However, this requires an ordering - in this case on value:
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
window = Window.partitionBy(df['label']).orderBy(df['value'])
df.withColumn('index', row_number().over(window))
This will give a new column index with values starting from 1 (to start from 0, simply add -1 to the expression above). The values will be given as ordered by the value column.

How to compare two pandas dataframes and remove duplicates on one file without appending data from other file [duplicate]

This question already has answers here:
set difference for pandas
(12 answers)
Closed 4 years ago.
I am trying to compare two csv files using pandas dataframes. One is a master sheet that is going to have data appended to it daily (test_master.csv). The second is a daily report (test_daily.csv) that contains the data I want to append to the test_master.csv.
I am creating two pandas dataframes from these files:
import pandas as pd
dfmaster = pd.read_csv(test_master.csv)
dfdaily = pd.read_csv(test_daily.csv)
I want the daily list to get compared to the master list to see if there are any duplicate rows on the daily list that are already in the master list. If so, I want them to remove the duplicates from dfdaily. I then want to write this non-duplicate data to dfmaster.
The duplicate data will always be an entire row. My plan was to iterate through the sheets row by row to make the comparison, then.
I realize I could append my daily data to the dfmaster dataframe and use drop_duplicates to remove the duplicates. I cannot figure out how to remove the duplicates in the dfdaily dataframe, though. And I need to be able to write the dfdaily data back to test_daily.csv (or another new file) without the duplicate data.
Here is an example of what the dataframes could look like.
test_master.csv
column 1 | column 2 | column 3 |
+-------------+-------------+-------------+
| 1 | 2 | 3 |
| 4 | 5 | 6 |
| 7 | 8 | 9 |
| duplicate 1 | duplicate 1 | duplicate 1 |
| duplicate 2 | duplicate 2 | duplicate 2
test_daily.csv
+-------------+-------------+-------------+
| column 1 | column 2 | column 3 |
+-------------+-------------+-------------+
| duplicate 1 | duplicate 1 | duplicate 1 |
| duplicate 2 | duplicate 2 | duplicate 2 |
| 10 | 11 | 12 |
| 13 | 14 | 15 |
+-------------+-------------+-------------+
Desired output is:
test_master.csv
+-------------+-------------+-------------+
| column 1 | column 2 | column 3 |
+-------------+-------------+-------------+
| 1 | 2 | 3 |
| 4 | 5 | 6 |
| 7 | 8 | 9 |
| duplicate 1 | duplicate 1 | duplicate 1 |
| duplicate 2 | duplicate 2 | duplicate 2 |
| 10 | 11 | 12 |
| 13 | 14 | 15 |
+-------------+-------------+-------------+
test_daily.csv
+----------+----------+----------+
| column 1 | column 2 | column 3 |
+----------+----------+----------+
| 10 | 11 | 12 |
| 13 | 14 | 15 |
+----------+----------+----------+
Any help would be greatly appreciated!
EDIT
I incorrectly thought solutions from the set difference question solved my problem. I ran into certain cases where those solutions did not work. I believe it had something to do with index numbers labels as mentioned in a comment by Troy D below. Troy D's solution is the solution that I am now using.
Try this:
I crate 2 indexes, and then set rows 2-4 to be duplicates:
import numpy as np
test_master = pd.DataFrame(np.random.rand(3, 3), columns=['A', 'B', 'C'])
test_daily = pd.DataFrame(np.random.rand(5, 3), columns=['A', 'B', 'C'])
test_daily.iloc[1:4] = test_master[:3].values
print(test_master)
print(test_daily)
output:
A B C
0 0.009322 0.330057 0.082956
1 0.197500 0.010593 0.356774
2 0.147410 0.697779 0.421207
A B C
0 0.643062 0.335643 0.215443
1 0.009322 0.330057 0.082956
2 0.197500 0.010593 0.356774
3 0.147410 0.697779 0.421207
4 0.973867 0.873358 0.502973
Then, add a multiindex level to identify which data is from which dataframe:
test_master['master'] = 'master'
test_master.set_index('master', append=True, inplace=True)
test_daily['daily'] = 'daily'
test_daily.set_index('daily', append=True, inplace=True)
Now merge as you suggested and drop duplicates:
merged = test_master.append(test_daily)
merged = merged.drop_duplicates().sort_index()
print(merged)
output:
A B C
master
0 daily 0.643062 0.335643 0.215443
master 0.009322 0.330057 0.082956
1 master 0.197500 0.010593 0.356774
2 master 0.147410 0.697779 0.421207
4 daily 0.973867 0.873358 0.502973
There you see the combined dataframe with the origin of the data labeled in the index. Now just slice for the daily data:
idx = pd.IndexSlice
print(merged.loc[idx[:, 'daily'], :])
output:
A B C
master
0 daily 0.643062 0.335643 0.215443
4 daily 0.973867 0.873358 0.502973

Categories