I have 3 different DataFrames as below:
Dataframe 1 (df1):
Dataframe 2 (df2):
Dataframe 3 (df3):
I wish to vertically stack these dataframes as a whole on top of each other.
The result should look like this:
I tried using pd.concat with axis=0, but i am unable to achieve the desired result.
Instead, this is what i am getting:
How can I achieve the desired dataframe?
First of all: You don't want to append/concatenate df3 to the other two because it's an entirely different table. You would concatenate Dataframes containing the same kind of data, i.e. the same columns. If df3 is missing a column df2 has, this column would be empty for all values originating from df3. This is what happened when you concatenated df3 to the other two. The result shows you what I try to say with my first sentence: They share no columns and are two completely separate tables.
Another thing that might help you to get along with pandas is: Pandas is preferably used on stacked data (I don't know if it's officially called stacked but I'll call it like this for now) while your df1 and df2 are pivot tables:
Pivot table:
| Country/Sex | male | female |
|-------------|------|--------|
| India | 10 | 20 |
| China | 30 | 40 |
| USA | 50 | 60 |
Stacked table (preferably use this with pandas):
| Country | sex | number |
|---------|--------|--------|
| India | male | 10 |
| India | female | 20 |
| China | male | 30 |
| China | female | 40 |
| USA | male | 50 |
| USA | female | 60 |
You can switch between these two using pd.pivot() and pd.unstack() as decribed here. It may also help with the problem that Training: 1 and Validation: 1 are interpreted as columns when they're probably just the names of the tables. In a stacked table you could just add another column specifying each row as either Training or Validation.
Your df3 is already stacked, df1 and df2 are pivot tables.
Related
I am building multiple dataframes from a SQL query that contains a lot of left joins which is producing a bunch of duplicate values. I am familiar with pd.drop_duplicates() as I use it regularly in my other scripts, however, I can't get this particular one to work.
I am trying to drop_duplicates on a subset of 2 columns. Here is my code:
df = pd.read_sql("query")
index = []
for i in range(len(df)):
index.append(i)
df['index'] = index
df.set_index([df['index']])
df2 = df.groupby(['SSN', 'client_name', 'Evaluation_Date']).substance_use_name.agg(' | '.join).reset_index()
df2.shape which equals (182,4)
df3 = pd.concat([df, df2], axis=1, join='outer').drop_duplicates(keep=False)
df3.drop_duplicates(subset=['client_name', 'Evaluation_Date'], keep='first', inplace=True)
df3 returns 791 rows of data... (which is the exact amount of rows that my original query returns). After the drop_duplicates method I expected to have only 190 rows of data, however, it only drops the duplicates to 301 rows. When I do df3.to_excel(r'file_path.xlsx') and remove duplicates manually by the same subset in Excel, it works just fine and gives me the 190 rows that I expect. I'm not sure why?
I noticed in other similar questions regarding this topic that pandas cannot drop duplicates if a date field is a dtype 'object' and that it must be changed to a datetime, however, my date field is already a datetime.
Data frame looks like this:
ID | substnace1 | substance2 | substance3 | substance4
01 | drug | null | null | null
01 | null | drug | null | null
01 | null | null | drug | null
01 | null | null | null | drug
02 | drug | null | null | null
so on and so forth. I want to merge the rows into one row so it looks like this:
ID | substnace1 | substance2 | substance3 | substance4
01 | drug | drug | drug | drug
02 | drug | drug | drug | drug
so on and so forth.. Does that make better sense?
Would anyone be able to help me with this?
Thanks!
I have a csv file (n types of products rated by users):
Simplified illustration of source table
--------------------------------
User_id | Product_id | Rating |
--------------------------------
1 | 00 | 3 |
1 | 02 | 5 |
2 | 01 | 1 |
2 | 00 | 2 |
2 | 02 | 2 |
I load it into a pandas dataframe and I want to transform it, converting per ratings values from rows to columns in the following way:
as a result of the conversion the number of rows will remain the same, but there will be 6 additional columns
3 columns (p0rt, p0rt, p2rt) each correspond to a product type. They need contain a product rating given by the user in this row to a product. Just one of the columns per row can have a rating and the other two must be zeros/nulls
3 columns (uspr0rt, uspr0rt, uspr2rt) need contain all product ratings provided by the user in Just one of the columns per row can have a rating and the other two must be zeros;values in columns related to products unrated by this user must be zeros/nulls
Desired output
------------------------------------------------------
User_id |p0rt |p1rt |p2rt |uspr0rt |uspr1rt |uspr2rt |
------------------------------------------------------
1 | 3 | 0 | 0 | 3 | 0 | 5 |
1 | 0 | 0 | 5 | 3 | 0 | 5 |
2 | 0 | 1 | 0 | 2 | 1 | 2 |
2 | 2 | 0 | 0 | 2 | 1 | 2 |
2 | 0 | 0 | 2 | 2 | 1 | 2 |
I will greatly appreciate any help with this. The actual number of distinct product_ids/product types is ~60,000 and the number of rows in the file is ~400mln, so performance is important.
Update 1
I tried using pivot_table but the dataset is too large for it to work (I wonder if there is a way to do it in baches)
df = pd.read_csv('product_ratings.csv')
df = df.pivot_table(index=['User_id', 'Product_id'], columns='Product_id', values='Rating')
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 983.4 GiB for an array with shape (20004, 70000000) and data type float64
Update 2
I tried "chunking" the data and applied pivot_table to a smaller chunk (240mln rows and "only" 1300 types of products) as a test, but this didn't work either:
My code:
df = pd.read_csv('minified.csv', nrows=99999990000, dtype={0:'int32',1:'int16',2:'int8'})
df_piv = pd.pivot_table(df, index=['product_id', 'user_id'], columns='product_id', values='rating', aggfunc='first', fill_value=0).fillna(0)
Outcome:
IndexError: index 1845657558 is out of bounds for axis 0 with size 1845656426
This is a known Pandas issue which is unresolved IndexError: index 1491188345 is out of bounds for axis 0 with size 1491089723
I think i'll try Dask next, if this does not work, I guess I'll need to write the data reshaper myself in C++ or other lower level language
I have 2 DataFrame as follows
DataFrame 1
DataFrame 2
I wanted to merge these 2 DataFrames, based on the values of each row in DataFrame 2, matched with the combination of index and column in DataFrame 1.
So I want to append another column in DataFrame 2, name it "weight", and store the merged value there.
For example,
----------------------------------------------------
| | col1 | col2 | relationship | weight |
| 0 | Andy | Claude | 0 | 1 |
| 1 | Andy | Frida | 20 | 1 |
and so on. How to do this?
Use DataFrame.join with DataFrame.stack for MultiIndex Series:
df2 = df2.join(df1.stack().rename('weight'), on=['col1','col2'])
I have a DataFrame as below.
df = pd.DataFrame({
'Country':['A','A','A','A','A','A','B','B','B'],
'City':['C 1','C 1','C 1','B 2','B 2','B 2','C 1','C 1','C 1'],
'Date':['7/1/2020','7/2/2020','7/3/2020','7/1/2020','7/2/2020','7/3/2020','7/1/2020','7/2/2020','7/3/2020'],
'Value':[46,90,23,84,89,98,31,84,41]
})
I need to create 2 averages
Firstly, both Country and City as the criteria
Secondly, Average for only the Country
In order to achieve this, we can easily write below codes
df.groupby(['Country','City']).agg('mean')
.
+---------+------+-------+
| Country | City | Value |
+---------+------+-------+
| A | B 2 | 90.33 |
| +------+-------+
| | C 1 | 53 |
+---------+------+-------+
| B | C 1 | 52 |
+---------+------+-------+
df.groupby(['Country']).agg('mean')
.
+---------+-------+
| Country | |
+---------+-------+
| A | 71.67 |
+---------+-------+
| B | 52 |
+---------+-------+
The only change in the above 2 codes are the groupby criteria City. apart from that everything is same. so there's a clear repetition/duplication of codes. (specially when it comes to complex scenarios).
Now my question is, Is there any way that, we could write one code to incorporate both the scenarios at once. DRY - Don't Repeat Yourself.
what I've in my mind is something like below.
Choice = 'City' `<<--Here I type either City or None or something based on the requirement. Eg: If None, the Below code will ignore that criteria.`
df.groupby(['Country',Choice]).agg('mean')
Is this possible? or what is the best way to write the above codes efficiently without repetition?
I am not sure what you want to accomplish but.. why not just using a if?
columns=['Country']
if Choice:
columns.append(Choice)
df.groupby(columns).agg('mean')
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
I would like to combine two dataframes
I would like to combine both the dataframes such a way that the accts are the same.
For eg, acct 10 should values in CME and NISSAN while the rest are zeros.
I think you can use df.combine_first():
It will update null elements with value in the same location in other.
df2.combine_first(df1)
Also, you can try:
pd.concat([df1.set_index('acct'),df2.set_index('acct')],axis=1).reset_index()
It looks like what you're trying to do is merge these two DataFrames.
You can use df.merge to merge the two. Since you want to match on the acct column, set the on keyword arg to "acct" and set how to "inner" to keep only those rows that appear in both DataFrames.
For example:
merged = df1.merge(df2, how="inner", on="acct")
Output:
+------+--------------------+------------------+-------------------+-----------+--------------------+-------------------+--------------------+
| acct | GOODM | KIS | NISSAN | CME | HKEX | OSE | SGX |
+------+--------------------+------------------+-------------------+-----------+--------------------+-------------------+--------------------+
| 10 | | | 1397464.227495019 | 1728005.0 | 0.0 | | |
| 30 | 30569.300965712766 | 4299649.75104102 | | 6237.0 | | | |
+------+--------------------+------------------+-------------------+-----------+--------------------+-------------------+--------------------+
If you want to fill empty values with zeroes, you can use df.fillna(0).