I have two dfs that have same columns and contain same information, but from different sources:
df_orders = pd.DataFrame({'id':[1,2,3],'model':['A1','A3','A6'], 'color':['Red','Blue','Green']})
df_billed = pd.DataFrame({'id':[1,6,7],'model':['A1','A7','B1'], 'color':['Purple','Pink','Red']})
Then I do a merge left on the df_billed by ids and add sufixes as column names overlap:
merge_df = pd.merge(df_billed,df_orders,on='id',how='left',suffixes=('_order','_billed'))
Results in
id|model_order|color_order | model_billed | color_billed
0 1 | A1 | Purple | A1 | Red
1 6 | A7 | Pink | NaN | NaN
2 7 | B1 | Red | NaN | NaN
The column order has more priority when the suffix is _order than billed, and somehow I would like to have a dataframe where if no billed info, then we take the order, and the suffixes are removed:
id|model_billed | color_billed |
0 1 | A1 | Red |
1 6 | A7 | Pink |
2 7 | B1 | Purple |
Ideally I thought of doing a combine_first to coalesce the colums and at the end rename them, but looks a bit dirty in code and looking for another more well-designed solution.
You can just use .fillna() and use the _order columns to fill the NAs
merge_df['model_billed'] = merge_df['model_billed'].fillna(merge_df['model_order'])
merge_df['color_billed'] = merge_df['color_billed'].fillna(merge_df['color_order'])
Output
merge_df[['id', 'model_billed', 'color_billed']]
id model_billed color_billed
0 1 A1 Red
1 6 A7 Pink
2 7 B1 Red
UPDATE
If there are more such columns, you can just use a loop like this:
col_names = ['model', 'color']
for col in col_names:
merge_df[col+'_billed'] = merge_df[col+'_billed'].fillna(merge_df[col+'_order'])
Related
Helo!
I have loaded a few datasets, the only thing in common is that they have the same Columns names BUT, the number of columns/rows and the data are different , so looks like i cannot use merge or concat because there is not thing in common by ID for example.. I want to to put each df above the other and leave the "extra" columns with Nan values.
df1:
| Column A | Column B |
| -------- | -------- |
| ID 1 | Cell 2 |
| ID 2 | Cell 4 |
df2:
Column A
Column B
ColumnC
ID 3
Cell 2
info
ID 4
Cell 4
info
I wanto something like this:
df:
Column A
Column B
ColumnC
ID 1
Cell 2
Nan
ID 2
Cell 4
Nan
ID 3
Cell 2
info
ID 4
Cell 4
info
Thanks a lot for your time!
I have try something like df = pd.concat(['df1','df2'], axis=1) and merge
I have DataFrame with almost 500 rows and 3 columns.
One of the columns has a string of dates and each cell has a unique date, but some cell have a common date and some cells are seem empty.
I'm trying to find the frequency of each day in a cell
df|Number_of_dates | Date
--|--------------------|---------------------
0 | 0.0 | []
1 | 3.0 | ['2006-01-01' '2006-03-22' '2019-07-29']
2 | 8.0 | ['2006-01-01' '2006-04-13' '2006-07-18' '2006-...
3 | 1.0 | ['2006-07-18']
4 | 1.0 | ['2019-07-29']
5 | 0.0 | []
6 | 397.0 | ['2019-01-02' '2019-01-03' '2019-01-04' '2019-...
Result:
df_1 |Date | Frequency
-----|------------ |---------------------
0 | 2006-01-01 |2
1 | 2006-03-22 |1
2 | 2006-04-13 |1
3 | 2006-07-18 |2
4 | 2019-07-29 |3
It would be very helpful if you could provide some guidance.
Thanks in advance
additional information:
I noticed that each cell has a string value instead of a list
Sample DataFrame
d = {"Date":[ "['2005-02-02' '2005-05-04' '2005-08-03' '2005-11-02' '2006-02-01' '2006-05-03']",
"['2006-01-31' '2006-02-01' '2006-03-16'\n '2006-06-13']",
"['2005-10-12' '2005-10-13' '2005-10-14'\n '2005-10-17']",
"[]",
"['2005-07-25' '2005-07-26' '2005-07-27'\n '2005-07-28' '2005-07-29' '2005-08-01' '2005-08-02' '2005-08-03'\n '2005-08-04' '2005-08-05']",
"['2005-03-15' '2005-03-16' '2005-03-17'\n '2005-03-18' '2005-03-21' '2005-03-22' '2005-03-23' '2005-03-24' \n'2005-03-28' '2005-03-29' '2005-03-30' '2005-03-31' '2005-04-01'\n '2005-04-04']",
"['2005-03-16' '2005-03-17' '2005-07-27'\n '2006-06-13']",
"['2005-02-02' '2005-05-04' '2005-03-16' '2005-03-17']",
"[]"
]
}
df = pd.DataFrame(d)
Use DataFrame.explode with GroupBy.size:
#create list from sample data
df['Date'] = df['Date'].str.strip('[]').str.split()
df_1 = df.explode('Date').groupby('Date').size().reset_index(name='Frequency')
print (df_1.head(10))
Date Frequency
0 '2005-02-02' 2
1 '2005-03-15' 1
2 '2005-03-16' 3
3 '2005-03-17' 3
4 '2005-03-18' 1
5 '2005-03-21' 1
6 '2005-03-22' 1
7 '2005-03-23' 1
8 '2005-03-24' 1
9 '2005-03-28' 1
So, I have a data frame like this (the important column is the third one):
| ABC | DEF | fruit |
----------------------------
1 | 12 | LO | banana
2 | 45 | KA | orange
3 | 65 | JU | banana
4 | 25 | UY | grape
5 | 23 | TE | apple
6 | 28 | YT | orange
7 | 78 | TR | melon
I want to keep the rows that have the 5 most occurring fruits and drop the rest, so I made a variable to hold those fruits to keep in a list, like this:
fruits = df['fruit'].value_counts()
fruits_to_keep = fruits[:5].reset_index()
fruits_to_keep.drop(['fruit'], inplace=True, axis=1)
fruits_to_keep = fruits_to_keep.to_numpy()
fruits_to_keep = fruits_to_keep.tolist()
fruits_to_keep
[['banana'],['orange'],[apple],[melon],[grape]]
I have the feeling that I made unnecessary steps, but anyway, the problem arises when I try to select the rows containing those fruits_to_keep
df = df.set_index('fruit')
df = df.loc[fruits_to_keep,:]
Then I get the Key Error saying that "None of [Index([('banana',), \n ('orange',), \n ('apple',)...... dtype='object', name='fruit')] are in the [index]"
I also tried:
df[df.fruit in fruits_to_keep]
But then I get the following error:
('Lengths must match to compare', (43987,), (1,))
Obs.: I actually have 43k rows, many 'fruits' that I don't want on the dataframe and 30k+ rows with the 5 most occurring 'fruits'
Thanks in advance!
To keep the rows with the top N values you can use value_counts and isin.
By default, value_counts returns the elements in descending order of frequency.
N = 5
df[df['col'].isin(df['col'].value_counts().index[:N])]
I have a dataset based on different weather stations for several variables (Temperature, Pressure, etc.),
stationID | Time | Temperature | Pressure |...
----------+------+-------------+----------+
123 | 1 | 30 | 1010.5 |
123 | 2 | 31 | 1009.0 |
202 | 1 | 24 | NaN |
202 | 2 | 24.3 | NaN |
202 | 3 | NaN | 1000.3 |
...
and I would like to create a pivot table that would show the number of NaNs and non-NaNs per weather station, such that:
stationID | nanStatus | Temperature | Pressure |...
----------+-----------+-------------+----------+
123 | NaN | 0 | 0 |
| nonNaN | 2 | 2 |
202 | NaN | 1 | 2 |
| nonNaN | 2 | 1 |
...
Below I show what I have done so far, which works (in a cumbersome way) for Temperature. But how can I get the same for both variables, as shown above?
import pandas as pd
import bumpy as np
df = pd.DataFrame({'stationID':[123,123,202,202,202], 'Time':[1,2,1,2,3],'Temperature':[30,31,24,24.3,np.nan],'Pressure':[1010.5,1009.0,np.nan,np.nan,1000.3]})
dfnull = df.isnull()
dfnull['stationID'] = df['stationID']
dfnull['tempValue'] = df['Temperature']
dfnull.pivot_table(values=["tempValue"], index=["stationID","Temperature"], aggfunc=len,fill_value=0)
The output is:
----------------------------------
tempValue
stationID | Temperature
123 | False 2
202 | False 2
| True 1
UPDATE: thanks to #root:
In [16]: df.groupby('stationID')[['Temperature','Pressure']].agg([nans, notnans]).astype(int).stack(level=1)
Out[16]:
Temperature Pressure
stationID
123 nans 0 0
notnans 2 2
202 nans 1 2
notnans 2 1
Original answer:
In [12]: %paste
def nans(s):
return s.isnull().sum()
def notnans(s):
return s.notnull().sum()
## -- End pasted text --
In [37]: df.groupby('stationID')[['Temperature','Pressure']].agg([nans, notnans]).astype(np.int8)
Out[37]:
Temperature Pressure
nans notnans nans notnans
stationID
123 0 2 0 2
202 1 2 2 1
I'll admit this is not the prettiest solution, but it works. First define two temporary columns TempNaN and PresNaN:
df['TempNaN'] = df['Temperature'].apply(lambda x: 'NaN' if x!=x else 'NonNaN')
df['PresNaN'] = df['Pressure'].apply(lambda x: 'NaN' if x!=x else 'NonNaN')
Then define your results DataFrame using a MultiIndex:
Results = pd.DataFrame(index=pd.MultiIndex.from_tuples(list(zip(*[sorted(list(df['stationID'].unique())*2),['NaN','NonNaN']*df['stationID'].nunique()])),names=['stationID','NaNStatus']))
Store your computations in the results DataFrame:
Results['Temperature'] = df.groupby(['stationID','TempNaN'])['Temperature'].apply(lambda x: x.shape[0])
Results['Pressure'] = df.groupby(['stationID','PresNaN'])['Pressure'].apply(lambda x: x.shape[0])
And fill the blank values with zero:
Results.fillna(value=0,inplace=True)
You can loop over the columns if that is easier. For example:
Results = pd.DataFrame(index=pd.MultiIndex.from_tuples(list(zip(*[sorted(list(df['stationID'].unique())*2),['NaN','NonNaN']*df['stationID'].nunique()])),names=['stationID','NaNStatus']))
for col in ['Temperature','Pressure']:
df[col + 'NaN'] = df[col].apply(lambda x: 'NaN' if x!=x else 'NonNaN')
Results[col] = df.groupby(['stationID',col + 'NaN'])[col].apply(lambda x: x.shape[0])
df.drop([col + 'NaN'],axis=1,inplace=True)
Results.fillna(value=0,inplace=True)
d = {'stationID':[], 'nanStatus':[], 'Temperature':[], 'Pressure':[]}
for station_id, data in df.groupby(['stationID']):
temp_nans = data.isnull().Temperature.mean()*data.isnull().Temperature.count()
pres_nans = data.isnull().Pressure.mean()*data.isnull().Pressure.count()
d['stationID'].append(station_id)
d['nanStatus'].append('NaN')
d['Temperature'].append(temp_nans)
d['Pressure'].append(pres_nans)
d['stationID'].append(station_id)
d['nanStatus'].append('nonNaN')
d['Temperature'].append(data.isnull().Temperature.count() - temp_nans)
d['Pressure'].append(data.isnull().Pressure.count() - pres_nans)
df2 = pd.DataFrame.from_dict(d)
print(df2)
The result is:
Pressure Temperature nanStatus stationID
0 0.0 0.0 NaN 123
1 2.0 2.0 nonNaN 123
2 2.0 1.0 NaN 202
3 1.0 2.0 nonNaN 202
I got the following problem which I got stuck on and unfortunately cannot resolve by myself or by similar questions that I found on stackoverflow.
To keep it simple, I'll give a short example of my problem:
I got a Dataframe with several columns and one column that indicates the ID of a user. It might happen that the same user has several entries in this data frame:
| | userID | col2 | col3 |
+---+-----------+----------------+-------+
| 1 | 1 | a | b |
| 2 | 1 | c | d |
| 3 | 2 | a | a |
| 4 | 3 | d | e |
Something like this. Now I want to known the number of rows that belongs to a certain userID. For this operation I tried to use df.groupby('userID').size() which in return I want to use for another simple calculation, like division whatsover.
But as I try to save the results of the calculation in a seperate column, I keep getting NaN values.
Is there a way to solve this so that I get the result of the calculations in a seperate column?
Thanks for your help!
edit//
To make clear, how my output should look like. The upper dataframe is my main data frame so to say. Besides this frame I got a second frame looking like this:
| | userID | value | value/appearances |
+---+-----------+----------------+-------+
| 1 | 1 | 10 | 10 / 2 = 5 |
| 3 | 2 | 20 | 20 / 1 = 20 |
| 4 | 3 | 30 | 30 / 1 = 30 |
So I basically want in the column 'value/appearances' to have the result of the number in the value column divided by the number of appearances of this certain user in the main dataframe. For user with ID=1 this would be 10/2, as this user has a value of 10 and has 2 rows in the main dataframe.
I hope this makes it a bit clearer.
IIUC you want to do the following, groupby on 'userID' and call transform on the grouped column and pass 'size' to identify the method to call:
In [54]:
df['size'] = df.groupby('userID')['userID'].transform('size')
df
Out[54]:
userID col2 col3 size
1 1 a b 2
2 1 c d 2
3 2 a a 1
4 3 d e 1
What you tried:
In [55]:
df.groupby('userID').size()
Out[55]:
userID
1 2
2 1
3 1
dtype: int64
When assigned back to the df aligns with the df index so it introduced NaN for the last row:
In [57]:
df['size'] = df.groupby('userID').size()
df
Out[57]:
userID col2 col3 size
1 1 a b 2
2 1 c d 1
3 2 a a 1
4 3 d e NaN