how to merge Two datasets with different time ranges? - python

I have two datasets that look like this:
df1:
Date
City
State
Quantity
2019-01
Chicago
IL
35
2019-01
Orlando
FL
322
...
....
...
...
2021-07
Chicago
IL
334
2021-07
Orlando
FL
4332
df2:
Date
City
State
Sales
2020-03
Chicago
IL
30
2020-03
Orlando
FL
319
...
...
...
...
2021-07
Chicago
IL
331
2021-07
Orlando
FL
4000
My date is in format period[M] for both datasets. I have tried using the df1.join(df2,how='outer') and (df2.join(df1,how='outer') commands but they don't add up correctly, essentially, in 2019-01, I have sales for 2020-03. How can I join these two datasets such that my output is as follows:
I have not been able to use merge() because I would have to merge with a combination of City and State and Date
Date
City
State
Quantity
Sales
2019-01
Chicago
IL
35
NaN
2019-01
Orlando
FL
322
NaN
...
...
...
...
...
2021-07
Chicago
IL
334
331
2021-07
Orlando
FL
4332
4000

You can outer-merge. By not specifying the columns to merge on, you merge on the intersection of the columns in both DataFrames (in this case, Date, City and State).
out = df1.merge(df2, how='outer').sort_values(by='Date')
Output:
Date City State Quantity Sales
0 2019-01 Chicago IL 35.0 NaN
1 2019-01 Orlando FL 322.0 NaN
4 2020-03 Chicago IL NaN 30.0
5 2020-03 Orlando FL NaN 319.0
2 2021-07 Chicago IL 334.0 331.0
3 2021-07 Orlando FL 4332.0 4000.0

Related

fill one dataframe with values ​from another

I have this Dataframe, which is null values ​​that haven't been populated right.
Unidad Precio Combustible Año_del_vehiculo Caballos \
49 1 1000 Gasolina 1998.0 50.0
63 1 800 Gasolina 1998.0 50.0
88 1 600 Gasolina 1999.0 54.0
107 1 3100 Diésel 2008.0 54.0
244 1 2000 Diésel 1995.0 60.0
... ... ... ... ... ...
46609 1 47795 Gasolina 2016.0 420.0
46770 1 26900 Gasolina 2011.0 450.0
46936 1 19900 Gasolina 2007.0 510.0
46941 1 24500 Gasolina 2006.0 514.0
47128 1 79600 Gasolina 2017.0 612.0
Comunidad_autonoma Marca_y_Modelo Año_Venta Año_Comunidad \
49 Islas Baleares CITROEN AX 2020 2020Islas Baleares
63 Islas Baleares SEAT Arosa 2021 2021Islas Baleares
88 Islas Baleares FIAT Seicento 2020 2020Islas Baleares
107 La Rioja TOYOTA Aygo 2020 2020La Rioja
244 Aragón PEUGEOT 205 2019 2019Aragón
... ... ... ... ...
46609 La Rioja PORSCHE Cayenne 2020 2020La Rioja
46770 Cataluña AUDI RS5 2020 2020Cataluña
46936 Islas Baleares MERCEDES-BENZ Clase M 2020 2020Islas Baleares
46941 La Rioja MERCEDES-BENZ Clase E 2020 2020La Rioja
47128 Islas Baleares MERCEDES-BENZ Clase E 2021 2021Islas Baleares
Fecha Año Super_95 Diesel Comunidad Salario en euros anuales
49 2020-12-01 NaN NaN NaN NaN NaN
63 2021-01-01 NaN NaN NaN NaN NaN
88 2020-12-01 NaN NaN NaN NaN NaN
107 2020-12-01 NaN NaN NaN NaN NaN
244 2019-03-01 NaN NaN NaN NaN NaN
... ... ... ... ... ... ...
46609 2020-12-01 NaN NaN NaN NaN NaN
46770 2020-07-01 NaN NaN NaN NaN NaN
46936 2020-10-01 NaN NaN NaN NaN NaN
46941 2020-11-01 NaN NaN NaN NaN NaN
47128 2021-01-01 NaN NaN NaN NaN NaN
I need to fill the gasoline, diesel and salary tables with the values ​​of the following:
Año Super_95 Diesel Comunidad Año_Comunidad Fecha \
0 2020 1.321750 1.246000 Navarra 2020Navarra 2020-01-01
1 2020 1.301000 1.207250 Navarra 2020Navarra 2020-02-01
2 2020 1.224800 1.126200 Navarra 2020Navarra 2020-03-01
3 2020 1.106667 1.020000 Navarra 2020Navarra 2020-04-01
4 2020 1.078750 0.986250 Navarra 2020Navarra 2020-05-01
.. ... ... ... ... ... ...
386 2021 1.416600 1.265000 La rioja 2021La rioja 2021-08-01
387 2021 1.431000 1.277000 La rioja 2021La rioja 2021-09-01
388 2021 1.474000 1.344000 La rioja 2021La rioja 2021-10-01
389 2021 1.510200 1.382000 La rioja 2021La rioja 2021-11-01
390 2021 1.481333 1.348667 La rioja 2021La rioja 2021-12-01
Salario en euros anuales
0 27.995,96
1 27.995,96
2 27.995,96
3 27.995,96
4 27.995,96
.. ...
386 21.535,29
387 21.535,29
388 21.535,29
389 21.535,29
390 21.535,29
It would fill the columns of the first with the second when the year_community table matches. for example in the nan where 2020Islas Baleares appears in the same row. fill in with the value of the price of gasoline from the other table where 2020Islas Baleares appears in the same row. In the case that it is 2020aragon, it would be with 2020 aragon and so on. I had thought of something like this:
analisis['Super_95'].fillna(analisis2['Super_95'].apply(lambda x: x if x=='2020Islas Baleares' else np.nan), inplace=True)
the second dataframe is the result of doing a merge, and those null values ​​have not worked
df1.merge(df2, on='Año_Comunidad')
As a result you'll have one DataFrame where columns with same names will have a suffix _x for first DataFrame and _y for the second one.
Now to fill in the blanks you can do this for each column:
df1.loc[df1["Año_x"].isnull(),'Año_x'] = df1["Año_y"]
If a row in Año is empty, it will be filled with data from second table that we merged earlier.
You can do it in a cycle for all the columns:
cols = ['Año', 'Super_95', 'Diesel', 'Comunidad', 'Salario en euros anuales']
for col in cols:
df1.loc[df1[col+"_x"].isnull(), col+'_x'] = df1[col+'_y']
And finally you can drop the merged columns:
for col in cols:
df1 = df1.drop(col+'_y', axis=1)

How to balance panel data by adding missing rows with no information?

I have an unbalanced dataset, namely unbalanced.df that looks as follows:
Date
ID
City
State
Quantity
2019-01
10001
Los Angeles
CA
500
2019-02
10001
Los Angeles
CA
995
2019-03
10001
Los Angeles
CA
943
2019-01
10002
Houston
TX
4330
2019-03
10002
Houston
TX
2340
2019-01
10003
Sacramento
CA
235
2019-02
10003
Sacramento
CA
239
2019-03
10003
Sacramento
CA
233
As you can see, Houston does not have 2019-02 as a Date. This happens all throughout my panel data with different cities.
I want to make this panel symmetric by adding NA rows on the missing dates, such that the new panel data looks like this, balanced.df:
Date
ID
City
State
Quantity
2019-01
10001
Los Angeles
CA
500
2019-02
10001
Los Angeles
CA
995
2019-03
10001
Los Angeles
CA
943
2019-01
10002
Houston
TX
4330
2019-02
10002
Houston
TX
NaN
2019-03
10002
Houston
TX
2340
2019 -01
10003
Sacramento
CA
235
2019-02
10003
Sacramento
CA
239
2019-03
10003
Sacramento
CA
233
In this case, I have an absolute minimum date and absolute maximum date, so I want to make sure that all cities follow the same dates. How can I fill my panel with NaN rows for cities and have therefore the same number of rows for each ID, City and State?
One option, that offers an efficient abstraction, is with complete from pyjanitor to get missing rows for the combination of Date vs the group of ('ID', 'City', 'State'):
# pip install pyjanitor
import pandas as pd
import janitor
df.complete(('ID', 'City', 'State'), 'Date')
Date ID City State Quantity
0 2019-01 10001 Los Angeles CA 500.0
1 2019-02 10001 Los Angeles CA 995.0
2 2019-03 10001 Los Angeles CA 943.0
3 2019-01 10002 Houston TX 4330.0
4 2019-02 10002 Houston TX NaN
5 2019-03 10002 Houston TX 2340.0
6 2019-01 10003 Sacramento CA 235.0
7 2019-02 10003 Sacramento CA 239.0
8 2019-03 10003 Sacramento CA 233.0
Try this use multiIndexes and reindex:
mapp = df.set_index('ID')[['City', 'State']].drop_duplicates()
df1 = df.set_index(['Date', 'ID'])\
.reindex(pd.MultiIndex.from_product([df['Date'].unique(),
df['ID'].unique()],
names=['Date', 'ID']))\
.reset_index()
df1.assign(City=df1['ID'].map(mapp['City']), State=df1['ID'].map(mapp['State']))
Output:
Date ID City State Quantity
0 2019-01 10001 Los Angeles CA 500.0
1 2019-01 10002 Houston TX 4330.0
2 2019-01 10003 Sacramento CA 235.0
3 2019-02 10001 Los Angeles CA 995.0
4 2019-02 10002 Houston TX NaN
5 2019-02 10003 Sacramento CA 239.0
6 2019-03 10001 Los Angeles CA 943.0
7 2019-03 10002 Houston TX 2340.0
8 2019-03 10003 Sacramento CA 233.0
If you have a lot of columns, then you can use merge instead of assign:
df[['ID', 'City', 'State']].drop_duplicates().merge(df1[['ID', 'Quantity']], on='ID')

How to delete rows for column having Non-NaN values

Input Dataframe(df)
Country Region Date Value.....
ABW NaN 01-01-2020 123
ABW NaN 02-01-2020 1234
ABW NaN 03-01-2020 3242
USA NaN 04-01-2020 4354
USA NaN 05-01-2020 43543
USA NaN 06-01-2020 34534
USA NaN 07-01-2020 435
USA WA 08-01-2020 43345
USA WA 09-01-2020 345
USA WV 10-01-2020 345
.
.
.
.
Expected Output(df1)
Country Region Date Value.....
ABW NaN 01-01-2020 123
ABW NaN 02-01-2020 1234
ABW NaN 03-01-2020 3242
USA NaN 04-01-2020 4354
USA NaN 05-01-2020 43543
USA NaN 06-01-2020 34534
USA NaN 07-01-2020 435
.
.
.
.
So from the above dataframe you can see that the column 'Region' has NaN as well as non-null values, I'd like to remove the entire row where column 'Region' has non-NaN values.
Also, AFTER performing the above operation, if I wanted to entirely remove the Region column, how to do that in the fastest possible way(10k+ columns)?? Experts, please help!
FINAL Expected Output
Country Date Value.....
ABW 01-01-2020 123
ABW 02-01-2020 1234
ABW 03-01-2020 3242
USA 04-01-2020 4354
USA 05-01-2020 43543
USA 06-01-2020 34534
USA 07-01-2020 435
Here's the code I tried
df1=df1.isnull(df1['Region'])
Error
df1=df.isnull(df['Region'])
TypeError: isnull() takes 1 positional argument but 2 were given
Using #BEN_YO's suggestion, this is what I did, works fine
filtered_df = df1[df1['Region'].isnull()]

df.dropna is not dropping NaN values from rows where all values are NaN

I'm trying to clean a pdf to turn it into a file for geocoding. I've been using tabula-py to rip the pdf and have had pretty good results up until the point of removing rows that are empty entirely. I'm not even sure if this is an efficient way to do this.
I've tried the majority of solutions SO has recommended to me and I still can't quite figure it out. I've set inplace = True, axis = 0 and 1, how="all". Tried indexing out the NaN values and that didn't work either.
import pandas as pd
import tabula
pd.set_option('display.width', 500)
df = tabula.read_pdf("C:\\Users\\Jack\\Documents\\Schoolwork\\Schoolwork\\WICResearch\\RefDocs\\wicclinicdirectory.pdf", pages='all', guess = False, pandas_options={'header': None})
df.columns = ["County_Name", "Clinic_Number", "Clinic_Name", "Address", "City", "Zip_Code", "Phone_Number", "Hours_of_Operation"]
df.drop(["Phone_Number", "Hours_of_Operation"], axis = 1, inplace = True)
#mass of code here that removes unwanted repeated column headers as by product of tabula reading PDFs.
df.drop(["Clinic_Name"], axis = 1, inplace = True)
df[['ClinicNum','ClinicName']] = df.Clinic_Number.apply(lambda x: pd.Series(str(x).split(" ", maxsplit = 1)))
df.drop(["Clinic_Number"], axis = 1, inplace = True)
#df[~df.isin(['NaN', 'NaT']).any(axis=1)]
#df.dropna(axis= 0, how ='all', inplace = True)
NaNIndex = df.index[df.isnull().all(1)]
print(NaNIndex)
print(df)
The above code gives this output:
Index: []
County_Name Address City Zip_Code ClinicNum ClinicName
0 NaN NaN Ohio WIC Clinic Locations NaN nan NaN
1 NaN NaN NaN NaN Clinic NaN
3 Adams 9137 State Route 136 West Union 45693 100 Adams County WIC Program
4 NaN NaN NaN NaN nan NaN
5 NaN NaN NaN NaN nan NaN
6 Allen 940 North Cable Road, Suite 4 Lima 45805 200 Allen County WIC Program
7 Ashland 934 Center Street Ashland 44805 300 Ashland County WIC Program
8 NaN Suite E NaN NaN nan NaN
9 Ashtabula Geneva Community Center Geneva 44041 403 Geneva WIC Clinic
10 NaN 62 West Main Street NaN NaN nan NaN
11 Ashtabula Jefferson United Methodist Church Jefferson 44047 402 Jefferson WIC Clinic
12 NaN 125 East Jefferson Street NaN NaN nan NaN
13 Ashtabula Conneaut Human Resource Center Conneaut 44030 401 Conneaut WIC Clinic
14 NaN 327 Mill Street NaN NaN nan NaN
15 Ashtabula 3225 Lake Avenue Ashtabula 44004 400 Ashtabula County WIC Program
16 NaN NaN NaN NaN nan NaN
18 NaN NaN NaN NaN Clinic NaN
20 Ashtabula St. Mary's Catholic Church Orwell 44076 490 Orwell WIC Clinic
And what I'd like is:
Index: []
County_Name Address City Zip_Code ClinicNum ClinicName
3 Adams 9137 State Route 136 West Union 45693 100 Adams County WIC Program
6 Allen 940 North Cable Road, Suite 4 Lima 45805 200 Allen County WIC Program
7 Ashland 934 Center Street Ashland 44805 300 Ashland County WIC Program
8 NaN Suite E NaN NaN nan NaN
9 Ashtabula Geneva Community Center Geneva 44041 403 Geneva WIC Clinic
10 NaN 62 West Main Street NaN NaN nan NaN
11 Ashtabula Jefferson United Methodist Church Jefferson 44047 402 Jefferson WIC Clinic
12 NaN 125 East Jefferson Street NaN NaN nan NaN
13 Ashtabula Conneaut Human Resource Center Conneaut 44030 401 Conneaut WIC Clinic
14 NaN 327 Mill Street NaN NaN nan NaN
15 Ashtabula 3225 Lake Avenue Ashtabula 44004 400 Ashtabula County WIC Program
18 NaN NaN NaN NaN Clinic NaN
20 Ashtabula St. Mary's Catholic Church Orwell 44076 490 Orwell WIC Clinic
I am able to create the data frame I want with the correct headings but it still does not remove the NaN values. Or it removes the entire thing. I'd also like to be able to move the rows that are not all NaN values into the correlating ones so they are all one line.
I'm also not sure how reproducible I can get this as I have fiddled around with tabula quite a bit trying to get this pdf converted.

Python Pandas pivot with values equal to simple function of specific column

import pandas as pd
olympics = pd.read_csv('olympics.csv')
Edition NOC Medal
0 1896 AUT Silver
1 1896 FRA Gold
2 1896 GER Gold
3 1900 HUN Bronze
4 1900 GBR Gold
5 1900 DEN Bronze
6 1900 USA Gold
7 1900 FRA Bronze
8 1900 FRA Silver
9 1900 USA Gold
10 1900 FRA Silver
11 1900 GBR Gold
12 1900 SUI Silver
13 1900 ZZX Gold
14 1904 HUN Gold
15 1904 USA Bronze
16 1904 USA Gold
17 1904 USA Silver
18 1904 CAN Gold
19 1904 USA Silver
I can pivot the data frame to have some aggregate value
pivot = olympics.pivot_table(index='Edition', columns='NOC', values='Medal', aggfunc='count')
NOC AUT CAN DEN FRA GBR GER HUN SUI USA ZZX
Edition
1896 1.0 NaN NaN 1.0 NaN 1.0 NaN NaN NaN NaN
1900 NaN NaN 1.0 3.0 2.0 NaN 1.0 1.0 2.0 1.0
1904 NaN 1.0 NaN NaN NaN NaN 1.0 NaN 4.0 NaN
Rather than having the total number of medals in values= , I am interested to have a tuple (a triple) with (#Gold, #Silver, #Bronze), (0,0,0) for NaN
How do I do that succinctly and elegantly?
No need to use pivot_table, as pivot is perfectly fine with tuple for a value
value_counts to count all medals
create multi-index for all combinations of countries, dates, medals
reindex with fill_values=0
counts = df.groupby(['Edition', 'NOC']).Medal.value_counts()
mux = pd.MultiIndex.from_product(
[c.values for c in counts.index.levels], names=counts.index.names)
counts = counts.reindex(mux, fill_value=0).unstack('Medal')
counts = counts[['Bronze', 'Silver', 'Gold']]
pd.Series([tuple(l) for l in counts.values.tolist()], counts.index).unstack()

Categories