Replace Unnamed values in date column with true values - python

I'm working on this raw data frame that needs some cleaning. So far, I have transformed this xlsx file
into this pandas dataframe:
print(df.head(16))
date technician alkalinity colour uv ph turbidity \
0 2020-02-01 00:00:00 Catherine 24.5 33 0.15 7.24 1.53
1 Unnamed: 2 NaN NaN NaN NaN NaN 2.31
2 Unnamed: 3 NaN NaN NaN NaN NaN 2.08
3 Unnamed: 4 NaN NaN NaN NaN NaN 2.2
4 Unnamed: 5 Michel 24 35 0.152 7.22 1.59
5 Unnamed: 6 NaN NaN NaN NaN NaN 1.66
6 Unnamed: 7 NaN NaN NaN NaN NaN 1.71
7 Unnamed: 8 NaN NaN NaN NaN NaN 1.53
8 2020-02-02 00:00:00 Catherine 24 NaN 0.145 7.21 1.44
9 Unnamed: 10 NaN NaN NaN NaN NaN 1.97
10 Unnamed: 11 NaN NaN NaN NaN NaN 1.91
11 Unnamed: 12 NaN NaN 33.0 NaN NaN 2.07
12 Unnamed: 13 Michel 24 34 0.15 7.24 1.76
13 Unnamed: 14 NaN NaN NaN NaN NaN 1.84
14 Unnamed: 15 NaN NaN NaN NaN NaN 1.72
15 Unnamed: 16 NaN NaN NaN NaN NaN 1.85
temperature
0 3
1 NaN
2 NaN
3 NaN
4 3
5 NaN
6 NaN
7 NaN
8 3
9 NaN
10 NaN
11 NaN
12 3
13 NaN
14 NaN
15 NaN
From here, I want to combine the rows so that I only have one row for each date. The values for each row will be the mean in the respective columns. ie.
print(new_df.head(2))
date time alkalinity colour uv ph turbidity temperature
0 2020-02-01 00:00:00 24.25 34 0.151 7.23 1.83 3
1 2020-02-02 00:00:00 24 33.5 0.148 7.23 1.82 3
How can I accomplish this when I have Unnamed values in my date column? Thanks!

Try setting the values to NaN and then use ffill:
df.loc[df.date.str.contains('Unnamed', na=False), 'date'] = np.nan
df.date = df.date.ffill()

If I understand, correctly you want to drop rows that contain 'Unnamed' in the date column, right?
Please look here:
https://stackoverflow.com/a/27360130/12790501
The solution would be something like this:
df = df.drop(df['Unnamed' in df.date].index)
Edit:
No, I would like to replace those Unnamed values with the date so I
could then use the groupby('date') function to return the mean values
for the columns
so in the case you should just iterate over the whole table
last_date = ''
for i in df.index:
if 'Unnamed' not in df.at[i, 'date']:
last_date = df.at[i, 'date']
else:
df.at[i, 'date'] = last_date

If the 'date' column is of type object i.e. string
then just write a logic to loop over the number as seen in image provided it follows a certain pattern-
for _ in range(2,9):
df.loc[(df['date'] == 'Unnamed: '+str(_), 'date'] = your_value

Related

Pandas: Filling NaN values in dataframe with monthly mean

The dataframe I am working with is as follows:
date AA1 AB2 AC3 AD4
0 1996-01-01 00:00:00 NaN NaN NaN NaN
1 1996-01-01 01:00:00 NaN 19.2 NaN NaN
2 1996-01-01 02:00:00 NaN 16.4 NaN NaN
3 1996-01-01 03:00:00 NaN 23.5 NaN NaN
4 1996-01-01 04:00:00 20.4 NaN NaN NaN
... ... ... ... ... ...
219164 2020-12-31 20:00:00 13.4 NaN 23.0 26.6
219165 2020-12-31 21:00:00 14.2 NaN 19.6 28.3
219166 2020-12-31 22:00:00 13.5 NaN 17.9 20.5
219167 2020-12-31 23:00:00 NaN NaN 16.7 20.7
219168 2021-01-01 00:00:00 NaN NaN NaN NaN
These are hourly data readings taken from different sensors from the year 1996 to 2021.
My goal is to be able to fill the NaN values with the monthly mean for each of the columns based on the date.
I have tried grouping the data and getting the monthly means for the group, though I am not sure where to go from here to transfer the grouped means to the original, larger dataframe, filling in some of the NaN values.
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
tem = df.groupby(['year', 'month']).mean().reset_index()
The resulting dataframe looks like this, with less indices because of the grouping:
year month AA1 AB2 AC3 AD4
0 1996 1 20.1 18.3 NaN NaN
1 1996 2 NaN NaN NaN NaN
2 1996 3 NaN NaN NaN NaN
3 1996 4 NaN NaN NaN NaN
4 1996 5 NaN NaN NaN NaN
... ... ... ... ... ... ...
296 2020 9 NaN NaN 15.7 20.2
297 2020 10 NaN NaN 15.3 19.7
298 2020 11 NaN NaN 26.7 25.9
299 2020 12 NaN NaN 24.6 25.3
300 2021 1 NaN NaN NaN NaN
Any advice on how I can implement this would be helpful. In the end, I need the original dataset indices, dates and columns, but with the NaN values filled with the means calculated from the monthly groups. The months with all NaN values can be ignored for the time being.
Assuming your date column is of type datetime64 or equivalent:
df['AA2'] = df['AA2'].fillna(df.groupby(df.date.dt.month)['AA2'].transform('mean'))
Or looping over all your columns (except the date column):
for col in df.columns.drop('date'):
df[col] = df[col].fillna(df.groupby(df.date.dt.month)[col].transform('mean'))
If you only want the mean of the month in that specific year, add df.date.dt.year to the group by function:
for col in df.columns.drop('date'):
df[col] = df[col].fillna(df.groupby([df.date.dt.year, df.date.dt.month])[col].transform('mean'))

Row by row mapping keys of dictionary of dataframes to new dictionary of dataframes

I have two dictionaries of data frames LP3 and ExeedenceDict. The ExeedenceDict is a dictionary of 4 dataframes with keys 'two','ten','twentyfive','onehundred'. The LP3 dictionary has keys 'LP_DevilMalad', 'LP_Bloomington', 'LP_DevilEvans', 'LP_Deep', 'LP_Maple', 'LP_CubMaple', 'LP_Cottonwood', 'LP_Mill', 'LP_CubNrPreston'
Edit: I am not sure of the most concise way to title this question but I think the title suites what I am asking.
There is a column in each dataframe within the ExeedenceDict that has row values equal to the keys in the LP3 dictionary.
Below is a 'blank' dataframe for two in the ExeedenceDict that I created. Using the code:
ExeedenceDF = []
cols = ['Location','Size','Annual Exceedence', 'With Reg Skew','Without Reg Skew','5% Lower','95% Upper']
for i in range(5):
i = pd.DataFrame(columns=cols)
i['Location'] = LP_names
i['Size'] = [39.8,24,34,29.7,21.2,53.7,61.7,27.6,31.6]
ExeedenceDF.append(i)
ExeedenceDict = {'two':ExeedenceDF[0], 'ten':ExeedenceDF[1], 'twentyfive':ExeedenceDF[2], 'onehundred':ExeedenceDF[3]}
Location Size Annual Exceedence With Reg Skew Without Reg Skew 5% Lower 95% Upper
0 LP_DevilMalad 39.8 NaN NaN NaN NaN NaN
1 LP_Bloomington 24.0 NaN NaN NaN NaN NaN
2 LP_DevilEvans 34.0 NaN NaN NaN NaN NaN
3 LP_Deep 29.7 NaN NaN NaN NaN NaN
4 LP_Maple 21.2 NaN NaN NaN NaN NaN
5 LP_CubMaple 53.7 NaN NaN NaN NaN NaN
6 LP_Cottonwood 61.7 NaN NaN NaN NaN NaN
7 LP_Mill 27.6 NaN NaN NaN NaN NaN
8 LP_CubNrPreston 31.6 NaN NaN NaN NaN NaN
Below is the dataframe for the key LP_DevilMalad in the LP3 dictionary. This dictionary was built by reading in data from 10 excel spreadsheets. Using the code:
LP_names = ['LP_DevilMalad', 'LP_Bloomington', 'LP_DevilEvans', 'LP_Deep', 'LP_Maple', 'LP_CubMaple', 'LP_Cottonwood', 'LP_Mill', 'LP_CubNrPreston']
for i, df in enumerate(LP_Data):
LP_Data[i] = LP_Data[i].dropna()
LP_Data[i]['Annual Exceedence'] = 1 / LP_Data[i]['Annual Exceedence']
LP_Data[i] = LP_Data[i].loc[LP_Data[i]['Annual Exceedence'].isin([2, 10, 25, 100])]
LP3 = {k:v for (k,v) in zip(LP_names, LP_Data)}
'LP_DevilMalad': Annual Exceedence With Reg Skew Without Reg Skew Log Variance of Est \
6 2.0 21.4 22.4 0.0091
9 10.0 46.5 44.7 0.0119
10 25.0 60.2 54.6 0.0166
12 100.0 81.4 67.4 0.0270
5% Lower 95% Upper
6 14.1 31.2
9 32.1 85.7
10 40.6 136.2
12 51.3 250.6
I am having issues matching the column values of each dataframe within the dictionaries from the keys of LP3 to the Location column in ExeedenceDict dataframes. With the goal of coming up with a script that would do all of this iteratively with some sort of dictionary comprehension.
The caveat is that the two dataframe is just the 6 index value in the LP3 dataframes, ten is the 9th index value, 'twentyfive' is the 10th index value, and onehundred is the 12th index value.
The goale data frame for key two in ExeedenceDict based on the two data frames above would look something like this:
Noting that the rest of the dataframe would be filled with the values from the 6th index from the rest of the dataframe values within the LP3 dictionary.
Location Size Annual Exceedence With Reg Skew Without Reg Skew 5% Lower 95% Upper
0 LP_DevilMalad 39.8 2 21.4 22.4 14.1 31.2
1 LP_Bloomington 24.0 NaN NaN NaN NaN NaN
2 LP_DevilEvans 34.0 NaN NaN NaN NaN NaN
3 LP_Deep 29.7 NaN NaN NaN NaN NaN
4 LP_Maple 21.2 NaN NaN NaN NaN NaN
5 LP_CubMaple 53.7 NaN NaN NaN NaN NaN
6 LP_Cottonwood 61.7 NaN NaN NaN NaN NaN
7 LP_Mill 27.6 NaN NaN NaN NaN NaN
8 LP_CubNrPreston 31.6 NaN NaN NaN NaN NaN
Can't test it without a reproducible example, but I would do something along the lines:
index_map = {
"two": 6,
"ten": 9,
"twentyfive": 10,
"onehundred": 12
}
col_of_interest = ["Annual Exceedence", "With Reg Skew", "Without Reg Skew", "5% Lower", "95% Upper"]
for index_key, df in ExeedenceDict.items():
lp_index = index_map[index_key]
for lp_val in df['Location'].values:
df.loc[df['Location'] == lp_val, col_of_interest] = LP3[lp_val].loc[lp_index, col_of_interest].values

Getting NaN when multiplying these two columns from different dataframes in pandas

I'm trying to multiply columns from two different dataframes into a new df. The first dataframe (df1) contains the prices for different items, and the column header is the date. The second dataframe (df2) contains the quantity of each item.
df1
Date 1990-01-03 1990-01-04 1990-01-05 ... 2020-04-09 2020-04-14 2020-04-15
AAAAAAA 1.11 1.11 1.09 ... 102.22 103.46 103.96
BBBBBBB NaN NaN NaN ... 308.70 314.95 314.10
CCCCCCC NaN NaN NaN ... 65.34 58.72 56.18
DDDDDDD 5.52 5.51 5.53 ... 104.50 106.03 NaN
EEEEEEE NaN NaN NaN ... 1211.45 1269.23 NaN
FFFFFFF NaN NaN NaN ... 36.14 36.85 NaN
GGGGGGG 93.35 94.37 94.37 ... 1564.00 1537.50 1482.50
HHHHHHH NaN NaN NaN ... 45.69 46.68 46.24
IIIIIII NaN NaN NaN ... 75.10 74.88 74.40
JJJJJJJ 328.76 328.25 327.74 ... 6168.00 6448.00 6296.00
KKKKKKK NaN NaN NaN ... 23.49 23.50 24.04
LLLLLLL 4.45 4.41 4.34 ... 36.55 35.96 NaN
MMMMMMM 1.96 1.96 1.94 ... 141.23 146.03 NaN
NNNNNNN 1.09 1.09 1.09 ... 267.99 287.05 NaN
OOOOOOO 1.09 1.09 1.08 ... 201.53 207.17 NaN
PPPPPPP NaN NaN NaN ... 98.00 100.80 100.50
QQQQQQQ NaN NaN NaN ... 129.00 128.40 124.20
RRRRRRR NaN NaN NaN ... 140.60 141.45 139.60
[18 rows x 7658 columns]
and df2
Symbol Average Purchase Price Quantity
0 AAAAAAA 49.980 320.0
1 BBBBBBB 239.125 120.0
2 CCCCCCC 223.040 40.0
3 DDDDDDD 90.370 100.0
4 EEEEEEE 701.300 10.0
5 FFFFFFF 35.150 120.0
6 GGGGGGG 1259.000 700.0
7 HHHHHHH 32.050 250.0
8 IIIIIII 53.300 240.0
9 JJJJJJJ 6805.000 130.0
10 KKKKKKK 27.590 1000.0
11 LLLLLLL 82.120 170.0
12 MMMMMMM 106.470 150.0
13 NNNNNNN 95.970 308.0
14 OOOOOOO 81.420 150.0
15 PPPPPPP 39.690 60.0
16 QQQQQQQ 35.270 104.0
17 RRRRRRR 68.240 12.0
however when I use the function:
date = '2020-04-14'
total = df2[['Quantity']].mul(df1[date], axis=0)
print(total)
(Ideally, I'd like to do it for every date but I'm just learning so I thought I'd start out with one date)
I get:
Quantity
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
AAAAAAA NaN
BBBBBBB NaN
CCCCCCC NaN
DDDDDDD NaN
EEEEEEE NaN
FFFFFFF NaN
GGGGGGG NaN
HHHHHHH NaN
IIIIIII NaN
JJJJJJJ NaN
KKKKKKK NaN
LLLLLLL NaN
MMMMMMM NaN
NNNNNNN NaN
OOOOOOO NaN
PPPPPPP NaN
QQQQQQQ NaN
RRRRRRR NaN
how can I solve this?
It is a problem of indexes. The index column of the product dataframe is an evidence that Symbol is the index for the first dataframe, while the second has a sequential index. Assuming that no repetition of the symbol occurs in either dataframe, you could set Symbol as the index in the second one
date = '2020-04-14'
total = df2.set_index('Symbol')[['Quantity']].mul(df1[date], axis=0)
print(total)
it gives:
Quantity
Symbol
AAAAAAA 33107.2
BBBBBBB 37794.0
CCCCCCC 2348.8
DDDDDDD 10603.0
EEEEEEE 12692.3
FFFFFFF 4422.0
GGGGGGG 1076250.0
HHHHHHH 11670.0
IIIIIII 17971.2
JJJJJJJ 838240.0
KKKKKKK 23500.0
LLLLLLL 6113.2
MMMMMMM 21904.5
NNNNNNN 88411.4
OOOOOOO 31075.5
PPPPPPP 6048.0
QQQQQQQ 13353.6
RRRRRRR 1697.4
The problem is in indexing - your data frames have got different indices. To make your code work, unify indices in both data frames by pandas.DataFrame.reset_index() method. You can use the following code.
>>> df1.reset_index(inplace=True)
The code will change index in df1 on integers from 0 to 17, which will be the same index as df2 has got.

Replacing labels with names using merge

I am trying to figure out how to do merge. I have a labels.csv which contains the names that I have to use to replace the numbers for the same field in my dat.csv
My dat.csv is as follows:
Id,Help in household,Maths,Reading,Science,Social
11011001001,4,20.37,,27.78,
11011001002,3,12.96,,38.18,
11011001003,4,27.78,70,,
11011001004,4,,56.67,,36
11011001005,1,,,14.55,8.33
11011001006,4,,23.33,,30
11011001007,4,40.74,70,,
11011001008,3,,26.67,,22.92
11011001009,2,24.07,,25.45,
11011001010,4,18.52,26.67,,
11011001012,2,37.04,16.67,,
11011001013,4,20.37,,20,
11011001014,2,,,29.63,35.42
11011001015,4,27.78,66.67,,
11011001016,0,18.52,,,
11011001017,4,,,42.59,32
11011001018,2,16.67,,,
11011001019,3,,,21.82,
11011001020,4,,20,,16
11011001021,1,,,18.52,16.67
My labels.csv is as follows:
Column,Name,Level,Rename
Help in household,Every day,4,Every day
Help in household,Never,1,Never
Help in household,Once a month,2,Once a month
Help in household,Once a week,3,Once a week
my programme is as follows:
import pandas as pd
df = pd.read_csv('dat.csv')
labels = pd.read_csv('labels.csv')
df=df.merge(labels,left_on='Help in household',right_on='Name',how='left')
print df
However, the names do not appear as I want them to.
STUID Help in household Maths % Reading % Science % Social % \
0 11011001001 4 20.37 NaN 27.78 NaN
1 11011001002 3 12.96 NaN 38.18 NaN
2 11011001003 4 27.78 70.00 NaN NaN
3 11011001004 4 NaN 56.67 NaN 36.00
4 11011001005 1 NaN NaN 14.55 8.33
5 11011001006 4 NaN 23.33 NaN 30.00
6 11011001007 4 40.74 70.00 NaN NaN
7 11011001008 3 NaN 26.67 NaN 22.92
8 11011001009 2 24.07 NaN 25.45 NaN
9 11011001010 4 18.52 26.67 NaN NaN
10 11011001012 2 37.04 16.67 NaN NaN
11 11011001013 4 20.37 NaN 20.00 NaN
12 11011001014 2 NaN NaN 29.63 35.42
13 11011001015 4 27.78 66.67 NaN NaN
14 11011001016 0 18.52 NaN NaN NaN
15 11011001017 4 NaN NaN 42.59 32.00
16 11011001018 2 16.67 NaN NaN NaN
17 11011001019 3 NaN NaN 21.82 NaN
18 11011001020 4 NaN 20.00 NaN 16.00
19 11011001021 1 NaN NaN 18.52 16.67
Column Name Level Rename
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 NaN NaN NaN NaN
10 NaN NaN NaN NaN
11 NaN NaN NaN NaN
12 NaN NaN NaN NaN
13 NaN NaN NaN NaN
14 NaN NaN NaN NaN
15 NaN NaN NaN NaN
16 NaN NaN NaN NaN
17 NaN NaN NaN NaN
18 NaN NaN NaN NaN
19 NaN NaN NaN NaN
What am I doing wrong?
Okay, is this what you want?
df['Name'] = df['Help in household'].map(labels.set_index('Level')['Name'])
Output:
Id Help in household Maths Reading Science Social \
0 11011001001 4 20.37 NaN 27.78 NaN
1 11011001002 3 12.96 NaN 38.18 NaN
2 11011001003 4 27.78 70.00 NaN NaN
3 11011001004 4 NaN 56.67 NaN 36.00
4 11011001005 1 NaN NaN 14.55 8.33
5 11011001006 4 NaN 23.33 NaN 30.00
6 11011001007 4 40.74 70.00 NaN NaN
7 11011001008 3 NaN 26.67 NaN 22.92
8 11011001009 2 24.07 NaN 25.45 NaN
9 11011001010 4 18.52 26.67 NaN NaN
10 11011001012 2 37.04 16.67 NaN NaN
11 11011001013 4 20.37 NaN 20.00 NaN
12 11011001014 2 NaN NaN 29.63 35.42
13 11011001015 4 27.78 66.67 NaN NaN
14 11011001016 0 18.52 NaN NaN NaN
15 11011001017 4 NaN NaN 42.59 32.00
16 11011001018 2 16.67 NaN NaN NaN
17 11011001019 3 NaN NaN 21.82 NaN
18 11011001020 4 NaN 20.00 NaN 16.00
19 11011001021 1 NaN NaN 18.52 16.67
Name
0 Every day
1 Once a week
2 Every day
3 Every day
4 Never
5 Every day
6 Every day
7 Once a week
8 Once a month
9 Every day
10 Once a month
11 Every day
12 Once a month
13 Every day
14 NaN
15 Every day
16 Once a month
17 Once a week
18 Every day
19 Never

Pandas: Merge data with different timing

I have two data frames that contain time-series data that are on different ranges. One starts earlier, and ends earlier. Also, one is monthly and one is quarterly. However, the index of both is in the form of YYYY-MM-DD. Is there a cute way of merging these dataframes using "Python" and "Pandas"?
Thanks!
/edit
One set:
DATE GDP GPDI NFLS
0 1947-01-01 243.1 35.9 112.815
1 1947-04-01 246.3 34.5 111.253
2 1947-07-01 250.1 34.9 113.023
3 1947-10-01 260.3 43.2 111.440
The other one:
DATE INDPRO M08354USM310NNBR GDP
(...)
334 1946-11-01 13.3916 NaN NaN
335 1946-12-01 13.4721 NaN NaN
336 1947-01-01 13.6332 42.8 NaN
337 1947-02-01 13.7137 42.5 NaN
Together I would like to join them, such that
DATE INDPRO M08354USM310NNBR GDP GPDI NFLS
1946-11-01 13.3916 NaN NaN NaN NaN
1946-12-01 13.4712 NaN NaN NaN NaN
1947-01-01 13.6332 42.8 243.1 35.9 112.815
1947-02-01 13.7137 42.5 NaN NaN NaN
(...)
Just perform a merge the fact the periods are different and don't overlap suits you in fact:
merged = df1.merge(df2, on='DATE', how='outer')
merged
Out[54]:
DATE GDP_x GPDI NFLS INDPRO M08354USM310NNBR GDP_y
0 1947-01-01 243.1 35.9 112.815 13.6332 42.8 NaN
1 1947-04-01 246.3 34.5 111.253 NaN NaN NaN
2 1947-07-01 250.1 34.9 113.023 NaN NaN NaN
3 1947-10-01 260.3 43.2 111.440 NaN NaN NaN
4 1946-11-01 NaN NaN NaN 13.3916 NaN NaN
5 1946-12-01 NaN NaN NaN 13.4721 NaN NaN
6 1947-02-01 NaN NaN NaN 13.7137 42.5 NaN
[7 rows x 7 columns]
You can rename, fill, drop the erroneous 'GDP_y' column
To sort the merged 'DATE' column just call sort:
In [57]:
merged.sort(['DATE'])
Out[57]:
DATE GDP_x GPDI NFLS INDPRO M08354USM310NNBR GDP_y
4 1946-11-01 NaN NaN NaN 13.3916 NaN NaN
5 1946-12-01 NaN NaN NaN 13.4721 NaN NaN
0 1947-01-01 243.1 35.9 112.815 13.6332 42.8 NaN
6 1947-02-01 NaN NaN NaN 13.7137 42.5 NaN
1 1947-04-01 246.3 34.5 111.253 NaN NaN NaN
2 1947-07-01 250.1 34.9 113.023 NaN NaN NaN
3 1947-10-01 260.3 43.2 111.440 NaN NaN NaN
[7 rows x 7 columns]

Categories