Merging columns and removing duplicates with Pandas

Merging columns and removing duplicates with Pandas - python

I need to merge similar columns and remove duplicates (entries with the same date). The data frame:
Albumin C-reactive protein CRP Ferritin Haemoglobin Hb Iron Nancy Index Plasma Platelets Transferrin saturation % Transferrin saturations UCEIS (0 to 8) WCC White Cell Count test_date
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 12.35 2016-04-17 23:00:00
1 NaN NaN NaN NaN 133.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2016-04-17 23:00:00
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN 406.0 NaN NaN NaN NaN NaN 2016-04-17 23:00:00
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN 406.0 NaN NaN NaN NaN NaN 2016-04-17 23:00:00
4 NaN 32.2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2016-04-17 23:00:00
5 36.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2016-04-17 23:00:00
6 NaN NaN NaN 99.7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2016-04-17 23:00:00
7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 25.0 NaN NaN NaN NaN 2016-04-17 23:00:00
12 36.0 NaN 32.2 99.7 NaN 133.0 NaN NaN NaN 406.0 NaN 25.0 NaN 12.35 NaN 2016-04-17 23:00:00
14 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 7.0 NaN NaN 2016-04-25 23:00:00
79 34.0 NaN 5.4 55.9 NaN 133.0 NaN NaN NaN 372.0 NaN 28.0 NaN 7.99 NaN 2016-06-12 23:00:00
I need to get:
Albumin CRP Ferritin Hb Nancy Index Plasma Platelets Transferrin saturations UCEIS (0 to 8) WCC test_date
12 36.0 32.2 99.7 133.0 NaN NaN 406.0 25.0 NaN 12.35 2016-04-17 23:00:00
14 NaN NaN NaN NaN NaN NaN NaN NaN 7.0 NaN 2016-04-25 23:00:00
79 34.0 5.4 55.9 133.0 NaN NaN 372.0 28.0 NaN 7.99 2016-06-12 23:00:00
So, columns 'C-reactive protein' should be merged with 'CRP', 'Hemoglobin' with 'Hb', 'Transferrin saturation %' with 'Transferrin saturation'.
I can easily remove duplicates with .drop_duplicates(), but the trick is remove not only row with the same date, but also to make sure, that the values in the same column are duplicated. For example, 'C-reactive protein' at row '4' has the same values as 'CRP' in row '12', in addition, they both have the same entry date. Given all that, I need to have only 'CRP' column with values 32.2 and the date '2016-04-17' (plus other unique columns).
EDIT
Some entries are really duplicates (absolutely identical, due to system glitches), for example (last three rows, on 2016-06-20, indices '803' and '122'). Is the solution below capable of removing such identical rows?
P.S. Thanks for the amazing and general solution for duplicate, but not identical entries.
Albumin C-reactive protein CRP Ferritin Haemoglobin Hb Iron Nancy Index Plasma Platelets Transferrin saturation % Transferrin saturations UCEIS (0 to 8) WCC White Cell Count setName test_date
735 39.0 NaN 0.4 52.0 NaN 144.0 NaN NaN NaN 197.0 NaN 25.0 NaN 4.88 NaN Bloods 2016-05-31 23:00:00
803 40.0 NaN 0.2 81.0 NaN 147.0 NaN NaN NaN 234.0 NaN 35.0 NaN 8.47 NaN Bloods 2016-06-20 23:00:00
347 NaN NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN NaN Research Bloods 2016-06-20 23:00:00
122 40.0 NaN 0.2 81.9 NaN 147.0 NaN NaN NaN 234.0 NaN 35.0 NaN 8.47 NaN Bloods 2016-06-20 23:00:00

I think you need groupby with rename columns by dict:
d = {'C-reactive protein':'CRP', 'Hemoglobin':'Hb',
'Transferrin saturation %':'Transferrin saturations'}
df = df.groupby('test_date').max().rename(columns=d).groupby(axis=1, level=0).max()
print (df)
Albumin CRP Ferritin Haemoglobin Hb Iron \
test_date
2016-04-17 23:00:00 36.0 32.2 99.7 133.0 133.0 NaN
2016-04-25 23:00:00 NaN NaN NaN NaN NaN NaN
2016-06-12 23:00:00 34.0 5.4 55.9 NaN 133.0 NaN
Nancy Index Plasma Platelets Transferrin saturations \
test_date
2016-04-17 23:00:00 NaN NaN 406.0 25.0
2016-04-25 23:00:00 NaN NaN NaN NaN
2016-06-12 23:00:00 NaN NaN 372.0 28.0
UCEIS (0 to 8) WCC White Cell Count
test_date
2016-04-17 23:00:00 NaN 12.35 12.35
2016-04-25 23:00:00 7.0 NaN NaN
2016-06-12 23:00:00 NaN 7.99 NaN
More general solution is reshape by melt, remove duplicates and then create DataFrame back:
d = {'C-reactive protein':'CRP', 'Hemoglobin':'Hb',
'Transferrin saturation %':'Transferrin saturations'}
df = df.rename(columns=d).groupby(axis=1, level=0).max()
df = pd.melt(df, id_vars='test_date').dropna(subset=['value']).drop_duplicates()
df = df.groupby(['test_date','variable'])['value'] \
.apply(lambda x: pd.Series(x.values)) \
.unstack(1) \
.reset_index(level=1, drop=True) \
.reset_index() \
.rename_axis(None,axis=1)
print (df)
test_date Albumin CRP Ferritin Hb Platelets \
0 2016-04-17 23:00:00 1000.0 32.2 99.7 1000.0 406.0
1 2016-04-17 23:00:00 36.0 NaN NaN 133.0 NaN
2 2016-04-25 23:00:00 NaN NaN NaN NaN NaN
3 2016-06-12 23:00:00 34.0 5.4 55.9 133.0 372.0
Transferrin saturations UCEIS (0 to 8) WCC White Cell Count
0 25.0 NaN 12.35 12.35
1 NaN NaN NaN NaN
2 NaN 7.0 NaN NaN
3 28.0 NaN 7.99 NaN

What #jezrael was saying is that if you had a situation where:
Albumin C-reactive protein CRP test_date
0 NaN NaN 32 2016-04-17 23:00:00
1 NaN 8.0 NaN 2016-04-17 23:00:00
then his method would erase the 8.0 reading and keep only the 32 (this is because he does it in two steps (or 3?), in this line: df = df.groupby('test_date').max().rename(columns=d).groupby(axis=1, level=0).max()
df = df.groupby('test_date').max() # selects max of each column
# while collapsing 'test_date'
which for my truncated example would give:
Albumin C-reactive protein CRP test_date
0 NaN 8.0 32 2016-04-17 23:00:00
then rename .rename(columns=d) giving:
Albumin CRP CRP test_date
0 NaN 8.0 32 2016-04-17 23:00:00
then .groupby(axis=1, level=0).max() to group along rows (instead of down columns) which gives:
Albumin CRP test_date
0 NaN 32 2016-04-17 23:00:00
which is where you run the highest risk of losing data.
Alternative
I would split the original data into two frames first
df1 = df[["C-reactive protein","Haemoglobin", ...]]
df2 = df[["CRP", "Hb"]]
# then rename
df2 = df2.rename(columns={"CRP":"C-reactive protein", "Hb":"Haemoglobin", ...})
# use concat to stack them on one another
df3 = pd.concat([df1, df2]) # i've run out of names
df3 = df3.drop_duplicates() # perhaps also drop NAs?
but this is only necessary if you have multiple non-duplicate entries for the same test on the same day.

Related

Pandas - Create Separate Columns in DataFrame Based on a Specific Column's Values

Let's say I have a simple Pandas DataFrame where one column contains a country name and another column contains some value. For example:
# Import Python Libraries
import numpy as np
import pandas as pd
# Create Sample DataFrame
df = pd.DataFrame(data={'Country': ['United States','United States','United States','United States', \
'United States','United States','United States','United States', \
'United States','United States','United States','United States', \
'Canada','Canada','Canada','Canada','Canada','Canada','Mexico', \
'Mexico','Mexico','Mexico','England','England','England','England', \
'England','England','England','England','England','England','England', \
'England','England','England','France','France','France','Spain','Germany', \
'Germany','Germany','Germany','Germany','Germany','Germany','Germany', \
'Germany','Germany'], 'Value': np.random.randint(1000, size=50)})
Which generates:
print(df.head())
Index Country Value
0 United States 943
1 United States 567
2 United States 534
3 United States 700
4 United States 470
My question is, what is the easiest way in Python to convert this DataFrame into one where each country has its own column and all the values of that country are listed in that column? In other words, how can I easily create a DataFrame where the number of columns is the unique count of countries in the 'Country' column, and that each column's length will vary depending on the number of times the corresponding country appears in the original DataFrame?
Here is sample code that provides a solution:
# Store Unique Country Names in Variable
columns = df['Country'].unique()
# Create Individual Country DataFrames
df_0 = df[df['Country'] == columns[0]]['Value'].values.tolist()
df_1 = df[df['Country'] == columns[1]]['Value'].values.tolist()
df_2 = df[df['Country'] == columns[2]]['Value'].values.tolist()
df_3 = df[df['Country'] == columns[3]]['Value'].values.tolist()
df_4 = df[df['Country'] == columns[4]]['Value'].values.tolist()
df_5 = df[df['Country'] == columns[5]]['Value'].values.tolist()
df_6 = df[df['Country'] == columns[6]]['Value'].values.tolist()
# Create Desired Output DataFrame
data_dict = {columns[0]: df_0, columns[1]: df_1, columns[2]: df_2, columns[3]: df_3, columns[4]: df_4, columns[5]: df_5, columns[6]: df_6}
new_df = pd.DataFrame({k:pd.Series(v[:len(df)]) for k,v in data_dict.items()})
Which generates:
print(new_df)
United States Canada Mexico England France Spain Germany
0 838.0 135.0 496.0 568.0 71.0 588.0 811.0
1 57.0 118.0 268.0 716.0 422.0 NaN 107.0
2 953.0 396.0 850.0 860.0 707.0 NaN 318.0
3 251.0 294.0 815.0 888.0 NaN NaN 633.0
4 127.0 466.0 NaN 869.0 NaN NaN 910.0
5 892.0 824.0 NaN 776.0 NaN NaN 472.0
6 11.0 NaN NaN 508.0 NaN NaN 466.0
7 563.0 NaN NaN 299.0 NaN NaN 200.0
8 864.0 NaN NaN 568.0 NaN NaN 637.0
9 810.0 NaN NaN 78.0 NaN NaN 392.0
10 268.0 NaN NaN 106.0 NaN NaN NaN
11 389.0 NaN NaN 153.0 NaN NaN NaN
12 NaN NaN NaN 217.0 NaN NaN NaN
13 NaN NaN NaN 941.0 NaN NaN NaN
While the above code works, it's obviously not a tenable solution for larger data sets. What is the most efficient way of generating this result from the original DataFrame?
Thank you!

Probably not the most performant solution out there, but it will get everything top justified.
df1 = df.groupby('Country').Value.agg(list).apply(pd.Series).T
df1.columns.name=None
Output: df1
Canada England France Germany Mexico Spain United States
0 653.0 187.0 396.0 491.0 251.0 433.0 919.0
1 215.0 301.0 25.0 107.0 755.0 NaN 435.0
2 709.0 581.0 858.0 691.0 158.0 NaN 166.0
3 626.0 706.0 NaN 572.0 767.0 NaN 352.0
4 516.0 999.0 NaN 393.0 NaN NaN 906.0
5 847.0 688.0 NaN 780.0 NaN NaN 489.0
6 NaN 722.0 NaN 19.0 NaN NaN 322.0
7 NaN 728.0 NaN 166.0 NaN NaN 753.0
8 NaN 765.0 NaN 299.0 NaN NaN 155.0
9 NaN 956.0 NaN 449.0 NaN NaN 438.0
10 NaN 41.0 NaN NaN NaN NaN 588.0
11 NaN 43.0 NaN NaN NaN NaN 796.0
12 NaN 485.0 NaN NaN NaN NaN NaN
13 NaN 218.0 NaN NaN NaN NaN NaN
The other option is to make use of Coldspeed's justify function and Yuca's pivot output:
import numpy as np
df2 = df.pivot(index=None, columns='Country', values='Value')
df2 = pd.DataFrame(
justify(df2.values, invalid_val=np.NaN, axis=0, side='up'),
columns=df2.columns
).dropna(0, 'all')
df2.columns.name=None
Output: df2
Canada England France Germany Mexico Spain United States
0 653 187 396 491 251 433 919
1 215 301 25 107 755 NaN 435
2 709 581 858 691 158 NaN 166
3 626 706 NaN 572 767 NaN 352
4 516 999 NaN 393 NaN NaN 906
5 847 688 NaN 780 NaN NaN 489
6 NaN 722 NaN 19 NaN NaN 322
7 NaN 728 NaN 166 NaN NaN 753
8 NaN 765 NaN 299 NaN NaN 155
9 NaN 956 NaN 449 NaN NaN 438
10 NaN 41 NaN NaN NaN NaN 588
11 NaN 43 NaN NaN NaN NaN 796
12 NaN 485 NaN NaN NaN NaN NaN
13 NaN 218 NaN NaN NaN NaN NaN

Use groupby, cumcount, and unstack with T:
df.set_index(['Country',df.groupby('Country').cumcount()])['Value'].unstack().T
Output:
Country Canada England France Germany Mexico Spain United States
0 535.0 666.0 545.0 522.0 581.0 525.0 394.0
1 917.0 130.0 76.0 882.0 563.0 NaN 936.0
2 344.0 376.0 960.0 442.0 247.0 NaN 819.0
3 760.0 272.0 NaN 604.0 976.0 NaN 975.0
4 745.0 199.0 NaN 512.0 NaN NaN 123.0
5 654.0 102.0 NaN 114.0 NaN NaN 690.0
6 NaN 570.0 NaN 318.0 NaN NaN 568.0
7 NaN 807.0 NaN 523.0 NaN NaN 385.0
8 NaN 18.0 NaN 890.0 NaN NaN 451.0
9 NaN 26.0 NaN 635.0 NaN NaN 282.0
10 NaN 871.0 NaN NaN NaN NaN 771.0
11 NaN 122.0 NaN NaN NaN NaN 505.0
12 NaN 0.0 NaN NaN NaN NaN NaN
13 NaN 578.0 NaN NaN NaN NaN NaN

pd.pivot takes you halfway there, the issue here is that your index have no information so your non NaN values are not at the top of the df
df.pivot(index=None, columns='Country', values = 'Value')
Country Canada England France ... Mexico Spain United States
0 NaN NaN NaN ... NaN NaN 992.0
1 NaN NaN NaN ... NaN NaN 814.0
2 NaN NaN NaN ... NaN NaN 489.0
3 NaN NaN NaN ... NaN NaN 943.0
4 NaN NaN NaN ... NaN NaN 574.0
5 NaN NaN NaN ... NaN NaN 428.0
6 NaN NaN NaN ... NaN NaN 907.0
7 NaN NaN NaN ... NaN NaN 899.0
8 NaN NaN NaN ... NaN NaN 379.0
9 NaN NaN NaN ... NaN NaN 130.0

Moving average in a pandas dataframe on valid values (non empty rows) [duplicate]

This question already has answers here:
Replace NaN or missing values with rolling mean or other interpolation
(2 answers)
Python: Sliding windowed mean, ignoring missing data
(2 answers)
Closed 4 years ago.
I have a df like this:
a001 a002 a003 a004 a005
time_axis
2017-02-07 1 NaN NaN NaN NaN
2017-02-14 NaN NaN NaN NaN NaN
2017-03-20 NaN NaN 2 NaN NaN
2017-04-03 NaN 3 NaN NaN NaN
2017-05-15 NaN NaN NaN NaN NaN
2017-06-05 NaN NaN NaN NaN NaN
2017-07-10 NaN 6 NaN NaN NaN
2017-07-17 4 NaN NaN NaN NaN
2017-07-24 NaN NaN NaN 1 NaN
2017-08-07 NaN NaN NaN NaN NaN
2017-08-14 NaN NaN NaN NaN NaN
2017-08-28 NaN NaN NaN NaN 5
And I would like to make a rolling mean for each row on the previous 3 valid values(not empty rows) and save in another df:
last_3
time_axis
2017-02-07 1 # still there is only a row
2017-02-14 1 # only a valid value(in the first row) -> average is the value itself
2017-03-20 1.5 # average on the previous rows (only 2 rows contain value-> (2+1)/2
2017-04-03 2 # average on the previous rows with non-NaN values(2017-02-14 excluded) (2+3+1)/3
2017-05-15 2 # Same reason as the previous row
2017-06-05 2 # Same reason
2017-07-10 3.6 # Now the considered values are:2,3,6
2017-07-17 4.3 # considered values: 4,6,3
2017-07-24 3.6 # considered values: 1,4,6
2017-08-07 3.6 # no new values in this row, so again 1,4,6
2017-08-14 3.6 # same reason
2017-08-28 3.3 # now the considered values are: 5,1,4
I was trying deleting the empty rows in the first dataframe and then apply rolling and mean, but I think it is the wrong approach(df1 in my example already exist):
df2 = df.dropna(how='all')
df1['last_3'] = df2.mean(axis=1).rolling(window=3, min_periods=3).mean()

I think you need:
df2 = df.dropna(how='all')
df['last_3'] = df2.mean(axis=1).rolling(window=3, min_periods=1).mean()
df['last_3'] = df['last_3'].ffill()
print (df)
a001 a002 a003 a004 a005 last_3
2017-02-07 1.0 NaN NaN NaN NaN 1.000000
2017-02-14 NaN NaN NaN NaN NaN 1.000000
2017-03-20 NaN NaN 2.0 NaN NaN 1.500000
2017-04-03 NaN 3.0 NaN NaN NaN 2.000000
2017-05-15 NaN NaN NaN NaN NaN 2.000000
2017-06-05 NaN NaN NaN NaN NaN 2.000000
2017-07-10 NaN 6.0 NaN NaN NaN 3.666667
2017-07-17 4.0 NaN NaN NaN NaN 4.333333
2017-07-24 NaN NaN NaN 1.0 NaN 3.666667
2017-08-07 NaN NaN NaN NaN NaN 3.666667
2017-08-14 NaN NaN NaN NaN NaN 3.666667
2017-08-28 NaN NaN NaN NaN 5.0 3.333333

Pandas Match row to column values

I have a JSON output which I am trying to get in excel.
What I am trying to do is match the WEIGHT as column Header.
I could get this output using some loops.
What I am trying to get is have all the Weights as the first column header and if it has values paste it in else NaN.
Desired Output:
page = requests.get(mainurl)
data = json.loads(page.text)
for i in data['categories']:
for j in i['items']:
if a == 1: # so changes and appends keys per category (highlighted)
a=2 # so not true in this loop
s=tuple(j['prices'].keys())
ws.append(s)
PVAL=list(j['prices'].values())
ws.append(PVAL)# append the value
a=1 # makes true next category
p= []
for i in price: # I know this is absolute madness but dicts were getting sorted
i = str(i).replace("'",'').replace('{','').replace('}','')# get price values
p.append(i)
###apppend in excel
Note : As you can tell by the above code, I am a complete Beginner. And the above code could have been pretty with 2-3 lines of Pandas :(
I am now tinkering with Pandas to do it since I think it will be faster and better.
JsonOutput
MAJOR EDIT:
So I didn't have much time so I did this:
for i in data['categories']:
for j in i['items']:
PVAL=j['prices']
try:
ounce = PVAL['ounce']
except:
ounce = 'NaN'
try:
gram = PVAL['gram']
except:gram = 'NaN'
try:
twograms = PVAL['two_grams']
except:twograms='NaN'
try:
quarter=PVAL['quarter']
except:quarter='NaN'
try:
eighth=PVAL['eighth']
except:eighth='NaN'
try:
halfO=PVAL['half_ounce']
except:halfO='NaN'
try:
unit = PVAL['unit']
except:unit='NaN'
try:
halfgram = PVAL['half_gram']
except:halfgram='NaN'
name= j['name']
cat = j['category_name']
listname = j['listing_name']
namel.append(name)
catl.append(cat)
listnamel.append(listname)
halfOl.append(halfO)
halfgraml.append(halfgram)
unitl.append(unit)
eighthl.append(eighth)
twogramsl.append(twograms)
quarterl.append(quarter)
ouncel.append(ounce)
graml.append(gram)
Then these lists are appended in Excel.
I know it is not Pythonic but I am still trying to findout a good way to do it in Pandas.

as my rep is still low, cannot post any comment yet, thus will just post it here and will edit this if further clarifications are provided.
I don't see any WEIGHTs in the desired output. If I understand the json file correctly, you are iterating prices given a weight unit. Is the expected output to loop over each item and iterate over prices per weight unit. Put NaN if weight unit is not available. Is there a list of possible weight units?
Pandas also has read_json function and thus can directly load this to a Pandas dataframe.
-- edited ---
Apologies for the delay. Please see below answer
import pandas as pd
import json
from cytoolz.dicttoolz import merge
#replace below with your json loader
with open('sample.json') as json_dta:
dict_dta = json.load(json_dta)
list_columns = ['id', 'name', 'category_name', 'ounce', 'gram', 'two_grams', 'quarter', 'eighth','half_ounce','unit','half_gram']
df = pd.io.json.json_normalize(dict_dta, ['categories', ['items']]).pipe(lambda x: x.drop('prices', 1).join(x.prices.apply(lambda y: pd.Series(merge(y)))))[list_columns]
Above will result to:
id name category_name ounce gram two_grams quarter eighth half_ounce unit half_gram
0 10501503 Recon Indica 99.0 9.0 0.0 40.0 25.0 70.0 NaN NaN
1 11614583 Kush Dawg Indica 99.0 9.0 0.0 40.0 25.0 70.0 NaN NaN
2 8602219 OG Kush Indica 99.0 9.0 0.0 40.0 25.0 70.0 NaN NaN
3 11448858 Poison OG Outdoor Sativa 69.0 9.0 0.0 40.0 25.0 50.0 NaN NaN
4 11731126 SunBurn 2.0 Outdoor Sativa 69.0 0.0 0.0 0.0 0.0 0.0 NaN NaN
5 6412418 Poison OG Sativa 99.0 9.0 18.0 40.0 25.0 70.0 NaN NaN
6 8982466 Sativa Trim Sativa 30.0 0.0 0.0 0.0 0.0 15.0 NaN NaN
7 11545434 Chupacabra Outdoor Hybrid 69.0 9.0 0.0 40.0 25.0 50.0 NaN NaN
8 11458944 Platinum Girl Scout Cookies Outdoor Hybrid 69.0 9.0 0.0 40.0 25.0 50.0 NaN NaN
9 11296163 Bubblegum Hybrid 99.0 9.0 0.0 40.0 25.0 70.0 NaN NaN
10 11614623 C4 Hybrid 99.0 9.0 0.0 40.0 25.0 70.0 NaN NaN
11 11333124 Chem Dawg Outdoor Hybrid 69.0 9.0 0.0 40.0 25.0 50.0 NaN NaN
12 11458988 Candy Kush Hybrid 99.0 9.0 0.0 40.0 25.0 70.0 NaN NaN
13 10501592 Candy Kush Outdoor Hybrid 69.0 9.0 0.0 40.0 25.0 50.0 NaN NaN
14 9123804 ZOOTROCKS LemonGrass Edible NaN NaN NaN NaN NaN NaN 20.0 NaN
15 9412336 Cherry Limeade 100mg Edible NaN NaN NaN NaN NaN NaN 20.0 NaN
16 4970503 Peanut Budda Buddha, 100mg REC Edible NaN NaN NaN NaN NaN NaN 20.0 NaN
17 9412238 Golden Strawberry Puck 100mg REC - CO Edible NaN NaN NaN NaN NaN NaN 20.0 NaN
18 9412232 Cherry Puck 100mg REC - CO Edible NaN NaN NaN NaN NaN NaN 20.0 NaN
19 9412228 Assorted Sour Pucks 100mg REC - CO Edible NaN NaN NaN NaN NaN NaN 20.0 NaN
20 6454686 Assorted Fruity Pucks 100mg REC - CO Edible NaN NaN NaN NaN NaN NaN 20.0 NaN
21 9412295 Sour Gummies Sativa 100mg, Recreational Edible NaN NaN NaN NaN NaN NaN 20.0 NaN
22 7494303 Cheeba Chews Edible NaN NaN NaN NaN NaN NaN 20.0 NaN
23 9411974 Mile High Mint, 100mg REC Edible NaN NaN NaN NaN NaN NaN 20.0 NaN
24 9411972 Boulder Bar, 100mg Edible NaN NaN NaN NaN NaN NaN 20.0 NaN
25 9412286 Sour Gummies Indica 100mg, Recreational Edible NaN NaN NaN NaN NaN NaN 20.0 NaN
26 9412242 Watermelon Puck 100mg - REC Edible NaN NaN NaN NaN NaN NaN 20.0 NaN
27 10066310 Coffee & Doughnuts Edible NaN NaN NaN NaN NaN NaN 20.0 NaN
28 10065124 Wildflower Honey Edible NaN NaN NaN NaN NaN NaN 24.0 NaN
29 10064962 Clover Honey Edible NaN NaN NaN NaN NaN NaN 24.0 NaN
30 9412290 Sour Gummies Peach 100mg, Recreational Edible NaN NaN NaN NaN NaN NaN 20.0 NaN
31 5926966 Stratos 100mg Edible NaN NaN NaN NaN NaN NaN 20.0 NaN
32 10066271 Salt & Nibs Edible NaN NaN NaN NaN NaN NaN 20.0 NaN
33 10065225 Yampa Valley Honey Edible NaN NaN NaN NaN NaN NaN 24.0 NaN
34 9412873 Fruit Punch Mints 100mg Edible NaN NaN NaN NaN NaN NaN 20.0 NaN
35 9412251 Sour Gummies Hybrid 100mg, Recreational Edible NaN NaN NaN NaN NaN NaN 20.0 NaN
36 9412922 Dutch Girl Carmel Waffle, 100mg Edible NaN NaN NaN NaN NaN NaN 20.0 NaN
37 6790292 Hybrid Distillate Jar Concentrate NaN 36.0 0.0 NaN NaN NaN NaN 0.0
38 6379060 Hybrid Cartridge Concentrate NaN 25.0 0.0 NaN NaN NaN NaN 18.0
39 9009149 Pure Cannabis Oil Hybrid Concentrate NaN 25.0 0.0 NaN NaN NaN NaN 0.0
40 9400145 Pure Cannabis Oil Sativa Concentrate NaN 25.0 0.0 NaN NaN NaN NaN 0.0
41 9409961 Sativa Cartridge Concentrate NaN 25.0 0.0 NaN NaN NaN NaN 18.0
42 9400121 Pure Cannabis Oil Indica Concentrate NaN 25.0 0.0 NaN NaN NaN NaN 0.0
43 9409954 Indica Cartridge Concentrate NaN 25.0 0.0 NaN NaN NaN NaN 18.0
44 9400467 Indica Distillate Jar Concentrate NaN 36.0 0.0 NaN NaN NaN NaN 0.0
45 9691836 PWO Wax by Mahatma Concentrate NaN 25.0 0.0 NaN NaN NaN NaN 0.0
46 9409970 Sativa Distillate Jar Concentrate NaN 36.0 0.0 NaN NaN NaN NaN 0.0
47 6134675 Bongs Gear NaN NaN NaN NaN NaN NaN 40.0 NaN
48 5993354 Small Glass Pipes Gear NaN NaN NaN NaN NaN NaN 10.0 NaN
49 4393434 Large Glass Pipes Gear NaN NaN NaN NaN NaN NaN 20.0 NaN
50 5941409 Pain Relief Salve, 2oz Topicals NaN NaN NaN NaN NaN NaN 26.0 NaN
51 8768835 THC Pain Stick Topicals NaN NaN NaN NaN NaN NaN 20.0 NaN
52 6370279 FORIA Pleasure (30ml) Spray Bottle Topicals NaN NaN NaN NaN NaN NaN 55.0 NaN
53 8911546 Bath Soak Topicals NaN NaN NaN NaN NaN NaN 20.0 NaN
54 9123854 FORIA Relief (2-pack) Suppositories Topicals NaN NaN NaN NaN NaN NaN 24.0 NaN
55 4187102 1 Gram Strain Specific-Prerolls Preroll NaN NaN NaN NaN NaN NaN 9.0 NaN

Sorting the columns of a pandas dataframe

Out[1015]: gp2
department MOBILE QA TA WEB MOBILE QA TA WEB
minutes minutes minutes minutes growth growth growth growth
period
2016-12-24 NaN NaN 140.0 400.0 NaN NaN 0.0 260.0
2016-12-25 NaN NaN NaN 80.0 NaN NaN NaN -320.0
2016-12-26 NaN NaN NaN 20.0 NaN NaN NaN -60.0
2016-12-27 NaN 45.0 NaN 180.0 NaN 25.0 NaN 135.0
2016-12-28 600.0 NaN NaN 15.0 420.0 NaN NaN -585.0
... ... ... ... ... ... ... ... ...
2017-01-03 NaN NaN NaN 80.0 NaN NaN NaN -110.0
2017-01-04 20.0 NaN NaN NaN -60.0 NaN NaN NaN
2017-02-01 120.0 NaN NaN NaN 100.0 NaN NaN NaN
2017-02-02 45.0 NaN NaN NaN -75.0 NaN NaN NaN
2017-02-03 NaN 45.0 NaN 30.0 NaN 0.0 NaN -15.0
I need MOBILE.minutes and MOBILE.growth to be one after another.
I tried this
In [1019]:gp2.columns = gp2.columns.sort_values()
In [1020]: gp2
Out[1020]:
department MOBILE QA TA WEB
growth minutes growth minutes growth minutes growth minutes
period
2016-12-24 NaN NaN 140.0 400.0 NaN NaN 0.0 260.0
2016-12-25 NaN NaN NaN 80.0 NaN NaN NaN -320.0
2016-12-26 NaN NaN NaN 20.0 NaN NaN NaN -60.0
2016-12-27 NaN 45.0 NaN 180.0 NaN 25.0 NaN 135.0
2016-12-28 600.0 NaN NaN 15.0 420.0 NaN NaN -585.0
... ... ... ... ... ... ... ... ...
2017-01-03 NaN NaN NaN 80.0 NaN NaN NaN -110.0
2017-01-04 20.0 NaN NaN NaN -60.0 NaN NaN NaN
2017-02-01 120.0 NaN NaN NaN 100.0 NaN NaN NaN
2017-02-02 45.0 NaN NaN NaN -75.0 NaN NaN NaN
2017-02-03 NaN 45.0 NaN 30.0 NaN 0.0 NaN -15.0
It sorted just the columns but didn't assign them proper values.

Just use df.sort_index:
df = df.sort_index(level=[0, 1], axis=1)
print(df)
MOBILE QA TA WEB
growth minutes growth minutes growth minutes growth minutes
period
2016-12-24 NaN NaN NaN NaN 0.0 140.0 260.0 400.0
2016-12-25 NaN NaN NaN NaN NaN NaN -320.0 80.0
2016-12-26 NaN NaN NaN NaN NaN NaN -60.0 20.0
2016-12-27 NaN NaN 25.0 45.0 NaN NaN 135.0 180.0
2016-12-28 420.0 600.0 NaN NaN NaN NaN -585.0 15.0
2017-01-03 NaN NaN NaN NaN NaN NaN -110.0 80.0
2017-01-04 -60.0 20.0 NaN NaN NaN NaN NaN NaN
2017-02-01 100.0 120.0 NaN NaN NaN NaN NaN NaN
2017-02-02 -75.0 45.0 NaN NaN NaN NaN NaN NaN
2017-02-03 NaN NaN 0.0 45.0 NaN NaN -15.0 30.0

Get daily averages of monthly database

I have a long list of data structured in the following way
Date, Time, Temperature, Moisture, Accumulated precipitation
1/01/2011, 00:00, 23, 50, 2,
1/01/2011, 00:15, 22, 45, 1,
1/01/2011, 00:30, 20, 39, 0,
1/01/2011, 01:00, 25, 34, 0,
1/01/2011, 01:15, 23, 50, 0,
.
.
.
.
1/01/2011, 23:45, 22, 40, 0,
.
.
.
.
31/01/2011, 00:00, 23, 45, 0,
How I can get the daily averages of the variables Temperature and Moisture for the 31st day of the month?

This is the sort of thing that the pandas library is good at. The basic idea is that you can read data into objects called DataFrames, kind of like an Excel sheet, and then you can do neat things to them. Starting from a temps.csv file I made up to look like yours:
>>> df = pd.read_csv("temps.csv", index_col=False, parse_dates=[[0,1]], skipinitialspace=True)
>>> df = df.rename(columns={"Date _Time": "Time"})
>>> df = df.set_index("Time")
>>> df
Temperature Moisture Accumulated precipitation
Time
2011-01-01 00:00:00 23 50 2
2011-01-01 00:15:00 22 45 1
2011-01-01 00:30:00 20 39 0
2011-01-01 01:00:00 25 34 0
2011-01-01 01:15:00 23 50 0
2011-01-01 23:45:00 22 40 0
2011-01-02 00:00:00 123 250 32
2011-01-02 00:15:00 122 245 31
2011-01-02 00:30:00 120 239 30
2011-01-02 01:00:00 125 234 30
2011-01-02 01:15:00 123 250 30
2011-01-02 23:45:00 122 240 30
Once we have the frame in a nice shape, we can easily resample (the default is the mean):
>>> df.resample("D")
Temperature Moisture Accumulated precipitation
Time
2011-01-01 22.5 43 0.5
2011-01-02 122.5 243 30.5
Or get the max or min:
>>> df.resample("D", how="max")
Temperature Moisture Accumulated precipitation
Time
2011-01-01 25 50 2
2011-01-02 125 250 32
>>> df.resample("D", how="min")
Temperature Moisture Accumulated precipitation
Time
2011-01-01 20 34 0
2011-01-02 120 234 30
Et cetera. Note that this is just the brute average of the recorded data points each day: if you want to resample differently to account for the different distance between measurements, that's easy too. If you're going to be doing data processing in Python, it's definitely worth reading through the 10 minute overview to see if it might be helpful.

Use the suggestions on a different database, I did as follows:
df = pd.read_csv('path-tracks.csv', index_col= 'Date', parse_dates=[0])
df
Lat Lon ID Moisture Temperature Category
Date
2004-02-05 06:45:00 19.7 -95.2 1 45 -38 CCM
2004-02-05 07:45:00 19.7 -94.7 1 34 -48 CCM
2004-02-05 08:45:00 19.3 -93.9 1 57 -60 CCM
2004-02-05 09:45:00 19.0 -93.5 1 89 -58 CCM
2004-02-05 10:45:00 19.0 -92.8 1 34 -50 CCM
2004-02-05 11:45:00 19.2 -92.6 1 23 -40 CCM
2004-02-05 12:45:00 19.9 -93.0 1 10 -43 CCM
2004-02-05 13:15:00 20.0 -92.8 1 50 -32 CCM
2004-05-30 04:45:00 23.1 -100.2 2 45 -45 SCME
2004-05-30 05:45:00 23.2 -100.0 2 68 -56 SCME
2004-05-30 06:45:00 23.3 -100.0 2 90 -48 SCME
2004-05-30 07:45:00 23.3 -100.2 2 100 -32 SCME
2004-05-31 03:15:00 23.4 -99.0 3 12 -36 SCM
2004-05-31 04:15:00 23.5 -98.9 3 34 -46 SCM
2004-05-31 05:15:00 23.6 -98.7 3 56 -68 SCM
2004-05-31 06:15:00 23.7 -98.8 3 78 -30 SCM
Now try to get the daily sum as follows:
df.resample('D',how='sum')
i get the following:
Lat Lon ID Moisture Temperature
Date
2004-02-06 155.8 -748.5 8 342 -369
2004-02-07 NaN NaN NaN NaN NaN
2004-02-08 NaN NaN NaN NaN NaN
2004-02-09 NaN NaN NaN NaN NaN
2004-02-10 NaN NaN NaN NaN NaN
2004-02-11 NaN NaN NaN NaN NaN
2004-02-12 NaN NaN NaN NaN NaN
2004-02-13 NaN NaN NaN NaN NaN
2004-02-14 NaN NaN NaN NaN NaN
2004-02-15 NaN NaN NaN NaN NaN
2004-02-16 NaN NaN NaN NaN NaN
2004-02-17 NaN NaN NaN NaN NaN
2004-02-18 NaN NaN NaN NaN NaN
2004-02-19 NaN NaN NaN NaN NaN
2004-02-20 NaN NaN NaN NaN NaN
2004-02-21 NaN NaN NaN NaN NaN
2004-02-22 NaN NaN NaN NaN NaN
2004-02-23 NaN NaN NaN NaN NaN
2004-02-24 NaN NaN NaN NaN NaN
2004-02-25 NaN NaN NaN NaN NaN
2004-02-26 NaN NaN NaN NaN NaN
2004-02-27 NaN NaN NaN NaN NaN
2004-02-28 NaN NaN NaN NaN NaN
2004-02-29 NaN NaN NaN NaN NaN
2004-03-01 NaN NaN NaN NaN NaN
2004-03-02 NaN NaN NaN NaN NaN
2004-03-03 NaN NaN NaN NaN NaN
2004-03-04 NaN NaN NaN NaN NaN
2004-03-05 NaN NaN NaN NaN NaN
2004-03-06 NaN NaN NaN NaN NaN
2004-03-07 NaN NaN NaN NaN NaN
2004-03-08 NaN NaN NaN NaN NaN
2004-03-09 NaN NaN NaN NaN NaN
2004-03-10 NaN NaN NaN NaN NaN
2004-03-11 NaN NaN NaN NaN NaN
2004-03-12 NaN NaN NaN NaN NaN
2004-03-13 NaN NaN NaN NaN NaN
2004-03-14 NaN NaN NaN NaN NaN
2004-03-15 NaN NaN NaN NaN NaN
2004-03-16 NaN NaN NaN NaN NaN
2004-03-17 NaN NaN NaN NaN NaN
2004-03-18 NaN NaN NaN NaN NaN
2004-03-19 NaN NaN NaN NaN NaN
2004-03-20 NaN NaN NaN NaN NaN
2004-03-21 NaN NaN NaN NaN NaN
2004-03-22 NaN NaN NaN NaN NaN
2004-03-23 NaN NaN NaN NaN NaN
2004-03-24 NaN NaN NaN NaN NaN
2004-03-25 NaN NaN NaN NaN NaN
2004-03-26 NaN NaN NaN NaN NaN
2004-03-27 NaN NaN NaN NaN NaN
2004-03-28 NaN NaN NaN NaN NaN
2004-03-29 NaN NaN NaN NaN NaN
2004-03-30 NaN NaN NaN NaN NaN
2004-03-31 NaN NaN NaN NaN NaN
2004-04-01 NaN NaN NaN NaN NaN
2004-04-02 NaN NaN NaN NaN NaN
2004-04-03 NaN NaN NaN NaN NaN
2004-04-04 NaN NaN NaN NaN NaN
2004-04-05 NaN NaN NaN NaN NaN
2004-04-06 NaN NaN NaN NaN NaN
2004-04-07 NaN NaN NaN NaN NaN
2004-04-08 NaN NaN NaN NaN NaN
2004-04-09 NaN NaN NaN NaN NaN
2004-04-10 NaN NaN NaN NaN NaN
2004-04-11 NaN NaN NaN NaN NaN
2004-04-12 NaN NaN NaN NaN NaN
2004-04-13 NaN NaN NaN NaN NaN
2004-04-14 NaN NaN NaN NaN NaN
2004-04-15 NaN NaN NaN NaN NaN
2004-04-16 NaN NaN NaN NaN NaN
2004-04-17 NaN NaN NaN NaN NaN
2004-04-18 NaN NaN NaN NaN NaN
2004-04-19 NaN NaN NaN NaN NaN
2004-04-20 NaN NaN NaN NaN NaN
2004-04-21 NaN NaN NaN NaN NaN
2004-04-22 NaN NaN NaN NaN NaN
2004-04-23 NaN NaN NaN NaN NaN
2004-04-24 NaN NaN NaN NaN NaN
2004-04-25 NaN NaN NaN NaN NaN
2004-04-26 NaN NaN NaN NaN NaN
2004-04-27 NaN NaN NaN NaN NaN
2004-04-28 NaN NaN NaN NaN NaN
2004-04-29 NaN NaN NaN NaN NaN
2004-04-30 NaN NaN NaN NaN NaN
2004-05-01 NaN NaN NaN NaN NaN
2004-05-02 NaN NaN NaN NaN NaN
2004-05-03 NaN NaN NaN NaN NaN
2004-05-04 NaN NaN NaN NaN NaN
2004-05-05 NaN NaN NaN NaN NaN
2004-05-06 NaN NaN NaN NaN NaN
2004-05-07 NaN NaN NaN NaN NaN
2004-05-08 NaN NaN NaN NaN NaN
2004-05-09 NaN NaN NaN NaN NaN
2004-05-10 NaN NaN NaN NaN NaN
2004-05-11 NaN NaN NaN NaN NaN
2004-05-12 NaN NaN NaN NaN NaN
2004-05-13 NaN NaN NaN NaN NaN
2004-05-14 NaN NaN NaN NaN NaN
2004-05-15 NaN NaN NaN NaN NaN
2004-05-16 NaN NaN NaN NaN NaN
2004-05-17 NaN NaN NaN NaN NaN
2004-05-18 NaN NaN NaN NaN NaN
2004-05-19 NaN NaN NaN NaN NaN
2004-05-20 NaN NaN NaN NaN NaN
2004-05-21 NaN NaN NaN NaN NaN
2004-05-22 NaN NaN NaN NaN NaN
2004-05-23 NaN NaN NaN NaN NaN
2004-05-24 NaN NaN NaN NaN NaN
2004-05-25 NaN NaN NaN NaN NaN
2004-05-26 NaN NaN NaN NaN NaN
2004-05-27 NaN NaN NaN NaN NaN
2004-05-28 NaN NaN NaN NaN NaN
2004-05-29 NaN NaN NaN NaN NaN
2004-05-30 NaN NaN NaN NaN NaN
2004-05-31 92.9 -400.4 8 303 -181
2004-06-01 94.2 -395.4 12 180 -180
I did something wrong? because it no takes into account the date 2004-02-05 6:45:00? How do I fix this error?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merging columns and removing duplicates with Pandas - python

Related

Pandas - Create Separate Columns in DataFrame Based on a Specific Column's Values

Moving average in a pandas dataframe on valid values (non empty rows) [duplicate]

Pandas Match row to column values

Sorting the columns of a pandas dataframe

Get daily averages of monthly database

Categories

Resources