how add new column with column names based on conditioned values? - python

I have a table that contains active cases of covid per country for period of time. The columns are country name and dates.
I need to find the max value of active cases per country and the corresponding date of the max values. I have created a list of max values but cant manage to create a column with the corresponding date.
I have written the following loop, but it returns only one date (the last one - [5/2/20]):
for row in active_cases_data[column]:
if row in max_cases:
active_cases_data['date'] = column
screenshot of df and resulting column
table looks like this:
country
4/29/20
4/30/20
5/1/20
5/2/20
Italy
67
105
250
240
I need extra column of date for the largest number for the row(in Italy case its the 5/1/20 for value = 250) like this:
country
4/29/20
4/30/20
5/1/20
5/2/20
date
Italy
67
105
250
240
5/1/20

In pandas we are trying not to use python loops, unless we REALLY need them.
I suppose that your dataset looks something like that:
df = pd.DataFrame({"Country": ["Poland", "Ukraine", "Czechia", "Russia"],
"2021.12.30": [12, 23, 43, 43],
"2021.12.31": [15, 25, 40, 50],
"2022.01.01": [18, 27, 41, 70],
"2022.01.02": [21, 22, 42, 90]})
# Country 2021.12.30 2021.12.31 2022.01.01 2022.01.02
#0 Poland 12 15 18 21
#1 Ukraine 23 25 27 22
#2 Czechia 43 40 41 42
#3 Russia 43 50 70 90
Short way:
You use idxmax(), after excluding column with name:
df['Date'] = df.loc[:, df.columns != "Country"].idxmax(axis=1)
# Country 2021.12.30 2021.12.31 2022.01.01 2022.01.02 Date
#0 Poland 12 15 18 21 2022.01.02
#1 Ukraine 23 25 27 22 2022.01.01
#2 Czechia 43 40 41 42 2021.12.30
#3 Russia 43 50 70 90 2022.01.02
You just have to be aware of running this line multiple times - it tooks every column (except of excluded one - "Country").
Long way:
First, I would transform the data from wide to long table:
df2 = df.melt(id_vars="Country", var_name = "Date", value_name = "Cases")
# Country Date Cases
#0 Poland 2021.12.30 12
#1 Ukraine 2021.12.30 23
#2 Czechia 2021.12.30 43
#3 Russia 2021.12.30 43
#4 Poland 2021.12.31 15
#...
#15 Russia 2022.01.02 90
With the long table we can in many different ways find the needed rows, for example:
df2 = df2.sort_values(by=["Country", "Cases", "Date"],
ascending=[True, False, False])
df2.groupby("Country").first().reset_index()
# Country Date Cases
#0 Czechia 2021.12.30 43
#1 Poland 2022.01.02 21
#2 Russia 2022.01.02 90
#3 Ukraine 2022.01.01 27
By setting the last position in ascending parameter you could manipulate which date should be used in case of a tie.

Related

Fill blank cells of a pandas dataframe column by matching with another datafame column

I have a pandas dataframe, lets call it df1 that looks like this (the follow is just a sample to give an idea of the dataframe):
Ac
Tp
Id
2020
2021
2022
Efecty
FC
IQ_EF
100
200
45
Asset
FC
52
48
15
Debt
P&G
IQ_DEBT
45
58
15
Tax
Other
48
45
78
And I want to fill the blank spaces using a in the 'Id' column using the next auxiliar dataframe, lets call it df2 (again, this is just a sample):
Ac
Tp
Id
Efecty
FC
IQ_EF
Asset
FC
IQ_AST
Debt
P&G
IQ_DEBT
Tax
Other
IQ_TAX
Income
BAL
IQ_INC
Invest
FC
IQ_INV
To get df1 dataframe, looking like this:
Ac
Tp
Id
2020
2021
2022
Efecty
FC
IQ_EF
100
200
45
Asset
FC
IQ_AST
52
48
15
Debt
P&G
IQ_DEBT
45
58
15
Tax
Other
IQ_TAX
48
45
78
I tried with this line of code but it did not work:
df1['Id'] = df1['Id'].mask(df1('nan')).fillna(df1['Ac'].map(df2('Ac')['Id']))
Can you guys help me?
Merge the two frames on Ac and Tp columns and assign the Id column from this result to df1.Id. This works similar to Excel Vlookup functionality.
ac_tp = ['Ac', 'Tp']
df1['Id'] = df1[ac_tp].merge(df2[[*ac_tp, 'Id']])['Id']
In a similar vein you could also try:
df['Id'] = (df.merge(df2, on = ['Ac', 'Tp'])
.pipe(lambda d: d['Id_x'].mask(d['Id_x'].isnull(), d['Id_y'])))
Ac Tp Id 2020 2021 2022
0 Efecty FC IQ_EF 100 200 45
1 Asset FC IQ_AST 52 48 15
2 Debt P&G IQ_DEBT 45 58 15
3 Tax Other IQ_TAX 48 45 78

Python : How to return most occurrent value on each row depend on fix columns?

I have a dataframe as below:
import pandas as pd
# intialise data of lists.
data = {'Name':['Tom', 'nick', 'krish', 'jack'],
'Book1':[20, 21, 19, 18],
'Book2':[20,'', 12, 20],
'Book3':[31, 21, 17, 16],
'Book4':[31, 19, 18, 16]}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
print(df)
Name Book1 Book2 Book3 Book4
Tom 20 20 31 31
nick 21 21 19
krish 19 12 17 18
jack 18 20 16 16
I wish to get below output which is comparing Book1, Book2, Book3 and Book4 column. For row Tom output, there are two 20 and two 31, since the number of the value is equal valent so in this case it will prefer the value come fist that is Book1, so the Output column is 20. For row nick, there was two number 21 and one number 19, so it will take the most occurrence number for output column which is number 21. While for krish row, there was no repeated number so the output column i want fix it as "Mix" .
Output column as below:
Name Book1 Book2 Book3 Book4 Output
Tom 20 20 31 31 20
nick 21 21 19 21
krish 19 12 17 18 Mix
jack 18 20 16 16 16
Anyone have ideas? I saw there is mode function but it was not applicable for this case, please help, thanks
Use value_counts:
max_val = lambda x: x.value_counts().index[0] \
if x.value_counts().iloc[0] > 1 else 'Mix'
df['Output'] = df.filter(like='Book').apply(max_val, axis=1)
print(df)
# Output:
Name Book1 Book2 Book3 Book4 Output
0 Tom 20 20 31 31 20
1 nick 21 21 19 21
2 krish 19 12 17 18 Mix
3 jack 18 20 16 16 16
Update
If you use Python >= 3.8, you can use the walrus operator (avoid a double call to value_counts:
max_val = lambda x: v.index[0] if (v := x.value_counts()).iloc[0] > 1 else 'Mix'
df['Output'] = df.filter(like='Book').apply(max_val, axis=1)
We can use your idea on mode to get your desired output. First, we need to convert the relevant columns to numeric data types:
temp = (df
.filter(like='Book')
.apply(pd.to_numeric)
.mode(1)
)
# compute for values
# nulls exist only if there are duplicates
output = np.where(temp.notna().all(1),
# value if True
'Mix',
# if False, pick the first modal value,
temp.iloc[:, 0])
df.assign(output = output)
Name Book1 Book2 Book3 Book4 output
0 Tom 20 20 31 31 20.0
1 nick 21 21 19 21.0
2 krish 19 12 17 18 Mix
3 jack 18 20 16 16 16.0

Pandas DataFrame - Issue regarding column formatting

I have a .txt file that has the data regarding the total number of queries with valid names. The text inside of the file came out of a SQL Server 19 query output. The database used consists of the results of an algorithm that retrieves the most similar brands related to the query inserted. The file looks something like this:
2 16, 42, 44 A MINHA SAÚDE
3 34 !D D DUNHILL
4 33 #MEGA
5 09 (michelin man)
5 12 (michelin man)
6 33 *MONTE DA PEDRA*
7 35 .FOX
8 33 #BATISTA'S BY PITADA VERDE
9 12 #COM
10 41 + NATUREZA HUMANA
11 12 001
12 12 002
13 12 1007
14 12 101
15 12 102
16 12 104
17 37 112 PC
18 33 1128
19 41 123 PILATES
The 1st column has the Query identifier, the 2nd one has the brand classes where the Query can be located and the 3rd one is the Query itself (the spaces came from the SQL Server output formatting).
I then made a Pandas DataFrame in Google Colaboratory where I wanted the columns to be like the ones in the text file. However, when I ran the code, it gave me this:
The code that I wrote is here:
# Dataframe with the total number of queries with valid names:
df = pd.DataFrame(pd.read_table("/content/drive/MyDrive/data/classes/100/queries100.txt", header=None, names=["Query ID", "Query Name", "Classes Where Query is Present"]))
df
I think that this happens because of the commas in the 2nd column but I'm not quite sure. Any suggestions on why this is happening? I already tried read_csv and read_fwf and they were even worse in terms of formatting.
You can use pd.read_fwf() in this case, as your columns have fixed widths:
import pandas as pd
df = pd.read_fwf(
"/content/drive/MyDrive/data/classes/100/queries100.txt",
colspecs=[(0,20),(21,40),(40,1000)],
header=None,
names=["Query ID", "Query Name", "Classes Where Query is Present"]
)
df.head()
# Query ID Query Name Classes Where Query is Present
# 0 2 16, 42, 44 A MINHA SAÚDE
# 1 3 34 !D D DUNHILL
# 2 4 33 #MEGA
# 3 5 09 (michelin man)
# 4 5 12 (michelin man)

Pandas: Convert annual data to decade data

Background
I want to determine the global cumulative value of a variable for different decades starting from 1990 to 2014 i.e. 1990, 2000, 2010 (3 decades separately). I have annual data for different countries. However, data availability is not uniform.
Existing questions
Uses R: 1
Following questions look at date formatting issues: 2, 3
Answers to these questions do not address the current question.
Current question
How to obtain a global sum for the period of different decades using features/tools of Pandas?
Expected outcome
1990-2000 x1
2000-2010 x2
2010-2015 x3
Method used so far
data_binned = data_pivoted.copy()
decade = []
# obtaining decade values for each country
for i in range(1960, 2017):
if i in list(data_binned):
# adding the columns into the decade list
decade.append(i)
if i % 10 == 0:
# adding large header so that newly created columns are set at the end of the dataframe
data_binned[i *10] = data_binned.apply(lambda x: sum(x[j] for j in decade), axis=1)
decade = []
for x in list(data_binned):
if x < 3000:
# removing non-decade columns
del data_binned[x]
# renaming the decade columns
new_names = [int(x/10) for x in list(data_binned)]
data_binned.columns = new_names
# computing global values
global_values = data_binned.sum(axis=0)
This is a non-optimal method because of less experience in using Pandas. Kindly suggest a better method which uses features of Pandas. Thank you.
If I had pandas.DataFrame called df looking like this:
>>> df = pd.DataFrame(
... {
... 1990: [1, 12, 45, 67, 78],
... 1999: [1, 12, 45, 67, 78],
... 2000: [34, 6, 67, 21, 65],
... 2009: [34, 6, 67, 21, 65],
... 2010: [3, 6, 6, 2, 6555],
... 2015: [3, 6, 6, 2, 6555],
... }, index=['country_1', 'country_2', 'country_3', 'country_4', 'country_5']
... )
>>> print(df)
1990 1999 2000 2009 2010 2015
country_1 1 1 34 34 3 3
country_2 12 12 6 6 6 6
country_3 45 45 67 67 6 6
country_4 67 67 21 21 2 2
country_5 78 78 65 65 6555 6555
I could make another pandas.DataFrame called df_decades with decades statistics like this:
>>> df_decades = pd.DataFrame()
>>>
>>> for decade in set([(col // 10) * 10 for col in df.columns]):
... cols_in_decade = [col for col in df.columns if (col // 10) * 10 == decade]
... df_decades[f'{decade}-{decade + 9}'] = df[cols_in_decade].sum(axis=1)
>>>
>>> df_decades = df_decades[sorted(df_decades.columns)]
>>> print(df_decades)
1990-1999 2000-2009 2010-2019
country_1 2 68 6
country_2 24 12 12
country_3 90 134 12
country_4 134 42 4
country_5 156 130 13110
The idea behind this is to iterate over all possible decades provided by column names in df, filtering those columns, which are part of the decade and aggregating them.
Finally, I could merge these data frames together, so my data frame df could be enriched by decades statistics from the second data frame df_decades.
>>> df = pd.merge(left=df, right=df_decades, left_index=True, right_index=True, how='left')
>>> print(df)
1990 1999 2000 2009 2010 2015 1990-1999 2000-2009 2010-2019
country_1 1 1 34 34 3 3 2 68 6
country_2 12 12 6 6 6 6 24 12 12
country_3 45 45 67 67 6 6 90 134 12
country_4 67 67 21 21 2 2 134 42 4
country_5 78 78 65 65 6555 6555 156 130 13110

Python pandas random sample by row

I have a dataframe of samples, with a country column. The relative number of records in each country are:
d1.groupby("country").size()
country
Australia 21
Cambodia 58
China 280
India 133
Indonesia 195
Malaysia 138
Myanmar 51
Philippines 49
Singapore 1268
Taiwan 47
Thailand 273
Vietnam 288
How do I select, say, 100 random samples from each country, if that country has > 100 samples? (if the country has <= 100 samples, do nothing). Currently, I do this for, say, Singapore:
names_nonsg_ls = []
names_sg_ls = []
# if the country is not SG, add it to names_nonsg_ls.
# else, add it to names_sg_ls, which will be subsampled later.
for index, row in d0.iterrows():
if str(row["country"]) != "Singapore":
names_nonsg_ls.append(str(row["header"]))
else:
names_sg_ls.append(str(row["header"]))
# Select 100 random names from names_sg_ls
names_sg_ls = random.sample(names_sg_ls, 100)
# Form the list of names to retain
names_ls = names_nonsg_ls + names_sg_ls
# create new dataframe
d1 = d0.loc[d0["header"].isin(names_ls)]
But manually a new list for each country that has >100 names is just poor form, not to mention that I first have to manually pick out the countries with > 100 names.
You can group by country, then sample based on the group size:
d1.groupby("country", group_keys=False).apply(lambda g: g.sample(100) if len(g) > 100 else g)
Example:
df = pd.DataFrame({
'A': ['a','a','a','a','b','b','b','c','d'],
'B': list(range(9))
})
df.groupby('A', group_keys=False).apply(lambda g: g.sample(3) if len(g) > 3 else g)
# A B
#2 a 2
#0 a 0
#1 a 1
#4 b 4
#5 b 5
#6 b 6
#7 c 7
#8 d 8

Categories