I have a question about splitting columns into multiple rows at Pandas with conditions.
For example, I tend to do something as follows but takes a very long time using for loop
| Index | Value |
| ----- | ----- |
| 0 | 1 |
| 1 | 1,3 |
| 2 | 4,6,8 |
| 3 | 1,3 |
| 4 | 2,7,9 |
into
| Index | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
| ----- | - | - | - | - | - | - | - | - | - |
| 0 | 1 | | | | | | | | |
| 1 | 1 | | 3 | | | | | | |
| 2 | | | | 4 | | 6 | | 8 | |
| 3 | 1 | | 3 | | | | | | |
| 4 | | 2 | | | | | 7 | | 9 |
I wonder if there are any packages that can help this out rather than to write a for loop to map all indexes.
Assuming the "Value" column contains strings, you can use str.split and pivot like so:
value = df["Value"].str.split(",").explode().astype(int).reset_index()
output = value.pivot(index="index", columns="Value", values="Value")
output = output.reindex(range(value["Value"].min(), value["Value"].max()+1), axis=1)
>>> output
Value 1 2 3 4 5 6 7 8 9
index
0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN
1 1.0 NaN 3.0 NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN 4.0 NaN 6.0 NaN 8.0 NaN
3 1.0 NaN 3.0 NaN NaN NaN NaN NaN NaN
4 NaN 2.0 NaN NaN NaN NaN 7.0 NaN 9.0
Input df:
df = pd.DataFrame({"Value": ["1", "1,3", "4,6,8", "1,3", "2,7,9"]})
Related
I'm trying to figure a way to do:
COUNTIF(Col2,Col4,Col6,Col8,Col10,Col12,Col14,Col16,Col18,">=0.05")
SUMIF(Col2,Col4,Col6,Col8,Col10,Col12,Col14,Col16,Col18,">=0.05")
My attempt:
import pandas as pd
df=pd.read_excel(r'C:\\Users\\Downloads\\Prepped.xls') #Please use: https://github.com/BeboGhattas/temp-repo/blob/main/Prepped.xls
df.iloc[:, [2,4,6,8,10,12,14,16,18]].astype(float) #changing dtype to float
#unconditional sum
df['sum']=df.iloc[:, [2,4,6,8,10,12,14,16,18]].astype(float).sum(axis=1)
whatever goes below won't work
#sum if
df['greater-than-0.05']=df.iloc[:, [2,4,6,8,10,12,14,16,18]].astype(float).sum([c for c in col if c >= 0.05])
| | # | word | B64684807 | B64684807Measure | B649845471 | B649845471Measure | B83344143 | B83344143Measure | B67400624 | B67400624Measure | B85229235 | B85229235Measure | B85630406 | B85630406Measure | B82615898 | B82615898Measure | B87558236 | B87558236Measure | B00000009 | B00000009Measure | 有效竞品数 | 关键词抓取时间 | 搜索量排名 | 月搜索量 | 在售商品数 | 竞争度 |
|---:|----:|:--------|------------:|:-------------------|-------------:|:-------------------------|------------:|:-------------------------|------------:|:-------------------|------------:|:-------------------|------------:|:-------------------|------------:|:-------------------|------------:|-------------------:|------------:|:-------------------|-------------:|:--------------------|-------------:|-----------:|-------------:|---------:|
| 0 | 1 | word 1 | 0.055639 | [主要流量词] | 0.049416 | nan | 0.072298 | [精准流量词, 主要流量词] | 0.00211 | nan | 0.004251 | nan | 0.007254 | nan | 0.074409 | [主要流量词] | 0.033597 | nan | 0.000892 | nan | 9 | 2022-10-06 00:53:56 | 5726 | 326188 | 3810 | 0.01 |
| 1 | 2 | word 2 | 0.045098 | nan | 0.005472 | nan | 0.010791 | nan | 0.072859 | [主要流量词] | 0.003423 | nan | 0.012464 | nan | 0.027396 | nan | 0.002825 | nan | 0.060989 | [主要流量词] | 9 | 2022-10-07 01:16:21 | 9280 | 213477 | 40187 | 0.19 |
| 2 | 3 | word 3 | 0.02186 | nan | 0.05039 | [主要流量词] | 0.007842 | nan | 0.028832 | nan | 0.044385 | [精准流量词] | 0.001135 | nan | 0.003866 | nan | 0.021035 | nan | 0.017202 | nan | 9 | 2022-10-07 00:28:31 | 24024 | 81991 | 2275 | 0.03 |
| 3 | 4 | word 4 | 0.000699 | nan | 0.01038 | nan | 0.001536 | nan | 0.021512 | nan | 0.007658 | nan | 5e-05 | nan | 0.048682 | nan | 0.001524 | nan | 0.000118 | nan | 9 | 2022-10-07 00:52:12 | 34975 | 53291 | 30970 | 0.58 |
| 4 | 5 | word 5 | 0.00984 | nan | 0.030248 | nan | 0.003006 | nan | 0.014027 | nan | 0.00904 | [精准流量词] | 0.000348 | nan | 0.000414 | nan | 0.006721 | nan | 0.00153 | nan | 9 | 2022-10-07 02:36:05 | 43075 | 41336 | 2230 | 0.05 |
| 5 | 6 | word 6 | 0.010029 | [精准流量词] | 0.120739 | [精准流量词, 主要流量词] | 0.014359 | nan | 0.002796 | nan | 0.002883 | nan | 0.028747 | [精准流量词] | 0.007022 | nan | 0.017803 | nan | 0.001998 | nan | 9 | 2022-10-07 00:44:51 | 49361 | 34791 | 517 | 0.01 |
| 6 | 7 | word 7 | 0.002735 | nan | 0.002005 | nan | 0.005355 | nan | 6.3e-05 | nan | 0.000772 | nan | 0.000237 | nan | 0.015149 | nan | 2.1e-05 | nan | 2.3e-05 | nan | 9 | 2022-10-07 09:48:20 | 53703 | 31188 | 511 | 0.02 |
| 7 | 8 | word 8 | 0.003286 | [精准流量词] | 0.058161 | [主要流量词] | 0.013681 | [精准流量词] | 0.000748 | [精准流量词] | 0.002684 | [精准流量词] | 0.013916 | [精准流量词] | 0.029376 | nan | 0.019792 | nan | 0.005602 | nan | 9 | 2022-10-06 01:51:53 | 58664 | 27751 | 625 | 0.02 |
| 8 | 9 | word 9 | 0.004273 | [精准流量词] | 0.025581 | [精准流量词] | 0.014784 | [精准流量词] | 0.00321 | [精准流量词] | 0.000892 | nan | 0.00223 | nan | 0.005315 | nan | 0.02211 | nan | 0.027008 | [精准流量词] | 9 | 2022-10-07 01:34:28 | 73640 | 20326 | 279 | 0.01 |
| 9 | 10 | word 10 | 0.002341 | [精准流量词] | 0.029604 | nan | 0.007817 | [精准流量词] | 0.000515 | [精准流量词] | 0.001865 | [精准流量词] | 0.010128 | [精准流量词] | 0.015378 | nan | 0.019677 | nan | 0.003673 | nan | 9 | 2022-10-07 01:17:44 | 80919 | 17779 | 207 | 0.01 |
So my question is,
How can i do the sumif and countif on the exact table (Should use col2,col4... etc, because every file will have the same format but different header, so using df['B64684807'] isn't helpful )
Sample file can be found at:
https://github.com/BeboGhattas/temp-repo/blob/main/Prepped.xls
IIUC, you can use a boolean mask:
df2 = df.iloc[:, [2,4,6,8,10,12,14,16,18]].astype(float)
m = df2.ge(0.05)
df['countif'] = m.sum(axis=1)
df['sumif'] = df2.where(m).sum(axis=1)
output (last 3 columns only):
sum countif sumif
0 0.299866 3 0.202346
1 0.241317 2 0.133848
2 0.196547 1 0.050390
3 0.092159 0 0.000000
4 0.075174 0 0.000000
5 0.206376 1 0.120739
6 0.026360 0 0.000000
7 0.147246 1 0.058161
8 0.105403 0 0.000000
9 0.090998 0 0.000000
I have this massive dataframe which has 3 different columns of values under each one heading.
As an example, first it looked something like this:
| | 0 | 1 | 2 | 3 | ..
| 0 | a | 7.3 | 9.1 | NaN | ..
| 1 | b | 2.51 | 4.8 | 6.33 | ..
| 2 | c | NaN | NaN | NaN | ..
| 3 | d | NaN | 3.73 | NaN | ..
1, 2 and 3 all belong together. For simplicity of the program I used integers for the dataframe index and columns.
But now that it finished calculating stuff, I changed the columns to the appropriate string.
| | 0 | Heading 1 | Heading 1 | Heading 1 | ..
| 0 | a | 7.3 | 9.1 | NaN | ..
| 1 | b | 2.51 | 4.8 | 6.33 | ..
| 2 | c | NaN | NaN | NaN | ..
| 3 | d | NaN | 3.73 | NaN | ..
Everything runs perfectly smooth up until this point, but here's where I'm stuck.
All I wanna do is merge the 3 "Heading 1" into one giant cell, so that it looks something like this:
| | 0 | Heading 1 | ..
| 0 | a | 7.3 | 9.1 | NaN | ..
| 1 | b | 2.51 | 4.8 | 6.33 | ..
| 2 | c | NaN | NaN | NaN | ..
| 3 | d | NaN | 3.73 | NaN | ..
But everything I find online is merging the entire column, values included.
I'd really appreciate if someone could help me out here!
I am having 2 dataframes of different size. I am looking to join the dataframes and want to replace the Nan values after combining both the dataframes and replacing the the Nan values with lower size dataframe.
dataframe1:-
| symbol| value1 | value2 | Occurance |
|=======|========|========|===========|
2020-07-31 | A | 193.5 | 186.05 | 3 |
2020-07-17 | A | 372.5 | 359.55 | 2 |
2020-07-21 | A | 387.8 | 382.00 | 1 |
dataframe2:-
| x | y | z | symbol|
|=====|=====|=====|=======|
2020-10-01 |448.5|453.0|443.8| A |
I tried concatenating and replacing the Nan values with values of dataframe2 value.
I tried df1 =pd.concat([dataframe2,dataframe1],axis=1). The result is given below but i am looking for result as in result desired. How can i achieve that desired result.
Result given:-
| X | Y | Z | symbol|symbol| value1| value2 | Occurance|
|====== | ====|=====|=======|======|=======| =======| =========|
2020-07-31|NaN |NaN | NaN | NaN | A |193.5 | 186.05 | 3 |
2021-05-17| NaN | NaN | NaN | NaN | A |372.5 | 359.55 | 2 |
2021-05-21| NaN | NaN | NaN | NaN | A |387.8 | 382.00 | 1 |
2020-10-01| 448.5 |453.0|443.8| A |NaN | NaN | NaN | NaN |
Result Desired:-
| X | Y | Z | symbol|symbol| value1| value2 | Occurance|
| ===== | ======| ====| ======| =====|=======|========|==========|
2020-10-01| 448.5 |453.0 |443.8| A | A |193.5 | 186.05 | 3 |
2020-10-01| 448.5 |453.0 |443.8| A | A |372.5 | 359.55 | 2 |
2020-10-01| 448.5 |453.0 |443.8| A | A |387.8 | 382.00 | 1 |
2020-10-01| 448.5 |453.0 |443.8| A |NaN | NaN | NaN | NaN |
Please note the datatime needs to be same in the Result Desired. In short replicating the single line of dataframe2 to NaN values of dataframe1. a solution avoiding For loop would be great.
Could you try to sort your dataframe by the index to check how the output would be ?
df1.sort_index()
Suppose I have table like this:
+----------+------------+----------+------------+----------+------------+-------+
| a_name_0 | id_qname_0 | a_name_1 | id_qname_1 | a_name_2 | id_qname_2 | count |
+----------+------------+----------+------------+----------+------------+-------+
| country | 1 | NAN | NAN | NAN | NAN | 100 |
+----------+------------+----------+------------+----------+------------+-------+
| region | 2 | city | NAN | NAN | NAN | 20 |
+----------+------------+----------+------------+----------+------------+-------+
| region | 2 | city | NAN | NAN | NAN | 80 |
+----------+------------+----------+------------+----------+------------+-------+
| region | 3 | age | 4 | sex | 6 | 40 |
+----------+------------+----------+------------+----------+------------+-------+
| region | 3 | age | 5 | sex | 7 | 60 |
+----------+------------+----------+------------+----------+------------+-------+
and I want to LEFT JOIN it with following table on a_name column in panadas:
+----+---------+-------+-------+-------+
| id | a_name | c01 | c02 | c03 |
+----+---------+-------+-------+-------+
| 1 | country | dtr1 | dtr2 | dtr3 |
+----+---------+-------+-------+-------+
| 2 | region | dtc1 | dtc2 | dtc3 |
+----+---------+-------+-------+-------+
| 3 | city | dta1 | dta2 | dta3 |
+----+---------+-------+-------+-------+
| 4 | age | dtCo1 | dtCo2 | dtCo3 |
+----+---------+-------+-------+-------+
| 5 | sex | dts1 | dts2 | dts3 |
+----+---------+-------+-------+-------+
I want to add columns c01, c02 and c03 to each value (country ,region, city, age,sex) appearing in columns a_name_0, a_name_1 and a_name_2 in first table.
Obviously I would need to add three new columns for each values that appear in a_name_0, a_name_1 and a_name_2 columns, otherwise my table would have different number of rows. Rest of row values should be empty, or NA or NAN..whatever.
Expected output:
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| a_name_0 | c01_0 | c01_0 | c01_0 | id_qname_0 | a_name_1 | c01_1 | c01_1 | c01_1 | id_qname_1 | a_name_2 | c01_2 | c01_2 | c01_2 | id_qname_2 | count |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| country | dtCo1 | dtCo2 | dtCo3 | 1 | NAN | NAN | NAN | NAN | NAN | NAN | NAN | NAN | NAN | NAN | 70 |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| region | dtr1 | dtr2 | dtr2 | 2 | city | dtc1 | dtc2 | dtc3 | NAN | NAN | NAN | NAN | NAN | NAN | 20 |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| region | | | | 2 | city | | | | NAN | NAN | | | | NAN | 20 |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| region | | | | 3 | age | | | | 4 | sex | | | | 6 | 40 |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
| region | | | | 3 | age | | | | 5 | sex | | | | 7 | 60 |
+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+----------+-------+-------+-------+------------+-------+
Explanation:
I'm building data warhouse table which will serve for data analytics purposes. Quotes table (1st table) should be populated with various projects quotes info (table 2) which needs to be visually represented.
Use:
#convert count column to index for possible processing all another cols by groups
df1 = df1.set_index('count')
#groups by last value after last _
c = df1.columns.str.rsplit('_').str[-1]
#removed unnecessary id column from df2
df2 = df2.drop('id', axis=1)
#for list of DataFrames
dfs = []
#iterate groups
for i, x in df1.groupby(c, axis=1):
#change columns names for match and for avoid duplicated columns names
df2.columns = [ f'a_name_{i}'] + (df2.columns + f'_{i}').tolist()[1:]
#left join
x = x.merge(df2, on=f'a_name_{i}', how='left')
#convert duplicates by a_name columns to NaNs
m = x.duplicated(subset=[x.columns[0]])
x.iloc[m.to_numpy(), 2:] = np.nan
#convert id_qname columns to end
x[f'id_qname_{i}'] = x.pop(f'id_qname_{i}')
#append to list
dfs.append(x)
#join together and last add count column from index
df = pd.concat(dfs, axis=1).assign(count=df1.index)
print (df)
a_name_0 c01_0 c02_0 c03_0 id_qname_0 a_name_1 c01_0_1 c02_0_1 c03_0_1 \
0 country dtr1 dtr2 dtr3 1 NaN NaN NaN NaN
1 region dtc1 dtc2 dtc3 2 city dta1 dta2 dta3
2 region NaN NaN NaN 2 city NaN NaN NaN
3 region NaN NaN NaN 3 age dtCo1 dtCo2 dtCo3
4 region NaN NaN NaN 3 age NaN NaN NaN
id_qname_1 a_name_2 c01_0_1_2 c02_0_1_2 c03_0_1_2 id_qname_2 count
0 NaN NaN NaN NaN NaN NaN 100
1 NaN NaN NaN NaN NaN NaN 20
2 NaN NaN NaN NaN NaN NaN 80
3 4.0 sex dts1 dts2 dts3 6.0 40
4 5.0 sex NaN NaN NaN 7.0 60
Merge the dataframes using Outer joins and specify the columns (from each dataframe) on which you want join the tables.
# Sample data
>>> A
name_1 name_2 values
0 a b 1
1 b c 2
2 c b 3
3 d a 4
>>> B
name values
0 a 1
1 b 2
2 c 3
>>> C
name values
0 a 10
1 b 20
2 c 30
Making use of the merge() method, you can specify the columns you wan to merge the dataframes on. Setting the how parameter to outer specifies an outer join, this fills the no matched data points with NaN.
# Merging
>>> merge1 = A.merge(B, left_on='name_1', right_on='name', how='outer')
>>> merge1
name_1 name_2 values_x name values_y
0 a b 1 a 1.0
1 b c 2 b 2.0
2 c b 3 c 3.0
3 d a 4 NaN NaN
>>> merge = merge1.merge(C, left_on='name_2', right_on='name', how='outer')
>>> merge
name_1 name_2 values_x name_x values_y name_y values
0 a b 1 a 1.0 b 20
1 c b 3 c 3.0 b 20
2 b c 2 b 2.0 c 30
3 d a 4 NaN NaN a 10
This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 4 years ago.
I have a dataframe (df) that looks like this:
+---------+-------+------------+----------+
| subject | pills | date | strength |
+---------+-------+------------+----------+
| 1 | 4 | 10/10/2012 | 250 |
| 1 | 4 | 10/11/2012 | 250 |
| 1 | 2 | 10/12/2012 | 500 |
| 2 | 1 | 1/6/2014 | 1000 |
| 2 | 1 | 1/7/2014 | 250 |
| 2 | 1 | 1/7/2014 | 500 |
| 2 | 3 | 1/8/2014 | 250 |
+---------+-------+------------+----------+
When I use reshape in R, I get what I want:
reshape(df, idvar = c("subject","date"), timevar = 'strength', direction = "wide")
+---------+------------+--------------+--------------+---------------+
| subject | date | strength.250 | strength.500 | strength.1000 |
+---------+------------+--------------+--------------+---------------+
| 1 | 10/10/2012 | 4 | NA | NA |
| 1 | 10/11/2012 | 4 | NA | NA |
| 1 | 10/12/2012 | NA | 2 | NA |
| 2 | 1/6/2014 | NA | NA | 1 |
| 2 | 1/7/2014 | 1 | 1 | NA |
| 2 | 1/8/2014 | 3 | NA | NA |
+---------+------------+--------------+--------------+---------------+
Using pandas:
df.pivot_table(df, index=['subject','date'],columns='strength')
+---------+------------+-------+----+-----+
| | | pills |
+---------+------------+-------+----+-----+
| | strength | 250 | 500| 1000|
+---------+------------+-------+----+-----+
| subject | date | | | |
+---------+------------+-------+----+-----+
| 1 | 10/10/2012 | 4 | NA | NA |
| | 10/11/2012 | 4 | NA | NA |
| | 10/12/2012 | NA | 2 | NA |
+---------+------------+-------+----+-----+
| 2 | 1/6/2014 | NA | NA | 1 |
| | 1/7/2014 | 1 | 1 | NA |
| | 1/8/2014 | 3 | NA | NA |
+---------+------------+-------+----+-----+
How do I get exactly the same output as in R with pandas? I only want 1 header.
After pivoting, convert the dataframe to records and then back to dataframe:
flattened = pd.DataFrame(pivoted.to_records())
# subject date ('pills', 250) ('pills', 500) ('pills', 1000)
#0 1 10/10/2012 4.0 NaN NaN
#1 1 10/11/2012 4.0 NaN NaN
#2 1 10/12/2012 NaN 2.0 NaN
#3 2 1/6/2014 NaN NaN 1.0
#4 2 1/7/2014 1.0 1.0 NaN
#5 2 1/8/2014 3.0 NaN NaN
You can now "repair" the column names, if you want:
flattened.columns = [hdr.replace("('pills', ", "strength.").replace(")", "") \
for hdr in flattened.columns]
flattened
# subject date strength.250 strength.500 strength.1000
#0 1 10/10/2012 4.0 NaN NaN
#1 1 10/11/2012 4.0 NaN NaN
#2 1 10/12/2012 NaN 2.0 NaN
#3 2 1/6/2014 NaN NaN 1.0
#4 2 1/7/2014 1.0 1.0 NaN
#5 2 1/8/2014 3.0 NaN NaN
It's awkward, but it works.