I have the following data:
data = ['10 20 10 36 30 33 400 400 -1 -1',
'100 50 50 30 60 27 70 24 -2 -2 700 700',
'300 1000 80 21 90 18 100 15 110 12 120 9 900 900 -3 -3',
'30 90 130 6 140 3 -4 -4 1000 1000']
data = [e.split() for e in l]
concentration = [np.array(concentration[3::2], dtype=np.int) for concentration in data]
I want set the values in my variable(concentration), that are not within the interval (0-50), to be False/0. So i did the following to code:
for row in range(len(concentration)):
for element in range(len(concentration[row])):
if 0 > concentration[row][element] or concentration[row][element] > 50:
concentration[row][element] = False
print("Error: Index {:} in time is out of range".format(element))
I get the following output, and my concentration variable looks like this:
Array of int64 [36 33 0 0]
Array of int64 [30 27 24 0 0]
Array of int64 [21 18 15 12 0 0]
Array of int64 [6 3 0 0]
Now i want to redefine my variable(concentration), where the values are sorted and only contain True/1 values(values which are not False/0). I want my new concentration variable to look like this:
Array of int64 [33 36]
Array of int64 [24 27 30]
Array of int64 [12 15 18 21]
Array of int64 [3 6]
Thanks for the help so far!
You can solve your problem using this way:
initial_data = ['10 20 10 36 30 33 400 400 -1 -1',
'100 50 50 30 60 27 70 24 -2 -2 700 700',
'300 1000 80 21 90 18 100 15 110 12 120 9 900 900 -3 -3',
'30 90 130 6 140 3 -4 -4 1000 1000']
result = [sorted(filter(lambda x: 0 < x < 50,
list(map(int, elem.split()))[3::2])) for elem in initial_data]
print(result)
# [[33, 36], [24, 27, 30], [9, 12, 15, 18, 21], [3, 6]]
If you need numpy arrays instead of lists, you can add transformation to list comprehension:
result = [np.array(sorted(filter(lambda x: 0 < x < 50,
list(map(int, elem.split()))[3::2])), dtype=np.int)
for elem in initial_data]
print(result)
# [array([33, 36]), array([24, 27, 30]), array([ 9, 12, 15, 18, 21]), array([3, 6])]
UPDATE
To redefine your concentration variable with desired result you can use following construction:
concentration = list(map(lambda x: np.sort(x[x > 0]), concentration))
If my data looks like this
Index Country ted_Val1 sam_Val1 ... ted_Val10 sam_Val10
1 Australia 1 3 ... 20 5
2 Bambua 12 33 ... 15 56
3 Tambua 14 34 ... 10 58
df = pd.DataFrame([["Australia", 1, 3, 20, 5],
["Bambua", 12, 33, 15, 56],
["Tambua", 14, 34, 10, 58]
], columns=["Country", "ted_Val1", "sam_Val1", "ted_Val10", "sam_Val10"]
)
I'd like to subtract all 'val_' columns from all 'ted_' values using a list, creating a new column starting with 'dif_' such that:
Index Country ted_Val1 sam_Val1 diff_Val1 ... ted_Val10 sam_Val10 diff_val10
1 Australia 1 3 -2 ... 20 5 -15
2 Bambua 12 33 12 ... 15 56 -41
3 Tambua 14 34 14... 10 58 -48
so far I've got:
calc_vars = ['ted_Val1',
'sam_Val1',
'ted_Val10',
'sam_Val10']
for i in calc_vars:
df_diff['dif_' + str(i)] = df.['ted_' + str(i)] - df.['sam_' + str(i)]
but I'm getting errors, not sure where to go from here. As a warning this is dummy data and there can be several underscores in the names
IIUC you can use filter to choose the columns for subtraction (assuming your columns are properly sorted like your sample):
print (pd.concat([df, pd.DataFrame(df.filter(like="ted").to_numpy()-df.filter(like="sam").to_numpy(),
columns=["diff"+i.split("_")[-1] for i in df.columns if "ted_Val" in i])],1))
Country ted_Val1 sam_Val1 ted_Val10 sam_Val10 diff1 diff10
0 Australia 1 3 20 5 -2 15
1 Bambua 12 33 15 56 -21 -41
2 Tambua 14 34 10 58 -20 -48
try this,
calc_vars = ['ted_Val1', 'sam_Val1', 'ted_Val10', 'sam_Val10']
# extract even & odd values from calc_vars
# ['ted_Val1', 'ted_Val10'], ['sam_Val1', 'sam_Val10']
for ted, sam in zip(calc_vars[::2], calc_vars[1::2]):
df['diff_' + ted.split("_")[-1]] = df[ted] - df[sam]
Edit: if columns are not sorted,
ted_cols = sorted(df.filter(regex="ted_Val\d+"), key=lambda x : x.split("_")[-1])
sam_cols = sorted(df.filter(regex="sam_Val\d+"), key=lambda x : x.split("_")[-1])
for ted, sam in zip(ted_cols, sam_cols):
df['diff_' + ted.split("_")[-1]] = df[ted] - df[sam]
Country ted_Val1 sam_Val1 ted_Val10 sam_Val10 diff_Val1 diff_Val10
0 Australia 1 3 20 5 -2 15
1 Bambua 12 33 15 56 -21 -41
2 Tambua 14 34 10 58 -20 -48
Background
I want to determine the global cumulative value of a variable for different decades starting from 1990 to 2014 i.e. 1990, 2000, 2010 (3 decades separately). I have annual data for different countries. However, data availability is not uniform.
Existing questions
Uses R: 1
Following questions look at date formatting issues: 2, 3
Answers to these questions do not address the current question.
Current question
How to obtain a global sum for the period of different decades using features/tools of Pandas?
Expected outcome
1990-2000 x1
2000-2010 x2
2010-2015 x3
Method used so far
data_binned = data_pivoted.copy()
decade = []
# obtaining decade values for each country
for i in range(1960, 2017):
if i in list(data_binned):
# adding the columns into the decade list
decade.append(i)
if i % 10 == 0:
# adding large header so that newly created columns are set at the end of the dataframe
data_binned[i *10] = data_binned.apply(lambda x: sum(x[j] for j in decade), axis=1)
decade = []
for x in list(data_binned):
if x < 3000:
# removing non-decade columns
del data_binned[x]
# renaming the decade columns
new_names = [int(x/10) for x in list(data_binned)]
data_binned.columns = new_names
# computing global values
global_values = data_binned.sum(axis=0)
This is a non-optimal method because of less experience in using Pandas. Kindly suggest a better method which uses features of Pandas. Thank you.
If I had pandas.DataFrame called df looking like this:
>>> df = pd.DataFrame(
... {
... 1990: [1, 12, 45, 67, 78],
... 1999: [1, 12, 45, 67, 78],
... 2000: [34, 6, 67, 21, 65],
... 2009: [34, 6, 67, 21, 65],
... 2010: [3, 6, 6, 2, 6555],
... 2015: [3, 6, 6, 2, 6555],
... }, index=['country_1', 'country_2', 'country_3', 'country_4', 'country_5']
... )
>>> print(df)
1990 1999 2000 2009 2010 2015
country_1 1 1 34 34 3 3
country_2 12 12 6 6 6 6
country_3 45 45 67 67 6 6
country_4 67 67 21 21 2 2
country_5 78 78 65 65 6555 6555
I could make another pandas.DataFrame called df_decades with decades statistics like this:
>>> df_decades = pd.DataFrame()
>>>
>>> for decade in set([(col // 10) * 10 for col in df.columns]):
... cols_in_decade = [col for col in df.columns if (col // 10) * 10 == decade]
... df_decades[f'{decade}-{decade + 9}'] = df[cols_in_decade].sum(axis=1)
>>>
>>> df_decades = df_decades[sorted(df_decades.columns)]
>>> print(df_decades)
1990-1999 2000-2009 2010-2019
country_1 2 68 6
country_2 24 12 12
country_3 90 134 12
country_4 134 42 4
country_5 156 130 13110
The idea behind this is to iterate over all possible decades provided by column names in df, filtering those columns, which are part of the decade and aggregating them.
Finally, I could merge these data frames together, so my data frame df could be enriched by decades statistics from the second data frame df_decades.
>>> df = pd.merge(left=df, right=df_decades, left_index=True, right_index=True, how='left')
>>> print(df)
1990 1999 2000 2009 2010 2015 1990-1999 2000-2009 2010-2019
country_1 1 1 34 34 3 3 2 68 6
country_2 12 12 6 6 6 6 24 12 12
country_3 45 45 67 67 6 6 90 134 12
country_4 67 67 21 21 2 2 134 42 4
country_5 78 78 65 65 6555 6555 156 130 13110
i have this test table in pandas dataframe
Leaf_category_id session_id product_id
0 111 1 987
3 111 4 987
4 111 1 741
1 222 2 654
2 333 3 321
this is the extension of my previous question, which was answered by #jazrael.
view answer
so after getting the values in product_id column as(just an assumption, little different from the output of my previous question,
|product_id |
---------------------------
|111,987,741,34,12 |
|987,1232 |
|654,12,324,465,342,324 |
|321,741,987 |
|324,654,862,467,243,754 |
|6453,123,987,741,34,12 |
and so on,
i want to create a new column, in which all the values in a row should be made as a bigram with its next one, and the last no. in the row combined with the first one,for example:
|product_id |Bigram
-------------------------------------------------------------------------
|111,987,741,34,12 |(111,987),**(987,741)**,(741,34),(34,12),(12,111)
|987,1232 |(987,1232),(1232,987)
|654,12,324,465,342,32 |(654,12),(12,324),(324,465),(465,342),(342,32),(32,654)
|321,741,987 |(321,741),**(741,987)**,(987,321)
|324,654,862 |(324,654),(654,862),(862,324)
|123,987,741,34,12 |(123,987),(987,741),(34,12),(12,123)
ignore the **( i'll tell you later on why i starred that)
the code to achive the bigram is
for i in df.Leaf_category_id.unique():
print (df[df.Leaf_category_id == i].groupby('session_id')['product_id'].apply(lambda x: list(zip(x, x[1:]))).reset_index())
from this df, i want to consider the bigram column and make one more column named as frequency, which gives me frequency of bigram occured.
Note* : (987,741) and (741,987) are to be considered as same and one dublicate entry should be removed and thus frequency of (987,741) should be 2.
similar is the case with (34,12) it occurs two times, so frequency should be 2
|Bigram
---------------
|(111,987),
|**(987,741)**
|(741,34)
|(34,12)
|(12,111)
|**(741,987)**
|(987,321)
|(34,12)
|(12,123)
Final Result should be.
|Bigram | frequency |
--------------------------
|(111,987) | 1
|(987,741) | 2
|(741,34) | 1
|(34,12) | 2
|(12,111) | 1
|(987,321) | 1
|(12,123) | 1
i am hoping to find answer here, please help me, i have elaborated it as much as possible.
try this code
from itertools import combinations
import pandas as pd
df = pd.DataFrame.from_csv("data.csv")
#consecutive
grouped_consecutive_product_ids = df.groupby(['Leaf_category_id','session_id'])['product_id'].apply(lambda x: [tuple(sorted(pair)) for pair in zip(x,x[1:])]).reset_index()
df1=pd.DataFrame(grouped_consecutive_product_ids)
s=df1.product_id.apply(lambda x: pd.Series(x)).unstack()
df2=pd.DataFrame(s.reset_index(level=0,drop=True)).dropna()
df2.rename(columns = {0:'Bigram'}, inplace = True)
df2["freq"] = df2.groupby('Bigram')['Bigram'].transform('count')
bigram_frequency_consecutive = df2.drop_duplicates(keep="first").sort_values("Bigram").reset_index()
del bigram_frequency_consecutive["index"]
for combinations (all possible bi-grams)
from itertools import combinations
import pandas as pd
df = pd.DataFrame.from_csv("data.csv")
#combinations
grouped_combination_product_ids = df.groupby(['Leaf_category_id','session_id'])['product_id'].apply(lambda x: [tuple(sorted(pair)) for pair in combinations(x,2)]).reset_index()
df1=pd.DataFrame(grouped_combination_product_ids)
s=df1.product_id.apply(lambda x: pd.Series(x)).unstack()
df2=pd.DataFrame(s.reset_index(level=0,drop=True)).dropna()
df2.rename(columns = {0:'Bigram'}, inplace = True)
df2["freq"] = df2.groupby('Bigram')['Bigram'].transform('count')
bigram_frequency_combinations = df2.drop_duplicates(keep="first").sort_values("Bigram").reset_index()
del bigram_frequency_combinations["index"]
where data.csv contains
Leaf_category_id,session_id,product_id
0,111,1,111
3,111,4,987
4,111,1,741
1,222,2,654
2,333,3,321
5,111,1,87
6,111,1,34
7,111,1,12
8,111,1,987
9,111,4,1232
10,222,2,12
11,222,2,324
12,222,2,465
13,222,2,342
14,222,2,32
15,333,3,321
16,333,3,741
17,333,3,987
18,333,3,324
19,333,3,654
20,333,3,862
21,222,1,123
22,222,1,987
23,222,1,741
24,222,1,34
25,222,1,12
The resultant bigram_frequency_consecutive will be
Bigram freq
0 (12, 34) 2
1 (12, 324) 1
2 (12, 654) 1
3 (12, 987) 1
4 (32, 342) 1
5 (34, 87) 1
6 (34, 741) 1
7 (87, 741) 1
8 (111, 741) 1
9 (123, 987) 1
10 (321, 321) 1
11 (321, 741) 1
12 (324, 465) 1
13 (324, 654) 1
14 (324, 987) 1
15 (342, 465) 1
16 (654, 862) 1
17 (741, 987) 2
18 (987, 1232) 1
The resultant bigram_frequency_combinations will be
Bigram freq
0 (12, 32) 1
1 (12, 34) 2
2 (12, 87) 1
3 (12, 111) 1
4 (12, 123) 1
5 (12, 324) 1
6 (12, 342) 1
7 (12, 465) 1
8 (12, 654) 1
9 (12, 741) 2
10 (12, 987) 2
11 (32, 324) 1
12 (32, 342) 1
13 (32, 465) 1
14 (32, 654) 1
15 (34, 87) 1
16 (34, 111) 1
17 (34, 123) 1
18 (34, 741) 2
19 (34, 987) 2
20 (87, 111) 1
21 (87, 741) 1
22 (87, 987) 1
23 (111, 741) 1
24 (111, 987) 1
25 (123, 741) 1
26 (123, 987) 1
27 (321, 321) 1
28 (321, 324) 2
29 (321, 654) 2
30 (321, 741) 2
31 (321, 862) 2
32 (321, 987) 2
33 (324, 342) 1
34 (324, 465) 1
35 (324, 654) 2
36 (324, 741) 1
37 (324, 862) 1
38 (324, 987) 1
39 (342, 465) 1
40 (342, 654) 1
41 (465, 654) 1
42 (654, 741) 1
43 (654, 862) 1
44 (654, 987) 1
45 (741, 862) 1
46 (741, 987) 3
47 (862, 987) 1
48 (987, 1232) 1
in the above case it groups by both
We are going to pull out the values from product_id, create bigrams that are sorted and thus deduplicated, and count them to get the frequency, and then populate a data frame.
from collections import Counter
# assuming your data frame is called 'df'
bigrams = [list(zip(x,x[1:])) for x in df.product_id.values.tolist()]
bigram_set = [tuple(sorted(xx) for x in bigrams for xx in x]
freq_dict = Counter(bigram_set)
df_freq = pd.DataFrame([list(f) for f in freq_dict], columns=['bigram','freq'])