How do a join two columns into another seperate column in Pandas? - python

Any help would be greatly appreciated. This is probably easy, but im new to Python.
I want to add two columns which are Latitude and Longitude and put it into a column called Location.
For example:
First row in Latitude will have a value of 41.864073 and the first row of Longitude will have a value of -87.706819.
I would like the 'Locations' column to display 41.864073, -87.706819.
please and thank you.

Setup
df = pd.DataFrame(dict(lat=range(10, 20), lon=range(100, 110)))
zip
This should be better than using apply
df.assign(location=[*zip(df.lat, df.lon)])
lat lon location
0 10 100 (10, 100)
1 11 101 (11, 101)
2 12 102 (12, 102)
3 13 103 (13, 103)
4 14 104 (14, 104)
5 15 105 (15, 105)
6 16 106 (16, 106)
7 17 107 (17, 107)
8 18 108 (18, 108)
9 19 109 (19, 109)
list variant
Though I'd still suggest tuple
df.assign(location=df[['lat', 'lon']].values.tolist())
lat lon location
0 10 100 [10, 100]
1 11 101 [11, 101]
2 12 102 [12, 102]
3 13 103 [13, 103]
4 14 104 [14, 104]
5 15 105 [15, 105]
6 16 106 [16, 106]
7 17 107 [17, 107]
8 18 108 [18, 108]
9 19 109 [19, 109]

I question the usefulness of this column, but you can generate it by applying the tuple callable over the columns.
>>> df = pd.DataFrame([[1, 2], [3,4]], columns=['lon', 'lat'])
>>> df
>>>
lon lat
0 1 2
1 3 4
>>>
>>> df['Location'] = df.apply(tuple, axis=1)
>>> df
>>>
lon lat Location
0 1 2 (1, 2)
1 3 4 (3, 4)
If there are other columns than 'lon' and 'lat' in your dataframe, use
df['Location'] = df[['lon', 'lat']].apply(tuple, axis=1)

Data from Pir
df['New']=tuple(zip(*df[['lat','lon']].values.T))
df
Out[106]:
lat lon New
0 10 100 (10, 100)
1 11 101 (11, 101)
2 12 102 (12, 102)
3 13 103 (13, 103)
4 14 104 (14, 104)
5 15 105 (15, 105)
6 16 106 (16, 106)
7 17 107 (17, 107)
8 18 108 (18, 108)
9 19 109 (19, 109)

I definitely learned something from W-B and timgeb. My idea was to just convert to strings and concatenate. I posted my answer in case you wanted the result as a string. Otherwise it looks like the answers above are the way to go.
import pandas as pd
from pandas import *
Dic = {'Lattitude': [41.864073], 'Longitude': [-87.706819]}
DF = pd.DataFrame.from_dict(Dic)
DF['Location'] = DF['Lattitude'].astype(str) + ',' + DF['Longitude'].astype(str)

Related

How to read a file into a dataframe where the values are grouped in parentheses?

I need to import a text file that is formatted as such
(201803, 95), (201708, 61)
(202002, 88), (201501, 88)
(202004, 76), (201602, 35)
(201012, 113), (201012, 115)
(201009, 114), (201708, 114)
(201603, 31), (201206, 78)
And turn it into a dataframe with this format where the first column is a date and the second column is an integer for temperature.
Year Temp
201803 95
201708 61
202002 88
201501 88
202004 76
201602 35
201012 113
201012 115
201009 114
201708 114
201603 31
201206 78
I tried to import it using pandas.read_csv but I get a dataframe with 4 columns and 2 rows of data in a row.
0 1 2 3
0 (201803 95) (201708 61)
1 (202002 88) (201501 88
Pandas imports exactly how it sees. To group things how you want in your case, you'll need to wrap the columns in the same set of quotes or simply restructure the CSV to be linearly stacked as you want for the output.

I want to redefine and list of array of int64, so the values inside the arrays are sorted and only contains values which are not False

I have the following data:
data = ['10 20 10 36 30 33 400 400 -1 -1',
'100 50 50 30 60 27 70 24 -2 -2 700 700',
'300 1000 80 21 90 18 100 15 110 12 120 9 900 900 -3 -3',
'30 90 130 6 140 3 -4 -4 1000 1000']
data = [e.split() for e in l]
concentration = [np.array(concentration[3::2], dtype=np.int) for concentration in data]
I want set the values in my variable(concentration), that are not within the interval (0-50), to be False/0. So i did the following to code:
for row in range(len(concentration)):
for element in range(len(concentration[row])):
if 0 > concentration[row][element] or concentration[row][element] > 50:
concentration[row][element] = False
print("Error: Index {:} in time is out of range".format(element))
I get the following output, and my concentration variable looks like this:
Array of int64 [36 33 0 0]
Array of int64 [30 27 24 0 0]
Array of int64 [21 18 15 12 0 0]
Array of int64 [6 3 0 0]
Now i want to redefine my variable(concentration), where the values are sorted and only contain True/1 values(values which are not False/0). I want my new concentration variable to look like this:
Array of int64 [33 36]
Array of int64 [24 27 30]
Array of int64 [12 15 18 21]
Array of int64 [3 6]
Thanks for the help so far!
You can solve your problem using this way:
initial_data = ['10 20 10 36 30 33 400 400 -1 -1',
'100 50 50 30 60 27 70 24 -2 -2 700 700',
'300 1000 80 21 90 18 100 15 110 12 120 9 900 900 -3 -3',
'30 90 130 6 140 3 -4 -4 1000 1000']
result = [sorted(filter(lambda x: 0 < x < 50,
list(map(int, elem.split()))[3::2])) for elem in initial_data]
print(result)
# [[33, 36], [24, 27, 30], [9, 12, 15, 18, 21], [3, 6]]
If you need numpy arrays instead of lists, you can add transformation to list comprehension:
result = [np.array(sorted(filter(lambda x: 0 < x < 50,
list(map(int, elem.split()))[3::2])), dtype=np.int)
for elem in initial_data]
print(result)
# [array([33, 36]), array([24, 27, 30]), array([ 9, 12, 15, 18, 21]), array([3, 6])]
UPDATE
To redefine your concentration variable with desired result you can use following construction:
concentration = list(map(lambda x: np.sort(x[x > 0]), concentration))

subtract two columns in a data frame if they have the same ending in a loop

If my data looks like this
Index Country ted_Val1 sam_Val1 ... ted_Val10 sam_Val10
1 Australia 1 3 ... 20 5
2 Bambua 12 33 ... 15 56
3 Tambua 14 34 ... 10 58
df = pd.DataFrame([["Australia", 1, 3, 20, 5],
["Bambua", 12, 33, 15, 56],
["Tambua", 14, 34, 10, 58]
], columns=["Country", "ted_Val1", "sam_Val1", "ted_Val10", "sam_Val10"]
)
I'd like to subtract all 'val_' columns from all 'ted_' values using a list, creating a new column starting with 'dif_' such that:
Index Country ted_Val1 sam_Val1 diff_Val1 ... ted_Val10 sam_Val10 diff_val10
1 Australia 1 3 -2 ... 20 5 -15
2 Bambua 12 33 12 ... 15 56 -41
3 Tambua 14 34 14... 10 58 -48
so far I've got:
calc_vars = ['ted_Val1',
'sam_Val1',
'ted_Val10',
'sam_Val10']
for i in calc_vars:
df_diff['dif_' + str(i)] = df.['ted_' + str(i)] - df.['sam_' + str(i)]
but I'm getting errors, not sure where to go from here. As a warning this is dummy data and there can be several underscores in the names
IIUC you can use filter to choose the columns for subtraction (assuming your columns are properly sorted like your sample):
print (pd.concat([df, pd.DataFrame(df.filter(like="ted").to_numpy()-df.filter(like="sam").to_numpy(),
columns=["diff"+i.split("_")[-1] for i in df.columns if "ted_Val" in i])],1))
Country ted_Val1 sam_Val1 ted_Val10 sam_Val10 diff1 diff10
0 Australia 1 3 20 5 -2 15
1 Bambua 12 33 15 56 -21 -41
2 Tambua 14 34 10 58 -20 -48
try this,
calc_vars = ['ted_Val1', 'sam_Val1', 'ted_Val10', 'sam_Val10']
# extract even & odd values from calc_vars
# ['ted_Val1', 'ted_Val10'], ['sam_Val1', 'sam_Val10']
for ted, sam in zip(calc_vars[::2], calc_vars[1::2]):
df['diff_' + ted.split("_")[-1]] = df[ted] - df[sam]
Edit: if columns are not sorted,
ted_cols = sorted(df.filter(regex="ted_Val\d+"), key=lambda x : x.split("_")[-1])
sam_cols = sorted(df.filter(regex="sam_Val\d+"), key=lambda x : x.split("_")[-1])
for ted, sam in zip(ted_cols, sam_cols):
df['diff_' + ted.split("_")[-1]] = df[ted] - df[sam]
Country ted_Val1 sam_Val1 ted_Val10 sam_Val10 diff_Val1 diff_Val10
0 Australia 1 3 20 5 -2 15
1 Bambua 12 33 15 56 -21 -41
2 Tambua 14 34 10 58 -20 -48

Pandas: Convert annual data to decade data

Background
I want to determine the global cumulative value of a variable for different decades starting from 1990 to 2014 i.e. 1990, 2000, 2010 (3 decades separately). I have annual data for different countries. However, data availability is not uniform.
Existing questions
Uses R: 1
Following questions look at date formatting issues: 2, 3
Answers to these questions do not address the current question.
Current question
How to obtain a global sum for the period of different decades using features/tools of Pandas?
Expected outcome
1990-2000 x1
2000-2010 x2
2010-2015 x3
Method used so far
data_binned = data_pivoted.copy()
decade = []
# obtaining decade values for each country
for i in range(1960, 2017):
if i in list(data_binned):
# adding the columns into the decade list
decade.append(i)
if i % 10 == 0:
# adding large header so that newly created columns are set at the end of the dataframe
data_binned[i *10] = data_binned.apply(lambda x: sum(x[j] for j in decade), axis=1)
decade = []
for x in list(data_binned):
if x < 3000:
# removing non-decade columns
del data_binned[x]
# renaming the decade columns
new_names = [int(x/10) for x in list(data_binned)]
data_binned.columns = new_names
# computing global values
global_values = data_binned.sum(axis=0)
This is a non-optimal method because of less experience in using Pandas. Kindly suggest a better method which uses features of Pandas. Thank you.
If I had pandas.DataFrame called df looking like this:
>>> df = pd.DataFrame(
... {
... 1990: [1, 12, 45, 67, 78],
... 1999: [1, 12, 45, 67, 78],
... 2000: [34, 6, 67, 21, 65],
... 2009: [34, 6, 67, 21, 65],
... 2010: [3, 6, 6, 2, 6555],
... 2015: [3, 6, 6, 2, 6555],
... }, index=['country_1', 'country_2', 'country_3', 'country_4', 'country_5']
... )
>>> print(df)
1990 1999 2000 2009 2010 2015
country_1 1 1 34 34 3 3
country_2 12 12 6 6 6 6
country_3 45 45 67 67 6 6
country_4 67 67 21 21 2 2
country_5 78 78 65 65 6555 6555
I could make another pandas.DataFrame called df_decades with decades statistics like this:
>>> df_decades = pd.DataFrame()
>>>
>>> for decade in set([(col // 10) * 10 for col in df.columns]):
... cols_in_decade = [col for col in df.columns if (col // 10) * 10 == decade]
... df_decades[f'{decade}-{decade + 9}'] = df[cols_in_decade].sum(axis=1)
>>>
>>> df_decades = df_decades[sorted(df_decades.columns)]
>>> print(df_decades)
1990-1999 2000-2009 2010-2019
country_1 2 68 6
country_2 24 12 12
country_3 90 134 12
country_4 134 42 4
country_5 156 130 13110
The idea behind this is to iterate over all possible decades provided by column names in df, filtering those columns, which are part of the decade and aggregating them.
Finally, I could merge these data frames together, so my data frame df could be enriched by decades statistics from the second data frame df_decades.
>>> df = pd.merge(left=df, right=df_decades, left_index=True, right_index=True, how='left')
>>> print(df)
1990 1999 2000 2009 2010 2015 1990-1999 2000-2009 2010-2019
country_1 1 1 34 34 3 3 2 68 6
country_2 12 12 6 6 6 6 24 12 12
country_3 45 45 67 67 6 6 90 134 12
country_4 67 67 21 21 2 2 134 42 4
country_5 78 78 65 65 6555 6555 156 130 13110

create a bigram from a column in pandas df

i have this test table in pandas dataframe
Leaf_category_id session_id product_id
0 111 1 987
3 111 4 987
4 111 1 741
1 222 2 654
2 333 3 321
this is the extension of my previous question, which was answered by #jazrael.
view answer
so after getting the values in product_id column as(just an assumption, little different from the output of my previous question,
|product_id |
---------------------------
|111,987,741,34,12 |
|987,1232 |
|654,12,324,465,342,324 |
|321,741,987 |
|324,654,862,467,243,754 |
|6453,123,987,741,34,12 |
and so on,
i want to create a new column, in which all the values in a row should be made as a bigram with its next one, and the last no. in the row combined with the first one,for example:
|product_id |Bigram
-------------------------------------------------------------------------
|111,987,741,34,12 |(111,987),**(987,741)**,(741,34),(34,12),(12,111)
|987,1232 |(987,1232),(1232,987)
|654,12,324,465,342,32 |(654,12),(12,324),(324,465),(465,342),(342,32),(32,654)
|321,741,987 |(321,741),**(741,987)**,(987,321)
|324,654,862 |(324,654),(654,862),(862,324)
|123,987,741,34,12 |(123,987),(987,741),(34,12),(12,123)
ignore the **( i'll tell you later on why i starred that)
the code to achive the bigram is
for i in df.Leaf_category_id.unique():
print (df[df.Leaf_category_id == i].groupby('session_id')['product_id'].apply(lambda x: list(zip(x, x[1:]))).reset_index())
from this df, i want to consider the bigram column and make one more column named as frequency, which gives me frequency of bigram occured.
Note* : (987,741) and (741,987) are to be considered as same and one dublicate entry should be removed and thus frequency of (987,741) should be 2.
similar is the case with (34,12) it occurs two times, so frequency should be 2
|Bigram
---------------
|(111,987),
|**(987,741)**
|(741,34)
|(34,12)
|(12,111)
|**(741,987)**
|(987,321)
|(34,12)
|(12,123)
Final Result should be.
|Bigram | frequency |
--------------------------
|(111,987) | 1
|(987,741) | 2
|(741,34) | 1
|(34,12) | 2
|(12,111) | 1
|(987,321) | 1
|(12,123) | 1
i am hoping to find answer here, please help me, i have elaborated it as much as possible.
try this code
from itertools import combinations
import pandas as pd
df = pd.DataFrame.from_csv("data.csv")
#consecutive
grouped_consecutive_product_ids = df.groupby(['Leaf_category_id','session_id'])['product_id'].apply(lambda x: [tuple(sorted(pair)) for pair in zip(x,x[1:])]).reset_index()
df1=pd.DataFrame(grouped_consecutive_product_ids)
s=df1.product_id.apply(lambda x: pd.Series(x)).unstack()
df2=pd.DataFrame(s.reset_index(level=0,drop=True)).dropna()
df2.rename(columns = {0:'Bigram'}, inplace = True)
df2["freq"] = df2.groupby('Bigram')['Bigram'].transform('count')
bigram_frequency_consecutive = df2.drop_duplicates(keep="first").sort_values("Bigram").reset_index()
del bigram_frequency_consecutive["index"]
for combinations (all possible bi-grams)
from itertools import combinations
import pandas as pd
df = pd.DataFrame.from_csv("data.csv")
#combinations
grouped_combination_product_ids = df.groupby(['Leaf_category_id','session_id'])['product_id'].apply(lambda x: [tuple(sorted(pair)) for pair in combinations(x,2)]).reset_index()
df1=pd.DataFrame(grouped_combination_product_ids)
s=df1.product_id.apply(lambda x: pd.Series(x)).unstack()
df2=pd.DataFrame(s.reset_index(level=0,drop=True)).dropna()
df2.rename(columns = {0:'Bigram'}, inplace = True)
df2["freq"] = df2.groupby('Bigram')['Bigram'].transform('count')
bigram_frequency_combinations = df2.drop_duplicates(keep="first").sort_values("Bigram").reset_index()
del bigram_frequency_combinations["index"]
where data.csv contains
Leaf_category_id,session_id,product_id
0,111,1,111
3,111,4,987
4,111,1,741
1,222,2,654
2,333,3,321
5,111,1,87
6,111,1,34
7,111,1,12
8,111,1,987
9,111,4,1232
10,222,2,12
11,222,2,324
12,222,2,465
13,222,2,342
14,222,2,32
15,333,3,321
16,333,3,741
17,333,3,987
18,333,3,324
19,333,3,654
20,333,3,862
21,222,1,123
22,222,1,987
23,222,1,741
24,222,1,34
25,222,1,12
The resultant bigram_frequency_consecutive will be
Bigram freq
0 (12, 34) 2
1 (12, 324) 1
2 (12, 654) 1
3 (12, 987) 1
4 (32, 342) 1
5 (34, 87) 1
6 (34, 741) 1
7 (87, 741) 1
8 (111, 741) 1
9 (123, 987) 1
10 (321, 321) 1
11 (321, 741) 1
12 (324, 465) 1
13 (324, 654) 1
14 (324, 987) 1
15 (342, 465) 1
16 (654, 862) 1
17 (741, 987) 2
18 (987, 1232) 1
The resultant bigram_frequency_combinations will be
Bigram freq
0 (12, 32) 1
1 (12, 34) 2
2 (12, 87) 1
3 (12, 111) 1
4 (12, 123) 1
5 (12, 324) 1
6 (12, 342) 1
7 (12, 465) 1
8 (12, 654) 1
9 (12, 741) 2
10 (12, 987) 2
11 (32, 324) 1
12 (32, 342) 1
13 (32, 465) 1
14 (32, 654) 1
15 (34, 87) 1
16 (34, 111) 1
17 (34, 123) 1
18 (34, 741) 2
19 (34, 987) 2
20 (87, 111) 1
21 (87, 741) 1
22 (87, 987) 1
23 (111, 741) 1
24 (111, 987) 1
25 (123, 741) 1
26 (123, 987) 1
27 (321, 321) 1
28 (321, 324) 2
29 (321, 654) 2
30 (321, 741) 2
31 (321, 862) 2
32 (321, 987) 2
33 (324, 342) 1
34 (324, 465) 1
35 (324, 654) 2
36 (324, 741) 1
37 (324, 862) 1
38 (324, 987) 1
39 (342, 465) 1
40 (342, 654) 1
41 (465, 654) 1
42 (654, 741) 1
43 (654, 862) 1
44 (654, 987) 1
45 (741, 862) 1
46 (741, 987) 3
47 (862, 987) 1
48 (987, 1232) 1
in the above case it groups by both
We are going to pull out the values from product_id, create bigrams that are sorted and thus deduplicated, and count them to get the frequency, and then populate a data frame.
from collections import Counter
# assuming your data frame is called 'df'
bigrams = [list(zip(x,x[1:])) for x in df.product_id.values.tolist()]
bigram_set = [tuple(sorted(xx) for x in bigrams for xx in x]
freq_dict = Counter(bigram_set)
df_freq = pd.DataFrame([list(f) for f in freq_dict], columns=['bigram','freq'])

Categories