If I have a data frame which has float columns like below
Pickup_longitude Pickup_latitude
1176807 -73.929321 40.746761
753359 -73.940964 40.679981
1378672 -73.924011 40.824677
302960 -73.845108 40.754841
827558 -73.937073 40.820759
I want to concatenate the lat-long as ("lat","long") in one column.
I did below code for sample three rows but I was wondering is there a faster method instead of converting to string using .astype(str). I initially tried using str() but that also takes the index values into it.
trip_data_sample['lat_long_pickup']=trip_data_sample["Pickup_latitude"][:3].astype(str)+","+\
trip_data_sample["Pickup_longitude"].astype(str)
You could create tuples using a list comprehension and indexing the dataframe:
df['lat_long'] = [', '.join(str(x) for x in y) for y in map(tuple, df[['Pickup_latitude', 'Pickup_longitude']].values)]
df looks like this now:
>>> df
Pickup_longitude Pickup_latitude lat_long
1176807 -73.929321 40.746761 40.746761, -73.929321
753359 -73.940964 40.679981 40.679981, -73.940964
1378672 -73.924011 40.824677 40.824677, -73.924011
302960 -73.845108 40.754841 40.754841, -73.845108
827558 -73.937073 40.820759 40.820759, -73.937073
Related
I have a DataFrame that contains two columns, 'A_List' and 'B_List', which are of the string dtype. I have converted these to lists and I would like to now perform element wise addition of the elements in the lists at specific indices. I have attached an example of the csv file I'm using. When I do the following, I am getting an output that is joining the elements at the specified indices as opposed to finding their sum. What may I try differently to achieve the sum instead?
For example, when I do row["A_List"][0] + row["B_List"][3], the desired output would be 0.16 (since 0.1+0.06 = 0.16). Instead, I am getting 0.10.06 as my answer.
import pandas as pd
df = pd.read_csv('Example.csv')
# Get rid of the brackets []
df["A_List"] = df["A_List"].apply(lambda x: x.strip("[]"))
df["B_List"] = df["B_List"].apply(lambda x: x.strip("[]"))
# Convert the string dtype of values into a list
df["A_List"] = df["A_List"].apply(lambda x: x.split())
df["B_List"] = df["B_List"].apply(lambda x: x.split())
for i, row in df.iterrows():
print(row["A_List"][0] + row["B_List"][3])
You need to turn the individual values into floats when parsing your string lists.
In one step, you can do the following with DataFrame.applymap which applies the given function to every element one at a time, and a lambda containing a list comprehension around str.strip and str.split.
import pandas as pd
df = pd.DataFrame(
{
"A_List": ["[0.1 0.2 0.3]", "[1.1 1.2 1.3]"],
"B_List": ["[0.9 0.8 0.7]", "[0.4 0.3 0.2]"],
}
)
df[["A_List", "B_List"]] = df[["A_List", "B_List"]].applymap(
lambda x: [float(v) for v in x.strip("[]").split()]
)
for i, row in df.iterrows():
print(row["A_List"][0] + row["B_List"][2])
prints
0.7999999999999999
1.3
I am trying to import csv data into a pandas dataframe. To do this I am doing the following:
df = pd.read_csv(StringIO(contents), skiprows=4, delim_whitespace=True,index_col=False,header=None)
index = pd.MultiIndex.from_arrays((columns, units, descr))
df.columns = index
df.columns.names = ['Name','Unit','Description']
df = df.apply(pd.to_numeric)
data['isotherm'] = df
This produces e.g. the following table:
In: data['isotherm']
Out:
Name Relative_Pressure Volume_STP
Unit - ccm/g
Description p/p0
0 0.042691 29.3601
1 0.078319 30.3071
2 0.129529 31.1643
3 0.183355 31.8513
4 0.233435 32.3972
5 0.280847 32.8724
However if I only want to get the values of the column Relative_Pressure I get this output:
In: data['isotherm']['Relative_Pressure'].values
Out:
array([[0.042691],
[0.078319],
[0.129529],
[0.183355],
[0.233435],
[0.280847]])
Of course I could now for every column I want to use flatten
x = [item for sublist in data['isotherm']['Relative_Pressure'].values for item in sublist]
However this would lead to a lot of extra effort and would also reduce the readability. How can I for the whole data frame make sure the data is flat?
array([[...]]) is not a list of lists, but a 2D numpy array. (I'm not sure why the values are returned as a single-column 2D array rather than a 1D array here, though. When I create a primitive DataFrame, a single column's values are returned as a 1D array.)
You can concatenate and flatten them using numpy's built-in functions, eg.
x = data['isotherm']['Relative_Pressure'].flatten()
Edit: This might be caused by the MultiIndex.
The direct way of indexing into one column belonging to your MultiIndex object is with a tuple as follows:
data[('isotherm', 'Relative_Pressure')]
which will return a Series object whose .values attribute will give you the expected 1D array. The docs discuss this here
You should be careful using chained indexing like data['isotherm']['Relative_Pressure'] because you won't know if you are dealing with a copy of the data or a view of the data. Please do a SO search of pandas' SettingWithCopyWarning for more details or read the docs here.
I'd like to concatenate two columns in pandas. Each column consists of a list of floating points of 1x4 elements. I'd like to merge two columns such that the output is a vector of 1x8. The below shows a snippet of the dataframe
ue,bs
"[1.27932459e-01 7.83234197e-02 3.24789420e-02 4.34971932e-01]","[2.97806183e-01 2.32453145e-01 3.10236304e-01 1.69975788e-02]"
"[0.05627587 0.4113416 0.02160842 0.20420576]","[1.64862491e-01 1.35556330e-01 2.59050065e-02 1.42498115e-02]"
To concatenate two columns, I do the following:
df['ue_bs'] = zip(df_join['ue'], df_join['bs'])
With this, I get a new column 'ue_bs' which contains the following for the first row of df['ue_bs']:
(array([1.27932459e-01, 7.83234197e-02, 3.24789420e-02, 4.34971932e-01]),
array([2.97806183e-01, 2.32453145e-01, 3.10236304e-01, 1.69975788e-02]))
However, they are still two arrays. In order to merge them, I did it as follows:
a = df['ue_bs'][0]
np.concatenate((a[0], a[1]), axis=0)
Then, I got
array([1.27932459e-01, 7.83234197e-02, 3.24789420e-02, 4.34971932e-01,
2.97806183e-01, 2.32453145e-01, 3.10236304e-01, 1.69975788e-02])
I am wondering is there a neat way of doing this in single line of code, instead of having to loop through df['ue_bs'] and perform np.concatenate()?
To concatinate two lists in python, the easiest way is to use +. The same is true when concating columns in pandas. You can simply do:
df['ue_bs'] = df['ue'] + df['bs']
If the column type is numpy arrays you can first convert them into normal python lists before the concatination:
df['ue_bs'] = df['ue'].apply(lambda x: x.tolist()) + df['bs'].apply(lambda x: x.tolist())
Create 2d numpy array and then numpy.hstack:
a = np.array(df[['ue','bs']].values.tolist())
df['ue_bs'] = np.hstack((a[:, 0], a[:, 1])).tolist()
print (df.loc[0, 'ue_bs'])
[0.127932459, 0.0783234197, 0.032478942, 0.434971932,
0.297806183, 0.232453145, 0.310236304, 0.0169975788]
I'm trying to create the following data frame
new_df = pd.DataFrame(data = percentage_default, columns =
df['purpose'].unique())
The variables I'm using are as follows
percentage_default = [0.15238817285822592,
0.11568938193343899,
0.16602316602316602,
0.17011128775834658,
0.2778675282714055,
0.11212814645308924,
0.20116618075801748]
df['purpose'].unique = array(['debt_consolidation', 'credit_card', 'all_other',
'home_improvement', 'small_business', 'major_purchase',
'educational'], dtype=object)
When I try to create this data frame I get the following error:
Shape of passed values is (1, 7), indices imply (7, 7)
To me it seemed like the shape of the values and idices were the same. Could someone explain what I'm missing here?
Thanks!
You're creating a dataframe from a list. Calling pd.DataFrame(your_list) where your_list is a simple homogenous list will create a single row for every element in that list. For your input:
percentage_default = [0.15238817285822592,
0.11568938193343899,
0.16602316602316602,
0.17011128775834658,
0.2778675282714055,
0.11212814645308924,
0.20116618075801748]
pandas will create a dataframe like this:
Column
0.15238817285822592
0.11568938193343899
0.16602316602316602
0.17011128775834658
0.2778675282714055
0.11212814645308924
0.20116618075801748
Because of this, your dataframe only has one column. You're trying to pass multiple column names, which is confusing pandas.
If you wish to create a dataframe from a list with multiple columns, you need to nest more lists or tuples inside your original list. Each nested tuple/list will become a row in the dataframe, and each element in the nested tuple/list will become a new column. See this:
percentage_default = [(0.15238817285822592,
0.11568938193343899,
0.16602316602316602,
0.17011128775834658,
0.2778675282714055,
0.11212814645308924,
0.20116618075801748)] # nested tuple
We have one nested tuple in this list, so our dataframe will have 1 row with n columns, where n is the number of elements in the nested tuple (7). We can then pass your 7 column names:
percentage_default = [(0.15238817285822592,
0.11568938193343899,
0.16602316602316602,
0.17011128775834658,
0.2778675282714055,
0.11212814645308924,
0.20116618075801748)]
col_names = ['debt_consolidation', 'credit_card', 'all_other',
'home_improvement', 'small_business', 'major_purchase',
'educational']
new_df = pd.DataFrame(percentage_default, columns = col_names)
print(new_df)
debt_consolidation credit_card all_other home_improvement \
0 0.152388 0.115689 0.166023 0.170111
small_business major_purchase educational
0 0.277868 0.112128 0.201166
Try to rewrite your data in a next way:
percentage_default = {
'debt_consolidation': 0.15238817285822592,
'credit_card': 0.11568938193343899,
...
}
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html
I'm trying to apply a function to a pandas dataframe, such a function required two np.array as input and it fit them using a well defined model.
The point is that I'm not able to apply this function starting from the selected columns since their "rows" contain list read from a JSON file and not np.array.
Now, I've tried different solutions:
#Here is where I discover the problem
train_df['result'] = train_df.apply(my_function(train_df['col1'],train_df['col2']))
#so I've tried to cast the Series before passing them to the function in both these ways:
X_col1_casted = trai_df['col1'].dtype(np.array)
X_col2_casted = trai_df['col2'].dtype(np.array)
doesn't work.
X_col1_casted = trai_df['col1'].astype(np.array)
X_col2_casted = trai_df['col2'].astype(np.array)
doesn't work.
X_col1_casted = trai_df['col1'].dtype(np.array)
X_col2_casted = trai_df['col2'].dtype(np.array)
does'nt work.
What I'm thinking to do now is a long procedure like:
starting from the uncasted column-series, convert them into list(), iterate on them apply the function to the np.array() single elements, and append the results into a temporary list. Once done I will convert this list into a new column. ( clearly, I don't know if it will work )
Does anyone of you know how to help me ?
EDIT:
I add one example to be clear:
The function assume to have as input two np.arrays. Now it has two lists since they are retrieved form a json file. The situation is this one:
col1 col2 result
[1,2,3] [4,5,6] [5,7,9]
[0,0,0] [1,2,3] [1,2,3]
Clearly the function is not the sum one, but a own function. For a moment assume that this sum can work only starting from arrays and not form lists, what should I do ?
Thanks in advance
Use apply to convert each element to it's equivalent array:
df['col1'] = df['col1'].apply(lambda x: np.array(x))
type(df['col1'].iloc[0])
numpy.ndarray
Data:
df = pd.DataFrame({'col1': [[1,2,3],[0,0,0]]})
df