I'd like to concatenate two columns in pandas. Each column consists of a list of floating points of 1x4 elements. I'd like to merge two columns such that the output is a vector of 1x8. The below shows a snippet of the dataframe
ue,bs
"[1.27932459e-01 7.83234197e-02 3.24789420e-02 4.34971932e-01]","[2.97806183e-01 2.32453145e-01 3.10236304e-01 1.69975788e-02]"
"[0.05627587 0.4113416 0.02160842 0.20420576]","[1.64862491e-01 1.35556330e-01 2.59050065e-02 1.42498115e-02]"
To concatenate two columns, I do the following:
df['ue_bs'] = zip(df_join['ue'], df_join['bs'])
With this, I get a new column 'ue_bs' which contains the following for the first row of df['ue_bs']:
(array([1.27932459e-01, 7.83234197e-02, 3.24789420e-02, 4.34971932e-01]),
array([2.97806183e-01, 2.32453145e-01, 3.10236304e-01, 1.69975788e-02]))
However, they are still two arrays. In order to merge them, I did it as follows:
a = df['ue_bs'][0]
np.concatenate((a[0], a[1]), axis=0)
Then, I got
array([1.27932459e-01, 7.83234197e-02, 3.24789420e-02, 4.34971932e-01,
2.97806183e-01, 2.32453145e-01, 3.10236304e-01, 1.69975788e-02])
I am wondering is there a neat way of doing this in single line of code, instead of having to loop through df['ue_bs'] and perform np.concatenate()?
To concatinate two lists in python, the easiest way is to use +. The same is true when concating columns in pandas. You can simply do:
df['ue_bs'] = df['ue'] + df['bs']
If the column type is numpy arrays you can first convert them into normal python lists before the concatination:
df['ue_bs'] = df['ue'].apply(lambda x: x.tolist()) + df['bs'].apply(lambda x: x.tolist())
Create 2d numpy array and then numpy.hstack:
a = np.array(df[['ue','bs']].values.tolist())
df['ue_bs'] = np.hstack((a[:, 0], a[:, 1])).tolist()
print (df.loc[0, 'ue_bs'])
[0.127932459, 0.0783234197, 0.032478942, 0.434971932,
0.297806183, 0.232453145, 0.310236304, 0.0169975788]
Related
I am working with matching two separate dataframes on first name using HMNI's fuzzymerge.
On output each row returns a key like: (May, 0.9905315373004635)
I am trying to separate the Name and Score into their own columns. I tried the below code but don't quite get the right output - every row ends up with the same exact name/score in the new columns.
for i, v in enumerate(matched.key):
matched['MatchedNameFinal'] = (matched.key[i][0][0])
matched['MatchedNameScore'] = (matched.key[i][0][1])
matched[['consumer_name_first', 'key','MatchedNameFinal', 'MatchedNameScore']]
first when going over rows in pandas is better to use apply
matched['MatchedNameFinal'] = matched.key.apply(lambda x: x[0][0])
matched['MatchedNameScore'] = matched.key.apply(lambda x: x[0][1])
and in your case I think you are missing a tab in the for loop
for i, v in enumerate(matched.key):
matched['MatchedNameFinal'] = (matched.key[i][0][0])
matched['MatchedNameScore'] = (matched.key[i][0][1])
Generally, you want to avoid using enumerate for pandas because pandas functions are vectorized and much faster to execute.
So this solution won't iterate using enumerate.
First you turn the list into single tuple per row.
matched.key.explode()
Then use zip to split the tuple into 2 columns.
matched['col1'], matched['col2'] = zip(tuples)
Do all in 1 line.
matched['MatchedNameFinal'], matched['MatchedNameScore'] = zip(*matched.key.explode())
I am trying to import csv data into a pandas dataframe. To do this I am doing the following:
df = pd.read_csv(StringIO(contents), skiprows=4, delim_whitespace=True,index_col=False,header=None)
index = pd.MultiIndex.from_arrays((columns, units, descr))
df.columns = index
df.columns.names = ['Name','Unit','Description']
df = df.apply(pd.to_numeric)
data['isotherm'] = df
This produces e.g. the following table:
In: data['isotherm']
Out:
Name Relative_Pressure Volume_STP
Unit - ccm/g
Description p/p0
0 0.042691 29.3601
1 0.078319 30.3071
2 0.129529 31.1643
3 0.183355 31.8513
4 0.233435 32.3972
5 0.280847 32.8724
However if I only want to get the values of the column Relative_Pressure I get this output:
In: data['isotherm']['Relative_Pressure'].values
Out:
array([[0.042691],
[0.078319],
[0.129529],
[0.183355],
[0.233435],
[0.280847]])
Of course I could now for every column I want to use flatten
x = [item for sublist in data['isotherm']['Relative_Pressure'].values for item in sublist]
However this would lead to a lot of extra effort and would also reduce the readability. How can I for the whole data frame make sure the data is flat?
array([[...]]) is not a list of lists, but a 2D numpy array. (I'm not sure why the values are returned as a single-column 2D array rather than a 1D array here, though. When I create a primitive DataFrame, a single column's values are returned as a 1D array.)
You can concatenate and flatten them using numpy's built-in functions, eg.
x = data['isotherm']['Relative_Pressure'].flatten()
Edit: This might be caused by the MultiIndex.
The direct way of indexing into one column belonging to your MultiIndex object is with a tuple as follows:
data[('isotherm', 'Relative_Pressure')]
which will return a Series object whose .values attribute will give you the expected 1D array. The docs discuss this here
You should be careful using chained indexing like data['isotherm']['Relative_Pressure'] because you won't know if you are dealing with a copy of the data or a view of the data. Please do a SO search of pandas' SettingWithCopyWarning for more details or read the docs here.
If I have a data frame which has float columns like below
Pickup_longitude Pickup_latitude
1176807 -73.929321 40.746761
753359 -73.940964 40.679981
1378672 -73.924011 40.824677
302960 -73.845108 40.754841
827558 -73.937073 40.820759
I want to concatenate the lat-long as ("lat","long") in one column.
I did below code for sample three rows but I was wondering is there a faster method instead of converting to string using .astype(str). I initially tried using str() but that also takes the index values into it.
trip_data_sample['lat_long_pickup']=trip_data_sample["Pickup_latitude"][:3].astype(str)+","+\
trip_data_sample["Pickup_longitude"].astype(str)
You could create tuples using a list comprehension and indexing the dataframe:
df['lat_long'] = [', '.join(str(x) for x in y) for y in map(tuple, df[['Pickup_latitude', 'Pickup_longitude']].values)]
df looks like this now:
>>> df
Pickup_longitude Pickup_latitude lat_long
1176807 -73.929321 40.746761 40.746761, -73.929321
753359 -73.940964 40.679981 40.679981, -73.940964
1378672 -73.924011 40.824677 40.824677, -73.924011
302960 -73.845108 40.754841 40.754841, -73.845108
827558 -73.937073 40.820759 40.820759, -73.937073
I'm trying to apply a function to a pandas dataframe, such a function required two np.array as input and it fit them using a well defined model.
The point is that I'm not able to apply this function starting from the selected columns since their "rows" contain list read from a JSON file and not np.array.
Now, I've tried different solutions:
#Here is where I discover the problem
train_df['result'] = train_df.apply(my_function(train_df['col1'],train_df['col2']))
#so I've tried to cast the Series before passing them to the function in both these ways:
X_col1_casted = trai_df['col1'].dtype(np.array)
X_col2_casted = trai_df['col2'].dtype(np.array)
doesn't work.
X_col1_casted = trai_df['col1'].astype(np.array)
X_col2_casted = trai_df['col2'].astype(np.array)
doesn't work.
X_col1_casted = trai_df['col1'].dtype(np.array)
X_col2_casted = trai_df['col2'].dtype(np.array)
does'nt work.
What I'm thinking to do now is a long procedure like:
starting from the uncasted column-series, convert them into list(), iterate on them apply the function to the np.array() single elements, and append the results into a temporary list. Once done I will convert this list into a new column. ( clearly, I don't know if it will work )
Does anyone of you know how to help me ?
EDIT:
I add one example to be clear:
The function assume to have as input two np.arrays. Now it has two lists since they are retrieved form a json file. The situation is this one:
col1 col2 result
[1,2,3] [4,5,6] [5,7,9]
[0,0,0] [1,2,3] [1,2,3]
Clearly the function is not the sum one, but a own function. For a moment assume that this sum can work only starting from arrays and not form lists, what should I do ?
Thanks in advance
Use apply to convert each element to it's equivalent array:
df['col1'] = df['col1'].apply(lambda x: np.array(x))
type(df['col1'].iloc[0])
numpy.ndarray
Data:
df = pd.DataFrame({'col1': [[1,2,3],[0,0,0]]})
df
I have a dataframe that looks likes this:
Sweep Index
Sweep0001 0 -70.434570
1 -67.626953
2 -68.725586
3 -70.556641
4 -71.899414
5 -69.946289
6 -63.964844
7 -73.974609
...
Sweep0039 79985 -63.964844
79986 -66.406250
79987 -67.993164
79988 -68.237305
79989 -66.894531
79990 -71.411133
I want to slice out different ranges of Sweeps.
So for example, I want Sweep0001 : Sweep0003, Sweep0009 : Sweep0015, etc.
I know I can do this in separate lines with ix, i.e.:
df.ix['Sweep0001':'Sweep0003']
df.ix['Sweep0009':'Sweep0015']
And then put those back together into one dataframe (I'm doing this so I can average sweeps together, but I need to select some of them and remove others).
Is there a way to do that selection in one line though? I.e. without having to slice each piece separately, followed by recombining all of it into one dataframe.
Use Pandas IndexSlice
import pandas as pd
idx = pd.IndexSlice
df.loc[idx[["Sweep0001", "Sweep0002", ..., "Sweep0003", "Sweep0009", ..., "Sweep0015"]]
You can retrieve the labels you want this way:
list1 = df.index.get_level_values(0).unique()
list2 = [x for x in list1]
list3 = list2[1:4] #For your Sweep0001:Sweep0003
list3.extend(list2[9:16]) #For you Sweep0009:Sweep0015
df.loc[idx[list3]] #Note that you need one set of "[]"
#less around "list3" as this list comes
#by default with its own set of "[]".
In case you want to also slice by columns you can use:
df.loc[idx[list3],:] #Same as above to include all columns.
df.loc[idx[list3],:"column label"] #Returns data up to that "column label".
More information on slicing is on the Pandas website (http://pandas.pydata.org/pandas-docs/stable/advanced.html#using-slicers) or in this similar Stackoverflow Q/A: Python Pandas slice multiindex by second level index (or any other level)