convert list from data series into a suitable data frame - python

I have a list of data series that looks something like this:
list = np.array([[0.32689251, 0.32677079, 0.32649432, 0.32594585, 0.32532732, 0.32509514, 0.32503138, 0.32492934, 0.324797, 0.32458332], [0.32689251, 0.32677079, 0.32649432, 0.32594585, 0.32532732, 0.32509514, 0.32503138, 0.32492934, 0.324797, 0.32458332], [0.32689251, 0.32677079, 0.32649432, 0.32594585, 0.32532732, 0.32509514, 0.32503138, 0.32492934, 0.324797, 0.32458332]])
I need to convert it to a pandas DataFrame that has the dimension 3x3. so for each data series one row and one column
by the following code, you only get a DataFrame of the format (3, 10):
df = pd.DataFrame(list)

try this. I would not recommend to use reserved word like "list" for variable names.
This might create just noice if not errors.
df = pd.DataFrame(list).transpose()

Related

pandas transform n columns to n/3 columns and n/3 rows

I have a dataframe like as shown below
data = {
'key':['k1','k2'],
'name_M1':['name', 'name'],'area_M1':[1,2],'length_M1':[11,21],'breadth_M1':[12,22],
'name_M2':['name', 'name'],'area_M2':[1,2],'length_M2':[11,21],'breadth_M2':[12,22],
'name_M3':['name', 'name'],'area_M3':[1,2],'length_M3':[11,21],'breadth_M3':[12,22],
'name_M4':['name', 'name'],'area_M4':[1,2],'length_M4':[11,21],'breadth_M4':[12,22],
'name_M5':['name', 'name'],'area_M5':[1,2],'length_M5':[11,21],'breadth_M5':[12,22],
'name_M6':['name', 'name'],'area_M6':[1,2],'length_M6':[11,21],'breadth_M6':[12,22],
}
df = pd.DataFrame(data)
Input data looks like below in wide format
I would like to convert it into time-based long format like below. We call it time-based because you can see that each row has 3 months data. Then the subsequent rows are pushed by 1 month
ex: sample shape of data looks like below (with only one column for each month)
k1,Area_M1,Area_M2,Area_M3,Area_M4,Area_M5,Area_M6
I would like to convert it like below (subsequent rows are shifted by one month)
k1,Area_M1,Area_M2,Area_M3
K1,Area_M2,Area_M3,Area_M4
K1,Area_M3,Area_M4,Area_M5
K1,Area_M4,Area_M5,Area_M6
But in real data, instead of one column for each month, I have multiple columns for each month. So, we need to convert/transform all those columns. So, I tried something like below but it doesn't work
pd.wide_to_long(df, stubnames=["name_1st","area_1st","length_first","breadth_first",
"name_2nd","area_2nd","length_2nd","breadth_2nd",
"name_3rd","area_3rd","length_3rd","breadth_3rd"],
i="key", j="name",
sep="_", suffix=r"(?:\d+|n)").reset_index()
But I expect my output to be like as below
updated error screenshot
Updated error screenshot
This is pretty ugly, but I'm not exactly sure of an easier way to do this. Perhaps you could melt everything and do a rolling pivot, but it's not really much different.
This approach just slices rows 0:12, 4:16, etc until the end - renaming and concatenating them all together.
import pandas as pd
import numpy as np
data = {
'key':['k1','k2'],
'name_M1':['name', 'name'],'area_M1':[1,2],'length_M1':[11,21],'breadth_M1':[12,22],
'name_M2':['name', 'name'],'area_M2':[1,2],'length_M2':[11,21],'breadth_M2':[12,22],
'name_M3':['name', 'name'],'area_M3':[1,2],'length_M3':[11,21],'breadth_M3':[12,22],
'name_M4':['name', 'name'],'area_M4':[1,2],'length_M4':[11,21],'breadth_M4':[12,22],
'name_M5':['name', 'name'],'area_M5':[1,2],'length_M5':[11,21],'breadth_M5':[12,22],
'name_M6':['name', 'name'],'area_M6':[1,2],'length_M6':[11,21],'breadth_M6':[12,22],
}
df = pd.DataFrame(data)
df = df.set_index('key')
s = 4
n = 3
cols = [
'name_1st','area_1st','length_1st','breadth_1st',
'name_2nd','area_2nd','length_2nd','breadth_2nd',
'name_3rd','area_3rd','length_3rd','breadth_3rd'
]
output = pd.concat((df.iloc[:,0+i*s:12+i*s].set_axis(cols, axis=1) for i in range(int((df.shape[1]-(s*n))/n))), ignore_index=True, axis=0).set_index(np.tile(df.index,4))

Why does pandas.to_numeric result in a list of lists?

I am trying to import csv data into a pandas dataframe. To do this I am doing the following:
df = pd.read_csv(StringIO(contents), skiprows=4, delim_whitespace=True,index_col=False,header=None)
index = pd.MultiIndex.from_arrays((columns, units, descr))
df.columns = index
df.columns.names = ['Name','Unit','Description']
df = df.apply(pd.to_numeric)
data['isotherm'] = df
This produces e.g. the following table:
In: data['isotherm']
Out:
Name Relative_Pressure Volume_STP
Unit - ccm/g
Description p/p0
0 0.042691 29.3601
1 0.078319 30.3071
2 0.129529 31.1643
3 0.183355 31.8513
4 0.233435 32.3972
5 0.280847 32.8724
However if I only want to get the values of the column Relative_Pressure I get this output:
In: data['isotherm']['Relative_Pressure'].values
Out:
array([[0.042691],
[0.078319],
[0.129529],
[0.183355],
[0.233435],
[0.280847]])
Of course I could now for every column I want to use flatten
x = [item for sublist in data['isotherm']['Relative_Pressure'].values for item in sublist]
However this would lead to a lot of extra effort and would also reduce the readability. How can I for the whole data frame make sure the data is flat?
array([[...]]) is not a list of lists, but a 2D numpy array. (I'm not sure why the values are returned as a single-column 2D array rather than a 1D array here, though. When I create a primitive DataFrame, a single column's values are returned as a 1D array.)
You can concatenate and flatten them using numpy's built-in functions, eg.
x = data['isotherm']['Relative_Pressure'].flatten()
Edit: This might be caused by the MultiIndex.
The direct way of indexing into one column belonging to your MultiIndex object is with a tuple as follows:
data[('isotherm', 'Relative_Pressure')]
which will return a Series object whose .values attribute will give you the expected 1D array. The docs discuss this here
You should be careful using chained indexing like data['isotherm']['Relative_Pressure'] because you won't know if you are dealing with a copy of the data or a view of the data. Please do a SO search of pandas' SettingWithCopyWarning for more details or read the docs here.

Using DataFrame.ISIN() with a series made of series

I am trying to filter the rows of a Dataframe depending if the value of a column is part of a series so:
pack_data_clean=pack_data_clean[pack_data_clean["actual_box_barcode"].isin(ref_list)==True]
Where pack_data_clean is the dataframe to be filtered, actual_box_barcode the column to check and ref_list the series with the values that will remain in the dataframe.
However ref_list is not a regular series with values, but a series made of series, let's see:
ref_list=["BG1", "BG2", "BG3", "BG4", A1_refs, A2_refs, A3_refs, C1_refs, C2_refs, C3_refs, C4_refs, C5_refs
, E0_refs, E1_refs, E3_refs, E4_refs, E6_refs, E7_refs, E36_refs]
Where A1_refs e.g. is:
A1_refs=["EEC", "ENC", "EZC"]
Right now my code is only filtering the rows that are "BG1", "BG2", "BG3", "BG4" but I would like to filter as well "EEC", "ENC", "EZC" and so on.
Could you please help me to ¡create this filter accordingly?
Thanks,
Eduardo
If it's a list of regular objects and series, you need to get it into a consistent series format. This might help:
ref_list = pd.Series(i for i in my_list if type(i) != pd.Series).append(list(j for j in my_list if type(j) == pd.Series))
If it's already a Series of Series, try flattening the series before using it, like so:
ref_list = ref_list.ravel()

Can't Create pandas DataFrame in python (Wrong Shape)

I'm trying to create the following data frame
new_df = pd.DataFrame(data = percentage_default, columns =
df['purpose'].unique())
The variables I'm using are as follows
percentage_default = [0.15238817285822592,
0.11568938193343899,
0.16602316602316602,
0.17011128775834658,
0.2778675282714055,
0.11212814645308924,
0.20116618075801748]
df['purpose'].unique = array(['debt_consolidation', 'credit_card', 'all_other',
'home_improvement', 'small_business', 'major_purchase',
'educational'], dtype=object)
When I try to create this data frame I get the following error:
Shape of passed values is (1, 7), indices imply (7, 7)
To me it seemed like the shape of the values and idices were the same. Could someone explain what I'm missing here?
Thanks!
You're creating a dataframe from a list. Calling pd.DataFrame(your_list) where your_list is a simple homogenous list will create a single row for every element in that list. For your input:
percentage_default = [0.15238817285822592,
0.11568938193343899,
0.16602316602316602,
0.17011128775834658,
0.2778675282714055,
0.11212814645308924,
0.20116618075801748]
pandas will create a dataframe like this:
Column
0.15238817285822592
0.11568938193343899
0.16602316602316602
0.17011128775834658
0.2778675282714055
0.11212814645308924
0.20116618075801748
Because of this, your dataframe only has one column. You're trying to pass multiple column names, which is confusing pandas.
If you wish to create a dataframe from a list with multiple columns, you need to nest more lists or tuples inside your original list. Each nested tuple/list will become a row in the dataframe, and each element in the nested tuple/list will become a new column. See this:
percentage_default = [(0.15238817285822592,
0.11568938193343899,
0.16602316602316602,
0.17011128775834658,
0.2778675282714055,
0.11212814645308924,
0.20116618075801748)] # nested tuple
We have one nested tuple in this list, so our dataframe will have 1 row with n columns, where n is the number of elements in the nested tuple (7). We can then pass your 7 column names:
percentage_default = [(0.15238817285822592,
0.11568938193343899,
0.16602316602316602,
0.17011128775834658,
0.2778675282714055,
0.11212814645308924,
0.20116618075801748)]
col_names = ['debt_consolidation', 'credit_card', 'all_other',
'home_improvement', 'small_business', 'major_purchase',
'educational']
new_df = pd.DataFrame(percentage_default, columns = col_names)
print(new_df)
debt_consolidation credit_card all_other home_improvement \
0 0.152388 0.115689 0.166023 0.170111
small_business major_purchase educational
0 0.277868 0.112128 0.201166
Try to rewrite your data in a next way:
percentage_default = {
'debt_consolidation': 0.15238817285822592,
'credit_card': 0.11568938193343899,
...
}
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html

Concatenating two floats into one column in pandas

If I have a data frame which has float columns like below
Pickup_longitude Pickup_latitude
1176807 -73.929321 40.746761
753359 -73.940964 40.679981
1378672 -73.924011 40.824677
302960 -73.845108 40.754841
827558 -73.937073 40.820759
I want to concatenate the lat-long as ("lat","long") in one column.
I did below code for sample three rows but I was wondering is there a faster method instead of converting to string using .astype(str). I initially tried using str() but that also takes the index values into it.
trip_data_sample['lat_long_pickup']=trip_data_sample["Pickup_latitude"][:3].astype(str)+","+\
trip_data_sample["Pickup_longitude"].astype(str)
You could create tuples using a list comprehension and indexing the dataframe:
df['lat_long'] = [', '.join(str(x) for x in y) for y in map(tuple, df[['Pickup_latitude', 'Pickup_longitude']].values)]
df looks like this now:
>>> df
Pickup_longitude Pickup_latitude lat_long
1176807 -73.929321 40.746761 40.746761, -73.929321
753359 -73.940964 40.679981 40.679981, -73.940964
1378672 -73.924011 40.824677 40.824677, -73.924011
302960 -73.845108 40.754841 40.754841, -73.845108
827558 -73.937073 40.820759 40.820759, -73.937073

Categories