pandas unravel list to columns - python

I am quite new to pandas, and I have a numpy list looking like so:
something=[10,20,30,40,50]
When I convert it to a pandas dataframe hgowever, I have the entire list as one element:
dataset = pd.DataFrame({'something': something, \
'something2': something2}, \
columns=['something', 'something2'])
and I get:
something
0 [10,20,30,40,50]
What I would like is:
0 1 2 3 4
0 10 20 30 40 50
i.e list elements as individual columns.

You can do this using pd.Dataframe.from_records:
In [323]: df = pd.DataFrame.from_records([something])
In [324]: df
Out[324]:
0 1 2 3 4
0 10 20 30 40 50
For multiple lists, you can simply do this:
In [337]: something2 = [101,201,301,401,501]
In [338]: df = pd.DataFrame.from_records([something, something2])
In [339]: df
Out[339]:
0 1 2 3 4
0 10 20 30 40 50
1 101 201 301 401 501
EDIT: After OP's comment
If you want all lists to be creating multiple columns, you can try this:
In [349]: something
Out[349]: [10, 20, 30, 40, 50]
In [350]: something2
Out[350]: [101, 201, 301, 401, 501]
In [351]: something.extend(something2)
In [353]: df = pd.DataFrame.from_records([something])
In [354]: df
Out[354]:
0 1 2 3 4 5 6 7 8 9
0 10 20 30 40 50 101 201 301 401 501

pandas dataframe from dict could help:
something=[10,20,30,40,50]
something2 = [25,30,22,1,5]
data = {'something':something,'something2':something2}
pd.DataFrame.from_dict(data,orient='index')
0 1 2 3 4
something 10 20 30 40 50
something2 25 30 22 1 5
If you don't care for the indexes, and want them to be integers, reset_index should suffice:
pd.DataFrame.from_dict(data,orient='index').reset_index(drop=True)

If you are passing dictionary in Dataframe then by default, pandas treat the key as a column, you don't need to give columns name again, unless if you want different column names.
I tried following example:
import pandas as pd
something1=[10,20,30,40,50]
something2=[101,201,301,401,501]
pd.DataFrame([something1,something2])
Output
0 1 2 3 4
0 10 20 30 40 50
1 101 201 301 401 501
let me know if this works for you or not.

Related

How to apply a function to dataframe with data clusters/neighborhoods seperately?

Consider the following table. The first column, Data1, contains data values that are clustered in groups: there are values around 100 and 200. I am wondering how I can apply a function that deals with each data grouping separately, perhaps by applying an if statement that excludes data points with values too far apart to be considered a neighboring data point.
Data1 Value1
99 1
100 2
101 3
102 4
199 5
200 6
201 7
... ...
For example, if I want to generate a third column called "Result1" that adds every Data1 cluster's corresponding Value1 together. The result would look something like this, where 1+2+3+4=10 and 5+6+7=18:
Data1 Value1 Result1
99 1 10
100 2 10
101 3 10
102 4 10
199 5 18
200 6 18
201 7 18
... ... ...
Try merge_asof:
data = [100,200]
labels = pd.merge_asof(df, pd.DataFrame({'label':data}),
left_on='Data1', right_on='label',
direction='nearest')['label']
df['Result1'] = df.groupby(labels)['Value1'].transform('sum')
Output:
Data1 Value1 Result1
0 99 1 10
1 100 2 10
2 101 3 10
3 102 4 10
4 199 5 18
5 200 6 18
6 201 7 18
In your case, a simple mask aught to do.
mask = df[“Data1”]<150
df.loc[mask,”Result1”] = df.loc[mask,”Value1”].sum()
df.loc[~mask,”Result1”] = ”df.loc[~mask,”Value1”].sum()

Is there a way to avoid while loops using pandas in order to speed up my code?

I'm writing a code to merge several dataframe together using pandas .
Here is my first table :
Index Values Intensity
1 11 98
2 12 855
3 13 500
4 24 140
and here is the second one:
Index Values Intensity
1 21 1000
2 11 2000
3 24 0.55
4 25 500
With these two df, I concanate and drop_duplicates the Values columns which give me the following df :
Index Values Intensity_df1 Intensity_df2
1 11 0 0
2 12 0 0
3 13 0 0
4 24 0 0
5 21 0 0
6 25 0 0
I would like to recover the intensity of each values in each Dataframes, for this purpose, I'm iterating through each line of each df which is very inefficient. Here is the following code I use:
m = 0
while m < len(num_df):
n = 0
while n < len(df3):
temp_intens_abs = df[m]['Intensity'][df3['Values'][n] == df[m]['Values']]
if temp_intens_abs.empty:
merged.at[n,"Intensity_df%s" %df[m]] = 0
else:
merged.at[n,"Intensity_df%s" %df[m]] = pandas.to_numeric(temp_intens_abs, errors='coerce')
n = n + 1
m = m + 1
The resulting df3 looks like this at the end:
Index Values Intensity_df1 Intensity_df2
1 11 98 2000
2 12 855 0
3 13 500 0
4 24 140 0.55
5 21 0 1000
6 25 0 500
My question is : Is there a way to directly recover "present" values in a df by comparing directly two columns using pandas? I've tried several solutions using numpy but without success.. Thanks in advance for your help.
You can try joining these dataframes: df3 = df1.merge(df2, on="Values")

Python: Memory efficient, quick lookup in python for 100 million pairs of data?

This is my first time asking a question on here, so apologies if I am doing something wrong.
I am looking to create some sort of dataframe/dict/list where I can check if the ID in one column has seen a specific value in another column before.
For example for one pandas dataframe like this (90 million rows):
ID Another_ID
1 10
1 20
2 50
3 10
3 20
4 30
And another like this(10 million rows):
ID Another_ID
1 30
2 30
2 50
2 20
4 30
5 70
I want to end up with a third column that is like this:
ID Another_ID seen_before
1 30 0
2 30 0
2 50 1
2 20 0
4 30 1
5 20 0
I am looking for a memory efficient but quick way to do this, any ideas? Thanks!
Merge is a good idea, here, you want to merge on both columns:
df1['seen_before'] = 1
df2.merge(df1, on=['ID', 'Another_ID'], how='left')
Output:
ID Another_ID seen_before
0 1 30 NaN
1 2 30 NaN
2 2 50 1.0
3 2 20 NaN
4 4 30 1.0
5 5 70 NaN
Note: this assumes that df1 has no duplicates. If you are not sure about this, replace df1 with df1.drop_duplicates() in merge.
Note: how is important on merge. see the comments in the code as well. np.where is
quite efficient but I have never worked with 100 million rows. Request to OP to
let us know how it goes.
Code:
import pandas as pd
import numpy as np
left = pd.DataFrame(data = {'ID':[1, 1, 2, 3, 3, 4], 'Another_ID': [10, 20, 50, 10, 20, 30]})
right = pd.DataFrame(data = {'ID':[1 , 2 , 2 , 2 , 4 , 5], 'Another_ID': [30 , 30 , 50 , 20 , 30 , 70]})
print(df1, '\n', df2)
res = pd.merge(left, right, how='right', on='ID')
# Another_ID_x showed up as float despite dtype as int on both right and left
res.fillna(value=0, inplace=True) # required for astype to work in next step
res['Another_ID_x'] = res['Another_ID_x'].astype(int)
res['Another_ID_x'] = np.where(res.Another_ID_x == res.Another_ID_y, 1, 0 )
res.rename(columns={'Another_ID_x': 'seen_before'}, inplace=True)
res.drop_duplicates(inplace=True)
print(res)
Output:
Another_ID
ID
1 10
1 20
2 50
3 10
3 20
4 30
Another_ID
ID
1 30
2 30
2 50
2 20
4 30
5 70
ID seen_before Another_ID_y
0 1 0 30
2 2 0 30
3 2 1 50
4 2 0 20
5 4 1 30
6 5 0 70
Update:
Thanks to everybody for all the replies on my first post!
#Quang Hong's solution worked amazing in this case as there were so many rows.
Total time it took on my laptop was 36.6s

Extend and fill a Pandas DataFrame to match another

I have two Pandas DataFrames A and B.
They have an identical index (weekly dates) up to a point: the series ends at the beginning of the year
for A and goes on for a number of observations in frame B. I need to set data frame A to have the same index as frame B - and fill each column with its own last values.
Thank you in advance.
Tikhon
EDIT: thank you for the advice on the question. What I need is for dfA_before to look at dfB and become dfA_after:
print(dfA_before)
a b
0 10 100
1 20 200
2 30 300
print(dfB)
a b
0 11 111
1 22 222
2 33 333
3 44 444
4 55 555
print(dfA_after)
a b
0 10 100
1 20 200
2 30 300
3 30 300
4 30 300
This should work
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'a':[10,20,30],'b':[100,200,300]})
df2 = pd.DataFrame({'a':[11,22,33,44,55],'c':[111,222,333,444,555]})
# solution
last = df1.iloc[-1].to_numpy()
df3 = pd.DataFrame(np.tile(last,(2,1)),
columns=df1.columns)
df4 = df1.append(df3,ignore_index=True)
# method 2
for _ in range(len(df2)-len(df1)):
df1.loc[len(df1)] = df1.loc[len(df1)-1]
# method 3
for _ in range(df2.shape[0]-df1.shape[0]):
df1 = df1.append(df1.loc[len(df1)-1],ignore_index=True)
# result
a b
0 10 100
1 20 200
2 30 300
3 30 300
4 30 300
Probably very inefficient - I am a beginner:
dfA_New = dfB.copy()
dfA_New.loc[:] = 0
dfA_New.loc[:] = dfA.loc[:]
dfA_New.fillna(method='ffill', inplace = True)
dfA = dfA_New

Insert values in pandas dataframe using list values corresponding to column

I have a dataframe:
id val size
1 100
2 500
3 300
i have a nested list L = [[300,20],[100,45],[500,12]]
I want to fill my dataframe with 2nd element in my sublist, corresponding to that column value.
i.e my final dataframe should look like
id val size
1 100 45
2 500 12
3 300 20
Another way using merge
In [1417]: df.merge(pd.DataFrame(L, columns=['val', 'size']), on='val')
Out[1417]:
id val size
0 1 100 45
1 2 500 12
2 3 300 20
First initialise a mapping:
In [132]: mapping = dict([[300,20],[100,45],[500,12]]); mapping
Out[132]: {100: 45, 300: 20, 500: 12}
Now, you can use either df.replace or df.map.
Option 1
Using df.replace:
In [137]: df['size'] = df.val.replace(mapping)
In [138]: df
Out[138]:
id val size
0 1 100 45
1 2 500 12
2 3 300 20
Option 2
Using df.map:
In [140]: df['size'] = df.val.map(mapping)
In [141]: df
Out[141]:
id val size
0 1 100 45
1 2 500 12
2 3 300 20

Categories