Concatenation in Pandas Python - python

I am trying to concatenate these three columns, but the code i am using is giving me this output, i changed the format of all the columns to string:
Income_Status_Number Income_Stability_Number Product_Takeup_Number Permutation
1 1 2 1.012
2 1 3 2.013
1 1 1 1.011
this is the code i used:
df['Permutation']=df['Income_Status_Number'].astype(str)+""+df['Income_Stability_Number'].astype(str)+""+df['Product_Takeup_Number'].astype(str)
But I want my output to look like this:
Income_Status_Number Income_Stability_Number Product_Takeup_Number Permutation
1 1 2 112
2 1 3 213
1 1 1 111
Please help.

The issue is that the first column is being treated as a float instead of an int. The simple way to solve this problem is to sum the values with multipliers to put the numbers is the correct space and let pandas realize that the number is an int:
df['Permutation'] = df['Income_Status_Number']*100 + df['Income_Stability_Number']*10 + df['Product_Takeup_Number']
Another solution is to use astype(int).astype to convert the number first, but that solution is somewhat slower:
10000 Runs Each
as_type
Total: 9.7106s
Avg: 971059.8162ns
Maths
Total: 7.0491s
Avg: 704909.3242ns

It looks like your first column is being read as a float right before you convert it to a string.
df['Permutation']=df['Income_Status_Number'].astype(int).astype(str)+df['Income_Stability_Number'].astype(int).astype(str)+df['Product_Takeup_Number'].astype(int).astype(str)

Try the following code to add a 'Permutation' column to your data frame formatted in the way you wanted:
df['Permutation'] = df[df.columns[0:]].apply(lambda x: ''.join(x.dropna().astype(int).astype(str)),axis=1)
Which give you the following dataframe:
df
Income_Status_Number Income_Stability_Number Product_Takeup_Number \
0 1 1 2
1 2 1 3
2 1 1 1
Permutation
0 112
1 213
2 111

I hope this one will work for you.
df['Permutation'] = df[df.columns].apply(lambda x: ' '.join(x.dropna().astype(int).astype(str)),axis=1)

Related

Modifying pandas row value based on its length

I have a column in my pandas dataframe with the following values that represent hours worked in a week.
0 40
1 40h / week
2 46.25h/week on average
3 11
I would like to check every row, and if the length of the value is larger than 2 digits - extract the number of hours only from it.
I have tried the following:
df['Hours_per_week'].apply(lambda x: (x.extract('(\d+)') if(len(str(x)) > 2) else x))
However I am getting the AttributeError: 'str' object has no attribute 'extract' error.
It looks like you could ensure having h after the number:
df['Hours_per_week'].str.extract(r'(\d{2}\.?\d*)h', expand=False)
Output:
0 NaN
1 40
2 46.25
3 NaN
Name: Hours_per_week, dtype: object
Assuming the series data are strings, try this:
df['Hours_per_week'].str.extract('(\d+)')
Why not immediately extract float pattern i.e. \d+\.?\d+ ?
>>> s = pd.Series(['40', '40h / week', '46.25h/week on average', '11'])
>>> s.str.extract("(\d+\.?\d+)")
0
0 40
1 40
2 46.25
3 11
2 digits will still match either way.

Filter dataframe based on matching values from two columns

I have a dataframe like as shown below
cdf = pd.DataFrame({'Id':[1,2,3,4,5],
'Label':[1,2,3,0,0]})
I would like to filter the dataframe based on the below criteria
cdf['Id']==cdf['Label'] # first 3 rows are matching for both columns in cdf
I tried the below
flag = np.where[cdf['Id'].eq(cdf['Label'])==True,1,0]
final_df = cdf[cdf['flag']==1]
but I got the below error
TypeError: 'function' object is not subscriptable
I expect my output to be like as shown below
Id Label
0 1 1
1 2 2
2 3 3
I think you're overthinking this. Just compare the columns:
>>> cdf[cdf['Id'] == cdf['Label']]
Id Label
0 1 1
1 2 2
2 3 3
Your particular error though is coming from the fact that you're using square brackets to call np.where, e.g. np.where[...], which is wrong. You should be using np.where(...) instead, but the above solution is bound to be as fast as it gets ;)
Also you can check query
cdf.query('Id == Label')
Out[248]:
Id Label
0 1 1
1 2 2
2 3 3

Pandas - How to extract values from a large DF without any 'keys' using another DF's values?

I've got one large matrix as a pandas DF w/o any 'keys' but plain numbers on top. A smaller version of that just to demonstrate the problem in here would be like this input:
M=pd.DataFrame(np.random.rand(4,5))
What I want to accomplish is using another given DF as reference that has a structure like this
N=pd.DataFrame({'A':[2,2,2],'B':[2,3,4]})
...to extract the values from the large DF whereas the values of 'A' correspond to the ROW number and 'B' values to the COLUMN number of the large DF so that the expected output would look like this:
Large DF
0 1 2 3 4
0 0.766275 0.910825 0.378541 0.775416 0.639854
1 0.505877 0.992284 0.720390 0.181921 0.501062
2 0.439243 0.416820 0.285719 0.100537 0.429576
3 0.243298 0.560427 0.162422 0.631224 0.033927
Small DF
A B
0 2 2
1 2 3
2 2 4
Expected Output:
A B extracted values
0 2 2 0.285719
1 2 3 0.100537
2 2 4 0.429576
So far I've tried different version of something like this
N['extracted'] = M.iloc[N['A'].astype(int):,N['B'].astype(int)]
..but it keeps failing with an error saying
TypeError: cannot do positional indexing on RangeIndex with these indexers
[0 2
1 2
2 2
Which approach would be the best ?
Is this job better to accomplish by converting the DF's into a numpy arrays ?
Thanks for help!
I think you want to use the apply function. This goes row by row through your data set.
N['extracted'] = N.apply(lambda row: M.iloc[row['A'], row['B']], axis=1)

Subtract 2 values from one another within 1 column after groupby

I am very sorry if this is a very basic question but unfortunately, I'm failing miserably at figuring out the solution.
I need to subtract the first value within a column (in this case column 8 in my df) from the last value & divide this by a number (e.g. 60) after having applied groupby to my pandas df to get one value per id. The final output would ideally look something like this:
id
1 1523
2 1644
I have the actual equation which works on its own when applied to the entire column of the df:
(df.iloc[-1,8] - df.iloc[0,8])/60
However I fail to combine this part with the groupby function. Among others, I tried apply, which doesn't work.
df.groupby(['id']).apply((df.iloc[-1,8] - df.iloc[0,8])/60)
I also tried creating a function with the equation part and then do apply(func)but so far none of my attempts have worked. Any help is much appreciated, thank you!
Demo:
In [204]: df
Out[204]:
id val
0 1 12
1 1 13
2 1 19
3 2 20
4 2 30
5 2 40
In [205]: df.groupby(['id'])['val'].agg(lambda x: (x.iloc[-1] - x.iloc[0])/60)
Out[205]:
id
1 0.116667
2 0.333333
Name: val, dtype: float64

Pandas Dataframes: how to build them efficiently

I have a file with 1M rows that I'm trying to read into 20 DataFrames. I do not know in advance which row belongs to which DataFrame or how large each DataFrame will be. How can I process this file into DataFrames efficiently? I've tried to do this several different ways. Here is what I currently have:
data = pd.read_csv(r'train.data', sep=" ", header = None) # Not slow
def collectData(row):
id = row[0]
df = dictionary[id] # Row content determines which dataframe this row belongs to
next = len(df.index)
df.loc[next] = row
data.apply(collectData, axis=1)
It's very slow. What am I doing wrong? If I just apply an empty function, my code runs in 30 sec. With the actual function it takes at least 10 minutes and I'm not sure if it would finish.
Here are a few sample rows from the dataset:
1 1 4
1 2 2
1 3 10
1 4 4
The full dataset is available here (if you click on Matlab version)
Your approach is not a vectored one, because you apply a python function row by row.
Rather that creating 20 dataframes , make a dictionary containing an index (in range(20)) for each key[0]. Then add this information to your DataFrame:
data['dict']=data[0].map(dictionary)
Then reorganize :
data2=data.reset_index().set_index(['dict','index'])
data2 is like :
0 1 2
dict index
12 0 1 1 4
1 1 2 2
2 1 3 10
3 1 4 4
4 1 5 2
....
and data2.loc[i] is one of the Dataframe you want.
EDIT:
It seems that dictionary is describe in train.label.
You can set the dictionary before by:
with open(r'train.label') as f: u=f.readlines()
v=[int(x) for x in u] # len(v) = 11269 = data[0].max()
dictionary=dict(zip(range(1,len(v)+1),v))
Since, the full data set is easily loaded into memory, the following should be fairly quick
data_split = {i: data[data[0] == i] for i in range(1, 21)}
# to access each dataframe, do a dictionary lookup, i.e.
data_split[2].head()
0 1 2
769 2 12 4
770 2 16 2
771 2 23 4
772 2 27 2
773 2 29 6
you may also want to reset the indices or copy the data frame when you're slicing the data frame into smaller data frames.
additional reading:
copy
reset_index
view-vs-copy
If you want to build them efficiently, I think you need some good raw materials:
wood
cement
Are robust and durable.
Try to avoid using hay as the dataframe can be blown up with a little wind.
Hope that helps

Categories