Sequential name of column in a DataFrame python - python

I work in python.
I have a large DataFrame df1 ( 25000 x 484 ) where, except than the first 4 columns, all the others can be divided in group of 4 and have sequential number.
To be clear, non considering the first 4 columns, that's how the header of the columns look like:
comp_type_1 / tag_1 /length_1 / value_1 / comp_type_2 / tag_2 /length_2 / value_2 / comp_type_3 / tag_3 /length_3 / value_3 ....
I would like to create df2 such that it contains only the column lenght_i, where i goes from 1 to the last number (120. Is there a way to realize that considering that part of the name of the column is the same, and what changes is only a number?
Thanks!

If I understand the question correctly, this is what you're looking for.
# setup
df = pd.DataFrame(np.random.randint(0,100,size=(3, 12)), columns=["comp_type_1", "tag_1", "length_1", "value_1", "comp_type_2", "tag_2", "length_2", "value_2", "comp_type_3", "tag_3", "length_3", "value_3"])
# column filter
df2 = df[[_ for _ in df.columns if 'length' in _]]
Output (df2)
length_1 length_2 length_3
0 91 81 23
1 42 92 50
2 61 79 76

Given dataframe df You can filter on columns:
df = df.filter(regex=("length"))

Related

How to call a column by combining a string and another variable in a python dataframe?

Imagine I have a dataframe with these variables and values:
ID
Weight
LR Weight
UR Weight
Age
LS Age
US Age
Height
LS Height
US Height
1
63
50
80
20
18
21
165
160
175
2
75
50
80
22
18
21
172
160
170
3
49
45
80
17
18
21
180
160
180
I want to create the additional following variables:
ID
Flag_Weight
Flag_Age
Flag_Height
1
1
1
1
2
1
0
0
3
1
0
1
These flags simbolize that the main variable values (e.g.: Weight, Age and Height) are between the correspondent Lower or Upper limits, which may start with different 2 digits (in this dataframe I gave four examples: LR, UR, LS, US, but in my real dataframe I have more), and whose limit values sometimes differ from ID to ID.
Can you help me create these flags, please?
Thank you in advance.
You can use reshaping using a temporary MultiIndex:
(df.set_index('ID')
.pipe(lambda d: d.set_axis(pd.MultiIndex.from_frame(
d.columns.str.extract('(^[LU]?).*?\s*(\S+)$')),
axis=1)
)
.stack()
.assign(flag=lambda d: d[''].between(d['L'], d['U']).astype(int))
['flag'].unstack().add_prefix('Flag_').reset_index()
)
Output:
ID Flag_Age Flag_Height Flag_Weight
0 1 1 1 1
1 2 0 0 1
2 3 0 1 1
So, if I understood correctly, you want to add columns with these new variables. The simplest solution to this would be df.insert().
You could make it something like this:
df.insert(number of column after which you want to insert the new column, name of the column, values of the new column)
You can make up the new values in pretty much everyway you can imagine. So just copying a column or simple mathematical operations like +,-,*,/, can be performed. But you can also apply a whole function, which returns the flags based on your conditions as values of the new column.
If the new columsn can just be appended, you can even just make up a new column like this:
df['new column name'] = any values you want
I hope this helped.

Changing values in dataframe based on cell and column name

I have a dataframe
df=pd.DataFrame( [0,1,2],columns=[‘3m3a’,’1z6n’,’11p66d’])
Now i would like to apply 2 * value * (last numbers of column name). Eg for the last 2 * 2* 66
Df.apply(lambda x: 2*x) for step 1
Step 2 is the hardest part
Can do new dataframe like df2=df.stack().reset_index().apply(lambda x: x[re.search(‘[azAZ]+’,x).end():]) and then multiple the 2.
What’s a more pythonic way?
For DataFrame:
3m3a 1z6n 11p66d
0 0 1 2
You can use .colums.str.extract and then DataFrame.multiply:
vals = df.columns.str.extract(r"(\d+)[a-z]*?$").T.astype(int)
df = df.multiply(2 * vals.values, axis=1)
print(df)
Prints:
3m3a 1z6n 11p66d
0 0 12 264
Late to the party, and having found almost the same answer, but using negative look-behind regex:
newdf = df.multiply(
2 * df.columns.str.extract(r'.*(?<!\d)(\d+)\D*').astype(int).values.ravel(),
axis=1)
>>> newdf
3m3a 1z6n 11p66d
0 0 12 264
Thank you, that both works
what if i would like to split the column in 2 parts, one up to and including the first letter, and the second the part after
df.columns.str.split(r"(\d+\D+)",n=1,expand=True)
work but give me a 3 part with first blank

Filtering rows on multiple string conditions at the same column

I want to filter a dataframe on multiple conditions. Let's say I have one column called 'detail', i want to get a dataframe where the 'detail' column values match the following:
detail = unidecode.unidecode(str(row['detail']).lower())
So now I have all detail rows unidecoded and to lowercase, then i want to extract the rows that start with some substring like:
detail.startswith('bomb')
And finally also take the rows where another integer column equals 100.
I tried to do this but obviously it doesn't work:
llista_dfs['df_bombes'] = df_filtratge[df_filtratge['detail'].str.lower().startswith('bomb') or df_filtratge['family']==100]
This line above is what I would like to execute but I'm not sure which is the syntax to be able to achieve this in a single line of code (if that's possible).
That's an example of what the code should do:
Initial table:
detail family
0 bòmba 90
1 boMbá 87
2 someword 100
3 someotherword 65
4 Bombá 90
Result table:
detail family
0 bòmba 90
1 boMbá 87
2 someword 100
4 Bombá 90
Actually #user3483203's comment is the right solution as to filter in pandas you use & and | instead of and and or. In any case in case you want to get rid of unidecode you might use this solution:
import pandas as pd
txt="""0 bòmba 90
1 boMbá 87
2 someword 100
3 someotherword 65
4 Bombá 90"""
df = [list(filter(lambda x: x!='', t.split(' ')))[1:]
for t in txt.split("\n")]
df = pd.DataFrame(df, columns=["details", 'family'])
df["family"] = df["family"].astype(int)
cond1 = df["details"].str.normalize('NFKD')\
.str.encode('ascii', errors='ignore')\
.str.decode('utf-8')\
.str.lower()\
.str.startswith('bomba')
cond2 = df["family"]==100
df[cond1 | cond2]

Pandas: Calculate remaining time in grouping

I have a requirement to sort a table by date starting from the oldest. Total field is created by grouping name and kind fields and applying sum. Now for each row I need to calculate the remaining time in the same name-kind grouping.
The csv looks like that:
date name kind duration total remaining
1-1-2017 a 1 10 100 ? should be 90
2-1-2017 b 1 5 35 ? should be 30
3-1-2017 a 2 3 50 ? should be 47
4-1-2017 b 2 1 25 ? should be 24
5-1-2017 a 1 8 100 ? should be 82
6-1-2017 b 1 2 35 ? should be 33
7-1-2017 a 2 3 50 ? should be 44
8-1-2017 b 2 6 25 ? should be 18
...
My question is how do I calculate the remaining value while having the DataFrame grouped by name and kind?
My initial approach was to shift the column and add the values from duration to each other like that:
df['temp'] = df.groupby(['name', 'kind'])['duration'].apply(lambda x: x.shift() + x)
and then:
df['duration'] = df.apply(lambda x: x['total'] - x['temp'], axis=1)
But it did not work as expected.
Is there a clean way to do it, or using the iloc, ix, loc somehow is the way to go?
Thanks.
You could do something like:
df["cumsum"] = df.groupby(['name', 'kind'])["duration"].cumsum()
df["remaining"] = df["total"] - df["cumsum"]
Being careful with resetting the index maybe.

Pandas: add crosstab totals

How can I add to my crosstab an additional row and an additional column for the totals?
df = pd.DataFrame({"A": np.random.randint(0,2,100), "B" : np.random.randint(0,2,100)})
ct = pd.crosstab(new.A, new.B)
ct
I thought I would add the new column (obtained by summing over the rows) by
ct["Total"] = ct.0 + ct.1
but this does not work.
In fact pandas.crosstab already provides an option margins, which does exactly what you want.
> df = pd.DataFrame({"A": np.random.randint(0,2,100), "B" : np.random.randint(0,2,100)})
> pd.crosstab(df.A, df.B, margins=True)
B 0 1 All
A
0 26 21 47
1 25 28 53
All 51 49 100
Basically, by setting margins=True, the resulting frequency table will add an "All" column and an "All" row that compute the subtotals.
This is because 'attribute-like' column access does not work with integer column names. Using the standard indexing:
In [122]: ct["Total"] = ct[0] + ct[1]
In [123]: ct
Out[123]:
B 0 1 Total
A
0 26 24 50
1 30 20 50
See the warnings at the end of this section in the docs: http://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
When you want to work with the rows, you can use .loc:
In [126]: ct.loc["Total"] = ct.loc[0] + ct.loc[1]
In this case ct.loc["Total"] is equivalent to ct.loc["Total", :]
You should use the margins=True for this along with crosstab. That should do the job!

Categories