Python - Column in CSV file contains multiple delimiters and results - python

I have quite a large CSV file that has multiple columns (no delimiters) and one column which contains results that use three delimiters.
The main delimiter is ";", which separates days of results.
The second delimiter is ":", which separates results per day (I am only using 2 results out of a possible of 6).
The third delimiter is "/", which separates the result day and the calendar value of the result.
I want to avoid looping through the "X&Y" column as much as possible as the column itself contains many delimited results, and there are a lot of rows.
Col1
Col2
X&Y
A
B
20200331/1D::::1:2;20200401/2D::::3:4;20200402/3D::::5:6
AA
BB
20210330/1Y::::11:22;20220330/2Y::::33:44;20230330/3Y::::55:66
I want to see:
Col1
Col2
Date
CalendarValue
X
Y
A
B
20200331
1D
1
2
A
B
20200401
2D
3
4
A
B
2020040
3D
5
6
AA
BB
20210330
1Y
11
22
AA
BB
20220330
2Y
33
44
AA
BB
20220330
3Y
55
66
import pandas as pd
df = pd.DataFrame({'Col1':['A','AA'], 'Col2':['B', 'BB'], 'Col3':['20200331/1D::::1:2;20200401/2D::::3:4;20200402/3D::::5:6','20210330/1Y::::11:22;20220330/2Y::::33:44;20230330/3Y::::55:66']})

Here is a solution you can try out, split based on delimiter (;) followed by explode to transform into rows. Followed by extract & finally concat the frames to get resultant frame.
import pandas as pd
import re
df = pd.DataFrame({'Col1': ['A', 'AA'], 'Col2': ['B', 'BB'],
'Col3': ['20200331/1D::::1:2;20200401/2D::::3:4;20200402/3D::::5:6',
'20210330/1Y::::11:22;20220330/2Y::::33:44;20230330/3Y::::55:66']})
df['Col3'] = df['Col3'].str.split(";")
# extract features from the string
extract_ = re.compile(r"(?P<Date>\w+)/(?P<CalendarValue>\w+):+(?P<X>.+):(?P<Y>.+)")
pd.concat([
df.drop(columns='Col3'),
df['Col3'].explode().str.extract(extract_, expand=True)
], axis=1)
Out[*]:
Col1 Col2 Date CalendarValue X Y
0 A B 20200331 1D 1 2
0 A B 20200401 2D 3 4
0 A B 20200402 3D 5 6
1 AA BB 20210330 1Y 11 22
1 AA BB 20220330 2Y 33 44
1 AA BB 20230330 3Y 55 66
Regex Demo

Related

Dividing all observations in a data frame with a particular observation of a column with Python

I have a df with more than 300 rows and more than 4000 columns. A sample of the Df looks like this:
AB
BC
DA
DC
FF
40
50
4
10
60
10
20
10
5
20
I want to create another DF by dividing all the observation by cells of column DC so that i will have a df that looks like this:
AB
BC
DA
DC
FF
4
5
0.4
1
6
1
2
1
0.2
2
an Idea that came to my mind was iterrows but I could not find my way around it.
any better suggestion on how to do this?
This should get you what you want
for column in df.columns:
if column == 'DC':
pass # this just does nothing and skip the column
else:
df[col] = df[col] / df['DC']

How to find string only if one condition is met and drop value from cell in dataframe but not the row itself?

I have a df that looks like this:
df:
col1 col2
28 24 and 24
11 .1 for .1
3 43
I want to create logic that I can apply to a list of columns (not all the columns in the dataframe) where if the cell contains both integer and string
the particular value in the cell gets replaced with empty string, but not drop the entire row.
new df should look like this:
col1 col2
28
11
3 43
how would I do this?
Using to_numeric
l=['col2']
df[l]=df[l].apply(pd.to_numeric,errors='coerce').fillna('')
df
Out[32]:
col1 col2
0 28
1 11
2 3 43

update one dataframe with data from another, for one specific column - Pandas and Python

I'm trying to update one dataframe with data from another, for one specific column called 'Data'. Both dataframe's have the unique ID caled column 'ID'. Both columns have a 'Data' column. I want data from 'Data' in df2 to overwrite entries in df1 'Data', for only the amount of rows that are in df1. Where there is no corresponding 'ID' in df2 the df1 entry should remain.
import pandas as pd
data1 = '''\
ID Data Data1
1 AA BB
2 AB BF
3 AC BK
4 AD BL'''
data2 = '''\
ID Data
1 AAB
3 AAL
4 MNL
5 AAP
6 MNX
8 DLP
9 POW'''
df1 = pd.read_csv(pd.compat.StringIO(data1), sep='\s+')
df2 = pd.read_csv(pd.compat.StringIO(data2), sep='\s+')
Expected output:
new df3 expected outcome.
ID Data Data1
1 AAB BB
2 AB BF
3 AAL BK
4 MNL BL
df2 is a master list of values which never changes and has thousands of entries, where as df1 sometime only ever has a few hundred entries.
I have looked at pd.merge and combine_first however can't seem to get the right combination.
df3 = pd.merge(df1, df2, on='ID', how='left')
Any help much appreciated.
Create new dataframe
Here is one way making use of update:
df3 = df1[:].set_index('ID')
df3['Data'].update(df2.set_index('ID')['Data'])
df3.reset_index(inplace=True)
Or we could use maps/dicts and reassign (Python >= 3.5)
m = {**df1.set_index('ID')['Data'], **df2.set_index('ID')['Data']}
df3 = df1[:].assign(Data=df1['ID'].map(m))
Python < 3.5:
m = df1.set_index('ID')['Data']
m.update(df2.set_index('ID')['Data'])
df3 = df1[:].assign(Data=df1['ID'].map(m))
Update df1
Are you open to update the df1? In that case:
df1.update(df2)
Or if ID not index:
m = df2.set_index('ID')['Data']
df1.loc[df1['ID'].isin(df2['ID']),'Data'] =df1['ID'].map(m)
Or:
df1.set_index('ID',inplace=True)
df1.update(df2.set_index('ID'))
df1.reset_index(inplace=True)
Note: There might be something that makes more sense :)
Full example:
import pandas as pd
data1 = '''\
ID Data Data1
1 AA BB
2 AB BF
3 AC BK
4 AD BL'''
data2 = '''\
ID Data
1 AAB
3 AAL
4 MNL
5 AAP
6 MNX
8 DLP
9 POW'''
df1 = pd.read_csv(pd.compat.StringIO(data1), sep='\s+')
df2 = pd.read_csv(pd.compat.StringIO(data2), sep='\s+')
m = {**df1.set_index('ID')['Data'], **df2.set_index('ID')['Data']}
df3 = df1[:].assign(Data=df1['ID'].map(m))
print(df3)
Returns:
ID Data Data1
0 1 AAB BB
1 2 AB BF
2 3 AAL BK
3 4 MNL BL

Stacking columns one below other when the column names are same

I have a huge data set in a pandas data frame. It looks something like this
df = pd.DataFrame([[1,2,3,4],[31,14,13,11],[115,613,1313,1]], columns=['c1','c1','c2','c2'])
Here first two columns have same name. So they should be concatenated into a single column so the the values are one below another. so the dataframe should look something like this:
df1 = pd.DataFrame([[1,3],[31,13],[115,1313],[2,4],[14,11],[613,1]], columns=['c1','c2'])
Note: My orignal dataframe has many column so I cannot used simple concat function to stack the columns. Also I tried using stack function, apart from concat function. What can I do?
use groupby + cumcount to create a pd.MultiIndex. Reassign column with new pd.MultiIndex and stack
df = pd.DataFrame(
[[1,2,3,4],[31,14,13,11],[115,613,1313,1]],
columns=['c1','c1','c2','c2'])
df1 = df.copy()
df1.columns = [df.columns, df.columns.to_series().groupby(level=0).cumcount()]
print(df1.stack().reset_index(drop=True))
c1 c2
0 1 3
1 2 4
2 31 13
3 14 11
4 115 1313
5 613 1
Or with a bit of creativity, in one line
df.T.set_index(
df.T.groupby([df.columns]).cumcount(),
append=True
).unstack().T.reset_index(drop=True)
c1 c2
0 1 3
1 2 4
2 31 13
3 14 11
4 115 1313
5 613 1
You could melt the dataframe, then count entries within each column to use as index for the new dataframe and then unstack it back like this:
import pandas as pd
df = pd.DataFrame(
[[1,2,3,4],[31,14,13,11],[115,613,1313,1]],
columns=['c1','c1','c2','c2'])
df1 = (pd.melt(df,var_name='column')
.assign(n = lambda x: x.groupby('column').cumcount())
.set_index(['n','column'])
.unstack())
df1.columns=df1.columns.get_level_values(1)
print(df1)
Which produces
column c1 c2
n
0 1 3
1 31 13
2 115 1313
3 2 4
4 14 11
5 613 1

does groupby concatenate the columns?

i have a "1000 rows * 4 columns" DataFrame:
a b c d
1 aa 93 4
2 bb 32 3
...
1000 nn 78 2
**[1283 rows x 4 columns]**
and I use groupby to group them based on 3 of the columns:
df.groupby(['a','b','c']).sum()
print(df)
a b c d
1 aa 93 12
2 bb 32 53
...
1000 nn 78 38
**[1283 rows x 1 columns]**
however the result give me a "1000 rows * 1 columns" Dataframe. SO my question is if Groupby concatenate columns as one Column? if yes how can I prevent that. I want to plot my data after grouping it but i can't since it only see one column instead of all 4.
edit: when i call the columns i only get the last column, it means it can't read 'a','b','c' as columns, why is that and how can i markl them as column again.
df.columns
Index([u'd'], dtype='object')
you can do it this way:
df.groupby(['a','b','c'], as_index=False).sum()
or:
df.groupby(['a','b','c']).sum().reset_index()

Categories