How to split Dataframe using pandas

How to split Dataframe using pandas - python

i have column value like 1ST:[70]2ND:[71]3RD:[71]S1:[71]4TH:[77]5TH:[78]6TH:[78]S2:[78]FIN:[75] in csv, need to extract all merged content into separate column, how to do it pandas
need O/p like:
1ST 2ND 3RD S1 4TH 5TH 6TH S2 FIN
0 70 71 71 71 77 78 78 78 75
here i have pasted some of rows of that column value.
1ST:[80]2ND:[79]3RD:[75]S1:[78]4TH:[76]5TH:[80]6TH:[87]S2:[81]FIN:[80]
1ST:[75]2ND:[74]3RD:[81]S1:[77]4TH:[80]5TH:[78]6TH:[87]S2:[82]FIN:[80]
1ST:[58]2ND:[54]3RD:[65]S1:[59]4TH:[80]5TH:[72]6TH:[74]S2:[75]FIN:[67]
1ST:[90]2ND:[91]3RD:[82]S1:[88]4TH:[84]5TH:[88]6TH:[87]S2:[86]FIN:[87]
1ST:[83]2ND:[79]3RD:[82]S1:[81]4TH:[85]5TH:[84]6TH:[90]S2:[86]FIN:[84]
IN dataframe i have one column contains above value. i need to split into different columns and value will be in rows.

Your question seems confusing. What is your objective from solution structure side?
your file is having value like this
1ST:[70]2ND:[71]3RD:[71]S1:[71]4TH:[77]5TH:[78]6TH:[78]S2:[78]FIN:[75]
you want output should be in this way
1ST 2ND 3RD S1 4TH 5TH 6TH S2 FIN
0 70 71 71 71 77 78 78 78 75
or like this
0 1
0 1ST 70
1 2ND 71
2 3RD 71
3 S1 71
4 4TH 77
5 5TH 78
6 6TH 78
7 S2 78
8 FIN 75
Now, approach to get output from given input
import pandas as pd
# consider your input is string (you can use csv)
file_val = "1ST:[70]2ND:[71]3RD:[71]S1:[71]4TH:[77]5TH:[78]6TH:[78]S2:[78]FIN:[75]"
df = pd.DataFrame([i.split(':') for i in file_val.replace('[',"").split(']') if i!=""])
print(df)
0 1
0 1ST 70
1 2ND 71
2 3RD 71
3 S1 71
4 4TH 77
5 5TH 78
6 6TH 78
7 S2 78
8 FIN 75
Please share the snap shot of csv file or couple of rows, so that I could be able to generate the output as per your requirement.
Coming back to the final solution as per your format
# reading data
with open('sample.csv') as f:
dat = file.read(f)
# spliting rows
dat1 = dat.split(\n)
# method to convert each row to dict
def row_to_dict(row):
return dict([i.split(":") for i in row.replace('[',"").split(']') if i!=""])
# now apply method to each row of dat1 and create single dataframe out of it
# that is nothing but final output
res = pd.DataFrame(map(lambda x:row_to_dict(x), dat1))
print(res)
1ST 2ND 3RD 4TH 5TH 6TH FIN S1 S2
0 80 79 75 76 80 87 80 78 81
1 75 74 81 80 78 87 80 77 82
2 58 54 65 80 72 74 67 59 75
3 90 91 82 84 88 87 87 88 86
4 83 79 82 85 84 90 84 81 86

Find the above result in R
a1=read.csv("c:/Users/Dell/Desktop/NewText.txt",header = FALSE)
a1$V1=as.character(a1$V1)
g1=NULL
g2=NULL
l=list()
for(i in 1:nrow(a1))
{
g1=strsplit(a1$V1[i],"]")
g1=strsplit(g1[[1]],":\\[")
g2=data.frame(g1)
g2[] <- lapply(g2, as.character)
colnames(g2)=g2[1,]
g2=g2[-1,]
l[[i]]=g2
}
l=do.call('rbind',l)

Related

How to create a column that contains the penultimate value of each row?

I have a DataFrame and I need to create a new column which contains the second largest value of each row in the original Dataframe.
Sample:
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))
Desired output:
0 1 2 3 4 5 6 7 8 9 penultimate
0 52 69 62 7 20 69 38 10 57 17 62
1 52 94 49 63 1 90 14 76 20 84 90
2 78 37 58 7 27 41 27 26 48 51 58
3 6 39 99 36 62 90 47 25 60 84 90
4 37 36 91 93 76 69 86 95 69 6 93
5 5 54 73 61 22 29 99 27 46 24 73
6 71 65 45 9 63 46 4 93 36 18 71
7 85 7 76 46 65 97 64 52 28 80 85
How can this be done in as little code as possible?

You could use NumPy for this:
import numpy as np
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))
df['penultimate'] = np.sort(df.values, 1)[:, -2]
print(df)
Using NumPy is faster.

Here is a simple lambda function!
# Input
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))
# Output
out = df.apply(lambda x: x.sort_values().unique()[-2], axis=1)
df['penultimate'] = out
print(df)
Cheers!

Shuffle rows in pandas dataframe, keeping duplicates together

I have a data like this:
A B C D E F
35 1 2 35 25 65
40 5 7 47 57 67
20 1 8 74 58 63
35 1 2 37 28 69
40 5 7 49 58 69
20 1 8 74 58 63
35 1 2 47 29 79
40 5 7 55 77 87
20 1 8 74 58 63
Here we can see that Columns A,B and C have replicas that are repeated in various rows. I want to shuffle all the rows and have the replicas in consecutive rows, without deleting any of them. The output should look like this:
A B C D E F
35 1 2 35 25 65
35 1 2 37 28 69
35 1 2 47 29 79
40 5 7 47 57 67
40 5 7 49 58 69
40 5 7 55 77 87
20 1 8 74 58 63
20 1 8 74 58 63
20 1 8 74 58 63
When I use pandas.DataFrame.duplicated, it can give me duplicated rows. How can I keep all the identical rows using groupby?

Here is code that achieves the result you asked for (which doesn't require either explicit shuffling or sorting, but merely grouping your existing df by columns A,B,C):
df_shuf = pd.concat( group[1] for group in df.groupby(['A','B','C'], sort=False) )
print(df_shuf.to_string(index=False))
A B C D E F
35 1 2 35 25 65
35 1 2 37 28 69
35 1 2 47 29 79
40 5 7 47 57 67
40 5 7 49 58 69
40 5 7 55 77 87
20 1 8 74 58 63
20 1 8 74 58 63
20 1 8 74 58 63
Notes:
I couldn't figure out how to do df.reindex in-place on the grouped object. But we can get by without it.
You don't need pandas.DataFrame.duplicated, since df.groupby(['A','B','C'] puts all duplicates in the same group already.
df.groupby(... sort=False) is faster, use it whenever you don't need the groups sorted by default.

Is there a way to alter values in a column of a dataframe that is in a dictionary of dataframes?

I have a dictionary of dataframes:
In[4]: df_dict
Out[4]:
1: A B C D
0 68 0 98 83
1 10 71 36 69
2 49 57 59 40
3 54 28 64 37
4 70 58 91 29
2: A B C D
0 17 69 59 7
1 79 66 72 53
2 81 37 26 34
3 0 63 80 15
4 20 55 64 86
3: A B C D
0 14 79 91 14
1 89 86 57 59
2 42 18 7 51
3 22 85 63 35
4 10 12 46 92
If I want to add the string "JAN" to every value in column B in each dataframe of the dictionary, how would I do that? For example for the dataframe with key == 1, I would want the values in column B to be [0JAN, 71JAN, 57JAN, 28JAN, 58JAN] and I would want that for each column B in the dictionary. Assume the current values of column B are already formatted as strings.

Re-create the dictionary:
df_dict = {k: df.assign(df['B'].astype(str) + 'JAN') for k, df in df_dict.items()}
Alternatively, assign in-place, this is slightly cheaper:
for df in df_dict.values():
df['B'] = df['B'].astype('str') + 'JAN'
Or,
# Iterate over the items, #jpp
for k, v in df_dict.items():
df_dict[k]['B'] = v['B'].astype('str') + 'JAN'

How to delete first value in the last column of a pandas dataframe and then delete the remaining last row?

Below I am using pandas to read my csv file in the following format:
dataframe = pandas.read_csv("test.csv", header=None, usecols=range(2,62), skiprows=1)
dataset = dataframe.values
How can I delete the first value in the very last column in the dataframe and then delete the last row in the dataframe?
Any ideas?

You can shift the last column up to get rid of the first value, then drop the last line.
df.assign(E=df.E.shift(-1)).drop(df.index[-1])
MVCE:
pd.np.random.seed = 123
df = pd.DataFrame(pd.np.random.randint(0,100,(10,5)),columns=list('ABCDE'))
Output:
A B C D E
0 91 83 40 17 94
1 61 5 43 87 48
2 3 69 73 15 85
3 99 53 18 95 45
4 67 30 69 91 28
5 25 89 14 39 64
6 54 99 49 44 73
7 70 41 96 51 68
8 36 3 15 94 61
9 51 4 31 39 0
df.assign(E=df.E.shift(-1)).drop(df.index[-1]).astype(int)
Output:
A B C D E
0 91 83 40 17 48
1 61 5 43 87 85
2 3 69 73 15 45
3 99 53 18 95 28
4 67 30 69 91 64
5 25 89 14 39 73
6 54 99 49 44 68
7 70 41 96 51 61
8 36 3 15 94 0
or in two steps:
df[df.columns[-1]] = df[df.columns[-1]].shift(-1)
df = df[:-1]

python get substring from regex

I want to extract a substring from a string, which is conform to a certain regex. The regex is:
(\[\s*(\d)+ byte(s)?\s*\](\s*|\d|[A-F]|[a-f])+)
Which effectively means that all of these strings get accepted:
[4 bytes] 66 74 79 70 33 67 70 35
[ 4 bytes ] 66 74 79 70 33 67 70 35
[1 byte] 66 74 79 70 33 67 70 35
I want to extract only the amount of bytes (just the number) from this string. I thought of doing this with re.search, but I'm not sure if that will work. What would be the cleanest and most performant way of doing this?

Use match.group to get the groups your regular expression defines:
import re
s = """[4 bytes] 66 74 79 70 33 67 70 35
[ 4 bytes ] 66 74 79 70 33 67 70 35
[1 byte] 66 74 79 70 33 67 70 35"""
r = re.compile(r"(\[\s*(\d)+ byte(s)?\s*\](\s*|\d|[A-F]|[a-f])+)")
for line in s.split("\n"):
m = r.match(line)
if m:
print(m.group(2))
The first group matches [4 bytes], the second only 4.
Output:
4
4
1

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to split Dataframe using pandas - python

Related

How to create a column that contains the penultimate value of each row?

Shuffle rows in pandas dataframe, keeping duplicates together

Is there a way to alter values in a column of a dataframe that is in a dictionary of dataframes?

How to delete first value in the last column of a pandas dataframe and then delete the remaining last row?

python get substring from regex

Categories

Resources