Python: split by (different) n spaces

Python: split by (different) n spaces - python

I have lines like this:
2 20 164 "guid" Some name^7 0 ip.a.dd.res:port -21630 25000
6 30 139 "guid" Other name^7 0 ip.a.dd.res:port 932 25000
I would like to split this, but the problem is that there is different number of spaces between this "words"...
How can I do this?

Python's split function doesn't care about the number of spaces:
>>> ' 2 20 164 "guid" Some name^7 0 ip.a.dd.res:port -21630 25000'.split()
['2', '20', '164', '"guid"', 'Some', 'name^7', '0', 'ip.a.dd.res:port', '-21630', '25000']

Have you tried split()? It will "compress" spaces, so after split you will get:
'2', '20', '164', '"guid'" etc.

>>> l = "1 2 4 'ds' 5 66"
>>> l
"1 2 4 'ds' 5 66"
>>> l.split(' ')
['1', '', '', '2', '', '', '4', "'ds'", '5', '', '66']
>>> [x for x in l.split()]
['1', '2', '4', "'ds'", '5', '66']

Just use split() function. The delimiter is \s+ that is any kind and any number of space

Related

python pandas substring based on columns values

Given the following df:
data = {'Description': ['with lemon', 'lemon', 'and orange', 'orange'],
'Start': ['6', '1', '5', '1'],
'Length': ['5', '5', '6', '6']}
df = pd.DataFrame(data)
print (df)
I would like to substring the "Description" based on what is specified in the other columns as start and length, here the expected output:
data = {'Description': ['with lemon', 'lemon', 'and orange', 'orange'],
'Start': ['6', '1', '5', '1'],
'Length': ['5', '5', '6', '6'],
'Res': ['lemon', 'lemon', 'orange', 'orange']}
df = pd.DataFrame(data)
print (df)
Is there a way to make it dynamic or another compact way?
df['Res'] = df['Description'].str[1:2]

You need to loop, a list comprehension will be the most efficient (python ≥3.8 due to the walrus operator, thanks #I'mahdi):
df['Res'] = [s[(start:=int(a)-1):start+int(b)] for (s,a,b)
in zip(df['Description'], df['Start'], df['Length'])]
Or using pandas for the conversion (thanks #DaniMesejo):
df['Res'] = [s[a:a+b] for (s,a,b) in
zip(df['Description'],
df['Start'].astype(int)-1,
df['Length'].astype(int))]
output:
Description Start Length Res
0 with lemon 6 5 lemon
1 lemon 1 5 lemon
2 and orange 5 6 orange
3 orange 1 6 orange
handling non-integers / NAs
df['Res'] = [s[a:a+b] if pd.notna(a) and pd.notna(b) else 'NA'
for (s,a,b) in
zip(df['Description'],
pd.to_numeric(df['Start'], errors='coerce').convert_dtypes()-1,
pd.to_numeric(df['Length'], errors='coerce').convert_dtypes()
)]
output:
Description Start Length Res
0 with lemon 6 5 lemon
1 lemon 1 5 lemon
2 and orange 5 6 orange
3 orange 1 6 orange
4 pinapple xxx NA NA NA
5 orangiie NA NA NA

Given that the fruit name of interest always seems to be the final word in the description column, you might be able to use a regex extract approach here.
data["Res"] = data["Description"].str.extract(r'(\w+)$')

You can use .map to cycle through the Series. Use split(' ') to separate the words if there is space and get the last word in the list [-1].
df['RES'] = df['Description'].map(lambda x: x.split(' ')[-1])

Categorizing pandas dataframe series values

I am new to pandas and trying to automatically create categories and group the values.
My dataframe:
df = pd.DataFrame({'Slug': ['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'c', 'c', 'c'],
'Position': ['0', '1', '2', '3', '4', '0', '1', '2', '3', '0', '1', '2'],
'Brand': ['Mazda', 'BMW', 'Ford', 'Fiat', 'Dodge', 'Mazda', 'BMW', 'Ford', 'Fiat', 'BMW', 'Ford', 'Fiat'],
'Sessions': ['70', '', '', '', '', '60', '', '', '', '50', '', ''],
'Transactions': ['1', '', '', '', '', '2', '', '', '', '3', '', ''],
'Ecommerce': ['1', '', '', '', '', '3', '', '', '', '4', '', ''],
'CTR': ['10', '', '', '', '', '15', '', '', '', '5', '', ''],
'All': ['11', '', '', '', '', '1', '', '', '', '4', '', '']})
I am trying to answer a question: which layout of brands has the best conversion. Position column declares the way brands are writtien down on the site:
Example:
0 A #Ford
1 B #BMW
2 C #Fiat
3 D #Dodge
The question is maybe having Ford in second place and BMW at the first would lead to more conversions.
The first thing I am trying to do is to generate categories for each unique group, there are about 10 different brands and 100 different ways the way they are set up.
For example:
Group1 could be:
0 A #Ford
1 B #BMW
2 C #Fiat
3 D #Dodge
Group2 could be:
0 B #BMW
1 A #Ford
2 C #Fiat
3 D #Dodge
Then my DataFrame would look like this:
Slug Group Sessions Transactions Ecommerce CTR All
a 1 70 1 1 10 10
b 2 60 2 3 15 11
c 1 60 2 3 15 11
d 3 60 2 3 15 11
e 2 60 2 3 15 11
Groups are categorized by the position and the brand column.
Slug could be understood as a country. For example in Country a, having a layout of group 1 70 sessions are achieved, in country b having a layout of group 2 60 sessions are achieved and so on.
And so on, then I could compare the group performance measure in sessions, transactions and other columns values which I have in my DataFrame.
The parameters like transactions, session and others are for the entire layout of brands, for example:
0 Ford
1 BMW
2 Fiat
3 Dodge
# this layout achieved 70 sessions and 5 conversions
So my question could be divided in to 3 separate parts:
1) How could I generate groups of position and brand
2) Maybe some of you had bumped in to a similar tas and knows any methods of determining the best layout of brands
3) I've tried a bit of machine learning, maybe you could suggest me which model I could apply to my problem
Thank you for your suggestions.

Create a dictionary from lists, overwrite duplicate keys

I have my code below. I am trying to create a dictionary from my lists extracted from a txt file but the loop overwrites the previous information:
f = open('data.txt','r')
lines = f.readlines()
lines = [line.rstrip('\n') for line in open('data.txt')]
columns=lines.pop(0)
for i in range(len(lines)):
lines[i]=lines[i].split(',')
dictt={}
for line in lines:
dictt[line[0]]=line[1:]
print('\n')
print(lines)
print('\n')
print(dictt)
I know I have to play with:
for line in lines:
dictt[line[0]] = line[1:]
part but what can I do , do I have to use numpy? If so, how?
My lines list is :
[['USS-Enterprise', '6', '6', '6', '6', '6'],
['USS-Voyager', '2', '3', '0', '4', '1'],
['USS-Peres', '10', '4', '0', '0', '5'],
['USS-Pathfinder', '2', '0', '0', '1', '2'],
['USS-Enterprise', '2', '2', '2', '2', '2'],
['USS-Voyager', '2', '1', '0', '1', '1'],
['USS-Peres', '8', '5', '0', '0', '4'],
['USS-Pathfinder', '4', '0', '0', '2', '1']]
My dict becomes:
{'USS-Enterprise': ['2', '2', '2', '2', '2'],
'USS-Voyager': ['2', '1', '0', '1', '1'],
'USS-Peres': ['8', '5', '0', '0', '4'],
'USS-Pathfinder': ['4', '0', '0', '2', '1']}
taking only the last ones, I want to add the values together. I am really confused.

You are trying to append multiple values for the same key. You can use defaultdict for that, or modify your code and utilize the get method for dictionaries.
for line in lines:
dictt[line[0]] = dictt.get(line[0], []).extend(line[1:])
This will look for each key, assign the line[1:] if the key is unique, and if it is duplicate, simply append those values onto the previous values.

dict_output = {}
for line in list_input:
if line[0] not in dict_output:
dict_output[line[0]] = line[1:]
else:
dict_output[line[0]] += line[1:]

EDIT: You subsequently clarified in comments that your input has duplicate keys, and you want later rows to overwrite earlier ones.
ORIGINAL ANSWER: The input is not a dictionary, it's a CSV file. Just use pandas.read_csv() to read it:
import pandas as pd
df = pd.read_csv('my.csv', sep='\s+', header=None)
df
0 1 2 3 4 5
0 USS-Enterprise 6 6 6 6 6
1 USS-Voyager 2 3 0 4 1
2 USS-Peres 10 4 0 0 5
3 USS-Pathfinder 2 0 0 1 2
4 USS-Enterprise 2 2 2 2 2
5 USS-Voyager 2 1 0 1 1
6 USS-Peres 8 5 0 0 4
7 USS-Pathfinder 4 0 0 2 1
Seems your input didn't have a header row. If your input columns had names, you can add them with df.columns = ['Ship', 'A', 'B', 'C', 'D', 'E'] or whatever.
If you really want to write a dict output (beware of duplicate keys being suppressed), see df.to_dict()

Python if/else statement confusion

How can you create an if else statement in python when you have a file with both text and numbers. Let's say I want to replace the values from the third to last column in the file below. I want to create an if else statement to replace values <5 or if there's a dot "." with a zero, and if possible to use that value as integer for a sum.
A quick and dirty solution using awk would look like this, but I'm curious on how to handle this type of data with python:
awk -F"[ :]" '{if ( (!/^#/) && ($9<5 || $9==".") ) $9="0" ; print }'
So how do you solve this problem?
Thanks
Input file:
\##Comment1
\#Header
sample1 1 2 3 4 1:0:2:1:.:3
sample2 1 4 3 5 1:3:2:.:3:3
sample3 2 4 6 7 .:0:6:5:4:0
Desired output:
\##Comment1
\#Header
sample1 1 2 3 4 1:0:2:0:0:3
sample2 1 4 3 5 1:3:2:0:3:3
sample3 2 4 6 7 .:0:6:5:4:0
SUM = 5
Result so far
['sample1', '1', '2', '3', '4', '1', '0', '2', '0', '0', '3\n']
['sample2', '1', '4', '3', '5', '1', '3', '2', '0', '3', '3\n']
['sample3', '2', '4', '6', '7', '.', '0', '6', '5', '4', '0']
Here's what I have tried so far:
import re
data=open("inputfile.txt", 'r')
for line in data:
if not line.startswith("#"):
nodots = line.replace(":.",":0")
final_nodots=re.split('\t|:',nodots)
if (int(final_nodots[8]))<5:
final_nodots[8]="0"
print (final_nodots)
else:
print(final_nodots)

data=open("inputfile.txt", 'r')
import re
sums = 0
for line in data:
if not line.startswith("#"):
nodots = line.replace(".","0")
final_nodots=list(re.findall('\d:.+\d+',nodots)[0])
if (int(final_nodots[6]))<5:
final_nodots[6]="0"
print(final_nodots)
sums += int(final_nodots[6])
print(sums)
You were pretty close but you your final_nodots returns a split on : instead of a split on the first few numbers, so your 8 should have been a 3. After that just add a sums counter to keep track of that slot.
['sample1 1 2 3 4 1', '0', '2', '0', '0', '3\n']
There are better ways to achieve what you want but I just wanted to fix your code.

Separate a file in paragraphs

I have a file like this:
cluster number 1
1
2
3
cluster number 2
1
2
3
cluster number x
1
2
3
I want to split this file in paragraph of cluster numbers, like this
cluster number 1
1
2
3
I try to search for an answer but I can't handle it.
Thanks for your help!

user regular expression
import re
input_text = "..."
r = re.findall(r"(cluster number (\d+)\n\n(\d+)\n\n(\d+)\n\n(\d+))", input_text)
print r
this code return below list
[('cluster number 1\n\n1\n\n2\n\n3', '1', '1', '2', '3'),
('cluster number 2\n\n1\n\n2\n\n3', '2', '1', '2', '3')]
you can also see the detail explanation from here

As recommended, you should use regular expressions. Perhaps the re.split function would be suitable here:
>>> l = re.split('cluster number (?:\d+)', x)[1:]
>>> [a.split() for a in l]
[['1', '2', '3'], ['1', '2', '3'], ...]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: split by (different) n spaces - python

I have lines like this: 2 20 164 "guid" Some name^7 0 ip.a.dd.res:port -21630 25000 6 30 139 "guid" Other name^7 0 ip.a.dd.res:port 932 25000 I would like to split this, but the problem is that there is different number of spaces between this "words"... How can I do this?

Python's split function doesn't care about the number of spaces: >>> ' 2 20 164 "guid" Some name^7 0 ip.a.dd.res:port -21630 25000'.split() ['2', '20', '164', '"guid"', 'Some', 'name^7', '0', 'ip.a.dd.res:port', '-21630', '25000']

Have you tried split()? It will "compress" spaces, so after split you will get: '2', '20', '164', '"guid'" etc.

>>> l = "1 2 4 'ds' 5 66" >>> l "1 2 4 'ds' 5 66" >>> l.split(' ') ['1', '', '', '2', '', '', '4', "'ds'", '5', '', '66'] >>> [x for x in l.split()] ['1', '2', '4', "'ds'", '5', '66']

Just use split() function. The delimiter is \s+ that is any kind and any number of space

Related

python pandas substring based on columns values

Categorizing pandas dataframe series values

Create a dictionary from lists, overwrite duplicate keys

Python if/else statement confusion

Separate a file in paragraphs

Categories

Resources