How to convert a string size 1 into Dataframe? - python

Just get back into coding. But came across this issue.
How do I get a 1 string into a dataframe where it sorts every five lines into a column.
The string show
"Jane Doe
Male-52
City- NYC
$36,000
total salary
Amy sam
Female-65
City- NYC
$38,000
total salary
.....
.....
and so on
"
How do I get it to be a data frame where I can put it into
Name Sex age City Total Salary
Jane Doe Male 52 NYC 36,000
Amy Sam Female 65 NYC 38,000
......
My code is
elements = driver.find_elements_by_xpath("""//*[#id="file"]""")
data = "".join([element.text for element in elements])

import pandas
s = """Jane Doe
Male-52
City- NYC
$36,000
total salary
Amy sam
Female-65
City- NYC
$38,000
total salary"""
import re
df = pandas.DataFrame(re.findall("(\w+ \w+)\n(\w+)-(\d+)\nCity- (\w+)\n\$(.*)",s),
columns=["name","sex","age","city","salary"])
print(df)
is one way to solve this ...

This should work for n number of columns - you would just have to pass in appropriate column names to dataframe afterwards. You will also have to clean up the columns and delete unnecessary ones after the reshaping is done
Edited to include the entire code and output
import pandas as pd
mystr = """Jane Doe
Male-52
City- NYC
$36,000
total salary
Amy sam
Female-65
City- NYC
$38,000
total salary"""
num_columns = 5
df = pd.Series(mystr.split("\n"), name="data")
pd.DataFrame(df.values.reshape((int(df.shape[0]/num_columns), num_columns)))
output image

Related

add a column in python with rows number

I have a dataset like these
state
sex
birth
player
QLD
m
1993
Dave
QLD
m
1992
Rob
Now I would like to create an additional row, which is the id. ID is equal to the row number but + 1
df = df.assign(ID=range(len(df)))
But unfortunately the first ID is zero, how can I fix that the first ID begins with 1 and so on
I want these output
state
sex
birth
player
ID
QLD
m
1993
Dave
1
QLD
m
1992
Rob
2
but I got these
state
sex
birth
player
ID
QLD
m
1993
Dave
0
QLD
m
1992
Rob
1
How can I add an additional column to python, which starts with one and gives for every row a unique number so for the second row 2, third 3 and so on.
You can try this:
import pandas as pd
df['ID'] = pd.Series(range(len(df))) + 1

Add values in columns if criteria from another column is met

I have the following DataFrame
import pandas as pd
d = {'Client':[1,2,3,4],'Salesperson':['John','John','Bob','Richard'],
'Amount':[1000,1000,0,500],'Salesperson 2':['Bob','Richard','John','Tom'],
'Amount2':[400,200,300,500]}
df = pd.DataFrame(data=d)
Client
Salesperson
Amount
Salesperson
Amount2
1
John
1000
Bob
400
2
John
1000
Richard
200
3
Bob
0
John
300
4
Richard
500
Tom
500
And I just need to create some sort of "sumif" statement (the one from excel) that will add the amount each salesperson is due. I don't know how to iterate over each row, but I want to have it so that it adds the values in "Amount" and "Amount2" for each one of the salespersons.
Then I need to be able to see the amount per salesperson.
Expected Output (Ideally in a DataFrame as well)
Sales Person
Total Amount
John
2300
Bob
400
Richard
700
Tom
500
There can be multiple ways of solving this. One option is to use Pandas Concat to join required columns and use groupby
merged_df = pd.concat([df[['Salesperson','Amount']], df[['Salesperson 2', 'Amount2']].rename(columns={'Salesperson 2':'Salesperson','Amount2':'Amount'})])
merged_df.groupby('Salesperson',as_index = False)['Amount'].sum()
you get
Salesperson Amount
0 Bob 400
1 John 2300
2 Richard 700
3 Tom 500
Edit: If you have another pair of salesperson/amount, you can add that to the concat
d = {'Client':[1,2,3,4],'Salesperson':['John','John','Bob','Richard'],
'Amount':[1000,1000,0,500],'Salesperson 2':['Bob','Richard','John','Tom'],
'Amount2':[400,200,300,500], 'Salesperson 3':['Nick','Richard','Sam','Bob'],
'Amount3':[400,800,100,400]}
df = pd.DataFrame(data=d)
merged_df = pd.concat([df[['Salesperson','Amount']], df[['Salesperson 2', 'Amount2']].rename(columns={'Salesperson 2':'Salesperson','Amount2':'Amount'}), df[['Salesperson 3', 'Amount3']].rename(columns={'Salesperson 3':'Salesperson','Amount3':'Amount'})])
merged_df.groupby('Salesperson',as_index = False)['Amount'].sum()
Salesperson Amount
0 Bob 800
1 John 2300
2 Nick 400
3 Richard 1500
4 Sam 100
5 Tom 500
Edit 2: Another solution using pandas wide_to_long
df = df.rename({'Salesperson':'Salesperson 1','Amount':'Amount1'}, axis='columns')
reshaped_df = pd.wide_to_long(df, stubnames=['Salesperson','Amount'], i='Client',j='num', suffix='\s?\d+').reset_index(drop = 1)
The above will reshape df,
Salesperson Amount
0 John 1000
1 John 1000
2 Bob 0
3 Richard 500
4 Bob 400
5 Richard 200
6 John 300
7 Tom 500
8 Nick 400
9 Richard 800
10 Sam 100
11 Bob 400
A simple groupby on reshaped_df will give you required output
reshaped_df.groupby('Salesperson', as_index = False)['Amount'].sum()
One option is to tidy the dataframe into long form, where all the Salespersons are in one column, and the amounts are in another, then you can groupby and get the aggregate.
Let's use pivot_longer from pyjanitor to transform to long form:
# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(
index="Client",
names_to=".value",
names_pattern=r"([a-zA-Z]+).*",
)
.groupby("Salesperson", as_index = False)
.Amount
.sum()
)
Salesperson Amount
0 Bob 400
1 John 2300
2 Richard 700
3 Tom 500
The .value tells the function to keep only those parts of the column that match it as headers. The columns have a pattern (They start with a text - either Salesperson or Amount - and either have a number at the end ( or not). This pattern is captured in names_pattern. .value is paired with the regex in the brackets, those outside do not matter in this case.
Once transformed into long form, it is easy to groupby and aggregate. The as_index parameter allows us to keep the output as a dataframe.

flatten dataframe containing list with multiple dictionaries

I have currently a pandas dataframe with 100+ columns, that was achieved from pd.normalize_json() and there is one particular column (children) that looks something like this:
name age children address... 100 more columns
Mathew 20 [{name: Sam, age:5}, {name:Ben, age: 10}] UK
Linda 30 [] USA
What I would like for the dataframe to look like is:
name age children.name children.age address... 100 more columns
Mathew 20 Sam 5 UK
Mathew 20 Ben 10 UK
Linda 30 USA
There can be any number of dictionaries within the list. Thanks for the help in advance!

Take value from a duplicate row and create a new column pandas

I have the following dataframe:
Year Name Town Vehicle
2000 John NYC Truck
2000 John NYC Car
2010 Jim London Bike
2010 Jim London Car
I would like to condense this dataframe to one row per Year/ Name /Town so that my end result is:
Year Name Town Vehicle Vehicle2
2000 John NYC Truck Car
2010 Jim London Bike Car
Im guessing it is some sort of df.grouby statement but im not sure how to create the new column. Any help would be much appreciated!
Use GroupBy.cumcount for counter with reshape by Series.unstack:
g = df.groupby(['Year', 'Name','Town']).cumcount()
df1 = (df.set_index(['Year', 'Name','Town', g])['Vehicle']
.unstack()
.add_prefix('Vehicle')
.reset_index())
print (df1)
Year Name Town Vehicle0 Vehicle1
0 2000 John NYC Truck Car
1 2010 Jim London Bike Car

Calculating grouped by % based on if there are contained values in numerator and unique column value in denominator

I am trying to compute a ratio or % that takes the number of occurrences of a grouped by column (Service Column) that has at least one of two possible values (Food or Beverage) and then divide it over the number of unique column (Business Column) values in the df but am having trouble.
Original df:
Rep | Business | Service
Cindy Shakeshake Food
Cindy Shakeshake Outdoor
Kim BurgerKing Beverage
Kim Burgerking Phone
Kim Burgerking Car
Nate Tacohouse Food
Nate Tacohouse Car
Tim Cofeeshop Coffee
Tim Coffeeshop Seating
Cindy Italia Seating
Cindy Italia Coffee
Desired Output:
Rep | %
Cindy .5
Kim 1
Nate 1
Tim 0
Where % is the number of Businesses cindy has with at least 1 Food or Beverage row divided by all unique Businesses in df for her.
I am trying something like below:
(df.assign(Service=df.Service.isin(['Food','Beverage']).astype(int))
.groupby('Rep')
.agg({'Business':'nunique','Service':'count'}))
s['Service']/s['Business']
But this doesnt give me what im looking for as the service only gives all rows in df for cindy in this case 4 and the Businees column isnt giving me an accurate # of where she has food or beverage in a grouped by business.
Thanks for looking and possible help in advance.
I you think you need aggregate sum for count matched values:
df1 = (df.assign(Service=df.Service.isin(['Food','Beverage']).astype(int))
.groupby('Rep')
.agg({'Business':'nunique','Service':'sum'}))
print (df1)
Business Service
Rep
Cindy 2 1
Kim 2 1
Nate 1 1
Tim 2 0
s = df1['Service']/df1['Business']
print (s)
Cindy 0.5
Kim 0.5
Nate 1.0
Tim 0.0
dtype: float64
There is a small mistake that you made in your code here:
s=(df.assign(Service=df.Service.isin(['Food','Beverage']).astype(int))
.groupby('Rep')
.agg({'Business':'nunique','Service':'count'}))
s['Service']/s['Business']
You would need to change 'Service':'count' to 'Service':'sum'. count just counts the number of rows that each Rep has. With sum, it counts the number of rows that each Rep has that is either Food or Beverage service.

Categories