I have a txt file that I read in through python that comes like this:
Text File:
18|Male|66|180|Brown
23|Female|67|120|Brown
16|71|192|Brown
22|Male|68|185|Brown
24|Female|62|100|Blue
One of the rows has missing data and the problem is that when I read it into a dataframe it appears like this:
Age Gender Height Weight Eyes
0 18 Male 66 180 Brown
1 23 Female 67 120 Brown
2 16 71 192 Brown NaN
3 22 Male 68 185 Brown
4 24 Female 62 100 Blue
I'm wondering if there is a way to shift the row that has missing data over without shifting all columns.
Here is what I have so far:
import pandas as pd
df = pd.read_csv('C:/Documents/file.txt', sep='|', names=['Age','Gender', 'Height', 'Weight', 'Eyes'])
df_full = df.loc[df['Gender'].isin(['Male','Female'])]
df_missing = df.loc[~df['Gender'].isin(['Male','Female'])]
df_missing = df_missing.shift(1,axis=1)
df_final = pd.concat([df_full, df_missing])
I was hoping to just separate out the ones with missing data, shift the columns by one, and then put the dataframe back to the data that has no missing data. But I'm not sure how to go about shifting the columns at a certain point. This is the result I'm trying to get to:
Age Gender Height Weight Eyes
0 18 Male 66 180 Brown
1 23 Female 67 120 Brown
2 16 NaN 71 192 Brown
3 22 Male 68 185 Brown
4 24 Female 62 100 Blue
It doesn't really matter how I get it done, but the files I'm using have thousands of rows so I can not fix them individually. Any help is appreciated. Thanks!
Selectively shift a portion of each of the rows that have missing values.
df.apply(lambda r: r[:1].append(r[1:].shift())
if r['Gender'] not in ['Male', 'Female']
else r, axis=1)
The misaligned column data for each affected record will be aligned with 'NaN' inserted where the missing value was in the input text.
Age Gender Height Weight Eyes Age Gender Height Weight Eyes
1 23 Female 67 120 Brown 1 23 Female 67 120 Brown
2 16 71 192 Brown NaN ======> 2 16 NaN 71 192 Brown
For a single record, this'll do it:
df.loc[2] = df.loc[2][:1].append(df.loc[2][1:].shift())
Starting at the 'Gender' column, data is shifted right. The default fill is 'NaN'. The 'Age' column is preserved.
RegEx could help here.
Searching for ^(\d+\|)(\d) and making a replacement using $1|$2 (just added one vertical bar where Gender is missing "group 1 + | + group 2")
This could be done in almost every text editors (Notepad++, VSC, Sublime etc.)
See the example following the link: https://regexr.com/50gkh
Related
Many of us know that the syntax for a Vlookup function on Excel is as follows:
=vlookup([lookup value], [lookup table/range], [column selected], [approximate/exact match (optional)])
I want to do something on Python with a lookup table (in dataframe form) that looks something like this:
Name Date of Birth ID#
Jack 1/1/2003 0
Ryan 1/8/2003 1
Bob 12/2/2002 2
Jack 3/9/2003 3
...and so on. Note how the two Jacks are assigned different ID numbers because they are born on different dates.
Say I have something like a gradebook (again, in dataframe form) that looks like this:
Name Date of Birth Test 1 Test 2
Jack 1/1/2003 89 91
Ryan 1/8/2003 92 88
Jack 3/9/2003 93 79
Bob 12/2/2002 80 84
...
How do I make it so that the result looks like this?
ID# Name Date of Birth Test 1 Test 2
0 Jack 1/1/2003 89 91
3 Ryan 1/8/2003 92 88
1 Jack 3/9/2003 93 79
2 Bob 12/2/2002 80 84
...
It seems to me that the "lookup value" would involve multiple columns of data ('Name' and 'Date of Birth'). I kind of know how to do this in Excel, but how do I do it in Python?
Turns out that I can just do
pd.merge([lookup value], [lookup table], on = ['Name', 'Date of Birth']
which produces
Name Date of Birth Test 1 Test 2 ID#
Jack 1/1/2003 89 91 0
Ryan 1/8/2003 92 88 3
Jack 3/9/2003 93 79 1
Bob 12/2/2002 80 84 2
...
Then everything needed is to move the last column to the front.
This is my data frame:
Name Age Stream Percentage
0 A 21 Math 88
1 B 19 Commerce 92
2 C 20 Arts 95
3 D 18 Biology 70
0 E 21 Math 88
1 F 19 Commerce 92
2 G 20 Arts 95
3 H 18 Biology 70
I want to download different excel file for each subject in one loop so basically, I should get 4 excel files for each subject
i tried this but didn't work:
n=0
for subjects in df.stream:
df.to_excel("sub"+ str(n)+".xlsx")
n+=1
I think groupby is helpful here. and you can use enumerate to keep track of the index.
for i, (group, group_df) in enumerate(df.groupby('stream')):
group_df.to_excel('sub{}.xlsx'.format(i))
# Alternatively, to name the file based on the stream...
group_df.to_excel('sub{}.xlsx'.format(group))
group is going to be the name of the stream.
group_df is going to be a sub-dataframe containing all the data in that group.
I created a database and I am trying to substitute the categorical variables with some numerical values
that I calculated via 'pivot'. In my code, I am trying to iterate through the whole dataframe and if the dataframe categorical columns cells have the same values as one of the elements in 'sublist_names', they should be replaced by the element in 'sublist_values' located in the same position as the value in sublist names.
For example, while iterating the dataframe and each of the categorical columns, the first value of column called 'Name' is the string 'tom'. 'tom' is exactly the 7th element in 'sublist_names', which means it should be replaced by the 7th element in 'sublist_values' which is equal to 150.
I was able to obtain all the needed values but when it comes to solving this last task by iterating the whole dataframe instead of working column by column, I am not sure how to do it.
I hope I explained clearly, but for any questions feel free to ask.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
data = [['tom', 10,6,'brown',200],
['nick', 15,5.10,'red',150],
['juli', 14,5.5,'black',170]
,['peter', 10,6,'blue',290],
['axel', 15,5.10,'yellow',190],
['william', 14,5.5,'yellow',170]
,['tom', 10,6,'orange',100],
['tom', 15,5.10,'brown',150],
['angela', 14,5.5,'black',160]
,['peter', 10,6,'purple',220],
['nick', 15,5.10,'orange',150],
['aroon', 14,5.5,'red',170] ]
df = pd.DataFrame(data, columns=['Name', 'Age','height','color','weight'])
categorical_variables= (df.select_dtypes('object') ) # categorical variables
categ_var_list=(list(categorical_variables))
print(categ_var_list)
condition_pivot_list_names=[]
pivot_values_list=[]
for i in categ_var_list:
condition_pivot = df.pivot_table(index=i, values='weight', aggfunc=np.mean)
pivot_names = (condition_pivot.index.values.tolist())
condition_pivot_list_names.append(pivot_names)
pivot_values_draft = ((condition_pivot.values.tolist()))
pivot_values = [i[0] for i in pivot_values_draft]
pivot_values_list.append(pivot_values)
print(condition_pivot_list_names, 'condition pivot list names')
print(pivot_values_list,'pivot values list')
sublist_names=[(sublists) for sublists in condition_pivot_list_names]
print(sublist_names)
sublist_values=[(sublists1) for sublists1 in pivot_values_list]
print(sublist_values)
def myfunc(x):
if x in sublist_names:
index=sublist_names.index(x)
return sublist_values[index]
return x
df['Name'] = df['Name'].apply(lambda x: myfunc(x))
print(df['Name'])
This is what print( df[name]) shows:
0 tom
1 nick
2 juli
3 peter
4 axel
5 william
6 tom
7 tom
8 angela
9 peter
10 nick
11 aroon
And this is what should show:
0 150
1 150
2 170
3 255
4 190
5 170
6 150
7 150
8 160
9 255
10 150
11 170
You have two categorical values Name and Color. So you cam do something like this.
df['Name'] = df['Name'].apply(lambda x: myfunc(x))
And than you can create a function myfunc() which will receive x from above code. What above code is doing is, it will iterate over the column one by one and pass value of each row one by one to the function. Inside the function you can define the logic to convert the categorical values something like this
def myfunc(x):
if x in sublist_names:
index=sublist_names.index(x)
return sublist_values[index]
return x
Do the same thing for the column Color.
Try this:
df.Name = np.where(df.groupby('Name', as_index=False)['Name'].cumcount().eq(0), df.Name, df.weight)
Output:
Name Age height color weight
0 tom 10 6.0 brown 200
1 nick 15 5.1 red 150
2 juli 14 5.5 black 170
3 peter 10 6.0 blue 290
4 axel 15 5.1 yellow 190
5 william 14 5.5 yellow 170
6 100 10 6.0 orange 100
7 150 15 5.1 brown 150
8 angela 14 5.5 black 160
9 220 10 6.0 purple 220
10 150 15 5.1 orange 150
11 aroon 14 5.5 red 170
Okay I see your problem. Just write the code below before the function declaration.
sub_names=[]
sub_values=[]
for i in sublist_names:
sub_names.extend(i)
for i in sublist_values:
sub_values.extend(i)
Also dont forget to update variable names in myfunc().
Im new to Python and working with data manipulation
I have a dataframe
df3
Out[22]:
Breed Lifespan
0 New Guinea Singing Dog 18
1 Chihuahua 17
2 Toy Poodle 16
3 Jack Russell Terrier 16
4 Cockapoo 16
.. ... ...
201 Whippet 12--15
202 Wirehaired Pointing Griffon 12--14
203 Xoloitzcuintle 13
204 Yorkie--Poo 14
205 Yorkshire Terrier 14--16
As you observe above, some of the lifespans are in a range like 14--16. The datatype of [Lifespan] is
type(df3['Lifespan'])
Out[24]: pandas.core.series.Series
I want it to reflect the average of these two numbers i.e. 15. I do not want any ranges. Just the average as a single digit. How do I do this?
Using split and expand=True
df = pd.DataFrame({'Breed': ['Dog1', 'Dog2'],
'Lifespan': [12, '14--15']})
df['Lifespan'] = (df['Lifespan']
.astype(str).str.split('--', expand=True)
.astype(float).mean(axis=1)
)
df
# Breed Lifespan
# 0 Dog1 12.0
# 1 Dog2 14.5
I am a beginner. I've looked all over and read a bunch of related questions but can't quite figure this out. I know I am the problem and that I'm missing something, but I'm hoping someone will be kind and help me out. I am attempting to convert data from one video game (a college basketball simulation) into data consistent with another video game's (pro basketball simulation) format.
I have a DF that has columns:
Name, Pos, Height, Weight, Shot, Points
With values such as:
Jon Smith, C, 84, 235, Exc, 19.4
Greg Jones, PG, 72, 187, Fair, 12.0
I want create a new column for "InsideScoring". What I'd like to do is assign a player a randomly generated number within a certain range based on what position they played, height, weight, shot rating and points scored.
I tried a bunch of attempts like:
df1['InsideScoring'] = 0
df1.loc[(df1.Pos == "C") &
(df1.Height > 82) &
(df1.Points > 19.0) &
(df1.Weight > 229), 'InsideScoring'] = np.random.randint(85,100)
When I do this, all the players (row at column "InsideScoring") that meet the criteria get assigned the same value between 85 and 100 rather than a random mix of numbers between 85 and 100.
Eventually, what I want to do is go through the list of players and based on those four criteria, assign values from different ranges. Any ideas appreciated.
Pandas: Create a new column with random values based on conditional
Numpy "where" with multiple conditions
My recommendation would be to use np.select here. You set up your conditions, your outputs, and you're good to go. However, to avoid iteration, but also to avoid assigning the same random value to every column that meets the condition, create random values equal to the length of your DataFrame, and select from those:
Setup
df = pd.DataFrame({
'Name': ['Chris', 'John'],
'Height': [72, 84],
'Pos': ['PG', 'C'],
'Weight': [165, 235],
'Shot': ['Amazing', 'Fair'],
'Points': [999, 25]
})
Name Height Pos Weight Shot Points
0 Chris 72 PG 165 Amazing 999
1 John 84 C 235 Fair 25
Now set up your ranges and your conditions (Create as many of these as you like):
cond1 = df.Pos.eq('C') & df.Height.gt(80) & df.Weight.gt(200)
cond2 = df.Pos.eq('PG') & df.Height.lt(80) & df.Weight.lt(200)
range1 = np.random.randint(85, 100, len(df))
range2 = np.random.randint(50, 85, len(df))
df.assign(InsideScoring=np.select([cond1, cond2], [range1, range2]))
Name Height Pos Weight Shot Points InsideScoring
0 Chris 72 PG 165 Amazing 999 72
1 John 84 C 235 Fair 25 89
Now to verify this doesn't assign values more than once:
df = pd.concat([df]*5)
... # Setup the ranges and conditions again
df.assign(InsideScoring=np.select([cond1, cond2], [range1, range2]))
Name Height Pos Weight Shot Points InsideScoring
0 Chris 72 PG 165 Amazing 999 56
1 John 84 C 235 Fair 25 96
0 Chris 72 PG 165 Amazing 999 74
1 John 84 C 235 Fair 25 93
0 Chris 72 PG 165 Amazing 999 63
1 John 84 C 235 Fair 25 97
0 Chris 72 PG 165 Amazing 999 55
1 John 84 C 235 Fair 25 95
0 Chris 72 PG 165 Amazing 999 60
1 John 84 C 235 Fair 25 90
And we can see that random values are assigned, even though they all match one of two conditions. While this is less memory efficient than iterating and picking a random value, since we are creating a lot of unused numbers, it will still be faster as these are vectorized operations.