Hello I´m trying to get the previous value in an specific column if this valuecontains "-":
this is my code:
count=-1
for i, row in df1.iterrows():
count=count + 1
if row["SUBCAPITULO"]== " "and count>0 and "-" in df1.loc[count-1:"SUBCAPITULO"]:
row["SUBCAPITULO"]= df1.loc[count-1:"SUBCAPITULO"]
Use shift:
Sample:
>>> df
SUBCAPITULO
0 dash-
1
2 comma,
3
4 dot.
df.loc[(df['SUBCAPITULO'] == ' ') &
(df['SUBCAPITULO'].shift().str.contains('-'))] = df['SUBCAPITULO'].shift()
>>> df
SUBCAPITULO
0 dash-
1 dash-
2 comma,
3
4 dot.
Desired output has not been posted and the request is unclear.
Taking these comments into account:
trying to get the previous row value in an specific column if this value contains a dash
I think OP wants to fill the dash- value over multiple lines
Maybe this helps...
import pandas as pd
df = pd.read_csv('test.csv')
print(df, '\n\n')
'''
Shows:
Other_data SUBCAPITULO
0 qwer NaN
1 vfds NaN
2 sdfg 1.01 – TORRE – 1
3 hfgt NaN
4 jkiu capitulo
5 bvcd 2.01 – TORRE – 1
6 grnc NaN
7 sdfg capitulo
8 poij NaN
9 fghg 2.01 – TORRE – 1
'''
for i in reversed(df.index):
if i >= 1:
if '–' in str(df.loc[i, 'SUBCAPITULO']):
if str(df.loc[i-1, 'SUBCAPITULO']) == 'nan':
df.loc[i-1, 'SUBCAPITULO'] = df.loc[i, 'SUBCAPITULO']
print(df)
'''
Shows:
Other_data SUBCAPITULO
0 qwer 1.01 – TORRE – 1
1 vfds 1.01 – TORRE – 1
2 sdfg 1.01 – TORRE – 1
3 hfgt NaN
4 jkiu capitulo
5 bvcd 2.01 – TORRE – 1
6 grnc NaN
7 sdfg capitulo
8 poij 2.01 – TORRE – 1
9 fghg 2.01 – TORRE – 1
'''
print('\n')
df1.reset_index(drop=True, inplace=True)
for i in range(1, len(df1)):
if df1.loc[i, 'SUBCAPITULO'] == " " and "-" in df1.loc[i-1, 'SUBCAPITULO']:
df1.loc[i, 'SUBCAPITULO']=df1.loc[i-1, 'SUBCAPITULO']
df1.dropna(inplace=True)
The problem was that I had TO RESET INDEX before the loop.
enter image description here
Ideally, my data frame looks like this:-
S.no
Names
1
Nanda Govind Gajre
2
deepmala mohan shinde
3
jyoti dakore
4
Sonavane Ashanamdev
5
VIMAL BHIKAJI RATHOD
6
ARCHAN DATTARAO KADAM
"Names" column is a combination of First name, Middle name, and Last name
Here, I want the first letter of each word to be in uppercase and the rest in lowercase.
My output
S.no
Names
1
Nanda Govind Gajre
2
Deepmala Mohan Shinde
3
Jyoti Dakore
4
Sonavane Ashanamdev
5
Vimal Bhikaji Rathod
6
Archan Dattarao Kadam
Example
df = {"Names" : ["Nanda Govind Gajre", "deepmala mohan shinde",
"jyoti dakore", "Sonavane Ashanamdev",
"VIMAL BHIKAJI RATHOD", "ARCHAN DATTARAO KADAM",
"KANTA VITTHALRAO TOKALWAD"]}
Apply str.title on "Names" column:
df["Names"] = df["Names"].apply(str.title)
print(df)
Prints:
S.no Names
0 1 Nanda Govind Gajre
1 2 Deepmala Mohan Shinde
2 3 Jyoti Dakore
3 4 Sonavane Ashanamdev
4 5 Vimal Bhikaji Rathod
5 6 Archan Dattarao Kadam
Or:
df["Names"] = df["Names"].str.title()
I have this dataframe, df_pm:
Player GameWeek Minutes \
PlayerMatchesDetailID
1 Alisson 1 90
2 Virgil van Dijk 1 90
3 Joseph Gomez 1 90
ForTeam AgainstTeam \
1 Liverpool Norwich City
2 Liverpool Norwich City
3 Liverpool Norwich City
Goals ShotsOnTarget ShotsInBox CloseShots \
1 0 0 0 0
2 1 1 1 1
3 0 0 0 0
TotalShots Headers GoalAssists ShotOnTargetCreated \
1 0 0 0 0
2 1 1 0 0
3 0 0 0 0
ShotInBoxCreated CloseShotCreated TotalShotCreated \
1 0 0 0
2 0 0 0
3 0 0 1
HeadersCreated
1 0
2 0
3 0
this second dataframe, df_melt:
MatchID GameWeek Date Team Home \
0 46605 1 2019-08-09 Liverpool Home
1 46605 1 2019-08-09 Norwich City Away
2 46606 1 2019-08-10 AFC Bournemouth Home
AgainstTeam
0 Norwich City
1 Liverpool
2 Sheffield United
3 AFC Bournemouth
...
575 Sheffield United
576 Newcastle United
577 Southampton
and this snippet, which uses both:
match_ids = []
home_away = []
dates = []
#For each row in the player matches dataframe...
for row in df_pm.itertuples():
#Look up the match id from the team matches dataframe
team = row.ForTeam
againstteam = row.AgainstTeam
gameweek = row.GameWeek
print (team,againstteam,gameweek)
match_id = df_melt.loc[(df_melt['GameWeek']==gameweek)
&(df_melt['Team']==team)
&(df_melt['AgainstTeam']==againstteam),
'MatchID'].item()
date = df_melt.loc[(df_melt['GameWeek']==gameweek)
&(df_melt['Team']==team)
&(df_melt['AgainstTeam']==againstteam),
'Date'].item()
home = df_melt.loc[(df_melt['GameWeek']==gameweek)
&(df_melt['Team']==team)
&(df_melt['AgainstTeam']==againstteam),
'Home'].item()
match_ids.append(match_id)
home_away.append(home)
dates.append(date)
At first iteration, I print:
Liverpool
Norwich City
1
But I'm getting the error:
Traceback (most recent call last):
File "tableau_data_generation.py", line 166, in <module>
'MatchID'].item()
File "/Users/me/anaconda2/envs/data_science/lib/python3.7/site-packages/pandas/core/base.py", line 652, in item
return self.values.item()
ValueError: can only convert an array of size 1 to a Python scalar
printing the whole df_melt dataframe, I see that these four datetime values are flawed:
540 46875 28 TBC Aston Villa Home
541 46875 28 TBC Sheffield United Away
...
548 46879 28 TBC Manchester City Home
549 46879 28 TBC Arsenal Away
How do I fix this?
When you use item() on a Series you should actually have received:
FutureWarning: `item` has been deprecated and will be removed in a future version
Since item() has been deprecated in version 0.25.0, it looks like you use
some outdated version of Pandas and possibly you should start from upgrading it.
Even in a newer version of Pandas you can use item(), but on a Numpy
array (at least now, not deprecated).
So change your code to:
df_melt.loc[...].values.item()
Another option is to use iloc[0], so you can also change your code to:
df_melt.loc[...].iloc[0]
Edit
The above solution still can raise an exception (IndexError) if df_melt
does not find any row meeting the given criteria.
To make your code resistant to such cases (and return some default value)
you can add a function getting the given attribute (attr, actually a
column) from the first row meeting the criteria given (gameweek, team,
and againstteam):
def getAttr(gameweek, team, againstteam, attr, default=None):
xx = df_melt.loc[(df_melt['GameWeek'] == gameweek)
& (df_melt['Team'] == team)
& (df_melt['AgainstTeam'] == againstteam)]
return default if xx.empty else xx.iloc[0].loc[attr]
Then, instead of all 3 ... = df_melt.loc[...].item() instructions run:
match_id = getAttr(gameweek, team, againstteam, 'MatchID', default=-1)
date = getAttr(gameweek, team, againstteam, 'Date')
home = getAttr(gameweek, team, againstteam, 'Home', default='????')
I am adding a mock dataframe to exemplify my problem.
I have a large dataframe in which some columns are missing values.
I would like to create some extra boolean columns in which 1 corresponds to a non missing value in the row and 0 corresponds to a missing value.
names = ['Banana, Andrew Something (Maria Banana)', np.nan, 'Willis, Mr. Bruce (Demi Moore)', 'Crews, Master Terry', np.nan]
room = [100, 330, 212, 111, 222]
hotel_loon = {'Name' : pd.Series(names), 'Room' : pd.Series(room)}
hotel_loon_df = pd.DataFrame(hotel_loon)
In another question I found on stack overflow they were super thorough and clear on how to proceed to keep track of all the columns that have missing values but not for specific ones.
I tried a few variations of that code (namely using where) but I was not successful with creating what I wanted which would be something like this:
Name Room Name_present Room_present
0 Banana, Andrew Something (Maria Banana) 100 1 1
1 NaN 330 0 1
2 Willis, Mr. Bruce (Demi Moore) 212 1 1
3 Crews, Master Terry 111 1 1
4 NaN 222 0 1
Thank you for your time, I am sure that in the end it is going to be trivial, but for some reason I got stuck.
To save some typing, use DataFrame.notnull, add some suffixes, and join the result back.
pd.concat([df, df.notnull().astype(int).add_suffix('_present')], axis=1)
Name Room Name_present Room_present
0 Banana, Andrew Something (Maria Banana) 100 1 1
1 NaN 330 0 1
2 Willis, Mr. Bruce (Demi Moore) 212 1 1
3 Crews, Master Terry 111 1 1
4 NaN 222 0 1
You can use .isnull() for your case, and change the type from bool to int:
hotel_loon_df['Name_present'] = (~hotel_loon_df['Name'].isnull()).astype(int)
hotel_loon_df['Room_present'] = (~hotel_loon_df['Room'].isnull()).astype(int)
Out[1]:
Name Room Name_present Room_present
0 Banana, Andrew Something (Maria Banana) 100 1 1
1 NaN 330 0 1
2 Willis, Mr. Bruce (Demi Moore) 212 1 1
3 Crews, Master Terry 111 1 1
4 NaN 222 0 1
The ~ means the opposite of, or something that is not.
If you are tracking only for Nan fields, you can use isnull() function.
df['name_present'] =df['name'].isnull()
df['name_present'].replace(True,0, inplace=True)
df['name_present'].replace(False,1, inplace=True)
df['room_present'] =df['room'].isnull()
df['room_present'].replace(True,0, inplace=True)
df['room_present'].replace(False,1, inplace=True)
We can do this in a concise manner by using DataFrame.isnull:
hotel_loon_df[['Name_present', 'Room_present']] = (~hotel_loon_df.isnull()).astype(int)
Name Room Name_present Room_present
0 Banana, Andrew Something (Maria Banana) 100 1 1
1 NaN 330 0 1
2 Willis, Mr. Bruce (Demi Moore) 212 1 1
3 Crews, Master Terry 111 1 1
4 NaN 222 0 1
I have searched a lot but couldn't find a solution to this particular case. I want to remove any rows that contains less than 3 strings or items in the lists. My issues will be addressed more clearly further down.
I'm preparing a LDA topic modelling with a large Swedish database in pandas and have limited the test case to 1000 rows. I'm only concerned with a specific column and my approach so far has been as follows:
con = sqlite3.connect('/Users/mo/EXP/NAV/afm.db')
sql = """
select * from stillinger limit 1000
"""
dfs = pd.read_sql(sql, con)
plb = """
select PLATSBESKRIVNING from stillinger limit 1000
"""
dfp = pd.read_sql(plb, con);dfp
Then I've defined a regular expression where the first argument removes any meta characters while keeping the Swedish and Norwegian language specific letters. The second argument removes words < 3:
rep = {
'PLATSBESKRIVNING': {
r'[^A-Za-zÅåÄäÖöÆØÅæøå]+': ' ',
r'\W*\b\w{1,3}\b': ' '}
}
p0 = (pd.DataFrame(dfp['PLATSBESKRIVNING'].str.lower()).replace(rep, regex=True).
drop_duplicates('PLATSBESKRIVNING').reset_index(drop=True));p0
PLATSBESKRIVNING
0 medrek rekrytering söker uppdrag manpower h...
1 familj barn tjejer kille söker pair ...
2 uppgift blir tillsammans medarbetare leda ...
3 behov operasjonssykepleiere langtidsoppdr...
4 detta perfekta jobbet arbetstiderna vardaga...
5 familj paris barn söker älskar barn v...
6 alla inom cafe restaurang förekommande arbets...
.
.
Creating a pandas Series:
s0 = p0['PLATSBESKRIVNING']
Then:
ts = s0.str.lower().str.split();ts
0 [medrek, rekrytering, söker, uppdrag, manpower...
1 [familj, barn, tjejer, kille, söker, pair, vil...
2 [uppgift, blir, tillsammans, medarbetare, leda...
3 [behov, operasjonssykepleiere, langtidsoppdrag...
4 [detta, perfekta, jobbet, arbetstiderna, varda...
5 [familj, paris, barn, söker, älskar, barn, vil...
6 [alla, inom, cafe, restaurang, förekommande, a...
7 [diskare, till, cafe, dubbel, sökes, arbetet, ...
8 [diskare, till, thelins, konditori, sökes, arb...
Removing the stop words from the database:
r = s0.str.split().apply(lambda x: [item for item in x if item not in mswl]);r
0 [uppdrag, bemanningsföretag, erbjuds, tillägg,...
1 [föräldrarna, citycentre, stort, tomt, mamman,...
2 [utveckla, övergripande, strategiska, frågor, ...
3 [erfaring, sykepleier, legitimasjon]
4 [arbetstiderna, vardagar, härliga, människor, ...
5 [paris, utav, badrum, båda, yngsta, endast, fö...
6 [förekommande, emot, utbildning]
7 []
8 [thelins]
9 [paris, baby, månader, våning, delar, badrum, ...
Creating a new DataFrame and removing the empty brackets:
dr = pd.DataFrame(r)
dr0 = dr[dr.astype(str)['PLATSBESKRIVNING'] != '[]'].reset_index(drop=True); dr0
PLATSBESKRIVNING
0 [uppdrag, bemanningsföretag, erbjuds, tillägg,...
1 [föräldrarna, citycentre, stort, tomt, mamman,...
2 [utveckla, övergripande, strategiska, frågor, ...
3 [erfaring, sykepleier, legitimasjon]
4 [arbetstiderna, vardagar, härliga, människor, ...
5 [paris, utav, badrum, båda, yngsta, endast, fö...
6 [förekommande, emot, utbildning]
7 [thelins]
8 [paris, baby, månader, våning, delar, badrum, ...
Maintaining the string:
dr1 = dr0['PLATSBESKRIVNING'].apply(str); len(dr1),type(dr1), dr1
0 ['uppdrag', 'bemanningsföretag', 'erbjuds', 't...
1 ['föräldrarna', 'citycentre', 'stort', 'tomt',...
2 ['utveckla', 'övergripande', 'strategiska', 'f...
3 ['erfaring', 'sykepleier', 'legitimasjon']
4 ['arbetstiderna', 'vardagar', 'härliga', 'männ...
5 ['paris', 'utav', 'badrum', 'båda', 'yngsta', ...
6 ['förekommande', 'emot', 'utbildning']
7 ['thelins']
8 ['paris', 'baby', 'månader', 'våning', 'delar'...
My issue now is that I want to remove any rows that contains less than 3 strings in the lists, e.g row 3, 6 and 7. Desired result would be like this:
0 ['uppdrag', 'bemanningsföretag', 'erbjuds', 't...
1 ['föräldrarna', 'citycentre', 'stort', 'tomt',...
2 ['utveckla', 'övergripande', 'strategiska', 'f...
3 ['arbetstiderna', 'vardagar', 'härliga', 'männ...
4 ['paris', 'utav', 'badrum', 'båda', 'yngsta', ...
5 ['paris', 'baby', 'månader', 'våning', 'delar'...
.
.
How can I obtain this? I'm also wondering if this could be done in a neater way? My approach seems so clumsy and cumbersome.
I would also like to remove both indexes and column name for the LDA topic modelling such that I could write it to a text file without the header and the digits of indexes. I have tried:
dr1.to_csv('LDA1.txt',header=None,index=False)
But this wraps quotation marks "['word1', 'word2', 't.. ]" to the each list of strings in the file.
Any suggestions would be much appreciated.
Best regards
Mo
Just measure the number of items in the list and filter the rows with length lower than 3
dr0['length'] = dr0['PLATSBESKRIVNING'].apply(lambda x: len(x))
cond = dr0['length'] > 3
dr0 = dr0[cond]
You can use apply len and then select data store it in the dataframe variable you like i.e
df[df['PLATSBESKRIVNING'].apply(len)>3]
Output :
PLATSBESKRIVNING
0 [uppdrag, bemanningsföretag, erbjuds, nice]
1 [föräldrarna, citycentre, stort, tomt]
2 [utveckla, övergripande, strategiska, fince]
4 [arbetstiderna, vardagar, härliga, männ]
5 [paris, utav, badrum, båda, yngsta]
8 [paris, baby, månader, våning, delar]