Filling in a pandas column based on existing number of strings - python

I have a pandas data-frame that looks like this:
ID Hobbby Name
1 Travel Kevin
2 Photo Andrew
3 Travel Kevin
4 Cars NaN
5 Photo Andrew
6 Football NaN
.............. 1303 rows.
The number of Names filled in might be large then 2 as well. I would like to end up the entire Names column filled n equally into the names ( or+1 in the case of even number of rows). I already store into a variable number of names the total number of names. In the above case it's 2. I tried filtering and counting by each name but I don't know how to make this when the number of name is dynamic.
Expected Dataframe:
ID Hobbby Name
1 Travel Kevin
2 Photo Andrew
3 Travel Kevin
4 Cars Kevin
5 Photo Andrew
6 Football Andrew
I tried: replace NaN with 0 in Column Name using fillna. Filter the column and end up with a dataframe that has only the na fields and afterwards len(df) to get the number of nan and from here created 2 databases each containing half of the df. Bu I think this approach is completely wrong as I do not always have 2 Names. There could be2,3,4 etc. ( this is given by a dictionary)
Any help highly appreciated
Thanks.

It's difficult to tell but I think you need ffill
df['Name'] = df['Name'].ffill()

Related

Merging/concatenating two datasets on a specific column (different lengths) [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 1 year ago.
I have two different datasets
df1
Name Surname Age Address
Julian Ross 34 Main Street
Mary Jane 52 Cook Road
len(1200)
df2
Name Country Telephone
Julian US NA
len(800)
df1 contains the full list of unique names; df2 contains less rows as many Name were not added.
I would like to get a final dataset with the full list of names in df1 (and all the fields that are there) plus the fields in df2. I would then expect a final dataset of length 1200 with some empty fields corresponding to the missing name in df2.
I have tried as follows:
pd.concat([df1.set_index('Name'),df2.set_index('Name')], axis=1, join='inner')
but it returns the length of the smallest dataset (i.e. 800).
I have also tried
df1.merge(df2, how = 'inner', on = ['Name'])
... same result.
I am not totally familiar with joining/merging/concatenating functions, even after reading the document https://pandas.pydata.org/docs/user_guide/merging.html .
I know that probably this question will be a duplicate of some others and I will be happy to delete it if necessary, but I would be really grateful if you could provide same help and explaining how to get the expected result:
df
Name Surname Age Address Country Telephone
Julian Ross 34 Main Street US NA
Mary Jane 52 Cook Road
IIUC, Use pd.merge like below:
>>> df1.merge(df2, how='left', on='Name')
Name Surname Age Address Country Telephone
0 Julian Ross 34 Main Street US NaN
1 Mary Jane 52 Cook Road NaN NaN
If you want to keep the number of rows of df1, you have to use how='left' in the case where there is no duplicate names in df2.
Read Pandas Merging 101

Efficient Way to Find Partial Duplicates in Dataset

I have a large dataset (~20,000 rows) consisting of persons and their info, and am looking for a way to identify potential duplicate persons within this dataset. These duplicates are not necessarily perfect matches since they have been entered manually and some contain typos.
ex)
LastName MiddleName FirstName DOB
1 Farmer Berry Dave 1/1/2004
2 Place D. Tom 8/4/2001
3 Famrer B. Dave 01/01/2004
4 Ander Kate Linda 12/26/1954
5 Place jr. David Tom 8/4/2001
...
In this case row 1 and 3, and rows 2 and 5 would need to be flagged as duplicates. The only solution I have been able to come up with is O(n^2), and consists of iterating through the entire dataset for each record in the dataset, comparing fields for partial matches and flagging rows if matching criteria is met.
Is there a more elegant solution for this?
Edit For clarity: it is possible that none of the fields contain an exact match. People are manually entering all of these individuals very quickly, so there is a lot of opportunity for typos/ incorrect information
LastName MiddleName FirstName DOB
1 John-adams T. Samuel 1/15/2021
2 Jhon-adams Tom Sam 10/15/2021
These 2 rows should be flagged as potential duplicates.

How do I subset with .isin (seems like it doesn't work properly)?

I'm a student from Moscow State University and I'm doing a small research about suburban railroads. I crawled information from wikipedia about all stations in Moscow region and now I need to subset those, that are Moscow Central Diameter 1 (railway line) station. I have a list of Diameter 1 stations (d1_names) and what I'm trying to do is to subset from whole dataframe (suburban_rail) with isin pandas method. The problem is it returns only 2 stations (the first one and the last one), though I'm pretty sure there are some more, because using str.contains with absent stations returns what I was looking for (so they are in dataframe). I've already checked spelling and tried to apply strip() to each element of both dataframe and stations' list. Attached several screenshots of my code.
suburban_rail dataframe
stations' list I use to subset
what isin returns
checking manually for Bakovka station
checking manually for Nemchinovka station
Thanks in advance!
Next time provide a minimal reproducible example, such as the one below:
suburban_rail = pd.DataFrame({'station_name': ['a','b','c','d'], 'latitude': [1,2,3,4], 'longitude': [10,20,30,40]})
d1_names = pd.Series(['a','c','d'])
suburban_rail
station_name latitude longitude
0 a 1 10
1 b 2 20
2 c 3 30
3 d 4 40
Now, to answer your question: using .loc the problem is solved:
suburban_rail.loc[suburban_rail.station_name.isin(d1_names)]
station_name latitude longitude
0 a 1 10
2 c 3 30
3 d 4 40

How do I use pandas to count the number of times a name and type occur together within a 60 period from the first instance?

My dataframe is this:
Date Name Type Description Number
2020-07-24 John Doe Type1 NaN NaN
2020-08-10 Jo Doehn Type1 NaN NaN
2020-08-15 John Doe Type1 NaN NaN
2020-09-10 John Doe Type2 NaN NaN
2020-11-24 John Doe Type1 NaN NaN
I want the Number column to have the instance number with the 60 day period. So for entry 1, the Number should just be 1 since it's the first instance - same with entry 2 since it's a different name. Entry 3 however, should have 2 in the Number column since it's the second instance of John Doe and Type 1 in the 60 day period starting 7/24 (the first instance date). Entry 4 would be 1 as well since the Type is different. Entry 5 would also be 1 since it's outside the 60 day period from 7/24. However, any entries after this with John Doe, Type 1 would have a new 60 day period starting 11/24.
Sorry, I know this is a pretty loaded question with a lot of aspects to it, but I'm trying to get up to speed on dataframes again and I'm not sure where to begin.
As a starting point, you could create a pivot table. (The assign statement just creates a temporary column of ones, to support counting.) In the example below, each row is a date, and each column is a (name, type) pair.
Then, use the resample function (to get one row for every calendar day), and the rolling function (to sum the numbers in the 60-day window).
x = (df.assign(temp = 1)
.pivot_table(index='date',
columns=['name', 'type'],
values='temp',
aggfunc='count',
fill_value=0)
)
x.resample('1d').count().rolling(60).sum()
Can you post sample data in text format (for copy/paste)?

how to deal with a copy-pasted table in pandas- reshaping a column vector

I have a table I copied from a webpage which when pasted into librecalc or excel occupies a single cell, and when pasted into notebook becomes a 3507x1 column. If I import this as a pandas dataframe using pd.read_csv I see the same 3507x1 column , and I'd now like to reshape it into the 501x7 array that it started as.
I thought I could recast as a numpy array, reshape as I am familiar with in numpy and then put back into a df, but the to_numpy methods of pandas seem to want to work with a Series object (not Dataframe) and attempts to read the file into a Series using eg
ser= pd.Series.from_csv('billionaires')
led to tokenizing errors. Is there some simple way to do this? Maybe I should throw in the towel on this direction and read from the html?
A simple copy paste does not give you any clear column separator, so it's impossible to do it easily.
You have only spaces, but spaces may or may not be inside the column values too (like in the name or country) so is impossible to give to DataFrame.read_csv a column separator.
However, if I copy paste the table in a file, I notice regularity.
If you know regex, you can try using pandas.Series.str.extract. This method extracts capture groups in a regex pattern as columns of a DataFrame. The regex is applied to each element / string of the series.
You can then try to find a regex pattern to capture the various elements of the row to split them into separate columns.
df = pd.read_csv('data.txt', names=["A"]) #no header in the file
ss = df['A']
rdf = ss.str.extract('(\d)\s+(.+)(\$[\d\.]+B)\s+([+-]\$[\d\.]+[BM])\s+([+-]\$[\d\.]+B)\s+([\w\s]+)\s+([\w\s]+)')
Here I tried to write a regex for the table in the link, the result on the first seems pretty good.
0 1 2 3 4 5 6
0 1 Jeff Bezos $121B +$231M -$3.94B United States Technology
1 3 Bernard Arnault $104B +$127M +$35.7B France Consumer
2 4 Warren Buffett $84.9B +$66.3M +$1.11B United States Diversified
3 5 Mark Zuckerberg $76.7B -$301M +$24.6B United States Technology
4 6 Amancio Ortega $66.5B +$303M +$7.85B Spain Retail
5 7 Larry Ellison $62.3B +$358M +$13.0B United States Technology
6 8 Carlos Slim $57.0B -$331M +$2.20B Mexico Diversified
7 9 Francoise Bettencourt Meyers $56.7B -$1.12B +$10.5B France Consumer
8 0 Larry Page $55.7B +$393M +$4.47B United States Technology
I used DataFrame.read_csv to read the file, since `Series.from_csv' is deprecated.
I found that converting to a numpy array was far easier than I had realized - the numpy asarray method can handle a df (and conveniently enough it works for general objects, not just numbers)
df = pd.read_csv('billionaires',sep='\n')
print(df.shape)
-> (3507, 1)
n = np.asarray(df)
m = np.reshape(n,[-1,7])
df2=pd.DataFrame(m)
df2.head()
0 1 2 3 4 \
0 0 Name Total net worth $ Last change $ YTD change
1 1 Jeff Bezos $121B +$231M -$3.94B
2 2 Bill Gates $107B -$421M +$16.7B
3 3 Bernard Arnault $104B +$127M +$35.7B
4 4 Warren Buffett $84.9B +$66.3M +$1.11B
5 6
0 Country Industry
1 United States Technology
2 United States Technology
3 France Consumer
4 United States Diversified

Categories