Create a new column in pandas using a value of a row - python

First of all, this is not a duplicate! I have searched in several SO questions as well as the Pandas doc, and I have not found anything conclusive!To create a new column with a row value, like this and this!
Imagine I have the following table, opening an .xls and I create a dataframe with it. As this is a small example created from the real proble, I created this simple Excel table which can be easily reproduceable:
What I want now is to find the row that has "Population Month Year" (I will be looking at different .xls, so the structure is the same: population, month and year.
xls='population_example.xls'
sheet_name='Sheet1'
df = pd.read_excel(xls, sheet_name=sheet_name, header=0, skiprows=2)
df
What I thought is:
Get the value of that row with startswith
Create a column, pythoning that value and getting the month and year value.
I have tried several things similar to this:
dff=df[s.str.startswith('Population')]
dff
But errors won't stop coming. In this above's code error, specifically:
IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match
I have several guesses:
I am not understanding properly how Seriesin pandas work, even though reading the doc. I did not even think on using them, but the startswithlooks like the thing I am looking for.
If I handle this properly, I might have a NaN error, but I cannot use df.dropna()yet, as I would lose that row value (Population April 2017)!
Edit:
The problem on using this:
df[df['Area'].str.startswith('Population')] Is that it will check the na values.
And this:
df['Area'].str.startswith('Population')
Will give me a true/false/na set of values, which I am not sure how I can use.

Thanks to #Erfan , I got to the solution:
Using properly the line of code in the comments and not like I was trying, I managed to:
dff=df[df['Area'].str.startswith('Population', na=False)]
dff
Which would output: Population and household forecasts, 2016 to 20... NaN NaN NaN NaN NaN NaN
Now I can access this value like
value=dff.iloc[0][0]
value
To get the string I was looking for: 'Population and household forecasts, 2016 to 2041, prepared by .id , the population experts, April 2019.'
And I can python around with this to create the desired column. Thank you!

You could try:
import pandas as pd
import numpy as np
pd.DataFrame({'Area': [f'Whatever{i+1}' for i in range(3)] + [np.nan, 'Population April 2017.'],
'Population': [3867, 1675, 1904, np.nan, np.nan]}).to_excel('population_example.xls', index=False)
df = pd.read_excel('population_example.xls').fillna('')
population_date = df[df.Area.str.startswith('Population')].Area.values[0].lstrip('Population ').rstrip('.').split()
Result:
['April', '2017']
Or (if Population Month Year is always on the last row):
df.iloc[-1, 0].lstrip('Population ').rstrip('.').split()

Related

Dropping Rows that Contain a Specific String wrapped in square brackets?

I'm trying to drop rows which contain strings that are wrapped in a column. I want to drop all values that contain the strings '[removed]', '[deleted]'.
My df looks like this:
Comments
1 The main thing is the price appreciation of the token (this determines the gains or losses more
than anything). Followed by the ecosystem for the liquid staking asset, the more opportunities
and protocols that accept the asset as collateral, the better. Finally, the yield for staking
comes into play.
2 [deleted]
3 [removed]
4 I could be totally wrong, but sounds like destroying an asset and claiming a loss, which I
believe is fraudulent. Like someone else said, get a tax guy - for this year anyway and then
you'll know for sure. Peace of mind has value too.
I have tried df[df["Comments"].str.contains("removed")==False]
But when i try to save the dataframe, it is still not removed.
EDIT:
My full code
import pandas as pd
sol2020 = pd.read_csv("Solana_2020_Comments_Time_Adjusted.csv")
sol2021 = pd.read_csv("Solana_2021_Comments_Time_Adjusted.csv")
df = pd.concat([sol2021, sol2020], ignore_index=True, sort=False)
df[df["Comments"].str.contains("deleted")==False]
df[df["Comments"].str.contains("removed")==False]
Try this
I have created a data frame for comments column and used my own comments but it should work for you
import pandas as pd
sample_data = { 'Comments': ['first comment whatever','[deleted]','[removed]','last comments whatever']}
df = pd.DataFrame(sample_data)
data = df[df["Comments"].str.contains("deleted|removed")==False]
print(data)
output I got
Comments
0 first comment whatever
3 last comments whatever
You can do it like this:
new_df = df[~(df['Comments'].str.startswith('[') & df['Comments'].str.endswith(']'))].reset_index(drop=True)
Output:
>>> new_df
Comments
0 The main thing is the price appreciation of th...
3 I could be totally wrong, but sounds like dest...
That will remove all rows where the value of the Comments column for that row starts with [ and ends with ].

Selective summation of columns in a pandas dataframe

The COVID-19 tracking project (api described here) provides data on many aspects of the pandemic. Each row of the JSON is one day's data for one state. As many people know, the pandemic is hitting different states differently -- New York and its neighbors hardest first, with other states being hit later. Here is a subset of the data:
date,state,positive,negative
20200505,AK,371,22321
20200505,CA,56212,723690
20200505,NY,321192,707707
20200505,WY,596,10319
20200504,AK,370,21353
20200504,CA,54937,692937
20200504,NY,318953,688357
20200504,WY,586,9868
20200503,AK,368,21210
20200503,CA,53616,662135
20200503,NY,316415,669496
20200503,WY,579,9640
20200502,AK,365,21034
20200502,CA,52197,634606
20200502,NY,312977,646094
20200502,WY,566,9463
To get the entire data set I am doing this:
import pandas as pd
all_states = pd.read_json("https://covidtracking.com/api/v1/states/daily.json")
I would like to be able to summarize the data by adding up the values for one column, but only for certain states; and then adding up the same column, for the states not included before. I was able to do this, for instance:
not_NY = all_states[all_states['state'] != 'NY'].groupby(['date'], as_index = False).hospitalizedCurrently.sum()
This creates a new dataframe from all_states, grouped by date, and summing for all the states that are not "NY". What I want to do, though, is exclude multiple states with something like a "not in" function (this doesn't work):
not_tristate = all_states[all_states['state'] not in ['NY','NJ','CT']].groupby(['date'], as_index = False).hospitalizedCurrently.sum()
Is there a way to do that? An alternate approach I tried is to create a new dataframe as a pivot table, with one row per date, one column per state, like this:
pivot_states = all_states.pivot_table(index = 'gooddate', columns = 'state', values = 'hospitalizedCurrently', aggfunc='sum')
but this still leaves me with creating new columns from summing only some columns. In SQL, I would solve the problem like this:
SELECT all_states.Date AS [Date], Sum(IIf([all_states]![state] In ("NY","NJ","CT"),[all_states]![hospitalizedCurrently],0)) AS tristate, Sum(IIf([all_states]![state] Not In ("NY","NJ","CT"),[all_states]![hospitalizedCurrently],0)) AS not_tristate
FROM all_states
GROUP BY all_states.Date
ORDER BY all_states.Date;
The end result I am looking for is like this (using the sample data above and summing on the 'positive' column, with 'NY' standing in for 'tristate'):
date,not_tristate,tristate
20200502,53128,312977,366105
20200503,54563,316415,370978
20200504,55893,318953,374846
20200505,57179,321192,378371
Any help would be welcome.
to get the expected output, you can use groupby on date and np.where the states are isin the states you want, sum on positive, unstack and assign to get the column total
df_f = all_states.groupby(['date',
np.where(all_states['state'].isin(["NY","NJ","CT"]),
'tristate', 'not_tristate')])\
['positive'].sum()\
.unstack()\
.assign(total=lambda x: x.sum(axis=1))
print (df_f)
not_tristate tristate total
date
20200502 53128 312977 366105
20200503 54563 316415 370978
20200504 55893 318953 374846
20200505 57179 321192 378371
or with pivot_table, you get similar result with:
print ( all_states.assign(state= np.where(all_states['state'].isin(["NY","NJ","CT"]),
'tristate', 'not_tristate'))\
.pivot_table(index='date', columns='state', values='positive',
aggfunc='sum', margins=True))
state not_tristate tristate All
date
20200502 53128 312977 366105
20200503 54563 316415 370978
20200504 55893 318953 374846
20200505 57179 321192 378371
All 220763 1269537 1490300
You can exclude multiple values of states by using isin with a NOT(~) sign:
all_states[~(all_states['state'].isin(["NY", "NJ", "CT"]))]
So, your code would be:
not_tristate = all_states[~(all_states['state'].isin(['NY','NJ','CT']))].groupby(['date'], as_index = False).hospitalizedCurrently.sum()

pandas dataframe throwing an empty list

I have a table where column names are not really organized like they have different years of data with different column numbers.
So I should access each data through specified column names.
I am using this syntax to access a column.
df = df[["2018/12"]]
But when I just want to extract numbers under that column, using
df.iloc[0,0]
it throws an error like
single positional indexer is out-of-bounds
So I am using
df.loc[0]
but it has the column name with the numeric data.
How can I extract just the number of each row?
Below is the CSV data
Closing Date,2014/12,2015/12,2016/12,2017/12,2018/12,Trend
Net Sales,"31,634","49,924","62,051","68,137","72,590",
""
Net increase,"-17,909","-16,962","-34,714","-26,220","-29,721",
Net Received,-,-,-,-,-,
Net Paid,-328,"-6,038","-9,499","-9,375","-10,661",
When writing this dumb question, I was just a beginner not even knowing what I wanted ask.
The OP's question comes down to "getting the row as a list" since he ended his post asking
how to get numbers(though he said "number" maybe by mistake) of each row.
The answer is that he made a mistake of using double square brackets in his example and it caused problems.
The solution is to use df = df["2018/12"] instead of df= df[["2018/12"]]
As for things I(me at the time of writing this) mentioned, I will answer them one by one:
Let's say the table looks like this
Unnamed: 0 2018/12 country drives_right
0 US 809 United States True
1 AUS 731 Australia False
2 JAP 588 Japan False
3 IN 18 India False
4 RU 200 Russia True
5 MOR 70 Morocco True
6 EG 45 Egypt True
1>df = df[["2018/12"]]
: it will output a dataframe which only has the column "2018/12" and the index column on the left side.
2>df.iloc[0,0]
Now, since from 1> we have a new dataframe having only one column(except for index column mentioning index values) this will output the first element of the column.
In the example above, the outcome will be "809" since it's the first element of the column.
3>
But when I just want to extract numbers under that column, using
df.iloc[0,0]
-> doesn't make sense if you want to get extract numbers. It will just output one element
809 from the sub-dataframe you created using df = df[["2018/12"]].
it throws an error like
single positional indexer is out-of-bounds
Maybe you are confused about the outcome.(Maybe in this case "df" is the one before your df dataframe subset assignment?(df=df[["2018/12"]]) Since df = df[["2018/12"]] will output a dataframe so it will work fine.
3
So I am using
df.loc[0]
but it has the column name with the numeric data.
: Yes df.loc[0] from df = df[["2018/12"]] will return column name and the first element of that column.
4.
How can I extract just the number of each row?
You mean "numbers" of each row right?
Use this:
print(df["2018/12"].values.tolist())
In terms of finding varying names of columns or rows, and then access each rows and columns, you should think of using regex.

Pandas Cleaning up

I have an excel file in this format and I am trying to read it in Pandas and clean it up:
I read in the file with read_excel and created a multiindex level starting from row 7([2013,2016,2017...]
df= pd.read_excel(PATH_CY_TABLE, header= [7,8,9])
This is how it read in:
Ideally, I want to clean up to look something like this:
What steps can I follow to get it this format?
Couple of things I have tried are:
1. remove the level 1 of multi index: where the columns names appears as 'unnamed...'
df.columns= df.columns.get_level_values(1)
This gives me an error: IndexError: Too many levels: Index has only 1 level, not 2
Stacking the columns indices:
df.stack()
This gives me an error: TypeError: '>' not supported between instances of 'str' and 'int'
I tried this:
df.columns=df.columns.get_level_values(0)
And this gave me the first level of MultiIndex as [2013, 2013, 2013, 2016,2016,2016...]. But I want the output df to have two levels of indices here: Level 0 and Level 3.
As a first step I am looking to remove the 'Unnamed...' columns names. I have tried to post the df as an output instead of pictures, but unsure how to do them in the correct way- when I copy paste from jupyter notebook, they paste all messed up. I am quite new to posting questions here..so still working my way around.
I wasn't still able find a better way to post my output but I worked around a way to clean up the file to the desired output:
I sliced the MultiLevelIndex level 0 to match year I want(2017)
df1= df
df1= df1.iloc[:, df1.columns.get_level_values(0)== 2017]
Out:
Number MOE1 (±) Rate
Total..........................................… 323156.0 123.0 X
NaN NaN NaN NaN
Any health plan……………….……...… 294613.0 662.0 91.2
NaN NaN NaN NaN
.Any private plan2,3……………………… 217007.0 1158.0 67.2

Pandas Dataframes - How do you maintain an index post a group by/aggregation operation?

This should be easy but I'm having a surprisingly annoying time at it. The code below shows me doing a Pandas groupby operation so I can calculate variance by symbol. Unfortunately what happens is that the aggregation command seems to get rid of the integer index, so I am trying to create a new integer list and add this as a column to the table and set as a new index.
vardataframe = voldataframe.groupby('Symbol')
vardataframe = vardataframe.aggregate(np.var)
vardataframe['newindex']= np.arange(1,(len(vardataframe)+1))
vardataframe.set_index(['newindex'])
vardataframe = vardataframe.ix[:,['newindex','Symbol','volatility']]
However what comes out is the below vardataframe.head() result, which does not properly change the index of the table from Symbol back to numeric. And this hurts me in a line or two when I try to do a merge command.
newindex Symbol volatility
Symbol
A 1 NaN 0.000249
AA 2 NaN 0.000413
AAIT 3 NaN 0.000237
AAL 4 NaN 0.001664
AAME 5 NaN 0.001283
As you see the problems with the above are now there are two Symbol columns and the index hasn't been set correctly. What I'd like to do is get rid of the second Symbol column and make newindex the new index. Anyone know what I'm doing wrong here? (Perhaps a misunderstanding of the ix command). Much appreciated!
You can use as_index=False to preserve integer index. You need only one line to do what you need:
vardataframe = voldataframe.groupby('Symbol', as_index=False).var()
A couple of things in your code:
vardataframe.set_index(['newindex'])
will set ndewindex as index, but returns a new dataframe which is not used. You can do vardataframe.set_index(['newindex'], inplace=True) if you want this.
vardataframe.ix[:,['newindex','Symbol','volatility']]
gives you a column Symbol of all NaN because Symbol is not a column of vardataframe, but only exists in its index. Querying a non-exist column with ix gives all NaN. As #user2600939 mentioned, you can do vardataframe.reset_index(inplace=True) (or vardataframe=vardataframe.reset_index() to put Symbol back as a column.
Instead of making a new index manually just reset it use...
df = df.reset_index()

Categories