Data Frame - changing of the nested variable - python

We are discussing data that is imported from excel
ene2 = pd.read_excel('Energy Indicators.xls', index=False)
recently I asked in post, where answers were clear, straightforward and brought success.
Changing Values of elements in Pandas Datastructure
However I went steps further, and I have similar (sic!) problem, where assigning variable does not change anything.
Lets consider Data Structure
print(ene2.head())
Country Energy Supply Energy Supply per Capita % Renewable's
15 NaN Gigajoules Gigajoules %
16 Afghanistan 321000000 10 78.6693
17 Albania 102000000 35 100
18 Algeria1 1959000000 51 0.55101
19 American Samoa ... ... 0.641026
238 Viet Nam 2554000000 28 45.3215
239 Wallis and Futuna Islands 0 26 0
240 Yemen 344000000 13 0
241 Zambia 400000000 26 99.7147
242 Zimbabwe 480000000 32 52.5361
243 NaN NaN NaN NaN
244 NaN NaN NaN NaN
where some countries have index (like Algieria1 or Australia12)
I want to change those names to become just Algieria, Australia and so on.
There is in total 20 entries that suppose to be changed.
I developed a method to do it, which at the last step fails..
for value in ene2['Country']:
if type(value) == float: # to cover NaN values
continue
x = re.findall("\D+\d", value) # to find those countries/elements which are with number
while len(x) > 0: # this shows elements with number, otherwise answer is [], which is 0
for letters in x: # to touch letters
right = letters[:-1] # and get rid of the last number
ene2.loc[ene2['Country'] == value, 'Country'] = right # THIS IS ELEMENT WHICH FAILS <= it does not chagne the value
x = re.findall("\D+\d", value) # to bring the new value to the while loop
Code above should make the task, to finally remove all the indexes from the names,
however the code - ene2.loc[...] which used to work previously, here, where is nested, just do nothing.
What could be the case that this exchange does not work, how can I overcome the problem a) in a old style way b) in the Panda way?

The code suggest you already use pandas, so why not use the built-in replace method with regex?
df = pd.DataFrame(data=["Afghanistan","Albania", "Algeria1", "Algeria9999"], columns=["Country"])
df["Country_clean"] = df["Country"].str.replace(r'\d+$', '')
output:
print(df["Country_clean"])
0 Afghanistan
1 Albania
2 Algeria
3 Algeria
Name: Country, dtype: object

Related

Create new column with multiple values in Python

I have a dataframe, which has name of Stations and Links of Measured value of each Station for 2 days
Station Link
0 EITZE https://www.pegelonline.wsv.de/webservices/rest-api/v2/stations/EITZE/W/measurements.json?start=P2D
1 RETHEM https://www.pegelonline.wsv.de/webservices/rest-api/v2/stations/RETHEM/W/measurements.json?start=P2D
.......
685 BORGFELD https://www.pegelonline.wsv.de/webservices/rest-api/v2/stations/BORGFELD/W/measurements.json?start=P2D
To take data from json isn't a big problem.
But then I realized, that json-link from each station has multiple values from different time, so I don't know how to add these values from each time to a specific station.
I tried to get all the values from json, but I can't define, which values from which station, because it's just too many.
Anyone have a solution for me?
The Dataframe i would like to have, should look like this!
Station Timestamp Value
0 EITZE 2022-07-31T00:30:00+02:00 15
1 EITZE 2022-07-31T00:45:00+02:00 15
.......
100 RETHEM 2022-07-31T00:30:00+02:00 15
101 RETHEM 2022-07-31T00:45:00+02:00 20
.......
xxxx BORGFELD 2022-08-02T00:32:00+02:00 608
Starting with this example data frame:
Station Link
0 EITZE https://www.pegelonline.wsv.de/webservices/res...
1 RETHEM https://www.pegelonline.wsv.de/webservices/res...
You could leverage apply to populate an accumulation data frame.
import requests
import json
Define the function to be used by apply
def get_link(x):
global accum_df
r = requests.get(x['Link'])
if r.status_code == 200:
ldf = pd.DataFrame(json.loads(r.text))
ldf['station'] = x['Station']
accum_df = pd.concat([accum_df,ldf])
else:
print(r.status_code) # handle the error
return None
Apply it
accum_df = pd.DataFrame()
df.apply(get_link, axis=1)
print(accum_df)
Result
timestamp value station
0 2022-07-31T02:00:00+02:00 220.0 EITZE
1 2022-07-31T02:15:00+02:00 220.0 EITZE
2 2022-07-31T02:30:00+02:00 220.0 EITZE
3 2022-07-31T02:45:00+02:00 220.0 EITZE
4 2022-07-31T03:00:00+02:00 219.0 EITZE
.. ... ... ...
181 2022-08-02T00:00:00+02:00 23.0 RETHEM
182 2022-08-02T00:15:00+02:00 23.0 RETHEM
183 2022-08-02T00:30:00+02:00 23.0 RETHEM
184 2022-08-02T00:45:00+02:00 23.0 RETHEM
185 2022-08-02T01:00:00+02:00 23.0 RETHEM

How can I merge these two datasets on 'Name' and 'Year'?

I am new in this field and stuck on this problem. I have two datasets
all_batsman_df, this df has 5 columns('years','team','pos','name','salary')
years team pos name salary
0 1991 SF 1B Will Clark 3750000.0
1 1991 NYY 1B Don Mattingly 3420000.0
2 1991 BAL 1B Glenn Davis 3275000.0
3 1991 MIL DH Paul Molitor 3233333.0
4 1991 TOR 3B Kelly Gruber 3033333.0
all_batting_statistics_df, this df has 31 columns
Year Rk Name Age Tm Lg G PA AB R ... SLG OPS OPS+ TB GDP HBP SH SF IBB Pos Summary
0 1988 1 Glen Davis 22 SDP NL 37 89 83 6 ... 0.289 0.514 48.0 24 1 1 0 1 1 987
1 1988 2 Jim Acker 29 ATL NL 21 6 5 0 ... 0.400 0.900 158.0 2 0 0 0 0 0 1
2 1988 3 Jim Adduci* 28 MIL AL 44 97 94 8 ... 0.383 0.641 77.0 36 1 0 0 3 0 7D/93
3 1988 4 Juan Agosto* 30 HOU NL 75 6 5 0 ... 0.000 0.000 -100.0 0 0 0 1 0 0 1
4 1988 5 Luis Aguayo 29 TOT MLB 99 260 237 21 ... 0.354 0.663 88.0 84 6 1 1 1 3 564
I want to merge these two datasets on 'year', 'name'. But the problem is, these both data frames has different names like in the first dataset, it has name 'Glenn Davis' but in second dataset it has 'Glen Davis'.
Now, I want to know that How can I merge both of them using difflib library even it has different names?
Any help will be appreciated ...
Thanks in advance.
I have used this code which I got in a question asked at this platform but it is not working for me. I am adding a new column after matching names in both of the datasets. I know this is not a good approach. Kindly suggest, If i can do it in a better way.
df_a = all_batting_statistics_df
df_b = all_batters
df_a = df_a.astype(str)
df_b = df_b.astype(str)
df_a['merge_year'] = df_a['Year'] # we will use these as the merge keys
df_a['merge_name'] = df_a['Name']
for comp_a, addr_a in df_a[['Year','Name']].values:
for ixb, (comp_b, addr_b) in enumerate(df_b[['years','name']].values):
if cdifflib.CSequenceMatcher(None,comp_a,comp_b).ratio() > .6:
df_b.loc[ixb,'merge_year'] = comp_a # creates a merge key in df_b
if cdifflib.CSequenceMatcher(None,addr_a, addr_b).ratio() > .6:
df_b.loc[ixb,'merge_name'] = addr_a # creates a merge key in df_b
merged_df = pd.merge(df_a,df_b,on=['merge_name','merge_years'],how='inner')
You can do
import difflib
df_b['name'] = df_b['name'].apply(lambda x: \
difflib.get_close_matches(x, df_a['name'])[0])
to replace names in df_b with closest match from df_a, then do your merge. See also this post.
Let me get to your problem by assuming that you have to make a data set with 2 columns and the 2 columns being 1. 'year' and 2. 'name'
okay
1. we will 1st rename all the names which are wrong
I hope you know all the wrong names from all_batting_statistics_df using this
all_batting_statistics_df.replace(regex=r'^Glen.$', value='Glenn Davis')
once you have corrected all the spellings, choose the smaller one which has the names you know, so it doesn't take long
2. we need both data sets to have the same columns i.e. only 'year' and 'name'
use this to drop the columns we don't need
all_batsman_df_1 = all_batsman_df.drop(['team','pos','salary'])
all_batting_statistics_df_1 = all_batting_statistics_df.drop(['Rk','Name','Age','Tm','Lg','G','PA','AB','R','Summary'], axis=1)
I cannot see all the 31 columns so I left them, you have to add to the above code
3. we need to change the column names to look the same i.e. 'year' and 'name' using python dataframe rename
df_new_1 = all_batting_statistics_df(colums={'Year': 'year', 'Name':'name'})
4. next, to merge them
we will use this
all_batsman_df.merge(df_new_1, left_on='year', right_on='name')
FINAL THOUGHTS:
If you don't want to do all this find a way to export the data set to google sheets or microsoft excel and use edit them with those advanced software, if you like pandas then its not that difficult you will find a way, all the best!

Matching two pandas series: How to find a string element from one series in another series and then create a new column

I am currently working on cleaning up a car emissions data set. This is what the data set looks like (only included first 10 rows):
import pandas as pd
cars_em_df = pd.DataFrame({'manufacturer_name_mapped': ['FIAT', 'FIAT','FIAT','FIAT','FIAT'],
'commercial_name':['124 gt multiair auto', '500l wagon pop star t-jet',
'doblo combi 1.4 95', 'panda 0.9t sge 85 natural power', 'punto 1.4 77 lpg'],
'fuel_type_mapped':['Petrol', 'Petrol', 'Petrol', 'NG-Biomethane', 'LPG'],
'file_year':[2018, 2018, 2018, 2018, 2018], 'emissions': [153,158,165,86,114]})
I am mostly interested in column 'commercial_name'. The end-goal is to add another column to this dataframe that shows the 'cleaned up' version of 'commercial_name'. I have a separate pandas series that contains the 'correct' names that should be used instead of these 'messy' names.
real_model_names = pd.Series(['uno', '147', 'panda', 'punto', '166', '4c', 'brera', 'giulia',
'giulietta', 'gtv'])
These are all strings as well. So as an example, I would like to look up in every row of 'commercial_name' whether it contains any of the names from the 'real_model_names series'. E.g. 'punto' from 'real_model_names' can be found in the entry 'punto 1.4 77 lpg' from the 'commercial_name' column. So then I would like (in a new column in car_em_df) to have 'punto' next to it. If it cannot be found, I would like the original 'messy' name to be shown.
I tried to define a function that I would then apply along the 'commercial_name' column. I tried this:
def str_ops(series):
for i in real_model_names:
if i in series:
return series.replace(series, i)
else:
return series
And as a next step I would apply this function and add it to the dataframe as a new column:
commercial_name_cleaned = cars_em_df.commercial_name.apply(str_ops)
cars_em_df.insert(3,value=commercial_name_cleaned,column='commercial_name_cleaned')
However, this just doesn't do anything. The new column just shows the exact same entries as 'commercial_name'.
Does anyone know how to solve this problem? Is there a better way to do this?
Thanks a lot in advance!
Your loop was on the right track. The most readable and direct way I can think of to do this:
def str_ops(x):
for y in real_model_names:
if y in x:
return y
return x
cars_em_df['commercial_name_cleaned'] = cars_em_df['commercial_name'].apply(str_ops)
# Result
cars_em_df
manufacturer_name_mapped commercial_name fuel_type_mapped file_year emissions commercial_name_cleaned
0 FIAT 124 gt multiair auto Petrol 2018 153 124 gt multiair auto
1 FIAT 500l wagon pop star t-jet Petrol 2018 158 500l wagon pop star t-jet
2 FIAT doblo combi 1.4 95 Petrol 2018 165 doblo combi 1.4 95
3 FIAT panda 0.9t sge 85 natural power NG-Biomethane 2018 86 panda
4 FIAT punto 1.4 77 lpg LPG 2018 114 punto

Tabula-py returns '...' on one specific column in df. everything else seems to work,

Expected behavior:
Read PDF, extract all table data into pandas df.
Actual behavior:
Reads PDF fine, extracts most table data and saves it to a debugging.txt with fp.write(df). One column (names) usually only returns '...' when I view the debugging.txt, or watch the terminal print it.
It's like 9/10 times returning ... - sometimes just the first page, but the rest are fine. Sometimes they're all ok... It seems weird.
(I may be an idiot and it might be shortening it because its by far the longest string by 2-3x. But my Google Fu is failing me)
Sample Input (Names covered for privacy):
Sample Output:
21 121 87 59 2003 ... NaN NaN NaN
22 122 86 59 2026 ... NaN NaN NaN
23 123 85 60 2038 ... NaN NaN NaN
24 124 84 60 2050 ... NaN NaN NaN
25 125 83 61 2056 ... NaN NaN NaN
26 126 82 61 2095 ... NaN NaN NaN
Code:
pagecount = 0
for filename in os.listdir(SPLITDIR):
print("Working on: {}".format(filename))
if not filename.endswith(".pdf"):
print("I dont think {} is a PDF".format(filename))
continue
pagedf = read_pdf(SPLITPATH.format(pagecount) pages='all')
#print(pagedf)
debugextract.write(str(pagedf))
pagedf = pd.DataFrame(pagedf)
print(pagedf)
pagecount += 1
This doesn't come from tabula but ipython or Jupyter's display setting.
See also https://github.com/chezou/tabula-py/issues/216#issuecomment-581837621

Pandas - Count the number of rows that would be true for a function - for each input row

I have a dataframe that needs a column added to it. That column needs to be a count of all the other rows in the table that meet a certain condition, that condition needs to take in input both from the "input" row and the "output" row.
For example, if it was a dataframe describing people, and I wanted to make a column that counted how many people were taller than the current row and lighter.
I'd want the height and weight of the row, as well as the height and weight of the other rows in a function, so I can do something like:
def example_function(height1, weight1, height2, weight2):
if height1 > height2 and weight1 < weight2:
return True
else:
return False
And it would just sum up all the True's and give that sum in the column.
Is something like this possible?
Thanks in advance for any ideas!
Edit: Sample input:
id name height weight country
0 Adam 70 180 USA
1 Bill 65 190 CANADA
2 Chris 71 150 GERMANY
3 Eric 72 210 USA
4 Fred 74 160 FRANCE
5 Gary 75 220 MEXICO
6 Henry 61 230 SPAIN
The result would need to be:
id name height weight country new_column
0 Adam 70 180 USA 1
1 Bill 65 190 CANADA 1
2 Chris 71 150 GERMANY 3
3 Eric 72 210 USA 1
4 Fred 74 160 FRANCE 4
5 Gary 75 220 MEXICO 1
6 Henry 61 230 SPAIN 0
I believe it will need to be some sort of function, as the actual logic I need to use is more complicated.
edit 2:fixed typo
You can add booleans, like this:
count = ((df.height1 > df.height2) & (df.weight1 < df.weight2)).sum()
EDIT:
I test it a bit and then change conditions with custom function:
def f(x):
#check boolean mask
#print ((df.height > x.height) & (df.weight < x.weight))
return ((df.height < x.height) & (df.weight > x.weight)).sum()
df['new_column'] = df.apply(f, axis=1)
print (df)
id name height weight country new_column
0 0 Adam 70 180 USA 2
1 1 Bill 65 190 CANADA 1
2 2 Chris 71 150 GERMANY 3
3 3 Eric 72 210 USA 1
4 4 Fred 74 160 FRANCE 4
5 5 Gary 75 220 MEXICO 1
6 6 Henry 61 230 SPAIN 0
Explanation:
For each row compare values and for count simply sum values True.
For example, if it was a dataframe describing people, and I wanted to make a column that counted how many people were taller than the current row and lighter.
As far as I understand, you want to assign to a new column something like
df['num_heigher_and_leighter'] = df.apply(lambda r: ((df.height > r.height) & (df.weight < r.weight)).sum(), axis=1)
However, your text description doesn't seem to match the outcome, which is:
0 2
1 3
2 0
3 1
4 0
5 0
6 6
dtype: int64
Edit
As in any other case, you can use a named function instead of a lambda:
df = ...
def foo(r):
return ((df.height > r.height) & (df.weight < r.weight)).sum()
df['num_heigher_and_leighter'] = df.apply(foo, axis=1)
I'm assuming you had a typo and want to compare heights with heights and weights with weights. If so, you could count the number of persons taller OR heavier like so:
>>> for i,height,weight in zip(df.index,df.height, df.weight):
... cnt = df.loc[((df.height>height) & (df.weight>weight)), 'height'].count()
... df.loc[i,'thing'] = cnt
...
>>> df
name height weight country thing
0 Adam 70 180 USA 2.0
1 Bill 65 190 CANADA 2.0
2 Chris 71 150 GERMANY 3.0
3 Eric 72 210 USA 1.0
4 Fred 74 160 FRANCE 1.0
5 Gary 75 220 MEXICO 0.0
6 Henry 61 230 SPAIN 0.0
Here for instance, no person is Heavier than Henry, and no person is taller than Gary. If that's not what you intended, it should be easy to modify the & above to a | instead or switching out the > to a <.
When you're more accustomed to Pandas, I suggest you use Ami Tavory excellent answer instead.
PS. For the love of god, use the Metric system for representing weight and height, and convert to whatever for presentation. These numbers are totally nonsensical for the world population at large. :)

Categories