I have a data set like this :
Id 1456 1457 1458 1459 1460
MSSubClass 60 20 70 20 20
MSZoning RL RL RL RL RL
LotFrontage 62 85 66 68 75
LotArea 7917 13175 9042 9717 9937
Street Pave Pave Pave Pave Pave
Alley NaN NaN NaN NaN NaN
LotShape Reg Reg Reg Reg Reg
LandContour Lvl Lvl Lvl Lvl Lvl
I converted the strings to pandas. Now I need to convert them numerical data. To convert it into numerical data I take the output of following:
display_all(df_raw.isnull().sum().sort_index()/len(df_raw))
1stFlrSF 0.000000
2ndFlrSF 0.000000
3SsnPorch 0.000000
Alley 0.937671
BedroomAbvGr 0.000000
BldgType 0.000000
BsmtCond 0.025342
BsmtExposure 0.026027
BsmtFinSF1 0.000000
Where ever the value is non-zero, I convert it to numerical values.
train_cats(df_raw) #convert strings to pandas
op1=df_raw.isnull().sum().sort_index()/len(df_raw)
i=0
while i < op1.shape[0]:
if op1[i]!=0.0:
variabe_name=op1.index[i]
df_raw.variable_name = df_raw.variable_name.cat.codes <----
i+=1
So the raw value of this where Alley is non-zero would be :
df_raw.alley = df_raw.alley.cat.codes
Alley needs to be passed as a variable name.
My question is how can I pass a variable name there instead of value so that I can loop through it? I tried #variable_name but it just gives me errors/
Maybe I am doing this wrong. Would there be a better way of doing this?
Your help would be very much appreciated.
Related
I have a dataframe, which has name of Stations and Links of Measured value of each Station for 2 days
Station Link
0 EITZE https://www.pegelonline.wsv.de/webservices/rest-api/v2/stations/EITZE/W/measurements.json?start=P2D
1 RETHEM https://www.pegelonline.wsv.de/webservices/rest-api/v2/stations/RETHEM/W/measurements.json?start=P2D
.......
685 BORGFELD https://www.pegelonline.wsv.de/webservices/rest-api/v2/stations/BORGFELD/W/measurements.json?start=P2D
To take data from json isn't a big problem.
But then I realized, that json-link from each station has multiple values from different time, so I don't know how to add these values from each time to a specific station.
I tried to get all the values from json, but I can't define, which values from which station, because it's just too many.
Anyone have a solution for me?
The Dataframe i would like to have, should look like this!
Station Timestamp Value
0 EITZE 2022-07-31T00:30:00+02:00 15
1 EITZE 2022-07-31T00:45:00+02:00 15
.......
100 RETHEM 2022-07-31T00:30:00+02:00 15
101 RETHEM 2022-07-31T00:45:00+02:00 20
.......
xxxx BORGFELD 2022-08-02T00:32:00+02:00 608
Starting with this example data frame:
Station Link
0 EITZE https://www.pegelonline.wsv.de/webservices/res...
1 RETHEM https://www.pegelonline.wsv.de/webservices/res...
You could leverage apply to populate an accumulation data frame.
import requests
import json
Define the function to be used by apply
def get_link(x):
global accum_df
r = requests.get(x['Link'])
if r.status_code == 200:
ldf = pd.DataFrame(json.loads(r.text))
ldf['station'] = x['Station']
accum_df = pd.concat([accum_df,ldf])
else:
print(r.status_code) # handle the error
return None
Apply it
accum_df = pd.DataFrame()
df.apply(get_link, axis=1)
print(accum_df)
Result
timestamp value station
0 2022-07-31T02:00:00+02:00 220.0 EITZE
1 2022-07-31T02:15:00+02:00 220.0 EITZE
2 2022-07-31T02:30:00+02:00 220.0 EITZE
3 2022-07-31T02:45:00+02:00 220.0 EITZE
4 2022-07-31T03:00:00+02:00 219.0 EITZE
.. ... ... ...
181 2022-08-02T00:00:00+02:00 23.0 RETHEM
182 2022-08-02T00:15:00+02:00 23.0 RETHEM
183 2022-08-02T00:30:00+02:00 23.0 RETHEM
184 2022-08-02T00:45:00+02:00 23.0 RETHEM
185 2022-08-02T01:00:00+02:00 23.0 RETHEM
We are discussing data that is imported from excel
ene2 = pd.read_excel('Energy Indicators.xls', index=False)
recently I asked in post, where answers were clear, straightforward and brought success.
Changing Values of elements in Pandas Datastructure
However I went steps further, and I have similar (sic!) problem, where assigning variable does not change anything.
Lets consider Data Structure
print(ene2.head())
Country Energy Supply Energy Supply per Capita % Renewable's
15 NaN Gigajoules Gigajoules %
16 Afghanistan 321000000 10 78.6693
17 Albania 102000000 35 100
18 Algeria1 1959000000 51 0.55101
19 American Samoa ... ... 0.641026
238 Viet Nam 2554000000 28 45.3215
239 Wallis and Futuna Islands 0 26 0
240 Yemen 344000000 13 0
241 Zambia 400000000 26 99.7147
242 Zimbabwe 480000000 32 52.5361
243 NaN NaN NaN NaN
244 NaN NaN NaN NaN
where some countries have index (like Algieria1 or Australia12)
I want to change those names to become just Algieria, Australia and so on.
There is in total 20 entries that suppose to be changed.
I developed a method to do it, which at the last step fails..
for value in ene2['Country']:
if type(value) == float: # to cover NaN values
continue
x = re.findall("\D+\d", value) # to find those countries/elements which are with number
while len(x) > 0: # this shows elements with number, otherwise answer is [], which is 0
for letters in x: # to touch letters
right = letters[:-1] # and get rid of the last number
ene2.loc[ene2['Country'] == value, 'Country'] = right # THIS IS ELEMENT WHICH FAILS <= it does not chagne the value
x = re.findall("\D+\d", value) # to bring the new value to the while loop
Code above should make the task, to finally remove all the indexes from the names,
however the code - ene2.loc[...] which used to work previously, here, where is nested, just do nothing.
What could be the case that this exchange does not work, how can I overcome the problem a) in a old style way b) in the Panda way?
The code suggest you already use pandas, so why not use the built-in replace method with regex?
df = pd.DataFrame(data=["Afghanistan","Albania", "Algeria1", "Algeria9999"], columns=["Country"])
df["Country_clean"] = df["Country"].str.replace(r'\d+$', '')
output:
print(df["Country_clean"])
0 Afghanistan
1 Albania
2 Algeria
3 Algeria
Name: Country, dtype: object
I am new in this field and stuck on this problem. I have two datasets
all_batsman_df, this df has 5 columns('years','team','pos','name','salary')
years team pos name salary
0 1991 SF 1B Will Clark 3750000.0
1 1991 NYY 1B Don Mattingly 3420000.0
2 1991 BAL 1B Glenn Davis 3275000.0
3 1991 MIL DH Paul Molitor 3233333.0
4 1991 TOR 3B Kelly Gruber 3033333.0
all_batting_statistics_df, this df has 31 columns
Year Rk Name Age Tm Lg G PA AB R ... SLG OPS OPS+ TB GDP HBP SH SF IBB Pos Summary
0 1988 1 Glen Davis 22 SDP NL 37 89 83 6 ... 0.289 0.514 48.0 24 1 1 0 1 1 987
1 1988 2 Jim Acker 29 ATL NL 21 6 5 0 ... 0.400 0.900 158.0 2 0 0 0 0 0 1
2 1988 3 Jim Adduci* 28 MIL AL 44 97 94 8 ... 0.383 0.641 77.0 36 1 0 0 3 0 7D/93
3 1988 4 Juan Agosto* 30 HOU NL 75 6 5 0 ... 0.000 0.000 -100.0 0 0 0 1 0 0 1
4 1988 5 Luis Aguayo 29 TOT MLB 99 260 237 21 ... 0.354 0.663 88.0 84 6 1 1 1 3 564
I want to merge these two datasets on 'year', 'name'. But the problem is, these both data frames has different names like in the first dataset, it has name 'Glenn Davis' but in second dataset it has 'Glen Davis'.
Now, I want to know that How can I merge both of them using difflib library even it has different names?
Any help will be appreciated ...
Thanks in advance.
I have used this code which I got in a question asked at this platform but it is not working for me. I am adding a new column after matching names in both of the datasets. I know this is not a good approach. Kindly suggest, If i can do it in a better way.
df_a = all_batting_statistics_df
df_b = all_batters
df_a = df_a.astype(str)
df_b = df_b.astype(str)
df_a['merge_year'] = df_a['Year'] # we will use these as the merge keys
df_a['merge_name'] = df_a['Name']
for comp_a, addr_a in df_a[['Year','Name']].values:
for ixb, (comp_b, addr_b) in enumerate(df_b[['years','name']].values):
if cdifflib.CSequenceMatcher(None,comp_a,comp_b).ratio() > .6:
df_b.loc[ixb,'merge_year'] = comp_a # creates a merge key in df_b
if cdifflib.CSequenceMatcher(None,addr_a, addr_b).ratio() > .6:
df_b.loc[ixb,'merge_name'] = addr_a # creates a merge key in df_b
merged_df = pd.merge(df_a,df_b,on=['merge_name','merge_years'],how='inner')
You can do
import difflib
df_b['name'] = df_b['name'].apply(lambda x: \
difflib.get_close_matches(x, df_a['name'])[0])
to replace names in df_b with closest match from df_a, then do your merge. See also this post.
Let me get to your problem by assuming that you have to make a data set with 2 columns and the 2 columns being 1. 'year' and 2. 'name'
okay
1. we will 1st rename all the names which are wrong
I hope you know all the wrong names from all_batting_statistics_df using this
all_batting_statistics_df.replace(regex=r'^Glen.$', value='Glenn Davis')
once you have corrected all the spellings, choose the smaller one which has the names you know, so it doesn't take long
2. we need both data sets to have the same columns i.e. only 'year' and 'name'
use this to drop the columns we don't need
all_batsman_df_1 = all_batsman_df.drop(['team','pos','salary'])
all_batting_statistics_df_1 = all_batting_statistics_df.drop(['Rk','Name','Age','Tm','Lg','G','PA','AB','R','Summary'], axis=1)
I cannot see all the 31 columns so I left them, you have to add to the above code
3. we need to change the column names to look the same i.e. 'year' and 'name' using python dataframe rename
df_new_1 = all_batting_statistics_df(colums={'Year': 'year', 'Name':'name'})
4. next, to merge them
we will use this
all_batsman_df.merge(df_new_1, left_on='year', right_on='name')
FINAL THOUGHTS:
If you don't want to do all this find a way to export the data set to google sheets or microsoft excel and use edit them with those advanced software, if you like pandas then its not that difficult you will find a way, all the best!
Expected behavior:
Read PDF, extract all table data into pandas df.
Actual behavior:
Reads PDF fine, extracts most table data and saves it to a debugging.txt with fp.write(df). One column (names) usually only returns '...' when I view the debugging.txt, or watch the terminal print it.
It's like 9/10 times returning ... - sometimes just the first page, but the rest are fine. Sometimes they're all ok... It seems weird.
(I may be an idiot and it might be shortening it because its by far the longest string by 2-3x. But my Google Fu is failing me)
Sample Input (Names covered for privacy):
Sample Output:
21 121 87 59 2003 ... NaN NaN NaN
22 122 86 59 2026 ... NaN NaN NaN
23 123 85 60 2038 ... NaN NaN NaN
24 124 84 60 2050 ... NaN NaN NaN
25 125 83 61 2056 ... NaN NaN NaN
26 126 82 61 2095 ... NaN NaN NaN
Code:
pagecount = 0
for filename in os.listdir(SPLITDIR):
print("Working on: {}".format(filename))
if not filename.endswith(".pdf"):
print("I dont think {} is a PDF".format(filename))
continue
pagedf = read_pdf(SPLITPATH.format(pagecount) pages='all')
#print(pagedf)
debugextract.write(str(pagedf))
pagedf = pd.DataFrame(pagedf)
print(pagedf)
pagecount += 1
This doesn't come from tabula but ipython or Jupyter's display setting.
See also https://github.com/chezou/tabula-py/issues/216#issuecomment-581837621
I have a dataframe:
LF RF LR RR
11 22 33 44
23 43 23 12
33 23 12 43
What I want to accomplish is a calculation (The purpose is to identify which column within each row has the lowest value and determine a percentage compared to the rest of the cols mean).
For example:
Identify the min value in r1, which is 11 and col name (LF). The rest of the cols mean is (22+33+44)/3= 33. Then we calculate the percentage difference 11/33 = 0.333
Expected output:
LF RF LR RR Min_Col dif(%)
11 22 33 44 LF 0.333
23 43 23 12 RR 0.404
33 23 12 43 LR 0.364
a proper way of writing the equation would be:
(min_value)/(sum_rest_of_cols/3)
Note: I need to have a column that indicates for each row which column is the lowest (This is a program to identify problems, so within the error message we want to be able to tell the user which column it is, that is giving the problems)
EDITED:
My code (df_inter is the original df which I am locing to only get the desired columns to perform this calculation):
df_exc = df_inter.loc[:,['LF_Strut_Pressure', 'RF_Strut_Pressure', 'LR_Strut_Pressure' ,'RR_Strut_Pressure']]
df_exc['dif(%)'] = df_exc.min(1) * 3 / (df_exc.sum(1) - df_inter.min(1))
df_exc['Min_Col'] = df_exc.iloc[:, :-1].idxmin(1)
print(df_exc)
My Output:
LF_Strut RF_Strut LR_Strut RR_Strut dif(%) Min_Col
truck_id
EX7057 0.000000 0.000000 0.000000 0.000000 0.0000 LF_Strut
EX7105 0.000000 0.000000 0.000000 0.000000 0.0000 LF_Strut
EX7106 0.000000 0.000000 0.000000 0.000000 0.0000 LF_Strut
EX7107 0.000000 0.000000 0.000000 0.000000 0.0000 LF_Strut
TD6510 36588.000000 36587.000 36587.00000 36587.00 0.8204 RF_Strut
TD6511 36986.000000 36989.000 36987.00000 36989.00 0.8220 LF_Strut
TD6512 27704.000000 27705.000 27702.00000 27705.00 0.7757 LR_Strut
The problem is: When doing the calculation for TD6510 ( 36587 / ( (36587 + 36587 + 36588) / 3 ) ) = 0.9999999 .. not 0.8204 . I tried replicating where 0.8204 came from, I was unsuccesful. Thanx for al l the help and support.
First we use idxmin
df['dif(%)']=df.min(1)*3/(df.sum(1)-df.min(1))
df['Min_Col']=df.iloc[:,:-1].idxmin(1)
df
LF RF LR RR dif(%) Min_Col
0 11 22 33 44 0.333333 LF
1 23 43 23 12 0.404494 RR
2 33 23 12 43 0.363636 LR
I wrote the text in a file called "textfile.txt". This should be useful:
import pandas as pd
df= pd.read_csv('textfile.txt', sep = ' ')
df['min'] = df[['LF','RF','LR','RR']].min(axis=1)
df['sum_3'] = df[['LF','RF','LR','RR']].sum(axis=1)- df['min']
df['sum_3_div3'] = df['sum_3']/3
You can just do usual calculation, the min col is given by idxmin
# find the mins in each row
mins = df.min(axis=1)
# compute mean of the other values
other_means = (df.sum(1) - mins).div(df.shape[1]-1)
(mins /other_means)*100
Output:
0 33.333333
1 40.449438
2 36.363636
dtype: float64
Using idxmin and df.mask() with df.isin() and df.min():
final = df.assign(Min_Col=df.idxmin(1),
Diff=df.min(1).div(df.mask(df.isin(df.min(1))).mean(1)))
print(final)
LF RF LR RR Min_Col Diff
0 11 22 33 44 LF 0.333333
1 23 43 23 12 RR 0.404494
2 33 23 12 43 LR 0.363636