Python + Pandas + dataframe : couldn't append one dataframe to another - python

I have two big CSV files. I have converted them to Pandas dataframes. Both of them have columns of same names and in same order : event_name, category, category_id, description. I want to append one dataframe to another, and, finally want to write the resultant dataframe to a CSV. I wrote a code for that:
#appendind a new dataframe to the older dataframe
data = pd.read_csv("dataset.csv")
data1 = pd.read_csv("dataset_new.csv")
dfs = [data, data1]
pd.concat([df.squeeze() for df in dfs], ignore_index=True)
dfs = pd.DataFrame(columns=['event_name','category', 'category_id', 'description'])
dfs.to_csv('dataset_append.csv', encoding='utf-8', index=False)
I wanted to show you the output of print(dfs) but I couldn't because Stackoverflow is showing following error because the output is too long:
Body is limited to 30000 characters; you entered 32132.
Would you please tell me a code snippet which you use succesfully
to append Pandas dataframe?
Edit1:
print(dfs)
outout:
---------------------------------------------------------
[ Unnamed: 10 Unnamed: 100 Unnamed: 101 Unnamed: 102 Unnamed: 103 \
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN
10 NaN NaN NaN NaN NaN
11 NaN NaN NaN NaN NaN
12 NaN NaN NaN NaN NaN
13 NaN NaN NaN NaN NaN
14 NaN NaN NaN NaN NaN
15 NaN NaN NaN NaN NaN
16 NaN NaN NaN NaN NaN
17 NaN NaN NaN NaN NaN
18 NaN NaN NaN NaN NaN
19 NaN NaN NaN NaN NaN
20 NaN NaN NaN NaN NaN
21 NaN NaN NaN NaN NaN
22 NaN NaN NaN NaN NaN
23 NaN NaN NaN NaN NaN
24 NaN NaN NaN NaN NaN
25 NaN NaN NaN NaN NaN
26 NaN NaN NaN NaN NaN
27 NaN NaN NaN NaN NaN
28 NaN NaN NaN NaN NaN
29 NaN NaN NaN NaN NaN
... ... ... ... ... ...
1159 NaN NaN NaN NaN NaN
1160 NaN NaN NaN NaN NaN
1161 NaN NaN NaN NaN NaN
1162 NaN NaN NaN NaN NaN
Unnamed: 104 Unnamed: 105 Unnamed: 106 Unnamed: 107 Unnamed: 108 \
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN
... ... ... ... ... ...
1161 NaN NaN NaN NaN NaN
1162 NaN NaN NaN NaN NaN
... Unnamed: 94 \
0 ... NaN
1 ... NaN
2 ... NaN
3 ... NaN
4 ... NaN
5 ... NaN
6 ... NaN
7 ... NaN
8 ... NaN
9 ... NaN
10 ... NaN
11 ... NaN
12 ... NaN
13 ... NaN
14 ... NaN
15 ... NaN
16 ... NaN
17 ... NaN
18 ... NaN
19 ... NaN
20 ... NaN
21 ... NaN
22 ... NaN
23 ... NaN
24 ... NaN
25 ... NaN
26 ... NaN
27 ... NaN
28 ... NaN
29 ... NaN
... ... ...
1133 ... NaN
1134 ... NaN
1135 ... NaN
1136 ... NaN
1137 ... NaN
1138 ... NaN
1139 ... NaN
1140 ... NaN
1141 ... NaN
1142 ... NaN
1143 ... NaN
1144 ... NaN
1145 ... NaN
1146 ... NaN
1147 ... NaN
1148 ... NaN
1149 ... NaN
1150 ... NaN
1151 ... NaN
1152 ... NaN
1153 ... NaN
1154 ... NaN
1155 ... NaN
1156 ... NaN
1157 ... NaN
1158 ... NaN
1159 ... NaN
1160 ... NaN
1161 ... NaN
1162 ... NaN
Unnamed: 95 Unnamed: 96 Unnamed: 97 Unnamed: 98 Unnamed: 99 \
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
... ... ... ... ... ...
1133 NaN NaN NaN NaN NaN
1134 NaN NaN NaN NaN NaN
1135 NaN NaN NaN NaN NaN
1136 NaN NaN NaN NaN NaN
category category_id \
0 Business 2
1 stage shows 33
2 Literature 15
3 Science & Technology 22
4 health 11
5 Science & Technology 22
6 Outdoor 19
7 stage shows 33
8 nightlife 30
9 fashion & lifestyle 6
10 Government & Activism 25
11 stage shows 33
12 Religion & Spirituality 21
13 Outdoor 19
14 management 17
15 Science & Technology 22
16 nightlife 30
17 Outdoor 19
18 FAMILy & kids 5
19 fashion & lifestyle 6
20 FAMILy & kids 5
21 games 10
22 hobbies 32
23 hobbies 32
24 Religion & Spirituality 21
25 health 11
26 fashion & lifestyle 6
27 career & education 31
28 health 11
29 arts 1
... ... ...
1133 Sports & Fitness 23
1134 Sports & Fitness 23
1135 Sports & Fitness 23
1136 Sports & Fitness 23
1137 Sports & Fitness 23
1138 Sports & Fitness 23
1139 Sports & Fitness 23
1140 Sports & Fitness 23
1141 Sports & Fitness 23
1142 Sports & Fitness 23
1143 Sports & Fitness 23
1144 Sports & Fitness 23
1145 Sports & Fitness 23
1146 Sports & Fitness 23
1147 Sports & Fitness 23
1148 Sports & Fitness 23
1149 Sports & Fitness 23
1150 Sports & Fitness 23
1151 Sports & Fitness 23
1152 Sports & Fitness 23
1153 Sports & Fitness 23
1154 Sports & Fitness 23
1155 Sports & Fitness 23
1156 Sports & Fitness 23
1157 Sports & Fitness 23
1158 Sports & Fitness 23
1159 Sports & Fitness 23
1160 Sports & Fitness 23
1161 Sports & Fitness 23
1162 Sports & Fitness 23
description \
0 Josh Talks in partnership with Facebook is all...
1 Unwind on the strums of Guitar & immerse your...
2 Book review for grade 3 and above learners. 3 ...
3 ..About Organizer:.This is the official page f...
4 Blood Donation is organized under the banner o...
5 A day "Etched with Innovation and Learning" to...
6 Our next destination for Fun with us is "Goa" ...
7 Enjoy the Soulful and Unplugged Performance of...
8 Get ready with your dance shoes on as our favo...
9 FESTIVE HUES -- a fashion and lifestyle exhibi...
10 On Aug. 8, Dr. Ambedkar presides over the Depr...
11 It's A Rapper Boys..And M Write A New Rap song...
12 The Spiritual Makeover..A weekend workshop tha...
13 Our next destination for Fun with us is "Goa" ...
14 Project Management is all about getting the th...
15 World Conference Next Generation Testing 2018 ...
16 ..About Organizer:.Whitefield is now #Sherlocked!
17 On occasion of 72th Independence Day , Udaan O...
18 *Smilofy Special Superstar*.A Talent hunt for ...
19 ITEEHA is coming back to Bengaluru, after a fa...
20 This is an exciting course for kids to teach t...
21 ..About Organizer:.PPG Lounge is a next genera...
22 Touch Feel Try & Buy the latest #car and #bike...
23 Sniper Media is organising an exclusive semina...
24 He has all sorts of powers and able solve any ...
25 registration fee 50/₹ we r providing free c...
26 World Biggest Pageant Miss & Mrs World Queen a...
27 ..About Organizer:.Canam Consultants - India's...
28 Innopharm is an effort to bring innovations in...
29 The first Central India Art and Design Expo - ...
... ...
1133 As the cricket fever grips the country again, ...
1134 An evening of fun, food, drinks and rooting fo...
1135 The time has come, who will take their place S...
1136 Do you want to prove that Age is not a barrier...
1137 We Invite All The Corporate Companies To Be A ...
1138 PlayTM happy to announce you that conducting o...
1139 A Mix of fun rules and cricketing skills. Afte...
1140 Shuttle Swap presents Singles, Doubles and Mix...
1141 Yonex Mavis 350 Shuttle will be used State/Nat...
1142 Light up the FIFA World Cup with Bud90 Match S...
1143 We are charmed to launch the SVSEVENTZ.COM 5-A...
1144 We corephysio FC invite you for our first foot...
1145 After completing the 2nd season of Bangalore S...
1146 As the cricket fever grips the country again, ...
1147 Introducing BOX Cricket Super 6 Corporate Cric...
1148 After the sucess of '1st Matt & Mudd T20 Leagu...
1149 Hi All, It is my pleasure to officially announ...
1150 Sign up: Get early updates, free movie voucher...
1151 About VIVO Pro Kabaddi 2018: A new season of t...
1152 The Hero Indian Super League (ISL) is India's ...
1153 Limited time offer: Free Paytm Movie Voucher w...
1154 The 5th edition of the Indian Super League is ...
1155 Calling all Jamshedpur FC fans! Here's your ch...
1156 Empower yourself and progress towards a health...
1157 Making people happy when they feel that its en...
1158 LOVE YOGA ?- but too busy with work during the...
1159 The coolest way to tour the city ! Absorb the ...
1160 Ready to be a part of India's Biggest Walkatho...
1161 The event will comprise of the following Open ...
1162 RUN FOR CANCER CHILDREN On world Cancer Day 3r...
event_name
0 Josh Talks Hyderabad 2018
1 Guitar Night With Ashmik Patil
2 Book Review - August 2018 - 2
3 Csaw'18
4 Blood donation camp
5 Rajasthan Youth Innovation and Technical Intel...
6 Goa – Fun All the Way!!! - Mom N Kids
7 The AnshUdhami Project LIVE at Tales & Spirits...
8 Friday Fiesta featuring Pearl
9 FESTIVE HUES
10 Nagpur
11 Yo Yo Deep SP The Rapper
12 The Spiritual Makeover
13 Goa Fun All the Way - Women Only group Tour
14 MS Project 2016 - A one day seminar
15 World Conference Next Generation Testing
16 Weekend Booster - Happy Hour
17 Ladies Only Camping : Freedom To Travel (Seaso...
18 Special superstar
19 Malaysian Batik Workshop
20 EQ Enhancement Course (5-10 years)
21 CS:GO Tournament 2018 - PPGL
22 Auto Mall at Mantri Square Bangalore
23 A Seminar by Ojas Rajani (Bollywood celebrity ...
24 rishikesh katti greatest Spirituality guru of ...
25 free BMD camp held on 26 jan 2018
26 Miss and Mrs Bhopal Madhya Pradesh India World...
27 USA, Canada & Singapore Application Days 2018
28 Innopharm 3
29 Kalasrishti Art and Design Expo
... ...
1133 Asia cup live screening at la casa Brewery+ ki...
1134 Asia Cup 2018 live screening at La Casa Brewer...
1135 FIFA FINAL AT KORAMANGALA TETTO - With #fifa#f...
1136 Womenasia Indoor Cricket Championship
1137 Switch Hit Corporate Cricket Tournament
1138 PlayTM Sports Arena Box Cricket league
1139 The Box Cricket League Edition II (16-17-18 No...
1140 Shuttle Swap Badminton Tournament - With Singl...
1141 SPARK BADMINTON LEAGUE - OCT 14th 2018
1142 Bud90 Match Screenings at Loft38
1143 5 A-Side Football Tournament
1144 5 vs 5 Football league - With Back 2 Track events
1145 Bangalore Sports Carnival Table Tennis Juniors...
1146 Asia cup live screening at la casa Brewery+ ki...
1147 Super 6 Corporate Cricket League
1148 Coolulu is organizing MATT & MUD T20 Cricket L...
1149 United Sportzs Pure Corporate Cricket season-10
1150 Sign up for updates on the VIVO Pro Kabaddi Se...
1151 VIVO Pro Kabaddi - UP Yoddha vs Patna Pirates ...
1152 HERO Indian Super League 2018-19: Kerala Blast...
1153 HERO ISL: FC Goa Memberships
1154 Hero Indian Super League 2018-19: Delhi Dynamo...
1155 HERO Indian Super League 2018-19: Jamshedpur F...
1156 Yoga Therapy Classes in Bangalore
1157 Saree Walkathon
1158 Weekend Yoga Teachers Training Program
1159 Bangalore Walks
1160 Oxfam Trailwalker Bengaluru
1161 TAD Pune 2018 (Triathlon Aquathlon Duathlon)
1162 RUN FOR CANCER CHILDREN
[1163 rows x 241 columns], event_name category \
0 Musical Camping at Dahanu Chiku farm outdoor
1 Adventure Camping at Wada outdoor
2 Kaas Plateau Tour outdoor
3 Pawna Lake Camping, kevre, Lonavala outdoor
4 Night Trek and Camping at Korigad Fort outdoor
5 PARAMOTORING outdoor
6 WATERFALL TREK & BEACH CAMPING (NAGALAPURAM: N... outdoor
7 Happiest Land On Earth - Bhutan outdoor
8 4 Days serial hiking in Sahyadris - Sep 29 to ... outdoor
9 Ride To Valparai outdoor
10 Dzongri Trek - Gateway to Kanchenjunga Mountain outdoor
11 Skandagiri Night Trek With Camping outdoor
12 Kalsubai Trek | Plan The Unplanned outdoor
13 Bike N Hike Skandagiri outdoor
14 Unplanned Stories - Episode 6 - Travel Tales outdoor
15 Feast on authentic flavors from Goa! outdoor
16 The Boot Camp outdoor
17 The HandleBards: Romeo and Juliet at Ranga Sha... outdoor
18 Workshop on Metagenomic Sequencing on the Grid... Science & Technology
19 Aerovision Science & Technology
20 Electric Vehicle Technology Workshop Science & Technology
21 BPM Strategy Summit Science & Technology
22 Summit of Interior Designers & Architecture Science & Technology
23 SMART ASIA India Expo& Summit Science & Technology
24 A Smart City Life Exhibition Science & Technology
25 OPEN SOURCE INDIA Science & Technology
26 SolarRoofs India Bangalore Science & Technology
27 International Conference on Innovative Researc... Science & Technology
28 International Conference on Business Managemen... Science & Technology
29 DevOn Summit Bangalore - Digital Transformations Science & Technology
.. ... ...
144 Asia cup live screening at la casa Brewery+ ki... Sports & Fitness
145 Asia Cup 2018 live screening at La Casa Brewer... Sports & Fitness
146 FIFA FINAL AT KORAMANGALA TETTO - With #fifa#f... Sports & Fitness
147 Womenasia Indoor Cricket Championship Sports & Fitness
148 Switch Hit Corporate Cricket Tournament Sports & Fitness
149 PlayTM Sports Arena Box Cricket league Sports & Fitness
150 The Box Cricket League Edition II (16-17-18 No... Sports & Fitness
151 Shuttle Swap Badminton Tournament - With Singl... Sports & Fitness
152 SPARK BADMINTON LEAGUE - OCT 14th 2018 Sports & Fitness
153 Bud90 Match Screenings at Loft38 Sports & Fitness
s
170 Bangalore Walks Sports & Fitness
171 Oxfam Trailwalker Bengaluru Sports & Fitness
172 TAD Pune 2018 (Triathlon Aquathlon Duathlon) Sports & Fitness
173 RUN FOR CANCER CHILDREN Sports & Fitness
category_id description \
0 19 Dear All Camping Lovers, Come take camping exp...
1 19 Our Adventure campsite at Wada is developed wi...
2 19 Type: Eco Tour Height: 3937 FT above MSL (Appr...
3 19 Our Pawna Lake Camping site is located near Ke...
4 19 Type: Hill Fort Height: 3050 Feet above MSL (A...
23 22 Making 'Smart Cities Mission' a Reality The SM...
24 22 A Smart City Life A Smart City Life Exhibition...
25 22 Asia's No. 1 Convention on Open Source Started...
26 22 The conference will offer an excellent platfor...
27 22 Provides a leading forum for the presentation ...
28 22 Provide opportunity for the global participant...
29 22 The biggest event about Digital Transformation...
.. ... ...
144 23 As the cricket fever grips the country again, ...
145 23 An evening of fun, food, drinks and rooting fo...
146 23 The time has come, who will take their place S...
147 23 Do you want to prove that Age is not a barrier...
148 23 We Invite All The Corporate Companies To Be A ...
149 23 PlayTM happy to announce you that conducting o...
150 23 A Mix of fun rules and cricketing skills. Afte...
151 23 Shuttle Swap presents Singles, Doubles and Mix...
152 23 Yonex Mavis 350 Shuttle will be used State/Nat...
153 23 Light up the FIFA World Cup with Bud90 Match S...
154 23 We are charmed to launch the SVSEVENTZ.COM 5-A...
155 23 We corephysio FC invite you for our first foot...
156 23 After completing the 2nd season of Bangalore S...
157 23 As the cricket fever grips the country again, ...
158 23 Introducing BOX Cricket Super 6 Corporate Cric...
159 23 After the sucess of '1st Matt & Mudd T20 Leagu...
160 23 Hi All, It is my pleasure to officially announ...
161 23 Sign up: Get early updates, free movie voucher...
162 23 About VIVO Pro Kabaddi 2018: A new season of t...
163 23 The Hero Indian Super League (ISL) is India's ...
164 23 Limited time offer: Free Paytm Movie Voucher w...
165 23 The 5th edition of the Indian Super League is ...
166 23 Calling all Jamshedpur FC fans! Here's your ch...
167 23 Empower yourself and progress towards a health...
168 23 Making people happy when they feel that its en...
169 23 LOVE YOGA ?- but too busy with work during the...
170 23 The coolest way to tour the city ! Absorb the ...
171 23 Ready to be a part of India's Biggest Walkatho...
172 23 The event will comprise of the following Open ...
173 23 RUN FOR CANCER CHILDREN On world Cancer Day 3r...
Unnamed: 4 Unnamed: 5
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
24 NaN NaN
25 NaN NaN
26 NaN NaN
27 NaN NaN
28 NaN NaN
29 NaN NaN
.. ... ...
144 NaN NaN
145 NaN NaN
146 NaN NaN
147 NaN NaN
148 NaN NaN
149 NaN NaN
[174 rows x 6 columns]]

Whats wrong with a simple:
pd.concat([df1, df2], ignore_index=True)).to_csv('File.csv', index=False)
this will work if they have the same columns.
A more verbose way to extract specific columns would be:
(pd.concat([df1[['event_name','category', 'category_id', 'description']],
df2[['event_name','category', 'category_id', 'description']]],
ignore_index=True))
.to_csv('File.csv', index=False))
Separate Notes:
you are initializing a DF with just columns and then outputting that to a CSV.
Why are you using .squeeze to convert it to 1-D dataset?

Related

python - find duplicates in a column, replace values in another column for that duplicate

I have a dataframe that consists of of video game titles on various platforms. it contains, among other values the name, critic's average score and user's average score. Many of them are missing scores for the user, critic and/or ESRB rating.
What i'd like to do is replace the missing rating, critic and user scores with those for the same game on a different platform (assuming they exist) i'm not quite sure how to approach this.(note - i don't want to drop the duplicate names, because they aren't truly duplicate rows)
here is a sample chunk of the dataframe (i've removed some unrelated columns to make it manageable):
name platform critic_score \ user_score rating
0 Wii Sports Wii 76.0 0 8.0 E
1 Super Mario Bros. NES NaN 1 NaN NaN
2 Mario Kart Wii Wii 82.0 2 8.3 E
3 Wii Sports Resort Wii 80.0 3 8.0 E
4 Pokemon Red/Pokemon Blue GB NaN 4 NaN NaN
5 Tetris GB NaN 5 NaN NaN
6 New Super Mario Bros. DS 89.0 6 8.5 E
7 Wii Play Wii 58.0 7 6.6 E
8 New Super Mario Bros. Wii Wii 87.0 8 8.4 E
9 Duck Hunt NES NaN 9 NaN NaN
10 Nintendogs DS NaN 10 NaN NaN
11 Mario Kart DS DS 91.0 11 8.6 E
12 Pokemon Gold/Pokemon Silver GB NaN 12 NaN NaN
13 Wii Fit Wii 80.0 13 7.7 E
14 Kinect Adventures! X360 61.0 14 6.3 E
15 Wii Fit Plus Wii 80.0 15 7.4 E
16 Grand Theft Auto V PS3 97.0 16 8.2 M
17 Grand Theft Auto: San Andreas PS2 95.0 17 9.0 M
18 Super Mario World SNES NaN 18 NaN NaN
19 Brain Age: Train Your Brain in Minutes a Day DS 77.0 19 7.9 E
20 Pokemon Diamond/Pokemon Pearl DS NaN 20 NaN NaN
21 Super Mario Land GB NaN 21 NaN NaN
22 Super Mario Bros. 3 NES NaN 22 NaN NaN
23 Grand Theft Auto V X360 97.0 23 8.1 M
24 Grand Theft Auto: Vice City PS2 95.0 24 8.7 M
25 Pokemon Ruby/Pokemon Sapphire GBA NaN 25 NaN NaN
26 Brain Age 2: More Training in Minutes a Day DS 77.0 26 7.1 E
27 Pokemon Black/Pokemon White DS NaN 27 NaN NaN
28 Gran Turismo 3: A-Spec PS2 95.0 28 8.4 E
29 Call of Duty: Modern Warfare 3 X360 88.0 29 3.4 M
now, there don't happen to be any duplicates that stick out in this head 30 lines, but for instance i have 007: Quantum of Solace on the PS3, Wii, DS, PC and x360. between all of the platforms i have a mean rating for both users and critics, as well as a rating.
as requested - here is a sample of some duplicated values:
index name platform critic_Score user_score rating
3862 Frozen: Olaf's Quest DS NaN NaN NaN
3358 Frozen: Olaf's Quest 3DS NaN NaN NaN
1785 007: Quantum of Solace PS3 65 6.6 T
3120 007: Quantum of Solace Wii 54 7.5 T
9507 007: Quantum of Solace DS 65 NaN T
4475 007: Quantum of Solace PS2 NaN NaN NaN
1285 007: Quantum of Solace X360 65 7.1 T
14658 007: Quantum of Solace PC 70 6.3 T
2243 007: The World is not Enough PS 61 6.7 T
1204 007: The World is not Enough N64 NaN NaN NaN
i've separated my duplicates into their own dataframe (df1 is my original games dataframe, df2 is the duplicates dataframe):
df2 = df[df.duplicated(['name'],False)]
df2 = df2.sort_values(['name'])
so i can see my duplicates and their values, but of course i don't wanna fill in 8500 missing values from duplicates by hand.
I can find the duplicated names, but i don't know how to fill the NaN values with the "good" values from the other platform?
i'm at a loss for how to begin this and would appreciate any input into a direction.
now - to add another step to it - in my example above of the 007 game - the critic and user scores aren't the same across platforms (the ps3 game got a 65, the wii game got a 54 and PC a 70) calculating the mean of the 3 should be the ideal solution, but i'll settle for ANY of the platforms if that is too complex (as you might have guessed, i am very new to python)
I appreciate any time and effort you have to share on my behalf.
Regards,
Jared
I'm pretty sure pandas.DataFrame.groupby is what you need:
df.groupby("name").mean()
If you want to join these results with you dataframe, you can use:
df.join(df.groupby("name").mean(), on = "name", rsuffix = "_mean")?

how to convert every row as column and value before colon into column name

I am reading a file called kids_csv with header=None option, this file contains every row with specific alphabets along with : like ab:, ad: etc, I want the entire row to become a column where like ab: that's starting off the line needs to be designated as a column name.
below is my dataframe:
>>> df = pd.read_csv("kids_cvc",error_bad_lines=False, header=None)
b'Skipping line 2: expected 13 fields, saw 14\nSkipping line 5: expected 13 fields, saw 14\nSkipping line 6: expected 13 fields, saw 16\nSkipping line 7: expected 13 fields, saw 14\nSkipping line 8: expected 13 fields, saw 15\nSkipping line 9: expected 13 fields, saw 14\nSkipping line 20: expected 13 fields, saw 19\nSkipping line 21: expected 13 fields, saw 16\nSkipping line 23: expected 13 fields, saw 14\nSkipping line 24: expected 13 fields, saw 16\nSkipping line 27: expected 13 fields, saw 14\n'
>>> df
0 1 2 3 4 5 6 7 8 9 10 11 12
0 ab: cab dab gab jab lab nab tab blab crab grab scab stab slab
1 ad: bad dad had lad mad pad sad tad glad NaN NaN NaN NaN
2 an: ban can fan man pan ran tan van clan plan scan than NaN
3 ed: bed fed led red wed bled bred fled pled sled shed NaN NaN
4 eg: beg keg leg peg NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 et: bet get jet let met net pet set vet wet yet fret NaN
6 en: den hen men pen ten then when NaN NaN NaN NaN NaN NaN
7 eck: beck deck neck peck check fleck speck wreck NaN NaN NaN NaN NaN
8 ell: bell cell dell jell sell tell well yell dwell shell smell spell swell
9 it: bit fit hit kit lit pit sit wit knit quit slit spit NaN
10 id: bid did hid kid lid rid skid slid NaN NaN NaN NaN NaN
11 ig: big dig fig gig jig pig rig wig zig twig NaN NaN NaN
12 im: dim him rim brim grim skim slim swim trim whim NaN NaN NaN
13 ish: fish dish wish swish NaN NaN NaN NaN NaN NaN NaN NaN NaN
14 ob: cob gob job lob mob rob sob blob glob knob slob snob NaN
15 og: bog cog dog fog hog jog log blog clog frog NaN NaN NaN
16 ock: dock lock rock sock tock block clock flock rock shock smock stock NaN NaN
17 ut: but cut gut hut jut nut rut shut NaN NaN NaN NaN NaN
18 ub: cub hub nub rub sub tub grub snub stub NaN NaN NaN NaN
19 ug: bug dug hug jug lug mug pug rug tug drug plug slug snug
20 um: bum gum hum mum sum chum drum glum plum scum slum NaN NaN
21 un: bun fun gun nun pun run sun spun stun NaN NaN NaN NaN
22 ud: bud cud dud mud spud stud thud NaN NaN NaN NaN NaN NaN
23 uck: buck duck luck muck puck suck tuck yuck chuck cluck pluck stuck truck
24 ush: gush hush lush mush rush blush brush crush flush slush NaN NaN NaN
When i am using transpose it get below:
>>> df.T
0 1 2 3 4 5 6 7 ... 17 18 19 20 21 22 23 24
0 ab: cab ad: bad an: ban ed: bed eg: beg et: bet en: den eck: beck ... ut: but ub: cub ug: bug um: bum un: bun ud: bud uck: buck ush: gush
1 dab dad can fed keg get hen deck ... cut hub dug gum fun cud duck hush
2 gab had fan led leg jet men neck ... gut nub hug hum gun dud luck lush
3 jab lad man red peg let pen peck ... hut rub jug mum nun mud muck mush
4 lab mad pan wed NaN met ten check ... jut sub lug sum pun spud puck rush
5 nab pad ran bled NaN net then fleck ... nut tub mug chum run stud suck blush
6 tab sad tan bred NaN pet when speck ... rut grub pug drum sun thud tuck brush
7 blab tad van fled NaN set NaN wreck ... shut snub rug glum spun NaN yuck crush
8 crab glad clan pled NaN vet NaN NaN ... NaN stub tug plum stun NaN chuck flush
9 grab NaN plan sled NaN wet NaN NaN ... NaN NaN drug scum NaN NaN cluck slush
10 scab NaN scan shed NaN yet NaN NaN ... NaN NaN plug slum NaN NaN pluck NaN
11 stab NaN than NaN NaN fret NaN NaN ... NaN NaN slug NaN NaN NaN stuck NaN
12 slab NaN NaN NaN NaN NaN NaN NaN ... NaN NaN snug NaN NaN NaN truck NaN
[13 rows x 25 columns]
What is desired:
Desired like below..
ab ad an ed eg et en eck
0 cab bad ban bed beg bet den beck
1 dab dad can fed keg get hen deck
2 gab had fan led leg jet men neck
3 jab lad man red peg let pen peck
4 lab mad pan wed NaN met ten check
5 nab pad ran bled NaN net then fleck
6 tab sad tan bred NaN pet when speck
7 blab tad van fled NaN set NaN wreck
8 crab glad clan pled NaN vet NaN NaN
9 grab NaN plan sled NaN wet NaN NaN
10 scab NaN scan shed NaN yet NaN NaN
11 stab NaN than NaN NaN fret NaN NaN
12 slab NaN NaN NaN NaN NaN NaN NaN
Try this:
df = pd.read_clipboard(sep='\s\s+') #Import your csv here
df[['i', '0']] = df['0'].str.split(':', expand=True) #Split first column on ':'
df.set_index('i').T #set_index and transpose
Output:
i ab ad an ed eg et en eck ell it ... og ock ut ub ug um un ud uck ush
0 cab bad ban bed beg bet den beck bell bit ... bog dock but cub bug bum bun bud buck gush
1 dab dad can fed keg get hen deck cell fit ... cog lock cut hub dug gum fun cud duck hush
2 gab had fan led leg jet men neck dell hit ... dog rock gut nub hug hum gun dud luck lush
3 jab lad man red peg let pen peck jell kit ... fog sock hut rub jug mum nun mud muck mush
4 lab mad pan wed NaN met ten check sell lit ... hog tock jut sub lug sum pun spud puck rush
5 nab pad ran bled NaN net then fleck tell pit ... jog block nut tub mug chum run stud suck blush
6 tab sad tan bred NaN pet when speck well sit ... log clock rut grub pug drum sun thud tuck brush
7 blab tad van fled NaN set NaN wreck yell wit ... blog flock shut snub rug glum spun NaN yuck crush
8 crab glad clan pled NaN vet NaN NaN dwell knit ... clog rock NaN stub tug plum stun NaN chuck flush
9 grab NaN plan sled NaN wet NaN NaN shell quit ... frog shock smock NaN NaN drug scum NaN NaN cluck slush
10 scab NaN scan shed NaN yet NaN NaN smell slit ... NaN stock NaN NaN plug slum NaN NaN pluck NaN
11 stab NaN than NaN NaN fret NaN NaN spell spit ... NaN NaN NaN NaN slug NaN NaN NaN stuck NaN
12 slab NaN NaN NaN NaN NaN NaN NaN swell NaN ... NaN NaN NaN NaN snug NaN NaN NaN truck NaN

Why is that when I use pandas to scrape a table from a website it skips the middle columns and only prints the first 2 and last 2

I am currently working on a program that scrapes Yahoo Finance Earnings Calendar Page and stores the data in a file. I am able to scrape the data but I am confused as to why it only scrapes the first 2 and last 2 columns. I also tried to do the same with a table on Wikipedia for List of S&P 500 Companies and am running into the same problem. Any help is appreciated.
Yahoo Finance Code
import csv
import pandas as pd
earnings = pd.read_html('https://finance.yahoo.com/calendar/earnings?day=2019-11-19')[0]
fileName = "testFile"
with open(fileName + ".csv", mode='w') as csv_file:
writer = csv.writer(csv_file)
writer.writerow([earnings])
print(earnings)
Wikipedia Code
import pandas as pd
url = r'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
tables = pd.read_html(url) # Returns list of all tables on page
sp500_table = tables[0] # Select table of interest
print(sp500_table)
~EDIT~
Here is the output I get from the Yahoo Finance Code
" Symbol Company ... Reported EPS Surprise(%)
0 WUBA 58.com Inc ... NaN NaN
1 ARMK Aramark ... NaN NaN
2 AFMD Affimed NV ... NaN NaN
3 NJR New Jersey Resources Corp ... NaN NaN
4 ECCB Eagle Point Credit Company Inc ... NaN NaN
5 TOUR Tuniu Corp ... NaN NaN
6 EIC Eagle Point Income Company Inc ... NaN NaN
7 KSS Kohls Corp ... NaN NaN
8 JKS JinkoSolar Holding Co Ltd ... NaN NaN
9 DL China Distance Education Holdings Ltd ... NaN NaN
10 TJX TJX Companies Inc ... NaN NaN
11 HD Home Depot Inc ... NaN NaN
12 PAGS PagSeguro Digital Ltd ... NaN NaN
13 ESE ESCO Technologies Inc ... NaN NaN
14 RADA Rada Electronic Industries Ltd ... NaN NaN
15 RADA Rada Electronic Industries Ltd ... NaN NaN
16 DAVA Endava PLC ... NaN NaN
17 FALC FalconStor Software Inc ... NaN NaN
18 GVP GSE Systems Inc ... NaN NaN
19 TDG TransDigm Group Inc ... NaN NaN
20 PPDF PPDAI Group Inc ... NaN NaN
21 GRBX Greenbox Pos ... NaN NaN
22 THMO Thermogenesis Holdings Inc ... NaN NaN
23 MMS Maximus Inc ... NaN NaN
24 NXTD NXT-ID Inc ... NaN NaN
25 URBN Urban Outfitters Inc ... NaN NaN
26 SINT SINTX Technologies Inc ... NaN NaN
27 ORNC Oranco Inc ... NaN NaN
28 LAIX LAIX Inc ... NaN NaN
29 MDT Medtronic PLC ... NaN NaN
[30 rows x 6 columns]"
Here is the output I get from Wikipedia Code
Symbol Security ... CIK Founded
0 MMM 3M Company ... 66740 1902
1 ABT Abbott Laboratories ... 1800 1888
2 ABBV AbbVie Inc. ... 1551152 2013 (1888)
3 ABMD ABIOMED Inc ... 815094 1981
4 ACN Accenture plc ... 1467373 1989
5 ATVI Activision Blizzard ... 718877 2008
6 ADBE Adobe Systems Inc ... 796343 1982
7 AMD Advanced Micro Devices Inc ... 2488 1969
8 AAP Advance Auto Parts ... 1158449 1932
9 AES AES Corp ... 874761 1981
10 AMG Affiliated Managers Group Inc ... 1004434 1993
11 AFL AFLAC Inc ... 4977 1955
12 A Agilent Technologies Inc ... 1090872 1999
13 APD Air Products & Chemicals Inc ... 2969 1940
14 AKAM Akamai Technologies Inc ... 1086222 1998
15 ALK Alaska Air Group Inc ... 766421 1985
16 ALB Albemarle Corp ... 915913 1994
17 ARE Alexandria Real Estate Equities ... 1035443 1994
18 ALXN Alexion Pharmaceuticals ... 899866 1992
19 ALGN Align Technology ... 1097149 1997
20 ALLE Allegion ... 1579241 1908
21 AGN Allergan, Plc ... 1578845 1983
22 ADS Alliance Data Systems ... 1101215 1996
23 LNT Alliant Energy Corp ... 352541 1917
24 ALL Allstate Corp ... 899051 1931
25 GOOGL Alphabet Inc Class A ... 1652044 1998
26 GOOG Alphabet Inc Class C ... 1652044 1998
27 MO Altria Group Inc ... 764180 1985
28 AMZN Amazon.com Inc. ... 1018724 1994
29 AMCR Amcor plc ... 1748790 NaN
.. ... ... ... ... ...
475 VIAB Viacom Inc. ... 1339947 NaN
476 V Visa Inc. ... 1403161 NaN
477 VNO Vornado Realty Trust ... 899689 NaN
478 VMC Vulcan Materials ... 1396009 NaN
479 WAB Wabtec Corporation ... 943452 NaN
480 WMT Walmart ... 104169 NaN
481 WBA Walgreens Boots Alliance ... 1618921 NaN
482 DIS The Walt Disney Company ... 1001039 NaN
483 WM Waste Management Inc. ... 823768 1968
484 WAT Waters Corporation ... 1000697 1958
485 WEC Wec Energy Group Inc ... 783325 NaN
486 WCG WellCare ... 1279363 NaN
487 WFC Wells Fargo ... 72971 NaN
488 WELL Welltower Inc. ... 766704 NaN
489 WDC Western Digital ... 106040 NaN
490 WU Western Union Co ... 1365135 1851
491 WRK WestRock ... 1636023 NaN
492 WY Weyerhaeuser ... 106535 NaN
493 WHR Whirlpool Corp. ... 106640 1911
494 WMB Williams Cos. ... 107263 NaN
495 WLTW Willis Towers Watson ... 1140536 NaN
496 WYNN Wynn Resorts Ltd ... 1174922 NaN
497 XEL Xcel Energy Inc ... 72903 1909
498 XRX Xerox ... 108772 1906
499 XLNX Xilinx ... 743988 NaN
500 XYL Xylem Inc. ... 1524472 NaN
501 YUM Yum! Brands Inc ... 1041061 NaN
502 ZBH Zimmer Biomet Holdings ... 1136869 NaN
503 ZION Zions Bancorp ... 109380 NaN
504 ZTS Zoetis ... 1555280 NaN
[505 rows x 9 columns]
As you can see in both examples the table conveniently omits the coloums in the middle and only displays the first and last 2.
~EDIT#2~
Making this change to the code now displays all coloumns but it does so in two seperate tables instead. Any idea as to why it does this?
fileName = "yahooFinance_Pandas"
with pd.option_context('display.max_columns', None): # more options can be specified also
with open(fileName + ".csv", mode='w') as csv_file:
writer = csv.writer(csv_file)
writer.writerow([earnings])
OUTPUT
" Symbol Company Earnings Call Time \
0 WUBA 58.com Inc Before Market Open
1 ARMK Aramark Before Market Open
2 AFMD Affimed NV TAS
3 NJR New Jersey Resources Corp Before Market Open
4 ECCB Eagle Point Credit Company Inc Before Market Open
5 TOUR Tuniu Corp Before Market Open
6 EIC Eagle Point Income Company Inc Before Market Open
7 KSS Kohls Corp Before Market Open
8 JKS JinkoSolar Holding Co Ltd Before Market Open
9 DL China Distance Education Holdings Ltd After Market Close
10 TJX TJX Companies Inc Before Market Open
11 HD Home Depot Inc Before Market Open
12 PAGS PagSeguro Digital Ltd TAS
13 ESE ESCO Technologies Inc After Market Close
14 RADA Rada Electronic Industries Ltd TAS
15 RADA Rada Electronic Industries Ltd Before Market Open
16 DAVA Endava PLC TAS
17 FALC FalconStor Software Inc After Market Close
18 GVP GSE Systems Inc TAS
19 TDG TransDigm Group Inc Before Market Open
20 PPDF PPDAI Group Inc Before Market Open
21 GRBX Greenbox Pos Time Not Supplied
22 THMO Thermogenesis Holdings Inc After Market Close
23 MMS Maximus Inc TAS
24 NXTD NXT-ID Inc TAS
25 URBN Urban Outfitters Inc After Market Close
26 SINT SINTX Technologies Inc Time Not Supplied
27 ORNC Oranco Inc Time Not Supplied
28 LAIX LAIX Inc After Market Close
29 MDT Medtronic PLC TAS
EPS Estimate Reported EPS Surprise(%)
0 0.82 NaN NaN
1 0.69 NaN NaN
2 -0.17 NaN NaN
3 0.28 NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 0.86 NaN NaN
8 0.83 NaN NaN
9 0.33 NaN NaN
10 0.66 NaN NaN
11 2.52 NaN NaN
12 0.29 NaN NaN
13 1.06 NaN NaN
14 -0.02 NaN NaN
15 -0.02 NaN NaN
16 21.21 NaN NaN
17 NaN NaN NaN
18 0.03 NaN NaN
19 5.16 NaN NaN
20 0.26 NaN NaN
21 NaN NaN NaN
22 -0.12 NaN NaN
23 0.94 NaN NaN
24 NaN NaN NaN
25 0.57 NaN NaN
26 NaN NaN NaN
27 NaN NaN NaN
28 -0.32 NaN NaN
29 1.28 NaN NaN "
~EDIT#3~
Made this change as you requested #Alex
earnings.to_csv(r'C:\Users\akkir\Desktop\pythonSelenium\export_dataframe.csv', index = None)
OUTPUT
Symbol,Company,Earnings Call Time,EPS Estimate,Reported EPS,Surprise(%)
ATTO,Atento SA,TAS,0.09,0.03,-66.67
ALPN,Alpine Immune Sciences Inc,TAS,-0.68,-0.62,8.82
ALPN,Alpine Immune Sciences Inc,Time Not Supplied,-0.68,-0.62,8.82
HOLI,Hollysys Automation Technologies Ltd,TAS,0.48,0.49,2.08
IDSA,Industrial Services of America Inc,After Market Close,,,
AGRO,Adecoagro SA,TAS,-0.01,,
ATOS,Atossa Genetics Inc,TAS,-0.52,-0.36,30.77
AXAS,Abraxas Petroleum Corp,TAS,0.03,0.02,-33.33
ACIU,AC Immune SA,TAS,0.17,0.25,47.06
ARCO,Arcos Dorados Holdings Inc,TAS,0.08,0.13,62.5
WTER,Alkaline Water Company Inc,Time Not Supplied,-0.07,-0.07,
ALNA,Allena Pharmaceuticals Inc,Before Market Open,-0.49,-0.57,-16.33
AEYE,AudioEye Inc,TAS,-0.26,-0.27,-3.85
APLT,Applied Therapeutics Inc,Before Market Open,-0.49,-0.63,-28.57
ALT,Altimmune Inc,TAS,-0.19,-0.73,-284.21
ABEOW,Abeona Therapeutics Inc,TAS,,,
ACER,Acer Therapeutics Inc,After Market Close,-0.57,-0.52,8.77
SRNN,Southern Banc Company Inc,Time Not Supplied,,,
SPB,Spectrum Brands Holdings Inc,Before Market Open,1.11,1.13,1.8
BIOC,Biocept Inc,TAS,-0.27,-0.25,7.41
IDXG,Interpace Biosciences Inc,TAS,-0.19,-0.19,
GTBP,GT Biopharma Inc,After Market Close,,,
MTNB,Matinas BioPharma Holdings Inc,Time Not Supplied,-0.03,-0.03,
MTNB,Matinas BioPharma Holdings Inc,TAS,-0.03,-0.03,
XELB,Xcel Brands Inc,After Market Close,0.12,0.06,-50.0
BBI,Brickell Biotech Inc,After Market Close,,,
SNBP,Sun Biopharma Inc,Before Market Open,,,
BZH,Beazer Homes USA Inc,TAS,0.51,0.08,-84.31
SELB,Selecta Biosciences Inc,TAS,-0.33,-0.26,21.21
BEST,BEST Inc,Before Market Open,,0.01,
CBPO,China Biologic Products Holdings Inc,TAS,0.88,1.4,59.09
TPCS,TechPrecision Corp,TAS,,,
LK,Luckin Coffee Inc,Before Market Open,-0.37,-0.32,13.51
CYD,China Yuchai International Ltd,Before Market Open,0.45,0.17,-62.22
CCF,Chase Corp,After Market Close,,,
SMCI,Super Micro Computer Inc,After Market Close,,,
AUMN,Golden Minerals Co,TAS,,,
PGR,Progressive Corp,Before Market Open,1.3,1.33,2.31
PUMP,ProPetro Holding Corp,TAS,0.51,0.33,-35.29
CPLG,CorePoint Lodging Inc,TAS,-0.44,-0.22,50.0
CHNG,Change Healthcare Inc,After Market Close,0.27,0.27,
NOVC,Novation Companies Inc,Time Not Supplied,,,
WFCF,Where Food Comes From Inc,Before Market Open,,,
CYCCP,Cyclacel Pharmaceuticals Inc,After Market Close,,,
ISCO,International Stem Cell Corp,Before Market Open,,,
CPA,Copa Holdings SA,TAS,2.23,2.45,9.87
CSCO,Cisco Systems Inc,TAS,0.81,0.84,3.7
GMDA,Gamida Cell Ltd,TAS,-0.36,-0.3,16.67
CHRA,Charah Solutions Inc,TAS,-0.05,-0.11,-120.0
MNI,McClatchy Co,TAS,-1.01,-0.16,84.16
ENSV,Enservco Corp,TAS,-0.06,-0.1,-66.67
TK,Teekay Corp,TAS,,,
SANW,S&W Seed Co,TAS,-0.15,-0.15,
SANW,S&W Seed Co,Before Market Open,-0.15,-0.15,
CMCM,Cheetah Mobile Inc,TAS,0.14,0.49,250.0
CYRN,Cyren Ltd,TAS,-0.07,-0.06,14.29
CATS,Catasys Inc,TAS,-0.32,-0.52,-62.5
GLAD,Gladstone Capital Corp,TAS,0.21,0.21,
PING,Ping Identity Holding Corp,After Market Close,0.01,0.13,1200.0
CRWS,Crown Crafts Inc,Before Market Open,0.18,0.18,
CTRP,Ctrip.Com International Ltd,After Market Close,0.29,,
GFF,Griffon Corp,After Market Close,0.33,0.4,21.21
CLIR,Clearsign Technologies Corp,After Market Close,,,
DMAC,DiaMedica Therapeutics Inc,After Market Close,,,
DSSI,Diamond S Shipping Inc,Time Not Supplied,-0.12,-0.19,-58.33
DSSI,Diamond S Shipping Inc,TAS,-0.12,-0.19,-58.33
DYAI,Dyadic International Inc,After Market Close,,,
ONE,OneSmart International Education Group Ltd,Before Market Open,,,
EFOI,Energy Focus Inc,Before Market Open,-0.15,-0.08,46.67
EDAP,Edap Tms SA,TAS,0.04,0.03,-25.0
EYEN,Eyenovia Inc,Before Market Open,-0.34,-0.29,14.71
EQS,EQUUS Total Return Inc,After Market Close,,,
SENR,Strategic Environmental & Energy Resources Inc,Before Market Open,,,
EPSN,Epsilon Energy Ltd,TAS,,,
GRMM,Grom Social Enterprises Inc,Before Market Open,,,
ECOR,"electroCore, Inc.",TAS,-0.31,-0.36,-16.13
SD,SandRidge Energy Inc,TAS,,,
ENR,Energizer Holdings Inc,TAS,0.81,0.93,14.81
ELMD,Electromed Inc,TAS,0.01,0.12,1100.0
EVK,Ever-Glory International Group Inc,TAS,,,
FTEK,Fuel Tech Inc,After Market Close,-0.03,-0.05,-66.67
FVRR,Fiverr International Ltd,Before Market Open,-0.19,-0.12,36.84
SGRP,SPAR Group Inc,TAS,,,
NSEC,National Security Group Inc,Time Not Supplied,,,
SNDL,Sundial Growers Inc,TAS,-0.08,,
SNDL,Sundial Growers Inc,Before Market Open,-0.08,,
TCOM,Trip.com Group Ltd,TAS,,,
RAVE,Rave Restaurant Group Inc,TAS,,,
SLGG,Super League Gaming Inc,After Market Close,-0.36,-0.43,-19.44
HI,Hillenbrand Inc,After Market Close,0.73,0.76,4.11
HROW,Harrow Health Inc,TAS,-0.24,-0.29,-20.83
NVGS,Navigator Holdings Ltd,TAS,-0.07,-0.01,85.71
INFU,InfuSystem Holdings Inc,Before Market Open,,,
OSW,OneSpaWorld Holdings Ltd,Before Market Open,0.12,0.11,-8.33
VIPS,Vipshop Holdings Ltd,TAS,0.17,0.25,47.06
PRTH,Priority Technology Holdings Inc,After Market Close,-0.12,-0.08,33.33
TGC,Tengasco Inc,TAS,,,
PRSP,Perspecta Inc,After Market Close,0.51,0.54,5.88
REED,Reed's Inc,After Market Close,-0.11,-0.14,-27.27
WSTL,Westell Technologies Inc,After Market Close,,,
As far as I can tell this nothing to do with the data and everything to do with the representation. Only the first and last columns are printed so as to keep the output from being massive and difficult to read. You can even see at the end of your output that your DataFrame has 9 columns.
Take a look here if you want to print the entire thing. You could also use .info to get some general information on your columns.
Thanks to #AlexanderCécile for the help regarding this issue.
For those interested in how he fixed my issue the code is below.
import pandas as pd
from datetime import date
pd.option_context('display.max_rows', None, 'display.max_columns', None)
earnings = pd.read_html('https://finance.yahoo.com/calendar/earnings?day=2019-11-13')[0]
earnings.to_csv(r'C:\Users\<user>\Desktop\earnings_{}.csv'.format(date.today()), index=None)

df.dropna is not dropping NaN values from rows where all values are NaN

I'm trying to clean a pdf to turn it into a file for geocoding. I've been using tabula-py to rip the pdf and have had pretty good results up until the point of removing rows that are empty entirely. I'm not even sure if this is an efficient way to do this.
I've tried the majority of solutions SO has recommended to me and I still can't quite figure it out. I've set inplace = True, axis = 0 and 1, how="all". Tried indexing out the NaN values and that didn't work either.
import pandas as pd
import tabula
pd.set_option('display.width', 500)
df = tabula.read_pdf("C:\\Users\\Jack\\Documents\\Schoolwork\\Schoolwork\\WICResearch\\RefDocs\\wicclinicdirectory.pdf", pages='all', guess = False, pandas_options={'header': None})
df.columns = ["County_Name", "Clinic_Number", "Clinic_Name", "Address", "City", "Zip_Code", "Phone_Number", "Hours_of_Operation"]
df.drop(["Phone_Number", "Hours_of_Operation"], axis = 1, inplace = True)
#mass of code here that removes unwanted repeated column headers as by product of tabula reading PDFs.
df.drop(["Clinic_Name"], axis = 1, inplace = True)
df[['ClinicNum','ClinicName']] = df.Clinic_Number.apply(lambda x: pd.Series(str(x).split(" ", maxsplit = 1)))
df.drop(["Clinic_Number"], axis = 1, inplace = True)
#df[~df.isin(['NaN', 'NaT']).any(axis=1)]
#df.dropna(axis= 0, how ='all', inplace = True)
NaNIndex = df.index[df.isnull().all(1)]
print(NaNIndex)
print(df)
The above code gives this output:
Index: []
County_Name Address City Zip_Code ClinicNum ClinicName
0 NaN NaN Ohio WIC Clinic Locations NaN nan NaN
1 NaN NaN NaN NaN Clinic NaN
3 Adams 9137 State Route 136 West Union 45693 100 Adams County WIC Program
4 NaN NaN NaN NaN nan NaN
5 NaN NaN NaN NaN nan NaN
6 Allen 940 North Cable Road, Suite 4 Lima 45805 200 Allen County WIC Program
7 Ashland 934 Center Street Ashland 44805 300 Ashland County WIC Program
8 NaN Suite E NaN NaN nan NaN
9 Ashtabula Geneva Community Center Geneva 44041 403 Geneva WIC Clinic
10 NaN 62 West Main Street NaN NaN nan NaN
11 Ashtabula Jefferson United Methodist Church Jefferson 44047 402 Jefferson WIC Clinic
12 NaN 125 East Jefferson Street NaN NaN nan NaN
13 Ashtabula Conneaut Human Resource Center Conneaut 44030 401 Conneaut WIC Clinic
14 NaN 327 Mill Street NaN NaN nan NaN
15 Ashtabula 3225 Lake Avenue Ashtabula 44004 400 Ashtabula County WIC Program
16 NaN NaN NaN NaN nan NaN
18 NaN NaN NaN NaN Clinic NaN
20 Ashtabula St. Mary's Catholic Church Orwell 44076 490 Orwell WIC Clinic
And what I'd like is:
Index: []
County_Name Address City Zip_Code ClinicNum ClinicName
3 Adams 9137 State Route 136 West Union 45693 100 Adams County WIC Program
6 Allen 940 North Cable Road, Suite 4 Lima 45805 200 Allen County WIC Program
7 Ashland 934 Center Street Ashland 44805 300 Ashland County WIC Program
8 NaN Suite E NaN NaN nan NaN
9 Ashtabula Geneva Community Center Geneva 44041 403 Geneva WIC Clinic
10 NaN 62 West Main Street NaN NaN nan NaN
11 Ashtabula Jefferson United Methodist Church Jefferson 44047 402 Jefferson WIC Clinic
12 NaN 125 East Jefferson Street NaN NaN nan NaN
13 Ashtabula Conneaut Human Resource Center Conneaut 44030 401 Conneaut WIC Clinic
14 NaN 327 Mill Street NaN NaN nan NaN
15 Ashtabula 3225 Lake Avenue Ashtabula 44004 400 Ashtabula County WIC Program
18 NaN NaN NaN NaN Clinic NaN
20 Ashtabula St. Mary's Catholic Church Orwell 44076 490 Orwell WIC Clinic
I am able to create the data frame I want with the correct headings but it still does not remove the NaN values. Or it removes the entire thing. I'd also like to be able to move the rows that are not all NaN values into the correlating ones so they are all one line.
I'm also not sure how reproducible I can get this as I have fiddled around with tabula quite a bit trying to get this pdf converted.

Python Pandas pivot with values equal to simple function of specific column

import pandas as pd
olympics = pd.read_csv('olympics.csv')
Edition NOC Medal
0 1896 AUT Silver
1 1896 FRA Gold
2 1896 GER Gold
3 1900 HUN Bronze
4 1900 GBR Gold
5 1900 DEN Bronze
6 1900 USA Gold
7 1900 FRA Bronze
8 1900 FRA Silver
9 1900 USA Gold
10 1900 FRA Silver
11 1900 GBR Gold
12 1900 SUI Silver
13 1900 ZZX Gold
14 1904 HUN Gold
15 1904 USA Bronze
16 1904 USA Gold
17 1904 USA Silver
18 1904 CAN Gold
19 1904 USA Silver
I can pivot the data frame to have some aggregate value
pivot = olympics.pivot_table(index='Edition', columns='NOC', values='Medal', aggfunc='count')
NOC AUT CAN DEN FRA GBR GER HUN SUI USA ZZX
Edition
1896 1.0 NaN NaN 1.0 NaN 1.0 NaN NaN NaN NaN
1900 NaN NaN 1.0 3.0 2.0 NaN 1.0 1.0 2.0 1.0
1904 NaN 1.0 NaN NaN NaN NaN 1.0 NaN 4.0 NaN
Rather than having the total number of medals in values= , I am interested to have a tuple (a triple) with (#Gold, #Silver, #Bronze), (0,0,0) for NaN
How do I do that succinctly and elegantly?
No need to use pivot_table, as pivot is perfectly fine with tuple for a value
value_counts to count all medals
create multi-index for all combinations of countries, dates, medals
reindex with fill_values=0
counts = df.groupby(['Edition', 'NOC']).Medal.value_counts()
mux = pd.MultiIndex.from_product(
[c.values for c in counts.index.levels], names=counts.index.names)
counts = counts.reindex(mux, fill_value=0).unstack('Medal')
counts = counts[['Bronze', 'Silver', 'Gold']]
pd.Series([tuple(l) for l in counts.values.tolist()], counts.index).unstack()

Categories