320x100
이번 챕터는 사실 원리적 측면에서는 selenium이나 BeautifulSoup를 통해서 익숙하지만
이런 구현 방식은 처음이라 흥미로웠다.
그리고 시작도 전에 importError lxml이 나면서 더욱 흥미를 돋게 해주었다.
뭐 엄청 어려운 기능이 들어간게 에러가 났다면 그러려니 하고 넘겼겠지만
그냥 암만봐도 기본중의 기본적인 기능이 에러가나서 작동을 안한다니
그냥 두고볼 수 없었다.
결과적으로는 방법을 찾았고 실행했다.
# Importing Data from the Web with pd.read_html()
import pandas as pd
url = 'https://en.wikipedia.org/wiki/1976_Summer_Olympics_medal_table'
pd.read_html(url)
'''
[ 1976 Summer Olympics medals 1976 Summer Olympics medals.1
0 Location Montreal, Canada
1 Highlights Highlights
2 Most gold medals Soviet Union (49)
3 Most total medals Soviet Union (125)
4 ← 1972 Olympics medal tables 1980 → ← 1972 Olympics medal tables 1980 →,
0
0 Part of a series on
1 1976 Summer Olympics
2 Bid process (bid details) Boycott Development ...
3 .mw-parser-output .navbar{display:inline;font-...,
Rank NOC Gold Silver Bronze Total
0 1 Soviet Union 49 41 35 125
1 2 East Germany 40 25 25 90
2 3 United States 34 35 25 94
3 4 West Germany 10 12 17 39
4 5 Japan 9 6 10 25
5 6 Poland 7 6 13 26
6 7 Bulgaria 6 9 7 22
7 8 Cuba 6 4 3 13
8 9 Romania 4 9 14 27
9 10 Hungary 4 5 13 22
10 11 Finland 4 2 0 6
11 12 Sweden 4 1 0 5
12 13 Great Britain 3 5 5 13
13 14 Italy 2 7 4 13
14 15 France 2 3 4 9
15 16 Yugoslavia 2 3 3 8
16 17 Czechoslovakia 2 2 4 8
17 18 New Zealand 2 1 1 4
18 19 South Korea 1 1 4 6
19 20 Switzerland 1 1 2 4
20 21 Jamaica 1 1 0 2
21 21 North Korea 1 1 0 2
22 21 Norway 1 1 0 2
23 24 Denmark 1 0 2 3
24 25 Mexico 1 0 1 2
25 26 Trinidad and Tobago 1 0 0 1
26 27 Canada* 0 5 6 11
27 28 Belgium 0 3 3 6
28 29 Netherlands 0 2 3 5
29 30 Portugal 0 2 0 2
30 30 Spain 0 2 0 2
31 32 Australia 0 1 4 5
32 33 Iran 0 1 1 2
33 34 Mongolia 0 1 0 1
34 34 Venezuela 0 1 0 1
35 36 Brazil 0 0 2 2
36 37 Austria 0 0 1 1
37 37 Bermuda 0 0 1 1
38 37 Pakistan 0 0 1 1
39 37 Puerto Rico 0 0 1 1
40 37 Thailand 0 0 1 1
41 Totals (41 NOCs) Totals (41 NOCs) 198 199 216 613,
Olympics Athlete Country Medal \
0 1976 Summer Olympics Valentin Khristov Bulgaria NaN
1 1976 Summer Olympics Blagoy Blagoev Bulgaria NaN
2 1976 Summer Olympics Zbigniew Kaczmarek Poland NaN
Event Ref
0 Weightlifting, Men's 110 kg [11]
1 Weightlifting, Men's 82.5 kg [12]
2 Weightlifting, Men's 67.5 kg [13] ,
vte Olympic Games medal tables \
0 Olympic medal All-time Olympic Games medal tab...
1 Summer Olympic Games
2 Winter Olympic Games
3 Lists of Olympic medalists List of stripped Ol...
vte Olympic Games medal tables.1
0 Olympic medal All-time Olympic Games medal tab...
1 1896 1900 1904 1908 1912 1920 1924 1928 1932 1...
2 1924 1928 1932 1936 1948 1952 1956 1960 1964 1...
3 Lists of Olympic medalists List of stripped Ol... ,
vte Summer Olympics medal table leaders by year \
0 .mw-parser-output .div-col{margin-top:0.3em;co...
vte Summer Olympics medal table leaders by year.1
0 .mw-parser-output .div-col{margin-top:0.3em;co... ]
'''
type(pd.read_html(url))
# list
pd.read_html(url)[0]
'''
1976 Summer Olympics medals 1976 Summer Olympics medals.1
0 Location Montreal, Canada
1 Highlights Highlights
2 Most gold medals Soviet Union (49)
3 Most total medals Soviet Union (125)
4 ← 1972 Olympics medal tables 1980 → ← 1972 Olympics medal tables 1980 →
'''
# 이 부분이 이 강의에선 포인트였다.
# 강사님은 0번으로 작업하셨을때, Rank등등이 나왔지만 나는 그렇지 않았고 값이 이상하게 나오길래
# 그냥 리스트 순차대로 돌려봤다.
# 결과적으론 2번 리스트가 구하고자 하는 값인 Rank가 들어있는 부분이였고
# https://en.wikipedia.org/wiki/1976_Summer_Olympics_medal_table
# 에 들어가서 확인해 줬더니, 알게모르게 디테일하게 몇가지 테이블이 추가되었음을 볼 수 있었다.
wik_1976 = pd.read_html(url)[2]
wik_1976.head()
'''
Rank NOC Gold Silver Bronze Total
0 1 Soviet Union 49 41 35 125
1 2 East Germany 40 25 25 90
2 3 United States 34 35 25 94
3 4 West Germany 10 12 17 39
4 5 Japan 9 6 10 25
'''
wik_1976.info()
'''
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42 entries, 0 to 41
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Rank 42 non-null object
1 NOC 42 non-null object
2 Gold 42 non-null int64
3 Silver 42 non-null int64
4 Bronze 42 non-null int64
5 Total 42 non-null int64
dtypes: int64(4), object(2)
memory usage: 2.1+ KB
'''
url2 = 'https://en.wikipedia.org/wiki/1996_Summer_Olympics_medal_table'
pd.read_html(url2)
'''
[ 1996 Summer Olympics medals 1996 Summer Olympics medals.1
0 Location Atlanta, United States
1 Highlights Highlights
2 Most gold medals United States (44)
3 Most total medals United States (101)
4 ← 1992 Olympics medal tables 2000 → ← 1992 Olympics medal tables 2000 →,
0
0 Part of a series on
1 1996 Summer Olympics
2 Bid process (bid details) Venues Marketing (ma...
3 .mw-parser-output .navbar{display:inline;font-...,
Rank Nation Gold Silver Bronze Total
0 1 United States* 44 32 25 101
1 2 Russia 26 21 16 63
2 3 Germany 20 18 27 65
3 4 China 16 22 12 50
4 5 France 15 7 15 37
.. ... ... ... ... ... ...
75 71 Mozambique 0 0 1 1
76 71 Puerto Rico 0 0 1 1
77 71 Tunisia 0 0 1 1
78 71 Uganda 0 0 1 1
79 Totals (79 nations) Totals (79 nations) 271 273 298 842
[80 rows x 6 columns],
vte Olympic Games medal tables \
0 Olympic medal All-time Olympic Games medal tab...
1 Summer Olympic Games
2 Winter Olympic Games
3 Lists of Olympic medalists List of stripped Ol...
vte Olympic Games medal tables.1
0 Olympic medal All-time Olympic Games medal tab...
1 1896 1900 1904 1908 1912 1920 1924 1928 1932 1...
2 1924 1928 1932 1936 1948 1952 1956 1960 1964 1...
3 Lists of Olympic medalists List of stripped Ol... ,
vte Summer Olympics medal table leaders by year \
0 .mw-parser-output .div-col{margin-top:0.3em;co...
vte Summer Olympics medal table leaders by year.1
0 .mw-parser-output .div-col{margin-top:0.3em;co... ]
'''
pd.read_html(url2)[2]
'''
Rank Nation Gold Silver Bronze Total
0 1 United States* 44 32 25 101
1 2 Russia 26 21 16 63
2 3 Germany 20 18 27 65
3 4 China 16 22 12 50
4 5 France 15 7 15 37
... ... ... ... ... ... ...
75 71 Mozambique 0 0 1 1
76 71 Puerto Rico 0 0 1 1
77 71 Tunisia 0 0 1 1
78 71 Uganda 0 0 1 1
79 Totals (79 nations) Totals (79 nations) 271 273 298 842
80 rows × 6 columns
'''
wik_1996 = pd.read_html(url2)[2]
wik_1996.info()
'''
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80 entries, 0 to 79
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Rank 80 non-null object
1 Nation 80 non-null object
2 Gold 80 non-null int64
3 Silver 80 non-null int64
4 Bronze 80 non-null int64
5 Total 80 non-null int64
dtypes: int64(4), object(2)
memory usage: 3.9+ KB
'''
wik_1976.to_csv('wik_1976.csv', index = False)
wik_1996.to_csv('wik_1996.csv', index = False)
300x250
'개발일지 > Pandas' 카테고리의 다른 글
pandas 판다스 문자열 가공 혼자서 해보기 (0) | 2022.07.28 |
---|---|
매우중요 pandas 판다스 기초 16 문자열 가공 Cleaning Data (0) | 2022.07.28 |
Pandas importerror: lxml not found, please install it (0) | 2022.07.28 |
pandas 판다스 기초 14 import excel (0) | 2022.07.27 |
pandas 판다스 기초 13 import CSV file (0) | 2022.07.26 |