본문 바로가기
개발일지/Pandas

pandas 판다스 기초 15 Importing Data From Web Site

by 다니엘의 개발 이야기 2022. 7. 28.
320x100

이번 챕터는 사실 원리적 측면에서는 selenium이나 BeautifulSoup를 통해서 익숙하지만

이런 구현 방식은 처음이라 흥미로웠다.

 

그리고 시작도 전에 importError lxml이 나면서 더욱 흥미를 돋게 해주었다.

뭐 엄청 어려운 기능이 들어간게 에러가 났다면 그러려니 하고 넘겼겠지만

그냥 암만봐도 기본중의 기본적인 기능이 에러가나서 작동을 안한다니

그냥 두고볼 수 없었다.

결과적으로는 방법을 찾았고 실행했다.

 

# Importing Data from the Web with pd.read_html()

import pandas as pd

url = 'https://en.wikipedia.org/wiki/1976_Summer_Olympics_medal_table'
pd.read_html(url)

'''
[           1976 Summer Olympics medals        1976 Summer Olympics medals.1
 0                             Location                     Montreal, Canada
 1                           Highlights                           Highlights
 2                     Most gold medals                    Soviet Union (49)
 3                    Most total medals                   Soviet Union (125)
 4  ← 1972 Olympics medal tables 1980 →  ← 1972 Olympics medal tables 1980 →,
                                                    0
 0                                Part of a series on
 1                               1976 Summer Olympics
 2  Bid process (bid details) Boycott Development ...
 3  .mw-parser-output .navbar{display:inline;font-...,
                 Rank                  NOC  Gold  Silver  Bronze  Total
 0                  1         Soviet Union    49      41      35    125
 1                  2         East Germany    40      25      25     90
 2                  3        United States    34      35      25     94
 3                  4         West Germany    10      12      17     39
 4                  5                Japan     9       6      10     25
 5                  6               Poland     7       6      13     26
 6                  7             Bulgaria     6       9       7     22
 7                  8                 Cuba     6       4       3     13
 8                  9              Romania     4       9      14     27
 9                 10              Hungary     4       5      13     22
 10                11              Finland     4       2       0      6
 11                12               Sweden     4       1       0      5
 12                13        Great Britain     3       5       5     13
 13                14                Italy     2       7       4     13
 14                15               France     2       3       4      9
 15                16           Yugoslavia     2       3       3      8
 16                17       Czechoslovakia     2       2       4      8
 17                18          New Zealand     2       1       1      4
 18                19          South Korea     1       1       4      6
 19                20          Switzerland     1       1       2      4
 20                21              Jamaica     1       1       0      2
 21                21          North Korea     1       1       0      2
 22                21               Norway     1       1       0      2
 23                24              Denmark     1       0       2      3
 24                25               Mexico     1       0       1      2
 25                26  Trinidad and Tobago     1       0       0      1
 26                27              Canada*     0       5       6     11
 27                28              Belgium     0       3       3      6
 28                29          Netherlands     0       2       3      5
 29                30             Portugal     0       2       0      2
 30                30                Spain     0       2       0      2
 31                32            Australia     0       1       4      5
 32                33                 Iran     0       1       1      2
 33                34             Mongolia     0       1       0      1
 34                34            Venezuela     0       1       0      1
 35                36               Brazil     0       0       2      2
 36                37              Austria     0       0       1      1
 37                37              Bermuda     0       0       1      1
 38                37             Pakistan     0       0       1      1
 39                37          Puerto Rico     0       0       1      1
 40                37             Thailand     0       0       1      1
 41  Totals (41 NOCs)     Totals (41 NOCs)   198     199     216    613,
                Olympics             Athlete   Country  Medal  \
 0  1976 Summer Olympics   Valentin Khristov  Bulgaria    NaN   
 1  1976 Summer Olympics      Blagoy Blagoev  Bulgaria    NaN   
 2  1976 Summer Olympics  Zbigniew Kaczmarek    Poland    NaN   
 
                           Event   Ref  
 0   Weightlifting, Men's 110 kg  [11]  
 1  Weightlifting, Men's 82.5 kg  [12]  
 2  Weightlifting, Men's 67.5 kg  [13]  ,
                       vte Olympic Games medal tables  \
 0  Olympic medal All-time Olympic Games medal tab...   
 1                               Summer Olympic Games   
 2                               Winter Olympic Games   
 3  Lists of Olympic medalists List of stripped Ol...   
 
                     vte Olympic Games medal tables.1  
 0  Olympic medal All-time Olympic Games medal tab...  
 1  1896 1900 1904 1908 1912 1920 1924 1928 1932 1...  
 2  1924 1928 1932 1936 1948 1952 1956 1960 1964 1...  
 3  Lists of Olympic medalists List of stripped Ol...  ,
      vte Summer Olympics medal table leaders by year  \
 0  .mw-parser-output .div-col{margin-top:0.3em;co...   
 
    vte Summer Olympics medal table leaders by year.1  
 0  .mw-parser-output .div-col{margin-top:0.3em;co...  ]
'''
type(pd.read_html(url))
# list
pd.read_html(url)[0]

'''
    1976 Summer Olympics medals	1976 Summer Olympics medals.1
0	Location	Montreal, Canada
1	Highlights	Highlights
2	Most gold medals	Soviet Union (49)
3	Most total medals	Soviet Union (125)
4	← 1972 Olympics medal tables 1980 →	← 1972 Olympics medal tables 1980 →
'''
# 이 부분이 이 강의에선 포인트였다.
# 강사님은 0번으로 작업하셨을때, Rank등등이 나왔지만 나는 그렇지 않았고 값이 이상하게 나오길래
# 그냥 리스트 순차대로 돌려봤다.
# 결과적으론 2번 리스트가 구하고자 하는 값인 Rank가 들어있는 부분이였고

# https://en.wikipedia.org/wiki/1976_Summer_Olympics_medal_table
# 에 들어가서 확인해 줬더니, 알게모르게 디테일하게 몇가지 테이블이 추가되었음을 볼 수 있었다.

wik_1976 = pd.read_html(url)[2]
wik_1976.head()

'''
Rank	NOC	Gold	Silver	Bronze	Total
0	1	Soviet Union	49	41	35	125
1	2	East Germany	40	25	25	90
2	3	United States	34	35	25	94
3	4	West Germany	10	12	17	39
4	5	Japan	9	6	10	25
'''
wik_1976.info()

'''
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42 entries, 0 to 41
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Rank    42 non-null     object
 1   NOC     42 non-null     object
 2   Gold    42 non-null     int64 
 3   Silver  42 non-null     int64 
 4   Bronze  42 non-null     int64 
 5   Total   42 non-null     int64 
dtypes: int64(4), object(2)
memory usage: 2.1+ KB
'''


url2 = 'https://en.wikipedia.org/wiki/1996_Summer_Olympics_medal_table'
pd.read_html(url2)

'''
[           1996 Summer Olympics medals        1996 Summer Olympics medals.1
 0                             Location               Atlanta, United States
 1                           Highlights                           Highlights
 2                     Most gold medals                   United States (44)
 3                    Most total medals                  United States (101)
 4  ← 1992 Olympics medal tables 2000 →  ← 1992 Olympics medal tables 2000 →,
                                                    0
 0                                Part of a series on
 1                               1996 Summer Olympics
 2  Bid process (bid details) Venues Marketing (ma...
 3  .mw-parser-output .navbar{display:inline;font-...,
                    Rank               Nation  Gold  Silver  Bronze  Total
 0                     1       United States*    44      32      25    101
 1                     2               Russia    26      21      16     63
 2                     3              Germany    20      18      27     65
 3                     4                China    16      22      12     50
 4                     5               France    15       7      15     37
 ..                  ...                  ...   ...     ...     ...    ...
 75                   71           Mozambique     0       0       1      1
 76                   71          Puerto Rico     0       0       1      1
 77                   71              Tunisia     0       0       1      1
 78                   71               Uganda     0       0       1      1
 79  Totals (79 nations)  Totals (79 nations)   271     273     298    842
 
 [80 rows x 6 columns],
                       vte Olympic Games medal tables  \
 0  Olympic medal All-time Olympic Games medal tab...   
 1                               Summer Olympic Games   
 2                               Winter Olympic Games   
 3  Lists of Olympic medalists List of stripped Ol...   
 
                     vte Olympic Games medal tables.1  
 0  Olympic medal All-time Olympic Games medal tab...  
 1  1896 1900 1904 1908 1912 1920 1924 1928 1932 1...  
 2  1924 1928 1932 1936 1948 1952 1956 1960 1964 1...  
 3  Lists of Olympic medalists List of stripped Ol...  ,
      vte Summer Olympics medal table leaders by year  \
 0  .mw-parser-output .div-col{margin-top:0.3em;co...   
 
    vte Summer Olympics medal table leaders by year.1  
 0  .mw-parser-output .div-col{margin-top:0.3em;co...  ]
'''
pd.read_html(url2)[2]

'''

    Rank	Nation	Gold	Silver	Bronze	Total
0	1	United States*	44	32	25	101
1	2	Russia	26	21	16	63
2	3	Germany	20	18	27	65
3	4	China	16	22	12	50
4	5	France	15	7	15	37
...	...	...	...	...	...	...
75	71	Mozambique	0	0	1	1
76	71	Puerto Rico	0	0	1	1
77	71	Tunisia	0	0	1	1
78	71	Uganda	0	0	1	1
79	Totals (79 nations)	Totals (79 nations)	271	273	298	842
80 rows × 6 columns
'''
wik_1996 = pd.read_html(url2)[2]
wik_1996.info()

'''
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80 entries, 0 to 79
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Rank    80 non-null     object
 1   Nation  80 non-null     object
 2   Gold    80 non-null     int64 
 3   Silver  80 non-null     int64 
 4   Bronze  80 non-null     int64 
 5   Total   80 non-null     int64 
dtypes: int64(4), object(2)
memory usage: 3.9+ KB
'''
wik_1976.to_csv('wik_1976.csv', index = False)
wik_1996.to_csv('wik_1996.csv', index = False)
300x250