본문 바로가기
개발일지/Pandas

pandas 판다스 기초 24, **매우 중요** heatmap, plot, sns, 등 순위및 종류 분석 (근데 호흡이 너무 김...)

by 개발에정착하고싶다 2022. 8. 22.
320x100

흐음... 정말 집중력 어떻게 안되나....

호흡이 이렇게 긴 강의를 들을때는 그냥 속수 무책이다.

통계관련 강의가 나오기도 해서 대부분은 알아듣고, 이해가 안되는 부분도 있지만

근본적으로는 흥미를 잃는 느낌이랄까

 

한편으로는 흥미가 만땅이지만 한편으로는 흥미가 없달까..

 

거참...

 

내가 작성했던 코드도 있는데,

결과적으로 봤을때 그건 무의미할 정도로 호흡이 너무 길다.

따라서 내가 썼던 코드는 생략하도록 하겠다.


# Explanatory Data Analysis Challenge (Olympic Games)
## Data Import and Inspection

Import the Datasets Summer (__summer.csv__), Winter (__winter.csv__) and dictionary (__dictionary.csv__) and Inspect! 

import pandas as pd

summer = pd.read_csv('summer.csv')
winter = pd.read_csv('winter.csv')
dic = pd.read_csv('dictionary.csv')


## Merging and Concatenating

# 1. __Merge__ Summer and Winter (one row for each Medal awarded in any Olympic Games) and save the merged DataFrame in __olympics__. 
# 2. An __additional column__ (e.g. "Edition") shall indicate the Edition -> __Summer or Winter__.
# 3. Add the __full Country name__ from dictionary to olympics (e.g. France for FRA).

pd.concat([summer, winter], axis = 0, keys = ['Summer', 'Winter'],
         names = ['Edition']).reset_index().drop(columns = 'level_1')
'''

Edition	Year	City	Sport	Discipline	Athlete	Country	Gender	Event	Medal
0	Summer	1896	Athens	Aquatics	Swimming	HAJOS, Alfred	HUN	Men	100M Freestyle	Gold
1	Summer	1896	Athens	Aquatics	Swimming	HERSCHMANN, Otto	AUT	Men	100M Freestyle	Silver
2	Summer	1896	Athens	Aquatics	Swimming	DRIVAS, Dimitrios	GRE	Men	100M Freestyle For Sailors	Bronze
3	Summer	1896	Athens	Aquatics	Swimming	MALOKINIS, Ioannis	GRE	Men	100M Freestyle For Sailors	Gold
4	Summer	1896	Athens	Aquatics	Swimming	CHASAPIS, Spiridon	GRE	Men	100M Freestyle For Sailors	Silver
...	...	...	...	...	...	...	...	...	...	...
36930	Winter	2014	Sochi	Skiing	Snowboard	JONES, Jenny	GBR	Women	Slopestyle	Bronze
36931	Winter	2014	Sochi	Skiing	Snowboard	ANDERSON, Jamie	USA	Women	Slopestyle	Gold
36932	Winter	2014	Sochi	Skiing	Snowboard	MALTAIS, Dominique	CAN	Women	Snowboard Cross	Silver
36933	Winter	2014	Sochi	Skiing	Snowboard	SAMKOVA, Eva	CZE	Women	Snowboard Cross	Gold
36934	Winter	2014	Sochi	Skiing	Snowboard	TRESPEUCH, Chloe	FRA	Women	Snowboard Cross	Bronze
36935 rows × 10 columns
'''
# axis 가 0이면 행을 기준으로 움직인다.
# 멀티인덱스로써 keys를 지정해줬다.
# 해당 멀티인덱스의 이름은 names다

pd.concat([summer, winter], axis = 0, keys = ['Summer', 'Winter'],
         names = ['Edition'])

'''
	Year	City	Sport	Discipline	Athlete	Country	Gender	Event	Medal
Edition										
Summer	0	1896	Athens	Aquatics	Swimming	HAJOS, Alfred	HUN	Men	100M Freestyle	Gold
1	1896	Athens	Aquatics	Swimming	HERSCHMANN, Otto	AUT	Men	100M Freestyle	Silver
2	1896	Athens	Aquatics	Swimming	DRIVAS, Dimitrios	GRE	Men	100M Freestyle For Sailors	Bronze
3	1896	Athens	Aquatics	Swimming	MALOKINIS, Ioannis	GRE	Men	100M Freestyle For Sailors	Gold
4	1896	Athens	Aquatics	Swimming	CHASAPIS, Spiridon	GRE	Men	100M Freestyle For Sailors	Silver
...	...	...	...	...	...	...	...	...	...	...
Winter	5765	2014	Sochi	Skiing	Snowboard	JONES, Jenny	GBR	Women	Slopestyle	Bronze
5766	2014	Sochi	Skiing	Snowboard	ANDERSON, Jamie	USA	Women	Slopestyle	Gold
5767	2014	Sochi	Skiing	Snowboard	MALTAIS, Dominique	CAN	Women	Snowboard Cross	Silver
5768	2014	Sochi	Skiing	Snowboard	SAMKOVA, Eva	CZE	Women	Snowboard Cross	Gold
5769	2014	Sochi	Skiing	Snowboard	TRESPEUCH, Chloe	FRA	Women	Snowboard Cross	Bronze
36935 rows × 9 columns
'''
# reset_index를 해주면 멀티인덱스로 잡아놨던것들이 자동으로 다 풀린다.
# 아직은 drop(columns)에 쓰인 level_1이 어디서 비롯되었는지 잘 모르겠다.

olympics = pd.concat([summer, winter], axis = 0, keys= ['Summer', 'Winter'],
         names = ['Edition']).reset_index().drop(columns = 'level_1')
olympics
dic.head()

'''
Country	Code	Population	GDP per Capita
0	Afghanistan	AFG	32526562.0	594.323081
1	Albania	ALB	2889167.0	3945.217582
2	Algeria	ALG	39666519.0	4206.031232
3	American Samoa*	ASA	55538.0	NaN
4	Andorra	AND	70473.0	NaN
'''
# Country_x의 값과 Code의 값이 같았다.
# 굳이 2개의 데이터가 필요 없으므로, Code 열의 값을 모두 drop시킨다.

olympics = olympics.merge(dic.iloc[:,:2], how = 'left',
              left_on = 'Country', right_on = 'Code').drop(columns = ['Code'])


## Data Cleaning (Part 1)

# 1. If you haven´t done it yet: Assign appropriate __Column Headers__ to Country Codes (e.g. __"Code"__) and full Country Names (e.g. __"Country"__).
# 2. __Remove Spaces__ from column headers in olympics and dictionary.
# 3. For some Country Codes, there is no corresponding __full Country Name__ available (e.g. for "URS") -> __missing values__ in olympics. Identify these Country Codes and search the Web for the full Country Names. __Replace__ missing values! (Alternatively, you can find a Solution for this at the end of this Notebook!) 

# 컬럼 이름 바꾸기

olympics.rename(columns = {'Country_x':'Code', 'Country_y':'Country'}, inplace = True)
# 컬럼 이름 바꾸기
dic.rename(columns = {'GDP per Capita':'GDP'}, inplace = True)
dic

'''
Country	Code	Population	GDP
0	Afghanistan	AFG	32526562.0	594.323081
1	Albania	ALB	2889167.0	3945.217582
2	Algeria	ALG	39666519.0	4206.031232
3	American Samoa*	ASA	55538.0	NaN
4	Andorra	AND	70473.0	NaN
...	...	...	...	...
196	Vietnam	VIE	91703800.0	2111.138024
197	Virgin Islands*	ISV	103574.0	NaN
198	Yemen	YEM	26832215.0	1406.291651
199	Zambia	ZAM	16211767.0	1304.879014
200	Zimbabwe	ZIM	15602751.0	924.143819
201 rows × 4 columns
'''
olympics.info()

'''
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36935 entries, 0 to 36934
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Edition     36935 non-null  object
 1   Year        36935 non-null  int64 
 2   City        36935 non-null  object
 3   Sport       36935 non-null  object
 4   Discipline  36935 non-null  object
 5   Athlete     36935 non-null  object
 6   Country     36931 non-null  object
 7   Gender      36935 non-null  object
 8   Event       36935 non-null  object
 9   Medal       36935 non-null  object
dtypes: int64(1), object(9)
memory usage: 2.8+ MB
'''
olympics.loc[olympics.Country.isnull()].Code.value_counts()

'''
URS    2489
GDR     987
ROU     642
FRG     584
TCH     487
YUG     442
EUN     283
EUA     281
ZZX      48
SRB      31
ANZ      29
RU1      17
MNE      14
TTO      10
BOH       7
BWI       5
SGP       4
IOP       3
Name: Code, dtype: int64
'''
old_c = olympics.loc[olympics.Country.isnull(), :].Code.value_counts().index
old_c

'''
Index(['URS', 'GDR', 'ROU', 'FRG', 'TCH', 'YUG', 'EUN', 'EUA', 'ZZX', 'SRB',
       'ANZ', 'RU1', 'MNE', 'TTO', 'BOH', 'BWI', 'SGP', 'IOP'],
      dtype='object')
'''
data = ["Soviet Union", "East Germany", "Romania", "West Germany", "Czechoslovakia",
                               "Yugoslavia", "Unified Team", "Unified Team of Germany", "Mixed teams", "Serbia",
                              "Australasia", "Russian Empire", "Montenegro", "Trinidad and Tobago", "Bohemia", 
                              "West Indies Federation", "Singapore", "Independent Olympic Participants"]
mapper = pd.Series(index = old_c, name = 'Country', data = data)

'''
URS                        Soviet Union
GDR                        East Germany
ROU                             Romania
FRG                        West Germany
TCH                      Czechoslovakia
YUG                          Yugoslavia
EUN                        Unified Team
EUA             Unified Team of Germany
ZZX                         Mixed teams
SRB                              Serbia
ANZ                         Australasia
RU1                      Russian Empire
MNE                          Montenegro
TTO                 Trinidad and Tobago
BOH                             Bohemia
BWI              West Indies Federation
SGP                           Singapore
IOP    Independent Olympic Participants
Name: Country, dtype: object
'''
# Country 컬럼에 해당하는 필드 값중에서
# olympics 변수의 Country 컬럼에 결측값이 있는지 확인

olympics.Country.isnull()
# 위의 결측값이 Country 컬럼에 있는 행들 중에서
# Code 행에 대한 값을 더해주자.

olympics.loc[olympics.Country.isnull()].Code.value_counts()

'''
URS    2489
GDR     987
ROU     642
FRG     584
TCH     487
YUG     442
EUN     283
EUA     281
ZZX      48
SRB      31
ANZ      29
RU1      17
MNE      14
TTO      10
BOH       7
BWI       5
SGP       4
IOP       3
Name: Code, dtype: int64
'''

# 이 국가들에 결측값이 많은 이유는, 더이상 존재하지 않는 나라인 확률이 높기 때문이다.
# 내코드
# 내 코드는 쓰임의 모습은 맞는것같은데, 기능측면에 있어서 완전 틀렸다.
# 그 증거로 전체 인덱스값이 조회된다.
# olympics.Country.isnull()

# 해답코드
missing_C_index = olympics.loc[olympics.Country.isnull()].index
olympics.loc[missing_C_index, 'Code'].map(mapper)

'''
132         Mixed teams
133         Mixed teams
134         Mixed teams
135         Mixed teams
136         Mixed teams
              ...      
33961      Unified Team
33977    Czechoslovakia
33978    Czechoslovakia
33979    Czechoslovakia
33980    Czechoslovakia
Name: Code, Length: 6367, dtype: object
'''
olympics.Country.fillna(olympics.Code.map(mapper), inplace = True)

'''
0               Hungary
1               Austria
2                Greece
3                Greece
4                Greece
              ...      
36930    United Kingdom
36931     United States
36932            Canada
36933    Czech Republic
36934            France
Name: Country, Length: 36935, dtype: object
'''
# 결과적으로 missing_C_index에는 결측값이 Country에 들어있던 값들의 인덱스 모음집이였다.
# 그리고 위의 fillna 작업등을 통해서 결측값을 채워주었다.

olympics.loc[missing_C_index]

'''
	Edition	Year	City	Sport	Discipline	Athlete	Code	Gender	Event	Medal	Country
132	Summer	1896	Athens	Tennis	Tennis	FLACK, Edwin	ZZX	Men	Doubles	Bronze	Mixed teams
133	Summer	1896	Athens	Tennis	Tennis	ROBERTSON, George Stuart	ZZX	Men	Doubles	Bronze	Mixed teams
134	Summer	1896	Athens	Tennis	Tennis	BOLAND, John	ZZX	Men	Doubles	Gold	Mixed teams
135	Summer	1896	Athens	Tennis	Tennis	TRAUN, Friedrich	ZZX	Men	Doubles	Gold	Mixed teams
136	Summer	1896	Athens	Tennis	Tennis	KASDAGLIS, Dionysios	ZZX	Men	Doubles	Silver	Mixed teams
...	...	...	...	...	...	...	...	...	...	...	...
33961	Winter	1992	Albertville	Skiing	Freestyle Skiing	KOZHEVNIKOVA, Yelizaveta	EUN	Women	Moguls	Silver	Unified Team
33977	Winter	1992	Albertville	Skiing	Ski Jumping	GODER, Tomas	TCH	Men	K120 Team (90M)	Bronze	Czechoslovakia
33978	Winter	1992	Albertville	Skiing	Ski Jumping	JEZ, Frantisek	TCH	Men	K120 Team (90M)	Bronze	Czechoslovakia
33979	Winter	1992	Albertville	Skiing	Ski Jumping	PARMA, Jiri	TCH	Men	K120 Team (90M)	Bronze	Czechoslovakia
33980	Winter	1992	Albertville	Skiing	Ski Jumping	SAKALA, Jaroslav	TCH	Men	K120 Team (90M)	Bronze	Czechoslovakia
6367 rows × 11 columns
'''


## Data Cleaning (Part 2)

olympics.info()

'''
<class 'pandas.core.frame.DataFrame'>
Int64Index: 36935 entries, 0 to 36934
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Edition     36935 non-null  object
 1   Year        36935 non-null  int64 
 2   City        36935 non-null  object
 3   Sport       36935 non-null  object
 4   Discipline  36935 non-null  object
 5   Athlete     36935 non-null  object
 6   Code        36931 non-null  object
 7   Gender      36935 non-null  object
 8   Event       36935 non-null  object
 9   Medal       36935 non-null  object
 10  Country     36931 non-null  object
dtypes: int64(1), object(10)
memory usage: 4.4+ MB
'''

# Code와 Country 컬럼에 각 4개씩의 결측값이 있다.
# 왜 그런지 필터링으로 찾아보자.
# 결측값 찾기 필터 마스크
olympics.loc[olympics.Code.isna()]

'''

    Edition	Year	City	Sport	Discipline	Athlete	Code	Gender	Event	Medal	Country
29603	Summer	2012	London	Athletics	Athletics	Pending	NaN	Women	1500M	Gold	NaN
31072	Summer	2012	London	Weightlifting	Weightlifting	Pending	NaN	Women	63KG	Gold	NaN
31091	Summer	2012	London	Weightlifting	Weightlifting	Pending	NaN	Men	94KG	Silver	NaN
31110	Summer	2012	London	Wrestling	Wrestling Freestyle	KUDUKHOV, Besik	NaN	Men	Wf 60 KG	Silver	NaN
'''
# 결측값이 있는 행을 버리자.

olympics.dropna(subset = ['Code'], inplace = True)
olympics.info()

'''
<class 'pandas.core.frame.DataFrame'>
Int64Index: 36931 entries, 0 to 36934
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Edition     36931 non-null  object
 1   Year        36931 non-null  int64 
 2   City        36931 non-null  object
 3   Sport       36931 non-null  object
 4   Discipline  36931 non-null  object
 5   Athlete     36931 non-null  object
 6   Code        36931 non-null  object
 7   Gender      36931 non-null  object
 8   Event       36931 non-null  object
 9   Medal       36931 non-null  object
 10  Country     36931 non-null  object
dtypes: int64(1), object(10)
memory usage: 3.4+ MB
'''

# 정말 신기하다. 결측값이 있던 컬럼의 4개 행을 버리니
# 행의 일부만 버려진게 아니라 해당 행은 모두 버려졌다.
# 값이 추가되거나 버려지고나서는 반드시 reset을 해주어야 한다.

olympics.reset_index(drop = True, inplace = True)
olympics.nunique()

'''
Edition           2
Year             33
City             41
Sport            48
Discipline       80
Athlete       26494
Code            148
Gender            2
Event           733
Medal             3
Country         145
dtype: int64
1

'''
olympics.Medal = olympics.Medal.astype('category')
olympics.info()

'''
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36931 entries, 0 to 36930
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   Edition     36931 non-null  object  
 1   Year        36931 non-null  int64   
 2   City        36931 non-null  object  
 3   Sport       36931 non-null  object  
 4   Discipline  36931 non-null  object  
 5   Athlete     36931 non-null  object  
 6   Code        36931 non-null  object  
 7   Gender      36931 non-null  object  
 8   Event       36931 non-null  object  
 9   Medal       36931 non-null  category
 10  Country     36931 non-null  object  
dtypes: category(1), int64(1), object(9)
memory usage: 2.9+ MB
'''
olympics.Medal.sort_values()

'''
36930    Bronze
25565    Bronze
25562    Bronze
25559    Bronze
11084    Bronze
          ...  
10811    Silver
10810    Silver
10809    Silver
25834    Silver
18465    Silver
Name: Medal, Length: 36931, dtype: category
Categories (3, object): ['Bronze', 'Gold', 'Silver']
'''
olympics.Medal.cat.set_categories(['Bronze', 'Silver', 'Gold'], ordered = True, inplace = True)
olympics.Medal.sort_values()

'''
36930    Bronze
8916     Bronze
15339    Bronze
27698    Bronze
8919     Bronze
          ...  
25833      Gold
10807      Gold
10806      Gold
10804      Gold
0          Gold
Name: Medal, Length: 36931, dtype: category
Categories (3, object): ['Bronze' < 'Silver' < 'Gold']
'''


## What are the most successful countries of all times?

import matplotlib.pyplot as plt
import seaborn as sns

olympics

'''
	Edition	Year	City	Sport	Discipline	Athlete	Code	Gender	Event	Medal	Country
0	Summer	1896	Athens	Aquatics	Swimming	HAJOS, Alfred	HUN	Men	100M Freestyle	Gold	Hungary
1	Summer	1896	Athens	Aquatics	Swimming	HERSCHMANN, Otto	AUT	Men	100M Freestyle	Silver	Austria
2	Summer	1896	Athens	Aquatics	Swimming	DRIVAS, Dimitrios	GRE	Men	100M Freestyle For Sailors	Bronze	Greece
3	Summer	1896	Athens	Aquatics	Swimming	MALOKINIS, Ioannis	GRE	Men	100M Freestyle For Sailors	Gold	Greece
4	Summer	1896	Athens	Aquatics	Swimming	CHASAPIS, Spiridon	GRE	Men	100M Freestyle For Sailors	Silver	Greece
...	...	...	...	...	...	...	...	...	...	...	...
36926	Winter	2014	Sochi	Skiing	Snowboard	JONES, Jenny	GBR	Women	Slopestyle	Bronze	United Kingdom
36927	Winter	2014	Sochi	Skiing	Snowboard	ANDERSON, Jamie	USA	Women	Slopestyle	Gold	United States
36928	Winter	2014	Sochi	Skiing	Snowboard	MALTAIS, Dominique	CAN	Women	Snowboard Cross	Silver	Canada
36929	Winter	2014	Sochi	Skiing	Snowboard	SAMKOVA, Eva	CZE	Women	Snowboard Cross	Gold	Czech Republic
36930	Winter	2014	Sochi	Skiing	Snowboard	TRESPEUCH, Chloe	FRA	Women	Snowboard Cross	Bronze	France
36931 rows × 11 columns
'''
olympics.Country.value_counts()

'''
United States            5238
Soviet Union             2489
United Kingdom           1799
Germany                  1665
France                   1548
                         ... 
Virgin Islands*             1
Guyana                      1
Netherlands Antilles*       1
Iraq                        1
Bermuda*                    1
Name: Country, Length: 145, dtype: int64
'''
# 빈도가 높은 국가를 순위화 해서 10개 rank화

top_10 = olympics.Country.value_counts().head(10)

'''
United States     5238
Soviet Union      2489
United Kingdom    1799
Germany           1665
France            1548
Italy             1488
Sweden            1477
Canada            1274
Australia         1204
Hungary           1091
Name: Country, dtype: int64
'''
top_10.plot(kind = 'bar', fontsize = 15, figsize = (14,10))
plt.title('Top 10 Countries by Medals', fontsize = 15)
plt.ylabel('Medals', fontsize =14)
plt.show()

olympics_10 = olympics.loc[olympics.Country.isin(top_10.index)]
olympics.loc[olympics.Country.isin(top_10.index)]

'''
	Edition	Year	City	Sport	Discipline	Athlete	Code	Gender	Event	Medal	Country
0	Summer	1896	Athens	Aquatics	Swimming	HAJOS, Alfred	HUN	Men	100M Freestyle	Gold	Hungary
6	Summer	1896	Athens	Aquatics	Swimming	HAJOS, Alfred	HUN	Men	1200M Freestyle	Gold	Hungary
11	Summer	1896	Athens	Athletics	Athletics	LANE, Francis	USA	Men	100M	Bronze	United States
12	Summer	1896	Athens	Athletics	Athletics	SZOKOLYI, Alajos	HUN	Men	100M	Bronze	Hungary
13	Summer	1896	Athens	Athletics	Athletics	BURKE, Thomas	USA	Men	100M	Gold	United States
...	...	...	...	...	...	...	...	...	...	...	...
36924	Winter	2014	Sochi	Skiing	Snowboard	KOBER, Amelie	GER	Women	Parallel Slalom	Bronze	Germany
36926	Winter	2014	Sochi	Skiing	Snowboard	JONES, Jenny	GBR	Women	Slopestyle	Bronze	United Kingdom
36927	Winter	2014	Sochi	Skiing	Snowboard	ANDERSON, Jamie	USA	Women	Slopestyle	Gold	United States
36928	Winter	2014	Sochi	Skiing	Snowboard	MALTAIS, Dominique	CAN	Women	Snowboard Cross	Silver	Canada
36930	Winter	2014	Sochi	Skiing	Snowboard	TRESPEUCH, Chloe	FRA	Women	Snowboard Cross	Bronze	France
19273 rows × 11 columns
'''
plt.figure(figsize = (20,10))
sns.set(font_scale = 1.5, palette = 'dark')

# order에 rank를 먹여줌으로 rank별로 정렬이 되었다.
sns.countplot(data = olympics_10, x = 'Country',order = top_10.index)
plt.title('Top 10 Countries by Medals', fontsize = 20)
plt.show()

plt.figure(figsize = (20,10))
sns.set(font_scale = 1.5, palette = 'dark')
sns.countplot(data = olympics_10, x = 'Country', hue = 'Edition', order = top_10.index)
plt.title('Top 10 Countries by Medals', fontsize = 20)
plt.show()

plt.figure(figsize = (20,10))
sns.set(font_scale = 1.5, palette = 'dark')
sns.countplot(data = olympics_10, x = 'Edition', hue = 'Country', hue_order = top_10.index)
plt.title('Top 10 Countries by Medals', fontsize = 20)
plt.show()

plt.figure(figsize= (20,10))
sns.set(font_scale = 1.5, palette = 'dark')
sns.countplot(data = olympics_10, x = 'Country', hue = 'Medal', order = top_10.index,
             hue_order = ['Gold', 'Silver', 'Bronze'], palette = ['gold', 'silver', 'brown'])
plt.title('Top 10 Countries by Medals', fontsize = 20)
plt.show()

plt.figure(figsize = (20,10))
sns.set(font_scale = 1.5, palette = 'dark')
sns.countplot(data = olympics_10, x = 'Medal', hue = 'Country',
             order = ['Gold', 'Silver', 'Bronze'], hue_order = top_10.index)
plt.title('Top 10 Countries by Medals', fontsize = 20)
plt.show()



## Do GDP, Population and Politics matter?

olympics

'''
	Edition	Year	City	Sport	Discipline	Athlete	Code	Gender	Event	Medal	Country
0	Summer	1896	Athens	Aquatics	Swimming	HAJOS, Alfred	HUN	Men	100M Freestyle	Gold	Hungary
1	Summer	1896	Athens	Aquatics	Swimming	HERSCHMANN, Otto	AUT	Men	100M Freestyle	Silver	Austria
2	Summer	1896	Athens	Aquatics	Swimming	DRIVAS, Dimitrios	GRE	Men	100M Freestyle For Sailors	Bronze	Greece
3	Summer	1896	Athens	Aquatics	Swimming	MALOKINIS, Ioannis	GRE	Men	100M Freestyle For Sailors	Gold	Greece
4	Summer	1896	Athens	Aquatics	Swimming	CHASAPIS, Spiridon	GRE	Men	100M Freestyle For Sailors	Silver	Greece
...	...	...	...	...	...	...	...	...	...	...	...
36926	Winter	2014	Sochi	Skiing	Snowboard	JONES, Jenny	GBR	Women	Slopestyle	Bronze	United Kingdom
36927	Winter	2014	Sochi	Skiing	Snowboard	ANDERSON, Jamie	USA	Women	Slopestyle	Gold	United States
36928	Winter	2014	Sochi	Skiing	Snowboard	MALTAIS, Dominique	CAN	Women	Snowboard Cross	Silver	Canada
36929	Winter	2014	Sochi	Skiing	Snowboard	SAMKOVA, Eva	CZE	Women	Snowboard Cross	Gold	Czech Republic
36930	Winter	2014	Sochi	Skiing	Snowboard	TRESPEUCH, Chloe	FRA	Women	Snowboard Cross	Bronze	France
36931 rows × 11 columns
'''
# 기본적으로 margins = True를 해주면
# 컬럼기준으로, 행 기준으로 Total이 생성된다.

medals_per_country = pd.crosstab(index = olympics.Country, columns = olympics.Medal, 
                                 margins = True, margins_name = 'Total')

'''
Medal	Bronze	Silver	Gold	Total
Country				
Afghanistan	2	0	0	2
Algeria	8	2	5	15
Argentina	91	99	69	259
Armenia	8	2	1	11
Australasia	5	4	20	29
...	...	...	...	...
West Indies Federation	5	0	0	5
Yugoslavia	119	180	143	442
Zambia	1	1	0	2
Zimbabwe	1	4	18	23
Total	12288	12238	12405	36931
146 rows × 4 columns
'''
medals_per_country.drop('Total', axis = 0, inplace = True)
medals_per_country

'''
Medal	Bronze	Silver	Gold	Total
Country				
Afghanistan	2	0	0	2
Algeria	8	2	5	15
Argentina	91	99	69	259
Armenia	8	2	1	11
Australasia	5	4	20	29
...	...	...	...	...
West Germany	227	191	166	584
West Indies Federation	5	0	0	5
Yugoslavia	119	180	143	442
Zambia	1	1	0	2
Zimbabwe	1	4	18	23
145 rows × 4 columns
'''
medals_per_contry = medals_per_country[['Total', 'Gold', 'Silver', 'Bronze']].sort_values(by = ['Total', 'Gold', 'Silver', 'Bronze'], ascending= False)

'''
Medal	Total	Gold	Silver	Bronze
Country				
United States	5238	2402	1571	1265
Soviet Union	2489	1088	724	677
United Kingdom	1799	580	632	587
Germany	1665	589	504	572
France	1548	444	526	578
...	...	...	...	...
Iraq	1	0	0	1
Macedonia	1	0	0	1
Mauritius	1	0	0	1
Niger	1	0	0	1
Togo	1	0	0	1
145 rows × 4 columns
'''
medals_per_country.head()

'''
Medal	Bronze	Silver	Gold	Total
Country				
Afghanistan	2	0	0	2
Algeria	8	2	5	15
Argentina	91	99	69	259
Armenia	8	2	1	11
Australasia	5	4	20	29
'''
dic

'''
	Country	Code	Population	GDP
0	Afghanistan	AFG	32526562.0	594.323081
1	Albania	ALB	2889167.0	3945.217582
2	Algeria	ALG	39666519.0	4206.031232
3	American Samoa*	ASA	55538.0	NaN
4	Andorra	AND	70473.0	NaN
...	...	...	...	...
196	Vietnam	VIE	91703800.0	2111.138024
197	Virgin Islands*	ISV	103574.0	NaN
198	Yemen	YEM	26832215.0	1406.291651
199	Zambia	ZAM	16211767.0	1304.879014
200	Zimbabwe	ZIM	15602751.0	924.143819
201 rows × 4 columns
'''
medals_per_country = medals_per_country.merge(dic, how = 'left', left_index = True,
                        right_on = 'Country').drop(columns = ['Code']).set_index('Country')

'''

Bronze	Silver	Gold	Total	Population	GDP
Country						
Afghanistan	2	0	0	2	32526562.0	594.323081
Algeria	8	2	5	15	39666519.0	4206.031232
Argentina	91	99	69	259	43416755.0	13431.878340
Armenia	8	2	1	11	3017712.0	3489.127690
Australasia	5	4	20	29	NaN	NaN
...	...	...	...	...	...	...
West Germany	227	191	166	584	NaN	NaN
West Indies Federation	5	0	0	5	NaN	NaN
Yugoslavia	119	180	143	442	NaN	NaN
Zambia	1	1	0	2	16211767.0	1304.879014
Zimbabwe	1	4	18	23	15602751.0	924.143819
145 rows × 6 columns
'''
olympics.nunique()

'''
Edition           2
Year             33
City             41
Sport            48
Discipline       80
Athlete       26494
Code            148
Gender            2
Event           733
Medal             3
Country         145
dtype: int64
'''
# lambda를 이용한 컬럼간 필드값 scala 방식으로 합치기
# 매우 중요

olympics['Games'] = olympics.apply(lambda x: str(x.Year) + ' ' + x.City, axis = 1)

'''
0        1896 Athens
1        1896 Athens
2        1896 Athens
3        1896 Athens
4        1896 Athens
            ...     
36926     2014 Sochi
36927     2014 Sochi
36928     2014 Sochi
36929     2014 Sochi
36930     2014 Sochi
Length: 36931, dtype: object
'''
# 각 컬럼별로 고유한 필드값이 몇개인가.

olympics.nunique()

'''
Edition           2
Year             33
City             41
Sport            48
Discipline       80
Athlete       26494
Code            148
Gender            2
Event           733
Medal             3
Country         145
Games            49
dtype: int64
'''
olympics.Games.nunique()

# 49
olympics.Games.value_counts()

'''
2008 Beijing                   2042
2000 Sydney                    2015
2004 Athens                    1998
2012 London                    1945
1996 Atlanta                   1859
1992 Barcelona                 1705
1988 Seoul                     1546
1984 Los Angeles               1459
1980 Moscow                    1387
1976 Montreal                  1305
1920 Antwerp                   1298
1972 Munich                    1185
1968 Mexico                    1031
1964 Tokyo                     1010
1952 Helsinki                   889
1956 Melbourne / Stockholm      885
1912 Stockholm                  885
1924 Paris                      884
1960 Rome                       882
1936 Berlin                     875
1948 London                     814
1908 London                     804
1928 Amsterdam                  710
1932 Los Angeles                615
2014 Sochi                      612
2006 Turin                      531
2010 Vancouver                  529
1900 Paris                      512
2002 Salt Lake City             481
1904 St Louis                   470
1998 Nagano                     447
1994 Lillehammer                343
1992 Albertville                325
1988 Calgary                    264
1984 Sarajevo                   222
1980 Lake Placid                218
1976 Innsbruck                  210
1972 Sapporo                    200
1968 Grenoble                   199
1964 Innsbruck                  185
1896 Athens                     151
1956 Cortina d\'Ampezzo          150
1960 Squaw Valley               147
1948 St.Moritz                  140
1952 Oslo                       136
1924 Chamonix                   118
1932 Lake Placid                116
1936 Garmisch Partenkirchen     108
1928 St.Moritz                   89
Name: Games, dtype: int64
'''
olympics.groupby('Country').apply(lambda x: x.Games.nunique())

'''
Country
Afghanistan                2
Algeria                    6
Argentina                 18
Armenia                    4
Australasia                2
                          ..
West Germany              12
West Indies Federation     1
Yugoslavia                18
Zambia                     2
Zimbabwe                   3
Length: 145, dtype: int64
'''
medals_per_country['Total_Games'] = olympics.groupby('Country').apply(lambda x: x.Games.nunique())
medals_per_country

'''
	Bronze	Silver	Gold	Total	Population	GDP	Total_Games
Country							
Afghanistan	2	0	0	2	32526562.0	594.323081	2
Algeria	8	2	5	15	39666519.0	4206.031232	6
Argentina	91	99	69	259	43416755.0	13431.878340	18
Armenia	8	2	1	11	3017712.0	3489.127690	4
Australasia	5	4	20	29	NaN	NaN	2
...	...	...	...	...	...	...	...
West Germany	227	191	166	584	NaN	NaN	12
West Indies Federation	5	0	0	5	NaN	NaN	1
Yugoslavia	119	180	143	442	NaN	NaN	18
Zambia	1	1	0	2	16211767.0	1304.879014	2
Zimbabwe	1	4	18	23	15602751.0	924.143819	3
145 rows × 7 columns
'''
medals_per_country.rank(ascending= False)

'''

Bronze	Silver	Gold	Total	Population	GDP	Total_Games
Country							
Afghanistan	100.0	135.0	122.5	117.0	37.0	115.0	107.5
Algeria	78.0	94.5	71.0	83.5	31.0	79.0	73.0
Argentina	34.0	33.0	35.0	34.0	29.0	44.0	28.0
Armenia	78.0	94.5	91.5	87.0	101.0	89.0	84.0
Australasia	86.5	82.0	51.5	67.0	NaN	NaN	107.5
...	...	...	...	...	...	...	...
West Germany	21.0	24.0	19.0	21.0	NaN	NaN	40.0
West Indies Federation	86.5	135.0	122.5	101.0	NaN	NaN	131.0
Yugoslavia	29.0	26.0	24.0	26.0	NaN	NaN	28.0
Zambia	114.5	113.0	122.5	117.0	57.0	105.0	107.5
Zimbabwe	114.5	82.0	54.0	74.5	59.0	109.0	93.0
145 rows × 7 columns
'''


## Statistical Analysis and Hypothesis Testing with scipy

medals_per_country

'''
	Bronze	Silver	Gold	Total	Population	GDP	Total_Games
Country							
Afghanistan	2	0	0	2	32526562.0	594.323081	2
Algeria	8	2	5	15	39666519.0	4206.031232	6
Argentina	91	99	69	259	43416755.0	13431.878340	18
Armenia	8	2	1	11	3017712.0	3489.127690	4
Australasia	5	4	20	29	NaN	NaN	2
...	...	...	...	...	...	...	...
West Germany	227	191	166	584	NaN	NaN	12
West Indies Federation	5	0	0	5	NaN	NaN	1
Yugoslavia	119	180	143	442	NaN	NaN	18
Zambia	1	1	0	2	16211767.0	1304.879014	2
Zimbabwe	1	4	18	23	15602751.0	924.143819	3
145 rows × 7 columns
'''
medals_per_country.drop(columns = ['Gold', 'Silver', 'Bronze'], inplace = True)
medals_per_country

'''
	Total	Population	GDP	Total_Games
Country				
Afghanistan	2	32526562.0	594.323081	2
Algeria	15	39666519.0	4206.031232	6
Argentina	259	43416755.0	13431.878340	18
Armenia	11	3017712.0	3489.127690	4
Australasia	29	NaN	NaN	2
...	...	...	...	...
West Germany	584	NaN	NaN	12
West Indies Federation	5	NaN	NaN	1
Yugoslavia	442	NaN	NaN	18
Zambia	2	16211767.0	1304.879014	2
Zimbabwe	23	15602751.0	924.143819	3
145 rows × 4 columns
'''
medals_per_country

'''
	Total	Population	GDP	Total_Games
Country				
Afghanistan	2	32526562.0	594.323081	2
Algeria	15	39666519.0	4206.031232	6
Argentina	259	43416755.0	13431.878340	18
Armenia	11	3017712.0	3489.127690	4
Australasia	29	NaN	NaN	2
...	...	...	...	...
West Germany	584	NaN	NaN	12
West Indies Federation	5	NaN	NaN	1
Yugoslavia	442	NaN	NaN	18
Zambia	2	16211767.0	1304.879014	2
Zimbabwe	23	15602751.0	924.143819	3
145 rows × 4 columns
'''
# corr는 상관계수 표현이다.
# +1은 완전한 양의 상관관계
# -1은 완전한 음의 상관관계
# 0은 완전히 상관관계가 없음을 의미한다.

medals_per_country.corr(method = 'pearson')

'''

Total	Population	GDP	Total_Games
Total	1.000000	0.206606	0.437131	0.712917
Population	0.206606	1.000000	-0.089976	0.123700
GDP	0.437131	-0.089976	1.000000	0.563540
Total_Games	0.712917	0.123700	0.563540	1.000000
'''

# pearson method는 '선형 관계'만 파악한다.
# 따라서 상관관계 판단에 제한적이다.
# corr 상관계수
# 따라서 spearman method를 써줌으로써 선형인지 여부와 관계없이 통계적 테스트가 정규 분포를 가정하지 않는다.
# 덜 제한적이다. 라고 할 수 있다.
medals_per_country.corr(method = 'spearman')

'''
Total	Population	GDP	Total_Games
Total	1.000000	0.419755	0.458478	0.927611
Population	0.419755	1.000000	-0.239097	0.402098
GDP	0.458478	-0.239097	1.000000	0.498056
Total_Games	0.927611	0.402098	0.498056	1.000000
'''
medals_per_country.rank(ascending=False).corr(method = 'pearson')

'''
	Total	Population	GDP	Total_Games
Total	1.000000	0.419634	0.458508	0.927611
Population	0.419634	1.000000	-0.239567	0.401448
GDP	0.458508	-0.239567	1.000000	0.497109
Total_Games	0.927611	0.401448	0.497109	1.000000
'''


# 가설검정 하기
import scipy.stats as stats

#### Hypothesis 1: There is no relationship between Total Medals and Population

stats.spearmanr(medals_per_country.Total, medals_per_country.Population,
               nan_policy = 'omit')

# SpearmanrResult(correlation=0.41975527887063924, pvalue=8.132240868116897e-07)
r, p_value = stats.spearmanr(medals_per_country.Total, medals_per_country.Population,
                            nan_policy = 'omit')
r

# 0.41975527887063924
#### Reject Hypothesis 1 -> There is a significant (positive) relationship between Total Medals and Population

---

### Hypothesis 2: There is no relationship between Total Medals and GDP per capita

r, p_value = stats.spearmanr(medals_per_country.Total, medals_per_country.GDP,
               nan_policy = 'omit')
r
# 0.45847787983426685
p_value

# 1.5758169913239602e-07


#### Reject Hypothesis 2 -> There is a significant (positive) relationship between Total Medals and GDP per Capita

---

#### Hypothesis 3: There is no relationship between Total Medals and Participations

medals_per_country.head()

'''
Total	Population	GDP	Total_Games
Country				
Afghanistan	2	32526562.0	594.323081	2
Algeria	15	39666519.0	4206.031232	6
Argentina	259	43416755.0	13431.878340	18
Armenia	11	3017712.0	3489.127690	4
Australasia	29	NaN	NaN	2
'''
r, p_value = stats.spearmanr(medals_per_country.head(50).Total,
               medals_per_country.head(50).Total_Games)
r

# 0.9642475115792023
p_value

# 2.4562642832742088e-29


## Aggregating and Ranking

sum_vs_win = pd.crosstab(olympics.Country, olympics.Edition)

'''

Edition	Summer	Winter
Country		
Afghanistan	2	0
Algeria	15	0
Argentina	259	0
Armenia	11	0
Australasia	29	0
...	...	...
West Germany	490	94
West Indies Federation	5	0
Yugoslavia	435	7
Zambia	2	0
Zimbabwe	23	0
145 rows × 2 columns
'''
men_vs_women = pd.crosstab(olympics.Country, olympics.Gender)

'''
Gender	Men	Women
Country		
Afghanistan	2	0
Algeria	12	3
Argentina	187	72
Armenia	11	0
Australasia	27	2
...	...	...
West Germany	450	134
West Indies Federation	5	0
Yugoslavia	379	63
Zambia	2	0
Zimbabwe	0	23
145 rows × 2 columns
'''
medals_by_cats = pd.concat([sum_vs_win, men_vs_women], axis = 1)

'''

Summer	Winter	Men	Women
Country				
Afghanistan	2	0	2	0
Algeria	15	0	12	3
Argentina	259	0	187	72
Armenia	11	0	11	0
Australasia	29	0	27	2
...	...	...	...	...
West Germany	490	94	450	134
West Indies Federation	5	0	5	0
Yugoslavia	435	7	379	63
Zambia	2	0	2	0
Zimbabwe	23	0	0	23
145 rows × 4 columns
'''
medals_by_cats['Total'] = medals_by_cats.Summer + medals_by_cats.Winter
medals_by_cats

'''

Summer	Winter	Men	Women	Total
Country					
Afghanistan	2	0	2	0	2
Algeria	15	0	12	3	15
Argentina	259	0	187	72	259
Armenia	11	0	11	0	11
Australasia	29	0	27	2	29
...	...	...	...	...	...
West Germany	490	94	450	134	584
West Indies Federation	5	0	5	0	5
Yugoslavia	435	7	379	63	442
Zambia	2	0	2	0	2
Zimbabwe	23	0	0	23	23
145 rows × 5 columns
'''
medals_by_cats = pd.concat([sum_vs_win, men_vs_women], axis = 1)

'''
	Summer	Winter	Men	Women
Country				
Afghanistan	2	0	2	0
Algeria	15	0	12	3
Argentina	259	0	187	72
Armenia	11	0	11	0
Australasia	29	0	27	2
...	...	...	...	...
West Germany	490	94	450	134
West Indies Federation	5	0	5	0
Yugoslavia	435	7	379	63
Zambia	2	0	2	0
Zimbabwe	23	0	0	23
145 rows × 4 columns
'''
medals_by_cats['Total'] = medals_by_cats.Summer + medals_by_cats.Winter
medals_by_cats.sort_values('Total', ascending=False, inplace = True)
medals_by_cats

'''
	Summer	Winter	Men	Women	Total
Country					
United States	4585	653	3618	1620	5238
Soviet Union	2049	440	1807	682	2489
United Kingdom	1720	79	1470	329	1799
Germany	1305	360	1119	546	1665
France	1396	152	1351	197	1548
...	...	...	...	...	...
Iraq	1	0	1	0	1
Senegal	1	0	1	0	1
Macedonia	1	0	1	0	1
Mauritius	1	0	1	0	1
Guatemala	1	0	1	0	1
145 rows × 5 columns
'''
medals_by_cats.sort_values('Total', ascending = False, inplace = True)
medals_by_cats

'''
	Summer	Winter	Men	Women	Total
Country					
United States	4585	653	3618	1620	5238
Soviet Union	2049	440	1807	682	2489
United Kingdom	1720	79	1470	329	1799
Germany	1305	360	1119	546	1665
France	1396	152	1351	197	1548
...	...	...	...	...	...
Barbados	1	0	1	0	1
Cyprus	1	0	1	0	1
Virgin Islands*	1	0	1	0	1
Bermuda*	1	0	1	0	1
Guatemala	1	0	1	0	1
145 rows × 5 columns
'''
ranks = medals_by_cats.rank(ascending=False, method = 'average')
ranks

'''
	Summer	Winter	Men	Women	Total
Country					
United States	1.0	1.0	1.0	1.0	1.0
Soviet Union	2.0	4.0	2.0	2.0	2.0
United Kingdom	3.0	19.0	3.0	11.0	3.0
Germany	5.0	7.0	7.0	4.0	4.0
France	4.0	14.0	4.0	16.5	5.0
...	...	...	...	...	...
Barbados	132.5	95.5	127.0	120.0	133.5
Cyprus	132.5	95.5	127.0	120.0	133.5
Virgin Islands*	132.5	95.5	127.0	120.0	133.5
Bermuda*	132.5	95.5	127.0	120.0	133.5
Guatemala	132.5	95.5	127.0	120.0	133.5
145 rows × 5 columns
'''
# 총 메달수 상위 50개 국을 다루는 게 좋은데
# 이유는 하위 그룹에서는 한두명의 특출나게 잘 하는 선수에 의해서 순위가 왜곡될 수 있기 때문이다.
top_50 = ranks.head(50)
plt.figure(figsize = (50,10))
sns.heatmap(top_50.T, cmap = 'RdYlGn_r', annot = True, fmt = '2.0f')
plt.show()



## Summer Games vs. Winter Games - does Geographical Location matter?

top_50.sort_values('Summer')
top_50.sort_values('Winter')
plt.figure(figsize = (50,10))
sns.heatmap(top_50.sort_values(by = 'Summer').T, cmap = 'RdYlGn_r', annot = True, fmt = '2.0f')
plt.show()

plt.figure(figsize = (50,10))
sns.heatmap(top_50.sort_values(by = 'Winter').T, cmap = 'RdYlGn_r',annot = True, fmt = '2.0f')
plt.show()

top_50.Summer.sub(top_50.Winter).sort_values()

'''
Country
Brazil                    -70.5
Cuba                      -68.5
Argentina                 -63.5
India                     -60.5
Greece                    -58.5
Jamaica                   -56.5
Pakistan                  -55.5
South Africa              -52.0
Mexico                    -52.0
Kenya                     -50.5
Turkey                    -49.5
Nigeria                   -48.5
Uruguay                   -47.5
Iran                      -46.5
Romania                   -25.5
Hungary                   -22.0
Australia                 -20.0
Denmark                   -18.5
Spain                     -18.5
United Kingdom            -16.0
Yugoslavia                -11.0
New Zealand               -10.5
France                    -10.0
Japan                      -8.0
Bulgaria                   -8.0
China                      -6.0
Netherlands                -5.0
Italy                      -5.0
Poland                     -4.0
Belgium                    -3.0
Soviet Union               -2.0
Germany                    -2.0
East Germany               -1.0
United States               0.0
Korea, South                1.0
Sweden                      3.0
Russia                      4.0
Ukraine                     4.5
West Germany                5.0
Unified Team of Germany     7.0
Croatia                     9.5
Unified Team               11.0
Canada                     13.0
Norway                     14.0
Belarus                    15.0
Finland                    17.0
Czechoslovakia             17.0
Switzerland                20.0
Austria                    29.0
Czech Republic             31.0
dtype: float64
'''
rank_diff= top_50.Summer.sub(top_50.Winter).sort_values().to_frame()
rank_diff

'''

0
Country	
Brazil	-70.5
Cuba	-68.5
Argentina	-63.5
India	-60.5
Greece	-58.5
Jamaica	-56.5
Pakistan	-55.5
South Africa	-52.0
Mexico	-52.0
Kenya	-50.5
Turkey	-49.5
Nigeria	-48.5
Uruguay	-47.5
Iran	-46.5
Romania	-25.5
Hungary	-22.0
Australia	-20.0
Denmark	-18.5
Spain	-18.5
United Kingdom	-16.0
Yugoslavia	-11.0
New Zealand	-10.5
France	-10.0
Japan	-8.0
Bulgaria	-8.0
China	-6.0
Netherlands	-5.0
Italy	-5.0
Poland	-4.0
Belgium	-3.0
Soviet Union	-2.0
Germany	-2.0
East Germany	-1.0
United States	0.0
Korea, South	1.0
Sweden	3.0
Russia	4.0
Ukraine	4.5
West Germany	5.0
Unified Team of Germany	7.0
Croatia	9.5
Unified Team	11.0
Canada	13.0
Norway	14.0
Belarus	15.0
Finland	17.0
Czechoslovakia	17.0
Switzerland	20.0
Austria	29.0
Czech Republic	31.0
'''
plt.figure(figsize = (35,5))
sns.heatmap(rank_diff.T, cmap = 'RdBu', annot = True, fmt = '2.0f', center = 0)
plt.show()



## Men vs. Women - does Culture & Religion matter?

plt.figure(figsize = (50, 10))
sns.heatmap(top_50.sort_values(by = 'Men').T, cmap = 'RdYlGn_r', annot = True, fmt = '2.0f')
plt.show()

plt.figure(figsize = (50,10))
sns.heatmap(top_50.sort_values(by = 'Women').T, cmap = 'RdYlGn_r', annot = True, fmt = '2.0f')
plt.show()

rank_diff2 = top_50.Men.sub(top_50.Women).sort_values().to_frame()

'''
	0
Country	
Pakistan	-83.0
Uruguay	-74.0
Iran	-72.0
India	-37.5
Belgium	-22.0
Croatia	-15.5
Turkey	-13.5
Yugoslavia	-13.0
Italy	-13.0
Switzerland	-13.0
France	-12.5
Finland	-12.0
Sweden	-10.5
Denmark	-9.5
Czechoslovakia	-9.0
United Kingdom	-8.0
Kenya	-7.5
Hungary	-6.0
Mexico	-5.0
Poland	-5.0
Norway	-5.0
Unified Team of Germany	-4.0
Austria	-4.0
New Zealand	-3.0
Spain	-1.0
West Germany	-1.0
United States	0.0
Soviet Union	0.0
Japan	0.0
South Africa	1.0
Cuba	1.0
Argentina	1.0
Canada	2.0
Czech Republic	2.0
Greece	2.0
Nigeria	2.5
Germany	3.0
Netherlands	5.0
Australia	7.0
East Germany	9.0
Brazil	9.0
Russia	10.0
Bulgaria	11.0
Unified Team	13.0
Korea, South	13.5
Ukraine	15.0
Belarus	15.5
Jamaica	16.5
Romania	18.0
China	27.0
'''
plt.figure(figsize = (35,5))
sns.heatmap(rank_diff2.T, cmap = 'RdBu_r', annot = True, fmt = '2.0f', center = 0)
plt.show()



## Do Traditions matter?

olympics.Sport.value_counts()

'''
Aquatics             4170
Athletics            3637
Rowing               2667
Gymnastics           2307
Skiing               1781
Fencing              1613
Ice Hockey           1563
Football             1497
Hockey               1422
Skating              1256
Wrestling            1211
Shooting             1150
Sailing              1109
Cycling              1105
Basketball           1012
Canoe / Kayak        1002
Volleyball            994
Handball              973
Equestrian            939
Boxing                894
Weightlifting         591
Judo                  491
Bobsleigh             452
Biathlon              420
Baseball              335
Archery               329
Tennis                296
Rugby                 192
Softball              180
Luge                  180
Modern Pentathlon     180
Curling               172
Badminton             144
Table Tennis          144
Taekwondo             112
Tug of War             94
Canoe                  81
Polo                   66
Lacrosse               59
Golf                   30
Triathlon              24
Cricket                24
Rackets                10
Croquet                 8
Water Motorsports       5
Basque Pelota           4
Roque                   3
Jeu de paume            3
Name: Sport, dtype: int64
'''
sports = olympics.Sport.value_counts().index

olympics.Country.value_counts().head(20)

'''
United States     5238
Soviet Union      2489
United Kingdom    1799
Germany           1665
France            1548
Italy             1488
Sweden            1477
Canada            1274
Australia         1204
Hungary           1091
Russia            1031
Norway            1011
East Germany       987
Netherlands        973
Finland            890
China              889
Japan              851
Switzerland        665
Romania            642
Korea, South       616
Name: Country, dtype: int64
'''
top_20 = olympics.Country.value_counts().head(20).index

pd.crosstab(olympics.Country, olympics.Sport)

'''
Sport	Aquatics	Archery	Athletics	Badminton	Baseball	Basketball	Basque Pelota	Biathlon	Bobsleigh	Boxing	...	Softball	Table Tennis	Taekwondo	Tennis	Triathlon	Tug of War	Volleyball	Water Motorsports	Weightlifting	Wrestling
Country																					
Afghanistan	0	0	0	0	0	0	0	0	0	0	...	0	0	2	0	0	0	0	0	0	0
Algeria	0	0	7	0	0	0	0	0	0	6	...	0	0	0	0	0	0	0	0	0	0
Argentina	3	0	5	0	0	24	0	0	0	24	...	0	0	1	6	0	0	12	0	2	0
Armenia	0	0	0	0	0	0	0	0	0	1	...	0	0	0	0	0	0	0	0	4	6
Australasia	11	0	1	0	0	0	0	0	0	1	...	0	0	0	1	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
West Germany	62	0	67	0	0	0	0	14	22	6	...	0	0	0	3	0	0	0	0	7	9
West Indies Federation	0	0	5	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
Yugoslavia	91	0	2	0	0	96	0	0	0	11	...	0	4	0	0	0	0	24	0	0	16
Zambia	0	0	1	0	0	0	0	0	0	1	...	0	0	0	0	0	0	0	0	0	0
Zimbabwe	7	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
145 rows × 48 columns
'''
by_sport = pd.crosstab(olympics.Country, olympics.Sport)
by_sport = by_sport.rank(ascending=False, method = 'average')
plt.figure(figsize = (30,10))
sns.heatmap(by_sport, cmap = 'RdYlGn_r', vmin = 1, vmax = 6, linewidth = 1)
plt.show()

300x250