[빅데이터 분석] 06. 통계 분석II

[SW]/빅데이터 (2023) (완)

by 시원00 2023. 7. 25. 14:16

728x90

통계 분석

project6. 타이타닉호 생존율 분석하기

핵심 개념: 상관분석
데이터 수집: 타이타닉 데이터셋
데이터 준비: 결측치 치환, 중앙값 치환, 최빈값 치환
데이터 탐색
1. 정보 확인: info()
2. 차트를 통한 데이터 탐색: pie(), countplot()
데이터 모델링
1. 모든 변수 간 상관 계수 구하기
2. 지정한 두 변수 간 상관 계수 구하기
결과 시각화

1. 핵심 개념 이해

상관 분석

- 두 변수가 어떤 선형적 관계에 있는지를 분석하는 방법

- 변수들 간의 상호관계 정도를 분석하는 통계적 기법

- 상관관계의 정도를 나타내는 것이 상관계수

| 두 변수가 연관된 정도를 나타낼 뿐 인과 관계를 설명하지 않음

| 변수 간 관계의 정도(0~1)와 방향(+,-)을 하나의 수치로 요약해주는 지수

| -1에서 +1 사이의 값을 가짐

상관 계수가 + : 한 변수가 증가하면 다른 변수도 증가

상관 계수가 - : 한 변수가 증가할 때 다른 변수는 감소

| 0.0 ~ 0.2: 상관관계가 거의 없음

| 0.2 ~ 0.4: 약한 상관관계가 있음

| 0.4 ~ 0.6: 상관관계가 있음

| 0.6 ~ 0.8: 강한 상관관계가 있음

| 0.8 ~ 1.0: 매우 강한 상관관계가 있음

단순 상관 분석 / 다중 상관 분석

2. 데이터 수집

seaborn 내장 데이터셋

- 파이썬의 대표적인 통계 데이터 시각화 도구 (matplotlib, seaborn)

import seaborn as sns

titanic = sns.load_dataset("titanic")
titanic.to_csv('titanic.csv', index = False)

다운로드한 CSV 파일 확인

3. 데이터 준비

데이터 정리 작업이 필요한지 확인

- age, embarked, deck, embarked town 항목에 결측값 확인

age -> 중앙값, embarked, deck, embark town -> 최빈값으로 대체

print(titanic.isnull().sum())  #결측값 확인

titanic['age'] = titanic['age'].fillna(titanic['age'].median()) #나이는 중앙값으로 대체

print(titanic['embarked'].value_counts())  #최빈값 확인 : S
titanic['embarked'] = titanic['embarked'].fillna('S')   #embarked는 최빈값으로 대체

print(titanic['deck'].value_counts())   #최빈값 확인 : C
titanic['deck'] = titanic['deck'].fillna('C')   #deck 최빈값으로 대체

print(titanic['embark_town'].value_counts())    #최빈값 확인 : Southampton
titanic['embark_town'] = titanic['embark_town'].fillna('Southampton')   #embark_town 최빈값으로 대체

print(titanic.isnull().sum())  #결측값 확인

Out:

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64
S    644
C    168
Q     77
Name: embarked, dtype: int64
C    59
B    47
D    33
E    32
A    15
F    13
G     4
Name: deck, dtype: int64
Southampton    644
Cherbourg      168
Queenstown      77
Name: embark_town, dtype: int64
survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
deck           0
embark_town    0
alive          0
alone          0
dtype: int64

4. 데이터 탐색

기본 정보 확인하기

- info()를 이용해 데이터에 대한 전반적인 정보를 탐색

titanic.info()
print(titanic.survived.value_counts())

Out:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
#   Column       Non-Null Count  Dtype
---  ------       --------------  -----
0   survived     891 non-null    int64
1   pclass       891 non-null    int64
2   sex          891 non-null    object
3   age          891 non-null    float64
4   sibsp        891 non-null    int64
5   parch        891 non-null    int64
6   fare         891 non-null    float64
7   embarked     891 non-null    object
8   class        891 non-null    category
9   who          891 non-null    object
10  adult_male   891 non-null    bool
11  deck         891 non-null    category
12  embark_town  891 non-null    object
13  alive        891 non-null    object
14  alone        891 non-null    bool
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB
0    549
1    342
Name: survived, dtype: int64

시각적 데이터 탐색 1

- 남, 여 승객의 생존율을 pie 차트(matplotlib)로 그리기

male_color=['red', 'grey']
female_color=['grey', 'red']

plt.subplot(1,2,1)
titanic['survived'][titanic['sex']=='male'].value_counts().plot.pie(explode=[0,0.1], colors=male_color, autopct='%1.1f%%', shadow=True)
            #explode: 파이 그래프에서 떨어져 나온 간격    #autopct: 부동소수점에 대한 자리수 지정
plt.title('Survived(Male)')

plt.subplot(1,2,2)
titanic['survived'][titanic['sex']=='female'].value_counts().plot.pie(explode=[0,0.1], colors=female_color, autopct='%1.1f%%', shadow=True)
plt.title('Survived(Female)')

plt.show()

Out:

시각적 데이터 탐색 2

- 객실 등급별 생존자 수 막대 그래프(seaborn)로 그리기

데이터의 빈도 수 시각화: countplot()

import matplotlib.pyplot as plt
import seaborn as sns

sns.countplot(x='pclass', hue='survived', data=titanic)
plt.title('Pclass vs. Servived')
plt.show()

Out:

5. 데이터 모델링

타이타닉호 승객의 속성과 생존 사이에 어떤 상관관계가 있는지 분석하는 모델

- 상관분석: corr()

- 상관계수: 피어슨 상관 계수

상관계수는 연속형 데이터에서만 구할 수 있음

    #전체 상관 계수
titanic_corr = titanic.corr(method='pearson', numeric_only=True)
titanic_corr.to_csv('titanic_corr.csv', index=False)

Out:

    # 특정 변수 사이의 상관계수
print("survived - adult_male : ", titanic['survived'].corr(titanic['adult_male']))
print("survived - fare : ", titanic['survived'].corr(titanic['fare']))

Out:

survived - adult_male : -0.5570800422053257
survived - fare : 0.2573065223849625

6. 결과 시각화

산점도로 상관 분석 시각화

- seaborn lib.의 pairplot() 사용

sns.pairplot(titanic, hue='survived')
plt.show()

Out:

두 변수의 상관 관계 시각화

- 분석

(1) 여성인 경우, 상관 관계가 모두 0.4 이상

-> pclass와 survived 사이에 상관 관계가 높다.

-> 특히 1등급 여성인 경우, survived와 상관 관계가 1

(2) 남성인 경우, 상관 관계가 모두 0.4 이하

-> pclass와 survived 사이에 상관 관계가 거의 없다.

sns.catplot(x='pclass', y='survived', hue='sex', data=titanic, kind='point')
plt.show()

Out:

추가. 등급별 남녀 사망자 비율 출력하기

survived_color=['red', 'grey']

plt.subplot(1,3,1)
titanic['sex'][titanic['pclass']==1][titanic['survived']==False].value_counts().plot.pie(explode=[0,0.1], colors=survived_color, autopct='%1.1f%%', shadow=True)
plt.title('1등급 남녀 사망자 비율')

plt.subplot(1,3,2)
titanic['sex'][titanic['pclass']==2][titanic['survived']==False].value_counts().plot.pie(explode=[0,0.1], colors=survived_color, autopct='%1.1f%%', shadow=True)
plt.title('2등급 남녀 사망자 비율')

plt.subplot(1,3,3)
titanic['sex'][titanic['survived']==False][titanic['pclass']==3].value_counts().plot.pie(explode=[0,0.1], colors=survived_color, autopct='%1.1f%%', shadow=True)
plt.title('3등급 남녀 사망자 비율')

plt.show()

Out:

FIN.

728x90

'[SW] > 빅데이터 (2023) (완)' 카테고리의 다른 글

[빅데이터 분석] 08. 지리 정보 분석 (0)	2023.07.25
[빅데이터 분석] 07. 텍스트 빈도 분석 (0)	2023.07.25
[빅데이터 분석] 05. 통계 분석 (1)	2023.07.24
[빅데이터 분석] 03 - 04. 웹 페이지 크롤링 (0)	2023.07.19
[빅데이터 분석] 01 - 02. 파이썬 크롤링 (API 이용) (1)	2023.05.11