PyArrow vs. Pandas - CSV 읽기 쓰기 성능 비교
포스트
취소

PyArrow vs. Pandas - CSV 읽기 쓰기 성능 비교

PyArrow 와 Pandas 의 CSV 읽기 쓰기 속도를 비교합니다. (유튜브 강좌 따라하기)

1. PyArrow

import pyarrow.csv as csv

2. PyArrow vs. Pandas : CSV 읽기 쓰기 시간 측정

출처: PyArrow vs. Pandas for managing CSV files - How to Speed Up Data Loading

1) 준비

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import random
import string
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.csv as csv
from datetime import datetime

def gen_random_string(length: int = 32) -> str:
    return ''.join(random.choices(string.ascii_uppercase + string.digits, k=length))

gen_random_string()
# ==> 'K46RG2R07MLZCO94PMFBQ1Q2GDJE2WFE'

dt = pd.date_range(start=datetime(2000,1,1), end=datetime(2021,1,1), freq='min')
dt[:10]
# ==> 
# DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 00:01:00',
#                '2000-01-01 00:02:00', '2000-01-01 00:03:00',
#                '2000-01-01 00:04:00', '2000-01-01 00:05:00',
#                '2000-01-01 00:06:00', '2000-01-01 00:07:00',
#                '2000-01-01 00:08:00', '2000-01-01 00:09:00'],
#               dtype='datetime64[ns]', freq='T')

%%time

np.random.seed = 42
df_size = len(dt)
# ==> 11046241

%%time

df = pd.DataFrame({
    'date': dt,
    'a': np.random.rand(df_size),
    'b': np.random.rand(df_size),
    'c': np.random.rand(df_size),
    'd': np.random.rand(df_size),
    'e': np.random.rand(df_size),
    'str1': [gen_random_string() for x in range(df_size)],
    'str2': [gen_random_string() for x in range(df_size)]
})

df.head()
# ==> shape (11046241, 8)

#                  date         a         b         c         d         e  \
# 0 2000-01-01 00:00:00  0.673229  0.000314  0.087757  0.062980  0.407930   
# 1 2000-01-01 00:01:00  0.036759  0.775502  0.740333  0.093106  0.854366   
# 2 2000-01-01 00:02:00  0.767371  0.938403  0.903407  0.486610  0.976815   
# 3 2000-01-01 00:03:00  0.463437  0.661311  0.695746  0.543487  0.539629   
# 4 2000-01-01 00:04:00  0.692024  0.853330  0.098762  0.303125  0.199427   
#                                str1                              str2  
# 0  3Q0OXWM88JVSGHWP8YNE85VJ7A5FSVCD  TZMB94OPLZMN5H0T8Z587DSWX4LV4SQP  
# 1  RHGFUMXXE37R8V6E8635BZ7D0MXD1NYU  WFKQYOYE6HX523DX5OS052NTX5FAGVIG  
# 2  P8GUK7XAAA18K75B83O83YYAL4VSNF1F  3BI8Q4IHNV2SIOU352LSFLZ44FSQSBQ0  
# 3  8YRSAJCLRQKFXY9H81GXTG5AOMPDM52B  8NLQMMUL0GO5CVA7W8819C3VBBZ5M4JZ  
# 4  8HOX5CK95685XSFKXRRVSGRO5IOG9YD8  RC2YRDZXIQSWQ8FJ1WPZSTH3TCCKSM92  

2) Pandas : CSV 읽기 쓰기

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
%%time
# CPU times: user 54.2 s, sys: 1.66 s, total: 55.8 s Wall time: 56 s

df.to_csv('temp/csv_pandas.csv', index=False)

%%time
# CPU times: user 2min 59s, sys: 1.41 s, total: 3min Wall time: 3min 1s

df.to_csv('temp/csv_pandas.csv.gz', index=False, compression='gzip')

%%time
# CPU times: user 13.4 s, sys: 804 ms, total: 14.2 s Wall time: 14.2 s

df1 = pd.read_csv('temp/csv_pandas.csv')

%%time
# CPU times: user 20.8 s, sys: 966 ms, total: 21.8 s Wall time: 21.8 s

df2 = pd.read_csv('temp/csv_pandas.csv.gz')

3) PyArrow : CSV 읽기 쓰기

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
%%time
# CPU times: user 5.76 s, sys: 271 ms, total: 6.03 s Wall time: 6.04 s

csv.write_csv(df_pa_table, 'temp/csv_pyarrow.csv')

%%time
# CPU times: user 1min 21s, sys: 502 ms, total: 1min 22s Wall time: 1min 22s

with pa.CompressedOutputStream('temp/csv_pyarrow.csv.gz', 'gzip') as out:
    csv.write_csv(df_pa_table, out)

%%time
# CPU times: user 2.54 s, sys: 1.47 s, total: 4.01 s Wall time: 1.1 s

df_pa_1 = csv.read_csv('temp/csv_pyarrow.csv')

%%time
# CPU times: user 9.1 s, sys: 565 ms, total: 9.67 s Wall time: 7.05 s

df_pa_2 = csv.read_csv('temp/csv_pyarrow.csv.gz')

Pandas DataFrame 을 pyarrow Table 로 변환

  • Date 타입을 정수형 np.int64 로 변환
  • from_pandas 로 df 를 pa_table 로 생성
1
2
3
4
5
6
7
df_pa = df.copy()
df_pa['date'] = df_pa['date'].values.astype(np.int64) // 10 ** 9

%%time
# CPU times: user 476 ms, sys: 184 ms, total: 660 ms Wall time: 529 ms

df_pa_table = pa.Table.from_pandas(df_pa)

pyarrow Table 을 Pandas DataFrame 으로 변환

  • to_pandas 로 DataFrame 을 변환
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
df_pa_1.schema
# ==>
# date: int64
# a: double
# b: double
# c: double
# d: double
# e: double
# str1: string
# str2: string

%%time
# CPU times: user 2.88 s, sys: 1.49 s, total: 4.38 s Wall time: 5.13 s

df_pd_1 = df_pa_1.to_pandas()

3. PyArrow vs. Pandas : CSV 읽기 쓰기 성능 비교

1) CSV 쓰기 성능 비교

CSV 쓰기 성능 비교

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import matplotlib.pyplot as plt

write_values = [56, 181, 6.04, 82]

mesures = ['Pandas Write (CSV)','Pandas Write(CSV.GZ)','PyArrow Write(CSV)','PyArrow Write(CSV.GZ)']

fig = plt.figure(figsize=(10, 4))

# creating the bar plot
ax = plt.barh(mesures, write_values, color='maroon')
plt.bar_label(ax, label_type='edge', padding=5, fmt='%.1f sec')

# plt.xlabel("Courses offered")
# plt.ylabel("No. of students enrolled")
plt.title("Pandas vs. PyArrow | Write time in seconds")
plt.grid(axis='x', color='black', linestyle='-.')
plt.grid(axis='y')
#plt.show()
plt.savefig('temp/09-pyarrow-vs-pandas-csv-write.png')

2) CSV 읽기 성능 비교

CSV 읽기 성능 비교

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import matplotlib.pyplot as plt

read_values  = [14.2, 21.8, 1.1, 7.05]

mesures = ['Pandas Read(CSV)','Pandas Read(CSV.GZ)','PyArrow Read(CSV)','PyArrow Read(CSV.GZ)']

fig = plt.figure(figsize=(10, 4))

# creating the bar plot
ax = plt.barh(mesures, read_values, color='maroon')
plt.bar_label(ax, label_type='edge', padding=5, fmt='%.1f sec')

# plt.xlabel("Courses offered")
# plt.ylabel("No. of students enrolled")
plt.title("Pandas vs. PyArrow | Write time in seconds")
plt.grid(axis='x', color='black', linestyle='-.')
plt.grid(axis='y')
#plt.show()
plt.savefig('temp/09-pyarrow-vs-pandas-csv-read.png')

9. Review

  • Pandas 외에 PyArrow 를 익히자.
  • Spark 없이 대량의 데이터를 가공할 때에는 PyArrow 가 유용하다.

 
 

끝!   읽어주셔서 감사합니다.

이 기사는 저작권자의 CC BY 4.0 라이센스를 따릅니다.