2020.03.09
1. What is TensorFlow Extended ?
- TensorFlow Extended (TFX) is a TensorFlow-based platform for performant machine learning in production, first designed for use within Google, but now mostly open sourced
- 프로덕션 ML 파이프라인을 배포하기 위한 End to End 플랫폼
- 사이트: https://www.tensorflow.org/tfx/
2. TFX 기본 라이브러리
- TFX components as building blocks: Tensorflow Data Validation, TensorFlow Transform, TensorFlow Model Analysis, and TensorFlow Serving.
2.1 TensorFlow Data Validation
- 참고 문서
소개: https://www.tensorflow.org/tfx/guide/tfdv
고려사항 - “!pip install -q pyarrow==0.15.0” 실행 후, 메뉴(Runtime > Restart runtime) 클릭, 무시하고 진행시 이후에 에러 발생 됨
AttributeError: module 'pyarrow.types' has no attribute ‘is_large_list
- 요약
✓ TensorFlow Data Validation identifies anomalies in training and serving data, and can automatically create a schema by examining the data.
That includes looking at descriptive statistics, inferring a schema, checking for and fixing anomalies, and checking for drift and skew in our dataset.
✓ Internally, TFDV uses Apache Beam's data-parallel processing framework to scale the computation of statistics over large datasets.
Apache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines.
분산 데이타 처리를 위한 프레임웍으로, 구글 클라우드(Dataflow) 상에서 실행하거나 또는 PC(멀티 쓰레드)나 Spark 클러스터상 여러 환경에서 실행이 가능 (https://bcho.tistory.com/1221)
- 예제 정리
a. Visualize statistics
tfdv.visualize_statistics(), which uses Facets to create a succinct visualization of our training data:
Facet overview below takes input feature data from any number of datasets, analyzes them feature by feature and visualizes the analysis. (https://pair-code.github.io/facets/)
$ head /tmp/tmpishmrvk8/data/train/data.csv | cut -c-100
pickup_community_area,fare,trip_start_month,trip_start_hour,trip_start_day,trip_start_timestamp,pick
22,12.85,3,11,7,1393673400,41.920451512,-87.679954768,41.877406123,-87.621971652,0.0,,17031320400,Ca
22,5.45,8,21,7,1439675100,41.920451512,-87.679954768,41.906771332,-87.681025231,1.2,,17031241300,Cas
…
$
b. Infer a schema
tfdv.infer_schema() to create a schema for our data, tfdv.display_schema() to display the inferred schema
c. Check evaluation data for errors & Compare evaluation data with training data
tfdv.generate_statistics_from_csv() to compute stats for evaluation data.
tfdv.validate_statistics() to compare evaluation data with training data.
d. Check for evaluation anomalies
tfdv.validate_statistics() to check eval data for errors by validating the eval data stats using the previously inferred schema.
This is especially important for categorical features, where we want to identify the range of acceptable values.
e. Fix evaluation anomalies in the schema
If an anomaly truly indicates a data error, then the underlying data should be fixed. Otherwise, we can simply update the schema to include the values in the eval dataset.
f. Schema Environments
In supervised learning we need to include labels in our dataset, but when we serve the model for inference the labels will not be included.
In some cases introducing slight schema variations is necessary. Environments can be used to express such requirements.
g. Check for drift and skew
TFDV performs this check by comparing the statistics of the different datasets based on the drift/skew comparators specified in the schema.
✓ Drift
• Drift detection은 Categorical 데이터 및 데이터의 연속 기간(N, N+1) 사이(예를 들면 서로 다른 날의 훈련 데이터 사이)에서 지원
• L-infinity distnace로 Drift를 표현하고 허용값보다 높으면 경고를 받을 수 있음
• 정확한 거리를 설정하는 것은 도메인 지식과 실험을 필요로하는 반복 프로세스
✓ Skew
• Schema Skew: 같은 스키마를 가지지 않을 때
• Feature Skew: Feature 생성 로직이 변경될 때
• Distribution Skew: Train, Serving 데이터 분포가 다를 경우
h. Freeze the schema
tfdv.infer_schema()로 생성하고, evaluation anomalies의 내용을 판단하여 필요시 schema에 반영 후 디스크에 저장
$ cat /tmp/tmpishmrvk8/chicago_taxi_output/schema.pbtxt
feature {
name: "pickup_community_area"
type: INT
presence {
min_fraction: 1.0
min_count: 1
}
shape {
dim {
size: 1
}
}
}
…
$
- Package
tensorflow-data-validation
2.2 TensorFlow Transform
- 참고 문서
예제: 데이터 사전 처리 (초급) - https://colab.research.google.com/github/tensorflow/tfx/blob/master/docs/tutorials/transform/simple.ipynb
데이터 사전 처리 (고급) - https://colab.research.google.com/github/tensorflow/tfx/blob/master/docs/tutorials/transform/census.ipynb
- 요약
✓ The Feature Engineering Component of TensorFlow Extended (TFX)
✓ TensorFlow Transform is a library for preprocessing input data for TensorFlow, including creating features that require a full pass over the training dataset.
• Normalize an input value by using the mean and standard deviation
• Convert strings to integers by generating a vocabulary over all of the input values
• Convert floats to integers by assigning them to buckets, based on the observed data distribution
✓ Using the same graph for both training and serving can prevent skew, since the same transformations are applied in both stages.
In order to understand tf.Transform and how it works with Apache Beam, you'll need to know a little bit about Apache Beam itself.
✓ Transformed data의 형식은 TFRecord 임
TFRecord 파일은 텐서플로우의 학습 데이타 등을 저장하기 위한 바이너리 포맷으로, 구글의 Protocol Buffer 포맷으로 데이타를 파일에 Serialize하여 저장
텐서플로우 학습에 있어서 데이타 포맷은 학습의 성능을 결정 짓는 중요한 요인중의 하나
텐서플로우 코드가 간단해 지고 성능에 도움이 되는 만큼 데이타 전처리 단계에서 가급적이면 학습 데이타를 tfrecord 타입으로 바꿔서 학습하는 것을 권장 (특히 이미지 데이타!!)
- 예제 정리 (고급)
✓ Training data (Input)
$ head -n 2 adult.data
39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K
50, Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 13, United-States, <=50K
$
✓ Transformed data (Output)
$ ls -l /tmp/*_transformed-*
-rw-r--r-- 1 root root 6325420 Mar 10 03:22 /tmp/test_transformed-00000-of-00001
-rw-r--r-- 1 root root 12650868 Mar 10 03:22 /tmp/train_transformed-00000-of-00001
$ cat /tmp/test_transformed-00000-of-00001
filenames = "/tmp/train_transformed-00000-of-00001"
raw_dataset = tf.data.TFRecordDataset(filenames)
for raw_record in raw_dataset.take(2):
example = tf.train.Example()
example.ParseFromString(raw_record.numpy())
print(example)
features {
feature {
key: "age"
value {
float_list {
value: 0.301369845867157
}
}
}
feature {
key: "capital-gain"
value {
float_list {
value: 0.02174021676182747
}
}
}
…
- Package
tensorflow-transform
2.3 TensorFlow Model Analysis
- 참고문서
소개: https://www.tensorflow.org/tfx/guide/tfma
- 요약
✓ TensorFlow Model Analysis allows you to perform model evaluations in the TFX pipeline, and view resultant metrics and plots in a Jupyter notebook
• metrics computed on entire training and holdout dataset, as well as next-day evaluations
• tracking metrics over time
• model quality performance on different feature slices
✓ An EvalSavedModel needs to be exported during training, which is a special SavedModel containing annotations for the metrics, features, labels, and so on.
- 예제 정리
a. Slicing and Dicing
It will create an EvalResult using tfma.run_model_analysis(), and use it to create a SlicingMetricsViewer using tfma.view.render_slicing_metrics()
b Tracking Model Performance Over Time
Use TFMA to see how they compare using render_time_series().
- Package
tensorflow-model-analysis
2.4 TensorFlow Serving
- 참고 문서
소개: https://www.tensorflow.org/tfx/guide/serving
- 요약
✓ TensorFlow Serving is a flexible, high-performance serving system for machine learning models, designed for production environments.
✓ To load our trained model into TensorFlow Serving we first need to save it in SavedModel format. This will create a protobuf file in a well-defined directory hierarchy,
and will include a version number. TensorFlow Serving allows us to select which version of a model, or "servable" we want to use when we make inference requests.
EX) Saved model:
/tmp/1
├── assets
├── saved_model.pb
└── variables
├── variables.data-00000-of-00002
├── variables.data-00001-of-00002
└── variables.index
✓ TensorFlow Serving
Server API: https://www.tensorflow.org/tfx/serving/api_docs/cc/
REST Client API: https://www.tensorflow.org/tfx/serving/api_rest
- OS Package / command
tensorflow-model-server (Ubuntu) / tensorflow_model_server
3. TFX Components
3.1. 참고문서
- https://www.tensorflow.org/tfx/tutorials/tfx/components
3.2 요약
- TFX Componets & Pipeline
Componets: ExampleGen, StatisticsGen, SchemaGen, ExampleValidator, Transform, Trainer, Evaluator, ModelValidator, Pusher
- Orchestration
In a production deployment of TFX, you will use an orchestrator such as Apache Airflow, Kubeflow Pipelines, or Apache Beam to orchestrate a pre-defined pipeline graph of TFX components.
- Metadata
In a production deployment of TFX, you will access metadata through the ML Metadata (MLMD) API. MLMD stores metadata properties in a database such as MySQL or SQLite, and stores the metadata payloads in a persistent store such as on your filesystem.
- TFX Componets와 TFX library 연관 관계
Transform Componet 사용
from tfx.components import Transform
…
# Performs transformations and feature engineering in training and serving.
transform = Transform(
examples=example_gen.outputs['examples'],
schema=infer_schema.outputs['schema'],
module_file=module_file)
- 관련 Package
Users/yoosungjeon/tfx-env/lib/python3.7/site-packages/tfx/components/__init__.py
rom tfx.components.example_gen.csv_example_gen.component import CsvExampleGen
rom tfx.components.example_validator.component import ExampleValidator
rom tfx.components.model_validator.component import ModelValidator
rom tfx.components.pusher.component import Pusher
rom tfx.components.schema_gen.component import SchemaGen
rom tfx.components.statistics_gen.component import StatisticsGen
rom tfx.components.trainer.component import Trainer
rom tfx.components.transform.component import Transform
/Users/yoosungjeon/tfx-env/lib/python3.7/site-packages/tfx/components/transform/component.py
…
from tfx.components.transform import executor
…
class Transform(base_component.BaseComponent):
…
SPEC_CLASS = TransformSpec
EXECUTOR_SPEC = executor_spec.ExecutorClassSpec(executor.Executor)
…
def __init__(
self,
examples: types.Channel = None,
schema: types.Channel = None,
module_file: Optional[Union[Text, data_types.RuntimeParameter]] = None,
…
/Users/yoosungjeon/tfx-env/lib/python3.7/site-packages/tfx/components/transform/executor.py
…
from tensorflow_transform import impl_helper
import tensorflow_transform.beam as tft_beam
from tensorflow_transform.beam import analyzer_cache
from tensorflow_transform.beam import common as tft_beam_common
from tensorflow_transform.saved import saved_transform_io
from tensorflow_transform.tf_metadata import dataset_metadata
from tensorflow_transform.tf_metadata import dataset_schema
from tensorflow_transform.tf_metadata import metadata_io
from tensorflow_transform.tf_metadata import schema_utils
…
class Executor(base_executor.BaseExecutor):
…
4. TFX Airflow Tutorial
4.1. 참고문서
- https://www.tensorflow.org/tfx/tutorials/tfx/airflow_workshop
4.2 요약
- You’re learning how to create an ML pipeline using TFX
✓ TFX pipelines are appropriate when you will be deploying a production ML application
✓ TFX pipelines are appropriate when datasets are large
✓ TFX pipelines are appropriate when training/serving consistency is important
✓ TFX pipelines are appropriate when version management for inference is important
✓ Google uses TFX pipelines for production ML
- 테스트 환경
macOS 10.15, Python 3.7.6, Airflow 1.10.3
- Airflow pipeline (taxi model, 최종)
- Tutorial 진행 도중 오류가 발생될 경우 "3.4 Tutorial Troubleshooting”를 참고 할 것 (Good luck)
4.3 Tutorial 정리
a. Setup your environment
$ brew update
$ brew install python
$ brew install git
$ export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES
$ cd
$ virtualenv -p python3 tfx-env
$ source ~/tfx-env/bin/activate
$ git clone https://github.com/tensorflow/tfx.git
## The setup script (setup_demo.sh) installs TFX and Airflow, and configures Airflow
$ $ ./tfx/tfx/examples/airflow_workshop/setup/setup_demo.sh
b. Airflow와 Jupyter 실행
## Open a new terminal window, and in that window ...
$ source ~/tfx-env/bin/activate
$ airflow webserver -p 8080
## Open another new terminal window, and in that window ...
$ source ~/tfx-env/bin/activate
$ airflow scheduler
## Open yet another new terminal window, and in that window ...
## Assuming that you've cloned the TFX repo into ~/tfx
$ source ~/tfx-env/bin/activate
$ cd ~/tfx/tfx/examples/airflow_workshop/notebooks
$ jupyter notebook
c. Dive into your data
c.1 Python 소스 보완(주석 제거) 및 Airflow Trigger DAG 실행
c.2 ExampleGen ingests and splits the input dataset.
[tfx.componets]
from tfx.components.example_gen.csv_example_gen.component import CsvExampleGen
[흐름도]
[결과]
$ tree CsvExampleGen
CsvExampleGen
└── examples
└── 1
├── eval
│ ├── data_tfrecord-00000-of-00004.gz
│ ├── data_tfrecord-00001-of-00004.gz
│ ├── data_tfrecord-00002-of-00004.gz
│ └── data_tfrecord-00003-of-00004.gz
└── train
├── data_tfrecord-00000-of-00004.gz
├── data_tfrecord-00001-of-00004.gz
├── data_tfrecord-00002-of-00004.gz
└── data_tfrecord-00003-of-00004.gz
c.3 StatisticsGen calculates statistics for the dataset.
[tfx.componets]
from tfx.components import StatisticsGen
[흐름도]
[결과]
$ tree StatisticsGen
StatisticsGen
└── statistics
└── 3
├── eval
│ └── stats_tfrecord
└── train
└── stats_tfrecord
c.4 SchemaGen SchemaGen examines the statistics and creates a data schema.
[tfx.componets]
from tfx.components import SchemaGen
[흐름도]
[결과]
$ tree SchemaGen
SchemaGen
└── schema
└── 4
└── schema.pbtxt
$ head SchemaGen/schema/4/schema.pbtxt
feature {
name: "payment_type"
value_count {
min: 1
max: 1
}
type: BYTES
domain: "payment_type"
presence {
min_fraction: 1.0
…
$
c.5 ExampleValidator looks for anomalies and missing values in the dataset.
[tfx.componets]
from tfx.components import ExampleValidator
[흐름도]
[결과]
$ tree ExampleValidator
ExampleValidator
└── anomalies
└── 5
└── anomalies.pbtxt
$ cat ExampleValidator/anomalies/5/anomalies.pbtxt
…
anomaly_info {
key: "company"
value {
description: "Examples contain values missing from the schema: 3094 - 24059 G.L.B. Cab Co (<1%), 3319 - CD Cab Co (<1%), 4053 - 40193 Adwar H. Nikola (<1%), 4197 - Royal Star (<1%), 5006 - Salifu Bawa (<1%), 5724 - KYVI Cab Inc (<1%), 585 - 88805 Valley Cab Co (<1%), 6743 - Luhak Corp (<1%). "
severity: ERROR
short_description: "Unexpected string values"
reason {
type: ENUM_TYPE_UNEXPECTED_STRING_VALUES
short_description: "Unexpected string values"
…
$
d. Feature engineering
- Transform performs feature engineering on the dataset.
- Transform Graph ?
The output of tf.Transform is exported as a TensorFlow graph which you can use for both training and serving. Using the same graph for both training and serving can prevent skew,
since the same transformations are applied in both stages.
[tfx.componets]
from tfx.components import Transform
[흐름도]
[결과]
$ tree Transform
Transform
├── transform_graph
│ └── 29
│ ├── metadata
│ │ └── schema.pbtxt
│ ├── transform_fn
│ │ ├── assets
│ │ │ ├── vocab_compute_and_apply_vocabulary_1_vocabulary
│ │ │ └── vocab_compute_and_apply_vocabulary_vocabulary
│ │ ├── saved_model.pb
│ │ └── variables
│ └── transformed_metadata
│ └── schema.pbtxt
└── transformed_examples
└── 29
├── eval
│ ├── transformed_examples-00000-of-00004.gz
│ ├── transformed_examples-00001-of-00004.gz
│ ├── transformed_examples-00002-of-00004.gz
│ └── transformed_examples-00003-of-00004.gz
└── train
├── transformed_examples-00000-of-00004.gz
├── transformed_examples-00001-of-00004.gz
├── transformed_examples-00002-of-00004.gz
└── transformed_examples-00003-of-00004.gz
e. Training
- Trainer trains the model using TensorFlow Estimatorscj
[tfx.componets]
from tfx.proto import trainer_pb2
from tfx.components import Trainer
[결과]
$ tree Trainer
Trainer
└── model
└── 31
├── eval_model_dir
│ └── 1583306504
│ ├── assets
│ │ ├── vocab_compute_and_apply_vocabulary_1_vocabulary
│ │ └── vocab_compute_and_apply_vocabulary_vocabulary
│ ├── saved_model.pb
│ └── variables
│ ├── variables.data-00000-of-00001
│ └── variables.index
└── serving_model_dir
├── checkpoint
├── eval_chicago-taxi-eval
│ └── events.out.tfevents.1583306478.ysjeon-MBP
├── events.out.tfevents.1583306464.ysjeon-MBP
├── export
│ └── chicago-taxi
│ └── 1583306503
│ ├── assets
│ │ ├── vocab_compute_and_apply_vocabulary_1_vocabulary
│ │ └── vocab_compute_and_apply_vocabulary_vocabulary
│ ├── saved_model.pb
│ └── variables
│ ├── variables.data-00000-of-00001
│ └── variables.index
├── graph.pbtxt
├── model.ckpt-10000.data-00000-of-00001
├── model.ckpt-10000.index
└── model.ckpt-10000.meta
f. Analyzing model performance
- Evaluator performs deep analysis of the training results.
[tfx.componets]
from tfx.proto import evaluator_pb2
from tfx.components import Evaluator
g. Ready for production
- ModelValidator ensures that the model is "good enough" to be pushed to production.
- Pusher deploys the model to a serving infrastructure.
[tfx.componets]
from tfx.proto import pusher_pb2
from tfx.components import ModelValidator
from tfx.components import Pusher
[결과]
$ tree ModelValidator Pusher
ModelValidator
└── blessing
└── 45
└── BLESSED
Pusher
└── pushed_model
└── 47
└── 1583306503
├── assets
│ ├── vocab_compute_and_apply_vocabulary_1_vocabulary
│ └── vocab_compute_and_apply_vocabulary_vocabulary
├── saved_model.pb
└── variables
├── variables.data-00000-of-00001
└── variables.index
4.4 Tutorial Troubleshooting
- TS #1
▷ Problem
$ airflow webserver -p 8080
…
ImportError: cannot import name 'CsvExampleGen' from 'tfx.components' (/Users/yoosungjeon/tfx-env/lib/python3.7/site-packages/tfx/components/__init__.py)
▷ Solution
$ vi airflow/dags/taxi_pipeline.py
# from tfx.components import CsvExampleGen
from tfx.components.example_gen.csv_example_gen.component import CsvExampleGen
…
- TS #2 : File already exists in database
▷ Problem
$ airflow webserver -p 8080
…
[libprotobuf ERROR google/protobuf/descriptor_database.cc:58] File already exists in database:
[libprotobuf FATAL google/protobuf/descriptor.cc:1370] CHECK failed: GeneratedDatabase()->Add(encoded_file_descriptor, size):
bi.dylib: terminating with uncaught exception of type google::protobuf::FatalException: CHECK failed: GeneratedDatabase()->Add(encoded_file_descriptor, size):
zsh: abort airflow webserver -p 8080
$
▷ Solution
airflow initdb 또는 airflow restdb를 하면 해결되지 않을까? 위 현상이 발생되었을 때는 이런 명령어를 알지 못 했고 아래와 같은 해결책을 찾아서 진행하였음.
pyarrow를 업그레이드(pyarrow-0.14.1 -> pyarrow-0.15.1)하면 위 문제가 해결되며, 종속성 때문에 tfx를 업그레이드 진행
Tensorflow도 2.0으로 업데이트 되면서 일부 소스를 수정 해야 하는 사이트 이펙트 발생 (TS #4 연관)
$ pip3 uninstall tfx && pip3 install tfx
Found existing installation: tfx 0.14.0rc1
Uninstalling tfx-0.14.0rc1:
…
Successfully installed absl-py-0.8.1 apache-beam-2.17.0 dill-0.3.0 ml-metadata-0.21.2 pyarrow-0.15.1 pyyaml-3.13 tensorflow-data-validation-0.21.2 tensorflow-metadata-0.21.1 tensorflow-model-analysis-0.21.4 tensorflow-serving-api-2.1.0 tensorflow-transform-0.21.0 tfx-0.21.0 tfx-bsl-0.21.3
$
- TS #3
▷ Problem: Jupyter notebook (http://localhost:8888/notebooks/step4.ipynb)에서 아래 로직 실행 시 에러 발생
schemas = store.get_artifacts_of_type_df(tfx_utils.TFXArtifactTypes.SCHEMA)
assert len(schemas.URI) == 1
schema_uri = schemas.URI.iloc[0] + ‘schema.pbtxt'
AttributeError: 'DataFrame' object has no attribute 'URI
▷ Solution: 모델 스키마 위치를 강제 지정
# schemas = store.get_artifacts_of_type_df(tfx_utils.TFXArtifactTypes.SCHEMA)
# assert len(schemas.URI) == 1
# schema_uri = schemas.URI.iloc[0] + 'schema.pbtxt'
schema_uri="/Users/yoosungjeon/airflow/tfx/pipelines/taxi/Transform/transform_graph/14/metadata/schema.pbtxt”
- TS #4
▷ Problem : Jupyter notebook (http://localhost:8888/notebooks/step4.ipynb)에서 아래 로직 실행 시 에러 발생
55 return tf.squeeze(
---> 56 tf.sparse_to_dense(tf.SparseTensor(x.indices, x.values, [x.dense_shape[0], 1]),
57 default_value),
58 axis=1)
AttributeError: module 'tensorflow' has no attribute 'sparse_to_dense'
▷ Solution: tf.sparse_to_dense()는 deprecated 되었으며 tf.sparse.to_dense()를 사용 해야 함, r2.0부터는 tf.sparse_to_dense() 미 지원
▷ Problem
93 outputs[_transformed_name(_LABEL_KEY)] = tf.where(
---> 94 tf.is_nan(taxi_fare),
AttributeError: module 'tensorflow' has no attribute 'is_nan'
▷ Solution
Tensorflow r1.15까지는 tf.math.is_nan()과 tf.is_nan() alias를 제공하였으나, Tensorflow r2.0부터는 tf.math.is_nan() 제공
tf.is_nan() r1.10 -> tf.is_nan() 소스 변경
- TS #5
▷ Problem
Jupyter notebook (http://localhost:8888/notebooks/step5.ipynb)에서 tensorboard 실행시 데이터 없음 발생
No dashboards are active for the current data set.
▷ Solution: 아래와 같이 TENSORBOARD_LOGDIR를 임의로 지정
#os.environ['TENSORBOARD_LOGDIR'] = tensorboard_logdir
os.environ['TENSORBOARD_LOGDIR'] = "/Users/yoosungjeon/airflow/tfx/pipelines/taxi/Trainer/model/31/serving_model_dir"
댓글