Feature engineering for machine learning: principles and techniques for data scientists
Gespeichert in:
Beteiligte Personen: | , |
---|---|
Format: | Buch |
Sprache: | Englisch |
Veröffentlicht: |
First edition
O'Reilly
April 2018
Beijing 2018 |
Schlagwörter: | |
Links: | http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=030355604&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA |
Umfang: | xiii, 200 Seiten Illustrationen, Diagramme |
ISBN: | 9781491953242 |
Internformat
MARC
LEADER | 00000nam a2200000 c 4500 | ||
---|---|---|---|
001 | BV044963084 | ||
003 | DE-604 | ||
005 | 20190312 | ||
007 | t| | ||
008 | 180528s2018 xx a||| |||| 00||| eng d | ||
020 | |a 9781491953242 |9 978-1-491-95324-2 | ||
035 | |a (OCoLC)1039175935 | ||
035 | |a (DE-599)HBZHT019156765 | ||
040 | |a DE-604 |b ger |e rda | ||
041 | 0 | |a eng | |
049 | |a DE-20 |a DE-739 |a DE-1046 |a DE-861 |a DE-573 |a DE-473 |a DE-384 |a DE-945 | ||
084 | |a ST 302 |0 (DE-625)143652: |2 rvk | ||
100 | 1 | |a Zheng, Alice |e Verfasser |0 (DE-588)1156465044 |4 aut | |
245 | 1 | 0 | |a Feature engineering for machine learning |b principles and techniques for data scientists |c Alice Zheng and Amanda Casari |
264 | 1 | |a First edition |c April 2018 | |
264 | 1 | |a Beijing |b O'Reilly |c 2018 | |
300 | |a xiii, 200 Seiten |b Illustrationen, Diagramme | ||
336 | |b txt |2 rdacontent | ||
337 | |b n |2 rdamedia | ||
338 | |b nc |2 rdacarrier | ||
650 | 0 | 7 | |a Datenanalyse |0 (DE-588)4123037-1 |2 gnd |9 rswk-swf |
650 | 0 | 7 | |a NumPy |0 (DE-588)1192378229 |2 gnd |9 rswk-swf |
650 | 0 | 7 | |a pandas |g Software |0 (DE-588)1192378490 |2 gnd |9 rswk-swf |
650 | 0 | 7 | |a Rohdaten |0 (DE-588)4875810-3 |2 gnd |9 rswk-swf |
650 | 0 | 7 | |a Merkmalsextraktion |0 (DE-588)4314440-8 |2 gnd |9 rswk-swf |
650 | 0 | 7 | |a Maschinelles Lernen |0 (DE-588)4193754-5 |2 gnd |9 rswk-swf |
650 | 0 | 7 | |a Datenaufbereitung |0 (DE-588)4148865-9 |2 gnd |9 rswk-swf |
689 | 0 | 0 | |a Maschinelles Lernen |0 (DE-588)4193754-5 |D s |
689 | 0 | 1 | |a Datenanalyse |0 (DE-588)4123037-1 |D s |
689 | 0 | |5 DE-604 | |
689 | 1 | 0 | |a Maschinelles Lernen |0 (DE-588)4193754-5 |D s |
689 | 1 | 1 | |a Datenaufbereitung |0 (DE-588)4148865-9 |D s |
689 | 1 | 2 | |a Rohdaten |0 (DE-588)4875810-3 |D s |
689 | 1 | 3 | |a Merkmalsextraktion |0 (DE-588)4314440-8 |D s |
689 | 1 | 4 | |a NumPy |0 (DE-588)1192378229 |D s |
689 | 1 | 5 | |a pandas |g Software |0 (DE-588)1192378490 |D s |
689 | 1 | |8 1\p |5 DE-604 | |
700 | 1 | |a Casari, Amanda |e Verfasser |0 (DE-588)1156465192 |4 aut | |
856 | 4 | 2 | |m Digitalisierung UB Passau - ADAM Catalogue Enrichment |q application/pdf |u http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=030355604&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA |3 Inhaltsverzeichnis |
883 | 1 | |8 1\p |a cgwrk |d 20201028 |q DE-101 |u https://d-nb.info/provenance/plan#cgwrk | |
943 | 1 | |a oai:aleph.bib-bvb.de:BVB01-030355604 |
Datensatz im Suchindex
_version_ | 1819256717067681792 |
---|---|
adam_text | Table of Contents
Preface....................................................................vii
1. The Machine Learning Pipeline.....;.................................. 1
Data 1
Tasks 1
Models 2
Features 3
Model Evaluation 3
2. Fancy Tricks with Simple Numbers....................................... 5
Scalars, Vectors, and Spaces 6
Dealing with Counts 8
Binarization 8
Quantization or Binning 10
Log Transformation 15
Log Transform in Action 19
Power Transforms: Generalization of the Log Transform 23
Feature Scaling or Normalization 29
Min-Max Scaling 30
Standardization (Variance Scaling) 31
t2 Normalization 32
Interaction Features 35
Feature Selection 38
Summary 39
Bibliography 39
3. Text Data: Flattening, Filtering, and Chunking........................ 41
Bag-of-X: Turning Natural Text into Flat Vectors 42
Bag-of-Words 42
Bag-of-n-Grams 45
Filtering for Cleaner Features 47
Stopwords 48
Frequency-Based Filtering 48
Stemming 51
Atoms of Meaning: From Words to n-Grams to Phrases 52
Parsing and Tokenization 52
Collocation Extraction for Phrase Detection 52
Summary 59
Bibliography 60
4. The Effects of Feature Scaling: From Bag-of-Words to Tf-ldf............. 61
Tf-Idf: A Simple Twist on Bag-of-Words 61
Putting It to the Test 63
Creating a Classification Dataset 64
Scaling Bag-of-Words with Tf-Idf Transformation 65
Classification with Logistic Regression 66
Tuning Logistic Regression with Regularization 68
Deep Dive: What Is Happening? 72
Summary 75
Bibliography 76
5. Categorical Variables: Counting Eggs in the Age of Robotic Chickens..... 77
Encoding Categorical Variables 78
One-Hot Encoding 78
Dummy Coding 79
Effect Coding 82
Pros and Cons of Categorical Variable Encodings 83
Dealing with Large Categorical Variables 83
HFeature Hashing 84
f [ Bin Counting 87
, Summary 94
• Bibliography 96
6. Dimensionality Reduction: Squashing the Data Pancake with PCA.......... 99
Intuition 99
Derivation 101
(m Linear Projection 102
Variance and Empirical Variance 103
? Principal Components: First Formulation 104
1 Principal Components: Matrix-Vector Formulation 104
iv | Table of Contents
General Solution of the Principal Components 105
Transforming Features 105
Implementing PC A 106
PCA in Action 106
Whitening and ZCA 108
Considerations and Limitations of PCA 109
Use Cases 111
Summary 112
Bibliography 113
7. Nonlinear Featurization via K-Means Model Stacking................. 115
k-Means Clustering 117
Clustering as Surface Tiling 119
k-Means Featurization for Classification 122
Alternative Dense Featurization 127
Pros, Cons, and Gotchas 128
Summary 130
Bibliography 131
8. Automating the Featurizer: Image Feature Extraction and Deep Learning. 133
The Simplest Image Features (and Why They Don’t Work) 134
Manual Feature Extraction: SIFT and HOG 135
Image Gradients 135
Gradient Orientation Histograms 139
SIFT Architecture 143
Learning Image Features with Deep Neural Networks 144
Fully Connected Layers 144
Convolutional Layers 146
Rectified Linear Unit (ReLU) Transformation 150
Response Normalization Layers 151
Pooling Layers 153
Structure of AlexNet 153
Summary 157
Bibliography 157
9. Back to the Feature: Building an Academic Paper Recommender........... 159
Item-Based Collaborative Filtering 159
First Pass: Data Import, Cleaning, and Feature Parsing 161
Academic Paper Recommender: Naive Approach 161
Second Pass: More Engineering and a Smarter Model 167
Academic Paper Recommender: Take 2 167
Third Pass: More Features = More Information 173
Table of Contents | v
Academic Paper Recommender: Take 3 174
Summary 176
Bibliography 177
A. Linear Modeling and Linear Algebra Basics......................... 179
Index................................................................ 193
vi | Table of Contents
|
any_adam_object | 1 |
author | Zheng, Alice Casari, Amanda |
author_GND | (DE-588)1156465044 (DE-588)1156465192 |
author_facet | Zheng, Alice Casari, Amanda |
author_role | aut aut |
author_sort | Zheng, Alice |
author_variant | a z az a c ac |
building | Verbundindex |
bvnumber | BV044963084 |
classification_rvk | ST 302 |
ctrlnum | (OCoLC)1039175935 (DE-599)HBZHT019156765 |
discipline | Informatik |
format | Book |
fullrecord | <?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>02317nam a2200517 c 4500</leader><controlfield tag="001">BV044963084</controlfield><controlfield tag="003">DE-604</controlfield><controlfield tag="005">20190312 </controlfield><controlfield tag="007">t|</controlfield><controlfield tag="008">180528s2018 xx a||| |||| 00||| eng d</controlfield><datafield tag="020" ind1=" " ind2=" "><subfield code="a">9781491953242</subfield><subfield code="9">978-1-491-95324-2</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(OCoLC)1039175935</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-599)HBZHT019156765</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-604</subfield><subfield code="b">ger</subfield><subfield code="e">rda</subfield></datafield><datafield tag="041" ind1="0" ind2=" "><subfield code="a">eng</subfield></datafield><datafield tag="049" ind1=" " ind2=" "><subfield code="a">DE-20</subfield><subfield code="a">DE-739</subfield><subfield code="a">DE-1046</subfield><subfield code="a">DE-861</subfield><subfield code="a">DE-573</subfield><subfield code="a">DE-473</subfield><subfield code="a">DE-384</subfield><subfield code="a">DE-945</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">ST 302</subfield><subfield code="0">(DE-625)143652:</subfield><subfield code="2">rvk</subfield></datafield><datafield tag="100" ind1="1" ind2=" "><subfield code="a">Zheng, Alice</subfield><subfield code="e">Verfasser</subfield><subfield code="0">(DE-588)1156465044</subfield><subfield code="4">aut</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Feature engineering for machine learning</subfield><subfield code="b">principles and techniques for data scientists</subfield><subfield code="c">Alice Zheng and Amanda Casari</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="a">First edition</subfield><subfield code="c">April 2018</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="a">Beijing</subfield><subfield code="b">O'Reilly</subfield><subfield code="c">2018</subfield></datafield><datafield tag="300" ind1=" " ind2=" "><subfield code="a">xiii, 200 Seiten</subfield><subfield code="b">Illustrationen, Diagramme</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="b">n</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="b">nc</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Datenanalyse</subfield><subfield code="0">(DE-588)4123037-1</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">NumPy</subfield><subfield code="0">(DE-588)1192378229</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">pandas</subfield><subfield code="g">Software</subfield><subfield code="0">(DE-588)1192378490</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Rohdaten</subfield><subfield code="0">(DE-588)4875810-3</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Merkmalsextraktion</subfield><subfield code="0">(DE-588)4314440-8</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Maschinelles Lernen</subfield><subfield code="0">(DE-588)4193754-5</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Datenaufbereitung</subfield><subfield code="0">(DE-588)4148865-9</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="689" ind1="0" ind2="0"><subfield code="a">Maschinelles Lernen</subfield><subfield code="0">(DE-588)4193754-5</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2="1"><subfield code="a">Datenanalyse</subfield><subfield code="0">(DE-588)4123037-1</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2=" "><subfield code="5">DE-604</subfield></datafield><datafield tag="689" ind1="1" ind2="0"><subfield code="a">Maschinelles Lernen</subfield><subfield code="0">(DE-588)4193754-5</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="1" ind2="1"><subfield code="a">Datenaufbereitung</subfield><subfield code="0">(DE-588)4148865-9</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="1" ind2="2"><subfield code="a">Rohdaten</subfield><subfield code="0">(DE-588)4875810-3</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="1" ind2="3"><subfield code="a">Merkmalsextraktion</subfield><subfield code="0">(DE-588)4314440-8</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="1" ind2="4"><subfield code="a">NumPy</subfield><subfield code="0">(DE-588)1192378229</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="1" ind2="5"><subfield code="a">pandas</subfield><subfield code="g">Software</subfield><subfield code="0">(DE-588)1192378490</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="1" ind2=" "><subfield code="8">1\p</subfield><subfield code="5">DE-604</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Casari, Amanda</subfield><subfield code="e">Verfasser</subfield><subfield code="0">(DE-588)1156465192</subfield><subfield code="4">aut</subfield></datafield><datafield tag="856" ind1="4" ind2="2"><subfield code="m">Digitalisierung UB Passau - ADAM Catalogue Enrichment</subfield><subfield code="q">application/pdf</subfield><subfield code="u">http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=030355604&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA</subfield><subfield code="3">Inhaltsverzeichnis</subfield></datafield><datafield tag="883" ind1="1" ind2=" "><subfield code="8">1\p</subfield><subfield code="a">cgwrk</subfield><subfield code="d">20201028</subfield><subfield code="q">DE-101</subfield><subfield code="u">https://d-nb.info/provenance/plan#cgwrk</subfield></datafield><datafield tag="943" ind1="1" ind2=" "><subfield code="a">oai:aleph.bib-bvb.de:BVB01-030355604</subfield></datafield></record></collection> |
id | DE-604.BV044963084 |
illustrated | Illustrated |
indexdate | 2024-12-20T18:15:26Z |
institution | BVB |
isbn | 9781491953242 |
language | English |
oai_aleph_id | oai:aleph.bib-bvb.de:BVB01-030355604 |
oclc_num | 1039175935 |
open_access_boolean | |
owner | DE-20 DE-739 DE-1046 DE-861 DE-573 DE-473 DE-BY-UBG DE-384 DE-945 |
owner_facet | DE-20 DE-739 DE-1046 DE-861 DE-573 DE-473 DE-BY-UBG DE-384 DE-945 |
physical | xiii, 200 Seiten Illustrationen, Diagramme |
publishDate | 2018 |
publishDateSearch | 2018 |
publishDateSort | 2018 |
publisher | O'Reilly |
record_format | marc |
spellingShingle | Zheng, Alice Casari, Amanda Feature engineering for machine learning principles and techniques for data scientists Datenanalyse (DE-588)4123037-1 gnd NumPy (DE-588)1192378229 gnd pandas Software (DE-588)1192378490 gnd Rohdaten (DE-588)4875810-3 gnd Merkmalsextraktion (DE-588)4314440-8 gnd Maschinelles Lernen (DE-588)4193754-5 gnd Datenaufbereitung (DE-588)4148865-9 gnd |
subject_GND | (DE-588)4123037-1 (DE-588)1192378229 (DE-588)1192378490 (DE-588)4875810-3 (DE-588)4314440-8 (DE-588)4193754-5 (DE-588)4148865-9 |
title | Feature engineering for machine learning principles and techniques for data scientists |
title_auth | Feature engineering for machine learning principles and techniques for data scientists |
title_exact_search | Feature engineering for machine learning principles and techniques for data scientists |
title_full | Feature engineering for machine learning principles and techniques for data scientists Alice Zheng and Amanda Casari |
title_fullStr | Feature engineering for machine learning principles and techniques for data scientists Alice Zheng and Amanda Casari |
title_full_unstemmed | Feature engineering for machine learning principles and techniques for data scientists Alice Zheng and Amanda Casari |
title_short | Feature engineering for machine learning |
title_sort | feature engineering for machine learning principles and techniques for data scientists |
title_sub | principles and techniques for data scientists |
topic | Datenanalyse (DE-588)4123037-1 gnd NumPy (DE-588)1192378229 gnd pandas Software (DE-588)1192378490 gnd Rohdaten (DE-588)4875810-3 gnd Merkmalsextraktion (DE-588)4314440-8 gnd Maschinelles Lernen (DE-588)4193754-5 gnd Datenaufbereitung (DE-588)4148865-9 gnd |
topic_facet | Datenanalyse NumPy pandas Software Rohdaten Merkmalsextraktion Maschinelles Lernen Datenaufbereitung |
url | http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=030355604&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA |
work_keys_str_mv | AT zhengalice featureengineeringformachinelearningprinciplesandtechniquesfordatascientists AT casariamanda featureengineeringformachinelearningprinciplesandtechniquesfordatascientists |