Start studying: Saturday, December 7th
Teaching materials, etc .: ・ Miyuki Oshige "Details! Python3 Introductory Note ”(Sotec, 2017): Completed on Thursday, December 19th ・ Progate Python course (5 courses in total): Ends on Saturday, December 21st ・ Andreas C. Müller, Sarah Guido "(Japanese title) Machine learning starting with Python" (O'Reilly Japan, 2017): Completed on Saturday, December 23 ・ Kaggle: Real or Not? NLP with Disaster Tweets: Posted on Saturday, December 28th to Friday, January 3rd Adjustment ・ ** Wes Mckinney "(Japanese title) Introduction to data analysis by Python" (O'Reilly Japan, 2018) **: January 4th (Sat) ~
p.276 Finish reading up to Chapter 8 Data Wrangling.
-The feature of pandas is that there are abundant functions to read table format data as data frame objects. read_csv、read_table、read_excel、read_html ... Some of the read functions do type inference automatically, so you don't necessarily have to set the details. Default delimiter (sep): read_csv → comma, read_table → \ t (horizontal tab) If you want to separate other things, specify it with an argument.
-Data frame of files with different line formats Pass it to csf.reader. Read the returned tuple with lines. Split into header and data line In dictionary format with dictionary comprehension and zip (* values) ...
・ Json (JavaScript Object Notation) One of the formats for exchanging data by HTTP request between a web browser and an application.
-It is also possible to read and write data in HTML / XML format. Read it with the read function, add skips and indexes to make it a dictionary, and finally make it into a data frame. The work of so-called scraping. Shape the data so that it can be used. Kaggle has a lot of well-organized data, so why not do it? Rather, it is a technology that is likely to be used a lot in practice.
-HDF5 File format for saving scientific sequence data. Written in C, it can read and write data efficiently, making it a good choice for large datasets.
・ Excel can also be read. It can also be read from the sql database.
・ Handling of missing values Drop all lines containing missing values (NA, NaN) with dropna. Fill with fillna. There are also ffills and bfills that are similar to those before and after. You can also specify how ='all' as an argument and drop only the lines that are all NA. The column is the same as the others and axis = 1 is specified. If you give a dictionary to fillna, you can fill each column with a different number. overwrite in place. If you give data.mean to fillna, you can also fill in the holes with the arithmetic mean.
・ Data transformation Returns a series of boolean values with duplicate. drop_duplicates deletes only true / false values (same elements as others) Element-by-element conversion with map. You can also give a dictionary. (It seems that all functions can be passed in a dictionary.) You can also replace it. I think it's this way to see it often on kaggle. Change the first argument to the second number.
・ Discretization and binning Create an element in the list and pass it as an argument to the pandas cut to divide it into bins.
・ Detection of outliers data [(np.abs (data)> 3) .any (1), 3 or more is an example. Lists any data element whose absolute value is larger than the specified value. If = np.sign (data) * 3, the upper limit can be created in combination with sign, which returns a numerical value corresponding to the sign of each element.
・ Random sampling You can sample 5 randoms with random.permutation (5) and equivalent sampling with take. To select a non-restoring extract, pass replace = True to the sample method.
-Retrieving indicator variables. List, for statement, extend(x.split('|'))so|Put the data divided based on the list. You can make a list of each component by extracting it with pandas unique. p229 I also used split often with kaggle.
·Regular expressions. Use re module, complile, findall, regex.match ...
-Hierarchical index. Data with two or more indexes. Index a has 1,2,3 elements, and index b has 1,2,3 elements. Such. You can pivot with unstack and stack. (Use the inner element as a columns label, etc.)
-You can change the order of the hierarchy with swaplevel. Sort by sort_index. If you pass level as an argument, you can specify whether to use that hierarchy. 0, 1, ... from the outside
-Data frame columns can be indexed. set.index If set.index ('a') is specified, the elements that make up the column of a are newly added as index. reset.index is the reverse of this.
-Merge and merge, concat, merge, and stick together. It is basically specified by an inner join, and only common ones are included in the result. Specify how ='outer' if you want to include everything in the result, even if it is independent.
-The stack is designed to remove missing values, but you can also drop it with dropna = False. Not limited to this, it seems that most operations can be performed by specifying with arguments (judged) If you have something you want to do, you may want to look at the arguments first.
Recommended Posts