Skip to content

Latest commit

 

History

History
2094 lines (1751 loc) · 86.1 KB

csv_utility_help.md

File metadata and controls

2094 lines (1751 loc) · 86.1 KB

csv_columns_summary.py

usage: csv_columns_summary.py [-h] [-v] [--columns COLUMN[,COLUMN...]]
                              [--function {all,count,sum,avg,min,max,std,median}]
                              [--count STRING] [--output FILE]
                              CSV_FILE

statistics summary for each colunm with CSV format

positional arguments:
  CSV_FILE              files to read, if empty, stdin is used

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  --columns COLUMN[,COLUMN...]
                        columns to do function, default=all columns that have
                        numeric values
  --function {all,count,sum,avg,min,max,std,median}
                        function to do
  --count STRING        count of given value. if given, '--function' is
                        ignored.
  --output FILE         path of output file, default=stdout

example:
  csv_columns_summary.py --function=sum --columns=A,B test1.csv
  csv_columns_summary.py --count=1 test1.csv
  csv_columns_summary.py --function=all --columns=A,B test1.csv| csvlook -I
| columns | count | sum | avg                | min | max | std                | median |
| ------- | ----- | --- | ------------------ | --- | --- | ------------------ | ------ |
| A       | 2.0   | 5.0 | 1.6666666666666667 | 0.0 | 4.0 | 2.0816659994661326 | 1.0    |
| B       | 2.0   | 7.0 | 2.3333333333333335 | 0.0 | 5.0 | 2.5166114784235836 | 2.0    |


csv_combine.py

usage: csv_combine.py [-h] [-v]
                      [--mode {first,bigger,smaller,add,sub,mul,div,mod,pow,lt,le,gt,ge,ne,eq,function}]
                      [--function EXP] [--prologe CODE;[CODE;CODE;...]]
                      [--boolean_by_number] [--output_file FILE]
                      CSV_FILE CSV_FILE_or_VALUE

complement the defect of datas with csv datas, element-wise

positional arguments:
  CSV_FILE              first csv file to complement
  CSV_FILE_or_VALUE     second csv file or scalar float value

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  --mode {first,bigger,smaller,add,sub,mul,div,mod,pow,lt,le,gt,ge,ne,eq,function}
                        combine mode
  --function EXP        lambda function for function mode
  --prologe CODE;[CODE;CODE;...]
                        pieces of python code to pre-load, for use in
                        expression of '--function'.
  --boolean_by_number   for logical results, use 1/0 instead of True/False
  --output_file FILE    path of output file,default=stdout

remark:
  This processes columns that have only numeric values.

  The bigger mode, smaller mode and etc are available for columns that have numeric values, others are done by combine_first.
  At the compare modes(lt,gt,and so on), NaN leads into False as result, always.

  About function that was given by '--function', that has two arguments of pandas.Series. see document of pandas.DataFrame.combine:
    https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.combine.html#pandas.DataFrame.combine
  When you use '--function', you may use '--prologe' to load external module:
    ex. '--prologe="import numpy as np;"' or '--prologe="import your_module;"'

    | mode     | numeric only | description                                   |
    | -------- | ------------ | --------------------------------------------- |
    | first    |              | combine_first                                 |
    | bigger   | O            | select bigger                                 |
    | smaller  | O            | select smaller                                |
    | function | O            | to do given function                          |
    | add      | O            | adding                                        |
    | sub      | O            | subtructing                                   |
    | mul      | O            | multipling                                    |
    | div      | O            | dividing                                      |
    | mod      | O            | modulo                                        |
    | pow      | O            | power                                         |
    | lt       | O            | results of 'less than' are True/False         |
    | le       | O            | results of 'less than equal' are True/False   |
    | gt       | O            | results of 'greater than' are True/False      |
    | ge       | O            | results of 'grater than equal' are True/False |
    | ne       | O            | results of 'not equal' are True/False         |
    | eq       | O            | results of 'equal' are True/False             |

  Second argument is a path of a second csv file or scalar float value.
  When scala value is given, second csv will be created with the same shape as first csv and have given value in all elements.
  Because of inefficiently, you should not use scala value as second argument for big csv data.

example:
  csv_combine.py --mode bigger t1.csv t2.csv| csvlook -I
  | A   | B   | C | D |
  | --- | --- | - | - |
  | 1.0 | 1.0 | 1 | a |
  | 3.0 | 4.0 | 5 | b |
  | 5.0 | 5.0 | 3 | c |
  | 6.0 |     | 4 | d |

  csv_combine.py --mode smaller t1.csv t2.csv| csvlook -I
  | A   | B   | C | D |
  | --- | --- | - | - |
  | 0.0 | 0.0 | 0 | a |
  | 2.0 | 2.0 | 0 | b |
  | 3.0 | 5.0 | 3 | c |
  | 6.0 |     | 4 | d |

  csv_combine.py --mode function --function "lambda s1,s2: s1 if s1.sum() > s2.sum() else s2" t1.csv t2.csv |csvlook -I
  | A   | B   | C | D |
  | --- | --- | - | - |
  | 0.0 | 1.0 | 1 | a |
  | 3.0 | 4.0 | 0 | b |
  | 5.0 | 5.0 | 3 | c |
  | 6.0 |     | 4 | d |

  csvlook -I t1.csv
  | A | B | C | D |
  | - | - | - | - |
  | 1 | 0 | 1 | a |
  | 2 | 2 | 0 | b |
  | 3 |   | 3 |   |
  |   |   | 4 | d |

  csvlook -I t2.csv
  | A | B | C | D |
  | - | - | - | - |
  | 0 | 1 | 0 | a |
  | 3 | 4 | 5 | b |
  | 5 | 5 |   | c |
  | 6 |   |   |   |


csv_correlation.py

usage: csv_correlation.py [-h] [-v] [--mode {auto,cross,partial}]
                          [--nlags INT] [--sampling INT] [--na_value FLOAT]
                          [--output FILE]
                          CSV_FILE COLUMN[,COLUMN...]

evaluate cross/auto correlation between columns

positional arguments:
  CSV_FILE              files to read, if empty, stdin is used
  COLUMN[,COLUMN...]    columns to do

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  --mode {auto,cross,partial}
                        correlation mode
  --nlags INT           maximum number of lags,default=100
  --sampling INT        number of step for rows to do sample
  --na_value FLOAT      value to fill for NaN,default=0
  --output FILE         path of outut file, default=stdout

remark:
  In results, prefix 'ac_' means result of auto-correlation, prefix 'cc_' means result of cross-correlation.
  prefix 'ac_ci_l' and 'ac_cf_u' mean lower adn upper level of 95% confidence interval for no-correlation.

  For '--mode=cross', nlags means number of records in output. If you want all records from cross-correlation, set nlags=0.

example:
  csv_correlation.py test_correlation.csv COL_0000,COL_0001
  csv_correlation.py test_correlation.csv COL_0000|head|csvlook -I
| index | ac_COL_0000            | ac_ci_l_COL_0000     | ac_ci_u_COL_0000    |
| ----- | ---------------------- | -------------------- | ------------------- |
| 0     | 1.0                    | 1.0                  | 1.0                 |
| 1     | 0.07018334701195321    | -0.06840703542301456 | 0.20877372944692096 |
| 2     | -0.031686282848955645  | -0.17095764718861142 | 0.10758508149070015 |
| 3     | 0.06474599533914761    | -0.07466376744626992 | 0.20415575812456516 |
| 4     | -0.07187457627030945   | -0.21186070987120875 | 0.06811155733058984 |
| 5     | -0.0032344642148376297 | -0.14392762546855406 | 0.13745869703887878 |
| 6     | -0.02065593286982406   | -0.16135052234546393 | 0.12003865660581581 |
| 7     | 0.03586195035334148    | -0.10489087472112046 | 0.1766147754278034  |
| 8     | 0.05144023922871804    | -0.08948797647058224 | 0.19236845492801832 |

  csv_correlation.py --mode=cross test_correlation.csv COL_0000,COL_0001,COL_0002
  csv_correlation.py --mode=cross --sampling=2 test_correlation.csv COL_0000,COL_0001,COL_0002|head|csvlook -I
| index | cc_COL_0000_COL_0001  | cc_COL_0000_COL_0002  | cc_COL_0001_COL_0002  |
| ----- | --------------------- | --------------------- | --------------------- |
| 0     | -0.07832200979116527  | -0.0361744688777645   | -0.0383920692904824   |
| 1     | -0.0584437912103411   | 0.02924305243353182   | 0.0014283173939956392 |
| 2     | -0.004634262357976521 | 0.009863911035045745  | 0.0700311412765593    |
| 3     | 0.08549028836897214   | 0.145849764523322     | -0.07607057576313002  |
| 4     | 0.06411213630824009   | -0.009025766566690439 | -0.043516364265988865 |
| 5     | 0.053392718054984536  | -0.04056558651200204  | 0.09684720026396708   |
| 6     | -0.10900425075345083  | -0.022320478554022246 | 0.0459223360399405    |
| 7     | 0.014787564430673562  | -0.02152087172311092  | -0.00444335370431942  |
| 8     | -0.02779304465519147  | -0.027346286491568755 | 0.12710493528359032   |


csv_crosstable.py

usage: csv_crosstable.py [-h] [-v] [--values COLUMN]
                         [--aggregator {sum,min,max,mean,median,prod,count_nonzero}]
                         [--row_names NAME[,NAME...]]
                         [--column_names NAME[,NAME...]]
                         [--normalize {all,index,column}]
                         [--suppress_all_zero] [--output_file FILE]
                         CSV_FILE ROW_COLUMN[,ROW_COLUMN...]
                         COLUMN[,COLUMN...]

make cross-matching table from csv file

positional arguments:
  CSV_FILE              files to read, if empty, stdin is used
  ROW_COLUMN[,ROW_COLUMN...]
  COLUMN[,COLUMN...]

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  --values COLUMN       name of column for value
  --aggregator {sum,min,max,mean,median,prod,count_nonzero}
                        aggregator function for '--values', default='sum'
  --row_names NAME[,NAME...]
                        names of rows in output,default=names of given rows
  --column_names NAME[,NAME...]
                        names of columns in output,default=names of given
                        columns
  --normalize {all,index,column}
                        normalized mode, see 'remark'
  --suppress_all_zero   suppress outputing columns whose elements are all NaN
  --output_file FILE    path of output file,default=stdout

remark:

  For '--normalize', following are available.
   ‘all’     : normalize over all values.
   ‘index’   : normalize over each row.
   ‘columns’ : normalize over each column.
  see 'pandas.crosstab https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html'

example:

# input data for example
cat test_crosstable.csv|csvlook -I
| COL001 | COL002 | COL003 | COL004 | COL005 |
| ------ | ------ | ------ | ------ | ------ |
| A      | a      | 1      | 2      | Z      |
| B      | b      | 1      | 3      | X      |
| C      | c      | 1      | 4      | Y      |
| D      | d      | 1      | 5      | Z      |
| A      | b      | 1      | 6      | V      |
| B      | c      | 1      | 7      | W      |
| C      | d      | 1      | 8      | S      |
| D      | a      | 1      | 9      | T      |

csv_crosstable.py test_crosstable.csv COL001 COL002|csvlook -I
| COL001 | a | b | c | d |
| ------ | - | - | - | - |
| A      | 1 | 1 | 0 | 0 |
| B      | 0 | 1 | 1 | 0 |
| C      | 0 | 0 | 1 | 1 |
| D      | 1 | 0 | 0 | 1 |

csv_crosstable.py test_crosstable.csv --values=COL004 COL001 COL002|csvlook -I
| COL001 | a   | b   | c   | d   |
| ------ | --- | --- | --- | --- |
| A      | 2.0 | 6.0 |     |     |
| B      |     | 3.0 | 7.0 |     |
| C      |     |     | 4.0 | 8.0 |
| D      | 9.0 |     |     | 5.0 |

csv_crosstable.py test_crosstable.csv --values=COL004 --aggregator=prod COL001 COL002,COL005
COL002,a,a,a,a,a,a,a,b,b,b,b,b,b,b,c,c,c,c,c,c,c,d,d,d,d,d,d,d
COL005,S,T,V,W,X,Y,Z,S,T,V,W,X,Y,Z,S,T,V,W,X,Y,Z,S,T,V,W,X,Y,Z
COL001,,,,,,,,,,,,,,,,,,,,,,,,,,,,
A,,,,,,,2.0,,,6.0,,,,,,,,,,,,,,,,,,
B,,,,,,,,,,,,3.0,,,,,,7.0,,,,,,,,,,
C,,,,,,,,,,,,,,,,,,,,4.0,,8.0,,,,,,
D,,9.0,,,,,,,,,,,,,,,,,,,,,,,,,,5.0

# with '--suppress_all_zero'
csv_crosstable.py test_crosstable.csv --values=COL004 --aggregator=prod --suppress_all_zero COL001 COL002,COL005
COL002,a,a,b,b,c,c,d,d
COL005,T,Z,V,X,W,Y,S,Z
COL001,,,,,,,,
A,,2.0,6.0,,,,,
B,,,,3.0,7.0,,,
C,,,,,,4.0,8.0,
D,9.0,,,,,,,5.0


csv_dummy.py

usage: csv_dummy.py [-h] [-v] [--output FILE] [--quote]
                    [--mode {rand,int,hex,header,arbitrarily}]
                    [--headers FILE]
                    ROWS [COLS]

generate dummy data of csv

positional arguments:
  ROWS                  number of data rows
  COLS                  number of columns

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  --output FILE         output file
  --quote               quoteing value of each cell
  --mode {rand,int,hex,header,arbitrarily}
                        value mode of each cell: hex={ir:04x}{ic:04x},
                        rand=random, ind=continus integer, header=definition
                        in headers file
  --headers FILE        list file that has names of columns: csv format as one
                        records or each name per line

remark:
  when '--mode=header' was given, you can use 'np.random.*'. 
  see "Random Generator  NumPy v1.19 Manual https://numpy.org/doc/stable/reference/random/generator.html"

  In header mode, 
  'index' means serial number with 0-base, 'int' means ic+ncols+ir, 
  'rand' means uniform random number in [0,1], 'random.*' means using function in np.random.*,.
  'datetime' means time string of now()+(ir+ic) seconds or frequentry time string with 'datetime:[:]', '' is optional.
  '' is formatted with '%Y-%m-%d %H:%M:%S'. Note: ':' must be esacped by back-slash, see exmple.
  about '', see ' pandas documentation https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-offset-aliases'
  'fixed' means fixed value, 
  'list' means random choiced value in list or dictionary with given probabilites.
  'format' means formatted string with 'row', 'col', 'time', 'random' variables, where 'time' is current time(%H%M%S) and 'random' is randomized integer in [0,100].

example:
  csv_dummy.py --mode header --headers bit-pattern-headers.txt 200 > bit-pattern-headers.csv
  csv_dummy.py --headers=headers.txt --mode=header 5 
ABC000,ABC001,ABC002,ABC003,ABC004,ABC005,ABC006,ABC007,ABC008,ABC009,ABC010
0,1,2020-11-01 09:48:38,2020-12-01 08:05:10,ABCDEF,"A0001",0.9769355181667144,"00000007",file_000_008_068_094836.txt,A0001,13
1,16,2020-11-01 09:48:39,2020-12-01 13:05:10,ABCDEF,"A0001",0.9537397926065723,"00010007",file_001_008_043_094836.txt,B0010,13
2,31,2020-11-01 09:48:40,2020-12-01 18:05:10,ABCDEF,"A0001",0.6544350953595085,"00020007",file_002_008_003_094836.txt,B0010,5
3,46,2020-11-01 09:48:41,2020-12-01 23:05:10,ABCDEF,"A0001",0.2262489413244111,"00030007",file_003_008_054_094836.txt,B0010,17
4,61,2020-11-01 09:48:42,2020-12-02 04:05:10,ABCDEF,"A0001",0.23743226355108915,"00040007",file_004_008_022_094836.txt,A0001,17

  header.txt example:
  ABC000 index
  ABC001 int
  ABC002 datetime
  ABC003 datetime:2020-12-01 08\:05\:10:5h
  ABC004 fixed: ABCDEF
  ABC005 q:list:["A0001","A0001","A0001","A0001","B0010"]
  ABC006 rand
  ABC007 q:hex
  ABC008 format:"file_{row:03d}_{col:03d}_{random:03d}_{time}.txt"
  ABC009 list:{"A0001":3,"B0010":2}
  ABC010 list:[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18]
  ABC011 random.beta(0.2,10.)
  ABC012 random.normal()
  ABC013 random.exponential()
  ABC014


csv_histogram.py

usage: csv_histogram.py [-h] [-v] [--nbins INT]
                        [--nbin_modes {square-root,sturges,rice,doane,s_and_s,freedman_diaconis}]
                        [--range LOWER_X,UPPER_X]
                        [--facets COLUMN[,COLUMN...]] [--output FILE]
                        CSV_FILE COLUMN [WEIGHT_COLUMN]

make histogram from csv file

positional arguments:
  CSV_FILE              files to read, if empty, stdin is used
  COLUMN                name of column to make histogram
  WEIGHT_COLUMN         name of column as weight

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  --nbins INT           number of bins, default=20
  --nbin_modes {square-root,sturges,rice,doane,s_and_s,freedman_diaconis}
                        method to evaluate number of bins. if given, '--nbins'
                        is ignored.
  --range LOWER_X,UPPER_X
                        range of x
  --facets COLUMN[,COLUMN...]
                        facets
  --output FILE         path of output file

remark:
  '--range' and 'weight_column' are available for only numerical 'column'.

  about '--nbin_mode', see Histogram - Wikipedia https://en.wikipedia.org/wiki/Histogram .
  NOTE 's_and_s' means Shimazaki and Shinomoto's choice.

example:
  csv_histogram.py --nbins=100 --output=- big_sample_arb.csv  COL_0008|less
  csv_histogram.py --nbins=100 --output=- --range=0.5,1.0 big_sample_arb.csv  COL_0008 COL_0033|less
  csv_histogram.py --facets=B,C,D test_hist.csv E
  csv_histogram.py --facets=B,C,D test_hist.csv A


csv_lmfit.py

usage: csv_lmfit.py [-h] [-v] [--model_definition MODEL_NAME|JSON]
                    [--xrange X_MIN,X_MAX] [--output_file OUTPUT]
                    [--remove_offset] [--print_parameters]
                    [--print_sample_model] [--list_internal]
                    [CSV_FILE] [COLUMN] [COLUMN]

fitting function to data in csv file

positional arguments:
  CSV_FILE              files to read, if empty, stdin is used
  COLUMN                name of x column
  COLUMN                name of y column

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  --model_definition MODEL_NAME|JSON
                        model definition, 'model_name|json string or @json
                        file', see example
  --xrange X_MIN,X_MAX  range of X
  --output_file OUTPUT  path of output file
  --remove_offset       remove dc offset
  --print_parameters    only print paramters
  --print_sample_model  print sample code for model into stdout
  --list_internal       show list of internal models

remark:
  if model name starts with "@", the name was treadted as class of lmfit.model. see '--list_internal'.
  if model name starts with "%", the name was treated as function that was defined internaly. see '--list_internal'.

example:
  csv_lmfit.py --print_sample_model
  csv_lmfit.py --list_internal
  MDL='local_model1|{"a": {"value": 0, "min": -1, "max": 5, "vary": true}, "b": {"value": 1, "min": -3, "max": 3, "vary": true}, "c": {"value": 1, "min": -1, "max": 1, "vary": true}}'
  python3 lmfit_test.py --model_definition="${MDL}" lmfit_sample.csv
  csv_lmfit.py --model_definition="${MDL}" --model_definition="@LorentzianModel" --output=test_result.csv lmfit_sample.csv X Y
  # read parameters from json file  
  csv_lmfit.py --model_definition="%BetaDistributionModel|@lmfit_test_data_beta.json" --output=test_result.csv lmfit_test_data_beta.csv X Y

#-- template for local_model1.py
csv_lmfit.py --print_sample_model lmfit_sample.csv X Y > local_model1.py
# and edit this soure.

#-- lmfit_sample.py for lmfit_sample.csv
# -- DEBUG
import numpy as np
# import matplotlib.pyplot as plt

npts = 150
x_values = np.linspace(-10, 10, npts)
a = 3
b = 2
c = 1
y_values = a * x_values**2 + b * x_values + c

d_df = pd.DataFrame(columns=["X", "Y"])
d_df["X"] = x_values
d_df["Y"] = y_values
d_df.to_csv("lmfit_sample.csv", index=False)
#--


csv_meltpivot.py

usage: csv_meltpivot.py [-h] [-v] [--mode {melt,pivot}] [--category_name NAME]
                        [--value_name NAME] [--values COLUMN[,COLUMN...]]
                        [--single_index_columns] [--output FILE]
                        CSV_FILE KEY_COLUMN[,KEY_COLUMN...]
                        [COLUM[,COLUMN...]]

melting or pivoting csv file

positional arguments:
  CSV_FILE              files to read, if empty, stdin is used
  KEY_COLUMN[,KEY_COLUMN...]
                        names of key columns(index columns)
  COLUM[,COLUMN...]     names of value columns

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  --mode {melt,pivot}   mode, default=melt
  --category_name NAME  name of column that is stored names of columns
  --value_name NAME     name of column that is stored value of each column
  --values COLUMN[,COLUMN...]
                        value columns for pivot mode
  --single_index_columns
                        single index for columns for pivot mode
  --output FILE         path of output file

remark:
  The columns, that are given by '--values', must have data type of numeric.

  For mode==melt, when 'columns' argument is not given, others than 'key_columns' are used as 'columns'.

  For mode=pivot, np.sum is used as aggregator function.
  'columns' must be given. NaNs in results are filled by 0.
  If without '--values', it is done with assuming all 'values'=1(pandas.get_dummies() si used)
  For '--single_index_columns', the name of first column, that are given as 'columns', is used 
  to make multiindex of columns into a single index for columns.

example:
  csv_meltpivot.py --category_name=Category --value_name=Value bit-pattern-headers.csv ABC002 BIT000,BIT001,BIT003,BIT004 | head
| ABC002              | Category | Value |
| ------------------- | -------- | ----- |
| 2020-12-01 10:10:15 | BIT000   | 0     |
| 2020-12-01 10:10:17 | BIT000   | 1     |
| 2020-12-01 10:10:19 | BIT000   | 0     |
| 2020-12-01 10:10:21 | BIT000   | 0     |
| 2020-12-01 10:10:23 | BIT000   | 0     |
| 2020-12-01 10:10:25 | BIT000   | 1     |
| 2020-12-01 10:10:27 | BIT000   | 0     |
| 2020-12-01 10:10:29 | BIT000   | 0     |

# in 'test.csv' , above results were stored.
  csv_meltpivot.py --mode pivot --value=Value --single_index_columns test.csv ABC002 Category | csvlook -I
| ABC002              | BIT000 | BIT001 | BIT003 | BIT004 |
| ------------------- | ------ | ------ | ------ | ------ |
| 2020-12-01 10:10:15 | 0      | 1      | 0      | 0      |
| 2020-12-01 10:10:17 | 1      | 0      | 1      | 0      |
| 2020-12-01 10:10:19 | 0      | 0      | 0      | 0      |
| 2020-12-01 10:10:21 | 0      | 0      | 1      | 1      |
| 2020-12-01 10:10:23 | 0      | 1      | 0      | 1      |
| 2020-12-01 10:10:25 | 1      | 0      | 0      | 0      |
| 2020-12-01 10:10:27 | 0      | 1      | 0      | 0      |
| 2020-12-01 10:10:29 | 0      | 1      | 1      | 0      |
| 2020-12-01 10:10:31 | 1      | 1      | 1      | 1      |

  csv_meltpivot.py --mode pivot --values=COL003 test_melt_pivot.csv COL001 COL002 | csvlook -I
| COL001 | CAT0 | CAT1 | CAT2 |
| ------ | ---- | ---- | ---- |
| A      | 1    | 0    | 0    |
| B      | 0    | 0    | 0    |
| C      | 0    | 0    | 2    |

 csv_meltpivot.py --mode pivot --values=COL003 --single_index_columns test_melt_pivot.csv COL001 COL002 | csv_meltpivot.py - COL001| csvlook -I
| COL001 | Category | Value |
| ------ | -------- | ----- |
| A      | CAT1     | 0     |
| B      | CAT1     | 0     |
| C      | CAT1     | 0     |
| A      | CAT0     | 1     |
| B      | CAT0     | 0     |
| C      | CAT0     | 0     |
| A      | CAT2     | 0     |
| B      | CAT2     | 0     |
| C      | CAT2     | 2     |

  csv_meltpivot.py --mode pivot test_melt_pivot.csv COL001 COL002|csvlook -I
%Inf:csv_meltpivot: pivoting without values
| COL001 | COL003 | CAT0 | CAT1 | CAT2 |
| ------ | ------ | ---- | ---- | ---- |
| A      | 1      | 1    | 0    | 0    |
| B      | 0      | 0    | 1    | 0    |
| C      | 2      | 0    | 0    | 1    |

  csv_meltpivot.py --mode pivot test_melt_pivot.csv COL001 COL002,COL003|csvlook -I
%Inf:csv_meltpivot: pivoting without values
| COL001 | COL002_CAT0 | COL002_CAT1 | COL002_CAT2 | COL003_0 | COL003_1 | COL003_2 |
| ------ | ----------- | ----------- | ----------- | -------- | -------- | -------- |
| A      | 1           | 0           | 0           | 0        | 1        | 0        |
| B      | 0           | 1           | 0           | 1        | 0        | 0        |
| C      | 0           | 0           | 1           | 0        | 0        | 1        |

# in above example, this data was used.
 cat test_melt_pivot.csv|csvlook -I
| COL001 | COL002 | COL003 |
| ------ | ------ | ------ |
| A      | CAT0   | 1      |
| B      | CAT1   | 0      |
| C      | CAT2   | 2      |


csv_multiindex_columns.py

usage: csv_multiindex_columns.py [-h] [-v] [--nrows INT] [--output FILE]
                                 [--to_single] [--only_header]
                                 CSV_FILE

handle multiindex columns csv

positional arguments:
  CSV_FILE       files to read, if empty, stdin is used

optional arguments:
  -h, --help     show this help message and exit
  -v, --version  show program's version number and exit
  --nrows INT    number of rows as header,default=2
  --output FILE  path of output file, default=stdout
  --to_single    convert single header for columns
  --only_header  parse only header rows

remark:

example:

cat test_multiindex_columns.csv
,,C,,,D,,
A,B,X,Y,Z,X,Y,Z
1,2,3,4,5,6,7,8
3,4,5,6,7,8,9,0

csv_multiindex_columns.py test_multiindex_columns.csv
,,C,C,C,D,D,D
A,B,X,Y,Z,X,Y,Z
1,2,3,4,5,6,7,8
3,4,5,6,7,8,9,0

cat test_multiindex_columns.csv
,,E,,,F,G,
,,C,,,D,,
A,B,X,Y,Z,X,Y,Z
1,2,3,4,5,6,7,8
3,4,5,6,7,8,9,0

 csv_multiindex_columns.py --only_header --nrows=3 test_multiindex_columns.csv
,,E,E,E,F,G,G
,,C,C,C,D,D,D
A,B,X,Y,Z,X,Y,Z
1,2,3,4,5,6,7,8

csv_multiindex_columns.py --nrows=3 --to_single test_multiindex_columns.csv |csvlook -I
| A | B | E_C_X | E_C_Y | E_C_Z | F_D_X | G_D_Y | G_D_Z |
| - | - | ----- | ----- | ----- | ----- | ----- | ----- |
| 1 | 2 | 3     | 4     | 5     | 6     | 7     | 8     |
| 3 | 4 | 5     | 6     | 7     | 8     | 9     | 0     |


csv_print_html.py

usage: csv_print_html.py [-h] [-v] [--title TITLE]
                         [--columns COLUMNS[,COLUMNS...]] [--fp_precision INT]
                         [--trim_null STRING=all or ROW:ROW,COL:COL)]
                         [--highlight FLOAT[=(all or ROW:ROW,COL:COL)]]
                         [--min_in_column (all or ROW:ROW,COL:COL)]
                         [--max_in_column (all or ROW:ROW,COL:COL)]
                         [--min_in_row (all or ROW:ROW,COL:COL)]
                         [--max_in_row (all or ROW:ROW,COL:COL)]
                         [--gradient (all or ROW:ROW,COL:COL)]
                         [--bar (all or ROW:ROW,COL:COL]
                         [--max_column_width WIDTH]
                         [--column_width COLUMN:WIDTH[,COLUMN:WIDTH..]]
                         [--part_color STRING:COLOR[,STRING:COLOR...]]
                         [--search_on_html] [--datatable] [--output_file FILE]
                         [--minify]
                         CSV_FILE

print html table made of csv with estimation

positional arguments:
  CSV_FILE              files to read, if empty, stdin is used

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  --title TITLE         Title of table
  --columns COLUMNS[,COLUMNS...]
                        names of columns to do
  --fp_precision INT    precision of float, default=2
  --trim_null STRING=(all or ROW:ROW,COL:COL)
                        triming null value
  --highlight FLOAT[=(all or ROW:ROW,COL:COL)]
                        highlighting more than threshold
  --min_in_column (all or ROW:ROW,COL:COL)
                        highlighting minimum cell in each column
  --max_in_column (all or ROW:ROW,COL:COL)
                        highlighting maximum cell in each column
  --min_in_row (all or ROW:ROW,COL:COL)
                        highlighting minimum cell in each row
  --max_in_row (all or ROW:ROW,COL:COL)
                        highlighting maximum cell in each row
  --gradient (all or ROW:ROW,COL:COL)
                        gradient mode
  --bar (all or ROW:ROW,COL:COL)
                        histogram of each column
  --max_column_width WIDTH
                        maximum width of all columns, default='200pm'
  --column_width COLUMN:WIDTH[,COLUMN:WIDTH..]
                        widths of columns
  --part_color STRING:COLOR[,STRING:COLOR...]
                        part color for string, color code is one in css codes.
  --search_on_html      searching on html is enable
  --datatable           datatble mode
  --output_file FILE    path of output file
  --minify              minifing html

remark:
  Elements in only columns, that has int64/float64 as dtype, will be processed.
  So in the following, "8" in Row=4,Col=D is not evaluated for 'max'/'min' and so on.
    | A | B | C | D |
    | - | - | - | - |
    | 1 | 0 | 1 | a |
    | 2 | 2 | 0 | b |
    | 3 |   | 3 |   |
    |   |   | 4 | d |
    | 4 | 6 | 7 | 8 |

  When '--bar' is used, others than '--trim_null' are not available.

  For '--part_color', is you want to use comma(,) an colon(:) in word, then those must be escaped by "".

  NOTE:NOW, '--max_column_width' is not available.

example:
  csv_print_html.py --highlight=2 --gradient=all --max_in_col=all --min_in_col=all t1.csv > test.html
  csv_print_html.py --bar=all --trim_null="-"=all t1.csv > test.html
  csv_print_html.py --min_in_col=0:1 t1.csv > test.html
  csv_print_html.py --min_in_col=,A:B t1.csv > test.html
  csv_print_html.py --highlight=2 --title=HightLight=2 t1.csv > test.html
  csv_print_html.py --bar=all --trim_null="-"=all --columns=A,B,C t1.csv > test.html
  csv_print_html.py --min_in_row=all t1.csv > test.html
  csv_print_html.py --part_color="Joseph:red,Beesley\, Mr. Lawrence:blue" titanic.csv > test.html
  csv_print_html.py --part_color='Joseph:red,Beesley\, Mr. Lawrence:blue,female:green,C\d+:black' titanic.csv > test.html
  csv_print_html.py --part_color='Joseph:red,Beesley\, Mr. Lawrence:blue,female:green,C\d+:black' --column_width="Name:128px" titanic.csv > test.html


csv_print_html_oia.py

usage: csv_print_html_oia.py [-h] [-v] [--title TITLE]
                             [--columns COLUMNS[,COLUMNS...]]
                             [--part_color STRING:COLOR[,STRING:COLOR...]]
                             [--search_on_html] [--output_file FILE]
                             [--minify]
                             CSV_FILE COLUMNS [COLUMNS ...]

print html table made of csv with estimation

positional arguments:
  CSV_FILE              file to read, if empty, stdin is used
  COLUMNS               colum names of Observation/Investigation/Action

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  --title TITLE         Title of table
  --columns COLUMNS[,COLUMNS...]
                        names of addtional columns
  --part_color STRING:COLOR[,STRING:COLOR...]
                        part color for string, color code is one in css codes.
  --search_on_html      searching on html is enable
  --output_file FILE    path of output file
  --minify              minifing html

remark:

  For '--part_color', is you want to use comma(,) an colon(:) in word, then those must be escaped by "".

example:
  cat test3.csv
IDX,B,C,O,I,A
1,A,Sample1,Observation1:this is a pen,Investigation1:Atre you there?,Action1: nothing to do
2,B,Sample2,Observation2:this is a pen,Investigation2:Atre you there?,Action2: nothing to do
3,C,Sample3,Observation3:this is a pen,Investigation2:Atre you there?,Action3: nothing to do

  csv_print_html_oia.py --columns=IDX,B,C --part_color='this:red' test3.csv  O I A > test.html
  csv_print_html_oia.py --columns=IDX,B,C --part_color='バリ島:red,米国:green,潜水艦:blue,海軍:black' --search_on_html test3.csv  O I A > test.html


csv_print_html_tl.py

usage: csv_print_html_tl.py [-h] [-v] [--title TITLE]
                            [--datetime_format FORMAT] [--group_column COLUMN]
                            [--headline_column COLUMN]
                            [--columns COLUMNS[,COLUMNS...]]
                            [--part_color STRING:COLOR[,STRING:COLOR...]]
                            [--group_by_part_color] [--media COLUMN[:COLUMN]]
                            [--cdn] [--output_file FILE] [--minify]
                            CSV_FILE COLUMN_OF_DATETIME COLUMN [COLUMN ...]

print html table made of csv with estimation

positional arguments:
  CSV_FILE              file to read, if empty, stdin is used
  COLUMN_OF_DATETIME    name of column of datetime
  COLUMN                colum names of Observation/Investigation/Action

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  --title TITLE         title and description
  --datetime_format FORMAT
                        format string for column of datetime,default=%Y-%m-%d
                        %H:%M:%S
  --group_column COLUMN
                        names of column to make group
  --headline_column COLUMN
                        names of column for short text
  --columns COLUMNS[,COLUMNS...]
                        names of addtional columns
  --part_color STRING:COLOR[,STRING:COLOR...]
                        part color for string, color code is one in css codes.
  --group_by_part_color
                        grouping by result of part_color
  --media COLUMN[:COLUMN]
                        columns for medias
  --cdn                 using CDN(cdn.knightlab.com), default=local
  --output_file FILE    path of output file
  --minify              minifing html

remark:

  Generated html uses script of TimelineJS3. if you used CDN(cdn.knightlab.com), use '--cdn'.
  Without '--cdn', those scripts must be gotten from 'cdn.knightlab.com' and 
  those scripts must be stored in 'timeline3' directory in the same directory as html.
  see "Loading Files Locally" section in Timeline https://timeline.knightlab.com/docs/instantiate-a-timeline.html .

  For '--part_color', is you want to use comma(,) an colon(:) in word, then those must be escaped by "".
  Using '--group_by_part_color', words in '--part_color' are used as group of event.
  when '--part_color' is used, also gantt chart for plantUML will be created. 
  then value of column('--headingline_column') will be used as task name.

  about '--media':
  TimelineJS3 has function to show media(image,vide,etc) for each event.
  using '--media' option, this function is enable: 
  path of media file is given by this option. optional column is column that has title string of each media file.

  aboutn TimelineJS3 see following:
  GitHub - NUKnightLab/TimelineJS3: TimelineJS v3: A Storytelling Timeline built in JavaScript. http://timeline.knightlab.com https://github.com/NUKnightLab/TimelineJS3

example:
  cat test3.csv
IDX,B,DT,C,O,I,A,image,title_of_image
1,A,2021-06-01 10:00:00,Sample1,Observation1:this is a pen,Investigation1:Atre you there?,Action1: nothing to do,pattern.png,sample
2,B,2021-07-01 10:00:00,Sample2,Observation2:this is a pen,Investigation2:Atre you there?,Action2: nothing to do,pattern.png,sample
3,C,2021-08-01 10:00:00,Sample3,Observation3:this is a pen,Investigation2:Atre you there?,Action3: nothing to do,pattern.png,sample

  csv_print_html_tl.py --columns=IDX,B,C --part_color='pen:red,action2:blue,observation3:black' --output=test.html test3.csv DT O I A
  csv_print_html_tl.py --columns=IDX,B,C --part_color='pen:red,action2:blue,observation3:black' --output=test.html --group_column=B test3.csv DT O I A
  csv_print_html_tl.py --columns=IDX,B,C --part_color='pen:red,action2:blue,observation3:black' --output=test.html --group_column=B --headline_column=B test3.csv DT O I A
  csv_print_html_tl.py --columns=IDX,B,C --part_color='pen:red,action2:blue,observation3:black' --output=test.html --group_column=B --headline_column=B --title="Title
description" test3.csv DT O I A
  csv_print_html_tl.py --columns=IDX,B,C --part_color='pen:red,action2:blue,observation3:black' --output=test.html --group_by_part_color --headline_column=B --title="Title\ndescription" test3.csv DT O I A
  csv_print_html_tl.py --columns=IDX,B,C --part_color='pen:red,action2:blue,observation3:black' --output=test.html --group_by_part_color --headline_column=B --media=image:title_of_image --title="Title
description" test3.csv DT O I A


csv_query.py

usage: csv_query.py [-h] [-v] [--encoding STR] [--query_file FILE]
                    [--columns COLUMNS[,COLUMNS,[COLUMNS,...]]] [--output STR]
                    CSV_FILE [STR]

do query for CSV data

positional arguments:
  CSV_FILE              path of csv file
  STR                   query string

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  --encoding STR        encoding, default=utf-8
  --query_file FILE     path of query file
  --columns COLUMNS[,COLUMNS,[COLUMNS,...]]
                        names of colunmns to ouput, default=all
  --output STR          path of output, default is stdout

remark:
 for encoding, see "Codec registry and base classes https://docs.python.org/3/library/codecs.html#standard-encodings"
 you can use str.contains(),str.endswith(),str.startswith(),str.match() and 'in' in query string.

 When '--query_file' was used, those lines were used as query string as joining with 'or'.
 In query file, lines those starts with '#' are ignored as comment lines.

example:
 csv_query.py big_sample_arb.csv 'COL_0002=="001001" and COL_0006=="PAT001"'
 csv_query.py big_sample_arb.csv 'COL_0002=="001001" and COL_0006.str.contains("PAT001")'
 csv_query.py --query_file=test_query.txt test1.csv

 cat test_query.txt
 # test
 B==2
 C==6


csv_sample.py

usage: csv_sample.py [-h] [-v] [--range START,END] [--random] [--skip INT]
                     [--output FILE]
                     CSV_FILE SAMPLE_SIZE

derive sample records from big csv file

positional arguments:
  CSV_FILE           files to read, if empty, stdin is used
  SAMPLE_SIZE        size of sample: ex 100 50% 0.5

optional arguments:
  -h, --help         show this help message and exit
  -v, --version      show program's version number and exit
  --range START,END  range for sampling: [0,1.0]
  --random           random sampling
  --skip INT         skip sampling
  --output FILE      path of output file: default=stdout

example:
  csv_sample.py --random --range=0.25,0.75 --output=test.csv big_sample_arb.csv 300


csv_stack_trime.py

usage: csv_stack_trime.py [-h] [-v] [--include COLUMNS[,COLUMNS...]]
                          --definition OLD_COL:NEW_COL:APPEND_COL=VALUE[:...]
                          [--columns COL1,COL2,...] [--output FILE]
                          CSV_FILE

split one column, that has multiple meaning, into some columns

positional arguments:
  CSV_FILE              files to read, if empty, stdin is used

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  --include COLUMNS[,COLUMNS...]
                        included columns, with csv format
  --definition OLD_COL:NEW_COL:APPEND_COL=VALUE[:...]
                        definition string to make translation
  --columns COL1,COL2,...
                        list of columns at output
  --output FILE         path of output file

remark:
  columns definitions
      old_column_name:new_column_name:append_column=fixed_value:append_column_name=fixed_value....
  the appended_column may be used as a category that was derived from the old_colun_name as one of some meanings,
  and may be used as facet or category of csv_plot_*.

  If argument of '--definition' or '--columns' was started with '@', then those are read from the file.

  if a fixed_value starts with '@', then the remained value without '@' was treated as a column name, 
  so the appended column has the copy of the column, that may mean an attribute column that was BINDED to the old_column_name.
    ex. Cat1_Attr,Cat2_Attr,Attribute,Catgeory1_value,Category2_value 
        => Category1_value:value:Category=1:Attr=@Cat1_Attr,Category2_value:value:Category=2:Attr=@Cat2_Attr

  if more new_columns were given simulitaneously, there will be NaN in each others(see example).

example1:
  csv_stack_trime.py --definition=P1C1:C1:P=P1,P1C2:C2:P=P1,P2C1:C1:P=P2,P2C2:C2:P=P2 --include=N --output=- test3.csv

  input:
  P1C1,P1C2,P2C1,P2C2,N
  1,0,1,0,A
  1,0,0,1,B
  1,0,1,0,C
  1,0,1,0,D

  output:
  C1,C2,N,P
  1,,A,P1
  1,,B,P1
  1,,C,P1
  1,,D,P1
  ,0,A,P1
  ,0,B,P1
  ,0,C,P1
  ,0,D,P1
  1,,A,P2
  0,,B,P2
  1,,C,P2
  1,,D,P2
  ,0,A,P2
  ,1,B,P2
  ,0,C,P2
  ,0,D,P2

example2:
  sv_stack_trime.py --definition=@csv_stack_trim_defs.txt --include=COL3,COL4 csv_stack_trime.csv
  col_defs="COL1:NEW_COL1:COLA=ABC:COLB=CDEF,COL2:NEW_COL2:COLA=ZXY:COLC=1234:COLD=@COL7"
  inc_cols="COL3,COL4"
  csv_stack_trime.py --definition=${col_defs} --include=${inc_cols} csv_stack_trime.csv | csvlook -I

  input:
  COL1,COL2,COL3,COL4,COL5,COL6,COL7
  1,2,3,4,5,6,A
  7,8,9,10,11,12,B
  13,14,15,16,17,18,C
  19,20,21,22,23,24,D

  output:
  | COL3 | COL4 | COLA | COLB | COLC | COLD | NEW_COL1 | NEW_COL2 |
  | ---- | ---- | ---- | ---- | ---- | ---- | -------- | -------- |
  | 3.0  | 4.0  | ABC  | CDEF |      |      | 1        |          |
  | 9.0  | 10.0 | ABC  | CDEF |      |      | 7        |          |
  | 15.0 | 16.0 | ABC  | CDEF |      |      | 13       |          |
  | 21.0 | 22.0 | ABC  | CDEF |      |      | 19       |          |
  | 3.0  | 4.0  | ZXY  |      | 1234 | A    |          | 2        |
  | 9.0  | 10.0 | ZXY  |      | 1234 | B    |          | 8        |
  | 15.0 | 16.0 | ZXY  |      | 1234 | C    |          | 14       |
  | 21.0 | 22.0 | ZXY  |      | 1234 | D    |          | 20       |


csv_status.py

usage: csv_status.py [-h] [-v] [--columns COLUMNS]
                     [--mode {count,sum,avg,std,min,max,mode,median,rank,sem,skew,var,mad,kurt,quantile25,quantile50,quantile75,nunique,cumsum,cumprod,vrange,notzero,zero,morethan,lessthan,positive,negative}]
                     [--group COLUMN] [--arguments ARG[,ARG...]]
                     [--output FILE]
                     CSV_FILE

to derive statitical information for each columns

positional arguments:
  CSV_FILE              files to read, if empty, stdin is used

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  --columns COLUMNS     name or index of columns as csv format
  --mode {count,sum,avg,std,min,max,mode,median,rank,sem,skew,var,mad,kurt,quantile25,quantile50,quantile75,nunique,cumsum,cumprod,vrange,notzero,zero,morethan,lessthan,positive,negative}
                        name of columns to make group
  --group COLUMN        name of columns to make group with '--mode'
  --arguments ARG[,ARG...]
                        arguments for some mode
  --output FILE         path of output csv file, default=stdout

remark:
  For '--mode', there are results for only columns that have numerical values.
  For '--mode mode', there is result that has 2D data, and if items that have the same count, there are more rows than one in result.

  With '--group', each histogram is made of value grouped by given group column.  

  description of mode:
    count :Count non-NA cells for each column
    sum   :Return the sum of the values for index
    avg   :Return the mean of index
    std   :Return sample standard deviation over index
    min   :Return the minimum of the values for index
    max   :Return the maximum of the values for index
    mode  :without '--group': Get the mode(s) of each element along index
    median:Return the median of the values for index
    rank  :Compute numerical data ranks (1 through n) along index
    sem   :Return unbiased standard error of the mean over index
    skew  :Return unbiased skew over index
    var   :Return unbiased variance over index
    mad   :without '--group':Return the mean absolute deviation of the values for index
    kurt  :without '--group': Return unbiased kurtosis over index
    quantile25:Return values at the given quantile over(=25%)
    quantile50:Return values at the given quantile over(=50%)=median
    quantile75:Return values at the given quantile over(=75%)
    nunique:Count distinct observations over index
    cumsum:Return cumulative sum over index
    cumprod:Return cumulative product over index
    vrange:Return range of value(max-min)
    notzero:Count not zero elements over index
    zero  :Count zero elements over index
    morethan:Coount elements over index, that have more than given value
    lessthan:Coount elements over index, that have less than given value
    positive:Coount elements over index, that have more than 0
    negative:Coount elements over index, that have less than 0

example:
  csv_status.py --mode morethan --arguments=3740 bit-pattern-headers.csv
  csv_uty.py --change_timefreq='D=ABC002:%Y-%m-%d %H\:%M\:%S:floor:10s' bit-pattern-headers.csv|\
     csv_status.py --mode sum --group D -|csv_uty.py --drop_columns=ABC000,ABC001 - |\
     csv_trimtime.py --stack=D - |csv_plot_bar.py --output=test.html --animation_column=D - category stacked_result

  csv_status.py --columns=2,5 big_sample_headers.csv
  == information for columns: ABC001
  count    1116.000000
  mean     4461.000000
  std      2578.446044
  min         1.000000
  25%      2231.000000
  50%      4461.000000
  75%      6691.000000
  max      8921.000000
  Name: ABC001, dtype: float64

  == information for columns: ABC004
  count      1116
  unique        2
  top       A0001
  freq        897
  Name: ABC004, dtype: object
  unique values:
  ('A0001', 897)
  ('B0010', 219)


csv_trim_header.py

usage: csv_trim_header.py [-h] [-v] [--rows INT] [--add_column_index] CSV_FILE

DEPRICATED:for csv file, to derive column names in one row from ones in multi rows

positional arguments:
  CSV_FILE            files to read, if empty, stdin is used

optional arguments:
  -h, --help          show this help message and exit
  -v, --version       show program's version number and exit
  --rows INT          number of header rows: default=1
  --add_column_index  add index to column name

remark:
  this script was deprectaed. use 'csv_multiindex_columns.py'.

example:

  csv_trim_header.py --rows=2 test_trim_header.csv
A1_B1,A1_B2,A2_B3,A2_B4,A3_B5,A3_B6,A3_B7,A3_B8

  csv_trim_header.py --rows=2 --add_column_index test_trim_header.csv
A1_B1_00000,A1_B2_00001,A2_B3_00002,A2_B4_00003,A3_B5_00004,A3_B6_00005,A3_B7_00006,A3_B8_00007

cat test_trim_header.csv
A1,,A2,,A3,,
B1,B2,B3,B4,B5,B6,B7,B8


csv_trimtime.py

usage: csv_trimtime.py [-h] [-v]
                       [--get_range_of_time COLUMN[:datetime_format]:unit]
                       [--sort_datetime COLUMN_NAME:datetime_format]
                       [--timestamp COLUMN_1:COLUMN_0]
                       [--add_time_column COLUMN:start:freq]
                       [--reformat COLUMN:in_format[:out_format]]
                       [--gap COLUMN=definition[,COLUMN=definition...]]
                       [--time_gap COLUMN=definition[,COLUMN=definition...]]
                       [--calculate_time_diff COLUMN=definition]
                       [--calculate_elapsed_time COLUMN:definition]
                       [--change_timefreq COLUMN=definition[,COLUMN=definition...]]
                       [--resample COLUMN[:time_format]:freq:COLUMN_TO_RESAMPLE[,COLUMN...]]
                       [--resample_function {nearest,count,sum,min,max,mean,std,linear,quadratic,cubic,spline,barycentric,polynomial,krogh,piecewise_polynomial,pchip,akima,cubicspline}]
                       [--select_hours COLUMN[:time_format]:start_time,end_time]
                       [--select_datetime COLUMN[:time_format]:start_time,end_time]
                       [--output FILE]
                       CSV_FILE

triming columns that have time data

positional arguments:
  CSV_FILE              files to read, if empty, stdin is used

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  --get_range_of_time COLUMN[:datetime_format]:unit
                        range of time of column, available unit:H,M,S if you
                        use comma or colon in expression, those must be
                        escaped with back-slash
  --sort_datetime COLUMN_NAME:datetime_format
                        sort datetime, format of
                        definition=column_name[:datetime_format].default
                        datetime_format='%Y-%m-%d %H:%M:%S'. if you use comma
                        or colon in expression, those must be escaped with
                        back-slash
  --timestamp COLUMN_1:COLUMN_0
                        convert date time column(COLUMN_0) into
                        timestamp(COLUMN_1)
  --add_time_column COLUMN:start:freq
                        add new column of time series
  --reformat COLUMN:in_format[:out_format]
                        change format of datetime of column if you use comma
                        or colon in expression, those must be escaped with
                        back-slash
  --gap COLUMN=definition[,COLUMN=definition...]
                        group by value gap, format of definitoin is
                        'group_column_name=col_name:gap'. if you use comma or
                        colon in expression, those must be escaped with back-
                        slash
  --time_gap COLUMN=definition[,COLUMN=definition...]
                        group by time gap[seconds],format of definitoin is
                        'group_column_name=col_name:datetime_format:gap'.unit
                        of 'gap' is second. if you use comma or colon in
                        expression, those must be escaped with back-slash
  --calculate_time_diff COLUMN=definition
                        calculate difference[seconds] of datetime
                        column,format of definitoin is
                        'column_name=col_name[:datetime_format[:step]]'.format
                        is datetime format, default='%Y-%m-%d %H:%M:%S'.
                        'step' is integer value, default=1. if you use comma
                        or colon in expression, those must be escaped with
                        back-slash
  --calculate_elapsed_time COLUMN:definition
                        calculate elapsed time[seconds] of datetime
                        column,format of definitoin is 'column_name=col_name[:
                        datetime_format[:origin]]'.format is datetime format,
                        default='%Y-%m-%d %H:%M:%S'. 'origin' is datetime
                        which format is '%Y-%m-%d %H:%M:%S'.if 'origin' was
                        omitted, value at first row will be used as origin. if
                        you use comma or colon in expression, those must be
                        escaped with back-slash
  --change_timefreq COLUMN=definition[,COLUMN=definition...]
                        change datetime frequeny unit:format of definitoin is 
                        'new_column_name=old_col_name:datetime_format:method:f
                        requency'. if you use comma or colon in expression,
                        those must be escaped with back-slash
  --resample COLUMN[:time_format]:freq:COLUMN_TO_RESAMPLE[,COLUMN...]
                        aggregating values of column resampled by time
                        frequency, using function is defined by '--
                        resample_function'
  --resample_function {nearest,count,sum,min,max,mean,std,linear,quadratic,cubic,spline,barycentric,polynomial,krogh,piecewise_polynomial,pchip,akima,cubicspline}
                        aggregation function for '--resample', default=mean
  --select_hours COLUMN[:time_format]:start_time,end_time
                        select hours range, ex: 14:00-18:00 for every days.
                        'start_time' and 'end_time' have the
                        format:%H:%M,%H:%M:%S
  --select_datetime COLUMN[:time_format]:start_time,end_time
                        select datetime range, 'start_time' and 'end_time'
                        have the same format as target column
  --output FILE         path of output csv file, default=stdout

remark:
  # Time series analysis with pandas https://ourcodingclub.github.io/tutorials/pandas-time-series/

  When '--get_range_of_time' was given, only range of time is printed to stdout without other processings.
  Available 'unit' to print is one of 'H'our, 'M'inuts, 'S'econds. See example.

  For '--change_timefreq', available methods are floor, ceil,round. About format string, you may find answer in folowing:
  'datetime  Basic date and time types https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior'.
  About 'freqnecy', you may check the document in following:
  'Time series / date functionality https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-offset-aliases'.

  If you make group according to gap in seriesed values or datetime, '--gap' or '--time_gap' are available.
  This group is useful for plotting by 'csv_plot_*', printing status by 'csv_status'.
  For '--time_gap', about time format string , see above about '--change_timefreq'
  using '--gap', numeric others than date time data are treated.
  For '--gap' and '--time_gap', given gap should be positive.

  If you want to use commas and colon in expression of '--change_timefreq' and others, those must be escaped by back-slash. see examples.

  processing order:
    sort datetime, convert into timestamp, add time column, reformat, gap, time gap, time diff, change timrefreq, resample, 
    select datetime range, select hours range

example:

  csv_trimtime.py --get_range_of_time='A:M' test_trimtime.csv
%inf:csv_uty:get time of range:A:13106.0 mins
  csv_trimtime.py --get_range_of_time='A:M' test_trimtime.csv | sed -E 's/^.*:([0-9.]+) mins/�/'
13106.0

  csv_trimtime.py --sort_datetime=date a10.csv

  csv_trimtime.py --timestamp="D:A" test_trimtime.csv
  csv_trimtime.py --add_time_column="D:2020-12-01 12\:12\:12:5s" test_trimtime.csv
  csv_trimtime.py --reformat="A:%Y-%m-%d %H\:%M\:%S:%Y/%m/%dT%H\:%M\:%S" test_trimtime.csv
  csv_trimtime.py --gap="GC=C:1" test_trimtime.csv

  csv_trimtime.py --time_gap="GA=A::61" test_trimtime.csv
%Inf:csv_trimtime:groupby time gap
%inf:csv_trimtime:groupby_time_gap:['GA=A::61']
%inf:csv_trimtime:groupby_time_gap:column=GA:number of groups=4:max count in each group=3
A,B,C,GA
2020-11-14 10:00:00,1,19,0
2020-11-13 10:00:00,1,19,1
2020-11-13 10:01:00,2,18,1
2020-11-13 10:02:00,3,7,1

  csv_trimtime.py --change_timefreq='D=ABC002:%Y-%m-%d %H\:%M\:%S\:floor\:30s' big_sample_headers.csv |\
                                                              csv_plot_histogram.py --animation_column=D --output=test.html - ABC005

  # in following example, column 'D' will be created as column of timestamp, and by those dataframe will be made into group and stacked.
  # at plot, the timestamp column 'D' will be used as animation key frames.
  csv_trimtime.py --change_timefreq='D=ABC002:%Y-%m-%d %H\:%M\:%S:floor:10s' bit-pattern-headers.csv|\
     csv_status.py --mode sum --group D -|csv_uty.py --drop_columns=ABC000,ABC001 - |\
     csv_uty.py --stack=D - |csv_plot_bar.py --output=bit-pattern-headers_10sec_sum.html --animation_column=D --yrange=0,1 - category stacked_result

  csv_trimtime.py --resample="A:%Y-%m-%d %H\:%M\:%S:2min:B,C" --resample_func=mean test_trimtime.csv

  csv_trimtime.py --select_hours="A:10\:00\:00,11\:00\:00" test_trimtime.csv
  csv_trimtime.py --select_hours="A:10\:00\:00,1\:00\:00pm" test_trimtime.csv
  csv_trimtime.py --select_datetime="date:%Y-%m-%d:2007-01-01,2007-12-01" a10.csv

  csv_trimtime.py --calculate_time_diff="TD=A:%Y-%m-%d %H\:%M\:%S:1" test_trimtime.csv
A,B,C,TD
2020-11-14 10:00:00,1,19,
2020-11-13 10:00:00,1,19,-86400.0
2020-11-13 10:01:00,2,18,60.0

  csv_trimtime.py --calculate_elapsed_time="G=date:%Y-%m-%d:2007-08-01 00\:00\:00" a10.csv
date,value,G
2007-06-01,20.681002,-5270400.0
2007-07-01,21.834889999999998,-2678400.0
2007-08-01,23.93020353,0.0
2007-09-01,22.93035694,2678400.0
2007-10-01,23.26333992,5270400.0


csv_uty.py

usage: csv_uty.py [-h] [-v] [--serial_column COLUMN[:STEP]]
                  [--drop_columns_regex REGEX]
                  [--drop_columns COLUMN[,COLUMN[,COLUMN...]]
                  [--drop_rows INT[,INT]|INT-INT]
                  [--drop_na_columns COLUMN[,COLUMN[,COLUMN...]]
                  [--drop_duplicated COLUMN[,COLUMN[,COLUMN...]]
                  [--prologe CODE;[CODE;CODE;...]]
                  [--change_timefreq COLUMN=definition[,COLUMN=definition...]]
                  [--add_columns COLUMN=expr[,COLUMN=expr...]]
                  [--trim_columns COLUMN=CODE[,COLUMN=CODE[,COLUMN=CODE...]]
                  [--type_columns COLUMN=type[,COLUMN=type..]]
                  [--fillna COLUMN=value[,COLUMN=value...]]
                  [--replace COLUMN=JSON[,COLUMN=JSON...]]
                  [--split_into_rows COLUMN[:SEPARATOR[,COLUMN:SEPARATOR]]]
                  [--split_into_columns COLUMN[:SEPARATOR[,COLUMN:SEPARATOR]]]
                  [--decompose_bit_string COLUMN[:NBITS[,COLUMN...]]]
                  [--rename_columns OLD_NAME:NEW_NAME[,OLD_NAME:NEW_NAME...]]
                  [--sort [sort_order|]COLUMN[,COLUMN...]]
                  [--sort_datetime [sort_order|]COLUMN:FORMAT]
                  [--stack COLUMN] [--transpose]
                  [--output_format {csv,hdf,parquet,pickel,json,feather,stata}]
                  [--columns_regex COLUMN[,COLUMN[,COLUMN...]]
                  [--columns COLUMN[,COLUMN[,COLUMN...]] [--output FILE]
                  CSV_FILE

utility for CSV file

positional arguments:
  CSV_FILE              file to read. if "-", stdin is used

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  --serial_column COLUMN[:STEP]
                        add new column that has continus numbers, 0-base. If
                        STEP was given, steped number is used.
  --drop_columns_regex REGEX
                        pattern of column names to drop
  --drop_columns COLUMN[,COLUMN[,COLUMN...]
                        names of columns to drop
  --drop_rows INT[,INT]|INT-INT
                        index of rows to drop, 0-base
  --drop_na_columns COLUMN[,COLUMN[,COLUMN...]
                        names of columns to check NA and to drop. if 'all',
                        rows are dropped with how='any'
  --drop_duplicated COLUMN[,COLUMN[,COLUMN...]
                        names of columns to check duplicated rows and to drop
                        others than first. if 'all', all columns are used to
                        check
  --prologe CODE;[CODE;CODE;...]
                        pieces of python code to pre-load, for use in
                        expression of '--add_columns'.
  --change_timefreq COLUMN=definition[,COLUMN=definition...]
                        [DEPRECATED]change datetime frequeny unit: format of
                        definitoin is 'new_column_name=old_col_name:format:met
                        hod:frequency'. if you use comma or colon in
                        expression, those must be escaped with back-slash
  --add_columns COLUMN=expr[,COLUMN=expr...]
                        names and expressions of columns to add or replace,
                        with csv format. if you use comma in expression, the
                        comma must be escaped with back-slash
  --trim_columns COLUMN=CODE[,COLUMN=CODE[,COLUMN=CODE...]
                        piece of python code for each column to replace and
                        output
  --type_columns COLUMN=type[,COLUMN=type..]
                        data type for each column:type=str, int, float, bin,
                        oct, hex
  --fillna COLUMN=value[,COLUMN=value...]
                        fill na for each column. if starts with '@', internal
                        function will be used, see remark.
  --replace COLUMN=JSON[,COLUMN=JSON...]
                        replace value for each column
  --split_into_rows COLUMN[:SEPARATOR[,COLUMN:SEPARATOR]]
                        split each element value with csv format and store
                        those into rows, default of separator=','
  --split_into_columns COLUMN[:SEPARATOR[,COLUMN:SEPARATOR]]
                        split each element value with flag format and store
                        those into columns, default of separator='|'
  --decompose_bit_string COLUMN[:NBITS[,COLUMN...]]
                        decompose string as bit pattern. ex 01010101
  --rename_columns OLD_NAME:NEW_NAME[,OLD_NAME:NEW_NAME...]
                        rename columns
  --sort [sort_order|]COLUMN[,COLUMN...]
                        sorting for columns, sort_order=ascendig or descendig
  --sort_datetime [sort_order|]COLUMN:FORMAT
                        sorting for columns as datetime, sort_order=ascendig
                        or descendig
  --stack COLUMN        name of column to make group with stacking
  --transpose           transpose dataframe
  --output_format {csv,hdf,parquet,pickel,json,feather,stata}
                        output format
  --columns_regex COLUMN[,COLUMN[,COLUMN...]
                        pattern of column names to output
  --columns COLUMN[,COLUMN[,COLUMN...]
                        names of columns to output
  --output FILE         path of output csv file, default=stdout

remark:
  Process of '--serial_column' is done at first, so the result of '--serial_column' may be used as a coluumn for '--add_columns', others.
  Note: result of '--serial_column' will be affected at proceeding processing for '--drop_na_columns', others.

  All columns are read as "str", so you may convert type of those if need, using '--add_columns', '--trim_columns', '--type_columns'.

  For '--add_column', available name of column is to match with '[\w:;@_;:@_]+'.
  At '--add_columns', there are '$' prefix column names in right-side of each defitiion, see examples.

  At '--trim_columns', in right-side of each definition, there is a lambda function, 
  the function will be applied by Series.map, see examples.

  If you want to use commas in expression of '--add_columns' and '--trim_columns', the comma must be escaped by back-slash. see examples.
  For '--add_columns', values of each column, that start with '0b' or '0o' or '0x', are converted int integer internaly,
  but at output value of those columns was formatted back into original format.

  [DEPRECATE] '--change_timefreq' was deprecated. use 'csv_trimtime'.
  For '--change_timefreq', available methods are floor, ceil,round. About format string, you may find answer in
  'datetime  Basic date and time types https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior'.
  About 'freqnecy', you may check the document in 
  'Time series / date functionality https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-offset-aliases'.
  If you want to use commas and colon in expression of '--change_timefreq', those must be escaped by back-slash. see examples.

  After other processings, '--trim_columns' are applied immediately before output.
  When '--drop_dulicated' was given, first row that has same values of columns will be remained.

  For '--decompose_bit_string', new columns are added, names of those has '_Bnnn' as suffix.
  NBITS means minimum number of columns for decomposed results.
  If values has prefix '0b', '0o','0x' or is string of decimal, then the values are decoded into binary string and decomposed.
  And value of elements is treated as bits pattern string. So results have reversed order: 
   ex: COL_A:10011 => [COL_A_B000:"1",COL_A_B001:"1",COL_A_B002:"0",COL_A_B003:"0",COL_A_B004:"1"]
  Note: if there are "100" and "200" in the same column, "100" is treated as binary string "0b100", 
        but "200" is treated as decimal value and results is ["1","1","0","0","1","0","0","0"].

  For '--stack', there are [,'category','stacked_result'] as columns in header of results. 
  '' is name given by '--stack'.

  For '--fillna', if '@interpolate' was used as value, NaN are filled by interpolated by linear method with both inside and outside.
  If '@forward' was used as value, propagate last valid observation forward to next valid
  If '@backward' was used as value, use next valid observation to fill gap.

  For '--replace', definition has json format as dict: '{"old_value":"new_value",...}' 
  NOTE: In definition, old value must be string. ex: '{"1":"A"}'
  If you want to use commas in definition of '--replace', those must be escaped by back-slash. see examples.

  If you want to use commas and colon in expression of '--split_into_rows' and '--split_into_columns', 
  those must be escaped by back-slash. see examples.

  For '--split_into_columns', the separactor is sigle character or regexp.
  If named parameter is used, the names are names of columns in the result. see example.

  output format:
    csv      : comma-separated values (csv) file
    hdf      : HDF5 file using HDFStore
    parquet  : the binary parquet format, that reuires pyarrow module
    pickeld  : Pickle (serialize) object to file
    json     : JSON string
    feather  : binary Feather format, that requires pyarrow module
    stata    : Stata dta format

  processing order:
   add serial column, dropping columns(regx), dropping rows, dropping na, dropping duplicated, adding columns that was changed with time frequency, 
   adding columns, triming columns, type columns,filling value for na, replace values, split into rows, split into columns, 
   decompose bits string, sort/sort by datetime, rename columns, stacking

SECURITY WARNING:
  this use 'exec' for '--add_columns' and '--prologe', '--trim_columns' without any sanity.

example:
  csv_uty.py --serial_column=serial:100 test1.csv
  csv_uty.py --drop_columns=A,B --drop_rows=1 test1.csv
  csv_uty.py --drop_na_columns=all test1.csv | csv_uty.py --serial_column=serial:100 -
  csv_uty.py --drop_rows=0-1 test1.csv
  csv_uty.py --drop_na_columns=P1C1,P1C2,P2C1 test3.csv
  csv_uty.py --drop_duplicated=P1C1 test3.csv

  csv_uty.py --add_columns='NCOL1="PAT001",NCOL2=12345,NCOL3=${A}+${B}' test1.csv
  csv_uty.py --add_column='E=(${D}.fillna("0").apply(lambda x: int(x\,2)) & 0b10)' test1.csv
  acol='NCOL1="PAT001",NCOL2=12345,NCOL3=(${D}.fillna("0").apply(lambda x: int(x\,0)) & 0b10)!=0'
  csv_uty.py --add_columns="${acol}" test1.csv
  acol='NCOL1="PAT001",NCOL2=12345,NCOL3=np.sin(${A}.fillna("0").astype(float))'
  csv_uty.py --prologe='import numpy as np;' --add_columns="${acol}" --columns=NCOL1,NCOL2,NCOL3,A,B,C,D test1.csv
  acol='NCOL1=${A}.map(lambda x: format(int(x)\, "#010x")\, na_action="ignore")'
  csv_uty.py --prologe='import numpy as np;' --add_columns="${acol}" --columns=NCOL1,A,B,C,D test1.csv
  # Series  pandas 1.1.3 documentation https://pandas.pydata.org/pandas-docs/stable/reference/series.html#accessors
  csv_uty.py --add_columns='D=${ABC002}.str.replace(r":\d+$"\,":00"\,regex=True)' big_sample_headers.csv |\
                                                              csv_plot_histogram.py --output=test.html --animation_column=D - ABC005
  csv_uty.py --add_columns='D=pd.to_datetime(${ABC002}\,format="%Y-%m-%d %H:%M:%S")' big_sample_headers.csv
  csv_uty.py --add_columns='D=pd.to_datetime(${ABC002}\,format="%Y-%m-%d %H:%M:%S"),E=${D}.dt.floor("30s")' big_sample_headers.csv |\
                                                              csv_plot_histogram.py --animation_column=E --output=test.html - ABC005

  # the same as above
  #[DEPRECATE] '--change_timefreq' was deprecated. use 'csv_trimtime'.
  csv_uty.py --change_timefreq='D=ABC002:%Y-%m-%d %H\:%M\:%S:floor:30s' big_sample_headers.csv |\
                                                              csv_plot_histogram.py --animation_column=D --output=test.html - ABC005

  csv_uty.py --trim_columns=D="lambda x: int(x\,0)" test1.csv # convert binary string into decimal value.
  csv_uty.py --type_columns=A=float,B=bin test2.csv

  csv_uty.py --decompose_bit_string=D:16 test1.csv |csvlook -I
  csv_uty.py --decompose_bit_string=A,B,C,D --rename_columns=A_B000:BIT_A,A_B001:BIT_B test1.csv

  csv_uty.py --rename_columns=A:abc,B:def test1.csv

  csv_uty.py --stack=ABC002 bit-pattern-headers.csv

  csv_uty.py --fillna=A=1,B=2,C="A B" test1.csv
  csv_uty.py --fillna=B=@interpolate test1.csv
  csv_uty.py --fillna=A=@forward t1.csv
  csv_uty.py --fillna=A=@backward t1.csv

  csv_uty.py --replace='A={"1":"A"\,"2":"B"},D={"a":1\,"b":0}' t1.csv

  csv_uty.py --split_into_rows="COL003" test_explode.csv
  csv_uty.py --split_into_rows="COL002:\:,COL003" test_explode.csv
  csv_uty.py --split_into_rows="COL002:\:" test_explode.csv |csvlook -I
  | COL001 | COL002 | COL003 | COL004   |
  | ------ | ------ | ------ | -------- |
  | A      | 1      | 1,2,3  | F1|F2|F3 |
  | A      | 2      | 1,2,3  | F1|F2|F3 |
  | A      | 3      | 1,2,3  | F1|F2|F3 |
  | B      | 2      | 4,5,6  | F2       |
  | C      | 3      | 7,8,9  | F1|F3    |

  csv_uty.py --split_into_columns="COL002:\:,COL004" test_explode.csv|csvlook -I
  | COL001 | COL002 | COL003 | COL004   | 1 | 2 | 3 | F1 | F2 | F3 |
  | ------ | ------ | ------ | -------- | - | - | - | -- | -- | -- |
  | A      | 1:2:3  | 1,2,3  | F1|F2|F3 | 1 | 1 | 1 | 1  | 1  | 1  |
  | B      | 2      | 4,5,6  | F2       | 0 | 1 | 0 | 0  | 1  | 0  |
  | C      | 3      | 7,8,9  | F1|F3    | 0 | 0 | 1 | 1  | 0  | 1  |

  csv_uty.py --split_into_columns="COL002:(?P\w+)(?P\d+),COL004" test_explode.csv|csvlook -I
  | COL001 | COL002 | COL003 | COL004   | alpha | D | F1 | F2 | F3 |
  | ------ | ------ | ------ | -------- | ----- | - | -- | -- | -- |
  | A      | 1:2:3  | 1,2,3  | F1|F2|F3 |       |   | 1  | 1  | 1  |
  | B      | AB2    | 4,5,6  | F2       | AB    | 2 | 0  | 1  | 0  |
  | C      | D3     | 7,8,9  | F1|F3    | D     | 3 | 1  | 0  | 1  |

  # in following example, column 'D' will be created as column of timestamp, and by those dataframe will be made into group and stacked.
  # at plot, the timestamp column 'D' will be used as animation key frames.
  # [DEPRECATE] '--change_timefreq' was deprecated. use 'csv_trimtime'.
  csv_uty.py --change_timefreq='D=ABC002:%Y-%m-%d %H\:%M\:%S:floor:10s' bit-pattern-headers.csv|\
     csv_status.py --mode sum --group D -|csv_uty.py --drop_columns=ABC000,ABC001 - |\
     csv_uty.py --stack=D - |csv_plot_bar.py --output=bit-pattern-headers_10sec_sum.html --animation_column=D --yrange=0,1 - category stacked_result

  csv_uty.py --sort=ABC004 test_sort.csv
  csv_uty.py --sort="desc|ABC004,ABC005" test_sort.csv
  csv_uty.py --sort_datetime="ABC002" test_sort.csv
  csv_uty.py --sort_datetime="desc|ABC002" test_sort.csv
  csv_uty.py --sort_datetime="ABC002:%Y-%m-%d %H\:%M\:%S" test_sort.csv

  csv_uty.py --output_format=hdf --output=test.dat bit-pattern-headers.csv

  input: test1.csv
  A,B,C,D
  1,2,3,0b01010
  4,5,6,0b01

  input: test3.csv
  P1C1,P1C2,P2C1,P2C2,N
  1,0,1,0,A
  1,0,0,1,B
  1,0,1,0,C
  1,0,1,0,D
  ,1,1,1,E
  ,,1,1,F
  1,1,,1,G


csv_window_rolling.py

usage: csv_window_rolling.py [-h] [-v] [--index COLUMN]
                             [--window_type {boxcar,triang,blackman,hamming,bartlett,parzen,bohman,blackmanharris,nuttall,barthann}]
                             [--window_function {sum,min,max,mean,median,std,count,var,skew,kurt,corr,cov}]
                             [--output_file FILE]
                             CSV_FILE COLUMN[,COLUMN..] INT

do rolling columns data, like as moving-average

positional arguments:
  CSV_FILE              files to read, if empty, stdin is used
  COLUMN[,COLUMN..]     columns to process
  INT                   size of window

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  --index COLUMN        column of independent values, default=id of rows
  --window_type {boxcar,triang,blackman,hamming,bartlett,parzen,bohman,blackmanharris,nuttall,barthann}
                        type of window function
  --window_function {sum,min,max,mean,median,std,count,var,skew,kurt,corr,cov}
                        function to do
  --output_file FILE    path of output file, default=stdout

remark:
   If window_type is not None, only one of ["sum", "mean", "var", "std"] is abailable as window_function. 

example:
  csv_window_rolling.py big_sample_rand.csv COL_0000 20
  csv_window_rolling.py --window_type=boxcar --window_function=std big_sample_rand.csv COL_0000 20


csv_write_excel.py

usage: csv_write_excel.py [-h] [-v] [--columns COLUMN[,COLUMN...]]
                          [--image_column COLUMN]
                          [--image_output_column COLUMN]
                          [--image_size WIDTH,HEIGHT] [--hyperlink]
                          [--output EXCEL_FILE]
                          CSV_FILE

write csv to ms-excel file with images

positional arguments:
  CSV_FILE              csv file to read

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  --columns COLUMN[,COLUMN...]
                        names of columns to write: default=all
  --image_column COLUMN
                        name of column for path of a image file
  --image_output_column COLUMN
                        name of column to insert the image, default='Image'
  --image_size WIDTH,HEIGHT
                        maximum size of image at cell of excel,
                        default='120,120'
  --hyperlink           enable hyperlink for file path or url
  --output EXCEL_FILE   path of file to output

remark:
  The values, except for numerics,  will be treated as string on excel.

  The datetime values in csv file are converted as string into cells of excel.
  So for treating datetime on excel, the value must be converted by '=DATEVALUE()' and appropriate format on excel.

  Using '--hyperlink', a cell will be with a hyperlink attribute for URL with http/https/file protocol or 
  a value of cell that is available as a path of a file.

  you may execute following vba on the excel to set a property of an image object.
How to automatically change all pictures to move and size with cells in Excel? https://www.extendoffice.com/documents/excel/4923-excel-picture-move-and-size-with-cells-default.html

Sub MoveAndSizeWithCells()
    Dim xPic As Picture
    On Error Resume Next
    Application.ScreenUpdating = False
    For Each xPic In ActiveSheet.Pictures
        xPic.Placement = xlMoveAndSize
    Next
    Application.ScreenUpdating = True
End Sub

example:
  csv_write_excel.py --image_column=file_name --output=images.xlsx images.csv
  csv_write_excel.py --image_column=file_name --output=images.xlsx --hyperlink images.csv

csv_cut.sh

print range of rows and range of columns in csv file
Usage: csv_cut.sh [-i] csv_file range_of_rows range_of_columns
arguments:
  csv_file        : csv file
  range_of_rows   : range of rows to print with csv format, 1-base
                    alse keyword 'all' is availabe.
                    ex. 10,20 10,+10 all
  range_of_columns: range of columns to print with csv format, 1-base
                    also name of column or 'all' is available.
                    ex. 10,20 10,+10 CO_0000,COL_0002

options:
  -i :add index column

example:
  csv_cut.sh -i big_sample_int.csv 100,+3 1000,+3|csvlook -I
  -- range of rows   : 100-103
  -- range of columns: 1000-1003
  | index | COL_0999 | COL_1000 | COL_1001 | COL_1002 |
  | ----- | -------- | -------- | -------- | -------- |
  | 100   | 111483   | 111484   | 111485   | 111486   |
  | 101   | 112599   | 112600   | 112601   | 112602   |
  | 102   | 113715   | 113716   | 113717   | 113718   |
  | 103   | 114831   | 114832   | 114833   | 114834   |

  csv_cut.sh -i big_sample_int.csv 100,+3 500,+3|csvtool transpose -|csvlook -I
  -- range of rows   : 100-103
  -- range of columns: 500-503
  | index    | 100    | 101    | 102    | 103    |
  | -------- | ------ | ------ | ------ | ------ |
  | COL_0499 | 110983 | 112099 | 113215 | 114331 |
  | COL_0500 | 110984 | 112100 | 113216 | 114332 |
  | COL_0501 | 110985 | 112101 | 113217 | 114333 |
  | COL_0502 | 110986 | 112102 | 113218 | 114334 |


csv_dtype.sh

to estimate data type of each column
Usage: csv_dtype.sh [-i] [-c columns] [-r number] [-v] csv_file
arguments:
  csv_file : path of csv file
options:
  -i        : print index of column, 1-base
  -c columns: indexes of columns to parse, 1-base ex.1,3-10
  -r number : number of rows to scan, default=10
  -v        : verbose

example:
  csv_dtype.sh -c 1-100 big_sample_arb.csv | awk '=="binary" {print /home/akei/library/csv_tools/csv_utility/csv_dtype.sh}'
  csv_dtype.sh -i -c 1-20 -v big_sample_arb.csv

remark:
  


csv_dummy.sh

make simple dummy record with csv format
Usage: csv_dummy.sh [-q] number_of_data_rows number_of_columns
options:
  -q : quoting cell value.



csv_function.sh

replace contents of columns with values that were processed by 'function'.
Usage: csv_function.sh [-a suffix] csv_file columns function_name
Usage: csv_function.sh -l

arguments:
  csv_file      : path of csv file
  columns       : columns to apply function,1-base, ex. 1-2,3 
  function_name : program or npath of shell script or implicit function(see '-l')

options:
  -l: print available implicit function
  -a suffix: append mode, suffix is used for new columnm name.

remark:
  program or script has each value of columns as arguments, 
  and write results with csv format, that has same number of fileds.

  sample of script(bin2dec):
       INS=("${@}")
       RES=""
       BASE=2
       for v in ${INS[@]}; do
           if [[ ${v} =~ [^0-9] ]]; then
	   	RES="${RES}${v},"
           else
	    	RES="${RES}$((${BASE}#${v})),"
           fi
       done

       if [[ ${#RES} == 0 ]]; then
           echo "${INS[@]}"
       else
           RES=${RES::-1}
           echo "${RES}"
       fi

example:
 csv_function.sh test.csv 1 bin2dec

 input:
 A,B
 10,01
 11,10

 output:
 A,B
 2,01
 3,10



csv_get_col_index.sh

to get index of column  or name of column with 1-base,
Usage: csv_get_col_index.sh csv_file column_name_or_index



csv_head.sh

command like head for csv.
Usage: csv_head.sh [-t] [-c range_of_columns] [-r number_of_rows] csv_file 
options:
  -c range_of_columns: range of columns as 1-base index ex.:16,20-31
  -r numbr_or_rows   : number of rows to view
  -t                 : tail mode
remark:
  as assumption, there is only one header row in csv file.



csv_hist_console.sh

plot histogram by ascii character.
Usage: csv_hist_console.sh [-p] [-s] [-d] [-l] csv_file X_column [y_column]
arguments:
  csv_file : path of csv file
  x_column : column index of x, 0-base
  y_column : column index of y, 0-base
options:
  -p      : line plot, y_column is required.
  -s      : scatter plot, y_column is required.
  -d      : heat map plot, y_column is required.
  -l      : histogram, y_column is not required.
  -b bins : number of bins for '-d' and '-l', default=20 

remark:
 at least one of '-p', '-s', '-d', '-l' must be given.
 GitHub - sgreben/jp: dead simple terminal plots from JSON data. single binary, no dependencies. linux, osx, windows. 
  https://github.com/sgreben/jp

example:
   csv_hist_console.sh -l test_plot.csv 1



csv_join.sh

join two csv file into one csv file
Usage: csv_join.sh [-m mode] columns_1 csv_file_1 columns_2 csv_file_2
Usage: csv_join.sh -s [-m mode] columns csv_file_1 csv_file_2
options:
  -s     : use same names of columns to join
  -m mode: join mode, l=left, r=right, f=full, default=l

example:
  csv_join.sh A,B test1.csv A,B test2.csv
  csv_join.sh -s A,B test1.csv test2.csv



csv_matrix.sh

convert matrix data into (x,y,v).
Usage: csv_matrix.sh [-n] csv_file
options:
  -n : no header mode

remark:
  output file may be used by csv_plot_heatmap.py, and so on.

example1:
  csv_matrix.sh test_matrix.csv |csv_plot_heatmap.py --format=html --side_hist - ROW_ID COL_ID value >test.html

example2:
  csv_matrix.sh test_matrix.csv
input=
,A,B,C,D,E
Z,1,2,3,4,5
Y,6,7,8,9,10

output:
ROW_ID,COL_ID,value
Z,A,1
Z,B,2
Z,C,3
Z,D,4
Z,E,5
Y,A,6
Y,B,7
Y,C,8
Y,D,9



csv_print.sh

to print contents around given (row,col) in csv file.
Usage: csv_print.sh [-n] [-c int]  [-r int] csv_file row col
arguments:
  row : index of row, integer
  col : index of column, integer
options:
  -c int: half width of view for columns
  -r int: half width of view for rows
  -n    : row number mode

remark:
  as assumption, there is only one header row in csv file.



csv_print_head.sh

to print names of columns in csv file
Usage: csv_print_head.sh [-t] [-c columns] [-n index_width] [-r nrows] csv_file

options:
  -t             : table format
  -c columns     : index of columns to print, ex: 1-10,51-60
  -n index_width : number of width of index, default=5
  -r nrows       : multiindex columns mode, nrows is number of rows of headers

remark:
  as assumption, there is only one header row in csv file.



csv_quick_view.sh

Usage: csv_quick_view.sh [-s word] csv_file
options:
  -s word: word to search

remark:
  see folloing
  Pretty CSV viewing on the Command Line - Stefaan Lippens inserts content here https://www.stefaanlippens.net/pretty-csv.html


csv_search_rc.sh

to search keyword in csv file and to return the position:(row,col)
Usage: csv_search_rc.sh [-w] [-l] keyword csv_file
arguments:
  keyword : keyword to search, grep regex pattern is available.
  csv_file: csv file

options:
  -l: one result as one line
  -w: keyword as word

remark:
  as assumption, there is only one header row in csv file.



csv_split_columns.sh

to split ONE csv file into some csv files, that may have too many columns.
Usage: csv_split_columns.sh [-n limit_columsn] index_column csv_file
arguments:
  index_column : column label that index column has.
                 in all results, there is this column.
                 if number, then as 0-base index number of fields.
options:
  -n limit_columns : maximum number of columns that one csv file has.
                     this limit may be given by limitation of database.
                     default is 2000
remark:
  as assumption, there is only one header row in csv file.

example:
csv_split_columns.sh 0 sample.csv
csv_split_columns.sh ID sample.csv



csv_sqlite_insert.sh

inset contents of csv file into sqlitDB
Usage: csv_sqlite_insert.sh [-d db_name] [-s number_of_rows] [-p primary_keys] table_name csv_file
options:
  -d db_name:  name of data base
  -p primary_keys: primary keys, with csv format
  -s number_of_rows: number of scaned rows for creatign table


csv_stack.sh

simply append csv files according to columns or rows.
Usage: csv_stack.sh [-m dir] [-c categories] [-r n] csv_file csv_file [csv_file ...]
options:
  -r n          : number of rows of headers, default=1
  -m dir        : direction of stack, v or h, default=v
  -c categories : name of categories for each csv files, with csv format.
                  this options is available for only vertical mode.
remark:
 when '-c' option was used, number of categories must be equal to number of csv files.
 if this option was given, for each record there is column of category in result.

example:
  csv_stack.sh -c p1,p2 test1.csv test2.csv
  A,B,Category
  1,2,p1
  3,4,p1
  5,6,p1
  7,8,p2
  9,10,p2
  11,12,p2
  13,14,p2



csv_status_xsv.sh

print status of each column by 'xsv'
Usage: csv_status_xsv.sh [-c columns] [-f number] csv_file
options:
  -c columns: columns to analysis, 1-base, default=all
  -f number : less than this number of freqeuncy, default=(number of records)/10


csv_to_db_csvkit.sh

to insret contents of csv file into db by csvkit
Usage: csv_to_db_csvkit.sh [-i] [-d db_name] [-s number_of_rows] [-p primary_keys] table_name csv_file
arguments:

options:
  -i        : only printing sql to make db table
  -d db_name: name of data base
              url is available: ex. sqlite:///db_file, mysql://user:password@localhost:3306/database_name
              if db_name was one without protocol, sqlite3 is assumed.
              if this option was omitted, assuming sqlite3 db file, what name is table_name+".sqlite" , will be created.
  -p primary_keys: primary keys, with csv format
  -s number_of_rows: number of scaned rows for creatign table


csv_to_db_shell.sh

to insert csv data into database
Usage: csv_to_db_shell.sh [-d db_type] [-u db_user] [-p] [-c columns] db_name table_name index_name csv_file [sql]

arguments:
   db_name    : name of database. if db_type was "sqlite3", then that means name of file.
   table_name : name of table
   index_name : name of column that means index/primary key.
   csv_file   : name of csv file, that has names of columns at first row.
   sql        : sql statments that will be executed at first.

options:
   -d db_type : name of database, [mysql,sqlite3]: default=sqlite3
   -u db_user : user name for database
   -c columns : selected columns. ex. 1-5,20-30
   -p         : dump sql string, without execution

example: 
  csv_to_db_shell.sh -d mysql -u root sampleDB testTBL ID sample.csv "use sampleDB;DROP TABLE IF EXISTS testTBL;"
  csv_to_db_shell.sh -d sqlite3 sampleDB.sqlite3 testTBL ID sample.csv "DROP TABLE IF EXISTS testTBL;"

remark:
  charcode in csv_file must be utf-8.
  in db, type of data for each columns is VARCHAR(256) for mysql or TEXT for sqlite3. 
  if you want to change the type, use '-p' and edit sql statements that is executed by hand. 
  or you may use 'head -20 csv_file | csvsql -i db_type' to make sql statements for creating table.
  As assumption, there is only one header row in csv file.

  Alse you can use 'csvsql --overwrite --tables table_name --db db_type:///csv_file --insert csv_file'
  


csv_to_html.sh

convert csv file into html format
Usage: csv_to_html.sh [-c columns] [-r row0,row1] csv_file
arguments:
  csv_file : path of csv file
options:
  -c columns: list of columns, ex. 1,5-10
  -r row0,row1 : range of rows to print


csv_wc.sh

command like 'wc' for csv
Usage: csv_wc.sh (-c|-r|-t) csv_file
options:
  -c : number of columns
  -r : number of rows
  -t : line terminator
remark:
  as assumption, there is only one header row in csv file.
  number of rows means one without header row.